Broadly: convert raw event data to usable form; enforce data model
Practically: applying rules and functions to extracted data
Rules: split, group, drop
Transformations: map, apply, normalize
Broadly: Push transformed data to datastore
Practically: Publish data in usable form
Database, flat file, api
Dashboard, table, report
Representating ETL Pipelines
Directed Acyclic Graph (DAG): a graph finitely many edges and verices and edges with each edge directed from one vertex to another, such that there is no way to start at any vertex v and follow a consistently-directed sequence of edges that eventually loops back to v again.
Examples
Halal Guys
Apartment Recommendations
This Can Get Complex
ETL Tools
Bash Scripts
Airflow and other data pipeline tools are organized around the concept of a DAG. A DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
Airflow Components
Operator: describes a single task in a workflow. Operators are usually (but not always) atomic, meaning they can stand on their own and don’t need to share resources with any other operators. Examples: BashOperator, PythonOperator, etc.
Task: a parameterized instance of an operator.
Sensor: a type of special operator which will only execute if a certain condition is met.
Pools: limits on the execution parallelism of arbitrary sets of tasks. (Prevents overwhelming resources.)
Lyft, Robinhood, and Zillow are among the many companies who have adopted Airflow to help manage their ETL processes.
ETL Process Stakeholders
ETL Considerations for Data Scientists
Handling Missing Values
Drop observations with missing fields.
Impute values:
Central tendancy: median, mean, mode.
Modeled value.
We'll touch on this throughout the rest of the course.
Handling Outliers
Drop.
Trim.
Be mindful.
We'll touch on this throughout the rest of the course.
Normalizing Data
Adjust range.
Adjust scale.
We'll touch on this throughout the rest of the course.