categories.pipeline-orchestration Intermediate
Data Pipeline Orchestration: Apache Airflow
Explain the role of pipeline orchestration tools, using Airflow as an example.
What Is Pipeline Orchestration
Pipelines consist of multiple interdependent tasks. Orchestration tools manage task execution order, retries, monitoring, and alerting.
Apache Airflow
Airflow is the leading open-source orchestration tool. Pipelines are defined as DAGs (Directed Acyclic Graphs) in Python code.
DAG (Directed Acyclic Graph)
Composed of task nodes and dependency edges. Ensures tasks execute in dependency order with no circular dependencies.
Operator Types
- PythonOperator: Execute a Python function.
- BashOperator: Execute a shell command.
- SqlOperator: Execute a SQL query.
- SensorOperator: Wait for an external condition (e.g., file arrives in S3).
Core Features
- Scheduling (cron expressions)
- Task retry on failure (retries, retry_delay)
- Backfill: Re-run tasks for historical dates
- Web UI for visualizing DAG run status
Modern Alternatives
Prefect, Dagster (more Pythonic, easier to test); Databricks Workflows (within the Databricks platform).
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
