Apache Airflow in Production: Lessons from Running It at Scale

The Scheduler Is a Single Thread

Everything about Airflow DAG design flows from one constraint: the scheduler is fundamentally serial. It scans all active DAGs, evaluates dependencies, and queues tasks in a single loop. When this loop takes too long, tasks queue but do not start, and the UI shows a growing backlog with no errors.

The scheduler loop slows down primarily because of DAG complexity. Every imported library, every database call in DAG-level code, and every dynamically generated task adds time to the scan. The rule: DAG files should be fast to parse and should do zero I/O at import time.

# BAD: database call at DAG-level (runs every scheduler scan)
from mydb import get_table_list

tables = get_table_list()  # This runs every time the file is parsed

with DAG("process_tables") as dag:
    for table in tables:
        PythonOperator(task_id=f"process_{table}", ...)

# GOOD: defer expensive operations into task execution
with DAG("process_tables") as dag:
    tables_task = PythonOperator(
        task_id="get_tables",
        python_callable=get_table_list,
    )
    # Use dynamic task mapping (Airflow 2.3+) or XCom for downstream tasks

Beyond import-time code, keep your DAG count manageable. More than 200-300 active DAGs on a single Airflow deployment starts to create scheduler pressure. If you have hundreds of similar DAGs (one per table, one per customer), consider consolidating into a parameterized DAG with dynamic task mapping rather than separate DAG files.

Operator Selection: The Choice That Determines Reliability

Most Airflow operators fall into two categories: those that run code in the worker process, and those that submit work to an external system and wait. The second category is far more reliable in production.

PythonOperator runs code directly in the Airflow worker. This means the worker must have all dependencies installed, the task holds a worker slot for its entire duration, and a worker crash means the task is lost. It is the right choice for lightweight, fast operations (triggering APIs, checking conditions, transforming small data).

KubernetesPodOperator submits a pod to Kubernetes and waits for completion. Each task gets its own isolated environment with its own dependencies. The worker releases its slot after submitting the pod. This is the right choice for heavy computation, ML training, or anything with significant dependencies.

SnowflakeOperator, BigQueryInsertJobOperator -- these submit SQL to the warehouse and poll for completion. The worker holds minimal resources while the warehouse does the work. Right choice for all warehouse-side transformations.

from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator

# Warehouse SQL: submit and poll, worker barely used
transform_task = SnowflakeOperator(
    task_id="transform_silver_orders",
    sql="""
        CREATE OR REPLACE TABLE silver.orders AS
        SELECT order_id, customer_id, amount_cents / 100.0 AS amount_usd
        FROM bronze.raw_orders
        WHERE order_id IS NOT NULL
    """,
    snowflake_conn_id="snowflake_default",
)

# Heavy Python: isolated container, no dependency conflicts
ml_task = KubernetesPodOperator(
    task_id="train_churn_model",
    image="myrepo/ml-trainer:v2.1",
    cmds=["python", "train.py"],
    arguments=["--date", "{{ ds }}"],
    namespace="airflow",
    get_logs=True,
    is_delete_operator_pod=True,
)

Retry Configuration That Actually Works

Default Airflow retry behavior is too aggressive for data pipelines. Three immediate retries on a task that failed because a source API is down will fail three more times before you even get an alert, and they will consume worker slots doing so.

The retry configuration I use for production ETL tasks:

from datetime import timedelta
from airflow.models import DAG
from airflow.operators.python import PythonOperator

default_args = {
    "owner": "data-platform",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "retry_exponential_backoff": True,  # 5min, 10min (doubles each time)
    "max_retry_delay": timedelta(minutes=30),
    "email_on_failure": True,
    "email_on_retry": False,  # Don't spam on retries, only final failure
}

with DAG(
    "daily_etl",
    default_args=default_args,
    schedule="0 5 * * *",  # 5 AM
    catchup=False,          # NEVER enable catchup for ETL dags
    max_active_runs=1,      # Prevent concurrent runs of the same dag
) as dag:
    ...

The two settings that prevent the most production pain:

catchup=False is mandatory for ETL DAGs. With catchup enabled, if you deploy a DAG that was paused for two weeks, Airflow will attempt to run every daily interval for those two weeks at once, overwhelming your workers and potentially re-running pipelines on stale data. Always disable catchup unless you explicitly need historical backfill.

max_active_runs=1 prevents multiple concurrent runs of the same DAG. Without this, a slow DAG run can stack up with the next scheduled run, and you end up with two concurrent ETL jobs writing to the same destination table.

Handling Failures: The Alerting Hierarchy

Airflow's built-in email alerting is coarse. Task failure, task retry, DAG failure -- all can generate email notifications, and most teams either turn them all on (alert fatigue) or turn them all off (silent failures). The right answer is a tiered approach:

from airflow.models import DAG
from airflow.utils.email import send_email

def on_failure_callback(context):
    """Alert Slack on task failure, email on DAG failure."""
    dag_id = context['dag'].dag_id
    task_id = context['task_instance'].task_id
    exec_date = context['execution_date']
    log_url = context['task_instance'].log_url
    
    # Slack for immediate visibility (task-level)
    slack_message = (
        f":red_circle: *Task Failed*
"
        f"DAG: {dag_id} | Task: {task_id}
"
        f"Date: {exec_date}
"
        f"<{log_url}|View Logs>"
    )
    send_slack_alert(slack_message)

def on_dag_failure_callback(context):
    """Email the data team when a full DAG fails."""
    dag_id = context['dag'].dag_id
    exec_date = context['execution_date']
    
    send_email(
        to=["data-platform@company.com"],
        subject=f"[AIRFLOW] DAG Failed: {dag_id}",
        html_content=f"<p>DAG <b>{dag_id}</b> failed at {exec_date}.</p>",
    )

default_args = {
    "on_failure_callback": on_failure_callback,
}

with DAG(
    "critical_etl",
    on_failure_callback=on_dag_failure_callback,
    default_args=default_args,
) as dag:
    ...

XCom: Use It Sparingly

XCom (cross-communication) lets tasks pass values to downstream tasks. It is stored in the Airflow metadata database, which is where most teams run into trouble. Storing large data (DataFrames, query results, file contents) in XCom bloats the metadata database and slows down the scheduler.

XCom is designed for small values: counts, status codes, file paths, configuration parameters. If you need to pass large data between tasks, write it to object storage (S3, GCS) and pass the path via XCom:

# BAD: pushing a large DataFrame via XCom
def extract(**context):
    df = fetch_million_rows()
    context['ti'].xcom_push(key='data', value=df.to_json())  # Kills metadata DB

# GOOD: write to S3, pass the path
def extract(**context):
    df = fetch_million_rows()
    path = f"s3://my-bucket/tmp/{context['ds']}/extract.parquet"
    df.to_parquet(path)
    context['ti'].xcom_push(key='extract_path', value=path)

def transform(**context):
    path = context['ti'].xcom_pull(key='extract_path', task_ids='extract')
    df = pd.read_parquet(path)
    # ... transform ...

Connection and Variable Management

Airflow Connections and Variables are stored in the metadata database by default, which means they are visible to anyone with database access. For production, use a secrets backend:

# airflow.cfg or environment variable
# Route secrets to AWS Secrets Manager
AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables"}'

# Or HashiCorp Vault
AIRFLOW__SECRETS__BACKEND=airflow.providers.hashicorp.secrets.vault.VaultBackend
AIRFLOW__SECRETS__BACKEND_KWARGS='{"connections_path": "connections", "variables_path": "variables", "mount_point": "airflow"}'

With a secrets backend configured, Airflow fetches credentials at task execution time rather than storing them in the metadata database. Rotation is handled in the secrets manager without touching Airflow configuration.

When Airflow Is the Wrong Tool

Airflow excels at scheduling and dependency management for batch workflows. It is the wrong tool for:

Real-time or near-real-time processing. Airflow's minimum scheduling granularity is effectively minutes. Sub-minute workflows belong in Kafka, Flink, or a dedicated streaming system.
Long-running tasks with external progress. A 12-hour Spark job should be submitted to Spark and monitored externally. Running it directly in an Airflow task ties up a worker slot for 12 hours.
Asset-oriented thinking. If you care more about “is my customers table fresh?” than “did my DAG run successfully?”, Dagster's asset model serves that mental model better. Airflow is task-centric; the data asset is implicit.
Teams starting from scratch with modern tooling. Airflow's operational overhead (metadata database, scheduler, workers, web server, all require separate management) is significant. For greenfield deployments, Dagster or Prefect offer better developer experience with lower operational complexity.

Airflow is an excellent tool when you know what it is good at and design your DAGs accordingly. The teams that struggle with it are usually the ones who discovered its constraints after building on top of it for two years.