Airflow High Availability Setup: A Comprehensive Guide

Apache Airflow is a robust platform for orchestrating workflows, and configuring it for high availability (HA) ensures that it remains operational and resilient, minimizing downtime and maintaining consistent task execution across Directed Acyclic Graphs (DAGs) even in the face of failures. Whether you’re running tasks with PythonOperator, sending notifications via SlackOperator, or integrating with systems like Airflow with Snowflake, an HA setup is critical for production-grade reliability. This comprehensive guide, hosted on SparkCodeHub, explores Airflow High Availability Setup—how it works, how to configure it, and best practices for robust implementation. We’ll provide detailed step-by-step instructions, practical examples with code, and an extensive FAQ section. For foundational knowledge, start with Airflow Web UI Overview and pair this with Defining DAGs in Python.

What is Airflow High Availability Setup?

Airflow High Availability (HA) Setup refers to configuring an Airflow deployment to ensure continuous operation and fault tolerance by eliminating single points of failure across its core components—Webserver, Scheduler, and Executor—for workflows defined in the ~/airflow/dags directory (DAG File Structure Best Practices). Managed by Airflow’s architecture (Airflow Architecture (Scheduler, Webserver, Executor)), HA involves running multiple instances of the Scheduler and Webserver, using a distributed Executor (e.g., CeleryExecutor), and relying on a highly available metadata database (e.g., PostgreSQL with replication) and message broker (e.g., Redis with Sentinel). Task states and execution data are tracked in the metadata database (airflow.db), with performance monitored via the Web UI (Monitoring Task Status in UI) and logs centralized (Task Logging and Monitoring). This setup ensures uninterrupted service, making HA essential for mission-critical, production-grade Airflow deployments handling complex, high-volume workflows.

Core Components in Detail

Airflow High Availability Setup relies on several core components, each with specific roles and configurable parameters. Below, we explore these components in depth, including their functionality, parameters, and practical code examples.

1. Multiple Schedulers: Ensuring Continuous Scheduling

Running multiple Scheduler instances in HA mode ensures that DAG parsing and task scheduling continue even if one Scheduler fails, using a leader election mechanism.

Key Functionality: Distributes scheduling—e.g., multiple Schedulers—across instances—e.g., with database locking—preventing downtime.
Parameters (in airflow.cfg under [scheduler]):

scheduler_heartbeat_sec (int): Heartbeat interval (e.g., 5)—frequency of checks.
num_runs (int): DAG runs per cycle (e.g., -1)—unlimited parsing.
[core] sql_alchemy_conn: Shared DB (e.g., "postgresql+psycopg2://...")—synchronizes Schedulers.

Code Example (Scheduler HA Configuration):

# airflow.cfg
[core]
executor = CeleryExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@postgres:5432/airflow

[scheduler]
scheduler_heartbeat_sec = 5
num_runs = -1

DAG Example (Scheduled by HA):

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def ha_task():
    print("Task scheduled by HA Scheduler")

with DAG(
    dag_id="ha_scheduler_dag",
    start_date=datetime(2025, 4, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:
    task = PythonOperator(
        task_id="ha_task",
        python_callable=ha_task,
    )

This configures multiple Schedulers with a shared DB, ensuring HA for ha_scheduler_dag.

2. Highly Available Webserver: Load-Balanced Access

Running multiple Webserver instances behind a load balancer ensures continuous Web UI and API access, providing fault tolerance and scalability.

Key Functionality: Balances requests—e.g., across Webserver instances—via a load balancer—e.g., HAProxy—maintaining UI availability.
Parameters (in airflow.cfg under [webserver]):

web_server_host (str): Host (e.g., "0.0.0.0")—binds Webserver.
web_server_port (int): Port (e.g., 8080)—defines endpoint.
secret_key (str): Session key (e.g., "random-secret-key")—shared across instances.

Code Example (Webserver HA Configuration):

# airflow.cfg
[webserver]
web_server_host = 0.0.0.0
web_server_port = 8080
secret_key = random-secret-key

HAProxy Config (haproxy.cfg):

frontend airflow_frontend
    bind *:8080
    mode tcp
    default_backend airflow_backend

backend airflow_backend
    mode tcp
    balance roundrobin
    server web1 127.0.0.1:8081 check
    server web2 127.0.0.1:8082 check

DAG Example (Accessed via HA Webserver):

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def web_task():
    print("Task accessed via HA Webserver")

with DAG(
    dag_id="ha_web_dag",
    start_date=datetime(2025, 4, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:
    task = PythonOperator(
        task_id="web_task",
        python_callable=web_task,
    )

This sets up HA Webservers with HAProxy, serving ha_web_dag.

3. Distributed Executor with Celery: Fault-Tolerant Execution

Using CeleryExecutor with multiple workers and a highly available message broker (e.g., Redis with Sentinel) ensures task execution continues despite worker failures.

Key Functionality: Distributes tasks—e.g., across Celery workers—via a broker—e.g., Redis Sentinel—ensuring execution resilience.
Parameters (in airflow.cfg under [celery]):

broker_url (str): Broker (e.g., "sentinel://...")—task queue.
result_backend (str): Result DB (e.g., "db+postgresql://...")—stores task results.
worker_concurrency (int): Tasks per worker (e.g., 16)—capacity.

Code Example (Celery HA Configuration):

# airflow.cfg
[core]
executor = CeleryExecutor

[celery]
broker_url = sentinel://sentinel1:26379,sentinel2:26379,sentinel3:26379/0;master_name=my_master
result_backend = db+postgresql://airflow:airflow@postgres:5432/airflow
worker_concurrency = 16

Redis Sentinel Config (sentinel.conf):

port 26379
sentinel monitor my_master 127.0.0.1 6379 2
sentinel down-after-milliseconds my_master 5000
sentinel failover-timeout my_master 60000

DAG Example (Executed by HA Celery):

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def celery_task():
    print("Task executed by HA Celery worker")

with DAG(
    dag_id="ha_celery_dag",
    start_date=datetime(2025, 4, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:
    task = PythonOperator(
        task_id="celery_task",
        python_callable=celery_task,
    )

This uses CeleryExecutor with Redis Sentinel for HA execution of ha_celery_dag.

4. HA Metadata Database: Reliable State Management

A highly available metadata database (e.g., PostgreSQL with replication) ensures consistent task state tracking and Scheduler coordination across HA instances.

Key Functionality: Replicates DB—e.g., PostgreSQL primary-replica—maintaining state—e.g., task instances—across failures.
Parameters (in airflow.cfg under [core]):

sql_alchemy_conn (str): DB connection (e.g., "postgresql+psycopg2://...")—HA endpoint.
sql_alchemy_pool_size (int): Pool size (e.g., 10)—connection capacity.

Code Example (PostgreSQL HA Configuration):

# airflow.cfg
[core]
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@postgres-ha:5432/airflow
sql_alchemy_pool_size = 10

PostgreSQL HA Setup (Simplified Example):

# Primary PostgreSQL
docker run -d -p 5432:5432 --name postgres-primary \
    -e POSTGRES_USER=airflow -e POSTGRES_PASSWORD=airflow -e POSTGRES_DB=airflow \
    postgres:13

# Replica PostgreSQL (simplified, requires replication config)
docker run -d -p 5433:5432 --name postgres-replica \
    -e POSTGRES_USER=airflow -e POSTGRES_PASSWORD=airflow -e POSTGRES_DB=airflow \
    postgres:13

DAG Example (State Managed by HA DB):

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def db_task():
    print("Task with HA DB state management")

with DAG(
    dag_id="ha_db_dag",
    start_date=datetime(2025, 4, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:
    task = PythonOperator(
        task_id="db_task",
        python_callable=db_task,
    )

This configures an HA PostgreSQL DB for ha_db_dag.

Key Parameters for Airflow High Availability Setup

Key parameters in airflow.cfg and HA configuration:

scheduler_heartbeat_sec: Scheduler frequency (e.g., 5)—ensures HA.
web_server_port: Webserver port (e.g., 8080)—defines HA endpoint.
broker_url: Celery broker (e.g., "sentinel://...")—HA task queue.
sql_alchemy_conn: DB connection (e.g., "postgresql+psycopg2://...")—HA state.
worker_concurrency: Worker capacity (e.g., 16)—execution scale.

These parameters enable HA.

Setting Up Airflow High Availability: Step-by-Step Guide

Let’s configure Airflow for HA with multiple Schedulers, Webservers, Celery workers, and an HA database, testing with a sample DAG.

Step 1: Set Up Your Airflow Environment

Install Docker: Install Docker Desktop—e.g., on macOS: brew install docker. Start Docker and verify: docker --version.
Install Airflow with Celery: Open your terminal, navigate to your home directory (cd ~), and create a virtual environment (python -m venv airflow_env). Activate it—source airflow_env/bin/activate on Mac/Linux or airflow_env\Scripts\activate on Windows—then install Airflow (pip install "apache-airflow[celery,postgres,redis]>=2.0.0").
Set Up Redis with Sentinel: Start Redis and Sentinel:

docker run -d -p 6379:6379 --name redis-primary redis:6.2
docker run -d -p 26379:26379 --name sentinel1 -v $(pwd)/sentinel.conf:/etc/redis/sentinel.conf redis:6.2 redis-sentinel /etc/redis/sentinel.conf
docker run -d -p 26380:26379 --name sentinel2 -v $(pwd)/sentinel.conf:/etc/redis/sentinel.conf redis:6.2 redis-sentinel /etc/redis/sentinel.conf

Create sentinel.conf as shown in the Distributed Executor with Celery section. 4. Set Up PostgreSQL with HA: Start primary and replica (simplified):

docker run -d -p 5432:5432 --name postgres-primary \
    -e POSTGRES_USER=airflow -e POSTGRES_PASSWORD=airflow -e POSTGRES_DB=airflow \
    postgres:13

docker run -d -p 5433:5432 --name postgres-replica \
    -e POSTGRES_USER=airflow -e POSTGRES_PASSWORD=airflow -e POSTGRES_DB=airflow \
    postgres:13

Note: Full HA requires replication setup (e.g., pg_hba.conf, recovery.conf—simplified here). 5. Set Up HAProxy: Create haproxy.cfg as shown in the Highly Available Webserver section, then:

docker run -d -p 8080:8080 -v $(pwd)/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg --name haproxy haproxy:latest

Configure Airflow: Edit ~/airflow/airflow.cfg:

[core]
executor = CeleryExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@localhost:5432/airflow

[webserver]
rbac = True
web_server_host = 0.0.0.0
web_server_port = 8081  # Adjust for multiple instances
secret_key = random-secret-key
authenticate = airflow.contrib.auth.backends.password_auth.PasswordAuth

[scheduler]
scheduler_heartbeat_sec = 5
num_runs = -1

[celery]
broker_url = sentinel://localhost:26379,localhost:26380/0;master_name=my_master
result_backend = db+postgresql://airflow:airflow@localhost:5432/airflow
worker_concurrency = 16

Duplicate for second Webserver with web_server_port = 8082. 7. Initialize the Database: Run airflow db init on one instance. 8. Create Admin User: Run:

airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --email admin@example.com \
    --role Admin \
    --password admin123

Start Airflow Services: In separate terminals:

airflow webserver -p 8081 (Webserver 1)
airflow webserver -p 8082 (Webserver 2)
airflow scheduler (Scheduler 1)
airflow scheduler (Scheduler 2)
airflow celery worker --concurrency 8 (Worker 1)
airflow celery worker --concurrency 8 (Worker 2)

Step 2: Create a Sample DAG for HA Testing

Open a Text Editor: Use Visual Studio Code or any plain-text editor—ensure .py output.
Write the DAG Script: Define a DAG:

Copy this code:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

def ha_test_task():
    print("Task running in HA environment")

with DAG(
    dag_id="ha_test_dag",
    start_date=datetime(2025, 4, 1),
    schedule_interval=timedelta(minutes=5),
    catchup=False,
    max_active_runs=2,
) as dag:
    task = PythonOperator(
        task_id="ha_test_task",
        python_callable=ha_test_task,
    )

Save as ha_test_dag.py in ~/airflow/dags.

Step 3: Test and Monitor High Availability Setup

Access Web UI: Go to localhost:8080 (HAProxy), log in with admin/admin123—verify access via load-balanced Webservers.
Trigger the DAG: In Graph View, toggle “ha_test_dag” to “On,” click “Trigger DAG” for April 7, 2025. Monitor:

ha_test_task executes, scheduled by one of the Schedulers, run by a Celery worker.

3. Test Scheduler HA: Stop one Scheduler (e.g., Ctrl+C), re-trigger—verify second Scheduler takes over. 4. Test Webserver HA: Stop one Webserver (e.g., Ctrl+C), refresh UI—verify HAProxy switches to the second Webserver. 5. Check Logs: In Graph View, click ha_test_task > “Log”—see “Task running in HA environment”. 6. Optimize HA:

Add a third Scheduler (airflow scheduler), re-trigger—note increased resilience.
Increase worker_concurrency to 24, restart workers—observe higher task capacity.

7. Retry DAG: If execution fails (e.g., DB unavailable), fix connection, click “Clear,” and retry.

This tests HA with redundant Schedulers, Webservers, and workers.

Key Features of Airflow High Availability Setup

Airflow High Availability Setup offers powerful features, detailed below.

Continuous Scheduling

Multiple Schedulers—e.g., HA mode—ensure scheduling—e.g., no downtime—via leader election.

Example: Scheduler HA

ha_scheduler_dag—scheduled despite failure.

Uninterrupted Web Access

Load-balanced Webservers—e.g., via HAProxy—provide UI access—e.g., always available—enhancing reliability.

Example: Web HA

ha_web_dag—accessible via HAProxy.

Fault-Tolerant Execution

CeleryExecutor with HA broker—e.g., Redis Sentinel—executes tasks—e.g., resiliently—across workers.

Example: Celery HA

ha_celery_dag—runs despite worker loss.

Reliable State Management

HA database—e.g., PostgreSQL replication—maintains state—e.g., task tracking—across failures.

Example: DB HA

ha_db_dag—state preserved in HA DB.

Scalable Resilience

HA components—e.g., multiple Schedulers, workers—scale resilience—e.g., for high loads—ensuring uptime.

Example: HA Scale

ha_test_dag—runs reliably with HA setup.

Best Practices for Airflow High Availability Setup

Optimize HA with these detailed guidelines:

Run Multiple Schedulers: Use 2+ Schedulers—e.g., scheduler_heartbeat_sec=5—ensure HA—test failover Airflow Configuration Basics.
Test HA Components: Simulate failures—e.g., stop Scheduler—verify continuity DAG Testing with Python.
Load Balance Webservers: Use HAProxy—e.g., 2+ instances—maintain UI—log access Airflow Performance Tuning.
Secure HA Broker: Use Redis Sentinel—e.g., 3 nodes—ensure task queue—log broker Airflow Pools: Resource Management.
Monitor HA: Check logs, UI—e.g., failover events—adjust configs Airflow Graph View Explained.
Optimize DB: Use HA PostgreSQL—e.g., replication—ensure state—log DB Task Logging and Monitoring.
Document HA: List components—e.g., in a README—for clarity DAG File Structure Best Practices.
Handle Time Zones: Align HA with timezone—e.g., adjust for PDT Time Zones in Airflow Scheduling.

These practices ensure robust HA.

FAQ: Common Questions About Airflow High Availability Setup

Here’s an expanded set of answers to frequent questions from Airflow users.

1. Why isn’t my second Scheduler working?

DB conflict—check sql_alchemy_conn—log Scheduler (Airflow Configuration Basics).

2. How do I debug HA failures?

Check logs—e.g., “Scheduler down”—verify components (Task Logging and Monitoring).

3. Why use multiple Schedulers?

HA resilience—e.g., failover—test downtime (Airflow Performance Tuning).

4. How do I scale HA workers?

Add Celery workers—e.g., worker_concurrency=16—log scaling (Airflow XComs: Task Communication).

5. Can HA span multiple instances?

Yes—with shared DB/broker—e.g., HA across nodes (Airflow Executors (Sequential, Local, Celery)).

6. Why is my Web UI unavailable?

Load balancer issue—check HAProxy—log access (DAG Views and Task Logs).

7. How do I monitor HA health?

Use logs, UI—e.g., uptime—or Prometheus—e.g., scheduler_uptime (Airflow Metrics and Monitoring Tools).

8. Can HA trigger a DAG?

Yes—use a sensor with HA check—e.g., if ha_components_ready() (Triggering DAGs via UI).

Conclusion

Airflow High Availability Setup ensures uninterrupted workflows—set it up with Installing Airflow (Local, Docker, Cloud), craft DAGs via Defining DAGs in Python, and monitor with Airflow Graph View Explained. Explore more with Airflow Concepts: DAGs, Tasks, and Workflows and Airflow RBAC!