DAG Parameters and Defaults

Apache Airflow is a premier open-source platform for orchestrating workflows, and its Directed Acyclic Graphs (DAGs) are the cornerstone of that process. When you define a DAG in Python—covered in Defining DAGs in Python—parameters and defaults shape how it behaves, from when it runs to how it handles tasks. These settings are your control panel, letting you fine-tune workflows like a simple script with BashOperator or a complex pipeline with Airflow with Apache Spark. This guide, hosted on SparkCodeHub, dives deep into DAG parameters and defaults—what they are, how to set them, and why they matter. We’ll include step-by-step instructions where needed and practical examples to make it clear. New to Airflow? Start with Airflow Fundamentals, and pair this with Introduction to DAGs in Airflow for context.


What Are DAG Parameters and Defaults?

DAG parameters and defaults are the settings you apply when defining a DAG in Airflow—options that control its identity, timing, and behavior. In a Python script, you pass these as arguments to the DAG class—like dag_id, start_date, and schedule_interval—to tell Airflow what the DAG is, when it begins, and how often it runs. Defaults are the fallback values Airflow uses if you don’t specify them—e.g., catchup=True runs past dates unless you set it otherwise. These live in your DAG script, stored in the dags folder (e.g., ~/airflow/dags, per DAG File Structure Best Practices), and drive how the Scheduler (Airflow Architecture (Scheduler, Webserver, Executor)) and Executor (Airflow Executors (Sequential, Local, Celery)) handle your workflow.

Think of parameters as the instructions you give a robot chef—when to start cooking, how often, and whether to catch up on missed meals—and defaults as its built-in assumptions if you don’t say otherwise.

Why DAG Parameters and Defaults Matter

These settings are what make your DAGs functional and flexible. Without a dag_id, Airflow can’t identify your workflow in the UI (Airflow Web UI Overview) or CLI (Airflow CLI: Overview and Usage). A missing start_date leaves the Scheduler guessing when to begin (Introduction to Airflow Scheduling), and no schedule_interval means no automation—only manual triggers (Triggering DAGs via UI). Defaults like catchup=True or max_active_runs=1 kick in if unset, affecting how Airflow backfills (Catchup and Backfill Scheduling) or limits concurrency (Task Concurrency and Parallelism). Getting them right ensures your workflow runs as intended—tracked in Airflow Metadata Database Setup).

Core DAG Parameters

Let’s explore the essential parameters you’ll set in every DAG.

dag_id

The dag_id is your DAG’s unique name—like “daily_etl”—set as dag_id="daily_etl". It’s how Airflow labels it in the UI and CLI—keep it short, unique, and meaningful to avoid clashes.

start_date

The start_date—e.g., start_date=datetime(2025, 1, 1)—marks when your DAG begins. Airflow uses it with schedule_interval to calculate run dates—set it in the past for backfills, but pair with catchup=False to skip old runs unless needed.

schedule_interval

The schedule_interval—like @daily or 0 0 * * —defines how often the DAG runs. @daily means midnight each day, cron like 0 9 * * means 9 AM—customize with DAG Scheduling (Cron, Timetables). Set to None for manual runs.

catchup

The catchup parameter—default True—decides if Airflow runs past dates from start_date. Set catchup=False to start from now—crucial for avoiding unwanted backfills (Catchup and Backfill Scheduling).

Additional DAG Parameters

Beyond the basics, these parameters fine-tune your DAG’s behavior.

max_active_runs

The max_active_runs—default 1—limits how many instances of your DAG run at once. Set max_active_runs=3 to allow three concurrent runs—useful for parallel schedules (Task Concurrency and Parallelism).

default_args

The default_args dictionary sets task-level defaults—like retries or retry_delay—applied to all tasks unless overridden. Define it outside the DAG block—e.g., default_args = {"retries": 2, "retry_delay": timedelta(minutes=5)}—see Task Retries and Retry Delays).

description

The description—e.g., description="Daily ETL process"—adds a note about your DAG, visible in the UI. It’s optional but helps document purpose.

Setting Up a DAG with Parameters

Let’s define a DAG with these parameters.

Step 1: Prepare Your Airflow Environment

  1. Install Airflow: Follow Installing Airflow (Local, Docker, Cloud)—open your terminal, type cd ~, press Enter, then python -m venv airflow_env, source airflow_env/bin/activate (Mac/Linux) or airflow_env\Scripts\activate (Windows), and pip install apache-airflow.
  2. Initialize the Database: Type airflow db init and press Enter—it creates ~/airflow/airflow.db and the dags folder.
  3. Start the Webserver: In one terminal, activate, type airflow webserver -p 8080, and press Enter—go to localhost:8080.
  4. Start the Scheduler: In another terminal, activate, type airflow scheduler, and press Enter—it scans ~/airflow/dags.

Step 2: Define a DAG with Parameters

  1. Open Your Text Editor: Use Notepad (Windows), TextEdit (Mac), or VS Code.
  2. Write the DAG Code: Paste this:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    dag_id="param_dag",
    start_date=datetime(2025, 1, 1),
    schedule_interval="@daily",
    catchup=False,
    max_active_runs=2,
    default_args=default_args,
    description="A daily workflow with custom parameters",
) as dag:
    start = BashOperator(
        task_id="start_task",
        bash_command="echo 'Starting!'",
    )
    end = BashOperator(
        task_id="end_task",
        bash_command="echo 'Done!'",
    )
    start >> end
  1. Save the File: Save as param_dag.py in ~/airflow/dags—e.g., /home/username/airflow/dags/param_dag.py or C:\Users\YourUsername\airflow\dags\param_dag.py. Ensure it’s .py (Windows: “Save As,” “All Files,” param_dag.py).

Step 3: Verify and Run the DAG

  1. Check the UI: Open your browser, go to localhost:8080, wait 10-20 seconds—see “param_dag” with its description.
  2. Trigger the DAG: In your terminal, activate, type airflow dags trigger -e 2025-04-07 param_dag, and press Enter—it runs start_task, then end_task.
  3. View Results: Refresh localhost:8080, click “param_dag”—see green circles, with logs in Task Logging and Monitoring).

This DAG runs daily, limits to two active runs, and retries tasks twice with a 5-minute delay.

Customizing DAG Parameters

Let’s tweak parameters for specific needs.

Adjusting Schedule Interval

For 9 AM daily:

with DAG(
    dag_id="morning_dag",
    start_date=datetime(2025, 1, 1),
    schedule_interval="0 9 * * *",
    catchup=False,
) as dag:
    task = BashOperator(task_id="morning_task", bash_command="echo 'Good morning!'")

Save, trigger—it runs at 9 AM (DAG Scheduling (Cron, Timetables)).

Setting Task Defaults

For custom retries:

default_args = {
    "retries": 3,
    "retry_delay": timedelta(minutes=10),
}

with DAG(
    dag_id="retry_dag",
    start_date=datetime(2025, 1, 1),
    schedule_interval="@daily",
    default_args=default_args,
) as dag:
    task = BashOperator(task_id="retry_task", bash_command="echo 'Trying!'")

Tasks retry three times, 10 minutes apart (Task Retries and Retry Delays).

Limiting Concurrent Runs

For one run at a time:

with DAG(
    dag_id="single_run_dag",
    start_date=datetime(2025, 1, 1),
    schedule_interval="@hourly",
    max_active_runs=1,
) as dag:
    task = BashOperator(task_id="single_task", bash_command="echo 'One at a time!'")

If a run’s active, the next waits (Task Concurrency and Parallelism).

Best Practices for DAG Parameters

Use unique, descriptive dag_ids—e.g., “etl_2025_daily.” Set start_date in the past—e.g., datetime(2025, 1, 1)—with catchup=False unless backfilling (Catchup and Backfill Scheduling). Define schedule_interval explicitly—avoid surprises. Set default_args for consistency—e.g., retries—override per task if needed. Keep scripts in ~/airflow/dags—organize with DAG File Structure Best Practices).

FAQ: Common Questions About DAG Parameters and Defaults

Here are frequent questions about DAG parameters and defaults, with detailed answers from online sources.

1. What happens if I don’t set a dag_id in my DAG?

Airflow requires a dag_id—omit it, and you’ll get a Python error like “TypeError: DAG.init() missing 1 required positional argument: 'dag_id'” when loading. It’s mandatory—set it uniquely (e.g., “my_dag”) to identify it in Airflow Web UI Overview).

2. Why does my DAG run multiple times when I only want one run?

Check catchup—default is True, so if start_date is past (e.g., January 1, 2025) and it’s April 7, 2025, Airflow backfills all missed runs. Set catchup=False to run only from now—see Catchup and Backfill Scheduling.

3. How do I stop my DAG from running past dates I don’t need?

Set catchup=False in your DAG—e.g., catchup=False—so it starts from the current date, not start_date. Without it, Airflow runs every interval from start_date—control with DAG Scheduling (Cron, Timetables).

4. What’s the difference between start_date and schedule_interval?

start_date (e.g., datetime(2025, 1, 1)) is the baseline—Airflow counts runs from there. schedule_interval (e.g., @daily) is the frequency—daily from January 1, 2025. Together, they set the timeline—start_date is static, schedule_interval repeats (Introduction to Airflow Scheduling).

5. Can I override default_args for a specific task in my DAG?

Yes—set default_args={"retries": 2}, then override in a task—e.g., BashOperator(task_id="special_task", bash_command="echo 'Special!'", retries=5). The task gets 5 retries, others 2—flexible with Task Retries and Retry Delays).

6. Why does my DAG keep running even though another instance is active?

Check max_active_runs—default is 1, so only one run should be active. If unset or higher (e.g., max_active_runs=3), multiple runs queue up. Set max_active_runs=1 to limit—manage with Task Concurrency and Parallelism).

7. How do I see what parameters my DAG is using after defining it?

Check the UI—go to localhost:8080, click your DAG, and see “Details” for schedule_interval, start_date, etc. Or use the CLI—type airflow dags list then airflow dags trigger -e 2025-04-07 my_dag --dry-run to preview—details in Airflow CLI: Overview and Usage).


Conclusion

DAG parameters and defaults shape your Airflow workflows—set them right with Defining DAGs in Python, install Airflow via Installing Airflow (Local, Docker, Cloud), and optimize with Airflow Performance Tuning. Monitor in Monitoring Task Status in UI) and explore more with Airflow Concepts: DAGs, Tasks, and Workflows!