DAG Scheduling (Cron, Timetables)
Apache Airflow is a top-tier open-source platform for orchestrating workflows, and scheduling your Directed Acyclic Graphs (DAGs) is what turns static Python scripts into automated, repeatable processes. Whether you’re running a simple task with BashOperator or a complex pipeline with Airflow with Apache Spark, DAG scheduling—via cron expressions or custom timetables—dictates when your workflows spring to life. This guide, hosted on SparkCodeHub, explores DAG scheduling in Airflow with cron and timetables—how they work, how to set them up, and why they’re key to automation. We’ll include step-by-step instructions where needed and practical examples to bring it home. New to Airflow? Start with Airflow Fundamentals, and pair this with DAG Parameters and Defaults for a fuller picture.
What is DAG Scheduling in Airflow?
DAG scheduling in Airflow is the process of telling Airflow when to run your workflows—specifically, the tasks within your Directed Acyclic Graphs (DAGs). When you define a DAG in Python—detailed in Defining DAGs in Python—you set a schedule_interval parameter to specify timing. This could be a cron expression like 0 0 * * * (daily at midnight) or a preset like @daily, or even a custom timetable for complex schedules. The Airflow Scheduler—part of Airflow Architecture (Scheduler, Webserver, Executor)—reads this from your DAG script in the dags folder (DAG File Structure Best Practices), calculates run dates based on start_date, and queues tasks accordingly. It’s what turns your DAG from a static plan into an automated workflow, tracked in Airflow Metadata Database Setup).
Imagine it as setting an alarm clock for your workflow—cron gives you standard times, timetables offer custom patterns, and Airflow wakes up your tasks right on cue.
Why DAG Scheduling Matters
Scheduling is the heartbeat of Airflow’s automation—without it, your DAGs would sit idle, requiring manual triggers via Triggering DAGs via UI) or CLI (Airflow CLI: Overview and Usage). The schedule_interval ties into the Scheduler’s logic, ensuring tasks run when needed—daily ETLs, hourly reports, or custom intervals—while respecting dependencies (DAG Dependencies and Task Ordering). It works with start_date and catchup to handle past runs (Catchup and Backfill Scheduling), and the Executor runs them efficiently (Airflow Executors (Sequential, Local, Celery)). Proper scheduling saves time, ensures consistency, and lets you monitor progress in Airflow Web UI Overview).
Without scheduling, you’d be stuck running tasks by hand—no automation, no scale.
Scheduling with Cron Expressions
Cron expressions are a classic way to schedule DAGs—flexible and widely understood.
What Are Cron Expressions?
A cron expression is a five-field string—minute hour day_of_month month day_of_week—defining when your DAG runs, using numbers, asterisks (<em></em> for “every”), or ranges. For example, 0 0 * * means “every day at midnight,” 0 9 * * 1-5 means “9 AM Monday to Friday.” Airflow’s Scheduler interprets this to queue tasks—set it as schedule_interval="0 0 * * *" in your DAG.
How Cron Expressions Work
The Scheduler reads your DAG’s schedule_interval, combines it with start_date, and calculates execution dates. For start_date=datetime(2025, 1, 1) and schedule_interval="0 0 * * *", it queues a run at midnight January 1, 2025, then daily after—viewable in Monitoring Task Status in UI). If catchup=True, it backfills past dates (Catchup and Backfill Scheduling).
Common Cron Examples
- Daily at Midnight: schedule_interval="0 0 * * *"—runs 00:00 every day.
- Hourly: schedule_interval="0 * * * *"—runs every hour at the start.
- Weekdays at 9 AM: schedule_interval="0 9 * * 1-5"—runs 09:00 Monday-Friday.
Setting Up a Cron-Scheduled DAG
Steps to Create a Cron-Scheduled DAG
- Set Up Airflow: Install via Installing Airflow (Local, Docker, Cloud)—in your terminal, type cd ~, press Enter, then python -m venv airflow_env, source airflow_env/bin/activate (Mac/Linux) or airflow_env\Scripts\activate (Windows), and pip install apache-airflow.
- Initialize the Database: Type airflow db init and press Enter—creates ~/airflow/airflow.db.
- Write the DAG:
- Open a text editor (Notepad, VS Code).
- Paste:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG(
dag_id="cron_dag",
start_date=datetime(2025, 1, 1),
schedule_interval="0 9 * * 1-5", # 9 AM weekdays
catchup=False,
) as dag:
task = BashOperator(
task_id="weekday_task",
bash_command="echo 'Good morning, weekdays!'",
)
- Save as cron_dag.py in ~/airflow/dags—e.g., /home/username/airflow/dags/cron_dag.py.
4. Start Services: In one terminal, activate, type airflow webserver -p 8080, press Enter. In another, activate, type airflow scheduler, press Enter. 5. Verify: Go to localhost:8080, wait 10-20 seconds—see “cron_dag” scheduled for 9 AM weekdays.
Scheduling with Timetables
Timetables offer custom scheduling beyond cron—introduced in Airflow 2.2 for flexibility.
What Are Timetables?
Timetables are Python classes you define to create custom schedules—like every other Tuesday or the last Friday of the month. Unlike cron’s fixed fields, timetables use Python logic for complex patterns, set as schedule_interval with a custom class—e.g., MyCustomTimetable(). They’re powerful for non-standard intervals.
How Timetables Work
You write a Timetable subclass, override methods like next_dagrun_info, and pass it to schedule_interval. The Scheduler uses it to calculate run dates—e.g., a timetable for “every second Tuesday” queues runs accordingly. It’s stored in ~/airflow/plugins or your DAG file—manage with Airflow Plugins: Development and Usage.
Creating a Custom Timetable
Steps to Create a Timetable-Scheduled DAG
- Ensure Airflow 2.2+: Type airflow version, press Enter—needs 2.2 or higher. Upgrade with pip install apache-airflow --upgrade if older.
- Write the Timetable and DAG:
- Open your editor, paste:
from airflow import DAG
from airflow.timetables.base import Timetable
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import pendulum
class EverySecondTuesday(Timetable):
def next_dagrun_info(self, last_automated_dagrun, restriction):
start = pendulum.instance(self.data_interval_start or last_automated_dagrun or self.start_date)
next_date = start.replace(hour=0, minute=0, second=0, microsecond=0)
while next_date.weekday() != 1: # Tuesday is 1
next_date += timedelta(days=1)
if last_automated_dagrun:
next_date = last_automated_dagrun + relativedelta(weeks=2)
return pendulum.DateTime(next_date.year, next_date.month, next_date.day, 9, 0, 0)
with DAG(
dag_id="timetable_dag",
start_date=datetime(2025, 1, 1),
schedule_interval=EverySecondTuesday(),
catchup=False,
) as dag:
task = BashOperator(
task_id="tuesday_task",
bash_command="echo 'Every second Tuesday at 9 AM!'",
)
- Save as timetable_dag.py in ~/airflow/dags.
3. Install Dependencies: Type pip install pendulum python-dateutil, press Enter—for date logic. 4. Start Services: As in Step 4 above—airflow webserver -p 8080 and airflow scheduler. 5. Verify: At localhost:8080, see “timetable_dag”—runs every second Tuesday at 9 AM.
Details in Custom Timetables in Airflow).
Key Scheduling Considerations
Catchup and Backfill
Set catchup=True to run missed intervals from start_date—e.g., January to April 2025—or False to start now (Catchup and Backfill Scheduling).
Time Zones
Use pendulum.timezone—e.g., start_date=pendulum.datetime(2025, 1, 1, tz="America/New_York")—for local times (Time Zones in Airflow Scheduling).
Pausing and Resuming
Pause with airflow dags pause my_dag, resume with airflow dags unpause my_dag (Pause and Resume DAGs).
Best Practices for DAG Scheduling
Use cron for standard schedules—e.g., 0 0 * * *—timetables for custom needs (Custom Timetables in Airflow). Set start_date in the past—e.g., datetime(2025, 1, 1)—with catchup=False unless backfilling. Test with airflow dags test my_dag 2025-04-07 (DAG Testing with Python). Keep DAGs in ~/airflow/dags—organize with DAG File Structure Best Practices).
FAQ: Common Questions About DAG Scheduling (Cron, Timetables)
Here are frequent questions about scheduling DAGs, with detailed answers from online sources.
1. Why does my DAG run every minute instead of daily?
Check schedule_interval—if it’s <em> * * * </em> (every minute) instead of 0 0 * * (daily), fix it—e.g., schedule_interval="0 0 * * "—and resave. Test with airflow dags test my_dag 2025-04-07 (DAG Testing with Python).
2. How do I schedule a DAG to run every weekday at 8 AM?
Use schedule_interval="0 8 * * 1-5"—it’s 8 AM Monday-Friday. Set in your DAG, save, and verify at localhost:8080—details in Cron Expressions in Airflow).
3. What’s the difference between cron and timetables for scheduling?
Cron uses five fields (e.g., 0 0 * * *) for fixed intervals—simple but limited. Timetables are Python classes for custom schedules—like every second Tuesday—offering flexibility beyond cron (Custom Timetables in Airflow).
4. Why are my DAG runs starting at the wrong time zone?
Default is UTC—set start_date=pendulum.datetime(2025, 1, 1, tz="America/New_York") for local time. Check UI display with Time Zones in Airflow Scheduling)—adjust in airflow.cfg (Airflow Configuration Options).
5. How do I stop a DAG from running automatically without deleting it?
Pause it—type airflow dags pause my_dag and press Enter—it stops scheduling but keeps history. Resume with airflow dags unpause my_dag (Pause and Resume DAGs).
6. Can I schedule a DAG to run only once?
Set schedule_interval=None—e.g., schedule_interval=None—and trigger manually with airflow dags trigger -e 2025-04-07 my_dag (Triggering DAGs via UI). It won’t run again unless triggered.
7. How do I backfill past dates for a DAG that missed its schedule?
Set catchup=True—e.g., catchup=True with start_date=datetime(2025, 1, 1)—and Airflow runs all missed intervals from January 1, 2025. Trigger with airflow dags backfill -s 2025-01-01 -e 2025-04-07 my_dag—details in Catchup and Backfill Scheduling).
Conclusion
DAG scheduling with cron and timetables powers Airflow’s automation—set them right with Defining DAGs in Python, install Airflow via Installing Airflow (Local, Docker, Cloud), and optimize with Airflow Performance Tuning. Monitor in Monitoring Task Status in UI) and explore more with Airflow Concepts: DAGs, Tasks, and Workflows!