Airflow Metadata Database Setup
Apache Airflow is a powerful open-source platform for orchestrating workflows, and at its core lies the metadata database—a critical component that keeps track of everything from task statuses to run histories. Setting up this database correctly is essential to making Airflow work smoothly, whether you’re running simple scripts with BashOperator or complex pipelines with Airflow with Apache Spark. This guide, hosted on SparkCodeHub, dives deep into the Airflow metadata database setup, covering why it matters, how to configure it with SQLite or PostgreSQL, and how to maintain it. We’ll provide step-by-step instructions where needed and practical insights to ensure success. New to Airflow? Start with Airflow Fundamentals, and pair this with Airflow Architecture (Scheduler, Webserver, Executor) for context.
What is the Airflow Metadata Database?
The Airflow metadata database is the memory bank of your Airflow instance—it’s where all the details about your workflows are stored. When you define a Directed Acyclic Graph (DAG) in Python—explored in Introduction to DAGs in Airflow—and run it, the database records when it started, which tasks succeeded or failed, and how long they took. It’s the backbone that the Scheduler, Webserver, and Executor rely on, tracking states (running, success, failed), run dates, and more. Without it, Airflow would be a forgetful mess—unable to retry tasks, show past runs in the UI (Airflow Web UI Overview), or manage dependencies (Task Dependencies (set_upstream, set_downstream)). By default, Airflow uses SQLite, but for bigger setups, PostgreSQL is the go-to choice—both of which we’ll set up here.
Why the Metadata Database Matters
The metadata database is what makes Airflow smart and reliable. Imagine running a daily ETL job—without the database, Airflow wouldn’t know if yesterday’s run succeeded or needs a retry (Task Retries and Retry Delays). The Scheduler checks it to queue tasks (Introduction to Airflow Scheduling), the Executor updates it as tasks complete (Airflow Executors (Sequential, Local, Celery)), and the Webserver pulls it to display statuses (Monitoring Task Status in UI). It’s also key for backfilling past runs with Catchup and Backfill Scheduling—without history, that’s impossible. Setting it up right ensures Airflow can handle your workflows, whether small or massive.
Setting Up the Metadata Database with SQLite
SQLite is Airflow’s default database—lightweight and easy to set up, perfect for testing or small projects. Here’s how to get it running.
Step 1: Install Airflow
- Open Your Terminal: On Windows, press the Windows key, type “cmd,” and press Enter to open Command Prompt. On Mac, click the magnifying glass, type “Terminal,” and hit Enter. On Linux, press Ctrl+Alt+T or search “Terminal.”
- Navigate to Your Home Directory: Type cd ~ (Mac/Linux) or cd %userprofile% (Windows) and press Enter—e.g., /home/username or C:\Users\YourUsername.
- Create a Virtual Environment: Type python -m venv airflow_env (or python3 -m venv airflow_env if needed) and press Enter—it creates ~/airflow_env.
- Activate the Environment: Type source airflow_env/bin/activate (Mac/Linux) or airflow_env\Scripts\activate (Windows) and press Enter—see (airflow_env) in your prompt.
- Install Airflow: Type pip install apache-airflow and press Enter—it downloads and installs Airflow. Verify with airflow version—e.g., “2.4.3.”
This follows Installing Airflow (Local, Docker, Cloud)—SQLite comes built-in, no extras needed.
Step 2: Initialize the SQLite Database
- Run the Init Command: In your activated terminal, type airflow db init and press Enter. This creates ~/airflow/airflow.db and the airflow folder with airflow.cfg.
- Check the Output: You’ll see logs about creating tables like “dag” and “task_instance”—it takes a few seconds, then your prompt returns.
- Verify the File: Type ls -a ~/airflow (Mac/Linux) or dir %userprofile%\airflow (Windows) and press Enter—look for airflow.db.
This sets up SQLite as your metadata database—it’s ready for small-scale use.
Step 3: Start Airflow Services
- Open a Terminal for the Webserver: Activate your environment (repeat Step 1.4), type airflow webserver -p 8080, and press Enter—go to localhost:8080 in your browser.
- Open Another Terminal for the Scheduler: Activate again, type airflow scheduler, and press Enter—it starts scanning ~/airflow/dags.
SQLite is now tracking your workflows—add a DAG to ~/airflow/dags and watch it work.
Setting Up the Metadata Database with PostgreSQL
PostgreSQL is the choice for production—it’s robust, scalable, and handles multiple users better than SQLite. Here’s how to set it up.
Step 1: Install PostgreSQL
- Open Your Browser: Use Chrome, Firefox, etc.
- Visit PostgreSQL’s Site: Type postgresql.org in the address bar, press Enter, and click “Download” at the top.
- Download for Your System: Pick your OS—Windows, Mac, or Linux. For Windows/Mac, download the installer (e.g., postgresql-15.2-1-windows-x64.exe); for Linux (Ubuntu), open a terminal and skip to step 5.
- Run the Installer: Double-click the downloaded file, follow the prompts (accept defaults, set a password like “airflow123”), and install—it adds PostgreSQL to your system.
- Install on Linux (Ubuntu): In your terminal, type sudo apt update, press Enter, then sudo apt install postgresql postgresql-contrib, and press Enter—enter your system password if prompted.
- Verify Installation: Type psql --version (or psql -V) and press Enter—you’ll see something like “psql (PostgreSQL) 15.2.”
PostgreSQL is now on your machine—ready for Airflow.
Step 2: Create a PostgreSQL Database
- Start PostgreSQL Service: On Windows/Mac, it starts automatically—on Linux, type sudo service postgresql start and press Enter.
- Access PostgreSQL: Type psql -U postgres and press Enter—enter your password (e.g., “airflow123”) if set during install. You’ll see postgres=#.
- Create the Database: Type CREATE DATABASE airflow; and press Enter—you’ll see “CREATE DATABASE.”
- Exit PostgreSQL: Type \q and press Enter—back to your terminal.
This creates an airflow database—Airflow will use it.
Step 3: Install PostgreSQL Driver for Airflow
- Activate Your Airflow Environment: In your terminal, type source airflow_env/bin/activate (Mac/Linux) or airflow_env\Scripts\activate (Windows) and press Enter—see (airflow_env).
- Install the Driver: Type pip install apache-airflow[postgres] and press Enter—it adds psycopg2, the PostgreSQL driver, to Airflow.
This ensures Airflow can talk to PostgreSQL.
Step 4: Configure Airflow to Use PostgreSQL
- Open airflow.cfg: Type nano ~/airflow/airflow.cfg (Mac/Linux) or open C:\Users\YourUsername\airflow\airflow.cfg in Notepad (Windows) and press Enter.
- Find the Database Setting: Scroll to [database] (or [core] in older versions), locate sql_alchemy_conn.
- Update the Connection String: Change it to sql_alchemy_conn = postgresql+psycopg2://postgres:airflow123@localhost:5432/airflow—replace airflow123 with your password. Save (Ctrl+O, Enter, Ctrl+X in nano; File > Save in Notepad).
- Alternative with Environment Variable: Skip editing—type export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://postgres:airflow123@localhost:5432/airflow (Mac/Linux) or set AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://postgres:airflow123@localhost:5432/airflow (Windows) and press Enter.
This links Airflow to PostgreSQL—secure it with Airflow Configuration Options.
Step 5: Initialize the PostgreSQL Database
- Run the Init Command: In your activated terminal, type airflow db init and press Enter—it creates tables in your airflow PostgreSQL database.
- Check the Output: Look for logs about table creation—no errors means success.
PostgreSQL is now your metadata database.
Step 6: Start Airflow Services
- Launch the Webserver: In one terminal, activate your environment, type airflow webserver -p 8080, and press Enter—check localhost:8080.
- Launch the Scheduler: In another terminal, activate, type airflow scheduler, and press Enter—it starts scheduling with PostgreSQL.
Your production-ready database is live—test it with a DAG.
Maintaining the Metadata Database
Keeping your database healthy ensures Airflow runs smoothly—here’s how.
Upgrading the Database
When you update Airflow (e.g., pip install apache-airflow --upgrade), the database schema might change. Type airflow db upgrade and press Enter—it updates tables to match the new version, keeping everything compatible—see Airflow Version Upgrades.
Resetting the Database
If things go wrong—corrupted data or testing clutter—reset it. Type airflow db reset --yes and press Enter—it wipes all data and reinitializes. Back up airflow.db (SQLite) or dump PostgreSQL (pg_dump airflow > backup.sql) first if needed—use with caution.
Optimizing Performance
For PostgreSQL, tweak connection limits—edit airflow.cfg, set sql_alchemy_pool_size = 10 under [database], and restart services. Monitor with Database Performance in Airflow to handle larger workloads.
Practical Example: Setting Up and Testing
Create a DAG in ~/airflow/dags/example_dag.py:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG(
dag_id="db_example",
start_date=datetime(2025, 1, 1),
schedule_interval="@daily",
) as dag:
task = BashOperator(
task_id="say_hi",
bash_command="echo 'Database is working!'",
)
With PostgreSQL running (Steps 1-6 above), type airflow dags trigger -e 2025-04-07 db_example—it runs, and the database logs it. Check localhost:8080—see “db_example” with a green “say_hi.”
FAQ: Common Questions About Airflow Metadata Database Setup
Here are frequent questions about the metadata database, with detailed answers from online sources.
1. Why does Airflow need a metadata database?
The database tracks everything—task states, run times, DAG history—so the Scheduler knows what’s next, the Executor updates progress, and the Webserver shows results. Without it, Airflow can’t remember past runs or retry tasks—it’s the glue that makes automation work, as in Airflow Architecture (Scheduler, Webserver, Executor).
2. Can I use SQLite for production instead of PostgreSQL?
SQLite works for testing—it’s simple, file-based, and needs no setup—but it’s not great for production. It struggles with multiple users or high task volumes due to single-threaded limits. PostgreSQL handles concurrency and scale better—stick to SQLite for small tests, switch to PostgreSQL for real workloads with Database Performance in Airflow.
3. What happens if I don’t initialize the database with airflow db init?
If you skip it, commands like airflow scheduler fail—expect errors like “database not found.” Airflow needs those tables to store data—running airflow db init (Step 2 for SQLite, Step 5 for PostgreSQL) creates them. Do it after installation or config changes.
4. How do I switch from SQLite to PostgreSQL after starting with SQLite?
Install PostgreSQL and the driver (Steps 1-3 for PostgreSQL), update sql_alchemy_conn in airflow.cfg to postgresql+psycopg2://postgres:password@localhost:5432/airflow, then run airflow db init—it sets up PostgreSQL. Old SQLite data (~/airflow/airflow.db) won’t transfer—start fresh or migrate manually with tools like pgloader—see Airflow Configuration Options.
5. What should I do if I get a “connection refused” error with PostgreSQL?
It means Airflow can’t reach PostgreSQL—check if it’s running (sudo service postgresql status on Linux, or look in Task Manager on Windows/Mac). Start it with sudo service postgresql start (Linux) or the PostgreSQL app (Windows/Mac). Ensure sql_alchemy_conn matches your setup—e.g., postgresql+psycopg2://postgres:airflow123@localhost:5432/airflow—and the password’s correct. Test with psql -U postgres -d airflow.
6. How do I back up my metadata database before resetting it?
For SQLite, copy ~/airflow/airflow.db—type cp ~/airflow/airflow.db ~/airflow_backup.db (Mac/Linux) or copy %userprofile%\airflow\airflow.db %userprofile%\airflow_backup.db (Windows) and press Enter. For PostgreSQL, dump it—type pg_dump -U postgres airflow > airflow_backup.sql and press Enter. Restore with cp ~/airflow_backup.db ~/airflow/airflow.db (SQLite) or psql -U postgres airflow < airflow_backup.sql (PostgreSQL).
7. Can I use the CLI to check if my database is set up correctly?
Yes—type airflow db check and press Enter—it tests the connection. If it says “Database connectivity is good,” you’re set. If it fails (e.g., “connection refused”), recheck sql_alchemy_conn in airflow.cfg or your environment variable—use Airflow CLI: Overview and Usage for more commands.
Conclusion
Setting up Airflow’s metadata database—whether with SQLite or PostgreSQL—is your foundation for reliable workflows. Follow these steps after Installing Airflow (Local, Docker, Cloud), write DAGs with Defining DAGs in Python, and monitor them in Monitoring Task Status in UI. Optimize with Database Performance in Airflow and explore more in Airflow Concepts: DAGs, Tasks, and Workflows!