The component responsible for reading our Python files that contains DAGs and creating internal DAG objects out of our code for execution has been the Scheduler itself for a pretty long time. But with Airflow
2.3.0 a lot of code had been refactored and this entire processing was separated into a component called
So if you want, now you can run a separate
DagFileProcessorManager process by running the following command:
$ airflow dag-processor [OPTIONS]
[OPTIONS], you can configure a bunch of things:
- Log file of the process (
- Daemonize the process (
- Specify the DAGs folder (
-S|--subdir PATH) – If not specified,
airflow.cfgwill be used.
- … and a couple of other things
If you do use the standalone DAG processor (the command above), then make sure the Scheduler does not also do its own file processing. This can be configured by enabling the following configuration option (default is
[scheduler] standalone_dag_processor = True
So what if you do not do all that is shown above, i.e., run the command and change the config? If
standalone_dag_processor is disabled, then the Scheduler itself runs a class called
DagFileProcessortAgent which creates a child process to run the same
DagFileProcessorManager inside it. So instead of running as a separate process (via the command above), it runs as a part of the Scheduler itself.
One of the benefits of having a separate DAG processor process is that now you can run multiple of these processes with different dag folders. So if you want to have some form of code separation across multiple users (multi-tenant systems), then this allows for that.
I did not talk about this earlier to keep things simple, but when the
DagFileProcessorManager has to actually parse and process a DAG file, it offloads the task to a component called
DagFileProcessorProcess. So in reality, there are these two separate components/classes in Airflow responsible for DAG file parsing – processor manager and processor process.
The processor manager spawns a separate process of
DagFileProcessorProcess to parse each file. So if there are 10 source files to be parsed, the manager will offload the parsing task to 10 different
These processes are created and work sequentially, but if you need some performance improvement, then the
parsing_processes configuration option can help:
[scheduler] # The scheduler can run multiple processes in parallel to parse dags. # This defines how many processes will run. parsing_processes = 2 # or AIRFLOW__SCHEDULER__PARSING_PROCESSES env
The default value of
2 means that at a given time, the manager will concurrently parse and process
2 files by spawning
2 processes of
DagFileProcessorProcess at the same time.