Which Airflow Components Process the DAG Files
The component responsible for reading our Python files that contains DAGs and creating internal DAG objects out of our code for execution has been the Scheduler itself for a pretty long time. But with Airflow 2.3.0
a lot of code had been refactored and this entire processing was separated into a component called DagFileProcessorManager
.
DagFileProcessorManager
So if you want, now you can run a separate DagFileProcessorManager
process by running the following command:
$ airflow dag-processor [OPTIONS]
In the [OPTIONS]
, you can configure a bunch of things:
- Log file of the process (
-l|--logfile LOG_FILE
) - Daemonize the process (
-D|--daemon
) - Specify the DAGs folder (
-S|--subdir PATH
) – If not specified,dags_folder
fromairflow.cfg
will be used. - … and a couple of other things
If you do use the standalone DAG processor (the command above), then make sure the Scheduler does not also do its own file processing. This can be configured by enabling the following configuration option (default is False
):
[scheduler]
standalone_dag_processor = True
So what if you do not do all that is shown above, i.e., run the command and change the config? If standalone_dag_processor
is disabled, then the Scheduler itself runs a class called DagFileProcessortAgent
which creates a child process to run the same DagFileProcessorManager
inside it. So instead of running as a separate process (via the command above), it runs as a part of the Scheduler itself.
One of the benefits of having a separate DAG processor process is that now you can run multiple of these processes with different dag folders. So if you want to have some form of code separation across multiple users (multi-tenant systems), then this allows for that.
DagFileProcessorProcess
I did not talk about this earlier to keep things simple, but when the DagFileProcessorManager
has to actually parse and process a DAG file, it offloads the task to a component called DagFileProcessorProcess
. So in reality, there are these two separate components/classes in Airflow responsible for DAG file parsing – processor manager and processor process.
The processor manager spawns a separate process of DagFileProcessorProcess
to parse each file. So if there are 10 source files to be parsed, the manager will offload the parsing task to 10 different DagFileProcessorProcess
processes.
These processes are created and work sequentially, but if you need some performance improvement, then the parsing_processes
configuration option can help:
[scheduler]
# The scheduler can run multiple processes in parallel to parse dags.
# This defines how many processes will run.
parsing_processes = 2 # or AIRFLOW__SCHEDULER__PARSING_PROCESSES env
The default value of 2
means that at a given time, the manager will concurrently parse and process 2
files by spawning 2
processes of DagFileProcessorProcess
at the same time.