Which Airflow Components Process the DAG Files

The component responsible for reading our Python files that contains DAGs and creating internal DAG objects out of our code for execution has been the Scheduler itself for a pretty long time. But with Airflow 2.3.0 a lot of code had been refactored and this entire processing was separated into a component called DagFileProcessorManager.

DagFileProcessorManager

So if you want, now you can run a separate DagFileProcessorManager process by running the following command:

$ airflow dag-processor [OPTIONS]

In the [OPTIONS], you can configure a bunch of things:

  • Log file of the process (-l|--logfile LOG_FILE)
  • Daemonize the process (-D|--daemon)
  • Specify the DAGs folder (-S|--subdir PATH) – If not specified, dags_folder from airflow.cfg will be used.
  • … and a couple of other things

If you do use the standalone DAG processor (the command above), then make sure the Scheduler does not also do its own file processing. This can be configured by enabling the following configuration option (default is False):

[scheduler]
standalone_dag_processor = True

So what if you do not do all that is shown above, i.e., run the command and change the config? If standalone_dag_processor is disabled, then the Scheduler itself runs a class called DagFileProcessortAgent which creates a child process to run the same DagFileProcessorManager inside it. So instead of running as a separate process (via the command above), it runs as a part of the Scheduler itself.

One of the benefits of having a separate DAG processor process is that now you can run multiple of these processes with different dag folders. So if you want to have some form of code separation across multiple users (multi-tenant systems), then this allows for that.

DagFileProcessorProcess

I did not talk about this earlier to keep things simple, but when the DagFileProcessorManager has to actually parse and process a DAG file, it offloads the task to a component called DagFileProcessorProcess. So in reality, there are these two separate components/classes in Airflow responsible for DAG file parsing – processor manager and processor process.

The processor manager spawns a separate process of DagFileProcessorProcess to parse each file. So if there are 10 source files to be parsed, the manager will offload the parsing task to 10 different DagFileProcessorProcess processes.

These processes are created and work sequentially, but if you need some performance improvement, then the parsing_processes configuration option can help:

[scheduler]
# The scheduler can run multiple processes in parallel to parse dags.
# This defines how many processes will run.
parsing_processes = 2 # or AIRFLOW__SCHEDULER__PARSING_PROCESSES env

The default value of 2 means that at a given time, the manager will concurrently parse and process 2 files by spawning 2 processes of DagFileProcessorProcess at the same time.

Leave a Reply

Your email address will not be published. Required fields are marked *