It is important to know the source file(s) selection process of Airflow that are read to load the DAGs. This process is implemented by the
DagFileProcessorProcess, triggered by the
DagFileProcessorManager. Let’s go through each step that is involved in creating the file paths list:
- First, the path or directory is resolved from where Airflow (or
DagFileProcessorProcess) will decide to read the files to be parsed/processed for the loading of DAGs. This path can come from two places:
- If running the
DAGProcessorManageras a standalone process (outside the scheduler) via the
airflow dag-processorcommand, then the directory passed to the
-S|--subdiroption is considered.
- If either no option is passed to
airflow dag-processor --subdiror the command itself is not used at all, i.e., the standalone processor is not used (in this case the DAG processing happens as part of the scheduler command/process), then the path specified for
airflow.cfg(configuration file) is used.
- If running the
- Out of all the files present in the path resolved in (1), only files that have
.pyextension are considered to be processed, i.e., only Python files are processed.
- Out of all the files shortlisted (2), i.e., Python files present in the DAGs folder, only files that contain the words/strings
airfloware considered to be processed. Although this operation can be disabled by turning off the
- Out of the final set of files shortlisted, all those that match (
glob) patterns specified in
.airflowignoreare excluded from processing.
The steps above filter out the file paths that Airflow finally reads to process and create DAG objects that are used by all the different components – Scheduler, Web Server and Workers.
Bonus: There are some optimizations that Airflow does to ensure it does “minimal” parsing of files. For instance, what’s the point in parsing a DAG file that has not been modified? Read this article to learn more about them.