Airflow Scheduler Parallelism (Concurrency)
Airflow allows configuring and tuning the parallelism/concurrency of tasks at various levels (like DAGs) and via different approaches (like Pools). But at the top level, the father of all knobs provided that controls the overall number of task instances that the entire cluster can run simultaneously (running
state), is the scheduler’s (or more appropriately the executor’s) parallelism
configuration option that comes under the [core]
section.
This is what the default value for parallelism
is, in the configuration file (airflow.cfg
):
[core]
# This defines the maximum number of task instances that can run concurrently per scheduler in
# Airflow, regardless of the worker count. Generally this value, multiplied by the number of
# schedulers in your cluster, is the maximum number of task instances with the running
# state in the metadata database.
parallelism = 32
As the comments block clarifies, here are the important aspects that this directive affects:
- The maximum number of task instances that can run concurrently per scheduler and across all the DAGs. So if you have
2
scheduler instances running, with the default value of32
, you will be able to run64
tasks concurrently at any given moment. - The parallelism/concurrency is enforced regardless of the workers count. So for instance, if you have
10
worker instances running with a concurrency of8
, then that means your infrastructure has the ability to process80
task instances at a given time. But with2
scheduler instances andparallelism=32
, you will still be processing no more than64
tasks parallelly (even if more tasks were scheduled). - A value of
0
meansinfinity
. In that case, your worker instances count and worker concurrency (per instance) will define the concurrency capacity of the cluster.
parallelism
is one of the parameters to tune if you have tasks stuck in the scheduled
state for a while.