Airflow allows configuring and tuning the parallelism/concurrency of tasks at various levels (like DAGs) and via different approaches (like Pools). But at the top level, the father of all knobs provided that controls the overall number of task instances that the entire cluster can run simultaneously (
running state), is the scheduler’s (or more appropriately the executor’s)
parallelism configuration option that comes under the
This is what the default value for
parallelism is, in the configuration file (
[core] # This defines the maximum number of task instances that can run concurrently per scheduler in # Airflow, regardless of the worker count. Generally this value, multiplied by the number of # schedulers in your cluster, is the maximum number of task instances with the running # state in the metadata database. parallelism = 32
As the comments block clarifies, here are the important aspects that this directive affects:
- The maximum number of task instances that can run concurrently per scheduler and across all the DAGs. So if you have
2scheduler instances running, with the default value of
32, you will be able to run
64tasks concurrently at any given moment.
- The parallelism/concurrency is enforced regardless of the workers count. So for instance, if you have
10worker instances running with a concurrency of
8, then that means your infrastructure has the ability to process
80task instances at a given time. But with
2scheduler instances and
parallelism=32, you will still be processing no more than
64tasks parallelly (even if more tasks were scheduled).
- A value of
infinity. In that case, your worker instances count and worker concurrency (per instance) will define the concurrency capacity of the cluster.
parallelism is one of the parameters to tune if you have tasks stuck in the
scheduled state for a while.