Airflow Scheduler Parallelism (Concurrency)

Airflow allows configuring and tuning the parallelism/concurrency of tasks at various levels (like DAGs) and via different approaches (like Pools). But at the top level, the father of all knobs provided that controls the overall number of task instances that the entire cluster can run simultaneously (running state), is the scheduler’s (or more appropriately the executor’s) parallelism configuration option that comes under the [core] section.

This is what the default value for parallelism is, in the configuration file (airflow.cfg):

[core]

# This defines the maximum number of task instances that can run concurrently per scheduler in
# Airflow, regardless of the worker count. Generally this value, multiplied by the number of
# schedulers in your cluster, is the maximum number of task instances with the running
# state in the metadata database.
parallelism = 32

As the comments block clarifies, here are the important aspects that this directive affects:

  • The maximum number of task instances that can run concurrently per scheduler and across all the DAGs. So if you have 2 scheduler instances running, with the default value of 32, you will be able to run 64 tasks concurrently at any given moment.
  • The parallelism/concurrency is enforced regardless of the workers count. So for instance, if you have 10 worker instances running with a concurrency of 8, then that means your infrastructure has the ability to process 80 task instances at a given time. But with 2 scheduler instances and parallelism=32, you will still be processing no more than 64 tasks parallelly (even if more tasks were scheduled).
  • A value of 0 means infinity. In that case, your worker instances count and worker concurrency (per instance) will define the concurrency capacity of the cluster.

parallelism is one of the parameters to tune if you have tasks stuck in the scheduled state for a while.

Leave a Reply

Your email address will not be published. Required fields are marked *