kube-airflow provides a set of tools to run Airflow in a Kubernetes cluster. This is useful when you'd want:
LocalExecutor
instead of CeleryExecutor
and delegate actual tasks to Kubernetes by calling e.g. kubectl run --restart=Never ...
from your tasks. It will work until the concurrent kubectl run
executions(up to the concurrency implied by scheduler's max_threads
and LocalExecutor's parallelism
. See this SO question for gotchas) consumes all the resources a single airflow-scheduler pod provides, which will be after the pretty long time.This repository contains:
./airflow
for deployments using HelmCreate all the deployments and services to run Airflow on Kubernetes:
kubectl create -f airflow.all.yaml
It will create deployments for:
and services for:
Ensure your helm installation is done, you may need to have TILLER_NAMESPACE
set as
environment variable.
Deploy to Kubernetes using:
make helm-install NAMESPACE=yournamespace HELM_VALUES=/path/to/you/values.yaml
The Chart provides ingress configuration to allow customization the installation by adapting
the config.yaml
depending on your setup.
This Helm chart allows using a "prefix" string that will be added to every Kubernetes names. That allows instantiating several, independent Airflow clusters in the same namespace.
Note:
Do NOT use characters such as " (double quote), ' (simple quote), / (slash) or \ (backslash)
in your passwords and prefix and keep it as small as possible.
This chart provide basically two way of deploying DAGs in your Airflow installation:
This helm chart provide support for Persistant Storage but not for sidecar git-sync pod. If you are willing to contribute, do not hesitate to do a Pull Request !
Git-sync is the easiest way to automatically update your DAGs. It simply checks periodically (by default every minute) a Git project on a given branch and check this new version out when available. Scheduler and worker see changes almost real-time. There is no need to other tool and complex rolling-update procedure.
While it is extremely cool to see its DAG appears on Airflow 60s after merge on this project, you should be aware of some limitations Airflow has with dynamic DAG updates:
If the scheduler reloads a dag in the middle of a dagrun then the dagrun will actually start
using the new version of the dag in the middle of execution.
This is a known issue with airflow and it means it's unsafe in general to use a git-sync like solution with airflow without:
Also keep in mind using git-sync may not be scalable at all in production if you have lot of DAGs. The best way to deploy you DAG is to build a new docker image containing all the DAG and their dependencies. To do so, fork this project
By default, we use the configuration file airflow.cfg
hardcoded in the docker image. This file
uses a custom templating system to apply some environmnet variable and feed the airflow processes
with (basically it is just some sed
).
If you want to use your own airflow.cfg
file without having to rebuild a complete docker image, for example when testing new settings, there is a way to define this file in a Kubernetes configuration
map:
helm install -f myvalue.yaml
airflow.airflow_cfg.enable: true
airflow.cfg
in the node airflow.airflow_cfg.data
airflow/myvalue-with-airflowcfg-configmap.yaml
for an example on how to set itconfig.yaml
fileairflow.cfg
(ex:{{ POSTGRES_CREDS }}
) or at least keep it aligned with the configuration applyied in yourAs you can see, Celery workers uses StatefulSet instead of deployment. It is used to freeze their DNS using a Kubernetes Headless Service, and allow the webserver to requests the logs from each workers individually. This requires to expose a port (8793) and ensure the pod DNS is accessible to the web server pod, which is why StatefulSet is for.
If you want more control on the way you deploy your DAGs, you can use embedded DAGs, where DAGs are burned inside the Docker container deployed as Scheduler and Workers.
Be aware this requirement more heavy tooling than using git-sync, especially if you use CI/CD:
Example of procedure:
dags
folder of this project, update requirements-dags.txt
toYou can avoid forking this project by:
keep a git-project dedicated to storing only your DAGs + dedicated requirements.txt
you can gate any change to DAGs in your CI (unittest, pip install -r requirements-dags.txt
,.. )
have your CI/CD makes a new docker image after each successful merge using
DAG_PATH=$PWD
cd /path/to/kube-aiflow
make ENBEDDED_DAGS_LOCATION=$DAG_PATH
trigger the deployment on this new image on your Kubernetes infrastructure
If you want to add specific python dependencies to use in your DAGs, you simply declare them inside
the requirements/dags.txt
file. They will be automatically installed inside the container during
build, so you can directly use these library in your DAGs.
To use another file, call:
make REQUIREMENTS_TXT_LOCATION=/path/to/you/dags/requirements.txt
Please note this requires you set up the same tooling environment in your CI/CD that when using Embedded DAGs.
Helm allow to overload the configuration to adapt to your environment. You probably want to specify your own ingress configuration for instance.
git clone
this repository and then just run:
make build
You can browse the Airflow dashboard via running:
minikube start
make browse-web
the Flower dashboard via running:
make browse-flower
If you want to use Ad hoc query, make sure you've configured connections:
Go to Admin -> Connections and Edit "mysql_default" set this values (equivalent to values in config/airflow.cfg
) :
Check Airflow Documentation
kubectl exec web-<id> --namespace airflow-dev airflow backfill tutorial -s 2015-05-01 -e 2015-06-01
For now, update the value for the replicas
field of the deployment you want to scale and then:
make apply
Fork, improve and PR. ;-)