Installing Airflow on Kubernetes Using Operator

Sunday, Jul 28, 2019| Tags: kubernetes, containers, docker, airflow, helm, data engineering

Operator - “A Kubernetes Operator is an abstraction for deploying non-trivial applications on Kubernetes. It wraps the logic for deploying and operating an application using Kubernetes constructs.” –Richard Laub, staff cloud engineer at Nebulaworks

Note: I will be using an EKS cluster on AWS. You could use the same steps on other cloud providers too.

Well created Kubernetes Operators pack a lot of power and help run and manage stateful applications on kubernetes. We had earlier seen how to install airflow on kubernetes using helm charts. While helm charts help you get started fast, they may not be suitable for day 2 operatios like:

  1. Upgrades
  2. Backup & restore
  3. Auto recovery
  4. Automatic/On-demand scalability
  5. Configuration management
  6. Deep insights

Let’s find how to install airflow on kubernetes using airflow operator.

1. Get the operator

git clone https://github.com/GoogleCloudPlatform/airflow-operator

2. Install CRDs

kubectl apply -f config/crds
kubectl apply -f hack/appcrd.yaml

3. Build operator docker image

# First we need to build the docker image for the controller
# Set this to the name of the docker registry and image you want to use
export IMG=hiprabhat/airflow-controller:latest 

# Build and push
docker build . -t $IMG
docker push ${IMG}

4. Update docker image in config/manager_image_patch.yaml

Update the image

6
7
8
9
10
11
12
spec:
  template:
    spec:
      containers:
      # Change the value of image field below to your controller image URL
      - image: hiprabhat/airflow-controller:latest
        name: manager


5. Install Airflow

# deploy base components first
kubectl apply -f hack/sample/mysql-celery/base.yaml

You can specify the source of DAGs in the hack/sample/mysql-celery/cluster.yaml file.

  dags:
    subdir: "airflow/example_dags/"
    git:
      repo: "https://github.com/apache/incubator-airflow/"
      # setting once to false allows the DAGs to be refreshed every 5 minutes
      once: false

Now its time to deploy the airflow components.

# after 30-60s deploy cluster components 
# using celery + git as DAG source
kubectl apply -f hack/sample/mysql-celery/cluster.yaml
# port forward to access the UI
kubectl port-forward mc-cluster-airflowui-0 8080:8080

AIRFLOW DAG PAGE airflow-using-operator.png

In order to setup authentication, follow the steps in the earlier blog



Comments