Deploy Cluster Autoscaler for Kubernetes on Jetstream

The Kubernetes Cluster Autoscaler is a service that runs within a Kubernetes cluster and when there are not enough resources to accomodate the pods that are queued to run, it contacts the API of the cloud provider to create more Virtual Machines to join the Kubernetes Cluster.

Initially the Cluster Autoscaler only supported commercial cloud provides, but back in March 2019 a user contributed Openstack support based on Magnum.

First step you should have a Magnum-based deployment running on Jetstream, see my recent tutorial about that.

Therefore you should also have already a copy of the repository of all configuration files checked out on your local machine that you are using to interact with the openstack API, if not:

git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream.git

and enter the folder dedicated to the autoscaler:

cd jupyterhub-deploy-kubernetes-jetstream/kubernetes_magnum/autoscaler

Setup credentials

We first create the service account needed by the autoscaler to interact with the Kubernetes API:

kubectl create -f cluster-autoscaler-svcaccount.yaml

Then we need to provide all connection details for the autoscaler to interact with the Openstack API, those are contained in the cloud-config of our cluster available in the master node and setup by Magnum. Get the IP of your master node from:

openstack server list
IP=xxx.xxx.xxx.xxx

Now ssh into the master node and access the cloud-config file:

ssh fedora@$IP
cat /etc/kubernetes/cloud-config

now copy the [Global] section at the end of cluster-autoscaler-secret.yaml on the local machine. Also remove the line of ca-file

kubectl create -f cluster-autoscaler-secret.yaml

Launch the Autoscaler deployment

Create the Autoscaler deployment:

kubectl create -f cluster-autoscaler-deployment-master.yaml

Alternatively, I also added a version for a cluster where we are not deploying pods on master cluster-autoscaler-deployment.yaml.

Check that the deployment is active:

kubectl -n kube-system get pods
NAME                   DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
cluster-autoscaler     1         1         1            0           10s

And check its logs:

kubectl -n kube-system logs cluster-autoscaler-59f4cf4f4-4k4p2

I0905 05:29:21.589062       1 leaderelection.go:217] attempting to acquire leader lease  kube-system/cluster-autoscaler...
I0905 05:29:39.412449       1 leaderelection.go:227] successfully acquired lease kube-system/cluster-autoscaler
I0905 05:29:43.896557       1 magnum_manager_heat.go:293] For stack ID 17ab3ae7-1a81-43e6-98ec-b6ffd04f91d3, stack name is k8s-lu3bksbwsln3
I0905 05:29:44.146319       1 magnum_manager_heat.go:310] Found nested kube_minions stack: name k8s-lu3bksbwsln3-kube_minions-r4lhlv5xuwu3, ID d0590824-cc70-4da5-b9ff-8581d99c666b

If you redeploy the cluster and keep a older authentication, you'll see "Authentication failed" in the logs of the autoscaler pod, you need to update the secret every time you redeploy the cluster.

Test the autoscaler

Now we need to produce a significant load on the cluster so that the autoscaler is triggered to request Openstack Magnum to create more Virtual Machines.

We can create a deployment of the NGINX container (any other would work for this test):

kubectl create deployment autoscaler-demo --image=nginx

And then create a large number of replicas:

kubectl scale deployment autoscaler-demo --replicas=300

We are using 2 nodes with a large amount of memory and CPU, so they can accommodate more then 200 of those pods. The rest remains in the queue:

kubectl get deployment autoscaler-demo
NAME              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
autoscaler-demo   300       300       300          213         18m

And this triggers the autoscaler:

kubectl -n kube-system logs cluster-autoscaler-59f4cf4f4-4k4p2

I0905 05:34:47.401149       1 scale_up.go:689] Scale-up: setting group DefaultNodeGroup size to 2
I0905 05:34:49.267280       1 magnum_nodegroup.go:101] Increasing size by 1, 1->2
I0905 05:35:22.222387       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status

Check also in the Openstack API:

openstack coe cluster list
+------+------+---------+------------+--------------+--------------------+
| uuid | name | keypair | node_count | master_count | status             |
+------+------+---------+------------+--------------+--------------------+
| 09fcf| k8s  | comet   |          2 |            1 | UPDATE_IN_PROGRESS |
+------+------+---------+------------+--------------+--------------------+

It takes about 4 minutes for a new VM to boot, be configured by Magnum and join the Kubernetes cluster.

Checking the logs again should show another line:

I0912 17:18:28.290987       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_COMPLETE status

Then you should have all 3 nodes available:

kubectl get nodes
NAME                        STATUS   ROLES    AGE   VERSION
k8s-6bawhy45wr5t-master-0   Ready    master   38m   v1.11.1
k8s-6bawhy45wr5t-minion-0   Ready    <none>   38m   v1.11.1
k8s-6bawhy45wr5t-minion-1   Ready    <none>   30m   v1.11.1

and all 300 NGINX containers deployed:

kubectl get deployments
NAME              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
autoscaler-demo   300       300       300          300         35m

You can also test scaling down by scaling back the number of NGINX containers to only a few and check in the logs of the autoscaler that this process triggers the scale-down process.

In cluster-autoscaler-deployment-master.yaml I have configured the scale down process to trigger just after 1 minute, to simplify testing. For production, better increase this to 10 minutes or more. Check the documentation of Cluster Autoscaler 1.14 for all other available options.

Note about the Cluster Autoscaler container

The Magnum provider was added in Cluster Autoscaler 1.15, however this version is not compatible with Kubernetes 1.11 which is currently available on Jetstream. Therefore I have taken the development version of Cluster Autoscaler 1.14 and compiled it myself. I also noticed that the scale down process was not working due to incompatible IDs when the Cloud Provider tried to lookup the ID of a Minion in the Stack. I am now directly using the MachineID instead of going through these indices. This version is available in my fork of autoscaler and it is built into docker containers on the zonca/k8s-cluster-autoscaler-jetstream repository on Docker Hub. The image tags are the short version of the repository git commit hash.

I build the container using the run_gobuilder.sh and run_build_autoscaler_container.sh scripts included in the repository.

Note about images used by Magnum

I have tested this deployment using the Fedora-Atomic-27-20180419 image on Jetstream at Indiana University. The Fedora Atomic 28 image had a long hang-up during boot and took more than 10 minutes to start and that caused timeout in the autoscaler and anyway it would have been too long for a user waiting to start a notebook.

I also tried updating the Fedora Atomic 28 image with sudo atomic host upgrade and while this fixed the slow startup issue, it generated a broken Kubernetes installation, i.e. the Kubernetes services didn't detect the master node as part of the cluster, kubectl get nodes only showed the minion.