I currently have 2 different strategies to deploy JupyterHub on top of Kubernetes on Jetstream:

In this tutorial I'll show how to use Yuvi Pandas' hubtraf to simulate load on JupyterHub, i.e. programmatically generate a predefined number of users connecting and executing notebooks on the system.

This is especially useful to test the Cluster Autoscaler.

hubtraf assumes you are using the Dummy authenticator, which is the default installed by the zero-to-jupyterhub helm chart. If you have configured another authenticator, temporarily disable it for testing purposes.

First go through the hubtraf documentation to understand its functionalities.

hubtraf also has a Helm recipe to run it within Kubernetes, but the simpler way is to test from your laptop, follow the [documentation of hubtraf] to install the package and then run:

hubtraf http://js-xxx-yyy.jetstream-cloud.org 2

To simulate 2 users connecting to the system, you can then check with:

kubectl get pods -n jhub

That the pods are being created successfully and check the logs on the command line from hubtraf which explains what it is doing and tracks the time every operation takes, so it is useful to debug any delay in providing resources to users.

Consider that volumes created by JupyterHub for the test users will remain in Kubernetes and in Openstack, therefore if you would like to use the same deployment for production, remember to cleanup the Kubernetes PersistentVolume and PersistentVolumeClaim resources.

Now we can test scalability of the deployment with:

    hubtraf http://js-xxx-yyy.jetstream-cloud.org 100

Make sure you have asked the XSEDE support to increase the maximum number of volumes in Openstack in your allocation that by default is only 10. Otherwise edit config_standard_storage.yaml and set:

singleuser:
  storage:
    type: none

Test the Cluster Autoscaler

If you followed the tutorial to deploy the Cluster Autoscaler on Magnum, you can launch hubtraf to create a large number of pods, then check that some pods are "Running" and the ones that do not fit in the current nodes are "Pending":

kubectl get pods -n jhub

and then check in the logs of the autoscaler that it detects that those pods are pending and requests additional nodes. For example:

> kubectl logs -n kube-system cluster-autoscaler-hhhhhhh-uuuuuuu
I1031 00:48:39.807384       1 scale_up.go:689] Scale-up: setting group DefaultNodeGroup size to 2
I1031 00:48:41.583449       1 magnum_nodegroup.go:101] Increasing size by 1, 1->2
I1031 00:49:14.141351       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status

After 4 or 5 minutes the new node should be available and should show up in:

kubectl get nodes

And we can check that some user pods are now running on the new node:

kubectl get pods -n jhub -o wide

In my case the Autoscaler actually requested a 3rd node to accomodate all the users pods:

I1031 00:48:39.807384       1 scale_up.go:689] Scale-up: setting group DefaultNodeGroup size to 2
I1031 00:48:41.583449       1 magnum_nodegroup.go:101] Increasing size by 1, 1->2
I1031 00:49:14.141351       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status
I1031 00:52:51.308054       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_COMPLETE status
I1031 00:53:01.315179       1 scale_up.go:689] Scale-up: setting group DefaultNodeGroup size to 3
I1031 00:53:02.996583       1 magnum_nodegroup.go:101] Increasing size by 1, 2->3
I1031 00:53:35.607158       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status
I1031 00:56:41.834151       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_COMPLETE status

Moreover Cluster Autoscaler also provides useful information in the status of each "Pending" node. For example if it detects that it is useless to create a new node because the node is "Pending" for some other reason (e.g. volume quota reached), this infomation will be accessible using:

kubectl describe node -n jhub jupyter-xxxxxxx

When the simulated users disconnect, hubtraf has a default of about 5 minutes, the autoscaler waits for the configured amount of minutes, by default it is 10 minutes, in my deployment it is 1 minute to simplify testing, see the cluster-autoscaler-deployment-master.yaml file. After this delay, the autoscaler scales down the size of the cluster, it is a 2 step process, it first terminates the Openstack Virtual machine and then adjusts the size of the Magnum cluster (node_count), you can monitor the process using openstack server list and openstack coe cluster list, and the log of the autoscaler:

I1101 06:31:10.223660       1 scale_down.go:882] Scale-down: removing empty node k8s-e2iw7axmhym7-minion-1 
I1101 06:31:16.081223       1 magnum_manager_heat.go:276] Waited for stack UPDATE_IN_PROGRESS status
I1101 06:32:17.061860       1 magnum_manager_heat.go:276] Waited for stack UPDATE_COMPLETE status
I1101 06:32:49.826439       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status
I1101 06:33:21.588022       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_COMPLETE status

Acknowledgments

Thanks Yuvi Panda for providing hubtraf, thanks Julien Chastang for testing my deployments.