Andrea Zonca's blog

Deploy CVMFS on Kubernetes

2020-02-26T13:00:00-08:00

CVMFS is a software distribution service, it is used by High Energy Physics experiments at CERN to synchronize software environments across the whole collaborations.

In the context of a Kubernetes + JupyterHub deployment on Jetstream, for example deployed using Magnum following my tutorial, it is useful to use CVMFS to make the software tools of a collaboration to all the users connected to JupyterHub, so that we can keep the base Docker image simpler and smaller.

Alternatives

A already existing solution is the CVMFS CSI driver, however it doesn't have much documentation, so I haven't tested it. It would be useful for larger deployments, but we are designing for a 5 (possibly up to 10) nodes Kubernetes cluster.

Architecture

We have a pod running in Kubernetes (running as a privileged Docker container) which runs the CVMFS client and caches locally (on a dedicated Openstack volume) some pre-defined CVMFS repositories (at the moment we do not support automounting).

Currently we are using the DIRECT connection for the CVMFS client, due to having just a single client which accesses a small amount of data. Using a proxy is required instead for heavier usage, and it could also be deployed inside Kubernetes.

The same pod also runs a NFS server and exposes it internally into the Kubernetes cluster, over the local Jetstream network, to any other pod which can use a NFS volume and mount it to the /cvmfs folder inside the container. We also activate the CVMFS configuration options for NFS support, following the documentation.

Deployment

The repositories used in this deployment are:

Github repository for the Docker image of the CVMFS client
Docker Hub repositories where the 2 containers are built: cvmfs-client and cvmfs-client-nfs
The jupyterhub-deploy-kubernetes-jetstream Github repositories with the Kubernetes configuration files

First we need to checkout the jupyterhub-deploy-kubernetes-jetstream repository:

git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream.git
cd jupyterhub-deploy-kubernetes-jetstream/cvmfs

Then configure the CVMFS pod with the required repositories, see the CVMFS_REPOSITORIES variable in pod_cvmfs_nfs.yaml.

Then deploy the pod with:

kubectl create -f pod_cvmfs_nfs.yaml

This creates 2 Openstack volumes, a 20 GB volume for the CVMFS cache, and a 1 GB volume which is just necessary as the /cvmfs root folder of the NFS server. It also creates the nfs-service Service, with a fixed IP, so that we can use it in the pod using this.

Finally we can create a pod using mounting the folder via NFS:

kubectl create -f test_nfs_mount.yaml

Then get a terminal in the pod with:

bash ../terminal_pod.sh test-nfs-mount

This creates a volume which mounts the /cvmfs folder shared with NFS, this automatically also shares also all the subfolders.

Finally we can check the content of the /cvmfs folder.

Organize calendars for a large scientific collaboration

2019-12-02T12:00:00-08:00

Many scientific collaborations have a central calendar, often hosted on Google Calendar, to coordinate Teleconferences, meetings and events across timezones.

The issue

Most users are only interested in a small subset of the events, however Google Calendar does not allow them to subscribe to single events. The central calendar admin could invite each person to events, but that requires lots of work.

So, users either subscribe to the whole calendar, but then have a huge clutter of un-interesting events, or copy just a subset of the events to their calendars, but loose track of any rescheduling of the original event.

Proposed solution

I recommend to split the events across multiple calendars, for example one for each working group, or any other categorization where most users would be interested in all events in a calendar. And possibly a "General" category with events that should interest the whole collaboration.

Still, we can embed all of the calendars in a single webpage, see an example below where 2 calendars (Monday and Tuesday telecon calendars) are visualized together, see the Google Calendar documentation.

Users can click on the bottom "Add to Google Calendar" button and subscribe to a subset or all the calendars. See the screenshot below, .

As an additional benefit, we can compartimentalize permissions more easily, e.g. leads of a working group get writing access only to their relevant calendar/calendars.

Simulate users on JupyterHub

2019-10-30T12:00:00-07:00

I currently have 2 different strategies to deploy JupyterHub on top of Kubernetes on Jetstream:

Using Kubespray
Using Magnum, which also supports the Cluster Autoscaler

In this tutorial I'll show how to use Yuvi Pandas' hubtraf to simulate load on JupyterHub, i.e. programmatically generate a predefined number of users connecting and executing notebooks on the system.

This is especially useful to test the Cluster Autoscaler.

hubtraf assumes you are using the Dummy authenticator, which is the default installed by the zero-to-jupyterhub helm chart. If you have configured another authenticator, temporarily disable it for testing purposes.

First go through the hubtraf documentation to understand its functionalities.

hubtraf also has a Helm recipe to run it within Kubernetes, but the simpler way is to test from your laptop, follow the [documentation of hubtraf] to install the package and then run:

hubtraf http://js-xxx-yyy.jetstream-cloud.org 2

To simulate 2 users connecting to the system, you can then check with:

kubectl get pods -n jhub

That the pods are being created successfully and check the logs on the command line from hubtraf which explains what it is doing and tracks the time every operation takes, so it is useful to debug any delay in providing resources to users.

Consider that volumes created by JupyterHub for the test users will remain in Kubernetes and in Openstack, therefore if you would like to use the same deployment for production, remember to cleanup the Kubernetes PersistentVolume and PersistentVolumeClaim resources.

Now we can test scalability of the deployment with:

    hubtraf http://js-xxx-yyy.jetstream-cloud.org 100

Make sure you have asked the XSEDE support to increase the maximum number of volumes in Openstack in your allocation that by default is only 10. Otherwise edit config_standard_storage.yaml and set:

singleuser:
  storage:
    type: none

Test the Cluster Autoscaler

If you followed the tutorial to deploy the Cluster Autoscaler on Magnum, you can launch hubtraf to create a large number of pods, then check that some pods are "Running" and the ones that do not fit in the current nodes are "Pending":

kubectl get pods -n jhub

and then check in the logs of the autoscaler that it detects that those pods are pending and requests additional nodes. For example:

> kubectl logs -n kube-system cluster-autoscaler-hhhhhhh-uuuuuuu
I1031 00:48:39.807384       1 scale_up.go:689] Scale-up: setting group DefaultNodeGroup size to 2
I1031 00:48:41.583449       1 magnum_nodegroup.go:101] Increasing size by 1, 1->2
I1031 00:49:14.141351       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status

After 4 or 5 minutes the new node should be available and should show up in:

kubectl get nodes

And we can check that some user pods are now running on the new node:

kubectl get pods -n jhub -o wide

In my case the Autoscaler actually requested a 3rd node to accomodate all the users pods:

I1031 00:48:39.807384       1 scale_up.go:689] Scale-up: setting group DefaultNodeGroup size to 2
I1031 00:48:41.583449       1 magnum_nodegroup.go:101] Increasing size by 1, 1->2
I1031 00:49:14.141351       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status
I1031 00:52:51.308054       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_COMPLETE status
I1031 00:53:01.315179       1 scale_up.go:689] Scale-up: setting group DefaultNodeGroup size to 3
I1031 00:53:02.996583       1 magnum_nodegroup.go:101] Increasing size by 1, 2->3
I1031 00:53:35.607158       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status
I1031 00:56:41.834151       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_COMPLETE status

Moreover Cluster Autoscaler also provides useful information in the status of each "Pending" node. For example if it detects that it is useless to create a new node because the node is "Pending" for some other reason (e.g. volume quota reached), this infomation will be accessible using:

kubectl describe node -n jhub jupyter-xxxxxxx

When the simulated users disconnect, hubtraf has a default of about 5 minutes, the autoscaler waits for the configured amount of minutes, by default it is 10 minutes, in my deployment it is 1 minute to simplify testing, see the cluster-autoscaler-deployment-master.yaml file. After this delay, the autoscaler scales down the size of the cluster, it is a 2 step process, it first terminates the Openstack Virtual machine and then adjusts the size of the Magnum cluster (node_count), you can monitor the process using openstack server list and openstack coe cluster list, and the log of the autoscaler:

I1101 06:31:10.223660       1 scale_down.go:882] Scale-down: removing empty node k8s-e2iw7axmhym7-minion-1 
I1101 06:31:16.081223       1 magnum_manager_heat.go:276] Waited for stack UPDATE_IN_PROGRESS status
I1101 06:32:17.061860       1 magnum_manager_heat.go:276] Waited for stack UPDATE_COMPLETE status
I1101 06:32:49.826439       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status
I1101 06:33:21.588022       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_COMPLETE status

Acknowledgments

Thanks Yuvi Panda for providing hubtraf, thanks Julien Chastang for testing my deployments.

Execute Jupyter Notebooks not interactively

2019-09-23T12:00:00-07:00

Over the years, I have explored how to scale up easily computation through Jupyter Notebooks by executing them not-interactively, possibily parametrized and remotely. This is mostly for reference.

nbsubmit is a Python package which has Python API to send a local notebook for execution on a remote SLURM cluster, for example Comet, see an example. This project is not maintained right now.
Back in 2017 I tested submitting notebooks to Open Science Grid, see the batch-notebooks-condor repository
Back in 2016 I created scripts to template a Jupyter Notebook and launch SLURM jobs, see slurm.shared.template and runipyloop.sh

Deploy Cluster Autoscaler for Kubernetes on Jetstream

2019-09-12T12:00:00-07:00

The Kubernetes Cluster Autoscaler is a service that runs within a Kubernetes cluster and when there are not enough resources to accomodate the pods that are queued to run, it contacts the API of the cloud provider to create more Virtual Machines to join the Kubernetes Cluster.

Initially the Cluster Autoscaler only supported commercial cloud provides, but back in March 2019 a user contributed Openstack support based on Magnum.

First step you should have a Magnum-based deployment running on Jetstream, see my recent tutorial about that.

Therefore you should also have already a copy of the repository of all configuration files checked out on your local machine that you are using to interact with the openstack API, if not:

git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream.git

and enter the folder dedicated to the autoscaler:

cd jupyterhub-deploy-kubernetes-jetstream/kubernetes_magnum/autoscaler

Setup credentials

We first create the service account needed by the autoscaler to interact with the Kubernetes API:

kubectl create -f cluster-autoscaler-svcaccount.yaml

Then we need to provide all connection details for the autoscaler to interact with the Openstack API, those are contained in the cloud-config of our cluster available in the master node and setup by Magnum. Get the IP of your master node from:

openstack server list
IP=xxx.xxx.xxx.xxx

Now ssh into the master node and access the cloud-config file:

ssh fedora@$IP
cat /etc/kubernetes/cloud-config

now copy the [Global] section at the end of cluster-autoscaler-secret.yaml on the local machine. Also remove the line of ca-file

kubectl create -f cluster-autoscaler-secret.yaml

Launch the Autoscaler deployment

Create the Autoscaler deployment:

kubectl create -f cluster-autoscaler-deployment-master.yaml

Alternatively, I also added a version for a cluster where we are not deploying pods on master cluster-autoscaler-deployment.yaml.

Check that the deployment is active:

kubectl -n kube-system get pods
NAME                   DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
cluster-autoscaler     1         1         1            0           10s

And check its logs:

kubectl -n kube-system logs cluster-autoscaler-59f4cf4f4-4k4p2

I0905 05:29:21.589062       1 leaderelection.go:217] attempting to acquire leader lease  kube-system/cluster-autoscaler...
I0905 05:29:39.412449       1 leaderelection.go:227] successfully acquired lease kube-system/cluster-autoscaler
I0905 05:29:43.896557       1 magnum_manager_heat.go:293] For stack ID 17ab3ae7-1a81-43e6-98ec-b6ffd04f91d3, stack name is k8s-lu3bksbwsln3
I0905 05:29:44.146319       1 magnum_manager_heat.go:310] Found nested kube_minions stack: name k8s-lu3bksbwsln3-kube_minions-r4lhlv5xuwu3, ID d0590824-cc70-4da5-b9ff-8581d99c666b

If you redeploy the cluster and keep a older authentication, you'll see "Authentication failed" in the logs of the autoscaler pod, you need to update the secret every time you redeploy the cluster.

Test the autoscaler

Now we need to produce a significant load on the cluster so that the autoscaler is triggered to request Openstack Magnum to create more Virtual Machines.

We can create a deployment of the NGINX container (any other would work for this test):

kubectl create deployment autoscaler-demo --image=nginx

And then create a large number of replicas:

kubectl scale deployment autoscaler-demo --replicas=300

We are using 2 nodes with a large amount of memory and CPU, so they can accommodate more then 200 of those pods. The rest remains in the queue:

kubectl get deployment autoscaler-demo
NAME              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
autoscaler-demo   300       300       300          213         18m

And this triggers the autoscaler:

kubectl -n kube-system logs cluster-autoscaler-59f4cf4f4-4k4p2

I0905 05:34:47.401149       1 scale_up.go:689] Scale-up: setting group DefaultNodeGroup size to 2
I0905 05:34:49.267280       1 magnum_nodegroup.go:101] Increasing size by 1, 1->2
I0905 05:35:22.222387       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_IN_PROGRESS status

Check also in the Openstack API:

openstack coe cluster list
+------+------+---------+------------+--------------+--------------------+
| uuid | name | keypair | node_count | master_count | status             |
+------+------+---------+------------+--------------+--------------------+
| 09fcf| k8s  | comet   |          2 |            1 | UPDATE_IN_PROGRESS |
+------+------+---------+------------+--------------+--------------------+

It takes about 4 minutes for a new VM to boot, be configured by Magnum and join the Kubernetes cluster.

Checking the logs again should show another line:

I0912 17:18:28.290987       1 magnum_nodegroup.go:67] Waited for cluster UPDATE_COMPLETE status

Then you should have all 3 nodes available:

kubectl get nodes
NAME                        STATUS   ROLES    AGE   VERSION
k8s-6bawhy45wr5t-master-0   Ready    master   38m   v1.11.1
k8s-6bawhy45wr5t-minion-0   Ready    <none>   38m   v1.11.1
k8s-6bawhy45wr5t-minion-1   Ready    <none>   30m   v1.11.1

and all 300 NGINX containers deployed:

kubectl get deployments
NAME              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
autoscaler-demo   300       300       300          300         35m

You can also test scaling down by scaling back the number of NGINX containers to only a few and check in the logs of the autoscaler that this process triggers the scale-down process.

In cluster-autoscaler-deployment-master.yaml I have configured the scale down process to trigger just after 1 minute, to simplify testing. For production, better increase this to 10 minutes or more. Check the documentation of Cluster Autoscaler 1.14 for all other available options.

Note about the Cluster Autoscaler container

The Magnum provider was added in Cluster Autoscaler 1.15, however this version is not compatible with Kubernetes 1.11 which is currently available on Jetstream. Therefore I have taken the development version of Cluster Autoscaler 1.14 and compiled it myself. I also noticed that the scale down process was not working due to incompatible IDs when the Cloud Provider tried to lookup the ID of a Minion in the Stack. I am now directly using the MachineID instead of going through these indices. This version is available in my fork of autoscaler and it is built into docker containers on the zonca/k8s-cluster-autoscaler-jetstream repository on Docker Hub. The image tags are the short version of the repository git commit hash.

I build the container using the run_gobuilder.sh and run_build_autoscaler_container.sh scripts included in the repository.

Note about images used by Magnum

I have tested this deployment using the Fedora-Atomic-27-20180419 image on Jetstream at Indiana University. The Fedora Atomic 28 image had a long hang-up during boot and took more than 10 minutes to start and that caused timeout in the autoscaler and anyway it would have been too long for a user waiting to start a notebook.

I also tried updating the Fedora Atomic 28 image with sudo atomic host upgrade and while this fixed the slow startup issue, it generated a broken Kubernetes installation, i.e. the Kubernetes services didn't detect the master node as part of the cluster, kubectl get nodes only showed the minion.

Create a Github account for your research group with free private repositories

2019-08-24T15:00:00-07:00

Github allows a research group to create their own webpage where they can host, share and develop their software using the git version control system and the powerful Github online issue-tracking interface.

Github offers unlimited private and public repositories to research groups and classrooms. Private repositories are useful for early stages of development or if it is necessary to keep software secret before publication, at publication they can easily switched to public repositories and free up their slot.

They also provide free data packs for git-lfs(Large File Support) which is useful to store large amount of binary data together with your software in the same repository, without actually committing the files into git but using a support server. Just go into "Settings" for your organization and under "Billing" add data packs, you will notice that the cost is $0.

Here the steps to set this up:

Create a user account on Github and choose the free plan, use your .edu email address
Create an organization account for your research group
Go to https://education.github.com/ and click on "Get benefits"
Choose what is your position, e.g. Researcher and select you want a discount for an organization
Choose the organization you created earlier and confirm that it is a "Research group"
Add details about your Research group
Finally you need to upload a picture of your University ID card and write how you plan on using the repositories
Within a week at most, but generally in less than 24 hours, you will be approved for unlimited private repositories.

Once the organization is created, you can add key team members to the "Owners" group, and then create another group for students and collaborators.

Consider also that is not necessary for every collaborator to have write access to your repositories. My recommendation is to ask a more experienced team member to administer the central repository, ask the students to fork the repository under their user accounts (forks of private repositories are always private, free and don't use any slot), and then send a pull request to the central repository for the administrator to review, discuss and merge.

See for example the organization account of the "The Lab for Data Intensive Biology" led by Dr. C. Titus Brown where they share code, documentation and papers. Open Science!!

Other suggestions on the setup very welcome!

Create a Github account for your research group with free private repositories

2019-08-24T15:00:00-07:00

Here the steps to set this up:

Create a user account on Github and choose the free plan, use your .edu email address
Create an organization account for your research group
Go to https://education.github.com/ and click on "Get benefits"
Choose what is your position, e.g. Researcher and select you want a discount for an organization
Choose the organization you created earlier and confirm that it is a "Research group"
Add details about your Research group
Finally you need to upload a picture of your University ID card and write how you plan on using the repositories
Within a week at most, but generally in less than 24 hours, you will be approved for unlimited private repositories.

Once the organization is created, you can add key team members to the "Owners" group, and then create another group for students and collaborators.

See for example the organization account of the "The Lab for Data Intensive Biology" led by Dr. C. Titus Brown where they share code, documentation and papers. Open Science!!

Other suggestions on the setup very welcome!

Ship large files with Python packages

2019-08-21T18:00:00-07:00

It is often useful to ship large data files together with a Python package, a couple of scenarios are:

data necessary to the functionality provided by the package, for example images, any binary or large text dataset, they could be either required just for a subset of the functionality of the package or for all of it
data necessary for unit or integration testing, both example inputs and expected outputs

If data are collectively less than 2 GB compressed and do not change very often, a simple and a bit hacky solution is to use GitHub release assets. For each packaged release on GitHub it is possible to attach one or more assets smaller than 2 GB. You can then attach data to each release, the downside is that users need to make sure to use the correct dataset for the release they are using and the first time they use the software the need to install the Python package and also download the dataset and install it in the right folder. See an example script to upload from the command line.

If data files are individually less than 10 MB and collectively less than 100 MB you can directly add them into the Python package. This is the easiest and most convenient option, for example the astropy package template automatically adds to the package any file inside the packagename/data folder.

For larger datasets I recommend to host the files externally and use the astropy.utils.data module. This module automates the process of retrieving a file from a remote server and caching it locally (in the users home folder), next time the user needs it, it is automatically retrieved from the cache:

    dataurl = "https://my-web-server.ucsd.edu/test-data/"
    with data.conf.set_temp("dataurl", dataurl), data.conf.set_temp(
        "remote_timeout", 30
    ):
        local_file_path = data.get_pkg_data_filename("myfile.jpg)

Now we need to host there files publicly, I have a few options.

Host on a dedicated GitHub repository

If files are individually less than 100MB and collectively a few GB, you can create a dedicated repository on GitHub and push there your files. Then activate GitHub Pages so that those files are published at https://your-organization.github.io/your-repository/. Then use this URL as dataurl in the above script.

Host on a Supercomputer or own server

Some Supercomputers offer the feature of providing public web access from specific folders, for example NERSC allows user to publish web-pages publicly, see their documentation.

This is very useful for huge datasets because you can automatically detect if the package is being run at NERSC and then automatically access the files with their path instead of downloading them.

For example:

def get_data_from_url(filename):
    """Retrieves input templates from remote server,
    in case data is available in one of the PREDEFINED_DATA_FOLDERS defined above,
    e.g. at NERSC, those are directly returned."""

    for folder in PREDEFINED_DATA_FOLDERS:
        full_path = os.path.join(folder, filename)
        if os.path.exists(full_path):
            warnings.warn(f"Access data from {full_path}")
            return full_path
    with data.conf.set_temp("dataurl", DATAURL), data.conf.set_temp(
        "remote_timeout", 30
    ):
        warnings.warn(f"Retrieve data for {filename} (if not cached already)")
        map_out = data.get_pkg_data_filename(filename, show_progress=True)
    return map_out

Similar setup can be achieved on a GNU/Linux server, for example a powerful machine used by all members of a scientific team, where a folder is dedicated to host these data and is also published online with Apache or NGINX.

The main downside of this approach is that there is no built-in version control. One possibility is to enforce a policy where no files are ever overwritten and version control is automatically achieved with filenames. Otherwise, use git lfs in that folder to track any change in a dedicated local git repository, e.g.:

git init
git lfs track "*.fits"
git add "*.fits"
git commit -m "initial version of all FITS files"

This method tracks the checksum of all the binary files and helps managing the history, even if only locally (make sure the folder is also regularly backed up). You could push it to GitHub, that would cost $5/month for each 50GB of storage.

Host on Figshare

You can upload files to Figshare using the browser and create a dataset which also comes with a DOI and a page where you can save metadata about this object.

Once you have set the dataset public, you can find out the URL of the actual file, which is of the form https://ndownloader.figshare.com/files/2432432432, therefore we can set https://ndownloader.figshare.com/files/ as the repository and use the integer defined in Figshare as filename. Using integers as filenames makes it a bit cryptic, but it has the great advantage that other people can do the uploading to Figshare and you can point to their files as easily as if the are yours. This is more convenient than alternatives where instead you need to give other people access to your file repository.

Host on Amazon S3 or other object store

A public bucket on Amazon S3 or other object store provides cheap storage and built-in version control. The cost currently is about $0.026/GB/month.

First login to the AWS console and create a new bucket, set it public by turning of "Block all public access" and under "Access Control List" set "List objects" to Yes for "Public access".

You could upload files with the browser, but for larger files command line is better.

The files will be available at https://bucket-name.s3-us-west-1.amazonaws.com/, this changes based on the chosen region.

(Advanced) Upload files from the command line

This is optional and requires some more familiarity with AWS. Go back to the AWS console to the Identity and Access Management (IAM) section, then users, create, create a policy to give access only to 1 bucket (replace bucket-name):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListObjectsInBucket",
            "Effect": "Allow",
            "Action": ["s3:ListBucket"],
            "Resource": ["arn:aws:s3:::bucket-name"]
        },
        {
            "Sid": "AllObjectActions",
            "Effect": "Allow",
            "Action": [
                "s3:*Object",
                "s3:PutObjectAcl"
            ],
            "Resource": ["arn:aws:s3:::bucket-name/*"]
        }
    ]
}

See the AWS documentation

Install s3cmd, then run s3cmd --configure to set it up and paste the Access and Secret keys, it will fail to test the configuration because it cannot list all the buckets, anyway choose to save the configuration.

Test it:

    s3cmd ls s3://bucket-name

Then upload your files (reduced redundancy is cheaper):

    s3cmd put --reduced-redundancy --acl-public *.fits s3://bucket-name

Deploy Kubernetes and JupyterHub on Jetstream with Magnum

2019-06-14T00:00:00-07:00

This tutorial deploys Kubernetes on Jetstream with Magnum and then JupyterHub on top of that using zero-to-jupyterhub.

In my previous tutorials I deployed Kubernetes using Kubespray. The main driver to using Magnum is that there is support for autoscaling, i.e. create and destroy Openstack instances based on the load on JupyterHub. I haven't tested that yet, though, that will come in a following tutorial.

Magnum is a technology built into Openstack to deploy Container Orchestration engines based on templates. The main difference with kubespray is that is way less configurable, the user does not have access to modify those templates but has just a number of parameters to set. Instead Kubespray is based on ansible and the user has full control of how the system is setup, it also supports having more High Availability features like multiple master nodes. On the other hand, the ansible recipe takes a very long time to run, ~30 min, while Magnum creates a cluster in about 10 minutes.

Setup access to the Jetstream API

First install the OpenStack client, please use these exact versions, also please run at Indiana, which currently has the Rocky release of Openstack, the TACC deployment has an older release of Openstack.

pip install python-openstackclient==3.16 python-magnumclient==2.10

Load your API credentials from openrc.sh, check documentation of the Jetstream wiki for details.

You need to have a keypair uploaded to Openstack, this just needs to be done once per account. See the Jetstream documentation under the section "Upload SSH key - do this once".

Create the cluster with Magnum

As usual, checkout the repository with all the configuration files on the machine you will use the Jetstream API from, typically your laptop.

git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream
cd jupyterhub-deploy-kubernetes-jetstream
cd kubernetes_magnum

Now we are ready to use Magnum to first create a cluster template and then the actual cluster, edit first create_cluster.sh and set the parameters of the cluster on the top. Also make sure to set the keypair name. Finally run:

bash create_network.sh
bash create_template.sh
bash create_cluster.sh

I have setup a test cluster with only 1 master node and 1 normal node but you can modify that later.

Check the status of your cluster, after about 10 minutes, it should be in state CREATE_COMPLETE:

openstack coe cluster show k8s

Configure kubectl locally

Install the kubectl client locally, first check the version of the master node:

openstack server list # find the floating public IP of the master node (starts with 149_
IP=149.xxx.xxx.xxx
ssh fedora@$IP
kubectl version

Now install the same version following the Kubernetes documentation

Now configure kubectl on your laptop to connect to the Kubernetes cluster created with Magnum:

mkdir kubectl_secret
cd kubectl_secret
openstack coe cluster config k8s

This downloads a configuration file and the required certificates.

and returns export KUBECONFIG=/absolute/path/to/config

See also the update_kubectl_secret.sh script to automate this step, but it requires to already have setup the environment variable.

execute that and then:

kubectl get nodes

Configure storage

Magnum configures a provider that knows how to create Kubernetes volumes using Openstack Cinder, but does not configure a storageclass, we can do that with:

kubectl create -f storageclass.yaml

We can test this by creating a Persistent Volume Claim:

kubectl create -f persistent_volume_claim.yaml

kubectl describe pv

kubectl describe pvc

Name:            pvc-e8b93455-898b-11e9-a37c-fa163efb4609
Labels:          failure-domain.beta.kubernetes.io/zone=nova
Annotations:     kubernetes.io/createdby: cinder-dynamic-provisioner
                 pv.kubernetes.io/bound-by-controller: yes
                 pv.kubernetes.io/provisioned-by: kubernetes.io/cinder
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    standard
Status:          Bound
Claim:           default/pvc-test
Reclaim Policy:  Delete
Access Modes:    RWO
Capacity:        5Gi
Node Affinity:   <none>
Message:         
Source:
    Type:       Cinder (a Persistent Disk resource in OpenStack)
    VolumeID:   2795724b-ef11-4053-9922-d854107c731f
    FSType:     
    ReadOnly:   false
    SecretRef:  nil
Events:         <none>

We can also test creating an actual pod with a persistent volume and check that the volume is successfully mounted and the pod started:

kubectl create -f ../alpine-persistent-volume.yaml
kubectl describe pod alpine

Note about availability zones

By default Openstack servers and Openstack volumes are created in different availability zones. This created an issue with the default Magnum templates because we need to modify the Kubernetes scheduler policy to allow this. Kubespray does this by default, so I created a fix to be applied to the Jetstream Magnum templates, this needs to be re-applied after every Openstack upgrade.

Install Helm

The Kubernetes deployment from Magnum is not as complete as the one out of Kubespray, we need to setup helm and the NGINX ingress ourselves. We would also need to setup a system to automatically deploy HTTPS certificates, I'll add this later on.

First install the Helm client on your laptop, make sure you have configured kubectl correctly.

Then we need to create a service account to give enough privilege to Helm to reconfigure the cluster:

kubectl create -f tiller_service_account.yaml

Then we can create the tiller pod inside Kubernetes:

helm init --service-account tiller --wait --history-max 200

kubectl get pods --all-namespaces
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
kube-system   coredns-78df4bf8ff-f2xvs                   1/1     Running   0          2d
kube-system   coredns-78df4bf8ff-pnj7g                   1/1     Running   0          2d
kube-system   heapster-74f98f6489-xsw52                  1/1     Running   0          2d
kube-system   kube-dns-autoscaler-986c49747-2m64g        1/1     Running   0          2d
kube-system   kubernetes-dashboard-54cb7b5997-c2vwx      1/1     Running   0          2d
kube-system   openstack-cloud-controller-manager-tf5mc   1/1     Running   3          2d
kube-system   tiller-deploy-6b5cd64488-4fkff             1/1     Running   0          20s

And check that all the versions agree:

helm version
Client: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}

Setup NGINX ingress

We need to have the NGINX web server to act as front-end to the services running inside the Kubernetes cluster.

Open HTTP and HTTPS ports

First we need to open the HTTP and HTTPS ports on the master node, you can either connect to the Horizon interface, create new rule named http_https, then add 2 rules, in the Rule drop down choose HTTP and HTTPS; or from the command line:

openstack security group create http_https
openstack security group rule create --ingress --protocol tcp --dst-port 80 http_https 
openstack security group rule create --ingress --protocol tcp --dst-port 443 http_https

Then you can find the name of the master node in openstack server list then add this security group to that instance:

openstack server add security group  k8s-xxxxxxxxxxxx-master-0 http_https

Install NGINX ingress with Helm

bash install_nginx_ingress.sh

Note, the documentation says we should add this annotation to ingress with kubectl edit ingress -n jhub, but I found out it is not necessary:

metadata:
  annotations:
    kubernetes.io/ingress.class: nginx

If this is correctly working, you should be able to run curl localhost from the master node and get a Default backend: 404 message.

Install JupyterHub

Finally, we can go back to the root of the repository and install JupyterHub, first create the secrets file:

bash create_secrets.sh

Then edit secrets.yaml and modify the hostname under hosts to display the hostname of your master Jetstream instance, i.e. if your instance public floating IP is aaa.bbb.xxx.yyy, the hostname should be js-xxx-yyy.jetstream-cloud.org (without http://).

You should also check that connecting with your browser to js-xxx-yyy.jetstream-cloud.org shows default backend - 404, this means NGINX is also reachable from the internet, i.e. the web port is open on the master node.

Finally:

bash configure_helm_jupyterhub.sh
bash install_jhub.sh

Connect with your browser to js-xxx-yyy.jetstream-cloud.org to check if it works.

Issues and feedback

Please open an issue on the repository to report any issue or give feedback. Also you find out there there what I am working on next.

Acknowledgments

Many thanks to Jeremy Fischer and Mike Lowe for solving all my tickets, this required a lot of work on their end to make it working.

Webinar about distributed computing with Python

2019-05-30T15:00:00-07:00

Recording available of the webinar I gave about "Distributed computing with Python":

Threads vs Processes, GIL
Just-In-Time compilation with Numba
Processing data larger than memory with Dask
Distributed computing with Dask

Live demo on my favorite Supercomputer Comet at the San Diego Supercomputer Center.

Webinar recording
Notebooks: https://github.com/zonca/python_hpc_tutorial

Kubernetes monitoring with Prometheus and Grafana

2019-04-20T00:00:00-07:00

In a production Kubernetes deployment it is necessary to make it easier to monitor the status of the cluster effectively. Kubernetes provides Prometheus to gather data from the different components of Kubernetes and Grafana to access those data and provide real-time plotting and inspection capability. Moreover, they both provide systems to send alerts in case some conditions on the state of the cluster are met, i.e. using more than 90% of RAM or CPU.

The only downside is that the pods that handle monitoring consume some resource themselves, so this could be significant for small clusters below 5 nodes or so, but shouldn't be a problem for typical larger production deployments.

Both Prometheus and Grafana can be installed separately with Helm recipes or using the Prometheus operator Helm recipe, however those deployments do not have any preconfigured dashboards, it is easier to get started thanks to the kube-prometheus project, which not only installs Prometheus and Grafana, but also preconfigures about 10 different Grafana dashboards to explore in depth the status of a Kubernetes cluster.

The main issue is that customizing it is really complicated, it requires modifying jsonnet templates and recompiling them with a jsonnet builder which requires go, however I don't foresee the need to do that for most users.

Unfortunately it is not based on Helm, so you need to first checkout the repository:

git clone https://github.com/coreos/kube-prometheus

and then follow the instructions in the documentation, copied here for convenience:

kubectl create -f manifests/

wait a moment, do not worry if some of the tasks fails, they should get fixed running:

kubectl apply -f manifests/

This creates several pods in the monitoring namespace:

kubectl get pods -n monitoring
NAME                                   READY   STATUS    RESTARTS   AGE
alertmanager-main-0                    2/2     Running   0          13m
alertmanager-main-1                    2/2     Running   0          13m
alertmanager-main-2                    2/2     Running   0          13m
grafana-9d97dfdc7-zkfft                1/1     Running   0          14m
kube-state-metrics-7c7979b6bc-srcvk    4/4     Running   0          12m
node-exporter-b6n2w                    2/2     Running   0          14m
node-exporter-cgp46                    2/2     Running   0          14m
prometheus-adapter-b7d894c9c-z2ph7     1/1     Running   0          14m
prometheus-k8s-0                       3/3     Running   1          13m
prometheus-k8s-1                       3/3     Running   1          13m
prometheus-operator-65c44fb7b7-8ltzs   1/1     Running   0          14m

Then you can setup forwarding on your laptop to export grafana locally:

kubectl --namespace monitoring port-forward svc/grafana 3000

Access localhost:3000 with your browser and you should be able to navigate through all the statistics of your cluster, see for example this screenshot. The credentials are user admin and password admin.

Access the UI from a different machine

In case you are running the configuration on a remote server and you would like to access the Grafana UI (or any other service) from your laptop, you can install kubectl also your my laptop, then copy the .kube/config to the laptop with:

 scp -r KUBECTLMACHINE:~/.kube/config ~/.kube

and run:

 ssh ubuntu@$IP -f -L 6443:localhost:6443 sleep 3h &

from the laptop and then run the port-forward command locally on the laptop.

Monitor JupyterHub

Once we have deployed JupyterHub with Helm, we can pull up the "namespace" monitor and select the jhub namespace to visualize resource usage but also usage requests and limits of all pods created by JupyterHub and its users. See a screenshot below.

Setup alerts

Grafana supports email alerts, but it needs a SMTP server, and it is not easy to setup and to avoid being filtered as spam. The easiest way is to setup an alert to Slack, and optionally be notified via email of Slack messages.

Follow the instructions for slack on the Grafana documentation

Create a Slack app, name it e.g. Grafana
Add feature "Incoming webhook"
Create a incoming webhook in the workspace and channel your prefer on Slack
In the Grafana Alerting menu, set the webhook incoming url, the channel name

Inherit group permission in folder

2019-03-24T18:00:00-07:00

I have googled this so many times...

On shared systems, like Supercomputers, you often belong to many different Unix groups, and that membership allows you to access data from specific projects you are working on and you can share data with your collaborators.

If you set SGID on a folder, any folder of file created in that folder will automatically belong to the Unix group of that folder, and not your default group. You first set the right group on the folder, recursively so that older files will get the right permissions:

chown -R somegroup sharedfolder

Then you set the SGID so future files will automatically belong to somegroup:

chmod g+s sharedfolder

This is very useful for example in the /project filesystem at NERSC, you can set the SGID so that every file that is copied to the shared /project filesystem is accessible by other collaborators.

Related to this is also the default umask, most systems by default give "read" permission for the group, so setting SGID is enough, otherwise it is also necessary to configure umask properly.

Scale Kubernetes manually on Jetstream

2019-02-22T21:00:00-08:00

We would like to modify the number of Openstack virtual machines available to Kubernetes. Ideally we would like to do this automatically based on the load on JupyterHub, that is the target. For now we will increase and decrease the size manually. This can be useful for example if you make a test deployment with only 1 worker node a week before a workshop and then scale it up to 10 or more instances the day before the workshop begins.

This assumes you have deployed Kubernetes and JupyterHub already

Create a new Openstack Virtual Machine with Terraform

To add nodes, enter the inventory/$CLUSTER folder, we can edit cluster.tf and increase number_of_k8s_nodes_no_floating_ip, in my testing I have increased it from 1 to 3.

Then we can run again terraform_apply.sh, this should run Terraform and create a new resource:

Apply complete! Resources: 2 added, 0 changed, 0 destroyed.

Check first that your machine has booted correctly running:

openstack server list

+--------------------------------------+---------------------+--------+--------------------------------------------+-------------------------------------+----------+
| ID                                   | Name                | Status | Networks                                   | Image                               | Flavor   |
+--------------------------------------+---------------------+--------+--------------------------------------------+-------------------------------------+----------+
| 4ea73e65-2bff-42c9-8c4b-6c6928ad1b77 | zonca-k8s-node-nf-3 | ACTIVE | zonca_k8s_network=10.0.0.7                 | JS-API-Featured-Ubuntu18-Dec-7-2018 | m1.small |                                                       | 0cf1552e-ef0c-48b0-ac24-571301809273 | zonca-k8s-node-nf-2 | ACTIVE | zonca_k8s_network=10.0.0.11                | JS-API-Featured-Ubuntu18-Dec-7-2018 | m1.small |                                                       | e3731cde-cf6e-4556-8bda-0eebc0c7f08e | zonca-k8s-master-1  | ACTIVE | zonca_k8s_network=10.0.0.9, xxx.xxx.xxx.xx | JS-API-Featured-Ubuntu18-Dec-7-2018 | m1.small |
| 443c6861-1a13-4080-b5a3-e005bb34a77c | zonca-k8s-node-nf-1 | ACTIVE | zonca_k8s_network=10.0.0.3                 | JS-API-Featured-Ubuntu18-Dec-7-2018 | m1.small |
+--------------------------------------+---------------------+--------+--------------------------------------------+-------------------------------------+----------+

As expected we have now 1 master and 3 nodes.

Then change the folder to the root of the repository and check you can connect to it with:

ansible -i inventory/$CLUSTER/hosts -m ping all

If any of the new nodes is Unreachable, you can try rebooting them with:

openstack server reboot zonca-k8s-node-nf-3

Configure the new instances for Kubernetes

kubespray has a special playbook scale.yml that impacts as little as possible the nodes already running. I have created a script k8s_scale.sh in the root folder of my jetstream_kubespray repository, launch:

bash k8s_scale.sh

See for reference the kubespray documentation

Once this completes (re-run it if it stops at some point), you should see what Ansible modified:

zonca-k8s-master-1         : ok=25   changed=3    unreachable=0    failed=0                                   zonca-k8s-node-nf-1        : ok=247  changed=16   unreachable=0    failed=0
zonca-k8s-node-nf-2        : ok=257  changed=77   unreachable=0    failed=0                                   zonca-k8s-node-nf-3        : ok=257  changed=77   unreachable=0    failed=0

At this point you should check the nodes are seen by Kubernetes with kubectl get nodes:

NAME                  STATUS   ROLES    AGE     VERSION                                                       zonca-k8s-master-1    Ready    master   4h29m   v1.12.5                                                       zonca-k8s-node-nf-1   Ready    node     4h28m   v1.12.5                                                       zonca-k8s-node-nf-2   Ready    node     5m11s   v1.12.5                                                       zonca-k8s-node-nf-3   Ready    node     5m11s   v1.12.5

Reduce the number of nodes

Kubernetes is built to be resilient to node losses, so you could just brutally delete a node with openstack server delete. However, there is a dedicated playbook, remove-node.yml, to remove a node cleanly migrating any running services to other nodes and lowering the risk of anything malfunctioning. I created a script k8s_remove_node.sh, pass the name of the node you would like to eliminate (or a comma separated list of many names):

bash k8s_remove_node.sh zonca-k8s-node-nf-3

Now the node has disappeared by kubectl get nodes but the underlying Openstack instance is still running, delete it with:

openstack server delete zonca-k8s-node-nf-3

For consistency you could now modify inventory/$CLUSTER/cluster.tf and reduce the number of nodes accordingly.

Deploy Kubernetes with Kubespray 2.8.2 and JupyterHub with helm recipe 0.8 on Jetstream

2019-02-22T18:00:00-08:00

Back in September 2018 I published a tutorial to deploy Kubernetes on Jetstream using Kubernetes.

Software in the Kubernetes space moves very fast, so I decided to update the recipe to use the newer Kubespray 2.8.2 that deploys Kubernetes v1.12.5.

Please follow the old tutorial and note the updates below.

Switch to kubespray 2.8.2

Once you get my fork of kubespray with a few fixes for Jetstream:

git clone https://github.com/zonca/jetstream_kubespray

switch to the newer 2.8.2 version

git checkout -b branch_v2.8.2 origin/branch_v2.8.2

See an overview of my changes compared to the standard kubespray release 2.8.2.

Use the new template

The name of my template is now just zonca instead of zonca_kubespray:

Before running Terraform, inside jetstream_kubespray, copy from my template:

export CLUSTER=$USER
cp -LRp inventory/zonca inventory/$CLUSTER
cd inventory/$CLUSTER

Explore kubernetes

In case you are interested in exploring some of the capabilities of Kubernetes, you can check the second part of my tutorial, nothing in this section is required to run JupyterHub.

Install JupyterHub

Finally you can use helm to install JupyterHub, see the last part of my tutorial.

Consider that I have updated the repository https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream to install the 0.8.0 version of the helm package just released yesterday, see their blog post with more details.

Thanks

Thanks to the Kubernetes, Kubespray and JupyterHub community for delivering great open-source software and to XSEDE for giving me the opportunity to work on this. Special thanks to my collaborators Julien Chastang and Rich Signell.

Deploy Pangeo on Kubernetes deployment on Jetstream created with Kubespray

2018-12-20T01:00:00-08:00

The Pangeo collaboration for Big Data Geoscience maintains a helm chart with a prefigured JupyterHub deployment on Kubernetes which also supports launching private dask workers. This is very useful because the Jupyter Notebook users can launch a cluster of worker containers inside Kubernetes and process larger amounts of data than they could using only their notebook container.

Setup Kubernetes on Jetstream with Kubespray

First check out my tutorial on deploying Kubernetes on Jetstream with Kubespray. You just need to complete the first part, do not install JupyterHub, it is installed as part of the Pangeo deployment.

I also recommend to setup kubectl and helm to run locally so that the following steps can be executed on the local machine, see the instructions at the bottom of the tutorial mentioned above. otherwise you need to ssh into the master node and type helm commands there.

Install Pangeo with Helm

Pangeo publishes a Helm chart (a software package for Kubernetes) and we can leverage that to setup the deployment.

First add the repository:

helm repo add pangeo https://pangeo-data.github.io/helm-chart/
helm repo update

Then download my repository with all the configuration files and helper scripts:

git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream

Create a secrets.yaml file running:

bash create_secrets.sh

Then head to the pangeo_helm folder and customize config_jupyterhub_pangeo_helm.yaml,

I have prepopulated very small limits for testing, increase those for production
I am using the docker image zonca/pangeo_notebook_rsignell, you can remove image: and the 2 lines below to use the standard Pangeo notebook image (defined in their values.yaml)
Copy cookieSecret and secretToken from secrets.yaml you created above
Customize ingress - hosts with the hostname of your master instance

Finally you can deploy it running:

bash install_pangeo.sh

Login by pointing your browser at http://js-XXX-YYY.jetstream-cloud.org, the default dummy authenticator only needs a username and empty password.

Customize and launch dask workers

Once you login to the Jupyter Notebook, you can customize the worker-template.yaml file available in your home folder, I have an example of it with very small limits in the pangeo_helm folder.

This file is used by dask_kubernetes to launch workers on your behalf, see for example the dask-array.ipynb notebook available in your home folder:

from dask_kubernetes import KubeCluster
cluster = KubeCluster(n_workers=3)
cluster

This will launch 3 workers on the cluster which are then available to launch jobs on with dask.

You can check with kubectl that the workers are executing:

$ kubectl get pods -n pangeo
NAME                         READY   STATUS    RESTARTS   AGE
dask-zonca-d191b7a4-d8jhft   1/1     Running   0          28m
dask-zonca-d191b7a4-dx9dhs   1/1     Running   0          28m
dask-zonca-d191b7a4-dzmgvv   1/1     Running   0          28m
hub-55f5bf597-f5bnt          1/1     Running   0          55m
jupyter-zonca                1/1     Running   0          38m
proxy-66576956d7-r926j       1/1     Running   0          55m

And also access the Dask GUI, using the menu on the left or the link provided by dask_kubernetes inside the Notebook.

Setup two factor authentication for UCSD, and Lastpass

2018-12-12T18:00:00-08:00

Starting at the end of January 2019 UCSD requires every employee to have activated two factor authentication.

Go over to https://duo-registration.ucsd.edu to register your devices and https://twostep.ucsd.edu to read more details.

Here some suggestions after I have used this for a few months.

The most convenient option is definitely to have the Duo application installed on your phone, so that once you try to login it sends a notification to your phone, you click accept and you're done.

Second best is to use the Duo or the Google Authenticator app to generate codes, then you can copy those codes into the login form, and this is anyway useful for VPN access, you choose the "2 Steps secured - allthroughucsd" option, type your password followed by a comma and the code, otherwise just the password and get a push notification on your primary device.

Then you can just add a mobile number and receive a text or add a landline and receive a call.

I also recommend to buy a security key and add it as a authentication option at https://duo-registration.ucsd.edu, either Google Titan or a Yubico key (I have a Titan), you can keep it always with you so that if you don't have your phone or the phone battery is dead, you can plug the security key in your USB port on the laptop and click on its button to authenticate.

Anther option is to request a fob token, a device that generates and displays timed codes and that is independent of a phone, see instructions on the UCSD website. They say there are only a limited number available and you need to be prepared to justify why you are requesting one.

Other services

Now that you already have Duo installed on your phone, I recommend to also activate two factor auth on all other services:

XSEDE
NERSC
Google
Github
Amazon
Microsoft
Dropbox

Consider that most of them just request the second step verification if you are on a new device, so you need to do the verification just once in a while and it provides a lot of security. Many of those also support the security key.

Password handling with Lastpass

Update October 2019: Fed up of using Lastpass, their interface is clunky and slow, both in Chrome and Android, I switched to Bitwarden. Way better, it also allows sharing with another user, only downside is that the do not offer Duo push 2FA for free, you need premium, but still supports using Duo as a token generator.

As you are into security, just go all the way and also install a password manager. UCSD provides free enterprise accounts for all employees, see the details.

With Lastpass, you just remember 1 strong password to descrypt all of your other passwords. If you ever used the Google Chrome builtin password manager, this is way way better.

You install the Lastpass extension on your browsers and the Lastpass app on your phone.

The only issue with Lastpass is that by default the Lastpass app on the smartphone automatically logsout every 30 minutes or so, so you have to re-authenticate very often. This is due to UCSD having configured it too strictly. I recommend to have a personal account and save all of the passwords in the personal account and then link it from the Enterprise account. Now from the desktop/laptop browsers you can use your Enterprise account, from the smartphone app instead use the personal account.

You can also automatically import your Google Chrome passwords into Lastpass.

Now you have no excuse to re-use the same password, automatically generate a 20 char random password and save it in Lastpass.

Save one-time codes

When you activate two factor auth on Google/Github and many other services, they also give you some one-time codes that you can use to login to the service if you do not have access to your phone, you can save them as "Notes" into the related account inside Lastpass.

Activate 2 factor auth for Lastpass

You should also activate 2 factor auth in Lastpass, it also supports Duo so the configuration is similar to the configuration for UCSD. Only issue is that they do not support a security key here, so you can only add your smartphone.

Deploy JupyterHub on a Supercomputer for a workshop or tutorial 2018 edition

2018-11-07T11:00:00-08:00

I described how to deploy JupyterHub with each user session running on a different node of a Supercomputer in my paper for PEARC18, however things are moving fast in the space and I am employing a different strategy this year, in particular relying on the littlest JupyterHub project for the initial deployment.

Initial deployment of JupyterHub

The littlest JupyterHub project has great documentation on how to deploy JupyterHub working on a single server on a wide array of providers.

In my case I logged in to the dashboard of SDSC Cloud, a OpenStack deployment at the San Diego Supercomputer Center, and requested an instance with 16 GB of RAM and 6 vCPUs with Ubuntu 18.04. Make sure you attach a floating public IP to the instance and open up ports 22 for SSH and 80,443 for HTTP/HTTPS.

Then I followed the installation tutorial for custom servers, just make sure that you first create in the virtual machine the admin user you specify in the installation script, also make sure to use the same username of your Github account, as we will later setup Github Authentication.

You can connect to the instance and check JupyterHub is working and you can login with your user and access the admin panel, for SDSC Cloud the address is http://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu, filled in with the instance floating IP address.

Setup HTTPS

Follow the Littlest JupyterHub documentation on how to get a SSL certificate through Letsencrypt automatically, after this you should be able to access JupyterHub from https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu or a custom domain you pointed there.

Authentication with Github

Follow the Littlest JupyterHub documentation, just make sure to set the http address and not the https address.

Interface with Comet via batchspawner

We want all users to run on Comet as a single "Gateway" user, as JupyterHub executes as the root user on the server, we want to create a SSH key for the root user and copy the public key to the home folder of the gateway user on Comet so that we can SSH without a password.

Instead, if you would like each user to utilize their own XSEDE account, you need them to authenticate via XSEDE and get a certificate from the XSEDE API that can be used to login to Comet on behalf of the user, see an example deployment of this.

First install batchspawner with pip in the Python environment of the hub, this is different than the Python environment of the user, you can have access to it logging in with the root user and running:

export PATH=/opt/tljh/hub/bin:${PATH}

Set the configuration file, see spawner.py on this Gist and copy it into the /opt/tljh/config/jupyterhub_config.d folder, then add the private SSH key of the tunnelbot user, which is a user on the Virtual Machine with no shell (set /bin/false in /etc/passwd) but that can setup a SSH tunnel from Comet back to the Hub.

Also customize all paths and usernames in the file.

Reload the Jupyterhub configuration with:

tljh-config reload

You can then check the Hub logs with sudo journalctl -r -u jupyterhub

The most complicated part is making sure that the environment variables defined by JupyterHub, the most important is the token which allows the singleuser server to authenticate itself with the Hub, are correctly propagated through SSH. See in spawner.py how I explicitely pass the variables over SSH.

Also, as all workshop participants access Comet with the same user account, I automatically create a folder with their Github username and checkout the Notebooks for the workshop in that folder. Then start JupyterLab in that folder, so that the users do not interfere, we are not worrying about security here, with the current setup a user can open a terminal inside JupyterLab and access the folder of another person.

How to setup the tunnelbot user

On the JupyterHub virtual machine, create a user named tunnelbot
sudo su tunnelbot to act as that user, then create a key with ssh-keygen
enter the .ssh folder and cp id_rsa.pub authorized_keys so that the ssh key can be used from Comet to ssh passwordless to the server
now get the private key from /home/tunnelbot/.ssh/id_rsa and paste it into spawner.py
now make sure you set the shell of tunnelbot to /bin/false in /etc/passwd/
for increased security, please also follow the steps in this stackoverflow answer

Acknowledgments

Thanks to the Jupyter and JupyterHub teams for releasing great software with outstanding documentation, in particular Yuvi Panda for the simplicity and elegance in the design of the Littlest JupyterHub deployment.

Advanced pandas with Astrophysics example Notebook

2018-10-26T18:00:00-07:00

Taught a lesson today on advanced python and pandas based on an example application in Astrophysics with simulations of data from the Planck Satellite, features also a Binder button to run it yourself. Jupyter Notebook available at: https://github.com/zonca/pandas-astro-example under CC-BY

Bring your computing to the San Diego Supercomputer Center

2018-10-24T18:00:00-07:00

I am often asked what computing resources are available at the San Diego Supercomputer Center for scientists and what is the best way to be granted access. I decided to write a blog post with an overview of all the options, consider that I'm writing this in October 2018, so please cross-check on the official websites.

Comet

Our key resource is the Comet Supercomputer, a 2000 nodes traditional supercomputer with 72 GPU nodes, each with 4 GPUs. Comet has powerful CPUs with 24 cores and lots of memory per node (128GB) and a very fast local flash drive on each node. It is also suitable to run large amounts of single node jobs, so you can exploit it even if you don't have a multi-node parallel software.

Comet is a XSEDE resource, XSEDE is basically a consortium of many large US supercomputers dedicated to Science, it reviews applications from US scientists and grants them supercomputing resources for free. It is funded by National Science Foundation.

How to request resources on Comet

Ordered from the lowest to the largest amount of resources needed, which means they are ordered by the amount of effort it takes to get each type of allocation.

The amount of resources on Comet are billed in core hours (sometimes named SUs), if you request a Comet node for 1 hour you are charged 24 hours, comet GPUs are billed 14 core hours for each hour on each GPU. The newest Comet GPU nodes have P100 instead of K80, those are billed 1.5 times the older GPU nodes, i.e. 21 core hours per hour. Comet also has a shared queue where you can request and be charged for a portion of a Comet node (you also get the proportional amount of memory), i.e. you can request 6 cores and pay only 6 core hours per hour and get access to 32GB of RAM.

Trial allocation

Anybody can request a trial allocation on Comet with a quick 1 paragraph justification and be approved within a day for 1000 core hours to be used within 6 month. This is useful to try Comet out, run some test jobs. See the trial allocation page on the XSEDE website.

Campus champions

Most US universities have a reference person that facilitates access to XSEDE supercomputers, it is often somebody in the Information Technology office or in a Chancellor of Research or a professor. This person is given a large amount of supercomputing hours on all XSEDE resources and local professors, postdocs and graduate students can request to be added to this allocation and use many thousands of core hours, depending on availability. Campus champions are currently available in 241 (!!) US institutions, see the list on the XSEDE website.

HPC@UC

If you are at any of the University of California campuses, you have an expedited way of getting resources at SDSC. You can submit a request for up to 1 million core hours (more often ~500K core hours) on Comet on the HPC@UC page. It just requires a 3 page justification and is answered within 10 business days. You are not eligible if your research group has an active XSEDE allocation.

Startup allocation

Startup allocations are really quick to prepare, they just require a 1 page justification and CV of the Principal Investigator and grant up to 50K core hours on Comet, if your research is funded by NSF/NASA/NIH remember to specify that. See the startup page on XSEDE for more details.

They are reviewed continously so you should be approved within a few days. Generally you are supposed to utilize the amount of hours within 1 year, but if your science project is funded for a longer period, you can request a multi-year allocation.

XRAC allocation

XRAC allocations are full fledged proposals, you can request up to a few million hours on Comet, here you must provide a detailed justification of the resources requested, demonstrate that your software is able to efficiently scale up in parallel, i.e. if in production you want to run on 100 nodes, you should run it on 5/10/50/100 nodes and check that performance does not degrade too much with increased parallelism. You should have performed those tests in a startup allocation. The XRAC requests are reviewed quarterly, see the Research allocations page, there is also a recorded webinar.

Triton Shared Computing Cluster

The Triton Shared Computing Cluster is a supercomputer at SDSC with specifications a bit lower than Comet and that is not allocated through XSEDE, resources are paid by the users. XSEDE resources are always oversubscribed and often only a portion of the resources requested is granted, scientific groups that do not get enough resources through XSEDE can complement it with an allocation on TSCC.

The easiest way to get computational hours on TSCC is a pay-as-you-go option where you buy an amount of core-hours at $6c / core-hour (academics have a lower rate based on affiliation).

But the most cost-effective way is to buy a node to be added to the cluster for 3 years with full hardware warranty plus 1 extra year with no warranty, so if it breaks it needs to be removed. You pay a fixed price to buy the node (~$6K) plus yearly operations (~$1.8K if not subsidized by your University, in UC campuses this is generally subsidized and is ~$.5K), see the updated costs on the TSCC page, also get in touch with them directly for more details. You can also buy a node with GPUs.

Then, instead of having direct access to that node, you are given an allocation as big as the computing hours that your node provides to the cluster. This is great because it allows you to not be penalized for incosistent usage patterns. You can pay for 1 node and then use tens of nodes together once in a while. If you have the yearly operations subsidized by campus, the cost per core hour is about $2c, which is quite competitive, and the cluster is in SDSC machine room and professionally managed, updated, backed up.

Colocation

Larger collaborations might need dedicated resources, it is possible to buy your own nodes, in units of entire racks (48 Rack Units), which depending on the type of blades can be 12 or 24 nodes and colocate it in SDSC's machine room. See the detailed cost on the colocation page, this is a custom solution and it is not easy to give a simple cost estimate, better write and ask for a quote.

Cloud resources (Virtual Machines)

SDSC also manages a OpenStack deployment, which is especially suitable for running services, for example websites, databases, APIs but it is also suitable to run long-running single node jobs or interactive data analysis (think Jupyter Notebooks). And Kubernetes, of course! (see my tutorial for Jetstream, which works also on SDSC Cloud. This is also equivalent to Amazon Elastic Cloud Compute (EC2), here you pay for what you use, within UC you provide a funding index and that is charged for each hour used, see the full pricing on the SDSC cloud page, roughly you are charged $8c an hour for a Virtual Machine with 1 core and 4GB of RAM.

Feedback

If you have questions please email me at zonca on the sdsc.edu domain or tweet @andreazonca.

Deploy JupyterHub on Kubernetes deployment on Jetstream created with Kubespray 3/3

2018-09-24T01:00:00-07:00

All of the following assumes you are logged in to the master node of the Kubernetes cluster deployed with kubespray and checked out the repository:

https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream

Install Jupyterhub

First run

bash create_secrets.sh

to create the secret strings needed by JupyterHub then edit its output secrets.yaml to make sure it is consistent, edit the hosts lines if needed. For example, supply the Jetstream DNS name of the master node js-XXX-YYY.jetstream-cloud.org (XXX and YYY are the last 2 groups of the floating IP of the instance AAA.BBB.XXX.YYY). See part 2, "Publish service externally with ingress".

bash configure_helm_jupyterhub.sh
bash install_jhub.sh

Check some preliminary pods running with:

kubectl get pods -n jhub

Once the proxy is running, even if hub is still in preparation, you can check in browser, you should get "Service Unavailable" which is a good sign that the proxy is working.

Customize JupyterHub

After JupyterHub is deployed and integrated with Cinder for persistent volumes, for any other customizations, first authentication, you are in good hands as the Zero-to-Jupyterhub documentation is great.

The only setup that could be peculiar to the deployment on top of kubespray is setup with HTTPS, see the next section.

Setup HTTPS with letsencrypt

Kubespray instead of installing kube-lego, installs certmanager to handle HTTPS certificates.

First we need to create a Issuer, set your email inside setup_https_kubespray/https_issuer.yml and create it with the usual:

kubectl create -f setup_https_kubespray/https_issuer.yml

Then we can manually create a HTTPS certificate, certmanager can be configured to handle this automatically, but as we only need a domain this is pretty quick, edit setup_https_kubespray/https_certificate.yml and set the domain name of your master node, then create the certificate resource with:

kubectl create -f setup_https_kubespray/https_certificate.yml

Finally we can configure JupyterHub to use this certificate, first edit your secrets.yaml following as an example the file setup_https_kubespray/example_letsencrypt_secrets.yaml, then update your JupyterHub configuration running again:

bash install_jhub.sh

Setup HTTPS with custom certificates

In case you have custom certificates for your domain, first create a secret in the jupyterhub namespace with:

kubectl create secret tls cert-secret --key ssl.key --cert ssl.crt -n jhub

Then setup ingress to use this in secrets.yaml:

ingress:
  enabled: true
  hosts:
    - js-XX-YYY.jetstream-cloud.org
  tls:
  - hosts:
    - js-XX-YYY.jetstream-cloud.org
    secretName: cert-secret

Eventually, you may need to update the certificate. This can be achieved with:

kubectl create secret tls cert-secret --key ssl.key --cert ssl.crt -n jhub \
    --dry-run -o yaml | kubectl apply -f -

Setup custom HTTP headers

After you have deployed JupyterHub, edit ingress:

kubectl edit ingress -n jhub

Add a configuration-snippet line inside annotations:

metadata:
  annotations:
    kubernetes.io/tls-acme: "true"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      more_set_headers "X-Frame-Options: DENY";
      more_set_headers "X-Xss-Protection: 1";

This doesn't require to restart or modify any other resource.

Modify the Kubernetes cluster size

See a followup short tutorial on scaling Kubernetes manually.

Persistence of user data

When a JupyterHub user logs in for the first time, a Kubernetes PersistentVolumeClaim of the size defined in the configuration file is created. This is a Kubernetes resource that defines a request for storage.

kubectl get pvc -n jhub
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
claim-zonca   Bound    pvc-c469967a-3968-11e9-aaad-fa163e9c7d08   1Gi        RWO            standard       2m34s
hub-db-dir    Bound    pvc-353114a7-3968-11e9-aaad-fa163e9c7d08   1Gi        RWO            standard       6m34s

Inspecting the claims we find out that we have a claim for the user and a claim to store the database of JupyterHub. Currently they are already Bound because they are already satistied.

Those claims are then satisfied by our Openstack Cinder provisioner to create a Openstack volume and wrap it into a Kubernetes PersistentVolume resource:

kubectl get pv -n jhub
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM              STORAGECLASS   REASON   AGE
pvc-353114a7-3968-11e9-aaad-fa163e9c7d08   1Gi        RWO            Delete           Bound    jhub/hub-db-dir    standard                8m52s
pvc-c469967a-3968-11e9-aaad-fa163e9c7d08   1Gi        RWO            Delete           Bound    jhub/claim-zonca   standard                5m4s

This corresponds to Openstack volumes automatically mounted onto the node that is executing the user pod:

+--------------------------------------+-------------------------------------------------------------+-----------+------+----------------------------------------------+
| ID                                   | Name                                                        | Status    | Size | Attached to                                  |
+--------------------------------------+-------------------------------------------------------------+-----------+------+----------------------------------------------+
| e6eddaaa-d40d-4832-addd-a05343ec3a80 | kubernetes-dynamic-pvc-c469967a-3968-11e9-aaad-fa163e9c7d08 | in-use    |    1 | Attached to zonca-k8s-node-nf-1 on /dev/sdc  |
| 00f1e822-8098-4633-804e-46ba44d7de7e | kubernetes-dynamic-pvc-353114a7-3968-11e9-aaad-fa163e9c7d08 | in-use    |    1 | Attached to zonca-k8s-node-nf-1 on /dev/sdb  |

If the user disconnects, the Openstack volume is un-attached from the instance but it is not delete and it is mounted back, optionally on another instance, if the user logs back in.

Delete and reinstall JupyterHub

Helm release deleted:

helm delete --purge jhub

As long as you do not delete the whole namespace, the volumes are not deleted, therefore you can re-deploy the same version or a newer version using helm and the same volume is mounted back for the user

Delete and recreate Openstack instances

When we run terraform to delete all Openstack resources:

bash terraform_destroy.sh

this does not include the Openstack volumes that are created by the Kubernetes persistent volume provisioner.

In case we are interested in keeping the same ip address, run instead:

bash terraform_destroy_keep_floatingip.sh

The problem is that if we recreate Kubernetes again, it doesn't know how to link the Openstack volume to the Persistent Volume of a user. Therefore we need to backup the Persistent Volumes and the Persistent Volume Claims resources before tearing Kubernetes down:

kubectl get pvc -n jhub -o yaml > pvc.yaml
kubectl get pv -n jhub -o yaml > pv.yaml

I recommend always to run kubectl on the local machine instead of the master node, because if you delete the master instance you loose any temporary modification to your scripts. In this case, even more importantly, if you are running on the master node please backup pvc.yaml and pv.yaml locally before running terraform_destroy.sh or they will be wiped out.

Then open the files with a text editor and delete the Persistent Volume and the Persistent Volume Claim related to hub-db-dir.

Edit pv.yaml and set:

  persistentVolumeReclaimPolicy:Retain

Otherwise if you create the PV first, it is deleted because there is no PVC.

Also remove the claimRef section of all the volumes in pv.yaml, otherwise you get the error "two claims are bound to the same volume, this one is bound incorrectly" on the PVC.

Now we can proceed to create the cluster again and then restore the volumes with:

kubectl apply -f pv.yaml
kubectl apply -f pvc.yaml

Feedback

Feedback on this is very welcome, please open an issue on the Github repository or email me at zonca on the domain of the San Diego Supercomputer Center (sdsc.edu).

Explore a Kubernetes deployment on Jetstream with Kubespray 2/3

2018-09-23T23:00:00-07:00

This is the second part of the tutorial on deploying Kubernetes with kubespray and JupyterHub on Jetstream.

In the first part, we installed Kubernetes on Jetstream with kubespray.

It is optional, its main purpose is to familiarize with the Kubernetes deployment on Jetstream and how the different components play together before installing JupyterHub. If you are already familiar with Kubernetes you can skip to next part where we will be installing Jupyterhub using the zerotojupyterhub helm recipe.

All the files for the examples below are available on Github, first SSH to the master node (or do this locally if you setup kubectl locally):

git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream
cd jupyterhub-deploy-kubernetes-jetstream

Test persistent storage with cinder

The most important feature that brought me to choose kubespray as method for installing Kubernetes is that it automatically sets up persistent storage exploiting Jetstream Volumes. The Jetstream team already does a great job in providing a persistent storage solution with adequate redundancy via the Cinder project, part of OpenStack.

kubespray sets up a Kubernetes provisioner so that when a container requests persistent storage, it talks to the Openstack API and have a dedicated volume (the same type you can create with the Jetstream Horizon Web interfaces) automatically created and exposed to Kubernetes.

This is achieved through a storageclass:

kubectl get storageclass
NAME                 PROVISIONER            AGE
standard (default)   kubernetes.io/cinder   1h

See the file alpine-persistent-volume.yaml in the repository on how we can request a Cinder volume to be created and attached to a pod.

kubectl create -f alpine-persistent-volume.yaml

We can test it by getting a terminal inside the container (alpine has no bash):

kubectl exec -it alpine -- /bin/sh

look into df -h, check that there is a 5GB mounted file system which is persistent.

Also, back to the machine with openstack access, see how an Openstack volume was dynamically created and attached to the running instance:

openstack volume list

openstack volume list
+--------------------------------------+-------------------------------------------------------------+--------+------+--------------------------------------------------+
| ID                                   | Name                                                        | Status | Size | Attached to                                      |
+--------------------------------------+-------------------------------------------------------------+--------+------+--------------------------------------------------+
| 508f1ee7-9654-4c84-b1fc-76dd8751cd6e | kubernetes-dynamic-pvc-e83ec4d6-bb9f-11e8-8344-fa163eb22e63 | in-use |    5 | Attached to kubespray-k8s-node-nf-1 on /dev/sdb  |
+--------------------------------------+-------------------------------------------------------------+--------+------+--------------------------------------------------+

Test ReplicaSets, Services and Ingress

In this section we will explore how to build redundancy and scale in a service with a simple example included in the book Kubernetes in Action, which by the way I highly recommend to get started with Kubernetes.

First let's deploy a service in our Kubernetes cluster, this service just answers to HTTP requests on port 8080 with the message "You've hit kubia-manual":

cd kubia_test_ingress
kubectl create -f kubia-manual.yaml

We can test it by checking at which IP Kubernetes created the pod:

kubectl get pods -o wide

and assign it to the KUBIA_MANUAL_IP variable, then on one of the nodes:

$ curl $KUBIA_MANUAL_IP:8080
You've hit kubia-manual

Finally close it:

kubectl delete -f kubia-manual.yaml

Load balancing with ReplicaSets and Services

Now we want to scale this service up and provide a set of 3 pods instead of just 1:

kubectl create -f kubia-replicaset.yaml

Now we could access those services on 3 different IP addresses, but we would like to have a single entry point and automatic load balancing across those instances, so we create a Kubernetes "Service" resource:

kubectl create -f kubia-service.yaml

And test it:

$ kubectl get service
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.233.0.1      <none>        443/TCP   22h
kubia        ClusterIP   10.233.28.205   <none>        80/TCP    45m

curl $KUBIA_SERVICE_IP

This is on port 80 so we don't need :8080 in the URL. Run many times and check different kubia services answer.

Publish service externally with ingress

Try to open browser and access the hostname of your master node at:

http://js-XXX-YYY.jetstream-cloud.org

Where XXX-YYY are the last 2 groups of digits of the floating IP of the master instance, i.e. AAA.BBB.XXX.YYY, each of them could also be 1 or 2 digits instead of 3.

The connection should respond with 404.

At this point, edit the kubia-ingress.yaml file and replace the host value with the master node domain name you just derived.

Now:

kubectl create -f kubia-ingress.yaml
kubectl get ingress

Try again in the browser. You should now see something like:

"You've hit kubia-jqwwp"

Force reload the browser page a few times and you will see you are hitting a different kubia service.

Finally,

kubectl delete -f kubia-ingress.yaml

Deploy Kubernetes on Jetstream with Kubespray 1/3

2018-09-23T18:00:00-07:00

Please check the last version of this tutorial (which mostly redirects here but uses a newer kubespray) at https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html

The purpose of this tutorial series is to deploy Jupyterhub on top of Kubernetes on Jetstream. This material was presented as a tutorial at the Gateways 2018 conference, see also the slides on Figshare.

Compared to my initial tutorial, I focused on improving automation. Instead of creating Jetstream instances via the Atmosphere web interface and then SSHing into the instances and run kubeadm based commands to setup Docker and Kubernetes we will:

Use the terraform recipe part of the kubespray project to interface with the Jetstream API and create a cluster of virtual machines
Run the kubespray ansible recipe to setup a production-ready Kubernetes deployment, optionally with High Availability features like redundant master nodes and much more, see kubepray.io.

Create Jetstream Virtual machines with Terraform

kubespray is able to deploy production-ready Kubernetes deployments and initially targeted only commercial cloud platforms.

They recently added support for Openstack via a Terraform recipe which is available in their Github repository.

Terraform allows to execute recipes that describe a set of OpenStack resources and their relationship. In the context of this tutorial, we do not need to learn much about Terraform, we will configure and execute the recipe provided by kubespray.

Requirements

On a Ubuntu 18.04 install python3-openstackclient with APT. Any other platform works as well, also install terraform by copying the correct binary to /usr/local/bin/, see https://www.terraform.io/intro/getting-started/install.html. The current version of the recipe requires Terraform 0.11.x, not the newest 0.12.

Request API access

In order to make sure your XSEDE account can access the Jetstream API, you need to contact the Helpdesk, see the instructions on the Jetstream Wiki. You will also receive your TACC password, which could be different than your XSEDE one (username is generally the same).

Login to the TACC Horizon panel at https://tacc.jetstream-cloud.org/dashboard, this is basically the low level web interface to OpenStack, a lot more complex and powerful than Atmosphere available at https://use.jetstream-cloud.org/application. Use tacc as domain, your TACC username (generally the same as your XSEDE username) and your TACC password.

First choose the right project you would like to charge to in the top dropdown menu (see the XSEDE website if you don't recognize the grant code).

Click on Compute / API Access and download the OpenRC V3 authentication file to your machine. Source it typing:

source XX-XXXXXXXX-openrc.sh

it should ask for your TACC password. This configures all the environment variables needed by the openstack command line tool to interface with the Openstack API.

Test with:

openstack flavor list

This should return the list of available "sizes" of the Virtual Machines.

Clone kubespray

I had to make a few modifications to kubespray to adapt it to Jetstream or backport bug fixes not merged yet, so currently better use my fork of kubespray:

git clone https://github.com/zonca/jetstream_kubespray

See an overview of my changes compared to the standard kubespray release 2.6.0.

Run Terraform

Inside jetstream_kubespray, copy from my template:

export CLUSTER=$USER
cp -LRp inventory/zonca_kubespray inventory/$CLUSTER
cd inventory/$CLUSTER

Open and modify cluster.tf, choose your image and number of nodes. Make sure to change the network name to something unique, like the expanded form of $CLUSTER_network.

You can find suitable images (they need to be JS-API-Featured, you cannot use the same instances used in Atmosphere):

openstack image list | grep "JS-API"

I already preconfigured the network UUID both for IU and TACC, but you can crosscheck looking for the public network in:

openstack network list

Initialize Terraform:

bash terraform_init.sh

Create the resources:

bash terraform_apply.sh

The last output log of Terraform should contain the IP of the master node k8s_master_fips, wait for it to boot then SSH in with:

ssh ubuntu@$IP

or centos@$IP for CentOS images.

Inspect with Openstack the resources created:

openstack server list
openstack network list

You can cleanup the virtual machines and all other Openstack resources (all data is lost) with bash terraform_destroy.sh.

Install Kubernetes with `kubespray`

Change folder back to the root of the jetstream_kubespray repository,

First make sure you have a recent version of ansible installed, you also need additional modules, so first run:

pip install -r requirements.txt

It is useful to create a virtualenv and install packages inside that. This will also install ansible, it is important to install ansible with pip so that the path to access the modules is correct. So remove any pre-installed ansible.

Then following the kubespray documentation, we setup ssh-agent so that ansible can SSH from the machine with public IP to the others:

eval $(ssh-agent -s)
ssh-add ~/.ssh/id_rsa

Test the connection through ansible:

ansible -i inventory/$CLUSTER/hosts -m ping all

If a server is not answering to ping, first try to reboot it:

openstack server reboot $CLUSTER-k8s-node-nf-1

Or delete it and run terraform_apply.sh to create it again.

check inventory/$CLUSTER/group_vars/all.yml, in particular bootstrap_os, I setup ubuntu, change it to centos if you used the Centos 7 base image.

Due to a bug in the recipe, run ( see details in the Troubleshooting notes below):

export OS_TENANT_ID=$OS_PROJECT_ID

Finally run the full playbook, it is going to take a good 10 minutes:

ansible-playbook --become -i inventory/$CLUSTER/hosts cluster.yml

If the playbook fails with "cannot lock the administrative directory", it is due to the fact that the Virtual Machine is automatically updating so it has locked the APT directory. Just wait a minute and launch it again. It is always safe to run ansible multiple times.

If the playbook gives any error, try to retry the above command, sometimes there are temporary failed tasks, Ansible is designed to be executed multiple times with consistent results.

You should have now a Kubernetes cluster running, test it:

$ ssh ubuntu@$IP
$ kubectl get pods --all-namespaces
NAMESPACE       NAME                                                   READY     STATUS    RESTARTS   AGE
cert-manager    cert-manager-78fb746bc7-w9r94                          1/1       Running   0          2h
ingress-nginx   default-backend-v1.4-7795cd847d-g25d8                  1/1       Running   0          2h
ingress-nginx   ingress-nginx-controller-bdjq7                         1/1       Running   0          2h
kube-system     kube-apiserver-zonca-kubespray-k8s-master-1            1/1       Running   0          2h
kube-system     kube-controller-manager-zonca-kubespray-k8s-master-1   1/1       Running   0          2h
kube-system     kube-dns-69f4c8fc58-6vhhs                              3/3       Running   0          2h
kube-system     kube-dns-69f4c8fc58-9jn25                              3/3       Running   0          2h
kube-system     kube-flannel-7hd24                                     2/2       Running   0          2h
kube-system     kube-flannel-lhsvx                                     2/2       Running   0          2h
kube-system     kube-proxy-zonca-kubespray-k8s-master-1                1/1       Running   0          2h
kube-system     kube-proxy-zonca-kubespray-k8s-node-nf-1               1/1       Running   0          2h
kube-system     kube-scheduler-zonca-kubespray-k8s-master-1            1/1       Running   0          2h
kube-system     kubedns-autoscaler-565b49bbc6-7wttm                    1/1       Running   0          2h
kube-system     kubernetes-dashboard-6d4dfd56cb-24f98                  1/1       Running   0          2h
kube-system     nginx-proxy-zonca-kubespray-k8s-node-nf-1              1/1       Running   0          2h
kube-system     tiller-deploy-5c688d5f9b-fpfpg                         1/1       Running   0          2h

Compare that you have all those services running also in your cluster. We have also configured NGINX to proxy any service that we will later deploy on Kubernetes, test it with:

$ wget localhost
--2018-09-24 03:01:14--  http://localhost/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-09-24 03:01:14 ERROR 404: Not Found.

Error 404 is a good sign, the service is up and serving requests, currently there is nothing to deliver. Finally test that the routing through the Jetstream instance is working correctly by opening your browser and test that if you access js-XX-XXX.jetstream-cloud.org you also get a default backend - 404 message. If any of the tests hangs or cannot connect, there is probably a networking issue.

Next you can explore the kubernetes deployment to learn more about how you deploy resources in the second part of my tutorial or skip it and proceed directly to the third and final part of the tutorial and deploy Jupyterhub and configure it with HTTPS.

Troubleshooting notes

For future reference, disregard this.

Failing ansible task: openstack_tenant_id is missing

fixed with: export OS_TENANT_ID=$OS_PROJECT_ID, this should be fixed once https://github.com/kubernetes-incubator/kubespray/pull/2783 is merged, anyway this is not blocking.

Failing task Write cacert file:

NOTE: had to cherry-pick a commit from https://github.com/kubernetes-incubator/kubespray/pull/3280, this will be unnecessary once this is fixed upstream

(Optional) Setup kubectl locally

We also set kubectl_localhost: true and kubeconfig_localhost: true. so that kubectl is installed on your local machine

it also copies admin.conf to:

inventory/$CLUSTER/artifacts

now copy that to ~/.kube/config

this has an issue, it has the internal IP of the Jetstream master. We cannot replace it with the public floating ip because the certificate is not valid for that. Best workaround is to replace it with 127.0.0.1 inside ~/.kube/config at the server: key. Then make a SSH tunnel:

ssh ubuntu@$IP -f -L 6443:localhost:6443 sleep 3h

-f sends the process in the background
executing sleep for 3 hours makes the tunnel automatically close after 3 hours, otherwise -N would keep the tunnel permanently open

(Optional) Setup helm locally

ssh into the master node, check helm version with:

helm version

Download the same binary version from the release page on Github and copy the binary to /url/local/bin.

helm ls

PEARC18 Paper on Deploying Jupyterhub at scale on XSEDE

2018-07-23T12:00:00-07:00

Bob Sinkovits and I are presenting a paper at PEARC18 about:

"Deploying Jupyter Notebooks at scale on XSEDE resources for Science Gateways and workshops"

See the pre-print on Arxiv: https://arxiv.org/abs/1805.04781

Jupyter Notebooks provide an interactive computing environment well suited for Science. JupyterHub is a multi-user Notebook environment developed by the Jupyter team.

In order to provide adequate amount of memory and CPU to many users for example during workshops, it is necessary to leverage a distributed system, either leveraging multiple Jetstream instances or interfacing with a traditional HPC system.

In this work we present 3 strategies for deploying JupyterHub on XSEDE resources to support a large number of users, each is linked to the step-by-step tutorial with all necessary configuration files:

Presentation slides

If are an author at PEARC18, you can follow my instructions on how to publish your preprint to Arxiv

Updated Singularity images for Comet

2018-07-22T12:00:00-07:00

Back in January 2017 I wrote a blog post about running Singularity on Comet.

I recently needed to update all my container images to the latest scientific python packages, so I also took the opportunity to create both a Docker auto-build repository on DockerHub and a SingularityHub image.

Those images have a working MPI installation which has the same MPI version of Comet so they can be used as a base for MPI programs.

The Docker image is based on the Jupyter Datascience notebook, therefore has Python, R and Julia. the Singularity image on SingularityHub has instead only Python. Anyway singularity pull also works with Docker containers, so also the Docker container can easily be turned into a singularity container.

See https://github.com/zonca/singularity-comet

Create DockerHub auto build

2018-07-19T18:00:00-07:00

It is very convenient to create Autobuild repositories on DockerHub linked to a Github repository with a Dockerfile. Then every time you commit to Github, Dockerhub is going to build the image on their service and make it available on https://hub.docker.com and can quickly be pulled to any other system that supports Docker or Singularity.

Unfortunately if you have many Github organizations and repositories, the process to set a new repository up gets stuck.

Fortunately we can bypass the issue by directly accessing the right URL, as suggested on StackOverflow.

I created a simple page to make this quicker, add the right parameters and it automatically builds the right URL, see:

https://zonca.github.io/docker-auto-build

How to organize code and data for simulations at NERSC

2018-06-20T18:00:00-07:00

I recently improved my strategy for organizing code and data for simulations run at NERSC, I'll write it here for reference.

Libraries

I mostly use Python (often with C/C++ extensions), so I first rely on the Anaconda module maintained by NERSC, currently python/3.6-anaconda-4.4.

If I need to add many more packages I can create a conda environment, but for just installing 1 or 2 packages I prefer to just add them to my PYTHONPATH.

I have core libraries that I rely on and often modify to run my simulations, those should be installed on Global Common Software: /global/common/software/projectname which is specifically designed to access small files like Python packages. I generally create a subfolder and reference it with an environment variable:

 export PREFIX=/global/common/software/projectname/zonca/python_prefix

Then I create a env.sh script in the source folder of the package (in Global Home) that loads the environment:

module load python/3.6-anaconda-4.4
export PREFIX=/global/common/software/projectname/zonca/python_prefix
export PATH=$PREFIX/bin:$PATH
export LD_LIBRARY_PATH=$PREFIX/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$PREFIX/lib/python3.6/site-packages:$PYTHONPATH

This environment is automatically propagated to the computing nodes when I submit a SLURM script, therefore I do not add any of these environment details to my SLURM scripts.

Then I can install a package there with:

python setup.py install --prefix=$PREFIX

or from pip:

pip install apackage --prefix=$PREFIX

It is also common to install a newer version of a package which is already provided by the base environment:

pip install apackage --ignore-installed --upgrade --no-deps --prefix=$PREFIX

Simulations SLURM scripts and configuration files

I first create a repository on Github for my simulations and clone it to my home folder at NERSC. I generally create a repository for each experiment, then I create a subfolder for each type of simulation I am working on.

Inside a folder I create parameters files to configure my run and slurm scripts to launch the simulations and put everything under version control immediately, I often create a Pull Request on Github and ask my collaborators to cross-check the configuration before a submit a run.

Smaller input data files, even binaries, can be added for convenience to the Github repository.

Once a run has been validated, inside the simulation type folder I createa a subfolder runs/201806_details_about_run and add a README.md, this will include all the details about the simulation. I also tag both the core library I depend on and the simulation repository with the same name e.g.:

git tag -a 201806_details_about_run -m "software version used for 201806_details_about_run"

I'll also add the path at NERSC of the input data and output results.

Then for future simulations I'll keep modifying the SLURM scripts and parameter files but always have a reference to each previous version.

Larger input data and output data

Larger input data and outputs are not suitable for version control and should live in a SCRATCH filesystem. I always use the Global Scratch $CSCRATCH which is available both on Edison on Cori and also from the Jupyter Notebook environment at: https://jupyter.nersc.gov.

I create a root folder for the project at:

$CSCRATCH/projectname

Then a subfolder for each simulation type:

$CSCRATCH/projectname/simulation_type_1
$CSCRATCH/projectname/simulation_type_2

Then I symlink those inside the simulation repository as the folder out/:

cd $HOME/projectname/simulation_type_1
ln -s $CSCRATCH/projectname/simulation_type_1 out

Therefore I can setup my simulation software to save all results inside out/201806_details_about_run and this is going to be written to CSCRATCH.

This setup makes it very convenient to regularly backup everything to tape using cput which just backs up files that are not already on tape, e.g.:

cd $CSCRATCH
hsi
cput -R projectname

This is going to synchronize the backup on tape with the latest results on CSCRATCH.

I do the same for input files:

mkdir $CSCRATCH/projectname/input_simulation_type_1
cd $HOME/projectname/simulation_type_1
ln -s $CSCRATCH/projectname/input_simulation_type_1 input

Setup private dask clusters in Kubernetes alongside JupyterHub on Jetstream

2018-06-07T18:00:00-07:00

In this post we will leverage software made available by the Pangeo community to allow each user of a Jupyterhub instance deployed on Jetstream on top of Kubernetes to launch a set of dask workers as containers running inside Kubernetes itself and use them for distributed computing.

Pangeo also maintains a deployment of this environment on Google Cloud freely accessible at pangeo.pydata.org.

Security considerations: This deployment grants each user administrative access to the Kubernetes API, so each user could use this privilege to terminate other users' pods or dask workers. Therefore it is suitable only for a community of trusted users. There is discussion about leveraging namespaces to limit this but it hasn't been implemented yet.

Deploy Kubernetes

We need to first create Jetstream instances and deploy Kubernetes on them. We can follow the first part of the tutorial at https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html. I also tested with Ubuntu 18.04 instead of Ubuntu 16.04 and edited the install-kubeadm.bash accordingly, I also removed version specifications in order to pickup the latest Kubernetes version, currently 1.10. See my install-kubeadm-18.04.bash. Notice that for the http://apt.kubernetes.io/ don't have yet Ubuntu 18.04 packages, so I left xenial, this should be updated in the future.

In order to simplify the setup we will just be using ephemeral storage, later we can update the deployment using either Rook following the steps in my original tutorial or a NFS share (I'll write a tutorial soon about that).

Deploy Pangeo

Deployment is just a single step because Pangeo published a Helm recipe that depends on the Zero-to-JupyterHub recipe and deploys both in a single step, therefore we should not have deployed JupyterHub beforehand.

First we need to create a yaml configuration file for the package. Checkout the Github repository with all the configuration files on the master node of Kubernetes:

git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream

in the pangeo_helm folder, there is already a draft of the configuration file.

We need to:

run openssl as instructed inside the file and paste the output tokens to the specified location
edit the hostname in the ingress section to the hostname of the Jetstream master node
customize the memory and CPU requirements, currently they are very low so that this can be tested also in a single small instance

We can then deploy with:

sudo helm install pangeo/pangeo -n pangeo --namespace pangeo -f config_pangeo_no_storage.yaml --version=v0.1.1-95ab292

You can optionally check if there are newer versions of the chart at https://pangeo-data.github.io/helm-chart/.

Then check that the pods start checking their status with:

sudo kubectl -n pangeo get pods

If any is stuck in Pending, check with:

sudo kubectl -n pangeo describe <pod-name>

Once the hub pod is running, you should be able to connect with your browser to js-xxx-xxx.Jetstream-cloud.org, by default it runs with a dummy authenticator, at the login form, just type any username and leave the password empty to login.

Launch a dask cluster

Once you get the Jupyter Notebook instance, you should see a file named worker-template.yaml in your home folder, this is a template for the configuration and the allocated resources for the pod of each dask worker. The default workers for Pangeo are beefy, for testing we can reduce their requirements, see for example my worker-template.yaml that works on a small Jetstream VM.

Then inside examples/ we have several example notebooks that show how to use dask for distributed computing. dask-array.ipynb shows basic functionality for distributed multi-dimensional arrays.

The most important piece of code is the creation of dask workers:

from dask_kubernetes import KubeCluster
cluster = KubeCluster(n_workers=2)
cluster

If we execute this cell dask_kubernetes contacts the Kubernetes API using the serviceaccount daskkubernetes mounted on the pods by the Helm chart and requests new pods to be launched. In fact we can check on the terminal again with:

sudo kubectl -n pangeo get pods

that new pods should be about to run. It also provides buttons to change the number of running workers, either manually or adaptively based on the required resources.

This also runs the dask scheduler on the pod that is running the Jupyter Notebook and we can connect to it with:

from dask.distributed import Client
client = Client(cluster)
client

From now on all dask commands will automatically execute commands on the dask cluster.

Customize the JupyterHub deployment

We can then customize the JupyterHub deployment for example to add authentication or permanent storage. Notice that all configuration options inside the config_pangeo_no_storage.yaml are inside the jupyterhub: tag, this is due to the fact that jupyterhub is another Helm package which we are configuring through the pangeo Helm package. Therefore make sure that any configuration option found in my previous tutorials or on the Zero-to-Jupyterhub documentation needs to be indented accordingly.

Then we can either run:

sudo helm delete --purge pangeo

and then install it from scratch again or just update the running cluster with:

sudo helm upgrade pangeo -f config_pangeo_no_storage.yaml

How to post a PEARC18 paper pre-print to Arxiv

2018-05-12T18:00:00-07:00

Quick version

Make sure you have the DOI from ACM
If you have Latex: create a zip with sources, figures and .bbl (not .bib), no output PDF
If you have Word: export to PDF
Go to https://arxiv.org/submit
Choose the first option for license and "Computer Science" and "Distributed, Parallel, and Cluster Computing" for category
In Metadata set Comments as: "7 pages, 3 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, USA"
Make sure you set the DOI or you violate ACM rules
Follow instructions until you publish

Follows the step-by-step version:

Why upload a pre-print to arXiv

Journals provide a Open Access option, but it is very expensive, however, they generally allow authors to upload manuscripts before copy-editing to non-profit pre-print servers like the arXiv. This makes your paper accessible to anybody without the need of any Journal subscription, you can also upload your work months before the conference proceedings are available.

See for example the page of my PEARC18 paper on the arXiv: https://arxiv.org/abs/1805.04781

License

Before publishing any pre-print, you need to check on the Journal or Conference website if it is allowed and at what conditions.

PEARC18 in particular publishes with ACM, therefore we can look at the author rights page on the ACM website.

Currently the requirements for posting a pre-print are:

the paper needs to be accepted and peer-reviewed
this is the version by the author, before copy-editing, if any, by the journal
it needs a DOI pointing to the ACM version of the paper

Get a DOI

A DOI is generated once the author chooses a license. PEARC18 first authors should have received an email around May 10th with a link to the ACM website to choose a license. There are 3 choices, Open Access is quite expensive, but we do not need that, we are still allowed to post the pre-print even with any of the other 2 licenses, I personally recommend the "license" option, that does not transfer copyright to ACM. After completing this you should receive a DOI, which is a set of numbers of the form 10.1145/xxxxx.xxxxxx. Also remember to add the license text you will receive via email to the paper before going on with the upload.

Prepare your Latex submission

The arXiv requires the source for any Latex paper. If you are using the online platform Overleaf, click on "Project" and then "Download as zip" at the bottom. If you are using anything else, create a zip file with all the paper sources and figures, not the output PDF, also make sure that you include the .bbl file, not the .bib, so you need to compile your paper locally and add just the .bbl to the archive. Also, the arXiv dislikes large figures, so if you already know you have them, better resize or lower their quality before submission. Anyway you can just submit it as it is and check if they are accepted.

Prepare your Word submission

Export the paper as PDF.

Upload to arXiv

Go to https://arxiv.org/submit, either login or create a new account.
At the submission page, fill the form, for license, the safest is to use the first option: "arXiv.org perpetual, non-exclusive license to distribute this article (Minimal rights required by arXiv.org)"
For "Archive and Subject Class", choose "Computer Science" and "Distributed, Parallel, and Cluster Computing" unless in the list there is a more suitable field
Then upload the Latex sources zip file or the conversion of the Word file to PDF.
Once you have uploaded the zip file, it shows you a list of the archive content, you can delete extra files are not needed to build the paper, if you used the Overleaf ACM template, remove sample-sigconf-authordraft.tex
If the paper doesn't build, the arXiv displays the log, check for missing files or unsupported packages in particular, you can click "Add files" to upload different files
If the paper successfully builds, click on the "View" button to check that the PDF is fine
In the Metadata, complete the form, in the Comments, add also the conference information, for example "7 pages, 3 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, USA"
Still in Metadata, make sure you add the DOI otherwise it is a violation of the conditions by ACM, the DOI is in the form 10.1145/xxxxxx.xxxx
Finally check the preview and finalize your submission
The submission is not available immediately, it will first be in "Processing" stage and it will be published in the next few days, you'll get an email with the publishing date and time.

Update your submission

Anytime before publication you can update (overwrite) your submission
After your pre-print is published you can update it at will but all previous versions will always be available on the arXiv servers.

In order to update the publication, login to the Arxiv and click on the "Replace" icon to update your paper with a new version.

Launch a shared dask cluster in Kubernetes alongside JupyterHub on Jetstream

2018-05-04T18:00:00-07:00

Let's assume we have already a Kubernetes deployment and have installed JupyterHub, see for example my previous tutorial on Jetstream. Now that users can login and access a Jupyter Notebook, we would also like to provide them more computing power for their interactive data exploration. The easiest way is through dask, we can launch a scheduler and any number of workers as containers inside Kubernetes so that users can leverage the computing power of many Jetstream instances at once.

There are 2 main strategies, we can give each user their own dask cluster with exclusive access and this would be more performant but cause quick spike of usage of the Kubernetes cluster, or just launch a shared cluster and give all users access to that.

In this tutorial we cover the second scenario, we'll cover the first scenario in a following tutorial.

We will deploy first Jupyterhub through the Zero-to-JupyterHub guide, then launch via Helm a fixed size dask clusters and show how users can connect, submit distributed Python jobs and monitor their execution on the dashboard.

The configuration files mentioned in the tutorial are available in the Github repository zonca/jupyterhub-deploy-kubernetes-jetstream.

Deploy JupyterHub

First we start from Jupyterhub on Jetstream with Kubernetes at https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html

Optionally, for testing purposes, we can simplify the deployment by skipping permanent storage, if this is an option, see the relevant section below.

We want to install Jupyterhub in the pangeo namespace with the name jupyter, replace the helm install line in the tutorial with:

sudo helm install --name jupyter jupyterhub/jupyterhub -f config_jupyterhub_pangeo_helm.yaml --namespace pangeo

The pangeo configuration file is using a different single user image which has the right version of dask for this tutorial.

(Optional) Simplify deployment using ephemeral storage

Instead of installing and configuring rook, we can temporarily disable permanent storage to make the setup quicker and easier to maintain.

In the JupyterHub configuration yaml set:

hub:
   db:
     type: sqlite-memory

singleuser:
   storage:
      type: none

Now every time a user container is killed and restarted, all data are gone, this is good enough for testing purposes.

Configure Github authentication

Follow the instructions on the Zero-to-Jupyterhub documentation, at the end you should have in the YAML:

auth:
  type: github
  admin:
    access: true
    users: [zonca, otherusername]
  github:
    clientId: "xxxxxxxxxxxxxxxxxxxx"
    clientSecret: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    callbackUrl: "https://js-xxx-xxx.jetstream-cloud.org/hub/oauth_callback"

Test Jupyterhub

Connect to the master node with your browser at: https://js-xxx-xxx.jetstream-cloud.org Login with your Github credentials, you should get a Jupyter Notebook.

You can also check that your pod is running:

sudo kubectl get pods -n pangeo
NAME                                  READY     STATUS    RESTARTS   AGE
jupyter-zonca                         1/1       Running   0          2m
......other pods

Install Dask

We want to deploy a single dask cluster that all the users can submit jobs to.

Customize the dask_shared/dask_config.yaml file available in the repository, for testing purposes I set just 1 GB RAM and 1 CPU limits on each of 3 workers. We can change replicas of the workers to add more.

sudo helm install stable/dask --name=dask --namespace=pangeo -f dask_config.yaml

Then check that the dask instances are running:

$ sudo kubectl get pods --namespace pangeo
NAME                              READY     STATUS    RESTARTS   AGE
dask-jupyter-647bdc8c6d-mqhr4     1/1       Running   0          22m
dask-scheduler-5d98cbf54c-4rtdr   1/1       Running   0          22m
dask-worker-6457975f74-dqhsh      1/1       Running   0          22m
dask-worker-6457975f74-lpvk4      1/1       Running   0          22m
dask-worker-6457975f74-xzcmc      1/1       Running   0          22m
hub-7f75b59fc5-8c2pg              1/1       Running   0          6d
jupyter-zonca                     1/1       Running   0          10m
proxy-6bbf67f6bd-swt7f            2/2       Running   0          6d

Access the scheduler and launch a distributed job

kube-dns gives a name to each service and automatically propagates it to each pod, so we can connect by name

from dask.distributed import Client
client = Client("dask-scheduler:8786")
client

Now we can access the 3 workers that we launched before:

Client
Scheduler: tcp://dask-scheduler:8786
Dashboard: http://dask-scheduler:8787/status
Cluster
Workers: 3
Cores: 6
Memory: 12.43 GB

We can run an example computation with dask array:

import dask.array as da
x = da.random.random((20000, 20000), chunks=(2000, 2000)).persist()
x.sum().compute()

Access the Dask dashboard for monitoring job execution

We need to setup ingress so that a path points to the Dask dashboard instead of Jupyterhub,

Checkout the file dask_shared/dask_webui_ingress.yaml in the repository, it routes the path /dask to the dask-scheduler service.

Create the ingress resource with:

sudo kubectl create ingress -n pangeo -f dask_webui_ingress.yaml

All users can now access the dashboard at:

https://js-xxx-xxx.jetstream-cloud.org/dask/status

Make sure to use /dask/status/ and not only /dask. Currently this is not authenticated, so this address is publicly available. A simple way to hide it is to choose a custom name instead of /dask and edit the ingress accordingly with:

sudo kubectl edit ingress dask -n pangeo

Install a BOINC server on Jetstream

2018-03-29T18:00:00-07:00

BOINC is the leading platform for volunteer computing.

Scientists can create a project on the platform and submit computational jobs that will be executed on computers of volunteers all over the world.

In this post we'll deploy a BOINC server on Jetstream. All US scientists can get a free allocation on Jetstream via XSEDE.

The deployment will be based on the Docker setup developed by the Cosmology@Home project.

Prepare a Jetstream Virtual Machine

First we login on the Atmosphere Jetstream control panel and create a new instance of Ubuntu 16.04 with Docker preinstalled, a "small" size is enough for testing.

(Optional) Mount a Jetstream Volume for docker images

It is ideal to have a dedicated Jetstream Volume and mount it in the location where Docker stores its data. So we have more space, less usage of the root filesystem and no issues on the OS if we get out of disk space.

We can create a volume of 10/20 GB in the Jetstream control panel and attach it to the running Virtual Machine. This will be automatically mounted to /vol_b, we want to mount instead to /var/lib/docker:

sudo systemctl stop docker
sudo mv /var/lib/docker/* /vol_b/
sudo umount /vol_b

Replace /vol_b with /var/lib/docker in /etc/fstab, e.g.:

zonca@js-xxx-xxx:~$ cat /etc/fstab
LABEL=cloudimg-rootfs   /        ext4   defaults        0 0
/dev/sdb /var/lib/docker ext4 defaults,nofail 0 2

Finally:

sudo mount /var/lib/docker
sudo systemctl start docker

Update Docker

Docker in 16.04 is a bit old, we want to update it to a more recent version.

We also want to make sure to remove the old docker and docker-compose:

sudo apt remove docker-compose docker

Then install a recent version, we can follow the instructions from the docker website or use this script:

https://gist.github.com/zonca/f5faba190f5285c68dad48e897622e90

I adapted it from kubeadm-bootstrap.

Finally install the latest docker-compose, see the documentation

Last step, add your user to the docker group:

sudo adduser $USER docker

logout and back in and make sure you can run docker commands without sudo:

docker ps

Install BOINC server via Docker

Follow the instructions from boinc-server-docker to launch a test deployment, in the last step, specify a URL_BASE so that the deployment will be accessible from outside connections:

URL_BASE=http://$(hostname) docker-compose up -d

You can check that the 3 containers are running with:

docker ps

and inspect their logs with:

docker logs <container_id>

After a few minutes you should be able to check that the server is running at the public address of your instance:

http://js-xxx-xxx.jetstream-cloud.org/boincserver/

(Optional) Mount Jetstream volumes on the containers

The Docker compose recipe defines 3 Docker volumes:

mysql: Data of the MySQL database
project: Files about the project
results: Result of the BOINC jobs

those volumes are managed internally by Docker and stored somewhere inside /var/lib/docker on the host node.

Docker also allows to mount specific folders from the host into a container, if we back these folders by a Jetstream volume, we can have dedicated detachable Jetstream volumes that live independently from any virtual machine.

Let's start by mysql, the same process can then be replicated for the other resources.

We create another Jetstream volume from the Atmosphere, name it mysql and attach it to the virtual machine, this will be automatically mounted to /vol_c, we can rename it by:

sudo umount /vol_c

Replace vol_c with mysql in /etc/fstab, finally:

sudo mount /mysql

Finally you can modify the docker-compose.yml to use this folder instead of a Docker Volume:

In the volumes: section, remove mysql:, in the definition of the MySQL service, replace:

volumes:
 - "mysql:/var/lib/mysql"

with:

volumes:
 - "/mysql:/var/lib/mysql"

So that instead of using a Docker Volume named mysql is creating a bind-mount to /mysql on the host.

Test jobs

Open a terminal in the BOINC server container:

docker exec -it <boincserver> /bin/bash


bin/boinc2docker_create_work.py \
    python:alpine python -c "open('/root/shared/results/hello.txt','w').write('Hello BOINC')"

Then we can test a client connection and execution either with a standard BOINC desktop client or on another Jetstream instance.

Test with a BOINC Desktop client

Follow the instructions on the BOINC website to install a client for your OS, install also VirtualBox, then set as the URL of the BOINC server the URL of the server we just created.

Test with a BOINC client in another Jetstream instance

Create another Ubuntu with Docker tiny instane on Jetstream, login,

sudo adduser $USER docker

We need Virtualbox: sudo apt install virtualbox-dkms

and reboot to make sure VirtualBox is active.

URL=http://js-xxx-xxx.jetstream-cloud.org/boincserver/
docker exec boinc boinccmd --create_account $URL email password name

status: Success
poll status: operation in progress
poll status: operation in progress
poll status: operation in progress
account key: de9c4cc66b8c923d04f834a0609ae742

We can save the account key in a environment variable:

URL=http://js-xxx-xxx.jetstream-cloud.org/boincserver/
URL=http://js-xxx-xxx.jetstream-cloud.org/boincserver/
account_key=de9c4cc66b8c923d04f834a0609ae742
docker exec boinc boinccmd --project_attach $URL $account_key

Then we can check the logs for the job being received and executed:

docker logs boinc

30-Mar-2018 13:02:04 [boincserver] Started download of layer_e9e858f6a2ba5a3e5a04b5799ef2de1c21a58602ffd400838ed10599f1b4a42c.tar.manual.gz
30-Mar-2018 13:02:06 [boincserver] Finished download of layer_10ffed26db733866a346caf7c79558e4addb23ae085a991b5e7237edaa69f8e2.tar.manual.gz
30-Mar-2018 13:02:06 [boincserver] Finished download of layer_e9e858f6a2ba5a3e5a04b5799ef2de1c21a58602ffd400838ed10599f1b4a42c.tar.manual.gz
30-Mar-2018 13:02:06 [boincserver] Started download of layer_0e650ab7661f993eff514b84c6e7b775f5be8c6dde8b63eb584f0f22ea24005f.tar.manual.gz
30-Mar-2018 13:02:06 [boincserver] Started download of image_4fcaf5fb5f2b8230c53b5fd4c4325df00021d45272dc4bfbb2148e5ca91ac166.tar.manual.gz
30-Mar-2018 13:02:07 [boincserver] Finished download of layer_0e650ab7661f993eff514b84c6e7b775f5be8c6dde8b63eb584f0f22ea24005f.tar.manual.gz
30-Mar-2018 13:02:07 [boincserver] Finished download of image_4fcaf5fb5f2b8230c53b5fd4c4325df00021d45272dc4bfbb2148e5ca91ac166.tar.manual.gz
30-Mar-2018 13:02:07 [boincserver] Starting task boinc2docker_3766_1522410497.503524_0
30-Mar-2018 13:02:07 [boincserver] Sending scheduler request: To fetch work.
30-Mar-2018 13:02:07 [boincserver] Requesting new tasks for CPU
30-Mar-2018 13:02:08 [boincserver] Scheduler request completed: got 1 new tasks
30-Mar-2018 13:02:12 [---] Vbox app stderr indicates CPU VM extensions disabled
30-Mar-2018 13:02:13 [boincserver] Computation for task boinc2docker_3766_1522410497.503524_0 finished
30-Mar-2018 13:02:13 [boincserver] Output file boinc2docker_3766_1522410497.503524_0_r207563194_0.tgz for task boinc2docker_3766_1522410497.503524_0 absent
30-Mar-2018 13:02:13 [boincserver] Starting task boinc2docker_3766_1522410497.503524_1
30-Mar-2018 13:02:18 [---] Vbox app stderr indicates CPU VM extensions disabled
30-Mar-2018 13:02:18 [boincserver] Computation for task boinc2docker_3766_1522410497.503524_1 finished
30-Mar-2018 13:02:18 [boincserver] Output file boinc2docker_3766_1522410497.503524_1_r1095010587_0.tgz for task boinc2docker_3766_1522410497.503524_1 absent

Use the distributed file format Zarr on Jetstream Swift object storage

2018-03-03T18:00:00-08:00

Install custom Python environment on Jupyter Notebooks at NERSC

2017-12-21T18:00:00-08:00

Jupyter Notebooks at NERSC

NERSC has provided a JupyterHub instance for quite some time to all NERSC users. It is currently running on a dedicated large-memory node on Cori, so now it can access also data on Cori $SCRATCH, not only /project and $HOME. See their documentation

Customize your Python environment

NERSC provides Anaconda in a Ubuntu container, of course the user doesn't have permission to write to the Anaconda folder to install new packages.

The easiest way is to install a custom Python environment is to create another conda environment and then register the Kernel with Jupyter.

Create a new conda environment, best choice is /project if you have one, otherwise $HOME would work. Access http://jupyter.nersc.gov, open a terminal with "New"->"Terminal".

conda create --prefix $HOME/myconda python=3.6 ipykernel

This is the minimal requirement, you could just add anaconda to get all the latest packages, you can also specify conda-forge to install other packages, e.g.:

source activate myconda
conda install -c conda-forge healpy

ipython kernel install --name myconda --user

The name of the kernel specified here doesn't need to be the same as the conda environment name, but it is simpler.

Once the conda environment is active, you can also install packages with pip.

conda install pip
pip install somepackage

ECSS Symposium about Jupyterhub deployments on XSEDE

2017-12-15T18:00:00-08:00

Jupyter Notebooks at scale for Gateways and Workshops

ECSS Symposium, 19 December 2017, Web presentation to the XSEDE Extended Collaborative Support Services.

Overview on deployment options for Jupyter Notebooks at scale on XSEDE resources.

Presentation

Tutorials

Step-by-step tutorials and configuration files to deploy JupyterHub on XSEDE resources:

Publication

Paper in preparation: "Deploying Jupyter Notebooks at scale on XSEDE for Science Gateways and workshops", Andrea Zonca and Robert Sinkovits, PEARC18

Deploy scalable Jupyterhub with Kubernetes on Jetstream

2017-12-05T18:00:00-08:00

Tested in June 2018 with Ubuntu 18.04 and Kubernetes 1.10
Updated in February 2018 with newer version of kubeadm-bootstrap, Kubernetes 1.9.2

Introduction

The best infrastructure available to deploy Jupyterhub at scale is Kubernetes. Kubernetes provides a fault-tolerant system to deploy, manage and scale containers. The Jupyter team released a recipe to deploy Jupyterhub on top of Kubernetes, Zero to Jupyterhub. In this deployment both the hub, the proxy and all Jupyter Notebooks servers for the users are running inside Docker containers managed by Kubernetes.

Kubernetes is a highly sophisticated system, for smaller deployments (30/50 users, less then 10 servers), another option is to use the Docker Swarm mode, I covered this in a tutorial on how to deploy it on Jetstream.

If you are not already familiar with Kubernetes, better first read the section about tools in Zero to Jupyterhub.

In this tutorial we will be installing Kubernetes on 2 Ubuntu instances on the XSEDE Jetstream OpenStack-based cloud, configure permanent storage with the Ceph distributed filesystem and run the "Zero to Jupyterhub" recipe to install Jupyterhub on it.

Setup two virtual machines

First of all we need to create two Virtual Machines from the Jetstream Atmosphere admin panelI tested this on XSEDE Jetstream Ubuntu 16.04 image (with Docker pre-installed), for testing purposes "small" instances work, then they can be scaled up for production. You can name them master_node and node_1 for example. Make sure that port 80 and 443 are open to outside connections.

Then you can SSH into the first machine with your XSEDE username with sudo privileges.

Install Kubernetes

The "Zero to Jupyterhub" recipe targets an already existing Kubernetes cluster, for example on Google Cloud. However the Berkeley Data Science Education Program team, which administers one of the largest Jupyterhub deployments to date, released a set of scripts based on the kubeadm tool to setup Kubernetes from scratch.

This will install all the Kubernetes services and configure the kubectl command line tool for administering and monitoring the cluster and the helm package manager to install pre-packaged services.

SSH into the first server and follow the instructions at https://github.com/data-8/kubeadm-bootstrap to "Setup a Master Node" this will install a more recent version of Docker.

Once the initialization of the master node is completed, you should be able to check that several containers (pods in Kubernetes) are running:

zonca@js-xxx-xxx:~/kubeadm-bootstrap$ sudo kubectl get pods --all-namespaces
NAMESPACE     NAME                                                    READY     STATUS    RESTARTS   AGE
kube-system   etcd-js-169-xx.jetstream-cloud.org                      1/1       Running   0          1m
kube-system   kube-apiserver-js-169-xx.jetstream-cloud.org            1/1       Running   0          1m
kube-system   kube-controller-manager-js-169-xx.jetstream-cloud.org   1/1       Running   0          1m
kube-system   kube-dns-6f4fd4bdf-nxxkh                                3/3       Running   0          2m
kube-system   kube-flannel-ds-rlsgb                                   1/1       Running   1          2m
kube-system   kube-proxy-ntmwx                                        1/1       Running   0          2m
kube-system   kube-scheduler-js-169-xx.jetstream-cloud.org            1/1       Running   0          2m
kube-system   tiller-deploy-69cb6984f-77nx2                           1/1       Running   0          2m
support       support-nginx-ingress-controller-k4swb                  1/1       Running   0          36s
support       support-nginx-ingress-default-backend-cb84895fb-qs9pp   1/1       Running   0          36s

Make also sure routing is working by accessing with your web browser the address of the Virtual Machine js-169-xx.jetstream-cloud.org and verify you are getting the error message default backend - 404.

Then SSH to the other server and set it up as a worker following the instructions in "Setup a Worker Node" at https://github.com/data-8/kubeadm-bootstrap,

Once the setup is complete on the worker, log back in to the master and check that the worker joined Kubernetes:

zonca@js-169-xx:~/kubeadm-bootstrap$ sudo kubectl get nodes
NAME                             STATUS    ROLES     AGE       VERSION
js-168-yyy.jetstream-cloud.org   Ready     <none>    1m        v1.9.2
js-169-xx.jetstream-cloud.org    Ready     master    2h        v1.9.2

Setup permanent storage for Kubernetes

The cluster we just setup has no permament storage, so user data would disappear every time a container is killed. We woud like to provide users with a permament home that would be available across all of the Kubernetes cluster, so that even if a user container spawns again on a different server, the data are available.

First we want to login again to Jetstream web interface and create 2 Volumes (for example 10 GB) and attach them one each to the master and to the first node, these will be automatically mounted on /vol_b, with no need of rebooting the servers.

Kubernetes has capability to provide Permanent Volumes but it needs a backend distributed file system. In this tutorial we will be using Rook which sets up the Ceph distributed filesystem across the nodes.

We can first use Helm to install the Rook services (I ran my tests with v0.6.1):

sudo helm repo add rook-alpha https://charts.rook.io/alpha
sudo helm install rook-alpha/rook

Then check that the pods have started:

zonca@js-xxx-xxx:~/kubeadm-bootstrap$ sudo kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
rook-agent-2v86r                1/1       Running   0          1h
rook-agent-7dfl9                1/1       Running   0          1h
rook-operator-88fb8f6f5-tss5t   1/1       Running   0          1h

Once the pods have started we can actually configure the storage, copy this rook-cluster.yaml file to the master node. Better clone all of the repository as we will be using other files later.

The most important bits are:

dataDirHostPath: this is a folder to save the Rook configuration, we can set it to /var/lib/rook
storage: directories: this is were data is stored, we can set this to /vol_b which is the default mount point of Volumes on Jetstream. This way we can more easily back those up or increase their size.
versionTag: make sure this is the same as your rook version (you can find it with sudo helm ls)

Then run it with:

sudo kubectl create -f rook-cluster.yaml

And wait for the services to launch:

zonca@js-xxx-xxx:~/kubeadm-bootstrap$ sudo kubectl -n rook get pods
NAME                              READY     STATUS    RESTARTS   AGE
rook-api-68b87d48d5-xmkpv         1/1       Running   0          6m
rook-ceph-mgr0-5ddd685b65-kw9bz   1/1       Running   0          6m
rook-ceph-mgr1-5fcf599447-j7bpn   1/1       Running   0          6m
rook-ceph-mon0-g7xsk              1/1       Running   0          7m
rook-ceph-mon1-zbfqt              1/1       Running   0          7m
rook-ceph-mon2-c6rzf              1/1       Running   0          6m
rook-ceph-osd-82lj5               1/1       Running   0          6m
rook-ceph-osd-cpln8               1/1       Running   0          6m

This step launches the distributed file system Ceph on all nodes.

Finally we can create a new StorageClass which provides block storage for the pods to store data persistently, get rook-storageclass.yaml from the same repository we used before and execute with:

sudo kubectl create -f rook-storageclass.yaml

You should now have the rook storageclass available:

sudo kubectl get storageclass
NAME         PROVISIONER
rook-block   rook.io/block

(Optional) Test Rook Persistent Storage

Optionally, we can deploy a simple pod to verify that the storage system is working properly.

You can copy alpine-rook.yaml from Github and launch it with:

sudo kubectl create -f alpine-rook.yaml

It is a very small pod with Alpine Linux that creates a 2 GB volume from Rook and mounts it on /data.

This creates a Pod with Alpine Linux that requests a Persistent Volume Claim to be mounted under /data. The Persistent Volume Claim specified the type of storage and its size. Once the Pod is created, it asks the Persistent Volume Claim to actually request Rook to prepare a Persistent Volume that is then mounted into the Pod.

We can verify the Persistent Volumes are created and associated with the pod, check:

sudo kubectl get pv
sudo kubectl get pvc
sudo kubectl get logs alpine

We can get a shell in the pod with:

sudo kubectl exec -it alpine  -- /bin/sh

access /data/ and make sure we can write some files.

Once you have completed testing, you can delete the pod and the Persistent Volume Claim with:

sudo kubectl delete -f alpine-rook.yaml

The Persistent Volume will be automatically deleted by Kubernetes after a few minutes.

Setup HTTPS with letsencrypt

We need kube-lego to automatically get a HTTPS certificate from Letsencrypt, For more information see the Ingress section on the Zero to Jupyterhub Advanced topics.

First we need to customize the Kube Lego configuration, edit the config_kube-lego_helm.yaml file from the repository and set your email address, then:

sudo helm install stable/kube-lego --namespace=support --name=lego -f config_kube-lego_helm.yaml

Then after you deploy Jupyterhub if you have some HTTPS trouble, you should check the logs of the kube-lego pod. First find the name of the pod with:

sudo kubectl get pods -n support

Then check its logs:

sudo kubectl logs -n support lego-kube-lego-xxxxx-xxx

Install Jupyterhub

Read all of the documentation of "Zero to Jupyterhub", then download config_jupyterhub_helm.yaml from the repository and customize it with the URL of the master node (for Jetstream js-xxx-xxx.jetstream-cloud.org) and generate the random strings for security, finally run the Helm chart:

sudo helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
sudo helm repo update
sudo helm install jupyterhub/jupyterhub --version=v0.6 --name=jup \
    --namespace=jup -f config_jupyterhub_helm.yaml

Once you modify the configuration you can update the deployment with:

sudo helm upgrade jup jupyterhub/jupyterhub -f config_jupyterhub_helm.yaml

Test Jupyterhub

Connect to the public URL of your master node instance at: https://js-xxx-xxx.jetstream-cloud.org

Try to login with your XSEDE username and password and check if Jupyterhub works properly.

If something is wrong, check:

sudo kubectl --namespace=jup get pods

Get the name of the hub pod and check the logs:

sudo kubectl --namespace=jup logs hub-xxxx-xxxxxxx

Check that Rook is working properly:

sudo kubectl --namespace=jup get pv
sudo kubectl --namespace=jup get pvc
sudo kubectl --namespace=jup describe pvc claim-YOURXSEDEUSERNAME

Administration tips

Add more servers to Kubernetes

We can create more Ubuntu instances (with a volume attached) and add them to Kubernetes by repeating the same setup we performed on the first worker node. Once the node joins Kubernetes, it will be automatically used as a node for the distributed filesystem by Rook and be available to host user containers.

Remove a server from Kubernetes

Launch first the kubectl drain command to move the currently active pods to other nodes:

sudo kubectl get nodes
sudo kubectl drain <node name>

Then suspend or delete the instance on the Jetstream admin panel.

Configure a different authentication system

"Zero to Jupyterhub" supports out of the box authentication with:

XSEDE credentials with CILogon
Many Campuses credentials with CILogon
Globus
Google

See the documentation and modify config_jupyterhub_helm_v0.5.0.yaml accordingly.

Acknowledgements

The Jupyter team, in particular Yuvi Panda, for providing a great software platform and a easy-to-user resrouce for deploying it and for direct support in debugging my issues
XSEDE Extended Collaborative Support Services for supporting part of my time to work on deploying Jupyterhub on Jetstream and providing computational time on Jetstream
Pacific Research Platform, in particular John Graham, Thomas DeFanti and Dmitry Mishin (SDSC) for access to their Kubernetes platform for testing
XSEDE Jetstream's Jeremy Fischer for prompt answers to my questions on Jetstream

Store a conda environment inside a Notebook

2017-12-04T18:00:00-08:00

Last August, during the Container Analysis Environments Workshop held at Urbana-Champaign, we had discussion about reproducibility in the Jupyter Notebooks. There came out the idea of storing all the details about the Python environment inside the Notebook, in the metadata.

I released an experimental package on Github (and PyPI):

https://github.com/zonca/nbenv

For simplicity it only supports conda environment, but it also supports having pip-installed packages inside those environments.

It automatically saves the conda environment as metadata inside the .ipynb document and then provides a command line tool to inspect it and create a new conda environment based on it.

I am not sure this is the best design, please open Issues on Github to send me feedback!

How to modify Singularity images on a Supercomputer

2017-11-06T18:00:00-08:00

Introduction

Singularity allows to run your own OS within most Supercomputers, see my previous post about Running Ubuntu on Comet via Singularity

Singularity's adoption by High Performance Computing centers has been driven by its strict security model. It never allows a user in a container to have root privileges unless the user is root on the Host system.

This means that you can only modify containers on a machine where you have root. Therefore you generally build a container on your local machine and then copy it to a Supercomputer. The process is tedious if you are still tweaking your container and modifying it often, and each time your have to copy back a 4 or maybe 8 GB container image.

In the next section I'll investigate possible solutions/workarounds.

Use DockerHub

Singularity can pull a container from DockerHub, so it is convenient if you are already using Docker, maybe to provide a simple way to install your software.

I found out that if you are using the Automatic build of your container by DockerHub itself, this is very slow, sometimes it takes 30 minutes to have your new container build.

Therefore the best is to manually build your container locally and then push it to DockerHub. A Docker container is organized in layers of the filesystem, so for small tweaks to your image you transfer tens of MB to DockerHub instead of GB.

Then from the Supercomputer you can run singularity pull docker://ubuntu:latest with no need of root privileges. Singularity keeps a cache of the docker layers, so you would download just the layers modified in the previous step.

Build your application locally

If you are modifying an application often you could build a Singularity container with all the requirements, copy it to the Supercomputer and then build your application there. This is also useful if the architecture of your CPU is different between your local machine and the Supercomputer and you are worried the compiler would not apply all the possible optimizations.

In this case you can use singularity shell to get a terminal inside the container, then build your software with the compiler toolchain available inside the container and then install it to your $HOME folder, then modify your $PATH and $LD_LIBRARY_PATH to execute and load libraries from this local folder.

This is also useful in case the container has already an application installed but you want to develop on it. You can follow this process and then mask the installed application with your new version.

Of course this makes your analysis not portable, the software is not available inside the container.

Freeze your application inside the container

Once you have completed tweaking the application on the Supercomputer, you can now switch back to your local machine, get the last version of your application and install it system-wide inside the container so that it will be portable.

On the other hand, you might be concerned about performance and prefer to have the application built on the Supercomputer. You can run the build process (e.g. make or python setup.py build) on the Supercomputer in your home, then sync the build artifacts back to your local machine and run the install process there (e.gsudo make installorsudo python setup.py install). Optionally usesshfs` to mount the build folder on both machines and make the process transparent.

Use a local Singularity registry

Singularity released singularity-registry, an application to build a local image registry, like DockerHub, that can take care of building containers.

This can be hosted locally at a Supercomputing Center to provide a local building service. For example Texas Advanced Computing Center builds locally Singularity images from BioContainers, software packages for the Life Sciences.

Otherwise, for example, a user at SDSC could install Singularity Registry on SDSC Cloud and configure it to mount one of Comet's filesystems and build the container images there. Even installing Singularity Registry on Jetstream could be an option thanks to its fast connection to other XSEDE resources.

Feedback

If you have any feedback, please reach me at @andreazonca or find my email from there.

Deploy scalable Jupyterhub on Docker Swarm mode

2017-10-26T18:00:00-07:00

Introduction

Jupyterhub genrally requires roughly 500MB per user for light data processing and many GB for heavy data processing, therefore it is often necessary to deploy it across multiple machines to support many users.

The recommended scalable deployment for Jupyterhub is on Kubernetes, see Zero to Jupyterhub (and I'll cover it next). However the learning curve for Kubernetes is quite steep, I believe that for smaller deployments (30/50 users, 10 users per machine) and where high availability is not critical, deploying on Docker with Swarm Mode is a simpler option.

In the past I have covered a Jupyterhub deployment on the old version of Docker Swarm using DockerSpawner. The most important difference is that the last version of Docker has a more sophisticated "Swarm mode" that allows you to launch and manage services instead of individual containers, support for this is provided by SwarmSpawner. Thanks to the new architecture, we do not need to have actual Unix accounts on the Host but all users can run with the jovyan user account defined only inside the Docker containers. Then we can also deploy Jupyterhub itself as a Docker container instead of installing it on the Host.

Setup a Virtual Machine for the Hub

First of all we need to create a Virtual Machine, I tested this on XSEDE Jetstream CentOS 7 image (with Docker pre-installed), but I would recommend Ubuntu 16.04 which is more universally used so it is easier to find support for it. The same setup would work on a bare-metal server.

Make sure that a recent version of Docker is installed, I used 17.07.0-ce.

Setup networking so that port 80 and 443 are accessible for HTTP and HTTPS. Associate a Public IP to this instance so that it is accessible from the Internet.

Add your user to the docker group so you do not need sudo to run docker commands. Check that docker works running docker info.

Clone the config files repository

I recommend to create the folder /etc/jupyterhub, set ownership to your user and clone my configuration repository there:

git clone https://github.com/zonca/deploy-jupyterhub-dockerswarm /etc/jupyterhub

Setup Swarm

The first node is going to be the Master node of the Swarm, launch:

docker swarm init --advertise-addr INTERNAL_IP_ADDRESS

It is better to use a internal IP address, for example on Jetstream the 192.xxx.xxx.xxx IP. This is the address that the other instances will use to connect to this node.

This command will print out the string that the other nodes will need to run to join this swarm, save it for later (you can recover it with docker swarm join-token)

Install the NGINX web server

NGINX is going to sit in front of Jupyterhub as a proxy and handle SSL (at the end of this tutorial), we are going to have also NGINX as a Docker service:

docker pull nginx:latest

Now let's test that Docker and the networking is working correctly, launch nginx with the default configuration:

docker service create \
  --name nginx \
  --publish 80:80 \
  nginx

This is going to create a service, then the service creates the containers, check with docker service ls and docker ps, if a container dies, the service will automatically relaunch it. Now if you connect to your instance from an external machine you should see the NGINX welcome page. If this is not the case check docker ps -a and docker logs INSTANCE_ID to debug the issue.

Finally remove the service with:

docker service rm nginx

Now run the service with the configuration for Jupyterhub, edit nginx.conf and replace SERVER_URL then launch:

bash ngnx_service.sh

At this point you should gate a Gateway error if you connect with a browser to your instance.

Install Jupyterhub

Before launching Jupyterhub you need to create a Docker network so that the containers in the swarm can communicate easily:

docker network create --driver overlay jupyterhub

You can launch the official Jupyterhub 0.8.0 container as a service with:

docker service create \
  --name jupyterhubserver \
  --network jupyterhub \
  --detach=true \
  jupyterhub/jupyterhub:0.8.0

This would run Jupyterhub with the default jupyterhub_config.py with local auth and local spawner. If you connect to the instance now you should see the Jupyterhub login page, you cannot login because you don't have a user account inside the container. We'll setup authentication next.

Configure Jupyterhub

Next we want to customize the hub, first login on http://hub.docker.com and create a new repository, then follow the instructions there to setup docker push on your server so you can push your image to the registy.

This is necessary because Swarm might spawn the service on a different machine, so itneeds an external registry to make sure to pull the right image.

You can now customize the hub image in /etc/jupyterhub/hub with docker build . -t yourusername/jupyterhub-docker and push it remotely with docker push yourusername/jupyterhub-docker.

This image includes oauthenticator for Github, Google, CILogon and Globus authentication and swarmspawner for spawning containers for the users.

We can now create jupyterhub_config.py, for now we just want temporary home folders, so replace the mounts variable with [] in c.SwarmSpawner.container_spec. Then customize the server URL server_url.com and IP SERVER_IP (it will be necessary later). At the bottom of jupyterhub_config.py we can also customize CPU and memory contraints. Unfortunately there is no easy way to setup a custom disk space limit.

Follow the documentation of oauthenticator to setup authentication.

Create the folder /var/nfs that we will configure later but it is harcoded in the script to launch the service.

Temporarily remove from launch_service_jupyterhub.sh the line:

--mount src=nfsvolume,dst=/var/nfs \

Launch the service from /etc/jupyterhub with bash launch_service_jupyterhub.sh.

Check in the script that we are mounting the Docker socket into the container so that Jupyterhub can launch Docker containers for the users. We also mount the /etc/jupyterhub folder so that it has access to jupyterhub_config.py. We also contraint it to run in the manager node of this Swarm, this assures that it always runs on this first node. We could later add another manager node for resiliency and the Hub could potentially spawn there with no issues.

At this point we have a first working configuration of Jupyterhub, try to login and check if the notebooks are working. This configuration has no permanent storage, so the users will have a home folder inside their container and will be able to write Notebooks and data there up to the image reaching 10GB, so about 5GB. If they logout and log back in they will find their files still there, but if they do "Close my Server" from the control panel or if for any other reason their container is removed, they will loose their data. So this setup could be used for short workshops or demos.

Setup other nodes

We can create another Virtual Machine with the same version of Docker and make sure that the two machines internally have all the port open to simplify networking. Any additional machine needs no open ports to the outside world, all connections will go through nginx.

We can have it join the Swarm by pasting the token got at Swarm initialization on the first node.

Now when Jupyterhub launches a single user container, it could spawn either on this server or on the first server, Swarm will automatically take care of load balancing. It will also automatically download the Docker image specified in jupyterhub_config.py.

We can add as many nodes as necessary.

Setup Permanent storage

Surprisingly enough, Swarm has no easy way to setup permament storage that would automatically move data from one node to another in case a user container is re-spawned on another server. There are some volume plugins but I believe that their configuration is so complex that at this point would be better to directly switch to Kubernetes. In order to achieve a simpler setup that I believe could easily handle few tens of users we can use NFS. Moreover Docker volumes can handle NFS natively, so we don't even need to have home folders owned by each user but we can just point Docker volumes to our NFS folder and Docker will manage that for us and we can just use one single user. Users cannot access other people's files because only their own folder is mounted into their container.

Setup a NFS server

First we need to decide which server acts as NFS server, for small deployments we can have just the first server which runs the hub also handle this, for more performance we might want to have a dedicated server that only runs NFS and which is part of the internal network but does not participate in the Swarm so that it won't have user containers running on it.

In a Cloud environment like Jetstream or Amazon, it is useful to create a Volume and attach it to that instance so that we can enlarge it later or back it up independently from the Instance and that would survive the Hub instance. Make sure to choose the XFS filesystem if you need to setup disk space contraints. Mount it in /var/nfs/ and make sure it is writable by any user.

On that server we can install NFS following the OS instructions and setup /etc/exports with:

/var/nfs        *(rw,sync,no_subtree_check)

The NFS port is accessible only on the internal network anyway so we can just accept any connection.

SSH into any of the Swarm nodes and check this works fine with:

sudo mount 192.NFS.SRV.IP:/var/nfs /mnt
touch /mnt/writing_works

Setup Jupyterhub to use Docker Volumes over NFS

In /etc/jupyterhub/jupyterhub_config.py we should configure the mounts to swarmspawner:

mounts = [{'type': 'volume',
           'source': 'jupyterhub-user-{username}',
           'target': notebook_dir,
        'no_copy' : True,
        'driver_config' : {
          'name' : 'local',
          'options' : {
             'type' : 'nfs4',
             'o' : 'addr=SERVER_IP,rw',
             'device' : ':/var/nfs/{username}/'
           }
        },
}]

Replace SERVER_IP with your server, this tells the Docker local Volume driver to mount folders /var/nfs/{username} as home folders of the single user notebook container.

The only problem is that these folders need to be pre-existing, so I modified the swarmspawner plugin to create those folders the first time a user authenticates, please let me know if there is a better way and I'll improve this tutorial. See the branch createfolder on my fork of swarmspawner. In order to install this you need to modify your custom jupyterhub-docker to install from there (see the commented out section in hub/Dockerfile). Often the Authenticator transform the username into a hash, so I added a feature on this spawner to also create a text file HASH_email.txt and save the email of the user there so that it is easier to check directly from the filesystem who owns a specific folder.

For this to work the Hub needs access to /var/nfs/, the best way to achieve this is to create another Volume, add the NFS_SERVER_IP and launch on the first server:

bash create_volume_nfs.sh

Then uncomment the --mount src=nfsvolume,dst=/var/nfs \ line from launch_service_jupyterhub.sh and relaunch the service so that it is available locally.

At this point you should test that if you login, then stop/kill the container, your data should still be there when you launch it again.

Setup user quota

The Docker local Volume driver does not support setting a user quota so we have to resort to our filesystem. You can modify /etc/fstab to mount the XFS volume with the pquota option that supports setting a limit to a folders and all of its subfolders. We cannot use user quotas because all of the users are running under the same UNIX account.

Create a folder /var/nfs/testquota and then test that setting quota is working with:

sudo set_quota.sh /var/nfs testquota

There should be a space between /var/nfs and testquota, then check with:

bash get_quota.sh

You should see a quota of 1GB for that folder. Modify set_quota.sh to choose another size.

Automatically set quotas

We want quota to be automatically set each time the spawner creates another folder, incrond can monitor a folder for any new created file and launch the set_quota.sh script for us.

Install the incrond package and make sure it is active and restarted on boot. Then customize it with sudo incrontab -e and paste the content of incrontab in /etc/jupyterhub.

Now delete your user folder in /var/nfs and launch Jupyterhub again to check that the folder is created with the correct quota. The spawner also creates a /var/nfs/{username}_QUOTA_NOT_SET that is deleted then by the set_quota.sh script.

Setup HTTPS

We would like to setup NGINX to provide SSL encryption for Jupyterhub using the free Letsencrypt service. The main issue is that those certificates need to be renewed every few months, so we need a service running regularly to take care of that.

The simplest option would be to add --publish 8000 to the Jupyterhub so that Jupyterhub exposes its port to the host and then remove the NGINX Docker container and install NGINX and certbot directly on the first host following a standard setup.

However, to keep the setup more modular, we'll proceed and use another NGINX container that comes equipped with automatic Let's Encrypt certificates request and renewal available at: https://github.com/linuxserver/docker-letsencrypt.

Modify networking setup

One complication is that this container requires additional privileges to handle networking that are not availble in Swarm mode, so we will run this container outside of the Swarm on the first node.

We need to make the jupyterhub network that we created before attachable by containers outside the Swarm.

docker service rm nginx
bash remove_service_jupyterhub.sh
docker network rm jupyterhub
docker network create --driver overlay --attachable jupyterhub

Then add --publish 8000 to launch_service_juputerhub.sh and start Jupyterhub again. Make sure that if you SSH to the first node you can wget localhost:8000 successfully but if you try to access yourdomain:8000 from the internet you should not be able to connect (the port should be closed by the networking configuration on OpenStack for example).

Test the NGINX/Letsencrypt container

Create a volume to save the configuration and the logs (optionally on the NFS volume):

docker volume create --driver local nginx_volume

Test the container running:

docker run \
  --cap-add=NET_ADMIN \
  --name nginx \
  -p 443:443 \
  -e EMAIL=your_email@domain.edu \
  -e URL=your.domain.org \
  -v nginx_volume:/config \
  linuxserver/letsencrypt

If this works correctly, connect to https://your.domain.org, you should have a valid SSL certificate and a welcome message. If not check docker logs nginx.

Configure NGINX to proxy Jupyterhub

We can use letsencrypt_container_nginx.conf to handle NGINX configuration with HTTPS support, this loads the certificates from a path automatically created by the letsencrypt container.

Customize launch_letsencrypt_container.sh and then run it, it will create the NGINX container again and it will also bind-mount the NGINX configuration into the container.

Now you should be able to connect to your server over HTTPS and access Jupyterhub.

Feedback

Feedback appreciated, @andreazonca

I am also available to support US scientists to deploy scientific gateways through the XSEDE ECSS consultation program.

Setup automated testing on a Github repository with Travis-ci

2017-09-06T18:00:00-07:00

Introduction

It is good practice in software development to implement extensive testing of the codebase in order to catch quickly any bug introduced into the code when implementing new features.

The suite of tests should be easy to execute (possibly one single command, for example with the py.test runner) and quick to run (more than 1 minute would make it tedious to run).

The developers should run the unit test suite every time they implement a change to the codebase to make sure anything else has not been broken.

However, once a commit has been pushed to Github, it is also useful to have automated testing executed automatically, at least for 2 reasons:

Run tests in all the environments that need to be supported by the software, for example run with different versions of Python or different versions of a key required external dependancy
Run tests in a clean environment that has less risks of being contaminated by some mis-configuration on one of the developers' environments

Travis-CI

Travis is a free web based service that allows to register a trigger on Github so that every time a commit is pushed to Github or a Pull Request is opened, it launches an isolated Ubuntu (even if it also supports Mac OS) container for each of the configurations that we want to test, builds the software (if needed) and then runs the test.

The only requirement is that the Github project needs to be public for the free service. Otherwise there are paid plans for private repositories.

Setup on Travis-CI

Go to http://travis-ci.org and login with a Github account
In order to automatatically configure the hook on Github, Travis requests writing privileges to your Github account, annoying but convenient
Leave all default options, just make sure that Pull Requests are automatically tested
If you have the repository both under an organization and a fork under your account, you can choose either to test both or just the organization repository, anyway your pull requests will be tested before merging.

Preparation of the test scripts

In order to automate running the test scripts on Travis-CI, it is important that the test scripts return a exit code different from zero to signal that the tests failed.

If you are using a test running tool like pytest, this is automatically done for you. If you are using bash scripts instead, make sure that if the script detects an error it calls exit 1. In order to automate running the test scripts on Travis-CI, it is important that the test scripts return a exit code different from zero to signal that the tests failed.

If you are using a test running tool like pytest, this is automatically done for you. If you are using bash scripts instead, make sure that if the script detects an error it calls exit 1.

Configuration of the repository

Create a new branch on your repository:
```
git checkout -b test_travis
```
Add a .travis.yml (mind that it starts with a dot) configuration file
Inside this file you can configure how your project is built and tested, for the simple case of bash or perl scripts you can just write:
```
dist: trusty
language: bash

script:
    - cd $TRAVIS_BUILD_DIR/tests; bash run_test.sh
```
Check the Travis-CI documentation for advanced configuration options
Now push these changes to your fork of the main repository and then create a Pull Request to the main repository
Go to https://travis-ci.org/YOUR_ORGANIZATION/YOUR_REPO to check the build status and the log
Once your Pull Request passes the tests, merge it to the main repository so that also the master branch will be tested for all future commits.

Python example

In the following example, Travis-CI will create 8 builds, each of the 4 versions of Python will be tested with the 2 versions of numpy:

language: python
python:
  - "2.7"
  - "3.4"
  - "3.5"
  - "3.6"
env:
  - NUMPY_VERSION=1.12.1
  - NUMPY_VERSION=1.13.1
# command to install dependencies, requirements.txt should NOT include numpy
install:
  - pip install -r requirements.txt numpy==$NUMPY_VERSION
# command to run tests
script:
  - pytest # or py.test for Python versions 3.5 and below

Badge in README

Aestetic touch, left click on the "Build Passing" image on the Travis-CI page for your repository, choose "Markdown" and paste the code to the README.md of your repository on Github. This will show in real time if the last version of the code is passing the tests or not.

Deployment of Jupyterhub with Globus Auth to spawn Notebook on Comet in Singularity containers

2017-08-11T18:00:00-07:00

Build Singularity containers to run single user notebook applications

Follow the instructions at https://github.com/zonca/singularity-comet to build images from the ubuntu_anaconda_jupyterhub.def and centos_anaconda_jupyterhub.def definition files, or use the containers I have already built on Comet:

/oasis/scratch/comet/zonca/temp_project/centos_anaconda_jupyterhub.img
/oasis/scratch/comet/zonca/temp_project/ubuntu_anaconda_cmb_jupyterhub.img

These containers have Centos 7 and Ubuntu 16.04 base images, MPI support (not needed for this), Anaconda 4.4.0, the Jupyterhub (for the jupyterhub-singleuser script) and Jupyterlab (for the awesomeness) packages.

Initial setup of Jupyterhub with Ansible

First we want to use the Ansible playbook provided by the Jupyter team to setup a Ubuntu Virtual Machine, for example on SDSC Cloud or XSEDE Jetstream. This sets up already a Jupyterhub instance on a single machine with Github authentication, NGINX with letsencrypt SSL and spawning of Notebooks as local processes.

Start from: Automated deployment of Jupyterhub with Ansible

It looks like there is a compatibility error with conda 4.3 and above, I had to fix this (and provided PR upstream), I used the version at https://github.com/zonca/jupyterhub-deploy-teaching/tree/globus_singularity. In particular check the example configuration file in the host_vars/ folder.

Once we have executed the scripts, connect to the Virtual Machine, login with Github and check that Notebooks are working.

Setup Authentication with Globus

Next we can SSH into the Jupyterhub Virtual Machine and customize Jupyterhub configuration in /etc/jupyterhub

oauthenticator should alrady be installed,, but it needs the Globus SDK to support authentication with Globus:

sudo /opt/conda/bin/pip install globus_sdk[jwt]

Then follow the instructions to setup Globus Auth: https://github.com/jupyterhub/oauthenticator#globus-setup

you should now have add these lines in /etc/jupyterhub/jupyterhub_config.py

from oauthenticator.globus import GlobusOAuthenticator
c.JupyterHub.authenticator_class = GlobusOAuthenticator
c.GlobusOAuthenticator.oauth_callback_url = 'https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu/hub/oauth_callback'
c.GlobusOAuthenticator.client_id = ''
c.GlobusOAuthenticator.client_secret = ''

You should now be able to login with your Globus ID credentials, see the documentation to support credentials from institutions supported by Globus Auth. After login, don't worry if you get an error in starting your notebook.

Setup Spawning with Batchspawner

In my last post about spawning Notebooks on Comet I was using XSEDE authentication so that each user would have to use their own Comet account. In this scenario instead we imagine a Gateway system where the administrator shares their own allocation with the Gateway users. Therefore you should create a SSH keypair for the root user on the Jupyterhub Virtual Machine and make sure you can login with no need for a password to Comet as the Gateway user.

Then you need to install batchspawner:

git clone https://github.com/jupyterhub/batchspawner.git
cd batchspawner/
sudo /opt/conda/bin/pip install .

Then configure the Spawner, see my configuration of Jupyterhub: jupyterhub_config.py.

You should modify comet_spawner.py to point to your Gateway user home folder and then fill all the details in jupyterhub_config.py marked by the CONF string.

In CometSpawner I also create a form for the user to choose the parameters of the job and also the Singularity image they want to use.

Here the spawner uses SSH to connect to the Comet login node and submit jobs as the Gateway user.

At this point you should be able to login and launch a job on Comet, execute squeue on Comet to check if that works or look in the home folder of the Gateway user for the logfile of the job and in /var/log/jupyterhub on the Virtual machine for errors.

Setup tunneling

Finally we need a way for the gateway Virtual Machine to access the port on the Comet computing node in order to proxy the Notebook application back to the user.

The simpler solution is to create a user tunnelbot on the VM with no shell access, then create a SSH keypair and paste the private key into the jupyterhub_config.py file (contact me if you have a btter solution!). The job on Comet sets up then a SSH tunnel between the Comet computing node and the Jupyterhub VM.

Improvements

To keep the setup simple, all users are running on the home folder of the Gateway user, for a real deployment, it is possible to create a subfolder for each user beforehand and then use Singularity to mount that as the home folder.

How to create pull requests on Github

2017-06-30T11:00:00-07:00

Pull Requests are the web-based version of sending software patches via email to code maintainers. They allow a person that has no access to a code repository to submit a code change to the repository administrator for review and 1-click merging.

Preparation

Create a free Github account at https://github.com
Login on Github with your credentials
Go to the homepage of the repository, for example https://github.com/sdsc/sdsc-summer-institute-2017

Small changes via Github.com

For small changes, like create a folder and upload a few files, or a quick fix on a previous file, you don't even need to use the git command line client.

If you need to create a folder
- click on "Create new file"
- in the "Name your file..." box, insert: "yourfolder/README.md"
- in the README.md write a description of the content of the folder, you can use markdown syntax, (see the Markdown Cheatsheet )
- create a bullet list with description of the files you will be uploading next
- Click on "Propose new file"
- this will ask you to create a Pull Request, follow the prompts and make sure to confirm at the end that you want to create a Pull Request, you have to click twice on "Create Pull Request" buttons
If you want to upload files in the folder you just created, you need an additional step, if you want to upload to a folder already existing in the original repo, skip this:
- Go to the fork of the original repository that was created automatically under your account, for example: https://github.com/YOURUSERNAME/sdsc-summer-institute-2017
- Click on the dropdown "Branch" menu and look for the branch named patch-1, or patch-n if you have more.
Click on the "Upload files" button, select and upload all files, a few notes:
- do not upload zip archives
- do not upload large data files, Github is for code
- if you are uploading binary files like images, downgrade them to a small size
- this will ask you to create a Pull Request, follow the prompts and make sure to confirm at the end that you want to create a Pull Request, you have to click twice on "Create Pull Request" buttons
Check that your pull request appeared in the Pull Requests area of the repository, for example https://github.com/sdsc/sdsc-summer-institute-2017/pulls

Update a previously create Pull Request via Github.com

If the repository maintainer has some feedback on your Pull Request, you can update it to accomodate any requested change.

Go to the fork of the original repository that was created automatically under your account, for example: https://github.com/YOURUSERNAME/sdsc-summer-institute-2017
Click on the dropdown "Branch" menu and look for the branch named patch-1, or patch-n if you have more.
Now make changes to files or upload new files, then confirm and write a commit message from the web interface
Check that your changes appear as updates inside the Pull Request you created before, for example https://github.com/sdsc/sdsc-summer-institute-2017/pull/N where N is the number assigned to your Pull Request

Use the command line client

For more control and especially if you expect the repository maintainer to make changes to your Pull Request before merging it, better use git.

Click on the "Fork" button on the top right of the repository
Now you should be on the copy of the repository under your own account, for example https://github.com/YOURUSERNAME/sdsc-summer-institute-2017

Now open your terminal, if you never used git before, set it up with:

$ git config --global user.name "Your Name"
$ git config --global user.email "your@email.edu"

Now open your terminal and clone the repository with:

git clone https://github.com/YOURUSERNAME/sdsc-summer-institute-2017

Enter in the repository folder
Create a branch to isolate your changes with:
```
git checkout -b "add_XXXX_material"
```
Now create folders, modify files, you can use any text editor
Once you are done doing modifications, you can prepare them to be committed with, this adds everything inside the folder:
```
git add my_folder
```
Generally better instead to add each file to make sure you don't accidentally commit wrong files
```
git add my_folder/aaa.txt my_folder/README.md
```
Then write this changes to history with a commit
```
git commit -m "Added material about XXXX"
```
Push changes to Github
```
git push -u origin add_XXXX_material
```
Now go to the homepage of the original repository, for example https://github.com/sdsc/sdsc-summer-institute-2017
There should be a yellow notice saying that it detected a recently pushed branch, click on "Compare and Pull Request"
Add a description
Confirm with the green "Create Pull Request" button

In case you want to update your Pull Request, repeat the steps of git add, git commit and git push, any changes will be reflected inside the pull request.

Deploy Jupyterhub on a Supercomputer with SSH Authentication

2017-05-16T22:00:00-07:00

The best way to deploy Jupyterhub with an interface to a Supercomputer is through the use of batchspawner. I have a sample deployment explained in an older blog post: https://zonca.github.io/2017/02/sample-deployment-jupyterhub-hpc.html

This setup however requires a OAUTH service, in this case provided by XSEDE, to authenticate the users via web and then provide a X509 certificate that is then used by batchspawner to connect to the Supercomputer on behalf of the user and submit the job to spawn a notebook.

In case an authentication service of this type is not available, another option is to use SSH authentication.

The starting point is a server with vanilla Jupyterhub installed, good practice would be to use an already available recipe with Ansible, like https://zonca.github.io/2017/02/automated-deployment-jupyterhub-ansible.html, that deploys Jupyterhub in a safer way, e.g. NGINX frontend with HTTPS.

First we want to setup authentication, the simpler way to start would be to use the default authentication with local UNIX user accounts and possibly add Github later. In any case it is necessary that all the users have both an account on the Supercomputer and on the Jupyterhub server, with the same username, this is tedious but is the simpler way to allow them to authenticate on the Supercomputer. Then we need to save the private SSH key into each user's .ssh folder and make sure they can SSH with no password required to the Supercomputer.

Then we can install batchspawner and configure Jupyterhub to use it. In the batchspawner configuration in jupyterhub_config.py, you have to prefix the scheduler commands with ssh so that Jupyterhub can connect to the Supercomputer to submit the job:

c.SlurmSpawner.batch_submit_cmd = 'ssh {username}@{host} sbatch'

See for example my configuration for Comet and replace gsissh with ssh.

Now when users connect, they are authenticated with local UNIX user accounts username and password and then Jupyterhub uses their SSH key to launch a job on the Supercomputer.

The last issue is how to proxy the Jupyterhub running on a computing node back to the server, here one option would be to create a user on the server with no Terminal access but with the possibility of creating tunnels, then at the end of the job, setup a tunnel using a SSH Private Key pasted into the job script itself, see for example my setup on Comet.

Configure Globus on your local machine for GridFTP with XSEDE authentication

2017-04-19T12:00:00-07:00

All the commands are executed on your local machine, the purpose of this tutorial is to be able to use globus-url-copy to copy efficiently data back and forth between your local machine and a XSEDE Supercomputer on the command line.

For a simpler point and click web interface, install Globus Conect Personal instead: https://www.globus.org/globus-connect-personal

Install Globus toolkit

See http://toolkit.globus.org/toolkit/docs/latest-stable/admin/install/#install-toolkit

On Ubuntu, download the deb of the Globus repo from:

wget http://www.globus.org/ftppub/gt6/installers/repo/globus-toolkit-repo_latest_all.deb
sudo dpkg -i globus-toolkit-repo_latest_all.deb
sudo apt-get install globus-data-management-client

Install XSEDE certificates on your machine

wget https://software.xsede.org/security/xsede-certs.tar.gz
tar xvf xsede-certs.tar.gz
sudo mv certificates /etc/grid-security

Full instructions here:

https://software.xsede.org/production/CA/CA-install.html

Authenticate with the myproxy provided by XSEDE

Authenticate with your XSEDE user and password:

myproxy-logon -s myproxy.xsede.org -l $USER -t 36

You can specify the lifetime of the certificate in hours with -t.

you should get a certificate:

A credential has been received for user zonca in /tmp/x509up_u1000.

You can check how much time is left on a certificate by running grid-proxy-info.

Run globus-url-copy

For example copy to my home on Comet:

globus-url-copy -vb -p 4 local_file.tar.gz gsiftp://oasis-dm.sdsc.edu///home/zonca/

See the quickstart guide on the most used globus-url-copy options:

http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/user/#gridftp-user-basic

Synchronize 2 folders

Only copy new files using the -sync and -sync-level options:

-sync
  Only transfer files where the destination does not exist or differs from the source. -sync-level controls how to determine if files differ.
-sync-level number
  Criteria for determining if files differ when performing a sync transfer. The default sync level is 2.\

The available levels are:

Level 0 will only transfer if the destination does not exist.
Level 1 will transfer if the size of the destination does not match the size of the source.
Level 2 will transfer if the time stamp of the destination is older than the time stamp of the source.
Level 3 will perform a checksum of the source and destination and transfer if the checksums do not match.

Sample deployment of Jupyterhub in HPC on SDSC Comet

2017-02-26T12:00:00-08:00

I have deployed an experimental Jupyterhub service (ask me privately if you would like access) installed on a SDSC Cloud virtual machine that spawns single user Jupyter notebooks on Comet computing nodes using batchspawner and then proxies the Notebook back to the user using SSH-tunneling.

Functionality

This kind of setup is functionally equivalent to launching a job yourself on Comet, launch jupyter notebook and then SSH-Tunneling the port to your local machine, but way more convenient. You jus open your browser to the Jupyterhub instance, authenticate with your XSEDE credentials, choose queue and job length and wait for the Notebook job to be ready (generally it is a matter of minutes).

Rationale

Jupyter Notebooks have a lot of use-cases on HPC, it can be used for:

In-situ visualization
Interactive data analysis when local resources are not enough, either in terms of RAM or disk space
Monitoring other running jobs
Launch IPython Parallel jobs and distribute computation to them in parallel
Interact with a running Spark cluster (we support Spark on Comet)

More on this on my Run Jupyterhub on a Supercomputer old blog post.

Setup details

The Jupyter team created a repository for sample HPC deployments, I added all configuration files of my deployment there, with all details about the setup:

Sample deployment in the jupyterhub-deploy-hpc repository

Please send feedback opening an issue in that repository and tagging @zonca.

Customize your Python environment in Jupyterhub

2017-02-24T12:00:00-08:00

Usecase: You have access to a Jupyterhub server and you would like to install some packages but cannot use pip install and modify the systemwide Python installation.

Check if conda is available

First check if the Python installation you have access to is based on Anaconda, open a Notebook and type:

!which conda

! executes bash commands instead of Python commands, we want to check if the conda package manager is installed.

If not, the setup is a bit tedious, so see my tutorial on installing Anaconda in your home folder

Create a conda environment

Conda allows to create independent environments in our home folder, this has the advantage that the environment will be writable so we can install any other package with pip or conda install.

!conda create -n myownenv --clone root

You can declare all the packages you want to install bu good starting point is just to clone the root environment, this will link all the global packages in your home folder, then you can customize it further.

Create a Jupyter Notebook kernel to launch this new environment

We need to notify Jupyter of this new Python environment by creating a Kernel, from a Notebook launch:

!source activate myownenv; ipython kernel install --user --name myownenv

Launch a Notebook

Go back to the Jupyterhub dashboard, reload the page, now you should have another option in the New menu that says myownenv.

In order to use your new kernel with an existing notebook, click on the notebook file in the dashboard, it will launch with the default kernel, then you can change kernel from the top menu Kernel > Change kernel.

Install new packages

Inside a Notebook using the myownenv environment you can install other packages running:

!conda install newpackagename

or:

!pip install newpackagename

Automated deployment of Jupyterhub with Ansible

2017-02-03T18:00:00-08:00

Last year I wrote some tutorials on simple deployments of Jupyterhub on Ubuntu 16.04 on the OpenStack deployment SDSC Cloud, even if most of the steps would also be suitable on other resources like Amazon EC2.

In more detail:

The Jupyter team has released an automated script to deploy Jupyterhub on a single server, see Jupyterhub-deploy-teaching.

In this tutorial we will use this script to deploy Jupyterhub to SDSC Cloud using:

NGINX handling HTTPS with Letsencrypt certificate
Github authentication
Local or Docker user notebooks
Grading with nbgrader
Memory limit for Docker containers

Setup a Virtual Machine to run Jupyterhub

Create first a Ubuntu 16.04 Virtual Machine, a default server image works fine.

In case you are deploying on SDSC Cloud, follow the steps in "Create a Virtual Machine in OpenStack" on my first tutorial at https://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html.

You will also need a DNS entry pointing to the server to create a SSL certificate with Let's Encrypt. Either ask your institution to provide a DNS A entry, e.g. test-jupyterhub.ucsd.edu, that points to the Public IP of the server. SDSC Cloud already provides a DNS entry in the form xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu.

If you plan on using nbgrader, you need to create the home folder for the instructor beforehand, so SSH into the server and create a user with your Github username, i.e. I had to execute sudo adduser zonca

Setup your local machine to run the automation scripts

Automation of the server setup is provided by the Ansible software tool, it allows to describe a server configuration in great detail (a "playbook") and then connects via SSH to a Virtual Machine and runs Python to install and setup all the required software.

On your local machine, install Ansible, at least version 2.1, see Ansible docs, for Ubuntu just add the Ansible PPA repository. I tested this with Ansible version 2.2.1.0

Then you need to configure passwordless SSH connection to your Virtual Machine. Download your SSH key from the OpenStack dashboard, copy it to your ~/.ssh folder and then add an entry to .ssh/config for the server:

Host xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu
    HostName xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu
    User ubuntu
    IdentityFile "~/.ssh/sdsccloud.key"

At this point you should be able to SSH into the machine without typing any password with ssh xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu.

Configure and run the Ansible script

Follow the Jupyterhub-deploy-teaching documentation to checkout the script, configure and run it.

The only modification you need to do if you are on SDSC Cloud is that the remote user is ubuntu and not root, so modify ansible.cfg in the root of the repository, replace remote_user=root with remote_user=ubuntu.

As an example, see the configuration I used, just:

copy it into host_vars
rename it to your public DNS record
fill in proxy_auth_token, Github OAuth credentials for authentication
replace zonca with your Github username everywhere

The exact version of the jupyterhub-deploy-teaching code I used for testing is on the sdsc_cloud_jan_17 tag on Github

Test the deployment

Optional: Docker

In order to provide isolation and resource limits to the users, it is useful to run single user Jupyter Notebooks inside Docker containers.

You will need to SSH into the Virtual Machine and follow the next steps.

Install Docker

First of all we need to install and configure Docker on the machine, see:

Install dockerspawner

Then install the Jupyterhub plugin dockerspawner that handles launching single user Notebooks inside Docker containers, we want to install from master instead of pypi to avoid an error setting the memory limit.

pip install git+https://github.com/jupyterhub/dockerspawner

Setup the Docker container to run user Notebooks

We can first get the standard systemuser container, this Docker container mounts the home folder of the users inside the container, this way we can have persistent data even if the container gets deleted.

docker pull jupyterhub/systemuser

If you do not need nbgrader this image is enough, otherwise we have to build our own image, first checkout my Github repository in the home folder of the ubuntu user on the server with:

git clone https://github.com/zonca/systemuser-nbgrader

then edit the nbgrader_config.py file to set the correct course_id, and build the container image running inside the systemuser-nbgrader folder:

docker build -t systemuser-nbgrader .

Configure Jupyterhub to use dockerspawner

Then add some configuration for dockerspawner to /etc/jupyterhub/jupyterhub_config.py:

c.JupyterHub.spawner_class = 'dockerspawner.SystemUserSpawner'
c.DockerSpawner.container_image = "systemuser-nbgrader" # delete this line if you just need `jupyterhub/systemuser`
                                                                                                          c.Spawner.mem_limit = '500M' # or 1G for GB, probably 300M is minimum required just to run simple calculations
c.DockerSpawner.volumes = {"/srv/nbgrader/exchange":"/tmp/exchange"} # this is necessary for nbgrader to transfer homework back and forth between students and instructor
c.DockerSpawner.remove_containers = True

# The docker instances need access to the Hub, so the default loopback port doesn't work:
from IPython.utils.localinterfaces import public_ips
c.JupyterHub.hub_ip = public_ips()[0]

Test the deployment with Docker

Connect to https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu on your browser, you should be redirected to Github for authentication and then access a Jupyter Notebook instance with the Python 2 or Python 3, open a Notebook and run !hostname in the first cell, you should find out that you get a Docker hash instead of the machine name, you are inside a container.

SSH into the machine run docker ps to find the hash of a running container and then docker stat HASH to check memory usage and the current limit.

Check that you can connect to the nbgrader formgrade service that allows to manually grade assignments at https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu/services/formgrade-COURSEID, replace COURSEID with the course identifier you setup in the Ansible script.

Pre-built image

I also have a saved Virtual Machine snapshot on SDSC Cloud named jupyterhub_ansible_nbgrader_coleman

How to publish your research software to Github

2017-02-01T18:00:00-08:00

Do you want to make your research software available publicly on Github?
Has your reviewer asked to publish the code described in your paper?
Would you like to collaborate on your research software with other people, either local or remote?

Nowadays many journals require that the software used to produce results described in a scientific paper be made available publicly for other peers to be able to reproduce the results or even just explore the analysis more in detail.

The most popular platform is Github, it allows to create a homepage for your software, keep track of any future code change and allows people to report issues or contribute patches easily.

I'll assume familiarity with working from the command line.

Prepare your software for publication

First it is necessary to make sure your code is all inside a single root folder (with any number of subfolders), then cleanup any build artifact, data or executable present in your tree of folders. Ideally you should only have the source code and documentation. If you have small datasets (<10MB total) it is convenient to store them inside the repository, otherwise better host them on dedicated free services like Figshare.

You should cleanup the build and installation process for your code, if any, and ideally you should structure your code in a standard format to ease adoption, for example using a project template generated by Cookiecutter.

You should create a README.md file in the root folder of your project, this is very important because it will be transformed into HTML and displayed in the homepage of your software project. Here you should use the Markdown formatting language, see a Markdown cheatsheet on Github, to explain:

short description of your software
build/usage requirements for your process
installation instructions (and point to another file INSTALL.md for more details)
quickstart section
link to usage examples
link to your paper about the project
list of developers
optionally: how users can get support (i.e. a mailing list)

Finally you should choose a license, otherwise even if the project is public, nobody is allowed to modify and re-use it legally. Create a LICENSE file in the root of folder tree and paste the content of the license. I recommend MIT license which is very permissive and simple: https://choosealicense.com/licenses/mit/

Create an account on Github

Second step is to create an account on Github: this just requires a username, email and password, choose your username carefully because it will become the root internet address of all your software projects, i.e. https://github.com/username/software-name.

A Github account is free and allows any number of public software projects, private repositories are generally available only on paid account, however who has a .edu email address can apply for unlimited private repositories by applying for the academic discount.

Create a repository on Github

Github hosts software inside a version control system, git, so that it stores the complete history of all the incremental changes over time and allows to easily recover previous versions of the software. Each software project is stored in a repository, which includes both the current version and all previous versions of the software.git is a more modern alternative to subversion.

First you need to create a repository on Github: authenticate on Github.com and click on the "New Repository" button, choose a name for your software project and leave all other options as default.

Publish your software on Github

Make sure that the git command line tool is available on the machine where your code is stored, install it from your package manager or see installation instructions on the git website.

Finally you can follow the instructions on the repository homepage https://github.com/username/software-name in the section ..or create a new repository on the command line, make sure you are in the root folder of your repository and follow this steps:

Turn the current folder into a git repository:

git init

Add recursively all files and folders, otherwise specify filenames or wildcard to pick only some, be careful not to accidentally upload sensitive content like passwords:

git add *

Store into the repository a first version of the software:

git commit -m "first version of the software"

Tell git the address of the remote repository on Github (make sure to use your username and the name you chose for your software project):

git remote add origin https://github.com/username/software-name

Upload the software to Github:

git push -u origin master

You can then check in your browser that all the code you meant to publish is available on Github

Update your software

Whenever in the future you need to make modifications to the software:

edit the files
git add filename1 filename2 to prepare them for commit
git commit -m "bugfix" create a version in the history with a explanatory commit message
git push to publish to Github

For more details on git, check the Software Carpentry lessons.

Run Ubuntu in HPC with Singularity

2017-01-13T12:00:00-08:00

Ever wanted to sudo apt install packages on a Supercomputer?
Ever wanted to freeze your software environment and reproduce a calculation after some time?
Ever wanted to dump your software environment to a file and move it to another Supercomputer? or wanted the same software on your laptop and on a computing node?

If your answer to any of those question is yes, read on! Otherwise, well, still read on, it's awesome!

Singularity

Singularity is a software project by Lawrence Berkeley Labs to provide a safe container technology for High Performance Computing, and it has been available for some time on my favorite Supercomputer, i.e. Comet at the San Diego Supercomputer Center.

You can read more details on their website, in summary you choose your own Operative System (any GNU/Linux distribution), describe its configuration in a standard format or even import an existing Dockerfile (from the popular Docker container technology) and Singularity is able to build an image contained in a single file. This file can then be executed on any Linux machine with Singularity installed (even on a Comet computing node), so you can run Ubuntu 16.10 or Red Hat 5 or any other flavor, your choice! It doesn't need any deamon running like Docker, you can just execute a command inside the container by running:

singularity exec /path/to/your/image.img your_executable

And your executable is run within the OS of the container.

The container technology is just sandboxing the environment, not executing a complete OS inside the host OS, so the loss of performance is minimal.

In summary, referring to the questions above:

This allows you to sudo apt install any package inside this environment when it is on your laptop, and then copy it to any Supercomputer and run your software inside that OS.
You can store this image to help reproduce your scientific results anytime in the future
You can develop your software inside a Singularity container and never have to worry about environment issues when you are ready for production runs on HPC or moving across different Supercomputers

Build a Singularity image for SDSC Comet with MPI support

One of the trickiest things for such technology in HPC is support for MPI, the key stack for high speed network communication. I have prepared a tutorial on Github on how to build either a CentOS 7 or a Ubuntu 16.04 Singularity container for Comet that allows to use the mpirun command provided by the host OS on Comet but execute a code that supports MPI within the container.

https://github.com/zonca/singularity-comet

More complicated setup for Julia with MPI support

For a project that needed a setup with Julia with MPI support I built a more complicated container, see:

https://github.com/zonca/singularity-comet/tree/master/debian_julia

Prebuilt containers

I made also available my containers on Comet, they are located in my scratch space:

/oasis/scratch/comet/zonca/temp_project

and are named Centos7.img, Ubuntu.img and julia.img.

You can also copy those images to your local machine and customize them more.

Trial accounts on Comet

If you don't have an account on Comet yet, you can request a trial allocation:

https://www.xsede.org/web/xup/allocations-overview#types-trial

Enjoy!

Jupyterhub Docker Spawner with GPU support

2016-10-12T12:00:00-07:00

Docker Spawner allows users of Jupyterhub to run Jupyter Notebook inside isolated Docker Containers. Access to the host NVIDIA GPU was not allowed until NVIDIA release the NVIDIA-docker plugin.

Build the Docker image

In order to make Jupyerhub work with NVIDIA-docker we need to build a Jupyterhub docker image for dockerspawner that includes both the dockerspawner singleuser or systemuser images and the nvidia-docker image.

The Jupyter systemuser images are built in several steps so let's use them as a starting point, it is good that both image start from Ubuntu 14.04.

Download the nvidia-docker repository
In ubuntu-14.04/cuda/8.0/runtime/Dockerfile, replace FROM ubuntu:14.04 with FROM jupyterhub/systemuser
Build this image sudo docker build -t systemuser-cuda-runtime runtime
In ubuntu-14.04/cuda/8.0/devel/Dockerfile, replace FROM cuda:8.0-runtime with FROM systemuser-cuda-runtime
Build this image sudo docker build -t systemuser-cuda-devel devel

Now we have 2 images, either just CUDA 8.0 runtime or also the compiler nvcc and other development tools.

Make sure the image itself runs from the command line on the host:

sudo nvidia-docker run --rm systemuser-cuda-devel nvidia-smi

Configure Jupyterhub

In jupyterhub_config.py, first of all set the right image:

c.DockerSpawner.container_image = "systemuser-cuda-devel"

However this is not enough, nvidia-docker images need special flags to work properly and mount the host GPU into the containers, this is usually performed calling nvidia-docker instead of docker from the command line. In dockerspawner however, we are directly using the docker library, so we need to properly configure the environment there.

First of all, we can get the correct flags by calling from the host machine:

curl -s localhost:3476/docker/cli

The result for my machine is:

--volume-driver=nvidia-docker --volume=nvidia_driver_361.93.02:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidia0 --device=/dev/nvidia1

Now we can configure dockerspawner using those values, in my case:

c.DockerSpawner.read_only_volumes = {"nvidia_driver_361.93.02":"/usr/local/nvidia"}
c.DockerSpawner.extra_create_kwargs = {"volume_driver":"nvidia-docker"}
c.DockerSpawner.extra_host_config = { "devices":["/dev/nvidiactl","/dev/nvidia-uvm","/dev/nvidia-uvm-tools","/dev/nvidia0","/dev/nvidia1"] }

Test it

Current issues

Environment on the Jupyterhub kernel is missing LD_LIBRARY_PATH, running directly on the image instead is fine
I'd like to test using numba in Jupyterhub, but that requires cudatoolkit 8.0 which is not available yet in Anaconda

Jupyterhub deployment on multiple nodes with Docker Swarm

2016-05-24T12:00:00-07:00

This post is part of a series on deploying Jupyterhub on OpenStack tailored at workshops, in the previous posts I showed:

The limitation of a single server setup is that it cannot scale beyond the resources available on that server, especially memory. Therefore for a workshop that requires to load large amount of data or with lots of students it is recommended to use a multi-server setup.

Fortunately Docker already provides that flexibility thanks to Docker Swarm. Docker Swarm allows to have a Docker interface that behaves like a normal single server instance but instead launches containers on a pool of servers. Therefore there are mininal changes on the Jupyterhub server.

Jupyterhub will interface with the Docker Swarm service running locally, Docker Swarm will take care of launching containers across the other nodes. Each container will launch a Jupyter Notebook server for a single user, then Jupyterhub will proxy the container port to the users. Users won't connect directly to the nodes in the Docker Swarm pool.

Setup the Jupyterhub server

Let's start from the public image already available, see just the first section "Create a Virtual Machine in OpenStack with the pre-built image" in http://zonca.github.io/2016/04/jupyterhub-image-sdsc-cloud.html for instructions on how to get the Jupyterhub single server running.

Setup Docker Swarm

First of all we need to have Docker accessible remotely so we need to configure it to listen on a TCP port, edit /etc/init/docker.conf and replace DOCKER_OPTS= in the start section with:

DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"

Port 2375 is not open on the OpenStack configuration, so this is not a security issue.

Then we need to run 2 swarm services in Docker containers, first a distributed key-store listening on port 8500 that is needed for Swarm to store information about all the available nodes, Consul:

docker run --restart=always  -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap

the manager which provides the interface to Docker Swarm:

HUB_LOCAL_IP=$(ip route get 8.8.8.8 | awk 'NR==1 {print $NF}')
docker run --restart=always  -d -p 4000:4000 swarm manage -H :4000 --replication --advertise $HUB_LOCAL_IP:4000 consul://$HUB_LOCAL_IP:8500

This sets HUB_LOCAL_IP to the internal ip of the instance, then starts the Manager container.

We are running both with automatic restarting, so that they are launched again in case of failure or after reboot.

You can check if the containers are running with:

docker ps -a

and then you can check if connection works with Docker Swarm on port 4000:

docker -H :4000 ps -a

Check the Docker documentation for a more robust setup with multiple Consul services and a backup Manager.

Setup Jupyterhub

Following the work by Jess Hamrick for the compmodels Jupyterhub deployment, we can get the jupyterhub_config.py from https://gist.github.com/zonca/83d222df8d0b9eaebd02b83faa676753 and copy them into the home of the ubuntu user.

Share users home via NFS

We have now a distributed system and we need a central location to store the home folders of the users, so that even if they happen to get containers on different server, they can still access their files.

Install NFS with the package manager:

sudo apt-get install nfs-kernel-server

edit /etc/exports, add:

/home    *(rw,sync,no_root_squash)

Ports are not open in the NFS configuration.

Setup networking

Before preparing a node, create a new security group under Compute -> Access & Security and name it swarm_group.

We need to be able to have open traffic between the swarmsecgroup and the group of the Jupyterhub instance, jupyterhubsecgroup in my previous tutorial. So in the new swarmsecgroup, add this rule:

Add Rule
Rule: ALL TCP
Direction: Ingress
Remote: Security Group
Security Group: jupyterhubsecgroup

Add another rule replacing Ingress with Egress. Now open the jupyterhubsecgroup group and add the same 2 rules, just make sure to choose as target "Security Group" swarmsecgroup.

On the swarmsecgroup also add a Rule for SSH traffic from any source choosing CIDR and 0.0.0.0/0, you can disable this after having executed the configuration.

Setup the Docker Swarm nodes

Launch a plain Ubuntu instance

Launch a new instance, all it swarmnode, choose the size depending on your requirements, and then choose "Boot from image" and get Ubuntu 14.04 LTS (16.04 should work as well, but I haven't yet tested it). Remember to choose a Key Pair under Access & Security and assign the Security Group swarmsecgroup.

Temporarily add a floating IP to this instance in order to SSH into it, see my first tutorial for more details.

Setup Docker Swarm

First install Docker engine:

sudo apt update
sudo apt install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | sudo tee /etc/apt/sources.list.d/docker.list 
sudo apt update
sudo apt install -y docker-engine
sudo usermod -aG docker ubuntu

Then make the same edit we did on the hub, edit /etc/init/docker.conf and replace DOCKER_OPTS= in the start section with:

DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"

Restart Docker with:

sudo service docker restart

Then run the container that interfaces with Swarm:

HUB_LOCAL_IP=10.XX.XX.XX
NODE_LOCAL_IP=$(ip route get 8.8.8.8 | awk 'NR==1 {print $NF}')
docker run --restart=always -d swarm join --advertise=$NODE_LOCAL_IP:2375 consul://$HUB_LOCAL_IP:8500

Copy the address of the Jupyterhub server in the HUB_LOCAL_IP variable.

Setup mounting the home filesystem

sudo apt-get install autofs

add in /etc/auto.master:

/home         /etc/auto.home

create /etc/auto.home:

echo "* $HUB_LOCAL_IP:/home/&" | sudo tee /etc/auto.home

using the internal IP of the hub.

sudo service autofs restart

verify by doing:

ls /home/ubuntu

ls /home/training01

you should see the same files that were on the Jupyterhub server.

Create users

As we are using system users and mounting the home filesystem it is important that users have the same UID on all nodes, so we are going to run on the node the same script we ran on the Jupyterhub server:

 bash create_users.sh

Test Jupyterhub

Login on the Jupyterhub instance with 2 or more different users, then check on the console of the Hub that the containers were launched on the swarmnode instance:

 docker -H :4000 ps -a

Create more nodes

Now that we created a fully functioning node we can clone it to create more to accomodate more users.

Create a snapshot of the node

First we need to delete all Docker containers, ssh into the swarmnode and execute:

 docker rm -f $(docker ps -a -q)

Docker has a unique identifying key, we need to remove that so that it will be regenerated by the clones.

sudo service docker stop
sudo rm /etc/docker/key.json

Then from Compute->Instances choose "Create Snapshot", call it swarmnodeimage.

Launch other nodes

Click on Launch instance->"Boot from Snapshot"->swarmnodeimage, choose the swarmnodesecgroup Security Group. Choose any number of instances you need.

Each node will need to launch the Swarm container with its own local ip, not the same as our first node. Therefore we need to use the "Post Creation"->"Direct Input" and add this script:

#!/bin/bash
HUB_LOCAL_IP=10.XX.XX.XX
NODE_LOCAL_IP=$(ip route get 8.8.8.8 | awk 'NR==1 {print $NF}')
docker run --restart=always -d swarm join --advertise=$NODE_LOCAL_IP:2375 consul://$HUB_LOCAL_IP:8500

HUB_LOCAL_IP is the internal network IP address of the Jupyterhub instance and NODE_LOCAL_IP will be filled with the IP of the OpenStack image just created.

See for example Jupyterhub with 3 remote Swarm nodes running containers for 4 training users:

$ docker -H :4000 ps -a
CONTAINER ID        IMAGE                                     COMMAND                  CREATED              STATUS              PORTS                         NAMES
60189f208df2        zonca/jupyterhub-datascience-systemuser   "tini -- sh /srv/sing"   11 seconds ago       Up 7 seconds        10.128.1.28:32769->8888/tcp   swarmnodes-1/jupyter-training04
1d7b05caedb1        zonca/jupyterhub-datascience-systemuser   "tini -- sh /srv/sing"   36 seconds ago       Up 32 seconds       10.128.1.27:32768->8888/tcp   swarmnodes-2/jupyter-training03
733c5ff0a5ed        zonca/jupyterhub-datascience-systemuser   "tini -- sh /srv/sing"   58 seconds ago       Up 54 seconds       10.128.1.29:32768->8888/tcp   swarmnodes-3/jupyter-training02
282abce201dd        zonca/jupyterhub-datascience-systemuser   "tini -- sh /srv/sing"   About a minute ago   Up About a minute   10.128.1.28:32768->8888/tcp   swarmnodes-1/jupyter-training01
29b2d394fab9        swarm                                     "/swarm join --advert"   13 minutes ago       Up 13 minutes       2375/tcp                      swarmnodes-2/romantic_easley
8fd3d32fe849        swarm                                     "/swarm join --advert"   13 minutes ago       Up 13 minutes       2375/tcp                      swarmnodes-3/clever_mestorf
1ae073f7b78b        swarm                                     "/swarm join --advert"   13 minutes ago       Up 13 minutes       2375/tcp                      swarmnodes-1/jovial_goldwasser

Where to go from here

At this level the deployment is quite complicated, so it is probably worth automating it with an ansible playbook, that will be the subject of the next blog post, I think the result will be a simplified version of Jess Hamrick's compmodels deployment. Still, I recommend starting with a manual setup to understand how the different pieces work.

Troubleshooting

If docker -H :4000 ps -a gives the error:

Error response from daemon: No elected primary cluster manager

it means the Consul container is broken, remove it and create it again.

Acknowledgments

Thanks to Jess Hamrick for sharing the setup of her compmodel class on Github, the Jupyter team for releasing such great tools and Kevin Coakley and the rest of the SDSC Cloud team for OpenStack support and resources.

Quick Jupyterhub deployment for workshops with pre-built image

2016-04-28T12:00:00-07:00

This tutorial explains how to use a OpenStack image I already built to quickly deploy a Jupyterhub Virtual Machine that can provide a good initial setup for a workshop, providing students access to Python 2/3, Julia, R, file editor and terminal with bash.

For details about building the instance yourself for more customization, see the full tutorial at http://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html.

Create a Virtual Machine in OpenStack with the pre-built image

Follow the 3 steps at the step by step tutorial under "Create a Virtual Machine in OpenStack":

Network setup
Create a new Virtual Machine: here instead of choosing the base ubuntu image, choose jupyterhub_docker, also you can choose any size, I recommend to start with a c1.large for experimentation, you can then resize it later to a more powerful instance depending on the needs of your workshop
Give public IP to the instance

Connect to Jupyterhub

The Jupyterhub instance is ready! Just open your browser and connect to the floating IP of the instance you just created.

The browser should show a security error related to the fact that the pre-installed SSL certificate is not trusted, click on "Advanced properties" and choose to connect anyway, we'll see later how to fix this.

You already have 50 training users, named training01 to training50, all with the same password jupyterhubSDSC (see below how to change it). Check that you can login and create a notebook.

Administer the Jupyterhub instance

Login into the Virtual Machine with ssh -i jupyterhub.pem ubuntu@xxx.xxx.xxx.xxx using the key file and the public IP setup in the previous steps.

To get rid of the annoying "unable to resolve host" warning, add the hostname of the machine (check by running hostname) to /etc/hosts, i.e. the first line should become something like 127.0.0.1 localhost jupyterhub if jupyterhub is the hostname

Change password/add more users

In the home folder of the ubuntu users, there is a file named create_users.sh, edit it to change the PASSWORD variable and the number of users from 50 to a larger number. Then run it with bash create_users.sh. Training users cannot SSH into the machine.

Use sudo passwd trainingXX to change the password of a single user.

Setup a domain (needed for SSL certificate)

If you do not know how to get a domain name, here some options:

you can generally request a subdomain name from your institution, see for example UCSD
if you own a domain, go in the DNS settings, add a record of type A to a subdomain, like jupyterhub.yourdomain.com that points to the floating IP of the Jupyterhub instance
you can get a free dynamic dns at websites like noip.com

In each case you need to have a DNS record of type A that points to the floating IP of the Jupyterhub instance.

Setup a SSL Certificate

Letsencrypt provides free SSL certificates by using a command line client.

SSH into the server, run:

git clone https://github.com/letsencrypt/letsencrypt
cd letsencrypt
sudo service nginx stop
./letsencrypt-auto certonly --standalone -d jupyterhubdeploy.ddns.net

Follow instructions at the terminal to obtain a certificate

Now open the nginx configuration file: sudo vim /etc/nginx/nginx.conf

And modify the SSL certificate lines:

ssl_certificate /etc/letsencrypt/live/yoursub.domain.edu/cert.pem;
ssl_certificate_key /etc/letsencrypt/live/yoursub.domain.edu/privkey.pem;

Start NGINX:

sudo service nginx start

Connect again to Jupyterhub and check that your browser correctly detects that the HTTPS connection is safe.

Comments? Suggestions?

Twitter
Email zonca on the domain sdsc.edu

Deploy Jupyterhub on a Virtual Machine for a Workshop

2016-04-16T12:00:00-07:00

This tutorial describes the steps to install a Jupyterhub instance on a single machine suitable for hosting a workshop, suitable for having people login with training accounts on Jupyter Notebooks running Python 2/3, R, Julia with also Terminal access on Docker containers. Details about the setup:

Jupyterhub installed with Anaconda directly on the host, proxied by NGINX under HTTPS with self-signed certificate
Login with Linux account credentials created previously by the administrator, data in /home are persistent across sessions
Each user runs in a separated Docker container with access to Python 2, Python 3, R and Julia kernels, they can also open the Notebook editor and the terminal
Using a single machine you have to consider that the biggest constraint is going to be memory usage, as a rule of thumb consider 100-200 MB/user plus 5x-10x the amount of data you are loading from disk, depending on the kind of analysis. For a multi-node setup you need to look into Docker Swarm.

I am using the OpenStack deployment at the San Diego Supercomputer Center, SDSC Cloud, AWS deployments should just replace the first section on Creating a VM and setting up Networking, see the Jupyterhub wiki.

If you intend to run on SDSC Cloud, I have a pre-built image of this deployment you can setup and run quickly, see see my followup tutorial.

Create a Virtual Machine in OpenStack

First of all we need to launch a new Virtual Machine and configure the network.

Network setup

Jupyterhub will be proxied to the standard HTTPS port by NGINX and we also want to redirect HTTP to HTTPS, so we open those ports, then SSH for the administrators to login and a custom TCP rule in order for the Docker containers to be able to connect to the Jupyterhub hub running on port 8081, so we are opening that port just to the subnet that is running the Docker containers.

Compute -> Access & Security -> Security Groups -> Create Security Group and name it jupyterhubsecgroup
Click on Manage Rules
Click on add rule, choose the HTTP rule and click add
Repeat the last step with HTTPS and SSH
Click on add rule again, choose Custom TCP Rule, set port 8081 and set CIDR 172.17.0.0/24 (this is needed so that the containers can connect to the hub)

Create a new Virtual Machine

We choose Ubuntu here, also other distributions should work fine.

Compute -> Access & Security -> Key Pairs -> Create key pair, name it jupyterhub and download it to your local machine
Instances -> Launch Instance, Choose a name, Choose "Boot from image" in Boot Source and Ubuntu as Image name, Choose any size, depending on the number of users (TODO add link to Jupyterhub docs)
Under "Access & Security" choose Key Pair jupyterhub and Security Groups jupyterhubsecgroup
Click Launch to create the instance

Give public IP to the instance

By default in SDSC Cloud machines do not have a public IP.

Compute -> Access & Sewcurity -> Floating IPs -> Allocate IP To Project, "Allocate IP" to request a public IP
Click on the "Associate" button of the IP just requested and under "Port to be associated" choose the instance just created

Setup Jupyterhub in the Virtual Machine

In this section we will install and configure Jupyterhub and NGINX to run on the Virtual Machine.

login into the Virtual Machine with ssh -i jupyterhub.pem ubuntu@xxx.xxx.xxx.xxx using the key file and the public IP setup in the previous steps
add the hostname of the machine (check by running hostname) to /etc/hosts, i.e. the first line should become something like 127.0.0.1 localhost jupyterhub if jupyterhub is the hostname

Setup Jupyterhub

 wget --no-check-certificate https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
 bash Miniconda3-latest-Linux-x86_64.sh
 ```

 use all defaults, answer "yes" to modify PATH

 ```
sudo apt-get install npm nodejs-legacy
sudo npm install -g configurable-http-proxy
conda install traitlets tornado jinja2 sqlalchemy 
pip install jupyterhub

For authentication to work, the ubuntu user needs to be able to read the /etc/shadow file:

sudo adduser ubuntu shadow

Setup the web server

We will use the NGINX web server to proxy Jupyterhub and handle HTTPS for us, this is recommended for deployments on the public internet.

sudo apt install nginx

SSL Certificate: Optionally later, once we have assigned a domain to the Virtual Machine, we can install letsencrypt and get a real certificate, see my followup tutorial, for simplicity here we are just using self-signed certificates that will give warnings on the first time users connect to the server, but still will keep the traffic encrypted.

sudo mkdir /etc/nginx/ssl
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /etc/nginx/ssl/nginx.key -out /etc/nginx/ssl/nginx.crt

Get /etc/nginx/nginx.conf from https://gist.github.com/zonca/08c413a37401bdc9d2a7f65a7af44462

Setup Docker Spawner

By default Jupyterhub runs notebooks as processes owned by each system user, for more security and isolation, we want Notebook to run in Docker containers, which are something like lightweight Virtual Machines running inside our server.

Install Docker

Source: https://docs.docker.com/engine/installation/linux/ubuntulinux/#prerequisites

sudo apt update
sudo apt install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | sudo tee /etc/apt/sources.list.d/docker.list 
sudo apt update
sudo apt install docker-engine
sudo usermod -aG docker ubuntu

Logout and login again for the group to take effect

Install and configure DockerSpawner

pip install dockerspawner
docker pull jupyter/systemuser
conda install ipython jupyter

Create jupyterhub_config.py in the home folder of the ubuntu user with this content:

c.JupyterHub.confirm_no_ssl = True
c.JupyterHub.spawner_class = 'dockerspawner.SystemUserSpawner'

# The docker instances need access to the Hub, so the default loopback port doesn't work:
from IPython.utils.localinterfaces import public_ips
c.JupyterHub.hub_ip = public_ips()[0]

Connect to Jupyterhub

From the home folder of the ubuntu user, type jupyterhub to launch the Jupyterhub process, see below how to start it automatically at boot. Use CTRL-C to stop it.

Open a browser and connect to the floating IP you set for your instance, this should redirect to the https, click "Advanced" in the warning about safety due to the self signed SSL certificate and login with the training credentials.

Instead of using the IP, you can use any domain that points to that same IP with a DNS record of type A or get a dymanic DNS for free on a website like http://noip.com. Once you have a custom domain, you can configure letsencrypt to have a proper HTTPS certificate so that users do not get any warning when connecting to the instance. I will add this to the optional steps below.

Optional: Automatically start jupyterhub at boot

Save https://gist.github.com/zonca/aaeaf3c4e7339127b482d759866e5f39 as /etc/init.d/jupyterhub

sudo chmod +x /etc/init.d/jupyterhub
sudo service jupyterhub start
sudo update-rc.d jupyterhub defaults

Optional: Create training user accounts

Add user accounts on Jupyterhub creating standard Linux users with adduser interactively or with a batch script.

For example the following batch script creates 10 users all with the same password:

#!/bin/bash
PASSWORD=samepasswordforallusers
NUMBER_OF_USERS=10
for n in `seq -f "%02g" 1 $NUMBER_OF_USERS`
do
    echo creating user training$n
    echo training$n:$PASSWORD::::/home/training$n:/bin/bash | sudo newusers
done

Also add AllowUsers ubuntu to /etc/ssh/sshd_config so that training users cannot SSH into the host machine.

Optional: Add the R and Julia kernels

SSH into the instance
git clone https://github.com/jupyter/dockerspawner
cd dockerspawner

Modify the file singleuser/Dockerfile, replace FROM jupyter/scipy-notebook with FROM jupyter/datascience-notebook

docker build -t datascience-singleuser singleuser

Modify the file systemuser/Dockerfile, replace FROM jupyter/singleuser with FROM datascience-singleuser

docker build -t datascience-systemuser systemuser

Finally in jupyterhub_config.py, select the new docker image:

c.DockerSpawner.container_image = "datascience-systemuser"

Use your own Python installation (kernel) in Jupyterhub

2015-10-05T12:00:00-07:00

Updated February 2017

You have access to a Jupyterhub server but the Python installation provided does not satisfy your needs, how to use your own?

Install Anaconda

If you haven't already your own Python installation on the Jupyterhub server you have access to, you can install Anaconda in your home folder. I assume here you have a permanent home folder on the server.

In order to type commands, you can either get a Jupyterhub Terminal, or run in the IPython notebook with !.

!wget https://repo.continuum.io/archive/Anaconda3-2.3.0-Linux-x86_64.sh
!bash ./Anacon*

Create a kernel file for Jupyterhub

You probably already know you can have Python 2 and Python 3 kernels on the same Jupyter notebook installation. In the same way you can create your own KernelSpec that launches instead another Python installation.

IPython can automatically create a KernelSpec for you, from the IPython notebook, run:

!~/anaconda3/bin/ipython kernel install --user --name anaconda

In case your path is different, just insert the full path to ipython from the Python installation you would like to use.

This will create a file kernel.json in ~/.local/share/jupyter/kernels/anaconda.

You can also add KernelSpecs for other conda environments doing:

!source activate environmentname
!ipython kernel install --user --name environmentname

Launch a Notebook

Go back to the Jupyterhub dashboard, reload the page, now you should have another option in the New menu that says My Anaconda.

IPython/Jupyter notebook setup on NERSC Edison

2015-09-24T20:00:00-07:00

Introduction

This tutorial explains the setup to run an IPython Notebook on a computing node on the supercomputer Edison at NERSC and forward its port encrypted with SSH to the browser on a local laptop. This setup is a bit more complicated than other supercomputers, i.e. see my tutorial for Comet for 2 reasons:

Edison's computing nodes run a stripped down OS, with no support for SSH, unless you activate Cluster Compatibility Mode (CCM)
On edison you generally don't have direct access to a computing node, even if you request an interactive node you actually have access to an intermediary node (MOM node), from there aprun sends a job for execution on the computing node.

Quick reference

Install IPython notebook and make sure it is in the path, I recommend to install Anaconda 64bit in your home folder or on scratch.
Make sure you can ssh passwordless within Edison, i.e. ssh edison from Edison login node works without password
Create a folder notebook in your home, get notebook_job.pbs and launch_notebook_and_tunnel_to_login.sh from https://gist.github.com/zonca/357d36347fd5addca8f0
Change the port number and customize options (duration)
qsub notebook_job.pbs
From laptop, launch bash tunnel_laptop_edisonlogin.sh ## from https://gist.github.com/zonca/5f8b5ccb826a774d3f89, where ## is the edison login number in 2 digits, like 03. First you need to modify the port number.
From laptop, open browser and connect to http://localhost:YOURPORT

Detailed walkthrough

One time setup on Edison

Make sure that ipython notebook works on a login node, one option is to install Anaconda 64bit from http://continuum.io/downloads#py34. Choose Python 3.

You need to be able to SSH from a node to another node on Edison with no need of a password. Create a new SSH certificate with ssh-keygen, hit enter to keep all default options, DO NOT ENTER A PASSWORD. Then use ssh-copy-id edison.nersc.gov, enter your password to make sure the key is copied in the authorized hosts. Now you can check it works by executing:

ssh edison.nersc.gov

from the login node and make sure you are NOT asked for your password.

Configure the script for TORQUE and submit the job

Create a notebook folder on your home on Edison.

Copy notebook_job.pbs and launch_notebook_and_tunnel_to_login.sh from https://gist.github.com/zonca/357d36347fd5addca8f0 to the notebook folder.

Change the port number in the launch_notebook_and_tunnel_to_login.sh script to a port of your choosing between 7000 and 9999, referenced as YOURPORT in the rest of the tutorial. Two users on the same login node on the same port would not be allowed to forward, so try to avoid common port numbers as 8000, 9000, 8080 or 8888.

Choose a duration of your job, for initial testing better keep 30 minutes so your job starts sooner.

Submit the job to the scheduler:

qsub notebook_job.pbs

Wait for the job to start running, you should see R in:

qstat -u $USER

The script launches an IPython notebook on a computing node and tunnels its port to the login node.

You can check that everything worked by checking that no errors show up in the notebook.log file, and that you can access the notebook page with wget:

wget localhost:YOURPORT

should download a index.html file in the current folder, and NOT give an error like "Connection refused".

Tunnel the port to your laptop

Linux / MAC

Download the tunnel_laptop_edisonlogin.sh script from https://gist.github.com/zonca/357d36347fd5addca8f0.

Customize the script with your port number and your username.

Launch bash tunnel_laptop_edisonlogin.sh ## where ## is the Edison login node you launched the job from in 2 digits, e.g. 03.

The script forwards the port from the login node of Edison to your laptop.

Windows

Install putty.

Follow tutorial for local port forwarding on http://howto.ccs.neu.edu/howto/windows/ssh-port-tunneling-with-putty/

set edison##-eth5.nersc.gov as remote host, where ## is the Edison login node you launched the job from in 2 digits, e.g. 03 and set 22 as SSH port
set YOURPORT as tunnel port, replace both 8080 and 80 in the tutorial with your port number.

Connect to the Notebook

Open a browser and type http://localhost:YOURPORT in the address bar.

See in the screenshot from my local browser, the hostname is one of Edison's computing node:

Acknowledgements

Thanks Lisa Gerhardt from NERSC user support to help me understand Edison's configuration.

IPython/Jupyter notebook setup on SDSC Comet

2015-09-17T20:00:00-07:00

Introduction

This tutorial explains the setup to run an IPython Notebook on a computing node on the supercomputer Comet at the San Diego Supercomputer Center and forward the port encrypted with SSH to the browser on a local laptop.

Quick reference

Add module load python scipy to .bashrc
Make sure you can ssh passwordless within comet, i.e. ssh comet.sdsc.edu from comet login node works without password
Get submit_slurm_comet.sh from https://gist.github.com/zonca/5f8b5ccb826a774d3f89
Change the port number and customize options (duration)
sbatch submit_slurm_comet.sh
Remember the login node you are using
From laptop, use bash tunnel_notebook_comet.sh N where N is the Comet login number (e.g. 2) from https://gist.github.com/zonca/5f8b5ccb826a774d3f89
From laptop, open browser and connect to http://localhost:YOURPORT

Detailed walkthrough

One time setup on Comet

Login into a Comet login node, edit the .bashrc file in your home folder (with nano .bashrc for example) and add module load python scipy at the bottom. This makes sure you always have the Python environment loaded in all your jobs. Logout, log back in, make sure that module list shows python and scipy.

You need to be able to SSH from a node to another node on comet with no need of a password. Create a new SSH certificate with ssh-keygen, hit enter to keep all default options, DO NOT ENTER A PASSWORD. Then use ssh-copy-id comet.sdsc.edu, enter your password to make sure the key is copied in the authorized hosts. Now you can check it works by executing:

ssh comet.sdsc.edu

from the login node and make sure you are NOT asked for your password.

Configure the script for SLURM and submit the job

Copy submit_slurm_comet.sh from https://gist.github.com/zonca/5f8b5ccb826a774d3f89 on your home on Comet.

Change the port number in the script to a port of your choosing between 8000 and 9999, referenced as YOURPORT in the rest of the tutorial. Two users on the same login node on the same port would not be allowed to forward, so try to avoid common port numbers as 8000, 9000, 8080 or 8888. Tho

Choose whether you prefer to use a full node to have access to all 24 cores and 128GB of RAM or if you only need 1 core and 5GB of RAM and change the top of the script accordingly.

Choose a duration of your job, for initial testing better keep 30 minutes so your job starts straight away.

Submit the job to the scheduler:

sbatch submit_slurm_comet.sh

Wait for the job to start running, you should see R in:

squeue -u $USER

The script launches an IPython notebook on a computing node and tunnels its port to the login node.

You can check that everything worked by checking that no errors show up in the notebook.log file, and that you can access the notebook page with wget:

wget localhost:YOURPORT

should download a index.html file in the current folder, and NOT give an error like "Connection refused".

Check what login node you were using on comet, i.e. the hostname on your terminal on comet, for example comet-ln2.

Tunnel the port to your laptop

Linux / MAC

Download the tunnel_notebook_comet.sh script from https://gist.github.com/zonca/5f8b5ccb826a774d3f89.

Customize the script with your port number.

Lauch bash tunnel_notebook_comet.sh N where N is the comet login node number. So if you were on comet-ln2, use bash tunnel_notebook_comet.sh 2.

The script forwards the port from the login node of comet to your laptop.

Windows

Install putty.

Follow tutorial for local port forwarding on https://www.akadia.com/services/ssh_putty.html/

set comet-ln2.sdsc.edu as remote host, 22 as SSH port
set YOURPORT as tunnel port, replace both 8080 and 80 in the tutorial with your port number.

Connect to the Notebook

Open a browser and type http://localhost:YOURPORT in the address bar.

Run Jupyterhub on a Supercomputer

2015-04-02T09:00:00-07:00

Summary: I developed a plugin for Jupyterhub: RemoteSpawner, it has a proof-of-concept interface with the Supercomputer Gordon at UC San Diego to spawn IPython Notebook instances as jobs throught the queue and tunnel the interface back to the Jupyterhub instance.

The IPython (recently renamed Jupyter) Notebook is a powerful tool for analyzing and visualizing data in Python and other programming languages. A key feature is that a single document contains code, figures, text and equations. Everything is saved in a single .ipynb file that can be shared, executed and modified. See an example Notebook on integration of partial differential equations.

The Jupyter Notebook is a Python application with a web frontend, i.e. the interface runs in the user browser. This setup makes it suitable for any kind of remote computing, in particular running the Jupyter Notebook on a computing node of a Supercomputer, and exporting the interface HTTP port to a local browser. Setting up tunneling via SSH is tedious, in particular if the user does not have a public IP address.

Jupyterhub, developed by the Jupyter team, comes to the rescue by providing a web application that manages and proxies multiple instances of the Jupyter Notebook for any number of users. Jupyterhub natively only spawns local processes, but supports plugins to extend its functionality.

I have been developing a proof-of-concept plugin (RemoteSpawner) designed to work on a web server and once a user is authenticated, connect to the login node of a Supercomputer and submit a Jupyter Notebook job. As soon as the job starts execution, it sets up SSH tunneling with the Jupyterhub host so that Jupyterhub can provide the Notebook interface to the user. This setup allows users to simply access a Supercomputer via browser, accessing all their Python environment and data.

I am looking for interested parties either as users or as collaborators to help further development. See more information about the project below.

Test it yourself

In order to have a feeling on how Jupyterhub works, you can test in your browser at:

http://tmpnb.org

This service by Rackspace creates temporary Jupyter Notebooks on the fly. If you click on Welcome.ipynb, you can see an example Notebook.

The purpose of my project is to have a web interface to access Jupyter Notebooks that are running on computing nodes of a Supercomputer. So that users can access the environment and data on a Supercomputer from their browser and run data-intensive processing.

Tour of Jupyterhub on the Gordon Supercomputer

I'll show some screenshots to display how a test Jupyterhub installation on my machine is integrated with Gordon thanks to the plugin.

Jupyterhub is accessed publicly via browser and the user can login. Jupyterhub supports authentication for PAM/LDAP so it could be integrated with XSEDE credential, at the moment I am testing with local authentication.

Once the user is authenticated, Jupyterhub connects via SSH to a login node on Gordon and submits a batch serial job using qsub. The web interface waits for the job to start running. A dedicated queue with a quick turnaround would be useful for this kind of jobs.

When the job starts running, it first sets up SSH tunneling between the Jupyterhub host and the computing node, then starts the Jupyter Notebook. As soon as the web interface detects that the job is running, proxies the tunneled HTTP port for the user. From this point the Jupyter Notebook works exactly like it would on a local machine.

See an example Notebook printing the hostname of the computing node:

Other two useful features of the Jupyter Notebook are a terminal:

and an editor that run in the browser:

Launch Jupyterhub parallel to access hundreds of computing engines

The Notebook also supports using Torque to run Python computing engines and send them computationally intensive serial functions for load-balanced execution.

In the Notebook interface, in the Clusters tab, is it possible to choose the number of engines and click start to submit a job to the queue system:

This will pack 16 jobs per node (Gordon has 16-cores CPUs) and make them available from the notebook, see an example usage where I process 1000 files with 128 engines running on a different job on Gordon:

Example of Jupyterhub Parallel

Accelerate groupby operation on pixels with Numba

2015-03-24T09:00:00-07:00

Download the original IPython notebook

Astrophysics background

It is very common in Astrophysics to work with sky pixels. The sky is tassellated in patches with specific properties and a sky map is then a collection of intensity values for each pixel. The most common pixelization used in Cosmology is HEALPix.

Measurements from telescopes are then represented as an array of pixels that encode the pointing of the instrument at each timestamp and the measurement output.

Sample timeline

import pandas as pd
import numba
import numpy as np

For simplicity let's assume we have a sky with 50K pixels:

NPIX = 50000

And we have 50 million measurement from our instrument:

NTIME = int(50 * 1e6)

The pointing of our instrument is an array of pixels, random in our sample case:

pixels = np.random.randint(0, NPIX-1, NTIME)

Our data are also random:

timeline = np.random.randn(NTIME)

Create a map of the sky with pandas

One of the most common operations is to sum all of our measurements in a sky map, so the value of each pixel in our sky map will be the sum of each individual measurement. The easiest way is to use the groupby operation in pandas:

timeline_pandas = pd.Series(timeline, index=pixels)

timeline_pandas.head()
46889    0.407097
3638     1.300001
6345     0.174931
15742   -0.255958
34308    1.147338
dtype: float64

%time m = timeline_pandas.groupby(level=0).sum()

CPU times: user 4.09 s, sys: 471 ms, total: 4.56 s
Wall time: 4.55 s

Create a map of the sky with numba

We would like to improve the performance of this operation using numba, which allows to produce automatically C-speed compiled code from pure python functions.

First we need to develop a pure python version of the code, test it, and then have numba optimize it:

def groupby_python(index, value, output):
    for i in range(index.shape[0]):
        output[index[i]] += value[i]

m_python = np.zeros_like(m)


%time groupby_python(pixels, timeline, m_python)

CPU times: user 37.5 s, sys: 0 ns, total: 37.5 s
Wall time: 37.6 s

np.testing.assert_allclose(m_python, m)

Pure Python is slower than the pandas version implemented in cython.

Optimize the function with numba.jit

numba.jit gets an input function and creates an compiled version with does not depend on slow Python calls, this is enforced by nopython=True, numba would throw an error if it would not be possible to run in nopython mode.

groupby_numba = numba.jit(groupby_python, nopython=True)

m_numba = np.zeros_like(m)

%time groupby_numba(pixels, timeline, m_numba)
CPU times: user 274 ms, sys: 5 ms, total: 279 ms
Wall time: 278 ms

np.testing.assert_allclose(m_numba, m)

Performance improvement is about 100x compared to Python and 20x compared to Pandas, pretty good!

Use numba.jit as a decorator

The exact same result is obtained if we use numba.jit as a decorator:

@numba.jit(nopython=True)
def groupby_numba(index, value, output):
    for i in range(index.shape[0]):
        output[index[i]] += value[i]

Software Carpentry setup for Chromebook

2015-02-10T20:00:00-08:00

In this post I'll provide instructions on how to install the main requirements of a Software Carpentry workshop on a Chromebook. Bash, git, IPython notebook and R.

Switch the Chromebook to Developer mode

ChromeOS is very restrictive on what users can install on the machine. The only way to get around this is to switch to developer mode.

Switching to Developer mode wipes all the data on the local disk and may void warranty, do it at your own risk.

Instructions are available on the ChromeOS wiki, you need to click on your device name and follow instructions. For most devices you need to switch the device off, then hold down ESC and Refresh and poke the Power button, then press Ctrl-D at the Recovery screen (there is no prompt, you have to know to do it). This will wipe the device and activate Developer mode.

Once you reboot and enter your Google credentials, the Chromebook will copy back from Google servers all of your settings.

Now you are in Developer mode, the main feature is that you have a root (superuser) shell you can activate using Ctrl-Alt-T.

The worst issue of Developer mode is that at each boot the system will display a scary screen warning that OS verification is off and asks you if you would like to leave Developer mode. If you either press Ctrl-D or wait 30 seconds, it will boot ChromeOS in Developer mode, if you instead hit the Space, it will wipe everything and switch back to Normal mode.

Install Ubuntu with crouton

You can now install Ubuntu using crouton, you can read the instructions on the page, in summary:

First you need to install the Crouton Chrome extension on ChromeOS
Download the last release from https://goo.gl/fd3zc
Open the ChromeOS shell using Ctrl-Alt-t, digit shell at the prompt and hit enter
Run sudo sh ~/Downloads/crouton -t xfce,xiwi -r trusty, this instlls Ubuntu Trutyty with xfce desktop and uses kiwi to be able to run in a window.

Now you can have Ubuntu running in a window of the Chromebook browser by:

Press Ctrl-Alt-T
digit shell at the prompt and hit enter
digit sudo startxfce4

What is great about crouton is that it is not like a Virtual Machine, Ubuntu runs at full performance on the same linux kernel of ChromeOS.

Install scientific computing stack

You can now follow the instructions for Linux at http://software-carpentry.org/v5/setup.html, summary of commands to run in a terminal:

sudo apt install nano
sudo apt install git
In order to install R sudo apt install r-base
Download Anaconda Python 3 64bit for Linux from http://continuum.io/downloads and execute it

Anaconda will run under Ubuntu but when you open an IPython notebook, it will automatically open a new tab in the main browser of ChromeOS, not inside the Ubuntu window.

Final note

I admit it looks scary, I personally followed this procedure successfully on 2 chromebooks: Samsung Chromebook 1 and Toshiba Chromebook 2.

See a screenshot on my Chromebook with the Ubuntu window on the right with git, nano and IPython notebook running, the IPython notebook window opens in Chrome, see the left window (click to enlarge).

It is also possible to switch the Chromebook to Developer mode and install Anaconda and git directly there, however I think that in order to have a complete platform for scientific computing is a lot better to have all of the packages provided by Ubuntu.

Zero based indexing

2014-10-22T10:00:00-07:00

Reads

Dijkstra: https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831.html
Guido van Rossum: https://plus.google.com/115212051037621986145/posts/YTUxbXYZyfi

Comment

For Europeans zero based indexing feels reasonable if we think of floors in a house, the lowest floor is ground floor, then 1st floor and so on.

A house with 2 stories has ground and 1st floor. It is natural in this way to index zero-based and to count 1-based.

What about slicing instead? This is a separate issue from indexing. The main problem here is that if you include the upper bound then you cannot express the empty slice. Also it is elegant to print the first n elements as a[:n]. Slicing a[i:j] excludes the upper bound, so it probably easier to understand if we express it as a[i:i+n].

Write unit tests as cells of IPython notebooks

2014-09-30T14:00:00-07:00

What?

Plugin for py.test to write unit tests as cells in IPython notebooks:

Homepage on Github: https://github.com/zonca/pytest-ipynb
PyPi : https://pypi.python.org/pypi/pytest-ipynb/
Install with pip install pytest-ipynb

Why?

Many unit testing fromeworks in Python, first of all the unittest package in the standard library, work very well for automating unit tests, but make it very difficult to debug interactively any failed test.

py.test alleviates this problem by allowing to write just plain Python functions with assert statements (no boilerplate code), discover them automatically in any file that starts with test and write a useful report.

I wrote a plugin for py.test, pytest-ipynb, that goes a step further and runs unit tests written as cells of any IPython notebook named test*.ipynb.

The advantage is that it is easy to create and debug interactively any issue by opening the testing notebook interactively, then clean the notebook outputs and add it to the software repository.

More details on Github: https://github.com/zonca/pytest-ipynb

Suggestions welcome as comments or github issues.

(Yes, works with Python 3)

How to perform code review for scientific software

2014-08-28T17:00:00-07:00

Code review is the formal process where a programmer inspects in detail a piece of software developed by somebody else in order to improve code quality by catching bugs, improve readibility and usability. It is used extensively in industry, not much in academia.

There has been some discussion about this lately, see: * A few thoughts on code review of scientific code by Titus Brown * Code review for science: What we learned by Kaitlin Thaney

I participated in the second code review pilot study of Software Carpentry where I was paired to a research group in Genomics and I reviewed some of their analysis code. In this blog post I'd like to write about some guidelines and best practices on how to perform code review of scientific code.

Best use of code review is on libraries, prior to publication, because an improvement in code quality can help future users of the code. One-off analysis scripts benefit less from the process.

How to do a code review of a large codebase

The code review process should be performed on ~200-400 lines of code at a time. First thing is to ask the code author if she can identify different functionalities of the code that could be packaged and distributed separately. Modularity really helps maintaining software in the long term.

Then the author should follow these steps to get ready for the code review:

For each of the packages identified previously, the code author should create a separate repository, generally on Github, possibly under an organization account (see Github for research groups).
Create a blank project in the programming language of choice (hopefully Python!) using a pre-defined standard template, I recommend using CookieCutter.
Write a README.md file explaining exactly the functionality of the code in general
Clone the repository locally, add, commit and push the blank project with README.md to the master branch on Github
Identify a portion of the software of about ~200-400 lines that has a defined functionality and that could be reviewed together. It doesn't necessarily need to be in a runnable state, at the beginning we can start the code review without running the code.
Create a new branch locally and copy, add, commit this file or this set of files to the repository and push to Github
Access the web interface of Github, it should have detected that you just pushed a new branch and asked if you want to create a pull request. Create a pull request with a few details on the code under review.
Point the reviewer to the pull request

How to review an improvement to the software

The implementation of a feature should be performed on a separate branch, then it is straightforward to push it to Github, create a pull request and ask reviewers to look at the set of changes.

How to perform the actual code review

Coding style should not be the main focus of the review, the most important feedback for the author are high-level comments on software organization. The reviewer should focus on what makes the software more usable and more maintenable.

A few examples:

can some parts of the code be simplified?
is there any functionality that could be replaced by an existing library?
is it clear what each part of the software is doing?
is there a more straightforward way of splitting the code into files?
is documentation enough?
are there some function arguments or function names that could be easily misinterpreted by a user?

The purpose is to improve the code, but also to help the code author to improve her coding skills.

On the Github pull requests interface, it is possible both to write general comments, and to click on a single line of code and write an inline comment.

How to implement reviewer's recommendations

The author can improve the code locally on the same branch used in the pull request, then commit and push the changes to Github, the changes will be automatically added to the existing pull request, so the reviewer can start another iteration of the review process.

Comments and suggestions are welcome.

Create a Github account for your research group with free private repositories

2014-08-19T15:00:00-07:00

See the updated version at https://zonca.github.io/2019/08/github-for-research-groups.html

Since February 2014 Github also offers 20 private repositories to research groups and classrooms, plus unlimited public repositories. Private repositories are useful for early stages of development or if it is necessary to keep software secret before publication, at publication they can easily switched to public repositories and free up their slot.

Here the steps to set this up:

Create a user account on Github and choose the free plan, use your .edu email address
Create an organization account for your research group
Go to https://education.github.com/ and click on "Request a discount"
Choose what is your position, e.g. Researcher and select you want a discount for an organization
Choose the organization you created earlier and confirm that it is a "Research group"
Add details about your Research group
Finally you need to upload a picture of your University ID card and write how you plan on using the repositories
Within a week at most, but generally in less than 24 hours, you will be approved for 20 private repositories.

Once the organization is created, you can add key team members to the "Owners" group, and then create another group for students and collaborators.

See for example the organization account of the "Genomics, Evolution, and Development" at Michigan State U led by Dr. C. Titus Brown where they share code, documentation and papers. Open Science!!

Other suggestions on the setup very welcome!

Create a Github account for your research group with free private repositories

2014-08-19T15:00:00-07:00

See the updated version at https://zonca.github.io/2019/08/github-for-research-groups.html

Here the steps to set this up:

Create a user account on Github and choose the free plan, use your .edu email address
Create an organization account for your research group
Go to https://education.github.com/ and click on "Request a discount"
Choose what is your position, e.g. Researcher and select you want a discount for an organization
Choose the organization you created earlier and confirm that it is a "Research group"
Add details about your Research group
Finally you need to upload a picture of your University ID card and write how you plan on using the repositories
Within a week at most, but generally in less than 24 hours, you will be approved for 20 private repositories.

Once the organization is created, you can add key team members to the "Owners" group, and then create another group for students and collaborators.

See for example the organization account of the "Genomics, Evolution, and Development" at Michigan State U led by Dr. C. Titus Brown where they share code, documentation and papers. Open Science!!

Other suggestions on the setup very welcome!

Thoughts on a career as a computational scientist

2014-06-05T14:00:00-07:00

Recently I've been asked what are the prospects of a wannabe computational scientist, both in terms of training and in terms of job opportunities.

So I am writing this blog post about my personal experience.

What is a computational scientist?

In my understanding, a computational scientist is a scientist with strong skills in scientific computing who most of the day is building software.

Usually there are 2 main areas, in any field of science:

Data analysis: historically only few fields of science had to deal with large amount of experimental data, e.g. Astrophysics, nowadays instead every field can generate extremely large amounts of data thanks to modern technology. The task of the computational scientist is generally to analyze the data, i.e. cleanup, check systematic effects, calibrate, understand and reduce to a form to be used for scientific exploitation. Generally a second phase of data analysis involves model fitting, i.e. check which theoretical models best fit the data and estimate their parameters with error bars, this requires knowledge of Statistics and Bayesian techniques, like Markov Chain Monte Carlo (MCMC).
Simulations: production of artificial data used for their own good in the understanding of scientific models or by trying to reproduce experimental data in order to characterize the response of a scientific instrument.

Skills of a computational scientist

Starting out as a computational scientist nowadays is quite easy; with a background in any field of science, it is possible to improve computational skills thanks to several learning resources, for example:

Free online video classes on Coursera, Udacity and others
Software Carpentry runs bootcamps for scientists to improve their computational skills
Online tutorials on Python for scientific computing
Books, e.g. Python for Data Analysis

Basically it is important to have a good experience with at least one programming language, Python is the safest option because:

it is well enstabilished in many fields of science
its syntax is easier to learn than most other common programming languages
it has the largest number of scientific libraries
it is easy to interface with other languages, i.e. we can reuse legacy code implemented in C/C++/FORTRAN
it can be used also when developing something unusual for a computational scientist, like web development (django) or interfacing with hardware (pyserial).

Python performance is comparable to C/C++/Java when we make use of optimized libraries like numpy, pandas, scipy, which have Python frontends to highly optimized C or Fortran code; therefore is necessary to avoid explicit for loops and learn to write "vectorized" code, that allows entire arrays and matrices to be processed in one step.

Some important Python tools to learn are:

IPython notebooks to write documents with code, documentatin and plots embedded
numpy and pandas for data management
matplotlib for plotting
h5py or pytables, HDF5 binary files manipulation
how to publish a Python package
emcee for MCMC
scipy for signal processing, FFT, optimization, integration, 2d array processing
scikit-learn for Machine Learning
scikit-image for image processing
Object oriented programming

For parallel programming:

IPython parallel for distributing large amount of serial and independent job on a cluster
PyTrilinos for distributed linear algebra (high level operations with data distributed across nodes, automatic MPI communication)
mpi4py for manually create communication of data via MPI

On top of Python is also useful to learn a bit about shell scripting with bash, which for simple automation tasks is better suited, and it is fundamental to learn version control with git or mercurial.

My experience

I trained as Aerospace Engineer for my Master degree, and then moved to a PhD in Astrophysics, in Milano, where I worked in the Planck collaboration and took care of simulating the inband response of the Low Frequency Instrument detectors. During my PhD I developed a good proficiency with Python, mainly using it for task automation and plotting. My previous programming experience was very low, only some Matlab during last year of my Master degree, but I found Python really easy to use, and learned it myself with books and online tutorials. With no formal education in Computer Science, the most complicated concept to grasp is Object Oriented programming; at the time I was moonlighting as a web developer and I familiarized with OO using Django models. After my PhD I got a PostDoc position at the University of California, Santa Barbara, there I had for the first time access to supercomputers and my job involved analyzing large amount of data. During 4 years at UCSB I had the great opportunity of choosing my own tools, implementing my own software for data processing, so I immediately saw the value of improving my understanding of software development best practices.

Unfortunately in science there is usually a push toward hacking around a quick and dirty solution to get out results and go forward, I instead focused on learning how to build easily-maintenable libraries that I could re-use in the future. This involved learning more advanced Python, version control, unit testing and so on. I learned these tools by reading tutorials and documentation on the web, answers on StackOverflow, blog posts. It also helped that I became one of the core developers of healpy, a Python package for pixelized sky maps processing.

In 2013, at the 4th year of my PostDoc and with the Planck mission near to the end in 2015, I was looking for a position as a computational scientist, mainly as a research scientist (i.e. doing research/data analysis full time, with a long term contract) at research labs like Berkeley Lab or Jet Propulsion Laboratory, or in a research group in Cosmology/Astrophysics or in High Performance Computing.

I got hired at the San Diego Supercomputer Center in December 2013 as a permanent staff, mainly thanks to my experience with data analysis, Python and parallel programming, here I collaborate with research groups in any field of Science and help them deploy and optimize their software on supercomputers here at SDSC or in other XSEDE centers.

Thoughts about a career as a computational scientist

After a PhD program, a computational scientist with experience either in data analysis or simulation, especially if has experience in parallel programming, should quite easily find a position as a PostDoc, lots of research groups have huge amount of data and need software development skilled labor.

I believe what is complicated is the next step, faculty jobs favour scientists with the best scientific publications, and software development generally is not recognized as a first class scientific product. Very interesting opportunities in Academia are Research Scientist positions either at research facilities, for example Lawrence Berkeley Labs and NASA Jet Propulsion Laboratory, or supercomputer centers. These jobs are often permament positions, unless the institution runs out of funding, and allow to work 100% on research. Another opportunity is to work as Research Scientist in a specific research group in a University, this is less common, and depends on their availability of long-term funding.

Still, the total number of available positions in Academia is not very high, therefore it is very important to also keep open the opportunity of a job in Industry. Fortunately nowadays most skills of a computational scientist are very well recognized in Industry, so I recommend to choose, whenever possible, to learn and use tools that are widely used also outside of Academia, for example Python, version control with Git, shell scripting, unit testing, databases, multi-core programming, parallel programming, GPU programming and so on.

Acknowledgement: thanks to Priscilla Kelly for discussion on this topic and review of the post

Comments/feedback: comment on the blog using Google+ or tweet to @andreazonca

Machine learning at scale with Python

2014-03-20T20:00:00-07:00

My talk for the San Diego Data Science meetup: http://www.meetup.com/San-Diego-Data-Science-R-Users-Group/events/170967362/

About:

Setup StarCluster to launch EC2 instances
Running IPython Notebook on Amazon EC2
Running single node Machine Learning jobs using multiple cores
Distributing jobs with IPython parallel to multiple EC2 instances
See HTML5 slides: http://bit.ly/ml-ec2
See the IPython notebook sources of the slides: http://bit.ly/ml-ec2-ipynb

Finally the Github repository with additional material, under MIT license: https://github.com/zonca/machine-learning-at-scale-with-python

Any feedback is appreciated, google+, twitter or email.

Python on Gordon

2014-03-20T19:30:00-07:00

Gordon has already a python environment setup which can be activated by loading the python module:

module load python # add this to .bashrc to load it at every login

Install virtualenv

Then we need to setup a sandboxed local environment to install other packages, by using virtualenv, get the link to the latest version from https://pypi.python.org/pypi/virtualenv, then download it on gordon and unpack it, e.g.

wget --no-check-certificate https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.2.tar.gz
tar xzvf virtualenv*tar.gz

Then create your own virtualenv and load it:

mkdir ~/venv
python virtualenv-*/virtualenv.py ~/venv/py
source ~/venv/py/bin/activate # add this to .bashrc to load it at every login

you can restore your previous environment by deactivating the virtualenv:

deactivate # from your bash prompt

Install IPython

Using pip you can install IPython and all dependencies for the notebook and parallel tools running:

pip install ipython pyzmq tornado jinja

Configure the IPython notebook

For interactive data exploration, you can run the IPython notebook in a computing node on Gordon and export the web interface to your local machine, which also embeds all the plots. Configuring the tunnelling over SSH is complicated, so I created a script, takes a little time to setup but then is very easy to use, see https://github.com/pyHPC/ipynbhpc.

Configure IPython parallel

IPython parallel on Gordon allows to launch a PBS job with tens (or hundreds) of Python engines and then easily submit hundreds (or thousands) of serial jobs to be executed with automatic load balancing. First of all create the default configuration files:

ipython profile create --parallel

Then, in ~/.ipython/profile_default/ipcluster_config.py, you need to setup:

c.IPClusterStart.controller_launcher_class = 'LocalControllerLauncher' 
c.IPClusterStart.engine_launcher_class = 'PBS' 
c.PBSLauncher.batch_template_file = u'/home/REPLACEWITHYOURUSER/.ipython/profile_default/pbs.engine.template' # "~" does not work

You also need to allow connections to the controller from other hosts, setting in ~/.ipython/profile_default/ipcontroller_config.py:

c.HubFactory.ip = '*'
c.HubFactory.engine_ip = '*'

Finally create the PBS template ~/.ipython/profile_default/pbs.engine.template:

#!/bin/bash
#PBS -q normal
#PBS -N ipcluster
#PBS -l nodes={n/16}:ppn={n}:native
#PBS -l walltime=01:00:00
#PBS -o ipcluster.out
#PBS -e ipcluster.err
#PBS -m abe
#PBS -V
mpirun_rsh -np {n} -hostfile $PBS_NODEFILE ipengine

Here we chose to run 16 IPython engines per Gordon node, so each has access to 4GB of ram, if you need more just change 16 to 8 for example.

Run IPython parallel

You can submit a job to the queue running, n is equal to the number of processes you want to use, so it needs to be a multiple of the ppn chosen in the PBS template:

ipcluster start --n=32 &

in this case we are requesting 2 nodes, with 16 IPython engines each, check with:

qstat -u $USER

basically ipcluster runs an ipcontroller on the login node and submits a job to PBS for running the ipengines on the computing nodes.

Once the PBS job is running, check that the engines are connected by opening a IPython on the login node and print the ids:

In [1]: from IPython.parallel import Client
In [2]: rc = Client()
In [3]: rc.ids

You can stop the cluster (kills ipcontroller and runs qdel on the PBS job) either by sending CTRL-c to ipcluster or running:

ipcluster stop # from bash console

Submit jobs to IPython parallel

As soon as ipcluster is executed, ipcontroller is ready to queue jobs up, which will be then consumed by the engines once they will be running. The easiest method to submit jobs with automatic load balancing is to create a load balanced view:

In [1]: from IPython.parallel import Client
In [2]: rc = Client()
In [3]: lview = rc.load_balanced_view() # default load-balanced view

and then use its map method:

def exp_10(x):
    return x**10

list_of_args = range(100)
result = lview.map(exp_10, list_of_args)

In this code IPython will distribute uniformly the list of arguments to the engines and the function will be evalutated for each of them and the result copied back to the connecting client running on the login node.

Submit non-python jobs to IPython parallel

Let's assume you have a list of commands you want to run in a text file, one command per line, those could be implemented in any programming language, e.g.:

date &> date.log
hostname &> hostname.log

Then you create a function that executes one of those commands:

def run_command(command):
    import subprocess
    subprocess.Popen(command, shell = True)

Then apply this function to the list of commands:

list_of_commands = open("commands.txt").readlines()
lview.map(run_command, list_of_commands)

I created a script that automates this process, see https://gist.github.com/zonca/8994544, you can run as:

./ipcluster_run_commands.py commands.txt

Build Software Carpentry lessons with Pelican

2014-02-26T23:00:00-08:00

Software Carpentry offers bootcamps for scientist to teach basic programming skills. All the material, mainly about bash, git, Python and R is available on Github under Creative Commons.

The content is either in Markdown or in IPython notebook format, and is currently built using Jekyll, nbconvert and Pandoc. Basicly the requirement is to make it easy for bootcamp instructors to setup their own website, modify the content, and have the website updated.

I created a fork of the Software Carpentry repository and configured Pelican for creating the website:

bootcamp-pelican repository: contains Markdown lessons in lessons (version v5), .ipynb in notebooks and news items in news.
bootcamp-pelican Github pages: This repository contains the output HTML
bootcamp-pelican website: this is the URL where Github publishes automatically the content of the previous repository

Pelican handles fenced code blocks, see http://swcarpentry-pelican.github.io/ and conversion of IPython notebooks, see http://swcarpentry-pelican.github.io/lessons/numpy-notebook.html

How to setup the repositories for a new bootcamp

create a new Organization on Github and add all the other instructors, name it: swcarpentry-YYYY-MM-DD-INST where INST is the institution name, e.g. NYU
Fork the bootcamp-pelican repository under the organization account
Create a new repository in your organization named swcarpentry-YYYY-MM-DD-INST.github.io that will host the HTML of the website, also tick initialize with README, it will help later.

Now you can either prepare the build environment on your laptop or have the web service travis-ci automatically update the website whenever you update the repository (even from the Github web interface!).

Build/Update the website from your laptop

Clone the bootcamp-pelican repository of your organization locally

Create a Python virtual environment and install requirements with:

cd bootcamp-pelican
virtualenv swcpy
. swcpy/bin/activate
pip install -r requirements.txt

Clone the swcarpentry-YYYY-MM-DD-INST.github.io in the output folder as:

git clone git@github.com:swcarpentry-YYYY-MM-DD-INST.github.io.git output

Build or Update the website with Pelican running
```
fab build
```
You can display the website in your browser locally with:
```
fab serve
```

Finally you can publish it to Github with:

cd output
git add .
git push origin master

Configure Travis-ci to automatically build and publish the website

Go to http://travis-ci.org and login with Github credentials
Under https://travis-ci.org/profile click on the organization name on the left and activate the webhook setting ON on your bootcamp-pelican repository
Now it is necessary to setup the credentials for travis-ci to write to the repository
Go to https://github.com/settings/tokens/new, create a new token with default permissions
Install the travis tool (in debian/ubuntu sudo gem install travis) and run from any machine (not necessary to have a clone of the repository):
```
travis encrypt -r swcarpentry-YYYY-MM-DD-INST/bootcamp-pelican GH_TOKEN=TOKENGOTATTHEPREVIOUSSTEP
```
otherwise I've setup a web application that does the encryption in your browser, see: http://travis-encrypt.github.io 1. Open .travis.yml on the website and replace the string under env: global: secure: with the string from travis encrypt 1. Push the modified .travis.yml to trigger the first build by Travis, and then check the log on http://travis-ci.org

Now any change on the source repository will be picked up automatically by Travis and used to update the website.

openproceedings: Github/FigShare based publishing platform for conference proceedings

2014-02-13T23:30:00-08:00

Github provides a great interface for gathering, peer reviewing and accepting papers for conference proceedings, the second step is to publish them on a website either in HTML or PDF form or both. The Scipy conference is at the forefront on this and did great work in peer reviewing on Github, see: https://github.com/scipy-conference/scipy_proceedings/pull/61.

I wanted to develop a system to make it easier to continously publish updated versions of the papers and also leverage FigShare to provide a long term repository, a sharing interface and a DOI.

I based it on the blog engine Pelican, developed a plugin figshare_pdf to upload a PDF of an article via API and configured Travis-ci as building platform.

See more details on the project page on Github: https://github.com/openproceedings/openproceedings-buildbot

wget file from google drive

2014-01-31T18:00:00-08:00

Sometimes it is useful, even more if you have a chromebook, to upload a file to Google Drive and then use wget to retrieve it from a server remotely.

In order to do this you need to make the file available to "Anyone with the link", then click on that link from your local machine and get to the download page that displays a Download button. Now right-click and select "Show page source" (in Chrome), and search for "downloadUrl", copy the url that starts with https://docs.google.com, for example:

https://docs.google.com/uc?id\u003d0ByPZe438mUkZVkNfTHZLejFLcnc\u0026export\u003ddownload\u0026revid\u003d0ByPZe438mUkZbUIxRkYvM2dwbVduRUxSVXNERm0zZFFiU2c0PQ

This is unicode, so open Python and do:

download_url = "PASTE HERE"
print download_url.decode("unicode_escape")
u'https://docs.google.com/uc?id=0ByPZe438mUkZVkNfTHZLejFLcnc&export=download&revid=0ByPZe438mUkZbUIxRkYvM2dwbVduRUxSVXNERm0zZFFiU2c0PQ'

The last url can be pasted into a terminal and used with wget.

Run IPython Notebook on a HPC Cluster via PBS

2013-12-18T16:30:00-08:00

The IPython notebook is a great tool for data exploration and visualization. It is suitable in particular for analyzing a large amount of data remotely on a computing node of a HPC cluster and visualize it in a browser that runs on a local machine. In this configuration, the interface is local, it is very responsive, but the amount of memory and CPU horsepower is provided by a HPC computing node.

Also, it is possible to keep the notebook server running, disconnect and reconnect later from another machine to the same session.

I created a script which is very general and can be used on most HPC cluster and published it on Github:

https://github.com/pyHPC/ipynbhpc

Once the script is running, it is possible to connect to localhost:PORT and visualize the IPython notebook, see the following screenshot of Chromium running locally on my machine connected to a IPython notebook running on a Gordon computing node:

Joining San Diego Supercomputer Center

2013-12-10T13:30:00-08:00

TL;DR Left UCSB after 4 years, got staff position at San Diego Supercomputer Center within UCSD, will be helping research groups analyze their data on Gordon and more. Still 20% on Planck.

I spent 4 great years at UCSB with Peter Meinhold working on analyzing Cosmic Microwave Background data from the ESA Planck space mission. Cosmology is fascinating, also I enjoyed working with a very open minded team, that always left large freedom in choosing the techniques and the software tools for the job.

My work has been mainly focused on understanding and characterizing large amount of data using Python (and C++) on NERSC supercomputers. I was neither interested nor fit for a traditional academic career, and I was looking for a job that allowed me to focus on doing research/data analysis full time.

The perfect opportunity showed up, as the San Diego Supercomputer Center was looking for a computational scientist with a strong scientific background in any field of science to help research teams jump into supercomputing, specifically newcomers. This involves having the opportunity to collaborate with groups in any area of science, the first projects I am going to work on will be in Astrophysics, Quantum Chemistry and Genomics!

I also have the opportunity to continue my work on calibration and mapmaking of Planck data in collaboration with UCSB for 20% of my time.

Published paper on Destriping Cosmic Microwave Background Polarimeter data

2013-11-20T21:30:00-08:00

TL;DR version:

Preprint on arxiv: Destriping Cosmic Microwave Background Polarimeter data
Destriping python code on github: dst
Output maps and sample input data on figshare: BMachine 40GHz CMB Polarimeter sky maps
(Paywalled published paper: Destriping Cosmic Microwave Background Polarimeter data)

My last paper was published by Astronomy and Computing.

The paper is focused on Cosmic Microwave Background data destriping, a map-making tecnique which exploits the fast scanning of instruments in order to efficiently remove correlated low frequency noise, generally caused by thermal fluctuations and gain instability of the amplifiers.

The paper treats in particular the case of destriping data from a polarimeter, i.e. an instrument which directly measures the polarized signal from the sky, which allows some simplification compared to the case of a simply polarization-sensitive radiometer.

I implemented a fully parallel python implementation of the algorithm based on:

PyTrilinos for Distributed Linear Algebra via MPI
HDF5 for I/O
cython for improving the performance of the inner loops

The code is available on Github under GPL.

The output maps for about 30 days of the UCSB B-Machine polarimeter at 37.5 GHz are available on FigShare.

The experience of publishing with ASCOM was really positive, I received 2 very helpful reviews that drove me to work on several improvements on the paper.

Jiffylab multiuser IPython notebooks

2013-10-14T10:30:00-07:00

jiffylab is a very interesting project by Preston Holmes to provide sandboxed IPython notebooks instances on a server using docker. There are several user cases, for example:

In a tutorial about python, give users instant access to a working IPython notebook
In a tutorial about some specific python package, give users instant access to a python environment with that package already installed
Give students in a research group access to python on a server with preinstalled several packages maintained and updated by an expert user.

How to install jiffylab on Ubuntu 12.04

Install docker on Ubuntu Precise
Copy-paste each line of linux-setup.sh to a terminal, to check what is going on step by step
To start the application, change user to jiffylabweb:

sudo su jiffylabweb
cd /usr/local/etc/jiffylab/webapp/
python app.py #run in debug mode

Point your browser to the server to check debugging messages, if any.
Finally start the application in production mode:

python server.py #run in production mode

How `jiffylab` works

Each users gets a sandboxed IPython notebook instance, the user can save the notebooks and reconnect to the same session later. Main things missing:

No real authentication system / no HTTPS connection, easy workaround would be to allow access only from local network/VPN/SSH tunnel
No scientific packages preinstalled, need to customize the docker image to have numpy, matplotlib, pandas...
No access to common filesystem, read-only, this I think is the most pressing feature missing, issue already on Github

I think that just adding the common filesystem would be enough to make the project already usable to provide students a way to easily get started with python.

Few screenshots

Login page

IPython notebook dashboard

IPython notebook

How to log exceptions in Python

2013-10-01T10:30:00-07:00

Sometimes it is useful to just catch any exception, write details to a log file and continue execution.

In the Python standard library, it is possible to use the logging and exceptions modules to achieve this. First of all, we want to catch any exception, but also being able to access all information about it:

try:
    my_function_1()
except exception.Exception as e:
    print e.__class__, e.__doc__, e.message

Then we want to write those to a logging file, so we need to setup the logging module:

import logging
logging.basicConfig( filename="main.log",
                     filemode='w',
                     level=logging.DEBUG,
                     format= '%(asctime)s - %(levelname)s - %(message)s',
                   )

In the following gist everything together, with also function name detection from Alex Martelli:

Here the output log:

2013-10-01 11:32:56,466 - ERROR - Function my_function_1() raised <type 'exceptions.IndexError'> (Sequence index out of range.): Some indexing error
2013-10-01 11:32:56,466 - ERROR - Function my_function_2() raised <class 'my_module.MyException'> (This is my own Exception): Something went quite wrong
2013-10-01 11:32:56,466 - ERROR - Function my_function_1_wrapper() raised <type 'exceptions.IndexError'> (Sequence index out of range.): Some indexing error

Google Plus comments plugin for Pelican

2013-09-27T17:45:00-07:00

There has been recently several discussions about whether comments are any useful on blogs I think it is important to find better ways to connect blogs to social networks. In my opinion the most suitable social network for this is Google+, because there is space for larger discussion, without Twitter's character limit.

So, for my small blog I've decided to implement the Google+ commenting system, which Google originally implemented just for Blogger but that works on any website.

See it in action below.

The plugin is available in the googleplus_comments branch in:

https://github.com/zonca/pelican-plugins/tree/googleplus_comments/googleplus_comments

How to automatically build your Pelican blog and publish it to Github Pages

2013-09-26T13:45:00-07:00

Something I like a lot about Jekyll, the Github static blog generator, is that you just push commits to your repository and Github takes care of re-building and publishing your website. Thanks to this, it is possible to create a quick blog post from the Github web interface, without the need to use a machine with Python environment.

The Pelican developers have a method for building and deploying Pelican on Heroku, which is really useful, but I would like instead to use Github Pages.

I realized that the best way to do this is to rely on Travis-CI, as the build/deploy workflow is pretty similar to install/unit-testing Travis is designed for.

How to setup Pelican to build on Travis

I suggest to use 2 separate git repositories on Github for the source and the built website, let's first only create the repository for the source:

create the yourusername.github.io-source repository for Pelican and add it as origin in your Pelican folder repository

add a requirements.txt file in your Pelican folder:

github:zonca/zonca.github.io-source/requirements.txt

add a .travis.yml file to your repository:

github:zonca/zonca.github.io-source/.travis.yml

In order to create the encrypted token under env, you can login to the Github web interface to get an Authentication Token, and then install the travis command line tool with:

# on Ubuntu you need ruby dev
sudo apt-get install ruby1.9.1-dev
sudo gem install travis

and run from inside the repository:

travis encrypt GH_TOKEN=LONGTOKENFROMGITHUB --add env.global

Then add also the deploy.sh script and update the global variable with yours:

github:zonca/zonca.github.io-source/deploy.sh

Then we can create the repository that will host the actual blog:

create the yourusername.github.io repository for the website (with initial readme, so you can clone it)

Finally we can connect to Travis-CI, connect our Github profile and activate Continous Integration on our yourusername.github.io-source repository.

Now, you can push a new commit to your source repository and check on Travis if the build and deploy is successful, hopefully it is (joking, no way it is going to work on the first try!).

clviewer, interactive plot of CMB spectra

2013-09-17T18:30:00-07:00

Today it was HackDay at .Astronomy, so I felt compelled to hack something around myself, creating something I have been thinking for a while after my previous work on Interactive CMB power spectra in the browser

The idea is to get text files from a user and load it in a browser-based interactive display built on top of the d3.js and rickshaw libraries.

Similar to nbviewer, I think it is very handy to load data from Github gists, because then there is no need of uploading files and it is easier to circulate links.

So I created a small web app, in Python of course, using Flask and deployed on Heroku. It just gets a gist number, calls the Github APIs to load the files, and displays them in the browser:

Application website: http://clviewer.herokuapp.com
Example input data: https://gist.github.com/zonca/6599016
Example interactive plot: http://clviewer.herokuapp.com/6599016
Source: https://github.com/zonca/clviewer

Planck CMB map at high resolution

2013-09-10T14:00:00-07:00

Prompted by a colleague, I created a high-resolution version of the Cosmic Microwave Background map in MollWeide projection released by the Planck collaboration, available on the Planck Data Release Website in FITS format.

The map is a PNG at a resolution of 17469x8796 pixels, which is suitable for printing at 300dpi up to 60x40 inch, or 150x100 cm, file size is about 150MB.

Update: now with Planck color scale

Update: previous version had grayed out pixels in the galactic plane represents the fraction of the sky that is not possible to reconstruct due to bright galactic sources. The last version uses inpainting to create a constrained CMB realization with the same statistics as the observed CMB to fill the unobserved pixels, more details in the Planck Explanatory Supplement.

High Resolution image on FigShare
Small size preview:

Python code:

Run Hadoop Python jobs on Amazon with MrJob

2013-09-02T02:36:00-07:00

First we need to install mrjob with:

pip install mrjob

I am starting with a simple example of word counting. Previously I implemented this directly using the hadoop streaming interface, therefore mapper and reducer were scripts that read from standard input and print to standard output, see mapper.py and reducer.py in:

https://github.com/zonca/python-wordcount-hadoop

With MrJob instead the interface is a little different, we implement the mapper method of our subclass of MrJob that already gets a "line" argument and yields the output as a tuple like ("word", 1).

MrJob makes the implementation of the reducer particularly simple. Using hadoop-streaming directly, we needed also to first parse back the output of the mapper into python objects, while MrJob does it for you and gives directly the key and the list of count, that we just need to sum.

The code is pretty simple:

First we can test locally with 2 different methods, either:

python word_count_mrjob.py gutemberg/20417.txt.utf-8

or:

python word_count_mrjob.py --runner=local gutemberg/20417.txt.utf-8

The first is a simple local test, the seconds sets some hadoop variables and uses multiprocessing to run the mapper in parallel.

Run on Amazon Elastic Map Reduce

Next step is submitting the job to EMR.
First get an account on Amazon Web Services from aws.amazon.com .

Setup MrJob with Amazon:

http://pythonhosted.org/mrjob/guides/emr-quickstart.html#amazon-setup

Then we just need to choose the "emr" runner for MrJob to take care of:

Copy the python module to Amazon S3, with requirements
Copy the input data to S3
Create a small EC2 instance (of course we could set it up to run 1000 instead)
Run Hadoop to process the jobs
Create a local web service that allows easy monitoring of the cluster
When completed, copy the results back (this can be disabled to just leave the results on S3.

e.g.:

python word_count_mrjob.py --runner=emr --aws-region=us-west-2 gutemberg/20417.txt.utf-8

It is important to make sure that the aws-region used by MrJob is the same we used for creating the SSH key on the EC2 console in the MrJob configuration step, i.e. SSH keys are region-specific.

Logs and output of the run

MrJob copies the needed files to S3:

. runemr.sh
using configs in /home/zonca/.mrjob.conf
using existing scratch bucket mrjob-ecd1d07aeee083dd
using s3://mrjob-ecd1d07aeee083dd/tmp/ as our scratch dir on S3
creating tmp directory /tmp/mrjobjob.zonca.20130901.192250.785550
Copying non-input files into s3://mrjob-ecd1d07aeee083dd/tmp/mrjobjob.zonca.20130901.192250.785550/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-2E83MO9QZQILB
Created new job flow j-2E83MO9QZQILB

Creates the instances:

Job launched 30.9s ago, status STARTING: Starting instances
Job launched 123.9s ago, status BOOTSTRAPPING: Running bootstrap actions
Job launched 250.5s ago, status RUNNING: Running step (mrjobjob.zonca.20130901.192250.785550: Step 1 of 1)

Creates an SSH tunnel to the tracker:

Opening ssh tunnel to Hadoop job tracker
Connect to job tracker at: http://localhost:40630/jobtracker.jsp

Therefore we can connect to that address to check realtime information about the cluster running on EC2, for example:

Once the job completes, MrJob copies the output back to the local machine, here are few lines from the file:

"maladies" 1
"malaria" 5
"male" 18
"maleproducing" 1
"males" 5
"mammal" 10
"mammalInstinctive" 1
"mammalian" 4
"mammallike" 1
"mammals" 87
"mammoth" 5
"mammoths" 1
"man" 152

I've been positively impressed that it is so easy to implement and run a MapReduce job with MrJob without need of managing directly EC2 instances or the Hadoop installation.
This same setup could be used on GB of data with hundreds of instances.

Interactive figures in the browser: CMB Power Spectra

2013-08-30T08:52:00-07:00

For a long time I've been curious about trying out d3.js , the javascript plotting library which is becoming the standard for interactive plotting in the browser.

What is really appealing is the capability of sharing with other people powerful interactive visualization simply via the link to a web page. This will hopefully be the future of scientific publications, as envisioned, for example, by Authorea .

An interesting example related to my work on Planck is a plot of the high number of Angular Power Spectra of the anisotropies of the Cosmic Microwave Background Temperature.

The CMB Power spectra describe how the temperature fluctuations were distributed in the sky as a function of the angular scale, for example the largest peak at about 1 degree means that the brightest cold/warm spots of the CMB have that angular size, see The Universe Simulator in the browser .

The Planck Collaboration released a combined spectrum, which aggregates several channels to give the best result, spectra frequency by frequency (for some frequencies split in detector-sets) and a best-fit spectrum given a Universe Model.

It is also interesting to compare to the latest release spectrum by WMAP with 9 years of data.

The plan is to create a visualization where it is easier to zoom to different angular scales on the horizontal axis and quickly show/hide each curve.

For this I used rickshaw , a library based on d3.js which makes it easier to create time-series plots.

In fact most of the features are already implemented, it is just a matter of configuring them, see the code on github: https://github.com/zonca/visualize-planck-cl

The most complex task is actually to load all the data, previously converted to JSON, in the background from the server and push them in a data structure which is understood by rickshaw.

Check out the result:

http://bit.ly/planck-spectra

Planck CTP angular power spectrum ell binning

2013-08-20T23:03:00-07:00

Planck released a binning of the angular power spectrum in the Explanatory supplement,
unfortunately the file is in PDF format, non easily machine-readable:

http://www.sciops.esa.int/wikiSI/planckpla/index.php?title=Frequency_maps_angular_power_spectra&instance=Planck_Public_PLA

So here is a csv version:
https://gist.github.com/zonca/6288439

Follows embedded gist.

HEALPix map of the Earth using healpy

2013-08-08T19:07:00-07:00

HEALPix maps can also be used to create equal-area pixelized maps of the Earth, RGB colors are not supported in healpy, so we need to convert the image to colorscale.
The best user case is for using spherical harmonic transforms, e.g. apply a smoothing filter, in this case HEALPix/healpy tools are really efficient.
However, other tools for transforming between angles (coordinates), 3d vectors and pixels might be useful.

I've created an IPython notebook that provides a simple example:

http://nbviewer.ipython.org/6187504

Here is the output Mollweide projection provided by healpy:

Few notes:

always use flip="geo" for plotting, otherwise maps are flipped East-West
increase the resolution of the plots (which is different from the resolution of the map array) by providing at least xsize=2000 to mollview and a reso lower than 1 to gnomview

Export google analytics data via API with Python

2013-08-04T17:47:00-07:00

Fun weekend hacking project: export google analytics data using the google APIs.

Clone the latest version of the API client from:

https://code.google.com/p/google-api-python-client

there is an example for accessing analytics APIs in the samples/analytics folder,
but you need to fill in client_secrets.json.

You can get the credentials from the APIs console:

https://code.google.com/apis/console

In SERVICES: activate google analytics
In API Access: Create a "Client ID for installed applications" choosing "Other" as a platform

Copy the client id and the client secret to client_secrets.json.

Now you only need the profile ID of the google analytics account, it is in the google analytics web interface, just choose the website, then click on Admin, then on the profile name in the profile tab, and then on profile settings.

You can then run:

python core_reporting_v3_reference.py ga:PROFILEID

The first time you run it, it will open a browser for authentication, but then the auth token is saved and used for future requests.

This retrieves from the APIs the visits to the website from search, with keywords and the number of visits, for example for my blog:

Total Metrics For All Results:
This query returned 25 rows.
But the query matched 30 total results.
Here are the metric totals for the matched total results.
Metric Name = ga:visits
Metric Total = 174
Rows:
google (not provided) 121
google andrea zonca 17
google butterworth filter python 4
google andrea zonca blog 2
google healpix for ubuntu 2
google healpy install ubuntu 2
google python butterworth filter 2
google zonca andrea 2
google andrea zonca buchrain luzern 1
google andrea zonca it 1
google astrofisica in pillole 1
google bin data healpy 1
google ellipticity fwhm 1
google enthought and healpy 1
google fwhm 1
google healpix apt-get 1
google healpix repository ubuntu 1
google healpix ubuntu 12.04 install 1
google healpy ubuntu 1
google install healpix ubuntu 1
google ipython cluster task output 1
google numpy pink noise 1
google pink noise numpy 1
google python 1/f noise 1
google python apply mixin 1

Processing sources in Planck maps with Hadoop and Python

2013-07-15T08:16:00-07:00

Purpose

The purpose of this post is to investigate how to process in parallel sources extracted from full sky maps, in this case the maps release by Planck, using Hadoop instead of more traditional MPI-based HPC custom software.

Hadoop is the MapReduce implementation most used in the enterprise world and it has been traditionally used to process huge amount of text data (~ TBs) , e.g. web pages or logs, over thousands commodity computers connected over ethernet.

It allows to distribute the data across the nodes on a distributed file-system (HDFS) and then analyze them ("map" step) locally on each node, the output of the map step is traditionally a set of text (key, value) pairs, that are then sorted by the framework and passed to the "reduce" algorithm, which typically aggregates them and then save them to the distributed file-system.

Hadoop gives robustness to this process by rerunning failed jobs, distribute the data with redundancy and re-distribute in case of failures, among many other features.

Most scientist use HPC supercomputers for running large data processing software. Using HPC is necessary for algorithms that require frequent communication across the nodes, implemented via MPI calls over a dedicated high speed network (e.g. infiniband). However, often HPC resources are used for running a large number of jobs that are loosely coupled, i.e. each job runs mostly independently of the others, just a sort of aggregation is performed at the end. In this cases the use of a robust and flexible framework like Hadoop could be beneficial.

Problem description

The Planck collaboration (btw I'm part of it...) released in May 2013 a set of full sky maps in Temperature at 9 different frequencies and catalogs of point and extended galactic and extragalactic sources:

http://irsa.ipac.caltech.edu/Missions/planck.html

Each catalog contains about 1000 sources, and the collaboration released the location and flux of each source.

The purpose of the analysis is to read each of the sky maps, slice out the section of the map around each source and perform some analysis on that patch of sky, as a simple example, to test the infrastructure, I am just going to compute the mean of the pixels located 10 arcminutes around the center of each source.

In a production run, we might for example run aperture photometry on each source, or fitting for the source center to check for pointing accuracy.

Sources

All files are available on github:

https://github.com/zonca/planck-sources-hadoop

Hadoop setup

I am running on the San Diego Supercomputing data intensive cluster Gordon:

http://www.sdsc.edu/us/resources/gordon/

SDSC has a simplified Hadoop setup based on shell scripts, myHadoop , which allows running Hadoop as a regular PBS job.

The most interesting feature is that the Hadoop distributed file-system HDFS is setup on the low-latency local flash drives, one of the distinctive features of Gordon.

Using Python with Hadoop-streaming

Hadoop applications run natively in Java, however thanks to Hadoop-streaming, we can use stdin and stdout to communicate with a script implemented in any programming language.

One of the most common choices for scientific applications is Python.

Application design

Best way to decrease the coupling between different parallel jobs for this application is, instead of analyzing one source at a time, analyze a patch of sky at a time, and loop through all the sources in that region.

Therefore the largest amount data, the sky map, is only read once by a process, and all the sources are processed. I pre-process the sky map by splitting it in 10x10 degrees patches, saving a 2 columns array with pixel index and map temperature ( preprocessing.py ).

Of course this will produce jobs whose length might be very different, due to the different effective sky area at poles and at equator, and by random number of source per patch, but that's something we do not worry about, that is exactly what Hadoop takes care of.

Implementation

Input data

The pre-processed patches of sky are available in binary format on a lustre file-system shared by the processes.

Therefore the text input files for the hadoop jobs are just the list of filenames of the sky patches, one per row.

Mapper

mapper.py

The mapper is fed by Hadoop via stdin with a number of lines extracted from the input files and returns a (key, value) text output for each source and for each statistics we compute on the source.

In this simple scenario, the only returned key printed to stdout is "SOURCENAME_10arcminmean".

For example, we can run a serial test by running:

echo plancktest/submaps/030_045_025 | ./mapper.py

and the returned output is:

PCCS1 030 G023.00+40.77_10arcminmean 4.49202e-04

PCCS1 030 G023.13+42.14_10arcminmean 3.37773e-04

PCCS1 030 G023.84+45.26_10arcminmean 4.69427e-04

PCCS1 030 G024.32+48.81_10arcminmean 3.79832e-04

PCCS1 030 G029.42+43.41_10arcminmean 4.11600e-04

Reducer

There is no need for a reducer in this scenario, so Hadoop will just use the default IdentityReducer, which just aggregates all the mappers outputs to a single output file.

Hadoop call

run.pbs

The hadoop call is:


    $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR jar $HADOOP_HOME/contrib/streaming/hadoopstreaming.jar -file $FOLDER/mapper.py -mapper $FOLDER/mapper.py -input /user/$USER/Input/* -output /user/$USER/Output

So we are using the Hadoop-streaming interface and providing just the mapper, the input text files (list of sources) had been already copied to HDFS, the output needs then to be copied from HDFS to the local file-system, see run.pbs.

Hadoop run and results

For testing purposes we have just used 2 of the 9 maps (30 and 70 GHz), and processed all the total of ~2000 sources running Hadoop on 4 nodes.

Processing takes about 5 minutes, Hadoop automatically chooses the number of mappers, and in this case only uses 2 mappers, as I think it reserves a couple of nodes to run the Scheduler and auxiliary processes.

The outputs of the mappers are then joined, sorted and written on a single file, see the output file

output/SAMPLE_RESULT_part-00000 .

See the full log sample_logs.txt extracted running:

/opt/hadoop/bin/hadoop job -history output

Comparison of the results with the catalog

Just for a rough consistency check, I compared the normalized temperatures computed with Hadoop using just the mean of the pixels in a radius of 10 arcmin to the fluxes computed by the Planck collaboration. I find a general agreement with the expected noise excess.

Conclusion

The advantage of using Hadoop is mainly the scalability, this same setup could be used on AWS or Cloudera using hundreds of nodes. All the complexity of scaling is managed by Hadoop.

The main concern is related to loading the data, in a HPC supercomputer it is easy to load directly from a high-performance shared disk, in a cloud environment instead we might opt for a similar setup loading data from S3, but the best would be to use Hadoop itself and stream the data to the mapper in the input files. This is complicated by the fact that Hadoop-streaming only supports text and not binary, the options would be either find a way to pack the binary data in a text file or use Hadoop-pipes instead of Hadoop-streaming.

How to use the IPython notebook on a small computing cluster

2013-06-22T11:12:00-07:00

The IPython notebook is a powerful and easy to use interface for using Python and particularly useful when running remotely, because it allows the interface to run locally in your browser, while the computing kernel runs remotely on the cluster.

1) Configure IPython notebook:

First time you use the notebook you need to follow this configuration steps:

Login to the cluster
Load the python environment, for example:
```
module load pythonEPD
```

Create the profile files:

ipython profile create # creates the configuration files
vim .ipython/profile_default/ipython_notebook_config.py

set a password, see instructions in the file.

Change the port to something specific to you, please change this to avoid conflict with other users:
```
c.NotebookApp.port = 8900
```
Set a certificate to serve the notebook over https:
```
c.NotebookApp.certfile = u'/home/zonca/mycert.pem'
```
or create a new certificate, see the documentation
Set:
```
c.NotebookApp.open_browser = False
```

2) Run the notebook for testing on the login node.

You can use IPython notebook on the login node if you do not use much memory, e.g. < 300MB. ssh into the login node, at the terminal run:

ipython notebook --pylab=inline

open the browser on your local machine and connect to (always use https, replace 8900 with your port):

https://LOGINNODEURL:8900

Dismiss all the browser complaints about the certificate and go ahead.

3) Run the notebook on a computing node

You should always use a computing node whenever you need a large amount of resources.

Create a folder notebooks/ in your home, just copy this script in runipynb.pbs in your that folder:

replace LOGINNODEURL with the url of the login node of your cluster.

NOTICE: you need to ask the sysadmin to set GatewayPorts yes in sshd_config on the login node to allow access externally to the notebook.

Submit the job to the queue running:

qsub runipynb.pbs

Then from your local machine connect to (replace 8900 with your port):

https://LOGINNODEURL:8900

Other introductory python resources

Scientific computing with Python, large and detailed introduction to Python, Numpy, Matplotlib, Scipy
My Python for High performance computing: slides and few ipython notebook examples, see the README
My short Python and healpy tutorial

IPython parallell setup on Carver at NERSC

2013-04-11T05:53:00-07:00

IPython parallel is one of the easiest ways to spawn several Python sessions on a Supercomputing cluster and process jobs in parallel.

On Carver, the basic setup is running a controller on the login node, and submit engines to the computing nodes via PBS.

First create your configuration files running:

ipython profile create --parallel

Therefore in the ~/.config/ipython/profile_default/ipcluster_config.py, just need to set:

c.IPClusterStart.controller_launcher_class = 'LocalControllerLauncher'
c.IPClusterStart.engine_launcher_class = 'PBS'
c.PBSLauncher.batch_template_file = u'~/.config/ipython/profile_default/pbs.engine.template'

You also need to allow connections to the controller from other hosts, setting in ~/.config/ipython/profile_default/ipcontroller_config.py:

c.HubFactory.ip = '*'

With the path to the pbs engine template.

Next a couple of examples of pbs templates, for 2 or 8 processes per node:
IPython configuration does not seem to be flexible enough to add a parameter for specifying the processes per node.
So I just created a bash script that get as parameters the processes per node and the total number of nodes:

ipc 8 2 # 2 nodes with 8ppn, 16 total engines
ipc 2 3 # 3 nodes with 2ppn, 6 total engines

Once the engines are running, jobs can be submitted opening an IPython shell on the login node and run:

from IPython.parallel import Client
rc = Client()

lview = rc.load_balanced_view() # default load-balanced view

def serial_func(argument):

pass

parallel_result = lview.map(serial_func, list_of_arguments)

The serial function is sent to the engines and executed for each element of the list of arguments.

If the function returns a value, than it is transferred back to the login node.

In case the returned values are memory consuming, is also possible to still run the controller on the login node, but execute the interactive IPython session in an interactive job.

Simple Mixin usage in python

2013-04-08T01:34:00-07:00

One situation where Mixins are useful in Python is when you need to modify a method of similar classes that you are importing from a package.

For just a single class, it is easier to just create a derived class, but if the same modification must be applied to several classes, then it is cleaner to implement this modification once in a Mixin and then apply it to all of them.

Here an example in Django:

Django has several generic view classes that allow to pull objects from the database and feed them to the html templates.

One for example shows the detail of a specific object:

from django.views.generic.detail import DetailView

This class has a get_object method that gets an object from the database given a primary key.

We need to modify this method to allow access to an object only to the user that owns them.

We first implement a Mixin, i.e. an independent class that only implements the method we wish to override:

class OwnedObjectMixin(object):

def get_object(self, *args, **kwargs):

obj = super(OwnedObjectMixin, self).get_object(*args, **kwargs)

if not obj.user == self.request.user:

raise Http404

return obj

Then we create a new derived class which inherits both from the Mixin and from the class we want to modify.

class ProtectedDetailView(OwnedObjectMixin, DetailView):

pass

This overrides the get_object method of DetailView with the get_object method of OwnedObjectMixin, and the call to super calls the get_object method of DetailView, so has the same effect of subclassing DetailView and override the get_object method, but we can be apply the same Mixin to other classes.

Noise in spectra and map domain

2013-04-08T01:32:00-07:00

Spectra

NET or $\sigma$ is the standard deviation of the noise, measured in mK/sqrt(Hz), typical values for microwave amplifiers are 0.2-5.
This is the natural unit of the amplitude spectra (ASD), therefore the high frequency tail of the ASD should get to the expected value of the NET.
NET can also be expressed in mKsqrt(s), which is NOT the same unit.
mK/sqrt(Hz) refers to an integration bandwidth of 1 Hz that assumes a 6dB/octave rolloff, its integration time is only about 0.5 seconds.
mK/sqrt(s) instead refers to integration time of 1 second, therefore assumes a top hat bandpass.
Therefore there is a factor of sqrt(2) difference between the two conventions, therefore mK/sqrt(Hz) = sqrt(2) * mK sqrt(s)
See appendix B of Noise Properties of the Planck-LFI Receivers
http://arxiv.org/abs/1001.4608

Maps

To estimate the map domain noise instead we need to integrate the sigma over the time per pixel; in this case it is easier to convert the noise to sigma/sample, therefore we need to multiply by the square root of the sampling frequency:

sigma_per_sample = NET * sqrt(sampling_freq)

Then the variance per pixel is sigma_per_sample**2/number_of_hits

Angular power spectra

$C_\ell$ of the variance map is just the variance map multiplied by the pixel area divided by the integration time.

$$C_\ell = \Omega_{\rm pix} \langle \frac{\sigma^2}{\tau} \rangle = \Omega_{\rm pix} \langle \frac{\sigma^2 f_{\rm samp}}{hits} \rangle$$

Basic fork/pull git workflow

2013-04-06T07:52:00-07:00

Typical simple workflow for a (github) repository with few users.

Permissions configuration:

Main developers have write access to the repository, occasional contributor are supposed to fork and create pull requests.

Main developer: Small bug fix just go directly in master:

git checkout master
# update from repository, better use rebase in case there are unpushed commits
git pull --rebase
git commit -m "commit message"
git push

More complex feature, better use a branch:

git checkout -b featurebranch
git commit -am "commit message"
# work and make several commits
# backup and share to github
git push origin featurebranch

When ready to merge (cannot push cleanly anymore after any rebasing):

# reorder, squash some similar commits, better commit msg
git rebase -i HEAD~10
# before merging move commits all together to the end of history
git rebase master
git checkout master
git merge featurebranch
git push
# branch is fully merged, no need to keep it
git branch -d featurebranch
git push origin --delete featurebranch

Optional, if the feature requires discussing within the team, better create a pull request.
After cleanup and rebase, instead of merging to master:

# create new branch
git checkout -b readyfeaturebranch
git push origin readyfeaurebranch

Connect to github and create a pull request from the new branch to master (now github has a shortcut for creating a pull request from the last branch pushed).

During the discussion on the pull request, any commit to the readyfeaturebranch is added to the pull request.
When ready either automatically merge on github, or do it manually as previously.

For occasional developers: Just fork the repo on github to their account, work on a branch there, and then create a pull request on the github web interface from the branch to master on the main repository.

Interactive 3D plot of a sky map

2013-03-12T19:49:00-07:00

Mayavi is a Python package from Enthought for 3D visualization, here a simple example of creating a 3D interactive map starting from a HEALPix pixelization sky map:

Here the code:

The output is a beautiful 3D interactive map, Mayavi allows to pan, zoom and rotate.
UPDATE 13 Mar: actually there was a bug (found by Marius Millea) in the script, there is no problem in the projection!

Mayavi can be installed in Ubuntu installing python-vtk and then sudo pip install mayavi.

How to cite HDF5 in bibtex

2013-02-27T00:42:00-08:00

here the bibtex entry:

reference:
http://www.hdfgroup.org/HDF5-FAQ.html#gcite

Compile healpix C++ to javascript

2013-01-28T21:06:00-08:00

Compile C++ -> LLVM with clang

Convert LLVM -> Javascript:
https://github.com/kripken/emscripten/wiki/Tutorial

Elliptic beams, FWHM and ellipticity

2013-01-18T00:58:00-08:00

The relationship between the Full Width Half Max, FWHM (min, max, and average) and the
ellipticity is:

FWHM = sqrt(FWHM_min * FWHM_max)
e = FWHM_max/FWHM_min

Ubuntu PPA for HEALPix and healpy

2012-12-17T10:37:00-08:00

HEALPix C, C++ version 3.00 and healpy version 1.4.1 are now available in a PPA repository for Ubuntu 12.04 Precise and Ubuntu 12.10 Quantal.

First remove your previous version of healpy , just find the location of the package:

> python -c "import healpy; print healpy.file"

and remove it:

> sudo rm -r /some-base-path/site-packages/healpy*

Then add the apt repository and install the packages:

> sudo add-apt-repository ppa:zonca/healpix
> sudo apt-get update
> sudo apt-get install healpix-cxx libhealpix-cxx-dev libchealpix0 libchealpix-dev python-healpy

> which anafast_cxx

/usr/bin/anafast_cxx

> python -c "import healpy; print healpy.version"

1.4.1

Butterworth filter with Python

2012-10-06T00:00:00-07:00

Using IPython notebook of course:

http://nbviewer.ipython.org/3843014/

IPython.parallel for Planck data analysis at NERSC

2012-09-27T06:24:00-07:00

Planck is a Space mission for high precision measurements of the Cosmic Microwave Background (CMB), data are received as timestreams of output voltages from the 2 instruments on-board, the Low and High Frequency Instruments [LFI / HFI].

The key phase in data reduction is map-making, where data are binned to a map of the microwave emission of our galaxy, the CMB, and extragalactic sources. This phase is intrinsically parallel and requires simultaneous access to all the data, so requires a fully parallel MPI-based software.

However, preparing the data for map-making requires several tasks that are serial, but are data and I/O intensive, therefore need to be parallelized.

IPython.parallel offers the easiest solution for managing a large amount of trivially parallel jobs.

The first task is pointing reconstruction, where we interpolate and apply several rotations and corrections to low-sampled satellite quaternions stored on disk and then write the output dense detector pointing to disk.
The disk quota of pointing files is about 2.5TB split in about 3000 files, those files can be processed independently, therefore we implement a function that processes 1 file, to be used interactively for debugging and testing.
Then launch an IPython cluster, typically between 20 and 300 engines on Carver (NERSC), and use the exact same function to process all the ~3000 files in parallel.
The IPython BalancedView controller automatically balances the queue therefore we get maximum efficiency, and it is possible to leave the cluster running and submit other instances of the job to be added to its queue.

Second task is calibration and dipole removal, which processes about 1.2 TB of data, but it needs to read the dense pointing from disk, so it is very I/O intensive. Also in this case we can submit the ~3000 jobs to an IPython.parallel cluster.

In a next post I'll describe in detail my setup and how I organize my code to make it easy to swap back and forth between debugging code interactively and running production runs in parallel.

homepage on about.me

2012-09-26T22:19:00-07:00

moved my homepage to about.me:

http://about.me/andreazonca

it is quite nice, and essential, as most of it is just links to other websites, i.e. arXiv for publications, Linkedin for CV, github for code.
So I'm going to use andreazonca.com as blog, hosted on blogger.

doctests and unittests happiness 2

2012-08-16T14:07:00-07:00

nosetests -v --with-doctest
Doctest: healpy.pixelfunc.ang2pix ... ok
Doctest: healpy.pixelfunc.get_all_neighbours ... ok
Doctest: healpy.pixelfunc.get_interp_val ... ok
Doctest: healpy.pixelfunc.get_map_size ... ok
Doctest: healpy.pixelfunc.get_min_valid_nside ... ok
Doctest: healpy.pixelfunc.get_neighbours ... ok

Doctest: healpy.pixelfunc.isnpixok ... ok
Doctest: healpy.pixelfunc.isnsideok ... ok
Doctest: healpy.pixelfunc.ma ... ok
Doctest: healpy.pixelfunc.maptype ... ok
Doctest: healpy.pixelfunc.mask_bad ... ok
Doctest: healpy.pixelfunc.mask_good ... ok
Doctest: healpy.pixelfunc.max_pixrad ... ok
Doctest: healpy.pixelfunc.nest2ring ... ok
Doctest: healpy.pixelfunc.npix2nside ... ok
Doctest: healpy.pixelfunc.nside2npix ... ok
Doctest: healpy.pixelfunc.nside2pixarea ... ok
Doctest: healpy.pixelfunc.nside2resol ... ok
Doctest: healpy.pixelfunc.pix2ang ... ok
Doctest: healpy.pixelfunc.pix2vec ... ok
Doctest: healpy.pixelfunc.reorder ... ok
Doctest: healpy.pixelfunc.ring2nest ... ok
Doctest: healpy.pixelfunc.ud_grade ... ok
Doctest: healpy.pixelfunc.vec2pix ... ok
Doctest: healpy.rotator.Rotator ... ok
test_write_map_C (test_fitsfunc.TestFitsFunc) ... ok
test_write_map_IDL (test_fitsfunc.TestFitsFunc) ... ok
test_write_alm (test_fitsfunc.TestReadWriteAlm) ... ok
test_write_alm_256_128 (test_fitsfunc.TestReadWriteAlm) ... ok
test_ang2pix_nest (test_pixelfunc.TestPixelFunc) ... ok
test_ang2pix_ring (test_pixelfunc.TestPixelFunc) ... ok
test_nside2npix (test_pixelfunc.TestPixelFunc) ... ok
test_nside2pixarea (test_pixelfunc.TestPixelFunc) ... ok
test_nside2resol (test_pixelfunc.TestPixelFunc) ... ok
test_inclusive (test_query_disc.TestQueryDisc) ... ok
test_not_inclusive (test_query_disc.TestQueryDisc) ... ok
test_anafast (test_sphtfunc.TestSphtFunc) ... ok
test_anafast_iqu (test_sphtfunc.TestSphtFunc) ... ok
test_anafast_xspectra (test_sphtfunc.TestSphtFunc) ... ok
test_synfast (test_sphtfunc.TestSphtFunc) ... ok
test_cartview_nocrash (test_visufunc.TestNoCrash) ... ok
test_gnomview_nocrash (test_visufunc.TestNoCrash) ... ok
test_mollview_nocrash (test_visufunc.TestNoCrash) ... ok

Ran 43 tests in 19.077s

OK

compile python module with mpi support

2012-07-06T16:08:00-07:00

CC=mpicc LDSHARED="mpicc -shared" python setup.py build_ext -i

some python resources

2011-11-01T23:02:00-07:00

python tutorial:
http://docs.python.org/tutorial/
numpy tutorial [arrays]:
http://www.scipy.org/Tentative_NumPy_Tutorial
plotting tutorial:
http://matplotlib.sourceforge.net/users/pyplot_tutorial.html

free online books:
http://diveintopython.org/toc/index.html
http://www.ibiblio.org/swaroopch/byteofpython/read/

install enthought python:
http://www.enthought.com/products/edudownload.php

video tut:
http://www.youtube.com/watch?v=YW8jtSOTRAU&feature=channel

cfitsio wrapper in python

2011-06-21T04:43:00-07:00

After several issues with pyfits, and tired of it being so overengineered, I've wrote my own fits I/O package in python, wrapping the C library cfitsio with ctypes.

Pretty easy, first version completely developed in 1 day.

https://github.com/zonca/pycfitsio

unit testing happiness

2011-06-21T04:39:00-07:00

nosetests -v
test_all_cols (pycfitsio.test.TestPyCfitsIoRead) ... ok
test_colnames (pycfitsio.test.TestPyCfitsIoRead) ... ok
test_move (pycfitsio.test.TestPyCfitsIoRead) ... ok
test_open_file (pycfitsio.test.TestPyCfitsIoRead) ... ok
test_read_col (pycfitsio.test.TestPyCfitsIoRead) ... ok
test_read_hdus (pycfitsio.test.TestPyCfitsIoRead) ... ok
test_create (pycfitsio.test.TestPyCfitsIoWrite) ... ok
test_write (pycfitsio.test.TestPyCfitsIoWrite) ... ok

----------------------------------------------------------------------
Ran 8 tests in 0.016s

OK

Pink noise (1/f noise) simulations in numpy

2011-05-18T23:49:00-07:00

https://gist.github.com/979729

Vim regular expressions

2011-04-29T02:14:00-07:00

very good reference of the usage of regular expressions in VIM:

http://www.softpanorama.org/Editors/Vimorama/vim_regular_expressions.shtml

set python logging level

2011-04-13T01:02:00-07:00

often using logging.basicConfig is useless because if the logging module is already configured upfront by one of the imported libraries this is ignored.

The solution is to set the level directly in the root logger:
logging.root.level = logging.DEBUG

pyfits memory leak in new_table

2011-03-28T17:22:00-07:00

I found a memory leakage issue in pyfits.new_table, data were NOT deleted when the table was deleted, I prepared a test on github, using objgraph , which shows that data are still in memory:
https://gist.github.com/884298

the issue was solved by Erik Bray of STSCI on March 28th, 2011 , see bug report:
http://trac6.assembla.com/pyfits/ticket/49
and changeset:
http://trac6.assembla.com/pyfits/changeset/844

ipython and PyTrilinos

2011-02-16T19:10:00-08:00

start ipcontroller

start ipengines:
mpiexec -n 4 ipengine --mpi=pytrilinos

start ipython 0.11:
import PyTrilinos from IPython.kernel import client mec = client.MultiEngineClient() %load_ext parallelmagic mec.activate() px import PyTrilinos px comm=PyTrilinos.Epetra.PyComm() px print(comm.NumProc())

git make local branch tracking origin

2011-02-02T02:58:00-08:00

git branch --set-upstream master origin/master

you obtain the same result as initial cloning

memory map npy files

2011-01-07T21:04:00-08:00

Mem-map the stored array, and then access the second row directly from disk:

X = np.load('/tmp/123.npy', mmap_mode='r')

force local install of python module

2010-12-03T22:18:00-08:00

python setup.py install --prefix FOLDER

creates lib/python2.6/site-packages, to force a local install you should use:

python setup.py install --install-lib FOLDER

gnome alt f2 popup launcher

2010-08-31T18:14:00-07:00

gnome-panel-control --run-dialog

switch to interactive backend with ipython -pylab

2010-08-21T00:33:00-07:00

objective:

when running ipython without pylab or executing scripts you want to use an image matplotlib backend like Agg

just when calling ipython -pylab you want to use an interactive backend like GTKAgg or TKAgg

you need first to setup as default backend on .matplotlib/matplotlibrc Agg :
backend : Agg
then setup you ipython to switch to interactive, in ipython file Shell.py, in the class MatplotlibShellBase, at about line 516, add:
matplotlib.use('GTKAgg')
after the first import of matplotlib

numpy dtypes and fits keywords

2010-08-04T21:57:00-07:00

bool: 'L', uint8: 'B', int16: 'I', int32: 'J', int64: 'K', float32: 'E', float64: 'D', complex64: 'C', complex128: 'M'

count hits with numpy

2010-07-23T15:18:00-07:00

I have an array where I record hits
a=np.zeros(5)
and an array with the indices of the hits, for example I have 2 hits on index 2
hits=np.array([2,2])
so I want to increase index 2 of a by 2

I tried:
a[hits]+=1
but it gives array([ 0., 0., 1., 0., 0.])
does someone have a suggestion?
bins=np.bincount(hits) a[:len(bins)] += bins a array([ 0., 0., 2., 0., 0.])

change column name in a fits with pyfits

2010-06-30T22:06:00-07:00

no way to change it manipulating the dtype of the data array.
a=pyfits.open('filename.fits') a[1].header.update('TTYPE1','newname')
you need to change the header, using the update method of the right TTYPE and then write again the fits file using a.writeto.

healpix coordinates

2010-06-23T01:01:00-07:00

Healpix considers latitude theta from 0 on north pole to pi south pole,
so the conversion is:
theta = pi/2 - latitude
longitude and phi instead are consistently from 0 to 2*pi with

zero on vernal equinox (for ecliptic ).

zero in the direction from Sun to galactic center (for galactic )

parallel computing the python way

2010-06-21T07:27:00-07:00

forget MPI:
http://showmedo.com/videotutorials/series?name=N49qyIFOh

quaternions for python

2010-06-21T07:21:00-07:00

the situation is pretty problematic, I hope someday scipy will add a python package for rotating and interpolating quaternions, up to now:

http://cgkit.sourceforge.net/doc2/quat.html : slow, bad interaction with numpy, I could not find a simple way to turn a list of N quaternions to a 4xN array without a loop

http://cxc.harvard.edu/mta/ASPECT/tool_doc/pydocs/Quaternion.html : more lightweight, does not implement quaternion interpolation

change permission recursively to folders only

2010-03-23T17:58:00-07:00

find . -type d -exec chmod 777 {} \;

aptitude search 'and'

2010-03-16T22:50:00-07:00

this is really something really annoying about aptitude, if you run:
aptitude search linux headers
it will make an 'or' search...to perform a 'and' search, which I need 99.9% of the time, you need quotation marks:
aptitude search 'linux headers'

using numpy dtype with loadtxt

2010-03-03T22:49:00-08:00

Let's say you want to read a text file like this:

#filename start end
fdsafda.fits 23143214 23143214
safdsafafds.fits 21423 23423432

you can use dtype to create a custom array, which is very flexible as you can work by row or columns with strings and floats in the same array:
dt=np.dtype({'names':['filename','start','end'],'formats':['S100',np.float,np.float]}) [I tried also using np.str instead of S100 without success, anyone knows why?]
then give this as input to loadtxt to load the file and create the array.
a = np.loadtxt(open('yourfile.txt'),dtype=dt)
so each element is:
('dsafsadfsadf.fits', 1.6287776249537126e+18, 1.6290301584937428e+18)
but you can get the array of start or end times using:
a['start']

Stop ipcluster from a script

2010-02-19T02:23:00-08:00

Ipcluster is easy to start but not trivial to stop from a script, after having finished the processing, here's the solution:
from IPython.kernel import client mec = client.MultiEngineClient() mec.kill(controller=True)

Correlation

2010-01-28T00:45:00-08:00

Expectation value or first moment of a random variable is the probability weighted sum of the possible values (weighted mean).
Expectation value of a 6-dice is 1+2+3+4+5+6 / 6 = 3.5

Covariance of 2 random variables is:
COV(X,Y)=E[(X-E(X))(Y-E(Y))]=E(XY) - E(X)E(Y)
i.e. the difference between the expected value of their product and the product of their expected values.
So if the variables change together, they will have a high covariance, if they are independent, their covariance is zero.

Variance is the covariance on the same variable, :
COV(X,X)=VAR(X)=E(X2) - E(X)2

Standard deviation is the square root of Variance

Correlation is:
COR(X,Y)=COV(X,Y)/STDEV(X)STDEV(Y)

http://mathworld.wolfram.com/Covariance.html

execute bash script remotely with ssh

2010-01-07T14:37:00-08:00

a bash script launched remotely via ssh does not load the environment, if this is an issue it is necessary to specify --login when calling bash:

ssh user@remoteserver.com 'bash --login life_om/cronodproc' | mail your@email.com -s cronodproc

lock pin hold a package using apt on ubuntu

2010-01-07T13:49:00-08:00

set hold:
echo packagename hold | dpkg --set-selections

check, should be hi :
dpkg -l packagename

unset hold:
echo packagename install | dpkg --set-selections

load arrays from a text file with numpy

2010-01-05T16:32:00-08:00

space separated text file with 5 arrays in columns:

[sourcecode language="python"]
ods,rings,gains,offsets,rparams = np.loadtxt(filename,unpack=True)
[/sourcecode]

quite impressive...

Latest Maxima and WxMaxima for Ubuntu Karmic

2009-12-15T11:20:00-08:00

http://zeus.nyf.hu/~blahota/maxima/karmic/

on maxima mailing lists they suggested to install the sbcl built, so I first installed sbcl from the Ubuntu repositories and then maxima and wxmaxima f
rom this url.

number of files in a folder and subfolders

2009-12-10T18:16:00-08:00

folders are not counted
find . -type f | wc -l

forcefully unmount a disk partition

2008-09-17T15:14:00-07:00

check which processes are accessing a partition:

[sourcecode language="python"]lsof | grep '/opt'[/sourcecode]

kill all the processes accessing the partition (check what you're killing, you could loose data):

[sourcecode language="python"]fuser -km /mnt[/sourcecode]

try to unmount now:
[sourcecode language="python"]umount /opt[/sourcecode]

netcat: quickly send binaries through network

2008-04-29T12:25:00-07:00

just start nc in server mode on localhost:

[sourcecode language='python'] nc -l -p 3333 [/sourcecode]

send a string to localhost on port 3333:

[sourcecode language='python'] echo "hello world" | nc localhost 3333 [/sourcecode]

you'll see on server side appearing the string you sent.

very useful for sending binaries, see examples .

Decibels, dB and dBm, in terms of Power and Amplitude

2008-03-29T02:13:00-07:00

It's not difficult, just always having some doubts...

Power

$latex L_{dB} = 10 log_{10} \left( \dfrac{P_1}{P_0} \right) $

10 dB increase for a factor 10 increase in the ratio

3 dB = doubling

40 dB = 10000 times

Amplitude

$latex L_{dB} = 10 log_{10} \left( \dfrac{A_1^2}{A_0^2} \right) = 20 log_{10} \left( \dfrac{A_1}{A_0} \right) $

dBm

dBm is an absolute value obtained by a ratio with 1 mW:

$latex L_{dBm} = 10 log_{10} \left( \dfrac{P_1}{1 mW} \right) $

0 dBm = 1 mW

3 dBm ≈ 2 mW

Relation between Power density and temperature in an antenna

2008-03-28T18:29:00-07:00

Considering an antenna placed inside a blackbody enclosure at temperature T, the power received per unit bandwidth is:
$latex \omega = kT$

where k is Boltzmann constant.

This relationship derives from considering a constant brightness $latex B$ in all directions, therefore Rayleigh Jeans law tells:

$latex B = \dfrac{2kT}{\lambda^2}$

Power per unit bandwidth is obtained by integrating brightness over antenna beam

$latex \omega = \frac{1}{2} A_e \int \int B \left( \theta , \phi \right) P_n \left( \theta , \phi \right) d \Omega $

therefore

$latex \omega = \dfrac{kT}{\lambda^2}A_e\Omega_A $

where:

$latex A_e$ is antenna effective aperture

$latex \Omega_A$ is antenna beam area

$latex \lambda^2 = A_e\Omega_A $ another post should talk about this

finally:

$latex \omega = kT $

which is the same noise power of a resistor.

source : Kraus Radio Astronomy pag 107

Producing PDF from XML files

2008-03-28T16:27:00-07:00

I need to produce formatted pdf from XML data input file.
The more standard way looks like to use XSL stylesheets.
Associating a XSL sheet to an XML file permits most browsers to render them directly as HMTL, this can be used for web publishing XML sheets.

The quick and dirty way to produce PDF could be printing them from Firefox, but an interesting option is to use xmlto , a script for running a XSL transformation and render an XML in PDF or other formats. It would be interesting to test this script and understand if it needs just docbook XML input or any XML.

vim costumization

2006-10-17T10:49:00-07:00

it is about perl but it suggests very useful tricks for programming with vim
http://mamchenkov.net/wordpress/2004/05/10/vim-for-perl-developers/

using gnu find

2006-10-03T14:00:00-07:00

list all the directories excluding ".":

find . -maxdepth 1 -type d -not -name ".*"

find some string in all files matching a pattern in the subfolders (with grep -r you cannot specify the type of file)

find . -name '*.py' -exec grep -i pdb '{}' \;

beginners bash guide

2006-10-03T13:56:00-07:00

great guide with many examples:

http://tille.xalasys.com/training/bash/

tar quickref

2006-09-25T13:19:00-07:00

compress: tar cvzf foo.tgz *.cc *.h
check inside: tar tzf foo.tgz | grep file.txt
extract: tar xvzf foo.tgz
extract 1 file only: tar xvzf foo.tgz path/to/file.txt

software carpentry

2006-09-25T12:51:00-07:00

basic software for scientists and engineers:
http://www.swc.scipy.org/

Software libero per il trattamento di dati scientifici

2006-09-22T13:35:00-07:00

nella ricerca del miglior ambiente per analisi di dati scientifici da leggere questi articoli:

http://www.pluto.it/files/journal/pj0501/swlibero-scie1.html

http://www.pluto.it/files/journal/pj0504/swlibero-scie2.html

http://www.pluto.it/files/journal/pj0505/swlibero-scie3.html

command line processing

2006-09-22T13:34:00-07:00

Very useful summary of many linux command line processing tools (great perl onliners)

http://grad.physics.sunysb.edu/~leckey/personal/forget/

awk made easy

2006-09-22T13:20:00-07:00

awk '/REGEX/ {print NR "\t" $9 "\t" $4"_"$5 ;}' file.txt
supports extended REGEX like perl ( e.g. [:blank:] Space or tab characters )
NR is line number
NF Number of fields
$n is the column to be printed, $0 is the whole row

if it only necessary to print columns of a file it is easier to use cut:

name -a | cut -d" " -f1,3,11,12

-d: or -d" " is the delimiter
-f1,3 are the fields to be displayed
other options: -s doesnt show lines without delimiters, --complement is selfesplicative
condition on a specific field:
$<field> ~ /<string>/ Search for string in specified field.

you can use awk also in pipes:
ll | awk 'NR!=1 {s+=$5} END {print "Average: " s/(NR-1)}'
END to process al file and then print results

tutorial on using awk from the command line:
http://www.vectorsite.net/tsawk_3.html#m1

pillole di astrofisica

2006-09-20T13:39:00-07:00

curiosita' ben spiegate da annibale d'ercole, interessante l'idea di avere un livello base e un livello avanzato
http://www.bo.astro.it/sait/spigolature/spigostart.html