Andrea Zonca's bloghttp://zonca.github.io/2020-02-26T13:00:00-08:00Deploy CVMFS on Kubernetes2020-02-26T13:00:00-08:002020-02-26T13:00:00-08:00Andrea Zoncatag:zonca.github.io,2020-02-26:/2020/02/cvmfs-kubernetes.html<p><a href="https://cvmfs.readthedocs.io/">CVMFS</a> is a software distribution service, it is used by High Energy Physics experiments at CERN
to synchronize software environments across the whole collaborations.</p>
<p>In the context of a Kubernetes + JupyterHub deployment on Jetstream, for example <a href="http://zonca.github.io/2019/06/kubernetes-jupyterhub-jetstream-magnum.html">deployed using Magnum following my tutorial</a>, it is useful to use CVMFS to make …</p><p><a href="https://cvmfs.readthedocs.io/">CVMFS</a> is a software distribution service, it is used by High Energy Physics experiments at CERN
to synchronize software environments across the whole collaborations.</p>
<p>In the context of a Kubernetes + JupyterHub deployment on Jetstream, for example <a href="http://zonca.github.io/2019/06/kubernetes-jupyterhub-jetstream-magnum.html">deployed using Magnum following my tutorial</a>, it is useful to use CVMFS to make the software tools of a collaboration to all the users connected to JupyterHub, so that we can keep the base Docker image simpler and smaller.</p>
<h2>Alternatives</h2>
<p>A already existing solution is <a href="https://github.com/cernops/cvmfs-csi">the CVMFS CSI driver</a>, however it doesn't have much documentation, so I haven't tested it. It would be useful for larger deployments, but we are designing for a 5 (possibly up to 10) nodes Kubernetes cluster.</p>
<h2>Architecture</h2>
<p>We have a pod running in Kubernetes (running as a privileged Docker container) which runs the CVMFS client and caches locally
(on a dedicated Openstack volume) some pre-defined CVMFS repositories (at the moment we do not support automounting).</p>
<p>Currently we are using the <code>DIRECT</code> connection for the CVMFS client, due to having just a single client which accesses
a small amount of data. Using a proxy is required instead for heavier usage, and it could also be deployed inside Kubernetes.</p>
<p>The same pod also runs a NFS server and exposes it internally into the Kubernetes cluster, over the local Jetstream network,
to any other pod which can use a NFS volume and mount it to the <code>/cvmfs</code> folder inside the container.
We also activate the CVMFS configuration options for NFS support, following the <a href="https://cvmfs.readthedocs.io/en/stable/cpt-configure.html#nfs-server-mode">documentation</a>.</p>
<h2>Deployment</h2>
<p>The repositories used in this deployment are:</p>
<ul>
<li><a href="https://github.com/zonca/docker-cvmfs-client">Github repository for the Docker image of the CVMFS client</a></li>
<li>Docker Hub repositories where the 2 containers are built: <a href="https://hub.docker.com/r/zonca/cvmfs-client"><code>cvmfs-client</code></a> and <a href="https://hub.docker.com/r/zonca/cvmfs-client-nfs"><code>cvmfs-client-nfs</code></a></li>
<li>The <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/tree/master/cvmfs"><code>jupyterhub-deploy-kubernetes-jetstream</code></a> Github repositories with the Kubernetes configuration files</li>
</ul>
<p>First we need to checkout the <code>jupyterhub-deploy-kubernetes-jetstream</code> repository:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream.git</span>
<span class="err">cd jupyterhub-deploy-kubernetes-jetstream/cvmfs</span>
</pre></div>
<p>Then configure the CVMFS pod with the required repositories, see the <code>CVMFS_REPOSITORIES</code> variable in <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/cvmfs/pod_cvmfs_nfs.yaml"><code>pod_cvmfs_nfs.yaml</code></a>.</p>
<p>Then deploy the pod with:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f pod_cvmfs_nfs.yaml</span>
</pre></div>
<p>This creates 2 Openstack volumes, a 20 GB volume for the CVMFS cache, and a 1 GB volume which is just necessary as the <code>/cvmfs</code> root folder of the NFS server.
It also creates the <code>nfs-service</code> Service, with a fixed IP, so that we can use it in the pod using this.</p>
<p>Finally we can create a pod using mounting the folder via NFS:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f test_nfs_mount.yaml</span>
</pre></div>
<p>Then get a terminal in the pod with:</p>
<div class="highlight"><pre><span></span><span class="err">bash ../terminal_pod.sh test-nfs-mount</span>
</pre></div>
<p>This creates a volume which mounts the <code>/cvmfs</code> folder shared with NFS, this automatically also shares also all the subfolders.</p>
<p>Finally we can check the content of the <code>/cvmfs</code> folder.</p>Organize calendars for a large scientific collaboration2019-12-02T12:00:00-08:002019-12-02T12:00:00-08:00Andrea Zoncatag:zonca.github.io,2019-12-02:/2019/12/organize-calendar-collaboration.html<p>Many scientific collaborations have a central calendar, often hosted on Google Calendar,
to coordinate Teleconferences, meetings and events across timezones.</p>
<h3>The issue</h3>
<p>Most users are only interested in a small subset of the events, however Google Calendar
does not allow them to subscribe to single events. The central calendar admin …</p><p>Many scientific collaborations have a central calendar, often hosted on Google Calendar,
to coordinate Teleconferences, meetings and events across timezones.</p>
<h3>The issue</h3>
<p>Most users are only interested in a small subset of the events, however Google Calendar
does not allow them to subscribe to single events. The central calendar admin could invite
each person to events, but that requires lots of work.</p>
<p>So, users either subscribe to the whole calendar, but then have a huge clutter of un-interesting events,
or copy just a subset of the events to their calendars, but loose track of any rescheduling of the
original event.</p>
<h3>Proposed solution</h3>
<p>I recommend to split the events across multiple calendars, for example one for each working group,
or any other categorization where most users would be interested in all events in a calendar.
And possibly a "General" category with events that should interest the whole collaboration.</p>
<p>Still, we can embed all of the calendars in a single webpage, see an example below where 2 calendars (Monday and Tuesday telecon calendars) are visualized together, <a href="https://support.google.com/calendar/answer/41207?hl=en">see the Google Calendar documentation</a>.</p>
<iframe src="https://calendar.google.com/calendar/embed?height=600&wkst=1&bgcolor=%23ffffff&ctz=America%2FLos_Angeles&src=dTI2dnBkNnZvcm1qNHVucnVtajMzZzdwcGNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ&src=c2FwazM1OTVmcHRiZHVtOWdqZnJwdWxkbnNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ&color=%23DD4477&color=%236633CC" style="border-width:0" width="800" height="600" frameborder="0" scrolling="no"></iframe>
<p>Users can click on the bottom "Add to Google Calendar" button and subscribe to a subset or all the calendars.
See the screenshot below, <img alt="screenshot of add to Google Calendar" src="/images/add_google_calendar.png">.</p>
<p>As an additional benefit, we can compartimentalize permissions more easily, e.g. leads of a working group
get writing access only to their relevant calendar/calendars.</p>Simulate users on JupyterHub2019-10-30T12:00:00-07:002019-10-30T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-10-30:/2019/10/loadtest-jupyterhub.html<p>I currently have 2 different strategies to deploy JupyterHub on top of Kubernetes on Jetstream:</p>
<ul>
<li>Using <a href="https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html">Kubespray</a></li>
<li>Using <a href="http://zonca.github.io/2019/06/kubernetes-jupyterhub-jetstream-magnum.html">Magnum</a>, which also supports the <a href="http://zonca.github.io/2019/09/kubernetes-jetstream-autoscaler.html">Cluster Autoscaler</a></li>
</ul>
<p>In this tutorial I'll show how to use Yuvi Pandas' <a href="https://github.com/yuvipanda/hubtraf"><code>hubtraf</code></a> to simulate load on JupyterHub, i.e. programmatically generate a predefined number of users …</p><p>I currently have 2 different strategies to deploy JupyterHub on top of Kubernetes on Jetstream:</p>
<ul>
<li>Using <a href="https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html">Kubespray</a></li>
<li>Using <a href="http://zonca.github.io/2019/06/kubernetes-jupyterhub-jetstream-magnum.html">Magnum</a>, which also supports the <a href="http://zonca.github.io/2019/09/kubernetes-jetstream-autoscaler.html">Cluster Autoscaler</a></li>
</ul>
<p>In this tutorial I'll show how to use Yuvi Pandas' <a href="https://github.com/yuvipanda/hubtraf"><code>hubtraf</code></a> to simulate load on JupyterHub, i.e. programmatically generate a predefined number of users connecting and executing notebooks on the system.</p>
<p>This is especially useful to test the Cluster Autoscaler.</p>
<p><code>hubtraf</code> assumes you are using the Dummy authenticator, which is the default installed by the <code>zero-to-jupyterhub</code> helm chart. If you have configured another authenticator, temporarily disable it for testing purposes.</p>
<p>First go through the <a href="https://github.com/yuvipanda/hubtraf/blob/master/docs/index.rst#jupyterhub-traffic-simulator"><code>hubtraf</code> documentation</a> to understand its functionalities.</p>
<p><code>hubtraf</code> also has a Helm recipe to run it within Kubernetes, but the simpler way is to test from your laptop, follow the [documentation of <code>hubtraf</code>] to install the package and then run:</p>
<div class="highlight"><pre><span></span><span class="err">hubtraf http://js-xxx-yyy.jetstream-cloud.org 2</span>
</pre></div>
<p>To simulate 2 users connecting to the system, you can then check with:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get pods -n jhub</span>
</pre></div>
<p>That the pods are being created successfully and check the logs on the command line from <code>hubtraf</code> which explains what it is doing and tracks the time every operation takes, so it is useful to debug any delay in providing resources to users.</p>
<p>Consider that volumes created by JupyterHub for the test users will remain in Kubernetes and in Openstack, therefore if you would like to use the same deployment for production, remember to cleanup the Kubernetes <code>PersistentVolume</code> and <code>PersistentVolumeClaim</code> resources.</p>
<p>Now we can test scalability of the deployment with:</p>
<div class="highlight"><pre><span></span><span class="err"> hubtraf http://js-xxx-yyy.jetstream-cloud.org 100</span>
</pre></div>
<p>Make sure you have asked the XSEDE support to increase the maximum number of volumes in Openstack in your allocation that by default is only 10. Otherwise edit <code>config_standard_storage.yaml</code> and set:</p>
<div class="highlight"><pre><span></span><span class="n">singleuser</span><span class="o">:</span>
<span class="n">storage</span><span class="o">:</span>
<span class="n">type</span><span class="o">:</span> <span class="n">none</span>
</pre></div>
<h2>Test the Cluster Autoscaler</h2>
<p>If you followed the tutorial to deploy the Cluster Autoscaler on Magnum, you can launch <code>hubtraf</code> to create a large number of pods, then check that some pods are "Running" and the ones that do not fit in the current nodes are "Pending":</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get pods -n jhub</span>
</pre></div>
<p>and then check in the logs of the autoscaler that it detects that those pods are pending and requests additional nodes.
For example:</p>
<div class="highlight"><pre><span></span>> kubectl logs -n kube-system cluster-autoscaler-hhhhhhh-uuuuuuu
I1031 <span class="m">00</span>:48:39.807384 <span class="m">1</span> scale_up.go:689<span class="o">]</span> Scale-up: setting group DefaultNodeGroup size to <span class="m">2</span>
I1031 <span class="m">00</span>:48:41.583449 <span class="m">1</span> magnum_nodegroup.go:101<span class="o">]</span> Increasing size by <span class="m">1</span>, <span class="m">1</span>->2
I1031 <span class="m">00</span>:49:14.141351 <span class="m">1</span> magnum_nodegroup.go:67<span class="o">]</span> Waited <span class="k">for</span> cluster UPDATE_IN_PROGRESS status
</pre></div>
<p>After 4 or 5 minutes the new node should be available and should show up in:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get nodes</span>
</pre></div>
<p>And we can check that some user pods are now running on the new node:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get pods -n jhub -o wide</span>
</pre></div>
<p>In my case the Autoscaler actually requested a 3rd node to accomodate all the users pods:</p>
<div class="highlight"><pre><span></span>I1031 <span class="m">00</span>:48:39.807384 <span class="m">1</span> scale_up.go:689<span class="o">]</span> Scale-up: setting group DefaultNodeGroup size to <span class="m">2</span>
I1031 <span class="m">00</span>:48:41.583449 <span class="m">1</span> magnum_nodegroup.go:101<span class="o">]</span> Increasing size by <span class="m">1</span>, <span class="m">1</span>->2
I1031 <span class="m">00</span>:49:14.141351 <span class="m">1</span> magnum_nodegroup.go:67<span class="o">]</span> Waited <span class="k">for</span> cluster UPDATE_IN_PROGRESS status
I1031 <span class="m">00</span>:52:51.308054 <span class="m">1</span> magnum_nodegroup.go:67<span class="o">]</span> Waited <span class="k">for</span> cluster UPDATE_COMPLETE status
I1031 <span class="m">00</span>:53:01.315179 <span class="m">1</span> scale_up.go:689<span class="o">]</span> Scale-up: setting group DefaultNodeGroup size to <span class="m">3</span>
I1031 <span class="m">00</span>:53:02.996583 <span class="m">1</span> magnum_nodegroup.go:101<span class="o">]</span> Increasing size by <span class="m">1</span>, <span class="m">2</span>->3
I1031 <span class="m">00</span>:53:35.607158 <span class="m">1</span> magnum_nodegroup.go:67<span class="o">]</span> Waited <span class="k">for</span> cluster UPDATE_IN_PROGRESS status
I1031 <span class="m">00</span>:56:41.834151 <span class="m">1</span> magnum_nodegroup.go:67<span class="o">]</span> Waited <span class="k">for</span> cluster UPDATE_COMPLETE status
</pre></div>
<p>Moreover Cluster Autoscaler also provides useful information in the status of each "Pending" node. For example if it detects that it is useless to create a new node because the node is "Pending" for some other reason (e.g. volume quota reached), this infomation will be accessible using:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl describe node -n jhub jupyter-xxxxxxx</span>
</pre></div>
<p>When the simulated users disconnect, <code>hubtraf</code> has a default of about 5 minutes, the autoscaler waits for the configured amount of minutes, by default it is 10 minutes, in my deployment it is 1 minute to simplify testing, see the <code>cluster-autoscaler-deployment-master.yaml</code> file.
After this delay, the autoscaler scales down the size of the cluster, it is a 2 step process, it first terminates the Openstack Virtual machine and then adjusts the size of the Magnum cluster (<code>node_count</code>), you can monitor the process using <code>openstack server list</code> and <code>openstack coe cluster list</code>, and the log of the autoscaler:</p>
<div class="highlight"><pre><span></span>I1101 <span class="m">06</span>:31:10.223660 <span class="m">1</span> scale_down.go:882<span class="o">]</span> Scale-down: removing empty node k8s-e2iw7axmhym7-minion-1
I1101 <span class="m">06</span>:31:16.081223 <span class="m">1</span> magnum_manager_heat.go:276<span class="o">]</span> Waited <span class="k">for</span> stack UPDATE_IN_PROGRESS status
I1101 <span class="m">06</span>:32:17.061860 <span class="m">1</span> magnum_manager_heat.go:276<span class="o">]</span> Waited <span class="k">for</span> stack UPDATE_COMPLETE status
I1101 <span class="m">06</span>:32:49.826439 <span class="m">1</span> magnum_nodegroup.go:67<span class="o">]</span> Waited <span class="k">for</span> cluster UPDATE_IN_PROGRESS status
I1101 <span class="m">06</span>:33:21.588022 <span class="m">1</span> magnum_nodegroup.go:67<span class="o">]</span> Waited <span class="k">for</span> cluster UPDATE_COMPLETE status
</pre></div>
<h2>Acknowledgments</h2>
<p>Thanks Yuvi Panda for providing <code>hubtraf</code>, thanks Julien Chastang for testing my deployments.</p>Execute Jupyter Notebooks not interactively2019-09-23T12:00:00-07:002019-09-23T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-09-23:/2019/09/batch-notebook-execution.html<p>Over the years, I have explored how to scale up easily computation through
Jupyter Notebooks by executing them not-interactively, possibily parametrized
and remotely. This is mostly for reference.</p>
<ul>
<li><a href="https://github.com/zonca/nbsubmit"><code>nbsubmit</code></a> is a Python package which has Python API to send a local notebook for execution on a remote SLURM cluster, for …</li></ul><p>Over the years, I have explored how to scale up easily computation through
Jupyter Notebooks by executing them not-interactively, possibily parametrized
and remotely. This is mostly for reference.</p>
<ul>
<li><a href="https://github.com/zonca/nbsubmit"><code>nbsubmit</code></a> is a Python package which has Python API to send a local notebook for execution on a remote SLURM cluster, for example Comet, see <a href="https://github.com/zonca/nbsubmit/blob/master/example/multiple_jobs/submit_multiple_jobs.ipynb">an example</a>. This project is not maintained right now.</li>
<li>Back in 2017 I tested submitting notebooks to Open Science Grid, see <a href="https://github.com/zonca/batch-notebooks-condor">the <code>batch-notebooks-condor</code> repository</a></li>
<li>Back in 2016 I created scripts to template a Jupyter Notebook and launch SLURM jobs, see <a href="https://github.com/sdsc/sdsc-summer-institute-2016/blob/master/hpc3_python_hpc/slurm.shared.template"><code>slurm.shared.template</code></a> and <a href="https://github.com/sdsc/sdsc-summer-institute-2016/blob/master/hpc3_python_hpc/runipyloop.sh"><code>runipyloop.sh</code></a></li>
</ul>Deploy Cluster Autoscaler for Kubernetes on Jetstream2019-09-12T12:00:00-07:002019-09-12T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-09-12:/2019/09/kubernetes-jetstream-autoscaler.html<p>The <a href="https://github.com/kubernetes/autoscaler">Kubernetes Cluster Autoscaler</a> is a service
that runs within a Kubernetes cluster and when there are not enough resources to accomodate
the pods that are queued to run, it contacts the API of the cloud provider to create
more Virtual Machines to join the Kubernetes Cluster.</p>
<p>Initially the Cluster …</p><p>The <a href="https://github.com/kubernetes/autoscaler">Kubernetes Cluster Autoscaler</a> is a service
that runs within a Kubernetes cluster and when there are not enough resources to accomodate
the pods that are queued to run, it contacts the API of the cloud provider to create
more Virtual Machines to join the Kubernetes Cluster.</p>
<p>Initially the Cluster Autoscaler only supported commercial cloud provides, but back in
March 2019 <a href="https://github.com/kubernetes/autoscaler/pull/1690">a user contributed Openstack support based on Magnum</a>.</p>
<p>First step you should have a Magnum-based deployment running on Jetstream,
see <a href="https://zonca.github.io/2019/06/kubernetes-jupyterhub-jetstream-magnum.html">my recent tutorial about that</a>.</p>
<p>Therefore you should also have already a copy of the repository of all configuration
files checked out on your local machine that you are using to interact with the openstack API,
if not:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream.git</span>
</pre></div>
<p>and enter the folder dedicated to the autoscaler:</p>
<div class="highlight"><pre><span></span><span class="err">cd jupyterhub-deploy-kubernetes-jetstream/kubernetes_magnum/autoscaler</span>
</pre></div>
<h2>Setup credentials</h2>
<p>We first create the service account needed by the autoscaler to interact with the Kubernetes API:</p>
<div class="highlight"><pre><span></span>kubectl create -f cluster-autoscaler-svcaccount.yaml
</pre></div>
<p>Then we need to provide all connection details for the autoscaler to interact with the Openstack API,
those are contained in the <code>cloud-config</code> of our cluster available in the master node and setup
by Magnum.
Get the <code>IP</code> of your master node from:</p>
<div class="highlight"><pre><span></span>openstack server list
<span class="nv">IP</span><span class="o">=</span>xxx.xxx.xxx.xxx
</pre></div>
<p>Now ssh into the master node and access the <code>cloud-config</code> file:</p>
<div class="highlight"><pre><span></span>ssh fedora@<span class="nv">$IP</span>
cat /etc/kubernetes/cloud-config
</pre></div>
<p>now copy the <code>[Global]</code> section at the end of <code>cluster-autoscaler-secret.yaml</code> on the local machine.
Also remove the line of <code>ca-file</code></p>
<div class="highlight"><pre><span></span>kubectl create -f cluster-autoscaler-secret.yaml
</pre></div>
<h2>Launch the Autoscaler deployment</h2>
<p>Create the Autoscaler deployment:</p>
<div class="highlight"><pre><span></span>kubectl create -f cluster-autoscaler-deployment-master.yaml
</pre></div>
<p>Alternatively, I also added a version for a cluster where we are not deploying pods on master <code>cluster-autoscaler-deployment.yaml</code>.</p>
<p>Check that the deployment is active:</p>
<div class="highlight"><pre><span></span>kubectl -n kube-system get pods
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
cluster-autoscaler <span class="m">1</span> <span class="m">1</span> <span class="m">1</span> <span class="m">0</span> 10s
</pre></div>
<p>And check its logs:</p>
<div class="highlight"><pre><span></span>kubectl -n kube-system logs cluster-autoscaler-59f4cf4f4-4k4p2
I0905 <span class="m">05</span>:29:21.589062 <span class="m">1</span> leaderelection.go:217<span class="o">]</span> attempting to acquire leader lease kube-system/cluster-autoscaler...
I0905 <span class="m">05</span>:29:39.412449 <span class="m">1</span> leaderelection.go:227<span class="o">]</span> successfully acquired lease kube-system/cluster-autoscaler
I0905 <span class="m">05</span>:29:43.896557 <span class="m">1</span> magnum_manager_heat.go:293<span class="o">]</span> For stack ID 17ab3ae7-1a81-43e6-98ec-b6ffd04f91d3, stack name is k8s-lu3bksbwsln3
I0905 <span class="m">05</span>:29:44.146319 <span class="m">1</span> magnum_manager_heat.go:310<span class="o">]</span> Found nested kube_minions stack: name k8s-lu3bksbwsln3-kube_minions-r4lhlv5xuwu3, ID d0590824-cc70-4da5-b9ff-8581d99c666b
</pre></div>
<p>If you redeploy the cluster and keep a older authentication, you'll see "Authentication failed" in the logs of the autoscaler pod, you need to update the secret every time you redeploy the cluster.</p>
<h2>Test the autoscaler</h2>
<p>Now we need to produce a significant load on the cluster so that the autoscaler is triggered to request Openstack Magnum to create more Virtual Machines.</p>
<p>We can create a deployment of the NGINX container (any other would work for this test):</p>
<div class="highlight"><pre><span></span>kubectl create deployment autoscaler-demo --image<span class="o">=</span>nginx
</pre></div>
<p>And then create a large number of replicas:</p>
<div class="highlight"><pre><span></span>kubectl scale deployment autoscaler-demo --replicas<span class="o">=</span><span class="m">300</span>
</pre></div>
<p>We are using 2 nodes with a large amount of memory and CPU, so they can accommodate more then 200 of those pods. The rest remains in the queue:</p>
<div class="highlight"><pre><span></span>kubectl get deployment autoscaler-demo
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
autoscaler-demo <span class="m">300</span> <span class="m">300</span> <span class="m">300</span> <span class="m">213</span> 18m
</pre></div>
<p>And this triggers the autoscaler:</p>
<div class="highlight"><pre><span></span>kubectl -n kube-system logs cluster-autoscaler-59f4cf4f4-4k4p2
I0905 <span class="m">05</span>:34:47.401149 <span class="m">1</span> scale_up.go:689<span class="o">]</span> Scale-up: setting group DefaultNodeGroup size to <span class="m">2</span>
I0905 <span class="m">05</span>:34:49.267280 <span class="m">1</span> magnum_nodegroup.go:101<span class="o">]</span> Increasing size by <span class="m">1</span>, <span class="m">1</span>->2
I0905 <span class="m">05</span>:35:22.222387 <span class="m">1</span> magnum_nodegroup.go:67<span class="o">]</span> Waited <span class="k">for</span> cluster UPDATE_IN_PROGRESS status
</pre></div>
<p>Check also in the Openstack API:</p>
<div class="highlight"><pre><span></span>openstack coe cluster list
+------+------+---------+------------+--------------+--------------------+
<span class="p">|</span> uuid <span class="p">|</span> name <span class="p">|</span> keypair <span class="p">|</span> node_count <span class="p">|</span> master_count <span class="p">|</span> status <span class="p">|</span>
+------+------+---------+------------+--------------+--------------------+
<span class="p">|</span> 09fcf<span class="p">|</span> k8s <span class="p">|</span> comet <span class="p">|</span> <span class="m">2</span> <span class="p">|</span> <span class="m">1</span> <span class="p">|</span> UPDATE_IN_PROGRESS <span class="p">|</span>
+------+------+---------+------------+--------------+--------------------+
</pre></div>
<p>It takes about 4 minutes for a new VM to boot, be configured by Magnum and join the Kubernetes cluster.</p>
<p>Checking the logs again should show another line:</p>
<div class="highlight"><pre><span></span>I0912 <span class="m">17</span>:18:28.290987 <span class="m">1</span> magnum_nodegroup.go:67<span class="o">]</span> Waited <span class="k">for</span> cluster UPDATE_COMPLETE status
</pre></div>
<p>Then you should have all 3 nodes available:</p>
<div class="highlight"><pre><span></span>kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-6bawhy45wr5t-master-0 Ready master 38m v1.11.1
k8s-6bawhy45wr5t-minion-0 Ready <none> 38m v1.11.1
k8s-6bawhy45wr5t-minion-1 Ready <none> 30m v1.11.1
</pre></div>
<p>and all 300 NGINX containers deployed:</p>
<div class="highlight"><pre><span></span>kubectl get deployments
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
autoscaler-demo <span class="m">300</span> <span class="m">300</span> <span class="m">300</span> <span class="m">300</span> 35m
</pre></div>
<p>You can also test scaling down by scaling back the number of NGINX containers to only a few and check in the logs
of the autoscaler that this process triggers the scale-down process.</p>
<p>In <code>cluster-autoscaler-deployment-master.yaml</code> I have configured the scale down process to trigger just after 1 minute, to simplify testing. For production, better increase this to 10 minutes or more. Check the <a href="https://github.com/zonca/autoscaler/blob/cluster-autoscaler-1.14-magnum/cluster-autoscaler/FAQ.md">documentation of Cluster Autoscaler 1.14</a> for all other available options.</p>
<h2>Note about the Cluster Autoscaler container</h2>
<p>The Magnum provider was added in Cluster Autoscaler 1.15, however this version is not compatible with Kubernetes 1.11 which is currently available on Jetstream. Therefore I have taken the development version of Cluster Autoscaler 1.14 and compiled it myself. I also noticed that the scale down process was not working due to incompatible IDs when the Cloud Provider tried to lookup the ID of a Minion in the Stack. I am now directly using the MachineID instead of going through these indices. This version is available in <a href="https://github.com/zonca/autoscaler/tree/cluster-autoscaler-1.14-magnum">my fork of <code>autoscaler</code></a> and it is built into docker containers on the <a href="https://cloud.docker.com/repository/docker/zonca/k8s-cluster-autoscaler-jetstream"><code>zonca/k8s-cluster-autoscaler-jetstream</code> repository on Docker Hub</a>.
The image tags are the short version of the repository git commit hash.</p>
<p>I build the container using the <code>run_gobuilder.sh</code> and <code>run_build_autoscaler_container.sh</code> scripts included in the repository.</p>
<h2>Note about images used by Magnum</h2>
<p>I have tested this deployment using the <code>Fedora-Atomic-27-20180419</code> image on Jetstream at Indiana University.
The Fedora Atomic 28 image had a long hang-up during boot and took more than 10 minutes to start and that caused timeout in the autoscaler and anyway it would have been too long for a user waiting to start a notebook.</p>
<p>I also tried updating the Fedora Atomic 28 image with <code>sudo atomic host upgrade</code> and while this fixed the slow startup issue, it generated a broken Kubernetes installation, i.e. the Kubernetes services didn't detect the master node as part of the cluster, <code>kubectl get nodes</code> only showed the minion.</p>Create a Github account for your research group with free private repositories2019-08-24T15:00:00-07:002019-08-24T15:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-08-24:/2019/08/github-for-research-groups.html<p><a href="https://github.com/">Github</a> allows a research group to create their own webpage where they can host, share and develop their software using the <code>git</code> version control system and the powerful Github online issue-tracking interface.</p>
<p>Github offers unlimited private and public repositories to research groups and classrooms.
Private repositories are useful for early …</p><p><a href="https://github.com/">Github</a> allows a research group to create their own webpage where they can host, share and develop their software using the <code>git</code> version control system and the powerful Github online issue-tracking interface.</p>
<p>Github offers unlimited private and public repositories to research groups and classrooms.
Private repositories are useful for early stages of development or if it is necessary to keep software secret before publication, at publication they can easily switched to public repositories and free up their slot.</p>
<p>They also provide free data packs for <a href="https://git-lfs.github.com/"><code>git-lfs</code>(Large File Support)</a> which is useful to store large amount of binary data together with your software in the same repository, without actually committing the files into <code>git</code> but using a support server. Just go into "Settings" for your organization and under "Billing" add data packs, you will notice that the cost is $0.</p>
<p>Here the steps to set this up:</p>
<ul>
<li>Create a user account on Github and choose the free plan, use your <code>.edu</code> email address</li>
<li>Create an organization account for your research group</li>
<li>Go to <a href="https://education.github.com/">https://education.github.com/</a> and click on "Get benefits"</li>
<li>Choose what is your position, e.g. Researcher and select you want a discount for an organization</li>
<li>Choose the organization you created earlier and confirm that it is a "Research group"</li>
<li>Add details about your Research group</li>
<li>Finally you need to upload a picture of your University ID card and write how you plan on using the repositories</li>
<li>Within a week at most, but generally in less than 24 hours, you will be approved for unlimited private repositories.</li>
</ul>
<p>Once the organization is created, you can add key team members to the "Owners" group, and then create another group for students and collaborators.</p>
<p>Consider also that is not necessary for every collaborator to have write access to your repositories. My recommendation is to ask a more experienced team member to administer the central repository, ask the students to fork the repository under their user accounts (forks of private repositories are always private, free and don't use any slot), and then <a href="https://help.github.com/articles/using-pull-requests">send a pull request</a> to the central repository for the administrator to review, discuss and merge.</p>
<p>See for example the organization account of the <a href="https://github.com/dib-lab">"The Lab for Data Intensive Biology" led by Dr. C. Titus Brown</a> where they share code, documentation and papers. Open Science!!</p>
<p>Other suggestions on the setup very welcome!</p>Create a Github account for your research group with free private repositories2019-08-24T15:00:00-07:002019-08-24T15:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-08-24:/2019/08/github-for-research-groups.html<p><a href="https://github.com/">Github</a> allows a research group to create their own webpage where they can host, share and develop their software using the <code>git</code> version control system and the powerful Github online issue-tracking interface.</p>
<p>Github offers unlimited private and public repositories to research groups and classrooms.
Private repositories are useful for early …</p><p><a href="https://github.com/">Github</a> allows a research group to create their own webpage where they can host, share and develop their software using the <code>git</code> version control system and the powerful Github online issue-tracking interface.</p>
<p>Github offers unlimited private and public repositories to research groups and classrooms.
Private repositories are useful for early stages of development or if it is necessary to keep software secret before publication, at publication they can easily switched to public repositories and free up their slot.</p>
<p>They also provide free data packs for <a href="https://git-lfs.github.com/"><code>git-lfs</code>(Large File Support)</a> which is useful to store large amount of binary data together with your software in the same repository, without actually committing the files into <code>git</code> but using a support server. Just go into "Settings" for your organization and under "Billing" add data packs, you will notice that the cost is $0.</p>
<p>Here the steps to set this up:</p>
<ul>
<li>Create a user account on Github and choose the free plan, use your <code>.edu</code> email address</li>
<li>Create an organization account for your research group</li>
<li>Go to <a href="https://education.github.com/">https://education.github.com/</a> and click on "Get benefits"</li>
<li>Choose what is your position, e.g. Researcher and select you want a discount for an organization</li>
<li>Choose the organization you created earlier and confirm that it is a "Research group"</li>
<li>Add details about your Research group</li>
<li>Finally you need to upload a picture of your University ID card and write how you plan on using the repositories</li>
<li>Within a week at most, but generally in less than 24 hours, you will be approved for unlimited private repositories.</li>
</ul>
<p>Once the organization is created, you can add key team members to the "Owners" group, and then create another group for students and collaborators.</p>
<p>Consider also that is not necessary for every collaborator to have write access to your repositories. My recommendation is to ask a more experienced team member to administer the central repository, ask the students to fork the repository under their user accounts (forks of private repositories are always private, free and don't use any slot), and then <a href="https://help.github.com/articles/using-pull-requests">send a pull request</a> to the central repository for the administrator to review, discuss and merge.</p>
<p>See for example the organization account of the <a href="https://github.com/dib-lab">"The Lab for Data Intensive Biology" led by Dr. C. Titus Brown</a> where they share code, documentation and papers. Open Science!!</p>
<p>Other suggestions on the setup very welcome!</p>Ship large files with Python packages2019-08-21T18:00:00-07:002019-08-21T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-08-21:/2019/08/large-files-python-packages.html<p>It is often useful to ship large data files together with a Python package,
a couple of scenarios are:</p>
<ul>
<li>data necessary to the functionality provided by the package, for example images, any binary or large text dataset, they could be either required just for a subset of the functionality of …</li></ul><p>It is often useful to ship large data files together with a Python package,
a couple of scenarios are:</p>
<ul>
<li>data necessary to the functionality provided by the package, for example images, any binary or large text dataset, they could be either required just for a subset of the functionality of the package or for all of it</li>
<li>data necessary for unit or integration testing, both example inputs and expected outputs</li>
</ul>
<p>If data are collectively less than 2 GB compressed and do not change very often, a simple and a bit hacky solution is to use GitHub release assets. For each packaged release on GitHub it is possible to attach one or more assets smaller than 2 GB. You can then attach data to each release, the downside is that users need to make sure to use the correct dataset for the release they are using and the first time they use the software the need to install the Python package and also download the dataset and install it in the right folder. See <a href="https://gist.github.com/zonca/52857f2425942725fb74595c4f8600e9">an example script to upload from the command line</a>.</p>
<p>If data files are individually less than 10 MB and collectively less than 100 MB you can directly add them into the Python package. This is the easiest and most convenient option, for example the <a href="https://github.com/astropy/package-template"><code>astropy package template</code></a> automatically adds to the package any file inside the <code>packagename/data</code> folder.</p>
<p>For larger datasets I recommend to host the files externally and use the <a href="http://docs.astropy.org/en/stable/utils/#module-astropy.utils.data"><code>astropy.utils.data</code> module</a>.
This module automates the process of retrieving a file from a remote server and caching it locally (in the users home folder), next time the user needs it, it is automatically retrieved from the cache:</p>
<div class="highlight"><pre><span></span> <span class="n">dataurl</span> <span class="o">=</span> <span class="s2">"https://my-web-server.ucsd.edu/test-data/"</span>
<span class="k">with</span> <span class="n">data</span><span class="o">.</span><span class="n">conf</span><span class="o">.</span><span class="n">set_temp</span><span class="p">(</span><span class="s2">"dataurl"</span><span class="p">,</span> <span class="n">dataurl</span><span class="p">),</span> <span class="n">data</span><span class="o">.</span><span class="n">conf</span><span class="o">.</span><span class="n">set_temp</span><span class="p">(</span>
<span class="s2">"remote_timeout"</span><span class="p">,</span> <span class="mi">30</span>
<span class="p">):</span>
<span class="n">local_file_path</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">get_pkg_data_filename</span><span class="p">(</span><span class="s2">"myfile.jpg)</span>
</pre></div>
<p>Now we need to host there files publicly, I have a few options.</p>
<h3>Host on a dedicated GitHub repository</h3>
<p>If files are individually less than 100MB and collectively a few GB, you can create a dedicated repository on GitHub and push there your files.
Then <a href="https://help.github.com/en/articles/what-is-github-pages">activate GitHub Pages</a> so that those files are published at <code>https://your-organization.github.io/your-repository/</code>.
Then use this URL as <code>dataurl</code> in the above script.</p>
<h3>Host on a Supercomputer or own server</h3>
<p>Some Supercomputers offer the feature of providing public web access from specific folders, for example NERSC allows user to publish web-pages publicly, see <a href="https://www.nersc.gov/users/computational-systems/pdsf/software-and-tools/hosting-webpages/">their documentation</a>.</p>
<p>This is very useful for huge datasets because you can automatically detect if the package is being run at NERSC and then automatically access the files with their path instead of downloading them.</p>
<p>For example:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">get_data_from_url</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span>
<span class="sd">"""Retrieves input templates from remote server,</span>
<span class="sd"> in case data is available in one of the PREDEFINED_DATA_FOLDERS defined above,</span>
<span class="sd"> e.g. at NERSC, those are directly returned."""</span>
<span class="k">for</span> <span class="n">folder</span> <span class="ow">in</span> <span class="n">PREDEFINED_DATA_FOLDERS</span><span class="p">:</span>
<span class="n">full_path</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">folder</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span>
<span class="k">if</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">full_path</span><span class="p">):</span>
<span class="n">warnings</span><span class="o">.</span><span class="n">warn</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Access data from </span><span class="si">{full_path}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">full_path</span>
<span class="k">with</span> <span class="n">data</span><span class="o">.</span><span class="n">conf</span><span class="o">.</span><span class="n">set_temp</span><span class="p">(</span><span class="s2">"dataurl"</span><span class="p">,</span> <span class="n">DATAURL</span><span class="p">),</span> <span class="n">data</span><span class="o">.</span><span class="n">conf</span><span class="o">.</span><span class="n">set_temp</span><span class="p">(</span>
<span class="s2">"remote_timeout"</span><span class="p">,</span> <span class="mi">30</span>
<span class="p">):</span>
<span class="n">warnings</span><span class="o">.</span><span class="n">warn</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Retrieve data for </span><span class="si">{filename}</span><span class="s2"> (if not cached already)"</span><span class="p">)</span>
<span class="n">map_out</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">get_pkg_data_filename</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">show_progress</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">return</span> <span class="n">map_out</span>
</pre></div>
<p>Similar setup can be achieved on a GNU/Linux server, for example a powerful machine used by all members of a scientific team, where a folder is dedicated to host these data and is also published online with Apache or NGINX.</p>
<p>The main downside of this approach is that there is no built-in version control. One possibility is to enforce a policy where no files are ever overwritten and version control is automatically achieved with filenames. Otherwise, use <a href="https://git-lfs.github.com/"><code>git lfs</code></a> in that folder to track any change in a dedicated local <code>git</code> repository, e.g.:</p>
<div class="highlight"><pre><span></span>git init
git lfs track <span class="s2">"*.fits"</span>
git add <span class="s2">"*.fits"</span>
git commit -m <span class="s2">"initial version of all FITS files"</span>
</pre></div>
<p>This method tracks the checksum of all the binary files and helps managing the history, even if only locally (make sure the folder is also regularly backed up). You could push it to GitHub, that would cost $5/month for each 50GB of storage.</p>
<h3>Host on Figshare</h3>
<p>You can upload files to Figshare using the browser and create a dataset which also comes with a DOI and a page where you can save metadata about this object.</p>
<p>Once you have set the dataset public, you can find out the URL of the actual file, which is of the form <code>https://ndownloader.figshare.com/files/2432432432</code>, therefore we can set <code>https://ndownloader.figshare.com/files/</code> as the repository and use the integer defined in Figshare as filename. Using integers as filenames makes it a bit cryptic, but it has the great advantage that other people can do the uploading to Figshare and you can point to their files as easily as if the are yours. This is more convenient than alternatives where instead you need to give other people access to your file repository.</p>
<h3>Host on Amazon S3 or other object store</h3>
<p>A public bucket on Amazon S3 or other object store provides cheap storage and built-in version control.
The cost currently is about $0.026/GB/month.</p>
<p>First login to the AWS console and create a new bucket, set it public by turning of "Block all public access" and under "Access Control List" set "List objects" to Yes for "Public access".</p>
<p>You could upload files with the browser, but for larger files command line is better.</p>
<p>The files will be available at <a href="https://bucket-name.s3-us-west-1.amazonaws.com/">https://bucket-name.s3-us-west-1.amazonaws.com/</a>, this changes based on the chosen region.</p>
<h4>(Advanced) Upload files from the command line</h4>
<p>This is optional and requires some more familiarity with AWS.
Go back to the AWS console to the Identity and Access Management (IAM) section, then users, create, create a policy to give access only to 1 bucket (replace <code>bucket-name</code>):</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="nt">"Version"</span><span class="p">:</span> <span class="s2">"2012-10-17"</span><span class="p">,</span>
<span class="nt">"Statement"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"Sid"</span><span class="p">:</span> <span class="s2">"ListObjectsInBucket"</span><span class="p">,</span>
<span class="nt">"Effect"</span><span class="p">:</span> <span class="s2">"Allow"</span><span class="p">,</span>
<span class="nt">"Action"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"s3:ListBucket"</span><span class="p">],</span>
<span class="nt">"Resource"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"arn:aws:s3:::bucket-name"</span><span class="p">]</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">"Sid"</span><span class="p">:</span> <span class="s2">"AllObjectActions"</span><span class="p">,</span>
<span class="nt">"Effect"</span><span class="p">:</span> <span class="s2">"Allow"</span><span class="p">,</span>
<span class="nt">"Action"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"s3:*Object"</span><span class="p">,</span>
<span class="s2">"s3:PutObjectAcl"</span>
<span class="p">],</span>
<span class="nt">"Resource"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"arn:aws:s3:::bucket-name/*"</span><span class="p">]</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
</pre></div>
<p>See the <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_s3_rw-bucket.html">AWS documentation</a></p>
<p>Install <code>s3cmd</code>, then run <code>s3cmd --configure</code> to set it up and paste the Access and Secret keys, it will fail to test the configuration because it cannot list all the buckets, anyway choose to save the configuration.</p>
<p>Test it:</p>
<div class="highlight"><pre><span></span> s3cmd ls s3://bucket-name
</pre></div>
<p>Then upload your files (reduced redundancy is cheaper):</p>
<div class="highlight"><pre><span></span> s3cmd put --reduced-redundancy --acl-public *.fits s3://bucket-name
</pre></div>Deploy Kubernetes and JupyterHub on Jetstream with Magnum2019-06-14T00:00:00-07:002019-06-14T00:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-06-14:/2019/06/kubernetes-jupyterhub-jetstream-magnum.html<p>This tutorial deploys Kubernetes on Jetstream with Magnum and then
JupyterHub on top of that using <a href="https://zero-to-jupyterhub.readthedocs.io/">zero-to-jupyterhub</a>.</p>
<p>In my <a href="https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html">previous tutorials</a> I deployed Kubernetes using Kubespray. The main driver to using Magnum is that there is support for autoscaling, i.e. create and destroy Openstack instances based on the load …</p><p>This tutorial deploys Kubernetes on Jetstream with Magnum and then
JupyterHub on top of that using <a href="https://zero-to-jupyterhub.readthedocs.io/">zero-to-jupyterhub</a>.</p>
<p>In my <a href="https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html">previous tutorials</a> I deployed Kubernetes using Kubespray. The main driver to using Magnum is that there is support for autoscaling, i.e. create and destroy Openstack instances based on the load on JupyterHub. I haven't tested that yet, though, that will come in a following tutorial.</p>
<p>Magnum is a technology built into Openstack to deploy Container Orchestration engines based on templates. The main difference with kubespray is that is way less configurable, the user does not have access to modify those templates but has just a number of parameters to set. Instead Kubespray is based on <code>ansible</code> and the user has full control of how the system is setup, it also supports having more High Availability features like multiple master nodes.
On the other hand, the <code>ansible</code> recipe takes a very long time to run, ~30 min, while Magnum creates a cluster in about 10 minutes.</p>
<h2>Setup access to the Jetstream API</h2>
<p>First install the OpenStack client, please use these exact versions, also please run at Indiana, which currently has the Rocky release of Openstack, the TACC deployment has an older release of Openstack.</p>
<div class="highlight"><pre><span></span><span class="err">pip install python-openstackclient==3.16 python-magnumclient==2.10</span>
</pre></div>
<p>Load your API credentials from <code>openrc.sh</code>, check <a href="https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/39682064/Setting+up+openrc.sh">documentation of the Jetstream wiki for details</a>.</p>
<p>You need to have a keypair uploaded to Openstack, this just needs to be done once per account. See <a href="https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/35913730/OpenStack+command+line">the Jetstream documentation</a> under the section "Upload SSH key - do this once".</p>
<h2>Create the cluster with Magnum</h2>
<p>As usual, checkout the repository with all the configuration files on the machine you will use the Jetstream API from, typically your laptop.</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream</span>
<span class="err">cd jupyterhub-deploy-kubernetes-jetstream</span>
<span class="err">cd kubernetes_magnum</span>
</pre></div>
<p>Now we are ready to use Magnum to first create a cluster template and then the actual cluster, edit first <code>create_cluster.sh</code> and set the parameters of the cluster on the top. Also make sure to set the keypair name.
Finally run:</p>
<div class="highlight"><pre><span></span><span class="err">bash create_network.sh</span>
<span class="err">bash create_template.sh</span>
<span class="err">bash create_cluster.sh</span>
</pre></div>
<p>I have setup a test cluster with only 1 master node and 1 normal node but you can modify that later.</p>
<p>Check the status of your cluster, after about 10 minutes, it should be in state <code>CREATE_COMPLETE</code>:</p>
<div class="highlight"><pre><span></span><span class="err">openstack coe cluster show k8s</span>
</pre></div>
<h3>Configure kubectl locally</h3>
<p>Install the <code>kubectl</code> client locally, first check the version of the master node:</p>
<div class="highlight"><pre><span></span><span class="err">openstack server list # find the floating public IP of the master node (starts with 149_</span>
<span class="err">IP=149.xxx.xxx.xxx</span>
<span class="err">ssh fedora@$IP</span>
<span class="err">kubectl version</span>
</pre></div>
<p>Now install the same version following the <a href="https://kubernetes.io/docs/tasks/tools/install-kubectl/">Kubernetes documentation</a></p>
<p>Now configure <code>kubectl</code> on your laptop to connect to the Kubernetes cluster created with Magnum:</p>
<div class="highlight"><pre><span></span><span class="err">mkdir kubectl_secret</span>
<span class="err">cd kubectl_secret</span>
<span class="err">openstack coe cluster config k8s</span>
</pre></div>
<p>This downloads a configuration file and the required certificates.</p>
<p>and returns <code>export KUBECONFIG=/absolute/path/to/config</code></p>
<p>See also the <code>update_kubectl_secret.sh</code> script to automate this step, but it requires to already have setup the environment variable.</p>
<p>execute that and then:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get nodes</span>
</pre></div>
<h2>Configure storage</h2>
<p>Magnum configures a provider that knows how to create Kubernetes volumes using Openstack Cinder,
but does not configure a <code>storageclass</code>, we can do that with:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f storageclass.yaml</span>
</pre></div>
<p>We can test this by creating a Persistent Volume Claim:</p>
<div class="highlight"><pre><span></span><span class="n">kubectl</span> <span class="k">create</span> <span class="o">-</span><span class="n">f</span> <span class="n">persistent_volume_claim</span><span class="p">.</span><span class="n">yaml</span>
<span class="n">kubectl</span> <span class="k">describe</span> <span class="n">pv</span>
<span class="n">kubectl</span> <span class="k">describe</span> <span class="n">pvc</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">Name</span><span class="o">:</span> <span class="n">pvc</span><span class="o">-</span><span class="n">e8b93455</span><span class="o">-</span><span class="mi">898</span><span class="n">b</span><span class="o">-</span><span class="mi">11</span><span class="n">e9</span><span class="o">-</span><span class="n">a37c</span><span class="o">-</span><span class="n">fa163efb4609</span>
<span class="n">Labels</span><span class="o">:</span> <span class="n">failure</span><span class="o">-</span><span class="n">domain</span><span class="o">.</span><span class="na">beta</span><span class="o">.</span><span class="na">kubernetes</span><span class="o">.</span><span class="na">io</span><span class="o">/</span><span class="n">zone</span><span class="o">=</span><span class="n">nova</span>
<span class="n">Annotations</span><span class="o">:</span> <span class="n">kubernetes</span><span class="o">.</span><span class="na">io</span><span class="o">/</span><span class="n">createdby</span><span class="o">:</span> <span class="n">cinder</span><span class="o">-</span><span class="kd">dynamic</span><span class="o">-</span><span class="n">provisioner</span>
<span class="n">pv</span><span class="o">.</span><span class="na">kubernetes</span><span class="o">.</span><span class="na">io</span><span class="o">/</span><span class="n">bound</span><span class="o">-</span><span class="n">by</span><span class="o">-</span><span class="n">controller</span><span class="o">:</span> <span class="n">yes</span>
<span class="n">pv</span><span class="o">.</span><span class="na">kubernetes</span><span class="o">.</span><span class="na">io</span><span class="sr">/provisioned-by: kubernetes.io/</span><span class="n">cinder</span>
<span class="n">Finalizers</span><span class="o">:</span> <span class="o">[</span><span class="n">kubernetes</span><span class="o">.</span><span class="na">io</span><span class="o">/</span><span class="n">pv</span><span class="o">-</span><span class="n">protection</span><span class="o">]</span>
<span class="n">StorageClass</span><span class="o">:</span> <span class="n">standard</span>
<span class="n">Status</span><span class="o">:</span> <span class="n">Bound</span>
<span class="n">Claim</span><span class="o">:</span> <span class="k">default</span><span class="o">/</span><span class="n">pvc</span><span class="o">-</span><span class="n">test</span>
<span class="n">Reclaim</span> <span class="n">Policy</span><span class="o">:</span> <span class="n">Delete</span>
<span class="n">Access</span> <span class="n">Modes</span><span class="o">:</span> <span class="n">RWO</span>
<span class="n">Capacity</span><span class="o">:</span> <span class="mi">5</span><span class="n">Gi</span>
<span class="n">Node</span> <span class="n">Affinity</span><span class="o">:</span> <span class="o"><</span><span class="n">none</span><span class="o">></span>
<span class="n">Message</span><span class="o">:</span>
<span class="n">Source</span><span class="o">:</span>
<span class="n">Type</span><span class="o">:</span> <span class="n">Cinder</span> <span class="o">(</span><span class="n">a</span> <span class="n">Persistent</span> <span class="n">Disk</span> <span class="n">resource</span> <span class="k">in</span> <span class="n">OpenStack</span><span class="o">)</span>
<span class="n">VolumeID</span><span class="o">:</span> <span class="mi">2795724</span><span class="n">b</span><span class="o">-</span><span class="n">ef11</span><span class="o">-</span><span class="mi">4053</span><span class="o">-</span><span class="mi">9922</span><span class="o">-</span><span class="n">d854107c731f</span>
<span class="n">FSType</span><span class="o">:</span>
<span class="n">ReadOnly</span><span class="o">:</span> <span class="kc">false</span>
<span class="n">SecretRef</span><span class="o">:</span> <span class="n">nil</span>
<span class="n">Events</span><span class="o">:</span> <span class="o"><</span><span class="n">none</span><span class="o">></span>
</pre></div>
<p>We can also test creating an actual pod with a persistent volume and check
that the volume is successfully mounted and the pod started:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f ../alpine-persistent-volume.yaml</span>
<span class="err">kubectl describe pod alpine</span>
</pre></div>
<h3>Note about availability zones</h3>
<p>By default Openstack servers and Openstack volumes are created in different availability zones. This created an issue with the default Magnum templates because we need to modify the Kubernetes scheduler policy to allow this. Kubespray does this by default, so I created a <a href="https://github.com/zonca/magnum/pull/1">fix to be applied to the Jetstream Magnum templates</a>, this needs to be re-applied after every Openstack upgrade.</p>
<h2>Install Helm</h2>
<p>The Kubernetes deployment from Magnum is not as complete as the one out of Kubespray, we need
to setup <code>helm</code> and the NGINX ingress ourselves. We would also need to setup a system to automatically
deploy HTTPS certificates, I'll add this later on.</p>
<p>First <a href="https://helm.sh/docs/using_helm/#installing-helm">install the Helm client on your laptop</a>, make
sure you have configured <code>kubectl</code> correctly.</p>
<p>Then we need to create a service account to give enough privilege to Helm to reconfigure the cluster:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f tiller_service_account.yaml</span>
</pre></div>
<p>Then we can create the <code>tiller</code> pod inside Kubernetes:</p>
<div class="highlight"><pre><span></span><span class="err">helm init --service-account tiller --wait --history-max 200</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="err">kubectl get pods --all-namespaces</span>
<span class="err">NAMESPACE NAME READY STATUS RESTARTS AGE</span>
<span class="err">kube-system coredns-78df4bf8ff-f2xvs 1/1 Running 0 2d</span>
<span class="err">kube-system coredns-78df4bf8ff-pnj7g 1/1 Running 0 2d</span>
<span class="err">kube-system heapster-74f98f6489-xsw52 1/1 Running 0 2d</span>
<span class="err">kube-system kube-dns-autoscaler-986c49747-2m64g 1/1 Running 0 2d</span>
<span class="err">kube-system kubernetes-dashboard-54cb7b5997-c2vwx 1/1 Running 0 2d</span>
<span class="err">kube-system openstack-cloud-controller-manager-tf5mc 1/1 Running 3 2d</span>
<span class="err">kube-system tiller-deploy-6b5cd64488-4fkff 1/1 Running 0 20s</span>
</pre></div>
<p>And check that all the versions agree:</p>
<div class="highlight"><pre><span></span><span class="err">helm version</span>
<span class="c">Client: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}</span>
<span class="c">Server: &version.Version{SemVer:"v2.11.0", GitCommit:"2e55dbe1fdb5fdb96b75ff144a339489417b146b", GitTreeState:"clean"}</span>
</pre></div>
<h2>Setup NGINX ingress</h2>
<p>We need to have the NGINX web server to act as front-end to the services running inside the Kubernetes cluster.</p>
<h3>Open HTTP and HTTPS ports</h3>
<p>First we need to open the HTTP and HTTPS ports on the master node, you can either connect to the Horizon interface,
create new rule named <code>http_https</code>, then add 2 rules, in the Rule drop down choose HTTP and HTTPS; or from the command line:</p>
<div class="highlight"><pre><span></span><span class="err">openstack security group create http_https</span>
<span class="err">openstack security group rule create --ingress --protocol tcp --dst-port 80 http_https </span>
<span class="err">openstack security group rule create --ingress --protocol tcp --dst-port 443 http_https</span>
</pre></div>
<p>Then you can find the name of the master node in <code>openstack server list</code> then add this security group to that instance:</p>
<div class="highlight"><pre><span></span><span class="err">openstack server add security group k8s-xxxxxxxxxxxx-master-0 http_https</span>
</pre></div>
<h3>Install NGINX ingress with Helm</h3>
<div class="highlight"><pre><span></span><span class="err">bash install_nginx_ingress.sh</span>
</pre></div>
<p>Note, the documentation says we should add this annotation to ingress with <code>kubectl edit ingress -n jhub</code>, but I found out it is not necessary:</p>
<div class="highlight"><pre><span></span><span class="n">metadata</span><span class="o">:</span>
<span class="n">annotations</span><span class="o">:</span>
<span class="n">kubernetes</span><span class="o">.</span><span class="na">io</span><span class="o">/</span><span class="n">ingress</span><span class="o">.</span><span class="na">class</span><span class="o">:</span> <span class="n">nginx</span>
</pre></div>
<p>If this is correctly working, you should be able to run <code>curl localhost</code> from the master node and get a <code>Default backend: 404</code> message.</p>
<h2>Install JupyterHub</h2>
<p>Finally, we can go back to the root of the repository and install JupyterHub, first create the secrets file:</p>
<div class="highlight"><pre><span></span><span class="err">bash create_secrets.sh</span>
</pre></div>
<p>Then edit <code>secrets.yaml</code> and modify the hostname under <code>hosts</code> to display the hostname of your master Jetstream instance, i.e. if your instance public floating IP is <code>aaa.bbb.xxx.yyy</code>, the hostname should be <code>js-xxx-yyy.jetstream-cloud.org</code> (without <code>http://</code>).</p>
<p>You should also check that connecting with your browser to <code>js-xxx-yyy.jetstream-cloud.org</code> shows <code>default backend - 404</code>, this means NGINX is also reachable from the internet, i.e. the web port is open on the master node.</p>
<p>Finally:</p>
<div class="highlight"><pre><span></span><span class="err">bash configure_helm_jupyterhub.sh</span>
<span class="err">bash install_jhub.sh</span>
</pre></div>
<p>Connect with your browser to <code>js-xxx-yyy.jetstream-cloud.org</code> to check if it works.</p>
<h2>Issues and feedback</h2>
<p>Please <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/">open an issue on the repository</a> to report any issue or give feedback. Also you find out there there what I am working on next.</p>
<h2>Acknowledgments</h2>
<p>Many thanks to Jeremy Fischer and Mike Lowe for solving all my tickets, this required a lot of work on their end to make it working.</p>Webinar about distributed computing with Python2019-05-30T15:00:00-07:002019-05-30T15:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-05-30:/2019/05/webinar-python-hpc.html<p>Recording available of the webinar I gave about "Distributed computing with Python":</p>
<ul>
<li>Threads vs Processes, GIL</li>
<li>Just-In-Time compilation with Numba</li>
<li>Processing data larger than memory with Dask</li>
<li>Distributed computing with Dask</li>
</ul>
<p>Live demo on my favorite Supercomputer Comet at the San Diego Supercomputer Center.</p>
<ul>
<li><a href="https://www.sdsc.edu/Events/training/webinars/distributed_parallel_computing_with_python_2019/recording/">Webinar recording</a></li>
<li>Notebooks: <a href="https://github.com/zonca/python_hpc_tutorial">https://github.com …</a></li></ul><p>Recording available of the webinar I gave about "Distributed computing with Python":</p>
<ul>
<li>Threads vs Processes, GIL</li>
<li>Just-In-Time compilation with Numba</li>
<li>Processing data larger than memory with Dask</li>
<li>Distributed computing with Dask</li>
</ul>
<p>Live demo on my favorite Supercomputer Comet at the San Diego Supercomputer Center.</p>
<ul>
<li><a href="https://www.sdsc.edu/Events/training/webinars/distributed_parallel_computing_with_python_2019/recording/">Webinar recording</a></li>
<li>Notebooks: <a href="https://github.com/zonca/python_hpc_tutorial">https://github.com/zonca/python_hpc_tutorial</a></li>
</ul>Kubernetes monitoring with Prometheus and Grafana2019-04-20T00:00:00-07:002019-04-20T00:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-04-20:/2019/04/kubernetes-monitoring-prometheus-grafana.html<p>In a production Kubernetes deployment it is necessary to make it easier to monitor the status of the cluster effectively.
Kubernetes provides Prometheus to gather data from the different components of Kubernetes and Grafana
to access those data and provide real-time plotting and inspection capability.
Moreover, they both provide systems …</p><p>In a production Kubernetes deployment it is necessary to make it easier to monitor the status of the cluster effectively.
Kubernetes provides Prometheus to gather data from the different components of Kubernetes and Grafana
to access those data and provide real-time plotting and inspection capability.
Moreover, they both provide systems to send alerts in case some conditions on the state of the cluster are met, i.e. using more than 90% of RAM or CPU.</p>
<p>The only downside is that the pods that handle monitoring consume some resource themselves, so this could be significant for small clusters below 5 nodes or so, but shouldn't be a problem for typical larger production deployments.</p>
<p>Both Prometheus and Grafana can be installed separately with Helm recipes or using the Prometheus operator Helm recipe,
however those deployments do not have any preconfigured dashboards, it is easier to get started thanks to the <code>kube-prometheus</code> project,
which not only installs Prometheus and Grafana, but also preconfigures about 10 different Grafana dashboards to explore in depth
the status of a Kubernetes cluster.</p>
<p>The main issue is that customizing it is really complicated, it requires modifying <code>jsonnet</code> templates and recompiling them with a <code>jsonnet</code> builder which requires <code>go</code>, however I don't foresee the need to do that for most users.</p>
<p>Unfortunately it is not based on Helm, so you need to first checkout the repository:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/coreos/kube-prometheus</span>
</pre></div>
<p>and then follow the instructions <a href="https://github.com/coreos/kube-prometheus#quickstart">in the documentation</a>,
copied here for convenience:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f manifests/</span>
</pre></div>
<p>wait a moment, do not worry if some of the tasks fails, they should get fixed running:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl apply -f manifests/</span>
</pre></div>
<p>This creates several pods in the <code>monitoring</code> namespace:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get pods -n monitoring</span>
<span class="err">NAME READY STATUS RESTARTS AGE</span>
<span class="err">alertmanager-main-0 2/2 Running 0 13m</span>
<span class="err">alertmanager-main-1 2/2 Running 0 13m</span>
<span class="err">alertmanager-main-2 2/2 Running 0 13m</span>
<span class="err">grafana-9d97dfdc7-zkfft 1/1 Running 0 14m</span>
<span class="err">kube-state-metrics-7c7979b6bc-srcvk 4/4 Running 0 12m</span>
<span class="err">node-exporter-b6n2w 2/2 Running 0 14m</span>
<span class="err">node-exporter-cgp46 2/2 Running 0 14m</span>
<span class="err">prometheus-adapter-b7d894c9c-z2ph7 1/1 Running 0 14m</span>
<span class="err">prometheus-k8s-0 3/3 Running 1 13m</span>
<span class="err">prometheus-k8s-1 3/3 Running 1 13m</span>
<span class="err">prometheus-operator-65c44fb7b7-8ltzs 1/1 Running 0 14m</span>
</pre></div>
<p>Then you can setup forwarding on your laptop to export grafana locally:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl --namespace monitoring port-forward svc/grafana 3000</span>
</pre></div>
<p>Access <code>localhost:3000</code> with your browser and you should be able to navigate through all the statistics of your cluster,
see for example this screenshot. The credentials are user <code>admin</code> and password <code>admin</code>.</p>
<p><img alt="Screenshot of the Grafana UI" src="/images/grafana.png"></p>
<h2>Access the UI from a different machine</h2>
<p>In case you are running the configuration on a remote server and you would like to access the Grafana UI (or any other service) from your laptop, you can install <code>kubectl</code> also your my laptop, then copy the <code>.kube/config</code> to the laptop with:</p>
<div class="highlight"><pre><span></span><span class="err"> scp -r KUBECTLMACHINE:~/.kube/config ~/.kube</span>
</pre></div>
<p>and run:</p>
<div class="highlight"><pre><span></span><span class="err"> ssh ubuntu@$IP -f -L 6443:localhost:6443 sleep 3h &</span>
</pre></div>
<p>from the laptop and then run the <code>port-forward</code> command locally on the laptop.</p>
<h2>Monitor JupyterHub</h2>
<p>Once we have <a href="https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html">deployed JupyterHub with Helm</a>, we can pull up the
"namespace" monitor and select the <code>jhub</code> namespace to visualize resource usage but also usage requests and limits of all pods created by JupyterHub and its users. See a screenshot below.</p>
<p><img alt="Screenshot of the Grafana namespace UI" src="/images/grafana_jhub.png"></p>
<h2>Setup alerts</h2>
<p>Grafana supports email alerts, but it needs a SMTP server, and it is not easy to setup and to avoid being filtered as spam.
The easiest way is to setup an alert to Slack, and optionally be notified via email of Slack messages.</p>
<p>Follow the <a href="https://grafana.com/docs/alerting/notifications/#slack">instructions for slack on the Grafana documentation</a></p>
<ul>
<li>Create a Slack app, name it e.g. Grafana</li>
<li>Add feature "Incoming webhook"</li>
<li>Create a incoming webhook in the workspace and channel your prefer on Slack</li>
<li>In the Grafana Alerting menu, set the webhook incoming url, the channel name</li>
</ul>
<p><img alt="Screenshot of the Grafana slack notification" src="/images/grafana_slack.png"></p>Inherit group permission in folder2019-03-24T18:00:00-07:002019-03-24T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2019-03-24:/2019/03/folder-inherit-group-permission.html<p>I have googled this so many times...</p>
<p>On shared systems, like Supercomputers, you often belong to many different Unix
groups, and that membership allows you to access data from specific projects you
are working on and you can share data with your collaborators.</p>
<p>If you set SGID on a folder …</p><p>I have googled this so many times...</p>
<p>On shared systems, like Supercomputers, you often belong to many different Unix
groups, and that membership allows you to access data from specific projects you
are working on and you can share data with your collaborators.</p>
<p>If you set SGID on a folder, any folder of file created in that folder will automatically
belong to the Unix group of that folder, and not your default group.
You first set the right group on the folder, recursively so that older files will get
the right permissions:</p>
<div class="highlight"><pre><span></span><span class="err">chown -R somegroup sharedfolder</span>
</pre></div>
<p>Then you set the SGID so future files will automatically belong to <code>somegroup</code>:</p>
<div class="highlight"><pre><span></span><span class="err">chmod g+s sharedfolder</span>
</pre></div>
<p>This is very useful for example in the <code>/project</code> filesystem at NERSC, you can set
the SGID so that every file that is copied to the shared <code>/project</code> filesystem is
accessible by other collaborators.</p>
<p>Related to this is also the default <code>umask</code>, most systems by default give "read" permission
for the group, so setting SGID is enough, otherwise it is also necessary to configure <code>umask</code> properly.</p>Scale Kubernetes manually on Jetstream2019-02-22T21:00:00-08:002019-02-22T21:00:00-08:00Andrea Zoncatag:zonca.github.io,2019-02-22:/2019/02/scale-kubernetes-jupyterhub-manually.html<p>We would like to modify the number of Openstack virtual machines available to Kubernetes.
Ideally we would like to do this automatically based on the load on JupyterHub, that is the
target.
For now we will increase and decrease the size manually.
This can be useful for example if you …</p><p>We would like to modify the number of Openstack virtual machines available to Kubernetes.
Ideally we would like to do this automatically based on the load on JupyterHub, that is the
target.
For now we will increase and decrease the size manually.
This can be useful for example if you make a test deployment with only 1 worker node a week
before a workshop and then scale it up to 10 or more instances the day before the workshop
begins.</p>
<p>This assumes you have <a href="http://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html">deployed Kubernetes and JupyterHub already</a></p>
<h2>Create a new Openstack Virtual Machine with Terraform</h2>
<p>To add nodes, enter the <code>inventory/$CLUSTER</code> folder, we can edit <code>cluster.tf</code> and increase <code>number_of_k8s_nodes_no_floating_ip</code>, in my testing I have increased it from 1 to 3.</p>
<p>Then we can run again <code>terraform_apply.sh</code>, this should run Terraform and create a new resource:</p>
<div class="highlight"><pre><span></span><span class="err">Apply complete! Resources: 2 added, 0 changed, 0 destroyed.</span>
</pre></div>
<p>Check first that your machine has booted correctly running:</p>
<div class="highlight"><pre><span></span><span class="err">openstack server list</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="err">+--------------------------------------+---------------------+--------+--------------------------------------------+-------------------------------------+----------+</span>
<span class="err">| ID | Name | Status | Networks | Image | Flavor |</span>
<span class="err">+--------------------------------------+---------------------+--------+--------------------------------------------+-------------------------------------+----------+</span>
<span class="err">| 4ea73e65-2bff-42c9-8c4b-6c6928ad1b77 | zonca-k8s-node-nf-3 | ACTIVE | zonca_k8s_network=10.0.0.7 | JS-API-Featured-Ubuntu18-Dec-7-2018 | m1.small | | 0cf1552e-ef0c-48b0-ac24-571301809273 | zonca-k8s-node-nf-2 | ACTIVE | zonca_k8s_network=10.0.0.11 | JS-API-Featured-Ubuntu18-Dec-7-2018 | m1.small | | e3731cde-cf6e-4556-8bda-0eebc0c7f08e | zonca-k8s-master-1 | ACTIVE | zonca_k8s_network=10.0.0.9, xxx.xxx.xxx.xx | JS-API-Featured-Ubuntu18-Dec-7-2018 | m1.small |</span>
<span class="err">| 443c6861-1a13-4080-b5a3-e005bb34a77c | zonca-k8s-node-nf-1 | ACTIVE | zonca_k8s_network=10.0.0.3 | JS-API-Featured-Ubuntu18-Dec-7-2018 | m1.small |</span>
<span class="err">+--------------------------------------+---------------------+--------+--------------------------------------------+-------------------------------------+----------+</span>
</pre></div>
<p>As expected we have now 1 master and 3 nodes.</p>
<p>Then change the folder to the root of the repository and check you can connect to it with:</p>
<div class="highlight"><pre><span></span><span class="err">ansible -i inventory/$CLUSTER/hosts -m ping all</span>
</pre></div>
<p>If any of the new nodes is Unreachable, you can try rebooting them with:</p>
<div class="highlight"><pre><span></span><span class="err">openstack server reboot zonca-k8s-node-nf-3</span>
</pre></div>
<h3>Configure the new instances for Kubernetes</h3>
<p><code>kubespray</code> has a special playbook <code>scale.yml</code> that impacts as little as possible the nodes
already running.
I have created a script <code>k8s_scale.sh</code> in the root folder of my <code>jetstream_kubespray</code> repository,
launch:</p>
<div class="highlight"><pre><span></span><span class="err">bash k8s_scale.sh</span>
</pre></div>
<p><a href="https://github.com/kubernetes-sigs/kubespray/blob/master/docs/getting-started.md#adding-nodes">See for reference the <code>kubespray</code> documentation</a></p>
<p>Once this completes (re-run it if it stops at some point), you should see what Ansible modified:</p>
<div class="highlight"><pre><span></span><span class="err">zonca-k8s-master-1 : ok=25 changed=3 unreachable=0 failed=0 zonca-k8s-node-nf-1 : ok=247 changed=16 unreachable=0 failed=0</span>
<span class="err">zonca-k8s-node-nf-2 : ok=257 changed=77 unreachable=0 failed=0 zonca-k8s-node-nf-3 : ok=257 changed=77 unreachable=0 failed=0</span>
</pre></div>
<p>At this point you should check the nodes are seen by Kubernetes with <code>kubectl get nodes</code>:</p>
<div class="highlight"><pre><span></span><span class="err">NAME STATUS ROLES AGE VERSION zonca-k8s-master-1 Ready master 4h29m v1.12.5 zonca-k8s-node-nf-1 Ready node 4h28m v1.12.5 zonca-k8s-node-nf-2 Ready node 5m11s v1.12.5 zonca-k8s-node-nf-3 Ready node 5m11s v1.12.5</span>
</pre></div>
<h2>Reduce the number of nodes</h2>
<p>Kubernetes is built to be resilient to node losses, so you could just brutally delete a node with <code>openstack server delete</code>. However, there is a dedicated playbook, <code>remove-node.yml</code>, to remove a node cleanly migrating any running services to other nodes and lowering the risk of anything malfunctioning.
I created a script <code>k8s_remove_node.sh</code>, pass the name of the node you would like to eliminate (or a comma separated list of many names):</p>
<div class="highlight"><pre><span></span><span class="err">bash k8s_remove_node.sh zonca-k8s-node-nf-3</span>
</pre></div>
<p>Now the node has disappeared by <code>kubectl get nodes</code> but the underlying Openstack instance is still running, delete it with:</p>
<div class="highlight"><pre><span></span><span class="err">openstack server delete zonca-k8s-node-nf-3</span>
</pre></div>
<p>For consistency you could now modify <code>inventory/$CLUSTER/cluster.tf</code> and reduce the number of nodes accordingly.</p>Deploy Kubernetes with Kubespray 2.8.2 and JupyterHub with helm recipe 0.8 on Jetstream2019-02-22T18:00:00-08:002019-02-22T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2019-02-22:/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html<p>Back in September 2018 I published a <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-jupyterhub.html">tutorial to deploy Kubernetes on Jetstream</a> using Kubernetes.</p>
<p>Software in the Kubernetes space moves very fast, so I decided to update the recipe to use the newer Kubespray 2.8.2 that deploys Kubernetes v1.12.5.</p>
<p>Please follow the old tutorial and …</p><p>Back in September 2018 I published a <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-jupyterhub.html">tutorial to deploy Kubernetes on Jetstream</a> using Kubernetes.</p>
<p>Software in the Kubernetes space moves very fast, so I decided to update the recipe to use the newer Kubespray 2.8.2 that deploys Kubernetes v1.12.5.</p>
<p>Please follow the old tutorial and note the updates below.</p>
<h3>Switch to kubespray 2.8.2</h3>
<p>Once you get my fork of kubespray with a few fixes for Jetstream:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/jetstream_kubespray</span>
</pre></div>
<p><strong>switch to the newer 2.8.2 version</strong></p>
<div class="highlight"><pre><span></span><span class="err">git checkout -b branch_v2.8.2 origin/branch_v2.8.2</span>
</pre></div>
<p>See an <a href="https://github.com/zonca/jetstream_kubespray/pull/5">overview of my changes compared to the standard <code>kubespray</code> release 2.8.2</a>.</p>
<h3>Use the new template</h3>
<p>The name of my template is now just <code>zonca</code> instead of <code>zonca_kubespray</code>:</p>
<p>Before running Terraform, inside <code>jetstream_kubespray</code>, copy from my template:</p>
<div class="highlight"><pre><span></span><span class="err">export CLUSTER=$USER</span>
<span class="err">cp -LRp inventory/zonca inventory/$CLUSTER</span>
<span class="err">cd inventory/$CLUSTER</span>
</pre></div>
<h3>Explore kubernetes</h3>
<p>In case you are interested in exploring some of the capabilities of Kubernetes, you can check <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-explore.html">the second part of my tutorial</a>, nothing in this section is required to run JupyterHub.</p>
<h3>Install JupyterHub</h3>
<p>Finally you can use <code>helm</code> to install JupyterHub, see the <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-jupyterhub.html">last part of my tutorial</a>.</p>
<p>Consider that I have updated the repository <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream">https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream</a> to install the <code>0.8.0</code> version of the <code>helm</code> package just released yesterday, see <a href="https://blog.jupyter.org/zero-to-jupyterhub-helm-chart-0-8-b99e0a79fd2a">their blog post with more details</a>.</p>
<h3>Thanks</h3>
<p>Thanks to the Kubernetes, Kubespray and JupyterHub community for delivering great open-source software and to XSEDE for giving me the opportunity to work on this. Special thanks to my collaborators Julien Chastang and Rich Signell.</p>Deploy Pangeo on Kubernetes deployment on Jetstream created with Kubespray2018-12-20T01:00:00-08:002018-12-20T01:00:00-08:00Andrea Zoncatag:zonca.github.io,2018-12-20:/2018/12/kubernetes-jetstream-kubespray-pangeo.html<p>The <a href="http://pangeo.io/">Pangeo collaboration for Big Data Geoscience</a> maintains a helm
chart with a prefigured JupyterHub deployment on Kubernetes which also supports launching
private dask workers.
This is very useful because the Jupyter Notebook users can launch a cluster of worker
containers inside Kubernetes and process larger amounts of data than …</p><p>The <a href="http://pangeo.io/">Pangeo collaboration for Big Data Geoscience</a> maintains a helm
chart with a prefigured JupyterHub deployment on Kubernetes which also supports launching
private dask workers.
This is very useful because the Jupyter Notebook users can launch a cluster of worker
containers inside Kubernetes and process larger amounts of data than they could using only
their notebook container.</p>
<h2>Setup Kubernetes on Jetstream with Kubespray</h2>
<p>First check out my <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray.html">tutorial on deploying Kubernetes on Jetstream with Kubespray</a>.
You just need to complete the first part, <strong>do not install</strong> JupyterHub, it is installed
as part of the Pangeo deployment.</p>
<p>I also recommend to setup <code>kubectl</code> and <code>helm</code> to run locally so that the following steps can be executed on the local machine, see the instructions at the bottom of the tutorial mentioned above.
otherwise you need to <code>ssh</code> into the master node and type <code>helm</code> commands there.</p>
<h2>Install Pangeo with Helm</h2>
<p>Pangeo publishes a <a href="https://github.com/pangeo-data/helm-chart">Helm chart</a> (a software package for Kubernetes) and we can leverage that
to setup the deployment.</p>
<p>First add the repository:</p>
<div class="highlight"><pre><span></span><span class="err">helm repo add pangeo https://pangeo-data.github.io/helm-chart/</span>
<span class="err">helm repo update</span>
</pre></div>
<p>Then download my repository with all the configuration files and helper scripts:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream</span>
</pre></div>
<p>Create a <code>secrets.yaml</code> file running:</p>
<div class="highlight"><pre><span></span><span class="err">bash create_secrets.sh</span>
</pre></div>
<p>Then head to the <code>pangeo_helm</code> folder and customize <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/pangeo_helm/config_jupyterhub_pangeo_helm.yaml"><code>config_jupyterhub_pangeo_helm.yaml</code></a>,</p>
<ul>
<li>I have prepopulated very small limits for testing, increase those for production</li>
<li>I am using the docker image <code>zonca/pangeo_notebook_rsignell</code>, you can remove <code>image:</code> and the 2 lines below to use the standard Pangeo notebook image (defined in their <a href="https://github.com/pangeo-data/helm-chart/blob/master/pangeo/values.yaml"><code>values.yaml</code></a>)</li>
<li>Copy <code>cookieSecret</code> and <code>secretToken</code> from <code>secrets.yaml</code> you created above</li>
<li>Customize <code>ingress</code> - <code>hosts</code> with the hostname of your master instance</li>
</ul>
<p>Finally you can deploy it running:</p>
<div class="highlight"><pre><span></span><span class="err">bash install_pangeo.sh</span>
</pre></div>
<p>Login by pointing your browser at <a href="http://js-XXX-YYY.jetstream-cloud.org">http://js-XXX-YYY.jetstream-cloud.org</a>, the default dummy authenticator only needs a username and empty password.</p>
<h2>Customize and launch dask workers</h2>
<p>Once you login to the Jupyter Notebook, you can customize the <code>worker-template.yaml</code> file available in your home folder,
I have <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/pangeo_helm/worker_template.yaml">an example of it with very small limits</a> in the <code>pangeo_helm</code> folder.</p>
<p>This file is used by <code>dask_kubernetes</code> to launch workers on your behalf, see for example the <code>dask-array.ipynb</code> notebook available in your home folder:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">dask_kubernetes</span> <span class="kn">import</span> <span class="n">KubeCluster</span>
<span class="n">cluster</span> <span class="o">=</span> <span class="n">KubeCluster</span><span class="p">(</span><span class="n">n_workers</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">cluster</span>
</pre></div>
<p>This will launch 3 workers on the cluster which are then available to launch jobs on with <a href="https://dask.pydata.org"><code>dask</code></a>.</p>
<p>You can check with <code>kubectl</code> that the workers are executing:</p>
<div class="highlight"><pre><span></span>$ kubectl get pods -n pangeo
NAME READY STATUS RESTARTS AGE
dask-zonca-d191b7a4-d8jhft <span class="m">1</span>/1 Running <span class="m">0</span> 28m
dask-zonca-d191b7a4-dx9dhs <span class="m">1</span>/1 Running <span class="m">0</span> 28m
dask-zonca-d191b7a4-dzmgvv <span class="m">1</span>/1 Running <span class="m">0</span> 28m
hub-55f5bf597-f5bnt <span class="m">1</span>/1 Running <span class="m">0</span> 55m
jupyter-zonca <span class="m">1</span>/1 Running <span class="m">0</span> 38m
proxy-66576956d7-r926j <span class="m">1</span>/1 Running <span class="m">0</span> 55m
</pre></div>
<p>And also access the Dask GUI, using the menu on the left or the link provided by <code>dask_kubernetes</code> inside the Notebook.</p>
<p><img alt="Screenshot of the Dask UI" src="/images/dask_ui_workers.png"></p>Setup two factor authentication for UCSD, and Lastpass2018-12-12T18:00:00-08:002018-12-12T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2018-12-12:/2018/12/twofactor-auth-ucsd.html<p>Starting at the end of January 2019 UCSD requires every employee to have activated
two factor authentication.</p>
<p>Go over to <a href="https://duo-registration.ucsd.edu">https://duo-registration.ucsd.edu</a> to register your devices and
<a href="https://twostep.ucsd.edu">https://twostep.ucsd.edu</a> to read more details.</p>
<p>Here some suggestions after I have used this for a few months.</p>
<p>The …</p><p>Starting at the end of January 2019 UCSD requires every employee to have activated
two factor authentication.</p>
<p>Go over to <a href="https://duo-registration.ucsd.edu">https://duo-registration.ucsd.edu</a> to register your devices and
<a href="https://twostep.ucsd.edu">https://twostep.ucsd.edu</a> to read more details.</p>
<p>Here some suggestions after I have used this for a few months.</p>
<p>The most convenient option is definitely to have the Duo application installed on
your phone, so that once you try to login it sends a notification to your phone,
you click accept and you're done.</p>
<p>Second best is to use the Duo or the Google Authenticator app to generate codes,
then you can copy those codes into the login form, and this is anyway useful for
VPN access, you choose the "2 Steps secured - allthroughucsd" option, type your
password followed by a comma and the code, otherwise just the password and get a
push notification on your primary device.</p>
<p>Then you can just add a mobile number and receive a text or add a landline and
receive a call.</p>
<p>I also recommend to buy a security key and add it as a authentication option
at <a href="https://duo-registration.ucsd.edu">https://duo-registration.ucsd.edu</a>, either <a href="https://store.google.com/product/titan_security_key_kit">Google Titan</a> or a <a href="https://www.yubico.com/products/yubikey-hardware/">Yubico key</a> (I have a Titan), you can
keep it always with you so that if you don't have your phone or the phone battery
is dead, you can plug the security key in your USB port on the laptop and click on
its button to authenticate.</p>
<p>Anther option is to request a fob token, a device that generates and displays timed codes and that
is independent of a phone, see <a href="https://blink.ucsd.edu/technology/security/services/two-step-login/guide.html#token">instructions on the UCSD website</a>. They say there are only a limited number available and you need
to be prepared to justify why you are requesting one.</p>
<h2>Other services</h2>
<p>Now that you already have Duo installed on your phone, I recommend to also activate
two factor auth on all other services:</p>
<ul>
<li>XSEDE</li>
<li>NERSC</li>
<li>Google</li>
<li>Github</li>
<li>Amazon</li>
<li>Microsoft</li>
<li>Dropbox</li>
</ul>
<p>Consider that most of them just request the second step verification if you are on
a new device, so you need to do the verification just once in a while and it provides
a lot of security. Many of those also support the security key.</p>
<h2>Password handling with Lastpass</h2>
<p><strong>Update October 2019</strong>: Fed up of using Lastpass, their interface is clunky and slow, both in Chrome and Android, I switched to <a href="https://bitwarden.com">Bitwarden</a>. Way better, it also allows sharing with another user, only downside is that the do not offer Duo push 2FA for free, you need premium, but still supports using Duo as a token generator.</p>
<p>As you are into security, just go all the way and also install a password manager.
UCSD provides free enterprise accounts for all employees, see <a href="https://blink.ucsd.edu/technology/security/services/lastpass/index.html">the details</a>.</p>
<p>With Lastpass, you just remember 1 strong password to descrypt all of your other passwords.
If you ever used the Google Chrome builtin password manager, this is way way better.</p>
<p>You install the Lastpass extension on your browsers and the Lastpass app on your phone.</p>
<p>The only issue with Lastpass is that by default the Lastpass app on the smartphone automatically
logsout every 30 minutes or so, so you have to re-authenticate very often. This is due to UCSD
having configured it too strictly. I recommend to have a personal account and save all of the passwords
in the personal account and then link it from the Enterprise account.
Now from the desktop/laptop browsers you can use your Enterprise account, from the smartphone app instead
use the personal account.</p>
<p>You can also automatically import your Google Chrome passwords into Lastpass.</p>
<p>Now you have no excuse to re-use the same password, automatically generate a 20 char random password and save it in Lastpass.</p>
<h3>Save one-time codes</h3>
<p>When you activate two factor auth on Google/Github and many other services, they also give you some one-time codes that you can use to login to the service if you do not have access to your phone, you can save them as "Notes" into the related account inside Lastpass.</p>
<h3>Activate 2 factor auth for Lastpass</h3>
<p>You should also activate 2 factor auth in Lastpass, it also supports Duo so the configuration is similar to the configuration for UCSD. Only issue is that they do not support a security key here, so you can only add your smartphone.</p>Deploy JupyterHub on a Supercomputer for a workshop or tutorial 2018 edition2018-11-07T11:00:00-08:002018-11-07T11:00:00-08:00Andrea Zoncatag:zonca.github.io,2018-11-07:/2018/11/jupyterhub-supercomputer.html<p>I described how to deploy JupyterHub with each user session running on a different
node of a Supercomputer in <a href="https://arxiv.org/abs/1805.04781">my paper for PEARC18</a>,
however things are moving fast in the space and I am employing a different strategy
this year, in particular relying on <a href="https://the-littlest-jupyterhub.readthedocs.io">the littlest JupyterHub project</a>
for the …</p><p>I described how to deploy JupyterHub with each user session running on a different
node of a Supercomputer in <a href="https://arxiv.org/abs/1805.04781">my paper for PEARC18</a>,
however things are moving fast in the space and I am employing a different strategy
this year, in particular relying on <a href="https://the-littlest-jupyterhub.readthedocs.io">the littlest JupyterHub project</a>
for the initial deployment.</p>
<h2>Initial deployment of JupyterHub</h2>
<p><a href="https://the-littlest-jupyterhub.readthedocs.io">The littlest JupyterHub project</a> has great documentation
on how to deploy JupyterHub working on a single server on a wide array of providers.</p>
<p>In my case I logged in to the <a href="https://dashboard.cloud.sdsc.edu/">dashboard</a> of <a href="http://www.sdsc.edu/services/ci/cloud.html">SDSC Cloud</a>, a OpenStack
deployment at the San Diego Supercomputer Center, and requested an instance with 16 GB of RAM and 6 vCPUs with Ubuntu 18.04. Make sure you attach a floating public IP to the instance and open up ports 22 for SSH and 80,443 for HTTP/HTTPS.</p>
<p>Then I followed the <a href="https://the-littlest-jupyterhub.readthedocs.io/en/latest/install/custom-server.html">installation tutorial for custom servers</a>, just make sure that you first create in the virtual machine the admin user you specify in the installation script, also
make sure to use the same username of your Github account, as we will later setup Github Authentication.</p>
<p>You can connect to the instance and check JupyterHub is working and you can login with your user and access the admin panel,
for SDSC Cloud the address is <code>http://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu</code>, filled in with the instance floating IP address.</p>
<h3>Setup HTTPS</h3>
<p>Follow the Littlest JupyterHub documentation on how to get a SSL certificate through Letsencrypt automatically, after this you should be able to access JupyterHub from <code>https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu</code> or a custom domain you pointed there.</p>
<h2>Authentication with Github</h2>
<p>Follow the Littlest JupyterHub documentation, just make sure to set the <code>http</code> address and not the <code>https</code> address.</p>
<h2>Interface with Comet via batchspawner</h2>
<p>We want all users to run on Comet as a single "Gateway" user, as JupyterHub executes as the <code>root</code> user on the server, we want to create a SSH key for the <code>root</code> user and copy the public key to the home folder of the gateway user on Comet so that we can SSH without a password.</p>
<p>Instead, if you would like each user to utilize their own XSEDE account, you need them to authenticate via XSEDE and get a certificate from the XSEDE API that can be used to login to Comet on behalf of the user, see <a href="https://github.com/jupyterhub/jupyterhub-deploy-hpc/tree/master/batchspawner-xsedeoauth-sshtunnel-sdsccomet">an example deployment of this</a>.</p>
<p>First install <code>batchspawner</code> with <code>pip</code> in the Python environment of the hub, this is different than the Python environment of the user, you can have access to it logging in with the <code>root</code> user and running:</p>
<div class="highlight"><pre><span></span>export PATH=/opt/tljh/hub/bin:<span class="cp">${</span><span class="n">PATH</span><span class="cp">}</span>
</pre></div>
<p>Set the configuration file, see <a href="https://gist.github.com/zonca/55f7949983e56088186e99db53548ded"><code>spawner.py</code> on this Gist</a> and copy it into the <code>/opt/tljh/config/jupyterhub_config.d</code> folder, then add the private SSH key of the tunnelbot user, which is a user on the Virtual Machine with no shell (set <code>/bin/false</code> in <code>/etc/passwd</code>) but that can setup a SSH tunnel from Comet back to the Hub.</p>
<p>Also customize all paths and usernames in the file.</p>
<p>Reload the Jupyterhub configuration with:</p>
<div class="highlight"><pre><span></span><span class="err">tljh-config reload</span>
</pre></div>
<p>You can then check the Hub logs with <code>sudo journalctl -r -u jupyterhub</code></p>
<p>The most complicated part is making sure that the environment variables defined by JupyterHub, the most important is the token which allows the singleuser server to authenticate itself with the Hub, are correctly propagated through SSH. See in <code>spawner.py</code> how I explicitely pass the variables over SSH.</p>
<p>Also, as all workshop participants access Comet with the same user account, I automatically create a folder with their Github username and checkout the Notebooks for the workshop in that folder. Then start JupyterLab in that folder, so that the users do not interfere, we are not worrying about security here, with the current setup a user can open a terminal inside JupyterLab and access the folder of another person.</p>
<h2>How to setup the tunnelbot user</h2>
<ul>
<li>On the JupyterHub virtual machine, create a user named <code>tunnelbot</code></li>
<li><code>sudo su tunnelbot</code> to act as that user, then create a key with <code>ssh-keygen</code></li>
<li>enter the <code>.ssh</code> folder and <code>cp id_rsa.pub authorized_keys</code> so that the ssh key can be used from Comet to ssh passwordless to the server</li>
<li>now get the <strong>private key</strong> from <code>/home/tunnelbot/.ssh/id_rsa</code> and paste it into <code>spawner.py</code></li>
<li>now make sure you set the shell of <code>tunnelbot</code> to <code>/bin/false</code> in <code>/etc/passwd/</code></li>
<li>for increased security, please also follow the steps in <a href="https://askubuntu.com/questions/48129/how-to-create-a-restricted-ssh-user-for-port-forwarding">this stackoverflow answer</a></li>
</ul>
<h2>Acknowledgments</h2>
<p>Thanks to the Jupyter and JupyterHub teams for releasing great software with outstanding documentation, in particular Yuvi Panda for the simplicity and elegance in the design of the Littlest JupyterHub deployment.</p>Advanced pandas with Astrophysics example Notebook2018-10-26T18:00:00-07:002018-10-26T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-10-26:/2018/10/pandas-astro-example.html<p>Taught a lesson today on advanced <code>python</code> and <code>pandas</code> based on an example application in Astrophysics with simulations of data from the <a href="https://en.wikipedia.org/wiki/Planck_(spacecraft)">Planck Satellite</a>, features also a Binder button to run it yourself. Jupyter Notebook available at: <a href="https://github.com/zonca/pandas-astro-example">https://github.com/zonca/pandas-astro-example</a> under CC-BY</p>Bring your computing to the San Diego Supercomputer Center2018-10-24T18:00:00-07:002018-10-24T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-10-24:/2018/10/compute-at-sdsc.html<p>I am often asked what computing resources are available at the San Diego Supercomputer Center for scientists and what is the best way to be granted access. I decided to write a blog post with an overview of all the options, consider that I'm writing this in October 2018, so …</p><p>I am often asked what computing resources are available at the San Diego Supercomputer Center for scientists and what is the best way to be granted access. I decided to write a blog post with an overview of all the options, consider that I'm writing this in October 2018, so please cross-check on the official websites.</p>
<h2>Comet</h2>
<p>Our key resource is the <a href="http://www.sdsc.edu/support/user_guides/comet.html">Comet Supercomputer</a>, a 2000 nodes traditional supercomputer with 72 GPU nodes, each with 4 GPUs.
Comet has powerful CPUs with 24 cores and lots of memory per node (128GB) and a very fast local flash drive on each node.
It is also suitable to run large amounts of single node jobs, so you can exploit it even if you don't have a multi-node parallel software.</p>
<p>Comet is a XSEDE resource, XSEDE is basically a consortium of many large US supercomputers dedicated to Science, it reviews applications from US scientists and grants them supercomputing resources for free. It is funded by National Science Foundation.</p>
<h3>How to request resources on Comet</h3>
<p>Ordered from the lowest to the largest amount of resources needed, which means they are ordered by the amount of effort it takes to get each type of allocation.</p>
<p>The amount of resources on Comet are billed in core hours (sometimes named SUs), if you request a Comet node for 1 hour you are charged 24 hours, comet GPUs are billed 14 core hours for each hour on each GPU. The newest Comet GPU nodes have P100 instead of K80, those are billed 1.5 times the older GPU nodes, i.e. 21 core hours per hour.
Comet also has a shared queue where you can request and be charged for a portion of a Comet node (you also get the proportional amount of memory), i.e. you can request 6 cores and pay only 6 core hours per hour and get access to 32GB of RAM.</p>
<h4>Trial allocation</h4>
<p>Anybody can request a trial allocation on Comet with a quick 1 paragraph justification and be approved within a day for 1000 core hours to be used within 6 month. This is useful to try Comet out, run some test jobs. See the <a href="https://portal.xsede.org/allocations/startup#trial">trial allocation page on the XSEDE website</a>.</p>
<h4>Campus champions</h4>
<p>Most US universities have a reference person that facilitates access to XSEDE supercomputers, it is often somebody in the Information Technology office or in a Chancellor of Research or a professor. This person is given a large amount of supercomputing hours on all XSEDE resources and local professors, postdocs and graduate students can request to be added to this allocation and use many thousands of core hours, depending on availability.
Campus champions are currently available in 241 (!!) US institutions, <a href="https://www.xsede.org/web/site/community-engagement/campus-champions/current">see the list on the XSEDE website</a>.</p>
<h4>HPC@UC</h4>
<p>If you are at any of the University of California campuses, you have an expedited way of getting resources at SDSC.
You can submit a request for up to 1 million core hours (more often ~500K core hours) on Comet on the <a href="http://www.sdsc.edu/collaborate/hpc_at_uc.html">HPC@UC page</a>. It just requires a 3 page justification and is answered within 10 business days. You are not eligible if your research group has an active XSEDE allocation.</p>
<h4>Startup allocation</h4>
<p>Startup allocations are really quick to prepare, they just require a 1 page justification and CV of the Principal Investigator and grant up to 50K core hours on Comet, if your research is funded by NSF/NASA/NIH remember to specify that. See the <a href="https://portal.xsede.org/allocations/startup">startup page on XSEDE for more details</a>.</p>
<p>They are reviewed continously so you should be approved within a few days. Generally you are supposed to utilize the amount of hours within 1 year, but if your science project is funded for a longer period, you can request a multi-year allocation.</p>
<h4>XRAC allocation</h4>
<p>XRAC allocations are full fledged proposals, you can request up to a few million hours on Comet, here you must provide a detailed justification of the resources requested, demonstrate that your software is able to efficiently scale up in parallel, i.e. if in production you want to run on 100 nodes, you should run it on 5/10/50/100 nodes and check that performance does not degrade too much with increased parallelism.
You should have performed those tests in a startup allocation.
The XRAC requests are reviewed quarterly, see the <a href="https://portal.xsede.org/allocations/research">Research allocations page</a>, there is also a recorded webinar.</p>
<h2>Triton Shared Computing Cluster</h2>
<p>The Triton Shared Computing Cluster is a supercomputer at SDSC with specifications a bit lower than Comet and that is not allocated through XSEDE, resources are paid by the users. XSEDE resources are always oversubscribed and often only a portion of the resources requested is granted, scientific groups that do not get enough resources through XSEDE can complement it with an allocation on TSCC.</p>
<p>The easiest way to get computational hours on TSCC is a pay-as-you-go option where you buy an amount of core-hours at $6c / core-hour (academics have a lower rate based on affiliation).</p>
<p>But the most cost-effective way is to buy a node to be added to the cluster for 3 years with full hardware warranty plus 1 extra year with no warranty, so if it breaks it needs to be removed.
You pay a fixed price to buy the node (~$6K) plus yearly operations (~$1.8K if not subsidized by your University, in UC campuses this is generally subsidized and is ~$.5K), see <a href="http://www.sdsc.edu/services/hpc/tscc-purchase.html">the updated costs on the TSCC page</a>, also get in touch with them directly for more details. You can also buy a node with GPUs.</p>
<p>Then, instead of having direct access to that node, you are given an allocation as big as the computing hours that your node provides to the cluster. This is great because it allows you to not be penalized for incosistent usage patterns. You can pay for 1 node and then use tens of nodes together once in a while. If you have the yearly operations subsidized by campus, the cost per core hour is about $2c, which is quite competitive, and the cluster is in SDSC machine room and professionally managed, updated, backed up.</p>
<h2>Colocation</h2>
<p>Larger collaborations might need dedicated resources, it is possible to buy your own nodes, in units of entire racks (48 Rack Units), which depending on the type of blades can be 12 or 24 nodes and colocate it in SDSC's machine room. See the detailed cost on the <a href="http://www.sdsc.edu/services/it/colocation.html">colocation page</a>, this is a custom solution and it is not easy to give a simple cost estimate, better write and ask for a quote.</p>
<h2>Cloud resources (Virtual Machines)</h2>
<p>SDSC also manages a OpenStack deployment, which is especially suitable for running services, for example websites, databases, APIs but it is also suitable to run long-running single node jobs or interactive data analysis (think Jupyter Notebooks). And Kubernetes, of course! (see my <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray.html">tutorial for Jetstream, which works also on SDSC Cloud</a>.
This is also equivalent to Amazon Elastic Cloud Compute (EC2), here you pay for what you use, within UC you provide a funding index and that is charged for each hour used, see the full pricing on the <a href="http://www.sdsc.edu/services/it/cloud.html">SDSC cloud page</a>, roughly you are charged $8c an hour for a Virtual Machine with 1 core and 4GB of RAM.</p>
<h2>Feedback</h2>
<p>If you have questions please email me at zonca on the sdsc.edu domain or tweet @andreazonca.</p>Deploy JupyterHub on Kubernetes deployment on Jetstream created with Kubespray 3/32018-09-24T01:00:00-07:002018-09-24T01:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-09-24:/2018/09/kubernetes-jetstream-kubespray-jupyterhub.html<p>All of the following assumes you are logged in to the master node of the <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray.html">Kubernetes cluster deployed with kubespray</a> and checked out the repository:</p>
<p><a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream">https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream</a></p>
<h2>Install Jupyterhub</h2>
<p>First run</p>
<div class="highlight"><pre><span></span><span class="err">bash create_secrets.sh</span>
</pre></div>
<p>to create the secret strings needed by JupyterHub then edit its output
<code>secrets …</code></p><p>All of the following assumes you are logged in to the master node of the <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray.html">Kubernetes cluster deployed with kubespray</a> and checked out the repository:</p>
<p><a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream">https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream</a></p>
<h2>Install Jupyterhub</h2>
<p>First run</p>
<div class="highlight"><pre><span></span><span class="err">bash create_secrets.sh</span>
</pre></div>
<p>to create the secret strings needed by JupyterHub then edit its output
<code>secrets.yaml</code> to make sure it is consistent, edit the <code>hosts</code> lines if needed. For example, supply the Jetstream DNS name of the master node <code>js-XXX-YYY.jetstream-cloud.org</code> (XXX and YYY are the last 2 groups of the floating IP of the instance AAA.BBB.XXX.YYY). See <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-explore.html">part 2</a>, "Publish service externally with ingress".</p>
<div class="highlight"><pre><span></span><span class="err">bash configure_helm_jupyterhub.sh</span>
<span class="err">bash install_jhub.sh</span>
</pre></div>
<p>Check some preliminary pods running with:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get pods -n jhub</span>
</pre></div>
<p>Once the <code>proxy</code> is running, even if <code>hub</code> is still in preparation, you can check
in browser, you should get "Service Unavailable" which is a good sign that
the proxy is working.</p>
<h2>Customize JupyterHub</h2>
<p>After JupyterHub is deployed and integrated with Cinder for persistent volumes,
for any other customizations, first authentication, you are in good hands as the
<a href="https://zero-to-jupyterhub.readthedocs.io/en/stable/extending-jupyterhub.html">Zero-to-Jupyterhub documentation</a> is great.</p>
<p>The only setup that could be peculiar to the deployment on top of <code>kubespray</code> is setup with HTTPS, see the next section.</p>
<h2>Setup HTTPS with letsencrypt</h2>
<p>Kubespray instead of installing <code>kube-lego</code>, installs <a href="https://cert-manager.readthedocs.io/en/latest/index.html"><code>certmanager</code></a> to handle HTTPS certificates.</p>
<p>First we need to create a Issuer, set your email inside <code>setup_https_kubespray/https_issuer.yml</code> and create it with the usual:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f setup_https_kubespray/https_issuer.yml</span>
</pre></div>
<p>Then we can manually create a HTTPS certificate, <code>certmanager</code> can be configured to handle this automatically, but as we only need a domain this is pretty quick, edit <code>setup_https_kubespray/https_certificate.yml</code> and set the domain name of your master node, then create the certificate resource with:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f setup_https_kubespray/https_certificate.yml</span>
</pre></div>
<p>Finally we can configure JupyterHub to use this certificate, first edit your <code>secrets.yaml</code> following as an example the file <code>setup_https_kubespray/example_letsencrypt_secrets.yaml</code>, then update your JupyterHub configuration running again:</p>
<div class="highlight"><pre><span></span><span class="err">bash install_jhub.sh</span>
</pre></div>
<h2>Setup HTTPS with custom certificates</h2>
<p>In case you have custom certificates for your domain, first create a secret in the jupyterhub namespace with:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create secret tls cert-secret --key ssl.key --cert ssl.crt -n jhub</span>
</pre></div>
<p>Then setup ingress to use this in <code>secrets.yaml</code>:</p>
<div class="highlight"><pre><span></span><span class="n">ingress</span><span class="o">:</span>
<span class="n">enabled</span><span class="o">:</span> <span class="kc">true</span>
<span class="n">hosts</span><span class="o">:</span>
<span class="o">-</span> <span class="n">js</span><span class="o">-</span><span class="n">XX</span><span class="o">-</span><span class="n">YYY</span><span class="o">.</span><span class="na">jetstream</span><span class="o">-</span><span class="n">cloud</span><span class="o">.</span><span class="na">org</span>
<span class="n">tls</span><span class="o">:</span>
<span class="o">-</span> <span class="n">hosts</span><span class="o">:</span>
<span class="o">-</span> <span class="n">js</span><span class="o">-</span><span class="n">XX</span><span class="o">-</span><span class="n">YYY</span><span class="o">.</span><span class="na">jetstream</span><span class="o">-</span><span class="n">cloud</span><span class="o">.</span><span class="na">org</span>
<span class="n">secretName</span><span class="o">:</span> <span class="n">cert</span><span class="o">-</span><span class="n">secret</span>
</pre></div>
<p>Eventually, you may need to update the certificate. This can be achieved with:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create secret tls cert-secret --key ssl.key --cert ssl.crt -n jhub \</span>
<span class="err"> --dry-run -o yaml | kubectl apply -f -</span>
</pre></div>
<h2>Setup custom HTTP headers</h2>
<p>After you have deployed JupyterHub, edit ingress:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl edit ingress -n jhub</span>
</pre></div>
<p>Add a <code>configuration-snippet</code> line inside annotations:</p>
<div class="highlight"><pre><span></span><span class="n">metadata</span><span class="o">:</span>
<span class="n">annotations</span><span class="o">:</span>
<span class="n">kubernetes</span><span class="o">.</span><span class="na">io</span><span class="o">/</span><span class="n">tls</span><span class="o">-</span><span class="n">acme</span><span class="o">:</span> <span class="s2">"true"</span>
<span class="n">nginx</span><span class="o">.</span><span class="na">ingress</span><span class="o">.</span><span class="na">kubernetes</span><span class="o">.</span><span class="na">io</span><span class="o">/</span><span class="n">configuration</span><span class="o">-</span><span class="n">snippet</span><span class="o">:</span> <span class="o">|</span>
<span class="n">more_set_headers</span> <span class="s2">"X-Frame-Options: DENY"</span><span class="o">;</span>
<span class="n">more_set_headers</span> <span class="s2">"X-Xss-Protection: 1"</span><span class="o">;</span>
</pre></div>
<p>This doesn't require to restart or modify any other resource.</p>
<h2>Modify the Kubernetes cluster size</h2>
<p>See a followup short tutorial on <a href="https://zonca.github.io/2019/02/scale-kubernetes-jupyterhub-manually.html">scaling Kubernetes manually</a>.</p>
<h2>Persistence of user data</h2>
<p>When a JupyterHub user logs in for the first time, a Kubernetes <code>PersistentVolumeClaim</code> of the size defined in the configuration file is created. This is a Kubernetes resource that defines a request for storage.</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get pvc -n jhub</span>
<span class="err">NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE</span>
<span class="err">claim-zonca Bound pvc-c469967a-3968-11e9-aaad-fa163e9c7d08 1Gi RWO standard 2m34s</span>
<span class="err">hub-db-dir Bound pvc-353114a7-3968-11e9-aaad-fa163e9c7d08 1Gi RWO standard 6m34s</span>
</pre></div>
<p>Inspecting the claims we find out that we have a claim for the user and a claim to store the database of JupyterHub. Currently they are already Bound because they are already satistied.</p>
<p>Those claims are then satisfied by our Openstack Cinder provisioner to create a Openstack volume and wrap it into a Kubernetes <code>PersistentVolume</code> resource:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get pv -n jhub</span>
<span class="err">NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE</span>
<span class="err">pvc-353114a7-3968-11e9-aaad-fa163e9c7d08 1Gi RWO Delete Bound jhub/hub-db-dir standard 8m52s</span>
<span class="err">pvc-c469967a-3968-11e9-aaad-fa163e9c7d08 1Gi RWO Delete Bound jhub/claim-zonca standard 5m4s</span>
</pre></div>
<p>This corresponds to Openstack volumes automatically mounted onto the node that is executing the user pod:</p>
<div class="highlight"><pre><span></span><span class="err">+--------------------------------------+-------------------------------------------------------------+-----------+------+----------------------------------------------+</span>
<span class="err">| ID | Name | Status | Size | Attached to |</span>
<span class="err">+--------------------------------------+-------------------------------------------------------------+-----------+------+----------------------------------------------+</span>
<span class="err">| e6eddaaa-d40d-4832-addd-a05343ec3a80 | kubernetes-dynamic-pvc-c469967a-3968-11e9-aaad-fa163e9c7d08 | in-use | 1 | Attached to zonca-k8s-node-nf-1 on /dev/sdc |</span>
<span class="err">| 00f1e822-8098-4633-804e-46ba44d7de7e | kubernetes-dynamic-pvc-353114a7-3968-11e9-aaad-fa163e9c7d08 | in-use | 1 | Attached to zonca-k8s-node-nf-1 on /dev/sdb |</span>
</pre></div>
<p>If the user disconnects, the Openstack volume is un-attached from the instance but it is not delete and it is mounted back, optionally on another instance, if the user logs back in.</p>
<h3>Delete and reinstall JupyterHub</h3>
<p>Helm release deleted:</p>
<div class="highlight"><pre><span></span><span class="err">helm delete --purge jhub</span>
</pre></div>
<p>As long as you do not delete the whole namespace, the volumes are not deleted, therefore you can re-deploy the same version or a newer version using <code>helm</code> and the same volume is mounted back for the user</p>
<h3>Delete and recreate Openstack instances</h3>
<p>When we run terraform to delete all Openstack resources:</p>
<div class="highlight"><pre><span></span><span class="err">bash terraform_destroy.sh</span>
</pre></div>
<p>this does not include the Openstack volumes that are created by the Kubernetes persistent volume provisioner.</p>
<p>In case we are interested in keeping the same ip address, run instead:</p>
<div class="highlight"><pre><span></span><span class="err">bash terraform_destroy_keep_floatingip.sh</span>
</pre></div>
<p>The problem is that if we recreate Kubernetes again, it doesn't know how to link the Openstack volume to the Persistent Volume of a user.
Therefore we need to backup the Persistent Volumes and the Persistent Volume Claims resources before tearing Kubernetes down:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get pvc -n jhub -o yaml > pvc.yaml</span>
<span class="err">kubectl get pv -n jhub -o yaml > pv.yaml</span>
</pre></div>
<p>I recommend always to run <code>kubectl</code> on the local machine instead of the master node, because if you delete the master instance you loose any temporary modification to your scripts. In this case, even more importantly, if you are running on the master node please backup <code>pvc.yaml</code> and <code>pv.yaml</code> locally before running <code>terraform_destroy.sh</code> or they will be wiped out.</p>
<p>Then open the files with a text editor and delete the Persistent Volume and the Persistent Volume Claim related to <code>hub-db-dir</code>.</p>
<p>Edit <code>pv.yaml</code> and set:</p>
<div class="highlight"><pre><span></span><span class="err"> persistentVolumeReclaimPolicy:Retain</span>
</pre></div>
<p>Otherwise if you create the PV first, it is deleted because there is no PVC.</p>
<p>Also remove the <code>claimRef</code> section of all the volumes in <code>pv.yaml</code>, otherwise you get the error "two claims are bound to the same volume, this one is bound incorrectly" on the PVC.</p>
<p>Now we can proceed to create the cluster again and then restore the volumes with:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl apply -f pv.yaml</span>
<span class="err">kubectl apply -f pvc.yaml</span>
</pre></div>
<h2>Feedback</h2>
<p>Feedback on this is very welcome, please open an issue on the <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream">Github repository</a> or email me at <code>zonca</code> on the domain of the San Diego Supercomputer Center (sdsc.edu).</p>Explore a Kubernetes deployment on Jetstream with Kubespray 2/32018-09-23T23:00:00-07:002018-09-23T23:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-09-23:/2018/09/kubernetes-jetstream-kubespray-explore.html<p>This is the second part of the tutorial on deploying Kubernetes with <code>kubespray</code> and JupyterHub
on Jetstream.</p>
<p>In the <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray.html">first part, we installed Kubernetes on Jetstream with <code>kubespray</code></a>.</p>
<p>It is optional, its main purpose is to familiarize with the Kubernetes deployment on Jetstream
and how the different components play together …</p><p>This is the second part of the tutorial on deploying Kubernetes with <code>kubespray</code> and JupyterHub
on Jetstream.</p>
<p>In the <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray.html">first part, we installed Kubernetes on Jetstream with <code>kubespray</code></a>.</p>
<p>It is optional, its main purpose is to familiarize with the Kubernetes deployment on Jetstream
and how the different components play together before installing JupyterHub.
If you are already familiar with Kubernetes you can skip to <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-jupyterhub.html">next part where we will be installing
Jupyterhub using the zerotojupyterhub helm recipe</a>.</p>
<p>All the files for the examples below are available on Github,
first SSH to the master node (or do this locally if you setup <code>kubectl</code> locally):</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream</span>
<span class="err">cd jupyterhub-deploy-kubernetes-jetstream</span>
</pre></div>
<h2>Test persistent storage with cinder</h2>
<p>The most important feature that brought me to choose <code>kubespray</code> as method for installing Kubernetes
is that it automatically sets up persistent storage exploiting Jetstream Volumes.
The Jetstream team already does a great job in providing a persistent storage solution with adequate
redundancy via the Cinder project, part of OpenStack.</p>
<p><code>kubespray</code> sets up a Kubernetes provisioner so that when a container requests persistent storage,
it talks to the Openstack API and have a dedicated volume (the same type you can create with the
Jetstream Horizon Web interfaces) automatically created and exposed to Kubernetes.</p>
<p>This is achieved through a storageclass:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get storageclass</span>
<span class="err">NAME PROVISIONER AGE</span>
<span class="err">standard (default) kubernetes.io/cinder 1h</span>
</pre></div>
<p>See the file <code>alpine-persistent-volume.yaml</code> in the repository on how we can request a Cinder volume
to be created and attached to a pod.</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f alpine-persistent-volume.yaml</span>
</pre></div>
<p>We can test it by getting a terminal inside the container (<code>alpine</code> has no <code>bash</code>):</p>
<div class="highlight"><pre><span></span><span class="err">kubectl exec -it alpine -- /bin/sh</span>
</pre></div>
<p>look into <code>df -h</code>, check that there is a 5GB mounted file system which is persistent.</p>
<p>Also, back to the machine with <code>openstack</code> access, see how an Openstack volume was dynamically created and attached to the running instance:</p>
<div class="highlight"><pre><span></span><span class="err">openstack volume list</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="err">openstack volume list</span>
<span class="err">+--------------------------------------+-------------------------------------------------------------+--------+------+--------------------------------------------------+</span>
<span class="err">| ID | Name | Status | Size | Attached to |</span>
<span class="err">+--------------------------------------+-------------------------------------------------------------+--------+------+--------------------------------------------------+</span>
<span class="err">| 508f1ee7-9654-4c84-b1fc-76dd8751cd6e | kubernetes-dynamic-pvc-e83ec4d6-bb9f-11e8-8344-fa163eb22e63 | in-use | 5 | Attached to kubespray-k8s-node-nf-1 on /dev/sdb |</span>
<span class="err">+--------------------------------------+-------------------------------------------------------------+--------+------+--------------------------------------------------+</span>
</pre></div>
<h2>Test ReplicaSets, Services and Ingress</h2>
<p>In this section we will explore how to build redundancy and scale in a service with a
simple example included in the book <a href="https://github.com/luksa/kubernetes-in-action/tree/master/Chapter02/kubia">Kubernetes in Action</a>,
which by the way I highly recommend to get started with Kubernetes.</p>
<p>First let's deploy a service in our Kubernetes cluster,
this service just answers to HTTP requests on port 8080 with the message "You've hit kubia-manual":</p>
<div class="highlight"><pre><span></span><span class="err">cd kubia_test_ingress</span>
<span class="err">kubectl create -f kubia-manual.yaml</span>
</pre></div>
<p>We can test it by checking at which IP Kubernetes created the pod:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl get pods -o wide</span>
</pre></div>
<p>and assign it to the <code>KUBIA_MANUAL_IP</code> variable, then on one of the nodes:</p>
<div class="highlight"><pre><span></span>$ curl <span class="nv">$KUBIA_MANUAL_IP</span>:8080
You<span class="err">'</span>ve hit kubia-manual
</pre></div>
<p>Finally close it:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl delete -f kubia-manual.yaml</span>
</pre></div>
<h3>Load balancing with ReplicaSets and Services</h3>
<p>Now we want to scale this service up and provide a set of 3 pods instead of just 1:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f kubia-replicaset.yaml</span>
</pre></div>
<p>Now we could access those services on 3 different IP addresses, but we would like to have
a single entry point and automatic load balancing across those instances, so we create
a Kubernetes "Service" resource:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f kubia-service.yaml</span>
</pre></div>
<p>And test it:</p>
<div class="highlight"><pre><span></span>$ kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT<span class="o">(</span>S<span class="o">)</span> AGE
kubernetes ClusterIP <span class="m">10</span>.233.0.1 <none> <span class="m">443</span>/TCP 22h
kubia ClusterIP <span class="m">10</span>.233.28.205 <none> <span class="m">80</span>/TCP 45m
</pre></div>
<div class="highlight"><pre><span></span><span class="err">curl $KUBIA_SERVICE_IP</span>
</pre></div>
<p>This is on port 80 so we don't need <code>:8080</code> in the URL.
Run many times and check different kubia services answer.</p>
<h3>Publish service externally with ingress</h3>
<p>Try to open browser and access the hostname of your master node at:</p>
<div class="highlight"><pre><span></span><span class="c">http://js-XXX-YYY.jetstream-cloud.org</span>
</pre></div>
<p>Where XXX-YYY are the last 2 groups of digits of the floating IP of the master instance,
i.e. AAA.BBB.XXX.YYY, each of them could also be 1 or 2 digits instead of 3.</p>
<p>The connection should respond with 404.</p>
<p>At this point, edit the <code>kubia-ingress.yaml</code> file and replace the <code>host</code> value with the master node domain name you just derived.</p>
<p>Now:</p>
<div class="highlight"><pre><span></span><span class="err">kubectl create -f kubia-ingress.yaml</span>
<span class="err">kubectl get ingress</span>
</pre></div>
<p>Try again in the browser. You should now see something like:</p>
<p>"You've hit kubia-jqwwp"</p>
<p>Force reload the browser page a few times and you will see you are hitting a different kubia service.</p>
<p>Finally,</p>
<div class="highlight"><pre><span></span><span class="err">kubectl delete -f kubia-ingress.yaml</span>
</pre></div>Deploy Kubernetes on Jetstream with Kubespray 1/32018-09-23T18:00:00-07:002018-09-23T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-09-23:/2018/09/kubernetes-jetstream-kubespray.html<p><strong>Please check the last version of this tutorial (which mostly redirects here but uses a newer <code>kubespray</code>) at <a href="https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html">https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html</a></strong></p>
<p>The purpose of this tutorial series is to deploy Jupyterhub on top of
Kubernetes on Jetstream.
This material was presented as a tutorial at …</p><p><strong>Please check the last version of this tutorial (which mostly redirects here but uses a newer <code>kubespray</code>) at <a href="https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html">https://zonca.github.io/2019/02/kubernetes-jupyterhub-jetstream-kubespray.html</a></strong></p>
<p>The purpose of this tutorial series is to deploy Jupyterhub on top of
Kubernetes on Jetstream.
This material was presented as a tutorial at the Gateways 2018 conference, see also <a href="https://figshare.com/articles/Hands-on_Tutorial_Deploying_Kubernetes_and_JupyterHub_on_Jetstream/7137884">the slides on Figshare</a>.</p>
<p>Compared to my <a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">initial tutorial</a>, I focused on improving automation.
Instead of creating Jetstream instances via the Atmosphere web interface and then
SSHing into the instances and run <code>kubeadm</code> based commands to setup Docker and Kubernetes we will:</p>
<ul>
<li>Use the <code>terraform</code> recipe part of the <code>kubespray</code> project to interface with the Jetstream API and create a cluster of virtual machines</li>
<li>Run the <code>kubespray</code> ansible recipe to setup a production-ready Kubernetes deployment, optionally with High Availability features like redundant master nodes and much more, see <a href="http://kubespray.io">kubepray.io</a>.</li>
</ul>
<h2>Create Jetstream Virtual machines with Terraform</h2>
<p><code>kubespray</code> is able to deploy production-ready Kubernetes deployments and initially targeted only
commercial cloud platforms.</p>
<p>They recently added support for Openstack via a Terraform recipe which is available in <a href="https://github.com/kubernetes-incubator/kubespray/tree/master/contrib/terraform/openstack">their Github repository</a>.</p>
<p>Terraform allows to execute recipes that describe a set of OpenStack resources and their relationship. In the context of this tutorial, we do not need to learn much about Terraform, we will configure and execute the recipe provided by <code>kubespray</code>.</p>
<h3>Requirements</h3>
<p>On a Ubuntu 18.04 install <code>python3-openstackclient</code> with APT.
Any other platform works as well, also install <code>terraform</code> by copying the correct binary to <code>/usr/local/bin/</code>, see <a href="https://www.terraform.io/intro/getting-started/install.html">https://www.terraform.io/intro/getting-started/install.html</a>. The current version of the recipe requires Terraform <code>0.11.x</code>, <strong>not the newest 0.12</strong>.</p>
<h3>Request API access</h3>
<p>In order to make sure your XSEDE account can access the Jetstream API, you need to contact the Helpdesk, see the <a href="https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/39682057/Using+the+Jetstream+API">instructions on the Jetstream Wiki</a>. You will also receive your <strong>TACC</strong> password, which could be different than your XSEDE one (username is generally the same).</p>
<p>Login to the TACC Horizon panel at <a href="https://tacc.jetstream-cloud.org/dashboard">https://tacc.jetstream-cloud.org/dashboard</a>, this is basically the low level web interface to OpenStack, a lot more complex and powerful than Atmosphere available at <a href="https://use.jetstream-cloud.org/application">https://use.jetstream-cloud.org/application</a>. Use <code>tacc</code> as domain, your TACC username (generally the same as your XSEDE username) and your TACC password.</p>
<p>First choose the right project you would like to charge to in the top dropdown menu (see the XSEDE website if you don't recognize the grant code).</p>
<p>Click on Compute / API Access and download the OpenRC V3 authentication file to your machine. Source it typing:</p>
<div class="highlight"><pre><span></span><span class="err">source XX-XXXXXXXX-openrc.sh</span>
</pre></div>
<p>it should ask for your TACC password. This configures all the environment variables needed by the <code>openstack</code> command line tool to interface with the Openstack API.</p>
<p>Test with:</p>
<div class="highlight"><pre><span></span><span class="err">openstack flavor list</span>
</pre></div>
<p>This should return the list of available "sizes" of the Virtual Machines.</p>
<h3>Clone kubespray</h3>
<p>I had to make a few modifications to <code>kubespray</code> to adapt it to Jetstream or backport bug fixes not merged yet, so currently better use my fork of <code>kubespray</code>:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/jetstream_kubespray</span>
</pre></div>
<p>See an <a href="https://github.com/zonca/jetstream_kubespray/pull/2">overview of my changes compared to the standard <code>kubespray</code> release 2.6.0</a>.</p>
<h3>Run Terraform</h3>
<p>Inside <code>jetstream_kubespray</code>, copy from my template:</p>
<div class="highlight"><pre><span></span><span class="err">export CLUSTER=$USER</span>
<span class="err">cp -LRp inventory/zonca_kubespray inventory/$CLUSTER</span>
<span class="err">cd inventory/$CLUSTER</span>
</pre></div>
<p>Open and modify <code>cluster.tf</code>, choose your image and number of nodes.
Make sure to change the network name to something unique, like the expanded form of <code>$CLUSTER_network</code>.</p>
<p>You can find suitable images (they need to be JS-API-Featured, you cannot use the same instances used in Atmosphere):</p>
<div class="highlight"><pre><span></span><span class="err">openstack image list | grep "JS-API"</span>
</pre></div>
<p>I already preconfigured the network UUID both for IU and TACC, but you can crosscheck
looking for the <code>public</code> network in:</p>
<div class="highlight"><pre><span></span><span class="err">openstack network list</span>
</pre></div>
<p>Initialize Terraform:</p>
<div class="highlight"><pre><span></span><span class="err">bash terraform_init.sh</span>
</pre></div>
<p>Create the resources:</p>
<div class="highlight"><pre><span></span><span class="err">bash terraform_apply.sh</span>
</pre></div>
<p>The last output log of Terraform should contain the IP of the master node <code>k8s_master_fips</code>, wait for it to boot then
SSH in with:</p>
<div class="highlight"><pre><span></span><span class="err">ssh ubuntu@$IP</span>
</pre></div>
<p>or <code>centos@$IP</code> for CentOS images.</p>
<p>Inspect with Openstack the resources created:</p>
<div class="highlight"><pre><span></span><span class="err">openstack server list</span>
<span class="err">openstack network list</span>
</pre></div>
<p>You can cleanup the virtual machines and all other Openstack resources (all data is lost) with <code>bash terraform_destroy.sh</code>.</p>
<h2>Install Kubernetes with <code>kubespray</code></h2>
<p>Change folder back to the root of the <code>jetstream_kubespray</code> repository,</p>
<p>First make sure you have a recent version of <code>ansible</code> installed, you also need additional modules,
so first run:</p>
<div class="highlight"><pre><span></span><span class="err">pip install -r requirements.txt</span>
</pre></div>
<p>It is useful to create a <code>virtualenv</code> and install packages inside that.
This will also install <code>ansible</code>, it is important to install <code>ansible</code> with <code>pip</code> so that the path to access the modules is correct. So remove any pre-installed <code>ansible</code>.</p>
<p>Then following the <a href="https://github.com/kubernetes-incubator/kubespray/blob/master/contrib/terraform/openstack/README.md#ansible"><code>kubespray</code> documentation</a>, we setup <code>ssh-agent</code> so that <code>ansible</code> can SSH from the machine with public IP to the others:</p>
<div class="highlight"><pre><span></span><span class="err">eval $(ssh-agent -s)</span>
<span class="err">ssh-add ~/.ssh/id_rsa</span>
</pre></div>
<p>Test the connection through ansible:</p>
<div class="highlight"><pre><span></span><span class="err">ansible -i inventory/$CLUSTER/hosts -m ping all</span>
</pre></div>
<p>If a server is not answering to ping, first try to reboot it:</p>
<div class="highlight"><pre><span></span><span class="err">openstack server reboot $CLUSTER-k8s-node-nf-1</span>
</pre></div>
<p>Or delete it and run <code>terraform_apply.sh</code> to create it again.</p>
<p>check <code>inventory/$CLUSTER/group_vars/all.yml</code>, in particular <code>bootstrap_os</code>, I setup <code>ubuntu</code>, change it to <code>centos</code> if you used the Centos 7 base image.</p>
<p>Due to a bug in the recipe, run ( see details in the Troubleshooting notes below):</p>
<div class="highlight"><pre><span></span><span class="err">export OS_TENANT_ID=$OS_PROJECT_ID</span>
</pre></div>
<p>Finally run the full playbook, it is going to take a good 10 minutes:</p>
<div class="highlight"><pre><span></span><span class="err">ansible-playbook --become -i inventory/$CLUSTER/hosts cluster.yml</span>
</pre></div>
<p>If the playbook fails with "cannot lock the administrative directory", it is due to the fact that the Virtual Machine is automatically updating so it has locked the APT directory. Just wait a minute and launch it again. It is always safe to run <code>ansible</code> multiple times.</p>
<p>If the playbook gives any error, try to retry the above command, sometimes there are temporary failed tasks, Ansible is designed to be executed multiple times with consistent results.</p>
<p>You should have now a Kubernetes cluster running, test it:</p>
<div class="highlight"><pre><span></span>$ ssh ubuntu@<span class="nv">$IP</span>
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-78fb746bc7-w9r94 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
ingress-nginx default-backend-v1.4-7795cd847d-g25d8 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
ingress-nginx ingress-nginx-controller-bdjq7 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
kube-system kube-apiserver-zonca-kubespray-k8s-master-1 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
kube-system kube-controller-manager-zonca-kubespray-k8s-master-1 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
kube-system kube-dns-69f4c8fc58-6vhhs <span class="m">3</span>/3 Running <span class="m">0</span> 2h
kube-system kube-dns-69f4c8fc58-9jn25 <span class="m">3</span>/3 Running <span class="m">0</span> 2h
kube-system kube-flannel-7hd24 <span class="m">2</span>/2 Running <span class="m">0</span> 2h
kube-system kube-flannel-lhsvx <span class="m">2</span>/2 Running <span class="m">0</span> 2h
kube-system kube-proxy-zonca-kubespray-k8s-master-1 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
kube-system kube-proxy-zonca-kubespray-k8s-node-nf-1 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
kube-system kube-scheduler-zonca-kubespray-k8s-master-1 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
kube-system kubedns-autoscaler-565b49bbc6-7wttm <span class="m">1</span>/1 Running <span class="m">0</span> 2h
kube-system kubernetes-dashboard-6d4dfd56cb-24f98 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
kube-system nginx-proxy-zonca-kubespray-k8s-node-nf-1 <span class="m">1</span>/1 Running <span class="m">0</span> 2h
kube-system tiller-deploy-5c688d5f9b-fpfpg <span class="m">1</span>/1 Running <span class="m">0</span> 2h
</pre></div>
<p>Compare that you have all those services running also in your cluster.
We have also configured NGINX to proxy any service that we will later deploy on Kubernetes,
test it with:</p>
<div class="highlight"><pre><span></span>$ wget localhost
--2018-09-24 <span class="m">03</span>:01:14-- http://localhost/
Resolving localhost <span class="o">(</span>localhost<span class="o">)</span>... <span class="m">127</span>.0.0.1
Connecting to localhost <span class="o">(</span>localhost<span class="o">)</span><span class="p">|</span><span class="m">127</span>.0.0.1<span class="p">|</span>:80... connected.
HTTP request sent, awaiting response... <span class="m">404</span> Not Found
<span class="m">2018</span>-09-24 <span class="m">03</span>:01:14 ERROR <span class="m">404</span>: Not Found.
</pre></div>
<p>Error 404 is a good sign, the service is up and serving requests, currently there is nothing to deliver.
Finally test that the routing through the Jetstream instance is working correctly by opening your browser
and test that if you access <code>js-XX-XXX.jetstream-cloud.org</code> you also get a <code>default backend - 404</code> message.
If any of the tests hangs or cannot connect, there is probably a networking issue.</p>
<h2>Next</h2>
<p>Next you can <a href="https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-explore.html">explore the kubernetes deployment to learn more about how you deploy resources in the second part of my tutorial</a> or skip it and proceed directly to the <a href="http://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-jupyterhub.html">third and final part of the tutorial and deploy Jupyterhub and configure it with HTTPS</a>.</p>
<h3>Troubleshooting notes</h3>
<p>For future reference, disregard this.</p>
<p>Failing ansible task: <code>openstack_tenant_id is missing</code></p>
<p>fixed with: <code>export OS_TENANT_ID=$OS_PROJECT_ID</code>, this should be fixed once <a href="https://github.com/kubernetes-incubator/kubespray/pull/2783">https://github.com/kubernetes-incubator/kubespray/pull/2783</a> is merged, anyway this is not blocking.</p>
<p>Failing task <code>Write cacert file</code>:</p>
<p>NOTE: had to cherry-pick a commit from <a href="https://github.com/kubernetes-incubator/kubespray/pull/3280">https://github.com/kubernetes-incubator/kubespray/pull/3280</a>, this will be unnecessary once this is fixed upstream</p>
<h2>(Optional) Setup kubectl locally</h2>
<p>We also set <code>kubectl_localhost: true</code> and <code>kubeconfig_localhost: true</code>.
so that <code>kubectl</code> is installed on your local machine</p>
<p>it also copies <code>admin.conf</code> to:</p>
<div class="highlight"><pre><span></span><span class="err">inventory/$CLUSTER/artifacts</span>
</pre></div>
<p>now copy that to <code>~/.kube/config</code></p>
<p>this has an issue, it has the internal IP of the Jetstream master.
We cannot replace it with the public floating ip because the certificate is not valid for that.
Best workaround is to replace it with <code>127.0.0.1</code> inside <code>~/.kube/config</code> at the <code>server:</code> key.
Then make a SSH tunnel:</p>
<div class="highlight"><pre><span></span><span class="err">ssh ubuntu@$IP -f -L 6443:localhost:6443 sleep 3h</span>
</pre></div>
<ul>
<li><code>-f</code> sends the process in the background</li>
<li>executing <code>sleep</code> for 3 hours makes the tunnel automatically close after 3 hours, otherwise <code>-N</code> would keep the tunnel permanently open</li>
</ul>
<h2>(Optional) Setup helm locally</h2>
<p>ssh into the master node, check helm version with:</p>
<div class="highlight"><pre><span></span><span class="err">helm version</span>
</pre></div>
<p>Download the same binary version from <a href="https://github.com/helm/helm/releases">the release page on Github</a>
and copy the binary to <code>/url/local/bin</code>.</p>
<div class="highlight"><pre><span></span><span class="err">helm ls</span>
</pre></div>PEARC18 Paper on Deploying Jupyterhub at scale on XSEDE2018-07-23T12:00:00-07:002018-07-23T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-07-23:/2018/07/pearc18-paper-deploy-jupyterhub-xsede.html<p>Bob Sinkovits and I are presenting a paper at PEARC18 about:</p>
<p>"Deploying Jupyter Notebooks at scale on XSEDE resources for Science Gateways and workshops"</p>
<p>See the pre-print on Arxiv: <a href="https://arxiv.org/abs/1805.04781">https://arxiv.org/abs/1805.04781</a></p>
<p>Jupyter Notebooks provide an interactive computing environment well suited for Science.
JupyterHub is a multi-user …</p><p>Bob Sinkovits and I are presenting a paper at PEARC18 about:</p>
<p>"Deploying Jupyter Notebooks at scale on XSEDE resources for Science Gateways and workshops"</p>
<p>See the pre-print on Arxiv: <a href="https://arxiv.org/abs/1805.04781">https://arxiv.org/abs/1805.04781</a></p>
<p>Jupyter Notebooks provide an interactive computing environment well suited for Science.
JupyterHub is a multi-user Notebook environment developed by the Jupyter team.</p>
<p>In order to provide adequate amount of memory and CPU to many users for example during workshops,
it is necessary to leverage a distributed system, either leveraging multiple Jetstream instances
or interfacing with a traditional HPC system.</p>
<p>In this work we present 3 strategies for deploying JupyterHub on XSEDE resources to support
a large number of users, each is linked to the step-by-step tutorial with all necessary configuration files:</p>
<ul>
<li><a href="https://zonca.github.io/2017/05/jupyterhub-hpc-batchspawner-ssh.html">deploy Jupyterhub on a single Jetstream instance and spawn Jupyter Notebook servers for each user on a computing node of a Supercomputer (for example Comet)</a></li>
<li><a href="https://zonca.github.io/2017/10/scalable-jupyterhub-docker-swarm-mode.html">deploy Jupyterhub on Jetstream using Docker Swarm to distributed the user's containers across many instances and providing persistent storage with quotas through a NFS share</a></li>
<li><a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">deploy Jupyterhub on top of Kubernetes across Jetstream instances with persistent storage provided by the Ceph distributed filesystem</a></li>
</ul>
<p><a href="https://zonca.github.io/docs/pearc18_slides_zonca_sinkovits.pdf">Presentation slides</a></p>
<p>If are an author at PEARC18, you can follow <a href="https://zonca.github.io/2018/05/pearc18-preprint-arxiv.html">my instructions on how to publish your preprint to Arxiv</a></p>Updated Singularity images for Comet2018-07-22T12:00:00-07:002018-07-22T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-07-22:/2018/07/singularity-2.5-comet.html<p>Back in January 2017 I wrote a <a href="https://zonca.github.io/2017/01/singularity-hpc-comet.html">blog post about running Singularity on Comet</a>.</p>
<p>I recently needed to update all my container images to the latest scientific python packages,
so I also took the opportunity to create both a Docker auto-build repository on DockerHub
and a SingularityHub image.</p>
<p>Those images …</p><p>Back in January 2017 I wrote a <a href="https://zonca.github.io/2017/01/singularity-hpc-comet.html">blog post about running Singularity on Comet</a>.</p>
<p>I recently needed to update all my container images to the latest scientific python packages,
so I also took the opportunity to create both a Docker auto-build repository on DockerHub
and a SingularityHub image.</p>
<p>Those images have a working MPI installation which has the same MPI version of Comet so
they can be used as a base for MPI programs.</p>
<p>The Docker image is based on the Jupyter Datascience notebook, therefore has Python, R and Julia.
the Singularity image on SingularityHub has instead only Python.
Anyway <code>singularity pull</code> also works with Docker containers, so also the Docker container can easily
be turned into a singularity container.</p>
<p>See <a href="https://github.com/zonca/singularity-comet">https://github.com/zonca/singularity-comet</a></p>Create DockerHub auto build2018-07-19T18:00:00-07:002018-07-19T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-07-19:/2018/07/create-dockerhub-autobuild.html<p>It is very convenient to create Autobuild repositories on DockerHub linked to
a Github repository with a <code>Dockerfile</code>.
Then every time you commit to Github, Dockerhub is going to build the image on
their service and make it available on <a href="https://hub.docker.com">https://hub.docker.com</a> and can quickly
be pulled to …</p><p>It is very convenient to create Autobuild repositories on DockerHub linked to
a Github repository with a <code>Dockerfile</code>.
Then every time you commit to Github, Dockerhub is going to build the image on
their service and make it available on <a href="https://hub.docker.com">https://hub.docker.com</a> and can quickly
be pulled to any other system that supports Docker or Singularity.</p>
<p>Unfortunately if you have many Github organizations and repositories, the process
to set a new repository up gets stuck.</p>
<p>Fortunately we can bypass the issue by directly accessing the right URL, as suggested
<a href="https://stackoverflow.com/questions/42792240/dockerhub-create-automated-build-step-stuck-at-creating">on StackOverflow</a>.</p>
<p>I created a simple page to make this quicker, add the right parameters and it automatically
builds the right URL, see:</p>
<p><a href="https://zonca.github.io/docker-auto-build">https://zonca.github.io/docker-auto-build</a></p>How to organize code and data for simulations at NERSC2018-06-20T18:00:00-07:002018-06-20T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-06-20:/2018/06/organize-code-data-simulations-nersc.html<p>I recently improved my strategy for organizing code and data for simulations run at NERSC,
I'll write it here for reference.</p>
<h2>Libraries</h2>
<p>I mostly use Python (often with C/C++ extensions), so I first rely on the Anaconda
module maintained by NERSC, currently <code>python/3.6-anaconda-4.4</code>.</p>
<p>If I need …</p><p>I recently improved my strategy for organizing code and data for simulations run at NERSC,
I'll write it here for reference.</p>
<h2>Libraries</h2>
<p>I mostly use Python (often with C/C++ extensions), so I first rely on the Anaconda
module maintained by NERSC, currently <code>python/3.6-anaconda-4.4</code>.</p>
<p>If I need to add many more packages I can create a conda environment, but for just installing
1 or 2 packages I prefer to just add them to my <code>PYTHONPATH</code>.</p>
<p>I have core libraries that I rely on and often modify to run my simulations,
those should be installed on Global Common Software: <code>/global/common/software/projectname</code>
which is specifically designed to access small files like Python packages.
I generally create a subfolder and reference it with an environment variable:</p>
<div class="highlight"><pre><span></span><span class="err"> export PREFIX=/global/common/software/projectname/zonca/python_prefix</span>
</pre></div>
<p>Then I create a <code>env.sh</code> script in the source folder of the package (in Global Home) that loads
the environment:</p>
<div class="highlight"><pre><span></span><span class="err">module load python/3.6-anaconda-4.4</span>
<span class="err">export PREFIX=/global/common/software/projectname/zonca/python_prefix</span>
<span class="err">export PATH=$PREFIX/bin:$PATH</span>
<span class="err">export LD_LIBRARY_PATH=$PREFIX/lib:$LD_LIBRARY_PATH</span>
<span class="err">export PYTHONPATH=$PREFIX/lib/python3.6/site-packages:$PYTHONPATH</span>
</pre></div>
<p>This environment is automatically propagated to the computing nodes when I submit a SLURM script,
therefore I do not add any of these environment details to my SLURM scripts.</p>
<p>Then I can install a package there with:</p>
<div class="highlight"><pre><span></span><span class="err">python setup.py install --prefix=$PREFIX</span>
</pre></div>
<p>or from pip:</p>
<div class="highlight"><pre><span></span><span class="err">pip install apackage --prefix=$PREFIX</span>
</pre></div>
<p>It is also common to install a newer version of a package which is already provided by
the base environment:</p>
<div class="highlight"><pre><span></span><span class="err">pip install apackage --ignore-installed --upgrade --no-deps --prefix=$PREFIX</span>
</pre></div>
<h2>Simulations SLURM scripts and configuration files</h2>
<p>I first create a repository on Github for my simulations and clone it to my home folder at NERSC.
I generally create a repository for each experiment, then I create a subfolder for each
type of simulation I am working on.</p>
<p>Inside a folder I create parameters files to configure my run and slurm scripts to launch the
simulations and put everything under version control immediately, I often create a Pull Request
on Github and ask my collaborators to cross-check the configuration before a submit a run.</p>
<p>Smaller input data files, even binaries, can be added for convenience to the Github repository.</p>
<p>Once a run has been validated, inside the simulation type folder I createa a subfolder <code>runs/201806_details_about_run</code> and
add a <code>README.md</code>, this will include all the details about the simulation.
I also tag both the core library I depend on and the simulation repository with the same name e.g.:</p>
<div class="highlight"><pre><span></span><span class="err">git tag -a 201806_details_about_run -m "software version used for 201806_details_about_run"</span>
</pre></div>
<p>I'll also add the path at NERSC of the input data and output results.</p>
<p>Then for future simulations I'll keep modifying the SLURM scripts and parameter files but always have
a reference to each previous version.</p>
<h2>Larger input data and output data</h2>
<p>Larger input data and outputs are not suitable for version control and should live in a SCRATCH filesystem.
I always use the Global Scratch <code>$CSCRATCH</code> which is available both on Edison on Cori and also
from the Jupyter Notebook environment at: <a href="https://jupyter.nersc.gov">https://jupyter.nersc.gov</a>.</p>
<p>I create a root folder for the project at:</p>
<div class="highlight"><pre><span></span><span class="err">$CSCRATCH/projectname</span>
</pre></div>
<p>Then a subfolder for each simulation type:</p>
<div class="highlight"><pre><span></span><span class="err">$CSCRATCH/projectname/simulation_type_1</span>
<span class="err">$CSCRATCH/projectname/simulation_type_2</span>
</pre></div>
<p>Then I symlink those inside the simulation repository as the folder <code>out/</code>:</p>
<div class="highlight"><pre><span></span><span class="err">cd $HOME/projectname/simulation_type_1</span>
<span class="err">ln -s $CSCRATCH/projectname/simulation_type_1 out</span>
</pre></div>
<p>Therefore I can setup my simulation software to save all results inside <code>out/201806_details_about_run</code>
and this is going to be written to <code>CSCRATCH</code>.</p>
<p>This setup makes it very convenient to regularly backup everything to tape using <code>cput</code> which just backs up
files that are not already on tape, e.g.:</p>
<div class="highlight"><pre><span></span><span class="err">cd $CSCRATCH</span>
<span class="err">hsi</span>
<span class="err">cput -R projectname</span>
</pre></div>
<p>This is going to synchronize the backup on tape with the latest results on <code>CSCRATCH</code>.</p>
<p>I do the same for input files:</p>
<div class="highlight"><pre><span></span><span class="err">mkdir $CSCRATCH/projectname/input_simulation_type_1</span>
<span class="err">cd $HOME/projectname/simulation_type_1</span>
<span class="err">ln -s $CSCRATCH/projectname/input_simulation_type_1 input</span>
</pre></div>Setup private dask clusters in Kubernetes alongside JupyterHub on Jetstream2018-06-07T18:00:00-07:002018-06-07T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-06-07:/2018/06/private-dask-kubernetes-jetstream.html<p>In this post we will leverage software made available by the <a href="https://pangeo-data.github.io">Pangeo community</a> to allow each user of a <a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">Jupyterhub instance deployed on Jetstream on top of Kubernetes</a> to launch a set of <a href="https://dask.pydata.org"><code>dask</code></a> workers as containers running inside Kubernetes itself and use them for distributed computing.</p>
<p>Pangeo also maintains …</p><p>In this post we will leverage software made available by the <a href="https://pangeo-data.github.io">Pangeo community</a> to allow each user of a <a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">Jupyterhub instance deployed on Jetstream on top of Kubernetes</a> to launch a set of <a href="https://dask.pydata.org"><code>dask</code></a> workers as containers running inside Kubernetes itself and use them for distributed computing.</p>
<p>Pangeo also maintains a deployment of this environment on Google Cloud freely accessible at <a href="https://pangeo.pydata.org">pangeo.pydata.org</a>.</p>
<p><strong>Security considerations</strong>: This deployment grants each user administrative access to the Kubernetes API, so each user could use this privilege to terminate other users' pods or dask workers. Therefore it is suitable only for a community of trusted users. There is <a href="https://github.com/pangeo-data/pangeo/issues/135#issuecomment-384320753">discussion about leveraging namespaces to limit this</a> but it hasn't been implemented yet.</p>
<h2>Deploy Kubernetes</h2>
<p>We need to first create Jetstream instances and deploy Kubernetes on them. We can follow the first part of the tutorial at <a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html</a>.
I also tested with Ubuntu 18.04 instead of Ubuntu 16.04 and edited the <code>install-kubeadm.bash</code> accordingly, I also removed version specifications in order to pickup the latest Kubernetes version, currently 1.10. See <a href="https://gist.github.com/zonca/5365fd2245462dedaf2297e0417c4662">my install-kubeadm-18.04.bash</a>.
Notice that for the <code>http://apt.kubernetes.io/</code> don't have yet Ubuntu 18.04 packages, so I left <code>xenial</code>, this should be updated in the future.</p>
<p>In order to simplify the setup we will just be using ephemeral storage, later we can update the deployment using either Rook following the <a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">steps in my original tutorial</a> or a NFS share (I'll write a tutorial soon about that).</p>
<h2>Deploy Pangeo</h2>
<p>Deployment is just a single step because Pangeo published a Helm recipe that depends on the Zero-to-JupyterHub recipe and deploys both in a single step, therefore we <em>should not have deployed JupyterHub beforehand</em>.</p>
<p>First we need to create a <code>yaml</code> configuration file for the package.
Checkout the Github repository with all the configuration files on the master node of Kubernetes:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream</span>
</pre></div>
<p>in the <code>pangeo_helm</code> folder, there is already a draft of the configuration file.</p>
<p>We need to:</p>
<ul>
<li>run <code>openssl</code> as instructed inside the file and paste the output tokens to the specified location</li>
<li>edit the hostname in the <code>ingress</code> section to the hostname of the Jetstream master node</li>
<li>customize the memory and CPU requirements, currently they are very low so that this can be tested also in a single small instance</li>
</ul>
<p>We can then deploy with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo helm install pangeo/pangeo -n pangeo --namespace pangeo -f config_pangeo_no_storage.yaml --version=v0.1.1-95ab292</span>
</pre></div>
<p>You can optionally check if there are newer versions of the chart at <a href="https://pangeo-data.github.io/helm-chart/">https://pangeo-data.github.io/helm-chart/</a>.</p>
<p>Then check that the pods start checking their status with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl -n pangeo get pods</span>
</pre></div>
<p>If any is stuck in Pending, check with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl -n pangeo describe <pod-name></span>
</pre></div>
<p>Once the <code>hub</code> pod is running, you should be able to connect with your browser to <code>js-xxx-xxx.Jetstream-cloud.org</code>, by default it runs with a dummy authenticator, at the login form, just type any username and leave the password empty to login.</p>
<h2>Launch a dask cluster</h2>
<p>Once you get the Jupyter Notebook instance, you should see a file named <code>worker-template.yaml</code> in your home folder, this is a template for the configuration and the allocated resources for the pod of each <code>dask</code> worker.
The default workers for Pangeo are beefy, for testing we can reduce their requirements, see for example my <a href="https://gist.github.com/zonca/21ef3125eee7af5c2548e505d47dc200">worker-template.yaml</a> that works on a small Jetstream VM.</p>
<p>Then inside <code>examples/</code> we have several example notebooks that show how to use <code>dask</code> for distributed computing.
<code>dask-array.ipynb</code> shows basic functionality for distributed multi-dimensional arrays.</p>
<p>The most important piece of code is the creation of dask workers:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">dask_kubernetes</span> <span class="kn">import</span> <span class="n">KubeCluster</span>
<span class="n">cluster</span> <span class="o">=</span> <span class="n">KubeCluster</span><span class="p">(</span><span class="n">n_workers</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">cluster</span>
</pre></div>
<p>If we execute this cell <code>dask_kubernetes</code> contacts the Kubernetes API using the <a href="https://github.com/pangeo-data/helm-chart/blob/master/pangeo/templates/dask-kubernetes-rbac.yaml">serviceaccount <code>daskkubernetes</code></a> mounted on the pods by the Helm chart and requests new pods to be launched.
In fact we can check on the terminal again with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl -n pangeo get pods</span>
</pre></div>
<p>that new pods should be about to run.
It also provides buttons to change the number of running workers, either manually or adaptively based on the required resources.</p>
<p>This also runs the <code>dask</code> scheduler on the pod that is running the Jupyter Notebook and we can connect to it with:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">dask.distributed</span> <span class="kn">import</span> <span class="n">Client</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">(</span><span class="n">cluster</span><span class="p">)</span>
<span class="n">client</span>
</pre></div>
<p>From now on all <code>dask</code> commands will automatically execute commands on the <code>dask</code> cluster.</p>
<h2>Customize the JupyterHub deployment</h2>
<p>We can then customize the JupyterHub deployment for example to add authentication or permanent storage.
Notice that all configuration options inside the <code>config_pangeo_no_storage.yaml</code> are inside the <code>jupyterhub:</code> tag, this is due to the fact that <code>jupyterhub</code> is another Helm package which we are configuring through the <code>pangeo</code> Helm package.
Therefore make sure that any configuration option found in my previous tutorials or on the <a href="https://zero-to-jupyterhub.readthedocs.io/en/latest/">Zero-to-Jupyterhub</a> documentation needs to be indented accordingly.</p>
<p>Then we can either run:</p>
<div class="highlight"><pre><span></span><span class="err">sudo helm delete --purge pangeo</span>
</pre></div>
<p>and then install it from scratch again or just update the running cluster with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo helm upgrade pangeo -f config_pangeo_no_storage.yaml</span>
</pre></div>How to post a PEARC18 paper pre-print to Arxiv2018-05-12T18:00:00-07:002018-05-12T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-05-12:/2018/05/pearc18-preprint-arxiv.html<h2>Quick version</h2>
<ul>
<li>Make sure you have the DOI from ACM</li>
<li>If you have Latex: create a zip with sources, figures and <code>.bbl</code> (not <code>.bib</code>), no output PDF</li>
<li>If you have Word: export to PDF</li>
<li>Go to <a href="https://arxiv.org/submit">https://arxiv.org/submit</a></li>
<li>Choose the first option for license and "Computer Science" and …</li></ul><h2>Quick version</h2>
<ul>
<li>Make sure you have the DOI from ACM</li>
<li>If you have Latex: create a zip with sources, figures and <code>.bbl</code> (not <code>.bib</code>), no output PDF</li>
<li>If you have Word: export to PDF</li>
<li>Go to <a href="https://arxiv.org/submit">https://arxiv.org/submit</a></li>
<li>Choose the first option for license and "Computer Science" and "Distributed, Parallel, and Cluster Computing" for category</li>
<li>In Metadata set Comments as: "7 pages, 3 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, USA"</li>
<li><strong>Make sure you set the DOI</strong> or you violate ACM rules</li>
<li>Follow instructions until you publish</li>
</ul>
<p>Follows the step-by-step version:</p>
<h2>Why upload a pre-print to arXiv</h2>
<p>Journals provide a Open Access option, but it is very expensive, however, they generally allow authors to upload manuscripts before copy-editing to non-profit pre-print servers like the <code>arXiv</code>.
This makes your paper accessible to anybody without the need of any Journal subscription, you can also upload your work months before the conference proceedings are available.</p>
<p>See for example the page of my PEARC18 paper on the <code>arXiv</code>: <a href="https://arxiv.org/abs/1805.04781">https://arxiv.org/abs/1805.04781</a></p>
<h2>License</h2>
<p>Before publishing any pre-print, you need to check on the Journal or Conference website
if it is allowed and at what conditions.</p>
<p>PEARC18 in particular publishes with ACM, therefore we can look at the <a href="http://authors.acm.org/main.html">author rights page on the ACM website</a>.</p>
<p>Currently the requirements for posting a pre-print are:</p>
<ul>
<li>the paper needs to be accepted and peer-reviewed</li>
<li>this is the version by the author, before copy-editing, if any, by the journal</li>
<li>it needs a DOI pointing to the ACM version of the paper</li>
</ul>
<h2>Get a DOI</h2>
<p>A DOI is generated once the author chooses a license.
PEARC18 first authors should have received an email around May 10th with a link to the ACM
website to choose a license.
There are 3 choices, Open Access is quite expensive, but we do not need that, we are still allowed
to post the pre-print even with any of the other 2 licenses, I personally recommend the
"license" option, that does not transfer copyright to ACM.
After completing this you should receive a DOI, which is a set of numbers of the form <code>10.1145/xxxxx.xxxxxx</code>.
Also remember to add the license text you will receive via email to the paper before going on with the upload.</p>
<h2>Prepare your Latex submission</h2>
<p>The arXiv requires the source for any Latex paper.
If you are using the online platform <a href="https://overleaf.com">Overleaf</a>, click on "Project" and then "Download as zip" at the bottom.
If you are using anything else, create a zip file with all the paper sources and figures, <em>not the output PDF</em>, also make sure that you include the <code>.bbl</code> file, not the <code>.bib</code>, so you need to compile your paper locally and add just the <code>.bbl</code> to the archive.
Also, the arXiv dislikes large figures, so if you already know you have them, better resize or lower their quality before submission. Anyway you can just submit it as it is and check if they are accepted.</p>
<h2>Prepare your Word submission</h2>
<p>Export the paper as PDF.</p>
<h2>Upload to arXiv</h2>
<ul>
<li>Go to <a href="https://arxiv.org/submit">https://arxiv.org/submit</a>, either login or create a new account.</li>
<li>At the submission page, fill the form, for license, the safest is to use the first option: "arXiv.org perpetual, non-exclusive license to distribute this article (Minimal rights required by arXiv.org)"</li>
<li>For "Archive and Subject Class", choose "Computer Science" and "Distributed, Parallel, and Cluster Computing" unless in the list there is a more suitable field</li>
<li>Then upload the Latex sources zip file or the conversion of the Word file to PDF.</li>
<li>Once you have uploaded the zip file, it shows you a list of the archive content, you can delete extra files are not needed to build the paper, if you used the Overleaf ACM template, remove <code>sample-sigconf-authordraft.tex</code></li>
<li>If the paper doesn't build, the arXiv displays the log, check for missing files or unsupported packages in particular, you can click "Add files" to upload different files</li>
<li>If the paper successfully builds, click on the "View" button to check that the PDF is fine</li>
<li>In the Metadata, complete the form, in the Comments, add also the conference information, for example "7 pages, 3 figures, PEARC '18: Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, USA"</li>
<li>Still in Metadata, <strong>make sure you add the DOI</strong> otherwise it is a violation of the conditions by ACM, the DOI is in the form <code>10.1145/xxxxxx.xxxx</code></li>
<li>Finally check the preview and finalize your submission</li>
<li>The submission is not available immediately, it will first be in "Processing" stage and it will be published in the next few days, you'll get an email with the publishing date and time.</li>
</ul>
<h2>Update your submission</h2>
<ul>
<li>Anytime before publication you can update (overwrite) your submission</li>
<li>After your pre-print is published you can update it at will but all previous versions will always be available on the arXiv servers.</li>
</ul>
<p>In order to update the publication, login to the Arxiv and click on the "Replace" icon to update your paper with a new version.</p>Launch a shared dask cluster in Kubernetes alongside JupyterHub on Jetstream2018-05-04T18:00:00-07:002018-05-04T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-05-04:/2018/05/shared-dask-kubernetes-jetstream.html<p>Let's assume we have already a Kubernetes deployment and have installed JupyterHub, see for example my <a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">previous tutorial on Jetstream</a>.
Now that users can login and access a Jupyter Notebook, we would also like to provide them more computing power for their interactive data exploration. The easiest way is through …</p><p>Let's assume we have already a Kubernetes deployment and have installed JupyterHub, see for example my <a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">previous tutorial on Jetstream</a>.
Now that users can login and access a Jupyter Notebook, we would also like to provide them more computing power for their interactive data exploration. The easiest way is through <a href="https://dask.pydata.org"><code>dask</code></a>, we can launch a scheduler and any number of workers as containers inside Kubernetes so that users can leverage the computing power of many Jetstream instances at once.</p>
<p>There are 2 main strategies, we can give each user their own dask cluster with exclusive access and this would be more performant but cause quick spike of usage of the Kubernetes cluster, or just launch a shared cluster and give all users access to that.</p>
<p>In this tutorial we cover the second scenario, we'll cover the first scenario in a following tutorial.</p>
<p>We will deploy first Jupyterhub through the Zero-to-JupyterHub guide, then launch via Helm a fixed size dask clusters and show how users can connect, submit distributed Python jobs and monitor their execution on the dashboard.</p>
<p>The configuration files mentioned in the tutorial are available in the Github repository <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream">zonca/jupyterhub-deploy-kubernetes-jetstream</a>.</p>
<h2>Deploy JupyterHub</h2>
<p>First we start from Jupyterhub on Jetstream with Kubernetes at <a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html</a></p>
<p>Optionally, for testing purposes, we can simplify the deployment by skipping permanent storage, if this is an option, see the relevant section below.</p>
<p>We want to install Jupyterhub in the <code>pangeo</code> namespace with the name <code>jupyter</code>, replace the <code>helm install</code> line in the tutorial with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo helm install --name jupyter jupyterhub/jupyterhub -f config_jupyterhub_pangeo_helm.yaml --namespace pangeo</span>
</pre></div>
<p>The <code>pangeo</code> configuration file is using a different single user image which has the right version of <code>dask</code> for this tutorial.</p>
<h2>(Optional) Simplify deployment using ephemeral storage</h2>
<p>Instead of installing and configuring rook, we can temporarily disable permanent storage to make the setup quicker and easier to maintain.</p>
<p>In the JupyterHub configuration <code>yaml</code> set:</p>
<div class="highlight"><pre><span></span><span class="n">hub</span><span class="o">:</span>
<span class="n">db</span><span class="o">:</span>
<span class="n">type</span><span class="o">:</span> <span class="n">sqlite</span><span class="o">-</span><span class="n">memory</span>
<span class="n">singleuser</span><span class="o">:</span>
<span class="n">storage</span><span class="o">:</span>
<span class="n">type</span><span class="o">:</span> <span class="n">none</span>
</pre></div>
<p>Now every time a user container is killed and restarted, all data are gone, this is good enough for testing purposes.</p>
<h2>Configure Github authentication</h2>
<p>Follow the instructions on the Zero-to-Jupyterhub documentation, at the end you should have in the YAML:</p>
<div class="highlight"><pre><span></span><span class="n">auth</span><span class="o">:</span>
<span class="n">type</span><span class="o">:</span> <span class="n">github</span>
<span class="n">admin</span><span class="o">:</span>
<span class="n">access</span><span class="o">:</span> <span class="kc">true</span>
<span class="n">users</span><span class="o">:</span> <span class="o">[</span><span class="n">zonca</span><span class="o">,</span> <span class="n">otherusername</span><span class="o">]</span>
<span class="n">github</span><span class="o">:</span>
<span class="n">clientId</span><span class="o">:</span> <span class="s2">"xxxxxxxxxxxxxxxxxxxx"</span>
<span class="n">clientSecret</span><span class="o">:</span> <span class="s2">"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"</span>
<span class="n">callbackUrl</span><span class="o">:</span> <span class="s2">"https://js-xxx-xxx.jetstream-cloud.org/hub/oauth_callback"</span>
</pre></div>
<h2>Test Jupyterhub</h2>
<p>Connect to the master node with your browser at: <code>https://js-xxx-xxx.jetstream-cloud.org</code>
Login with your Github credentials, you should get a Jupyter Notebook.</p>
<p>You can also check that your pod is running:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl get pods -n pangeo</span>
<span class="err">NAME READY STATUS RESTARTS AGE</span>
<span class="err">jupyter-zonca 1/1 Running 0 2m</span>
<span class="err">......other pods</span>
</pre></div>
<h2>Install Dask</h2>
<p>We want to deploy a single dask cluster that all the users can submit jobs to.</p>
<p>Customize the <code>dask_shared/dask_config.yaml</code> file available in the repository,
for testing purposes I set just 1 GB RAM and 1 CPU limits on each of 3 workers.
We can change <code>replicas</code> of the workers to add more.</p>
<div class="highlight"><pre><span></span><span class="err">sudo helm install stable/dask --name=dask --namespace=pangeo -f dask_config.yaml</span>
</pre></div>
<p>Then check that the <code>dask</code> instances are running:</p>
<div class="highlight"><pre><span></span>$ sudo kubectl get pods --namespace pangeo
NAME READY STATUS RESTARTS AGE
dask-jupyter-647bdc8c6d-mqhr4 <span class="m">1</span>/1 Running <span class="m">0</span> 22m
dask-scheduler-5d98cbf54c-4rtdr <span class="m">1</span>/1 Running <span class="m">0</span> 22m
dask-worker-6457975f74-dqhsh <span class="m">1</span>/1 Running <span class="m">0</span> 22m
dask-worker-6457975f74-lpvk4 <span class="m">1</span>/1 Running <span class="m">0</span> 22m
dask-worker-6457975f74-xzcmc <span class="m">1</span>/1 Running <span class="m">0</span> 22m
hub-7f75b59fc5-8c2pg <span class="m">1</span>/1 Running <span class="m">0</span> 6d
jupyter-zonca <span class="m">1</span>/1 Running <span class="m">0</span> 10m
proxy-6bbf67f6bd-swt7f <span class="m">2</span>/2 Running <span class="m">0</span> 6d
</pre></div>
<h3>Access the scheduler and launch a distributed job</h3>
<p><code>kube-dns</code> gives a name to each service and automatically propagates it to each pod, so we can connect by name</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">dask.distributed</span> <span class="kn">import</span> <span class="n">Client</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">(</span><span class="s2">"dask-scheduler:8786"</span><span class="p">)</span>
<span class="n">client</span>
</pre></div>
<p>Now we can access the 3 workers that we launched before:</p>
<div class="highlight"><pre><span></span><span class="err">Client</span>
<span class="c">Scheduler: tcp://dask-scheduler:8786</span>
<span class="c">Dashboard: http://dask-scheduler:8787/status</span>
<span class="err">Cluster</span>
<span class="c">Workers: 3</span>
<span class="c">Cores: 6</span>
<span class="c">Memory: 12.43 GB</span>
</pre></div>
<p>We can run an example computation with dask array:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">dask.array</span> <span class="kn">as</span> <span class="nn">da</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">da</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">((</span><span class="mi">20000</span><span class="p">,</span> <span class="mi">20000</span><span class="p">),</span> <span class="n">chunks</span><span class="o">=</span><span class="p">(</span><span class="mi">2000</span><span class="p">,</span> <span class="mi">2000</span><span class="p">))</span><span class="o">.</span><span class="n">persist</span><span class="p">()</span>
<span class="n">x</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">compute</span><span class="p">()</span>
</pre></div>
<h3>Access the Dask dashboard for monitoring job execution</h3>
<p>We need to setup ingress so that a path points to the Dask dashboard instead of Jupyterhub,</p>
<p>Checkout the file <code>dask_shared/dask_webui_ingress.yaml</code> in the repository, it routes the path <code>/dask</code>
to the <code>dask-scheduler</code> service.</p>
<p>Create the ingress resource with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl create ingress -n pangeo -f dask_webui_ingress.yaml</span>
</pre></div>
<p>All users can now access the dashboard at:</p>
<ul>
<li><a href="https://js-xxx-xxx.jetstream-cloud.org/dask/status">https://js-xxx-xxx.jetstream-cloud.org/dask/status</a></li>
</ul>
<p>Make sure to use <code>/dask/status/</code> and not only <code>/dask</code>.
Currently this is not authenticated, so this address is publicly available.
A simple way to hide it is to choose a custom name instead of <code>/dask</code> and edit
the ingress accordingly with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl edit ingress dask -n pangeo</span>
</pre></div>Install a BOINC server on Jetstream2018-03-29T18:00:00-07:002018-03-29T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2018-03-29:/2018/03/boinc-server-jetstream.html<p><a href="https://boinc.berkeley.edu/">BOINC</a> is the leading platform for volunteer computing.</p>
<p>Scientists can create a project on the platform and submit computational jobs that will
be executed on computers of volunteers all over the world.</p>
<p>In this post we'll deploy a BOINC server on Jetstream. All US scientists can get a free
<a href="https://jetstream-cloud.org/allocations.php">allocation …</a></p><p><a href="https://boinc.berkeley.edu/">BOINC</a> is the leading platform for volunteer computing.</p>
<p>Scientists can create a project on the platform and submit computational jobs that will
be executed on computers of volunteers all over the world.</p>
<p>In this post we'll deploy a BOINC server on Jetstream. All US scientists can get a free
<a href="https://jetstream-cloud.org/allocations.php">allocation on Jetstream via XSEDE</a>.</p>
<p>The deployment will be based on the <a href="https://github.com/marius311/boinc-server-docker">Docker setup developed by the Cosmology@Home project</a>.</p>
<h2>Prepare a Jetstream Virtual Machine</h2>
<p>First we login on the Atmosphere Jetstream control panel and create a new instance
of Ubuntu 16.04 with Docker preinstalled, a "small" size is enough for testing.</p>
<h3>(Optional) Mount a Jetstream Volume for docker images</h3>
<p>It is ideal to have a dedicated Jetstream Volume and mount it in the location where
Docker stores its data. So we have more space, less usage of the root filesystem
and no issues on the OS if we get out of disk space.</p>
<p>We can create a volume of 10/20 GB in the Jetstream control panel and attach it to
the running Virtual Machine. This will be automatically mounted to <code>/vol_b</code>, we
want to mount instead to <code>/var/lib/docker</code>:</p>
<div class="highlight"><pre><span></span><span class="err">sudo systemctl stop docker</span>
<span class="err">sudo mv /var/lib/docker/* /vol_b/</span>
<span class="err">sudo umount /vol_b</span>
</pre></div>
<p>Replace <code>/vol_b</code> with <code>/var/lib/docker</code> in <code>/etc/fstab</code>, e.g.:</p>
<div class="highlight"><pre><span></span><span class="err">zonca@js-xxx-xxx:~$ cat /etc/fstab</span>
<span class="err">LABEL=cloudimg-rootfs / ext4 defaults 0 0</span>
<span class="err">/dev/sdb /var/lib/docker ext4 defaults,nofail 0 2</span>
</pre></div>
<p>Finally:</p>
<div class="highlight"><pre><span></span><span class="err">sudo mount /var/lib/docker</span>
<span class="err">sudo systemctl start docker</span>
</pre></div>
<h3>Update Docker</h3>
<p>Docker in 16.04 is a bit old, we want to update it to a more recent version.</p>
<p>We also want to make sure to remove the old <code>docker</code> and <code>docker-compose</code>:</p>
<div class="highlight"><pre><span></span><span class="err">sudo apt remove docker-compose docker</span>
</pre></div>
<p>Then install a recent version,
we can follow the instructions from the docker website or use this script:</p>
<p><a href="https://gist.github.com/zonca/f5faba190f5285c68dad48e897622e90">https://gist.github.com/zonca/f5faba190f5285c68dad48e897622e90</a></p>
<p>I adapted it from <a href="https://github.com/data-8/kubeadm-bootstrap/blob/master/install-kubeadm.bash">kubeadm-bootstrap</a>.</p>
<p>Finally install the latest <code>docker-compose</code>, see the <a href="https://docs.docker.com/compose/install/#install-compose">documentation</a></p>
<p>Last step, add your user to the <code>docker</code> group:</p>
<div class="highlight"><pre><span></span><span class="err">sudo adduser $USER docker</span>
</pre></div>
<p>logout and back in and make sure you can run <code>docker</code> commands without sudo:</p>
<div class="highlight"><pre><span></span><span class="err">docker ps</span>
</pre></div>
<h3>Install BOINC server via Docker</h3>
<p>Follow the <a href="https://github.com/marius311/boinc-server-docker">instructions from <code>boinc-server-docker</code></a>
to launch a test deployment, in the last step, specify a <code>URL_BASE</code> so that
the deployment will be accessible from outside connections:</p>
<div class="highlight"><pre><span></span><span class="err">URL_BASE=http://$(hostname) docker-compose up -d</span>
</pre></div>
<p>You can check that the 3 containers are running with:</p>
<div class="highlight"><pre><span></span><span class="err">docker ps</span>
</pre></div>
<p>and inspect their logs with:</p>
<div class="highlight"><pre><span></span><span class="err">docker logs <container_id></span>
</pre></div>
<p>After a few minutes you should be able to check that the server is running at the
public address of your instance:</p>
<p><a href="http://js-xxx-xxx.jetstream-cloud.org/boincserver/">http://js-xxx-xxx.jetstream-cloud.org/boincserver/</a></p>
<h2>(Optional) Mount Jetstream volumes on the containers</h2>
<p>The Docker compose recipe defines 3 Docker volumes:</p>
<ul>
<li><code>mysql</code>: Data of the MySQL database</li>
<li><code>project</code>: Files about the project</li>
<li><code>results</code>: Result of the BOINC jobs</li>
</ul>
<p>those volumes are managed internally
by Docker and stored somewhere inside <code>/var/lib/docker</code> on the host node.</p>
<p>Docker also allows to mount specific folders from the host into a container,
if we back these folders by a Jetstream volume, we can have dedicated detachable Jetstream volumes
that live independently from any virtual machine.</p>
<p>Let's start by <code>mysql</code>, the same process can then be replicated for the other resources.</p>
<p>We create another Jetstream volume from the Atmosphere, name it <code>mysql</code> and attach it to the virtual machine,
this will be automatically mounted to <code>/vol_c</code>, we can rename it by:</p>
<div class="highlight"><pre><span></span><span class="err">sudo umount /vol_c</span>
</pre></div>
<p>Replace <code>vol_c</code> with <code>mysql</code> in <code>/etc/fstab</code>, finally:</p>
<div class="highlight"><pre><span></span><span class="err">sudo mount /mysql</span>
</pre></div>
<p>Finally you can modify the <code>docker-compose.yml</code> to use this folder instead of a Docker Volume:</p>
<p>In the <code>volumes:</code> section, remove <code>mysql:</code>, in the definition of the MySQL service,
replace:</p>
<div class="highlight"><pre><span></span><span class="c">volumes:</span>
<span class="c"> - "mysql:/var/lib/mysql"</span>
</pre></div>
<p>with:</p>
<div class="highlight"><pre><span></span><span class="c">volumes:</span>
<span class="c"> - "/mysql:/var/lib/mysql"</span>
</pre></div>
<p>So that instead of using a Docker Volume named <code>mysql</code> is creating a bind-mount to <code>/mysql</code> on the host.</p>
<h2>Test jobs</h2>
<p>Open a terminal in the BOINC server container:</p>
<div class="highlight"><pre><span></span><span class="n">docker</span> <span class="k">exec</span> <span class="o">-</span><span class="n">it</span> <span class="o"><</span><span class="n">boincserver</span><span class="o">></span> <span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">bash</span>
<span class="n">bin</span><span class="o">/</span><span class="n">boinc2docker_create_work</span><span class="p">.</span><span class="n">py</span> <span class="err">\</span>
<span class="n">python</span><span class="p">:</span><span class="n">alpine</span> <span class="n">python</span> <span class="o">-</span><span class="k">c</span> <span class="ss">"open('/root/shared/results/hello.txt','w').write('Hello BOINC')"</span>
</pre></div>
<p>Then we can test a client connection and execution either with a standard BOINC desktop client or on another Jetstream instance.</p>
<h3>Test with a BOINC Desktop client</h3>
<p>Follow the instructions on the <a href="https://boinc.berkeley.edu/">BOINC website</a> to install a client for your OS, install also VirtualBox, then set as the URL of the BOINC server the URL of the server we just created.</p>
<h3>Test with a BOINC client in another Jetstream instance</h3>
<p>Create another Ubuntu with Docker tiny instane on Jetstream, login,</p>
<div class="highlight"><pre><span></span><span class="err">sudo adduser $USER docker</span>
</pre></div>
<p>We need Virtualbox:
sudo apt install virtualbox-dkms</p>
<p>and reboot to make sure VirtualBox is active.</p>
<div class="highlight"><pre><span></span><span class="n">URL</span><span class="o">=</span><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">js</span><span class="o">-</span><span class="n">xxx</span><span class="o">-</span><span class="n">xxx</span><span class="p">.</span><span class="n">jetstream</span><span class="o">-</span><span class="n">cloud</span><span class="p">.</span><span class="n">org</span><span class="o">/</span><span class="n">boincserver</span><span class="o">/</span>
<span class="n">docker</span> <span class="k">exec</span> <span class="n">boinc</span> <span class="n">boinccmd</span> <span class="c1">--create_account $URL email password name</span>
<span class="n">status</span><span class="p">:</span> <span class="n">Success</span>
<span class="n">poll</span> <span class="n">status</span><span class="p">:</span> <span class="k">operation</span> <span class="k">in</span> <span class="n">progress</span>
<span class="n">poll</span> <span class="n">status</span><span class="p">:</span> <span class="k">operation</span> <span class="k">in</span> <span class="n">progress</span>
<span class="n">poll</span> <span class="n">status</span><span class="p">:</span> <span class="k">operation</span> <span class="k">in</span> <span class="n">progress</span>
<span class="n">account</span> <span class="k">key</span><span class="p">:</span> <span class="n">de9c4cc66b8c923d04f834a0609ae742</span>
</pre></div>
<p>We can save the account key in a environment variable:</p>
<div class="highlight"><pre><span></span><span class="err">URL=http://js-xxx-xxx.jetstream-cloud.org/boincserver/</span>
<span class="err">URL=http://js-xxx-xxx.jetstream-cloud.org/boincserver/</span>
<span class="err">account_key=de9c4cc66b8c923d04f834a0609ae742</span>
<span class="err">docker exec boinc boinccmd --project_attach $URL $account_key</span>
</pre></div>
<p>Then we can check the logs for the job being received and executed:</p>
<div class="highlight"><pre><span></span><span class="err">docker logs boinc</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">04</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Started</span><span class="w"> </span><span class="n">download</span><span class="w"> </span><span class="k">of</span><span class="w"> </span><span class="n">layer_e9e858f6a2ba5a3e5a04b5799ef2de1c21a58602ffd400838ed10599f1b4a42c</span><span class="p">.</span><span class="n">tar</span><span class="p">.</span><span class="n">manual</span><span class="p">.</span><span class="n">gz</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">06</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Finished</span><span class="w"> </span><span class="n">download</span><span class="w"> </span><span class="k">of</span><span class="w"> </span><span class="n">layer_10ffed26db733866a346caf7c79558e4addb23ae085a991b5e7237edaa69f8e2</span><span class="p">.</span><span class="n">tar</span><span class="p">.</span><span class="n">manual</span><span class="p">.</span><span class="n">gz</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">06</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Finished</span><span class="w"> </span><span class="n">download</span><span class="w"> </span><span class="k">of</span><span class="w"> </span><span class="n">layer_e9e858f6a2ba5a3e5a04b5799ef2de1c21a58602ffd400838ed10599f1b4a42c</span><span class="p">.</span><span class="n">tar</span><span class="p">.</span><span class="n">manual</span><span class="p">.</span><span class="n">gz</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">06</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Started</span><span class="w"> </span><span class="n">download</span><span class="w"> </span><span class="k">of</span><span class="w"> </span><span class="n">layer_0e650ab7661f993eff514b84c6e7b775f5be8c6dde8b63eb584f0f22ea24005f</span><span class="p">.</span><span class="n">tar</span><span class="p">.</span><span class="n">manual</span><span class="p">.</span><span class="n">gz</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">06</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Started</span><span class="w"> </span><span class="n">download</span><span class="w"> </span><span class="k">of</span><span class="w"> </span><span class="n">image_4fcaf5fb5f2b8230c53b5fd4c4325df00021d45272dc4bfbb2148e5ca91ac166</span><span class="p">.</span><span class="n">tar</span><span class="p">.</span><span class="n">manual</span><span class="p">.</span><span class="n">gz</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">07</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Finished</span><span class="w"> </span><span class="n">download</span><span class="w"> </span><span class="k">of</span><span class="w"> </span><span class="n">layer_0e650ab7661f993eff514b84c6e7b775f5be8c6dde8b63eb584f0f22ea24005f</span><span class="p">.</span><span class="n">tar</span><span class="p">.</span><span class="n">manual</span><span class="p">.</span><span class="n">gz</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">07</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Finished</span><span class="w"> </span><span class="n">download</span><span class="w"> </span><span class="k">of</span><span class="w"> </span><span class="n">image_4fcaf5fb5f2b8230c53b5fd4c4325df00021d45272dc4bfbb2148e5ca91ac166</span><span class="p">.</span><span class="n">tar</span><span class="p">.</span><span class="n">manual</span><span class="p">.</span><span class="n">gz</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">07</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Starting</span><span class="w"> </span><span class="n">task</span><span class="w"> </span><span class="n">boinc2docker_3766_1522410497</span><span class="mf">.503524</span><span class="n">_0</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">07</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Sending</span><span class="w"> </span><span class="n">scheduler</span><span class="w"> </span><span class="nl">request</span><span class="p">:</span><span class="w"> </span><span class="k">To</span><span class="w"> </span><span class="k">fetch</span><span class="w"> </span><span class="k">work</span><span class="p">.</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">07</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Requesting</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">tasks</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">CPU</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">08</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Scheduler</span><span class="w"> </span><span class="n">request</span><span class="w"> </span><span class="nl">completed</span><span class="p">:</span><span class="w"> </span><span class="n">got</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="k">new</span><span class="w"> </span><span class="n">tasks</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">12</span><span class="w"> </span><span class="o">[</span><span class="n">---</span><span class="o">]</span><span class="w"> </span><span class="n">Vbox</span><span class="w"> </span><span class="n">app</span><span class="w"> </span><span class="n">stderr</span><span class="w"> </span><span class="n">indicates</span><span class="w"> </span><span class="n">CPU</span><span class="w"> </span><span class="n">VM</span><span class="w"> </span><span class="n">extensions</span><span class="w"> </span><span class="n">disabled</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">13</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Computation</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">task</span><span class="w"> </span><span class="n">boinc2docker_3766_1522410497</span><span class="mf">.503524</span><span class="n">_0</span><span class="w"> </span><span class="n">finished</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">13</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="k">Output</span><span class="w"> </span><span class="k">file</span><span class="w"> </span><span class="n">boinc2docker_3766_1522410497</span><span class="mf">.503524</span><span class="n">_0_r207563194_0</span><span class="p">.</span><span class="n">tgz</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">task</span><span class="w"> </span><span class="n">boinc2docker_3766_1522410497</span><span class="mf">.503524</span><span class="n">_0</span><span class="w"> </span><span class="n">absent</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">13</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Starting</span><span class="w"> </span><span class="n">task</span><span class="w"> </span><span class="n">boinc2docker_3766_1522410497</span><span class="mf">.503524</span><span class="n">_1</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">18</span><span class="w"> </span><span class="o">[</span><span class="n">---</span><span class="o">]</span><span class="w"> </span><span class="n">Vbox</span><span class="w"> </span><span class="n">app</span><span class="w"> </span><span class="n">stderr</span><span class="w"> </span><span class="n">indicates</span><span class="w"> </span><span class="n">CPU</span><span class="w"> </span><span class="n">VM</span><span class="w"> </span><span class="n">extensions</span><span class="w"> </span><span class="n">disabled</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">18</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="n">Computation</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">task</span><span class="w"> </span><span class="n">boinc2docker_3766_1522410497</span><span class="mf">.503524</span><span class="n">_1</span><span class="w"> </span><span class="n">finished</span><span class="w"></span>
<span class="mi">30</span><span class="o">-</span><span class="n">Mar</span><span class="o">-</span><span class="mi">2018</span><span class="w"> </span><span class="mi">13</span><span class="err">:</span><span class="mi">02</span><span class="err">:</span><span class="mi">18</span><span class="w"> </span><span class="o">[</span><span class="n">boincserver</span><span class="o">]</span><span class="w"> </span><span class="k">Output</span><span class="w"> </span><span class="k">file</span><span class="w"> </span><span class="n">boinc2docker_3766_1522410497</span><span class="mf">.503524</span><span class="n">_1_r1095010587_0</span><span class="p">.</span><span class="n">tgz</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">task</span><span class="w"> </span><span class="n">boinc2docker_3766_1522410497</span><span class="mf">.503524</span><span class="n">_1</span><span class="w"> </span><span class="n">absent</span><span class="w"></span>
</pre></div>Use the distributed file format Zarr on Jetstream Swift object storage2018-03-03T18:00:00-08:002018-03-03T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2018-03-03:/2018/03/zarr-on-jetstream.html<p><meta http-equiv="refresh" content="0; URL=https://zonca.dev/{{ url }}">
<link rel="canonical" href="https://zonca.dev/{{ url }}"></p>Install custom Python environment on Jupyter Notebooks at NERSC2017-12-21T18:00:00-08:002017-12-21T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-12-21:/2017/12/custom-conda-python-jupyter-nersc.html<h2>Jupyter Notebooks at NERSC</h2>
<p>NERSC has provided a JupyterHub instance for quite some time to all NERSC users.
It is currently running on a dedicated large-memory node on Cori, so now it can access also data on
Cori <code>$SCRATCH</code>, not only <code>/project</code> and <code>$HOME</code>. See <a href="http://www.nersc.gov/users/data-analytics/data-analytics-2/jupyter-and-rstudio/">their documentation</a></p>
<h2>Customize your Python …</h2><h2>Jupyter Notebooks at NERSC</h2>
<p>NERSC has provided a JupyterHub instance for quite some time to all NERSC users.
It is currently running on a dedicated large-memory node on Cori, so now it can access also data on
Cori <code>$SCRATCH</code>, not only <code>/project</code> and <code>$HOME</code>. See <a href="http://www.nersc.gov/users/data-analytics/data-analytics-2/jupyter-and-rstudio/">their documentation</a></p>
<h2>Customize your Python environment</h2>
<p>NERSC provides Anaconda in a Ubuntu container, of course the user doesn't have permission to write to the Anaconda folder to install new packages.</p>
<p>The easiest way is to install a custom Python environment is to create another conda environment and then register the Kernel with Jupyter.</p>
<p>Create a new conda environment, best choice is <code>/project</code> if you have one, otherwise <code>$HOME</code> would work.
Access <a href="http://jupyter.nersc.gov">http://jupyter.nersc.gov</a>, open a terminal with "New"->"Terminal".</p>
<div class="highlight"><pre><span></span><span class="err">conda create --prefix $HOME/myconda python=3.6 ipykernel</span>
</pre></div>
<p>This is the minimal requirement, you could just add <code>anaconda</code> to get all the latest packages, you can also specify <code>conda-forge</code> to install other packages, e.g.:</p>
<div class="highlight"><pre><span></span><span class="err">source activate myconda</span>
<span class="err">conda install -c conda-forge healpy</span>
</pre></div>
<p>Register the kernel with the Jupyter Notebook:</p>
<div class="highlight"><pre><span></span><span class="err">ipython kernel install --name myconda --user</span>
</pre></div>
<p>The name of the kernel specified here doesn't need to be the same as the conda environment name, but it is simpler.</p>
<p>Once the conda environment is active, you can also install packages with <code>pip</code>.</p>
<div class="highlight"><pre><span></span><span class="err">conda install pip</span>
<span class="err">pip install somepackage</span>
</pre></div>ECSS Symposium about Jupyterhub deployments on XSEDE2017-12-15T18:00:00-08:002017-12-15T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-12-15:/2017/12/ecss-symposium.html<h2>Jupyter Notebooks at scale for Gateways and Workshops</h2>
<p>ECSS Symposium, 19 December 2017, Web presentation to the XSEDE <a href="https://www.xsede.org/for-users/ecss">Extended Collaborative Support Services</a>.</p>
<p>Overview on deployment options for Jupyter Notebooks at scale on XSEDE resources.</p>
<h2>Presentation</h2>
<ul>
<li><a href="https://docs.google.com/presentation/d/1vxtRaeju7qWrb_RXcsh-m2lKEDZoFBCJE0SWOMi-wNo/edit?usp=sharing">Google doc slides</a></li>
<li><a href="https://www.youtube.com/watch?v=BE6tRuJtq8c">Recording of the talk on Youtube</a></li>
</ul>
<h2>Tutorials</h2>
<p>Step-by-step tutorials and configuration files …</p><h2>Jupyter Notebooks at scale for Gateways and Workshops</h2>
<p>ECSS Symposium, 19 December 2017, Web presentation to the XSEDE <a href="https://www.xsede.org/for-users/ecss">Extended Collaborative Support Services</a>.</p>
<p>Overview on deployment options for Jupyter Notebooks at scale on XSEDE resources.</p>
<h2>Presentation</h2>
<ul>
<li><a href="https://docs.google.com/presentation/d/1vxtRaeju7qWrb_RXcsh-m2lKEDZoFBCJE0SWOMi-wNo/edit?usp=sharing">Google doc slides</a></li>
<li><a href="https://www.youtube.com/watch?v=BE6tRuJtq8c">Recording of the talk on Youtube</a></li>
</ul>
<h2>Tutorials</h2>
<p>Step-by-step tutorials and configuration files to deploy JupyterHub on XSEDE resources:</p>
<ul>
<li><a href="https://zonca.github.io/2017/05/jupyterhub-hpc-batchspawner-ssh.html">spawn Notebooks on a traditional HPC system</a></li>
<li><a href="https://zonca.github.io/2017/10/scalable-jupyterhub-docker-swarm-mode.html">setup a distributed scalable system on Jetstream instances via <strong>Docker Swarm</strong></a></li>
<li><a href="https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html">setup a distributed scalable system on Jetstream instances via <strong>Kubernetes</strong></a></li>
</ul>
<h2>Publication</h2>
<p>Paper in preparation: "Deploying Jupyter Notebooks at scale on XSEDE for Science Gateways and workshops", Andrea Zonca and Robert Sinkovits, PEARC18</p>Deploy scalable Jupyterhub with Kubernetes on Jetstream2017-12-05T18:00:00-08:002017-12-05T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-12-05:/2017/12/scalable-jupyterhub-kubernetes-jetstream.html<ul>
<li><strong>Tested in June 2018 with Ubuntu 18.04 and Kubernetes 1.10</strong></li>
<li><strong>Updated in February 2018 with newer version of <code>kubeadm-bootstrap</code>, Kubernetes 1.9.2</strong></li>
</ul>
<h2>Introduction</h2>
<p>The best infrastructure available to deploy Jupyterhub at scale is Kubernetes. Kubernetes provides a fault-tolerant system to deploy, manage and scale containers. The Jupyter …</p><ul>
<li><strong>Tested in June 2018 with Ubuntu 18.04 and Kubernetes 1.10</strong></li>
<li><strong>Updated in February 2018 with newer version of <code>kubeadm-bootstrap</code>, Kubernetes 1.9.2</strong></li>
</ul>
<h2>Introduction</h2>
<p>The best infrastructure available to deploy Jupyterhub at scale is Kubernetes. Kubernetes provides a fault-tolerant system to deploy, manage and scale containers. The Jupyter team released a recipe to deploy Jupyterhub on top of Kubernetes, <a href="https://zero-to-jupyterhub.readthedocs.io">Zero to Jupyterhub</a>. In this deployment both the hub, the proxy and all Jupyter Notebooks servers for the users are running inside Docker containers managed by Kubernetes.</p>
<p>Kubernetes is a highly sophisticated system, for smaller deployments (30/50 users, less then 10 servers), another option is to use the Docker Swarm mode, I covered this in a <a href="https://zonca.github.io/2017/10/scalable-jupyterhub-docker-swarm-mode.html">tutorial on how to deploy it on Jetstream</a>.</p>
<p>If you are not already familiar with Kubernetes, better first read the <a href="https://zero-to-jupyterhub.readthedocs.io/en/latest/tools.html">section about tools in Zero to Jupyterhub</a>.</p>
<p>In this tutorial we will be installing Kubernetes on 2 Ubuntu instances on the XSEDE Jetstream OpenStack-based cloud, configure permanent storage with the Ceph distributed filesystem and run the "Zero to Jupyterhub" recipe to install Jupyterhub on it.</p>
<h2>Setup two virtual machines</h2>
<p>First of all we need to create two Virtual Machines from the <a href="https://use.jetstream-cloud.org">Jetstream Atmosphere admin panel</a>I tested this on XSEDE Jetstream Ubuntu 16.04 image (with Docker pre-installed), for testing purposes "small" instances work, then they can be scaled up for production. You can name them <code>master_node</code> and <code>node_1</code> for example.
Make sure that port 80 and 443 are open to outside connections.</p>
<p>Then you can SSH into the first machine with your XSEDE username with <code>sudo</code> privileges.</p>
<h2>Install Kubernetes</h2>
<p>The "Zero to Jupyterhub" recipe targets an already existing Kubernetes cluster, for example on Google Cloud. However the Berkeley Data Science Education Program team, which administers one of the largest Jupyterhub deployments to date, released a set of scripts based on the <code>kubeadm</code> tool to setup Kubernetes from scratch.</p>
<p>This will install all the Kubernetes services and configure the <code>kubectl</code> command line tool for administering and monitoring the cluster and the <code>helm</code> package manager to install pre-packaged services.</p>
<p>SSH into the first server and follow the instructions at <a href="https://github.com/data-8/kubeadm-bootstrap">https://github.com/data-8/kubeadm-bootstrap</a> to "Setup a Master Node"
this will install a more recent version of Docker.</p>
<p>Once the initialization of the master node is completed, you should be able to check that several containers (pods in Kubernetes) are running:</p>
<div class="highlight"><pre><span></span><span class="err">zonca@js-xxx-xxx:~/kubeadm-bootstrap$ sudo kubectl get pods --all-namespaces</span>
<span class="err">NAMESPACE NAME READY STATUS RESTARTS AGE</span>
<span class="err">kube-system etcd-js-169-xx.jetstream-cloud.org 1/1 Running 0 1m</span>
<span class="err">kube-system kube-apiserver-js-169-xx.jetstream-cloud.org 1/1 Running 0 1m</span>
<span class="err">kube-system kube-controller-manager-js-169-xx.jetstream-cloud.org 1/1 Running 0 1m</span>
<span class="err">kube-system kube-dns-6f4fd4bdf-nxxkh 3/3 Running 0 2m</span>
<span class="err">kube-system kube-flannel-ds-rlsgb 1/1 Running 1 2m</span>
<span class="err">kube-system kube-proxy-ntmwx 1/1 Running 0 2m</span>
<span class="err">kube-system kube-scheduler-js-169-xx.jetstream-cloud.org 1/1 Running 0 2m</span>
<span class="err">kube-system tiller-deploy-69cb6984f-77nx2 1/1 Running 0 2m</span>
<span class="err">support support-nginx-ingress-controller-k4swb 1/1 Running 0 36s</span>
<span class="err">support support-nginx-ingress-default-backend-cb84895fb-qs9pp 1/1 Running 0 36s</span>
</pre></div>
<p>Make also sure routing is working by accessing with your web browser the address of the Virtual Machine <code>js-169-xx.jetstream-cloud.org</code> and verify you are getting the error message <code>default backend - 404</code>.</p>
<p>Then SSH to the other server and set it up as a worker following the instructions in "Setup a Worker Node" at <a href="https://github.com/data-8/kubeadm-bootstrap">https://github.com/data-8/kubeadm-bootstrap</a>,</p>
<p>Once the setup is complete on the worker, log back in to the master and check that the worker joined Kubernetes:</p>
<div class="highlight"><pre><span></span><span class="err">zonca@js-169-xx:~/kubeadm-bootstrap$ sudo kubectl get nodes</span>
<span class="err">NAME STATUS ROLES AGE VERSION</span>
<span class="err">js-168-yyy.jetstream-cloud.org Ready <none> 1m v1.9.2</span>
<span class="err">js-169-xx.jetstream-cloud.org Ready master 2h v1.9.2</span>
</pre></div>
<h2>Setup permanent storage for Kubernetes</h2>
<p>The cluster we just setup has no permament storage, so user data would disappear every time a container is killed.
We woud like to provide users with a permament home that would be available across all of the Kubernetes cluster, so that even if a user container spawns again on a different server, the data are available.</p>
<p>First we want to login again to Jetstream web interface and create 2 Volumes (for example 10 GB) and attach them one each to the master and to the first node, these will be automatically mounted on <code>/vol_b</code>, with no need of rebooting the servers.</p>
<p>Kubernetes has capability to provide Permanent Volumes but it needs a backend distributed file system. In this tutorial we will be using <a href="https://rook.io/">Rook</a> which sets up the Ceph distributed filesystem across the nodes.</p>
<p>We can first use Helm to install the Rook services (I ran my tests with <code>v0.6.1</code>):</p>
<div class="highlight"><pre><span></span><span class="err">sudo helm repo add rook-alpha https://charts.rook.io/alpha</span>
<span class="err">sudo helm install rook-alpha/rook</span>
</pre></div>
<p>Then check that the pods have started:</p>
<div class="highlight"><pre><span></span><span class="err">zonca@js-xxx-xxx:~/kubeadm-bootstrap$ sudo kubectl get pods</span>
<span class="err">NAME READY STATUS RESTARTS AGE</span>
<span class="err">rook-agent-2v86r 1/1 Running 0 1h</span>
<span class="err">rook-agent-7dfl9 1/1 Running 0 1h</span>
<span class="err">rook-operator-88fb8f6f5-tss5t 1/1 Running 0 1h</span>
</pre></div>
<p>Once the pods have started we can actually configure the storage, copy this <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/storage_rook/rook-cluster.yaml"><code>rook-cluster.yaml</code> file</a> to the master node. Better clone all of the repository as we will be using other files later.</p>
<p>The most important bits are:</p>
<ul>
<li><code>dataDirHostPath</code>: this is a folder to save the Rook configuration, we can set it to <code>/var/lib/rook</code></li>
<li><code>storage: directories</code>: this is were data is stored, we can set this to <code>/vol_b</code> which is the default mount point of Volumes on Jetstream. This way we can more easily back those up or increase their size.</li>
<li><code>versionTag</code>: make sure this is the same as your <code>rook</code> version (you can find it with <code>sudo helm ls</code>)</li>
</ul>
<p>Then run it with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl create -f rook-cluster.yaml</span>
</pre></div>
<p>And wait for the services to launch:</p>
<div class="highlight"><pre><span></span><span class="err">zonca@js-xxx-xxx:~/kubeadm-bootstrap$ sudo kubectl -n rook get pods</span>
<span class="err">NAME READY STATUS RESTARTS AGE</span>
<span class="err">rook-api-68b87d48d5-xmkpv 1/1 Running 0 6m</span>
<span class="err">rook-ceph-mgr0-5ddd685b65-kw9bz 1/1 Running 0 6m</span>
<span class="err">rook-ceph-mgr1-5fcf599447-j7bpn 1/1 Running 0 6m</span>
<span class="err">rook-ceph-mon0-g7xsk 1/1 Running 0 7m</span>
<span class="err">rook-ceph-mon1-zbfqt 1/1 Running 0 7m</span>
<span class="err">rook-ceph-mon2-c6rzf 1/1 Running 0 6m</span>
<span class="err">rook-ceph-osd-82lj5 1/1 Running 0 6m</span>
<span class="err">rook-ceph-osd-cpln8 1/1 Running 0 6m</span>
</pre></div>
<p>This step launches the distributed file system Ceph on all nodes.</p>
<p>Finally we can create a new StorageClass which provides block storage for the pods to store data persistently, get <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/storage_rook/rook-storageclass.yaml"><code>rook-storageclass.yaml</code> from the same repository we used before</a> and execute with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl create -f rook-storageclass.yaml</span>
</pre></div>
<p>You should now have the rook storageclass available:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl get storageclass</span>
<span class="err">NAME PROVISIONER</span>
<span class="err">rook-block rook.io/block</span>
</pre></div>
<h3>(Optional) Test Rook Persistent Storage</h3>
<p>Optionally, we can deploy a simple pod to verify that the storage system is working properly.</p>
<p>You can copy <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/storage_rook/alpine-rook.yaml"><code>alpine-rook.yaml</code> from Github</a>
and launch it with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl create -f alpine-rook.yaml</span>
</pre></div>
<p>It is a very small pod with Alpine Linux that creates a 2 GB volume from Rook and mounts it on <code>/data</code>.</p>
<p>This creates a Pod with Alpine Linux that requests a Persistent Volume Claim to be mounted under <code>/data</code>. The Persistent Volume Claim specified the type of storage and its size. Once the Pod is created, it asks the Persistent Volume Claim to actually request Rook to prepare a Persistent Volume that is then mounted into the Pod.</p>
<p>We can verify the Persistent Volumes are created and associated with the pod, check:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl get pv</span>
<span class="err">sudo kubectl get pvc</span>
<span class="err">sudo kubectl get logs alpine</span>
</pre></div>
<p>We can get a shell in the pod with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl exec -it alpine -- /bin/sh</span>
</pre></div>
<p>access <code>/data/</code> and make sure we can write some files.</p>
<p>Once you have completed testing, you can delete the pod and the Persistent Volume Claim with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl delete -f alpine-rook.yaml</span>
</pre></div>
<p>The Persistent Volume will be automatically deleted by Kubernetes after a few minutes.</p>
<h2>Setup HTTPS with letsencrypt</h2>
<p>We need <code>kube-lego</code> to automatically get a HTTPS certificate from Letsencrypt,
For more information see the Ingress section on the <a href="http://zero-to-jupyterhub.readthedocs.io/en/latest/advanced.html">Zero to Jupyterhub Advanced topics</a>.</p>
<p>First we need to customize the Kube Lego configuration, edit the <code>config_kube-lego_helm.yaml</code> file from the repository and set your email address, then:</p>
<div class="highlight"><pre><span></span><span class="err">sudo helm install stable/kube-lego --namespace=support --name=lego -f config_kube-lego_helm.yaml</span>
</pre></div>
<p>Then after you deploy Jupyterhub if you have some HTTPS trouble, you should check the logs of the kube-lego pod. First find the name of the pod with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl get pods -n support</span>
</pre></div>
<p>Then check its logs:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl logs -n support lego-kube-lego-xxxxx-xxx</span>
</pre></div>
<h2>Install Jupyterhub</h2>
<p>Read all of the documentation of "Zero to Jupyterhub", then download <a href="https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/blob/master/config_jupyterhub_helm.yaml"><code>config_jupyterhub_helm.yaml</code> from the repository</a> and customize it with the URL of the master node (for Jetstream <code>js-xxx-xxx.jetstream-cloud.org</code>) and generate the random strings for security, finally run the Helm chart:</p>
<div class="highlight"><pre><span></span><span class="err">sudo helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/</span>
<span class="err">sudo helm repo update</span>
<span class="err">sudo helm install jupyterhub/jupyterhub --version=v0.6 --name=jup \</span>
<span class="err"> --namespace=jup -f config_jupyterhub_helm.yaml</span>
</pre></div>
<p>Once you modify the configuration you can update the deployment with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo helm upgrade jup jupyterhub/jupyterhub -f config_jupyterhub_helm.yaml</span>
</pre></div>
<h3>Test Jupyterhub</h3>
<p>Connect to the public URL of your master node instance at: <a href="https://js-xxx-xxx.jetstream-cloud.org">https://js-xxx-xxx.jetstream-cloud.org</a></p>
<p>Try to login with your XSEDE username and password and check if Jupyterhub works properly.</p>
<p>If something is wrong, check:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl --namespace=jup get pods</span>
</pre></div>
<p>Get the name of the <code>hub</code> pod and check the logs:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl --namespace=jup logs hub-xxxx-xxxxxxx</span>
</pre></div>
<p>Check that Rook is working properly:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl --namespace=jup get pv</span>
<span class="err">sudo kubectl --namespace=jup get pvc</span>
<span class="err">sudo kubectl --namespace=jup describe pvc claim-YOURXSEDEUSERNAME</span>
</pre></div>
<h2>Administration tips</h2>
<h3>Add more servers to Kubernetes</h3>
<p>We can create more Ubuntu instances (with a volume attached) and add them to Kubernetes by repeating the same setup we performed on the first worker node.
Once the node joins Kubernetes, it will be automatically used as a node for the distributed filesystem by Rook and be available to host user containers.</p>
<h3>Remove a server from Kubernetes</h3>
<p>Launch first the <code>kubectl drain</code> command to move the currently active pods to other nodes:</p>
<div class="highlight"><pre><span></span><span class="err">sudo kubectl get nodes</span>
<span class="err">sudo kubectl drain <node name></span>
</pre></div>
<p>Then suspend or delete the instance on the Jetstream admin panel.</p>
<h3>Configure a different authentication system</h3>
<p>"Zero to Jupyterhub" supports out of the box authentication with:</p>
<ul>
<li>XSEDE credentials with CILogon</li>
<li>Many Campuses credentials with CILogon</li>
<li>Globus</li>
<li>Google</li>
</ul>
<p>See <a href="https://zero-to-jupyterhub.readthedocs.io/en/latest/extending-jupyterhub.html#authenticating-with-oauth2">the documentation</a> and modify <code>config_jupyterhub_helm_v0.5.0.yaml</code> accordingly.</p>
<h2>Acknowledgements</h2>
<ul>
<li>The Jupyter team, in particular Yuvi Panda, for providing a great software platform and a easy-to-user resrouce for deploying it and for direct support in debugging my issues</li>
<li>XSEDE Extended Collaborative Support Services for supporting part of my time to work on deploying Jupyterhub on Jetstream and providing computational time on Jetstream</li>
<li>Pacific Research Platform, in particular John Graham, Thomas DeFanti and Dmitry Mishin (SDSC) for access to their Kubernetes platform for testing</li>
<li>XSEDE Jetstream's Jeremy Fischer for prompt answers to my questions on Jetstream</li>
</ul>Store a conda environment inside a Notebook2017-12-04T18:00:00-08:002017-12-04T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-12-04:/2017/12/store-conda-environment-inside-notebook.html<p>Last August, during the Container Analysis Environments Workshop held at Urbana-Champaign,
we had discussion about reproducibility in the Jupyter Notebooks.
There came out the idea of storing all the details about the Python environment inside the Notebook,
in the metadata.</p>
<p>I released an experimental package on Github (and PyPI):</p>
<p><a href="https://github.com/zonca/nbenv">https …</a></p><p>Last August, during the Container Analysis Environments Workshop held at Urbana-Champaign,
we had discussion about reproducibility in the Jupyter Notebooks.
There came out the idea of storing all the details about the Python environment inside the Notebook,
in the metadata.</p>
<p>I released an experimental package on Github (and PyPI):</p>
<p><a href="https://github.com/zonca/nbenv">https://github.com/zonca/nbenv</a></p>
<p>For simplicity it only supports <code>conda</code> environment, but it also supports having <code>pip</code>-installed packages
inside those environments.</p>
<p>It automatically saves the <code>conda</code> environment as metadata inside the <code>.ipynb</code> document and then provides
a command line tool to inspect it and create a new <code>conda</code> environment based on it.</p>
<p>I am not sure this is the best design, please open Issues on Github to send me feedback!</p>How to modify Singularity images on a Supercomputer2017-11-06T18:00:00-08:002017-11-06T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-11-06:/2017/11/modify-singularity-images.html<h2>Introduction</h2>
<p><a href="http://singularity.lbl.gov/">Singularity</a> allows to run your own OS within most Supercomputers, see my previous post about <a href="https://zonca.github.io/2017/01/singularity-hpc-comet.html">Running Ubuntu on Comet via Singularity</a></p>
<p>Singularity's adoption by High Performance Computing centers has been driven by its strict security model. It never allows a user in a container to have <code>root</code> privileges unless …</p><h2>Introduction</h2>
<p><a href="http://singularity.lbl.gov/">Singularity</a> allows to run your own OS within most Supercomputers, see my previous post about <a href="https://zonca.github.io/2017/01/singularity-hpc-comet.html">Running Ubuntu on Comet via Singularity</a></p>
<p>Singularity's adoption by High Performance Computing centers has been driven by its strict security model. It never allows a user in a container to have <code>root</code> privileges unless the user is <code>root</code> on the Host system.</p>
<p>This means that you can only modify containers on a machine where you have <code>root</code>. Therefore you generally build a container on your local machine and then copy it to a Supercomputer.
The process is tedious if you are still tweaking your container and modifying it often, and each time your have to copy back a 4 or maybe 8 GB container image.</p>
<p>In the next section I'll investigate possible solutions/workarounds.</p>
<h2>Use DockerHub</h2>
<p>Singularity can pull a container from DockerHub, so it is convenient if you are already using Docker, maybe to provide a simple way to install your software.</p>
<p>I found out that if you are using the Automatic build of your container by DockerHub itself, this is very slow, sometimes it takes 30 minutes to have your new container build.</p>
<p>Therefore the best is to manually build your container locally and then push it to DockerHub. A Docker container is organized in layers of the filesystem, so for small tweaks to your image you transfer tens of MB to DockerHub instead of GB.</p>
<p>Then from the Supercomputer you can run <code>singularity pull docker://ubuntu:latest</code> with no need of <code>root</code> privileges. Singularity keeps a cache of the docker layers, so you would download just the layers modified in the previous step.</p>
<h2>Build your application locally</h2>
<p>If you are modifying an application often you could build a Singularity container with all the requirements, copy it to the Supercomputer and then build your application there. This is also useful if the architecture of your CPU is different between your local machine and the Supercomputer and you are worried the compiler would not apply all the possible optimizations.</p>
<p>In this case you can use <code>singularity shell</code> to get a terminal inside the container, then build your software with the compiler toolchain available <strong>inside the container</strong> and then install it to your <code>$HOME</code> folder, then modify your <code>$PATH</code> and <code>$LD_LIBRARY_PATH</code> to execute and load libraries from this local folder.</p>
<p>This is also useful in case the container has already an application installed but you want to develop on it. You can follow this process and then mask the installed application with your new version.</p>
<p>Of course this makes your analysis <strong>not portable</strong>, the software is not available inside the container.</p>
<h3>Freeze your application inside the container</h3>
<p>Once you have completed tweaking the application on the Supercomputer, you can now switch back to your local machine, get the last version of your application and install it system-wide inside the container so that it will be portable.</p>
<p>On the other hand, you might be concerned about performance and prefer to have the application built on the Supercomputer. You can run the build process (e.g. <code>make</code> or <code>python setup.py build) on the Supercomputer in your home, then sync the build artifacts back to your local machine and run the install process there (e.g</code>sudo make install<code>or</code>sudo python setup.py install<code>). Optionally use</code>sshfs` to mount the build folder on both machines and make the process transparent.</p>
<h2>Use a local Singularity registry</h2>
<p>Singularity released <a href="https://singularityhub.github.io/singularity-registry/inst/"><code>singularity-registry</code></a>, an application to build a local image registry, like DockerHub, that can take care of building containers.</p>
<p>This can be hosted locally at a Supercomputing Center to provide a local building service. For example Texas Advanced Computing Center <a href="https://www.slideshare.net/JohnFonner1/biocontainers-for-supercomputers-2000-accessible-discoverable-singularity-apps">builds locally Singularity images from BioContainers</a>, software packages for the Life Sciences.</p>
<p>Otherwise, for example, a user at SDSC could install Singularity Registry on SDSC Cloud and configure it to mount one of Comet's filesystems and build the container images there. Even installing Singularity Registry on Jetstream could be an option thanks to its fast connection to other XSEDE resources.</p>
<h2>Feedback</h2>
<p>If you have any feedback, please reach me at <a href="https://twitter.com/andreazonca">@andreazonca</a> or find my email from there.</p>Deploy scalable Jupyterhub on Docker Swarm mode2017-10-26T18:00:00-07:002017-10-26T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2017-10-26:/2017/10/scalable-jupyterhub-docker-swarm-mode.html<h2>Introduction</h2>
<p>Jupyterhub genrally requires roughly 500MB per user for light data processing and many GB for heavy data processing, therefore it is often necessary to deploy it across multiple machines to support many users.</p>
<p>The recommended scalable deployment for Jupyterhub is on Kubernetes, see <a href="https://zonca.github.io/2016/05/jupyterhub-docker-swarm.html">Zero to Jupyterhub</a> (and I'll cover …</p><h2>Introduction</h2>
<p>Jupyterhub genrally requires roughly 500MB per user for light data processing and many GB for heavy data processing, therefore it is often necessary to deploy it across multiple machines to support many users.</p>
<p>The recommended scalable deployment for Jupyterhub is on Kubernetes, see <a href="https://zonca.github.io/2016/05/jupyterhub-docker-swarm.html">Zero to Jupyterhub</a> (and I'll cover it next). However the learning curve for Kubernetes is quite steep, I believe that for smaller deployments (30/50 users, 10 users per machine) and where high availability is not critical, deploying on Docker with Swarm Mode is a simpler option.</p>
<p>In the past I have covered a <a href="https://zonca.github.io/2016/05/jupyterhub-docker-swarm.html">Jupyterhub deployment on the old version of Docker Swarm</a> using <code>DockerSpawner</code>. The most important difference is that the last version of Docker has a more sophisticated "Swarm mode" that allows you to launch and manage services instead of individual containers, support for this is provided by <a href="https://github.com/cassinyio/SwarmSpawner"><code>SwarmSpawner</code></a>. Thanks to the new architecture, we do not need to have actual Unix accounts on the Host but all users can run with the <code>jovyan</code> user account defined only inside the Docker containers. Then we can also deploy Jupyterhub itself as a Docker container instead of installing it on the Host.</p>
<h2>Setup a Virtual Machine for the Hub</h2>
<p>First of all we need to create a Virtual Machine, I tested this on XSEDE Jetstream CentOS 7 image (with Docker pre-installed), but I would recommend Ubuntu 16.04 which is more universally used so it is easier to find support for it.
The same setup would work on a bare-metal server.</p>
<p>Make sure that a recent version of Docker is installed, I used <code>17.07.0-ce</code>.</p>
<p>Setup networking so that port 80 and 443 are accessible for HTTP and HTTPS. Associate a Public IP to this instance so that it is accessible from the Internet.</p>
<p>Add your user to the <code>docker</code> group so you do not need <code>sudo</code> to run <code>docker</code> commands. Check that <code>docker</code> works running <code>docker info</code>.</p>
<h3>Clone the config files repository</h3>
<p>I recommend to create the folder <code>/etc/jupyterhub</code>, set ownership to your user and clone my configuration repository there:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/deploy-jupyterhub-dockerswarm /etc/jupyterhub</span>
</pre></div>
<h3>Setup Swarm</h3>
<p>The first node is going to be the <em>Master</em> node of the Swarm, launch:</p>
<div class="highlight"><pre><span></span><span class="err">docker swarm init --advertise-addr INTERNAL_IP_ADDRESS</span>
</pre></div>
<p>It is better to use a internal IP address, for example on Jetstream the <code>192.xxx.xxx.xxx</code> IP. This is the address that the other instances will use to connect to this node.</p>
<p>This command will print out the string that the other nodes will need to run to join this swarm, save it for later (you can recover it with <code>docker swarm join-token</code>)</p>
<h3>Install the NGINX web server</h3>
<p>NGINX is going to sit in front of Jupyterhub as a proxy and handle SSL (at the end of this tutorial), we are going to have also NGINX as a Docker service:</p>
<div class="highlight"><pre><span></span><span class="err">docker pull nginx:latest</span>
</pre></div>
<p>Now let's test that Docker and the networking is working correctly, launch <code>nginx</code> with the default configuration:</p>
<div class="highlight"><pre><span></span><span class="err">docker service create \</span>
<span class="err"> --name nginx \</span>
<span class="err"> --publish 80:80 \</span>
<span class="err"> nginx</span>
</pre></div>
<p>This is going to create a service, then the service creates the containers, check with <code>docker service ls</code> and <code>docker ps</code>, if a container dies, the service will automatically relaunch it.
Now if you connect to your instance from an external machine you should see the NGINX welcome page.
If this is not the case check <code>docker ps -a</code> and <code>docker logs INSTANCE_ID</code> to debug the issue.</p>
<p>Finally remove the service with:</p>
<div class="highlight"><pre><span></span><span class="err">docker service rm nginx</span>
</pre></div>
<p>Now run the service with the configuration for Jupyterhub, edit <code>nginx.conf</code> and replace <code>SERVER_URL</code> then launch:</p>
<div class="highlight"><pre><span></span><span class="err">bash ngnx_service.sh</span>
</pre></div>
<p>At this point you should gate a Gateway error if you connect with a browser to your instance.</p>
<h3>Install Jupyterhub</h3>
<p>Before launching Jupyterhub you need to create a Docker network so that the containers in the swarm can communicate easily:</p>
<div class="highlight"><pre><span></span><span class="err">docker network create --driver overlay jupyterhub</span>
</pre></div>
<p>You can launch the official Jupyterhub 0.8.0 container as a service with:</p>
<div class="highlight"><pre><span></span><span class="err">docker service create \</span>
<span class="err"> --name jupyterhubserver \</span>
<span class="err"> --network jupyterhub \</span>
<span class="err"> --detach=true \</span>
<span class="err"> jupyterhub/jupyterhub:0.8.0</span>
</pre></div>
<p>This would run Jupyterhub with the default <code>jupyterhub_config.py</code> with local auth and local spawner.
If you connect to the instance now you should see the Jupyterhub login page, you cannot login because you don't have
a user account inside the container. We'll setup authentication next.</p>
<h4>Configure Jupyterhub</h4>
<p>Next we want to customize the hub, first login on <a href="http://hub.docker.com">http://hub.docker.com</a> and create a new repository,
then follow the instructions there to setup <code>docker push</code> on your server so you can push your image
to the registy.</p>
<p>This is necessary because Swarm might spawn the service on a different machine, so itneeds an external
registry to make sure to pull the right image.</p>
<p>You can now customize the hub image in <code>/etc/jupyterhub/hub</code> with <code>docker build . -t yourusername/jupyterhub-docker</code>
and push it remotely with <code>docker push yourusername/jupyterhub-docker</code>.</p>
<p>This image includes <code>oauthenticator</code> for Github, Google, CILogon and Globus authentication and <code>swarmspawner</code> for
spawning containers for the users.</p>
<p>We can now create <code>jupyterhub_config.py</code>, for now we just want temporary home folders, so replace the <code>mounts</code> variable with <code>[]</code> in <code>c.SwarmSpawner.container_spec</code>. Then customize the server URL <code>server_url.com</code> and IP <code>SERVER_IP</code> (it will be necessary later).
At the bottom of <code>jupyterhub_config.py</code> we can also customize CPU and memory contraints. Unfortunately there is no easy way to setup a custom disk space limit.</p>
<p>Follow the documentation of <code>oauthenticator</code> to setup authentication.</p>
<p>Create the folder <code>/var/nfs</code> that we will configure later but it is harcoded in the script to launch the service.</p>
<p>Temporarily remove from <code>launch_service_jupyterhub.sh</code> the line:</p>
<div class="highlight"><pre><span></span><span class="err">--mount src=nfsvolume,dst=/var/nfs \</span>
</pre></div>
<p>Launch the service from <code>/etc/jupyterhub</code> with <code>bash launch_service_jupyterhub.sh</code>.</p>
<p>Check in the script that we are mounting the Docker socket into the container so that Jupyterhub can launch Docker containers for the users. We also mount the <code>/etc/jupyterhub</code> folder so that it has access to <code>jupyterhub_config.py</code>. We also contraint it to run in the manager node of this Swarm, this assures that it always runs on this first node. We could later add another manager node for resiliency and the Hub could potentially spawn there with no issues.</p>
<p>At this point we have a first working configuration of Jupyterhub, try to login and check if the notebooks are working.
This configuration has no permanent storage, so the users will have a home folder inside their container and will be able to
write Notebooks and data there up to the image reaching 10GB, so about 5GB.
If they logout and log back in they will find their files still there, but if they do "Close my Server" from the control panel
or if for any other reason their container is removed, they will loose their data.
So this setup could be used for short workshops or demos.</p>
<h2>Setup other nodes</h2>
<p>We can create another Virtual Machine with the same version of Docker and make sure that the two machines internally have all the port open to simplify networking. Any additional machine <strong>needs no open ports</strong> to the outside world, all connections will go through nginx.</p>
<p>We can have it join the Swarm by pasting the token got at Swarm initialization on the first node.</p>
<p>Now when Jupyterhub launches a single user container, it could spawn either on this server or on the first server, Swarm will automatically take care of load balancing. It will also automatically download the Docker image specified in <code>jupyterhub_config.py</code>.</p>
<p>We can add as many nodes as necessary.</p>
<h2>Setup Permanent storage</h2>
<p>Surprisingly enough, Swarm has no easy way to setup permament storage that would automatically move data from one node to another in case a user container is re-spawned on another server. There are some volume plugins but I believe that their configuration is so complex that at this point would be better to directly switch to Kubernetes.
In order to achieve a simpler setup that I believe could easily handle few tens of users we can use NFS. Moreover Docker volumes can handle NFS natively, so we don't even need to have home folders owned by each user but we can just point Docker volumes to our NFS folder and Docker will manage that for us and we can just use one single user. Users cannot access other people's files because only their own folder is mounted into their container.</p>
<h3>Setup a NFS server</h3>
<p>First we need to decide which server acts as NFS server, for small deployments we can have just the first server which runs the hub also handle this, for more performance we might want to have a dedicated server that only runs NFS and which is part of the internal network but does not participate in the Swarm so that it won't have user containers running on it.</p>
<p>In a Cloud environment like Jetstream or Amazon, it is useful to create a Volume and attach it to that instance so that we can enlarge it later or back it up independently from the Instance and that would survive the Hub instance. Make sure to choose the XFS filesystem if you need to setup disk space contraints. Mount it in <code>/var/nfs/</code> and make sure it is writable by any user.</p>
<p>On that server we can install NFS following the OS instructions and setup <code>/etc/exports</code> with:</p>
<div class="highlight"><pre><span></span><span class="err">/var/nfs *(rw,sync,no_subtree_check)</span>
</pre></div>
<p>The NFS port is accessible only on the internal network anyway so we can just accept any connection.</p>
<p>SSH into any of the Swarm nodes and check this works fine with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo mount 192.NFS.SRV.IP:/var/nfs /mnt</span>
<span class="err">touch /mnt/writing_works</span>
</pre></div>
<h3>Setup Jupyterhub to use Docker Volumes over NFS</h3>
<p>In <code>/etc/jupyterhub/jupyterhub_config.py</code> we should configure the mounts to <code>swarmspawner</code>:</p>
<div class="highlight"><pre><span></span><span class="err">mounts = [{'type': 'volume',</span>
<span class="err"> 'source': 'jupyterhub-user-{username}',</span>
<span class="err"> 'target': notebook_dir,</span>
<span class="err"> 'no_copy' : True,</span>
<span class="err"> 'driver_config' : {</span>
<span class="err"> 'name' : 'local',</span>
<span class="err"> 'options' : {</span>
<span class="err"> 'type' : 'nfs4',</span>
<span class="err"> 'o' : 'addr=SERVER_IP,rw',</span>
<span class="err"> 'device' : ':/var/nfs/{username}/'</span>
<span class="err"> }</span>
<span class="err"> },</span>
<span class="err">}]</span>
</pre></div>
<p>Replace <code>SERVER_IP</code> with your server, this tells the Docker <code>local</code> Volume driver to mount folders <code>/var/nfs/{username}</code> as home folders of the single user notebook container.</p>
<p>The only problem is that these folders need to be pre-existing, so I modified the <code>swarmspawner</code> plugin to create those folders the first time a user authenticates, please let me know if there is a better way and I'll improve this tutorial.
See the branch <code>createfolder</code> on <a href="https://github.com/zonca/SwarmSpawner/tree/createfolder">my fork of <code>swarmspawner</code></a>.
In order to install this you need to modify your custom <code>jupyterhub-docker</code> to install from there (see the commented out section in <code>hub/Dockerfile</code>).
Often the <code>Authenticator</code> transform the username into a hash, so I added a feature on this spawner to also create a text file <code>HASH_email.txt</code> and save the email of the user there so that it is easier to check directly from the filesystem who owns a specific folder.</p>
<p>For this to work the Hub needs access to <code>/var/nfs/</code>, the best way to achieve this is to create another Volume, add the <code>NFS_SERVER_IP</code> and launch on the first server:</p>
<div class="highlight"><pre><span></span><span class="err">bash create_volume_nfs.sh</span>
</pre></div>
<p>Then uncomment the <code>--mount src=nfsvolume,dst=/var/nfs \</code> line from <code>launch_service_jupyterhub.sh</code> and relaunch the service so that it is available locally.</p>
<p>At this point you should test that if you login, then stop/kill the container, your data should still be there when you launch it again.</p>
<h3>Setup user quota</h3>
<p>The Docker local Volume driver does not support setting a user quota so we have to resort to our filesystem. You can modify <code>/etc/fstab</code> to mount the XFS volume with the <code>pquota</code> option that supports setting a limit to a folders and all of its subfolders. We cannot use user quotas because all of the users are running under the same UNIX account.</p>
<p>Create a folder <code>/var/nfs/testquota</code> and then test that setting quota is working with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo set_quota.sh /var/nfs testquota</span>
</pre></div>
<p>There should be a space between <code>/var/nfs</code> and <code>testquota</code>, then check with:</p>
<div class="highlight"><pre><span></span><span class="err">bash get_quota.sh</span>
</pre></div>
<p>You should see a quota of <code>1GB</code> for that folder. Modify <code>set_quota.sh</code> to choose another size.</p>
<h4>Automatically set quotas</h4>
<p>We want quota to be automatically set each time the spawner creates another folder, <code>incrond</code> can monitor a folder for any new created file and launch the <code>set_quota.sh</code> script for us.</p>
<p>Install the <code>incrond</code> package and make sure it is active and restarted on boot. Then customize it with <code>sudo incrontab -e</code> and paste the content of <code>incrontab</code> in <code>/etc/jupyterhub</code>.</p>
<p>Now delete your user folder in <code>/var/nfs</code> and launch Jupyterhub again to check that the folder is created with the correct quota. The spawner also creates a <code>/var/nfs/{username}_QUOTA_NOT_SET</code> that is deleted then by the <code>set_quota.sh</code> script.</p>
<h2>Setup HTTPS</h2>
<p>We would like to setup NGINX to provide SSL encryption for Jupyterhub using the free Letsencrypt service. The main issue is that those certificates need to be renewed every few months, so we need a service running regularly to take care of that.</p>
<p>The simplest option would be to add <code>--publish 8000</code> to the Jupyterhub so that Jupyterhub exposes its port to the host and then remove the NGINX Docker container and install NGINX and certbot directly on the first host following <a href="https://www.digitalocean.com/community/tutorials/how-to-secure-nginx-with-let-s-encrypt-on-ubuntu-16-04">a standard setup</a>.</p>
<p>However, to keep the setup more modular, we'll proceed and use another NGINX container that comes equipped with automatic Let's Encrypt certificates request and renewal available at: <a href="https://github.com/linuxserver/docker-letsencrypt">https://github.com/linuxserver/docker-letsencrypt</a>.</p>
<h3>Modify networking setup</h3>
<p>One complication is that this container requires additional privileges to handle networking that are not availble in Swarm mode, so we will run this container outside of the Swarm on the first node.</p>
<p>We need to make the <code>jupyterhub</code> network that we created before attachable by containers outside the Swarm.</p>
<div class="highlight"><pre><span></span><span class="err">docker service rm nginx</span>
<span class="err">bash remove_service_jupyterhub.sh</span>
<span class="err">docker network rm jupyterhub</span>
<span class="err">docker network create --driver overlay --attachable jupyterhub</span>
</pre></div>
<p>Then add <code>--publish 8000</code> to <code>launch_service_juputerhub.sh</code> and start Jupyterhub again. Make sure that if you SSH to the first node you can <code>wget localhost:8000</code> successfully but if you try to access <code>yourdomain:8000</code> from the internet you <strong>should not</strong> be able to connect (the port should be closed by the networking configuration on OpenStack for example).</p>
<h3>Test the NGINX/Letsencrypt container</h3>
<p>Create a volume to save the configuration and the logs (optionally on the NFS volume):</p>
<div class="highlight"><pre><span></span><span class="err">docker volume create --driver local nginx_volume</span>
</pre></div>
<p>Test the container running:</p>
<div class="highlight"><pre><span></span><span class="err">docker run \</span>
<span class="err"> --cap-add=NET_ADMIN \</span>
<span class="err"> --name nginx \</span>
<span class="err"> -p 443:443 \</span>
<span class="err"> -e EMAIL=your_email@domain.edu \</span>
<span class="err"> -e URL=your.domain.org \</span>
<span class="err"> -v nginx_volume:/config \</span>
<span class="err"> linuxserver/letsencrypt</span>
</pre></div>
<p>If this works correctly, connect to <a href="https://your.domain.org">https://your.domain.org</a>, you should have a valid SSL certificate and a welcome message. If not check <code>docker logs nginx</code>.</p>
<h3>Configure NGINX to proxy Jupyterhub</h3>
<p>We can use <code>letsencrypt_container_nginx.conf</code> to handle NGINX configuration with HTTPS support, this loads the certificates from a path automatically created by the <code>letsencrypt</code> container.</p>
<p>Customize <code>launch_letsencrypt_container.sh</code> and then run it, it will create the NGINX container again and it will also bind-mount the NGINX configuration into the container.</p>
<p>Now you should be able to connect to your server over HTTPS and access Jupyterhub.</p>
<h2>Feedback</h2>
<p>Feedback appreciated, <a href="https://twitter.com/andreazonca">@andreazonca</a></p>
<p>I am also available to support US scientists to deploy scientific gateways through the <a href="https://www.xsede.org/for-users/ecss">XSEDE ECSS consultation program</a>.</p>Setup automated testing on a Github repository with Travis-ci2017-09-06T18:00:00-07:002017-09-06T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2017-09-06:/2017/09/automated-testing-travis-ci-github.html<h2>Introduction</h2>
<p>It is good practice in software development to implement extensive testing of the codebase in order to catch quickly any bug introduced into the code when implementing new features.</p>
<p>The suite of tests should be easy to execute (possibly one single command, for example with the <code>py.test</code> runner …</p><h2>Introduction</h2>
<p>It is good practice in software development to implement extensive testing of the codebase in order to catch quickly any bug introduced into the code when implementing new features.</p>
<p>The suite of tests should be easy to execute (possibly one single command, for example with the <code>py.test</code> runner) and quick to run (more than 1 minute would make it tedious to run).</p>
<p>The developers should run the unit test suite every time they implement a change to the codebase to make sure anything else has not been broken.</p>
<p>However, once a commit has been pushed to Github, it is also useful to have automated testing executed automatically, at least for 2 reasons:</p>
<ul>
<li>Run tests in all the environments that need to be supported by the software, for example run with different versions of Python or different versions of a key required external dependancy</li>
<li>Run tests in a clean environment that has less risks of being contaminated by some mis-configuration on one of the developers' environments</li>
</ul>
<h2>Travis-CI</h2>
<p>Travis is a free web based service that allows to register a trigger on Github so that every time a commit is pushed to Github or a Pull Request is opened, it launches an isolated Ubuntu (even if it also supports Mac OS) container for each of the configurations that we want to test, builds the software (if needed) and then runs the test.</p>
<p>The only requirement is that the Github project needs to be public for the free service. Otherwise there are paid plans for private repositories.</p>
<h2>Setup on Travis-CI</h2>
<ul>
<li>Go to <a href="http://travis-ci.org">http://travis-ci.org</a> and login with a Github account</li>
<li>In order to automatatically configure the hook on Github, Travis requests writing privileges to your Github account, annoying but convenient</li>
<li>Leave all default options, just make sure that Pull Requests are automatically tested</li>
<li>If you have the repository both under an organization and a fork under your account, you can choose either to test both or just the organization repository, anyway your pull requests will be tested before merging.</li>
</ul>
<h2>Preparation of the test scripts</h2>
<p>In order to automate running the test scripts on Travis-CI, it is important that the test scripts return a exit code different from zero to signal that the tests failed.</p>
<p>If you are using a test running tool like <code>pytest</code>, this is automatically done for you. If you are using bash scripts instead, make sure that if the script detects an error it calls <code>exit 1</code>.
In order to automate running the test scripts on Travis-CI, it is important that the test scripts return a exit code different from zero to signal that the tests failed.</p>
<p>If you are using a test running tool like <code>pytest</code>, this is automatically done for you. If you are using bash scripts instead, make sure that if the script detects an error it calls <code>exit 1</code>.</p>
<h2>Configuration of the repository</h2>
<ul>
<li>
<p>Create a new branch on your repository:</p>
<div class="highlight"><pre><span></span><span class="err">git checkout -b test_travis</span>
</pre></div>
</li>
<li>
<p>Add a <code>.travis.yml</code> (mind that it starts with a dot) configuration file</p>
</li>
<li>
<p>Inside this file you can configure how your project is built and tested, for the simple case of <code>bash</code> or <code>perl</code> scripts you can just write:</p>
<div class="highlight"><pre><span></span><span class="n">dist</span><span class="o">:</span> <span class="n">trusty</span>
<span class="n">language</span><span class="o">:</span> <span class="n">bash</span>
<span class="n">script</span><span class="o">:</span>
<span class="o">-</span> <span class="n">cd</span> <span class="n">$TRAVIS_BUILD_DIR</span><span class="o">/</span><span class="n">tests</span><span class="o">;</span> <span class="n">bash</span> <span class="n">run_test</span><span class="o">.</span><span class="na">sh</span>
</pre></div>
</li>
<li>
<p>Check the Travis-CI documentation for advanced configuration options</p>
</li>
<li>Now push these changes to your fork of the main repository and then create a Pull Request to the main repository</li>
<li>Go to <a href="https://travis-ci.org/YOUR_ORGANIZATION/YOUR_REPO">https://travis-ci.org/YOUR_ORGANIZATION/YOUR_REPO</a> to check the build status and the log</li>
<li>Once your Pull Request passes the tests, merge it to the main repository so that also the master branch will be tested for all future commits.</li>
</ul>
<h2>Python example</h2>
<p>In the following example, Travis-CI will create 8 builds, each of the 4 versions of Python will be tested with the 2 versions of <code>numpy</code>:</p>
<div class="highlight"><pre><span></span><span class="n">language</span><span class="o">:</span> <span class="n">python</span>
<span class="n">python</span><span class="o">:</span>
<span class="o">-</span> <span class="s2">"2.7"</span>
<span class="o">-</span> <span class="s2">"3.4"</span>
<span class="o">-</span> <span class="s2">"3.5"</span>
<span class="o">-</span> <span class="s2">"3.6"</span>
<span class="n">env</span><span class="o">:</span>
<span class="o">-</span> <span class="n">NUMPY_VERSION</span><span class="o">=</span><span class="mf">1.12</span><span class="o">.</span><span class="mi">1</span>
<span class="o">-</span> <span class="n">NUMPY_VERSION</span><span class="o">=</span><span class="mf">1.13</span><span class="o">.</span><span class="mi">1</span>
<span class="err">#</span> <span class="n">command</span> <span class="n">to</span> <span class="n">install</span> <span class="n">dependencies</span><span class="o">,</span> <span class="n">requirements</span><span class="o">.</span><span class="na">txt</span> <span class="n">should</span> <span class="n">NOT</span> <span class="k">include</span> <span class="n">numpy</span>
<span class="n">install</span><span class="o">:</span>
<span class="o">-</span> <span class="n">pip</span> <span class="n">install</span> <span class="o">-</span><span class="n">r</span> <span class="n">requirements</span><span class="o">.</span><span class="na">txt</span> <span class="n">numpy</span><span class="o">==</span><span class="n">$NUMPY_VERSION</span>
<span class="err">#</span> <span class="n">command</span> <span class="n">to</span> <span class="n">run</span> <span class="n">tests</span>
<span class="n">script</span><span class="o">:</span>
<span class="o">-</span> <span class="n">pytest</span> <span class="err">#</span> <span class="n">or</span> <span class="n">py</span><span class="o">.</span><span class="na">test</span> <span class="k">for</span> <span class="n">Python</span> <span class="n">versions</span> <span class="mf">3.5</span> <span class="n">and</span> <span class="n">below</span>
</pre></div>
<h2>Badge in README</h2>
<p>Aestetic touch, left click on the "Build Passing" image on the Travis-CI page for your repository, choose "Markdown" and paste the code to the <code>README.md</code> of your repository on Github. This will show in real time if the last version of the code is passing the tests or not.</p>Deployment of Jupyterhub with Globus Auth to spawn Notebook on Comet in Singularity containers2017-08-11T18:00:00-07:002017-08-11T18:00:00-07:00Andrea Zoncatag:zonca.github.io,2017-08-11:/2017/08/jupyterhub-globus-comet-singularity.html<h2>Build Singularity containers to run single user notebook applications</h2>
<p>Follow the instructions at <a href="https://github.com/zonca/singularity-comet">https://github.com/zonca/singularity-comet</a> to build images from the <code>ubuntu_anaconda_jupyterhub.def</code> and <code>centos_anaconda_jupyterhub.def</code> definition files, or use the containers I have already built on Comet:</p>
<div class="highlight"><pre><span></span><span class="err">/oasis/scratch/comet/zonca/temp_project/centos_anaconda_jupyterhub.img</span>
<span class="err">/oasis/scratch/comet …</span></pre></div><h2>Build Singularity containers to run single user notebook applications</h2>
<p>Follow the instructions at <a href="https://github.com/zonca/singularity-comet">https://github.com/zonca/singularity-comet</a> to build images from the <code>ubuntu_anaconda_jupyterhub.def</code> and <code>centos_anaconda_jupyterhub.def</code> definition files, or use the containers I have already built on Comet:</p>
<div class="highlight"><pre><span></span><span class="err">/oasis/scratch/comet/zonca/temp_project/centos_anaconda_jupyterhub.img</span>
<span class="err">/oasis/scratch/comet/zonca/temp_project/ubuntu_anaconda_cmb_jupyterhub.img</span>
</pre></div>
<p>These containers have Centos 7 and Ubuntu 16.04 base images, MPI support (not needed for this), Anaconda 4.4.0, the Jupyterhub (for the <code>jupyterhub-singleuser</code> script) and Jupyterlab (for the awesomeness) packages.</p>
<h2>Initial setup of Jupyterhub with Ansible</h2>
<p>First we want to use the Ansible playbook provided by the Jupyter team to setup a Ubuntu Virtual Machine, for example on SDSC Cloud or XSEDE Jetstream.
This sets up already a Jupyterhub instance on a single machine with Github authentication, NGINX with letsencrypt SSL and spawning of Notebooks as local processes.</p>
<p>Start from: <a href="https://zonca.github.io/2017/02/automated-deployment-jupyterhub-ansible.html">Automated deployment of Jupyterhub with Ansible</a></p>
<p>It looks like there is a compatibility error with <code>conda</code> 4.3 and above, I had to fix this (and provided PR upstream), I used the version at <a href="https://github.com/zonca/jupyterhub-deploy-teaching/tree/globus_singularity">https://github.com/zonca/jupyterhub-deploy-teaching/tree/globus_singularity</a>.
In particular check the example configuration file in the <code>host_vars/</code> folder.</p>
<p>Once we have executed the scripts, connect to the Virtual Machine, login with Github and check that Notebooks are working.</p>
<h2>Setup Authentication with Globus</h2>
<p>Next we can SSH into the Jupyterhub Virtual Machine and customize Jupyterhub configuration in <code>/etc/jupyterhub</code></p>
<p><code>oauthenticator</code> should alrady be installed,, but it needs the Globus SDK to support authentication with Globus:</p>
<div class="highlight"><pre><span></span><span class="n">sudo</span><span class="w"> </span><span class="o">/</span><span class="n">opt</span><span class="o">/</span><span class="n">conda</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">pip</span><span class="w"> </span><span class="n">install</span><span class="w"> </span><span class="n">globus_sdk</span><span class="o">[</span><span class="n">jwt</span><span class="o">]</span><span class="w"></span>
</pre></div>
<p>Then follow the instructions to setup Globus Auth: <a href="https://github.com/jupyterhub/oauthenticator#globus-setup">https://github.com/jupyterhub/oauthenticator#globus-setup</a></p>
<p>you should now have add these lines in <code>/etc/jupyterhub/jupyterhub_config.py</code></p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">oauthenticator.globus</span> <span class="kn">import</span> <span class="n">GlobusOAuthenticator</span>
<span class="n">c</span><span class="o">.</span><span class="n">JupyterHub</span><span class="o">.</span><span class="n">authenticator_class</span> <span class="o">=</span> <span class="n">GlobusOAuthenticator</span>
<span class="n">c</span><span class="o">.</span><span class="n">GlobusOAuthenticator</span><span class="o">.</span><span class="n">oauth_callback_url</span> <span class="o">=</span> <span class="s1">'https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu/hub/oauth_callback'</span>
<span class="n">c</span><span class="o">.</span><span class="n">GlobusOAuthenticator</span><span class="o">.</span><span class="n">client_id</span> <span class="o">=</span> <span class="s1">''</span>
<span class="n">c</span><span class="o">.</span><span class="n">GlobusOAuthenticator</span><span class="o">.</span><span class="n">client_secret</span> <span class="o">=</span> <span class="s1">''</span>
</pre></div>
<p>You should now be able to login with your Globus ID credentials, see the documentation to support credentials from institutions supported by Globus Auth.
After login, don't worry if you get an error in starting your notebook.</p>
<h2>Setup Spawning with Batchspawner</h2>
<p>In my last post about spawning Notebooks on Comet I was using XSEDE authentication so that each user would have to use their own Comet account.
In this scenario instead we imagine a Gateway system where the administrator shares their own allocation with the Gateway users.
Therefore you should create a SSH keypair for the <code>root</code> user on the Jupyterhub Virtual Machine and make sure you can login with no need for a password to Comet as the Gateway user.</p>
<p>Then you need to install <code>batchspawner</code>:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/jupyterhub/batchspawner.git</span>
<span class="err">cd batchspawner/</span>
<span class="err">sudo /opt/conda/bin/pip install .</span>
</pre></div>
<p>Then configure the Spawner, see <a href="https://gist.github.com/zonca/aaed55502c4b16535fe947791d02ac32">my configuration of Jupyterhub: <code>jupyterhub_config.py</code></a>.</p>
<p>You should modify <code>comet_spawner.py</code> to point to your Gateway user home folder and then fill all the details in <code>jupyterhub_config.py</code> marked by the <code>CONF</code> string.</p>
<p>In <code>CometSpawner</code> I also create a form for the user to choose the parameters of the job and also the Singularity image they want to use.</p>
<p>Here the spawner uses <code>SSH</code> to connect to the Comet login node and submit jobs as the Gateway user.</p>
<p>At this point you should be able to login and launch a job on Comet, execute <code>squeue</code> on Comet to check if that works or look in the home folder of the Gateway user for the logfile of the job and in <code>/var/log/jupyterhub</code> on the Virtual machine for errors.</p>
<h2>Setup tunneling</h2>
<p>Finally we need a way for the gateway Virtual Machine to access the port on the Comet computing node in order to proxy the Notebook application back to the user.</p>
<p>The simpler solution is to create a user <code>tunnelbot</code> on the VM with no shell access, then create a SSH keypair and paste the <strong>private</strong> key into the <code>jupyterhub_config.py</code> file (contact me if you have a btter solution!).
The job on Comet sets up then a SSH tunnel between the Comet computing node and the Jupyterhub VM.</p>
<h2>Improvements</h2>
<p>To keep the setup simple, all users are running on the home folder of the Gateway user, for a real deployment, it is possible to create a subfolder for each user beforehand and then use Singularity to mount that as the home folder.</p>How to create pull requests on Github2017-06-30T11:00:00-07:002017-06-30T11:00:00-07:00Andrea Zoncatag:zonca.github.io,2017-06-30:/2017/06/quick-github-pull-requests.html<p>Pull Requests are the web-based version of sending software patches via email to code maintainers.
They allow a person that has no access to a code repository to submit a code change to the repository administrator for review and 1-click merging.</p>
<h2>Preparation</h2>
<ul>
<li>Create a free Github account at <a href="https://github.com">https://github …</a></li></ul><p>Pull Requests are the web-based version of sending software patches via email to code maintainers.
They allow a person that has no access to a code repository to submit a code change to the repository administrator for review and 1-click merging.</p>
<h2>Preparation</h2>
<ul>
<li>Create a free Github account at <a href="https://github.com">https://github.com</a></li>
<li>Login on Github with your credentials</li>
<li>Go to the homepage of the repository, for example <a href="https://github.com/sdsc/sdsc-summer-institute-2017">https://github.com/sdsc/sdsc-summer-institute-2017</a></li>
</ul>
<h2>Small changes via Github.com</h2>
<p>For small changes, like create a folder and upload a few files, or a quick fix on a previous file, you don't even need to use the <code>git</code> command line client.</p>
<ul>
<li>If you need to <strong>create a folder</strong><ul>
<li>click on "Create new file"</li>
<li>in the "Name your file..." box, insert: "yourfolder/README.md"</li>
<li>in the README.md write a description of the content of the folder, you can use markdown syntax, (see <a href="https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet">the Markdown Cheatsheet</a> )</li>
<li>create a bullet list with description of the files you will be uploading next</li>
<li>Click on "Propose new file"</li>
<li>this will ask you to create a Pull Request, follow the prompts and make sure to confirm at the end that you want to create a Pull Request, you have to click twice on "Create Pull Request" buttons</li>
</ul>
</li>
<li>If you want to upload files in the folder you just created, you need an additional step, if you want to upload to a folder already existing in the original repo, skip this:<ul>
<li>Go to the fork of the original repository that was created automatically under your account, for example: <a href="https://github.com/YOURUSERNAME/sdsc-summer-institute-2017">https://github.com/YOURUSERNAME/sdsc-summer-institute-2017</a></li>
<li>Click on the dropdown "Branch" menu and look for the branch named <code>patch-1</code>, or <code>patch-n</code> if you have more.</li>
</ul>
</li>
<li>Click on the "Upload files" button, select and upload all files, a few notes:<ul>
<li>do not upload zip archives</li>
<li>do not upload large data files, Github is for code</li>
<li>if you are uploading binary files like images, downgrade them to a small size</li>
<li>this will ask you to create a Pull Request, follow the prompts and make sure to confirm at the end that you want to create a Pull Request, you have to click twice on "Create Pull Request" buttons</li>
</ul>
</li>
<li>Check that your pull request appeared in the Pull Requests area of the repository, for example <a href="https://github.com/sdsc/sdsc-summer-institute-2017/pulls">https://github.com/sdsc/sdsc-summer-institute-2017/pulls</a></li>
</ul>
<h2>Update a previously create Pull Request via Github.com</h2>
<p>If the repository maintainer has some feedback on your Pull Request, you can update it to accomodate any requested change.</p>
<ul>
<li>Go to the fork of the original repository that was created automatically under your account, for example: <a href="https://github.com/YOURUSERNAME/sdsc-summer-institute-2017">https://github.com/YOURUSERNAME/sdsc-summer-institute-2017</a></li>
<li>Click on the dropdown "Branch" menu and look for the branch named <code>patch-1</code>, or <code>patch-n</code> if you have more.</li>
<li>Now make changes to files or upload new files, then confirm and write a commit message from the web interface</li>
<li>Check that your changes appear as updates inside the Pull Request you created before, for example <a href="https://github.com/sdsc/sdsc-summer-institute-2017/pull/N">https://github.com/sdsc/sdsc-summer-institute-2017/pull/N</a> where N is the number assigned to your Pull Request</li>
</ul>
<h2>Use the command line client</h2>
<p>For more control and especially if you expect the repository maintainer to make changes to your Pull Request before merging it, better use <code>git</code>.</p>
<ul>
<li>Click on the "Fork" button on the top right of the repository</li>
<li>Now you should be on the copy of the repository under your own account, for example <a href="https://github.com/YOURUSERNAME/sdsc-summer-institute-2017">https://github.com/YOURUSERNAME/sdsc-summer-institute-2017</a></li>
<li>
<p>Now open your terminal, if you never used <code>git</code> before, set it up with:</p>
<div class="highlight"><pre><span></span>$ git config --global user.name <span class="s2">"Your Name"</span>
$ git config --global user.email <span class="s2">"your@email.edu"</span>
</pre></div>
</li>
<li>
<p>Now open your terminal and clone the repository with:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/YOURUSERNAME/sdsc-summer-institute-2017</span>
</pre></div>
</li>
<li>
<p>Enter in the repository folder</p>
</li>
<li>
<p>Create a branch to isolate your changes with:</p>
<div class="highlight"><pre><span></span><span class="err">git checkout -b "add_XXXX_material"</span>
</pre></div>
</li>
<li>
<p>Now create folders, modify files, you can use any text editor</p>
</li>
<li>
<p>Once you are done doing modifications, you can prepare them to be committed with, this adds everything inside the folder:</p>
<div class="highlight"><pre><span></span><span class="err">git add my_folder</span>
</pre></div>
</li>
<li>
<p>Generally better instead to add each file to make sure you don't accidentally commit wrong files</p>
<div class="highlight"><pre><span></span><span class="err">git add my_folder/aaa.txt my_folder/README.md</span>
</pre></div>
</li>
<li>
<p>Then write this changes to history with a commit</p>
<div class="highlight"><pre><span></span><span class="err">git commit -m "Added material about XXXX"</span>
</pre></div>
</li>
<li>
<p>Push changes to Github</p>
<div class="highlight"><pre><span></span><span class="err">git push -u origin add_XXXX_material</span>
</pre></div>
</li>
<li>
<p>Now go to the homepage of the original repository, for example <a href="https://github.com/sdsc/sdsc-summer-institute-2017">https://github.com/sdsc/sdsc-summer-institute-2017</a></p>
</li>
<li>There should be a yellow notice saying that it detected a recently pushed branch, click on "Compare and Pull Request"</li>
<li>Add a description</li>
<li>Confirm with the green "Create Pull Request" button</li>
</ul>
<p>In case you want to update your Pull Request, repeat the steps of <code>git add</code>, <code>git commit</code> and <code>git push</code>, any changes will be reflected inside the pull request.</p>Deploy Jupyterhub on a Supercomputer with SSH Authentication2017-05-16T22:00:00-07:002017-05-16T22:00:00-07:00Andrea Zoncatag:zonca.github.io,2017-05-16:/2017/05/jupyterhub-hpc-batchspawner-ssh.html<p>The best way to deploy Jupyterhub with an interface to a Supercomputer is through the use of <code>batchspawner</code>. I have a sample deployment explained in an older blog post: <a href="https://zonca.github.io/2017/02/sample-deployment-jupyterhub-hpc.html">https://zonca.github.io/2017/02/sample-deployment-jupyterhub-hpc.html</a></p>
<p>This setup however requires a OAUTH service, in this case provided by XSEDE …</p><p>The best way to deploy Jupyterhub with an interface to a Supercomputer is through the use of <code>batchspawner</code>. I have a sample deployment explained in an older blog post: <a href="https://zonca.github.io/2017/02/sample-deployment-jupyterhub-hpc.html">https://zonca.github.io/2017/02/sample-deployment-jupyterhub-hpc.html</a></p>
<p>This setup however requires a OAUTH service, in this case provided by XSEDE, to authenticate the users via web and then provide a X509 certificate that is then used by <code>batchspawner</code> to
connect to the Supercomputer on behalf of the user and submit the job to spawn a notebook.</p>
<p>In case an authentication service of this type is not available, another option is to use SSH authentication.</p>
<p>The starting point is a server with vanilla Jupyterhub installed, good practice would be to use an already available recipe with Ansible, like <a href="https://zonca.github.io/2017/02/automated-deployment-jupyterhub-ansible.html">https://zonca.github.io/2017/02/automated-deployment-jupyterhub-ansible.html</a>, that deploys Jupyterhub in a safer way, e.g. NGINX frontend with HTTPS.</p>
<p>First we want to setup authentication, the simpler way to start would be to use the default authentication with local UNIX user accounts and possibly add Github later.
In any case it is necessary that all the users have both an account on the Supercomputer and on the Jupyterhub server, with the same username, this is tedious but is the simpler way to allow them to authenticate on the Supercomputer.
Then we need to save the <strong>private</strong> SSH key into each user's <code>.ssh</code> folder and make sure they can SSH with no password required to the Supercomputer.</p>
<p>Then we can install <code>batchspawner</code> and configure Jupyterhub to use it. In the <code>batchspawner</code> configuration in <code>jupyterhub_config.py</code>, you have to prefix the scheduler commands with ssh so that Jupyterhub can connect to the Supercomputer to submit the job:</p>
<div class="highlight"><pre><span></span><span class="err">c.SlurmSpawner.batch_submit_cmd = 'ssh {username}@{host} sbatch'</span>
</pre></div>
<p>See for example <a href="https://github.com/jupyterhub/jupyterhub-deploy-hpc/blob/master/batchspawner-xsedeoauth-sshtunnel-sdsccomet/jupyterhub_config.py#L66">my configuration for Comet</a> and replace <code>gsissh</code> with <code>ssh</code>.</p>
<p>Now when users connect, they are authenticated with local UNIX user accounts username and password and then Jupyterhub uses their SSH key to launch a job on the Supercomputer.</p>
<p>The last issue is how to proxy the Jupyterhub running on a computing node back to the server, here one option would be to create a user on the server with no Terminal access but with the possibility of creating tunnels, then at the end of the job, setup a tunnel using a SSH Private Key pasted into the job script itself, see for example <a href="https://github.com/jupyterhub/jupyterhub-deploy-hpc/blob/master/batchspawner-xsedeoauth-sshtunnel-sdsccomet/jupyterhub_config.py#L54">my setup on Comet</a>.</p>Configure Globus on your local machine for GridFTP with XSEDE authentication2017-04-19T12:00:00-07:002017-04-19T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2017-04-19:/2017/04/globus-gridftp-local.html<p>All the commands are executed on your local machine, the purpose of this tutorial is to be able to use <code>globus-url-copy</code> to copy efficiently data back and forth between your local machine and a XSEDE Supercomputer on the command line.</p>
<p>For a simpler point and click web interface, install Globus …</p><p>All the commands are executed on your local machine, the purpose of this tutorial is to be able to use <code>globus-url-copy</code> to copy efficiently data back and forth between your local machine and a XSEDE Supercomputer on the command line.</p>
<p>For a simpler point and click web interface, install Globus Conect Personal instead: <a href="https://www.globus.org/globus-connect-personal">https://www.globus.org/globus-connect-personal</a></p>
<h2>Install Globus toolkit</h2>
<p>See http://toolkit.globus.org/toolkit/docs/latest-stable/admin/install/#install-toolkit</p>
<p>On Ubuntu, download the deb of the Globus repo from:</p>
<div class="highlight"><pre><span></span><span class="err">wget http://www.globus.org/ftppub/gt6/installers/repo/globus-toolkit-repo_latest_all.deb</span>
<span class="err">sudo dpkg -i globus-toolkit-repo_latest_all.deb</span>
<span class="err">sudo apt-get install globus-data-management-client</span>
</pre></div>
<h2>Install XSEDE certificates on your machine</h2>
<div class="highlight"><pre><span></span><span class="err">wget https://software.xsede.org/security/xsede-certs.tar.gz</span>
<span class="err">tar xvf xsede-certs.tar.gz</span>
<span class="err">sudo mv certificates /etc/grid-security</span>
</pre></div>
<p>Full instructions here:</p>
<p><a href="https://software.xsede.org/production/CA/CA-install.html">https://software.xsede.org/production/CA/CA-install.html</a></p>
<h2>Authenticate with the myproxy provided by XSEDE</h2>
<p>Authenticate with your XSEDE user and password:</p>
<div class="highlight"><pre><span></span><span class="err">myproxy-logon -s myproxy.xsede.org -l $USER -t 36</span>
</pre></div>
<p>You can specify the lifetime of the certificate in hours with <code>-t</code>.</p>
<p>you should get a certificate:</p>
<div class="highlight"><pre><span></span><span class="err">A credential has been received for user zonca in /tmp/x509up_u1000.</span>
</pre></div>
<p>You can check how much time is left on a certificate by running <code>grid-proxy-info</code>.</p>
<h2>Run globus-url-copy</h2>
<p>For example copy to my home on Comet:</p>
<div class="highlight"><pre><span></span><span class="err">globus-url-copy -vb -p 4 local_file.tar.gz gsiftp://oasis-dm.sdsc.edu///home/zonca/</span>
</pre></div>
<p>See the quickstart guide on the most used <code>globus-url-copy</code> options:</p>
<p><a href="http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/user/#gridftp-user-basic">http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/user/#gridftp-user-basic</a></p>
<h2>Synchronize 2 folders</h2>
<p>Only copy new files using the <code>-sync</code> and <code>-sync-level</code> options:</p>
<div class="highlight"><pre><span></span><span class="err">-sync</span>
<span class="err"> Only transfer files where the destination does not exist or differs from the source. -sync-level controls how to determine if files differ.</span>
<span class="err">-sync-level number</span>
<span class="err"> Criteria for determining if files differ when performing a sync transfer. The default sync level is 2.\</span>
</pre></div>
<p>The available levels are:</p>
<ul>
<li>Level 0 will only transfer if the destination does not exist.</li>
<li>Level 1 will transfer if the size of the destination does not match the size of the source.</li>
<li>Level 2 will transfer if the time stamp of the destination is older than the time stamp of the source.</li>
<li>Level 3 will perform a checksum of the source and destination and transfer if the checksums do not match.</li>
</ul>Sample deployment of Jupyterhub in HPC on SDSC Comet2017-02-26T12:00:00-08:002017-02-26T12:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-02-26:/2017/02/sample-deployment-jupyterhub-hpc.html<p>I have deployed an experimental Jupyterhub service (ask me privately if you would like access) installed on a <a href="http://www.sdsc.edu/services/it/cloud.html">SDSC Cloud</a> virtual machine that spawns single user Jupyter notebooks on Comet computing nodes using <a href="https://github.com/jupyterhub/batchspawner"><code>batchspawner</code></a> and then proxies the Notebook back to the user using SSH-tunneling.</p>
<h2>Functionality</h2>
<p>This kind of setup …</p><p>I have deployed an experimental Jupyterhub service (ask me privately if you would like access) installed on a <a href="http://www.sdsc.edu/services/it/cloud.html">SDSC Cloud</a> virtual machine that spawns single user Jupyter notebooks on Comet computing nodes using <a href="https://github.com/jupyterhub/batchspawner"><code>batchspawner</code></a> and then proxies the Notebook back to the user using SSH-tunneling.</p>
<h2>Functionality</h2>
<p>This kind of setup is functionally equivalent to launching a job yourself on Comet, launch <code>jupyter notebook</code> and then SSH-Tunneling the port to your local machine, but way more convenient. You jus open your browser to the Jupyterhub instance, authenticate with your XSEDE credentials, choose queue and job length and wait for the Notebook job to be ready (generally it is a matter of minutes).</p>
<h2>Rationale</h2>
<p>Jupyter Notebooks have a lot of use-cases on HPC, it can be used for:</p>
<ul>
<li>In-situ visualization</li>
<li>Interactive data analysis when local resources are not enough, either in terms of RAM or disk space</li>
<li>Monitoring other running jobs</li>
<li>Launch <a href="https://github.com/ipython/ipyparallel">IPython Parallel</a> jobs and distribute computation to them in parallel</li>
<li>Interact with a running Spark cluster (we support Spark on Comet)</li>
</ul>
<p>More on this on my <a href="https://zonca.github.io/2015/04/jupyterhub-hpc.html">Run Jupyterhub on a Supercomputer</a> old blog post.</p>
<h2>Setup details</h2>
<p>The Jupyter team created a repository for sample HPC deployments, I added all configuration files of my deployment there, with all details about the setup:</p>
<ul>
<li><a href="https://github.com/jupyterhub/jupyterhub-deploy-hpc/tree/master/batchspawner-xsedeoauth-sshtunnel-sdsccomet">Sample deployment in the <code>jupyterhub-deploy-hpc</code> repository</a></li>
</ul>
<p>Please send feedback opening an issue in that repository and tagging <code>@zonca</code>.</p>Customize your Python environment in Jupyterhub2017-02-24T12:00:00-08:002017-02-24T12:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-02-24:/2017/02/customize-python-environment-jupyterhub.html<p>Usecase: You have access to a Jupyterhub server and you would like to install some packages but cannot use <code>pip install</code> and modify the systemwide Python installation.</p>
<h2>Check if conda is available</h2>
<p>First check if the Python installation you have access to is based on Anaconda, open a Notebook and …</p><p>Usecase: You have access to a Jupyterhub server and you would like to install some packages but cannot use <code>pip install</code> and modify the systemwide Python installation.</p>
<h2>Check if conda is available</h2>
<p>First check if the Python installation you have access to is based on Anaconda, open a Notebook and type:</p>
<div class="highlight"><pre><span></span><span class="sx">!which conda</span>
</pre></div>
<p><code>!</code> executes bash commands instead of Python commands, we want to check if the <code>conda</code> package manager is installed.</p>
<p>If not, the setup is a bit tedious, so see my tutorial on <a href="https://zonca.github.io/2015/10/use-own-python-in-jupyterhub.html">installing Anaconda in your home folder</a></p>
<h2>Create a conda environment</h2>
<p>Conda allows to create independent environments in our home folder, this has the advantage that the environment will be writable so we can install any other package with <code>pip</code> or <code>conda install</code>.</p>
<div class="highlight"><pre><span></span><span class="sx">!conda create -n myownenv --clone root</span>
</pre></div>
<p>You can declare all the packages you want to install bu good starting point is just to clone the <code>root</code> environment, this will link all the global packages in your home folder, then you can customize it further.</p>
<h2>Create a Jupyter Notebook kernel to launch this new environment</h2>
<p>We need to notify Jupyter of this new Python environment by creating a Kernel, from a Notebook launch:</p>
<div class="highlight"><pre><span></span><span class="sx">!source activate myownenv; ipython kernel install --user --name myownenv</span>
</pre></div>
<h2>Launch a Notebook</h2>
<p>Go back to the Jupyterhub dashboard, reload the page, now you should have another option in the <code>New</code> menu that says <code>myownenv</code>.</p>
<p>In order to use your new kernel with an existing notebook, click on the notebook file in the dashboard, it will launch with the default kernel, then you can change kernel from the top menu <code>Kernel</code> > <code>Change kernel</code>.</p>
<h2>Install new packages</h2>
<p>Inside a Notebook using the <code>myownenv</code> environment you can install other packages running:</p>
<div class="highlight"><pre><span></span><span class="sx">!conda install newpackagename</span>
</pre></div>
<p>or:</p>
<div class="highlight"><pre><span></span><span class="sx">!pip install newpackagename</span>
</pre></div>Automated deployment of Jupyterhub with Ansible2017-02-03T18:00:00-08:002017-02-03T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-02-03:/2017/02/automated-deployment-jupyterhub-ansible.html<p>Last year I wrote some tutorials on simple deployments of Jupyterhub on Ubuntu 16.04 on the OpenStack deployment <a href="http://www.sdsc.edu/services/it/cloud.html">SDSC Cloud</a>, even if most of the steps would also be suitable on other resources like Amazon EC2.</p>
<p>In more detail:</p>
<ul>
<li><a href="https://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html">Manually installing Jupyterhub on a single Virtual Machine with users …</a></li></ul><p>Last year I wrote some tutorials on simple deployments of Jupyterhub on Ubuntu 16.04 on the OpenStack deployment <a href="http://www.sdsc.edu/services/it/cloud.html">SDSC Cloud</a>, even if most of the steps would also be suitable on other resources like Amazon EC2.</p>
<p>In more detail:</p>
<ul>
<li><a href="https://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html">Manually installing Jupyterhub on a single Virtual Machine with users running inside Docker containers</a></li>
<li><a href="https://zonca.github.io/2016/04/jupyterhub-image-sdsc-cloud.html">Quick deployment of the above using a pre-built image</a></li>
<li><a href="https://zonca.github.io/2016/05/jupyterhub-docker-swarm.html">Jupyterhub distributing user containers on other nodes using Docker Swarm</a></li>
</ul>
<p>The Jupyter team has released an automated script to deploy Jupyterhub on a single server, see <a href="http://jupyterhub-deploy-teaching.readthedocs.io">Jupyterhub-deploy-teaching</a>.</p>
<p>In this tutorial we will use this script to deploy Jupyterhub to SDSC Cloud using:</p>
<ul>
<li>NGINX handling HTTPS with Letsencrypt certificate</li>
<li>Github authentication</li>
<li>Local or Docker user notebooks</li>
<li>Grading with <code>nbgrader</code></li>
<li>Memory limit for Docker containers</li>
</ul>
<h2>Setup a Virtual Machine to run Jupyterhub</h2>
<p>Create first a Ubuntu 16.04 Virtual Machine, a default server image works fine.</p>
<p>In case you are deploying on SDSC Cloud, follow the steps in "Create a Virtual Machine in OpenStack" on my first tutorial at <a href="https://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html">https://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html</a>.</p>
<p>You will also need a DNS entry pointing to the server to create a SSL certificate with Let's Encrypt. Either ask your institution to provide a DNS A entry, e.g. <code>test-jupyterhub.ucsd.edu</code>, that points to the Public IP of the server.
SDSC Cloud already provides a DNS entry in the form <code>xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu</code>.</p>
<p>If you plan on using <code>nbgrader</code>, you need to create the home folder for the instructor beforehand, so SSH into the server and create a user with your Github username, i.e. I had to execute <code>sudo adduser zonca</code></p>
<h2>Setup your local machine to run the automation scripts</h2>
<p>Automation of the server setup is provided by the <a href="http://ansible.com">Ansible</a> software tool, it allows to describe a server configuration in great detail (a "playbook") and then connects via SSH to a Virtual Machine and runs Python to install and setup all the required software.</p>
<p>On your local machine, install <code>Ansible</code>, at least version 2.1, see <a href="http://docs.ansible.com/ansible/intro_installation.html#getting-ansible">Ansible docs</a>, for Ubuntu just add the <a href="https://launchpad.net/~ansible/+archive/ubuntu/ansible">Ansible PPA repository</a>.
I tested this with Ansible version 2.2.1.0</p>
<p>Then you need to configure passwordless SSH connection to your Virtual Machine. Download your SSH key from the OpenStack dashboard, copy it to your <code>~/.ssh</code> folder and then add an entry to <code>.ssh/config</code> for the server:</p>
<div class="highlight"><pre><span></span><span class="err">Host xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu</span>
<span class="err"> HostName xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu</span>
<span class="err"> User ubuntu</span>
<span class="err"> IdentityFile "~/.ssh/sdsccloud.key"</span>
</pre></div>
<p>At this point you should be able to SSH into the machine without typing any password with <code>ssh xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu</code>.</p>
<h2>Configure and run the Ansible script</h2>
<p>Follow the <a href="http://jupyterhub-deploy-teaching.readthedocs.io/en/latest/installation.html">Jupyterhub-deploy-teaching documentation</a> to checkout the script, configure and run it.</p>
<p>The only modification you need to do if you are on SDSC Cloud is that the remote user is <code>ubuntu</code> and not <code>root</code>, so modify <code>ansible.cfg</code> in the root of the repository,
replace <code>remote_user=root</code> with <code>remote_user=ubuntu</code>.</p>
<p>As an example, see the <a href="https://gist.github.com/zonca/fd2400a2069b5769f32b1c4b57eb97dc">configuration I used</a>, just:</p>
<ul>
<li>copy it into <code>host_vars</code></li>
<li>rename it to your public DNS record</li>
<li>fill in <code>proxy_auth_token</code>, Github OAuth credentials for authentication</li>
<li>replace <code>zonca</code> with your Github username everywhere</li>
</ul>
<p>The exact version of the <code>jupyterhub-deploy-teaching</code> code I used for testing is <a href="https://github.com/zonca/jupyterhub-deploy-teaching/releases/tag/sdsc_cloud_jan_17">on the <code>sdsc_cloud_jan_17</code> tag on Github</a></p>
<h2>Test the deployment</h2>
<p>Connect to <a href="https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu">https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu</a> on your browser, you should be redirected to Github for authentication and then access a Jupyter Notebook instance with the Python 3, R and bash kernels running locally on the machine.</p>
<h2>Optional: Docker</h2>
<p>In order to provide isolation and resource limits to the users, it is useful to run single user Jupyter Notebooks inside Docker containers.</p>
<p>You will need to SSH into the Virtual Machine and follow the next steps.</p>
<h3>Install Docker</h3>
<p>First of all we need to install and configure Docker on the machine, see:</p>
<ul>
<li><a href="https://docs.docker.com/engine/installation/linux/ubuntu/">https://docs.docker.com/engine/installation/linux/ubuntu/</a></li>
<li><a href="https://docs.docker.com/engine/installation/linux/linux-postinstall/">https://docs.docker.com/engine/installation/linux/linux-postinstall/</a></li>
</ul>
<h3>Install dockerspawner</h3>
<p>Then install the Jupyterhub plugin <code>dockerspawner</code> that handles launching single user Notebooks inside Docker containers, we want to install from master instead of pypi to avoid an error setting the memory limit.</p>
<div class="highlight"><pre><span></span><span class="err">pip install git+https://github.com/jupyterhub/dockerspawner</span>
</pre></div>
<h3>Setup the Docker container to run user Notebooks</h3>
<p>We can first get the standard <code>systemuser</code> container, this Docker container mounts the home folder of the users inside the container, this way we can have persistent data even if the container gets deleted.</p>
<div class="highlight"><pre><span></span><span class="err">docker pull jupyterhub/systemuser</span>
</pre></div>
<p>If you do not need <a href="http://nbgrader.readthedocs.io"><code>nbgrader</code></a> this image is enough, otherwise we have to build our own image, first checkout my Github repository in the home folder of the <code>ubuntu</code> user on the server with:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/zonca/systemuser-nbgrader</span>
</pre></div>
<p>then edit the <code>nbgrader_config.py</code> file to set the correct <code>course_id</code>, and build the container image running inside the <code>systemuser-nbgrader</code> folder:</p>
<div class="highlight"><pre><span></span><span class="err">docker build -t systemuser-nbgrader .</span>
</pre></div>
<h3>Configure Jupyterhub to use dockerspawner</h3>
<p>Then add some configuration for dockerspawner to <code>/etc/jupyterhub/jupyterhub_config.py</code>:</p>
<div class="highlight"><pre><span></span><span class="n">c</span><span class="o">.</span><span class="n">JupyterHub</span><span class="o">.</span><span class="n">spawner_class</span> <span class="o">=</span> <span class="s1">'dockerspawner.SystemUserSpawner'</span>
<span class="n">c</span><span class="o">.</span><span class="n">DockerSpawner</span><span class="o">.</span><span class="n">container_image</span> <span class="o">=</span> <span class="s2">"systemuser-nbgrader"</span> <span class="c1"># delete this line if you just need `jupyterhub/systemuser`</span>
<span class="n">c</span><span class="o">.</span><span class="n">Spawner</span><span class="o">.</span><span class="n">mem_limit</span> <span class="o">=</span> <span class="s1">'500M'</span> <span class="c1"># or 1G for GB, probably 300M is minimum required just to run simple calculations</span>
<span class="n">c</span><span class="o">.</span><span class="n">DockerSpawner</span><span class="o">.</span><span class="n">volumes</span> <span class="o">=</span> <span class="p">{</span><span class="s2">"/srv/nbgrader/exchange"</span><span class="p">:</span><span class="s2">"/tmp/exchange"</span><span class="p">}</span> <span class="c1"># this is necessary for nbgrader to transfer homework back and forth between students and instructor</span>
<span class="n">c</span><span class="o">.</span><span class="n">DockerSpawner</span><span class="o">.</span><span class="n">remove_containers</span> <span class="o">=</span> <span class="bp">True</span>
<span class="c1"># The docker instances need access to the Hub, so the default loopback port doesn't work:</span>
<span class="kn">from</span> <span class="nn">IPython.utils.localinterfaces</span> <span class="kn">import</span> <span class="n">public_ips</span>
<span class="n">c</span><span class="o">.</span><span class="n">JupyterHub</span><span class="o">.</span><span class="n">hub_ip</span> <span class="o">=</span> <span class="n">public_ips</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
<h3>Test the deployment with Docker</h3>
<p>Connect to <a href="https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu">https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu</a> on your browser, you should be redirected to Github for authentication and then access a Jupyter Notebook instance with the Python 2 or Python 3, open a Notebook and run <code>!hostname</code> in the first cell, you should find out that you get a Docker hash instead of the machine name, you are inside a container.</p>
<p>SSH into the machine run <code>docker ps</code> to find the hash of a running container and then <code>docker stat HASH</code> to check memory usage and the current limit.</p>
<p>Check that you can connect to the <code>nbgrader</code> <code>formgrade</code> service that allows to manually grade assignments at <a href="https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu/services/formgrade-COURSEID">https://xxx-xxx-xxx-xxx.compute.cloud.sdsc.edu/services/formgrade-COURSEID</a>, replace <code>COURSEID</code> with the course identifier you setup in the Ansible script.</p>
<h3>Pre-built image</h3>
<p>I also have a saved Virtual Machine snapshot on SDSC Cloud named <code>jupyterhub_ansible_nbgrader_coleman</code></p>How to publish your research software to Github2017-02-01T18:00:00-08:002017-02-01T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-02-01:/2017/02/publish-research-software-github.html<ul>
<li>Do you want to make your research software available publicly on Github?</li>
<li>Has your reviewer asked to publish the code described in your paper?</li>
<li>Would you like to collaborate on your research software with other people, either local or remote?</li>
</ul>
<p>Nowadays many journals require that the software used to produce …</p><ul>
<li>Do you want to make your research software available publicly on Github?</li>
<li>Has your reviewer asked to publish the code described in your paper?</li>
<li>Would you like to collaborate on your research software with other people, either local or remote?</li>
</ul>
<p>Nowadays many journals require that the software used to produce results described in a scientific paper be made available publicly
for other peers to be able to reproduce the results or even just explore the analysis more in detail.</p>
<p>The most popular platform is <a href="http://github.com">Github</a>, it allows to create a homepage for your software, keep track of any future code change and allows people to report issues or contribute patches easily.</p>
<p>I'll assume familiarity with working from the command line.</p>
<h2>Prepare your software for publication</h2>
<p>First it is necessary to make sure your code is all inside a single root folder (with any number of subfolders), then cleanup any build artifact, data or executable present in your tree of folders.
Ideally you should only have the source code and documentation.
If you have small datasets (<10MB total) it is convenient to store them inside the repository, otherwise better host them on dedicated free services like <a href="http://figshare.com">Figshare</a>.</p>
<p>You should cleanup the build and installation process for your code, if any, and ideally you should structure your code in a standard format to ease adoption, for example using a project template generated by <a href="https://github.com/audreyr/cookiecutter">Cookiecutter</a>.</p>
<p>You should create a <code>README.md</code> file in the root folder of your project, this is very important because it will be transformed into HTML and displayed in the homepage of your software project. Here you should use the Markdown formatting language, see <a href="https://help.github.com/articles/basic-writing-and-formatting-syntax/">a Markdown cheatsheet on Github</a>, to explain:</p>
<ul>
<li>short description of your software</li>
<li>build/usage requirements for your process</li>
<li>installation instructions (and point to another file <code>INSTALL.md</code> for more details)</li>
<li>quickstart section</li>
<li>link to usage examples</li>
<li>link to your paper about the project</li>
<li>list of developers</li>
<li>optionally: how users can get support (i.e. a mailing list)</li>
</ul>
<p>Finally you should choose a license, otherwise even if the project is public, nobody is allowed to modify and re-use it legally.
Create a <code>LICENSE</code> file in the root of folder tree and paste the content of the license. I recommend MIT license which is very permissive and simple: <a href="https://choosealicense.com/licenses/mit/">https://choosealicense.com/licenses/mit/</a></p>
<h2>Create an account on Github</h2>
<p>Second step is to create an account on Github: this just requires a username, email and password, choose your username carefully because it will become the
root internet address of all your software projects, i.e. <code>https://github.com/username/software-name</code>.</p>
<p>A Github account is free and allows any number of public software projects, private repositories are generally available only on paid account, however
who has a <code>.edu</code> email address can apply for unlimited private repositories by applying for the <a href="https://education.github.com/discount_requests/new">academic discount</a>.</p>
<h2>Create a repository on Github</h2>
<p>Github hosts software inside a version control system, <code>git</code>, so that it stores the complete history of all the incremental changes over time and allows to easily
recover previous versions of the software. Each software project is stored in a repository, which includes both the current version and all previous versions of the software.<code>git</code> is a more modern alternative to <code>subversion</code>.</p>
<p>First you need to create a repository on Github: authenticate on Github.com and click on the "New Repository" button, choose a name for your software project and leave all other options as default.</p>
<h2>Publish your software on Github</h2>
<p>Make sure that the <code>git</code> command line tool is available on the machine where your code is stored, install it from your package manager or see <a href="https://git-scm.com/downloads">installation instructions on the git website</a>.</p>
<p>Finally you can follow the instructions on the repository homepage <code>https://github.com/username/software-name</code> in the section <strong>..or create a new repository on the command line</strong>,
make sure you are in the root folder of your repository and follow this steps:</p>
<p>Turn the current folder into a <code>git</code> repository:</p>
<div class="highlight"><pre><span></span><span class="err">git init</span>
</pre></div>
<p>Add recursively all files and folders, otherwise specify filenames or wildcard to pick only some, <strong>be careful not to accidentally upload sensitive content like passwords</strong>:</p>
<div class="highlight"><pre><span></span><span class="err">git add *</span>
</pre></div>
<p>Store into the repository a first version of the software:</p>
<div class="highlight"><pre><span></span><span class="err">git commit -m "first version of the software"</span>
</pre></div>
<p>Tell <code>git</code> the address of the remote repository on Github (make sure to use your username and the name you chose for your software project):</p>
<div class="highlight"><pre><span></span><span class="err">git remote add origin https://github.com/username/software-name</span>
</pre></div>
<p>Upload the software to Github:</p>
<div class="highlight"><pre><span></span><span class="err">git push -u origin master</span>
</pre></div>
<p>You can then check in your browser that all the code you meant to publish is available on Github</p>
<h2>Update your software</h2>
<p>Whenever in the future you need to make modifications to the software:</p>
<ul>
<li>edit the files</li>
<li><code>git add filename1 filename2</code> to prepare them for commit</li>
<li><code>git commit -m "bugfix"</code> create a version in the history with a explanatory commit message</li>
<li><code>git push</code> to publish to Github</li>
</ul>
<p>For more details on <code>git</code>, check the <a href="https://swcarpentry.github.io/git-novice/">Software Carpentry lessons</a>.</p>Run Ubuntu in HPC with Singularity2017-01-13T12:00:00-08:002017-01-13T12:00:00-08:00Andrea Zoncatag:zonca.github.io,2017-01-13:/2017/01/singularity-hpc-comet.html<ul>
<li>Ever wanted to <code>sudo apt install</code> packages on a Supercomputer?</li>
<li>Ever wanted to freeze your software environment and reproduce a calculation after some time?</li>
<li>Ever wanted to dump your software environment to a file and move it to another Supercomputer? or wanted the same software on your laptop and on …</li></ul><ul>
<li>Ever wanted to <code>sudo apt install</code> packages on a Supercomputer?</li>
<li>Ever wanted to freeze your software environment and reproduce a calculation after some time?</li>
<li>Ever wanted to dump your software environment to a file and move it to another Supercomputer? or wanted the same software on your laptop and on a computing node?</li>
</ul>
<p>If your answer to any of those question is yes, read on! Otherwise, well, still read on, it's awesome!</p>
<h2>Singularity</h2>
<p><a href="http://singularity.lbl.gov">Singularity</a> is a software project by Lawrence Berkeley Labs to provide a safe container technology for High Performance Computing,
and it has been available for some time on my favorite Supercomputer, i.e. Comet at the San Diego Supercomputer Center.</p>
<p>You can read more details on their website, in summary you choose your own Operative System (any GNU/Linux distribution), describe its configuration in a standard format or even
import an existing <code>Dockerfile</code> (from the popular Docker container technology) and Singularity is able to build an image contained in a single file.
This file can then be executed on any Linux machine with Singularity installed (even on a Comet computing node), so you can run Ubuntu 16.10 or Red Hat 5 or any other flavor, your choice!
It doesn't need any deamon running like Docker, you can just execute a command inside the container by running:</p>
<div class="highlight"><pre><span></span><span class="err">singularity exec /path/to/your/image.img your_executable</span>
</pre></div>
<p>And your executable is run within the OS of the container.</p>
<p>The container technology is just sandboxing the environment, not executing a complete OS inside the host OS, so the loss of performance is minimal.</p>
<p>In summary, referring to the questions above:</p>
<ul>
<li>This allows you to <code>sudo apt install</code> any package inside this environment when it is on your laptop, and then copy it to any Supercomputer and run your software inside that OS.</li>
<li>You can store this image to help reproduce your scientific results anytime in the future</li>
<li>You can develop your software inside a Singularity container and never have to worry about environment issues when you are ready for production runs on HPC or moving across different Supercomputers</li>
</ul>
<h2>Build a Singularity image for SDSC Comet with MPI support</h2>
<p>One of the trickiest things for such technology in HPC is support for MPI, the key stack for high speed network communication. I have prepared a tutorial on Github on how to build either a CentOS 7 or a Ubuntu 16.04 Singularity container for Comet that allows to use the <code>mpirun</code> command provided by the host OS on Comet but execute a code that supports MPI within the container.</p>
<ul>
<li><a href="https://github.com/zonca/singularity-comet">https://github.com/zonca/singularity-comet</a></li>
</ul>
<h2>More complicated setup for Julia with MPI support</h2>
<p>For a project that needed a setup with Julia with MPI support I built a more complicated container, see:</p>
<ul>
<li><a href="https://github.com/zonca/singularity-comet/tree/master/debian_julia">https://github.com/zonca/singularity-comet/tree/master/debian_julia</a></li>
</ul>
<h2>Prebuilt containers</h2>
<p>I made also available my containers on Comet, they are located in my scratch space:</p>
<p><code>/oasis/scratch/comet/zonca/temp_project</code></p>
<p>and are named <code>Centos7.img</code>, <code>Ubuntu.img</code> and <code>julia.img</code>.</p>
<p>You can also copy those images to your local machine and customize them more.</p>
<h2>Trial accounts on Comet</h2>
<p>If you don't have an account on Comet yet, you can request a trial allocation:</p>
<p><a href="https://www.xsede.org/web/xup/allocations-overview#types-trial">https://www.xsede.org/web/xup/allocations-overview#types-trial</a></p>
<p>Enjoy!</p>Jupyterhub Docker Spawner with GPU support2016-10-12T12:00:00-07:002016-10-12T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2016-10-12:/2016/10/dockerspawner-cuda.html<p><a href="https://github.com/jupyterhub/dockerspawner">Docker Spawner</a> allows users of Jupyterhub to run Jupyter Notebook inside isolated Docker Containers.
Access to the host NVIDIA GPU was not allowed until NVIDIA release the <a href="https://github.com/NVIDIA/nvidia-docker">NVIDIA-docker</a> plugin.</p>
<h2>Build the Docker image</h2>
<p>In order to make Jupyerhub work with NVIDIA-docker we need to build a Jupyterhub docker image for …</p><p><a href="https://github.com/jupyterhub/dockerspawner">Docker Spawner</a> allows users of Jupyterhub to run Jupyter Notebook inside isolated Docker Containers.
Access to the host NVIDIA GPU was not allowed until NVIDIA release the <a href="https://github.com/NVIDIA/nvidia-docker">NVIDIA-docker</a> plugin.</p>
<h2>Build the Docker image</h2>
<p>In order to make Jupyerhub work with NVIDIA-docker we need to build a Jupyterhub docker image for <code>dockerspawner</code> that includes both the <code>dockerspawner</code> <code>singleuser</code> or <code>systemuser</code> images and the <code>nvidia-docker</code> image.</p>
<p>The Jupyter <code>systemuser</code> images are built in several steps so let's use them as a starting point, it is good that both image start from Ubuntu 14.04.</p>
<ul>
<li>Download the <code>nvidia-docker</code> repository</li>
<li>In <code>ubuntu-14.04/cuda/8.0/runtime/Dockerfile</code>, replace <code>FROM ubuntu:14.04</code> with <code>FROM jupyterhub/systemuser</code></li>
<li>Build this image <code>sudo docker build -t systemuser-cuda-runtime runtime</code></li>
<li>In <code>ubuntu-14.04/cuda/8.0/devel/Dockerfile</code>, replace <code>FROM cuda:8.0-runtime</code> with <code>FROM systemuser-cuda-runtime</code></li>
<li>Build this image <code>sudo docker build -t systemuser-cuda-devel devel</code></li>
</ul>
<p>Now we have 2 images, either just CUDA 8.0 runtime or also the compiler <code>nvcc</code> and other development tools.</p>
<p>Make sure the image itself runs from the command line on the host:</p>
<div class="highlight"><pre><span></span><span class="err">sudo nvidia-docker run --rm systemuser-cuda-devel nvidia-smi</span>
</pre></div>
<h2>Configure Jupyterhub</h2>
<p>In <code>jupyterhub_config.py</code>, first of all set the right image:</p>
<div class="highlight"><pre><span></span><span class="err">c.DockerSpawner.container_image = "systemuser-cuda-devel"</span>
</pre></div>
<p>However this is not enough, <code>nvidia-docker</code> images need special flags to work properly and mount the host GPU into the containers, this is usually performed calling <code>nvidia-docker</code> instead of <code>docker</code> from the command line.
In <code>dockerspawner</code> however, we are directly using the docker library, so we need to properly configure the environment there.</p>
<p>First of all, we can get the correct flags by calling from the host machine:</p>
<div class="highlight"><pre><span></span><span class="err">curl -s localhost:3476/docker/cli</span>
</pre></div>
<p>The result for my machine is:</p>
<div class="highlight"><pre><span></span><span class="err">--volume-driver=nvidia-docker --volume=nvidia_driver_361.93.02:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools --device=/dev/nvidia0 --device=/dev/nvidia1</span>
</pre></div>
<p>Now we can configure <code>dockerspawner</code> using those values, in my case:</p>
<div class="highlight"><pre><span></span><span class="err">c.DockerSpawner.read_only_volumes = {"nvidia_driver_361.93.02":"/usr/local/nvidia"}</span>
<span class="err">c.DockerSpawner.extra_create_kwargs = {"volume_driver":"nvidia-docker"}</span>
<span class="err">c.DockerSpawner.extra_host_config = { "devices":["/dev/nvidiactl","/dev/nvidia-uvm","/dev/nvidia-uvm-tools","/dev/nvidia0","/dev/nvidia1"] }</span>
</pre></div>
<h2>Test it</h2>
<p>Login with Jupyterhub, try this notebook: <a href="http://nbviewer.jupyter.org/gist/zonca/a14af3b92ab472580f7b97b721a2251e">http://nbviewer.jupyter.org/gist/zonca/a14af3b92ab472580f7b97b721a2251e</a></p>
<h2>Current issues</h2>
<ul>
<li>Environment on the Jupyterhub kernel is missing <code>LD_LIBRARY_PATH</code>, running directly on the image instead is fine</li>
<li>I'd like to test using <code>numba</code> in Jupyterhub, but that requires <code>cudatoolkit</code> 8.0 which is not available yet in Anaconda</li>
</ul>Jupyterhub deployment on multiple nodes with Docker Swarm2016-05-24T12:00:00-07:002016-05-24T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2016-05-24:/2016/05/jupyterhub-docker-swarm.html<p>This post is part of a series on deploying Jupyterhub on OpenStack tailored at workshops, in the previous posts I showed:</p>
<ul>
<li><a href="http://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html">How to deploy a Jupyterhub on a single server with Docker and Python/R/Julia support</a></li>
<li><a href="http://zonca.github.io/2016/04/jupyterhub-image-sdsc-cloud.html">How to deploy the previous server from a pre-built image and customize it …</a></li></ul><p>This post is part of a series on deploying Jupyterhub on OpenStack tailored at workshops, in the previous posts I showed:</p>
<ul>
<li><a href="http://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html">How to deploy a Jupyterhub on a single server with Docker and Python/R/Julia support</a></li>
<li><a href="http://zonca.github.io/2016/04/jupyterhub-image-sdsc-cloud.html">How to deploy the previous server from a pre-built image and customize it</a></li>
</ul>
<p>The limitation of a single server setup is that it cannot scale beyond the resources available on that server, especially memory. Therefore for a workshop that requires to load large amount of data or with lots of students it is recommended to use a multi-server setup.</p>
<p>Fortunately Docker already provides that flexibility thanks to <a href="https://docs.docker.com/swarm/overview/">Docker Swarm</a>. Docker Swarm allows to have a Docker interface that behaves like a normal single server instance but instead launches containers on a pool of servers. Therefore there are mininal changes on the Jupyterhub server.</p>
<p>Jupyterhub will interface with the Docker Swarm service running locally, Docker Swarm will take care of launching containers across the other nodes. Each container will launch a Jupyter Notebook server for a single user, then Jupyterhub will proxy the container port to the users. Users won't connect directly to the nodes in the Docker Swarm pool. </p>
<h2>Setup the Jupyterhub server</h2>
<p>Let's start from the public image already available, see just the first section "Create a Virtual Machine in OpenStack with the pre-built image" in <a href="http://zonca.github.io/2016/04/jupyterhub-image-sdsc-cloud.html">http://zonca.github.io/2016/04/jupyterhub-image-sdsc-cloud.html</a> for instructions on how to get the Jupyterhub single server running.</p>
<h3>Setup Docker Swarm</h3>
<p>First of all we need to have Docker accessible remotely so we need to configure it to listen on a TCP port, edit <code>/etc/init/docker.conf</code> and replace <code>DOCKER_OPTS=</code> in the <code>start</code> section with:</p>
<div class="highlight"><pre><span></span><span class="err">DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"</span>
</pre></div>
<p>Port 2375 is not open on the OpenStack configuration, so this is not a security issue.</p>
<p>Then we need to run 2 swarm services in Docker containers, first a distributed key-store listening on port 8500 that is needed for Swarm to store information about all the available nodes, Consul:</p>
<div class="highlight"><pre><span></span><span class="err">docker run --restart=always -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap</span>
</pre></div>
<p>the manager which provides the interface to Docker Swarm:</p>
<div class="highlight"><pre><span></span><span class="err">HUB_LOCAL_IP=$(ip route get 8.8.8.8 | awk 'NR==1 {print $NF}')</span>
<span class="err">docker run --restart=always -d -p 4000:4000 swarm manage -H :4000 --replication --advertise $HUB_LOCAL_IP:4000 consul://$HUB_LOCAL_IP:8500</span>
</pre></div>
<p>This sets <code>HUB_LOCAL_IP</code> to the internal ip of the instance, then starts the Manager container.</p>
<p>We are running both with automatic restarting, so that they are launched again in case of failure or after reboot.</p>
<p>You can check if the containers are running with:</p>
<div class="highlight"><pre><span></span><span class="err">docker ps -a</span>
</pre></div>
<p>and then you can check if connection works with Docker Swarm on port 4000:</p>
<div class="highlight"><pre><span></span><span class="err">docker -H :4000 ps -a</span>
</pre></div>
<p>Check the Docker documentation for a more robust setup with multiple Consul services and a backup Manager.</p>
<h3>Setup Jupyterhub</h3>
<p>Following the work by Jess Hamrick for the <a href="https://github.com/compmodels/jupyterhub">compmodels Jupyterhub deployment</a>, we can get the <code>jupyterhub_config.py</code> from <a href="https://gist.github.com/zonca/83d222df8d0b9eaebd02b83faa676753">https://gist.github.com/zonca/83d222df8d0b9eaebd02b83faa676753</a> and copy them into the home of the ubuntu user.</p>
<h3>Share users home via NFS</h3>
<p>We have now a distributed system and we need a central location to store the home folders of the users, so that even if they happen to get containers on different server, they can still access their files.</p>
<p>Install NFS with the package manager:</p>
<div class="highlight"><pre><span></span><span class="err">sudo apt-get install nfs-kernel-server</span>
</pre></div>
<p>edit <code>/etc/exports</code>, add:</p>
<div class="highlight"><pre><span></span><span class="err">/home *(rw,sync,no_root_squash)</span>
</pre></div>
<p>Ports are not open in the NFS configuration.</p>
<h2>Setup networking</h2>
<p>Before preparing a node, create a new security group under Compute -> Access & Security and name it <code>swarm_group</code>.</p>
<p>We need to be able to have open traffic between the <code>swarmsecgroup</code> and the group of the Jupyterhub instance, <code>jupyterhubsecgroup</code> in my previous tutorial. So in the new <code>swarmsecgroup</code>, add this rule: </p>
<ul>
<li>Add Rule</li>
<li>Rule: ALL TCP</li>
<li>Direction: Ingress</li>
<li>Remote: Security Group</li>
<li>Security Group: <code>jupyterhubsecgroup</code></li>
</ul>
<p>Add another rule replacing Ingress with Egress.
Now open the <code>jupyterhubsecgroup</code> group and add the same 2 rules, just make sure to choose as target "Security Group" <code>swarmsecgroup</code>.</p>
<p>On the <code>swarmsecgroup</code> also add a Rule for SSH traffic from any source choosing CIDR and 0.0.0.0/0, you can disable this after having executed the configuration.</p>
<h2>Setup the Docker Swarm nodes</h2>
<h3>Launch a plain Ubuntu instance</h3>
<p>Launch a new instance, all it <code>swarmnode</code>, choose the size depending on your requirements, and then choose "Boot from image" and get Ubuntu 14.04 LTS (16.04 should work as well, but I haven't yet tested it). Remember to choose a Key Pair under Access & Security and assign the Security Group <code>swarmsecgroup</code>.</p>
<p>Temporarily add a floating IP to this instance in order to SSH into it, see my first tutorial for more details.</p>
<h3>Setup Docker Swarm</h3>
<p>First install Docker engine:</p>
<div class="highlight"><pre><span></span><span class="err">sudo apt update</span>
<span class="err">sudo apt install apt-transport-https ca-certificates</span>
<span class="err">sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D</span>
<span class="err">echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | sudo tee /etc/apt/sources.list.d/docker.list </span>
<span class="err">sudo apt update</span>
<span class="err">sudo apt install -y docker-engine</span>
<span class="err">sudo usermod -aG docker ubuntu</span>
</pre></div>
<p>Then make the same edit we did on the hub, edit <code>/etc/init/docker.conf</code> and replace <code>DOCKER_OPTS=</code> in the <code>start</code> section with:</p>
<div class="highlight"><pre><span></span><span class="err">DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"</span>
</pre></div>
<p>Restart Docker with:</p>
<div class="highlight"><pre><span></span><span class="err">sudo service docker restart</span>
</pre></div>
<p>Then run the container that interfaces with Swarm:</p>
<div class="highlight"><pre><span></span><span class="err">HUB_LOCAL_IP=10.XX.XX.XX</span>
<span class="err">NODE_LOCAL_IP=$(ip route get 8.8.8.8 | awk 'NR==1 {print $NF}')</span>
<span class="err">docker run --restart=always -d swarm join --advertise=$NODE_LOCAL_IP:2375 consul://$HUB_LOCAL_IP:8500</span>
</pre></div>
<p>Copy the address of the Jupyterhub server in the <code>HUB_LOCAL_IP</code> variable.</p>
<h3>Setup mounting the home filesystem</h3>
<div class="highlight"><pre><span></span><span class="err">sudo apt-get install autofs</span>
</pre></div>
<p>add in <code>/etc/auto.master</code>:</p>
<div class="highlight"><pre><span></span><span class="err">/home /etc/auto.home</span>
</pre></div>
<p>create <code>/etc/auto.home</code>:</p>
<div class="highlight"><pre><span></span><span class="err">echo "* $HUB_LOCAL_IP:/home/&" | sudo tee /etc/auto.home</span>
</pre></div>
<p>using the internal IP of the hub.</p>
<div class="highlight"><pre><span></span><span class="err">sudo service autofs restart</span>
</pre></div>
<p>verify by doing:</p>
<div class="highlight"><pre><span></span><span class="err">ls /home/ubuntu</span>
</pre></div>
<p>or </p>
<div class="highlight"><pre><span></span><span class="err">ls /home/training01</span>
</pre></div>
<p>you should see the same files that were on the Jupyterhub server.</p>
<h3>Create users</h3>
<p>As we are using system users and mounting the home filesystem it is important that users have the same UID on all nodes, so we are going to run on the node the same script we ran on the Jupyterhub server:</p>
<div class="highlight"><pre><span></span><span class="err"> bash create_users.sh</span>
</pre></div>
<h3>Test Jupyterhub</h3>
<p>Login on the Jupyterhub instance with 2 or more different users, then check on the console of the Hub that the containers were launched on the <code>swarmnode</code> instance:</p>
<div class="highlight"><pre><span></span><span class="err"> docker -H :4000 ps -a</span>
</pre></div>
<h2>Create more nodes</h2>
<p>Now that we created a fully functioning node we can clone it to create more to accomodate more users.</p>
<h3>Create a snapshot of the node</h3>
<p>First we need to delete all Docker containers, ssh into the <code>swarmnode</code> and execute:</p>
<div class="highlight"><pre><span></span><span class="err"> docker rm -f $(docker ps -a -q)</span>
</pre></div>
<p>Docker has a unique identifying key, we need to remove that so that it will be regenerated by the clones.</p>
<div class="highlight"><pre><span></span><span class="err">sudo service docker stop</span>
<span class="err">sudo rm /etc/docker/key.json</span>
</pre></div>
<p>Then from Compute->Instances choose "Create Snapshot", call it <code>swarmnodeimage</code>.</p>
<h3>Launch other nodes</h3>
<p>Click on Launch instance->"Boot from Snapshot"-><code>swarmnodeimage</code>, choose the <code>swarmnodesecgroup</code> Security Group. Choose any number of instances you need.</p>
<p>Each node will need to launch the Swarm container with its own local ip, not the same as our first node. Therefore we need to use the "Post Creation"->"Direct Input" and add this script: </p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2
3
4</pre></div></td><td class="code"><div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
<span class="nv">HUB_LOCAL_IP</span><span class="o">=</span><span class="m">10</span>.XX.XX.XX
<span class="nv">NODE_LOCAL_IP</span><span class="o">=</span><span class="k">$(</span>ip route get <span class="m">8</span>.8.8.8 <span class="p">|</span> awk <span class="s1">'NR==1 {print $NF}'</span><span class="k">)</span>
docker run --restart<span class="o">=</span>always -d swarm join --advertise<span class="o">=</span><span class="nv">$NODE_LOCAL_IP</span>:2375 consul://<span class="nv">$HUB_LOCAL_IP</span>:8500
</pre></div>
</td></tr></table>
<p><code>HUB_LOCAL_IP</code> is the internal network IP address of the Jupyterhub instance and <code>NODE_LOCAL_IP</code> will be filled with the IP of the OpenStack image just created.</p>
<p>See for example Jupyterhub with 3 remote Swarm nodes running containers for 4 training users:</p>
<div class="highlight"><pre><span></span>$ docker -H :4000 ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
60189f208df2 zonca/jupyterhub-datascience-systemuser <span class="s2">"tini -- sh /srv/sing"</span> <span class="m">11</span> seconds ago Up <span class="m">7</span> seconds <span class="m">10</span>.128.1.28:32769->8888/tcp swarmnodes-1/jupyter-training04
1d7b05caedb1 zonca/jupyterhub-datascience-systemuser <span class="s2">"tini -- sh /srv/sing"</span> <span class="m">36</span> seconds ago Up <span class="m">32</span> seconds <span class="m">10</span>.128.1.27:32768->8888/tcp swarmnodes-2/jupyter-training03
733c5ff0a5ed zonca/jupyterhub-datascience-systemuser <span class="s2">"tini -- sh /srv/sing"</span> <span class="m">58</span> seconds ago Up <span class="m">54</span> seconds <span class="m">10</span>.128.1.29:32768->8888/tcp swarmnodes-3/jupyter-training02
282abce201dd zonca/jupyterhub-datascience-systemuser <span class="s2">"tini -- sh /srv/sing"</span> About a minute ago Up About a minute <span class="m">10</span>.128.1.28:32768->8888/tcp swarmnodes-1/jupyter-training01
29b2d394fab9 swarm <span class="s2">"/swarm join --advert"</span> <span class="m">13</span> minutes ago Up <span class="m">13</span> minutes <span class="m">2375</span>/tcp swarmnodes-2/romantic_easley
8fd3d32fe849 swarm <span class="s2">"/swarm join --advert"</span> <span class="m">13</span> minutes ago Up <span class="m">13</span> minutes <span class="m">2375</span>/tcp swarmnodes-3/clever_mestorf
1ae073f7b78b swarm <span class="s2">"/swarm join --advert"</span> <span class="m">13</span> minutes ago Up <span class="m">13</span> minutes <span class="m">2375</span>/tcp swarmnodes-1/jovial_goldwasser
</pre></div>
<h2>Where to go from here</h2>
<p>At this level the deployment is quite complicated, so it is probably worth automating it with an <code>ansible</code> playbook, that will be the subject of the next blog post, I think the result will be a simplified version of <a href="https://github.com/compmodels/jupyterhub-deploy">Jess Hamrick's compmodels deployment</a>. Still, I recommend starting with a manual setup to understand how the different pieces work.</p>
<h2>Troubleshooting</h2>
<p>If <code>docker -H :4000 ps -a</code> gives the error:</p>
<div class="highlight"><pre><span></span><span class="err">Error response from daemon: No elected primary cluster manager</span>
</pre></div>
<p>it means the Consul container is broken, remove it and create it again.</p>
<h2>Acknowledgments</h2>
<p>Thanks to Jess Hamrick for sharing the setup of her <a href="https://github.com/compmodels">compmodel class on Github</a>, the Jupyter team for releasing such great tools and Kevin Coakley and the rest of the <a href="http://www.sdsc.edu/services/it/cloud.html">SDSC Cloud</a> team for OpenStack support and resources.</p>Quick Jupyterhub deployment for workshops with pre-built image2016-04-28T12:00:00-07:002016-04-28T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2016-04-28:/2016/04/jupyterhub-image-sdsc-cloud.html<p>This tutorial explains how to use a OpenStack image I already built to quickly deploy a Jupyterhub Virtual Machine that can provide a good initial setup for a workshop, providing students access to Python 2/3, Julia, R, file editor and terminal with bash.</p>
<p>For details about building the instance …</p><p>This tutorial explains how to use a OpenStack image I already built to quickly deploy a Jupyterhub Virtual Machine that can provide a good initial setup for a workshop, providing students access to Python 2/3, Julia, R, file editor and terminal with bash.</p>
<p>For details about building the instance yourself for more customization, see the full tutorial at <a href="http://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html">http://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html</a>.</p>
<h2>Create a Virtual Machine in OpenStack with the pre-built image</h2>
<p>Follow the 3 steps at <a href="http://zonca.github.io/2016/04/jupyterhub-sdsc-cloud.html>">the step by step tutorial</a> under "Create a Virtual Machine in OpenStack":</p>
<ul>
<li>Network setup</li>
<li>Create a new Virtual Machine: here instead of choosing the base <code>ubuntu</code> image, choose <code>jupyterhub_docker</code>, also you can choose any size, I recommend to start with a <code>c1.large</code> for experimentation, you can then resize it later to a more powerful instance depending on the needs of your workshop</li>
<li>Give public IP to the instance</li>
</ul>
<h2>Connect to Jupyterhub</h2>
<p>The Jupyterhub instance is ready! Just open your browser and connect to the floating IP of the instance you just created.</p>
<p>The browser should show a security error related to the fact that the pre-installed SSL certificate is not trusted, click on "Advanced properties" and choose to connect anyway, we'll see later how to fix this.</p>
<p>You already have 50 training users, named <code>training01</code> to <code>training50</code>, all with the same password <code>jupyterhubSDSC</code> (see below how to change it). Check that you can login and create a notebook.</p>
<h2>Administer the Jupyterhub instance</h2>
<p>Login into the Virtual Machine with <code>ssh -i jupyterhub.pem ubuntu@xxx.xxx.xxx.xxx</code> using the key file and the public IP setup in the previous steps.</p>
<p>To get rid of the annoying "unable to resolve host" warning, add the hostname of the machine (check by running hostname) to <code>/etc/hosts</code>, i.e. the first line should become something like <code>127.0.0.1 localhost jupyterhub</code> if jupyterhub is the hostname</p>
<h3>Change password/add more users</h3>
<p>In the home folder of the <code>ubuntu</code> users, there is a file named <code>create_users.sh</code>, edit it to change the <code>PASSWORD</code> variable and the number of users from <code>50</code> to a larger number. Then run it with <code>bash create_users.sh</code>. Training users <strong>cannot SSH</strong> into the machine.</p>
<p>Use <code>sudo passwd trainingXX</code> to change the password of a single user.</p>
<h3>Setup a domain (needed for SSL certificate)</h3>
<p>If you do not know how to get a domain name, here some options:</p>
<ul>
<li>you can generally request a subdomain name from your institution, see for example <a href="http://blink.ucsd.edu/technology/help-desk/sysadmin-resources/domain.html#Register-your-domain-name">UCSD</a></li>
<li>if you own a domain, go in the DNS settings, add a record of type A to a subdomain, like <code>jupyterhub.yourdomain.com</code> that points to the floating IP of the Jupyterhub instance</li>
<li>you can get a free dynamic dns at websites like <a href="https://noip.com">noip.com</a></li>
</ul>
<p>In each case you need to have a DNS record of type A that points to the floating IP of the Jupyterhub instance.</p>
<h3>Setup a SSL Certificate</h3>
<p><a href="https://letsencrypt.org/">Letsencrypt</a> provides free SSL certificates by using a command line client.</p>
<p>SSH into the server, run:</p>
<div class="highlight"><pre><span></span><span class="err">git clone https://github.com/letsencrypt/letsencrypt</span>
<span class="err">cd letsencrypt</span>
<span class="err">sudo service nginx stop</span>
<span class="err">./letsencrypt-auto certonly --standalone -d jupyterhubdeploy.ddns.net</span>
</pre></div>
<p>Follow instructions at the terminal to obtain a certificate</p>
<p>Now open the nginx configuration file: <code>sudo vim /etc/nginx/nginx.conf</code></p>
<p>And modify the SSL certificate lines:</p>
<div class="highlight"><pre><span></span><span class="err">ssl_certificate /etc/letsencrypt/live/yoursub.domain.edu/cert.pem;</span>
<span class="err">ssl_certificate_key /etc/letsencrypt/live/yoursub.domain.edu/privkey.pem;</span>
</pre></div>
<p>Start NGINX:</p>
<div class="highlight"><pre><span></span><span class="err">sudo service nginx start</span>
</pre></div>
<p>Connect again to Jupyterhub and check that your browser correctly detects that the HTTPS connection is safe.</p>
<h2>Comments? Suggestions?</h2>
<ul>
<li><a href="http://twitter.com/andreazonca">Twitter</a></li>
<li>Email <code>zonca</code> on the domain <code>sdsc.edu</code></li>
</ul>Deploy Jupyterhub on a Virtual Machine for a Workshop2016-04-16T12:00:00-07:002016-04-16T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2016-04-16:/2016/04/jupyterhub-sdsc-cloud.html<p>This tutorial describes the steps to install a Jupyterhub instance on a single machine suitable for hosting a workshop, suitable for having people login with training accounts on Jupyter Notebooks running Python 2/3, R, Julia with also Terminal access on Docker containers.
Details about the setup:</p>
<ul>
<li>Jupyterhub installed with …</li></ul><p>This tutorial describes the steps to install a Jupyterhub instance on a single machine suitable for hosting a workshop, suitable for having people login with training accounts on Jupyter Notebooks running Python 2/3, R, Julia with also Terminal access on Docker containers.
Details about the setup:</p>
<ul>
<li>Jupyterhub installed with Anaconda directly on the host, proxied by NGINX under HTTPS with self-signed certificate</li>
<li>Login with Linux account credentials created previously by the administrator, data in /home are persistent across sessions</li>
<li>Each user runs in a separated Docker container with access to Python 2, Python 3, R and Julia kernels, they can also open the Notebook editor and the terminal</li>
<li>Using a single machine you have to consider that the biggest constraint is going to be memory usage, as a rule of thumb consider 100-200 MB/user plus 5x-10x the amount of data you are loading from disk, depending on the kind of analysis. For a multi-node setup you need to look into Docker Swarm.</li>
</ul>
<p>I am using the OpenStack deployment at the San Diego Supercomputer Center, <a href="http://www.sdsc.edu/services/it/cloud.html">SDSC Cloud</a>, AWS deployments should just replace the first section on Creating a VM and setting up Networking, see <a href="https://github.com/jupyterhub/jupyterhub/wiki/Deploying-JupyterHub-on-AWS">the Jupyterhub wiki</a>.</p>
<p>If you intend to run on SDSC Cloud, I have a pre-built image of this deployment you can setup and run quickly, see <a href="http://zonca.github.io/2016/04/jupyterhub-image-sdsc-cloud.html">see my followup tutorial</a>.</p>
<h1>Create a Virtual Machine in OpenStack</h1>
<p>First of all we need to launch a new Virtual Machine and configure the network.</p>
<ul>
<li>Login to the SDSC Cloud OpenStack dashboard</li>
</ul>
<h2>Network setup</h2>
<p>Jupyterhub will be proxied to the standard HTTPS port by NGINX and we also want to redirect HTTP to HTTPS, so we open those ports, then SSH for the administrators to login and a custom TCP rule in order for the Docker containers to be able to connect to the Jupyterhub hub running on port 8081, so we are opening that port just to the subnet that is running the Docker containers.</p>
<ul>
<li>Compute -> Access & Security -> Security Groups -> Create Security Group and name it <code>jupyterhubsecgroup</code></li>
<li>Click on Manage Rules </li>
<li>Click on add rule, choose the HTTP rule and click add</li>
<li>Repeat the last step with HTTPS and SSH</li>
<li>Click on add rule again, choose Custom TCP Rule, set port 8081 and set CIDR 172.17.0.0/24 (this is needed so that the containers can connect to the hub)</li>
</ul>
<h2>Create a new Virtual Machine</h2>
<p>We choose Ubuntu here, also other distributions should work fine.</p>
<ul>
<li>Compute -> Access & Security -> Key Pairs -> Create key pair, name it <code>jupyterhub</code> and download it to your local machine</li>
<li>Instances -> Launch Instance, Choose a name, Choose "Boot from image" in Boot Source and Ubuntu as Image name, Choose any size, depending on the number of users (TODO add link to Jupyterhub docs)</li>
<li>Under "Access & Security" choose Key Pair <code>jupyterhub</code> and Security Groups <code>jupyterhubsecgroup</code></li>
<li>Click <code>Launch</code> to create the instance</li>
</ul>
<h2>Give public IP to the instance</h2>
<p>By default in SDSC Cloud machines do not have a public IP.</p>
<ul>
<li>Compute -> Access & Sewcurity -> Floating IPs -> Allocate IP To Project, "Allocate IP" to request a public IP</li>
<li>Click on the "Associate" button of the IP just requested and under "Port to be associated" choose the instance just created</li>
</ul>
<h1>Setup Jupyterhub in the Virtual Machine</h1>
<p>In this section we will install and configure Jupyterhub and NGINX to run on the Virtual Machine.</p>
<ul>
<li>login into the Virtual Machine with <code>ssh -i jupyterhub.pem ubuntu@xxx.xxx.xxx.xxx</code> using the key file and the public IP setup in the previous steps</li>
<li>add the hostname of the machine (check by running <code>hostname</code>) to <code>/etc/hosts</code>, i.e. the first line should become something like <code>127.0.0.1 localhost jupyterhub</code> if <code>jupyterhub</code> is the hostname</li>
</ul>
<h2>Setup Jupyterhub</h2>
<div class="highlight"><pre><span></span> <span class="n">wget</span> <span class="c1">--no-check-certificate https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh</span>
<span class="n">bash</span> <span class="n">Miniconda3</span><span class="o">-</span><span class="n">latest</span><span class="o">-</span><span class="n">Linux</span><span class="o">-</span><span class="n">x86_64</span><span class="p">.</span><span class="n">sh</span>
<span class="o">```</span>
<span class="n">use</span> <span class="k">all</span> <span class="k">defaults</span><span class="p">,</span> <span class="n">answer</span> <span class="ss">"yes"</span> <span class="k">to</span> <span class="k">modify</span> <span class="n">PATH</span>
<span class="o">```</span>
<span class="n">sudo</span> <span class="n">apt</span><span class="o">-</span><span class="k">get</span> <span class="n">install</span> <span class="n">npm</span> <span class="n">nodejs</span><span class="o">-</span><span class="n">legacy</span>
<span class="n">sudo</span> <span class="n">npm</span> <span class="n">install</span> <span class="o">-</span><span class="k">g</span> <span class="n">configurable</span><span class="o">-</span><span class="n">http</span><span class="o">-</span><span class="n">proxy</span>
<span class="n">conda</span> <span class="n">install</span> <span class="n">traitlets</span> <span class="n">tornado</span> <span class="n">jinja2</span> <span class="n">sqlalchemy</span>
<span class="n">pip</span> <span class="n">install</span> <span class="n">jupyterhub</span>
</pre></div>
<p>For authentication to work, the <code>ubuntu</code> user needs to be able to read the <code>/etc/shadow</code> file:</p>
<div class="highlight"><pre><span></span><span class="err">sudo adduser ubuntu shadow</span>
</pre></div>
<h2>Setup the web server</h2>
<p>We will use the NGINX web server to proxy Jupyterhub and handle HTTPS for us, this is recommended for deployments on the public internet.</p>
<div class="highlight"><pre><span></span><span class="err">sudo apt install nginx</span>
</pre></div>
<p><strong>SSL Certificate</strong>: Optionally later, once we have assigned a domain to the Virtual Machine, we can install <code>letsencrypt</code> and get a real certificate, <a href="http://zonca.github.io/2016/04/jupyterhub-image-sdsc-cloud.html">see my followup tutorial</a>, for simplicity here we are just using self-signed certificates that will give warnings on the first time users connect to the server, but still will keep the traffic encrypted.</p>
<div class="highlight"><pre><span></span><span class="err">sudo mkdir /etc/nginx/ssl</span>
<span class="err">sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /etc/nginx/ssl/nginx.key -out /etc/nginx/ssl/nginx.crt</span>
</pre></div>
<p>Get <code>/etc/nginx/nginx.conf</code> from https://gist.github.com/zonca/08c413a37401bdc9d2a7f65a7af44462</p>
<h1>Setup Docker Spawner</h1>
<p>By default Jupyterhub runs notebooks as processes owned by each system user, for more security and isolation, we want Notebook to run in Docker containers, which are something like lightweight Virtual Machines running inside our server.</p>
<h2>Install Docker</h2>
<ul>
<li>Source: https://docs.docker.com/engine/installation/linux/ubuntulinux/#prerequisites</li>
</ul>
<div class="highlight"><pre><span></span><span class="err">sudo apt update</span>
<span class="err">sudo apt install apt-transport-https ca-certificates</span>
<span class="err">sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D</span>
<span class="err">echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | sudo tee /etc/apt/sources.list.d/docker.list </span>
<span class="err">sudo apt update</span>
<span class="err">sudo apt install docker-engine</span>
<span class="err">sudo usermod -aG docker ubuntu</span>
</pre></div>
<p>Logout and login again for the group to take effect</p>
<h2>Install and configure DockerSpawner</h2>
<div class="highlight"><pre><span></span><span class="err">pip install dockerspawner</span>
<span class="err">docker pull jupyter/systemuser</span>
<span class="err">conda install ipython jupyter</span>
</pre></div>
<p>Create <code>jupyterhub_config.py</code> in the home folder of the ubuntu user with this content:</p>
<div class="highlight"><pre><span></span><span class="n">c</span><span class="o">.</span><span class="n">JupyterHub</span><span class="o">.</span><span class="n">confirm_no_ssl</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">c</span><span class="o">.</span><span class="n">JupyterHub</span><span class="o">.</span><span class="n">spawner_class</span> <span class="o">=</span> <span class="s1">'dockerspawner.SystemUserSpawner'</span>
<span class="c1"># The docker instances need access to the Hub, so the default loopback port doesn't work:</span>
<span class="kn">from</span> <span class="nn">IPython.utils.localinterfaces</span> <span class="kn">import</span> <span class="n">public_ips</span>
<span class="n">c</span><span class="o">.</span><span class="n">JupyterHub</span><span class="o">.</span><span class="n">hub_ip</span> <span class="o">=</span> <span class="n">public_ips</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
<h1>Connect to Jupyterhub</h1>
<p>From the home folder of the <code>ubuntu</code> user, type <code>jupyterhub</code> to launch the Jupyterhub process, see below how to start it automatically at boot. Use CTRL-C to stop it.</p>
<p>Open a browser and connect to the floating IP you set for your instance, this should redirect to the https, click "Advanced" in the warning about safety due to the self signed SSL certificate and login with the training credentials.</p>
<p>Instead of using the IP, you can use any domain that points to that same IP with a DNS record of type A or get a dymanic DNS for free on a website like http://noip.com.
Once you have a custom domain, you can configure letsencrypt to have a proper HTTPS certificate so that users do not get any warning when connecting to the instance. I will add this to the optional steps below.</p>
<h1>Optional: Automatically start jupyterhub at boot</h1>
<p>Save https://gist.github.com/zonca/aaeaf3c4e7339127b482d759866e5f39 as <code>/etc/init.d/jupyterhub</code></p>
<div class="highlight"><pre><span></span><span class="err">sudo chmod +x /etc/init.d/jupyterhub</span>
<span class="err">sudo service jupyterhub start</span>
<span class="err">sudo update-rc.d jupyterhub defaults</span>
</pre></div>
<h1>Optional: Create training user accounts</h1>
<p>Add user accounts on Jupyterhub creating standard Linux users with <code>adduser</code> interactively or with a batch script.</p>
<p>For example the following batch script creates 10 users all with the same password:</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre>1
2
3
4
5
6
7
8</pre></div></td><td class="code"><div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
<span class="nv">PASSWORD</span><span class="o">=</span>samepasswordforallusers
<span class="nv">NUMBER_OF_USERS</span><span class="o">=</span><span class="m">10</span>
<span class="k">for</span> n in <span class="sb">`</span>seq -f <span class="s2">"%02g"</span> <span class="m">1</span> <span class="nv">$NUMBER_OF_USERS</span><span class="sb">`</span>
<span class="k">do</span>
<span class="nb">echo</span> creating user training<span class="nv">$n</span>
<span class="nb">echo</span> training<span class="nv">$n</span>:<span class="nv">$PASSWORD</span>::::/home/training<span class="nv">$n</span>:/bin/bash <span class="p">|</span> sudo newusers
<span class="k">done</span>
</pre></div>
</td></tr></table>
<p>Also add <code>AllowUsers ubuntu</code> to <code>/etc/ssh/sshd_config</code> so that training users cannot SSH into the host machine.</p>
<h1>Optional: Add the R and Julia kernels</h1>
<ul>
<li>SSH into the instance</li>
<li><code>git clone https://github.com/jupyter/dockerspawner</code></li>
<li><code>cd dockerspawner</code></li>
</ul>
<p>Modify the file <code>singleuser/Dockerfile</code>, replace <code>FROM jupyter/scipy-notebook</code> with <code>FROM jupyter/datascience-notebook</code></p>
<div class="highlight"><pre><span></span><span class="err">docker build -t datascience-singleuser singleuser</span>
</pre></div>
<p>Modify the file <code>systemuser/Dockerfile</code>, replace <code>FROM jupyter/singleuser</code> with <code>FROM datascience-singleuser</code></p>
<div class="highlight"><pre><span></span><span class="err">docker build -t datascience-systemuser systemuser</span>
</pre></div>
<p>Finally in <code>jupyterhub_config.py</code>, select the new docker image:</p>
<div class="highlight"><pre><span></span><span class="err">c.DockerSpawner.container_image = "datascience-systemuser"</span>
</pre></div>Use your own Python installation (kernel) in Jupyterhub2015-10-05T12:00:00-07:002015-10-05T12:00:00-07:00Andrea Zoncatag:zonca.github.io,2015-10-05:/2015/10/use-own-python-in-jupyterhub.html<p><strong>Updated February 2017</strong></p>
<p>You have access to a Jupyterhub server but the Python installation provided does not satisfy your needs,
how to use your own?</p>
<h2>Install Anaconda</h2>
<p>If you haven't already your own Python installation on the Jupyterhub server you have access to, you can install Anaconda in your home …</p><p><strong>Updated February 2017</strong></p>
<p>You have access to a Jupyterhub server but the Python installation provided does not satisfy your needs,
how to use your own?</p>
<h2>Install Anaconda</h2>
<p>If you haven't already your own Python installation on the Jupyterhub server you have access to, you can install Anaconda in your home folder. I assume here you have a permanent home folder on the server.</p>
<p>In order to type commands, you can either
get a Jupyterhub Terminal, or run in the IPython notebook with <code>!</code>.</p>
<ul>
<li><code>!wget https://repo.continuum.io/archive/Anaconda3-2.3.0-Linux-x86_64.sh</code></li>
<li><code>!bash ./Anacon*</code></li>
</ul>
<h2>Create a kernel file for Jupyterhub</h2>
<p>You probably already know you can have Python 2 and Python 3 kernels on the same Jupyter notebook installation. In the same way you can create your own <code>KernelSpec</code> that launches instead another Python installation.</p>
<p>IPython can automatically create a <code>KernelSpec</code> for you, from the IPython notebook, run:</p>
<div class="highlight"><pre><span></span><span class="err">!~/anaconda3/bin/ipython kernel install --user --name anaconda</span>
</pre></div>
<p>In case your path is different, just insert the full path to <code>ipython</code> from the Python installation you would like to use.</p>
<p>This will create a file <code>kernel.json</code> in <code>~/.local/share/jupyter/kernels/anaconda</code>.</p>
<p>You can also add KernelSpecs for other <code>conda</code> environments doing:</p>
<div class="highlight"><pre><span></span><span class="sx">!source activate environmentname</span>
<span class="sx">!ipython kernel install --user --name environmentname</span>
</pre></div>
<h2>Launch a Notebook</h2>
<p>Go back to the Jupyterhub dashboard, reload the page, now you should have another option in the <code>New</code> menu that says <code>My Anaconda</code>.</p>
<p>In order to use your new kernel with an existing notebook, click on the notebook file in the dashboard, it will launch with the default kernel, then you can change kernel from the top menu <code>Kernel</code> > <code>Change kernel</code>.</p>IPython/Jupyter notebook setup on NERSC Edison2015-09-24T20:00:00-07:002015-09-24T20:00:00-07:00Andrea Zoncatag:zonca.github.io,2015-09-24:/2015/09/ipython-jupyter-notebook-nersc-edison.html<h2>Introduction</h2>
<p>This tutorial explains the setup to run an IPython Notebook on a computing node on the supercomputer Edison at NERSC and forward its port encrypted with SSH to the browser on a local laptop.
This setup is a bit more complicated than other supercomputers, i.e. see <a href="http://zonca.github.io/2015/09/ipython-jupyter-notebook-sdsc-comet.html">my tutorial …</a></p><h2>Introduction</h2>
<p>This tutorial explains the setup to run an IPython Notebook on a computing node on the supercomputer Edison at NERSC and forward its port encrypted with SSH to the browser on a local laptop.
This setup is a bit more complicated than other supercomputers, i.e. see <a href="http://zonca.github.io/2015/09/ipython-jupyter-notebook-sdsc-comet.html">my tutorial for Comet</a> for 2 reasons:</p>
<ul>
<li>Edison's computing nodes run a stripped down OS, with no support for SSH, unless you activate <a href="https://www.nersc.gov/users/computational-systems/hopper/cluster-compatibility-mode/">Cluster Compatibility Mode</a> (CCM) </li>
<li>On edison you generally don't have direct access to a computing node, even if you request an interactive node you actually have access to an intermediary node (MOM node), from there <code>aprun</code> sends a job for execution on the computing node.</li>
</ul>
<h2>Quick reference</h2>
<ul>
<li>Install IPython notebook and make sure it is in the path, I recommend to install Anaconda 64bit in your home folder or on scratch.</li>
<li>Make sure you can ssh passwordless within Edison, i.e. <code>ssh edison</code> from Edison login node works without password</li>
<li>Create a folder <code>notebook</code> in your home, get <code>notebook_job.pbs</code> and <code>launch_notebook_and_tunnel_to_login.sh</code> from <a href="https://gist.github.com/zonca/357d36347fd5addca8f0">https://gist.github.com/zonca/357d36347fd5addca8f0</a></li>
<li>Change the port number and customize options (duration)</li>
<li><code>qsub notebook_job.pbs</code></li>
<li>From laptop, launch <code>bash tunnel_laptop_edisonlogin.sh ##</code> from <a href="https://gist.github.com/zonca/5f8b5ccb826a774d3f89">https://gist.github.com/zonca/5f8b5ccb826a774d3f89</a>, where <code>##</code> is the edison login number in 2 digits, like <code>03</code>. First you need to modify the port number.</li>
<li>From laptop, open browser and connect to <code>http://localhost:YOURPORT</code></li>
</ul>
<h2>Detailed walkthrough</h2>
<h3>One time setup on Edison</h3>
<p>Make sure that <code>ipython notebook</code> works on a login node, one option is to install
Anaconda 64bit from http://continuum.io/downloads#py34. Choose Python 3.</p>
<p>You need to be able to SSH from a node to another node on Edison with no need of a password. Create a new SSH certificate with <code>ssh-keygen</code>, hit enter to keep all default options, DO NOT ENTER A PASSWORD. Then use <code>ssh-copy-id edison.nersc.gov</code>, enter your password to make sure the key is copied in the authorized hosts.
Now you can check it works by executing:</p>
<div class="highlight"><pre><span></span><span class="err">ssh edison.nersc.gov</span>
</pre></div>
<p>from the login node and make sure you are NOT asked for your password.</p>
<h3>Configure the script for TORQUE and submit the job</h3>
<p>Create a <code>notebook</code> folder on your home on Edison.</p>
<p>Copy <code>notebook_job.pbs</code> and <code>launch_notebook_and_tunnel_to_login.sh</code> from <a href="https://gist.github.com/zonca/357d36347fd5addca8f0">https://gist.github.com/zonca/357d36347fd5addca8f0</a> to the <code>notebook</code> folder.</p>
<p>Change the port number in the <code>launch_notebook_and_tunnel_to_login.sh</code> script to a port of your choosing between 7000 and 9999, referenced as YOURPORT in the rest of the tutorial. Two users on the same login node on the same port would not be allowed to forward, so try to avoid common port numbers as 8000, 9000, 8080 or 8888.</p>
<p>Choose a duration of your job, for initial testing better keep 30 minutes so your job starts sooner.</p>
<p>Submit the job to the scheduler:</p>
<div class="highlight"><pre><span></span><span class="err">qsub notebook_job.pbs</span>
</pre></div>
<p>Wait for the job to start running, you should see <code>R</code> in:</p>
<div class="highlight"><pre><span></span><span class="err">qstat -u $USER</span>
</pre></div>
<p>The script launches an IPython notebook on a computing node and tunnels its port to the login node.</p>
<p>You can check that everything worked by checking that no errors show up in the <code>notebook.log</code> file, and that you can access the notebook page with <code>wget</code>:</p>
<div class="highlight"><pre><span></span><span class="err">wget localhost:YOURPORT</span>
</pre></div>
<p>should download a <code>index.html</code> file in the current folder, and NOT give an error like "Connection refused".</p>
<h3>Tunnel the port to your laptop</h3>
<h4>Linux / MAC</h4>
<p>Download the <code>tunnel_laptop_edisonlogin.sh</code> script from <a href="https://gist.github.com/zonca/357d36347fd5addca8f0">https://gist.github.com/zonca/357d36347fd5addca8f0</a>.</p>
<p>Customize the script with your port number and your username.</p>
<p>Launch <code>bash tunnel_laptop_edisonlogin.sh ##</code> where <code>##</code> is the Edison login node you launched the job from in 2 digits, e.g. <code>03</code>.</p>
<p>The script forwards the port from the login node of Edison to your laptop.</p>
<h4>Windows</h4>
<p>Install <code>putty</code>.</p>
<p>Follow tutorial for local port forwarding on <a href="http://howto.ccs.neu.edu/howto/windows/ssh-port-tunneling-with-putty/">http://howto.ccs.neu.edu/howto/windows/ssh-port-tunneling-with-putty/</a></p>
<ul>
<li>set <code>edison##-eth5.nersc.gov</code> as remote host, where <code>##</code> is the Edison login node you launched the job from in 2 digits, e.g. <code>03</code> and set 22 as SSH port</li>
<li>set YOURPORT as tunnel port, replace both 8080 and 80 in the tutorial with your port number. </li>
</ul>
<h3>Connect to the Notebook</h3>
<p>Open a browser and type <code>http://localhost:YOURPORT</code> in the address bar.</p>
<p>See in the screenshot from my local browser, the <code>hostname</code> is one of Edison's computing node:</p>
<p><img alt="test_edison_screenshot.png" src="/images/test_edison_screenshot.png"></p>
<h2>Acknowledgements</h2>
<p>Thanks Lisa Gerhardt from NERSC user support to help me understand Edison's configuration.</p>IPython/Jupyter notebook setup on SDSC Comet2015-09-17T20:00:00-07:002015-09-17T20:00:00-07:00Andrea Zoncatag:zonca.github.io,2015-09-17:/2015/09/ipython-jupyter-notebook-sdsc-comet.html<h2>Introduction</h2>
<p>This tutorial explains the setup to run an IPython Notebook on a computing node on the supercomputer Comet at the San Diego Supercomputer Center and forward the port encrypted with SSH to the browser on a local laptop.</p>
<h2>Quick reference</h2>
<ul>
<li>Add <code>module load python scipy</code> to <code>.bashrc</code></li>
<li>Make sure …</li></ul><h2>Introduction</h2>
<p>This tutorial explains the setup to run an IPython Notebook on a computing node on the supercomputer Comet at the San Diego Supercomputer Center and forward the port encrypted with SSH to the browser on a local laptop.</p>
<h2>Quick reference</h2>
<ul>
<li>Add <code>module load python scipy</code> to <code>.bashrc</code></li>
<li>Make sure you can ssh passwordless within comet, i.e. <code>ssh comet.sdsc.edu</code> from comet login node works without password</li>
<li>Get <code>submit_slurm_comet.sh</code> from <a href="https://gist.github.com/zonca/5f8b5ccb826a774d3f89">https://gist.github.com/zonca/5f8b5ccb826a774d3f89</a></li>
<li>Change the port number and customize options (duration)</li>
<li><code>sbatch submit_slurm_comet.sh</code></li>
<li>Remember the login node you are using</li>
<li>From laptop, use <code>bash tunnel_notebook_comet.sh N</code> where N is the Comet login number (e.g. 2) from <a href="https://gist.github.com/zonca/5f8b5ccb826a774d3f89">https://gist.github.com/zonca/5f8b5ccb826a774d3f89</a></li>
<li>From laptop, open browser and connect to <code>http://localhost:YOURPORT</code></li>
</ul>
<h2>Detailed walkthrough</h2>
<h3>One time setup on Comet</h3>
<p>Login into a Comet login node, edit the <code>.bashrc</code> file in your home folder (with <code>nano .bashrc</code> for example) and add <code>module load python scipy</code> at the bottom. This makes sure you always have the Python environment loaded in all your jobs. Logout, log back in, make sure that <code>module list</code> shows <code>python</code> and <code>scipy</code>.</p>
<p>You need to be able to SSH from a node to another node on comet with no need of a password. Create a new SSH certificate with <code>ssh-keygen</code>, hit enter to keep all default options, DO NOT ENTER A PASSWORD. Then use <code>ssh-copy-id comet.sdsc.edu</code>, enter your password to make sure the key is copied in the authorized hosts.
Now you can check it works by executing:</p>
<div class="highlight"><pre><span></span><span class="err">ssh comet.sdsc.edu</span>
</pre></div>
<p>from the login node and make sure you are NOT asked for your password.</p>
<h3>Configure the script for SLURM and submit the job</h3>
<p>Copy <code>submit_slurm_comet.sh</code> from <a href="https://gist.github.com/zonca/5f8b5ccb826a774d3f89">https://gist.github.com/zonca/5f8b5ccb826a774d3f89</a> on your home on Comet.</p>
<p>Change the port number in the script to a port of your choosing between 8000 and 9999, referenced as YOURPORT in the rest of the tutorial. Two users on the same login node on the same port would not be allowed to forward, so try to avoid common port numbers as 8000, 9000, 8080 or 8888. Tho</p>
<p>Choose whether you prefer to use a full node to have access to all 24 cores and 128GB of RAM or if you only need 1 core and 5GB of RAM and change the top of the script accordingly.</p>
<p>Choose a duration of your job, for initial testing better keep 30 minutes so your job starts straight away.</p>
<p>Submit the job to the scheduler:</p>
<div class="highlight"><pre><span></span><span class="err">sbatch submit_slurm_comet.sh</span>
</pre></div>
<p>Wait for the job to start running, you should see <code>R</code> in:</p>
<div class="highlight"><pre><span></span><span class="err">squeue -u $USER</span>
</pre></div>
<p>The script launches an IPython notebook on a computing node and tunnels its port to the login node.</p>
<p>You can check that everything worked by checking that no errors show up in the <code>notebook.log</code> file, and that you can access the notebook page with <code>wget</code>:</p>
<div class="highlight"><pre><span></span><span class="err">wget localhost:YOURPORT</span>
</pre></div>
<p>should download a <code>index.html</code> file in the current folder, and NOT give an error like "Connection refused".</p>
<p>Check what login node you were using on comet, i.e. the hostname on your terminal on comet, for example <code>comet-ln2</code>.</p>
<h3>Tunnel the port to your laptop</h3>
<h4>Linux / MAC</h4>
<p>Download the <code>tunnel_notebook_comet.sh</code> script from <a href="https://gist.github.com/zonca/5f8b5ccb826a774d3f89">https://gist.github.com/zonca/5f8b5ccb826a774d3f89</a>.</p>
<p>Customize the script with your port number.</p>
<p>Lauch <code>bash tunnel_notebook_comet.sh N</code> where N is the comet login node number. So if you were on <code>comet-ln2</code>, use <code>bash tunnel_notebook_comet.sh 2</code>.</p>
<p>The script forwards the port from the login node of comet to your laptop.</p>
<h4>Windows</h4>
<p>Install <code>putty</code>.</p>
<p>Follow tutorial for local port forwarding on <a href="https://www.akadia.com/services/ssh_putty.html/">https://www.akadia.com/services/ssh_putty.html/</a></p>
<ul>
<li>set <code>comet-ln2.sdsc.edu</code> as remote host, 22 as SSH port</li>
<li>set YOURPORT as tunnel port, replace both 8080 and 80 in the tutorial with your port number. </li>
</ul>
<h3>Connect to the Notebook</h3>
<p>Open a browser and type <code>http://localhost:YOURPORT</code> in the address bar.</p>Run Jupyterhub on a Supercomputer2015-04-02T09:00:00-07:002015-04-02T09:00:00-07:00Andrea Zoncatag:zonca.github.io,2015-04-02:/2015/04/jupyterhub-hpc.html<blockquote>
<p><strong>Summary</strong>: I developed a plugin for <a href="https://github.com/jupyter/jupyterhub" title="jupyterhub">Jupyterhub</a>: <a href="https://github.com/zonca/remotespawner">RemoteSpawner</a>, it has a proof-of-concept interface with the Supercomputer Gordon at UC San Diego to spawn IPython Notebook instances as jobs throught the queue and tunnel the interface back to the Jupyterhub instance.</p>
</blockquote>
<p>The IPython (recently renamed Jupyter) Notebook is a powerful tool …</p><blockquote>
<p><strong>Summary</strong>: I developed a plugin for <a href="https://github.com/jupyter/jupyterhub" title="jupyterhub">Jupyterhub</a>: <a href="https://github.com/zonca/remotespawner">RemoteSpawner</a>, it has a proof-of-concept interface with the Supercomputer Gordon at UC San Diego to spawn IPython Notebook instances as jobs throught the queue and tunnel the interface back to the Jupyterhub instance.</p>
</blockquote>
<p>The IPython (recently renamed Jupyter) Notebook is a powerful tool for analyzing and visualizing data in Python and other programming languages.
A key feature is that a single document contains code, figures, text and equations.
Everything is saved in a single .ipynb file that can be shared, executed and modified. See an <a href="http://nbviewer.ipython.org/github/waltherg/notebooks/blob/master/2013-12-03-Crank_Nicolson.ipynb" title="example notebook">example Notebook on integration of partial differential equations</a>.</p>
<p>The Jupyter Notebook is a Python application with a web frontend, i.e. the interface runs in the user browser.
This setup makes it suitable for any kind of remote computing, in particular running the Jupyter Notebook on a computing node of a Supercomputer, and exporting the interface HTTP port to a local browser.
Setting up tunneling via SSH is tedious, in particular if the user does not have a public IP address.</p>
<p><a href="https://github.com/jupyter/jupyterhub" title="jupyterhub">Jupyterhub</a>, developed by the Jupyter team, comes to the rescue by providing a web application that manages and proxies multiple instances of the Jupyter Notebook for any number of users.
Jupyterhub natively only spawns local processes, but supports plugins to extend its functionality.</p>
<p>I have been developing a proof-of-concept plugin (<a href="https://github.com/zonca/remotespawner">RemoteSpawner</a>) designed to work on a web server and once a user is authenticated, connect to the login node of a Supercomputer and submit a Jupyter Notebook job.
As soon as the job starts execution, it sets up SSH tunneling with the Jupyterhub host so that
Jupyterhub can provide the Notebook interface to the user.
This setup allows users to simply access a Supercomputer via browser, accessing all their Python environment and data.</p>
<p>I am looking for interested parties either as users or as collaborators to help further development. See more information about the project below.</p>
<h2>Test it yourself</h2>
<p>In order to have a feeling on how Jupyterhub works, you can test in your browser at:</p>
<ul>
<li><a href="http://tmpnb.org">http://tmpnb.org</a></li>
</ul>
<p>This service by Rackspace creates temporary Jupyter Notebooks on the fly. If you click on <code>Welcome.ipynb</code>,
you can see an example Notebook.</p>
<p>The purpose of my project is to have a web interface to access Jupyter Notebooks that are
running on computing nodes of a Supercomputer. So that users can access the environment and
data on a Supercomputer from their browser and run data-intensive processing. </p>
<h2>Tour of Jupyterhub on the Gordon Supercomputer</h2>
<p>I'll show some screenshots to display how a test Jupyterhub installation on my machine is integrated with <a href="http://www.sdsc.edu/us/resources/gordon/">Gordon</a> thanks to the plugin.</p>
<p>Jupyterhub is accessed publicly via browser and the user can login. Jupyterhub supports authentication for <code>PAM</code>/<code>LDAP</code> so it could be integrated with XSEDE credential, at the moment I am testing with local authentication.</p>
<p><img alt="jupyterhub-hpc-login.png" src="/images/jupyterhub-hpc-login.png"></p>
<p>Once the user is authenticated, Jupyterhub connects via <code>SSH</code> to a login node on Gordon and submits a batch serial job using <code>qsub</code>. The web interface waits for the job to start running. A dedicated queue with a quick turnaround would be useful for this kind of jobs.</p>
<p><img alt="jupyterhub-hpc-refresh.png" src="/images/jupyterhub-hpc-refresh.png">
<img alt="jupyterhub-hpc-job.png" src="/images/jupyterhub-hpc-job.png"></p>
<p>When the job starts running, it first sets up <code>SSH</code> tunneling between the Jupyterhub host and the computing node, then starts the Jupyter Notebook.
As soon as the web interface detects that the job is running, proxies the tunneled HTTP port for the user. From this point the Jupyter Notebook works exactly like it would on a local machine.</p>
<p>See an example Notebook printing the hostname of the computing node:</p>
<p><img alt="jupyterhub-hpc-testnotebook.png" src="/images/jupyterhub-hpc-testnotebook.png"></p>
<p>Other two useful features of the Jupyter Notebook are a terminal:</p>
<p><img alt="jupyterhub-hpc-terminal.png" src="/images/jupyterhub-hpc-terminal.png"></p>
<p>and an editor that run in the browser:</p>
<p><img alt="jupyterhub-hpc-editor.png" src="/images/jupyterhub-hpc-editor.png"></p>
<h2>Launch Jupyterhub parallel to access hundreds of computing engines</h2>
<p>The Notebook also supports using Torque to run Python computing engines and send them computationally intensive serial functions for load-balanced execution.</p>
<p>In the Notebook interface, in the <code>Clusters</code> tab, is it possible to choose the number of engines and click start to submit a job to the queue system:</p>
<p><img alt="jupyterhub-hpc-clusterlaunch.png" src="/images/jupyterhub-hpc-clusterlaunch.png"></p>
<p>This will pack 16 jobs per node (Gordon has 16-cores CPUs) and make them available from the notebook, see an example usage where I process 1000 files with 128 engines running on a different job on Gordon:</p>
<ul>
<li><a href="http://nbviewer.ipython.org/gist/zonca/9bd94d8782af037704ff">Example of Jupyterhub Parallel</a></li>
</ul>Accelerate groupby operation on pixels with Numba2015-03-24T09:00:00-07:002015-03-24T09:00:00-07:00Andrea Zoncatag:zonca.github.io,2015-03-24:/2015/03/numba-groupby-pixels.html<p><a href="/notebooks/numba_groupby_pixels.ipynb">Download the original IPython notebook</a></p>
<h2>Astrophysics background</h2>
<p>It is very common in Astrophysics to work with sky pixels. The sky is tassellated in patches with specific properties and a sky map is then a collection of intensity values for each pixel. The most common pixelization used in Cosmology is <a href="http://healpix.jpl.nasa.gov">HEALPix …</a></p><p><a href="/notebooks/numba_groupby_pixels.ipynb">Download the original IPython notebook</a></p>
<h2>Astrophysics background</h2>
<p>It is very common in Astrophysics to work with sky pixels. The sky is tassellated in patches with specific properties and a sky map is then a collection of intensity values for each pixel. The most common pixelization used in Cosmology is <a href="http://healpix.jpl.nasa.gov">HEALPix</a>.</p>
<p>Measurements from telescopes are then represented as an array of pixels that encode the pointing of the instrument at each timestamp and the measurement output.</p>
<h2>Sample timeline</h2>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numba</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
</pre></div>
<p>For simplicity let's assume we have a sky with 50K pixels:</p>
<div class="highlight"><pre><span></span><span class="err">NPIX = 50000</span>
</pre></div>
<p>And we have 50 million measurement from our instrument:</p>
<div class="highlight"><pre><span></span><span class="err">NTIME = int(50 * 1e6)</span>
</pre></div>
<p>The pointing of our instrument is an array of pixels, random in our sample case:</p>
<div class="highlight"><pre><span></span><span class="err">pixels = np.random.randint(0, NPIX-1, NTIME)</span>
</pre></div>
<p>Our data are also random:</p>
<div class="highlight"><pre><span></span><span class="err">timeline = np.random.randn(NTIME)</span>
</pre></div>
<h2>Create a map of the sky with pandas</h2>
<p>One of the most common operations is to sum all of our measurements in a sky map, so the value of each pixel in our sky map will be the sum of each individual measurement.
The easiest way is to use the <code>groupby</code> operation in <code>pandas</code>:</p>
<div class="highlight"><pre><span></span><span class="n">timeline_pandas</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span><span class="n">timeline</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">pixels</span><span class="p">)</span>
<span class="n">timeline_pandas</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
<span class="mi">46889</span> <span class="mf">0.407097</span>
<span class="mi">3638</span> <span class="mf">1.300001</span>
<span class="mi">6345</span> <span class="mf">0.174931</span>
<span class="mi">15742</span> <span class="o">-</span><span class="mf">0.255958</span>
<span class="mi">34308</span> <span class="mf">1.147338</span>
<span class="nl">dtype</span><span class="p">:</span> <span class="n">float64</span>
<span class="nf">%time</span> <span class="n">m</span> <span class="o">=</span> <span class="n">timeline_pandas</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="n">sum</span><span class="p">()</span>
<span class="n">CPU</span> <span class="nl">times</span><span class="p">:</span> <span class="n">user</span> <span class="mf">4.09</span> <span class="n">s</span><span class="p">,</span> <span class="nl">sys</span><span class="p">:</span> <span class="mi">471</span> <span class="n">ms</span><span class="p">,</span> <span class="nl">total</span><span class="p">:</span> <span class="mf">4.56</span> <span class="n">s</span>
<span class="n">Wall</span> <span class="nl">time</span><span class="p">:</span> <span class="mf">4.55</span> <span class="n">s</span>
</pre></div>
<h2>Create a map of the sky with numba</h2>
<p>We would like to improve the performance of this operation using <code>numba</code>, which allows to produce automatically C-speed compiled code from pure python functions.</p>
<p>First we need to develop a pure python version of the code, test it, and then have <code>numba</code> optimize it:</p>
<div class="highlight"><pre><span></span><span class="n">def</span> <span class="n">groupby_python</span><span class="p">(</span><span class="n">index</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">output</span><span class="p">)</span><span class="o">:</span>
<span class="k">for</span> <span class="n">i</span> <span class="n">in</span> <span class="n">range</span><span class="p">(</span><span class="n">index</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="o">:</span>
<span class="n">output</span><span class="p">[</span><span class="n">index</span><span class="p">[</span><span class="n">i</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">value</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">m_python</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
<span class="nf">%time</span> <span class="n">groupby_python</span><span class="p">(</span><span class="n">pixels</span><span class="p">,</span> <span class="n">timeline</span><span class="p">,</span> <span class="n">m_python</span><span class="p">)</span>
<span class="n">CPU</span> <span class="nl">times</span><span class="p">:</span> <span class="n">user</span> <span class="mf">37.5</span> <span class="n">s</span><span class="p">,</span> <span class="nl">sys</span><span class="p">:</span> <span class="mi">0</span> <span class="n">ns</span><span class="p">,</span> <span class="nl">total</span><span class="p">:</span> <span class="mf">37.5</span> <span class="n">s</span>
<span class="n">Wall</span> <span class="nl">time</span><span class="p">:</span> <span class="mf">37.6</span> <span class="n">s</span>
<span class="n">np</span><span class="p">.</span><span class="n">testing</span><span class="p">.</span><span class="n">assert_allclose</span><span class="p">(</span><span class="n">m_python</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span>
</pre></div>
<p>Pure Python is slower than the <code>pandas</code> version implemented in <code>cython</code>.</p>
<h3>Optimize the function with numba.jit</h3>
<p><code>numba.jit</code> gets an input function and creates an compiled version with does not depend on slow Python calls, this is enforced by <code>nopython=True</code>, <code>numba</code> would throw an error if it would not be possible to run in <code>nopython</code> mode.</p>
<div class="highlight"><pre><span></span><span class="n">groupby_numba</span> <span class="o">=</span> <span class="n">numba</span><span class="p">.</span><span class="n">jit</span><span class="p">(</span><span class="n">groupby_python</span><span class="p">,</span> <span class="n">nopython</span><span class="o">=</span><span class="n">True</span><span class="p">)</span>
<span class="n">m_numba</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
<span class="nf">%time</span> <span class="n">groupby_numba</span><span class="p">(</span><span class="n">pixels</span><span class="p">,</span> <span class="n">timeline</span><span class="p">,</span> <span class="n">m_numba</span><span class="p">)</span>
<span class="n">CPU</span> <span class="nl">times</span><span class="p">:</span> <span class="n">user</span> <span class="mi">274</span> <span class="n">ms</span><span class="p">,</span> <span class="nl">sys</span><span class="p">:</span> <span class="mi">5</span> <span class="n">ms</span><span class="p">,</span> <span class="nl">total</span><span class="p">:</span> <span class="mi">279</span> <span class="n">ms</span>
<span class="n">Wall</span> <span class="nl">time</span><span class="p">:</span> <span class="mi">278</span> <span class="n">ms</span>
<span class="n">np</span><span class="p">.</span><span class="n">testing</span><span class="p">.</span><span class="n">assert_allclose</span><span class="p">(</span><span class="n">m_numba</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span>
</pre></div>
<p>Performance improvement is about 100x compared to Python and 20x compared to Pandas, pretty good!</p>
<h2>Use numba.jit as a decorator</h2>
<p>The exact same result is obtained if we use <code>numba.jit</code> as a decorator:</p>
<div class="highlight"><pre><span></span><span class="nv">@numba</span><span class="p">.</span><span class="n">jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="k">True</span><span class="p">)</span><span class="w"></span>
<span class="n">def</span><span class="w"> </span><span class="n">groupby_numba</span><span class="p">(</span><span class="k">index</span><span class="p">,</span><span class="w"> </span><span class="k">value</span><span class="p">,</span><span class="w"> </span><span class="k">output</span><span class="p">)</span><span class="err">:</span><span class="w"></span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="ow">in</span><span class="w"> </span><span class="k">range</span><span class="p">(</span><span class="k">index</span><span class="p">.</span><span class="n">shape</span><span class="o">[</span><span class="n">0</span><span class="o">]</span><span class="p">)</span><span class="err">:</span><span class="w"></span>
<span class="w"> </span><span class="k">output</span><span class="o">[</span><span class="n">index[i</span><span class="o">]</span><span class="err">]</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="k">value</span><span class="o">[</span><span class="n">i</span><span class="o">]</span><span class="w"></span>
</pre></div>Software Carpentry setup for Chromebook2015-02-10T20:00:00-08:002015-02-10T20:00:00-08:00Andrea Zoncatag:zonca.github.io,2015-02-10:/2015/02/software-carpentry-setup-chromebook.html<p>In this post I'll provide instructions on how to install the main requirements of a <a href="http://software-carpentry.org">Software Carpentry workshop</a> on
a Chromebook.
Bash, git, IPython notebook and R.</p>
<h2>Switch the Chromebook to Developer mode</h2>
<p>ChromeOS is very restrictive on what users can install on the machine.
The only way to get …</p><p>In this post I'll provide instructions on how to install the main requirements of a <a href="http://software-carpentry.org">Software Carpentry workshop</a> on
a Chromebook.
Bash, git, IPython notebook and R.</p>
<h2>Switch the Chromebook to Developer mode</h2>
<p>ChromeOS is very restrictive on what users can install on the machine.
The only way to get around this is to switch to developer mode.</p>
<p>Switching to Developer mode <strong>wipes</strong> all the data on the local disk and
may void warranty, do it at your own risk.</p>
<p>Instructions are available on the <a href="http://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices">ChromeOS wiki</a>, you need
to click on your device name and follow instructions.
For most devices you need to switch the device off, then hold down <code>ESC</code> and <code>Refresh</code> and poke the <code>Power</code> button, then press <code>Ctrl-D</code> at the
Recovery screen (there is no prompt, you have to know to do it).
This will wipe the device and activate Developer mode.</p>
<p>Once you reboot and enter your Google credentials, the Chromebook will copy back from Google servers all of your settings.</p>
<p>Now you are in Developer mode, the main feature is that you have a <code>root</code> (superuser) shell you can activate using <code>Ctrl-Alt-T</code>.</p>
<p>The worst issue of Developer mode is that at each boot the system will display a scary screen warning that OS verification is off and asks you if you would like to leave Developer mode. If you either press <code>Ctrl-D</code> or wait 30 seconds, it will boot ChromeOS in Developer mode, if you instead hit the Space, it will wipe
everything and switch back to Normal mode.</p>
<h2>Install Ubuntu with crouton</h2>
<p>You can now install Ubuntu using <a href="https://github.com/dnschneid/crouton">crouton</a>, you can read the instructions on the page, in summary:</p>
<ul>
<li>First you need to install the <a href="https://goo.gl/OVQOEt">Crouton Chrome extension</a> on ChromeOS</li>
<li>Download the last release from <a href="https://goo.gl/fd3zc">https://goo.gl/fd3zc</a></li>
<li>Open the ChromeOS shell using <code>Ctrl-Alt-t</code>, digit <code>shell</code> at the prompt and hit enter</li>
<li>Run <code>sudo sh ~/Downloads/crouton -t xfce,xiwi -r trusty</code>, this instlls Ubuntu Trutyty with xfce desktop and uses <code>kiwi</code> to be able to run in a window.</li>
</ul>
<p>Now you can have Ubuntu running in a window of the Chromebook browser by:</p>
<ul>
<li>Press <code>Ctrl-Alt-T</code></li>
<li>digit <code>shell</code> at the prompt and hit enter</li>
<li>digit <code>sudo startxfce4</code></li>
</ul>
<p>What is great about <code>crouton</code> is that it is not like a Virtual Machine, Ubuntu runs at full performance on the same linux kernel of ChromeOS.</p>
<h2>Install scientific computing stack</h2>
<p>You can now follow the instructions for
Linux at <a href="http://software-carpentry.org/v5/setup.html">http://software-carpentry.org/v5/setup.html</a>, summary of commands to run in a terminal:</p>
<ul>
<li><code>sudo apt install nano</code></li>
<li><code>sudo apt install git</code></li>
<li>In order to install R <code>sudo apt install r-base</code></li>
<li>Download Anaconda Python 3 64bit for Linux from <a href="http://continuum.io/downloads">http://continuum.io/downloads</a> and execute it</li>
</ul>
<p>Anaconda will run under Ubuntu but when you open an IPython notebook, it will automatically open a new tab in the main browser of ChromeOS, not
inside the Ubuntu window.</p>
<h2>Final note</h2>
<p>I admit it looks scary, I personally followed this procedure successfully on 2 chromebooks: Samsung Chromebook 1 and Toshiba Chromebook 2.</p>
<p>See a screenshot on my Chromebook with the Ubuntu window on the right with <code>git</code>, <code>nano</code> and <code>IPython notebook</code> running, the <code>IPython notebook</code> window opens in Chrome, see the left window (click to enlarge).</p>
<p><a href="/images/screenshot-chromebook.png"><img src="/images/screenshot-chromebook.png" alt="Screenshot Chromebook click for full resolution" style="width: 730px;"/></a></p>
<p>It is also possible to switch the Chromebook to Developer mode and install Anaconda and git directly there, however I think that in order to have
a complete platform for scientific computing is a lot better to have all of the packages provided by Ubuntu.</p>Zero based indexing2014-10-22T10:00:00-07:002014-10-22T10:00:00-07:00Andrea Zoncatag:zonca.github.io,2014-10-22:/2014/10/zero-based-indexing.html<h2>Reads</h2>
<ul>
<li>Dijkstra: <a href="https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831.html">https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831.html</a></li>
<li>Guido van Rossum: <a href="https://plus.google.com/115212051037621986145/posts/YTUxbXYZyfi">https://plus.google.com/115212051037621986145/posts/YTUxbXYZyfi</a></li>
</ul>
<h2>Comment</h2>
<p>For Europeans zero based indexing feels reasonable if we think of floors in a house,
the lowest floor is ground floor, then 1st floor and so on …</p><h2>Reads</h2>
<ul>
<li>Dijkstra: <a href="https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831.html">https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831.html</a></li>
<li>Guido van Rossum: <a href="https://plus.google.com/115212051037621986145/posts/YTUxbXYZyfi">https://plus.google.com/115212051037621986145/posts/YTUxbXYZyfi</a></li>
</ul>
<h2>Comment</h2>
<p>For Europeans zero based indexing feels reasonable if we think of floors in a house,
the lowest floor is ground floor, then 1st floor and so on.</p>
<p>A house with 2 stories has ground and 1st floor. It is natural in this way to index
zero-based and to count 1-based.</p>
<p>What about <strong>slicing</strong> instead? This is a separate issue from indexing.
The main problem here is that if you include the upper bound then you cannot express
the empty slice.
Also it is elegant to print the first <code>n</code> elements as <code>a[:n]</code>. Slicing <code>a[i:j]</code> excludes
the upper bound, so it probably easier to understand if we express it as <code>a[i:i+n]</code>.</p>Write unit tests as cells of IPython notebooks2014-09-30T14:00:00-07:002014-09-30T14:00:00-07:00Andrea Zoncatag:zonca.github.io,2014-09-30:/2014/09/unit-tests-ipython-notebook.html<h2>What?</h2>
<p>Plugin for <code>py.test</code> to write unit tests as cells in IPython notebooks:</p>
<ul>
<li>Homepage on Github: <a href="https://github.com/zonca/pytest-ipynb">https://github.com/zonca/pytest-ipynb</a></li>
<li>PyPi : <a href="https://pypi.python.org/pypi/pytest-ipynb/">https://pypi.python.org/pypi/pytest-ipynb/</a></li>
<li>Install with <code>pip install pytest-ipynb</code></li>
</ul>
<h2>Why?</h2>
<p>Many unit testing fromeworks in Python, first of all the <code>unittest</code> package in the standard …</p><h2>What?</h2>
<p>Plugin for <code>py.test</code> to write unit tests as cells in IPython notebooks:</p>
<ul>
<li>Homepage on Github: <a href="https://github.com/zonca/pytest-ipynb">https://github.com/zonca/pytest-ipynb</a></li>
<li>PyPi : <a href="https://pypi.python.org/pypi/pytest-ipynb/">https://pypi.python.org/pypi/pytest-ipynb/</a></li>
<li>Install with <code>pip install pytest-ipynb</code></li>
</ul>
<h2>Why?</h2>
<p>Many unit testing fromeworks in Python, first of all the <code>unittest</code> package in the standard library, work very well for automating unit tests, but make it very difficult to debug interactively any failed test.</p>
<p><a href="http://pytest.org"><code>py.test</code></a> alleviates this problem by allowing to write just plain Python functions with <code>assert</code> statements (no boilerplate code), discover them automatically in any file that starts with <code>test</code> and write a useful report.</p>
<p>I wrote a plugin for <code>py.test</code>, <a href="https://pypi.python.org/pypi/pytest-ipynb"><code>pytest-ipynb</code></a>, that goes a step further and runs unit tests written as cells of any IPython notebook named <code>test*.ipynb</code>.</p>
<p>The advantage is that it is easy to create and debug interactively any issue by opening the testing notebook interactively, then clean the notebook outputs and add it to the software repository.</p>
<p>More details on Github: <a href="https://github.com/zonca/pytest-ipynb">https://github.com/zonca/pytest-ipynb</a></p>
<p>Suggestions welcome as comments or github issues.</p>
<p>(Yes, works with Python 3)</p>How to perform code review for scientific software2014-08-28T17:00:00-07:002014-08-28T17:00:00-07:00Andrea Zoncatag:zonca.github.io,2014-08-28:/2014/08/code-review-for-scientific-computing.html<p>Code review is the formal process where a programmer inspects in detail a piece of software developed by somebody else in order to improve code quality by catching bugs, improve readibility and usability.
It is used extensively in industry, not much in academia.</p>
<p>There has been some discussion about this …</p><p>Code review is the formal process where a programmer inspects in detail a piece of software developed by somebody else in order to improve code quality by catching bugs, improve readibility and usability.
It is used extensively in industry, not much in academia.</p>
<p>There has been some discussion about this lately, see:
* <a href="http://ivory.idyll.org/blog/on-code-review-of-scientific-code.html">A few thoughts on code review of scientific code</a> by Titus Brown
* <a href="http://mozillascience.org/code-review-for-science-what-we-learned/">Code review for science: What we learned</a> by Kaitlin Thaney</p>
<p>I participated in the <a href="http://software-carpentry.org/blog/2014/01/code-review-round-2.html">second code review pilot study of Software Carpentry</a> where I was paired to a research group in Genomics and I reviewed some of their analysis code.
In this blog post I'd like to write about some guidelines and best practices on how to perform code review of scientific code.</p>
<p>Best use of code review is on libraries, prior to publication, because an improvement in code quality can help future users of the code. One-off analysis scripts benefit less from the process.</p>
<h2>How to do a code review of a large codebase</h2>
<p>The code review process should be performed on ~200-400 lines of code at a time.
First thing is to ask the code author if she can identify different functionalities of the code that could be packaged and distributed separately. Modularity really helps maintaining software in the long term.</p>
<p>Then the author should follow these steps to get ready for the code review:</p>
<ul>
<li>For each of the packages identified previously, the code author should create a separate repository, generally on Github, possibly under an organization account (see <a href="http://zonca.github.io/2014/08/github-for-research-groups.html">Github for research groups</a>).</li>
<li>Create a blank project in the programming language of choice (hopefully Python!) using a pre-defined standard template, I recommend using <a href="https://github.com/audreyr/cookiecutter">CookieCutter</a>.</li>
<li>Write a <code>README.md</code> file explaining exactly the functionality of the code in general</li>
<li>Clone the repository locally, add, commit and push the blank project with <code>README.md</code> to the <code>master</code> branch on Github</li>
<li>Identify a portion of the software of about ~200-400 lines that has a defined functionality and that could be reviewed together. It doesn't necessarily need to be in a runnable state, at the beginning we can start the code review without running the code.</li>
<li>Create a new branch locally and copy, add, commit this file or this set of files to the repository and push to Github</li>
<li>Access the web interface of Github, it should have detected that you just pushed a new branch and asked if you want to create a pull request. Create a pull request with a few details on the code under review.</li>
<li>Point the reviewer to the pull request</li>
</ul>
<h2>How to review an improvement to the software</h2>
<p>The implementation of a feature should be performed on a separate branch, then it is straightforward to push it to Github, create a pull request and ask reviewers to look at the set of changes.</p>
<h2>How to perform the actual code review</h2>
<p>Coding style should not be the main focus of the review, the most important feedback for the author are high-level comments on software organization. The reviewer should focus on what makes the software more usable and more maintenable.</p>
<p>A few examples:</p>
<ul>
<li>can some parts of the code be simplified?</li>
<li>is there any functionality that could be replaced by an existing library?</li>
<li>is it clear what each part of the software is doing?</li>
<li>is there a more straightforward way of splitting the code into files?</li>
<li>is documentation enough?</li>
<li>are there some function arguments or function names that could be easily misinterpreted by a user?</li>
</ul>
<p>The purpose is to improve the code, but also to help the code author to improve her coding skills.</p>
<p>On the Github pull requests interface, it is possible both to write general comments, and to click on a single line of code and write an inline comment.</p>
<h2>How to implement reviewer's recommendations</h2>
<p>The author can improve the code locally on the same branch used in the pull request, then commit and push the changes to Github, the changes will be automatically added to the existing pull request, so the reviewer can start another iteration of the review process.</p>
<p>Comments and suggestions are welcome.</p>Create a Github account for your research group with free private repositories2014-08-19T15:00:00-07:002014-08-19T15:00:00-07:00Andrea Zoncatag:zonca.github.io,2014-08-19:/2014/08/github-for-research-groups.html<p>See the <strong>updated version</strong> at <a href="https://zonca.github.io/2019/08/github-for-research-groups.html">https://zonca.github.io/2019/08/github-for-research-groups.html</a></p>
<p><a href="https://github.com/">Github</a> allows a research group to create their own webpage where they can host, share and develop their software using the <code>git</code> version control system and the powerful Github online issue-tracking interface.</p>
<p>Since February 2014 Github also …</p><p>See the <strong>updated version</strong> at <a href="https://zonca.github.io/2019/08/github-for-research-groups.html">https://zonca.github.io/2019/08/github-for-research-groups.html</a></p>
<p><a href="https://github.com/">Github</a> allows a research group to create their own webpage where they can host, share and develop their software using the <code>git</code> version control system and the powerful Github online issue-tracking interface.</p>
<p>Since February 2014 Github also offers 20 private repositories to research groups and classrooms, plus unlimited public repositories.
Private repositories are useful for early stages of development or if it is necessary to keep software secret before publication, at publication they can easily switched to public repositories and free up their slot.</p>
<p>Here the steps to set this up:</p>
<ul>
<li>Create a user account on Github and choose the free plan, use your <code>.edu</code> email address</li>
<li>Create an organization account for your research group</li>
<li>Go to https://education.github.com/ and click on "Request a discount"</li>
<li>Choose what is your position, e.g. Researcher and select you want a discount for an organization</li>
<li>Choose the organization you created earlier and confirm that it is a "Research group"</li>
<li>Add details about your Research group</li>
<li>Finally you need to upload a picture of your University ID card and write how you plan on using the repositories</li>
<li>Within a week at most, but generally in less than 24 hours, you will be approved for 20 private repositories.</li>
</ul>
<p>Once the organization is created, you can add key team members to the "Owners" group, and then create another group for students and collaborators.</p>
<p>Consider also that is not necessary for every collaborator to have write access to your repositories. My recommendation is to ask a more experienced team member to administer the central repository, ask the students to fork the repository under their user accounts (forks of private repositories are always private, free and don't use any slot), and then <a href="https://help.github.com/articles/using-pull-requests">send a pull request</a> to the central repository for the administrator to review, discuss and merge.</p>
<p>See for example the organization account of the <a href="https://github.com/ged-lab">"Genomics, Evolution, and Development" at Michigan State U led by Dr. C. Titus Brown</a> where they share code, documentation and papers. Open Science!!</p>
<p>Other suggestions on the setup very welcome!</p>Create a Github account for your research group with free private repositories2014-08-19T15:00:00-07:002014-08-19T15:00:00-07:00Andrea Zoncatag:zonca.github.io,2014-08-19:/2014/08/github-for-research-groups.html<p>See the <strong>updated version</strong> at <a href="https://zonca.github.io/2019/08/github-for-research-groups.html">https://zonca.github.io/2019/08/github-for-research-groups.html</a></p>
<p><a href="https://github.com/">Github</a> allows a research group to create their own webpage where they can host, share and develop their software using the <code>git</code> version control system and the powerful Github online issue-tracking interface.</p>
<p>Since February 2014 Github also …</p><p>See the <strong>updated version</strong> at <a href="https://zonca.github.io/2019/08/github-for-research-groups.html">https://zonca.github.io/2019/08/github-for-research-groups.html</a></p>
<p><a href="https://github.com/">Github</a> allows a research group to create their own webpage where they can host, share and develop their software using the <code>git</code> version control system and the powerful Github online issue-tracking interface.</p>
<p>Since February 2014 Github also offers 20 private repositories to research groups and classrooms, plus unlimited public repositories.
Private repositories are useful for early stages of development or if it is necessary to keep software secret before publication, at publication they can easily switched to public repositories and free up their slot.</p>
<p>Here the steps to set this up:</p>
<ul>
<li>Create a user account on Github and choose the free plan, use your <code>.edu</code> email address</li>
<li>Create an organization account for your research group</li>
<li>Go to https://education.github.com/ and click on "Request a discount"</li>
<li>Choose what is your position, e.g. Researcher and select you want a discount for an organization</li>
<li>Choose the organization you created earlier and confirm that it is a "Research group"</li>
<li>Add details about your Research group</li>
<li>Finally you need to upload a picture of your University ID card and write how you plan on using the repositories</li>
<li>Within a week at most, but generally in less than 24 hours, you will be approved for 20 private repositories.</li>
</ul>
<p>Once the organization is created, you can add key team members to the "Owners" group, and then create another group for students and collaborators.</p>
<p>Consider also that is not necessary for every collaborator to have write access to your repositories. My recommendation is to ask a more experienced team member to administer the central repository, ask the students to fork the repository under their user accounts (forks of private repositories are always private, free and don't use any slot), and then <a href="https://help.github.com/articles/using-pull-requests">send a pull request</a> to the central repository for the administrator to review, discuss and merge.</p>
<p>See for example the organization account of the <a href="https://github.com/ged-lab">"Genomics, Evolution, and Development" at Michigan State U led by Dr. C. Titus Brown</a> where they share code, documentation and papers. Open Science!!</p>
<p>Other suggestions on the setup very welcome!</p>Thoughts on a career as a computational scientist2014-06-05T14:00:00-07:002014-06-05T14:00:00-07:00Andrea Zoncatag:zonca.github.io,2014-06-05:/2014/06/career-as-a-computational-scientist.html<p>Recently I've been asked what are the prospects of a wannabe computational scientist,
both in terms of training and in terms of job opportunities.</p>
<p>So I am writing this blog post about my personal experience.</p>
<h2>What is a computational scientist?</h2>
<p>In my understanding, a computational scientist is a scientist with …</p><p>Recently I've been asked what are the prospects of a wannabe computational scientist,
both in terms of training and in terms of job opportunities.</p>
<p>So I am writing this blog post about my personal experience.</p>
<h2>What is a computational scientist?</h2>
<p>In my understanding, a computational scientist is a scientist with strong skills in scientific computing who
most of the day is building software.</p>
<p>Usually there are 2 main areas, in any field of science:</p>
<ol>
<li><em>Data analysis</em>: historically only few fields of science had to deal with large amount
of experimental data, e.g. Astrophysics, nowadays instead every field can generate
extremely large amounts of data thanks to modern technology.
The task of the computational scientist is generally to analyze the data, i.e. cleanup, check systematic effects,
calibrate, understand and reduce to a form to be used for scientific exploitation.
Generally a second phase of data analysis involves model fitting, i.e. check which theoretical models best fit the
data and estimate their parameters with error bars, this requires knowledge of Statistics and Bayesian techniques,
like Markov Chain Monte Carlo (MCMC).</li>
<li><em>Simulations</em>: production of artificial data used for their own good in the understanding of scientific models or
by trying to reproduce experimental data in order to characterize the response of a scientific instrument. </li>
</ol>
<h2>Skills of a computational scientist</h2>
<p>Starting out as a computational scientist nowadays is quite easy; with a background in any field of science, it is possible to improve computational skills thanks to several learning resources, for example:</p>
<ul>
<li>Free online video classes on <a href="https://www.coursera.org/courses?search=python">Coursera</a>, <a href="https://www.udacity.com/courses#!/data-science">Udacity</a> and others</li>
<li><a href="http://software-carpentry.org">Software Carpentry</a> runs bootcamps for scientists to improve their computational skills</li>
<li>Online tutorials on <a href="http://scipy-lectures.github.io/">Python for scientific computing</a></li>
<li>Books, e.g. <a href="http://shop.oreilly.com/product/0636920023784.do">Python for Data Analysis</a></li>
</ul>
<p>Basically it is important to have a good experience with at least one programming language, Python is the safest option because:</p>
<ul>
<li>it is well enstabilished in many fields of science</li>
<li>its syntax is easier to learn than most other common programming languages</li>
<li>it has the largest number of scientific libraries </li>
<li>it is easy to interface with other languages, i.e. we can reuse legacy code implemented in C/C++/FORTRAN</li>
<li>it can be used also when developing something unusual for a computational scientist, like web development (<code>django</code>) or interfacing with hardware (<code>pyserial</code>).</li>
</ul>
<p>Python performance is comparable to C/C++/Java when we make use of optimized libraries like <code>numpy</code>, <code>pandas</code>, <code>scipy</code>, which
have Python frontends to highly optimized C or Fortran code; therefore is necessary to avoid explicit for loops and learn
to write "vectorized" code, that allows entire arrays and matrices to be processed in one step.</p>
<p>Some important Python tools to learn are:</p>
<ul>
<li><code>IPython</code> notebooks to write documents with code, documentatin and plots embedded </li>
<li><code>numpy</code> and <code>pandas</code> for data management</li>
<li><code>matplotlib</code> for plotting</li>
<li><code>h5py</code> or <code>pytables</code>, HDF5 binary files manipulation</li>
<li><a href="http://www.jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/">how to publish a Python package</a></li>
<li><code>emcee</code> for MCMC</li>
<li><code>scipy</code> for signal processing, FFT, optimization, integration, 2d array processing</li>
<li><code>scikit-learn</code> for Machine Learning</li>
<li><code>scikit-image</code> for image processing </li>
<li>Object oriented programming</li>
</ul>
<p>For parallel programming:</p>
<ul>
<li><code>IPython parallel</code> for distributing large amount of serial and independent job on a cluster</li>
<li><code>PyTrilinos</code> for distributed linear algebra (high level operations with data distributed across nodes, automatic MPI communication)</li>
<li><code>mpi4py</code> for manually create communication of data via MPI</li>
</ul>
<p>On top of Python is also useful to learn a bit about shell scripting with <code>bash</code>, which for simple automation tasks is better suited,
and it is fundamental to learn version control with git or mercurial.</p>
<h2>My experience</h2>
<p>I trained as Aerospace Engineer for my Master degree, and then moved to a PhD in Astrophysics, in Milano,
where I worked in the Planck collaboration and took care of simulating the inband response of the Low Frequency Instrument
detectors.
During my PhD I developed a good proficiency with Python, mainly using it for task automation and plotting.
My previous programming experience was very low, only some Matlab during last year of my Master degree, but I found Python really easy to use,
and learned it myself with books and online tutorials.
With no formal education in Computer Science, the most complicated concept to grasp is Object Oriented programming; at the time
I was moonlighting as a web developer and I familiarized with OO using Django models.
After my PhD I got a PostDoc position at the University of California, Santa Barbara, there I had for the first time
access to supercomputers and my job involved analyzing large amount of data.
During 4 years at UCSB I had the great opportunity of choosing my own tools, implementing my own software for data processing,
so I immediately saw the value of improving my understanding of software development best practices.</p>
<p>Unfortunately in science there is usually a push toward hacking around a quick and dirty solution to get out results and go forward,
I instead focused on learning how to build easily-maintenable libraries that I could re-use in the future. This
involved learning more advanced Python, version control, unit testing and so on. I learned these tools by reading tutorials and
documentation on the web, answers on StackOverflow, blog posts.
It also helped that I became one of the core developers of <code>healpy</code>, a Python package for pixelized sky maps processing.</p>
<p>In 2013, at the 4th year of my PostDoc and with the Planck mission near to the end in 2015, I was looking for a position
as a computational scientist, mainly as a research scientist (i.e. doing research/data analysis full time, with a long term contract)
at research labs like Berkeley Lab or Jet Propulsion Laboratory, or in a research group in Cosmology/Astrophysics or in
High Performance Computing.</p>
<p>I got hired at the San Diego Supercomputer Center in December 2013 as a permanent staff, mainly thanks to my experience with data analysis,
Python and parallel programming, here I collaborate with research groups in any field of Science and help them deploy and optimize their software on supercomputers here at SDSC or in other XSEDE centers.</p>
<h2>Thoughts about a career as a computational scientist</h2>
<p>After a PhD program, a computational scientist with experience either in data analysis or simulation, especially if has experience in parallel programming, should quite easily find a position as a PostDoc, lots of research groups have huge amount of data and need software development skilled labor.</p>
<p>I believe what is complicated is the next step, faculty jobs favour scientists with the best scientific publications, and software development generally is not recognized as a first class scientific product.
Very interesting opportunities in Academia are Research Scientist positions either at research facilities, for example Lawrence Berkeley Labs and NASA Jet Propulsion Laboratory, or supercomputer centers. These jobs are often permament positions, unless the institution runs out of funding, and allow to work 100% on research.
Another opportunity is to work as Research Scientist in a specific research group in a University, this is less common, and depends on their availability of long-term funding.</p>
<p>Still, the total number of available positions in Academia is not very high, therefore it is very important to also keep open the opportunity of a job in Industry. Fortunately nowadays most skills of a computational scientist are very well recognized in Industry, so I recommend to choose, whenever possible, to learn and use tools that are widely used also outside of Academia, for example Python, version control with Git, shell scripting, unit testing, databases, multi-core programming, parallel programming, GPU programming and so on.</p>
<p><em>Acknowledgement</em>: thanks to Priscilla Kelly for discussion on this topic and review of the post</p>
<p><em>Comments/feedback</em>: comment on the blog using Google+ or tweet to <a href="http://twitter.com/andreazonca">@andreazonca</a></p>Machine learning at scale with Python2014-03-20T20:00:00-07:002014-03-20T20:00:00-07:00Andrea Zoncatag:zonca.github.io,2014-03-20:/2014/03/machine-learning-at-scale-with-python.html<p>My talk for the San Diego Data Science meetup: <a href="http://www.meetup.com/San-Diego-Data-Science-R-Users-Group/events/170967362/">http://www.meetup.com/San-Diego-Data-Science-R-Users-Group/events/170967362/</a></p>
<p>About:</p>
<ul>
<li>Setup <a href="http://star.mit.edu/cluster/">StarCluster</a> to launch EC2 instances</li>
<li>Running IPython Notebook on Amazon EC2</li>
<li>Running single node Machine Learning jobs using multiple cores</li>
<li>
<p>Distributing jobs with IPython parallel to multiple EC2 instances</p>
</li>
<li>
<p>See HTML5 <strong>slides …</strong></p></li></ul><p>My talk for the San Diego Data Science meetup: <a href="http://www.meetup.com/San-Diego-Data-Science-R-Users-Group/events/170967362/">http://www.meetup.com/San-Diego-Data-Science-R-Users-Group/events/170967362/</a></p>
<p>About:</p>
<ul>
<li>Setup <a href="http://star.mit.edu/cluster/">StarCluster</a> to launch EC2 instances</li>
<li>Running IPython Notebook on Amazon EC2</li>
<li>Running single node Machine Learning jobs using multiple cores</li>
<li>
<p>Distributing jobs with IPython parallel to multiple EC2 instances</p>
</li>
<li>
<p>See HTML5 <strong>slides</strong>: <a href="http://bit.ly/ml-ec2">http://bit.ly/ml-ec2</a></p>
</li>
<li>See the IPython notebook sources of the slides: <a href="http://bit.ly/ml-ec2-ipynb">http://bit.ly/ml-ec2-ipynb</a></li>
</ul>
<p>Finally the Github repository with additional material, under MIT license:
<a href="https://github.com/zonca/machine-learning-at-scale-with-python">https://github.com/zonca/machine-learning-at-scale-with-python</a></p>
<p>Any feedback is appreciated, google+, twitter or email.</p>Python on Gordon2014-03-20T19:30:00-07:002014-03-20T19:30:00-07:00Andrea Zoncatag:zonca.github.io,2014-03-20:/2014/03/setup-ipython-notebook-parallel-Gordon.html<p>Gordon has already a <code>python</code> environment setup which can be activated by loading the <code>python</code> module:</p>
<div class="highlight"><pre><span></span><span class="err">module load python # add this to .bashrc to load it at every login</span>
</pre></div>
<h3>Install virtualenv</h3>
<p>Then we need to setup a sandboxed local environment to install other packages, by using <code>virtualenv</code>, get the link …</p><p>Gordon has already a <code>python</code> environment setup which can be activated by loading the <code>python</code> module:</p>
<div class="highlight"><pre><span></span><span class="err">module load python # add this to .bashrc to load it at every login</span>
</pre></div>
<h3>Install virtualenv</h3>
<p>Then we need to setup a sandboxed local environment to install other packages, by using <code>virtualenv</code>, get the link to the latest version from <a href="https://pypi.python.org/pypi/virtualenv">https://pypi.python.org/pypi/virtualenv</a>, then download it on gordon and unpack it, e.g.</p>
<div class="highlight"><pre><span></span><span class="err">wget --no-check-certificate https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.2.tar.gz</span>
<span class="err">tar xzvf virtualenv*tar.gz</span>
</pre></div>
<p>Then create your own virtualenv and load it:</p>
<div class="highlight"><pre><span></span><span class="err">mkdir ~/venv</span>
<span class="err">python virtualenv-*/virtualenv.py ~/venv/py</span>
<span class="err">source ~/venv/py/bin/activate # add this to .bashrc to load it at every login</span>
</pre></div>
<p>you can restore your previous environment by deactivating the virtualenv:</p>
<div class="highlight"><pre><span></span><span class="err">deactivate # from your bash prompt</span>
</pre></div>
<h3>Install IPython</h3>
<p>Using <code>pip</code> you can install <code>IPython</code> and all dependencies for the notebook and parallel tools running:</p>
<div class="highlight"><pre><span></span><span class="err">pip install ipython pyzmq tornado jinja</span>
</pre></div>
<h3>Configure the IPython notebook</h3>
<p>For interactive data exploration, you can run the <code>IPython</code> notebook in a computing node on Gordon and export the web interface to your local machine, which also embeds all the plots.
Configuring the tunnelling over SSH is complicated, so I created a script, takes a little time to setup but then is very easy to use, see https://github.com/pyHPC/ipynbhpc.</p>
<h3>Configure IPython parallel</h3>
<p><a href="http://ipython.org/ipython-doc/stable/parallel/">IPython parallel</a> on Gordon allows to launch a <code>PBS</code> job with tens (or hundreds) of Python engines and then easily submit hundreds (or thousands) of serial jobs to be executed with automatic load balancing.
First of all create the default configuration files:</p>
<div class="highlight"><pre><span></span><span class="err">ipython profile create --parallel</span>
</pre></div>
<p>Then, in <code>~/.ipython/profile_default/ipcluster_config.py</code>, you need to setup:</p>
<div class="highlight"><pre><span></span><span class="err">c.IPClusterStart.controller_launcher_class = 'LocalControllerLauncher' </span>
<span class="err">c.IPClusterStart.engine_launcher_class = 'PBS' </span>
<span class="err">c.PBSLauncher.batch_template_file = u'/home/REPLACEWITHYOURUSER/.ipython/profile_default/pbs.engine.template' # "~" does not work</span>
</pre></div>
<p>You also need to allow connections to the controller from other hosts, setting in <code>~/.ipython/profile_default/ipcontroller_config.py</code>: </p>
<div class="highlight"><pre><span></span><span class="err">c.HubFactory.ip = '*'</span>
<span class="err">c.HubFactory.engine_ip = '*'</span>
</pre></div>
<p>Finally create the PBS template <code>~/.ipython/profile_default/pbs.engine.template</code>:</p>
<table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre> 1
2
3
4
5
6
7
8
9
10</pre></div></td><td class="code"><div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
<span class="c1">#PBS -q normal</span>
<span class="c1">#PBS -N ipcluster</span>
<span class="c1">#PBS -l nodes={n/16}:ppn={n}:native</span>
<span class="c1">#PBS -l walltime=01:00:00</span>
<span class="c1">#PBS -o ipcluster.out</span>
<span class="c1">#PBS -e ipcluster.err</span>
<span class="c1">#PBS -m abe</span>
<span class="c1">#PBS -V</span>
mpirun_rsh -np <span class="o">{</span>n<span class="o">}</span> -hostfile <span class="nv">$PBS_NODEFILE</span> ipengine
</pre></div>
</td></tr></table>
<p>Here we chose to run 16 IPython engines per Gordon node, so each has access to 4GB of ram, if you need more just change 16 to 8 for example.</p>
<h3>Run IPython parallel</h3>
<p>You can submit a job to the queue running, <code>n</code> is equal to the number of processes you want to use, so it needs to be a multiple of the <code>ppn</code> chosen in the PBS template:</p>
<div class="highlight"><pre><span></span><span class="err">ipcluster start --n=32 &</span>
</pre></div>
<p>in this case we are requesting 2 nodes, with 16 IPython engines each, check with:</p>
<div class="highlight"><pre><span></span><span class="err">qstat -u $USER</span>
</pre></div>
<p>basically <code>ipcluster</code> runs an <code>ipcontroller</code> on the login node and submits a job to PBS for running the <code>ipengines</code> on the computing nodes.</p>
<p>Once the PBS job is running, check that the engines are connected by opening a IPython on the login node and print the <code>ids</code>:</p>
<div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="kn">from</span> <span class="nn">IPython.parallel</span> <span class="kn">import</span> <span class="n">Client</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="n">rc</span> <span class="o">=</span> <span class="n">Client</span><span class="p">()</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="n">rc</span><span class="o">.</span><span class="n">ids</span>
</pre></div>
<p>You can stop the cluster (kills <code>ipcontroller</code> and runs <code>qdel</code> on the PBS job) either by sending CTRL-c to <code>ipcluster</code> or running:</p>
<div class="highlight"><pre><span></span><span class="err">ipcluster stop # from bash console</span>
</pre></div>
<h3>Submit jobs to IPython parallel</h3>
<p>As soon as <code>ipcluster</code> is executed, <code>ipcontroller</code> is ready to queue jobs up, which will be then consumed by the engines once they will be running.
The easiest method to submit jobs with automatic load balancing is to create a load balanced view:</p>
<div class="highlight"><pre><span></span><span class="n">In</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="kn">from</span> <span class="nn">IPython.parallel</span> <span class="kn">import</span> <span class="n">Client</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="n">rc</span> <span class="o">=</span> <span class="n">Client</span><span class="p">()</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="n">lview</span> <span class="o">=</span> <span class="n">rc</span><span class="o">.</span><span class="n">load_balanced_view</span><span class="p">()</span> <span class="c1"># default load-balanced view</span>
</pre></div>
<p>and then use its <code>map</code> method:</p>
<div class="highlight"><pre><span></span><span class="n">def</span> <span class="n">exp_10</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span><span class="o">**</span><span class="mi">10</span>
<span class="n">list_of_args</span> <span class="o">=</span> <span class="n">range</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
<span class="k">result</span> <span class="o">=</span> <span class="n">lview</span><span class="p">.</span><span class="k">map</span><span class="p">(</span><span class="n">exp_10</span><span class="p">,</span> <span class="n">list_of_args</span><span class="p">)</span>
</pre></div>
<p>In this code <code>IPython</code> will distribute uniformly the list of arguments to the engines and the function will be evalutated for each of them and the result copied back to the connecting client running on the login node.</p>
<h3>Submit non-python jobs to IPython parallel</h3>
<p>Let's assume you have a list of commands you want to run in a text file, one command per line, those could be implemented in any programming language, e.g.:</p>
<div class="highlight"><pre><span></span><span class="err">date &> date.log</span>
<span class="err">hostname &> hostname.log</span>
</pre></div>
<p>Then you create a function that executes one of those commands:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">run_command</span><span class="p">(</span><span class="n">command</span><span class="p">):</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="n">subprocess</span><span class="o">.</span><span class="n">Popen</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">shell</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
</pre></div>
<p>Then apply this function to the list of commands:</p>
<div class="highlight"><pre><span></span><span class="err">list_of_commands = open("commands.txt").readlines()</span>
<span class="err">lview.map(run_command, list_of_commands)</span>
</pre></div>
<p>I created a script that automates this process, see https://gist.github.com/zonca/8994544, you can run as:</p>
<div class="highlight"><pre><span></span><span class="err">./ipcluster_run_commands.py commands.txt</span>
</pre></div>Build Software Carpentry lessons with Pelican2014-02-26T23:00:00-08:002014-02-26T23:00:00-08:00Andrea Zoncatag:zonca.github.io,2014-02-26:/2014/02/build-software-carpentry-with-pelican.html<p><a href="http://www.software-carpentry.org">Software Carpentry</a> offers bootcamps for scientist to teach basic programming skills.
All the material, mainly about bash, git, Python and R is <a href="http://github.com/swcarpentry/bc">available on Github</a> under Creative Commons.</p>
<p>The content is either in Markdown or in IPython notebook format, and is currently built using Jekyll, nbconvert and Pandoc.
Basicly the …</p><p><a href="http://www.software-carpentry.org">Software Carpentry</a> offers bootcamps for scientist to teach basic programming skills.
All the material, mainly about bash, git, Python and R is <a href="http://github.com/swcarpentry/bc">available on Github</a> under Creative Commons.</p>
<p>The content is either in Markdown or in IPython notebook format, and is currently built using Jekyll, nbconvert and Pandoc.
Basicly the requirement is to make it easy for bootcamp instructors to setup their own website, modify the content, and have the website updated.</p>
<p>I created a fork of the Software Carpentry repository and configured Pelican for creating the website:</p>
<ul>
<li><a href="https://github.com/swcarpentry-pelican/bootcamp-pelican">bootcamp-pelican repository</a>: contains Markdown lessons in <code>lessons</code> (version v5), <code>.ipynb</code> in <code>notebooks</code> and news items in <code>news</code>.</li>
<li><a href="https://github.com/swcarpentry-pelican/swcarpentry-pelican.github.io">bootcamp-pelican Github pages</a>: This repository contains the output HTML</li>
<li><a href="http://swcarpentry-pelican.github.io/">bootcamp-pelican website</a>: this is the URL where Github publishes automatically the content of the previous repository</li>
</ul>
<p>Pelican handles fenced code blocks, see <a href="http://swcarpentry-pelican.github.io/">http://swcarpentry-pelican.github.io/</a> and conversion of IPython notebooks, see <a href="http://swcarpentry-pelican.github.io/lessons/numpy-notebook.html">http://swcarpentry-pelican.github.io/lessons/numpy-notebook.html</a></p>
<h2>How to setup the repositories for a new bootcamp</h2>
<ol>
<li><a href="https://github.com/organizations/new">create a new Organization on Github</a> and add all the other instructors, name it: <code>swcarpentry-YYYY-MM-DD-INST</code> where <code>INST</code> is the institution name, e.g. <code>NYU</code></li>
<li><a href="https://github.com/swcarpentry-pelican/bootcamp-pelican/fork">Fork the <code>bootcamp-pelican</code> repository</a> under the organization account</li>
<li>Create a new repository in your organization named <code>swcarpentry-YYYY-MM-DD-INST.github.io</code> that will host the HTML of the website, also tick <strong>initialize with README</strong>, it will help later.</li>
</ol>
<p>Now you can either prepare the build environment on your laptop or have the web service <code>travis-ci</code> automatically update the website whenever you update the repository (even from the Github web interface!).</p>
<h2>Build/Update the website from your laptop</h2>
<ol>
<li>Clone the <code>bootcamp-pelican</code> repository of your organization locally</li>
<li>
<p>Create a <code>Python</code> virtual environment and install requirements with:</p>
<div class="highlight"><pre><span></span><span class="err">cd bootcamp-pelican</span>
<span class="err">virtualenv swcpy</span>
<span class="err">. swcpy/bin/activate</span>
<span class="err">pip install -r requirements.txt</span>
</pre></div>
</li>
<li>
<p>Clone the <code>swcarpentry-YYYY-MM-DD-INST.github.io</code> in the output folder as:</p>
<div class="highlight"><pre><span></span><span class="err">git clone git@github.com:swcarpentry-YYYY-MM-DD-INST.github.io.git output</span>
</pre></div>
</li>
<li>
<p>Build or Update the website with Pelican running</p>
<div class="highlight"><pre><span></span><span class="err">fab build</span>
</pre></div>
</li>
<li>
<p>You can display the website in your browser locally with:</p>
<div class="highlight"><pre><span></span><span class="err">fab serve</span>
</pre></div>
</li>
<li>
<p>Finally you can publish it to Github with:</p>
<div class="highlight"><pre><span></span><span class="err">cd output</span>
<span class="err">git add .</span>
<span class="err">git push origin master</span>
</pre></div>
</li>
</ol>
<h2>Configure Travis-ci to automatically build and publish the website</h2>
<ol>
<li>Go to <a href="http://travis-ci.org">http://travis-ci.org</a> and login with Github credentials</li>
<li>Under <a href="https://travis-ci.org/profile">https://travis-ci.org/profile</a> click on the organization name on the left and activate the webhook setting <code>ON</code> on your <code>bootcamp-pelican</code> repository</li>
<li>Now it is necessary to setup the credentials for <code>travis-ci</code> to write to the repository</li>
<li>Go to https://github.com/settings/tokens/new, create a new token with default permissions</li>
<li>
<p>Install the <code>travis</code> tool (in debian/ubuntu <code>sudo gem install travis</code>) and run from any machine (not necessary to have a clone of the repository):</p>
<div class="highlight"><pre><span></span><span class="err">travis encrypt -r swcarpentry-YYYY-MM-DD-INST/bootcamp-pelican GH_TOKEN=TOKENGOTATTHEPREVIOUSSTEP</span>
</pre></div>
<p>otherwise I've setup a web application that does the encryption in your browser, see: <a href="http://travis-encrypt.github.io">http://travis-encrypt.github.io</a>
1. Open <code>.travis.yml</code> on the website and replace the string under <code>env: global: secure:</code> with the string from <code>travis encrypt</code>
1. Push the modified <code>.travis.yml</code> to trigger the first build by Travis, and then check the log on <a href="http://travis-ci.org">http://travis-ci.org</a></p>
</li>
</ol>
<p>Now any change on the source repository will be picked up automatically by Travis and used to update the website.</p>openproceedings: Github/FigShare based publishing platform for conference proceedings2014-02-13T23:30:00-08:002014-02-13T23:30:00-08:00Andrea Zoncatag:zonca.github.io,2014-02-13:/2014/02/openproceedings-github-figshare-pelican-conference-proceedings.html<p>Github provides a great interface for gathering, peer reviewing and accepting papers for conference proceedings, the second step is to publish them on a website either in HTML or PDF form or both.
The Scipy conference is at the forefront on this and did great work in peer reviewing on …</p><p>Github provides a great interface for gathering, peer reviewing and accepting papers for conference proceedings, the second step is to publish them on a website either in HTML or PDF form or both.
The Scipy conference is at the forefront on this and did great work in peer reviewing on Github, see: <a href="https://github.com/scipy-conference/scipy_proceedings/pull/61">https://github.com/scipy-conference/scipy_proceedings/pull/61</a>.</p>
<p>I wanted to develop a system to make it easier to continously publish updated versions of the papers and also leverage FigShare to provide a long term repository, a sharing interface and a <a href="http://en.wikipedia.org/wiki/Digital_object_identifier">DOI</a>.</p>
<p>I based it on the blog engine <a href="http://getpelican.com"><code>Pelican</code></a>, developed a plugin <a href="http://github.com/openproceedings/pelican_figshare_pdf"><code>figshare_pdf</code></a> to upload a PDF of an article via API and configured <a href="http://travis-ci.org">Travis-ci</a> as building platform.</p>
<p>See more details on the project page on Github:
<a href="https://github.com/openproceedings/openproceedings-buildbot">https://github.com/openproceedings/openproceedings-buildbot</a></p>wget file from google drive2014-01-31T18:00:00-08:002014-01-31T18:00:00-08:00Andrea Zoncatag:zonca.github.io,2014-01-31:/2014/01/wget-file-from-google-drive.html<p>Sometimes it is useful, even more if you have a chromebook, to upload a file to Google Drive and then use <code>wget</code> to retrieve it from a server remotely.</p>
<p>In order to do this you need to make the file available to "Anyone with the link", then click on that …</p><p>Sometimes it is useful, even more if you have a chromebook, to upload a file to Google Drive and then use <code>wget</code> to retrieve it from a server remotely.</p>
<p>In order to do this you need to make the file available to "Anyone with the link", then click on that link from your local machine and get to the download page that displays a Download button.
Now right-click and select "Show page source" (in Chrome), and search for "downloadUrl", copy the url that starts with <code>https://docs.google.com</code>, for example:</p>
<div class="highlight"><pre><span></span><span class="c">https://docs.google.com/uc?id\u003d0ByPZe438mUkZVkNfTHZLejFLcnc\u0026export\u003ddownload\u0026revid\u003d0ByPZe438mUkZbUIxRkYvM2dwbVduRUxSVXNERm0zZFFiU2c0PQ</span>
</pre></div>
<p>This is unicode, so open <code>Python</code> and do:</p>
<div class="highlight"><pre><span></span><span class="err">download_url = "PASTE HERE"</span>
<span class="err">print download_url.decode("unicode_escape")</span>
<span class="err">u'https://docs.google.com/uc?id=0ByPZe438mUkZVkNfTHZLejFLcnc&export=download&revid=0ByPZe438mUkZbUIxRkYvM2dwbVduRUxSVXNERm0zZFFiU2c0PQ'</span>
</pre></div>
<p>The last url can be pasted into a terminal and used with <code>wget</code>.</p>Run IPython Notebook on a HPC Cluster via PBS2013-12-18T16:30:00-08:002013-12-18T16:30:00-08:00Andrea Zoncatag:zonca.github.io,2013-12-18:/2013/12/run-ipython-notebook-on-HPC-cluster-via-PBS.html<p>The <a href="http://ipython.org/notebook.html">IPython notebook</a> is a great tool for data exploration
and visualization.
It is suitable in particular for analyzing a large amount of data remotely on a computing node
of a HPC cluster and visualize it in a browser that runs on a local machine.
In this configuration, the interface …</p><p>The <a href="http://ipython.org/notebook.html">IPython notebook</a> is a great tool for data exploration
and visualization.
It is suitable in particular for analyzing a large amount of data remotely on a computing node
of a HPC cluster and visualize it in a browser that runs on a local machine.
In this configuration, the interface is local, it is very responsive, but the amount of memory
and CPU horsepower is provided by a HPC computing node.</p>
<p>Also, it is possible to keep the notebook server running, disconnect and reconnect later from
another machine to the same session.</p>
<p>I created a script which is very general and can be used on most HPC cluster and published it on Github:</p>
<p><a href="https://github.com/pyHPC/ipynbhpc">https://github.com/pyHPC/ipynbhpc</a></p>
<p>Once the script is running, it is possible to connect to <code>localhost:PORT</code> and visualize the
IPython notebook, see the following screenshot of Chromium running locally on my machine
connected to a IPython notebook running on a Gordon computing node:</p>
<p><img src="/images/run-ipython-notebook-on-HPC-cluster-via-PBS_screenshot.png" alt="IPython notebook on Gordon" style="width: 730px;"/></p>Joining San Diego Supercomputer Center2013-12-10T13:30:00-08:002013-12-10T13:30:00-08:00Andrea Zoncatag:zonca.github.io,2013-12-10:/2013/12/joining-sandiego-supercomputer-center.html<p><code>TL;DR</code>
Left UCSB after 4 years, got staff position at San Diego Supercomputer Center within UCSD, will be helping research groups analyze their data on Gordon and more. Still 20% on Planck.</p>
<p>I spent 4 great years at UCSB with Peter Meinhold working on analyzing Cosmic Microwave Background data …</p><p><code>TL;DR</code>
Left UCSB after 4 years, got staff position at San Diego Supercomputer Center within UCSD, will be helping research groups analyze their data on Gordon and more. Still 20% on Planck.</p>
<p>I spent 4 great years at UCSB with Peter Meinhold working on analyzing Cosmic Microwave Background data from the ESA Planck space mission.
Cosmology is fascinating, also I enjoyed working with a very open minded team, that always left large freedom in choosing the techniques and the software tools for the job.</p>
<p>My work has been mainly focused on understanding and characterizing large amount of data using <code>Python</code> (and <code>C++</code>) on NERSC supercomputers.
I was neither interested nor fit for a traditional academic career, and I was looking for a job that allowed me to focus on doing research/data analysis full time.</p>
<p>The perfect opportunity showed up, as the San Diego Supercomputer Center was looking for a computational scientist with a strong scientific background in any field of science to help research teams jump into supercomputing, specifically newcomers. This involves having the opportunity to collaborate with groups in any area of science, the first projects I am going to work on will be in Astrophysics, Quantum Chemistry and Genomics!</p>
<p>I also have the opportunity to continue my work on calibration and mapmaking of Planck data in collaboration with UCSB for 20% of my time.</p>Published paper on Destriping Cosmic Microwave Background Polarimeter data2013-11-20T21:30:00-08:002013-11-20T21:30:00-08:00Andrea Zoncatag:zonca.github.io,2013-11-20:/2013/11/published-paper-destriping-CMB-polarimeter.html<p>TL;DR version:</p>
<ul>
<li>Preprint on arxiv: <a href="http://arxiv.org/abs/1309.5609">Destriping Cosmic Microwave Background Polarimeter data</a></li>
<li>Destriping <code>python</code> code on github: <a href="https://github.com/zonca/dst"><code>dst</code></a></li>
<li>Output maps and sample input data on figshare: <a href="http://figshare.com/articles/BMachine_40GHz_CMB_Polarimeter_sky_maps/644507">BMachine 40GHz CMB Polarimeter sky maps</a></li>
<li>(Paywalled published paper: <a href="http://dx.doi.org/10.1016/j.ascom.2013.10.002">Destriping Cosmic Microwave Background Polarimeter data</a>)</li>
</ul>
<p>My last paper was published by <a href="http://www.journals.elsevier.com/astronomy-and-computing/">Astronomy and Computing …</a></p><p>TL;DR version:</p>
<ul>
<li>Preprint on arxiv: <a href="http://arxiv.org/abs/1309.5609">Destriping Cosmic Microwave Background Polarimeter data</a></li>
<li>Destriping <code>python</code> code on github: <a href="https://github.com/zonca/dst"><code>dst</code></a></li>
<li>Output maps and sample input data on figshare: <a href="http://figshare.com/articles/BMachine_40GHz_CMB_Polarimeter_sky_maps/644507">BMachine 40GHz CMB Polarimeter sky maps</a></li>
<li>(Paywalled published paper: <a href="http://dx.doi.org/10.1016/j.ascom.2013.10.002">Destriping Cosmic Microwave Background Polarimeter data</a>)</li>
</ul>
<p>My last paper was published by <a href="http://www.journals.elsevier.com/astronomy-and-computing/">Astronomy and Computing</a>.</p>
<p>The paper is focused on Cosmic Microwave Background data destriping, a map-making tecnique which exploits the fast
scanning of instruments in order to efficiently remove correlated low frequency noise, generally caused by thermal
fluctuations and gain instability of the amplifiers.</p>
<p>The paper treats in particular the case of destriping data from a polarimeter, i.e. an instrument which directly measures
the polarized signal from the sky, which allows some simplification compared to the case of a simply polarization-sensitive
radiometer.</p>
<p>I implemented a fully parallel <code>python</code> implementation of the algorithm based on:</p>
<ul>
<li><a href="http://trilinos.sandia.gov/packages/pytrilinos/"><code>PyTrilinos</code></a> for Distributed Linear Algebra via MPI</li>
<li><code>HDF5</code> for I/O</li>
<li><code>cython</code> for improving the performance of the inner loops</li>
</ul>
<p>The code is available on Github under GPL.</p>
<p>The output maps for about 30 days of the UCSB B-Machine polarimeter at 37.5 GHz are available on FigShare.</p>
<p>The experience of publishing with ASCOM was really positive, I received 2 very helpful reviews that drove me to
work on several improvements on the paper.</p>Jiffylab multiuser IPython notebooks2013-10-14T10:30:00-07:002013-10-14T10:30:00-07:00Andrea Zoncatag:zonca.github.io,2013-10-14:/2013/10/jiffylab-multiuser-ipython-notebooks.html<p><a href="https://github.com/ptone/jiffylab">jiffylab</a> is a very interesting project by <a href="https://twitter.com/ptone">Preston Holmes</a> to provide sandboxed IPython notebooks instances on a server using <a href="http://www.docker.io/">docker</a>.
There are several user cases, for example:</p>
<ul>
<li>In a tutorial about <code>python</code>, give users instant access to a working IPython notebook</li>
<li>In a tutorial about some specific <code>python</code> package, give …</li></ul><p><a href="https://github.com/ptone/jiffylab">jiffylab</a> is a very interesting project by <a href="https://twitter.com/ptone">Preston Holmes</a> to provide sandboxed IPython notebooks instances on a server using <a href="http://www.docker.io/">docker</a>.
There are several user cases, for example:</p>
<ul>
<li>In a tutorial about <code>python</code>, give users instant access to a working IPython notebook</li>
<li>In a tutorial about some specific <code>python</code> package, give users instant access to a python environment with that package already installed</li>
<li>Give students in a research group access to <code>python</code> on a server with preinstalled several packages maintained and updated by an expert user.</li>
</ul>
<h2>How to install <a href="https://github.com/ptone/jiffylab">jiffylab</a> on Ubuntu 12.04</h2>
<ul>
<li><a href="http://docs.docker.io/en/latest/installation/ubuntulinux/#ubuntu-precise">Install <code>docker</code> on Ubuntu Precise</a></li>
<li>Copy-paste each line of <code>linux-setup.sh</code> to a terminal, to check what is going on step by step</li>
<li>To start the application, change user to <code>jiffylabweb</code>:</li>
</ul>
<div class="highlight"><pre><span></span>sudo su jiffylabweb
<span class="nb">cd</span> /usr/local/etc/jiffylab/webapp/
python app.py <span class="c1">#run in debug mode</span>
</pre></div>
<ul>
<li>Point your browser to the server to check debugging messages, if any.</li>
<li>Finally start the application in production mode:</li>
</ul>
<div class="highlight"><pre><span></span>python server.py <span class="c1">#run in production mode</span>
</pre></div>
<h2>How <code>jiffylab</code> works</h2>
<p>Each users gets a sandboxed IPython notebook instance, the user can save the notebooks and reconnect to the same session later. Main things missing:</p>
<ul>
<li>No real authentication system / no HTTPS connection, easy workaround would be to allow access only from local network/VPN/SSH tunnel</li>
<li>No scientific packages preinstalled, need to customize the docker image to have <code>numpy</code>, <code>matplotlib</code>, <code>pandas</code>...</li>
<li>No access to common filesystem, read-only, this I think is the most pressing feature missing, <a href="https://github.com/ptone/jiffylab/issues/12">issue already on Github</a></li>
</ul>
<p>I think that just adding the common filesystem would be enough to make the project already usable to provide students a way to easily get started with python.</p>
<h2>Few screenshots</h2>
<h3>Login page</h3>
<p><img src="/images/jiffylab_intro.png" alt="Jiffylab Login page" style="width: 730px;"/></p>
<h3>IPython notebook dashboard</h3>
<p><img src="/images/jiffylab_dashboard.png" alt="Jiffylab IPython notebook dashboard" style="width: 730px;"/></p>
<h3>IPython notebook</h3>
<p><img src="/images/jiffylab_notebook.png" alt="Jiffylab IPython notebook" style="width: 730px;"/></p>How to log exceptions in Python2013-10-01T10:30:00-07:002013-10-01T10:30:00-07:00Andrea Zoncatag:zonca.github.io,2013-10-01:/2013/10/how-to-log-exceptions-in-python.html<p>Sometimes it is useful to just catch any exception, write details to a log file and continue execution.</p>
<p>In the <code>Python</code> standard library, it is possible to use the <code>logging</code> and <code>exceptions</code> modules to achieve this.
First of all, we want to catch any exception, but also being able to …</p><p>Sometimes it is useful to just catch any exception, write details to a log file and continue execution.</p>
<p>In the <code>Python</code> standard library, it is possible to use the <code>logging</code> and <code>exceptions</code> modules to achieve this.
First of all, we want to catch any exception, but also being able to access all information about it:</p>
<div class="highlight"><pre><span></span><span class="k">try</span><span class="p">:</span>
<span class="n">my_function_1</span><span class="p">()</span>
<span class="k">except</span> <span class="n">exception</span><span class="o">.</span><span class="n">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="nb">print</span> <span class="n">e</span><span class="o">.</span><span class="vm">__class__</span><span class="p">,</span> <span class="n">e</span><span class="o">.</span><span class="vm">__doc__</span><span class="p">,</span> <span class="n">e</span><span class="o">.</span><span class="n">message</span>
</pre></div>
<p>Then we want to write those to a logging file, so we need to setup the logging module:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">logging</span>
<span class="n">logging</span><span class="o">.</span><span class="n">basicConfig</span><span class="p">(</span> <span class="n">filename</span><span class="o">=</span><span class="s2">"main.log"</span><span class="p">,</span>
<span class="n">filemode</span><span class="o">=</span><span class="s1">'w'</span><span class="p">,</span>
<span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">,</span>
<span class="nb">format</span><span class="o">=</span> <span class="s1">'</span><span class="si">%(asctime)s</span><span class="s1"> - </span><span class="si">%(levelname)s</span><span class="s1"> - </span><span class="si">%(message)s</span><span class="s1">'</span><span class="p">,</span>
<span class="p">)</span>
</pre></div>
<p><a href="https://gist.github.com/zonca/6782980">In the following gist</a> everything together, with also <a href="http://stackoverflow.com/questions/2380073/how-to-identify-what-function-call-raise-an-exception-in-python">function name detection from Alex Martelli</a>:</p>
<script src="https://gist.github.com/zonca/6782980.js"></script>
<p>Here the output log:</p>
<div class="highlight"><pre><span></span>2013-10-01 11:32:56,466 - ERROR - Function my_function_1() raised <type 'exceptions.IndexError'> (Sequence index out of range.): Some indexing error
2013-10-01 11:32:56,466 - ERROR - Function my_function_2() raised <class 'my_module.MyException'> (This is my own Exception): Something went quite wrong
2013-10-01 11:32:56,466 - ERROR - Function my_function_1_wrapper() raised <type 'exceptions.IndexError'> (Sequence index out of range.): Some indexing error
</pre></div>Google Plus comments plugin for Pelican2013-09-27T17:45:00-07:002013-09-27T17:45:00-07:00Andrea Zoncatag:zonca.github.io,2013-09-27:/2013/09/google-plus-comments-plugin-for-pelican.html<p>There has been recently several discussions about
<a href="http://www.popsci.com/science/article/2013-09/why-were-shutting-our-comments">whether comments are any useful on blogs</a>
I think it is important to find better ways to connect blogs to social networks.
In my opinion the most suitable social network for this is Google+, because there is space for larger discussion, without Twitter's …</p><p>There has been recently several discussions about
<a href="http://www.popsci.com/science/article/2013-09/why-were-shutting-our-comments">whether comments are any useful on blogs</a>
I think it is important to find better ways to connect blogs to social networks.
In my opinion the most suitable social network for this is Google+, because there is space for larger discussion, without Twitter's character limit.</p>
<p>So, for my small blog I've decided to implement the Google+ commenting system, which Google originally implemented just for Blogger but that <a href="http://browsingthenet.blogspot.com/2013/04/google-plus-comments-on-any-website.html">works on any website</a>.</p>
<p>See it in action below.</p>
<p>The plugin is available in the <code>googleplus_comments</code> branch in:</p>
<p><a href="https://github.com/zonca/pelican-plugins/tree/googleplus_comments/googleplus_comments">https://github.com/zonca/pelican-plugins/tree/googleplus_comments/googleplus_comments</a></p>How to automatically build your Pelican blog and publish it to Github Pages2013-09-26T13:45:00-07:002013-09-26T13:45:00-07:00Andrea Zoncatag:zonca.github.io,2013-09-26:/2013/09/automatically-build-pelican-and-publish-to-github-pages.html<p>Something I like a lot about Jekyll, the Github static blog generator, is that you just push commits to your repository and Github takes care of re-building and publishing your website.
Thanks to this, it is possible to create a quick blog post from the Github web interface, without the …</p><p>Something I like a lot about Jekyll, the Github static blog generator, is that you just push commits to your repository and Github takes care of re-building and publishing your website.
Thanks to this, it is possible to create a quick blog post from the Github web interface, without the need to use a machine with Python environment.</p>
<p>The Pelican developers have a <a href="http://blog.getpelican.com/using-pelican-with-heroku.html">method for building and deploying Pelican on Heroku</a>, which is really useful, but I would like instead to use Github Pages.</p>
<p>I realized that the best way to do this is to rely on <a href="https://travis-ci.org/">Travis-CI</a>, as the build/deploy workflow is pretty similar to install/unit-testing Travis is designed for.</p>
<h2>How to setup Pelican to build on Travis</h2>
<p>I suggest to use 2 separate git repositories on Github for the source and the built website, let's first only create the repository for the source:</p>
<ul>
<li>create the <code>yourusername.github.io-source</code> repository for Pelican and add it as <code>origin</code> in your Pelican folder repository</li>
</ul>
<p>add a <code>requirements.txt</code> file in your Pelican folder:</p>
<div class="highlight"><pre><span></span><span class="n">github</span><span class="o">:</span><span class="n">zonca</span><span class="sr">/zonca.github.io-source/</span><span class="n">requirements</span><span class="o">.</span><span class="na">txt</span>
</pre></div>
<p>add a <code>.travis.yml</code> file to your repository:</p>
<div class="highlight"><pre><span></span><span class="n">github</span><span class="o">:</span><span class="n">zonca</span><span class="sr">/zonca.github.io-source/</span><span class="o">.</span><span class="na">travis</span><span class="o">.</span><span class="na">yml</span>
</pre></div>
<p>In order to create the encrypted token under env, you can login to the Github web interface to get an <a href="https://help.github.com/articles/creating-an-access-token-for-command-line-use">Authentication Token</a>, and then install the <code>travis</code> command line tool with:</p>
<div class="highlight"><pre><span></span><span class="err"># on Ubuntu you need ruby dev</span>
<span class="err">sudo apt-get install ruby1.9.1-dev</span>
<span class="err">sudo gem install travis</span>
</pre></div>
<p>and run from inside the repository:</p>
<div class="highlight"><pre><span></span><span class="err">travis encrypt GH_TOKEN=LONGTOKENFROMGITHUB --add env.global</span>
</pre></div>
<p>Then add also the <code>deploy.sh</code> script and update the global variable with yours:</p>
<div class="highlight"><pre><span></span><span class="n">github</span><span class="o">:</span><span class="n">zonca</span><span class="sr">/zonca.github.io-source/</span><span class="n">deploy</span><span class="o">.</span><span class="na">sh</span>
</pre></div>
<p>Then we can create the repository that will host the actual blog:</p>
<ul>
<li>create the <code>yourusername.github.io</code> repository for the website (with initial readme, so you can clone it)</li>
</ul>
<p>Finally we can connect to <a href="https://travis-ci.org/">Travis-CI</a>, connect our Github profile and activate Continous Integration on our <code>yourusername.github.io-source</code> repository.</p>
<p>Now, you can push a new commit to your source repository and check on Travis if the build and deploy is successful, hopefully it is (joking, no way it is going to work on the first try!).</p>clviewer, interactive plot of CMB spectra2013-09-17T18:30:00-07:002013-09-17T18:30:00-07:00Andrea Zoncatag:zonca.github.io,2013-09-17:/2013/09/clviewer-interactive-plot-of-CMB-spectra.html<p>Today it was HackDay at <a href="http://dotastronomy.com">.Astronomy</a>, so I felt compelled to hack something around myself,
creating something I have been thinking for a while after my previous work on <a href="http://zonca.github.io/2013/08/interactive-figures-planck-power-spectra.html">Interactive CMB power spectra in the browser</a></p>
<p>The idea is to get text files from a user and load it in …</p><p>Today it was HackDay at <a href="http://dotastronomy.com">.Astronomy</a>, so I felt compelled to hack something around myself,
creating something I have been thinking for a while after my previous work on <a href="http://zonca.github.io/2013/08/interactive-figures-planck-power-spectra.html">Interactive CMB power spectra in the browser</a></p>
<p>The idea is to get text files from a user and load it in a browser-based interactive display built on top of the <a href="http://d3js.org">d3.js</a> and <a href="http://code.shutterstock.com/rickshaw/">rickshaw</a> libraries.</p>
<p>Similar to <a href="http://nbviewer.ipython.org/">nbviewer</a>, I think it is very handy to load data from <a href="https://gist.github.com/">Github gists</a>, because then there is no need of uploading files and it is easier to circulate links.</p>
<p>So I created a small web app, in <code>Python</code> of course, using <a href="http://flask.pocoo.org/">Flask</a> and deployed on <a href="http://heroku.com">Heroku</a>.
It just gets a gist number, calls the Github APIs to load the files, and displays them in the browser:</p>
<ul>
<li>Application website: <a href="http://clviewer.herokuapp.com">http://clviewer.herokuapp.com</a></li>
<li>Example input data: <a href="https://gist.github.com/zonca/6599016">https://gist.github.com/zonca/6599016</a></li>
<li>Example interactive plot: <a href="http://clviewer.herokuapp.com/6599016">http://clviewer.herokuapp.com/6599016</a></li>
<li>Source: <a href="https://github.com/zonca/clviewer">https://github.com/zonca/clviewer</a></li>
</ul>Planck CMB map at high resolution2013-09-10T14:00:00-07:002013-09-10T14:00:00-07:00Andrea Zoncatag:zonca.github.io,2013-09-10:/2013/09/Planck-CMB-map-at-high-resolution.html<p>Prompted by a colleague, I created a high-resolution version of the Cosmic Microwave Background map in MollWeide projection released by the Planck collaboration, available on the <a href="http://irsa.ipac.caltech.edu/data/Planck/release_1/all-sky-maps/previews/COM_CompMap_CMB-smica_2048_R1.20/index.html">Planck Data Release Website</a> in FITS format.</p>
<p>The map is a PNG at a resolution of 17469x8796 pixels, which is suitable for printing at …</p><p>Prompted by a colleague, I created a high-resolution version of the Cosmic Microwave Background map in MollWeide projection released by the Planck collaboration, available on the <a href="http://irsa.ipac.caltech.edu/data/Planck/release_1/all-sky-maps/previews/COM_CompMap_CMB-smica_2048_R1.20/index.html">Planck Data Release Website</a> in FITS format.</p>
<p>The map is a PNG at a resolution of 17469x8796 pixels, which is suitable for printing at 300dpi up to 60x40 inch, or 150x100 cm, file size is about 150MB.</p>
<p><em>Update</em>: now with Planck color scale</p>
<p><em>Update</em>: previous version had grayed out pixels in the galactic plane represents the fraction of the sky that is not possible to reconstruct due to bright galactic sources. The last version uses inpainting to create a constrained CMB realization with the same statistics as the observed CMB to fill the unobserved pixels, more details in the <a href="http://www.sciops.esa.int/wikiSI/planckpla/index.php?title=CMB_and_astrophysical_component_maps&instance=Planck_Public_PLA">Planck Explanatory Supplement</a>. </p>
<ul>
<li>
<p><a href="http://dx.doi.org/10.6084/m9.figshare.795296">High Resolution image on FigShare</a></p>
</li>
<li>
<p>Small size preview:</p>
</li>
</ul>
<p><img alt="Preview of Planck CMB map" src="/images/Planck-CMB-map-at-high-resolution_planck_cmb_map.jpg"></p>
<ul>
<li>Python code:</li>
</ul>
<script src="https://gist.github.com/zonca/6515744.js"></script>Run Hadoop Python jobs on Amazon with MrJob2013-09-02T02:36:00-07:002013-09-02T02:36:00-07:00Andrea Zoncatag:zonca.github.io,2013-09-02:/2013/09/run-hadoop-python-jobs-on-amazon-with-mrjob.html<p><br/>
First we need to install mrjob with:
<br/>
<blockquote class="tr_bq">
pip install mrjob
</blockquote>
I am starting with a simple example of word counting. Previously I implemented this directly using the hadoop streaming interface, therefore mapper and reducer were scripts that read from standard input and print to standard output, see mapper.py and …</p><p><br/>
First we need to install mrjob with:
<br/>
<blockquote class="tr_bq">
pip install mrjob
</blockquote>
I am starting with a simple example of word counting. Previously I implemented this directly using the hadoop streaming interface, therefore mapper and reducer were scripts that read from standard input and print to standard output, see mapper.py and reducer.py in:
<br/>
<br/>
<a href="https://github.com/zonca/python-wordcount-hadoop">
https://github.com/zonca/python-wordcount-hadoop
</a>
<br/>
<br/>
With MrJob instead the interface is a little different, we implement the mapper method of our subclass of MrJob that already gets a "line" argument and yields the output as a tuple like ("word", 1).
<br/>
<div>
MrJob makes the implementation of the reducer particularly simple. Using hadoop-streaming directly, we needed also to first parse back the output of the mapper into python objects, while MrJob does it for you and gives directly the key and the list of count, that we just need to sum.
</div>
<div>
<br/>
<a name="more">
</a>
</div>
<div>
The code is pretty simple:
<br/>
<br/>
<script src="http://gist-it.appspot.com/github/zonca/python-wordcount-hadoop/blob/master/mrjob/word_count_mrjob.py">
</script>
<div>
<br/>
</div>
First we can test locally with 2 different methods, either:
<br/>
<br/>
<blockquote class="tr_bq">
python word_count_mrjob.py gutemberg/20417.txt.utf-8
</blockquote>
<br/>
or:
<br/>
<br/>
<blockquote class="tr_bq">
python word_count_mrjob.py --runner=local gutemberg/20417.txt.utf-8
</blockquote>
<br/>
The first is a simple local test, the seconds sets some hadoop variables and uses multiprocessing to run the mapper in parallel.
<br/>
<div>
<br/>
</div>
<span style="font-size: large;">
Run on Amazon Elastic Map Reduce
</span>
<br/>
<br/>
</div>
<div>
Next step is submitting the job to EMR.
<br/>
First get an account on Amazon Web Services from
<a href="http://aws.amazon.com/">
aws.amazon.com
</a>
.
<br/>
<br/>
Setup MrJob with Amazon:
<br/>
<br/>
<a href="http://pythonhosted.org/mrjob/guides/emr-quickstart.html#amazon-setup">
http://pythonhosted.org/mrjob/guides/emr-quickstart.html#amazon-setup
</a>
<br/>
<br/>
<div>
Then we just need to choose the "emr" runner for MrJob to take care of:
</div>
<div>
<ul>
<li>
Copy the python module to Amazon S3, with requirements
</li>
<li>
Copy the input data to S3
</li>
<li>
Create a small EC2 instance (of course we could set it up to run 1000 instead)
</li>
<li>
Run Hadoop to process the jobs
</li>
<li>
Create a local web service that allows easy monitoring of the cluster
</li>
<li>
When completed, copy the results back (this can be disabled to just leave the results on S3.
</li>
</ul>
</div>
<div>
e.g.:
</div>
<blockquote class="tr_bq">
python word_count_mrjob.py --runner=emr --aws-region=us-west-2 gutemberg/20417.txt.utf-8
</blockquote>
<div>
It is important to make sure that the aws-region used by MrJob is the same we used for creating the SSH key on the EC2 console in the MrJob configuration step, i.e. SSH keys are region-specific.
<br/>
<br/>
<span style="font-size: large;">
Logs and output of the run
</span>
<br/>
<br/>
MrJob copies the needed files to S3:
<br/>
<blockquote class="tr_bq">
. runemr.sh
<br/>
using configs in /home/zonca/.mrjob.conf
<br/>
using existing scratch bucket mrjob-ecd1d07aeee083dd
<br/>
using s3://mrjob-ecd1d07aeee083dd/tmp/ as our scratch dir on S3
<br/>
creating tmp directory /tmp/mrjobjob.zonca.20130901.192250.785550
<br/>
Copying non-input files into s3://mrjob-ecd1d07aeee083dd/tmp/mrjobjob.zonca.20130901.192250.785550/files/
<br/>
Waiting 5.0s for S3 eventual consistency
<br/>
Creating Elastic MapReduce job flow
<br/>
Job flow created with ID: j-2E83MO9QZQILB
<br/>
Created new job flow j-2E83MO9QZQILB
</blockquote>
Creates the instances:
<br/>
<blockquote class="tr_bq">
Job launched 30.9s ago, status STARTING: Starting instances
<br/>
Job launched 123.9s ago, status BOOTSTRAPPING: Running bootstrap actions
<br/>
Job launched 250.5s ago, status RUNNING: Running step (mrjobjob.zonca.20130901.192250.785550: Step 1 of 1)
</blockquote>
Creates an SSH tunnel to the tracker:
<br/>
<blockquote class="tr_bq">
Opening ssh tunnel to Hadoop job tracker
<br/>
Connect to job tracker at: http://localhost:40630/jobtracker.jsp
</blockquote>
</div>
Therefore we can connect to that address to check realtime information about the cluster running on EC2, for example:
<br/>
<br/>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://zonca.github.io/images/run-hadoop-python-jobs-on-amazon-with-mrjob_s1600_awsjobdetails.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" height="588" src="http://zonca.github.io/images/run-hadoop-python-jobs-on-amazon-with-mrjob_s640_awsjobdetails.png" width="640"/>
</a>
</div>
<br/>
Once the job completes, MrJob copies the output back to the local machine, here are few lines from the file:
<br/>
<blockquote class="tr_bq">
"maladies"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
"malaria"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
5
<br/>
"male"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
18
<br/>
"maleproducing"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
"males"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
5
<br/>
"mammal"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
10
<br/>
"mammalInstinctive"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
"mammalian"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
4
<br/>
"mammallike"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
"mammals"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
87
<br/>
"mammoth"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
5
<br/>
"mammoths"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
"man"
<span class="Apple-tab-span" style="white-space: pre;">
</span>
152
</blockquote>
I've been positively impressed that it is so easy to implement and run a MapReduce job with MrJob without need of managing directly EC2 instances or the Hadoop installation.
<br/>
This same setup could be used on GB of data with hundreds of instances.
</div></p>Interactive figures in the browser: CMB Power Spectra2013-08-30T08:52:00-07:002013-08-30T08:52:00-07:00Andrea Zoncatag:zonca.github.io,2013-08-30:/2013/08/interactive-figures-planck-power-spectra.html<p>
For a long time I've been curious about trying out
<span style="font-family: Courier New, Courier, monospace;">
d3.js
</span>
, the javascript plotting library which is becoming the standard for interactive plotting in the browser.
<br/>
</p>
<div>
<br/>
</div>
<div>
What is really appealing is the capability of sharing with other people powerful interactive visualization simply via the link to a web page …</div><p>
For a long time I've been curious about trying out
<span style="font-family: Courier New, Courier, monospace;">
d3.js
</span>
, the javascript plotting library which is becoming the standard for interactive plotting in the browser.
<br/>
</p>
<div>
<br/>
</div>
<div>
What is really appealing is the capability of sharing with other people powerful interactive visualization simply via the link to a web page. This will hopefully be the future of scientific publications, as envisioned, for example, by
<a href="https://www.authorea.com/">
Authorea
</a>
.
</div>
<div>
<a name="more">
</a>
An interesting example related to my work on Planck is a plot of the high number of Angular Power Spectra of the anisotropies of the Cosmic Microwave Background Temperature.
</div>
<div>
The CMB Power spectra describe how the temperature fluctuations were distributed in the sky as a function of the angular scale, for example the largest peak at about 1 degree means that the brightest cold/warm spots of the CMB have that angular size, see
<a href="http://www.strudel.org.uk/blog/astro/001030.shtml">
The Universe Simulator in the browser
</a>
.
</div>
<div>
The
<a href="http://irsa.ipac.caltech.edu/data/Planck/release_1/ancillary-data/">
Planck Collaboration released
</a>
a combined spectrum, which aggregates several channels to give the best result, spectra frequency by frequency (for some frequencies split in detector-sets) and a best-fit spectrum given a Universe Model.
</div>
<div>
It is also interesting to compare to the latest release spectrum by WMAP with 9 years of data.
</div>
<div>
<br/>
</div>
<div>
The plan is to create a visualization where it is easier to zoom to different angular scales on the horizontal axis and quickly show/hide each curve.
</div>
<div>
For this I used
<a href="http://code.shutterstock.com/rickshaw/">
rickshaw
</a>
, a library based on
<span style="font-family: Courier New, Courier, monospace;">
d3.js
</span>
<span style="font-family: inherit;">
which makes it easier to create time-series plots.
</span>
</div>
<div>
<span style="font-family: inherit;">
In fact most of the features are already implemented, it is just a matter of configuring them, see the code on github:
</span>
<a href="https://github.com/zonca/visualize-planck-cl">
https://github.com/zonca/visualize-planck-cl
</a>
</div>
<div>
The most complex task is actually to load all the data, previously converted to JSON, in the background from the server and push them in a data structure which is understood by rickshaw.
</div>
<div>
<br/>
</div>
<div>
Check out the result:
</div>
<div style="text-align: center;">
<b>
<a href="http://bit.ly/planck-spectra">
http://bit.ly/planck-spectra
</a>
</b>
</div>
<div>
<br/>
</div>Planck CTP angular power spectrum ell binning2013-08-20T23:03:00-07:002013-08-20T23:03:00-07:00Andrea Zoncatag:zonca.github.io,2013-08-20:/2013/08/planck-ctp-angular-power-spectrum-ell.html<p>
Planck released a binning of the angular power spectrum in the Explanatory supplement,
<br/>
unfortunately the file is in PDF format, non easily machine-readable:
<br/>
<br/>
<a href="http://www.sciops.esa.int/wikiSI/planckpla/index.php?title=Frequency_maps_angular_power_spectra&instance=Planck_Public_PLA">
http://www.sciops.esa.int/wikiSI/planckpla/index.php?title=Frequency_maps_angular_power_spectra&instance=Planck_Public_PLA
</a>
<br/>
<br/>
So here is a csv version:
<br/>
<a href="https://gist.github.com/zonca/6288439">
https://gist.github.com/zonca/6288439
</a>
<br/>
<br/>
Follows embedded …</p><p>
Planck released a binning of the angular power spectrum in the Explanatory supplement,
<br/>
unfortunately the file is in PDF format, non easily machine-readable:
<br/>
<br/>
<a href="http://www.sciops.esa.int/wikiSI/planckpla/index.php?title=Frequency_maps_angular_power_spectra&instance=Planck_Public_PLA">
http://www.sciops.esa.int/wikiSI/planckpla/index.php?title=Frequency_maps_angular_power_spectra&instance=Planck_Public_PLA
</a>
<br/>
<br/>
So here is a csv version:
<br/>
<a href="https://gist.github.com/zonca/6288439">
https://gist.github.com/zonca/6288439
</a>
<br/>
<br/>
Follows embedded gist.
<br/>
<br/>
<a name="more">
</a>
<br/>
<br/>
<br/>
<script src="https://gist.github.com/zonca/6288439.js">
</script>
</p>HEALPix map of the Earth using healpy2013-08-08T19:07:00-07:002013-08-08T19:07:00-07:00Andrea Zoncatag:zonca.github.io,2013-08-08:/2013/08/healpix-map-of-earth-using-healpy.html<p>
HEALPix maps can also be used to create equal-area pixelized maps of the Earth, RGB colors are not supported in healpy, so we need to convert the image to colorscale.
<br/>
The best user case is for using spherical harmonic transforms, e.g. apply a smoothing filter, in this case HEALPix …</p><p>
HEALPix maps can also be used to create equal-area pixelized maps of the Earth, RGB colors are not supported in healpy, so we need to convert the image to colorscale.
<br/>
The best user case is for using spherical harmonic transforms, e.g. apply a smoothing filter, in this case HEALPix/healpy tools are really efficient.
<br/>
However, other tools for transforming between angles (coordinates), 3d vectors and pixels might be useful.
<br/>
<br/>
<a name="more">
</a>
<br/>
I've created an IPython notebook that provides a simple example:
<br/>
<br/>
<a href="http://nbviewer.ipython.org/6187504">
http://nbviewer.ipython.org/6187504
</a>
<br/>
<br/>
Here is the output Mollweide projection provided by healpy:
<br/>
<br/>
</p>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://zonca.github.io/images/healpix-map-of-earth-using-healpy_s1600_download.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" height="230" src="http://zonca.github.io/images/healpix-map-of-earth-using-healpy_s400_download.png" width="400"/>
</a>
</div>
<p><br/>
Few notes:
<br/>
<br/>
<div>
</div>
<br/>
<ul style="-webkit-text-stroke-width: 0px; color: black; font-family: 'Times New Roman'; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<li>
always use
<span style="font-family: Courier New, Courier, monospace;">
flip="geo"
</span>
for plotting, otherwise maps are flipped East-West
</li>
<li>
increase the resolution of the plots (which is different from the resolution of the map array) by providing at least xsize=2000 to mollview and a reso lower than 1 to gnomview
</li>
</ul></p>Export google analytics data via API with Python2013-08-04T17:47:00-07:002013-08-04T17:47:00-07:00Andrea Zoncatag:zonca.github.io,2013-08-04:/2013/08/export-google-analytics-data-via-api.html<p>
Fun weekend hacking project: export google analytics data using the google APIs.
<br/>
<br/>
Clone the latest version of the API client from:
<br/>
<br/>
<a href="https://code.google.com/p/google-api-python-client">
https://code.google.com/p/google-api-python-client
</a>
<br/>
<br/>
there is an example for accessing analytics APIs in the samples/analytics folder,
<br/>
but you need to fill in client_secrets.json.
<br/>
<br/>
You can …</p><p>
Fun weekend hacking project: export google analytics data using the google APIs.
<br/>
<br/>
Clone the latest version of the API client from:
<br/>
<br/>
<a href="https://code.google.com/p/google-api-python-client">
https://code.google.com/p/google-api-python-client
</a>
<br/>
<br/>
there is an example for accessing analytics APIs in the samples/analytics folder,
<br/>
but you need to fill in client_secrets.json.
<br/>
<br/>
You can get the credentials from the APIs console:
<br/>
<br/>
<a href="https://code.google.com/apis/console">
https://code.google.com/apis/console
</a>
<br/>
<br/>
In SERVICES: activate google analytics
<br/>
In API Access: Create a "Client ID for installed applications" choosing "Other" as a platform
<br/>
<br/>
Copy the client id and the client secret to client_secrets.json.
<br/>
<br/>
<a name="more">
</a>
<br/>
Now you only need the profile ID of the google analytics account, it is in the google analytics web interface, just choose the website, then click on Admin, then on the profile name in the profile tab, and then on profile settings.
<br/>
<br/>
You can then run:
<br/>
<br/>
</p>
<blockquote class="tr_bq">
python core_reporting_v3_reference.py ga:PROFILEID
</blockquote>
<p>The first time you run it, it will open a browser for authentication, but then the auth token is saved and used for future requests.
<br/>
<br/>
This retrieves from the APIs the visits to the website from search, with keywords and the number of visits, for example for my blog:
<br/>
<br/>
<blockquote class="tr_bq">
Total Metrics For All Results:
<br/>
This query returned 25 rows.
<br/>
But the query matched 30 total results.
<br/>
Here are the metric totals for the matched total results.
<br/>
Metric Name = ga:visits
<br/>
Metric Total = 174
<br/>
Rows:
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
(not provided)
<span class="Apple-tab-span" style="white-space: pre;">
</span>
121
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
andrea zonca
<span class="Apple-tab-span" style="white-space: pre;">
</span>
17
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
butterworth filter python
<span class="Apple-tab-span" style="white-space: pre;">
</span>
4
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
andrea zonca blog
<span class="Apple-tab-span" style="white-space: pre;">
</span>
2
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
healpix for ubuntu
<span class="Apple-tab-span" style="white-space: pre;">
</span>
2
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
healpy install ubuntu
<span class="Apple-tab-span" style="white-space: pre;">
</span>
2
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
python butterworth filter
<span class="Apple-tab-span" style="white-space: pre;">
</span>
2
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
zonca andrea
<span class="Apple-tab-span" style="white-space: pre;">
</span>
2
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
andrea zonca buchrain luzern
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
andrea zonca it
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
astrofisica in pillole
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
bin data healpy
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
ellipticity fwhm
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
enthought and healpy
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
fwhm
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
healpix apt-get
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
healpix repository ubuntu
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
healpix ubuntu 12.04 install
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
healpy ubuntu
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
install healpix ubuntu
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
ipython cluster task output
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
numpy pink noise
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
pink noise numpy
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
python 1/f noise
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
<br/>
google
<span class="Apple-tab-span" style="white-space: pre;">
</span>
python apply mixin
<span class="Apple-tab-span" style="white-space: pre;">
</span>
1
</blockquote>
<div>
<br/>
</div></p>Processing sources in Planck maps with Hadoop and Python2013-07-15T08:16:00-07:002013-07-15T08:16:00-07:00Andrea Zoncatag:zonca.github.io,2013-07-15:/2013/07/processing-planck-sources-with-hadoop.html<h2>
Purpose
</h2>
<div>
The purpose of this post is to investigate how to process in parallel sources extracted from full sky maps, in this case the maps release by Planck, using Hadoop instead of more traditional MPI-based HPC custom software.
</div>
<div>
Hadoop is the MapReduce implementation most used in the enterprise world and …</div><h2>
Purpose
</h2>
<div>
The purpose of this post is to investigate how to process in parallel sources extracted from full sky maps, in this case the maps release by Planck, using Hadoop instead of more traditional MPI-based HPC custom software.
</div>
<div>
Hadoop is the MapReduce implementation most used in the enterprise world and it has been traditionally used to process huge amount of text data (~ TBs) , e.g. web pages or logs, over thousands commodity computers connected over ethernet.
</div>
<div>
It allows to distribute the data across the nodes on a distributed file-system (HDFS) and then analyze them ("map" step) locally on each node, the output of the map step is traditionally a set of text (key, value) pairs, that are then sorted by the framework and passed to the "reduce" algorithm, which typically aggregates them and then save them to the distributed file-system.
</div>
<div>
Hadoop gives robustness to this process by rerunning failed jobs, distribute the data with redundancy and re-distribute in case of failures, among many other features.
</div>
<div>
Most scientist use HPC supercomputers for running large data processing software. Using HPC is necessary for algorithms that require frequent communication across the nodes, implemented via MPI calls over a dedicated high speed network (e.g. infiniband). However, often HPC resources are used for running a large number of jobs that are loosely coupled, i.e. each job runs mostly independently of the others, just a sort of aggregation is performed at the end. In this cases the use of a robust and flexible framework like Hadoop could be beneficial.
</div>
<div>
<a name="more">
</a>
</div>
<h2>
Problem description
</h2>
<div>
The Planck collaboration (btw I'm part of it...) released in May 2013 a set of full sky maps in Temperature at 9 different frequencies and catalogs of point and extended galactic and extragalactic sources:
</div>
<div>
<a href="http://irsa.ipac.caltech.edu/Missions/planck.html">
http://irsa.ipac.caltech.edu/Missions/planck.html
</a>
</div>
<div>
Each catalog contains about 1000 sources, and the collaboration released the location and flux of each source.
</div>
<div>
The purpose of the analysis is to read each of the sky maps, slice out the section of the map around each source and perform some analysis on that patch of sky, as a simple example, to test the infrastructure, I am just going to compute the mean of the pixels located 10 arcminutes around the center of each source.
</div>
<div>
In a production run, we might for example run aperture photometry on each source, or fitting for the source center to check for pointing accuracy.
</div>
<h2>
Sources
</h2>
<p>All files are available on github:
<br/>
<div>
<a href="https://github.com/zonca/planck-sources-hadoop">
https://github.com/zonca/planck-sources-hadoop
</a>
</div>
<h2>
Hadoop setup
</h2>
<div>
I am running on the San Diego Supercomputing data intensive cluster Gordon:
</div>
<div>
<a href="http://www.sdsc.edu/us/resources/gordon/">
http://www.sdsc.edu/us/resources/gordon/
</a>
</div>
<div>
SDSC has a simplified Hadoop setup based on shell scripts,
<a href="http://www.sdsc.edu/us/resources/gordon/gordon_hadoop.html">
myHadoop
</a>
, which allows running Hadoop as a regular PBS job.
</div>
<div>
The most interesting feature is that the Hadoop distributed file-system HDFS is setup on the low-latency local flash drives, one of the distinctive features of Gordon.
</div>
<h3>
Using Python with Hadoop-streaming
</h3>
<div>
Hadoop applications run natively in Java, however thanks to Hadoop-streaming, we can use stdin and stdout to communicate with a script implemented in any programming language.
</div>
<div>
One of the most common choices for scientific applications is Python.
</div>
<h3>
Application design
</h3>
<div>
Best way to decrease the coupling between different parallel jobs for this application is, instead of analyzing one source at a time, analyze a patch of sky at a time, and loop through all the sources in that region.
</div>
<div>
Therefore the largest amount data, the sky map, is only read once by a process, and all the sources are processed. I pre-process the sky map by splitting it in 10x10 degrees patches, saving a 2 columns array with pixel index and map temperature (
<a href="https://github.com/zonca/planck-sources-hadoop/blob/master/preprocessing.py">
preprocessing.py
</a>
).
</div>
<div>
Of course this will produce jobs whose length might be very different, due to the different effective sky area at poles and at equator, and by random number of source per patch, but that's something we do not worry about, that is exactly what Hadoop takes care of.
</div>
<h2>
Implementation
</h2>
<h3>
Input data
</h3>
<div>
The pre-processed patches of sky are available in binary format on a lustre file-system shared by the processes.
</div>
<div>
Therefore the text input files for the hadoop jobs are just the list of filenames of the sky patches, one per row.
</div>
<h3>
Mapper
</h3>
<div>
<a href="https://github.com/zonca/planck-sources-hadoop/blob/master/mapper.py">
mapper.py
</a>
</div>
<div>
<br/>
</div>
<div>
The mapper is fed by Hadoop via stdin with a number of lines extracted from the input files and returns a (key, value) text output for each source and for each statistics we compute on the source.
</div>
<div>
In this simple scenario, the only returned key printed to stdout is "SOURCENAME_10arcminmean".
</div>
<div>
For example, we can run a serial test by running:
</div>
<div>
<br/>
</div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
echo plancktest/submaps/030_045_025 | ./mapper.py
</span>
</div>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
</div>
<div>
<span style="font-family: inherit;">
and the returned output is:
</span>
</div>
<div>
<span style="font-family: inherit;">
<br/>
</span>
</div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
PCCS1 030 G023.00+40.77_10arcminmean
<span class="Apple-tab-span" style="white-space: pre;">
</span>
4.49202e-04
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
PCCS1 030 G023.13+42.14_10arcminmean
<span class="Apple-tab-span" style="white-space: pre;">
</span>
3.37773e-04
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
PCCS1 030 G023.84+45.26_10arcminmean
<span class="Apple-tab-span" style="white-space: pre;">
</span>
4.69427e-04
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
PCCS1 030 G024.32+48.81_10arcminmean
<span class="Apple-tab-span" style="white-space: pre;">
</span>
3.79832e-04
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
PCCS1 030 G029.42+43.41_10arcminmean
<span class="Apple-tab-span" style="white-space: pre;">
</span>
4.11600e-04
</span>
</div>
<div style="font-family: inherit;">
<br/>
</div>
</div>
<h3>
Reducer
</h3>
<div>
There is no need for a reducer in this scenario, so Hadoop will just use the default IdentityReducer, which just aggregates all the mappers outputs to a single output file.
</div>
<h3>
Hadoop call
</h3>
<div>
<a href="https://github.com/zonca/planck-sources-hadoop/blob/master/run.pbs">
run.pbs
</a>
</div>
<div>
<br/>
</div>
<div>
The hadoop call is:
</div>
<div>
<br/>
</div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
<code>
$HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR jar $HADOOP_HOME/contrib/streaming/hadoop<em>streaming</em>.jar -file $FOLDER/mapper.py -mapper $FOLDER/mapper.py -input /user/$USER/Input/* -output /user/$USER/Output
</code>
</span>
</div>
</div>
<div>
<br/>
</div>
<div>
So we are using the Hadoop-streaming interface and providing just the mapper, the input text files (list of sources) had been already copied to HDFS, the output needs then to be copied from HDFS to the local file-system, see run.pbs.
</div>
<h2>
Hadoop run and results
</h2>
<div>
For testing purposes we have just used 2 of the 9 maps (30 and 70 GHz), and processed all the total of ~2000 sources running Hadoop on 4 nodes.
</div>
<div>
Processing takes about 5 minutes, Hadoop automatically chooses the number of mappers, and in this case only uses 2 mappers, as I think it reserves a couple of nodes to run the Scheduler and auxiliary processes.
</div>
<div>
The outputs of the mappers are then joined, sorted and written on a single file, see the output file
</div>
<div>
<a href="https://github.com/zonca/planck-sources-hadoop/blob/master/output/SAMPLE_RESULT_part-00000">
output/SAMPLE_RESULT_part-00000
</a>
.
</div>
<div>
See the full log
<a href="https://github.com/zonca/planck-sources-hadoop/blob/master/sample_logs.txt">
sample_logs.txt
</a>
extracted running:
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
/opt/hadoop/bin/hadoop job -history output
</span>
</div>
<h3>
<span style="font-family: inherit;">
Comparison of the results with the catalog
</span>
</h3>
<div>
<span style="font-family: inherit;">
Just for a rough consistency check, I compared the normalized temperatures computed with Hadoop using just the mean of the pixels in a radius of 10 arcmin to the fluxes computed by the Planck collaboration. I find a general agreement with the expected noise excess.
</span>
</div>
<div>
<br/>
<div class="separator" style="clear: both; text-align: left;">
<a href="http://zonca.github.io/images/processing-planck-sources-with-hadoop_s1600_download.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" src="http://zonca.github.io/images/processing-planck-sources-with-hadoop_s1600_download.png"/>
</a>
</div>
<h2>
Conclusion
</h2>
<div>
The advantage of using Hadoop is mainly the scalability, this same setup could be used on AWS or Cloudera using hundreds of nodes. All the complexity of scaling is managed by Hadoop.
</div>
<div>
The main concern is related to loading the data, in a HPC supercomputer it is easy to load directly from a high-performance shared disk, in a cloud environment instead we might opt for a similar setup loading data from S3, but the best would be to use Hadoop itself and stream the data to the mapper in the input files. This is complicated by the fact that Hadoop-streaming only supports text and not binary, the options would be either find a way to pack the binary data in a text file or use Hadoop-pipes instead of Hadoop-streaming.
</div>
<div>
<br/>
</div>
<div class="separator" style="clear: both; text-align: center;">
<br/>
</div>
<div class="separator" style="clear: both; text-align: center;">
<br/>
</div>
</div></p>How to use the IPython notebook on a small computing cluster2013-06-22T11:12:00-07:002013-06-22T11:12:00-07:00Andrea Zoncatag:zonca.github.io,2013-06-22:/2013/06/how-to-use-ipython-notebook-on-small.html<p><a href="http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html">The IPython notebook</a> is a powerful and easy to use interface for using Python and particularly useful when running remotely, because it allows the interface to run locally in your browser, while the computing kernel runs remotely on the cluster.</p>
<h2>1) Configure IPython notebook:</h2>
<p>First time you use the notebook …</p><p><a href="http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html">The IPython notebook</a> is a powerful and easy to use interface for using Python and particularly useful when running remotely, because it allows the interface to run locally in your browser, while the computing kernel runs remotely on the cluster.</p>
<h2>1) Configure IPython notebook:</h2>
<p>First time you use the notebook you need to follow this configuration steps:</p>
<ul>
<li>Login to the cluster</li>
<li>
<p>Load the python environment, for example:</p>
<div class="highlight"><pre><span></span><span class="err">module load pythonEPD</span>
</pre></div>
</li>
<li>
<p>Create the profile files:</p>
<div class="highlight"><pre><span></span><span class="err">ipython profile create # creates the configuration files</span>
<span class="err">vim .ipython/profile_default/ipython_notebook_config.py</span>
</pre></div>
<p>set a password, see instructions in the file.</p>
</li>
<li>
<p>Change the port to something specific to you, <strong>please change this to avoid conflict with other users</strong>:</p>
<div class="highlight"><pre><span></span><span class="err">c.NotebookApp.port = 8900</span>
</pre></div>
</li>
<li>
<p>Set a certificate to serve the notebook over https:</p>
<div class="highlight"><pre><span></span><span class="err">c.NotebookApp.certfile = u'/home/zonca/mycert.pem'</span>
</pre></div>
<p>or create a new certificate, see <a href="http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html">the documentation</a></p>
</li>
<li>
<p>Set:</p>
<div class="highlight"><pre><span></span><span class="err">c.NotebookApp.open_browser = False</span>
</pre></div>
</li>
</ul>
<h2>2) Run the notebook for testing on the login node.</h2>
<p>You can use IPython notebook on the login node if you do not use much memory, e.g. < 300MB.
ssh into the login node, at the terminal run:</p>
<div class="highlight"><pre><span></span><span class="err">ipython notebook --pylab=inline</span>
</pre></div>
<p>open the browser on your local machine and connect to (always use https, replace 8900 with your port):</p>
<div class="highlight"><pre><span></span><span class="c">https://LOGINNODEURL:8900</span>
</pre></div>
<p>Dismiss all the browser complaints about the certificate and go ahead.</p>
<h2>3) Run the notebook on a computing node</h2>
<p>You should always use a computing node whenever you need a large amount of resources.</p>
<p>Create a folder <code>notebooks/</code> in your home, just copy this script in <code>runipynb.pbs</code> in your that folder:</p>
<script src="https://gist.github.com/zonca/5840518.js">
</script>
<p>replace <code>LOGINNODEURL</code> with the url of the login node of your cluster.</p>
<p>NOTICE: you need to ask the sysadmin to set <code>GatewayPorts yes</code> in <code>sshd_config</code> on the login node to allow access externally to the notebook.</p>
<p>Submit the job to the queue running:</p>
<div class="highlight"><pre><span></span><span class="err">qsub runipynb.pbs</span>
</pre></div>
<p>Then from your local machine connect to (replace 8900 with your port):</p>
<div class="highlight"><pre><span></span><span class="c">https://LOGINNODEURL:8900</span>
</pre></div>
<h2>Other introductory python resources</h2>
<ul>
<li><a href="http://scipy-lectures.github.io/">Scientific computing with Python</a>, large and detailed introduction to Python, Numpy, Matplotlib, Scipy</li>
<li>My <a href="https://github.com/zonca/PythonHPC">Python for High performance computing</a>: slides and few ipython notebook examples, see the README</li>
<li>My <a href="https://github.com/zonca/healpytut/blob/master/healpytut.pdf?raw=true">short Python and healpy tutorial</a></li>
</ul>IPython parallell setup on Carver at NERSC2013-04-11T05:53:00-07:002013-04-11T05:53:00-07:00Andrea Zoncatag:zonca.github.io,2013-04-11:/2013/04/ipython-parallell-setup-on-carver-at.html<p>
IPython parallel is one of the easiest ways to spawn several Python sessions on a Supercomputing cluster and process jobs in parallel.
<br/>
<br/>
On Carver, the basic setup is running a controller on the login node, and submit engines to the computing nodes via PBS.
<br/>
<br/>
<a name="more">
</a>
<br/>
First create your configuration files running …</p><p>
IPython parallel is one of the easiest ways to spawn several Python sessions on a Supercomputing cluster and process jobs in parallel.
<br/>
<br/>
On Carver, the basic setup is running a controller on the login node, and submit engines to the computing nodes via PBS.
<br/>
<br/>
<a name="more">
</a>
<br/>
First create your configuration files running:
<br/>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
ipython profile create --parallel
</span>
<br/>
<br/>
Therefore in the ~/.config/ipython/profile_default/ipcluster_config.py, just need to set:
<br/>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
c.IPClusterStart.controller_launcher_class = 'LocalControllerLauncher'
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
c.IPClusterStart.engine_launcher_class = 'PBS'
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
c.PBSLauncher.batch_template_file = u'~/.config/ipython/profile_default/pbs.engine.template'
</span>
<br/>
<br/>
You also need to allow connections to the controller from other hosts, setting in ~/.config/ipython/profile_default/ipcontroller_config.py:
<br/>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
c.HubFactory.ip = '*'
</span>
<br/>
</p>
<div>
<br/>
</div>
<p>With the path to the pbs engine template.
<br/>
<br/>
Next a couple of examples of pbs templates, for 2 or 8 processes per node:
<script src="https://gist.github.com/zonca/5334225.js">
</script>
<br/>
IPython configuration does not seem to be flexible enough to add a parameter for specifying the processes per node.
<br/>
So I just created a bash script that get as parameters the processes per node and the total number of nodes:
<br/>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
ipc 8 2 # 2 nodes with 8ppn, 16 total engines
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
ipc 2 3 # 3 nodes with 2ppn, 6 total engines
</span>
<br/>
<br/>
<span style="font-family: inherit;">
Once the engines are running, jobs can be submitted opening an IPython shell on the login node and run:
</span>
<br/>
<span style="font-family: inherit;">
<br/>
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
from IPython.parallel import Client
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
rc = Client()
</span>
<br/>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
lview = rc.load_balanced_view() # default load-balanced view
</span>
<br/>
<div>
<span style="font-family: Courier New, Courier, monospace;">
def serial_func(argument):
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
pass
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
parallel_result = lview.map(serial_func, list_of_arguments)
</span>
</div>
<br/>
<div style="font-family: inherit;">
<br/>
</div>
<div>
<span style="font-family: inherit;">
The serial function is sent to the engines and executed for each element of the list of arguments.
</span>
</div>
<div>
<span style="font-family: inherit;">
If the function returns a value, than it is transferred back to the login node.
</span>
</div>
<div>
<span style="font-family: inherit;">
In case the returned values are memory consuming, is also possible to still run the controller on the login node, but execute the interactive IPython session in an interactive job.
</span>
</div>
<div style="font-family: inherit;">
<br/>
</div>
<div style="font-family: inherit;">
<br/>
</div></p>Simple Mixin usage in python2013-04-08T01:34:00-07:002013-04-08T01:34:00-07:00Andrea Zoncatag:zonca.github.io,2013-04-08:/2013/04/simple-mixin-usage-in-python.html<p>
One situation where Mixins are useful in Python is when you need to modify a method of similar classes that you are importing from a package.
<br/>
</p>
<div>
<br/>
</div>
<div>
For just a single class, it is easier to just create a derived class, but if the same modification must be applied to several …</div><p>
One situation where Mixins are useful in Python is when you need to modify a method of similar classes that you are importing from a package.
<br/>
</p>
<div>
<br/>
</div>
<div>
For just a single class, it is easier to just create a derived class, but if the same modification must be applied to several classes, then it is cleaner to implement this modification once in a Mixin and then apply it to all of them.
</div>
<div>
<br/>
<a name="more">
</a>
</div>
<div>
Here an example in Django:
</div>
<div>
<br/>
</div>
<div>
Django has several generic view classes that allow to pull objects from the database and feed them to the html templates.
</div>
<div>
<br/>
</div>
<div>
One for example shows the detail of a specific object:
</div>
<div>
<br/>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
from django.views.generic.detail import DetailView
</span>
</div>
<div>
<div>
<br/>
</div>
<div>
This class has a get_object method that gets an object from the database given a primary key.
</div>
<div>
We need to modify this method to allow access to an object only to the user that owns them.
</div>
<div>
<br/>
</div>
<div>
We first implement a Mixin, i.e. an independent class that only implements the method we wish to override:
</div>
<div>
<br/>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
class OwnedObjectMixin(object):
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
def get_object(self, *args, **kwargs):
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
obj = super(OwnedObjectMixin, self).get_object(*args, **kwargs)
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
if not obj.user == self.request.user:
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
raise Http404
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
return obj
</span>
</div>
<div>
<br/>
</div>
</div>
<div>
<span style="font-family: inherit;">
Then we create a new derived class which inherits both from the Mixin and from the class we want to modify.
</span>
</div>
<div>
<span style="font-family: inherit;">
<br/>
</span>
</div>
<div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
class ProtectedDetailView(OwnedObjectMixin, DetailView):
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
pass
</span>
</div>
</div>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
</div>
<div>
This overrides the get_object method of DetailView with the get_object method of OwnedObjectMixin, and the call to super calls the get_object method of DetailView, so has the same effect of subclassing DetailView and override the get_object method, but we can be apply the same Mixin to other classes.
</div>Noise in spectra and map domain2013-04-08T01:32:00-07:002013-04-08T01:32:00-07:00Andrea Zoncatag:zonca.github.io,2013-04-08:/2013/04/noise-in-spectra-and-map-domain.html<h3>
Spectra
</h3>
<p>NET or $\sigma$ is the standard deviation of the noise, measured in mK/sqrt(Hz), typical values for microwave amplifiers are 0.2-5.
<br/>
This is the natural unit of the amplitude spectra (ASD), therefore the high frequency tail of the ASD should get to the expected value of the …</p><h3>
Spectra
</h3>
<p>NET or $\sigma$ is the standard deviation of the noise, measured in mK/sqrt(Hz), typical values for microwave amplifiers are 0.2-5.
<br/>
This is the natural unit of the amplitude spectra (ASD), therefore the high frequency tail of the ASD should get to the expected value of the NET.
<br/>
NET can also be expressed in mKsqrt(s), which is NOT the same unit.
<br/>
<b>
mK/sqrt(Hz)
</b>
refers to an integration bandwidth of 1 Hz that assumes a 6dB/octave rolloff, its integration time is only about 0.5 seconds.
<br/>
<b>
mK/sqrt(s)
</b>
instead refers to integration time of 1 second, therefore assumes a top hat bandpass.
<br/>
Therefore there is a factor of sqrt(2) difference between the two conventions, therefore mK/sqrt(Hz) = sqrt(2) * mK sqrt(s)
<br/>
See appendix B of Noise Properties of the Planck-LFI Receivers
<br/>
<a href="http://arxiv.org/abs/1001.4608">
http://arxiv.org/abs/1001.4608
</a>
<br/>
<h3>
Maps
</h3>
To estimate the map domain noise instead we need to integrate the sigma over the time per pixel; in this case it is easier to convert the noise to sigma/sample, therefore we need to multiply by the square root of the sampling frequency:
<br/>
<br/>
sigma_per_sample = NET * sqrt(sampling_freq)
<br/>
<br/>
Then the variance per pixel is sigma_per_sample**2/number_of_hits
<br/>
<h3>
Angular power spectra
</h3>
<div>
$C_\ell$ of the variance map is just the variance map multiplied by the pixel area divided by the integration time.
<br/>
<br/>
$$C_\ell = \Omega_{\rm pix} \langle \frac{\sigma^2}{\tau} \rangle = \Omega_{\rm pix} \langle \frac{\sigma^2 f_{\rm samp}}{hits} \rangle$$
</div></p>Basic fork/pull git workflow2013-04-06T07:52:00-07:002013-04-06T07:52:00-07:00Andrea Zoncatag:zonca.github.io,2013-04-06:/2013/04/basic-forkpull-git-workflow.html<div dir="ltr">
Typical simple workflow for a (github) repository with few users.
</div>
<div dir="ltr">
<b>
<br/>
</b>
</div>
<div dir="ltr">
<b>
Permissions configuration:
</b>
</div>
<div dir="ltr">
Main developers have write access to the repository, occasional contributor are supposed to fork and create pull requests.
</div>
<div dir="ltr">
</div>
<p><a name="more">
</a>
<br/>
<div dir="ltr">
<b>
Main developer:
</b>
Small bug fix just go directly in master:
</div>
<div dir="ltr">
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
</div>
<div dir="ltr">
<span style="font-family: Courier New, Courier, monospace;">
git checkout master
<br/>
# update from repository, better use rebase in …</span></div></p><div dir="ltr">
Typical simple workflow for a (github) repository with few users.
</div>
<div dir="ltr">
<b>
<br/>
</b>
</div>
<div dir="ltr">
<b>
Permissions configuration:
</b>
</div>
<div dir="ltr">
Main developers have write access to the repository, occasional contributor are supposed to fork and create pull requests.
</div>
<div dir="ltr">
</div>
<p><a name="more">
</a>
<br/>
<div dir="ltr">
<b>
Main developer:
</b>
Small bug fix just go directly in master:
</div>
<div dir="ltr">
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
</div>
<div dir="ltr">
<span style="font-family: Courier New, Courier, monospace;">
git checkout master
<br/>
# update from repository, better use rebase in case there are unpushed commits
<br/>
git pull --rebase
<br/>
git commit -m "commit message"
<br/>
git push
</span>
</div>
<div dir="ltr">
<br/>
</div>
<div dir="ltr">
More complex feature, better use a branch:
</div>
<div dir="ltr">
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
</div>
<div dir="ltr">
<span style="font-family: Courier New, Courier, monospace;">
git checkout -b featurebranch
<br/>
git commit -am "commit message"
<br/>
# work and make several commits
<br/>
# backup and share to github
<br/>
git push origin featurebranch
</span>
</div>
<div dir="ltr">
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
</div>
<div dir="ltr">
<span style="font-family: inherit;">
When ready to merge (cannot push cleanly anymore after any rebasing):
</span>
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
<br/>
<span style="font-family: inherit;">
<br/>
</span>
</div>
<div dir="ltr">
<span style="font-family: Courier New, Courier, monospace;">
# reorder, squash some similar commits, better commit msg
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
git rebase -i HEAD~10
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
# before merging move commits all together to the end of history
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
git rebase master
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
git checkout master
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
git merge featurebranch
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
git push
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
# branch is fully merged, no need to keep it
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
git branch -d featurebranch
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
git push origin --delete featurebranch
</span>
</div>
<div dir="ltr">
<br/>
</div>
<div dir="ltr">
Optional, if the feature requires discussing within the team, better create a pull request.
<br/>
After cleanup and rebase, instead of merging to master:
<br/>
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
</div>
<div dir="ltr">
<span style="font-family: Courier New, Courier, monospace;">
# create new branch
<br/>
git checkout -b readyfeaturebranch
<br/>
git push origin readyfeaurebranch
</span>
</div>
<div dir="ltr">
Connect to github and create a pull request from the new branch to master (now github has a shortcut for creating a pull request from the last branch pushed).
</div>
<div dir="ltr">
<br/>
</div>
<div dir="ltr">
During the discussion on the pull request, any commit to the readyfeaturebranch is added to the pull request.
<br/>
When ready either automatically merge on github, or do it manually as previously.
</div>
<div dir="ltr">
<br/>
</div>
<div dir="ltr">
<b>
For occasional developers:
</b>
Just fork the repo on github to their account, work on a branch there, and then create a pull request on the github web interface from the branch to master on the main repository.
</div></p>Interactive 3D plot of a sky map2013-03-12T19:49:00-07:002013-03-12T19:49:00-07:00Andrea Zoncatag:zonca.github.io,2013-03-12:/2013/03/interactive-3d-plot-of-sky-map.html<p><a href="http://code.enthought.com/projects/mayavi/">
Mayavi
</a>
is a Python package from Enthought for 3D visualization, here a simple example of creating a 3D interactive map starting from a HEALPix pixelization sky map:
<br/>
<div>
<br/>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://zonca.github.io/images/interactive-3d-plot-of-sky-map_s1600_snapshot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" height="271" src="http://zonca.github.io/images/interactive-3d-plot-of-sky-map_s400_snapshot.png" width="400"/>
</a>
</div>
<div class="separator" style="clear: both; text-align: center;">
<br/>
</div>
<br/>
<a name="more">
</a>
<br/>
Here the code:
<br/>
<script src="https://gist.github.com/zonca/5146356.js">
</script>
<br/>
<br/>
The output is a beautiful 3D interactive map, Mayavi allows to pan, zoom and rotate.
<br/>
UPDATE 13 Mar: actually there …</div></div></p><p><a href="http://code.enthought.com/projects/mayavi/">
Mayavi
</a>
is a Python package from Enthought for 3D visualization, here a simple example of creating a 3D interactive map starting from a HEALPix pixelization sky map:
<br/>
<div>
<br/>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://zonca.github.io/images/interactive-3d-plot-of-sky-map_s1600_snapshot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
<img border="0" height="271" src="http://zonca.github.io/images/interactive-3d-plot-of-sky-map_s400_snapshot.png" width="400"/>
</a>
</div>
<div class="separator" style="clear: both; text-align: center;">
<br/>
</div>
<br/>
<a name="more">
</a>
<br/>
Here the code:
<br/>
<script src="https://gist.github.com/zonca/5146356.js">
</script>
<br/>
<br/>
The output is a beautiful 3D interactive map, Mayavi allows to pan, zoom and rotate.
<br/>
UPDATE 13 Mar: actually there was a bug (found by Marius Millea) in the script, there is no problem in the projection!
<br/>
<div class="separator" style="clear: both; text-align: center;">
<br/>
</div>
Mayavi can be installed in Ubuntu installing
<span style="font-family: Courier New, Courier, monospace;">
python-vtk
</span>
and then
<span style="font-family: Courier New, Courier, monospace;">
sudo pip install mayavi.
</span>
</div>
</div></p>How to cite HDF5 in bibtex2013-02-27T00:42:00-08:002013-02-27T00:42:00-08:00Andrea Zoncatag:zonca.github.io,2013-02-27:/2013/02/how-to-cite-hdf5-in-bibtex.html<p>
here the bibtex entry:
<br/>
<br/>
<script src="https://gist.github.com/zonca/5043796.js">
</script>
<br/>
reference:
<br/>
<a href="http://www.hdfgroup.org/HDF5-FAQ.html#gcite">
http://www.hdfgroup.org/HDF5-FAQ.html#gcite
</a>
</p>Compile healpix C++ to javascript2013-01-28T21:06:00-08:002013-01-28T21:06:00-08:00Andrea Zoncatag:zonca.github.io,2013-01-28:/2013/01/tag:blogger.html<p>
Compile C++ -> LLVM with clang
<br/>
<br/>
Convert LLVM -> Javascript:
<br/>
<a href="https://github.com/kripken/emscripten/wiki/Tutorial">
https://github.com/kripken/emscripten/wiki/Tutorial
</a>
</p>Elliptic beams, FWHM and ellipticity2013-01-18T00:58:00-08:002013-01-18T00:58:00-08:00Andrea Zoncatag:zonca.github.io,2013-01-18:/2013/01/elliptic-beams-fwhm-and-ellipticity.html<p><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
The relationship between the Full Width Half Max, FWHM (min, max, and average) and the
</span>
<br/>
<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
ellipticity is:
</span>
<br/>
<br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;"/>
<span style="font-family: Courier New, Courier, monospace;">
<span style="background-color: white; color: #222222; font-size: 13px;">
FWHM = sqrt(FWHM_min * FWHM_max)
</span>
<br style="background-color: white; color: #222222; font-size: 13px;"/>
<span style="background-color: white; color: #222222; font-size: 13px;">
e = FWHM_max/FWHM_min
</span>
</span>
<br/>
<span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br/>
</span></p>Ubuntu PPA for HEALPix and healpy2012-12-17T10:37:00-08:002012-12-17T10:37:00-08:00Andrea Zoncatag:zonca.github.io,2012-12-17:/2012/12/ubuntu-ppa-for-healpix-and-healpy.html<p><br/>
<b>
HEALPix C, C++
</b>
version 3.00 and
<b>
healpy
</b>
version 1.4.1 are now available in a PPA repository for Ubuntu 12.04 Precise and Ubuntu 12.10 Quantal.
<br/>
<br/>
First remove your previous version of
<span style="font-family: Courier New, Courier, monospace;">
healpy
</span>
, just find the location of the package:
<br/>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
> python -c "import healpy; print healpy.<strong>file …</strong></span></p><p><br/>
<b>
HEALPix C, C++
</b>
version 3.00 and
<b>
healpy
</b>
version 1.4.1 are now available in a PPA repository for Ubuntu 12.04 Precise and Ubuntu 12.10 Quantal.
<br/>
<br/>
First remove your previous version of
<span style="font-family: Courier New, Courier, monospace;">
healpy
</span>
, just find the location of the package:
<br/>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
> python -c "import healpy; print healpy.<strong>file</strong>"
</span>
<br/>
<br/>
and remove it:
<br/>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
> sudo rm -r /some-base-path/site-packages/healpy*
</span>
<br/>
<div style="font-family: 'Courier New', Courier, monospace;">
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
</div>
<span style="font-family: inherit;">
Then add the apt repository and install the packages:
</span>
<br/>
<div style="font-family: 'Courier New', Courier, monospace;">
<span style="font-family: Courier New, Courier, monospace;">
<br/>
</span>
</div>
<span style="font-family: Courier New, Courier, monospace;">
> sudo add-apt-repository ppa:zonca/healpix
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
> sudo apt-get update
</span>
<br/>
<span style="font-family: Courier New, Courier, monospace;">
> sudo apt-get install healpix-cxx libhealpix-cxx-dev
</span>
<span style="font-family: 'Courier New', Courier, monospace;">
</span>
<span style="font-family: 'Courier New', Courier, monospace;">
libchealpix0
</span>
<span style="font-family: 'Courier New', Courier, monospace;">
libchealpix-dev python-healpy
</span>
<br/>
<br/>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
> which anafast_cxx
</span>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
/usr/bin/anafast_cxx
</span>
</div>
</div>
<div>
<span style="font-family: Courier New, Courier, monospace;">
</span>
<br/>
<div>
<span style="font-family: Courier New, Courier, monospace;">
> python -c "import healpy; print healpy.<strong>version</strong>"
</span>
</div>
<span style="font-family: Courier New, Courier, monospace;">
</span>
<br/>
<div>
<span style="font-family: Courier New, Courier, monospace;">
1.4.1
</span>
</div>
</div></p>Butterworth filter with Python2012-10-06T00:00:00-07:002012-10-06T00:00:00-07:00Andrea Zoncatag:zonca.github.io,2012-10-06:/2012/10/butterworth-filter-with-python.html<p>
Using IPython notebook of course:
<br/>
<br/>
<a href="http://nbviewer.ipython.org/3843014/">
http://nbviewer.ipython.org/3843014/
</a>
</p>IPython.parallel for Planck data analysis at NERSC2012-09-27T06:24:00-07:002012-09-27T06:24:00-07:00Andrea Zoncatag:zonca.github.io,2012-09-27:/2012/09/ipythonparallel-for-planck-data.html<p><a href="http://www.esa.int/planck">
Planck
</a>
is a Space mission for high precision measurements of the
<a href="http://en.wikipedia.org/wiki/Cosmic_microwave_background_radiation">
Cosmic Microwave Background
</a>
(CMB), data are received as timestreams of output voltages from the 2 instruments on-board, the Low and High Frequency Instruments [LFI / HFI].
<br/>
<br/>
The key phase in data reduction is map-making, where data are binned to a …</p><p><a href="http://www.esa.int/planck">
Planck
</a>
is a Space mission for high precision measurements of the
<a href="http://en.wikipedia.org/wiki/Cosmic_microwave_background_radiation">
Cosmic Microwave Background
</a>
(CMB), data are received as timestreams of output voltages from the 2 instruments on-board, the Low and High Frequency Instruments [LFI / HFI].
<br/>
<br/>
The key phase in data reduction is map-making, where data are binned to a map of the microwave emission of our galaxy, the CMB, and extragalactic sources. This phase is intrinsically parallel and requires simultaneous access to all the data, so requires a fully parallel MPI-based software.
<br/>
<br/>
However, preparing the data for map-making requires several tasks that are serial, but are data and I/O intensive, therefore need to be parallelized.
<br/>
<br/>
<a name="more">
</a>
<br/>
IPython.parallel offers the easiest solution for managing a large amount of trivially parallel jobs.
<br/>
<br/>
The first task is pointing reconstruction, where we interpolate and apply several rotations and corrections to low-sampled satellite quaternions stored on disk and then write the output dense detector pointing to disk.
<br/>
The disk quota of pointing files is about 2.5TB split in about 3000 files, those files can be processed independently, therefore we implement a function that processes 1 file, to be used interactively for debugging and testing.
<br/>
Then launch an IPython cluster, typically between 20 and 300 engines on Carver (NERSC), and use the exact same function to process all the ~3000 files in parallel.
<br/>
The IPython
<a href="http://ipython.org/ipython-doc/dev/api/generated/IPython.parallel.client.view.html?highlight=apply_async#IPython.parallel.client.view.LoadBalancedView">
BalancedView
</a>
controller automatically balances the queue therefore we get maximum efficiency, and it is possible to leave the cluster running and submit other instances of the job to be added to its queue.
<br/>
<br/>
Second task is calibration and dipole removal, which processes about 1.2 TB of data, but it needs to read the dense pointing from disk, so it is very I/O intensive. Also in this case we can submit the ~3000 jobs to an IPython.parallel cluster.
<br/>
<br/>
In a next post I'll describe in detail my setup and how I organize my code to make it easy to swap back and forth between debugging code interactively and running production runs in parallel.</p>homepage on about.me2012-09-26T22:19:00-07:002012-09-26T22:19:00-07:00Andrea Zoncatag:zonca.github.io,2012-09-26:/2012/09/homepage-on-aboutme.html<p>
moved my homepage to about.me:
<br/>
<br/>
<a href="http://about.me/andreazonca">
http://about.me/andreazonca
</a>
<br/>
<br/>
it is quite nice, and essential, as most of it is just links to other websites, i.e. arXiv for publications, Linkedin for CV, github for code.
<br/>
So I'm going to use andreazonca.com as blog, hosted on blogger.
</p>doctests and unittests happiness 22012-08-16T14:07:00-07:002012-08-16T14:07:00-07:00Andrea Zoncatag:zonca.github.io,2012-08-16:/2012/08/doctests-and-unittests-happiness-2.html<blockquote>
nosetests -v --with-doctest
<br/>
Doctest: healpy.pixelfunc.ang2pix ... ok
<br/>
Doctest: healpy.pixelfunc.get_all_neighbours ... ok
<br/>
Doctest: healpy.pixelfunc.get_interp_val ... ok
<br/>
Doctest: healpy.pixelfunc.get_map_size ... ok
<br/>
Doctest: healpy.pixelfunc.get_min_valid_nside ... ok
<br/>
Doctest: healpy.pixelfunc.get_neighbours ... ok
</blockquote>
<p><br/>
<a name="more">
</a>
<br/>
<blockquote>
Doctest: healpy.pixelfunc.isnpixok ... ok
<br/>
Doctest: healpy.pixelfunc.isnsideok ... ok
<br/>
Doctest: healpy.pixelfunc.ma ... ok
<br/>
Doctest: healpy …</blockquote></p><blockquote>
nosetests -v --with-doctest
<br/>
Doctest: healpy.pixelfunc.ang2pix ... ok
<br/>
Doctest: healpy.pixelfunc.get_all_neighbours ... ok
<br/>
Doctest: healpy.pixelfunc.get_interp_val ... ok
<br/>
Doctest: healpy.pixelfunc.get_map_size ... ok
<br/>
Doctest: healpy.pixelfunc.get_min_valid_nside ... ok
<br/>
Doctest: healpy.pixelfunc.get_neighbours ... ok
</blockquote>
<p><br/>
<a name="more">
</a>
<br/>
<blockquote>
Doctest: healpy.pixelfunc.isnpixok ... ok
<br/>
Doctest: healpy.pixelfunc.isnsideok ... ok
<br/>
Doctest: healpy.pixelfunc.ma ... ok
<br/>
Doctest: healpy.pixelfunc.maptype ... ok
<br/>
Doctest: healpy.pixelfunc.mask_bad ... ok
<br/>
Doctest: healpy.pixelfunc.mask_good ... ok
<br/>
Doctest: healpy.pixelfunc.max_pixrad ... ok
<br/>
Doctest: healpy.pixelfunc.nest2ring ... ok
<br/>
Doctest: healpy.pixelfunc.npix2nside ... ok
<br/>
Doctest: healpy.pixelfunc.nside2npix ... ok
<br/>
Doctest: healpy.pixelfunc.nside2pixarea ... ok
<br/>
Doctest: healpy.pixelfunc.nside2resol ... ok
<br/>
Doctest: healpy.pixelfunc.pix2ang ... ok
<br/>
Doctest: healpy.pixelfunc.pix2vec ... ok
<br/>
Doctest: healpy.pixelfunc.reorder ... ok
<br/>
Doctest: healpy.pixelfunc.ring2nest ... ok
<br/>
Doctest: healpy.pixelfunc.ud_grade ... ok
<br/>
Doctest: healpy.pixelfunc.vec2pix ... ok
<br/>
Doctest: healpy.rotator.Rotator ... ok
<br/>
test_write_map_C (test_fitsfunc.TestFitsFunc) ... ok
<br/>
test_write_map_IDL (test_fitsfunc.TestFitsFunc) ... ok
<br/>
test_write_alm (test_fitsfunc.TestReadWriteAlm) ... ok
<br/>
test_write_alm_256_128 (test_fitsfunc.TestReadWriteAlm) ... ok
<br/>
test_ang2pix_nest (test_pixelfunc.TestPixelFunc) ... ok
<br/>
test_ang2pix_ring (test_pixelfunc.TestPixelFunc) ... ok
<br/>
test_nside2npix (test_pixelfunc.TestPixelFunc) ... ok
<br/>
test_nside2pixarea (test_pixelfunc.TestPixelFunc) ... ok
<br/>
test_nside2resol (test_pixelfunc.TestPixelFunc) ... ok
<br/>
test_inclusive (test_query_disc.TestQueryDisc) ... ok
<br/>
test_not_inclusive (test_query_disc.TestQueryDisc) ... ok
<br/>
test_anafast (test_sphtfunc.TestSphtFunc) ... ok
<br/>
test_anafast_iqu (test_sphtfunc.TestSphtFunc) ... ok
<br/>
test_anafast_xspectra (test_sphtfunc.TestSphtFunc) ... ok
<br/>
test_synfast (test_sphtfunc.TestSphtFunc) ... ok
<br/>
test_cartview_nocrash (test_visufunc.TestNoCrash) ... ok
<br/>
test_gnomview_nocrash (test_visufunc.TestNoCrash) ... ok
<br/>
test_mollview_nocrash (test_visufunc.TestNoCrash) ... ok
<br/>
<br/></p>
<hr>
<p><br/>
Ran 43 tests in 19.077s
<br/>
<br/>
OK
</blockquote></p>compile python module with mpi support2012-07-06T16:08:00-07:002012-07-06T16:08:00-07:00Andrea Zoncatag:zonca.github.io,2012-07-06:/2012/07/compile-python-module-with-mpi-support.html<p>
CC=mpicc LDSHARED="mpicc -shared" python setup.py build_ext -i
</p>some python resources2011-11-01T23:02:00-07:002011-11-01T23:02:00-07:00Andrea Zoncatag:zonca.github.io,2011-11-01:/2011/11/some-python-resources.html<p>
python tutorial:
<br/>
<a href="http://docs.python.org/tutorial/">
http://docs.python.org/tutorial/
<br/>
</a>
numpy tutorial [arrays]:
<br/>
<a href="http://www.scipy.org/Tentative_NumPy_Tutorial">
http://www.scipy.org/Tentative_NumPy_Tutorial
<br/>
</a>
plotting tutorial:
<br/>
<a href="http://matplotlib.sourceforge.net/users/pyplot_tutorial.html">
http://matplotlib.sourceforge.net/users/pyplot_tutorial.html
</a>
<br/>
<br/>
free online books:
<br/>
<a href="http://diveintopython.org/toc/index.html">
http://diveintopython.org/toc/index.html
<br/>
</a>
<a href="http://www.ibiblio.org/swaroopch/byteofpython/read/">
http://www.ibiblio.org/swaroopch/byteofpython/read/
</a>
<br/>
<br/>
install enthought python:
<br/>
<a href="http://www.enthought.com/products/edudownload.php">
http://www.enthought.com/products/edudownload.php …</a></p><p>
python tutorial:
<br/>
<a href="http://docs.python.org/tutorial/">
http://docs.python.org/tutorial/
<br/>
</a>
numpy tutorial [arrays]:
<br/>
<a href="http://www.scipy.org/Tentative_NumPy_Tutorial">
http://www.scipy.org/Tentative_NumPy_Tutorial
<br/>
</a>
plotting tutorial:
<br/>
<a href="http://matplotlib.sourceforge.net/users/pyplot_tutorial.html">
http://matplotlib.sourceforge.net/users/pyplot_tutorial.html
</a>
<br/>
<br/>
free online books:
<br/>
<a href="http://diveintopython.org/toc/index.html">
http://diveintopython.org/toc/index.html
<br/>
</a>
<a href="http://www.ibiblio.org/swaroopch/byteofpython/read/">
http://www.ibiblio.org/swaroopch/byteofpython/read/
</a>
<br/>
<br/>
install enthought python:
<br/>
<a href="http://www.enthought.com/products/edudownload.php">
http://www.enthought.com/products/edudownload.php
</a>
<br/>
<br/>
video tut:
<br/>
http://www.youtube.com/watch?v=YW8jtSOTRAU&feature=channel
</p>cfitsio wrapper in python2011-06-21T04:43:00-07:002011-06-21T04:43:00-07:00Andrea Zoncatag:zonca.github.io,2011-06-21:/2011/06/cfitsio-wrapper-in-python.html<p>
After several issues with pyfits, and tired of it being so overengineered, I've wrote my own fits I/O package in python, wrapping the C library cfitsio with ctypes.
<br/>
<br/>
Pretty easy, first version completely developed in 1 day.
<br/>
<br/>
<a href="https://github.com/zonca/pycfitsio">
https://github.com/zonca/pycfitsio
</a>
</p>unit testing happiness2011-06-21T04:39:00-07:002011-06-21T04:39:00-07:00Andrea Zoncatag:zonca.github.io,2011-06-21:/2011/06/unit-testing-happiness.html<pre>nosetests -v<br/>test_all_cols (pycfitsio.test.TestPyCfitsIoRead) ... ok<br/>test_colnames (pycfitsio.test.TestPyCfitsIoRead) ... ok<br/>test_move (pycfitsio.test.TestPyCfitsIoRead) ... ok<br/>test_open_file (pycfitsio.test.TestPyCfitsIoRead) ... ok<br/>test_read_col (pycfitsio.test.TestPyCfitsIoRead) ... ok<br/>test_read_hdus (pycfitsio.test.TestPyCfitsIoRead) ... ok<br/>test_create (pycfitsio.test.TestPyCfitsIoWrite) ... ok<br/>test_write (pycfitsio.test.TestPyCfitsIoWrite) ... ok<br/><br/>----------------------------------------------------------------------<br/>Ran 8 tests in 0.016s<br/><br/>OK</pre>Pink noise (1/f noise) simulations in numpy2011-05-18T23:49:00-07:002011-05-18T23:49:00-07:00Andrea Zoncatag:zonca.github.io,2011-05-18:/2011/05/pink-noise-1f-noise-simulations-in-numpy.html<p><a href="https://gist.github.com/979729">
https://gist.github.com/979729
</a>
<br/>
<br/>
<a href="http://zonca.github.io/images/pink-noise-1f-noise-simulations-in-numpy_05_oneoverf1.png">
<img alt="" class="alignnone size-medium wp-image-128" height="225" src="http://zonca.github.io/images/pink-noise-1f-noise-simulations-in-numpy_05_oneoverf1.png" title="oneoverf" width="300"/>
</a></p>Vim regular expressions2011-04-29T02:14:00-07:002011-04-29T02:14:00-07:00Andrea Zoncatag:zonca.github.io,2011-04-29:/2011/04/vim-regular-expressions.html<p>
very good reference of the usage of regular expressions in VIM:
<br/>
<br/>
<a href="http://www.softpanorama.org/Editors/Vimorama/vim_regular_expressions.shtml">
http://www.softpanorama.org/Editors/Vimorama/vim_regular_expressions.shtml
</a>
</p>set python logging level2011-04-13T01:02:00-07:002011-04-13T01:02:00-07:00Andrea Zoncatag:zonca.github.io,2011-04-13:/2011/04/set-python-logging-level.html<p>
often using logging.basicConfig is useless because if the logging module is already configured upfront by one of the imported libraries this is ignored.
<br/>
<br/>
The solution is to set the level directly in the root logger:
<br/>
<code>
logging.root.level = logging.DEBUG
</code>
</p>pyfits memory leak in new_table2011-03-28T17:22:00-07:002011-03-28T17:22:00-07:00Andrea Zoncatag:zonca.github.io,2011-03-28:/2011/03/pyfits-memory-leak-in-newtable.html<p>
I found a memory leakage issue in pyfits.new_table, data were NOT deleted when the table was deleted, I prepared a test on github, using
<a href="http://mg.pov.lt/objgraph/" title="objgraph">
objgraph
</a>
, which shows that data are still in memory:
<br/>
<a name="more">
</a>
<a href="https://gist.github.com/884298">
https://gist.github.com/884298
</a>
<br/>
<br/>
the issue was solved by Erik Bray of STSCI on March …</p><p>
I found a memory leakage issue in pyfits.new_table, data were NOT deleted when the table was deleted, I prepared a test on github, using
<a href="http://mg.pov.lt/objgraph/" title="objgraph">
objgraph
</a>
, which shows that data are still in memory:
<br/>
<a name="more">
</a>
<a href="https://gist.github.com/884298">
https://gist.github.com/884298
</a>
<br/>
<br/>
the issue was solved by Erik Bray of STSCI on March 28th, 2011 , see bug report:
<br/>
<a href="http://trac6.assembla.com/pyfits/ticket/49">
http://trac6.assembla.com/pyfits/ticket/49
<br/>
</a>
and changeset:
<br/>
<a href="http://trac6.assembla.com/pyfits/changeset/844">
http://trac6.assembla.com/pyfits/changeset/844
</a>
</p>ipython and PyTrilinos2011-02-16T19:10:00-08:002011-02-16T19:10:00-08:00Andrea Zoncatag:zonca.github.io,2011-02-16:/2011/02/ipython-and-pytrilinos.html<ol>
<br/>
<li>
start ipcontroller
</li>
<br/>
<li>
start ipengines:
<br/>
<code>
mpiexec -n 4 ipengine --mpi=pytrilinos
</code>
</li>
<br/>
<li>
start ipython 0.11:
<br/>
<code>
import PyTrilinos
<br/>
from IPython.kernel import client
<br/>
mec = client.MultiEngineClient()
<br/>
%load_ext parallelmagic
<br/>
mec.activate()
<br/>
px import PyTrilinos
<br/>
px comm=PyTrilinos.Epetra.PyComm()
<br/>
px print(comm.NumProc())
</code>
</li>
<br/>
</ol>git make local branch tracking origin2011-02-02T02:58:00-08:002011-02-02T02:58:00-08:00Andrea Zoncatag:zonca.github.io,2011-02-02:/2011/02/git-make-local-branch-tracking-origin.html<p><code>
git branch --set-upstream master origin/master
</code>
<br/>
<br/>
you obtain the same result as initial cloning</p>memory map npy files2011-01-07T21:04:00-08:002011-01-07T21:04:00-08:00Andrea Zoncatag:zonca.github.io,2011-01-07:/2011/01/memory-map-npy-files.html<p>
Mem-map the stored array, and then access the second row directly from disk:
<br/>
<br/>
<code>
X = np.load('/tmp/123.npy', mmap_mode='r')
</code>
</p>force local install of python module2010-12-03T22:18:00-08:002010-12-03T22:18:00-08:00Andrea Zoncatag:zonca.github.io,2010-12-03:/2010/12/force-local-install-of-python-module.html<p><code>
python setup.py install --prefix FOLDER
<br/>
</code>
<br/>
<br/>
creates lib/python2.6/site-packages, to force a local install you should use:
<br/>
<br/>
<code>
python setup.py install --install-lib FOLDER
</code></p>gnome alt f2 popup launcher2010-08-31T18:14:00-07:002010-08-31T18:14:00-07:00Andrea Zoncatag:zonca.github.io,2010-08-31:/2010/08/gnome-alt-f2-popup-launcher.html<p>
<br/>
<code>
gnome-panel-control --run-dialog
</code>
</p>switch to interactive backend with ipython -pylab2010-08-21T00:33:00-07:002010-08-21T00:33:00-07:00Andrea Zoncatag:zonca.github.io,2010-08-21:/2010/08/switch-to-interactive-backend-with.html<p>
objective:
<br/>
</p>
<ol>
<br/>
<li>
when running ipython without pylab or executing scripts you want to use an image matplotlib backend like Agg
</li>
<br/>
<li>
just when calling ipython -pylab you want to use an interactive backend like GTKAgg or TKAgg
</li>
<br/>
</ol>
<p><br/>
<a name="more">
</a>
<br/>
<br/>
you need first to setup as default backend on .matplotlib/matplotlibrc
<strong>
Agg
</strong>
:
<br/>
<code>
backend : Agg
</code>
<br/>
then …</p><p>
objective:
<br/>
</p>
<ol>
<br/>
<li>
when running ipython without pylab or executing scripts you want to use an image matplotlib backend like Agg
</li>
<br/>
<li>
just when calling ipython -pylab you want to use an interactive backend like GTKAgg or TKAgg
</li>
<br/>
</ol>
<p><br/>
<a name="more">
</a>
<br/>
<br/>
you need first to setup as default backend on .matplotlib/matplotlibrc
<strong>
Agg
</strong>
:
<br/>
<code>
backend : Agg
</code>
<br/>
then setup you ipython to switch to interactive, in ipython file Shell.py, in the class MatplotlibShellBase, at about line 516, add:
<br/>
<code>
matplotlib.use('GTKAgg')
</code>
<br/>
after the first import of matplotlib</p>numpy dtypes and fits keywords2010-08-04T21:57:00-07:002010-08-04T21:57:00-07:00Andrea Zoncatag:zonca.github.io,2010-08-04:/2010/08/numpy-dtypes-and-fits-keywords.html<p><code>
bool: 'L',
<br/>
uint8: 'B',
<br/>
int16: 'I',
<br/>
int32: 'J',
<br/>
int64: 'K',
<br/>
float32: 'E',
<br/>
float64: 'D',
<br/>
complex64: 'C',
<br/>
complex128: 'M'
</code></p>count hits with numpy2010-07-23T15:18:00-07:002010-07-23T15:18:00-07:00Andrea Zoncatag:zonca.github.io,2010-07-23:/2010/07/count-hits-with-numpy.html<p>
I have an array where I record hits
<br/>
<code>
a=np.zeros(5)
</code>
<br/>
and an array with the indices of the hits, for example I have 2 hits on index 2
<br/>
<code>
hits=np.array([2,2])
</code>
<br/>
so I want to increase index 2 of a by 2
<br/>
<a name="more">
</a>
<br/>
I tried:
<br/>
<code>
a[hits …</code></p><p>
I have an array where I record hits
<br/>
<code>
a=np.zeros(5)
</code>
<br/>
and an array with the indices of the hits, for example I have 2 hits on index 2
<br/>
<code>
hits=np.array([2,2])
</code>
<br/>
so I want to increase index 2 of a by 2
<br/>
<a name="more">
</a>
<br/>
I tried:
<br/>
<code>
a[hits]+=1
</code>
<br/>
but it gives array([ 0., 0., 1., 0., 0.])
<br/>
does someone have a suggestion?
<br/>
<code>
bins=np.bincount(hits)
<br/>
a[:len(bins)] += bins
<br/>
a
<br/>
array([ 0., 0., 2., 0., 0.])
</code>
</p>change column name in a fits with pyfits2010-06-30T22:06:00-07:002010-06-30T22:06:00-07:00Andrea Zoncatag:zonca.github.io,2010-06-30:/2010/06/change-column-name-in-fits-with-pyfits.html<p>
no way to change it manipulating the dtype of the data array.
<br/>
<code>
a=pyfits.open('filename.fits')
<br/>
a[1].header.update('TTYPE1','newname')
</code>
<br/>
you need to change the header, using the update method of the right TTYPE and then write again the fits file using a.writeto.
</p>healpix coordinates2010-06-23T01:01:00-07:002010-06-23T01:01:00-07:00Andrea Zoncatag:zonca.github.io,2010-06-23:/2010/06/healpix-coordinates.html<p>
Healpix considers
<strong>
latitude
</strong>
theta from 0 on north pole to pi south pole,
<br/>
so the conversion is:
<br/>
<code>
theta = pi/2 - latitude
</code>
<br/>
<strong>
longitude
</strong>
and phi instead are consistently from 0 to 2*pi with
<br/>
</p>
<ul>
<br/>
<li>
zero on vernal equinox (for
<a href="http://en.wikipedia.org/wiki/Ecliptic_coordinate_system">
ecliptic
</a>
).
</li>
<br/>
<li>
zero in the direction from Sun to galactic center (for
<a href="http://en.wikipedia.org/wiki/Galactic_coordinate_system">
galactic …</a></li></ul><p>
Healpix considers
<strong>
latitude
</strong>
theta from 0 on north pole to pi south pole,
<br/>
so the conversion is:
<br/>
<code>
theta = pi/2 - latitude
</code>
<br/>
<strong>
longitude
</strong>
and phi instead are consistently from 0 to 2*pi with
<br/>
</p>
<ul>
<br/>
<li>
zero on vernal equinox (for
<a href="http://en.wikipedia.org/wiki/Ecliptic_coordinate_system">
ecliptic
</a>
).
</li>
<br/>
<li>
zero in the direction from Sun to galactic center (for
<a href="http://en.wikipedia.org/wiki/Galactic_coordinate_system">
galactic
</a>
)
</li>
<br/>
</ul>parallel computing the python way2010-06-21T07:27:00-07:002010-06-21T07:27:00-07:00Andrea Zoncatag:zonca.github.io,2010-06-21:/2010/06/parallel-computing-python-way.html<p>
forget MPI:
<br/>
<a href="http://showmedo.com/videotutorials/series?name=N49qyIFOh">
http://showmedo.com/videotutorials/series?name=N49qyIFOh
</a>
</p>quaternions for python2010-06-21T07:21:00-07:002010-06-21T07:21:00-07:00Andrea Zoncatag:zonca.github.io,2010-06-21:/2010/06/quaternions-for-python.html<p>
the situation is pretty problematic, I hope someday
<strong>
scipy
</strong>
will add a python package for rotating and interpolating quaternions, up to now:
<br/>
</p>
<ul>
<br/>
<li>
<a href="http://cgkit.sourceforge.net/doc2/quat.html">
http://cgkit.sourceforge.net/doc2/quat.html
</a>
: slow, bad interaction with numpy, I could not find a simple way to turn a list of N quaternions to a …</li></ul><p>
the situation is pretty problematic, I hope someday
<strong>
scipy
</strong>
will add a python package for rotating and interpolating quaternions, up to now:
<br/>
</p>
<ul>
<br/>
<li>
<a href="http://cgkit.sourceforge.net/doc2/quat.html">
http://cgkit.sourceforge.net/doc2/quat.html
</a>
: slow, bad interaction with numpy, I could not find a simple way to turn a list of N quaternions to a 4xN array without a loop
</li>
<br/>
<li>
<a href="http://cxc.harvard.edu/mta/ASPECT/tool_doc/pydocs/Quaternion.html">
http://cxc.harvard.edu/mta/ASPECT/tool_doc/pydocs/Quaternion.html
</a>
: more lightweight, does not implement quaternion interpolation
</li>
<br/>
</ul>change permission recursively to folders only2010-03-23T17:58:00-07:002010-03-23T17:58:00-07:00Andrea Zoncatag:zonca.github.io,2010-03-23:/2010/03/change-permission-recursively-to.html<p><code>
find . -type d -exec chmod 777 {} \;
</code></p>aptitude search 'and'2010-03-16T22:50:00-07:002010-03-16T22:50:00-07:00Andrea Zoncatag:zonca.github.io,2010-03-16:/2010/03/aptitude-search.html<p>
this is really something
<strong>
really annoying
</strong>
about aptitude, if you run:
<br/>
<code>
aptitude search linux headers
</code>
<br/>
it will make an 'or' search...to perform a 'and' search, which I need 99.9% of the time, you need quotation marks:
<br/>
<code>
aptitude search 'linux headers'
</code>
</p>using numpy dtype with loadtxt2010-03-03T22:49:00-08:002010-03-03T22:49:00-08:00Andrea Zoncatag:zonca.github.io,2010-03-03:/2010/03/using-numpy-dtype-with-loadtxt.html<p>
Let's say you want to read a text file like this:
<br/>
<br/>
<br/>
</p>
<blockquote>
#filename start end
<br/>
fdsafda.fits 23143214 23143214
<br/>
safdsafafds.fits 21423 23423432
</blockquote>
<p><br/>
<br/>
<br/>
<a name="more">
</a>
<br/>
you can use dtype to create a custom array, which is very flexible as you can work by row or columns with strings and floats in the same …</p><p>
Let's say you want to read a text file like this:
<br/>
<br/>
<br/>
</p>
<blockquote>
#filename start end
<br/>
fdsafda.fits 23143214 23143214
<br/>
safdsafafds.fits 21423 23423432
</blockquote>
<p><br/>
<br/>
<br/>
<a name="more">
</a>
<br/>
you can use dtype to create a custom array, which is very flexible as you can work by row or columns with strings and floats in the same array:
<br/>
<code>
dt=np.dtype({'names':['filename','start','end'],'formats':['S100',np.float,np.float]})
<br/>
</code>
[I tried also using np.str instead of S100 without success, anyone knows why?]
<br/>
then give this as input to loadtxt to load the file and create the array.
<br/>
<code>
a = np.loadtxt(open('yourfile.txt'),dtype=dt)
</code>
<br/>
so each element is:
<br/>
<code>
('dsafsadfsadf.fits', 1.6287776249537126e+18, 1.6290301584937428e+18)
<br/>
</code>
<br/>
but you can get the array of start or end times using:
<br/>
<code>
a['start']
</code></p>Stop ipcluster from a script2010-02-19T02:23:00-08:002010-02-19T02:23:00-08:00Andrea Zoncatag:zonca.github.io,2010-02-19:/2010/02/stop-ipcluster-from-script.html<p>
Ipcluster is easy to start but not trivial to stop from a script, after having finished the processing, here's the solution:
<br/>
<code>
from IPython.kernel import client
<br/>
mec = client.MultiEngineClient()
<br/>
mec.kill(controller=True)
</code>
</p>Correlation2010-01-28T00:45:00-08:002010-01-28T00:45:00-08:00Andrea Zoncatag:zonca.github.io,2010-01-28:/2010/01/correlation.html<p><strong>
Expectation value
</strong>
or first moment of a random variable is the probability weighted sum of the possible values (weighted mean).
<br/>
Expectation value of a 6-dice is 1+2+3+4+5+6 / 6 = 3.5
<br/>
<br/>
<strong>
Covariance
</strong>
of 2 random variables is:
<br/>
<code>
COV(X,Y)=E[(X-E(X))(Y-E(Y))]=E …</code></p><p><strong>
Expectation value
</strong>
or first moment of a random variable is the probability weighted sum of the possible values (weighted mean).
<br/>
Expectation value of a 6-dice is 1+2+3+4+5+6 / 6 = 3.5
<br/>
<br/>
<strong>
Covariance
</strong>
of 2 random variables is:
<br/>
<code>
COV(X,Y)=E[(X-E(X))(Y-E(Y))]=E(X<em>Y) - E(X)E(Y)
</code>
<br/>
i.e. the difference between the expected value of their product and the product of their expected values.
<br/>
So if the variables change together, they will have a high covariance, if they are independent, their covariance is zero.
<br/>
<br/>
<strong>
Variance
</strong>
is the covariance on the same variable, :
<br/>
<code>
COV(X,X)=VAR(X)=E(X<strong>2) - E(X)</strong>2
</code>
<br/>
<br/>
<strong>
Standard deviation
</strong>
is the square root of Variance
<br/>
<br/>
<strong>
Correlation
</strong>
is:
<br/>
<code>
COR(X,Y)=COV(X,Y)/STDEV(X)</em>STDEV(Y)
</code>
<br/>
<br/>
<br/>
<a href="http://mathworld.wolfram.com/Covariance.html">
http://mathworld.wolfram.com/Covariance.html
</a></p>execute bash script remotely with ssh2010-01-07T14:37:00-08:002010-01-07T14:37:00-08:00Andrea Zoncatag:zonca.github.io,2010-01-07:/2010/01/execute-bash-script-remotely-with-ssh.html<p>
a bash script launched remotely via ssh does not load the environment, if this is an issue it is necessary to specify --login when calling bash:
<br/>
<br/>
<code>
ssh user@remoteserver.com 'bash --login life_om/cronodproc' | mail your@email.com -s cronodproc
</code>
</p>lock pin hold a package using apt on ubuntu2010-01-07T13:49:00-08:002010-01-07T13:49:00-08:00Andrea Zoncatag:zonca.github.io,2010-01-07:/2010/01/lock-pin-hold-package-using-apt-on.html<p>
set hold:
<br/>
<code>
echo packagename hold | dpkg --set-selections
</code>
<br/>
<br/>
check, should be
<strong>
hi
</strong>
:
<br/>
<code>
dpkg -l packagename
</code>
<br/>
<br/>
unset hold:
<br/>
<code>
echo packagename install | dpkg --set-selections
</code>
</p>load arrays from a text file with numpy2010-01-05T16:32:00-08:002010-01-05T16:32:00-08:00Andrea Zoncatag:zonca.github.io,2010-01-05:/2010/01/load-arrays-from-text-file-with-numpy.html<p>
space separated text file with 5 arrays in columns:
<br/>
<br/>
[sourcecode language="python"]
<br/>
ods,rings,gains,offsets,rparams = np.loadtxt(filename,unpack=True)
<br/>
[/sourcecode]
<br/>
<br/>
quite impressive...
</p>Latest Maxima and WxMaxima for Ubuntu Karmic2009-12-15T11:20:00-08:002009-12-15T11:20:00-08:00Andrea Zoncatag:zonca.github.io,2009-12-15:/2009/12/latest-maxima-and-wxmaxima-for-ubuntu.html<p><a href="http://zeus.nyf.hu/~blahota/maxima/karmic/" title="maxima for ubuntu">
http://zeus.nyf.hu/~blahota/maxima/karmic/
</a>
<br/>
<br/>
on maxima mailing lists they suggested to install the sbcl built, so I first installed sbcl from the Ubuntu repositories and then maxima and wxmaxima f
<br/>
rom this url.</p>number of files in a folder and subfolders2009-12-10T18:16:00-08:002009-12-10T18:16:00-08:00Andrea Zoncatag:zonca.github.io,2009-12-10:/2009/12/number-of-files-in-folder-and-subfolders.html<p>
folders are not counted
<br/>
<code>
find . -type f | wc -l
</code>
</p>forcefully unmount a disk partition2008-09-17T15:14:00-07:002008-09-17T15:14:00-07:00Andrea Zoncatag:zonca.github.io,2008-09-17:/2008/09/forcefully-unmount-disk-partition.html<p>
check which processes are accessing a partition:
<br/>
<br/>
[sourcecode language="python"]lsof | grep '/opt'[/sourcecode]
<br/>
<br/>
kill all the processes accessing the partition (check what you're killing, you could loose data):
<br/>
<br/>
[sourcecode language="python"]fuser -km /mnt[/sourcecode]
<br/>
<br/>
try to unmount now:
<br/>
[sourcecode language="python"]umount /opt[/sourcecode]
</p>netcat: quickly send binaries through network2008-04-29T12:25:00-07:002008-04-29T12:25:00-07:00Andrea Zoncatag:zonca.github.io,2008-04-29:/2008/04/netcat-quickly-send-binaries-through.html<p>
just start nc in server mode on localhost:
<br/>
<br/>
[sourcecode language='python'] nc -l -p 3333 [/sourcecode]
<br/>
<br/>
send a string to localhost on port 3333:
<br/>
<br/>
[sourcecode language='python'] echo "hello world" | nc localhost 3333 [/sourcecode]
<br/>
<br/>
you'll see on server side appearing the string you sent.
<br/>
<br/>
very useful for sending binaries, see …</p><p>
just start nc in server mode on localhost:
<br/>
<br/>
[sourcecode language='python'] nc -l -p 3333 [/sourcecode]
<br/>
<br/>
send a string to localhost on port 3333:
<br/>
<br/>
[sourcecode language='python'] echo "hello world" | nc localhost 3333 [/sourcecode]
<br/>
<br/>
you'll see on server side appearing the string you sent.
<br/>
<br/>
very useful for sending binaries, see
<a href="http://www.g-loaded.eu/2006/11/06/netcat-a-couple-of-useful-examples/">
examples
</a>
.
</p>Decibels, dB and dBm, in terms of Power and Amplitude2008-03-29T02:13:00-07:002008-03-29T02:13:00-07:00Andrea Zoncatag:zonca.github.io,2008-03-29:/2008/03/decibels-db-and-dbm-in-terms-of-power.html<p>
It's not difficult, just always having some doubts...
<br/>
</p>
<h4>
Power
</h4>
<p><br/>
$latex L_{dB} = 10 log_{10} \left( \dfrac{P_1}{P_0} \right) $
<br/>
<br/>
10 dB increase for a factor 10 increase in the ratio
<br/>
<br/>
3 dB = doubling
<br/>
<br/>
40 dB = 10000 times
<br/>
<h4>
Amplitude
</h4>
<br/>
$latex L_{dB} = 10 log_{10} \left( \dfrac{A_1^2}{A_0 …</p><p>
It's not difficult, just always having some doubts...
<br/>
</p>
<h4>
Power
</h4>
<p><br/>
$latex L_{dB} = 10 log_{10} \left( \dfrac{P_1}{P_0} \right) $
<br/>
<br/>
10 dB increase for a factor 10 increase in the ratio
<br/>
<br/>
3 dB = doubling
<br/>
<br/>
40 dB = 10000 times
<br/>
<h4>
Amplitude
</h4>
<br/>
$latex L_{dB} = 10 log_{10} \left( \dfrac{A_1^2}{A_0^2} \right) = 20 log_{10} \left( \dfrac{A_1}{A_0} \right) $
<br/>
<h4>
dBm
</h4>
<br/>
dBm is an absolute value obtained by a ratio with 1 mW:
<br/>
<br/>
$latex L_{dBm} = 10 log_{10} \left( \dfrac{P_1}{1 mW} \right) $
<br/>
<ul>
<br/>
<li>
0 dBm = 1 mW
</li>
<br/>
<li>
3 dBm ≈ 2 mW
</li>
<br/>
</ul></p>Relation between Power density and temperature in an antenna2008-03-28T18:29:00-07:002008-03-28T18:29:00-07:00Andrea Zoncatag:zonca.github.io,2008-03-28:/2008/03/relation-between-power-density-and.html<p>
Considering an antenna placed inside a blackbody enclosure at temperature T, the power received per unit bandwidth is:
<br/>
$latex \omega = kT$
<br/>
<br/>
where k is Boltzmann constant.
<br/>
<br/>
This relationship derives from considering a constant brightness $latex B$ in all directions, therefore Rayleigh Jeans law tells:
<br/>
<br/>
$latex B = \dfrac{2kT}{\lambda^2 …</p><p>
Considering an antenna placed inside a blackbody enclosure at temperature T, the power received per unit bandwidth is:
<br/>
$latex \omega = kT$
<br/>
<br/>
where k is Boltzmann constant.
<br/>
<br/>
This relationship derives from considering a constant brightness $latex B$ in all directions, therefore Rayleigh Jeans law tells:
<br/>
<br/>
$latex B = \dfrac{2kT}{\lambda^2}$
<br/>
<br/>
Power per unit bandwidth is obtained by integrating brightness over antenna beam
<br/>
<br/>
$latex \omega = \frac{1}{2} A_e \int \int B \left( \theta , \phi \right) P_n \left( \theta , \phi \right) d \Omega $
<br/>
<br/>
therefore
<br/>
<br/>
$latex \omega = \dfrac{kT}{\lambda^2}A_e\Omega_A $
<br/>
<br/>
where:
<br/>
</p>
<ul>
<br/>
<li>
$latex A_e$ is antenna effective aperture
</li>
<br/>
<li>
$latex \Omega_A$ is antenna beam area
</li>
<br/>
</ul>
<p><br/>
$latex \lambda^2 = A_e\Omega_A $ another post should talk about this
<br/>
<br/>
finally:
<br/>
<br/>
$latex \omega = kT $
<br/>
<br/>
which is the same noise power of a resistor.
<br/>
<br/>
source : Kraus Radio Astronomy pag 107</p>Producing PDF from XML files2008-03-28T16:27:00-07:002008-03-28T16:27:00-07:00Andrea Zoncatag:zonca.github.io,2008-03-28:/2008/03/producing-pdf-from-xml-files.html<p>
I need to produce formatted pdf from XML data input file.
<br/>
The more standard way looks like to use
<a href="http://www.w3schools.com/xsl" title="w3schools tutorial">
XSL stylesheets.
</a>
<br/>
Associating a XSL sheet to an XML file permits most browsers to render them directly as HMTL, this can be used for web publishing XML sheets.
<br/>
<br/>
The quick and …</p><p>
I need to produce formatted pdf from XML data input file.
<br/>
The more standard way looks like to use
<a href="http://www.w3schools.com/xsl" title="w3schools tutorial">
XSL stylesheets.
</a>
<br/>
Associating a XSL sheet to an XML file permits most browsers to render them directly as HMTL, this can be used for web publishing XML sheets.
<br/>
<br/>
The quick and dirty way to produce PDF could be printing them from Firefox, but an interesting option is to use
<a href="http://http://cyberelk.net/tim/software/xmlto/" title="xmlto homepage">
xmlto
</a>
, a script for running a XSL transformation and render an XML in PDF or other formats. It would be interesting to test this script and understand if it needs just docbook XML input or any XML.
</p>vim costumization2006-10-17T10:49:00-07:002006-10-17T10:49:00-07:00Andrea Zoncatag:zonca.github.io,2006-10-17:/2006/10/vim-costumization.html<p>
it is about perl but it suggests very useful tricks for programming with vim
<br/>
http://mamchenkov.net/wordpress/2004/05/10/vim-for-perl-developers/
</p>using gnu find2006-10-03T14:00:00-07:002006-10-03T14:00:00-07:00Andrea Zoncatag:zonca.github.io,2006-10-03:/2006/10/using-gnu-find.html<p>
list all the directories excluding ".":
<br/>
</p>
<blockquote>
find . -maxdepth 1 -type d -not -name ".*"
</blockquote>
<p><br/>
find some string in all files matching a pattern in the subfolders (with grep -r you cannot specify the type of file)
<br/>
<blockquote>
find . -name '*.py' -exec grep -i pdb '{}' \;
</blockquote></p>beginners bash guide2006-10-03T13:56:00-07:002006-10-03T13:56:00-07:00Andrea Zoncatag:zonca.github.io,2006-10-03:/2006/10/beginners-bash-guide.html<p>
great guide with many examples:
<br/>
<br/>
http://tille.xalasys.com/training/bash/
</p>tar quickref2006-09-25T13:19:00-07:002006-09-25T13:19:00-07:00Andrea Zoncatag:zonca.github.io,2006-09-25:/2006/09/tar-quickref.html<p>
compress: tar cvzf foo.tgz *.cc *.h
<br/>
check inside: tar tzf foo.tgz | grep file.txt
<br/>
extract: tar xvzf foo.tgz
<br/>
extract 1 file only: tar xvzf foo.tgz path/to/file.txt
</p>software carpentry2006-09-25T12:51:00-07:002006-09-25T12:51:00-07:00Andrea Zoncatag:zonca.github.io,2006-09-25:/2006/09/software-carpentry.html<p>
basic software for scientists and engineers:
<br/>
http://www.swc.scipy.org/
<br/>
</p>Software libero per il trattamento di dati scientifici2006-09-22T13:35:00-07:002006-09-22T13:35:00-07:00Andrea Zoncatag:zonca.github.io,2006-09-22:/2006/09/software-libero-per-il-trattamento-di.html<p>
nella ricerca del miglior ambiente per analisi di dati scientifici da leggere questi articoli:
<br/>
<br/>
http://www.pluto.it/files/journal/pj0501/swlibero-scie1.html
<br/>
<br/>
http://www.pluto.it/files/journal/pj0504/swlibero-scie2.html
<br/>
<br/>
http://www.pluto.it/files/journal/pj0505/swlibero-scie3.html
</p>command line processing2006-09-22T13:34:00-07:002006-09-22T13:34:00-07:00Andrea Zoncatag:zonca.github.io,2006-09-22:/2006/09/command-line-processing.html<p>
Very useful summary of many linux command line processing tools (great perl onliners)
<br/>
<br/>
http://grad.physics.sunysb.edu/~leckey/personal/forget/
</p>awk made easy2006-09-22T13:20:00-07:002006-09-22T13:20:00-07:00Andrea Zoncatag:zonca.github.io,2006-09-22:/2006/09/awk-made-easy.html<p><strong>
awk '/REGEX/ {print NR "\t" $9 "\t" $4"_"$5 ;}' file.txt
</strong>
<br/>
supports extended REGEX like perl ( e.g. [:blank:] Space or tab characters )
<br/>
NR is line number
<br/>
NF Number of fields
<br/>
$n is the column to be printed, $0 is the whole row
<br/>
<br/>
if it only necessary to print …</p><p><strong>
awk '/REGEX/ {print NR "\t" $9 "\t" $4"_"$5 ;}' file.txt
</strong>
<br/>
supports extended REGEX like perl ( e.g. [:blank:] Space or tab characters )
<br/>
NR is line number
<br/>
NF Number of fields
<br/>
$n is the column to be printed, $0 is the whole row
<br/>
<br/>
if it only necessary to print columns of a file it is easier to use cut:
<br/>
<br/>
name -a | cut -d" " -f1,3,11,12
<br/>
<br/>
-d: or -d" " is the delimiter
<br/>
-f1,3 are the fields to be displayed
<br/>
other options: -s doesnt show lines without delimiters, --complement is selfesplicative
<br/>
condition on a specific field:
<br/>
$<field> ~ /<string>/ Search for string in specified field.
<br/>
<br/>
you can use awk also in pipes:
<br/>
ll | awk 'NR!=1 {s+=$5} END {print "Average: " s/(NR-1)}'
<br/>
END to process al file and then print results
<br/>
<br/>
tutorial on using awk from the command line:
<br/>
<a href="http://www.vectorsite.net/tsawk_3.html#m1" target="_blank" title="awk tutorial">
http://www.vectorsite.net/tsawk_3.html#m1
</a></p>pillole di astrofisica2006-09-20T13:39:00-07:002006-09-20T13:39:00-07:00Andrea Zoncatag:zonca.github.io,2006-09-20:/2006/09/pillole-di-astrofisica.html<p>
curiosita' ben spiegate da annibale d'ercole, interessante l'idea di avere un livello base e un livello avanzato
<br/>
<a href="http://www.bo.astro.it/sait/spigolature/spigostart.html" target="_blank" title="spigolature astronomiche">
http://www.bo.astro.it/sait/spigolature/spigostart.html
</a>
</p>