Use the distributed file format Zarr on Jetstream Swift object storage

jupyter
jetstream
Published

March 3, 2018

Zarr

Zarr is a pretty new file format designed for cloud computing, see documentation and a webinar for more details.

Zarr is also supported by dask, the parallel computing framework for Dask, and the Dask team implemented storage backends for Google Cloud Storage and Amazon S3.

Use OpenStack swift on Jetstream for object storage

Jetstream also offers (currently in beta) access to object storage via OpenStack Swift. This is a separate service from the Jetstream Virtual Machines, so you do not need to spin any Virtual Machine dedicated to storing the data but just use the object storage already provided by Jetstream.

Read Zarr files from object store

If somebody else has already made available some files on object store and set their visibility to “public”, anybody can read them.

See this notebook

Need openstack RC file version 3 from: https://iu.jetstream-cloud.org/project/api_access/

pip install python-openstackclient

source the openstackRC file, put the password, this is the TACC password, NOT the XSEDE Password. I know.

now create ec2 credentials with:

openstack ec2 credentials create -f json > ec2.json

test if we can access this.

I installed this on js-169-169

actually we can skip ec2 credentials and just use openstack:

openstack object list zarr_pangeo

save credentials in ~/.aws/config

[default]
region=RegionOne
aws_access_key_id=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
aws_secret_access_key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
import s3fs
fs = s3fs.S3FileSystem(client_kwargs=dict(endpoint_url="https://iu.jetstream-cloud.org:8080"))
fs.ls("zarr_pangeo")

Zarr with dask on 1 node works fine

https://gist.github.com/zonca/071bbd8cbb9d15b1789865acb9e66de8

Need to test: * access from multiple nodes with distributed * test read-only access without authentication