The software described in this contribution is developed as part of subtask 1 of the D-ANA task within the Obelics work package. But since the software is made available as a docker container and we are running an instance of the software as a JupyterHub webservice, it is also listed in this document.
The size of astronomical datasets has increased dramatically over the years; terabyte sized datasets are no longer an exception. This trend will only accelerate; the SKA is expected to produce nearly 1 TB of archived data each day. This means that it will no longer be feasible for astronomers to download these huge datasets and perform the data reduction on their own machines, as is currently the practice. Instead the data reduction is likely to be done close to where the data is archived in central data processing centres, with the astronomer operating remotely on the data.
One way of facilitating this is through Jupyter notebooks. Jupyter is a web-based application which allows users to create interactive notebooks which can include annotated text and graphics as well as executable code. Currently Jupyter supports more than 40 different programming languages, including Python, R, and Matlab. Jupyter is designed be extended and makes it easy to add additional languages.
As part of the Obelics work-package we have created a Jupyter kernel for CASA, a widely-used software package for processing astronomical data. The kernel allows all CASA tasks to be run from inside a Jupyter notebook, albeit non-interactively. Tasks which normally spawn a GUI window are wrapped so that their output is saved to an image instead, which is then displayed inside the notebook.
The notebook format also has the great advantage that all steps of the data reduction are preserved inside the notebook. This means that the whole data reduction process is self-documenting and fully repeatable. It also allows users to very easily make changes to their pipeline and then rerun the pipeline steps affected.
Because Jupyter requires a much more current python distribution than what is provided in NRAO's CASA releases, a custom build of CASA is required. We distribute a DOCKER image containing a version of CASA which uses the most recent (I)python, matplotlib, etc. Note that this version of CASA can only be used from within Jupyter.
Installation is a simple as executing: docker pull penngwyn/jupytercasa
The source code is located here: https://github.com/aardk/jupyter-casa
Even though we wrap all CASA tasks so that they will not launch a GUI window, the QT based CASA tasks still require X11, unfortunately. Tasks such as plotms won't start unless X11 is working even when it doesn't even open a window. Therefore the local X11 socket needs to be shared with Docker container.
The simplest incantation to start JUPYTER on a recent Ubuntu:
docker run –rm -p 8888:8888 -i -t -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY penngwyn/jupytercasa /bin/sh -c “jupyter notebook”
Note that the –rm option will make DOCKER delete the container after use.
Of course the above example is not very useful as the container will not be able to access locally stored measurement sets. To add a data directory to the DOCKER container is, fortunately, very simple using the -v option:
docker run –rm -p 8888:8888 -i -t -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY -v PATH_TO_DATA_DIR:/home/jupyter/data penngwyn/jupytercasa /bin/sh -c “jupyter notebook”
Where PATH_TO_DATA_DIR should be replaced with the full path to your local data directory.
We have created an example notebook which contains the NRAO continuum VLA tutorial. To run the notebook locally download the data files from the NRAO wiki and make the data available to the DOCKER container as is explained above.
For demonstration purposes we have made a multi-user Jupyter notebook server that uses the CASA Jupyter kernel using JupyterHub. For the time being this server can be accessed at http://jupyter.jive.nl.