As we enter the era of Big Data, astronomy and astroparticle infrastructures will be facing new challenges and changing their data models accordingly.
The raw data cannot be retrieved and analysed on the final user computer or computing infrastructure. Therefore, solutions to deal with data on large scales must be used and with them, services to help the users have to be provided.
The OBELICS D-INT Services Repository collects several technologies enabling the integration of analysis software. Some of these technologies have been developed in the ASTERICS project, others, namely Rucio and the Dirac framework, are developed externally but were evaluated for their use in astroparticle physics and radio astronomy.
Common aspects of all these technologies is that they enable user interaction with large datasets which are stored on a remotely located cluster (or several clusters) from the user. Some aspects of the integration frameworks that need attention are A&A (Authorization & Authentication), user interface, and reproducibility of scientific results.
Since the integration task is not unique to astroparticle physics and radio astronomy, many solutions have already been developed that solve a part of the problem. For example, there are over 200 generic systems that deal with the specification of workflows.
As analyses in astronomy, astrophysics and particle astrophysics tend to increase in complexity with the abilities available on modern computer systems, the basic requirement of (short and long term) reproducibility is becoming harder to achieve.
The high number of dependencies on other software packages, which have implicitly been used to obtain a result, are non-trivial to be reproduced exactly and sometimes not all dependencies are recognized explicitly.
In the framework of the CTA Data Management group (CTA-DM), INAF-OACT developed a science gateway and an interactive desktop connected to the federated authentication and authorization infrastructure (INAF CTA AAI). The science gateway leverages open source technologies providing web access to a set of tools and services widely used by the CTA community.
The figure below shows the INAF CTA workspace.
Gammapy is an open-source Python package for gamma-ray astronomy built on Numpy and Astropy. It can be found at gammpy.org. It starts from High Level data preferentially, but not restricted to, in the open DL3 format. This format describes collections of reconstructed particles plus the Instrument Response Functions necessary to derive physical quantities from them. Gammapy produces higher level, publication-ready products such as sky maps and spectra
The APERTIF upgrade of the Westerbork Radio Synthesis Telescope will be used to perform a full survey of the radio sky in several years. The data products generated by APERTIF will be stored in ALTA, the Apertif Long Term Archive. Only several fixed type data products will be supported, which is acceptable because of the survey nature of the instrument.
Some numbers characterizing the Apertif Long Term Archive:
J4G combines a multi-user version of notebooks with Gamma Ray Astronomy tools. It provides remote single-user Jupyter notebooks and it is integrated with the INAF-CTA Authentication and Authorization Infrastructure (INAF-CTA AAI). It makes available user data by deploying Jupyter Notebooks in user space close to user’s data thanks to an integrated cloud data service environment (based on owncloud). J4G is a user-friendly and reproducible computing in HPC/HTC environments.
The Large Synoptic Survey Telescope (LSST) [1], currently under construction in Chile, is designed to conduct a ten-year survey of the dynamic universe. This large-aperture, wide-field, ground-based telescope will map the entire southern sky in just a few nights in six optical bands from 320 to 1050 nm with its 3.2-gigapixel camera. LSST will take about 2000 exposures per observing night, for a total raw data volume of about 20 TB per 24 hours period.
Rucio is a project that provides services for allowing scientific collaborations to manage large volumes of data spread across facilities at multiple institutions and organisations under different administrative domains. Within Rucio, all distributed data and storage is federated in a single unified namespace. It provides a data bookkeeping catalog and takes care of file transfers, deletions, and many other advanced data management features like policy-based data lifecycle management with Quality of Service guarantees or the automation of large-scale and repetitive operational tasks.
DIRAC is a software framework for distributed computing that acts as an intermediate layer between users and resources: as such, it is considered as an interware. DIRAC allows for interoperability by providing simplified interfaces to all kind of computing resources (illustration 2): grid worker nodes, cloud virtual machines or direct access to a batch system, this both in HTC and HPC contexts.