Rucio

Rucio is a project that provides services for allowing scientific collaborations to manage large volumes of data spread across facilities at multiple institutions and organisations under different administrative domains. Within Rucio, all distributed data and storage is federated in a single unified namespace.  It provides a data bookkeeping catalog and takes care of file transfers, deletions, and many other advanced data management features like policy-based data lifecycle management with Quality of Service guarantees or the automation of large-scale and repetitive operational tasks.

Rucio interfaces with well-established products and frameworks from industry and scientific domains, such as object stores or cloud providers. It is a data management solution that could cover the needs of different communities in the scientific domain, e.g., HEP, astronomy, or biology.

Rucio is an open source software (Apache 2 licence). Project management lies with the ATLAS collaboration (https://atlas.cern/) which includes several organizations (full list: https://github.com/rucio/rucio/blob/master/AUTHORS.rst).

Usage in scientific projects

Rucio has been originally developed to meet the data management needs of the ATLAS experiment at the Large Hadron Collider (LHC). Rucio is in charge of managing all ATLAS data on the The Worldwide LHC Computing Grid (WLCG) with more than 130 data centres. It has demonstrated production use at the required scale and functionality, with currently more than 350 Petabytes of data, stored in 1 billion files, and accessed by more than 2000 physicists.

Additionally, Rucio has been put in production for two other experiments, AMS (http://www.ams02.org/) and Xenon1T (http://www.xenon1t.org/) demonstrating viability at different scale and usage patterns, e.g., Xenon1T has 5.6 Petabytes of data, stored 100'000 files distributed to 5 data centres.

Rucio is under evaluation by other collaborations as a common solution, for example,

 

Advantages

Rucio is adapted for scientific collaborations which needs to manage large volumes of data spread across facilities at multiple institutions under different administrative domains. It provides solutions to a distributed scientific community for accessing and sharing data as well as  automating a certain number of repetitive administrative tasks.

Some projects can be seen close to Rucio although they do not necessarily provide the same features, nor target the same audience:

IRODs (https://irods.org) which is an Open Source Data Management Software, operates at a smaller scale than Rucio, typically targeted to single data centres.

Dirac (http://diracgrid.org) data management has been originally developed for the LHCb experiment. Dirac provides less advanced data management features and operates at a smaller scale.

FTS3(http://fts3-service.web.cern.ch/) and GlobusOnline (https://www.globus.org/tags/globus-online) are services to transfer and delete data between two storage systems. These systems do not provide data management features such as catalogs, policies or lifecycle management. Both services are usable through Rucio as pluggable transfer services.

 

Feedback from end users

The general user feedback is very good, we specifically try to have a quick response time and a short turnaround time for feature and bugfix requests.

Rucio provides different ways to provide end user feedback. The Rucio users mailing list (rucio-users@googlegroups.com), and the issue tracker of the github repository (https://github.com/rucio/rucio/issues/new).

For ATLAS (2000 active users), feedback is also collected by the Distributed Analysis Support Team (DAST), the ATLAS internal issue tracker on JIRA, or via our ATLAS internal support mailing list. The DAST team provides the first level user support.  We also collect user feedback via our developers mailing list (rucio-dev@cern.ch) mostly from our contact persons with external ATLAS collaborations (like AMS and Xenon1t).

 

Contact persons & website

Author: 
CERN