Professor Michael Hildreth from the University of Notre Dame argues that while efforts are underway to ensure that the results of particle physics experiments (and the knowledge behind them) are preserved, the future will ultimately require a new way of working.
Experimental particle physics knows something about Big Data. At the Large Hadron Collider (LHC), collisions occurring 40 million times per second each generate up to one megabyte of data that must be extracted from the large detectors. Massively-parallel data acquisition and filtering systems examine the products of each collision and reduce that 40 terabytes per second into merely a gigabyte per second of ‘interesting’ collisions. In aggregate, this results in annual dataset sizes of order 100 petabytes per experiment of ‘raw’ data, which must be processed by several stages of software before the particles sought by physicists are recognisable. This processing takes place on hundreds of thousands of compute cores spread around the globe and connected to the Worldwide LHC Computing Grid. Physicists construct hundreds of individual analyses studying various aspects of this dataset, adding additional layers of software and processing before the end results, hundreds of journal articles, can appear.
This is a highly-orchestrated but, in the end, highly-decentralised system. Each analyst has access to the processed data representing hundreds of petabytes of raw collision data. His or her individual work, which is only limited by creativity, available compute resources, and the journal reviewer’s opinion, is the science that is derived from this data and what contributes to our study of the ultimate building blocks of the Universe. If the analyst moves institutions, or leaves the field, what becomes of this work? How can it be preserved, either to check its validity, or to pass on its insight and infrastructure to a new round of physicists? Even more important, how can these gigantic datasets and the software that processes them into ‘physics-useful’ data be kept for future generations and citizen scientists? The LHC, its associated experiments, and the effort to run the complex for many years has a total cost easily exceeding US$10bn. The data collected, and the knowledge required to access and process it, are the products of this investment, and likely will never be reproduced. As LHC physicists, we have an obligation to preserve this work, including the intellectual products, the analyses, so that this will be available to future scientists. How might we make our data FAIR (Findable Accessible Interoperable Reusable)?
Since 2012 I, along with many others, have led various projects to explore the possibilities and develop technologies for knowledge preservation in particle physics. Various fields, especially astrophysics, have led the way in making datasets public and ensuring their long-term preservation. Particle physics datasets, especially those at the LHC, are both larger and much more complex, however, than many of those that had been successfully ‘preserved’. The huge collaborations and the complexity and diversity of the software stacks present many new challenges. In 2012, I built a multi-university US team into what was called the DASPOS1 (Data and Software Preservation for Open Science) project that began to investigate systematically the ingredients and procedures necessary for knowledge preservation in particle physics. The team included physicists, digital librarians, and computer scientists, all of whom brought necessary skills to what was one of the earliest projects of this kind in the USA.
The DASPOS effort comprised a wide range of investigations. At a microscopic level, clever methods were developed to trace all possible external calls from a piece of software to map out everything it uses when it is running. One could then catalogue the dependencies and attempt to collect and bundle them to make the software self-contained and preservable. This led to work with virtual machines and various types of linux containers. One insight that came from this work is that the process of generating an executable program that could run on generic cloud computing or other non-custom resources requires preservation and encapsulation; the same tools can be used for both purposes. The advent of simple ways to build and manage linux containers, such as the Docker ecosystem, has dramatically improved the ease with which individual users can capture their work.
Capturing an executable, however, is only a minor piece in a much larger framework. For example, in order for an executable to be reusable after it is captured, it needs metadata describing what it does so that another user can understand its function, required inputs and dependencies, and its outputs. ‘Smart Containers’ were developed that are able to wrap a standard linux container with an additional metadata layer in order to provide this functionality and flexibility. More generally, no metadata vocabulary or ontological system existed that could describe a particle physics analysis. So, we had to build one. Another common problem is the record-keeping involved in capturing the provenance of a given piece of data that might be derived from several chained processing steps. Several techniques were explored here, including systems that would automatically regenerate data products if any intermediate software or input data changed. Many of these studies were undertaken at a fundamental level in an attempt to understand the computer science and semantics involved in solving some of these problems while arriving at a final product or strategy.
Preserving the legacy of large particle physics projects
This work took place as part of a global effort focused on preserving the legacy of large particle physics projects. Overall co-ordination and a broader forum has been provided by the Data Preservation in High Energy Physics (DPHEP)2 subcommittee of the ICFA, the International Committee on Future Accelerators, and several national laboratories, including CERN, SLAC, DESY, and Fermilab. CERN has been the hub at the centre of activity and new development, while the other labs host preservation projects safeguarding data and software from their previous experiments. In particular, through efforts in its Scientific Information Services Group and its Computing Division, CERN has made significant investments in data and knowledge preservation, adopting some of the work from the DASPOS project while creating significant infrastructure. CERN now hosts the CERN Open Data Portal,3 where multiple petabytes of data, preserved software, extensive examples, and instructions can be found. All of this would not be possible, however, without policies encouraging the release of software and data. After more than a decade of discussions, I am pleased to report that all of the LHC experiments have now agreed to release large fractions of their data to the public.4
Preserving the knowledge, too
Providing access to data for analysis, even with examples, does not preserve the knowledge behind the science done at the LHC. For that, effort is required to capture and document the analyses performed by the physicists. Thus, as a complement to the Open Data Portal, the DASPOS team worked with the groups at CERN to create the CERN Analysis Preservation portal (CAP).5 The CAP attempts to preserve a record of the ingredients of a physics analysis: what was the topic, which data was analysed, what software was used, etc. Connections have been built from CAP directly to the databases of the individual experiments to pull as much information as possible from internal sources to the CAP records. Even web pages and other documents can be included, providing wide documentation. Links to the analyst’s code repositories are provided. A metadata vocabulary adapted from the DASPOS model is used to provide a searchable description of the analysis information.
We can go one step further, however. ‘Behind’ the CAP, the DASPOS team partnered with CERN to create the REANA framework,6 a reproducible research data analysis platform that enables analysis re-use. The REANA framework allows the re-execution of extremely complicated workflows, orchestrating linux containers that contain executables for each of the individual processing steps and routing data inputs and outputs through the analysis chain. Using these tools, an entire physics analysis can be captured and stored for later use, even accessing data from the Open Data Portal.
This offers a number of advantages. First, the analysis becomes reproducible as it can be rerun at any time with consistent results. Second, the analysis becomes shareable. Anyone with the analysis framework and the components can run the analysis and examine the outputs at various steps. This offers a potentially ideal way of training new students, for example, since they have a baseline to which they can compare their own work. Third, the preserved ingredients are ‘composable’: in principle, any of the processing steps can be lifted out of a given workflow and reused elsewhere if appropriate.
The composability and flexibility of the REANA system has opened opportunities in other areas of research. I am co-lead on a different project, SCAILFIN (Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference),7 that is using the REANA framework’s capabilities to orchestrate complex machine learning training workflows on leadership-class high performance computing systems. While we are currently working on machine learning in high energy physics, the workflows themselves are completely generic. These problems of software/executable preservation and reuse are general; developments engineered to be flexible can be useful across many different fields of research.
A different way of working
In a recent article,8 my colleagues and I argue that we need a different way of working. While something like the REANA infrastructure does serve the purpose of preserving physics analyses and making them reproducible and reusable, our current workflows do not keep much of the information needed by REANA to make them ‘preservable’. We need better tools that would facilitate this kind of work. Where might they come from? This raises a difficult problem. At a broad level, science and scientists are caught between opposing forces pushing towards FAIR data and the realisation that making one’s data FAIR is a tremendous amount of work. Mandates from funding agencies or governments for FAIR data often come with no additional funding, not even for the archives that would store the data. Moreover, blanket requirements for open data often lack the subtlety to evaluate which data should be made open and which data is not worth the effort. Finally, as mentioned above, the tools that would facilitate the creation of FAIR data in all fields of science do not exist. Without some incentive, for example, it is hard to see what would motivate scientists to create and adopt new tools and techniques just for the sake of being able to make their data FAIR.
Among the sciences, particle physics is relatively advanced in pursuit of FAIR data because of its fairly uniform data configurations, common analysis techniques, and the huge existing common computing and processing infrastructure. Even with these advantages and all of the work done on the preservation front, we are still far away from being able to easily produce and archive FAIR data. The public LHC data has been reanalysed by non-LHC scientists to advance the understanding of quantum chromodynamics. Other uses, including the reinterpretation of analyses with new physics models, have also arisen. Are these enough to justify the effort expended to publish and host these large datasets? Time will tell.
These technical and policy topics have been at the centre of many workshops over the past few years, several of which I have led. There are many international groups, such as within the Research Data Alliance, dedicated to these topics as well. In particular, I co-lead a group dedicated to preservation tools, techniques, and policies where we attempt to bridge the gap between researchers and archivists by understanding what kinds of tools researchers need to preserve their data, with FAIR data being an aspirational goal. It is fair to say, pun intended, that the recent enthusiasm surrounding the concept of FAIR data and its potential benefits is gathering momentum and sparking both creativity and funding. Following this, the recognition and support of what is required in terms of infrastructure development will be necessary in order to make progress for science at large. I hope we are up to the task.
- Chen, X., Dallmeier-Tiessen, S., Dasler, R. et al. ‘Open is not enough’. Nature Phys 15, 113–119 (2019). https://doi.org/10.1038/s41567-018-0342-2
Please note, this article will also appear in the fifth edition of our quarterly publication.