Minisymposium: The Exabyte Data Challenge

Various data-intense scientific domains must deal with Exabytes of data before they reach the Exaflop. Data management at these extreme scales is challenging and covers not only pre-processing, data production, and data analysis workflows. While there are many research approaches and science databases that aim to manage data and improve their limits over time, practitioners still struggle to manage their data in the Petabyte era. For instance, achieving high performance and providing means to easily localize data upon request. With billions of files, the scalability of the manual and fine-grained data management in HPC environment reaches its limitations. Various domain-specific solutions have been developed that mitigate performance and management issues enabling data management in the Petabyte era. However, due to new storage technologies and heterogeneous environments, the challenges increase and so does the development effort for individual solutions.

In this minisymposium, speakers from environmental science (MetOffice and ECMWF), CERN, and the Square Kilometre Array will address this matter for different domains; each speaker will present the challenges faced in their scientific domain today, give an outlook for the future, and present state-of-the-art approaches the community follows to mitigate the data deluge.

This minisymposium is organized as part of the PASC official schedule.

Date		Friday, June 14th, 2019
Venue		HG D 1.1
Contact		Dr. Julian Kunkel

This workshop is powered by the Virtual Institute for I/O and ESiWACE ¹⁾.

The workshop is organized by

Julian Kunkel (Department of Computer Science, University of Reading, UK), j.m.kunkel@reading.ac.uk
Joachim Biercamp (DKRZ, Germany)
Bryan Lawrence (NCAS-CMS, UK)

13:30 Fighting the Data Deluge with Data-Centric Middleware – Julian Kunkel
The Exabyte of storage occupied by computational simulations is reached long before Exaflop systems are built. Motivated by workflows in climate and weather, in the ESiWACE project the Earth System Data Middleware is developed that focuses on optimizing performance throughout the heterogeneous storage landscape. The talk concludes by discussing the need for community development of standards that will lead to next-generation interfaces that will enable data-centric processing.
Slides
14:00 The CERN Tape Archive: Preparing for the Exabyte Storage Era – Michael Davis
Slides
The High Energy Physics experiments at CERN generate a deluge of data which must be efficiently archived for later retrieval and analysis. During the first two Runs of the LHC (2009-2018), over 250 Pb of physics data was collected and archived to tape. CERN is facing two main challenges for archival storage over the next decade. First, the rate of data taking and the total volume of data will increase exponentially due to improvements in the luminosity and availability of the LHC and upgrades to the detectors and data acquisition system. Data archival is expected to reach 150 Pb/year during Run–3 (2021-2023), increasing to 400 Pb/year during Run–4 (2025-). The integrated total data on tape will exceed one Exabyte within a few years from now. Second, constraints in available computing power and disk capacity will change the way in which archival storage is used by the experiments. This presentation will describe these challenges and outline the preparations that the CERN IT Storage Group are making to prepare for the Exabyte storage era.
14:30 The Met Office Cold Storage Future: Tape or Cloud? – Richard Lawrence
Slides
The Met Office hosts one of the largest environmental science archives in the world using tape as the primary storage mechanism. The archive has over 275 petabytes of data stored today and a growth that is expected to exceed 5 exabytes during the next decade. When combined with data ingress and egress rates that exceed 200 terabytes a day each, the Met Office needs to ensure that the archive does not become the bottleneck to the production of our operational Weather forecasts and research needs. We will examine the current archive system, look at its current pain points and how the Met Office expects the needs of the archive to change in the short term. We then outline the UK government principle of ‘cloud first’ for digital designs and see if this can be applied to large scale Science IT. The talk will then progress onto how our current approach measures up to public cloud offerings looking at the capability, risk, benefits and costs of a cloud based archive.
14:30 ECMWF's Extreme Data Challenges Towards a Exascale Weather Forecasting System – Tiago Quintino, Simon Smart, James Hawkes, Baudouin Raoult
CMWF's operational weather forecast generates massive I/O in short bursts, currently approaching 100 TiB per day, in two hour-long windows. From this output, millions of user-defined daily products are generated and disseminated to member states and commercial clients all over the world. As ECMWF aims to achieve Exascale NWP by 2025, we expect to handle around 1 PiB of model data per day and generate 100's of millions daily products. This poses a strong challenge to a complex workflow that is already facing I/O bottlenecks. To help tackle this challenge, ECMWF is developing multiple solutions and changes to its workflows, and incrementally bringing them into operations. For example, it has developed a high-performance distributed object-store that manages the model output, for the needs of our NWP and Climate simulations, making data available via scientific meaningful requests, which integrate seamlessly with the rest of the operational workflow. We will present how ECMWF is leveraging this and other technologies to address current performance issues in our operations, while at the same time preparing for technology changes in the hardware and system landscape and the convergence between HPC and Cloud provisioning.
15:30 End

¹⁾

ESiWACE has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 675191

Minisymposium: The Exabyte Data Challenge

Organization

Agenda