BoF: Large-Scale Data Management with Data Lakes
Abstract
Large-scale data management is challenging for users and data centers. The users struggle to organize millions of files involved in scientific workflows and the involved software. Data centers suffer from the complexity of providing and optimizing storage environments without knowing the exact intent of the users. The creation of data management plans and a clear definition of the information life cycle and workflows serve the documentation, increase reproducibility, and portability. Many workflows integrate user-specific metadata into search engines allowing users to navigate data. Concepts such as data lakes and lakehouses become popular as a central storage. Data lakes aim to integrate data from diverse sources into a unified management system, retaining data in its original format. The idea is to dump scientific data into the lake organized following the FAIR principle. In addition, a research data management solution should not only ensure data preservation but also support scientists in complying with good scientific practices. Developing a good data management practice is difficult and domain-specific, therefore, the interaction with users with similar challenges accelerates the solution development. The aim of the BoF is to aid the community building in this topic and the discussion with the audience in order to find common problems and their individual solutions. First, several speakers from industry, data centers, and academia give lightning talks revolving around the topic of large-scale data management with a particular focus on data lakes and large-scale data management. In the second part, surveys, discussions, and community building takes place.
The BoF takes place as part of ISC HPC.
Date | June 1st, 16-17:00 2022 | ||
Venue | CCH, Hamburg | ||
Contact | Hendrik Nolte |
This BoF is powered by the NHR, the Virtual Institute for I/O, the Journal of High-Performance Storage, ESiWACE 1).
Organisation
The BoF is organised by
- Hendrik Nolte (GWDG) hendrik.nolte@gwdg.de
- Julian Kunkel (Georg-August-Universität Göttingen/GWDG), julian.kunkel@gwdg.de
- Stefano Claudio Gorini (ETHZ-CSCS)
Agenda
- 16:00 Welcome – Hendrik Nolte – Slides
- Challenges with Data Lakes – Hendrik Nolte – Slides
- Data Management Challenges, Potential and Questions – Julian Kunkel – Slides
- Moving Data – Stefano Claudio Gorini – Slides
- Housing data lake: a storage viewpoint – Jean-Thomas Acquaviva – Slides
- Blue Brain Nexus – Felix Schürmann – Slides
- Challenges and perspectives using iRODS for data management – Terrell Russell (RENCI)