BoF: Large-Scale Data Management with Data Lakes

Large-scale data management is challenging for users and data centers. The users struggle to organize millions of files involved in scientific workflows and the involved software. Data centers suffer from the complexity of providing and optimizing storage environments without knowing the exact intent of the users. The creation of data management plans and a clear definition of the information life cycle and workflows serve the documentation, increase reproducibility, and portability. Many workflows integrate user-specific metadata into search engines allowing users to navigate data. Concepts such as data lakes and lakehouses become popular as a central storage. Data lakes aim to integrate data from diverse sources into a unified management system, retaining data in its original format. The idea is to dump scientific data into the lake organized following the FAIR principle. In addition, a research data management solution should not only ensure data preservation but also support scientists in complying with good scientific practices. Developing a good data management practice is difficult and domain-specific, therefore, the interaction with users with similar challenges accelerates the solution development. The aim of the BoF is to aid the community building in this topic and the discussion with the audience in order to find common problems and their individual solutions. First, several speakers from industry, data centers, and academia give lightning talks revolving around the topic of large-scale data management with a particular focus on data lakes and large-scale data management. In the second part, surveys, discussions, and community building takes place.

The BoF takes place as part of ISC HPC.

Date June 1st, 16-17:00 2022
Venue CCH, Hamburg
Contact Hendrik Nolte

This BoF is powered by the NHR, the Virtual Institute for I/O, the Journal of High-Performance Storage, ESiWACE 1).

The BoF is organised by

Agenda

  • 16:00 WelcomeHendrik Nolte – Slides
  • Challenges with Data LakesHendrik Nolte – Slides
  • Data Management Challenges, Potential and QuestionsJulian Kunkel – Slides
  • Moving DataStefano Claudio Gorini – Slides
  • Housing data lake: a storage viewpointJean-Thomas Acquaviva – Slides
  • Blue Brain NexusFelix Schürmann – Slides
  • Challenges and perspectives using iRODS for data management – Terrell Russell (RENCI)

1)
ESiWACE is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 823988.
  • Impressum
  • Privacy
  • events/2022/isc-bof-data-management.txt
  • Last modified: 2023-08-28 10:40
  • by 127.0.0.1