Minisymposium: Leveraging Data Lakes to Manage and Process Scientific Data

Abstract

In recent years, data lakes have become increasingly popular as central storage, particularly for unstructured data. Generally, data lakes aim to integrate heterogeneous data from diverse sources into a unified information management system, where data is retained in its original format. Storing data in raw format, opposed to inferring a schema on write as it is commonly done in a data warehouse, supports the reuse and sharing of already collected data. The idea is to basically dump the data into the lake and later fish for knowledge using sophisticated analysis tools. This approach, however, is quite challenging since it has to be ensured hat all data, no matter the number or size of the different data sets, will be found and can be accessed later on. In addition, especially for domain researchers in public research institutions, a research data management solution should not only ensure the preservation of the data, but also support and guide scientists in complying with good scientific practices from the very beginning. In order to discuss the current challenges, their possible solutions and share personal insights into data lakes, we bring different experts together and discuss with the scientific community the potential and technical approaches.

The minisymposium is held in conjunction with the PASC.

Date		June 28nd, 2022
Venue		Basel, Switzerland
Contact		Hendrik Nolte (GWDG) Hendrik.Nolte@gwdg.de

This minisymposium is supported by the NHR, Virtual Institute for I/O, the Journal of High-Performance Storage, ESiWACE ¹⁾.

The workshop is organised by

Hendrik Nolte (Sandia National Lab, USA), Hendrik.Nolte@gwdg.de
Julian Kunkel (Georg-August-Universität Göttingen/GWDG), julian.kunkel@gwdg.de

Data Integration in Data Lakes – Prof. Dr. Rihan Hai (Affiliation: TU Delft)
Slides

Although big data is being discussed for some years, it still has many research challenges, such as the variety of data. The diversity of data sources often exists in information silos, which are a collection of non-integrated data management systems with heterogeneous schemas, query languages, and data models. It poses a huge difficulty to efficiently integrate, access, and query the large volume of diverse data in these information silos with the traditional 'schema-on-write' approaches such as data warehouses. Data lake systems have been proposed as a solution to this problem, which are repositories storing raw data in its original formats and providing a common access interface. In this talk, I will discuss the landscape of existing data lake problems, and our solutions for integrating multiple heterogeneous data sources in data lakes. I will also introduce the recent advances in supporting AI in data lakes.

Enabling industrialized analysis of textual documents in data lakes – Dr. Pegdwendé Nicolas Sawadogo (Affiliation: Fondation de l'AP-HP)
Slides

The concept of data lake was introduced in 2010 by James Dixon as an alternative to data warehouses for big data analysis and management. Unlike data warehouses, data lakes follow a schema-on-read approach to better support ad’hoc analyses. In the absence of a fixed schema, data from the lake can be handled miscellaneously. This however makes hard industrialized analyses from data lakes. More recently, the concept of data lakehouse has been proposed as a solution to activate industrialized analyses in data lakes. That consists to merge the better from data lake and data warehouse concepts. Nevertheless, data lakehouses still limited as they essentially focus on structured data management. Yet, the majority of big data is made by unstructured data, amongst which textual data. To remedy the limitations of data lakehouses we introduce a new approach to activate industrialized analyses on textual documents from a data lake. Our approach is based on techniques from information retrieval and text-mining domains. In this presentation, we particularly focus on architecting and metadata management which are essential issues while building a data lake system.

Utilizing Data Lakes for Managing Multidisciplinary Research Data – Dr. Mark Greiner (Affiliation: Max Planck Institute for Chemical Energy Conversion)
Slides

Scientific research institutes face a lot of the same challenges as commercial organizations when it comes to managing data. Just like for commercial organizations, a common situation is data silos, or even wors, data swamps. The fundamental problem is that the continual manual effort needed to govern data prooves to be too much for many research institutes. A possible solution would be to automate as much of the process as possible, and to minimize the amount of duplicated efforts. In the present talk, we discuss our current efforts to improve data management of a mid-sized an academic research institution. We show that, while some aspects of data management are very similar to those faced in commercial organizations–such as data ingestion, processing, and reporting–some others aspects are quite specific to the use case of academic research. In these cases, we adapt or re-build cutom modules to accomodate for the unique workflows of researchers. In the end, we aim to make use of known best practices and technologies, while embracing the uniqueness of research practices.

A FAIR Digital Object-Based Data Lake Architecture to Support Various User Groups and Scientific Domains – Hendrik Nolte (Affiliation: Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen)
Slides

Across various domains, data lakes are successfully utilized to centrally store all data of an organization in their raw format. This promises a high reusability of the stored data since a schema is implied on read, which prevents an information loss due to ETL (Extract, Transform, Load) processes. Despite this schema-on-read approach, some modeling is mandatory to ensure proper data integration, comprehensibility, and quality. These data models are maintained within a central data catalog which can be queried. To further organize the data in the data lake, different architectures have been proposed, like the most widely known zone architecture where data is assigned to different zones according to the degree of processing. In this talk, a novel data lake architecture based on FAIR (Findable, Accessible, Interoperable, Reusable) Digital Objects (FDO) with (high-performance) processing capabilities is presented. These FDOs abstract away the handling of the underlying mass storage and databases, thereby enforcing a homogeneous state, while offering a flat yet easily comprehensible research data management. The FDOs are connected by a provenance-centered graph. Users can define generic workflows, which are reproducible by design, making this data lake implementation ideally suited for science.

¹⁾

ESiWACE is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 823988.

Minisymposium: Leveraging Data Lakes to Manage and Process Scientific Data

Abstract

Organisation

Talks

HPS