Lehrveranstaltung: High-Performance Data Analytics

Data-driven science requires the handling of large volumes of data in a quick period of time. Executing efficient workflows is challenging for users but also for systems. This module introduces concepts, principles, tools, system architectures, techniques, and algorithms toward large-scale data analytics using distributed and parallel computing. We will investigate the state-of-the-art of processing data of workloads using solutions in High-Performance Computing and Big Data Analytics.

Note that the lecture will be given online. I will make a survey regarding the exercise and presumably offer hybrid attendance for the exercise.

Contact		Julian Kunkel
Location		Virtual, meeting room
Time		Monday 16:15-17:45 (lecture), Monday 12:15-13:45 (lunch exercise!)
Language		English
Module		Modul B.Inf.1712: Vertiefung Hochleistungsrechnen, Module M.Inf.1236: High-Performance Data Analytics
SWS		4
Credits		6
Contact time		56 hours
Independent study		124 hours
Exam		17.03. - 10:00 - 12:00 in room MN09 Geowissenschaften, second exam date: Friday 08.04.2022, 10-12 Uhr In Person, room MN09 (Geowissenschaften)

Topics cover:

Challenges in high-performance data analytics
Use-cases for large-scale data analytics
Performance models for parallel systems and workload execution
Data models to organize data and (No)SQL solutions for data management
Industry relevant processing models with tools like Hadoop, Spark, and Paraview
System architectures for processing large data volumes
Relevant algorithms and data structures
Visual Analytics
Parallel and distributed file systems

Guest talks from academia and industry will be incorporated in teaching that demonstrates the applicability of this topic.

Weekly laboratory practicals and tutorials will guide students to learn the concepts and tools. In the process of learning, students will form a learning community and integrate peer learning into the practicals. Students will have opportunities to present their solutions to the challenging tasks in the class. Students will develop presentation skills and gain confidence in the topics.

Assign big data challenges to a given use-case
Outline use-case examples for high-performance data analytics
Estimate performance and runtime for a given workload and system
Create a suitable hardware configuration to execute a given workload within a deadline
Construct suitable data models for a given use-case and discuss their pro/cons
Discuss the rationales behind the design decisions for the tools
Describe the concept of visual analytics and its potential in scientific workflows
Compare the features and architectures of NoSQL solutions to the abstract concept of a parallel file system
Appraise the requirements for designing system architectures for systems storing and processing data
Apply distributed algorithms and data structures to a given problem instance and illustrate their processing steps in pseudocode
Explain the importance of hardware characteristics when executing a given workload

Written (90 Min.) or oral (ca. 30 Min.) → depends on the number of attendees.

See the learning objectives.

25.10.21 - Lecture Overview. Use Cases. – Slides – Exercise
- Exercise: There is no meeting today!
- Exercise sheet 1 is due next week!
- Exercise topics: Discussion of use cases covering business/industry and science. Sketching the analytics pipeline for a use case.
01.11.21 - Data Models and Data Processing Strategies – Slides – Exercise
- Exercise: Developing data models for selected use cases. Researching performance for HPDA. Python Word-Count.
08.11.21 - Databases and Data Warehouses – Slides – Exercise
- Exercise: Developing a database schema and SQL queries.
15.11.21 - Distributed Storage and Processing with Hadoop – Slides – Exercise
- Exercise: MapReduce processing with Python. Sketching the difference between SQL running via Hadoop (and Hive) vs. a traditional relational database vs. a data warehouse
22.11.21 - Dataflow Computation and Big Data SQL using Hive – Slides Hive – Slides Dataflow – Exercise
- Exercise: MapReduce via Streaming in Hadoop. Developing a dataflow system in Python.
29.11.21 - Columnar Access and Document Storage – Slides – Exercise
- Exercise: Managing data using MongoDB
06.12.21 - In-Memory Computation – Slides – Exercise
- Exercise: Data processing using Spark
13.12.21 - Stream Processing – Slides – Exercise
- Exercise: Streaming concepts and crime data
20.12.21 - The Apache Ecosystem and Beyond – Slides – This slide deck is optional and not subject to examination
- Exercise: None
10.01.22 - Designing Distributed Systems and Performance Modelling – Slides – Exercise
- During the exercise, we discuss any questions you may have.
- Exercise: RESTful Services. Consistent Hashing. Performance analysis of I/O mappings of use cases to systems.
17.01.22 - Visual Analytics and Large-Scale Data Analysis – Slides – Exercise
- Exercise: Developing a visualization using GoJS
24.01.22 - Data Storage – Slides – Exercise
- Exercise: NetCDF data model and benchmarking.
31.01.22 - A Data Lake Use Case for scientific research data management – Mark Greiner (MPI CEC) – Slides
- Exercise: discussion of last week's exercises
07.02.22 - Summary – Slides
- Exercise: Q&A Session

Example Scripts: https://github.com/JulianKunkel/hpda-samples

Lehrveranstaltung: High-Performance Data Analytics

Key information

Topics

Learning Objectives

Examination

Agenda

Links