Data-driven science requires the handling of large volumes of data in a quick period of
time. Executing efficient workflows is challenging for users but also for systems. This
module introduces concepts, principles, tools, system architectures, techniques, and
algorithms toward large-scale data analytics using distributed and parallel computing.
We will investigate the state-of-the-art of processing data of workloads using solutions in
High-Performance Computing and Big Data Analytics.
| Contact | Julian Kunkel, Jonathan Decker |
| Location | Virtual |
| Time | Monday 16:15-17:45 (lecture), Monday 12:15-13:45 (lunch exercise, starts 1 week later) |
| Language | English |
| Module | Modul B.Inf.1712: Vertiefung Hochleistungsrechnen, Module M.Inf.1236: High-Performance Data Analytics |
| SWS | 4 |
| Credits | 6 |
| Contact time | 56 hours |
| Independent study | 124 hours |
| Exam | Written date: TBD, 2nd exam: TBD, in TBD |
Please note that we plan to record sessions (lectures and seminar talks) with the intent of providing the recordings
via BBB to other students but also to publish and link the recordings on YouTube for future terms.
If you appear in any of the recordings via voice, camera or screen share, we need your consent to publish the recordings.
See also this Slide.
Topics
Topics cover:
Challenges in high-performance data analytics
Use-cases for large-scale data analytics
Performance models for parallel systems and workload execution
Data models to organize data and (No)SQL solutions for data management
Industry relevant processing models with tools like Hadoop, Spark, and Paraview
System architectures for processing large data volumes
Relevant algorithms and data structures
Visual Analytics
Parallel and distributed file systems
Weekly laboratory practicals and tutorials will guide students to learn the concepts
and tools. In the process of learning, students will form a learning community and
integrate peer learning into the practicals. Students will have opportunities to present
their solutions to the challenging tasks in the class. Students will develop presentation
skills and gain confidence in the topics.
Learning Objectives
Assign big data challenges to a given use-case
Outline use-case examples for high-performance data analytics
Estimate performance and runtime for a given workload and system
Create a suitable hardware configuration to execute a given workload within a deadline
Construct suitable data models for a given use-case and discuss their pro/cons
Discuss the rationales behind the design decisions for the tools
Describe the concept of visual analytics and its potential in scientific workflows
Compare the features and architectures of NoSQL solutions to the abstract concept of a parallel file system
Appraise the requirements for designing system architectures for systems storing and processing data
Apply distributed algorithms and data structures to a given problem instance and illustrate their processing steps in pseudocode
Explain the importance of hardware characteristics when executing a given workload
Examination
Written (90 Min.) or oral (ca. 30 Min.) → depends on the number of attendees (typically written).
See the learning objectives.
Example Exam:
Agenda
-
-
-
11.11.2026 -
Distributed Storage and Processing with Hadoop
Slides Exercises
-
-
-
-
16.12.2026 -
Designing Distributed Systems and Performance Modelling
Slides Exercises
During the exercise, we discuss any questions you may have.
Exercise: RESTful Services. Consistent Hashing. Performance analysis of I/O mappings of use cases to systems.
06.01.2027 -
Visual Analytics and Large-Scale Data Analysis
Slides Exercises
-
20.01.2027 -
The Apache Ecosystem and Beyond
This slide deck is optional and not subject to examination
Slides
-
Links