Lehrveranstaltung: High-Performance Data Analytics
Data-driven science requires the handling of large volumes of data in a quick period of time. Executing efficient workflows is challenging for users but also for systems. This module introduces concepts, principles, tools, system architectures, techniques, and algorithms toward large-scale data analytics using distributed and parallel computing. We will investigate the state-of-the-art of processing data of workloads using solutions in High-Performance Computing and Big Data Analytics.
Key information
Contact | Julian Kunkel, Jonathan Decker | ||
Location | Virtual | ||
Time | Monday 16:15-17:45 (lecture), Monday 12:15-13:45 (lunch exercise, starts 1 week later) | ||
Language | English | ||
Module | Modul B.Inf.1712: Vertiefung Hochleistungsrechnen, Module M.Inf.1236: High-Performance Data Analytics | ||
SWS | 4 | ||
Credits | 6 | ||
Contact time | 56 hours | ||
Independent study | 124 hours | ||
Exam | Written date: Geo MN15 14:00 10.02.2023 (Be punctual!), 2nd exam: Geo MN14 14:00 05.04.2023 |
Please note that we plan to record sessions (lectures and seminar talks) with the intent of providing the recordings via BBB to other students but also to publish and link the recordings on YouTube for future terms. If you appear in any of the recordings via voice, camera or screen share, we need your consent to publish the recordings. See also this Slide.
Topics
Topics cover:
- Challenges in high-performance data analytics
- Use-cases for large-scale data analytics
- Performance models for parallel systems and workload execution
- Data models to organize data and (No)SQL solutions for data management
- Industry relevant processing models with tools like Hadoop, Spark, and Paraview
- System architectures for processing large data volumes
- Relevant algorithms and data structures
- Visual Analytics
- Parallel and distributed file systems
Guest talks from academia and industry will be incorporated in teaching that demonstrates the applicability of this topic.
Weekly laboratory practicals and tutorials will guide students to learn the concepts and tools. In the process of learning, students will form a learning community and integrate peer learning into the practicals. Students will have opportunities to present their solutions to the challenging tasks in the class. Students will develop presentation skills and gain confidence in the topics.
Learning Objectives
- Assign big data challenges to a given use-case
- Outline use-case examples for high-performance data analytics
- Estimate performance and runtime for a given workload and system
- Create a suitable hardware configuration to execute a given workload within a deadline
- Construct suitable data models for a given use-case and discuss their pro/cons
- Discuss the rationales behind the design decisions for the tools
- Describe the concept of visual analytics and its potential in scientific workflows
- Compare the features and architectures of NoSQL solutions to the abstract concept of a parallel file system
- Appraise the requirements for designing system architectures for systems storing and processing data
- Apply distributed algorithms and data structures to a given problem instance and illustrate their processing steps in pseudocode
- Explain the importance of hardware characteristics when executing a given workload
Examination
Written (90 Min.) or oral (ca. 30 Min.) → depends on the number of attendees.
See the learning objectives.
Agenda
-
- We will not have an exercise meeting in the first week!
- Exercise: Discussion of use cases covering business/industry and science. Sketching the analytics pipeline for a use case.
-
- Exercise: Developing data models for selected use cases. Researching performance for HPDA. Python Word-Count.
-
- Exercise: Developing a database schema and SQL queries.
-
- Exercise: MapReduce processing with Python. Sketching the difference between SQL running via Hadoop (and Hive) vs. a traditional relational database vs. a data warehouse
-
- Exercise: MapReduce via Streaming in Hadoop. Developing a dataflow system in Python.
-
- Exercise: Managing data using MongoDB
-
- Exercise: Data processing using Spark
-
- Exercise: Streaming concepts and crime data
-
- During the exercise, we discuss any questions you may have.
- Exercise: RESTful Services. Consistent Hashing. Performance analysis of I/O mappings of use cases to systems.
-
- Exercise: Performance analysis and OpenAPI
-
- Exercise: NetCDF data model and benchmarking.
- 30.01.23 - The Apache Ecosystem and Beyond
This slide deck is optional and not subject to examination
Slides- Exercise: discussion of last week's exercises
- 06.02.23 - Summary
Slides- Exercise: Q&A Session
Links
- Example Scripts: https://github.com/JulianKunkel/hpda-samples