This is an old revision of the document!

Lehrveranstaltung: High-Performance Data Analytics

Data-driven science requires the handling of large volumes of data in a quick period of time. Executing efficient workflows is challenging for users but also for systems. This module introduces concepts, principles, tools, system architectures, techniques, and algorithms toward large-scale data analytics using distributed and parallel computing. We will investigate the state-of-the-art of processing data of workloads using solutions in High-Performance Computing and Big Data Analytics.

Contact Julian Kunkel
Location Virtual, meeting room
Time Monday 16-18 (lecture), Monday 12-14 (lunch exercise!)
Language English
Module Modul B.Inf.1712: Vertiefung Hochleistungsrechnen, Module M.Inf.1236: High-Performance Data Analytics
Credits 6
Contact time 56 hours
Independent study 124 hours

Topics cover:

  • Challenges in high-performance data analytics
  • Use-cases for large-scale data analytics
  • Performance models for parallel systems and workload execution
  • Data models to organize data and (No)SQL solutions for data management
  • Industry relevant processing models with tools like Hadoop, Spark, and Paraview
  • System architectures for processing large data volumes
  • Relevant algorithms and data structures
  • Visual Analytics
  • Parallel and distributed file systems

Guest talks from academia and industry will be incorporated in teaching that demonstrates the applicability of this topic.

Weekly laboratory practicals and tutorials will guide students to learn the concepts and tools. In the process of learning, students will form a learning community and integrate peer learning into the practicals. Students will have opportunities to present their solutions to the challenging tasks in the class. Students will develop presentation skills and gain confidence in the topics.

  • Assign big data challenges to a given use-case
  • Outline use-cases for high-performance data analytics
  • Estimate performance and runtime for a given workload and system
  • Create a suitable hardware configuration to execute a given workload within a deadline
  • Construct suitable data models for a given use-case and discuss their pro/cons
  • Discuss the rationales behind the design decisions behind our learned tools
  • Describe the concept of visual analytics and its potential in scientific workflows
  • Compare the features and architectures of NoSQL solutions to the abstract concept of a parallel file system
  • Appraise the requirements for designing system architectures for systems storing and processing data
  • Apply distributed algorithms and data structures to a given problem instance and illustrate their processing steps
  • Explain the importance of hardware characteristics when executing a given workload

Written (90 Min.) or oral (ca. 30 Min.)

See the learning objectives.

  • 25.10.21 - Lecture Overview. Use Cases.
    • Exercise: There is no exercise today!
  • 01.11.21 - System Architectures and Distributed Algorithms
    • Exercise: Discussion of use cases covering business/industry and science. Sketching the analytics pipeline for a use case.
  • 08.11.21 - Data Models and Data Processing Strategies
    • Exercise: Sketching system architectures and the execution of distributed algorithms.
  • 15.11.21 - Databases and Data Warehouses
    • Exercise: Developing data models for selected use cases. Sketching the processing pipeline.
  • 22.11.21 - Distributed Processing (with Hadoop)
    • Exercise: Developing a database schema and SQL queries.
  • 29.11.21 - Designing Distributed Systems and Performance Modelling
    • Exercise: Data processing with Hadoop.
  • 06.12.21 - Dataflow Computation
    • Exercise: Performance analysis of scenarios. Analysing mappings of use cases to systems.
  • 13.12.21 - Columnar Access and Document Storage
    • Exercise: Developing a dataflow system.
  • 20.12.21 - In-Memory Computation
    • Exercise: Processing data using HBASE and MongoDB
  • 10.01.22 - Stream Processing
    • Exercise: Data processing using Spark
  • 17.01.22 - Visual Analytics and Large-Scale Data Analysis
    • Exercise: Sketching stream workflows for use cases
  • 24.01.22 - Storage Systems in Cloud and HPC
    • Exercise: Developing a visualization using GoJS
  • 31.01.22 - INVITED TALK – TBA
    • Exercise: Performance analysis of storage solutions
  • 07.02.22 - Summary
    • Exercise: Q&A Session
  • teaching/autumn_term_2021/hpda.1633286777.txt.gz
  • Last modified: 2021-10-03 20:46
  • by Julian Kunkel