Practical: High-Performance Computing System Administration

High-Performance Computing System Administration is essential for managing HPC resources not only as a user but as a cluster administrator. As part of this practical course, you receive an introduction into the basics of Linux and using HPC resources in two sessions. At the end of these sessions you will be assigned a topic in terms of a tool related to HPC system administration. You will test out and evaluate the tool. After the end of the term, a one-week block course will take place that goes more in depth on HPC system administration. At the end of the semester, you will hand in a report describing your evaluation of the topic you were assigned to.

Contact		Julian Kunkel, Jonathan Decker
Location		Virtual Support Room
Time		26.10.22 14:15-17:45, 02.11.22 14:15-17:45, 20-24.02.23 5-day block course
Language		English
Module		M.Inf.1831: High-Performance Computing System Administration
SWS		4
Credits		5,6(,9) (depending on the course)
Contact time		up to 84 hours (63 full hours), depending on the course
Independent study		up to 186 hours

Please note that we plan to record sessions (lectures and seminar talks) with the intent of providing the recordings via BBB to other students but also to publish and link the recordings on YouTube for future terms. If you appear in any of the recordings via voice, camera or screen share, we need your consent to publish the recordings. See also this Slide.

No skills/knowledge is required
Understanding of Linux basics and having used Linux before and being able to operate a Bash shell is beneficial
We will provide a short crash course at the beginning of the course and link supplementary training material

Discuss theoretic facts related to networking, compute and storage resources
Integrate cluster hardware consisting of multiple compute and storage nodes into a “supercomputer“
Configure system services that allow the efficient management of the cluster hardware and software including network services such as DHCP, DNS, NFS, IPMI, SSHD.
Install software and provide it to multiple users
Compile end-user applications and execute it on multiple nodes
Analyze system and application performance using benchmarks and tools
Formulate security policies and good practice for administrators
Apply tools for hardening the system such as firewalls and intrusion detection
Describe and document the system configuration

Intrusion detection tools for HPC
Encryption tools
Image Management and network booting with Werewolf
Software Management with modules/spack
Ressource Management with SLURM
Managing object storage
Managing cluster file systems in user space (GlusterFS, FUSE, SeaWeedFS)
File system management (NFSv4, Ceph, BeeGFS)
Performance analysis tools
Monitoring system performance
Application and system benchmarks
Virtualization tools for HPC (e.g., CharlieCloud, Singularity, Shifter)
Scalable databases with e.g., Elasticsearch, Postgres
Kernel compilation and configuration
Security infrastructures and intrusion systems
Deep Package Analysis and filtering
Berkeley Packet Filters (eBPF)
Firewalls
Kernel splicing
Scalable software management and distribution for Python
Forensic tools
Cluster wide User/Group management (e.g. LDAP)
Scalable logging and log-file analysis

26.10.22 14:15 - 17:45
- 14:15 - Welcome/Structure of the Course – Julian Kunkel slides
  - Forming support groups
- 14:30 Linux Crash Course – Jonathan Decker preparation exercise sheet slides exercise sheet
  - Command Line
  - Some basic commands
  - Remote access to the Scientific Compute Cluster
- 16:00 break
- 16:15 Linux Exercise – Jonathan Decker
- 16:45 First steps running applications on the cluster using Slurm – Ruben Kellner slides
  - Running applications on multiple nodes using SRUN
  - Getting an overview of the available hardware (docu, sinfo)
  - Outlook of running a parallel program, measuring different types of applications
- 17:15 SLURM Exercise exercise
- 17:30 Exercise - Homework – Jonathan Decker homework primes.c
  - Virtual Linux machine setup
  - Assessing the performance of running applications
02.11.22 14:15 - 17:45
- 14:15 Homework discussion – Jonathan Decker
- 14:35 Introduction to Git – Christian Köhler slides
  - 15:00 Exercise for Git exercise
- 15:20 break
- 15:30 Compilation of applications via cmake, Autotools, make – Trevor Khwam slides
  - Exercise for cmake, Autotools, make exercise
- 16:10 Software management with Spack – Trevor Khwam slides
- 16:30 break
- 16:45 Running container with Singularity – Azat Khuziyakhmetov slides
- 17:15 Assignment information and topics – Julian Kunkel, Jonathan Decker slides
You work on your topic with some meetings with your supervisor.
- We encourage you to collaborate in teams on your independent topics.
20-24.02.23 4.5-day block course 9:00 - 18:00
- Schedule to be announced
31.03.23 Deadline for the submission of the report

This part is attended by BSc/MSc students and GWDG academy participants

Note: There are only breaks for lecture slots in the schedule. You can take a break during exercises as necessary. Preparation sheets: Preparation Configure Network

Monday 20.02.2023

09:00 - 10:00 Welcome, Organization of the block course – Julian Kunkel Slides – Exercise 1
- Agenda of the week
- Format of the “group work”
- Exercise (10 min): Introduce yourself in the “learning groups”
- Tutorial (10 min): Demo; setting up cloud resources from a fresh account
- Exercise (20 min): Is your cloud setup working?
- Plenary (10 min): Discussion of the format, Q&A
10:00 - 12:00 Cluster Management – Hendrik Nolte Slides Exercise
- “How to boot a thousand nodes”
- Lecture (20 min): Motivation, components of cluster management (DNS, DHCP, PXE-Boot process, images, resource management, monitoring, hardware-components)
- Management Demo (this is how it is supposed to look like in the end)
- Exercise (30 min): Describing the responsibility of Warewulf components and the boot process
  - “Role playing”
- Lecture: Technical details and administration of dnsmasq, DHCP, and investigating logfiles
- Exercise: WareWulf hands-on
12:00 - 12:45 Lunch Break
12:45 - 14:45 Cluster Management with Warewulf – Hendrik Nolte Slides Exercise
- Lecture: Warewulf configuration
- Demo: Image creation and deployment
- Exercise: Image creation with Warewulf
- Lecture: Advanced topics, system and runtime-overlays, Kernel management
- Exercise: System setup
14:45 - 15:00 Break
15:00 - 16:00 Network File System Setup – Ruben Kellner Slides Exercise
- Lecture(15 min): NFS Introduction
- Exercise(30 min): Setup of a basic NFS Server and client
- Plenary Discussion(15 min)
16:00 - 18:00 Slurm administration – Timon Vogt Slides Exercise
- Slurm installation, basic configuration, testing
- Lecture: introduction to Slurm
- Tutorial server installation, basic configuration and testing (flexible break)
- Exercise: adjustments of the configuration, integration of the cluster nodes, testing

Tuesday 21.02.2023

09:00 - 10:00 Best practices for administrators – Vanessa End Slides Exercise
- Lecture (20 min): processes and management, documentation, frameworks: ITIL, PRINCE2
- Exercise (20 min): Discussion of the best-practices, searching for related work, critical discussion of your own experience with the setup of Warewulf and Slurm
- Plenary discussion (20 min)
10:00 - 12:00 Setting Up Containers – Freja Nordsiek Slides Tutorial 1 Exercise 1 Tutorial 2 Exercise 2
- Lecture (15 min): Introduction to containers and their management
- Demo + Q&A (10 min): Outlook - the scope of container management using Docker/Singularity → what they learn at the end of the session
- Lecture (15 min): Setting up Podman and testing it
- Exercise (30 min)
- Lecture (15 min): Installing and configuring singularity on the cluster from source
- Exercise (30 min)
- Plenary discussion (15 min)
12:00 - 12:45 Lunch Break
12:45 - 14:45 Monitoring in HPC – Marcus Merz Slides Tutorial
- Lecture(15 min): Monitoring introduction and software stacks
- Lecture(5 min): InfluxDB
- Exercise(20 min): Installing InfluxDB
- Lecture(5 min): Telegraf
- Exercise(20 min): Installing Telegraf
- Lecture(5 min): Grafana
- Exercise(35 min): Installing Grafana and setting up a dashboard for an example application (Slurm)
- Plenary discussion (15 min)
14:45 - 15:00 Break
15:00 - 16:00 Service Catalogue – Marcus Merz Slides Exercise Exercise Solution
- Lecture(15 min): Service catalogue introduction, privacy concerns and risk management
- Exercise(10 min): Describing an application for a service catalogue (Telegraf, Influx, Slurm, …)
- Plenary discussion (5 min)
16:00 - 17:30 Security and security policies – Trevor Tabougua Slides Exercise Exercise Solution
- Lecture(30 min): Security introduction + Demo
  - Discussing an existing service and its security implications
- Exercise(15 min): Theoretical investigation of an existing service (the one from before)
- Exercise(30 min): Describe a new service and it's security implications and adding it to a service catalogue
- Plenary discussion (15 min)
17:30 - 18:00 Intelligent Platform Management Interface (IPMI) – Nils Kanning Slides
- Lecture(15 min): IPMI introduction
- Plenary discussion (15 min)

Wednesday 22.02.2023

RzGö live hardware demonstration and Hands-on. If you are a remote participant, we request that you revisit the previous material and prepare questions for Q&A sessions.

On-site is limited to up to 20 participants.

09:00 Meet at GWDG Burckhardtweg 4, 37077 Göttingen in the lobby - (Bus stop Bruckhardtweg)
09:15 Network interconnects – Sebastian Krey Slides Whiteboard Exercise Hardware
- Lecture(20 min): HPC Interconnects, Fabric Manager, RDMA, VLAN, LATP
- Exercise(20 min): Cable planing
10:15 Group 1 Introduction to our onsite hardware – Sebastian Krey
10:30-14:00 Group 1 Hands-on Hardware Exercises
12:00 - 12:45 Lunch Break
13:00-14:00 Group 2 Tour in the data center
14:00-15:00 Group 1 Tour in the data center
14:15 Group 2 Introduction to our onsite hardware – Sebastian Krey
14:30-18:00 Group 2 Hands-on Hardware Exercises
Setting up hardware
- Plugin a small cluster
- BIOS settings
- Installation of Warewulf
- Mounting of Infiniband cards
- Configuration of Infiniband
- RMDI performance test
18:00

Thursday 23.02.2023

09:00 - 10:30 Tracking Issues and Collaborative Work with Gitlab – Martin Paleico Slides Exercise
- Lecture(10 min): Gitlab introduction
- Exercise(25 min): Installing Gitlab-Community Edition
- Lecture(15 min): Best-practices for using Git for issue tracking and collaboration
  - Examples from GWDG
- Exercise(25 min): Discussing practices for issue tracking
- Plenar Discussion(15 min)
10:30 - 12:00 Ticket Systems – Stefanie Mühlhausen + Sadegh Keshtkar Slides Tutorial Exercise
- Lecture(10 min): Introduction ticketing systems and ticket workflows
- Tutorial(10 min): Demonstration of features
- Exercise(30 min): Install Znuny
- Plenar Discussion(10 min)
- Exercise (20 min): Testing out Znuny
- Plenar Discussion(10 min)
12:00 - 12:45 Lunch Break
12:45 - 13:15 WEKA FS – Christoph Hottenroth Slides
- Lecture(10 min): Introduction WEKA FS
- Demo(10 min): Deployment and usage
- Plenar Discussion(10 min)
13:15 - 14:30 Provisioning of an Environment for Parallel Computing – Artur Wachtel Slides Exercise
- Lecture(15 min): Providing a joint software environment with environment modules and Spack
- Exercise(45 min): Installing MPI and Gromacs and providing module descriptions (other group members to test)
- Plenar Discussion(15 min)
14:30 - 15:00 ClusterShell – Artur Wachtel Slides Exercise
- Lecture (10 min): Introduction
- Exercise (15 min): Installation and testing
- Plenar Discussion(5 min)
15:00 - 16:45 Student Presentations
- Aaron Kurda – Ressource Management with SLURM
- Sonal Lakhotia – Encryption Tools Slides
- Dominik Mann – Forensic Tools Slides
- Zoya Masih – On demand file systems with BeeGFS
- Matthias Mildenberger – Security infrastructures and intrusion systems Slides
16:45 - 17:00 Break
17:00 - 18:00 Student Presentations
- David Nelles – Ressource Management with SLURM Slides
- Winfired Oed – Virtualization tools for HPC (e.g., CharlieCloud, Singularity, Shifter) Slides
- Lars Quentin – Evaluation of Time-Series Databases Slides

Friday 24.02.2023

09:00 - 10:00 Benchmarking – Aasish Kumar Sharma Slides Tutorial Exercise Code
- Lecture(35 min): Benchmarking
- Exercise(15 min): Real system benchmarking on your VMs
- Plenary Discussion(10 min)
10:00 - 11:30 Performance Estimation – Julian Kunkel Slides Exercise
- Lecture(20 min): Hardware characteristics and performance estimates in distributed systems
- Exercise(35 min): Theoretic performance assessment
- Plenary Discussion(35 min)
11:30 - 12:00 Certificates and PKI – Jonathan Decker Slides Exercise
- Lecture(25 min): Introduction Certificates and PKI
- Plenary Discussion(5 min)
12:00 - 12:45 Lunch Break
12:45 - 13:45 Certificates and PKI – Jonathan Decker
- Lecture(15 min): Certificates and PKI - In Practice
- Exercise(30 min): Create, inspect and install certificates into a web server
- Plenary Discussion(15 min)
13:45 - 15:00 Firewalls – Julian Kunkel – Slides Exercise NFT Ruleset
- Lecture(15 min): Introduction to firewalls
- Exercise(45 min): Exploring firewall rules, port scanning with nmap, internet access for the nodes using NAT
- Plenary Discussion(15 min)
15:00 - 16:45 Student Presentations
- Johannes Richter – Application and system benchmarks Slides
- Julius Sieg – Encryption Tools
- Lukas Steinegger – Monitoring System Performance Slides
- Linus Weber – Scalable logging and log-file analysis Slides
- Silin Zhao – Application and system benchmarks Slides
16:45 - 17:00 Break
17:00 - 18:00 General Q&A session and organisational information for students

The exam is conducted through a report. The report should cover the evaluation of the assigned tool. The report should describe:

What the tool is, what it is used for
How the tool was set up
How you evaluated it
The results of your evaluation
Discussion of problems and potential of the tool
Conclusion

We recommend to use the LaTeX templates provided by us here: https://hps.vi4io.org/teaching/ressources/start#templates

Encryption Tools¹⁾ – Sonal Lakhotia Report
Encryption Tools²⁾ – Julius Sieg
Forensic Tools³⁾ – Dominik Mann Report Complementary
Security infrastructures and intrusion systems⁴⁾ – Matthias Mildenberger Report
Scalable logging and log-file analysis⁵⁾ – Linus Weber Report
Scalable databases with e.g., Elasticsearch, Postgres⁶⁾ – Jakob Schmitz
Virtualization tools for HPC (e.g., CharlieCloud, Singularity, Shifter)⁷⁾ – Winfired Oed Report
Virtualization tools for HPC (e.g., CharlieCloud, Singularity, Shifter)⁸⁾ – Frederik Hennecke
Ressource Management with SLURM⁹⁾ – Aaron Kurda
Ressource Management with SLURM¹⁰⁾ – David Nelles Report
On demand file systems with BeeGFS¹¹⁾ – Zoya Masih Report
Application and system benchmarks¹²⁾ – Silin Zhao Report
Application and system benchmarks¹³⁾ – Johannes Richter Report
Evaluation of Time-Series Databases¹⁴⁾ – Lars Quentin
Monitoring System Performance¹⁵⁾ – Lukas Steinegger Report Code
Managing cluster file systems in user space¹⁶⁾ – Tim Dettmar
Performance analysis tools¹⁷⁾ – Nicolas Alqas Alyas
Performance analysis/measurements with Cassandra and HBase¹⁸⁾ – Abdul Rafay Report Code

¹⁾ , ²⁾

Supervisor: Hendrik Nolte

³⁾ , ⁴⁾

Supervisor: Artur Wachtel

⁵⁾

Supervisor: Christoph Hottenroth

⁶⁾

Supervisor: Zoya Masih

⁷⁾ , ⁸⁾

Supervisor: Azat Khuziyakhmetov

⁹⁾ , ¹⁰⁾

Supervisor: Vanessa End

¹¹⁾ , ¹⁸⁾

Supervisor: Julian Kunkel

¹²⁾ , ¹³⁾ , ¹⁴⁾ , ¹⁵⁾

Supervisor: Marcus Merz

¹⁶⁾

Supervisor: Sebastian Krey

¹⁷⁾

Supervisor: Jack Ogaja

Practical: High-Performance Computing System Administration

Key information

Required Prior Knowledge

Learning Objectives

Topics for Practical Works

Agenda

Block Sessions 2022-10-26

Block Seminar 2023-02-23

Monday 20.02.2023

Tuesday 21.02.2023

Wednesday 22.02.2023

Thursday 23.02.2023

Friday 24.02.2023

Examination

Topic Distribution

HPS