Practical: High-Performance Computing System Administration

High-Performance Computing System Administration is essential for managing HPC resources not only as a user but as a cluster administrator. As part of this practical course, you will take part in a hands-on one-week block course, which will introduce the basics of Linux and using HPC resources and then go into depth on HPC system administration. At the end of the block course you will choose a topic in terms of a tool related to HPC system administration, evaluate that tool and hand-in a report at the end of the semester. For this a supervisor will be assigned to you, who is an expert on the assigned tool and is able to guide you.

Contact		Julian Kunkel, Jonathan Decker
Location		Virtual Main Room Support Room
Time		16.10.23-20.10.23 5-day block course
Language		English
Module		M.Inf.1831: High-Performance Computing System Administration
SWS		4
Credits		6
Contact time		up to 84 hours (63 full hours), depending on the course
Independent study		up to 186 hours

Please note that we plan to record sessions (lectures and seminar talks) with the intent of providing the recordings via BBB to other students but also to publish and link the recordings on YouTube for future terms. If you appear in any of the recordings via voice, camera or screen share, we need your consent to publish the recordings. See also this Slide.

No skills/knowledge is required
Understanding of Linux basics and having used Linux before and being able to operate a Bash shell is beneficial
We will provide a short crash course at the beginning of the course and link supplementary training material

Discuss theoretic facts related to networking, compute and storage resources
Integrate cluster hardware consisting of multiple compute and storage nodes into a “supercomputer“
Configure system services that allow the efficient management of the cluster hardware and software including network services such as DHCP, DNS, NFS, IPMI, SSHD.
Install software and provide it to multiple users
Compile end-user applications and execute it on multiple nodes
Analyze system and application performance using benchmarks and tools
Formulate security policies and good practice for administrators
Apply tools for hardening the system such as firewalls and intrusion detection
Describe and document the system configuration

Intrusion detection tools for HPC
Encryption tools
Image Management and network booting with Werewolf
Software Management with modules/spack
Ressource Management with SLURM
Managing object storage
Managing cluster file systems in user space (GlusterFS, FUSE, SeaWeedFS)
File system management (NFSv4, Ceph, BeeGFS)
Performance analysis tools
Monitoring system performance
Application and system benchmarks
Virtualization tools for HPC (e.g., CharlieCloud, Singularity, Shifter)
Scalable databases with e.g., Elasticsearch, Postgres
Kernel compilation and configuration
Security infrastructures and intrusion systems
Deep Package Analysis and filtering
Berkeley Packet Filters (eBPF)
Firewalls
Kernel splicing
Scalable software management and distribution for Python
Forensic tools
Cluster wide User/Group management (e.g. LDAP)
Scalable logging and log-file analysis
Web Hosting Software Stacks (e.g. LAMP)

This part is attended by BSc/MSc students and GWDG academy participants

Note: There are only breaks for lecture slots in the schedule. You can take a break during exercises as necessary. Preparation sheets: Preparation

Monday 16.10.2023

09:00 - 10:00 Welcome, Organization of the block course – Julian Kunkel Slides Exercise
- Agenda of the week
- Forming support groups
- Format of the “group work”
- Exercise (10 min): Introduce yourself in the “learning groups”
- Tutorial (10 min): Demo; setting up cloud resources from a fresh account
- Exercise (20 min): Is your cloud setup working?
- Plenary (10 min): Discussion of the format, Q&A
10:00 - 11:30 Linux Crash Course – Ruben Kellner slides
- Command Line
- Some basic commands
- Remote access to the Scientific Compute Cluster
11:30 - 12:00 Linux Exercise – Ruben Kellner exercise
12:00 - 12:45 Lunch Break
12:45 - 13:15 First steps running applications on the cluster using Slurm – Patrick Höhn slides
- Running applications on multiple nodes using SRUN
- Getting an overview of the available hardware (docu, sinfo)
- Outlook of running a parallel program, measuring different types of applications
13:15 - 13:45 Slurm Exercise exercise
13:45 - 14:00 break
14:00 - 14:30 Introduction to Git – Christian Köhler slides
14:30 - 15:00 Git Exercise exercise
15:00 - 15:40 Compilation of applications via cmake, Autotools, make – Trevor Khwam slides
- Exercise for cmake, Autotools, make exercise
15:40 - 16:00 Software management with Spack – Trevor Khwam slides
16:00 - 16:15 break
16:15 - 16:45 Running containers with Singularity – Azat Khuziyakhmetov slides
16:45 - 18:00 Exercise - Virtual Machine and Slurm – Jonathan Decker homework primes.zip
- Virtual Linux machine setup
- Assessing the performance of running applications
- Complete other unfinished exercises

Tuesday 17.10.2023

09:00 - 10:00 Firewalls – Julian Kunkel – Slides Exercise NFT Ruleset
- Lecture(15 min): Introduction to firewalls
- Exercise(35 min): Exploring firewall rules, port scanning with nmap, internet access for the nodes using NAT
- Plenary Discussion(10 min)
10:00 - 12:00 Certificates and PKI – Jonathan Decker Slides Exercise
- Lecture(45 min): Introduction Certificates and PKI
- Exercise(55 min): Create, inspect and install certificates into a web server
- Plenary Discussion(20 min)
12:00 - 12:45 Lunch Break
12:45 - 14:45 Cluster Management (with Warewulf) – Niklas Bölter Slides Exercise Slides Exercise
- “How to boot a thousand nodes”
- Lecture (20 min): Motivation, components of cluster management (DNS, DHCP, PXE-Boot process, images, resource management, monitoring, hardware-components)
- Management Demo
- Exercise (30 min): Describing the responsibility of Warewulf components and the boot process
- Lecture: Technical details and administration of dnsmasq, DHCP, and investigating logfiles
- Exercise 1
- Lecture: Warewulf configuration
- Demo: Image creation and deployment
- Exercise 2
14:45 - 15:00 Break
15:00 - 16:00 Network File System Setup – Patrick Höhn Slides Exercise
- Lecture(15 min): NFS Introduction
- Exercise(30 min): Setup of a basic NFS Server and client
- Plenary Discussion(15 min)
16:00 - 18:00 Slurm administration – Timon Vogt Slides Exercise Config
- Slurm installation, basic configuration, testing
- Lecture: introduction to Slurm
- Tutorial server installation, basic configuration and testing (flexible break)
- Exercise: adjustments of the configuration, integration of the cluster nodes, testing

Wednesday 18.10.2023

9:00 - 11:00 Setting Up Containers – Freja Nordsiek Slides Tutorial 1 Exercise 1 Tutorial 2 Exercise 2
- Lecture (15 min): Introduction to containers and their management
- Demo + Q&A (10 min): Outlook - the scope of container management in Docker and Singularity ecosystems
- Lecture (15 min): Setting up Podman and testing it
- Exercise (30 min)
- Lecture (15 min): Installing and configuring Apptainer and testing it
- Exercise (30 min)
- Plenary discussion (15 min)
11:00 - 12:00 Best practices for administrators – Stefanie Mühlhausen Slides Exercise
- Lecture (20 min): processes and management, documentation, frameworks: ITIL, PRINCE2
- Exercise (20 min): Discussion of the best-practices, searching for related work, critical discussion of your own experience with the setup of Warewulf and Slurm
- Plenary discussion (20 min)
12:00 - 12:45 Lunch Break
12:45 - 14:45 Monitoring in HPC – Marcus Merz Slides Tutorial
- Lecture(15 min): Monitoring introduction and software stacks
- Lecture(5 min): InfluxDB
- Exercise(20 min): Installing InfluxDB
- Lecture(5 min): Telegraf
- Exercise(20 min): Installing Telegraf
- Lecture(5 min): Grafana
- Exercise(35 min): Installing Grafana and setting up a dashboard for an example application (Slurm)
- Plenary discussion (15 min)
14:45 - 15:00 Break
15:00 - 16:00 Service Catalogue – Marcus Merz Slides Exercise Exercise Solution
- Lecture(15 min): Service catalogue introduction, privacy concerns and risk management
- Exercise(10 min): Describing an application for a service catalogue (Telegraf, Influx, Slurm, …)
- Plenary discussion (5 min)
16:00 - 17:30 Security and security policies – Trevor Tabougua Slides Exercise Exercise Solution
- Lecture(30 min): Security introduction + Demo
  - Discussing an existing service and its security implications
- Exercise(15 min): Theoretical investigation of an existing service (the one from before)
- Exercise(30 min): Describe a new service and it's security implications and adding it to a service catalogue
- Plenary discussion (15 min)
17:30 - 18:00 Intelligent Platform Management Interface (IPMI) – Nils Kanning Slides
- Lecture(15 min): IPMI introduction
- Plenary discussion (15 min)

Thursday 19.10.2023

RzGö live hardware demonstration and Hands-on. If you are a remote participant, we request that you revisit the previous material and prepare questions for Q&A sessions.

On-site is limited to up to 20 participants.

Group 1
- 09:00 Group 1 Meet at GWDG Burckhardtweg 4, 37077 Göttingen in the lobby - (Bus stop Bruckhardtweg)
- 09:15 Group 1 Network interconnects – Sebastian Krey Slides Exercise Hardware
- Lecture(20 min): HPC Interconnects, Fabric Manager, RDMA, VLAN, LATP
- Exercise(20 min): Cable planing
- 10:15 Group 1 Introduction to our onsite hardware – Sebastian Krey
  Smartboard Group 1
- 10:30-13:00 Group 1 Hands-on Hardware Exercises
- 13:00-14:00 Group 1 Tour in the data center
Group 2
- 11:45 Group 2 Meet at GWDG Burckhardtweg 4, 37077 Göttingen in the lobby - (Bus stop Bruckhardtweg)
- 12:00-13:00 Group 2 Tour in the data center
- 13:30 Group 2 Network interconnects – Sebastian Krey Slides Exercise Hardware
- Lecture(20 min): HPC Interconnects, Fabric Manager, RDMA, VLAN, LATP
- Exercise(20 min): Cable planing
- 14:30 Group 2 Introduction to our onsite hardware – Sebastian Krey
  Smartboard Group 2
- 14:45-17:30 Group 2 Hands-on Hardware Exercises
Setting up hardware
- Plugin a small cluster
- BIOS settings
- Installation of Warewulf
- Mounting of Infiniband cards
- Configuration of Infiniband
- RMDI performance test

Friday 20.10.2023

09:00 - 10:30 Tracking Issues and Collaborative Work with Gitlab – Martin Paleico Slides Exercise
- Lecture(10 min): Gitlab introduction
- Exercise(25 min): Installing Gitlab-Community Edition
- Lecture(15 min): Best-practices for using Git for issue tracking and collaboration
  - Examples from GWDG
- Exercise(25 min): Discussing practices for issue tracking
- Plenary Discussion(15 min)
10:30 - 12:00 Ticket Systems – Sadegh Keshtkar Slides Tutorial Exercise
- Lecture(10 min): Introduction ticketing systems and ticket workflows
- Tutorial(10 min): Demonstration of features
- Exercise(30 min): Install Znuny
- Plenary Discussion(10 min)
- Exercise (20 min): Testing out Znuny
- Plenary Discussion(10 min)
12:00 - 12:45 Lunch Break
12:45 - 13:15 WEKA FS – Christoph Hottenroth Slides
- Lecture(10 min): Introduction WEKA FS
- Demo(10 min): Deployment and usage
- Plenary Discussion(10 min)
13:15 - 14:30 Provisioning of an Environment for Parallel Computing – Artur Wachtel Slides Exercise
- Lecture(15 min): Providing a joint software environment with environment modules and Spack
- Exercise(45 min): Installing MPI and Gromacs and providing module descriptions (other group members to test)
- Plenary Discussion(15 min)
14:30 - 15:00 ClusterShell – Artur Wachtel Slides Exercise
- Lecture (10 min): Introduction
- Exercise (15 min): Installation and testing
- Plenary Discussion(5 min)
15:00 - 15:15 Break
15:15 - 16:15 Benchmarking – Aasish Kumar Sharma Slides Tutorial Exercise
- Lecture(35 min): Benchmarking
- Exercise(15 min): Real system benchmarking on your VMs
- Plenary Discussion(10 min)
16:15 - 17:15 Performance Estimation – Julian Kunkel Slides Exercise
- Lecture(20 min): Hardware characteristics and performance estimates in distributed systems
- Exercise(35 min): Theoretic performance assessment
- Plenary Discussion(15 min)
17:15 - 17:30 Break
17:30 - 18:00 General Q&A session and organisational information for students slides

2023-11-03 - Send your requested topic to us until this day
2023-11-10 - We assign a supervisor per student until this day
- Contact your supervisor
- Work on your topic
- Write your reports
- Get feedback from supervisor
2024-03-31 - Submit final report as PDF per email to jonathan.decker@uni-goettingen.de

The exam is conducted through a report. The report should cover the evaluation of the assigned tool. The report should describe:

What the tool is, what it is used for
How the tool was set up
How you evaluated it
The results of your evaluation
Discussion of problems and potential of the tool
Conclusion

The report should not exceed 15 pages (only counting raw text in the main part, the full report including cover pages and appendix may be longer). It is not sufficient to repeat the documentation of the tool in your own words.

We recommend to use the LaTeX templates provided by us here: https://hps.vi4io.org/teaching/ressources/start#templates

In order to be allowed to take the examination, you have to show that you have taken the majority of the sessions of the block course. To prove this, please send 1-2 pages of notes on the course to us. These can be your personal notes from the course you took during the sessions and does not need to be a formatted document and is just to prove that you took the course. These do NOT need to be complete solutions to the exercises, a few sentences on your takeaways per section are enough.

If you joined the course late or had to miss out on some of the sessions, you can find the recordings on BBB and the materials on this web page. The exercises can be completed on a personal VM.

Student	Supervisor	Topic	Submissions
Jakob Hampel	Stefanie Mühlhausen	Ticketing Systems Schnittstellen/Performance/Vergleich	Report
Joao Soares	Timon Vogt	Web Hosting Software Stacks Supabase vs Pocketbase	Report
Andre Buderus	Hauke Kirchner	Scalable software management and distribution for Python
Jakob Dieterle	Freja Nordsiek	File system management (NFSv4, Ceph, BeeGFS)
Qumeng Sun	Marcus Merz	Intrusion detection tools for HPC	Report
Mohamed Basuony	Hendrik Nolte	Scalable software management and distribution for Python
Zilin Song	Timon Vogt	Ressource Management with SLURM
Abdellah Omar Adolf	Marcus Merz	Monitoring System Performance
Michael Hubert Duah	Jaromir Nemecek	Image Management and network booting with Warewulf
Tim Dettmar	Sebastian Krey	HPC networking with libibverbs and libfabric (fallback Managing Cluster File Systems in user space)	Report
Frederik Hennecke	Zoya Masih	Berkeley Packet Filters	Report
Mehmet Niyazi Kayi	Julian Rüger	Cluster wide User/Group management (e.g. LDAP)
Surendhar Muthukumar	Freja Nordsiek	Managing cluster file systems in user space (GlusterFS, FUSE, SeaWeedFS)	Report
Ashutosh Jaiswal	Narges Lux	Application and System Benchmarks
Pranay Bhatia	Jonathan Decker	Kubernetes for HPC	Report
Sunny Jain	Lars Quentin	Scalable databases with e.g. Elasticsearch, Postgres	Report
Lars Quentin	Marcus Merz	Prometheus Scalability Evaluation for HPC Monitoring	Report
Chinaza Ogo Obiagazie	Julian Rüger	Cluster wide User/Group management (e.g. LDAP)

Practical: High-Performance Computing System Administration

Key information

Required Prior Knowledge

Learning Objectives

Topics for Practical Works

Agenda

Block Seminar 16.10.23-20.10.23

Monday 16.10.2023

Tuesday 17.10.2023

Wednesday 18.10.2023

Thursday 19.10.2023

Friday 20.10.2023

Student Project Work

Examination

Examination Requirement

Topic Distribution

HPS