High-Performance Computing System Administration is essential for managing HPC resources
not only as a user but as a cluster administrator. As part of this practical course, you will take part in a hands-on one-week block course, which will introduce the basics of Linux and using HPC resources and then go into depth on HPC system administration.
At the end of the block course you will choose a topic in terms of a tool related to HPC system administration, evaluate that tool and hand-in a report at the end of the semester.
For this a supervisor will be assigned to you, who is an expert on the assigned tool and is able to guide you.
Contact | Julian Kunkel, Jonathan Decker |
Location | Virtual Main Room Support Room |
Time | 07.10.24-11.10.24 5-day block course |
Language | English |
Module | M.Inf.1831: High-Performance Computing System Administration |
SWS | 4 |
Credits | 6 |
Contact time | up to 84 hours (63 full hours), depending on the course |
Independent study | up to 186 hours |
Please note that we plan to record sessions (lectures and seminar talks) with the intent of providing the recordings
via BBB to other students but also to publish and link the recordings on YouTube for future terms.
If you appear in any of the recordings via voice, camera or screen share, we need your consent to publish the recordings.
See also this Slide.
Required Prior Knowledge
No skills/knowledge is required
Understanding of Linux basics and having used Linux before and being able to operate a Bash shell is beneficial
We will provide a short crash course at the beginning of the course and link supplementary training material
Learning Objectives
Discuss theoretic facts related to networking, compute and storage resources
Integrate cluster hardware consisting of multiple compute and storage nodes into a “supercomputer“
Configure system services that allow the efficient management of the cluster hardware and software including network services such as DHCP,
DNS, NFS, IPMI, SSHD.
Install software and provide it to multiple users
Compile end-user applications and execute it on multiple nodes
Analyze system and application performance using benchmarks and tools
Formulate security policies and good practice for administrators
Apply tools for hardening the system such as firewalls and intrusion detection
Describe and document the system configuration
Topics for Practical Works
Intrusion detection tools for HPC
Encryption tools
Image Management and network booting with Werewolf
Software Management with modules/spack
Ressource Management with SLURM
Managing object storage
Managing cluster file systems in user space (GlusterFS, FUSE, SeaWeedFS)
File system management (NFSv4, Ceph, BeeGFS)
Performance analysis tools
Monitoring system performance
Application and system benchmarks
Virtualization tools for HPC (e.g., CharlieCloud, Singularity, Shifter)
Scalable databases with e.g., Elasticsearch, Postgres
Kernel compilation and configuration
Security infrastructures and intrusion systems
Deep Package Analysis and filtering
Berkeley Packet Filters (eBPF)
Firewalls
Kernel splicing
Scalable software management and distribution for Python
Forensic tools
Cluster wide User/Group management (e.g. LDAP)
Scalable logging and log-file analysis
Web Hosting Software Stacks (e.g. LAMP)
Agenda
Block Seminar 07.10.24-11.10.24
This part is attended by BSc/MSc students and GWDG academy participants
Note: There are only breaks for lecture slots in the schedule. You can take a break during exercises as necessary.
Preparation sheets: Preparation
Monday 07.10.2023
Tuesday 08.10.2024
Wednesday 09.10.2024
Thursday 10.10.2024
RzGö live hardware demonstration and Hands-on.
If you are a remote participant, we request that you revisit the previous material and prepare questions for Q&A sessions.
On-site is limited to up to 20 participants.
Group 1
09:00 Group 1 Meet at GWDG Burckhardtweg 4, 37077 Göttingen in the lobby - (Bus stop Bruckhardtweg)
-
Lecture(20 min): HPC Interconnects, Fabric Manager, RDMA, VLAN, LATP
Exercise(20 min): Cable planing
-
10:30-13:00 Group 1 Hands-on Hardware Exercises
13:00-14:00 Group 1 Tour in the data center
Group 2
11:45 Group 2 Meet at GWDG Burckhardtweg 4, 37077 Göttingen in the lobby - (Bus stop Bruckhardtweg)
12:00-13:00 Group 2 Tour in the data center
-
Lecture(20 min): HPC Interconnects, Fabric Manager, RDMA, VLAN, LATP
Exercise(20 min): Cable planing
-
14:45-17:30 Group 2 Hands-on Hardware Exercises
Setting up hardware
Friday 11.10.2024
Student Project Work
2024-11-01 - Send your requested topic to us until this day
2024-11-08 - We assign a supervisor per student until this day
2025-03-31 - Submit final report as PDF per email to jonathan.decker@uni-goettingen.de
Examination
The exam is conducted through a report.
The report should cover the evaluation of the assigned tool.
The report should describe:
What the tool is, what it is used for
How the tool was set up
How you evaluated it
The results of your evaluation
Discussion of problems and potential of the tool
Conclusion
The report should not exceed 15 pages (only counting raw text in the main part, the full report including cover pages and appendix may be longer).
It is not sufficient to repeat the documentation of the tool in your own words.
We recommend to use the LaTeX templates provided by us here: https://hps.vi4io.org/teaching/ressources/start#templates
Examination Requirement
In order to be allowed to take the examination, you have to show that you have taken the majority of the sessions of the block course.
To prove this, please send 1-2 pages of notes on the course to us.
These can be your personal notes from the course you took during the sessions and does not need to be a formatted document and is just to prove that you took the course.
These do NOT need to be complete solutions to the exercises, a few sentences on your takeaways per section are enough.
If you joined the course late or had to miss out on some of the sessions, you can find the recordings on BBB and the materials on this web page.
The exercises can be completed on a personal VM.
Topic Distribution
Student | Supervisor | Topic | Submissions |
Your Name | Your Supervisor | Your Topic | Report |