====== Practical: High-Performance Computing System Administration ====== High-Performance Computing System Administration is essential for managing HPC resources not only as a user but as a cluster administrator. As part of this practical course, you will take part in a hands-on one-week block course, which will introduce the basics of Linux and using HPC resources and then go into depth on HPC system administration. At the end of the block course you will choose a topic in terms of a tool related to HPC system administration, evaluate that tool and hand-in a report at the end of the semester. For this a supervisor will be assigned to you, who is an expert on the assigned tool and is able to guide you. ===== Key information ===== || Contact || [[about:people:julian_kunkel|Julian Kunkel]], [[about:people:jonathan_decker|Jonathan Decker]] || || Location || Virtual [[https://meet.gwdg.de/b/jul-pfo-7mr-txo|Main Room]] [[https://meet.gwdg.de/b/jul-mii-pfh-shu|Support Room]] || || Time || 13.10.25-17.10.25 5-day block course || || Language || English || || Module || M.Inf.1831: High-Performance Computing System Administration || || SWS || 4 || || Credits || 6 (+ 3 with M.Inf.1834) || || Contact time || up to 84 hours (63 full hours), depending on the course || || Independent study || up to 186 hours || Please note that we plan to record sessions (lectures and seminar talks) with the intent of providing the recordings via BBB to other students but also to publish and link the recordings on YouTube for future terms. If you appear in any of the recordings via voice, camera or screen share, we need your consent to publish the recordings. See also this {{ :teaching:templates:dataprivacy_student_notice_slide.pdf |Slide}}. ==== Required Prior Knowledge ==== * No skills/knowledge is required * Understanding of Linux basics and having used Linux before and being able to operate a Bash shell is beneficial * We will provide a short crash course at the beginning of the course and link supplementary training material ===== Learning Objectives ===== * Discuss theoretic facts related to networking, compute and storage resources * Integrate cluster hardware consisting of multiple compute and storage nodes into a “supercomputer“ * Configure system services that allow the efficient management of the cluster hardware and software including network services such as DHCP, DNS, NFS, IPMI, SSHD. * Install software and provide it to multiple users * Compile end-user applications and execute it on multiple nodes * Analyze system and application performance using benchmarks and tools * Formulate security policies and good practice for administrators * Apply tools for hardening the system such as firewalls and intrusion detection * Describe and document the system configuration ===== Topics for Practical Works ===== * LLM RAG Agent based on ChatAI * Automating Simple Maintenance Tasks in HPC Systems Using Python and Shell Scripts * Extending the Linux kernel scheduler * Confidential Computing (HPC/Cloud) * Python Performance Optimization leveraging Native Implementations (Numba/CPython/PyO3/Nukita/transpyle) * Parallel filesystems performance optimization & benchmarking (incl AI/ML) * Longhorn as a Kubernetes persistent storage in the HPC environment * AI for monitoring * I/O Performance for ML models * Nvidia Nsigiht systems on HPC cluster, profiling AI workloads remotely * Web testing Shiny applications * Neuromorphic Computing * Effective intrusion detection systems (IDS) Strategies in HPC Environments * Regression Testing for HPC * Global Optimization (of Clusters) with Genetic Algorithms * FPGA Computing with SciEngine * RISC-V: State of the union * Benchmarking of HPC Systems * Security in Cloud and HPC * GPU Computing with WebAssembly * Parallelization with Dask + Xarray * What's new in the Kubernetes ecosystem * Containers in HPC * Function-as-a-service in HPC * Encryption tools * Image Management and network booting with Werewolf * Software Management with modules/spack * Ressource Management with SLURM * Managing object storage * Managing cluster file systems in user space (GlusterFS, FUSE, SeaWeedFS) * File system management (NFSv4, Ceph, BeeGFS) * Performance analysis tools * Monitoring system performance * Application and system benchmarks * Virtualization tools for HPC (e.g., CharlieCloud, Singularity, Shifter) * Scalable databases with e.g., Elasticsearch, Postgres * Kernel compilation and configuration * Forensic tools * WebAssembly in Kubernetes * Vector database performance comparison with Postgres * Confidential Container Attestation ===== Agenda ===== ==== Block Seminar 13.10.25-17.10.25 ==== //This part is attended by BSc/MSc students and GWDG academy participants// Note: There are only breaks for lecture slots in the schedule. You can take a break during exercises as necessary. Preparation sheets: {{ :teaching:autumn_term_2025:hpcsa:hpcsa-prepe0.pdf|Preparation}} === Monday 13.10.2025 === * 09:00 - 10:00 **Welcome**, Organization of the block course -- //Julian Kunkel// {{ :teaching:autumn_term_2025:hpcsa:hpcsa-welcome.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:welcome-exercise.pdf |Exercise}} * Agenda of the week * Forming support groups * Format of the "group work" * Exercise (10 min): Introduce yourself in the "learning groups" * Tutorial (10 min): Demo; setting up cloud resources from a fresh account * Exercise (20 min): Is your cloud setup working? * Plenary (10 min): Discussion of the format, Q&A * 10:00 - 12:00 **Cluster Management with Warewulf Part 1** -- //Freja Nordsiek// {{ :teaching:autumn_term_2025:hpcsa:warewulf-presentation-1.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:warewulf-exercise-1.pdf |Exercise}} {{ :teaching:autumn_term_2025:hpcsa:warewulf-presentation-2.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:warewulf-exercise-2.pdf |Exercise}} * "How to boot a thousand nodes" * Lecture (20 min): Motivation, components of cluster management (DNS, DHCP, PXE-Boot process, images, resource management, monitoring, hardware components) * Management Demo * Exercise (30 min): Describing the responsibility of Warewulf components and the boot process * Lecture: Technical details and administration of dnsmasq, DHCP, and investigating logfiles * Exercise 1 * Lecture: Warewulf configuration * Demo: Image creation and deployment * Exercise 2 * 12:00 - 12:45 //Lunch Break// * 12:45 - 15:00 **Cluster Management with Warewulf Part 2** -- //Freja Nordsiek// * 15:00 - 15:15 //Break// * 15:15 - 17:00 **Slurm administration** -- //Freja Nordsiek// {{ :teaching:autumn_term_2025:hpcsa:slurm-admin-slides.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:slurm-admin-exercise.pdf |Exercise}} {{ :teaching:autumn_term_2025:hpcsa:slurm-files.zip |Config}} * Slurm installation, basic configuration, testing * Lecture: Introduction to Slurm * Tutorial server installation, basic configuration, and testing (flexible break) * Exercise: adjustments of the configuration, integration of the cluster nodes, testing === Tuesday 14.10.2025 === * 09:00 - 10:00 **Recap Warewulf and Slurm Installation** -- //Freja Nordsiek// * 10:00 - 11:00 **User Management with Warewulf** -- //Freja Nordsiek// * 11:00 - 12:00 **Network File System Setup** -- //Patrick Höhn// {{ :teaching:autumn_term_2025:hpcsa:-nfs-slides.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:nfs-exercisesheet.pdf |Exercise}} * Lecture(15 min): NFS Introduction * Exercise(30 min): Setup of a basic NFS Server and client * Plenary Discussion(15 min) * 12:00 - 12:45 //Lunch Break// * 12:45 - 14:45 **Provisioning of an Environment for Parallel Computing** -- //Artur Wachtel// {{ :teaching:autumn_term_2025:hpcsa:environment.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:environment-exercise.pdf |Exercise}} * Lecture(15 min): Providing a joint software environment with environment modules and Spack * Exercise(45 min): Installing MPI and Gromacs and providing module descriptions (other group members to test) * Plenary Discussion(15 min) * 14:45 - 15:00 //Break// * 15:00 - 17:00 **Monitoring in HPC** -- //Marcus Merz// {{ :teaching:autumn_term_2025:hpcsa:monitoring.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:monitoring-tutorial.pdf |Tutorial}} * Lecture(15 min): Monitoring introduction and software stacks * Lecture(5 min): InfluxDB * Exercise(20 min): Installing InfluxDB * Lecture(5 min): Telegraf * Exercise(20 min): Installing Telegraf * Lecture(5 min): Grafana * Exercise(35 min): Installing Grafana and setting up a dashboard for an example application (Slurm) * Plenary discussion (15 min) === Wednesday 15.10.2025 === * 09:00 - 10:00 **Best practices for administrators** -- //Kevin Lüdemann// {{ :teaching:autumn_term_2025:hpcsa:best-practices.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:best-practices-exercise.pdf |Exercise}} * Lecture (20 min): processes and management, documentation, frameworks: ITIL, PRINCE2 * Exercise (20 min): Discussion of the best practices, searching for related work, and a critical discussion of your own experience with the setup of Warewulf and Slurm * Plenary discussion (20 min) * 10:00 - 11:00 **Firewalls** -- //Freja Nordsiek// -- {{ :teaching:autumn_term_2025:hpcsa:firewalls.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:firewalls-exercise.pdf |Exercise}} {{ :teaching:autumn_term_2025:hpcsa:firewalls-rules.zip|NFT Ruleset}} * Lecture(15 min): Introduction to firewalls * Exercise(35 min): Exploring firewall rules, port scanning with nmap, and internet access for the nodes using NAT * Plenary Discussion(10 min) * 11:00 - 12:00 **Security and security policies** -- //Trevor Tabougua// {{ :teaching:autumn_term_2025:hpcsa:security-slides.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:security-and-security-policies-exercise.pdf |Exercise}} {{ :teaching:autumn_term_2025:hpcsa:security-and-security-policies-exercise-solution.pdf |Exercise Solution}} * Lecture(30 min): Security introduction + Demo * Discussing an existing service and its security implications * Exercise(15 min): Theoretical investigation of an existing service (the one from before) * Exercise(30 min): Describe a new service and its security implications, and add it to a service catalogue * Plenary discussion (15 min) * 12:00 - 12:45 //Lunch Break// * 12:45 - 13:45 **Intelligent Platform Management Interface (IPMI)** -- //Nils Kanning// {{ :teaching:autumn_term_2025:hpcsa:ipmi.pdf |Slides}} * Lecture(15 min): IPMI introduction * Plenary discussion (45 min) * 13:45 - 14:15 **ClusterShell** -- //Artur Wachtel// {{ :teaching:autumn_term_2025:hpcsa:clush-slides.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:clush-exercise.pdf |Exercise}} * Lecture (10 min): Introduction * Exercise (15 min): Installation and testing * Plenary Discussion(5 min) * //14:15 - 14:30 Break// * 14:30 - 16:30 **Documentation Writing** -- //Kevin Lüdemann// {{ :teaching:autumn_term_2025:hpcsa:ipmi.pdf |Slides}} === Thursday 16.10.2025 === * 09:00 - 10:30 **Benchmarking** -- //Aasish Kumar Sharma// {{ :teaching:autumn_term_2025:hpcsa:benchmark.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:benchmark-tutorial.pdf |Tutorial}} {{ :teaching:autumn_term_2025:hpcsa:-benchmark-exercise-new.pdf |Exercise}} * Lecture(35 min): Benchmarking * Exercise(15 min): Real system benchmarking on your VMs * Plenary Discussion(10 min) * 10:30 - 12:00 **Performance Estimation** -- //Julian Kunkel// {{ :teaching:autumn_term_2025:hpcsa:performance-estimation.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:performance-estimation-exercise.pdf |Exercise}} * Lecture(20 min): Hardware characteristics and performance estimates in distributed systems * Exercise(35 min): Theoretic performance assessment * Plenary Discussion(15 min) * 12:00 - 12:45 //Lunch Break// * 12:45 - 16:00 **Working on other students' clusters and test documentations** -- //Kevin Lüdemann// {{ :teaching:autumn_term_2025:hpcsa:performance-estimation.pdf |Slides}} * 16:00 - 17:00 **General Q&A session and organisational information for students** -- //Jonathan Decker// {{ :teaching:autumn_term_2025:hpcsa:hpcsa-assignment.pdf |slides}} === Friday 17.10.2025 === RzGö live hardware demonstration and Hands-on. If you are a remote participant, we request that you revisit the previous material and prepare questions for Q&A sessions. On-site is limited to up to 20 participants. * Group 1 * 09:00 Group 1 Meet at GWDG Burckhardtweg 4, 37077 Göttingen in the lobby - (Bus stop Bruckhardtweg)\\ * 09:15 Group 1 **Network interconnects** -- //Sebastian Krey// {{ :teaching:autumn_term_2025:hpcsa:interconnect.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:interconnect-exercise.pdf |Exercise}} {{ :teaching:autumn_term_2025:hpcsa:hardware.pdf |Hardware}} * Lecture(20 min): HPC Interconnects, Fabric Manager, RDMA, VLAN, LATP * Exercise(20 min): Cable planing * 10:15 Group 1 **Introduction to our onsite hardware** -- //Sebastian Krey// \\ {{ :teaching:autumn_term_2025:hpcsa:smartboard-group-1.pdf |Smartboard Group 1}} * 10:30-13:00 Group 1 Hands-on Hardware Exercises * 13:00-14:00 Group 1 Tour in the data center * Group 2 * 11:45 Group 2 Meet at GWDG Burckhardtweg 4, 37077 Göttingen in the lobby - (Bus stop Bruckhardtweg)\\ * 12:00-13:00 Group 2 Tour in the data center * 13:30 Group 2 **Network interconnects** -- //Sebastian Krey// {{ :teaching:autumn_term_2025:hpcsa:interconnect.pdf |Slides}} {{ :teaching:autumn_term_2025:hpcsa:interconnect-exercise.pdf |Exercise}} {{ :teaching:autumn_term_2025:hpcsa:hardware.pdf |Hardware}} * Lecture(20 min): HPC Interconnects, Fabric Manager, RDMA, VLAN, LATP * Exercise(20 min): Cable planing * 14:30 Group 2 **Introduction to our onsite hardware** -- //Sebastian Krey// \\ {{ :teaching:autumn_term_2025:hpcsa:smartboard-group-2.pdf |Smartboard Group 2}} * 14:45-17:30 Group 2 Hands-on Hardware Exercises * Setting up hardware * Plug in a small cluster * BIOS settings * Installation of Warewulf * Mounting of Infiniband cards * Configuration of Infiniband * RMDI performance test ==== Student Project Work ==== * 2025-11-03 - Send your requested topic to us until this day * 2025-11-10 - We assign a supervisor per student until this day * Contact your supervisor * Work on your topic * Write your reports * Get feedback from supervisor * 2026-03-31 - Submit final report as PDF per email to jonathan.decker@uni-goettingen.de ===== Examination ===== The exam is conducted through a report. The report should cover the evaluation of the assigned tool. The report should describe: * What the tool is, what it is used for * How the tool was set up * How you evaluated it * The results of your evaluation * Discussion of problems and potential of the tool * Conclusion The report should not exceed 15 pages (only counting raw text in the main part, the full report including cover pages and appendix may be longer). It is not sufficient to repeat the documentation of the tool in your own words. We recommend to use the LaTeX templates provided by us here: https://hps.vi4io.org/teaching/ressources/start#templates ===== Examination Requirement ===== In order to be allowed to take the examination, you have to show that you have taken the majority of the sessions of the block course. To prove this, please send 1-2 pages of notes on the course to us. These can be your personal notes from the course you took during the sessions and does not need to be a formatted document and is just to prove that you took the course. These do NOT need to be complete solutions to the exercises, a few sentences on your takeaways per section are enough. If you joined the course late or had to miss out on some of the sessions, you can find the recordings on BBB and the materials on this web page. The exercises can be completed on a personal VM. ===== Topic Distribution ===== || **Student** || **Supervisor** || **Topic** || **Submissions** || || Your Name || Your Supervisor || Your Topic || {{ :teaching:autumn_term_2025:stud:report.pdf |Report}} ||