The offered theses below are intended for MSc but can also be reduced in scope and handed out as BSc theses.
Serverless Computing or Function-as-a-Service (FaaS) has emerged as a new paradigm for computing over the last few years. There exists a number of open source FaaS platforms based on Kubernetes as the container orchestration platform maps well to the components required for FaaS. However, most approaches to FaaS are still relatively naive and leave many performance improvements on the table. This work focuses on said limitations and aims to solve at least one of them and implement a proof of concept. Finally, the performance improvements should be benchmarked in a virtualized environment and on the HPC system.
For customer facing systems that handle sensitve data such as patient information, it is required to comply with strict data protection laws. In order to comply with these laws even during a security breach, confidential computing should be used, however, modern use-cases require the usage of scable multi-user systems with GPU acceleration for ML inference workloads. This thesis encapsulates setting up confidential computing on top of a Kubernetes cluster using Kata Containers, Confidential Containers and Nvidia Confidential GPU Computing as well as measuring the performance costs of using a confidential compute stack.
Retrieval-Augmented-Generation (RAG) is a method for providing an LLM with additional data via documents that are automatically added into queries. This method has been implemented in a number of ways including multiple open source projects such as PrivateGPT or H2OGPT. These projects commonly have a feature where the answer to a given query states what documents and what pages in these documents were used to complete the query, however, these can often be inaccurate as the system is forced to make a selection even in cases where no document has been used. For this thesis topic, a student would set up such a RAG LLM system and augment its retrieval system to provide more accurate references to documents or none if no documents from the given set were used.
In this thesis, student will explore the current methods in AI to create smart algorithm that can catch problems in computer systems before they turn serious. Think of it as developing a high-tech 'early warning system'. The journey will involve playing with data, crafting algorithms, and running simulations to see how well they work. Plus, you'll get to integrate your creations into real computing systems, making them more reliable and reducing downtimes.
This research explores the potential of edge computing technologies in enabling real-time predictive maintenance within compute continuum systems. The objective is to develop a framework that utilizes edge computing for immediate data processing and decision-making, enhancing the overall efficiency and responsiveness of maintenance protocols. The thesis will involve both theoretical and practical aspects, including system design, implementation, and testing in real-world scenarios.
To answer the question of how making AI-driven maintenance work smoothly in huge computing systems, we will need to find out what makes scaling up so tricky and come up with efficient ways to make it better. Student will investigate the scalability challenges associated with implementing AI-based predictive maintenance in large-scale compute continuum systems. They'll get to analyze existing systems, brainstorm new methods, and test how well they work in the real world of large-scale computing maintenance. The research will focus on identifying key scalability issues and developing innovative solutions to enhance the performance and effectiveness of predictive maintenance strategies. It also will include a thorough analysis of current systems, proposal of new methodologies, and evaluation of their impact on large-scale system maintenance.
A VAST Storage system will be installed as part of the new KISSKI data center. VAST storage systems offer different protocol flavours to access the storage backend, i.e. NFS, S3, SMB, and mixed. Since projects at the new data center should be executed in an efficient way, it is important to gain some insights in the potential performance of machine learning workloads. The proposed thesis will fill this gap and provide recommendations for future projects.
LIGGGHTS is a common code used for the simulation of macroscopic particles. It is based on the well-known molecular dynamics code LAMMPS. The variant used within the thesis is the academic fork LIGGGHTS-PFM which is under current development. Since LAMMPS already has some modules for GPU processing, it is the goal of the thesis to modify LIGGGHTS-PFM to make use of these capabilities. In a first step the best strategy for implementing LIGGGHTS-PFM on GPUs should be evaluated. Based on this a concept and initial steps of the implementation are expected. However, it is not required that all features of LIGGGHTS-PFM are implemented within the scope of the thesis. It is expected that the enhancement will improve the run-time performance and pave the road to particle simulations on GPUs. General programming experience is required. Knowledge in GPUcomputing and particle transport is beneficial but not mandatory.
Precice as already presented at the GöHPCoffee is a multiphysics framework which allows the combination of various simulation codes to perform coupled simulations. These can both include coupled thermal problems or topics related to fluid structure interaction. So far, there exists no possibility to perform a coupled particle simulation using preCICE since the only particle solver is not publicly available. It is the aim of this thesis to mitigate this limitation by implementing a precice-adapter for the particle solver LIGGGHTS-PFM. One possibility could be the modification of an existing OpenFOAM-adapter in preCICE. In addition, the thesis will compare the achievable performance with other coupling libraries using LIGGGHTS and its derivatives. General programming experience is required. Knowledge in simulation technology and particle transport especially in LIGGGHTS is beneficial but not mandatory.
This thesis aims to enhance the computational efficiency of GPU-based applications on GWDG clusters. A performance model will be developed considering the GPU architecture, application characteristics, and GWDG cluster configuration. The model will be implemented and its accuracy will be evaluated using a set of benchmark applications. The model will then be used to identify performance bottlenecks and optimize these applications. The expected outcome is an improved understanding of GPU performance on GWDG clusters, leading to more efficient utilization of these resources. This work has the potential to significantly impact the performance of GPU-based applications on GWDG clusters.
Iterative optimzation algorithms are used in various areas of computer science and related fields including machine learning and artificial intelligence, and image reconstruction. For large-scale problems these algorithms can be parallelized to run on multiple CPUs and GPUs. In this work, an existing image-reconstruction framework for Computational Magnetic Resonance Imaging (MRI) will be parallelized using Message Passing Interface (MPI) standard. Benchmarks and performance analysis on the parallel implementations will be performed on a national super-computer.
Training Generative Adversarial Networks (GANs) involves training both genrator and discriminator networks in an alternating procedure. This procedure can be complex and consumes relatively huge amount of computational resources. In this work, HPC performance tools for applications will be used to profile training of GANs and characterise their performances on GPUs.
Data Processing Units (DPUs) are programmable SoC-based SmartNICs which have the capability to offload processing tasks that are normally performed by CPUs. Using their onboard processors, DPUs can be used to perform in-network data analysis besides performing the traditional NIC functions. Specifically, the host system can offload data-intensive workloads to DPUs for Big Data Anlytics and AI/ML acceleration. Anomalies in computer network can be attributed to hardware or software failures, cyber-attacks or misconfigurations. In-networking analysis of network data can help reduce serious damages in case of cyber attack or similar security breaches. Big Data analytics tools like Spark Streaming can be used to enable real-time data processing before applying ML/DL algorithms for anomaly detection. In this work, machine learning models will be trained and deployed in DPUs to perform in-network inference on network data for anomaly detection. The results is expected to demonstrate the potential of deploying DPUs for cybersecurity.
Especially in the noisy intermediate scale quantum computing era, hybrid quantum-classical approaches are among the most promising to achieve some early advantages over classical computing. For these approaches an integration with HPC systems is mandatory. The goal of this project is to design and implement a workflow allowing to run hybrid codes using our HPC systems and, as a first step, quantum computing simulators, extend this to cloud-available real quantum computers, and provide perspectives for future systems made available otherwise. Possible aspects of this work are Jupyter based user interfaces, containerization, scheduling, and costs of hybrid workloads. The final result should be a PoC covering at least some important aspects.
Neuromorphic computers, i.e., computers which design is inspired by the human brain, are mostly intended for machine learning. However, recent results show that they may prove advantageous for NP-complete optimization problems as well. In this area they compete with (future) Quantum Computers, especially with Quantum Annealing and Adiabatic approaches. The goal of this project is to explore the SpiNNaker systems avaialable at GWDG regarding their use in this type of problems. A successful project would encompass the implementation of a toy problem comparing it to implementations on other platforms.
The present thesis delves into the exciting research field of personalized teaching in High Performance Computing (HPC). The objective is to identify innovative methods and technologies that enable tailoring educational content in the field of high-performance computing to the individual needs of students. By examining adaptive learning platforms, machine learning, and personalized teaching strategies, the thesis will contribute to the efficient transfer of knowledge in HPC courses. The insights from this research aim not only to enhance teaching in high-performance computing but also to provide new perspectives for the advancement of personalized teaching approaches in other technology-intensive disciplines.
This thesis focuses on the compilation and analysis of training materials from various scientific institutions in the High Performance Computing (HPC) domain. The initial phase involves utilizing scraping techniques to gather diverse training resources from different sources. Subsequently, the study employs methods derived from Machine Learning and Statistics to conduct a comprehensive analysis of the collected materials. The research aims to provide insights into the existing landscape of HPC training materials, identify commonalities, and offer recommendations for optimizing content delivery in this crucial field.
This groundbreaking thesis endeavors to transform the landscape of High Performance Computing (HPC) education by leveraging the capabilities of Large Language Models (LLMs). The primary focus is on developing an interactive training environment where LLMs are employed to dynamically generate tailored instructional content for HPC courses. Additionally, the study explores the proficiency of LLMs in providing coding support, assessing the quality of their output, and discerning their effectiveness in facilitating a seamless learning experience.
Utilizing advanced analytics on highly regulated health data is a very important topic. It can be used to search for unknown correlations between different bio markers to predict certain event, like a septic shock, or detect an early onset of dementia on very mildly impaired patients. These technqiues require on the one side large data sets and similarly scalable compute infrastructure, but do also require the highest security standards. Normal HPC systems do not provide this trusted compite environment. Within this thesis an existing SecureHPC platform is extended.
This thesis delves into the realm of computer science education with a particular focus on High Performance Computing (HPC). Rather than implementing new tools, the research centers on the field of didactics, aiming to explore and assess various pedagogical concepts applied to existing HPC training materials. Leveraging Machine Learning tools, this study seeks to identify prevalent didactic approaches, analyze their effectiveness, and ascertain which strategies prove most promising. This work is tailored for those with an interest in computer science education, emphasizing the importance of refining instructional methods in the dynamic and evolving landscape of High Performance Computing.
This thesis focuses on the evolution of the certification processes within the High Performance Computing (HPC) domain, specifically addressing the adaptation and porting of an existing prototype from the HPC Certification Forum. The objective is to redefine, optimize and automate the certification procedures, emphasizing the validation of knowledge and skills in HPC. The study involves the redevelopment of the prototype to align with current industry standards and technological advancements. By undertaking this project, the research aims to contribute to the establishment of robust and up-to-date certification mechanisms and standards that effectively assess and endorse competencies in the dynamic field of High Performance Computing.
Participating in courses is time-consuming and requires an active teacher. Especially, if these courses cover basic learning topics, this approach in very work intensive. A more feasible approach would be to design self study material and host is in the documentation docs.hpc.gwdg.de and link it in an FAQ. The work to be done here is to analse courses we teach and evaluate a possible application as a self-study course. Aim of this work would be to transfer one or two course to, for example, Jupyter-notebooks and test their effectiveness. The results should be a guideline on what courses can benefit from the self-study approach, and a guide on how to implement such a course should be developed.
In this day and age, everybody has heard of the Cloud, is using cloud services and most people know that you can deploy parallel applications on cloud infrastructure. Meanwhile, HPC is still stuck in its narrow niche of a select few power users and experts. Few everyday people even know what HPC means. It is easy to get access to large amounts of computing power by renting time on various cloud services. But how do applications deployed on a cloud service like the GWDG cloud compare to their twins deployed on HPC clusters in terms of performance? How well suited are different parallelization schemes to run on both systems? The goal of this project is to get some insight into these questions and benchmark a few applications to get concrete numbers, compare both approaches and present the results in an accessible and clear way.
While the HPC world is dominated by x86 architectures, RISC-V is a promising evolving alternative. To prepare for RISC-V based HPC and get familiar with architecture specific details, Starfive Visionfive2 eval boards have been procured [1]. These need to be configured to run Linux according to documentation, compiler toolchains and libraries need to be setup and tested, some benchmark or other proof of operability performed. Familiarity with electronics equipment beneficial, knowledge of Linux command line is a must. [1] https://www.heise.de/hintergrund/RiscV-Board-Erste-Schritte-mit-Starfive-Visionfive-2-7444668.html
While the data analytics tool Apache Spark has already been available on GWDG systems for multiple years, Dask is an upcoming topic. Spark is primarily used with Scala (and supports Python as well), Dask on the other hand is a part of the Python ecosystem. The project proposal is to compare the deployment methods on an HPC system (via Slurm in our case), the monitoring possibilities and tooling available, and to develop, run and evaluate a concrete application example on both platforms.
We want to develop an AI model to predict the best suitable technical supporter for each new submitted question in a technical support system. We assume that every case has been solved by a single supporter totally independently in the past. Based on the historical communications of each case with respect to its supporter, we will use the attention mechanism to understand the context meanings of those conversations, so that we can solve this supervised classification NLP task like a normal classification task. After our model has been well implemanted, we will explore its best super-parameters for time and accuracy performance and export it as an ONNX file. In GPU and in CPU we attempt to execute our ONNX file for retraining with respect to time consumption variante and for inferencing with respect to accuracy variante in different ONNX runtimes for the portability and interoperability. Our task is to explore the maximum compatibility of our ONNX file within different ONNX runtime.
1) Theory: To explore and present "Efficient ML Workload Mapping Techniques" in Heterogeneous HPC Landscape 2) Practical: Data collection and showing an optimize workflow based on a sample workload.
1) Theory: To explore and present "Efficient ML Workload Scheduling Techniques" in Heterogeneous HPC Landscape 2) Practical: Data collection and showing an optimize workflow based on a sample workload.
1) Theory: To explore and present how HPC resource mapping problem can be solved using Quantum Approach 2) Practical: To prepare a toy model with a use case of HPC workload and show an try your approach to see the result.
Evolutionary algorithms are an established means for optimization tasks in a variety of fields. An existing code being used for molecular clusters using a now simpler target system shall be investigated in regards of e.g. another parallelization scheme, more efficient operators, better convergence behavior of optimization routines used therein, etc.
Improving I/O performance is critical for optimizing the overall efficiency of HPC systems. Enhanced I/O performance leads to faster data processing, which is crucial for AI workloads that require quick and efficient handling of large datasets. The thesis should be focused on Analyzing different I/O metrics, in HPC systems, and/or developing models to predict I/O performance, especially under AI workloads which are data-intensive and have unique I/O patterns. The ultimate goal is to improve the performance and reliability of HPC systems, making them more effective for advanced computational tasks, including AI applications.
This master's thesis explores the cutting-edge domain of real-time medical image processing, focusing on secure inference using MRI/CT scan data. The study encompasses segmentation/detection/classification of MRI/CT data, leveraging datasets from UMG and other public sources. The core of the research involves training deep learning models on the SCC/Grete cluster, followed by real-time inference on various high-performance computing (HPC) systems. A comparative analysis of inference performance across these HPC systems forms a crucial part of this investigation. The thesis aims to contribute significant insights into the optimization of real-time medical image processing in secure environments, adhering to stringent data privacy standards. This research necessitates a master’s student with a background in applying deep learning to image data and some proficiency in PyTorch or TensorFlow.
In phylogenetic tree reconstructions, we describe the evolutionary relationship between biological sequences in terms of their shared ancestry. To reconstruct such a tree, multiple approaches exist, including maximum likelihood and Bayesian methods. Among the most commonly used implementations of these methods are RAxML and MrBayes, both of which are available on SCC. In this project, you will identify a suitable benchmarking suite and use it to benchmark RAxML and MrBayes on SCC.
In phylogenetic tree reconstructions, we describe the evolutionary relationship between biological sequences in terms of their shared ancestry. To reconstruct such a tree, multiple approaches exist, including maximum likelihood and Bayesian methods. Among the most commonly used implementations of these methods are RAxML and MrBayes, both of which are available on SCC. In this project, you will identify and establish a typical workflow on SCC, from data management to documentation. This project is especially suitable for students enrolled in Computer Science (M.Ed.) programme.
Proteins are involved in every biological process in every living cell. To assess how a protein functions exactly, knowing its amino acid sequence alone is not enough. Instead, its three-dimensional structure needs to be determined as well. In the last year, we saw a number of AI bases approaches put forward. In this project, you will compare and benchmark the performance of AlphaFold and alternative models on the SCC.
The naive simulation of interacting condensed matter systems is an ocean-boiling problem because of the exponential growth of the Hilbert space dimension. This offers a great opportunity to apply many analytical approximations and advanced numerical methods in HPC.
High-Performance Computing (HPC) systems generate vast amounts of monitoring data, which, when effectively analyzed, can provide critical insights into system performance, resource utilization, and potential issues. This project delves into the creation and implementation of a data analysis pipeline tailored for HPC monitoring data. The key components and stages of the pipeline, including data collection, preprocessing, storage solutions, and advanced analytics will be studied.
Large-scale HPC environments provided as a service are often dynamically changing environments using a range of different technologies and resources. HPC operation is ensured by a board of different competences. Systems ensuring cybernetic as well as physical security significantly influence at all levels how HPC and supporting systems are used, including their supervision. Monitoring and reporting is required by everyone who comes into contact with HPC. The user wants to monitor the processing of his task and its problems. Procurement then requires the information necessary for further planning of investments and operating expenses. The estimated workload for managing the HPC resource is a data required for the HR department. Failing technical components of HPC and infrastructure can cause significant damage or disable services. In an ideal world, HPC systems would be designed and operated flawlessly and the user or admin, vendor would not make mistakes. It would not be necessary to invest in additional processes, equipment and human resources to watch these systems. Just as there is no ideal technical solution and its operation, there is no ideal monitoring either. It is necessary to find ways not only to technically monitor the HPC environment and its infrastructure, but also how to present and use the obtained information in a real human environment and to find the right border between the theoretically possible and the realistically useful. To do this, it is necessary to properly understand what is needed and how to monitor and how to grasp the computer resource monitoring service itself. Therefore, monitoring and reporting are an integral part of the ITIL and ITSM methodologies. [1] ISO9001 https://www.iso.org/ [2] ISO14001 https://www.iso.org/ [3] ISO22301 https://www.iso.org/ [4] ISO27001 https://www.iso.org/ [5] ISO27004 https://www.iso.org/ [6] ITIL(v4) https://en.wikipedia.org/wiki/ITIL
During quantum computations, unwanted disturbances from the environment and imperfect hardware can cause errors, which materialize as noise in the results. These disturbances can be electromagnetic radiation, material vibrations, the earth's magnetic field, etc. This topic focuses on developing a strong understanding of noise in quantum computations through the use of different quantum simulators. Students will gather research on noise, run different simulations, present models, and potentially benchmark a selection of different quantum software available for noise. A lot of research is being done on this topic currently, with great resources available by, for example, Qiskit, Q-Ctrl and QuTip.
Quantum Computers are beginning to show progress in broader areas of interest as the hardware becomes more practical. However, qubits (quantum bits) have error rates that accumulate and degrade computational accuracy for long applications. Besides improving the hardware to combat errors, software techniques exist too. This topic aims to understand and compare the implementation of error correction and error mitigation. Students will develop implementations of both concepts through the use of noisy quantum simulators to drastically reduce the error in large quantum circuits. This research is ideal for students with a curious mind for a novel computing style, quantum 'wizardry', optimization and a lack of fear for linear algebra.
Quantum graphs can simulate the interference that objects near each other have on each other e.g. planets with gravity, molecules with electric charge, pizza restaurants with sales, cell towers with network, etc. Optimizing such a graph, if possible, has many commercial applications. This experimental topic aims to understand how fluctuations in a network could occur, and how to create an optimized network instead. Students will construct weighted graphs and use quantum mechanics such as entanglement to simulate the network effect. This research is ideal for students interested in graph theory, basic quantum computing and optimization.
Objective: The objective of this thesis is to develop and evaluate methods and tools to harden OpenStack environments for customers and improve their security. Tasks: Vulnerability analysis: Perform a detailed analysis of OpenStack environments to identify open ports, unused services and other potential vulnerabilities. This can be done with tools such as Nmap, Nessus or OpenVAS. Automatic detection of security vulnerabilities: Development of an automated process for the detection of security vulnerabilities in OpenStack environments. This can be done by using tools such as Ansible, SaltStack or Puppet. Wazuh for CVE detection: Integration of Wazuh, an open source security information and event management system (SIEM), into the OpenStack environment to detect Common Vulnerabilities and Exposures (CVEs) to track common vulnerabilities and exposures. Hardening of OpenStack components: Perform hardening measures for OpenStack components such as Nova, Neutron, Cinder and Keystone to improve their security. Security policy development: Develop security policies and procedures for security policies and procedures for the OpenStack environment to ensure that all security measures are implemented consistently and effectively. Methodology: Literature review: Conduct a comprehensive literature review on OpenStack security, hardening and CVE detection. Experimental work: Conducting experiments and tests in a lab environment to evaluate the developed methods and tools. Case study: Performing a case study in a real OpenStack environment to demonstrate the practicability of the developed methods and tools.