Open Theses
PhD
MSc & BSc
The offered theses below are intended for MSc but can also be reduced in scope and handed out as BSc theses.
Characterizing HPC storage systemsApply
HPC storage systems exhibit complex behavior that is unfortunately often not well understood. As part of this work, the student would execute various storage benchmarks on different storage systems at GWDG and aim to understand the system performance creating a characterization for these systems. We will also document the results and aim to create a publication.
Containers for Parallel ApplicationsApply
Parallel applications on HPC systems often rely on system specific MPI (Message Passing Interface) and interconnect libraries, for example for Infiniband or OmniPath networks. This partially offsets one main advantage of containerizing such applications, namely the portability between different platforms. The goal of this project is to evaluate different ways of integrating system specific communication libraries into containers, allowing for porting these containers to a different platform with minimal effort. A PoC should be implemented and benchmarked against running natively on a system.
Characterizing Neuromorph HardwareApply
The SpiNNaker 2 processor is a platform for bio-inspired algorithms and AI applications. We have some sample hardware. As part of this project, the system behavior and performance should be characterized in system models. While at later stages, various frameworks will be supported, the system requires C knowledge at the moment.
Fixing Shortcomings of Kubernetes Severless TechnologiesApply
Serverless Computing or Function-as-a-Service (FaaS) has emerged as a new paradigm for computing over the last few years. There exists a number of open source FaaS platforms based on Kubernetes as the container orchestration platform maps well to the components required for FaaS. However, most approaches to FaaS are still relatively naive and leave many performance improvements on the table. This work focuses on said limitations and aims to solve at least one of them and implement a proof of concept. Finally, the performance improvements should be benchmarked in a virtualized environment and on the HPC system.
Evaluating the Capabilities of K8SGPTApply
K8SGPT (https://k8sgpt.ai/) is a Kubernetes tool that can use the OpenAI API or self-hosted AI APIs (such as LocalAI https://github.com/go-skynet/LocalAI) to analyse a given cluster. On paper this sounds great as it allows finding and fixing the relevant information within the complexity of a K8s cluster. But how capable is it really? What limitations apply, what is the overhead and how do the OpenAI API and LocalAI compare? For this topic, methods for the evaluation should be developed and applied to test clusters. Finally, a recommendation should be given on which use cases can benifit from K8SGPT and which not.
Governance for a Data LakeApply
In order to give the full data sovereignty to the individual user and to enforce compliance with certain policies, like the GDPR for personalized data, a reliable governance is required which enforces a homogeneous policy (or governance) across all involved system, like a cloud environment, an HPC system or a network accessible storage.
Developing a Benchmark for Research Data Management SystemsApply
With the continous increase in research data and the success of data driven methods, RDMS are becomign increasingly important to ensure best scientific practices. Particularly for HPC users those system are not yet in their daily toolbox. However, HPC admins are increasingly confronted with user requesting specific RDMS. To tackle this issue in a broader and more generic way, it is desirable to systematically benchmark different systems to draw conclusions for their general applicability in HPC. In this work, such a novel RDMS benchmark is developed, or composed from existing ones.
Implementing Enforced Data Management Plans for Data Projects on HPC SystemsApply
Along with the increase in available compute power of high-performance computing (HPC) systems and the success of novel data-driven methods, the amount of data processed and the user groups increase as well. This gave rise to two big challenges: The traditional interaction scheme of users with modern HPC systems becomes more and more unsuited to deal with large data sets and many independent tasks working on these data sets. This highly manual way can quickly lead to unreproducible results and data loss due to missing backups since it is stored fragmented on multiple storage tiers. Similarly, domain-specific data management systems have been established to ease the burden of data and process management of particularly inexperienced users. These systems, however, only offer a very rigid, and tool-specific interaction scheme. This resulted in a gap between these two user groups, which even hinders large-scale cooperations across different domains. In this work you will implement an automated and actionable Data Management Plan tool, make a qualitative assessment about the DMP and quantitative IO benchmarks to analyse the data placement strategies.
Scientific container monitoring: Development of a tool to monitor a vast number of scientific containers in a multi-node, multi-user HPC environmentApply
In the realm of scientific computing, containerization is gaining an ever-growing relevance, as the advantages of containerized applications, especially in the multi-user environment of an HPC system, are numerous: encapsulation of dependencies, ease of portability to other systems, ease of deployment and much more. Yet, while by now a multitude of container runtimes and container management solutions exist, HPC monitoring software that specifically takes containerized applications into account and is capable of generating and displaying monitoring data that is "container-aware" and can resolve down to a level of individual containers within individual computing nodes is still very much lacking. In this thesis, you will develop your own HPC monitoring software that specifically targets containerized applications on the GWDGs HPC system. Your software will then be deployed on Scientific Compute Cluster (SCC) of the GWDG to monitor the containers that researchers are running on it and you will analyze which additional insights administrators of HPC systems can achieve if their monitoring software is "container-aware".
Benchmarking phylogenetic tree reconstructionsApply
In phylogenetic tree reconstructions, we describe the evolutionary relationship between biological sequences in terms of their shared ancestry. To reconstruct such a tree, multiple approaches exist, including maximum likelihood and Bayesian methods. Among the most commonly used implementations of these methods are RAxML and MrBayes, both of which are available on SCC. In this project, you will identify a suitable benchmarking suite and use it to benchmark RAxML and MrBayes on SCC.
Prototyping common workflows in phylogenetic tree reconstructionsApply
In phylogenetic tree reconstructions, we describe the evolutionary relationship between biological sequences in terms of their shared ancestry. To reconstruct such a tree, multiple approaches exist, including maximum likelihood and Bayesian methods. Among the most commonly used implementations of these methods are RAxML and MrBayes, both of which are available on SCC. In this project, you will identify and establish a typical workflow on SCC, from data management to documentation. This project is especially suitable for students enrolled in Computer Science (M.Ed.) programme.
Benchmarking AlphaFold and alternative models for protein structure predictionApply
Proteins are involved in every biological process in every living cell. To assess how a protein functions exactly, knowing its amino acid sequence alone is not enough. Instead, its three-dimensional structure needs to be determined as well. In the last year, we saw a number of AI bases approaches put forward. In this project, you will compare and benchmark the performance of AlphaFold and alternative models on the SCC.
Upscaling single cell analysis using the HPCApply
In single cell analyses the focus is, as the name suggests, on the individual cell. For example, the set of expressed genes can be determined via RNA-Seq. Monocole3 is a widely used toolkit to analyse such data which is available on SCC. In this project, you will benchmark the existing single cell analysis pipeline at GWDG with particular attention to the potential to upscale.
Personalized Medicine using the HPCApply
The project aims to leverage the power of HPC and NLP to revolutionize personalized medicine by developing targeted treatments tailored to each patient's unique genetic makeup and characteristics. To achieve this goal, the project will start by using data from Kaggle to differentiate between mutations that promote tumor growth and those that do not in malignant tumors. Currently, the manual interpretation of genetic alterations is time-consuming and requires a clinical pathologist to manually examine and categorize each genetic mutation using data from text-based clinical literature. To address this challenge, the project will develop algorithms that categorize genetic variants based on clinical evidence. After developing these algorithms, the project will focus on creating an NLP-based personalized medicine application that can run on the HPC cluster. The application will process large volumes of medical records, genomic data, and other relevant data sources to generate personalized treatment recommendations for patients. Finally, the project will evaluate and optimize the performance of the application in terms of speed, scalability, and efficiency. This will involve testing the application on a variety of datasets and scenarios, identifying bottlenecks, and implementing optimizations to improve its performance. Overall, the project's goal is to leverage the power of HPC and NLP to revolutionize personalized medicine and improve patient outcomes. By developing an automated, data-driven approach to personalized treatment, the project aims to reduce the time and cost of treatment while improving its effectiveness.
Benchmarking Applications on Cloud vs. HPC SystemsApply
In this day and age, everybody has heard of the Cloud, is using cloud services and most people know that you can deploy parallel applications on cloud infrastructure. Meanwhile, HPC is still stuck in its narrow niche of a select few power users and experts. Few everyday people even know what HPC means. It is easy to get access to large amounts of computing power by renting time on various cloud services. But how do applications deployed on a cloud service like the GWDG cloud compare to their twins deployed on HPC clusters in terms of performance? How well suited are different parallelization schemes to run on both systems? The goal of this project is to get some insight into these questions and benchmark a few applications to get concrete numbers, compare both approaches and present the results in an accessible and clear way.
Parallelization of Iterative Optimization Algorithms for Image Processing using MPIApply
We are finalizing the topic, details will be provided on request.
Performance Analysis of Generative Neural NetworksApply
We are finalizing the topic, details will be provided on request.
Containerizing On-Premise HPC Services with SingularityApply
Containerized applications are becoming more popular than ever. One of the biggest problems when compiling a software on OS is to break dependencies of other installed softwares. Containers encapsulate an application as a single executable package of software that put application code together with all of the related configuration files, libraries, and dependencies required for it to run. They also maximize scalability and flexibility during deployment process, which are the most important points of DevOps culture.This project aims to containerize the one of the most popular HPC softwares, Slurm. Therefore, Slurm can be easily deployed and upgraded without compatibility issues with CI/CD pipelines
Monitoring of service utilization using DevOps strategiesApply
Accurate accounting of service usage statistics is vital to the prioritization of offered utilities and resources. This accounting is not a trivial task, since each service presents its data heterogeneously. The work will then consist of setting up scripts to gather, centralize and process the access data of a number of different web-accesible services; and implementing an interactive web solution for displaying the data to the end user (i.e., service administrators). The implemented solutions need to be maintainable, extendable, and robust.
Evolutionary Algorithm for Global OptimizationApply
Evolutionary algorithms are an established means for optimization tasks in a variety of fields. An existing code being used for molecular clusters using a now simpler target system shall be investigated in regards of e.g. another parallelization scheme, more efficient operators, better convergence behavior of optimization routines used therein, etc.
Scientific container monitoring: Development of a tool to monitor a vast number of scientific containers in a multi-node, multi-user HPC environmentApply
In the realm of scientific computing, containerization is gaining an ever-growing relevance, as the advantages of containerized applications, especially in the multi-user environment of an HPC system, are numerous: encapsulation of dependencies, ease of portability to other systems, ease of deployment and much more. Yet, while by now a multitude of container runtimes and container management solutions exist, HPC monitoring software that specifically takes containerized applications into account and is capable of generating and displaying monitoring data that is "container-aware" and can resolve down to a level of individual containers within individual computing nodes is still very much lacking. In this thesis, you will develop your own HPC monitoring software that specifically targets containerized applications on the GWDGs HPC system. Your software will then be deployed on Scientific Compute Cluster (SCC) of the GWDG to monitor the containers that researchers are running on it and you will analyze which additional insights administrators of HPC systems can achieve if their monitoring software is "container-aware".
Integration of HPC systems and Quantum ComputersApply
Especially in the noisy intermediate scale quantum computing era, hybrid quantum-classical approaches are among the most promising to achieve some early advantages over classical computing. For these approaches an integration with HPC systems is mandatory. The goal of this project is to design and implement a workflow allowing to run hybrid codes using our HPC systems and, as a first step, quantum computing simulators, extend this to cloud-available real quantum computers, and provide perspectives for future systems made available otherwise. Possible aspects of this work are Jupyter based user interfaces, containerization, scheduling, and costs of hybrid workloads. The final result should be a PoC covering at least some important aspects.
Quantum Machine Learning in ForestryApply
The project ForestCare uses machine learning methods to analyze diverse sets of data on forest health. Quantum machine learning (QML) is the application of quantum computers to machine learning tasks, which is considered to have great potential. Therefore, testing QML application to new areas is of high interest for future applications, even if in the current era of noisy intermediate scale quantum computers no realizable advantage of this approach is to be expected. In this explorative project the goal is to identify suitable representations of ForestCare data, QML methods to apply to this data, and compare results to the classical approach. This project requires some enthusiasm to understand the workings of ML and QML methods and willingness to be challenged.
Comparing the SeaweedFS Object Store to Established Reference Systems such as MinIO and CephApply
The S3 standard for object storages established by AWS has gained in popularity, so much so that other storage platforms have also implemented S3 backends. However, as these platforms each have their own implementation they also have their own performance profile. How can a relatively recent implementation for SeaWeedFS keep up with battle-tested systems such as MinIO and Ceph? Another important factor is how complex it is to manage a platform. MinIO is easier to manage than SeaWeedFS, which in turn is less complex than Ceph.
Backing A Live BeeGFS Filesystem via A Synchronized SnapshotApply
BeeGFS [1] is a parallel filesystem often used in HPC environments. At present, the only way to safely backup its contents is to completely shut it down [2] and then do the backup, creating downtime. This downtime can be minimized by the use of CoW filesystems for the backing storage and taking a snapshot, but cannot yet be eliminated. For an HPC environment meant to run users' jobs around the clock, downtimes must be rare, which is incompatible with taking regular backups. The goal is to develop a scheme to make all storage and meta-data servers perform a synchronized snapshot with at most a pause short enough that the BeeGFS clients and user applications do not crash. Then, another (secondary read-only set of servers) process could run over the snapshots to facilitate the actual backup. The simplest scenario is where there is only a single meta-data and a single storage buddy-pair where two nodes run one member of each pair. The more common scenario is where there are several meta-data and storage buddy-pairs. Can such a backup scheme be done for the simplest scenario? The more common scenario? [1] https://www.beegfs.io [2] https://doc.beegfs.io/latest/advanced_topics/backup.html#full-system-backup
How to efficiently access free earth observation data for data analysis on HPC Systems?Apply
In recent years the availability of freely available Earth observation data has increased. Besides ESA's Sentinel mission [1] and NASA's Landsat mission [2], various open data initiatives have arisen. For example, several federal states in Germany publish geographical and earth observation data, such as orthophotos or lidar data, free of charge [3,4]. However, one bottleneck at the moment is the accessibility of this data. Before analyzing this data, researchers need to put a substantial amount of work into downloading and pre-processing this data. Big platforms such as Google [5] and Amazon [6] offer these data sets, making working in their environments significantly more comfortable. To promote and simplify data analysis in earth observation on HPC systems, approaches for convenient data access need to be developed. In a best-case scenario, the resulting data is analysis-ready so that researchers can directly jump into their research. The goal of this project is to explore the current state of services and technologies available (data cubes [7], INSPIRE [8], STAC [9]) and to implement a workflow that provides a selected data set to users of our HPC system. [1] https://sentinels.copernicus.eu/ [2] https://landsat.gsfc.nasa.gov/ [3] https://www.geoportal-th.de/de-de/ [4] https://www.geodaten.niedersachsen.de/startseite/ [5] https://developers.google.com/earth-engine/datasets [6] https://aws.amazon.com/de/earth/ [7] https://datacube.remote-sensing.org/ [8] https://inspire.ec.europa.eu/ [9] https://stacspec.org/
Performance optimization of deep learning model training and inferenceApply
Recent advances in deep learning, such as image (Rombach et al. 2022) and text generation (OpenAI 2023), have led to an increase in the number of AI publications in the world (Zhang et al. 2022). The breakthrough in deep learning is only possible because of evolving hardware and software that allows the processing of big data sets efficiently. Further, most of the accuracy gains of these models result from increasingly complex models (Schwartz et al. 2019). From 2013 to 2019, the required computing power for training deep learning models increased by a factor of $300,000$ (Schwart 2019). Therefore, performance optimization of deep learning model training and inference is highly relevant. Profiling with tools such as DeepSpeed [1] and the in-build PyTorch Profiler [2] helps identify the existing model's bottlenecks. Different optimization strategies, such as data and model parallelism, could be applied depending on the profiling results. Further, tools such as PyTorch Lightning's trainer [3] and Horovod [4] can be tested to use the cluster's resources efficiently. [1] https://github.com/microsoft/DeepSpeed [2] https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html [3] https://lightning.ai/docs/pytorch/latest/accelerators/gpu_intermediate.html [4] https://github.com/horovod/horovod Dodge, Jesse et al. (2022). Measuring the Carbon Intensity of AI in Cloud Instances. doi: 10.48550/ARXIV.2206.05229. url: https://arxiv.org/abs/2206.05229. OpenAI (2023). GPT-4 Technical Report. arXiv: 2303.08774 [cs.CL]. Rombach, Robin et al. (2022). “High-Resolution Image Synthesis with Latent Diffusion Models”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). url: https://github.com/CompVis/latent-diffusionhttps: //arxiv.org/abs/2112.10752. Schwartz, Roy et al. (2019). “Green AI”. In: CoRR abs/1907.10597. arXiv: 1907.10597. url: http://arxiv.org/abs/1907.10597. Zhang, Daniel et al. (2022). The AI Index 2022 Annual Report. arXiv: 2205.03468 [cs.AI].
RISC-V eval board Linux and toolchains putting into operationApply
While HPC world is dominated by x86 Architectures, RISC-V is a promising evolving architecture. To prepare for work with RISC-V based HPC and get familiar with architecture specific details, Starfive Visionfive2 eval boards have been procured [1]. These need to be configured to run Linux according to documentation, compiler toolchains and libraries setup and tested, some benchmark or other proof of operability performed. Familiarity with electronics equipment and Linux command line is an advantage. [1] https://www.heise.de/hintergrund/RiscV-Board-Erste-Schritte-mit-Starfive-Visionfive-2-7444668.html