Open Theses

The offered theses below are intended for MSc but can also be reduced in scope and handed out as BSc theses.

Running Kubernetes Workloads on HPC SystemsApply

Kubernetes is a container orchestrator that is designed to run containerized, scalable applications on cloud systems. It is often used to run large data analytics or deep learning workloads. When transferring such workloads to HPC systems, users face the challenge to adapt their workflows to HPC specific resource managers like Slurm. The goal of this project is to evaluate how to allow users to run their Kuberntes workloads natively with Slurm or to significantly simplify and automate the transfer of such workloads to the Slurm scheduler, retaining containerization. A PoC should be implemented.

Containers for Parallel ApplicationsApply

Parallel applications on HPC systems often rely on system specific MPI (Message Passing Interface) and interconnect libraries, for example for Infiniband or OmniPath networks. This partially offsets one main advantage of containerizing such applications, namely the portability between different platforms. The goal of this project is to evaluate different ways of integrating system specific communication libraries into containers, allowing for porting these containers to a different platform with minimal effort. A PoC should be implemented and benchmarked against running natively on a system.

Development of a provenance aware ad-hoc interface for a data lakeApply

In order to support the entire development life cycle of a data driven project, an ad-hoc interface is needed for the first exploratory phase. Here, scientist want to import data from the data lake into, e.g. a jupyter notebook, to work and explore interactively their data and processes. In order to secure save those back into the data lake, it is of most importance to audit provenance information of the transformations performed on the local system of a user. In this project, such an interface will be developed. For this development on both sites, a web server as well as a local library is needed, along with a scientific elaboration of the conceptual approach.

Semantic classification of metadata attributes in a data lake using machine learningApply

In order to find the data one is looking for in a heterogeneous data lake with various data sets in it, an efficient querying mechanism is needed to detect synonyms and semantic relationships. Historically, this was done using Ontologies. Using those, however, requires an additional effort by the individual user. In this work, machine learning methods shall be explored to map individual attributes into a semantic space, where similarity analysis can be performed to find similar data sets.

Governance for a Data LakeApply

In order to give the full data sovereignty to the individual user and to enforce compliance with certain policies, like the GDPR for personalized data, a reliable governance is required which enforces a homogeneous policy (or governance) across all involved system, like a cloud environment, an HPC system or a network accessible storage.

Enabling Parallel IO in eCryptFSApply

eCryptFS is a filesystem in userspace that allows transparent encryption. However, when multiple nodes try to write to the same file, it breaks. As part of this work, this issue will be fixed, we will be implementing file locking and a common file Header to enable parallel IO into the same file from multiple nodes.

Monitoring of service utilization using DevOps strategiesApply

Accurate accounting of service usage statistics is vital to the prioritization of offered utilities and resources. This accounting is not a trivial task, since each service presents its data heterogeneously. The work will then consist of setting up scripts to gather, centralize and process the access data of a number of different web-accesible services; and implementing an interactive web solution for displaying the data to the end user (i.e., service administrators). The implemented solutions need to be maintainable, extendable, and robust.

How to efficiently access free earth observation data for data analysis on HPC Systems?Apply

In recent years the availability of freely available Earth observation data has increased. Besides ESA's Sentinel mission and NASA's Landsat mission, various open data initiatives have arisen. For example, several federal states in Germany publish geographical and earth observation data, such as orthophotos or lidar data, free of charge. However, one bottleneck at the moment is the accessibility of this data. Before analyzing this data, researchers need to put a substantial amount of work into downloading and pre-processing this data. Big platforms such as Google and Amazon offer these data sets, making working in their environments significantly more comfortable. To promote and simplify data analysis in earth observation on HPC systems, approaches for convenient data access need to be developed. In a best-case scenario, the resulting data is analysis-ready so that researchers can directly jump into their research. The goal of this project is to explore the current state of services and technologies available (data cubes, INSPIRE, STAC) and to implement a workflow that provides a selected data set to users of our HPC system.

Contributing unused HPC resources to grid computing projects using BOINC by backfillinApply

HPC clusters are usually highly utilized resources, but often there are temporary unused resources when the cluster has to aggregate free nodes for very large jobs. Aim of this project is the development of a concept how the BOINC client can be run on HPC compute nodes via the scheduler Slurm without internet connection to use these resources.

Scientific container monitoring: Development of a tool to monitor a vast number of scientific containers in a multi-node, multi-user HPC environmentApply

In the realm of scientific computing, containerization is gaining an ever-growing relevance, as the advantages of containerized applications, especially in the multi-user environment of an HPC system, are numerous: encapsulation of dependencies, ease of portability to other systems, ease of deployment and much more. Yet, while by now a multitude of container runtimes and container management solutions exist, HPC monitoring software that specifically takes containerized applications into account and is capable of generating and displaying monitoring data that is "container-aware" and can resolve down to a level of individual containers within individual computing nodes is still very much lacking. In this thesis, you will develop your own HPC monitoring software that specifically targets containerized applications on the GWDGs HPC system. Your software will then be deployed on Scientific Compute Cluster (SCC) of the GWDG to monitor the containers that researchers are running on it and you will analyze which additional insights administrators of HPC systems can achieve if their monitoring software is "container-aware".

Characterizing HPC storage systemsApply

HPC storage systems exhibit complex behavior that is unfortunately often not well understood. As part of this work, the student would execute various storage benchmarks on different storage systems at GWDG and aim to understand the system performance creating a characterization for these systems. We will also document the results and aim to create a publication.

Using containers to develop training/assement scenarios for High-Performance ComputingApply

Learning to use HPC systems is non-trivial for practitioners. The HPC Certification Forum aims to characterize the competencies and to develop scenarios for the assesment / examination of these. These are to be implemented in containers and will be deployed on The choice of competencies and learning scenarios are up to the student!

Text to image relation in digital collectionsApply

Books and magazines are digitized page by page. If the scan is complete, then the object is ingested as one item and can be viewed page by page. Each page is a single digital file. But in reality, the content is not strictly text only, there are often figures and images inside the publication. And there also more image-heavy publications such as art magazines. The goal is to create an algorithm to identify images and provide the bounding box for the identified image. Why this is interesting: the bounding boxes can be used to calculate a percentage value for the whole publication (e.g. 20 % text, 80 % image). If this is done for a whole series of a certain magazine, trends can be determined (e.g. fewer, but larger images). Also the images can be displayed independent of the text and can be used for further searches.

Object outline detection of digital assetsApply

A drawing or painting or scan of a cultural artifact contain different things like a landscape, food, persons, text. All these items have a certain meaning. The digital humanists want to annotate them. But on complex objects this is a tedious process. An outline must be manually drawn to perform the annotation. This is time consuming. It would be very helpful to perform an object recognition. It must not be able to identify the depictured item, but a good outline and object segmentation of shown items should be created. The algorithm should support different materials (drawings, paintings, photos) to create the outline. As training material different repositories should be used (images are accessible via REST based API). If possible the demonstrations should be done in a Jupyter notebook.

Movie scene detectionApply

In archives and broad cast studios video footage gets manually annotated. They note the timestamps and describe what can be seen in this sequence. E.g. a certain person with a name (e.g. interview partner), a scene, an object (e.g. country hall, street view, country hall entrance, country hall backyard). There has been a lot of research to detect these scenes automatically. The task is to get on overview about the current state of the art. If there is usable software available, this should be tested on different assets. The results should be noted to identify if this is something that can be integrated into a metadata generation workflow.

Comparison of Distributed Computing FrameworksApply

While the data analytics tool Apache Spark has already been available on GWDG systems for multiple years, Dask is an upcoming topic. Spark is primarily used with Scala (and supports Python as well), Dask on the other hand is a part of the Python ecosystem. The project proposal is to compare the deployment methods on an HPC system (via Slurm in our case), the monitoring possibilities and tooling available, and to develop, run and evaluate a concrete application example on both platforms.

Benchmarking Quantum Computing SimulatorsApply

The application of quantum computers to data science problems may become a powerful tool in the future. However, the current generation of noisy intermediate scale quantum computers can only tackle small problems, and their availability is very limited. Therefore quantum computing simulators (QCS) running on HPC systems are an important alternative for current research. In this project benchmarks for QCS should be implemented based on existing quantum circuits. The benchmarks can cover both pure quantum and quantum-classical algorithms and should be designed with the perspective to compare results to real quantum computers and classical approaches. The benchmarks will be used to compare simulators selected and installed by the student.

Integration of HPC systems and Quantum ComputersApply

Especially in the noisy intermediate scale quantum computing era, hybrid quantum-classical approaches are among the most promising to achieve some early advantages over classical computing. For these approaches an integration with HPC systems is mandatory. The goal of this project is to design and implement a workflow allowing to run hybrid codes using our HPC systems and, as a first step, quantum computing simulators, extend this to cloud-available real quantum computers, and provide perspectives for future systems made available otherwise. Possible aspects of this work are Jupyter based user interfaces, containerization, scheduling, and costs of hybrid workloads. The final result should be a PoC covering at least some important aspects.

Quantum Machine Learning in ForestryApply

The project ForestCare uses machine learning methods to analyze diverse sets of data on forest health. Quantum machine learning (QML) is the application of quantum computers to machine learning tasks, which is considered to have great potential by researchers. In the current era of noisy intermediate scale quantum computers however, no realizable advantage of this approach is to be expected. Nevertheless, testing QML application to new areas remains of high interest for future applications. In this explorative project the goal is to identify suitable data reductions of ForestCare data, QML methods to apply to these reduced sets, and compare results to the classical approach. This project requires some enthusiasm to understand the workings of ML and QML methods and willingness to be challenged.

Performance Analysis of Containerized MPI ApplicationApply

Potential benefits of containerization are minimization of complexities in software installation and run, and increase in portability of the software in multiple systems. For MPI applications, containerization can result to undesired effects on the application’s performance. An attempt to optimize the containerization might be complicated and will require some sacrifices in portability. In this project, performance of a containerized OpenFoam - a CFD MPI application - will be measured and anlysed using various performance tools.

Parallelization of Iterative Optimization Algorithms for Image Processing using MPIApply

We are finalizing the topic, details will be provided on request.

Utilizing Dask in Scientific WorkflowsApply

Evaluation of existing workflows with the goal to create a benchmark suite, performance analysis and evaluation... that may lead to improvements of Dask. Dask has come along way supporting most of the scientific stack and more recently, the some scienfitic communitiy is attempting to use Dask for model simulations. It used to be just the post-processing workflow before. Also commercial are slowely swtiching from Spark to Dask as the later is cost effective and offers huge performace improvement.

Recommendation System for performance monitoring and analysis in HPCApply

Performance monitoring of HPC applications is crucial for optimizing a software stack and detecting various problems in software or hardware. However, inexperienced users might find it difficult to get started with performance metrics and more experiences users can easily overlook some issues during a runtime. Automated system which generates recommendations based on the monitoring metrics can help in optimizations and detection of problems.

Performance indicators for HPC operation and managementApply

We are finalizing the topic, details will be provided on request.

Automatic system analysis for stability prediction and operator guidance in HPCApply

We are finalizing the topic, details will be provided on request.

Monitoring and evaluating application usage in the data centerApply

We are finalizing the topic, details will be provided on request.

Common Sys-Op knowledgebase that can be used by humans and machine learningApply

We are finalizing the topic, details will be provided on request.

  • research/open-theses.txt
  • Last modified: 2022-07-07 19:41
  • by Julian Kunkel