Open Theses

The offered theses below are intended for MSc but can also be reduced in scope and handed out as BSc theses.

How to efficiently access free earth observation data for data analysis on HPC Systems?Apply

In recent years the availability of freely available Earth observation data has increased. Besides ESA's Sentinel mission [1] and NASA's Landsat mission [2], various open data initiatives have arisen. For example, several federal states in Germany publish geographical and earth observation data, such as orthophotos or lidar data, free of charge [3,4]. However, one bottleneck at the moment is the accessibility of this data. Before analyzing this data, researchers need to put a substantial amount of work into downloading and pre-processing this data. Big platforms such as Google [5] and Amazon [6] offer these data sets, making working in their environments significantly more comfortable. To promote and simplify data analysis in earth observation on HPC systems, approaches for convenient data access need to be developed. In a best-case scenario, the resulting data is analysis-ready so that researchers can directly jump into their research. The goal of this project is to explore the current state of services and technologies available (data cubes [7], INSPIRE [8], STAC [9]) and to implement a workflow that provides a selected data set to users of our HPC system. [1] https://sentinels.copernicus.eu/ [2] https://landsat.gsfc.nasa.gov/ [3] https://www.geoportal-th.de/de-de/ [4] https://www.geodaten.niedersachsen.de/startseite/ [5] https://developers.google.com/earth-engine/datasets [6] https://aws.amazon.com/de/earth/ [7] https://datacube.remote-sensing.org/ [8] https://inspire.ec.europa.eu/ [9] https://stacspec.org/

Evolutionary Algorithm for Global OptimizationApply

Evolutionary algorithms are an established means for optimization tasks in a variety of fields. An existing code being used for molecular clusters using a now simpler target system shall be investigated in regards of e.g. another parallelization scheme, more efficient operators, better convergence behavior of optimization routines used therein, etc.

Parallelization of Iterative Optimization Algorithms for Image Processing using MPIApply

We are finalizing the topic, details will be provided on request.

Performance indicators for HPC operation and managementApply

We are finalizing the topic, details will be provided on request.

Utilizing Dask in Scientific WorkflowsApply

Evaluation of existing workflows with the goal to create a benchmark suite, performance analysis and evaluation... that may lead to improvements of Dask. Dask has come along way supporting most of the scientific stack and more recently, the some scienfitic communitiy is attempting to use Dask for model simulations. It used to be just the post-processing workflow before. Also commertial are slowely swtiching from Spark to Dask as the later is cost effective and offers huge proformace improvement.

Applications of Scientific Machine LearningApply

Scientific machine learning (SML) is a form of artificial intelligence (AI) that uses data-driven techniques to automate the process of scientific inquiry and discovery. It is used to analyze, interpret, and make predictions based on large amounts of data. SML can be used in various scientific and engineering disciplines, including materials science, medicine, physics, and chemistry. Some use cases of SML include the development of drug discovery, disease diagnosis, and personalized medicine. For example, SML can identify patterns in medical data that could lead to a better understanding of a particular disease and new treatments. Additionally, SML can be used to detect anomalies in materials data, understand the underlying processes, and optimize manufacturing processes. The aim of a thesis would be the application of SML workflows to large data sets and investigate the training and performance behaviors.

Development of a provenance aware ad-hoc interface for a data lakeApply

In order to support the entire development life cycle of a data driven project, an ad-hoc interface is needed for the first exploratory phase. Here, scientist want to import data from the data lake into, e.g. a jupyter notebook, to work and explore interactively their data and processes. In order to secure save those back into the data lake, it is of most importance to audit provenance information of the transformations performed on the local system of a user. In this project, such an interface will be developed. For this development on both sites, a web server as well as a local library is needed, along with a scientific elaboration of the conceptual approach.

Semantic classification of metadata attributes in a data lake using machine learningApply

In order to find the data one is looking for in a heterogeneous data lake with various data sets in it, an efficient querying mechanism is needed to detect synonyms and semantic relationships. Historically, this was done using Ontologies. Using those, however, requires an additional effort by the individual user. In this work, machine learning methods shall be explored to map individual attributes into a semantic space, where similarity analysis can be performed to find similar data sets.

Governance for a Data LakeApply

In order to give the full data sovereignty to the individual user and to enforce compliance with certain policies, like the GDPR for personalized data, a reliable governance is required which enforces a homogeneous policy (or governance) across all involved system, like a cloud environment, an HPC system or a network accessible storage.

Characterizing HPC storage systemsApply

HPC storage systems exhibit complex behavior that is unfortunately often not well understood. As part of this work, the student would execute various storage benchmarks on different storage systems at GWDG and aim to understand the system performance creating a characterization for these systems. We will also document the results and aim to create a publication.

Comparing the SeaweedFS Object Store to Established Reference Systems such as MinIO and CephApply

The S3 standard for object storages established by AWS has gained in popularity, so much so that other storage platforms have also implemented S3 backends. However, as these platforms each have their own implementation they also have their own performance profile. How can a relatively recent implementation for SeaWeedFS keep up with battle-tested systems such as MinIO and Ceph? Another important factor is how complex it is to manage a platform. MinIO is easier to manage than SeaWeedFS, which in turn is less complex than Ceph.

Scientific container monitoring: Development of a tool to monitor a vast number of scientific containers in a multi-node, multi-user HPC environmentApply

In the realm of scientific computing, containerization is gaining an ever-growing relevance, as the advantages of containerized applications, especially in the multi-user environment of an HPC system, are numerous: encapsulation of dependencies, ease of portability to other systems, ease of deployment and much more. Yet, while by now a multitude of container runtimes and container management solutions exist, HPC monitoring software that specifically takes containerized applications into account and is capable of generating and displaying monitoring data that is "container-aware" and can resolve down to a level of individual containers within individual computing nodes is still very much lacking. In this thesis, you will develop your own HPC monitoring software that specifically targets containerized applications on the GWDGs HPC system. Your software will then be deployed on Scientific Compute Cluster (SCC) of the GWDG to monitor the containers that researchers are running on it and you will analyze which additional insights administrators of HPC systems can achieve if their monitoring software is "container-aware".

Containers for Parallel ApplicationsApply

Parallel applications on HPC systems often rely on system specific MPI (Message Passing Interface) and interconnect libraries, for example for Infiniband or OmniPath networks. This partially offsets one main advantage of containerizing such applications, namely the portability between different platforms. The goal of this project is to evaluate different ways of integrating system specific communication libraries into containers, allowing for porting these containers to a different platform with minimal effort. A PoC should be implemented and benchmarked against running natively on a system.

Performance Analysis of Containerized MPI ApplicationApply

Potential benefits of containerization are minimization of complexities in software installation and run, and increase in portability of the software in multiple systems. For MPI applications, containerization can result to undesired effects on the application’s performance. An attempt to optimize the containerization might be complicated and will require some sacrifices in portability. In this project, performance of a containerized OpenFoam - a CFD MPI application - will be measured and anlysed using various performance tools.

Developing a Benchmark for Research Data Management SystemsApply

With the continous increase in research data and the success of data driven methods, RDMS are becomign increasingly important to ensure best scientific practices. Particularly for HPC users those system are not yet in their daily toolbox. However, HPC admins are increasingly confronted with user requesting specific RDMS. To tackle this issue in a broader and more generic way, it is desirable to systematically benchmark different systems to draw conclusions for their general applicability in HPC. In this work, such a novel RDMS benchmark is developed, or composed from existing ones.

Monitoring of service utilization using DevOps strategiesApply

Accurate accounting of service usage statistics is vital to the prioritization of offered utilities and resources. This accounting is not a trivial task, since each service presents its data heterogeneously. The work will then consist of setting up scripts to gather, centralize and process the access data of a number of different web-accesible services; and implementing an interactive web solution for displaying the data to the end user (i.e., service administrators). The implemented solutions need to be maintainable, extendable, and robust.

Monitoring and evaluating application usage in the data centerApply

We are finalizing the topic, details will be provided on request.

Monitoring and Analyzing the GPU UtilizationApply

GPU utilization in high-performance computing is an important metric to monitor as it can help to ensure optimal performance and resource allocation. GPU utilization indicates how much of the total available GPU processing power is being used at any given time. Monitoring GPU utilization can help identify any bottlenecks that may be causing performance issues and can help identify areas in the system that may need additional resources or optimization. Additionally, monitoring GPU utilization can help ensure that the best possible performance is being achieved from the GPU and that resources are well-spent. Within this thesis, you will provide a solution for an effective monitoring of the GPU usage on our new GPU Cluster "Grete".

Text to image relation in digital collectionsApply

Books and magazines are digitized page by page. If the scan is complete, then the object is ingested as one item and can be viewed page by page. Each page is a single digital file. But in reality, the content is not strictly text only, there are often figures and images inside the publication. And there also more image-heavy publications such as art magazines. The goal is to create an algorithm to identify images and provide the bounding box for the identified image. Why this is interesting: the bounding boxes can be used to calculate a percentage value for the whole publication (e.g. 20 % text, 80 % image). If this is done for a whole series of a certain magazine, trends can be determined (e.g. fewer, but larger images). Also the images can be displayed independent of the text and can be used for further searches.

Object outline detection of digital assetsApply

A drawing or painting or scan of a cultural artifact contain different things like a landscape, food, persons, text. All these items have a certain meaning. The digital humanists want to annotate them. But on complex objects this is a tedious process. An outline must be manually drawn to perform the annotation. This is time consuming. It would be very helpful to perform an object recognition. It must not be able to identify the depictured item, but a good outline and object segmentation of shown items should be created. The algorithm should support different materials (drawings, paintings, photos) to create the outline. As training material different repositories should be used (images are accessible via REST based API). If possible the demonstrations should be done in a Jupyter notebook.

Movie scene detectionApply

In archives and broadcast studios video footage gets manually annotated. They note the timestamps and describe what can be seen in this sequence. E.g. a certain person with a name (e.g. interview partner), a scene, an object (e.g. country hall, stree view, country hall entrance, country hall backyard). There has been a lot of research to detect these scenes automatically [1][2][3]. The task is to get on overview about the current state of art. If there is usable software available, this should be tested on different assets. The results should be noted to identify if this is something that can be integrated into a metadata generation workflow. [1] https://journals.sagepub.com/doi/pdf/10.1177/1550147719845277 [2] https://ieeexplore.ieee.org/document/1211489 [3] https://openaccess.thecvf.com/content_CVPR_2020/papers/Rao_A_Local-to-Global_Approach_to_Multi-Modal_Movie_Scene_Segmentation_CVPR_2020_paper.pdf

Benchmarking Quantum Computing SimulatorsApply

The application of quantum computers to data science problems may become a powerful tool in the future. However, the current generation of noisy intermediate scale quantum computers can only tackle small problems, and their availability is very limited. Therefore quantum computing simulators (QCS) running on HPC systems are an important alternative for current research. In this project benchmarks for QCS should be implemented based on existing quantum circuits. The benchmarks can cover both pure quantum and quantum-classical algorithms and should be designed with the perspective to compare results to real quantum computers and classical approaches. The benchmarks will be used to compare simulators selected and installed by the student.

Integration of HPC systems and Quantum ComputersApply

Especially in the noisy intermediate scale quantum computing era, hybrid quantum-classical approaches are among the most promising to achieve some early advantages over classical computing. For these approaches an integration with HPC systems is mandatory. The goal of this project is to design and implement a workflow allowing to run hybrid codes using our HPC systems and, as a first step, quantum computing simulators, extend this to cloud-available real quantum computers, and provide perspectives for future systems made available otherwise. Possible aspects of this work are Jupyter based user interfaces, containerization, scheduling, and costs of hybrid workloads. The final result should be a PoC covering at least some important aspects.

Quantum Machine Learning in ForestryApply

The project ForestCare uses machine learning methods to analyze diverse sets of data on forest health. Quantum machine learning (QML) is the application of quantum computers to machine learning tasks, which is considered to have great potential by researchers. In the current era of noisy intermediate scale quantum computers however, no realizable advantage of this approach is to be expected. Nevertheless, testing QML application to new areas remains of high interest for future applications. In this explorative project the goal is to identify suitable data reductions of ForestCare data, QML methods to apply to these reduced sets, and compare results to the classical approach. This project requires some enthusiasm to understand the workings of ML and QML methods and willingness to be challenged.

A Quantum-Classic Hybrid Machine Learning Approach for Performance Analysis of Heterogeneous HPC SystemApply

In order to model the performance of heterogeneous HPC systems via ML techniques, a large number of parameters and hyper-parameters must be handled. Such a model could serve to optimize for factors such as energy efficiency. The objective of this work is to work with tools and techniques of Classical, Quantum and Quantum-Classical ML.

A Study on Parallel Vs Quantum Scheduling Approach for Workload Mapping in HPC System LandscapesApply

The focus of this research is on the performance analysis of a classical parallel scheduling approach with respect to a quantum scheduling approach. The solution will be evaluated in regards to speed and resource usage.

Benchmarking phylogenetic tree reconstructionsApply

In phylogenetic tree reconstructions, we describe the evolutionary relationship between biological sequences in terms of their shared ancestry. To reconstruct such a tree, multiple approaches exist, including maximum likelihood and Bayesian methods. Among the most commonly used implementations of these methods are RAxML and MrBayes, both of which are available on SCC. In this project, you will identify a suitable benchmarking suite and use it to benchmark RAxML and MrBayes on SCC.

Prototyping common workflows in phylogenetic tree reconstructionsApply

In phylogenetic tree reconstructions, we describe the evolutionary relationship between biological sequences in terms of their shared ancestry. To reconstruct such a tree, multiple approaches exist, including maximum likelihood and Bayesian methods. Among the most commonly used implementations of these methods are RAxML and MrBayes, both of which are available on SCC. In this project, you will identify and establish a typical workflow on SCC, from data management to documentation. This project is especially suitable for students enrolled in Computer Science (M.Ed.) programme.

Benchmarking AlphaFold and alternative models for protein structure predictionApply

Proteins are involved in every biological process in every living cell. To assess how a protein functions exactly, knowing its amino acid sequence alone is not enough. Instead, its three-dimensional structure needs to be determined as well. In the last year, we saw a number of AI bases approaches put forward. In this project, you will compare and benchmark the performance of AlphaFold and alternative models on the SCC.

Upscaling single cell analysis using the HPCApply

In single cell analyses the focus is, as the name suggests, on the individual cell. For example, the set of expressed genes can be determined via RNA-Seq. Monocole3 is a widely used toolkit to analyse such data which is available on SCC. In this project, you will benchmark the existing single cell analysis pipeline at GWDG with particular attention to the potential to upscale.

  • Impressum
  • Privacy
  • research/open-theses.txt
  • Last modified: 2022-07-07 19:41
  • by Julian Kunkel