Table of Contents

Open Theses

PhD

Please note the PhD application process.

MSc & BSc

The offered theses below are intended for MSc but can also be reduced in scope and handed out as BSc theses.

Trustworthy AI: Continuous Governance for LLM ServicesApply

Large Language Models (LLMs) are increasingly deployed as AI services within universities, public administrations, and research infrastructures. While local deployment improves data sovereignty and privacy, it does not automatically guarantee trustworthy, safe, and compliant AI behavior. LLMs may generate biased, harmful, or misleading outputs and may violate institutional policies or regulatory requirements. This thesis investigates how LLM services can be continuously monitored using a decentralized governance framework. The proposed approach employs specialized governance agents that evaluate different ethical dimensions, including fairness, privacy, explainability, safety, and regulatory compliance. The aim is to generate an overall ethical risk assessment and compliance score.

Exploring Quantum Computing Use CasesApply

In the quantum computing test center QUICS (Quantum Innovation and Computing for SME) we explore the potential of quantum computing for real-world use cases. The goal of this thesis is to characterize the (future) applicability of quantum computing for a problem defined by an industry partner or internally by the QUICS project itself. This involves characterizing the state-of-the art approach to the problem in classical computing, defining a suitable benchmark for comparisons, selecting a quantum computing approach, and defining and implementing a simplified proof of concept (PoC) that can be used on current-day quantum computers. Ideally, the thesis will use the PoC to discuss the requirements for future quantum advantage. The thesis will be conducted within the QUICS team and, when applicable, in collaboration with an industry partner.

AI-assisted language learningApply

Our current generation of AI models is language based. This opens a lot of possibilities for using AI in language learning. Despite the potential, there is a lack of free and open source language learning applications, especially as a support for language courses e.g. as offered by the university. The goal of this project is to change that and develop a language learning app based on the ideas of spaced-repetition and using LLMs to bridge the gap between vocabulary learning and full texts.

AI-assisted programming learningApply

AI is transforming education. AI chatbots are everywhere but more useful patterns only slowly emerge. In our CS Bachelor, we use the programming learning environment SmartBeans that provides students with tasks and automatic feedback based on unit testing. But this feedback is limited in scope and usefulness. The goal of this thesis is to improve the learning experience by adding state-of-the-art AI methods that go beyond chats, improving well-known factors in efficient learning e.g. cognitive load. The focus could either be on AI or on improving learning.

Performance Evaluation of LLM Inference EnginesApply

While vLLM is a widely spread inference backend engine for operating LLMs, there are alternative options that have the potential to deliver better performance by replacing or extending vLLM. Notable options are the Modular platform with MAX, ServerlessLLM and LMCache. Performance improvements may be limited to certain use cases. The overarching goal of this topic is to explore potential performance improvements for the Chat AI platform.

Operating Kubernetes with AI EngineersApply

Projects such as K8sGPT as well as MCP servers for Kubernetes enable LLMs to directly interact with Kubernetes clusters. This project aims to explore how well it is possible to maintain a given Kubernetes cluster with LLM-based engineers to complete typical maintenance tasks such as adjusting workloads and migrating between versions.

Enabling Automatic Context Switching for Personal AI AssistantsApply

Personal AI assistants such as OpenClaw and Hermes are able to deal with a wide range of use cases, which is even further extended by skills and plugins. However, as they serve as general purpose assistants, a user might give them multiple tasks during a day that each are complex and cause the agent system to evict or compress their memory. Doing so causes information to be lost that a user expects to be present in the agent memory causing users to re-explain tasks and approaches. This thesis aims to enable personal AI assistants to manage tasks as sessions such that they can keep their memory or notes per task and understand when to load what session data.

Comparison of Distributed Computing FrameworksApply

While the data analytics tool Apache Spark has already been available on GWDG systems for multiple years, Dask is an upcoming topic. Spark is primarily used with Scala (and supports Python as well), Dask on the other hand is a part of the Python ecosystem. The project proposal is to compare the deployment methods on an HPC system (via Slurm in our case), the monitoring possibilities and tooling available, and to develop, run and evaluate a concrete application example on both platforms.

Speculative KV Caches for Efficient LLM InferenceApply

Key-Value caches are a critical component of Large Language Model inference, but they introduce significant memory and latency overheads. Speculative KV caches are techniques that predict and manage cache entries ahead of time, making them available when needed and improving inference efficiency. This thesis explores the design, implementation, and evaluation of such techniques for modern LLMs.

Adaptive Attention Budgeting for Efficient Large Language Model InferenceApply

This thesis investigates dynamic allocation of attention computation across transformer layers and attention heads based on input characteristics. The goal is to reduce inference cost while preserving model quality by learning or estimating per-input attention budgets.

Automatic Prompt Classification for Intelligent Language Model RoutingApply

On the ChatAI platform, the default model accounts for one of the highest proportions of user interactions, suggesting that many users rely on the preselected option rather than manually choosing a model. This thesis focuses on classifying user prompts into task categories (such as reasoning, coding, creative writing, summarization, and information retrieval). Based on the predicted task, requests are dynamically assigned to specialized models, with the goal of outperforming a single default-model approach in terms of accuracy, latency, and cost.

Graph Transformer Architectures for Workflow Scheduling: Beyond Standard GNNsApply

Graph transformer models, which combine attention mechanisms with graph structure, consistently outperform standard message-passing GNNs on combinatorial optimization tasks in recent benchmarks (2024-2026). Applied to workflow scheduling, graph transformers can capture global task-dependency patterns and long-range interactions that locality-limited GNNs miss, a limitation documented in the Grapheon RL benchmark at large scales (rnc5000, DOI 10.5281/zenodo.20432418). This thesis replaces the GNN backbone of the Grapheon RL architecture with a graph transformer encoder (e.g., GraphGPS or Exphormer) and evaluates whether attention-based representations improve scheduling quality on rnc300-rnc1000 workflow instances from the published STG dataset. The student will compare objective gap, inference speed, and training convergence against the published GNN-RL baselines under the same homogeneous and heterogeneous system configurations. The thesis contributes a systematically evaluated architectural extension to the open benchmark.

Carbon-Aware Multi-Objective Scheduling for HPC Workflows Using Reinforcement LearningApply

Data centers in Europe are now subject to the EU Energy Efficiency Directive (2023/1791) and are under growing pressure from funders and institutions to report and minimize operational carbon emissions. HPC schedulers that factor in carbon intensity of the grid alongside performance are emerging as a key tool, but dedicated benchmarks combining workflow-level quality metrics with carbon cost remain rare. This thesis extends the Grapheon RL framework to a three-objective scheduler minimizing makespan, energy consumption, and carbon cost simultaneously. The student will integrate synthetic carbon intensity traces (modeled after real-world grid data from the European Energy Exchange or carbon-aware open APIs), reformulate the RL reward to include a carbon penalty term, and retrain and evaluate on the published STG workflow dataset. The evaluation reports Pareto frontier trade-offs and compares carbon savings against performance-only Grapheon RL and HEFT baselines. The thesis outcome serves as a reproducible carbon-aware scheduling baseline for the group's benchmark.

Transfer Learning Evaluation of GNN-RL Schedulers Across Workflow FamiliesApply

A practical requirement for production GNN-RL schedulers is the ability to generalize beyond the workflow family used for training. The Grapheon RL model (DOI 10.1109/COMPSAC65507.2025.00341) is trained on Standard Task Graph (STG) instances. It is an open question how well it transfers to other workflow families such as Pegasus CyberShake, Montage, or synthetic BLAST pipelines, which differ in graph structure, depth, parallelism ratio, and task heterogeneity. This BSc thesis systematically evaluates Grapheon RL transfer to at least two non-STG workflow families without any fine-tuning, with lightweight fine-tuning (10-50 additional episodes), and with full retraining. The student will report normalized objective gap, schedule feasibility, and inference speed under the homogeneous 3-node configuration from the benchmark. The thesis produces a transfer learning guide: which workflow structural properties predict successful generalization, and how many fine-tuning samples are sufficient for a new family.

Online Adaptive Scheduling with Continual Reinforcement Learning for Shifting HPC WorkloadsApply

Production HPC clusters experience workload distribution drift over time: new workflow types appear, system configurations change, and peak load periods vary by season or funding cycle. A GNN-RL scheduler trained on a fixed dataset degrades as the deployment distribution shifts away from training, requiring costly full retraining. Continual reinforcement learning methods (Elastic Weight Consolidation, PackNet, progressive networks) address this stability-plasticity dilemma by enabling an agent to learn new tasks without catastrophic forgetting. This thesis implements an online continual learning wrapper for the Grapheon RL scheduler that updates model weights on a rolling window of recent scheduling decisions from the GWDG SCC cluster workload logs. The student will evaluate forgetting rate on previous workflow sizes, adaptation speed to new workflow families, and compute cost per update step. A comparison with periodic full retraining quantifies the practical case for online adaptation. Access to anonymized SCC job logs is available through the GWDG HPS group.

Knowledge Graph Modeling of HPC Resources and Workflow DependenciesApply

Machine-readable modeling of HPC system resources and workflow task dependencies is a prerequisite for both exact optimization and learning-based scheduling. Current approaches use flat JSON or ad-hoc formats that lack semantic expressiveness. Knowledge graphs and linked-data ontologies offer a structured alternative that enables reasoning, constraint checking, and integration with LLM-based planners via graph query languages (SPARQL) or schema retrieval. This thesis designs and implements a knowledge graph schema for heterogeneous HPC systems and Standard Task Graph workflows, building on the node and workflow JSON definitions from the Stage 2 benchmark (DOI 10.5281/zenodo.20432418). The student will populate a reference knowledge graph instance, evaluate its expressiveness against the scheduling constraints used in Grapheon RL, and demonstrate two end-use cases: constraint validation at submission time and semantic query for suitable node allocation. The outcome is a reusable schema with open-source tooling for KG population from existing JSON workflow descriptions.

Monitoring and Controlling Irregular Behavior in Agentic AI Systems with Access to Digital and Physical ResourcesApply

In this thesis, researchers works on a new class of risks emerging from agentic AI systems that are able to perform actions, use tools, access resources, execute software, control devices, or make operational decisions. As AI systems move from passive recommendation toward automated decision-making and action execution, it becomes important to monitor whether an agent behaves within its allowed boundaries. The main objective is to design a lightweight monitoring and control framework that can detect irregular, unauthorized, or suspicious behavior in AI agents. Such behavior may include executing unexpected commands, accessing restricted files, modifying system settings, using tools outside the assigned task, consuming abnormal resources, or making decisions that conflict with predefined rules and human intentions. The project may focus on software agents operating on desktop or operating-system-level environments, AI assistants executing automated tasks, business-process agents managing resources, or simulated cyber-physical agents interacting with actuators. The student will define allowed and disallowed behavior patterns, collect or simulate agent activity logs, and develop a monitoring mechanism that can detect deviations from expected behavior. A Master thesis may focus on implementing and evaluating a prototype supervisory system that monitors agent actions and detects rule violations. A PhD-level thesis may extend the work by developing a more general framework for runtime agent governance, combining rule-based monitoring, machine learning, anomaly detection, policy checking, formal constraints, and human-in-the-loop approval mechanisms. A possible motivating example is a business or resource-management scenario in which an autonomous assistant supervises financial or operational decisions and prevents risky or unauthorized actions by another actor. This illustrates the broader need for trusted third-party AI supervisors that monitor agents and enforce operational boundaries.

Environmental Sustainability Labeling for AI Services Based on Energy, Runtime, and Resource UsageApply

This thesis focuses on designing a transparent labeling system for evaluating the environmental impact of AI services. The main idea is to measure how much computational effort, runtime, memory, CPU, GPU, and energy an AI service requires, and then translate these measurements into a simple and understandable label. The labeling concept can be inspired by energy labels used for household appliances, such as A, B, C, D, and E. However, the student has freedom to define the exact scoring method, evaluation metrics, and visualization style. The AI services may include classification models, generative AI models, anomaly detection tools, scheduling systems, or other selected AI applications. A Bachelor thesis can focus on implementing a prototype and evaluating a small number of AI models. A Master thesis can extend the work by designing a more formal scoring model, comparing multiple infrastructures, or including sustainability indicators such as estimated carbon impact.

Meta Machine Intelligence for Adaptive Error Detection in High-Performance Computing SystemsApply

This thesis investigates how multiple AI models can be used together for detecting errors, failures, and abnormal behavior in high-performance computing systems. Instead of relying on one fixed detection model, the system should analyze the current situation and select the most suitable model based on system context. The context may include workload behavior, node status, resource usage, error patterns, log messages, or historical system behavior. The student can explore different strategies such as model selection, ensemble learning, rule-based routing, machine learning-based routing, or LLM-assisted log interpretation. A Bachelor thesis may compare several anomaly detection models on HPC or server logs. A Master thesis may design an adaptive Meta Machine Intelligence layer that selects the best model depending on the current system condition.

Lightweight Edge AI for Detecting Irregular Behavior in Device and Sensor LogsApply

This thesis aims to develop a lightweight anomaly detection system for device logs, sensor data, or small-scale monitoring environments. The focus is on AI methods that can operate under limited computational resources, such as embedded systems, Raspberry Pi devices, IoT nodes, or edge computing environments. The system should detect irregular behavior using indicators such as timestamp frequency, error messages, temperature changes, signal variations, or resource usage. Students can freely choose the application domain, for example smart devices, environmental sensors, robotics, small server nodes, or industrial monitoring. A Bachelor thesis may compare lightweight machine learning models for anomaly detection. A Master thesis may investigate TinyML, online learning, model compression, or hybrid signal-processing and AI-based detection methods.

Benchmarking AI Models for Resource-Aware System MonitoringApply

This thesis focuses on benchmarking different AI models for monitoring tasks under resource constraints. The student will compare models not only based on accuracy, but also based on runtime, memory usage, energy consumption, inference latency, and deployment complexity. The monitoring task may involve anomaly detection, failure prediction, log classification, sensor analysis, or workload prediction. The project gives students freedom to select the models, datasets, and evaluation environment. A Bachelor thesis may compare a small number of models for one monitoring task. A Master thesis may design a more systematic benchmarking framework and propose guidelines for selecting models based on system constraints.

Development of new applications for the SpiNNaker-2 neuromorphic computing platformApply

SpiNNaker is a new kind of computer architecture, inititally designed to efficiently perform simulations of spiking neuron networks. It consists of a large number of low-powered ARM cores, connected with an efficient message passing network. This architecture together with the flexibility of the spiking neuron model make it also ideal for accelerating other types of algorithms such as optimization problems, constrain problems, live image and signal processing, AI/ML, cellular automata, finite element simulations, distributed partial differential equations, and embedded, robotics, and low powered applications in general. As part of the Future Technology Platform, the GWDG has acquired a number of SPiNNaker boards that will be available for the thesis. In this thesis, you will develop one (or more) applications for SPiNNaker, either with the high-Level Python or low-level C/C++ software stacks, characterize your solution, compare it to a pure CPU/GPU solution (or other hardware in the Future Technologa Platform), if possible apply it to a real case study, and study the power consumption of your program.

Federated Learning from Private LLM-Agent Trajectories for Adaptive Agent ControlApply

This thesis investigates how LLM-based agents can improve their task-solving behavior from locally generated interaction trajectories without sharing private data. During task execution, an agent produces trajectories consisting of actions, observations, tool calls, failed attempts, corrections, and final outcomes. The project studies privacy-preserving methods for using these trajectories to improve agent control while keeping the base LLM frozen or only lightly adapted. Three improvement strategies may be considered: training a lightweight next-action ranker or critic, storing and retrieving successful trajectories as contextual memory, and optionally fine-tuning small adapters such as LoRA on trajectory-derived examples. Federated learning is used to aggregate improvements across simulated private environments while raw trajectories remain local. The thesis will evaluate whether trajectory-based federated improvement increases task success, reduces unnecessary steps, and lowers execution cost compared to local-only and non-adaptive agent baselines.

FedToolAgent: Privacy-Preserving Federated Learning of Tool-Use Policies for LLM AgentsApply

This thesis investigates how LLM-based agents can learn better tool-use behavior across distributed private environments without sharing sensitive task data or execution logs. Modern LLM agents interact with external tools such as search engines, databases, code execution environments, document processors, validators, and domain-specific APIs. The project focuses on learning a lightweight tool-use policy that decides which tool should be called next, when it should be used, and how tool-call outcomes should guide subsequent actions. Each client trains locally from its own tool-use logs, including successful and failed tool calls, execution cost, latency, and task outcomes. Federated learning is then used to aggregate policy updates while keeping raw logs and private data local. The thesis evaluates whether federated tool-use learning improves task success, reduces unnecessary tool calls, lowers execution cost, and generalizes across heterogeneous client environments.

FedCollab: Federated Optimization of Collaboration Protocols in LLM-Based Multi-Agent SystemsApply

This thesis investigates how collaboration protocols in LLM-based multi-agent systems can be optimized across distributed private environments without sharing sensitive task data or interaction logs. In multi-agent systems, several specialized agents may cooperate through roles such as planner, executor, reviewer, critic, retriever, or validator. The quality of the final result depends not only on the individual agents, but also on the collaboration protocol: how tasks are decomposed, which agent acts first, how intermediate results are exchanged, how disagreements are resolved, and when the system should stop. This project studies lightweight learning methods for adapting such collaboration protocols from local multi-agent interaction logs. Federated learning is used to aggregate protocol improvements across simulated private clients while keeping raw conversations, task inputs, and agent traces local. The thesis will evaluate whether federated protocol optimization improves task success, reduces redundant communication, lowers execution cost, and increases robustness compared to fixed multi-agent collaboration patterns.

FedAgenticRAG: Privacy-Preserving Multi-Agent Reasoning over Distributed Knowledge BasesApply

This thesis investigates a privacy-preserving agentic Retrieval-Augmented Generation framework for reasoning over distributed knowledge bases without centralizing sensitive documents. In many real-world settings, relevant knowledge is distributed across different organizations, databases, departments, or user-owned collections, where direct data sharing is not possible. The proposed system uses multiple specialized LLM-based agents, such as local retrievers, summarizers, validators, and reasoning agents, to process knowledge locally and exchange only controlled intermediate representations, evidence summaries, or model updates. The work explores how federated or decentralized coordination mechanisms can support multi-step reasoning across private knowledge sources while preserving data ownership and access constraints. The thesis will evaluate the system in terms of answer quality, evidence grounding, privacy preservation, communication cost, and robustness compared to centralized and single-agent RAG baselines.

FedGuardAgent: Federated Safety Policy Learning for Autonomous LLM AgentsApply

This thesis investigates how autonomous LLM-based agents can learn and improve safety policies across distributed private environments without sharing sensitive interaction logs or task data. As LLM agents increasingly interact with tools, files, APIs, code execution environments, and external systems, they require safeguards that decide when an action is allowed, risky, should be modified, or must be blocked. The project focuses on training lightweight safety components, such as risk classifiers, policy checkers, action filters, or guard agents, from locally observed agent behavior and safety outcomes. Federated learning is used to aggregate safety-policy improvements across multiple clients while raw conversations, tool calls, documents, and execution traces remain local. The thesis will evaluate whether federated safety learning improves unsafe-action detection, reduces policy violations, preserves useful task completion, and generalizes across heterogeneous agent environments.

FedMARL-Agent: Federated Multi-Agent Reinforcement Learning for Privacy-Preserving LLM Agent CoordinationApply

This thesis investigates how multiple LLM-based agents can learn better coordination strategies across distributed private environments using federated multi-agent reinforcement learning. In complex agentic systems, several agents may interact through roles such as planner, executor, retriever, tool user, critic, validator, or summarizer. Their overall performance depends on coordination decisions, including task allocation, turn-taking, communication, conflict resolution, and stopping behavior. The project studies how local multi-agent interaction logs and reward signals can be used to train lightweight coordination policies while keeping raw conversations, task data, tool outputs, and execution traces local. Federated learning is used to aggregate policy updates across clients, enabling privacy-preserving improvement of multi-agent coordination. The thesis will evaluate whether federated reinforcement learning improves task success, reduces redundant communication, lowers execution cost, and generalizes across heterogeneous agent environments compared to fixed coordination protocols and local-only learning.

Containerized Edge AI: Deploying AI Services with Lightweight KubernetesApply

This thesis investigates how AI inference services can be deployed, managed, and evaluated in edge computing environments using lightweight container orchestration. As AI applications move closer to sensors, users, and embedded devices, edge platforms must support reliable execution under limited compute, memory, storage, and network conditions. Container technologies and lightweight Kubernetes distributions such as K3s or MicroK8s provide a promising approach for packaging, scaling, updating, and monitoring AI services outside traditional cloud data centers. The project focuses on building a reproducible edge AI deployment environment in which one or more selected AI inference services are containerized and deployed on a small edge setup or simulated edge cluster. The student will study the practical trade-offs between simple container-based deployment and Kubernetes-based orchestration, including how orchestration affects startup time, inference latency, resource consumption, scalability, service monitoring, and update mechanisms. The thesis will evaluate whether lightweight Kubernetes provides practical benefits for operating AI services at the edge compared to simpler deployment approaches. The expected outcome is a working prototype, a reproducible deployment workflow, and a practical analysis of the advantages and limitations of container orchestration for edge AI workloads.

Deployment and Evaluation of Lightweight AI Models on Edge and Embedded SystemsApply

This thesis investigates how lightweight AI models can be implemented, optimized, and evaluated on resource-constrained edge and embedded systems. Many practical AI applications require local processing close to sensors, devices, or users in order to reduce latency, limit data transfer, improve privacy, or operate without continuous cloud connectivity. However, embedded and edge platforms often provide limited memory, compute power, energy availability, and hardware acceleration compared to conventional servers. The project focuses on selecting a representative AI task, such as image classification, sensor-data analysis, anomaly detection, or simple signal processing, and deploying one or more suitable models on an edge or embedded platform. The student may investigate techniques such as quantization, pruning, TinyML, lightweight neural networks, classical machine learning baselines, or hardware-aware inference optimization. The implementation will be evaluated with respect to accuracy, inference latency, memory usage, resource consumption, deployment complexity, and robustness under constrained runtime conditions. The thesis will compare different model and deployment choices to identify practical trade-offs for AI processing on small devices. The expected outcome is a reproducible prototype and a structured evaluation that provides guidance on selecting and deploying AI models for edge and embedded environments.

Implementation of a precice-Adapter for the particle transport simulator LIGGGHTSApply

Precice as already presented at the GöHPCoffee is a multiphysics framework which allows the combination of various simulation codes to perform coupled simulations. These can both include coupled thermal problems or topics related to fluid structure interaction. So far, there exists no possibility to perform a coupled particle simulation using preCICE since the only particle solver is not publicly available. It is the aim of this thesis to mitigate this limitation by implementing a precice-adapter for the particle solver LIGGGHTS-PFM. One possibility could be the modification of an existing OpenFOAM-adapter in preCICE. In addition, the thesis will compare the achievable performance with other coupling libraries using LIGGGHTS and its derivatives. General programming experience is required. Knowledge in simulation technology and particle transport especially in LIGGGHTS is beneficial but not mandatory.

Implementation of a precice-Adapter for the particle transport simulator LIGGGHTSApply

Precice as already presented at the GöHPCoffee is a multiphysics framework which allows the combination of various simulation codes to perform coupled simulations. These can both include coupled thermal problems or topics related to fluid structure interaction. So far, there exists no possibility to perform a coupled particle simulation using preCICE since the only particle solver is not publicly available. It is the aim of this thesis to mitigate this limitation by implementing a precice-adapter for the particle solver LIGGGHTS-PFM. One possibility could be the modification of an existing OpenFOAM-adapter in preCICE. In addition, the thesis will compare the achievable performance with other coupling libraries using LIGGGHTS and its derivatives. General programming experience is required. Knowledge in simulation technology and particle transport especially in LIGGGHTS is beneficial but not mandatory.

Framework for automated ML and empirical model generationApply

Despite drilling technology traditonally originates from the field of oil and gas, it still plays a crucial role in emerging fields of Carbon Capture and Storage, geothermal energy or hydrogen storage. In order to reach a wide adoption of the new fields it is crucial to optimize the wellbore construction costs. In my research I was using mathematical models, i.e. both statistical and empircal, to replicate scenarios generated from previous drilling projects. In my previous paper "Framework for automated generation of real-time rate of penetration models" (doi:10.1016/j.petrol.2022.110369), I created a framework for the automatic parametrization of models for a single variable based on preprocessed measurement data. These models include both empirical models from the literature and trained using machine learning algorithms from sklearn. In a recent Master Thesis, a new simulation framework was developed in Python which could use the parametrized models for research and education in the drilling industry. Compared to the implementation in the paper, the new version will integrate several models from the literature to enable a more comprehensive simulation experience both for researchers and students. Upon successful completion of the project, the applicant will gain hands-on experience with a real-world problem in the area of mathematical and ML modeling. The results are also planned to be submitted in a scientific publication, so it is your chance to get your first paper published.

Checkpoint Integrity Verification in Distributed ML TrainingApply

Checkpoints are essential to reliable distributed ML training, enabling recovery from faults and fault-induced data corruption. However, as data science workloads are increasingly deployed in shared, untrusted data center environments, checkpoints stored in shared storage become vulnerable to tampering, unauthorized modification, and rollback attacks—threats that undermine both reliability and security guarantees. This work investigates the intersection of reliability and security by asking: how can we detect checkpoint tampering and ensure integrity of critical recovery data in distributed ML pipelines? We design and implement a checkpoint integrity verification system that combines cryptographic integrity verification with provenance tracking to detect unauthorized modifications and corruption in (lets say HDF5/NetCDF) checkpoint files. The system integrates with containerized environments (Apptainer) and is evaluated against adversarial fault injection attacks. We demonstrate detection efficacy under realistic threat models and provide empirical evidence of how security hardening can protect reliability mechanisms in data center ML infrastructure. Publishing the results in a peer-reviewed journal will give you the opportunity to establish yourself as a published author early in your career.

eBPF-Based Runtime Fault Injection Framework for HPC WorkloadsApply

Understanding how machine learning workloads behave under realistic hardware faults is critical to designing resilient data science systems for data centers. However, fault injection tools often require source code modification, specialized hardware, or are specific to particular I/O libraries, limiting their applicability to diverse HPC applications. This work investigates how runtime fault injection can characterize ML workload resilience without invasive instrumentation. We present an eBPF-based fault injection framework that enables controlled injection of I/O faults, network delays, and memory corruption into containerized HPC applications at runtime, targeting HDF5 parallel I/O operations. The framework integrates with checkpoint-enabled training pipelines and is evaluated on Kubernetes clusters. We demonstrate how different fault types degrade training progress and checkpoint reliability, and provide a systematic characterization of failure modes that informs both reliability and security strategies for ML workloads in shared data center environments. Your findings will form the basis of a peer-reviewed publication, giving you the chance to publish your first paper.

LLM Poisoning Detection via Training Data AttributionApply

Data poisoning attacks—where malicious training samples introduce backdoors, adversarial behaviors, or degraded model performance—represent a critical security threat to machine learning systems deployed in data centers. Unlike traditional security defenses, detecting poisoning requires understanding the influence of individual training examples on model behavior. This work investigates whether gradient-based data attribution can effectively identify poisoned training data in large language models before deployment. We develop a poisoning detection system that uses influence scoring and attribution methods to rank training examples by their impact on model outputs, enabling flagging of suspicious samples that could introduce backdoors or adversarial behaviors. The system is prototyped on smaller language models and systematically evaluated against benchmark poisoning attacks including backdoor injection, trojan insertion, and adversarial fine-tuning. We provide empirical evidence on detection accuracy, false positive rates, and robustness to adaptive attacks, contributing to the understanding of how data integrity verification can be embedded into ML training pipelines. Your findings will form the basis of a peer-reviewed publication, giving you the chance to publish your first paper.

Development of Text-to-SQL/XML Conversational AI for Planarian Research DatabaseApply

The Rink Lab at the Max Planck Institute for Multidisciplinary Sciences investigates why some animals can regenerate lost body parts while others cannot, using planarian flatworms as a model system - species that range from being able to regrow an entire organism from tiny fragments to those with limited regenerative ability. We have developed a comprehensive database containing planarian genome, transcriptomes, functional annotations, and gene expression data. This Master’s thesis project will focus on creating a Text-to-SQL/XML conversational AI system that enables natural language queries of the database, including the implementation and fine-tuning of NLP/AI models for query translation and intent recognition, systematic evaluation of different approaches, and the development of a GUI-based conversational interface for intuitive database exploration.