# Experiments in the scope of Seminar with Practical: Scalable Computing Systems and Applications in AI, Big Data and HPC
## Topic: LLM Benchmarking Frameworks and their limitations
#### Author: Niclas Unger

All the code for the performed Experiments can be found in the lm_eval_harness_experiments.ipynb file.<br>
An API key to the GWDG ChatAI platform is required to run the Experiments.<br>
All the Evaluation results are named gsm8k_....json .<br>
The yaml files containing the custom prompt wrapping settings should be moved into lm-Evaluation-harness/Tasks/gsm8k (the cloned GitHub repo), if a rerun of the Experiments is desired.
After cloning the GitHub, it might be necessary to replace the api_models.py file in lm-evaluation-harness/lm_eval/models with the file provided to not run into API rate Limit issues.<br>
Almost all experiments were performed only on the first 50 Questions from the GSM8K benchmark and also mostly with 1-shot prompts, since any higher settings quickly cause the API rate limit to be exceeded.<br>
