Publications

Show reviewed papersHide abstractsSort by type

  • A Similarity Study of I/O Traces via String Kernels (Raul Torres, Julian Kunkel, Manuel F. Dolz, Thomas Ludwig), In The Journal of Supercomputing, pp. 1–13, Springer, ISSN: 0920-8542, 2018-07-03
    BibTeX DOI
    Abstract: Understanding I/O for data-intense applications is the foundation for the optimization of these applications. The classification of the applications according to the expressed I/O access pattern eases the analysis. An access pattern can be seen as fingerprint of an application. In this paper, we address the classification of traces. Firstly, we convert them first into a weighted string representation. Due to the fact that string objects can be easily compared using kernel methods, we explore their use for fingerprinting I/O patterns. To improve accuracy, we propose a novel string kernel function called kast2 spectrum kernel. The similarity matrices, obtained after applying the mentioned kernel over a set of examples from a real application, were analyzed using kernel principal component analysis and hierarchical clustering. The evaluation showed that two out of four I/O access pattern groups were completely identified, while the other two groups conformed a single cluster due to the intrinsic similarity of their members. The proposed strategy can be promisingly applied to other similarity problems involving tree-like structured data.
  • Poster: Advanced Computation and I/O Methods for Earth-System Simulations (AIMES) (Julian Kunkel, Thomas Ludwig, Thomas Dubos, Naoya Maruyama, Takayuki Aoki, Günther Zängl, Hisashi Yashiro, Ryuji Yoshida, Hirofumi Tomita, Masaki Satoh, Yann Meurdesoif, Nabeeh Jumah, Anastasiia Novikova, Anja Gerbes), ISC HPC, Frankfurt, Germany, 2018-06-26
    BibTeX URL PDF
    Abstract: The Advanced Computation and I/O Methods for Earth-System Simulations (AIMES) project addresses the key issues of programmability, computational efficiency and I/O limitations that are common in next-generation icosahedral earth-system models. Ultimately, the project is intended to foster development of best-practices and useful norms by cooperating on shared ideas and components. During the project, we will ensure that the developed concepts and tools are not only applicable for earth-science but for other scientific domains as well. In this poster we show the projects plan and progress during the first two years of the project lifecycle.
  • Poster: The Virtual Institute for I/O and the IO-500 (Julian Kunkel, Jay Lofstead, John Bent), ISC HPC, Frankfurt, Germany, 2018-06-26
    BibTeX URL PDF
    Abstract: The research community in high-performance computing is organized loosely. There are many distinct resources such as homepages of research groups and benchmarks. The Virtual Institute for I/O aims to provide a hub for the community and particularly newcomers to find relevant information in many directions. It hosts the comprehensive data center list (CDCL). Similarly to the top500, it contains information about supercomputers and their storage systems. I/O benchmarking, particularly, the intercomparison of measured performance between sites is tricky as there are more hardware components involved and configurations to take into account. Therefore, together with the community, we standardized an HPC I/O benchmark, the IO-500 benchmark, for which the first list had been released during supercomputing in Nov. 2017. This poster introduces the Virtual Institute for I/O, the high-performance storage list and the effort for the IO-500 which are unfunded community projects.
  • Poster: A user-controlled GGDML Code Translation Technique for Performance Portability of Earth System Models (Nabeeh Jumah, Julian Kunkel), ISC HPC, Frankfurt, Germany, 2018-06-26
    BibTeX URL PDF
    Abstract: Demand for high-performance computing is increasing in earth system modeling, and in natural sciences in general. Unfortunately, automatic optimizations done by compilers are not enough to make use of target machines' capabilities. Manual code adjustments are mandatory to exploit hardware capabilities. However, optimizing for one architecture, may degrade performance for other architectures. This loss of portability is a challenge. Our approach involves the use of the GGDML language extensions to write a higher-level modeling code, and use a user-controlled source-to-source translation technique. Translating the code results in an optimized version for the target machine. The contributions of this poster are: 1) The use of a highly-configurable code translation technique to transform higher-level code into target-machine-optimized code. 2) Evaluation of code transformation for multi-core and GPU based machines, both single and multi-node configurations
  • Poster: Performance Conscious HPC (PeCoH) - 2018 (Kai Himstedt, Nathanael Hübbe, Sandra Schröder, Hendryk Bockelmann, Michael Kuhn, Julian Kunkel, Thomas Ludwig, Stephan Olbrich, Matthias Riebisch, Markus Stammberger, Hinnerk Stüben), ISC HPC, Frankfurt, Germany, 2018-06-26
    BibTeX URL PDF
    Abstract: In PeCoH, we establish the Hamburg HPC Competence Center (HHCC) as a virtual institution, which coordinates and fosters joint performance engineering activities between the local compute centers DKRZ, RRZ and TUHH RZ. Together, we will implement user services to support performance engineering on a basic level and provide a basis for co-development, user education and dissemination of performance engineering concepts. We will evaluate methods to raise user awareness for performance engineering and bring them into production environments in order to tune standard software as well as individual software. Specifically, we address cost-awareness, provide representative success stories, and provide basic and advanced HPC knowledge as online content resulting in a certification system.
  • Poster: International HPC Certification Program (Julian Kunkel, Kai Himstedt, Weronika Filinger, Jean-Thomas Acquaviva, William Jalby, Lev Lafayette), ISC HPC, Frankfurt, Germany, 2018-06-26
    BibTeX URL PDF
    Abstract: The HPC community has always considered the training of new and existing HPC practitioners to be of high importance to its growth. The significance of training will increase even further in the era of Exascale when HPC encompasses even more scientific disciplines. This diversification of HPC practitioners challenges the traditional training approaches, which are not able to satisfy the specific needs of users, often coming from non-traditionally HPC disciplines and only interested in learning a particular set of skills. HPC centres are struggling to identify and overcome the gaps in users’ knowledge. How should we support prospective and existing users who are not aware of their own knowledge gaps?
    We are working towards the establishment of an International HPC Certification program that would clearly categorize, define and examine them similarly to a school curriculum.
    Ultimately, we aim for the certificates to be recognized and respected by the HPC community and industry.
  • Poster: Automatic Profiling for Climate Modeling (Anja Gerbes, Nabeeh Jumah, Julian Kunkel), Euro LLVM, Bristol, United Kingdom, 2018-04-17
    BibTeX URL PDF
    Abstract: Some applications are time consuming like climate modeling, which include lengthy simulations. Hence, coding is sensitive for performance. Spending more time on optimization of specific code parts can improve total performance. Profiling an application is a well-known technique to do that. Many tools are available for developers to get performance information about their code. With our provided python package Performance Analysis and Source-Code Instrumentation Toolsuite (PASCIT) is a automatic instrumentation of an user’s source code possible. Developers mark the parts that they need performance information about. We present an effort to profile climate modeling codes with two alternative methods: • usage of GGDML translation tool to mark directly the computational kernels of an application for profiling. • usage of GGDML translation tool to generate a serial code in a first step and then use LLVM/Clang to instrument some code parts with a profiler’s directives. The resulting codes are profiled with the LIKWID profiler. Alternatively, we use perf and OProfile’s ocount & operf to measure hardware characteristics. The performance report with a visualization of the measured hardware performance counters in generating Radar Charts, Latex Tables, Box Plots are interesting for scientist to understand the bottlenecks of their codes.
  • A Survey of Storage Systems for High-Performance Computing (Jakob Lüttgau, Michael Kuhn, Kira Duwe, Yevhen Alforov, Eugen Betke, Julian Kunkel, Thomas Ludwig), In Supercomputing Frontiers and Innovations, Series: Volume 5, Number 1, pp. 31–58, (Editors: Jack Dongarra, Vladimir Voevodin), Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia), 2018-04
    BibTeX URL DOI PDF
    Abstract: In current supercomputers, storage is typically provided by parallel distributed file systems for hot data and tape archives for cold data. These file systems are often compatible with local file systems due to their use of the POSIX interface and semantics, which eases development and debugging because applications can easily run both on workstations and supercomputers. There is a wide variety of file systems to choose from, each tuned for different use cases and implementing different optimizations. However, the overall application performance is often held back by I/O bottlenecks due to insufficient performance of file systems or I/O libraries for highly parallel workloads. Performance problems are dealt with using novel storage hardware technologies as well as alternative I/O semantics and interfaces. These approaches have to be integrated into the storage stack seamlessly to make them convenient to use. Upcoming storage systems abandon the traditional POSIX interface and semantics in favor of alternative concepts such as object and key-value storage; moreover, they heavily rely on technologies such as NVM and burst buffers to improve performance. Additional tiers of storage hardware will increase the importance of hierarchical storage management. Many of these changes will be disruptive and require application developers to rethink their approaches to data management and I/O. A thorough understanding of today's storage infrastructures, including their strengths and weaknesses, is crucially important for designing and implementing scalable storage systems suitable for demands of exascale computing.
  • Tools for Analyzing Parallel I/O (Julian Kunkel, Eugen Betke, Matt Bryson, Philip Carns, Rosemary, Francis, Wolfgang Frings, Roland Laifer, Sandra Mendez), Lecture Notes in Computer Science, to appear, Springer, HPC-IODC workshop, ISC HPC, Frankfurt, Germany, 2018
    BibTeX
  • Benefit of DDN's IME-Fuse and IME-Lustre File Systems for I/O Intensive HPC Applications (Eugen Betke, Julian Kunkel), Lecture Notes in Computer Science, to appear, (Editors: Rio Yokota, Michele Weiland, David Keyes, Carsten Trintis), Springer, ISC Team, WOPSSS workshop, ISC HPC, Frankfurt, Germany, 2018
    BibTeX
    Abstract: Many scientific applications are limited by I/O performance offered by parallel file systems on conventional storage systems. Flash- based burst buffers provide significant better performance than HDD backed storage, but at the expense of capacity. Burst buffers are consid- ered as the next step towards achieving wire-speed of interconnect and providing more predictable low latency I/O, which are the holy grail of storage. A critical evaluation of storage technology is mandatory as there is no long-term experience with performance behavior for particular applica- tions scenarios. The evaluation enables data centers choosing the right products and system architects the integration in HPC architectures. This paper investigates the native performance of DDN-IME, a flash- based burst buffer solution. Then, it takes a closer look at the IME-FUSE file systems, which uses IMEs as burst buffer and a Lustre file system as back-end. Finally, by utilizing a NetCDF benchmark, it estimates the performance benefit for climate applications.
  • Cost and Performance Modeling for Earth System Data Management and Beyond (Jakob Lüttgau, Julian Kunkel), Lecture Notes in Computer Science, to appear, (Editors: Rio Yokota, Michele Weiland, David Keyes, Carsten Trintis), Springer, ISC Team, HPC-IODC workshop, ISC HPC, Frankfurt, Germany, 2018
    BibTeX
    Abstract: Current and anticipated storage environments confront domain scientist and data center operators with usability, performance and cost challenges. The amount of data upcoming system will be required to handle is expected to grow exponentially, mainly due to increasing resolution and affordable compute power. Unfortunately, the relationship between cost and performance is not always well understood requiring considerable effort for educated procurement. Within the Centre of Excellence in Simulation of Weather and Climate in Europe (ESiWACE) models to better understand cost and performance of current and future systems are being explored. This paper presents models and methodology focusing on, but not limited to, data centers used in the context of climate and numerical weather prediction. The paper concludes with a case study of alternative deployment strategies and outlines the challenges anticipating their impact on cost and performance. By publishing these early results, we would like to make the case to work towards standard models and methodologies collaboratively as a community to create sufficient incentives for vendors to provide specifications in formats which are compatible to these modeling tools. In addition to that, we see application for such formalized models and information in I/O re lated middleware, which are expected to make automated but reasonable decisions in increasingly heterogeneous data centers.
  • Towards an HPC Certification Program (Julian Kunkel, Kai Himstedt, Nathanael Hübbe, Hinnerk Stüben, Sandra Schröder, Michael Kuhn, Matthias Riebisch, Stephan Olbrich, Thomas Ludwig, Weronika Filinger, Jean-Thomas Acquaviva, Anja Gerbes, Lev Lafayette), In Journal of Computational Science Education, to appear, 2018
    BibTeX
    Abstract: The HPC community has always considered the training of new and existing HPC practitioners to be of high importance to its growth. This diversification of HPC practitioners challenges the traditional training approaches, which are not able to satisfy the specific needs of users, often coming from non-traditionally HPC disciplines, and only interested in learning a particular set of competences. Challenges for HPC centres are to identify and overcome the gaps in users’ knowledge, while users struggle to identify relevant skills. We have developed a first version of an HPC certification program that would clearly categorize, define, and examine competences. Making clear what skills are required of or recommended for a competent HPC user would benefit both the HPC service providers and practitioners. Moreover, it would allow centres to bundle together skills that are most beneficial for specific user roles and scientific domains. From the perspective of content providers, existing training material can be mapped to competences allowing users to quickly identify and learn the skills they require. Finally, the certificates recognized by the whole HPC community simplify inter-comparison of independently offered courses and provide additional incentive for participation.
  • Toward Understanding I/O Behavior in HPC Workflows (Jakob Lüttgau, Shane Snyder, Philip Carns, Justin M. Wozniak, Julian Kunkel, Thomas Ludwig), to appear, IEEE, PDSW-DISCS, Dallas, Texas, 2018
    BibTeX
    Abstract: Scientific discovery increasingly depends on complex workflows consisting of multiple phases and sometimes millions of parallelizable tasks or pipelines. These workflows access storage resources for a variety of purposes, including preprocessing, simulation output, and postprocessing steps. Unfortunately, most workflow models focus on the scheduling and allocation of com- putational resources for tasks while the impact on storage systems remains a secondary objective and an open research question. I/O performance is not usually accounted for in workflow telemetry reported to users. In this paper, we present an approach to augment the I/O efficiency of the individual tasks of workflows by combining workflow description frameworks with system I/O telemetry data. A conceptual architecture and a prototype implementation for HPC data center deployments are introduced. We also identify and discuss challenges that will need to be addressed by workflow management and monitoring systems for HPC in the future. We demonstrate how real-world applications and workflows could benefit from the approach, and we show how the approach helps communicate performance-tuning guidance to users.
  • Comparison of Clang Abstract Syntax Trees using String Kernels (Raul Torres, Julian Kunkel, Manuel F. Dolz, Thomas Ludwig), to appear, CADO 2018, Orleans, France, 2018
    BibTeX
    Abstract: Abstract Syntax Trees (ASTs) are intermediate representations widely used by compiler frameworks. One of their strengths is that they can be used to determine the similarity among a collection of programs. In this paper we propose a novel comparison method that converts ASTs into weighted strings in order to get similarity matrices and quantify the level of correlation among codes. To evaluate the approach, we leveraged the corresponding strings derived from the Clang ASTs of a set of 100 source code examples written in C. Our kernel and two other string kernels from the literature were used to obtain similarity matrices among those examples. Next, we used Hierarchical Clustering to visualize the results. Our solution was able to identify different clusters conformed by examples that shared similar semantics. We demonstrated that the proposed strategy can be promisingly applied to similarity problems involving trees or strings.
  • Towards Green Scientific Data Compression Through High-Level I/O Interfaces (Yevhen Alforov, Anastasiia Novikova, Michael Kuhn, Julian Kunkel, Thomas Ludwig), to appear, Springer, SBAC-PAD 2018, Lyon, France, 2018
    BibTeX
    Abstract: Every HPC system today has to cope with a deluge of data generated by scientific applications, simulations or large- scale experiments. The upscaling of supercomputer systems and infrastructures, generally results in a dramatic increase of their energy consumption. In this paper, we argue that techniques like data compression can lead to significant gains in terms of power efficiency by reducing both network and storage requirements. To that end, we propose a novel methodology for achieving on-the-fly intelligent determination of energy efficient data reduction for a given data set by leveraging state-of-the-art compression algorithms and meta data at application-level I/O. We motivate our work by analyzing the energy and storage saving needs of real-life scientific HPC applications, and review the various compression techniques that can be applied. We find that the resulting data reduction can decrease the data volume transferred and stored by as much as 80% in some cases, consequently leading to significant savings in storage and networking costs.
  • Performance Portability of Earth System Models with User-Controlled GGDML code Translation (Nabeeh Jum'ah, Julian Kunkel), Lecture Notes in Computer Science, to appear, (Editors: Rio Yokota, Michele Weiland, David Keyes, Carsten Trintis), Springer, ISC Team, P3MA workshop, ISC HPC, Frankfurt, Germany, 2018
    BibTeX
    Abstract: The increasing need for performance of earth system modeling and other scientific domains pushes the computing technologies in diverse architectural directions. The development of models needs technical expertise and skills of using tools that are able to exploit the hardware capabilities. The heterogeneity of architectures complicates the development and the maintainability of the models. To improve the software development process of earth system models, we provide an approach that simplifies the code maintainability by fostering separation of concerns while providing performance portability. We propose the use of high-level language extensions that reflect scientific concepts. The scientists can use the programming language of their own choice to develop models, however, they can use the language extensions optionally wherever they need. The code translation is driven by configurations that are separated from the model source code. These configurations are prepared by scientific programmers to optimally use the machine’s features. The main contribution of this paper is the demonstration of a user-controlled source-to-source translation technique of earth system models that are written with higher-level semantics. We discuss a flexible code translation technique that is driven by the users through a configuration input that is prepared especially to transform the code, and we use this technique to produce OpenMP or OpenACC enabled codes besides MPI to support multi-node configurations.
  • Tools for Analyzing Parallel I/O (Julian Kunkel, Eugen Betke, Matt Bryson, Philip Carns, Rosemary Francis, Wolfgang Frings, Roland Laifer, Sandra Mendez), Lecture Notes in Computer Science, to appear, (Editors: Rio Yokota, Michele Weiland, David Keyes, Carsten Trintis), Springer, ISC Team, HPC-IODC workshop, ISC HPC, Frankfurt, Germany, 2018
    BibTeX
    Abstract: Parallel application I/O performance often does not meet user expectations. Additionally, slight access pattern modifications may lead to significant changes in performance due to complex interactions between hardware and software. These issues call for sophisticated tools to capture, analyze, understand, and tune application I/O. In this paper, we highlight advances in monitoring tools to help address these issues. We also describe best practices, identify issues in measure- ment and analysis, and provide practical approaches to translate parallel I/O analysis into actionable outcomes for users, facility operators, and researchers.
  • Understanding Metadata Latency with MDWorkbench (Julian Kunkel, George S. Markomanolis), Lecture Notes in Computer Science, to appear, (Editors: Rio Yokota, Michele Weiland, David Keyes, Carsten Trintis), Springer, ISC Team, WOPSSS workshop, ISC HPC, Frankfurt, Germany, 2018
    BibTeX
    Abstract: While parallel file systems often satisfy the need of applica- tions with bulk synchronous I/O, they lack capabilities of dealing with metadata intense workloads. Typically, in procurements, the focus lies on the aggregated metadata throughput using the MDTest benchmark. However, metadata performance is crucial for interactive use. Metadata benchmarks involve even more parameters compared to I/O benchmarks. There are several aspects that are currently uncovered and, therefore, not in the focus of vendors to investigate. Particularly, response latency and interactive workloads operating on a working set of data. The lack of ca- pabilities from file systems can be observed when looking at the IO-500 list, where metadata performance between best and worst system does not differ significantly. In this paper, we introduce a new benchmark called MDWorkbench which generates a reproducible workload emulating many concurrent users or – in an alternative view – queuing systems. This benchmark pro- vides a detailed latency profile, overcomes caching issues, and provides a method to assess the quality of the observed throughput. We evaluate the benchmark on state-of-the-art parallel file systems with GPFS (IBM Spectrum Scale), Lustre, Cray’s Datawarp, and DDN IME, and conclude that we can reveal characteristics that could not be identified before.
  • Towards Decoupling the Selection of Compression Algorithms from Quality Constraints – an Investigation of Lossy Compression Efficiency (Julian Kunkel, Anastasiia Novikova, Eugen Betke), In Supercomputing Frontiers and Innovations, Series: Volume 4, Number 4, pp. 17–33, (Editors: Jack Dongarra, Vladimir Voevodin), 2017-12
    BibTeX URL DOI PDF
    Abstract: Data intense scientific domains use data compression to reduce the storage space needed. Lossless data compression preserves information accurately but lossy data compression can achieve much higher compression rates depending on the tolerable error margins. There are many ways of defining precision and to exploit this knowledge, therefore, the field of lossy compression is subject to active research. From the perspective of a scientist, the qualitative definition about the implied loss of data precision should only matter.With the Scientific Compression Library (SCIL), we are developing a meta-compressor that allows users to define various quantities for acceptable error and expected performance behavior. The library then picks a suitable chain of algorithms yielding the user's requirements, the ongoing work is a preliminary stage for the design of an adaptive selector. This approach is a crucial step towards a scientifically safe use of much-needed lossy data compression, because it disentangles the tasks of determining scientific characteristics of tolerable noise, from the task of determining an optimal compression strategy. Future algorithms can be used without changing application code. In this paper, we evaluate various lossy compression algorithms for compressing different scientific datasets (Isabel, ECHAM6), and focus on the analysis of synthetically created data that serves as blueprint for many observed datasets. We also briefly describe the available quantities of SCIL to define data precision and introduce two efficient compression algorithms for individual data points. This shows that the best algorithm depends on user settings and data properties.
  • Poster: Toward Decoupling the Selection of Compression Algorithms from Quality Constraints (Julian Kunkel, Anastasia Novikova, Eugen Betke), SC17, Denver, CO, USA, 2017-11-14
    BibTeX PDF
  • Understanding Hardware and Software Metrics with respect to Power Consumption (Julian Kunkel, Manuel F. Dolz), In Sustainable Computing: Informatics and Systems, Series: Sustainable Computing, (Editors: Ishfaq Ahmad), Elsevier, ISSN: 2210-5379, 2017-11-04
    BibTeX URL DOI
    Abstract: Analyzing and understanding energy consumption of applications is an important task which allows researchers to develop novel strategies for optimizing and conserving energy. A typical methodology is to reduce the complexity of real systems and applications by developing a simplified performance model from observed behavior. In the literature, many of these models are known; however, inherent to any simplification is that some measured data cannot be explained well. While analyzing a models accuracy, it is highly important to identify the properties of such prediction errors. Such knowledge can then be used to improve the model or to optimize the benchmarks used for training the model parameters. For such a benchmark suite, it is important that the benchmarks cover all the aspects of system behavior to avoid overfitting of the model for certain scenarios. It is not trivial to identify the overlap between the benchmarks and answer the question if a benchmark causes different hardware behavior. Inspection of all the available hardware and software counters by humans is a tedious task given the large amount of real-time data they produce. In this paper, we utilize statistical techniques to foster understand and investigate hardware counters as potential indicators of energy behavior. We capture hardware and software counters including power with a fixed frequency and analyze the resulting timelines of these measurements. The concepts introduced can be applied to any set of measurements in order to compare them to another set of measurements. We demonstrate how these techniques can aid identifying interesting behavior and significantly reducing the number of features that must be inspected. Next, we propose counters that can potentially be used for building linear models for predicting with a relative accuracy of 3%. Finally, we validate the completeness of a benchmark suite, from the point of view of using the available architectural components, for generating accurate models.
  • Poster: Icosahedral Modeling with GGDML (Nabeeh Jumah, Julian Kunkel, Günther Zängl, Hisashi Yashiro, Thomas Dubos, Yann Meurdesoif), DKRZ user workshop 2017, Hamburg, Germany, 2017-10-09
    BibTeX PDF
    Abstract: The atmospheric and climate sciences and the natural sciences in general are increasingly demanding for higher performance computing. Unfortunately, the gap between the diversity of the hardware architectures that the manufacturers provide to fulfill the needs for performance and the scientific modeling can't be filled by the general-purpose languages and compilers. Scientists need to manually optimize their models to exploit the machine capabilities. This leads to code redundancies when targeting different machines. This is not trivial while considering heterogeneous computing as a basis for exascale computing.
    In order to provide performance portability to the icosahedral climate modeling we have developed a set of higher-level language extensions we call GGDML. The extensions provide semantically-higher-level constructs allowing to express scientific problems with scientific concepts. This eliminates the need to explicitly provide lower-level machine-dependent code. Scientists still use the general-purpose language. The GGDML code is translated by a source-to-source translation tool that optimizes the generated code to a specific machine. The translation process is driven by configurations that are provided independently from the source code.
    In this poster we review some GGDML extensions and we focus mainly on the configurable code translation of the higher-level code.
  • GGDML: Icosahedral Models Language Extensions (Nabeeh Jumah, Julian Kunkel, Günther Zängl, Hisashi Yashiro, Thomas Dubos, Yann Meurdesoif), In Journal of Computer Science Technology Updates, Series: Volume 4, Number 1, pp. 1–10, Cosmos Scholars Publishing House, 2017-06-21
    BibTeX URL DOI
    Abstract: The optimization opportunities of a code base are not completely exploited by compilers. In fact, there are optimizations that must be done within the source code. Hence, if the code developers skip some details, some performance is lost. Thus, the use of a general-purpose language to develop a performance-demanding software -e.g. climate models- needs more care from the developers. They should take into account hardware details of the target machine.
    Besides, writing a high-performance code for one machine will have a lower performance on another one. The developers usually write multiple optimized sections or even code versions for the different target machines. Such codes are complex and hard to maintain.
    In this article we introduce a higher-level code development approach, where we develop a set of extensions to the language that is used to write a model’s code. Our extensions form a domain-specific language (DSL) that abstracts domain concepts and leaves the lower level details to a configurable source-to-source translation process.
    The purpose of the developed extensions is to support the icosahedral climate/atmospheric model development. We have started with the three icosahedral models: DYNAMICO, ICON, and NICAM. The collaboration with the scientists from the weather/climate sciences enabled agreed-upon extensions. When we have suggested an extension we kept in mind that it represents a higher-level domain-based concept, and that it carries no lower-level details.
    The introduced DSL (GGDML- General Grid Definition and Manipulation Language) hides optimization details like memory layout. It reduces code size of a model to less than one third its original size in terms of lines of code. The development costs of a model with GGDML are therefore reduced significantly.
  • Poster: Towards Performance Portability for Atmospheric and Climate Models with the GGDML DSL (Nabeeh Jumah, Julian Kunkel, Günther Zängl, Hisashi Yashiro, Thomas Dubos, Yann Meurdesoif), ISC 2017, Germany, Frankfurt, 2017-06-20
    BibTeX URL
    Abstract: Demand for high-performance computing is increasing in atmospheric and climate sciences, and in natural sciences in general. Unfortunately, automatic optimizations done by compilers are not enough to make use of target machines' capabilities. Manual code adjustments are mandatory to exploit hardware capabilities. However, optimizing for one architecture, may degrade performance for other architectures. This loss of portability is a challenge. With GGDML we examine an approach for icosahedral-grid based climate and atmospheric models, that is based on a domain-specific language (DSL) which fosters separation of concerns between domain scientists and computer scientists. Our DSL extends Fortran language with concepts from domain science, apart from any technical descriptions such as hardware based optimization. The approach aims to achieve high performance, portability and maintainability through a compilation infrastructure principally built upon configurations from computer scientists. Fortran code extended with novel semantics from the DSL goes through the meta-DSL based compilation procedure. This generates high performance code -aware of platform features, based on provided configurations. We show that our approach reduces code significantly (to 40%) and improves readability for the models DYNAMICO, ICON and NICAM. We also show that the whole approach is viable in terms of performance portability, as it allows to generate platform-optimized code with minimal configuration changes. With a few lines, we are able to switch between two different memory representations during compilation and achieve double the performance. In addition, applying inlining and loop fusion yields 10 percent enhanced performance.
  • Poster: FortranTestGenerator: Automatic and Flexible Unit Test Generation for Legacy HPC Code (Christian Hovy, Julian Kunkel), ISC High Performance 2017, Frankfurt, 2017-06-20
    BibTeX PDF
    Abstract: Unit testing is an established practice in professional software development. However, in high-performance computing (HPC) with its scientific applications, it is not widely applied. Besides general problems regarding testing of scientific software, for many HPC applications the effort of creating small test cases with a consistent set of test data is high. We have created a tool called FortranTestGenerator to reduce the effort of creating unit tests for subroutines of an existing Fortran application. It is based on Capture & Replay (C&R), that is, it extracts data while running the original application and uses the extracted data as test input data. The tool automatically generates code for capturing the input data and a basic test driver which can be extended by the developer to a meaningful unit test. A static source code analysis is conducted, to reduce the number of captured variables. Code is generated based on flexibly customizable templates. Thus, both the capturing process and the unit tests can easily be integrated into an existing software ecosystem.
  • Poster: Enhanced Adaptive Compression in Lustre (Anna Fuchs, Michael Kuhn, Julian Kunkel, Thomas Ludwig), ISC High Performance 2017, Frankfurt, Germany, 2017-06-20
    BibTeX URL
  • Poster: Performance Conscious HPC (PeCoH) (Julian Kunkel, Michael Kuhn, Thomas Ludwig, Matthias Riebisch, Stephan Olbrich, Hinnerk Stüben, Kai Himstedt, Hendryk Bockelmann, Markus Stammberger), ISC High Performance 2017, Frankfurt, Germany, 2017-06-20
    BibTeX URL
  • Poster: Advanced Computation and I/O Methods for Earth-System Simulations (AIMES) (Julian Kunkel, Thomas Ludwig, Thomas Dubos, Naoya Maruyama, Takayuki Aoki, Günther Zängl, Hisashi Yashiro, Ryuji Yoshida, Hirofumi Tomita, Masaki Satoh, Yann Meurdesoif, Nabeeh Jumah, Anastasiia Novikova), ISC 2017, Germany, Frankfurt, 2017-06-20
    BibTeX URL
    Abstract: The Advanced Computation and I/O Methods for Earth-System Simulations (AIMES) project addresses the key issues of programmability, computational efficiency and I/O limitations that are common in next-generation icosahedral earth-system models. Ultimately, the project is intended to foster development of best-practices and useful norms by cooperating on shared ideas and components. During the project, we ensure that the developed concepts and tools are not only applicable for earth-science but for other scientific domains as well.
  • Poster: The Virtual Institute for I/O and the IO-500 (Julian Kunkel, Jay Lofstead, John Bent), ISC High Performance 2017, Frankfurt, Germany, 2017-06-20
    BibTeX PDF
  • Wissenschaftliches Rechnen - Scientific Computing - 2016 (Yevhen Alforov, Eugen Betke, Konstantinos Chasapis, Anna Fuchs, Fabian Große, Nabeeh Jumah, Michael Kuhn, Julian Kunkel, Hermann Lenhart, Jakob Lüttgau, Philipp Neumann, Anastasiia Novikova, Jannek Squar, Thomas Ludwig), Research Group: Scientific Computing, University of Hamburg (Deutsches Klimarechenzentrum GmbH, Bundesstraße 45a, D-20146 Hamburg), 2017-06-19
    BibTeX PDF
  • SFS: A Tool for Large Scale Analysis of Compression Characteristics (Julian Kunkel), Research Papers (4), Research Group: Scientific Computing, University of Hamburg (Deutsches Klimarechenzentrum GmbH, Bundesstraße 45a, D-20146 Hamburg), 2017-05-05
    BibTeX PDF
    Abstract: Data centers manage Petabytes of storage. Identifying the a fast lossless compression algorithm that is enabled on the storage system
    that potentially reduce data by additional 10% is significant. However, it is not trivial to evaluate algorithms on huge data pools as this evaluation requires running the algorithms and, thus, is costly, too. Therefore, there is the need for tools to optimize such an analysis. In this paper, the open source tool SFS is described that perform these scans efficiently. While based on an existing open source tool, SFS builds on a proven method to scan huge quantities of data using sampling from statistic. Additionally, we present results of 162 variants of various algorithms conducted on three data pools with scientific data and one more general purpose data pool. Based on this analysis promising classes of algorithms are identified.
  • Interaktiver C-Programmierkurs, ICP (Julian Kunkel, Jakob Lüttgau), In HOOU Content Projekte der Vorprojektphase 2015/16 – Sonderband zum Fachmagazin Synergie (Kerstin Mayrberger), pp. 182–186, Universität Hamburg (Universität Hamburg, Mittelweg 177, 20148 Hamburg), ISBN: 978-3-924330-57-6, 2017-04-10
    BibTeX URL
    Abstract: Programmiersprachen bilden die Basis für die automatisierte Datenverarbeitung in der digitalen Welt. Obwohl die Grundkonzepte einfach zu verstehen sind, beherrscht nur ein geringer Anteil von Personen diese Werkzeuge. Die Gründe hierfür sind Defizite in der Ausbildung und die hohe Einstiegshürde bei der Bereitstellung einer produktiven Programmierumgebung. Insbesondere erfordert das Erlernen einer Programmiersprache die praktische Anwendung der Sprache, vergleichbar mit dem Erlernen einer Fremdsprache. Ziel des Projekts ist die Erstellung eines interaktiven Kurses für die Lehre der Programmiersprache C. Die Interaktivität und das angebotene automatische Feedback sind an den Bedürfnissen der Teilnehmerinnen und Teilnehmer orientiert und bieten die Möglichkeit, autodidaktisch Kenntnisse auf- und auszubauen. Die Lektionen beinhalten sowohl die Einführung in spezifische Teilthemen als auch anspruchsvollere Aufgaben, welche die akademischen Problemlösefähigkeiten fördern. Damit werden unterschiedliche akademische Zielgruppen bedient und aus verschieden Bereichen der Zivilgesellschaft an die Informatik herangeführt. Der in diesem Projekt entwickelte Programmierkurs und die Plattform zur Programmierung können weltweit frei genutzt werden, und der Quellcode bzw. die Lektionen stehen unter Open-Source-Lizenzen und können deshalb beliebig auf die individuellen Bedürfnisse angepasst werden. Dies ermöglicht insbesondere das Mitmachen und Besteuern von neuen Lektionen zur Plattform.
  • Poster: Intelligent Selection of Compiler Options to Optimize Compile Time and Performance (Anja Gerbes, Julian Kunkel, Nabeeh Jumah), Euro LLVM, Saarbrücken, 2017-03-27
    BibTeX URL PDF
    Abstract: The efficiency of the optimization process during the compilation is crucial for the later execution behavior of the code. The achieved performance depends on the hardware architecture and the compiler's capabilities to extract this performance. Code optimization can be a CPU- and memory-intensive process which – for large codes – can lead to high compilation times during development. Optimization also influences the debuggability of the resulting binary; for example, by storing data in registers. During development, it would be interesting to compile files individually with appropriate flags that enable debugging and provide high (near-production) performance during the testing but with moderate compile times. We are exploring to create a tool to identify code regions that are candidates for higher optimization levels. We follow two different approaches to identify the most efficient code optimization: 1) compiling different files with different options by brute force; 2) using profilers to identify the relevant code regions that should be optimized. Since big projects comprise hundreds of files, brute force is not efficient. The problem in, e.g., climate applications is that codes have too many files to test them individually. Improving this strategy using a profiler, we can identify the time consuming regions (and files) and then repeatedly refine our selection. Then, the relevant files are evaluated with different compiler flags to determine a good compromise of the flags. Once the appropriate flags are determined, this information could be retained across builds and shared between users. In our poster, we motivate and demonstrate this strategy on a stencil code derived from climate applications. The experiments done throughout this work are carried out on a recent Intel Skylake (i7-6700 CPU @ 3.40GHz) machine. We compare performance of the compilers clang (version 3.9.1) and gcc (version 6.3.0) for various optimization flags and using profile guided optimization (PGO) with the traditional compile with instrumentation/run/compile phase and when using the perf tool for dynamic instrumentation. The results show that more time (2x) is spent for compiling code using higher optimization levels in general, though gcc takes a little less time in general than clang. Yet the performance of the application were comparable after compiling the whole code with O3 to that of applying O3 optimization to the right subset of files. Thus, the approach proves to be effective for repositories where compilation is analyzed to guide subsequent compilations. Based on these results, we are building a prototype tool that can be embedded into building systems that realizes the aforementioned strategies of brute-force testing and profile guided analysis of relevant compilation flags.
  • A Novel String Representation and Kernel Function for the Comparison of I/O Access Patterns (Raul Torres, Julian Kunkel, Manuel Dolz, Thomas Ludwig), In International Conference on Parallel Computing Technologies, Lecture Notes in Computer Science (10421), pp. 500–512, (Editors: Victor Malyshkin), Springer, PaCT, Nizhni Novgorod, Russia, ISBN: 978-3-319-62932-2, 2017
    BibTeX DOI PDF
    Abstract: Parallel I/O access patterns act as fingerprints of a parallel program. In order to extract meaningful information from these patterns, they have to be represented appropriately. Due to the fact that string objects can be easily compared using Kernel Methods, a conversion to a weighted string representation is proposed in this paper, together with a novel string kernel function called Kast Spectrum Kernel. The similarity matrices, obtained after applying the mentioned kernel over a set of examples from a real application, were analyzed using Kernel Principal Component Analysis (Kernel PCA) and Hierarchical Clustering. The evaluation showed that 2 out of 4 I/O access pattern groups were completely identified, while the other 2 conformed a single cluster due to the intrinsic similarity of their members. The proposed strategy can be promisingly applied to other similarity problems involving tree-like structured data.
  • An MPI-IO In-Memory Driver for Non-Volatile Pooled Memory of the Kove XPD (Julian Kunkel, Eugen Betke), In High Performance Computing: ISC High Performance 2017 International Workshops, DRBSD, ExaComm, HCPM, HPC-IODC, IWOPH, IXPUG, P^3MA, VHPC, Visualization at Scale, WOPSSS, Lecture Notes in Computer Science (10524), pp. 644–655, (Editors: Julian Kunkel, Rio Yokota, Michaela Taufer, John Shalf), Springer, ISC High Performance, Frankfurt, Germany, ISBN: 978-3-319-67629-6, 2017
    BibTeX DOI PDF
    Abstract: Many scientific applications are limited by the performance offered by parallel file systems. SSD based burst buffers provide significant better performance than HDD backed storage but at the expense of capacity. Clearly, achieving wire-speed of the interconnect and predictable low latency I/O is the holy grail of storage. In-memory storage promises to provide optimal performance exceeding SSD based solutions. Kove R ’s XPD R offers pooled memory for cluster systems. This remote memory is asynchronously backed up to storage devices of the XPDs and considered to be non-volatile. Albeit the system offers various APIs to access this memory such as treating it as a block device, it does not allow to expose it as file system that offers POSIX or MPI-IO semantics. In this paper, we 1) describe the XPD-MPIIO-driver which supports the scale-out architecture of the XPDs. This MPI-agnostic driver enables high-level libraries to utilize the XPD’s memory as storage. 2) A thorough performance evaluation of the XPD is conducted. This includes scaleout testing of the infrastructure and metadata operations but also performance variability. We show that the driver and storage architecture is able to nearly saturate wire-speed of Infiniband (60+ GiB/s with 14 FDR links) while providing low latency and little performance variability.
  • Toward Decoupling the Selection of Compression Algorithms from Quality Constraints (Julian Kunkel, Anastasiia Novikova, Eugen Betke, Armin Schaare), In High Performance Computing: ISC High Performance 2017 International Workshops, DRBSD, ExaComm, HCPM, HPC-IODC, IWOPH, IXPUG, P^3MA, VHPC, Visualization at Scale, WOPSSS, Lecture Notes in Computer Science (10524), pp. 1–12, (Editors: Julian Kunkel, Rio Yokota, Michaela Taufer, John Shalf), Springer, ISC High Performance, Frankfurt, Germany, ISBN: 978-3-319-67629-6, 2017
    BibTeX DOI PDF
    Abstract: Data intense scientific domains use data compression to reduce the storage space needed. Lossless data compression preserves the original information accurately but on the domain of climate data usually yields a compression factor of only 2:1. Lossy data compression can achieve much higher compression rates depending on the tolerable error/precision needed. Therefore, the field of lossy compression is still subject to active research. From the perspective of a scientist, the compression algorithm does not matter but the qualitative information about the implied loss of precision of data is a concern. With the Scientific Compression Library (SCIL), we are developing a meta-compressor that allows users to set various quantities that define the acceptable error and the expected performance behavior. The ongoing work a preliminary stage for the design of an automatic compression algorithm selector. The task of this missing key component is the construction of appropriate chains of algorithms to yield the users requirements. This approach is a crucial step towards a scientifically safe use of much-needed lossy data compression, because it disentangles the tasks of determining scientific ground characteristics of tolerable noise, from the task of determining an optimal compression strategy given target noise levels and constraints. Future algorithms are used without change in the application code, once they are integrated into SCIL. In this paper, we describe the user interfaces and quantities, two compression algorithms and evaluate SCIL’s ability for compressing climate data. This will show that the novel algorithms are competitive with state-of-the-art compressors ZFP and SZ and illustrate that the best algorithm depends on user settings and data properties.
  • Simulation of Hierarchical Storage Systems for TCO and QoS (Jakob Lüttgau, Julian Kunkel), In High Performance Computing: ISC High Performance 2017 International Workshops, DRBSD, ExaComm, HCPM, HPC-IODC, IWOPH, IXPUG, P^3MA, VHPC, Visualization at Scale, WOPSSS, Lecture Notes in Computer Science (10524), pp. 116–128, (Editors: Julian Kunkel, Rio Yokota, Michaela Taufer, John Shalf), Springer, ISC High Performance, Frankfurt, Germany, ISBN: 978-3-319-67629-6, 2017
    BibTeX DOI PDF
    Abstract: Due to the variety of storage technologies deep storage hierarchies turn out to be the most feasible choice to meet performance and cost requirements when handling vast amounts of data. Long-term archives employed by scientific users are mainly reliant on tape storage, as it remains the most cost-efficient option. Archival systems are often loosely integrated into the HPC storage infrastructure. In expectation of exascale systems and in situ analysis also burst buffers will require integration with the archive. Exploring new strategies and developing open software for tape systems is a hurdle due to the lack of affordable storage silos and availability outside of large organizations and due to increased wariness requirements when dealing with ultra-durable data. Lessening these problems by providing virtual storage silos should enable community-driven innovation and enable site operators to add features where they see fit while being able to verify strategies before deploying on production systems. Different models for the individual components in tape systems are developed. The models are then implemented in a prototype simulation using discrete event simulation. The work shows that the simulations can be used to approximate the behavior of tape systems deployed in the real world and to conduct experiments without requiring a physical tape system.
  • Real-Time I/O-Monitoring of HPC Applications with SIOX, Elasticsearch, Grafana and FUSE (Eugen Betke, Julian Kunkel), In High Performance Computing: ISC High Performance 2017 International Workshops, DRBSD, ExaComm, HCPM, HPC-IODC, IWOPH, IXPUG, P^3MA, VHPC, Visualization at Scale, WOPSSS, Lecture Notes in Computer Science (10524), pp. 158–170, (Editors: Julian Kunkel, Rio Yokota, Michaela Taufer, John Shalf), Springer, ISC High Performance, Frankfurt, Germany, ISBN: 978-3-319-67629-6, 2017
    BibTeX DOI PDF
    Abstract: The starting point for our work was a demand for an overview of application’s I/O behavior, that provides information about the usage of our HPC “Mistral”. We suspect that some applications are running using inefficient I/O patterns, and probably, are wasting a significant amount of machine hours. To tackle the problem, we focus on detection of poor I/O performance, identification of these applications, and description of I/O behavior. Instead of gathering I/O statistics from global system variables, like many other monitoring tools do, in our approach statistics come directly from I/O interfaces POSIX, MPI, HDF5 and NetCDF. For interception of I/O calls we use an instrumentation library that is dynamically linked with LD_PRELOAD at program startup. The HPC on-line monitoring framework is built on top of open source software: Grafana, SIOX, Elasticsearch and FUSE. This framework collects I/O statistics from applications and mount points. The latter is used for non-intrusive monitoring of virtual memory allocated with mmap(), i.e., no code adaption is necessary. The framework is evaluated showing its effectiveness and critically discussed.
  • Poster: Predicting I/O-performance in HPC using Artificial Neural Networks (Jan Fabian Schmid, Julian Kunkel), ISC High Performance 2015, Frankfurt, 2016-21-06
    BibTeX PDF
    Abstract: Tools are demanded that help users of HPC-facilities to implement efficient input/output (I/O) in their programs. It is difficult to find the best access parameters and patterns due to complex parallel storage systems. To develop tools which support the implementation of efficient I/O a computational model of the storage system is key. For single hard disk systems such a model can be derived analytically [1]; however, for the complex storage system of a super computer these models become too difficult to configure [2]. Therefore we searched for good predictors of I/O performance using a machine learning approach with artificial neural networks (ANNs). A hypothesis was then proposed: The I/O-path significantly influences the time needed to access a file. In our analysis we used ANNs with different input information for the prediction of access times. To use I/O-paths as input for the ANNs, we developed a method, which approximates the different I/O-paths the storage system used during a benchmark-test. This method utilizes error classes.
  • Poster: Analyzing Data Properties using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features (Julian Kunkel), ISC High Performance 2016, Frankfurt, 2016-21-06 – Awards: Best Poster
    BibTeX PDF
    Abstract: Understanding the characteristics of data stored in data centers helps computer scientists identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant file formats but also helps in a procurement to define useful benchmarks. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of evaluate novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This poster investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It is demonstrated that scanning 1\% of files and data volume is sufficient on DKRZ's supercomputer to obtain accurate results. This not only speeds up the analysis process but reduces costs of such studies significantly. Contributions of this poster are: 1) investigation of the inherent error when operating only on a subset of data, 2) presentation of methods that help future studies to mitigate this error and, 3) illustration of the approach with a study for scientific file types and compression
  • Interaktiver C-Programmierkurs, ICP (Julian Kunkel, Jakob Lüttgau), In Synergie, Fachmagazin für Digitalisierung in der Lehre (2), pp. 74–75, 2016-11-16
    BibTeX URL
    Abstract: Programmiersprachen bilden die Basis für die automatisierte Datenverarbeitung in der digitalen Welt. Obwohl die Grundkonzepte einfach zu verstehen sind, beherrscht nur ein geringer Anteil von Personen diese Werkzeuge. Die Gründe hierfür sind Defizite in der Ausbildung und die hohe Einstiegshürde bei der Bereitstellung einer produktiven Programmierumgebung. Insbesondere erfordert das Erlernen einer Programmiersprache die praktische Anwendung der Sprache, vergleichbar mit dem Erlernen einer Fremdsprache. Ziel des Projekts ist die Erstellung eines interaktiven Kurses für die Lehre der Programmiersprache C. Die Interaktivität und das angebotene automatische Feedback sind an den Bedürfnissen der Teilnehmerinnen und Teilnehmer orientiert und bieten die Möglichkeit, autodidaktisch Kenntnisse auf- und auszubauen. Die Lektionen beinhalten sowohl die Einführung in spezifische Teilthemen als auch anspruchsvollere Aufgaben, welche die akademischen Problemlösefähigkeiten fördern. Damit werden unterschiedliche akademische Zielgruppen bedient und aus verschieden Bereichen der Zivilgesellschaft an die Informatik herangeführt. Der in diesem Projekt entwickelte Programmierkurs und die Plattform zur Programmierung können weltweit frei genutzt werden, und der Quellcode bzw. die Lektionen stehen unter Open-Source-Lizenzen und können deshalb beliebig auf die individuellen Bedürfnisse angepasst werden. Dies ermöglicht insbesondere das Mitmachen und Besteuern von neuen Lektionen zur Plattform.
  • Analyzing Data Properties using Statistical Sampling – Illustrated on Scientific File Formats (Julian Kunkel), In Supercomputing Frontiers and Innovations, Series: Volume 3, Number 3, pp. 19–33, (Editors: Jack Dongarra, Vladimir Voevodin), 2016-10
    BibTeX URL DOI
    Abstract: Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a subset of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1% of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly.
  • Predicting I/O Performance in HPC Using Artificial Neural Networks (Jan Fabian Schmid, Julian Kunkel), In Supercomputing Frontiers and Innovations, Series: Volume 3, Number 3, pp. 34–39, (Editors: Jack Dongarra, Vladimir Voevodin), 2016-10
    BibTeX URL DOI
    Abstract: The prediction of file access times is an important part for the modeling of supercomputer's storage systems. These models can be used to develop analysis tools which support the users to integrate efficient I/O behavior. In this paper, we analyze and predict the access times of a Lustre file system from the client perspective. Therefore, we measure file access times in various test series and developed different models for predicting access times.
    The evaluation shows that in models utilizing artificial neural networks the average prediciton error is about 30% smaller than in linear models. A phenomenon in the distribution of file access times is of particular interest: File accesses with identical parameters show several typical access times.The typical access times usually differ by orders of magnitude and can be explained with a different processing of the file accesses in the storage system - an alternative I/O path. We investigate a method to automatically determine the alternative I/O path and quantify the significance of knowledge about the internal processing. It is shown that the prediction error is improved significantly with this approach.
  • Analyzing Data Properties using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features (Julian Kunkel), In High Performance Computing: ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P3MA, VHPC, WOPSSS, Lecture Notes in Computer Science (9945 2016), pp. 130–141, (Editors: Michela Taufer, Bernd Mohr, Julian Kunkel), Springer, ISC-HPC 2017, Frankfurt, Germany, ISBN: 978-3-319-46079-6, 2016-06
    BibTeX DOI PDF
    Abstract: Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a small set of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified. This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1 % of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly. The contributions of this paper are: (1) the systematic investigation of the inherent analysis error when operating only on a subset of data, (2) the demonstration of methods that help future studies to mitigate this error, (3) the illustration of the approach on a study for scientific file types and compression for a data center.
  • Data Compression for Climate Data (Michael Kuhn, Julian Kunkel, Thomas Ludwig), In Supercomputing Frontiers and Innovations, Series: Volume 3, Number 1, pp. 75–94, (Editors: Jack Dongarra, Vladimir Voevodin), 2016-06
    BibTeX URL DOI PDF
  • Towards Automatic and Flexible Unit Test Generation for Legacy HPC Code (Christian Hovy, Julian Kunkel), In Proceedings of the Fourth International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering, SEHPCCSE16, Salt Lake City, Utah, USA, 2016
    BibTeX DOI
    Abstract: Unit testing is an established practice in professional software development. However, in high-performance computing (HPC) with its scientific applications, it is not widely applied. Besides general problems regarding testing of scientific software, for many HPC applications the effort of creating small test cases with a consistent set of test data is high. We have created a tool called FortranTestGenerator, that significantly reduces the effort of creating unit tests for subroutines of an existing Fortran application. It is based on Capture & Replay (C&R), that is, it extracts data while running the original application and uses the extracted data as test input data. The tool automatically generates code for capturing the input data and a basic test driver which can be extended by the developer to an appropriate unit test. A static source code analysis is conducted, to reduce the number of captured variables. Code is generated based on flexibly customizable templates. Thus, both the capturing process and the unit tests can easily be integrated into an existing software ecosystem. Since most HPC applications use message passing for parallel processing, we also present an approach to extend our C&R model to MPI communication. This allows extraction of unit tests from massively parallel applications that can be run with a single process.
  • Poster: Interaktiver C Kurs (ICP) (Julian Kunkel, Thomas Ludwig, Jakob Lüttgau, Dion Timmermann, Christian Kautz, Volker Skwarek), Campus Innovation 2015, Hamburg, 2015-11-27
    BibTeX URL
    Abstract: Programmiersprachen bilden die Basis für die automatisierte Datenverarbeitung in der digitalen Welt. Obwohl die Grundkonzepte einfach zu verstehen sind, beherrscht nur ein geringer Anteil von Personen diese Werkzeuge. Die Gründe hierfür sind Defizite in der Ausbildung und die Einstiegsshürde bei der Bereitstellung einer produktiven Programmierumgebung. Insbesondere erfordert das Erlernen einer Programmiersprache die praktische Anwendung der Sprache. Eine Integration von Programmierkursen in die Hamburg Open Online University verbessert nicht nur das Angebot für Studierende, sondern erschließt auch Fachfremden den Zugang zur Informatik.
  • Poster: Advanced Data Sieving for Non-Contigouous I/O (Enno Zickler, Julian Kunkel), Frankfurt, Germany, 2015-07-13
    BibTeX URL
  • Monitoring energy consumption with SIOX (Julian Kunkel, Alvaro Aguilera, Nathanael Hübbe, Marc Wiedemann, Michaela Zimmer), In Computer Science – Research and Development, Series: Volume 30, Number 2, pp. 125–133, Springer, ISSN: 1865-2034, 2015-05
    BibTeX URL DOI PDF
    Abstract: In the face of the growing complexity of HPC systems, their growing energy costs, and the increasing difficulty to run applications efficiently, a number of monitoring tools have been developed during the last years. SIOX
    is one such endeavor, with a uniquely holistic approach: Not only does it aim to record a certain kind of data, but to make all relevant data available for analysis and optimization. Among other sources, this encompasses data from hardware energy counters and trace data from different hardware/software layers. However, not all data that can be recorded should be recorded. As such, SIOX
    needs good heuristics to determine when and what data needs to be collected, and the energy consumption can provide an important signal about when the system is in a state that deserves closer attention. In this paper, we show that SIOX
    can use Likwid to collect and report the energy consumption of applications, and present how this data can be visualized using SIOX’s web-interface. Furthermore, we outline how SIOX
    can use this information to intelligently adjust the amount of data it collects, allowing it to reduce the monitoring overhead while still providing complete information about critical situations.
  • Identifying Relevant Factors in the I/O-Path using Statistical Methods (Julian Kunkel), Research Papers (3), Research Group: Scientific Computing, University of Hamburg (Deutsches Klimarechenzentrum GmbH, Bundesstraße 45a, D-20146 Hamburg), 2015-03-14
    BibTeX PDF
    Abstract: File systems of supercomputers are complex systems of hardware and software. They utilize many optimization techniques such as the cache hierarchy to speed up data access. Unfortunately, this complexity makes assessing I/O difficult. It is impossible to predict the performance of a single I/O operation without knowing the exact system state, as optimizations such as client-side caching of the parallel file system may speed up performance significantly. I/O tracing and characterization tools help capturing the application workload and quantitatively assessing the performance. However, a user has to decide himself if obtained performance is acceptable. In this paper, a density-based method from statistics is investigated to build a model which assists administrators to identify relevant causes (a performance factor). Additionally, the model can be applied to purge unexpectedly slow operations that are caused by significant congestion on a shared resource. It will be sketched, how this could be used in the long term to automatically assess performance and identify the likely cause. The main contribution of the paper is the presentation of a novel methodology to identify relevant performance factors by inspecting the observed execution time on the client side. Starting from a black box model, the methodology is applicable without fully understanding all hardware and software components of the complex system. It then guides the analysis from observations and fosters identification of the most significant performance factors in the I/O path. To evaluate the approach, a model is trained on DKRZ's supercomputer Mistral and validated on synthetic benchmarks. It is demonstrated that the methodology is currently able to distinguish between several client-side storage cases such as sequential and random memory layout, and cached or uncached data, but this will be extended in the future to include server-side I/O factors as well.
  • An analytical methodology to derive power models based on hardware and software metrics (Manuel F. Dolz, Julian Kunkel, Konstantinos Chasapis, Sandra Catalan), In Computer Science - Research and Development, pp. 1–10, Springer US, ISSN: 1865-2042, 2015
    BibTeX DOI PDF
    Abstract: The use of models to predict the power consumption of a system is an appealing alternative to wattmeters since they avoid hardware costs and are easy to deploy. In this paper, we present an analytical methodology to build models with a reduced number of features in order to estimate power consumption at node level. We aim at building simple power models by performing a per-component analysis (CPU, memory, network, I/O) through the execution of four standard benchmarks. While they are executed, information from all the available hardware counters and resource utilization metrics provided by the system is collected. Based on correlations among the recorded metrics and their correlation with the instantaneous power, our methodology allows (i) to identify the significant metrics; and (ii) to assign weights to the selected metrics in order to derive reduced models. The reduction also aims at extracting models that are based on a set of hardware counters and utilization metrics that can be obtained simultaneously and, thus, can be gathered and computed on-line. The utility of our procedure is validated using real-life applications on an Intel Sandy Bridge architecture.
  • Speicherung großer Datenmengen und Energieeffizienz (Thomas Ludwig, Manuel Dolz, Michael Kuhn, Julian Kunkel, Hermann Lenhart), Max-Planck-Gesselschaft (München), 2015
    BibTeX URL
  • Predicting Performance of Non-Contiguous I/O with Machine Learning (Julian Kunkel, Eugen Betke, Michaela Zimmer), In High Performance Computing, 30th International Conference, ISC High Performance 2015, Lecture Notes in Computer Science (9137), pp. 257–273, (Editors: Julian Martin Kunkel, Thomas Ludwig), ISC High Performance, Frankfurt, ISSN: 0302-9743, 2015
    BibTeX DOI PDF
  • Poster: SIOX: An Infrastructure for Monitoring and Optimization of HPC-I/O (Julian Kunkel, Michaela Zimmer, Marc Wiedemann, Nathanael Hübbe, Alvaro Aguilera, Holger Mickler, Xuan Wang, Andrij Chut, Thomas Bönisch), ISC'14 Leipzig, 2014-06-23
    BibTeX URL
    Abstract: Performance analysis and optimization of high-performance I/O systems is a daunting task. Mainly, this is due to the overwhelmingly complex interplay of the involved hardware and software layers. The Scalable I/O for Extreme Performance (SIOX) project provides a versatile environment for monitoring I/O activities and learning from this information. The goal of SIOX is to automatically suggest and apply performance optimizations, and to assist in locating and diagnosing performance problems. In this poster, we present the current status of SIOX. Our modular architecture covers instrumentation of POSIX, MPI and other high-level I/O libraries; the monitoring data is recorded asynchronously into a global database, and recorded traces can be visualized. Furthermore, we offer a set of primitive plug-ins with additional features to demonstrate the flexibility of our architecture: A surveyor plug-in to keep track of the oberserved spatial access patterns; an fadvise plug-in for injecting hints to achieve read-ahead for strided access patterns; and an optimizer plug-in which monitors the performance achieved with different MPI-IO hints, automatically supplying the best known hint-set when no hints were explicitely set. The presentation of the technical status is accompanied by a demonstration of some of these features on our 20 node cluster. In additional experiments, we analyze the overhead for concurrent access, for MPI-IO's 4-levels of access, and for an instrumented climate application. While our prototype is not yet full-featured, it demonstrates the potential and feasability of our approach.
  • Whitepaper: E10 – Exascale IO (Andre Brinkmann, Toni Cortes, Hugo Falter, Julian Kunkel, Sai Narasimhamurthy), 2014-06
    BibTeX URL PDF
  • Exascale Storage Systems – An Analytical Study of Expenses (Julian Kunkel, Michael Kuhn, Thomas Ludwig), In Supercomputing Frontiers and Innovations, Series: Volume 1, Number 1, pp. 116–134, (Editors: Jack Dongarra, Vladimir Voevodin), 2014-06
    BibTeX URL
  • Feign: In-Silico Laboratory for Researching I/O Strategies (Jakob Lüttgau, Julian Kunkel), In Parallel Data Storage Workshop (PDSW), 2014 9th, pp. 43–48, SC14, New Orleans, 2014
    BibTeX URL
  • A Comparison of Trace Compression Methods for Massively Parallel Applications in Context of the SIOX Project (Alvaro Aguilera, Holger Mickler, Julian Kunkel, Michaela Zimmer, Marc Wiedemann, Ralph Müller-Pfefferkorn), In Tools for High Performance Computing 2013, pp. 91–105, ISBN: 978-3-319-08143-4, 2014
    BibTeX
  • The SIOX Architecture – Coupling Automatic Monitoring and Optimization of Parallel I/O (Julian Kunkel, Michaela Zimmer, Nathanael Hübbe, Alvaro Aguilera, Holger Mickler, Xuan Wang, Andrij Chut, Thomas Bönisch, Jakob Lüttgau, Roman Michel, Johann Weging), In Supercomputing, Supercomputing, pp. 245–260, (Editors: Julian Kunkel, Thomas Ludwig, Hans Meuer), Lecture Notes in Computer Science, ISC events, ISC'14, Leipzig, ISBN: 978-3-319-07517-4, 2014
    BibTeX DOI PDF
    Abstract: Performance analysis and optimization of high-performance I/O systems is a daunting task. Mainly, this is due to the overwhelmingly complex interplay of the involved hardware and software layers. The Scalable I/O for Extreme Performance (SIOX) project provides a versatile environment for monitoring I/O activities and learning from this information. The goal of SIOX is to automatically suggest and apply performance optimizations, and to assist in locating and diagnosing performance problems. In this paper, we present the current status of SIOX. Our modular architecture covers instrumentation of POSIX, MPI and other high-level I/O libraries; the monitoring data is recorded asynchronously into a global database, and recorded traces can be visualized. Furthermore, we offer a set of primitive plug-ins with additional features to demonstrate the flexibility of our architecture: A surveyor plug-in to keep track of the observed spatial access patterns; an fadvise plug-in for injecting hints to achieve read-ahead for strided access patterns; and an optimizer plug-in which monitors the performance achieved with different MPI-IO hints, automatically supplying the best known hint-set when no hints were explicitly set. The presentation of the technical status is accompanied by a demonstration of some of these features on our 20 node cluster. In additional experiments, we analyze the overhead for concurrent access, for MPI-IO’s 4-levels of access, and for an instrumented climate application. While our prototype is not yet full-featured, it demonstrates the potential and feasibility of our approach.
  • ICON DSL: A Domain-Specific Language for climate modeling (Raul Torres, Leonidas Lindarkis, Julian Kunkel, Thomas Ludwig), In WOLFHPC 2013 Third International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, SC13, Denver, 2013-11-18
    BibTeX URL
  • Poster: Source-to-Source Translation for Climate Models (Raul Torres, Leonidas Lindarkis, Julian Kunkel), International Supercomputing Conference 2013, Leipzig, Germany, 2013-06-17
    BibTeX URL
  • Towards Self-optimization in HPC I/O (Michaela Zimmer, Julian Kunkel, Thomas Ludwig), In Supercomputing, Lecture Notes in Computer Science (7905), pp. 422–434, (Editors: Julian Martin Kunkel, Thomas Ludwig, Hans Werner Meuer), Springer (Berlin, Heidelberg), ISC 2013, Leipzig, Germany, ISBN: 978-3-642-38749-4, ISSN: 0302-9743, 2013-06
    BibTeX DOI PDF
    Abstract: Performance analysis and optimization of high-performance I/O systems is a daunting task. Mainly, this is due to the overwhelmingly complex interplay of internal processes while executing application programs. Unfortunately, there is a lack of monitoring tools to reduce this complexity to a bearable level. For these reasons, the project Scalable I/O for Extreme Performance (SIOX) aims to provide a versatile environment for recording system activities and learning from this information. While still under development, SIOX will ultimately assist in locating and diagnosing performance problems and automatically suggest and apply performance optimizations.The SIOX knowledge path is concerned with the analysis and utilization of data describing the cause-and-effect chain recorded via the monitoring path. In this paper, we present our refined modular design of the knowledge path. This includes a description of logical components and their interfaces, details about extracting, storing and retrieving abstract activity patterns, a concept for tying knowledge to these patterns, and the integration of machine learning. Each of these tasks is illustrated through examples. The feasibility of our design is further demonstrated with an internal component for anomaly detection, permitting intelligent monitoring to limit the SIOX system’s impact on system resources.
  • Evaluating Lossy Compression on Climate Data (Nathanael Hübbe, Al Wegener, Julian Kunkel, Yi Ling, Thomas Ludwig), In Supercomputing, Lecture Notes in Computer Science (7905), pp. 343–356, (Editors: Julian Martin Kunkel, Thomas Ludwig, Hans Werner Meuer), Springer (Berlin, Heidelberg), ISC 2013, Leipzig, Germany, ISBN: 978-3-642-38749-4, ISSN: 0302-9743, 2013-06
    BibTeX DOI PDF
    Abstract: While the amount of data used by today’s high-performance computing (HPC) codes is huge, HPC users have not broadly adopted data compression techniques, apparently because of a fear that compression will either unacceptably degrade data quality or that compression will be too slow to be worth the effort. In this paper, we examine the effects of three lossy compression methods (GRIB2 encoding, GRIB2 using JPEG 2000 and LZMA, and the commercial Samplify APAX algorithm) on decompressed data quality, compression ratio, and processing time. A careful evaluation of selected lossy and lossless compression methods is conducted, assessing their influence on data quality, storage requirements and performance. The differences between input and decoded datasets are described and compared for the GRIB2 and APAX compression methods. Performance is measured using the compressed file sizes and the time spent on compression and decompression. Test data consists both of 9 synthetic data exposing compression behavior and 123 climate variables output from a climate model. The benefits of lossy compression for HPC systems are described and are related to our findings on data quality.
  • Using Simulation to Validate Performance of MPI(-IO) Implementations (Julian Kunkel), In Supercomputing, Lecture Notes in Computer Science (7905), pp. 181–195, (Editors: Julian Martin Kunkel, Thomas Ludwig, Hans Werner Meuer), Springer (Berlin, Heidelberg), ISC 2013, Leipzig, Germany, ISBN: 978-3-642-38749-4, ISSN: 0302-9743, 2013-06
    BibTeX DOI PDF
    Abstract: Parallel file systems and MPI implementations aim to exploit available hardware resources in order to achieve optimal performance. Since performance is influenced by many hardware and software factors, achieving optimal performance is a daunting task. For these reasons, optimized communication and I/O algorithms are still subject to research. While complexity of collective MPI operations is discussed in literature sometimes, theoretic assessment of the measurements is de facto non-existent. Instead, conducted analysis is typically limited to performance comparisons to previous algorithms. However, observable performance is not only determined by the quality of an algorithm. At run-time performance could be degraded due to unexpected implementation issues and triggered hardware and software exceptions. By applying a model that resembles the system, simulation allows us to estimate the performance. With this approach, the non-function requirement for performance of an implementation can be validated and run-time inefficiencies can be localized. In this paper we demonstrate how simulation can be applied to assess observed performance of collective MPI calls and parallel IO. PIOsimHD, an event-driven simulator, is applied to validate observed performance on our 10 node cluster. The simulator replays recorded application activity and point-to-point operations of collective operations. It also offers the option to record trace files for visual comparison to recorded behavior. With the innovative introspection into behavior, several bottlenecks in system and implementation are localized.
  • Performance-optimized clinical IMRT planning on modern CPUs (Peter Ziegenhein, Cornelis Ph Kamerling, Mark Bangert, Julian Kunkel, Uwe Oelfke), In Physics in Medicine and Biology, Series: Volume 58 Number 11, IOP Publishing, ISSN: 1361-6560, 2013-05-08
    BibTeX URL DOI
    Abstract: Intensity modulated treatment plan optimization is a computationally expensive task. The feasibility of advanced applications in intensity modulated radiation therapy as every day treatment planning, frequent re-planning for adaptive radiation therapy and large-scale planning research severely depends on the runtime of the plan optimization implementation. Modern computational systems are built as parallel architectures to yield high performance. The use of GPUs, as one class of parallel systems, has become very popular in the field of medical physics. In contrast we utilize the multi-core central processing unit (CPU), which is the heart of every modern computer and does not have to be purchased additionally. In this work we present an ultra-fast, high precision implementation of the inverse plan optimization problem using a quasi-Newton method on pre-calculated dose influence data sets. We redefined the classical optimization algorithm to achieve a minimal runtime and high scalability on CPUs. Using the proposed methods in this work, a total plan optimization process can be carried out in only a few seconds on a low-cost CPU-based desktop computer at clinical resolution and quality. We have shown that our implementation uses the CPU hardware resources efficiently with runtimes comparable to GPU implementations, at lower costs.
  • Reducing the HPC-Datastorage Footprint with MAFISC – Multidimensional Adaptive Filtering Improved Scientific data Compression (Nathanel Hübbe, Julian Kunkel), In Computer Science - Research and Development, Series: Volume 28, Issue 2-3, pp. 231–239, Springer, 2013-05
    BibTeX URL PDF
    Abstract: Large HPC installations today also include large data storage installations. Data compression can significantly reduce the amount of data, and it was one of our goals to find out, how much compression can do for climate data. The price of compression is, of course, the need for additional computational resources, so our second goal was to relate the savings of compression to the costs it necessitates. In this paper we present the results of our analysis of typical climate data. A lossless algorithm based on these insights is developed and its compression ratio is compared to that of standard compression tools. As it turns out, this algorithm is general enough to be useful for a large class of scientific data, which is the reason we speak of MAFISC as a method for scientific data compression. A numeric problem for lossless compression of scientific data is identified and a possible solution is given. Finally, we discuss the economics of data compression in HPC environments using the example of the German Climate Computing Center.
  • Towards I/O Analysis of HPC Systems and a Generic Architecture to Collect Access Patterns (Marc Wiedemann, Julian Kunkel, Michaela Zimmer, Thomas Ludwig, Michael Resch, Thomas Bönisch, Xuan Wang, Andriy Chut, Alvaro Aguilera, Wolfgang E. Nagel, Michael Kluge, Holger Mickler), In Computer Science - Research and Development, Series: 28, pp. 241–251, Springer New York Inc. (Hamburg, Berlin, Heidelberg), ISSN: 1865-2034, 2013-05
    BibTeX URL PDF
    Abstract: In high-performance computing applications, a high-level I/O call will trigger activities on a multitude of hardware components. These are massively parallel systems supported by huge storage systems and internal software layers. Their complex interplay currently makes it impossible to identify the causes for and the locations of I/O bottlenecks. Existing tools indicate when a bottleneck occurs but provide little guidance in identifying the cause or improving the situation. We have thus initiated Scalable I/O for Extreme Performance to find solutions for this problem. To achieve this goal in SIOX, we will build a system to record access information on all layers and components, to recognize access patterns, and to characterize the I/O system. The system will ultimately be able to recognize the causes of the I/O bottlenecks and propose optimizations for the I/O middleware that can improve I/O performance, such as throughput rate and latency. Furthermore, the SIOX system will be able to support decision making while planning new I/O systems. In this paper, we introduce the SIOX system and describe its current status: We first outline our approach for collecting the required access information. We then provide the architectural concept, the methods for reconstructing the I/O path and an excerpt of the interface for data collection. This paper focuses especially on the architecture, which collects and combines the relevant access information along the I/O path, and which is responsible for the efficient transfer of this information. An abstract modelling approach allows us to better understand the complexity of the analysis of the I/O activities on parallel computing systems, and an abstract interface allows us to adapt the SIOX system to various HPC file systems.
  • A Study on Data Deduplication in HPC Storage Systems (Dirk Meister, Jürgen Kaiser, Andre Brinkmann, Michael Kuhn, Julian Kunkel, Toni Cortes), In Proceedings of the ACM/IEEE Conference on High Performance Computing (SC), IEEE Computer Society, SC'12, Salt Lake City, USA, 2012-11-10
    BibTeX
  • Simulating parallel programs on application and system level (Julian Kunkel), In Computer Science – Research and Development, Series: Volume 28 Number 2-3, Springer (Berlin, Heidelberg), ISSN: 1865-2042, 2012-06
    BibTeX URL DOI PDF
    Abstract: Understanding the measured performance of parallel applications in real systems is difficult—with the aim to utilize the resources available, optimizations deployed in hardware and software layers build up to complex systems. However, in order to identify bottlenecks the performance must be assessed. This paper introduces PIOsimHD, an event-driven simulator for MPI-IO applications and the underlying (heterogeneous) cluster computers. With the help of the simulator runs of MPI-IO applications can be conducted in-silico; this includes detailed simulation of collective communication patterns as well as simulation of parallel I/O. The simulation estimates upper bounds for expected performance and helps assessing observed performance.Together with HDTrace, an environment which allows tracing the behavior of MPI programs and internals of MPI and PVFS, PIOsimHD enables us to localize inefficiencies, to conduct research on optimizations for communication algorithms, and to evaluate arbitrary and future systems. In this paper the simulator is introduced and an excerpt of the conducted validation is presented, which demonstrates the accuracy of the models for our cluster.
  • Simulating Application and System Interaction with PIOsimHD (Julian Kunkel, Thomas Ludwig), In Proceedings of the Work in Progress Session, 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, SEA-Publications (31), (Editors: Erwin Grosspietsch, Konrad Klöckner), Institute for Systems Engineering and Automation (Johannes Kepler University Linz), Munich Network Management Team, PDP 2012, Garching, Germany, ISBN: 978-3-902457-31-8, 2012
    BibTeX
  • Scientific Computing: Performance and Efficiency in Climate Models (Sandra Schröder, Michael Kuhn, Nathanael Hübbe, Julian Kunkel, Timo Minartz, Petra Nerge, Florens Wasserfall, Thomas Ludwig), In Proceedings of the Work in Progress Session, 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, SEA-Publications (31), (Editors: Erwin Grosspietsch, Konrad Klöckner), Institute for Systems Engineering and Automation (Johannes Kepler University Linz), Munich Network Management Team, PDP 2012, Garching, Germany, ISBN: 978-3-902457-31-8, 2012
    BibTeX
  • Optimizations for Two-Phase Collective I/O (Michael Kuhn, Julian Kunkel, Yuichi Tsujita, Hidetaka Muguruma, Thomas Ludwig), In Applications, Tools and Techniques on the Road to Exascale Computing, Advances in Parallel Computing (22), pp. 455–462, (Editors: Koen De Bosschere, Erik H. D'Hollander, Gerhard R. Joubert, David Padua, Frans Peters), IOS Press (Amsterdam, Berlin, Tokyo, Washington DC), University of Ghent, ELIS Department, ParCo 2011, Ghent, Belgium, ISBN: 978-1-61499-040-6, ISSN: 0927-5452, 2012
    BibTeX
    Abstract: The performance of parallel distributed file systems suffers from many clients executing a large number of operations in parallel, because the I/O subsystem can be easily overwhelmed by the sheer amount of incoming I/O operations. This, in turn, can slow down the whole distributed system. Many optimizations exist that try to alleviate this problem. Client-side optimizations perform preprocessing to minimize the amount of work the file servers have to do. Server-side optimizations use server-internal knowledge to improve performance. This paper provides an overview of existing client-side optimizations and presents new modifications of the Two-Phase protocol. Interleaved Two-Phase is a modification of ROMIO's Two-Phase protocol, which iterates over the file differently to reduce the number of seek operations on disk. Pipelined Two-Phase uses a pipelined scheme which overlaps I/O and communication phases to utilize the network and I/O subsystems concurrently.
  • Tool Environments to Measure Power Consumption and Computational Performance (Timo Minartz, Daniel Molka, Julian Kunkel, Michael Knobloch, Michael Kuhn, Thomas Ludwig), In Handbook of Energy-Aware and Green Computing (Ishfaq Ahmad, Sanjay Ranka), Chapters: 31, pp. 709–743, Chapman and Hall/CRC Press Taylor and Francis Group (6000 Broken Sound Parkway NW, Boca Raton, FL 33487), ISBN: 978-1-4398-5040-4, 2012
    BibTeX
  • Tracing and Visualization of Energy-Related Metrics (Timo Minartz, Julian M. Kunkel, Thomas Ludwig), In 26th IEEE International Parallel & Distributed Processing Symposium Workshops, IEEE Computer Society, HPPAC 2012, Shanghai, China, 2012
    BibTeX
    Abstract: In an effort to reduce the energy consumption of high-performance computing centers, a number of new approaches have been developed in the last few years. One of these approaches is to switch hardware to lower power states in phases of device idleness or low utilization. Even if the concepts are already quite clear, tools to identify these phases in applications and to determine impact on performance and power consumption are still missing. In this paper, we investigate the tracing of energy-related metrics into our existing tracing environment in an effort to correlate them with the application. We implement tracing of performance and sleep states of the processor, the disk and the network device states in addition to the node power consumption. The exemplary energy efficiency analysis visually correlates the application with the energy-related metrics. With this correlation, it is possible to identify and further avoid waiting times caused by mode switches initiated by the user or the system.
  • Simulation-Aided Performance Evaluation of Server-Side Input/Output Optimizations (Michael Kuhn, Julian Kunkel, Thomas Ludwig), In 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, pp. 562–566, (Editors: Rainer Stotzka, Michael Schiffers, Yiannis Cotronis), IEEE Computer Society (Los Alamitos, Washington, Tokyo), Munich Network Management Team, PDP 2012, Garching, Germany, ISBN: 978-0-7695-4633-9, ISSN: 1066-6192, 2012
    BibTeX
    Abstract: The performance of parallel distributed file systems suffers from many clients executing a large number of operations in parallel, because the I/O subsystem can be easily overwhelmed by the sheer amount of incoming I/O operations. Many optimizations exist that try to alleviate this problem. Client-side optimizations perform preprocessing to minimize the amount of work the file servers have to do. Server-side optimizations use server-internal knowledge to improve performance. The HDTrace framework contains components to simulate, trace and visualize applications. It is used as a testbed to evaluate optimizations that could later be implemented in real-life projects. This paper compares existing client-side optimizations and newly implemented server-side optimizations and evaluates their usefulness for I/O patterns commonly found in HPC. Server-directed I/O chooses the order of non-contiguous I/O operations and tries to aggregate as many operations as possible to decrease the load on the I/O subsystem and improve overall performance. The results show that server-side optimizations beat client-side optimizations in terms of performance for many use cases. Integrating such optimizations into parallel distributed file systems could alleviate the need for sophisticated client-side optimizations. Due to their additional knowledge of internal workflows server-side optimizations may be better suited to provide high performance in general.
  • IOPm – Modeling the I/O Path with a Functional Representation of Parallel File System and Hardware Architecture (Julian Kunkel, Thomas Ludwig), In 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, pp. 554–561, (Editors: Rainer Stotzka, Michael Schiffers, Yiannis Cotronis), IEEE Computer Society (Los Alamitos, Washington, Tokyo), Munich Network Management Team, PDP 2012, Garching, Germany, ISBN: 978-0-7695-4633-9, ISSN: 1066-6192, 2012
    BibTeX
    Abstract: The I/O path model (IOPm) is a graphical representation of the architecture of parallel file systems and the machine they are deployed on. With help of IOPm, file system and machine configurations can be quickly analyzed and distinguished from each other. Contrary to typical representations of the machine and file system architecture, the model visualizes the data or meta data path of client access. Abstract functionality of hardware components such as client and server nodes is covered as well as software aspects such as high-level I/O libraries, collective I/O and caches. Redundancy could be represented, too. Besides the advantage of a standardized representation for analysis IOPm assists to identify and communicate bottlenecks in the machine and file system configuration by highlighting performance relevant functionalities. By abstracting functionalities from the components they are hosted on, IOPm will enable to build interfaces to monitor file system activity.
  • Visualization of MPI(-IO) Datatypes (Julian Kunkel, Thomas Ludwig), In Applications, Tools and Techniques on the Road to Exascale Computing, Advances in Parallel Computing (22), pp. 473–480, (Editors: Koen De Bosschere, Erik H. D'Hollander, Gerhard R. Joubert, David Padua, Frans Peters), IOS Press (Amsterdam, Berlin, Tokyo, Washington DC), University of Ghent, ELIS Department, ParCo 2011, Ghent, Belgium, ISBN: 978-1-61499-040-6, ISSN: 0927-5452, 2012
    BibTeX
    Abstract: To permit easy and efficient access to non-contiguous regions in memory for communication and I/O the message passing interface offers nested datatypes. Since nested datatypes can be very complicated, the understanding of non-contiguous access patterns and the debugging of wrongly accessed memory regions is hard for the developer. HDTrace is an environment which allows to trace the behavior of MPI programs and to simulate them for arbitrary
    virtual cluster configuration. It is designed to record all MPI parameters including MPI datatypes. In this paper we present the capabilities to visualize usage of derived datatypes for communication and I/O accesses – a simple hierarchical view is introduced which presents them in a compact form and allows to dig into the nested datatypes. File regions accessed in non-contiguous I/O calls can be visualized in terms of the original datatype. The presented feature assists developers in understanding the datatype layout and spatial I/O access patterns of their application.
  • HDTrace – A Tracing and Simulation Environment of Application and System Interaction (Julian Kunkel), Research Papers (2), Research Group: Scientific Computing, University of Hamburg (Deutsches Klimarechenzentrum GmbH, Bundesstraße 45a, D-20146 Hamburg), 2011-01-23
    BibTeX PDF
    Abstract: HDTrace is an environment which allows to trace and simulate the behavior of MPI programs on a cluster. It explicitly includes support to trace internals of MPICH2 and the parallel file system PVFS. With this support it enables to localize inefficiencies, to conduct research on new algorithms and to evaluate future systems. Simulation provides upper bounds of expected performance and helps to assess observed performance as potential performance gains of optimizations can be approximated.
    In this paper the environment is introduced and several examples depict how it assists to reveal internal behavior and spot bottlenecks. In an example with PVFS the inefficient write-out of a matrix diagonal could be either identified by inspecting the PVFS server behavior or by simulation. Additionally the simulation showed that in theory the operation should finish 20 times faster on our cluster – by applying correct MPI hints this potential could be exploited.
  • Towards an Energy-Aware Scientific I/O Interface – Stretching the ADIOS Interface to Foster Performance Analysis and Energy Awareness (Julian Kunkel, Timo Minartz, Michael Kuhn, Thomas Ludwig), In Computer Science - Research and Development, Series: 1, (Editors: Thomas Ludwig), Springer (Berlin / Heidelberg, Germany), 2011
    BibTeX DOI PDF
    Abstract: Intelligently switching energy saving modes of CPUs, NICs and disks is mandatory to reduce the energy consumption. Hardware and operating system have a limited perspective of future performance demands, thus automatic control is suboptimal. However, it is tedious for a developer to control the hardware by himself. In this paper we propose an extension of an existing I/O interface which on the one hand is easy to use and on the other hand could steer energy saving modes more efficiently. Furthermore, the proposed modifications are beneficial for performance analysis and provide even more information to the I/O library to improve performance. When a user annotates the program with the proposed interface, I/O, communication and computation phases are labeled by the developer. Run-time behavior is then characterized for each phase, this knowledge could be then exploited by the new library.
  • System Performance Comparison of Stencil Operations with the Convey HC-1 (Julian Kunkel, Petra Nerge), Technical Reports (1), Research Group: Scientific Computing, University of Hamburg (Deutsches Klimarechenzentrum GmbH, Bundesstraße 45a, D-20146 Hamburg), 2010-11-16
    BibTeX URL
    Abstract: In this technical report our first experiences with a Convey HC-1 are documented. Several stencil application kernels are evaluated and related work in the area of CPUs, GPUs and FPGAs is discussed. Performance of the C and Fortran stencil benchmarks in single and double precision are reported. Benchmarks were run on Blizzard – the IBM supercomputer at DKRZ –, the working group's Intel Westmere cluster and the Convey HC-1 provided at KIT.
    With the Vector personality, performance of the Convey system is not convincing. However, there lies potential in programming custom personalities. The major issue is to approximate performance of an implementation on a FPGA before the time consuming implementation is performed.
  • Classification of Network Computers Based on Distribution of ICMP-echo Round-trip Times (Julian Kunkel, Jan C. Neddermeyer, Thomas Ludwig), Research Papers (1), Staats- und Universitätsbibliothek Hamburg (Carl von Ossietzky, Von-Melle-Park 3, 20146 Hamburg), 2010-09-28
    BibTeX URL
    Abstract: Classification of network hosts into groups of similar hosts allows an attacker to transfer knowledge gathered from one host of a group to others. In this paper we demonstrate that it is possible to classify hosts by inspecting the distributions of the response times from ICMP echo requests. In particular, it is shown that the response time of a host is like a fingerprint covering components inside the network, the host software as well as some hardware aspects of the target.
    This allows to identify nodes consisting of similar hardware and OS. Instances of virtual machines hosted on a single physical hardware can be detected in the same way. To understand the influence of hardware and software components a simple model is built and the quantitative contribution of each component to the round-trip time is briefly evaluated.
    Several experiments show the successful application of the classifier inside an Ethernet LAN and over the Internet.
  • Poster: Benchmarking Application I/O in the Community (Julian Kunkel, Olga Mordvinova, Dennis Runz, Michael Kuhn, Thomas Ludwig), International Supercomputing Conference, Hamburg, Germany, 2010-06-01
    BibTeX URL Abstract PDF
  • Poster: Simulation of Cluster Power Consumption and Energy-to-Solution (Timo Minartz, Julian Kunkel, Thomas Ludwig), International Conference on Energy-Efficient Computing and Networking, Passau, Germany, 2010-04-14
    BibTeX URL
  • From experimental setup to bioinformatics: an RNAi screening platform to identify host factors involved in HIV-1 replication (Kathleen Börner, Johannes Hermle, Christoph Sommer, Nigel P. Brown, Bettina Knapp, Bärbel Glass, Julian Kunkel, Gloria Torralba, Jürgen Reymann, Nina Beil, Jürgen Beneke, Rainer Pepperkok, Reinhard Schneider, Thomas Ludwig, Michael Hausmann, Fred Hamprecht, Holger Erfle, Lars Kaderali, Hans-Georg Kräusslich, Maik J. Lehmann), In Biotechnology Journal, Series: 5-1, pp. 39–49, WILEY-VCH (Weinheim, Germany), ISSN: 1860-7314, 2010-01
    BibTeX URL DOI
    Abstract: RNA interference (RNAi) has emerged as a powerful technique for studying loss of function phenotypes by specific down-regulation of gene expression, allowing the investigation of virus-host interactions by large scale high-throughput RNAi screens. Here we comprehensively describe a robust and sensitive siRNA screening platform consisting of an experimental setup, single-cell image analysis and statistical as well as bioinformatics analyses. The workflow has been established to elucidate host gene functions exploited by viruses, monitoring both suppression and enhancement of viral replication simultaneously by fluorescence microscopy. The platform comprises a two-stage procedure in which potential host-factors were first identified in a primary screen and afterwards retested in a validation screen to confirm true positive hits. Subsequent bioinformatics analysis allows the identification of cellular genes participating in metabolic pathways and cellular networks utilized by viruses for efficient infection. Our workflow has been used to investigate host factor usage by the human immunodeficiency virus-1 (HIV 1) but can also be adapted to different viruses. Importantly, the provided platform can be used to guide further screening approaches, thus contributing to fill in current gaps in our understanding of virus-host interactions.
  • Tracing Performance of MPI-I/O with PVFS2: A Case Study of Optimization (Yuichi Tsujita, Julian Kunkel, Stephan Krempel, Thomas Ludwig), In Parallel Computing: From Multicores and GPU's to Petascale, pp. 379–386, IOS Press, PARCO 2009, ISBN: 978-1-60750-530-3, 2010
    BibTeX URL DOI
  • Collecting Energy Consumption of Scientific Data (Julian Kunkel, Olga Mordvinova, Michael Kuhn, Thomas Ludwig), In Computer Science - Research and Development, Series: 3, pp. 1–9, (Editors: Thomas Ludwig), Springer (Berlin / Heidelberg, Germany), ISSN: 1865-2034, 2010
    BibTeX URL DOI
    Abstract: In this paper the data life cycle management is extended by accounting for energy consumption during the life cycle of files. Information about the energy consumption of data not only allows to account for the correct costs of its life cycle, but also provides a feedback to the user and administrator, and improves awareness of the energy consumption of file I/O. Ideas to realize a storage landscape which determines the energy consumption for maintaining and accessing each file are discussed. We propose to add new extended attributes to file metadata which enable to compute the energy consumed during the life cycle of each file.
  • Simulation of power consumption of energy efficient cluster hardware (Timo Minartz, Julian Kunkel, Thomas Ludwig), In Computer Science - Research and Development, Series: 3, pp. 165–175, (Editors: Thomas Ludwig), Springer (Berlin / Heidelberg, Germany), ISSN: 1865-2034, 2010
    BibTeX URL DOI
    Abstract: In recent years the power consumption of high-performance computing clusters has become a growing problem because the number and size of cluster installations has been rising. The high power consumption of clusters is a consequence of their design goal: High performance. With low utilization, cluster hardware consumes nearly as much energy as when it is fully utilized. Theoretically, in these low utilization phases cluster hardware can be turned off or switched to a lower power consuming state. We designed a model to estimate power consumption of hardware based on the utilization. Applications are instrumented to create utilization trace files for a simulator realizing this model. Different hardware components can be simulated using multiple estimation strategies. An optimal strategy determines an upper bound of energy savings for existing hardware without affecting the time-to-solution. Additionally, the simulator can estimate the power consumption of efficient hardware which is energy-proportional. This way the minimum power consumption can be determined for a given application. Naturally, this minimal power consumption provides an upper bound for any power saving strategy. After evaluating the correctness of the simulator several different strategies and energy-proportional hardware are compared.
  • I/O Performance Evaluation with Parabench – Programmable I/O Benchmark (Olga Mordvinova, Dennis Runz, Julian Kunkel, Thomas Ludwig), In Procedia Computer Science, Series: 1-1, pp. 2119–2128, Elsevier B.V (Amsterdam, Netherlands), ISSN: 1877-0509, 2010
    BibTeX URL DOI
    Abstract: Choosing an appropriate cluster file system for a specific high performance computing application is challenging and depends mainly on the specific application I/O needs. There is a wide variety of I/O requirements: Some implementations require reading and writing large datasets, others out-of-core data access, or they have database access requirements. Application access patterns reflect different I/O behavior and can be used for performance testing. This paper presents the programmable I/O benchmarking tool Parabench. It has access patterns as input, which can be adapted to mimic behavior for a rich set of applications. Using this benchmarking tool, composed patterns can be automatically tested and easily compared on different local and cluster file systems. Here we introduce the design of the proposed benchmark, focusing on the Parabench programming language, which was developed for flexible pattern creation. We also demonstrate here an exemplary usage of Parabench and its capabilities to handle the POSIX and MPI-IO interfaces.
  • Tracing Internal Communication in MPI and MPI-I/O (Julian Kunkel, Yuichi Tsujita, Olga Mordvinova, Thomas Ludwig), In International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT, pp. 280–286, IEEE Computer Society (Washington, DC, USA), Hiroshima University, PDCAT-09, Higashi Hiroshima, Japan, ISBN: 978-0-7695-3914-0, 2009-12-29
    BibTeX DOI
    Abstract: MPI implementations can realize MPI operations with any algorithm that fulfills the specified semantics. To provide optimal efficiency the MPI implementation might choose the algorithm dynamically, depending on the parameters given to the function call. However, this selection is not transparent to the user. While this abstraction is appropriate for common users, achieving best performance with fixed parameter sets requires knowledge of internal processing. Also, for developers of collective operations it might be useful to understand timing issues inside the communication or I/O call. In this paper we extended the PIOviz environment to trace MPI internal communication. Thus, this allows the user to see PVFS server behavior together with the behavior in the MPI application and inside MPI itself. We present some analysis results for these capabilities for MPICH2 on a Beowulf Cluster
  • USB Flash Drives as an Energy Efficiency Storage Alternative (Olga Mordvinova, Julian Kunkel, Christian Baun, Thomas Ludwig, Marcel Kunze), In Proceedings of the 10th IEEE/ACM International Conference on Grid Computing, pp. 175–182, IEEE Computer Society (Washington, DC, USA), IEEE/ACM, GRID-09, Banff, Alberta, Canada, ISBN: 978-1-4244-5148-7, 2009-10
    BibTeX DOI
  • Poster: Data Storage and Processing for High Throughput RNAi Screening (Julian Kunkel, Thomas Ludwig, M. Hemberger, G. Torralba, E. Schmitt, M. Hausmann, V. Lindenstruth, N. Brown, R. Schneider), German Symposium on Systems Biology 2009, Heidelberg, Germany, 2009
    BibTeX PDF
  • Dynamic file system semantics to enable metadata optimizations in PVFS (Michael Kuhn, Julian Kunkel, Thomas Ludwig), In Concurrency and Computation: Practice and Experience, Series: 21-14, pp. 1775–1788, John Wiley and Sons Ltd. (Chichester, UK), ISSN: 1532-0626, 2009
    BibTeX URL DOI
    Abstract: Modern file systems maintain extensive metadata about stored files. While metadata typically is useful, there are situations when the additional overhead of such a design becomes a problem in terms of performance. This is especially true for parallel and cluster file systems, where every metadata operation is even more expensive due to their architecture. In this paper several changes made to the parallel cluster file system Parallel Virtual File System (PVFS) are presented. The changes target at the optimization of workloads with large numbers of small files. To improve the metadata performance, PVFS was modified such that unnecessary metadata is not managed anymore. Several tests with a large quantity of files were performed to measure the benefits of these changes. The tests have shown that common file system operations can be sped up by a factor of two even with relatively few changes.
  • Small-file Access in Parallel File Systems (Philip Carns, Sam Lang, Robert Ross, Murali Vilayannur, Julian Kunkel, Thomas Ludwig), In IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–11, IEEE Computer Society (Washington, DC, USA), University of Rome, IPDPS-09, Rome, Italy, ISBN: 978-1-4244-3751-1, 2009
    BibTeX URL DOI
    Abstract: Today's computational science demands have resulted in ever larger parallel computers, and storage systems have grown to match these demands. Parallel file systems used in this environment are increasingly specialized to extract the highest possible performance for large I/O operations, at the expense of other potential workloads. While some applications have adapted to I/O best practices and can obtain good performance on these systems, the natural I/O patterns of many applications result in generation of many small files. These applications are not well served by current parallel file systems at very large scale. This paper describes five techniques for optimizing small-file access in parallel file systems for very large scale systems. These five techniques are all implemented in a single parallel file system (PVFS) and then systematically assessed on two test platforms. A microbenchmark and the mdtest benchmark are used to evaluate the optimizations at an unprecedented scale. We observe as much as a 905% improvement in small-file create rates, 1,106% improvement in small-file stat rates, and 727% improvement in small-file removal rates, compared to a baseline PVFS configuration on a leadership computing platform using 16,384 cores.
  • Using Non-blocking I/O Operations in High Performance Computing to Reduce Execution Times (David Buettner, Julian Kunkel, Thomas Ludwig), In Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 134–142, Springer-Verlag (Berlin, Heidelberg), CSC - IT, EuroPVM/MPI-09, Espoo, Finland, ISBN: 978-3-642-03769-6, 2009
    BibTeX URL DOI
    Abstract: As supercomputers become faster, the I/O part of applications can become a real problem in regard to overall execution times. System administrators and developers of hardware or software components reduce execution times by creating new and optimized parts for the supercomputers. While this helps a lot in the struggle to minimize I/O times, adjustment of the execution environment is not the only option to improve overall application behavior. In this paper we examine if the application programmer can also contribute by making use of non-blocking I/O operations. After an analysis of non-blocking I/O operations and their potential for shortening execution times we present a benchmark which was created and run in order to see if the theoretical promises also hold in practice.
  • Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring (Julian Kunkel, Thomas Ludwig), In Euro-Par '08: Proceedings of the 14th international Euro-Par conference on Parallel Processing, pp. 212–221, Springer-Verlag (Berlin, Heidelberg), University of Las Palmas de Gran Canaria, Euro-Par-08, Las Palmas de Gran Canaria, Spain, ISBN: 978-3-540-85450-0, 2008
    BibTeX URL DOI
    Abstract: Today we recognize a high demand for powerful storage. In industry this issue is tackled either with large storage area networks, or by deploying parallel file systems on top of RAID systems or on smaller storage networks. The bigger the system gets the more important is the ability to analyze the performance and to identify bottlenecks in the architecture and the applications. We extended the performance monitor available in the parallel file system PVFS2 by including statistics of the server process and information of the system. Performance monitor data is available during runtime and the server process was modified to store this data in off-line traces suitable for post-mortem analysis. These values can be used to detect bottlenecks in the system. Some measured results demonstrate how these help to identify bottlenecks and may assists to rank the servers depending on their capabilities
  • Directory-Based Metadata Optimizations for Small Files in PVFS (Michael Kuhn, Julian Kunkel, Thomas Ludwig), In Euro-Par '08: Proceedings of the 14th international Euro-Par conference on Parallel Processing, pp. 90–99, Springer-Verlag (Berlin, Heidelberg), University of Las Palmas de Gran Canaria, Euro-Par-08, Las Palmas de Gran Canaria, Spain, ISBN: 978-3-540-85450-0, 2008 – Awards: Best Paper
    BibTeX DOI
    Abstract: Modern file systems maintain extensive metadata about stored files. While this usually is useful, there are situations when the additional overhead of such a design becomes a problem in terms of performance. This is especially true for parallel and cluster file systems, because due to their design every metadata operation is even more expensive. In this paper several changes made to the parallel cluster file system PVFS are presented. The changes are targeted at the optimization of workloads with large numbers of small files. To improve metadata performance, PVFS was modified such that unnecessary metadata is not managed anymore. Several tests with a large quantity of files were done to measure the benefits of these changes. The tests have shown that common file system operations can be sped up by a factor of two even with relatively few changes.
  • Analysis of the MPI-IO Optimization Levels with the PIOViz Jumpshot Enhancement (Thomas Ludwig, Stephan Krempel, Michael Kuhn, Julian Kunkel, Christian Lohse), In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science (4757), pp. 213–222, (Editors: Franck Cappello, Thomas Hérault, Jack Dongarra), Springer (Berlin / Heidelberg, Germany), Institut national de recherche en informatique et automatique, EuroPVM/MPI-07, Paris, France, ISBN: 978-3-540-75415-2, 2007
    BibTeX URL DOI
    Abstract: With MPI-IO we see various alternatives for programming file I/O. The overall program performance depends on many different factors. A new trace analysis environment provides deeper insight into the client/server behavior and visualizes events of both process types. We investigate the influence of making independent vs. collective calls together with access to contiguous and non-contiguous data regions in our MPI-IO program. Combined client and server traces exhibit reasons for observed I/O performance.
  • Performance Evaluation of the PVFS2 Architecture (Julian Kunkel, Thomas Ludwig), In PDP '07: Proceedings of the 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, pp. 509–516, IEEE Computer Society (Washington, DC, USA), Euromicro, PDP-07, Napoli, Italy, ISBN: 0-7695-2784-1, 2007
    BibTeX DOI
    Abstract: As the complexity of parallel file systems? software stacks increases it gets harder to reveal the reasons for performance bottlenecks in these software layers. This paper introduces a method which eliminates the influence of the physical storage on performance analysis in order to find these bottlenecks. Also, the influence of the hardware components on the performance is modeled to estimate the maximum achievable performance of a parallel file system. The paper focusses on the Parallel Virtual File System 2 (PVFS2) and shows results for the functionality file creation, small contiguous I/O requests and large contiguous I/O requests.
  • Tracing the MPI-IO Calls' Disk Accesses (Thomas Ludwig, Stephan Krempel, Julian Kunkel, Frank Panse, Dulip Withanage), In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science (4192), pp. 322–330, (Editors: Bernd Mohr, Jesper Larsson Träff, Joachim Worringen, Jack Dongarra), Springer (Berlin / Heidelberg, Germany), C&C Research Labs, NEC Europe Ltd., and the Research Centre Jülich, EuroPVM/MPI-06, Bonn, Germany, ISBN: 3-540-39110-X, 2006
    BibTeX URL DOI
    Abstract: With parallel file I/O we are faced with the situation that we do not have appropriate tools to get an insight into the I/O server behavior depending on the I/O calls in the corresponding parallel MPI program. We present an approach that allows us to also get event traces from the I/O server environment and to merge them with the client trace. Corresponding events will be matched and visualized. We integrate this functionality into the parallel file system PVFS2 and the MPICH2 tool Jumpshot. Keywords: Performance Analyzer, Parallel I/O, Visualization, Trace-based Tools, PVFS2.