HPC-IODC: HPC I/O in the Data Center Workshop
News
2021-12-20: The workshop is accepted at ISC HPC
Abstract
Managing scientific data at a large scale is challenging for both scientists and the host data centre.
The storage and file systems deployed within a data centre are expected to meet users' requirements for data integrity and high performance across heterogeneous and concurrently running applications.
With new storage technologies and layers in the memory hierarchy, the picture is becoming even murkier. To effectively manage the data load within a data centre, I/O experts must understand how users expect to use the storage and what services they should provide to enhance user productivity.
In this workshop, we bring together I/O experts from data centres and application workflows to share current practices for scientific workflows, issues, and obstacles for both hardware and the software stack, and R&D to overcome these issues. We seek to ensure that a systems-level perspective is included in these discussions.
The workshop content is built on the tracks with calls for papers/talks:
We are excited to announce that research papers will be published in Springer LNCS open access and extended manuscripts in the Journal of High-Performance Storage as well.
Contributions to both tracks are peer-reviewed and require submission of the respective research paper or idea for your presentation via Easychair (see the complete description in Track: Research Papers).
The workshop is held in conjunction with the ISC-HPC during the ISC workshop day.
Note that the attendance to ISC workshops requires a workshop pass.
See also our last year's workshop web page.
This workshop is powered by the Virtual Institute for I/O, the Journal of High-Performance Storage, ESiWACE 1).
Please find the summary of our workshop here.
Organisation
The workshop is organised by
Agenda
You must register to attend the workshop.
We are currently finalizing the agenda.
09:00
Welcome –
Julian Kunkel –
Slides
09:15 Speed introduction and goal formulation – Every attendee is invited to introduce themself with 1-2 sentences regarding their key interests in today's workshop
09:30 Systems –
09:30
Extreme-scale I/O and Storage Resources in Heterogeneous Modular Supercomputing Architectures –
Sarah Neuwirth –
Slides
Emerging from the DEEP project series, the Modular Supercomputing Architecture (MSA) concept breaks with the traditional HPC system architecture approach (based on replicating many identical compute nodes, possibly integrating heterogeneous processing resources within each node) by integrating heterogeneous computing resources in a modular way at the system level. An MSA system consists of a collection of modules with different architecture and/or performance characteristics, each of them being a homogenous cluster of potentially large size. This approach brings substantial benefits for heterogeneous applications and workflows since each part can be run on exactly matching computing resources, therefore improving the time to solution and energy use. In this talk, the MSA ecosystem at the Jülich Supercomputing Centre for supporting extreme-scale application I/O and data-centric workflows will be highlighted. First, the latest hardware infrastructure, including the Scalable Storage Service Module (SSSM), the All-Flash Storage Module (AFSM) and Network-Attached Memory (NAM), will be introduced. This discussion will be complemented by an overview of the current status of MSA's software infrastructure to support parallel I/O, data-centric workflows, and resiliency. The talk will conclude with an overview of ongoing EU research projects involving MSA and future research directions, including managing data heterogeneity, data scalability, and data placement.
10:00
Challenges through the demands of HPC and AI –
Sebastian Krey –
Slides
In this talk we present the storage architecture at GWDG and how we target to handle upcoming challenges created by an increasing adoption of AI methods. At GWDG we operate more than 20PB of highperformance storage (Lustre, BeeGFS, IME), enterprise storage (SpectrumScale, StorNext, NetApp) with a total size of around 15PB and an increasing amount (approaching 30PB) of object storage (mainly Ceph). Modern AI workflows with hundreds of GPUs have a huge IOPS requirements, which can only be satisfied by high-performance all flash storage systems. At the same time the required amount of storage space increases so fast (especially from high resolutaion measurement systems), that it is impossible form an economics point of view, to store everything in the high-performance storage systems. This means a proper datamanagement plan with well designed workflows between data ingest, pre-processing, compute, post-processing and archival of results is of increasing importance.
10:30
Making storage more secure –
Chris J. Newburn –
Slides
We used to trust people. That's getting harder to do as the value of computing resources and data grows and the likelihood that you'll need to protect your storage data from attack or at least be able to demonstrate reasonable protection (e.g. for HIPAA or GDPR) increases. The compute nodes that issue requests to data filers can no longer be trusted, since applications may break out of their containers and root the node. This has implications for storage. Isolation among users and tenants in shared file systems is the next phase of data isolation that's part of the drive toward zero trust in modern data centers.
In this talk, we'll look at how fresh DPU hardware and innovations in managing security impact the data center. We'll enumerate a set of attacks and show how they can be mitigated by shifting work out of the untrusted compute nodes and into the more-trusted infrastructure control plane. We'll evaluate the benefits of doing that with DPUs and show how applying techniques on the filer end can complement what you can do with a DPU. We'll highlight areas where standardization is lacking and community involvement may help. Finally, we'll show how this could be integrated into NVIDIA's data center management software for an end-to-end solution.
Bio: Chris J. Newburn, who goes by CJ, is a Principal Architect who drives HPC strategy and the SW product roadmap in NVIDIA Compute Software, with a special focus on data center architecture and security, IO, systems, and programming models for scale. CJ has contributed to both hardware and software technologies for the past 30 years and has over 100 patents. He is a community builder with a passion for extending the core capabilities of hardware and software platforms from HPC into AI, data science, and visualization. He's delighted to have worked on volume products that his Mom used and that help researchers do their life's work in science that previously wasn't possible.
11:00 Coffee break
11:30 Expert talks –
11:30
DASI: decoupling data collocation and storage performance using scientifically meaningful object identifiers –
Olivier Iffrig, James Hawkes, Simon Smart and Tiago Quintino –
Slides
Storing scientific data on HPC systems is traditionally done by encoding chosen parameters into the file and directory structure of a data repository. This has two major drawbacks. First, a compromise between the accuracy of the description and the complexity of the naming convention must be made, at the expense of either uniqueness or readability of the names. Second, the directory hierarchy often reflects a combination of the scientific parameter taxonomy and the I/O constraints on the storage system, meaning that storage performance is coupled to data organisation. We present an alternative approach that provides semantic access to data using scientifically meaningful keywords, while abstracting storage constraints from the scientific workflow. The Data Access and Storage Interface (DASI) is a software library developed within the IO-SEA EuroHPC project that indexes data with a one-to-one correspondence to scientific metadata, allowing efficient use of the underlying storage backend, be it a traditional parallel file system or an object store. It builds upon years of experience in data storage at ECMWF using a domain-specific data description.
12:00
Supporting malleability in the GekkoFS distributed burst buffer file system –
Marc-André Vef –
Slides
The growing need to process huge data sets is one of the main drivers for building exascale HPC systems. However, the flat storage hierarchies found in classic HPC architectures no longer satisfy the performance requirements of data-processing applications. Uncoordinated file access and limited bandwidth can make the centralized back-end parallel file system a serious bottleneck. At the same time, emerging multi-tier storage hierarchies come with the potential to remove this barrier. However, maximizing performance requires careful control to avoid congestion and balance computational with storage performance. As of now, appropriate interfaces and policies for managing such an enhanced I/O stack are still lacking.
In this talk, we will introduce the ADMIRE EuroHPC project, which aims to establish this control by creating an active I/O stack that dynamically adjusts computation and storage requirements through intelligent global coordination, malleability of computation and I/O, and scheduling of storage resources along with all levels of the storage hierarchy. We will further discuss one of its main components to accelerate I/O performance — the GekkoFS distributed burst buffer file system —, its role within the ADMIRE ecosystem, and how it will be extended in the course of the project to support a malleable storage system that can dynamically react the requirements of the ADMIRE I/O stack.
12:30
Discussion –
IO500: Emerging Access Patterns and Features –
Julian Kunkel –
Slides
This session firstly introduces the available access patterns in the IO500 standard and extended mode. Then the patterns and other feature requests (such as GPUDirect) are discussed in the community.
13:00 Lunch break
14:00 Performance –
14:00
Phobos: an open-source scalable object store optimized for tapes (and more…) –
Thomas Leibovici, Philippe Deniel –
Slides
Phobos is an open-source parallel object store designed to manage large volumes of data. It can manage various kind of storage from SSD devices to tapes libraries.
Phobos is developed at CEA, where it has been in production since 2016 to manage the many petabytes of the France Genomique dataset, hosted in the TGCC compute center. Very large datasets are handled efficiently on inexpensive media without sacrificing scalability, performance, or fault-tolerance requirements. Phobos is designed to offer different layouts, such as mirrored double write and erasure coding. I/O on magnetic tapes are optimized through the use of dedicated scheduling policies applied during allocation of storage resources.
Phobos natively supports the control of tape libraries in SCSI and relies on well-known standards (such as writing to tapes via LTFS) to avoid any dependency on a proprietary format. It provides several interfaces including S3 and the possibility of being an HSM backend for Lustre in a Lustre/HSM configuration. Its API also makes it easy to add other front-ends, including an interface to present data as a POSIX filesystem (under development).
This presentation presents the design of Phobos, designed to be used in an Exascale context, and also some future use cases to which it is able to respond effectively.
14:30
Characterizing Infiniband routes to support data intensive I/O –
Sebastian Oeste –
Slides
15:00
FUSE-Based File System Monitoring with IOFS –
Marcus Boden, Lars Quentin, Julian Kunkel –
Slides
IOFS is an open-source tool to monitor I/O patterns of applications. The goal of IOFS is to give users an easy-to-use monitoring solution to debug and evaluate the I/O profile of their applications. IOFS uses FUSE to mount a directory and intercepts all I/O calls made to that directory. The software counts the different calls made to that file system, and monitors the sizes for read and write operations. Additionally, timings are taken to measure the latency of the calls. Finally, the metrics are sent to a time-series database such as InfluxDB, where they can be visualized with a frontend such as Grafana. The collected metrics include all individual file system calls, aggregate metadata, and read and write operations. This solution provides detailed performance monitoring of applications without the need for any change in the code, recompilation, or specific libraries. We include Docker recipes for InfluxDB and Grafana as well as templates for visualization with Grafana, which allows easy testing and evaluation for users.
To assess the performance penalty introduced by FUSE itself and the process of intercepting all file system operations, we ran IO500 benchmarks on two different parallel file systems and compared the results when using IOFS to the native ones. The resulting slowdown of more than 50% shows that this tool is a debugging and performance engineering tool, not a live monitoring tool for production workloads. However, as a debugging solution, it enables users to get a deeper insight into the I/O profile of their applications. We also provide an outlook for automatic assessment of the performance.
15:30
A Model of Checkpoint Behavior for Parallel Scientific Applications – (virtual)
Betzabeth Leon, Sandra Mendez, Daniel Franco, Dolores Rexachs, and Emilio Luque –
Slides
Due to the increase and complexity of computer systems, reducing the overhead of the fault tolerance techniques has become important in recent years. One technique in fault tolerance is checkpointing, which saves a snapshot with the information that has been computed up to a specific moment, suspending the execution of the application, and consuming I/O resources and network bandwidth. Characterizing the files that are generated when performing the checkpoint of a parallel application is useful to determine the resources consumed and its impact on the I/O system. It is also important to characterize the application that performs checkpoints, one of these characteristics is if the application does I/O. In this paper, we present a model of checkpoint behavior for parallel applications that performs I/O; this depends on the application and on other factors such as the number of processes, the mapping of processes, the type of IO used. These characteristics will also influence scalability, the resources consumed and their impact on the IO system. Our model describes the behavior of the checkpoint size based on the characteristics of the system and the type (or model) of I/O used, such as the number I/O aggregators processes, the buffering size utilized by the two-phase I/O optimization technique and components of collective file I/O operations. The BT benchmark and FLASH I/O are analyzed under different configurations of aggregator processes and buffer size to explain our approach. Experimental results show how these parameters have more impact on the shared memory zone of the checkpoint file.
16:00 Coffee break
16:30 Student Mentoring Session – chair:
16:30
Benchmarking the I/O performance of the world’s 5th largest CPU-based Supercomputer, ARCHER2 –
Shrey Bhardwaj, Paul Bartholomew and Mark Parsons –
Slides
The ACF operated by EPCC has a relatively new HPC machine ARCHER2, a HPE Cray EX supercomputing system with a total of 750,080 compute cores across 5,860 dual socket nodes with AMD EPYC 7742 64-core processors. It is currently the 5th largest CPU-based supercomputer and is at the 22nd spot in the Top500 list. The work file system consists of a 14.5PB HPE Cray ClusterStor storage connected using the HPE Cray slingshot interconnect. The system also includes a newly installed 1.1 PB burst buffer HPE Cray E1000F NVMe storage. A prototype benchmarking application in C was created to simulate a typical HPC I/O workload. benchmark_c2) writes arrays of configurable dimensions and increasing sizes through different I/O libraries such as MPIIO, HDF5, ADIOS2 HDF5, BP4 and BP5 I/O engines. This program can benchmark the write time performance by varying the number of nodes, tasks per node, stripe sizes, stripe counts and optionally specifying different output directories such as the NVMe storage on ARCHER2. This analysis can be done for strong and weak scaling of the arrays. The theoretical bandwidth for read/write operations of the Cray ClusterStor storage server is stated as 30 GB/s. It was found that the peak I/O bandwidth achieved using this program was 13.3GB/s when utilising the ADIOS2 BP4 engine, running on 2048 MPI ranks. This talk will attempt to explore the reasons for the large difference between the theoretical and the achieved I/O bandwidths obtained by running benchmark_c on ARCHER2 using various I/O profilers. Additionally, the NVMe max read/write performance is stated to be 80/55 GB/s read/write from the manufacturer. It is expected to gain access to this burst buffer so that results of benchmarking this layer can be presented at the talk.
17:00
IO500 based Model Exploration: From IO500 List and Clusters Information –
Radita Liem –
Slides
IO500 benchmark proposed a balanced I/O performance model instead of focusing on one or two metrics that might not be suitable for most data centers' usage. In addition to the IO500 benchmark suite, IO500 List provides complementary information that can help data centers to pry on what the other data centers are doing and learn from them. In this work, we try to map this information provided by the IO500 list using the Bounding-Box of User Expectation (BBoUE) model based on the IO500 concept also supplement it with I/O information collected by data centers.
17:30 Discussion
18:00 End
Registration
The workshop will be a full day split into two half-day parts.
The first half-day is an unofficial workshop using the Big-Blue-Button video conferencing system.
The second half-day is the official ISC-HPC workshop that uses zoom.
Attendance of the first half is free. However, registration via the ISC HPC system is mandatory to addend the (second half) of the workshop. We hope you register for both sessions!
Please register for the session using our form.
For attendance of the afternoon session, please register at ISC HPC.
Program Committee
Thomas Bönisch (HLRS)
Suren Byna (Lawrence Berkeley National Laboratory)
Matthew Curry (Sandia National Laboratories)
Philippe Deniel (CEA)
Sandro Fiore (University of Trento)
Javier Garcia Blas (Carlos III University)
Stefano Gorini (Swiss National Supercomputing Centre)
Adrian Jackson (The University of Edinburgh)
Ivo Jimenez (University of California, Santa Cruz)
Glenn Lockwood (Lawrence Berkeley National Laboratory)
George S. Markomanolis (Oak Ridge National Laboratory)
Sandra Mendez (Barcelona Supercomputing Center (BSC))
Feiyi Wang (Oak Ridge National Laboratory)
Xue Wei (Tsinghua University)
Bing Xie (Oak Ridge National Lab)
Participation
The workshop is integrated into ISC-HPC.
We welcome everybody to join the workshop, including:
I/O experts from data centres and industry.
Researchers/Engineers working on high-performance I/O for data centres.
Domain scientists and computer scientists interested in discussing I/O issues.
Vendors are also welcome, but their presentations must align with data centre topics (e.g. how do they manage their own clusters) and not focus on commercial aspects.
The call for papers and talks is already open. We accept early submissions and typically proceed with them within 45 days.
We particularly encourage early submission of abstracts such that you indicate your interest in submissions.
You may be interested in joining our mailing list at the Virtual Institute for I/O.
We especially welcome participants that are willing to give a presentation about the I/O of the representing institutions' data centre.
Note that such presentations should cover the topics mentioned below.
Call for Papers/Contributions (CfP)
Track: Research Papers
The research track accepts papers covering state-of-the-practice and research dedicated to storage in the data centre.
Proceedings will appear in ISC's post-conference workshop proceedings in Springers LNCS. Extended versions have a chance for acceptance in the first issue of the JHPS journal.
We will apply the more restrictive review criteria from JHPS and use the open workflow of the JHPS journal for managing the proceedings. For interaction, we will rely on Easychair, so please submit the metadata to EasyChair before the deadline.
For the workshop, we accept papers with up to 12 pages (excluding references) in LNCS format.
You may already submit an extended version suitable for the JHPS in JHPS format. Upon submission, please indicate potential sections for the extended version (setting a light red background colour).
There are two JHPS templates, a LaTeX and a Word template.
The JHPS template can be easily converted to the LNCS Word format such that the effort is minimal for the authors to obtain both publications. See the Manuscript Preparation, Layout & Templates, Springer.
For accepted papers, the length of the talk during the workshop depends on the controversiality and novelty of the approach (the length is decided based on the preference provided by the authors and feedback from the reviewers).
All relevant work in the area of data centre storage will be published with our joint workshop proceedings. We just believe the available time should be used best to discuss controversial topics.
Topics
The relevant topics for papers cover all aspects of data centre I/O, including:
Application workflows
User productivity and costs
Performance monitoring
Dealing with heterogeneous storage
Data management aspects
Archiving and long term data management
State-of-the-practice (e.g., using or optimising a storage system for data centre workloads)
Research that tackles data centre I/O challenges
Cloud/Edge storage aspects
Paper Deadlines
2022-03-01:
Submission deadline (extended): AoE
3)
2022-04-24: Author notification
2022-05-22: Pre-final submission for ISC (Papers to be shared during the workshop. We will also use the JHPS papers, if available.)
2022-06-02: Workshop (to be announced)
2022-07-02: Camera-ready papers for ISC
4) – As they are needed for ISC's post-conference workshop proceedings. We embrace the opportunity for authors to improve their papers based on the feedback received during the workshop.
2022-08-24: Camera-ready papers for the extended JHPS paper (It depends on the author's ability to incorporate feedback into their submission in the incubator.)
Review Criteria
The main acceptance criterion is the relevance of the approach to be presented, i.e., the core idea is novel and worthwhile to be discussed in the community.
Considering that the camera-ready version of the papers is due after the workshop, we pursue two rounds of reviews:
Acceptance for the workshop (as a talk).
Acceptance as a paper *after* the workshop, incorporating feedback from the workshop.
After the first review, all papers undergo a shepherding process.
The criteria for The Journal of High-Performance Storage are described on its webpage.
Track: Expert Talks
The topics of interest in this track include, but are not limited to:
We also accept industry talks, given that they are focused on operational issues on data centres and omit marketing.
We use Easychair for managing the interaction with the program committee.
If you are interested in participating, please submit a short (1/2 page) intended abstract of your talk together with a brief Bio.
Abstract Deadlines
Content
The following list of items should be tried to be integrated into a talk covering your data centre, if possible.
We hope your site's administrator will support you to gather the information with little effort.
Workload characterisation
Scientific Workflow (give a short introduction)
A typical use-case (if multiple are known, feel free to present more)
Involved number of files/amount of data
Job mix
Node utilisation (related to peak-performance)
System view
Architecture
Schema of the client/server infrastructure
Capacities (Tape, Disk, etc.)
Potential peak-performance of the storage
Theoretical
Optional: Performance results of acceptance tests.
Software/Middleware used, e.g. NetCDF 4.X, HDF5, …
Monitoring infrastructure
Tools and systems used to gather and analyse utilisation
Actual observed performance in production
Throughput graphs of the storage (e.g., from Ganglia)
Metadata throughput (Ops/s)
Files on the storage
Number of files (if possible, per file type)
Distribution of file sizes
Issues/Obstacles
Hardware
Software
Pain points (what is seen as the most significant problem(s) and suggested solutions, if known)
Conducted R&D (that aim to mitigate issues)
Future perspective
Known or projected future workload characterisation
Scheduled hardware upgrades and new capabilities we should focus on exploiting as a community
Ideal system characteristics and how it addresses current problems or challenges
What hardware should be added
What software should be developed to make things work better (capabilities perspective)
Items requiring discussion
Track: Student Mentoring Sessions
To foster the next generation of data-related practitioners and researchers, students are encouraged to submit an abstract following the expert talk guidelines above as far as their research is aligned with these topics. At the workshop, the students will be given 10 minutes to talk about what they are working on followed by 10-15 minutes of conversation with the community present about how to further the work, what the impact could be, alternative research directions, and other topics to help the students progress in their studies. We encourage students to work with a shepherd towards a JHPS paper illustrating their research based on the feedback obtained during the workshop.