HPC-IODC: HPC I/O in the Data Center Workshop

News

Due to COVID-19, the workshop will be fully virtual. The workshop will be a full day split into two half-day parts.

The first half-day is an unofficial workshop free of charge. Please connect here
The second half-day is the official ISC-HPC workshop.

Abstract

Managing scientific data at a large scale is challenging for both scientists and the host data centre.

The storage and file systems deployed within a data centre are expected to meet users' requirements for data integrity and high performance across heterogeneous and concurrently running applications.

With new storage technologies and layers in the memory hierarchy, the picture is becoming even murkier. To effectively manage the data load within a data centre, I/O experts must understand how users expect to use the storage and what services they should provide to enhance user productivity.

In this workshop, we bring together I/O experts from data centres and application workflows to share current practices for scientific workflows, issues, and obstacles for both hardware and the software stack, and R&D to overcome these issues. We seek to ensure that a systems-level perspective is included in these discussions.

The workshop content is built on the tracks with calls for papers/talks:

Research paper track – Requesting submissions regarding state-of-the-practice and research about I/O in the data centre (see our topic list).
Talks from I/O experts – Requesting submissions of talks.
Student Mentoring Sessions

We are excited to announce that research papers will be published in Springer LNCS open access and extended manuscripts in the Journal of High-Performance Storage as well. Contributions to both tracks are peer-reviewed and require submission of the respective research paper or idea for your presentation via Easychair (see the complete description in Track: Research Papers).

The workshop is held in conjunction with the ISC-HPC during the ISC workshop day. Note that the attendance to ISC workshops requires a workshop pass. See also our last year's workshop web page.

Date		Friday, July 2nd, 2021
Venue		Virtual
Contact		Dr. Julian Kunkel

This workshop is powered by the Virtual Institute for I/O, the Journal of High-Performance Storage, ESiWACE ¹⁾.

Please find the summary of our workshop here.

Organisation

The workshop is organised by

Julian Kunkel (Department of Computer Science, University of Reading, UK), j.m.kunkel@reading.ac.uk
Jay Lofstead (Sandia National Lab, USA), gflofst@sandia.gov
Jean-Thomas Acquaviva (DDN, France), jtacquaviva@ddn.com

Agenda

The workshop will be a full day split into two half-day parts. You must register to attend the workshop.

Please see our workshop summary paper from last year.

Times are listed in CEST (GMT+2), -7 hours for US Central (CDT)

Morning session

The unofficial morning session will use Big Blue Button as a video conferencing system.

08:55 Welcome
Slides - Video
09:10 Research session (chair Jay Lofstead)
- 09:10 H3: An Application-Level, Low-Overhead Object Store – Antony Chazapis, Efstratios Politis, Giorgos Kalaentzis, Christos Kozanitis and Angelos Bilas
  Slides - Video
- 09:40 A Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis – Julian Kunkel, Eugen Betke
  Slides - Video
- 10:10 Analyzing the I/O patterns of Deep Learning Applications – Sandra Mendez, Edixon Párraga, Betzabeth León, Román Bond, Diego Encinas, Aprigio Bezerra, Dolores Rexachs and Emilo Luque
  Slides - Video
10:30 Virtual coffee break with peer discussion
11:00 Expert talks (chair Jean-Thomas)
- 11:00 The CERN Unified Environment – Dan van der Ster and Jakub T. Moscicki (CERN)
  Slides - Video
- 11:30 Optimising performance through data localisation – Adrian Jackson (EPCC)
  Abstract: The availability of high performance storage devices, NVMe and NVRAM, provide the potential for high performance local job storage within compute nodes. This could utilised as burst buffer/temporary storage, through ephemeral job-local filesystems, or as a long term storage target. In this take we outline our experiences with utilising such storage for applications, the performance that can be achieved, and the trade-offs involved.
  Slides - Video
- 12:00 Studying the Elbencho benchmark on multiple GPU architectures – George Markomanolis (CSC)
  Abstract: Elbencho is a GPU-oriented distributed storage benchmark for file systems and block devices. In this short talk, we explore the procedure to port the benchmark for AMD GPUs, to illustrate what worked, what failed, and we analyze some performance aspects. We profile the benchmark for the first time on such hardware to understand better how it performs and identify differences with NVIDIA architecture.
  Slides - Video
- 12:30 MeluXina - a new generation supercomputer – Valentin Plugaru (CTO LuxProvide)
  Abstract: LuxProvide is home to the MeluXina supercomputer, built as one of the new generation of European supercomputers and part of the EuroHPC network. MeluXina is designed as a modular system to offer world-class HPC, HPDA and AI services for a wide variety of workloads and application domains. This talk will focus on MeluXina's architecture and technologies, software ecosystem and platform services. We hope to be able to present some of the results coming out of system acceptance processes, as well as from early access demonstrators ran during MeluXina's first weeks in production.
  Slides - Video

13:00 Lunch break
13:45 Migration to zoom

Afternoon session

The official HPC ISC session uses Zoom as video conferencing system.

14:00 Welcome
14:05 Panel: The impact of HPC and Cloud convergence on storage (chair Jean-Thomas)
The HPC and Cloud convergence influences how HPC workflows are executed and, moreover, how data is managed. In this panel, speakers from academia and industry discuss the approaches, challenges, and the future prospect focussing on the storage perspective. The panelists represent the whole value chain ranging from users to storage vendors. The panel is organized as a brief round of talks (10 min each) followed by 30-minute panel discussions involving the audience.

Speakers – Video
- Bruno Silva (Amazon Web Services)
  Slides
- Richard Lawrence (MetOffice)
  Slides
- Bingsheng He (National University of Singapore)
  Slides
- Alberto Scionti (Linksfoundation)
  Slides
- Vasileios Baousis (ECMWF)
  Slides

15:30 Student mentoring session (chair Jay Lofstead) - Video
- 15:30 Smart Mapping of Scientific Workflows on to Heterogeneous Hardware Resources – Erdem Yilmaz
  Slides
- 15:45 TeraCache: Efficient Caching over Fast Storage Devices – Iacovos G. Kolokasis, Angelos Bilas
  Slides
- 16:00 HPC and Cloud Convergence – Frank Gadban
  Slides
- 16:15 Profiling the I/O Software Stack for Distributed Storage Systems – Luke Logan, Anthony Kougkas and Xian-He Sun
  Slides
16:30 Virtual coffee break
17:00 Expert talks (chair Julian Kunkel)
- Architecture and Performance of the Perlmutter 35 PB All-NVMe Lustre File System at NERSC – Glenn Lockwood, Alberto Chiusole, Lisa Gerhardt, David Paul and Nicholas Wright
  Slides - Video
17:30 Discussion
18:15 End

Registration

The workshop will be a full day split into two half-day parts. The first half-day is an unofficial workshop using the Big-Blue-Button video conferencing system. The second half-day is the official ISC-HPC workshop that uses zoom. Attendance of the first half is free. However, registration via the ISC HPC system is mandatory to addend the (second half) of the workshop. We hope you register for both sessions!

Please register for the session using our form. For attendance of the afternoon session, please register at ISC HPC.

Program Committee

Thomas Boenisch (High-Performance Computing Center Stuttgart)
Suren Byna (Lawrence Berkeley National Laboratory)
Matthew Curry (Sandia National Laboratories)
Philippe Deniel (CEA)
Sandro Fiore (University of Trento)
Wolfgang Frings (Juelich Supercomputing Centre)
Javier Garcia (Blas Carlos III University)
Stefano Gorini (Swiss National Supercomputing Centre)
Adrian Jackson (The University of Edinburgh)
Ivo Jimenez (University of California, Santa Cruz)
Anthony Kougkas (Illinois Institute of Technology)
Glenn Lockwood (Lawrence Berkeley National Laboratory)
Carlos Maltzahn (University of California, Santa Cruz)
George S. Markomanolis (Oak Ridge National Laboratory)
Sandra Mendez (Barcelona Supercomputing Center (BSC))
Robert Ross (Argonne National Laboratory)
Feiyi Wang (Oak Ridge National Laboratory)
Xue Wei (Tsinghua University)
Bing Xie (Oak Ridge National Lab)

Participation

The workshop is integrated into ISC-HPC. We welcome everybody to join the workshop, including:

I/O experts from data centres and industry.
Researchers/Engineers working on high-performance I/O for data centres.
Domain scientists and computer scientists interested in discussing I/O issues.
Vendors are also welcome, but their presentations must align with data centre topics (e.g. how do they manage their own clusters) and not focus on commercial aspects.

The call for papers and talks is already open. We accept early submissions and typically proceed with them within 45 days. We particularly encourage early submission of abstracts such that you indicate your interest in submissions.

You may be interested in joining our mailing list at the Virtual Institute for I/O.

We especially welcome participants that are willing to give a presentation about the I/O of the representing institutions' data centre. Note that such presentations should cover the topics mentioned below.

Call for Papers/Contributions (CfP)

Track: Research Papers

The research track accepts papers covering state-of-the-practice and research dedicated to storage in the data centre.

Proceedings will appear in ISC's post-conference workshop proceedings in Springers LNCS. Extended versions have a chance for acceptance in the first issue of the JHPS journal. We will apply the more restrictive review criteria from JHPS and use the open workflow of the JHPS journal for managing the proceedings. For interaction, we will rely on Easychair, so please submit the metadata to EasyChair before the deadline.

For the workshop, we accept papers with up to 12 pages (excluding references) in LNCS format. You may already submit an extended version suitable for the JHPS in JHPS format. Upon submission, please indicate potential sections for the extended version (setting a light red background colour). There are two JHPS templates, a LaTeX and a Word template. The JHPS template can be easily converted to the LNCS Word format such that the effort is minimal for the authors to obtain both publications. See the Manuscript Preparation, Layout & Templates, Springer.

For accepted papers, the length of the talk during the workshop depends on the controversiality and novelty of the approach (the length is decided based on the preference provided by the authors and feedback from the reviewers). All relevant work in the area of data centre storage will be published with our joint workshop proceedings. We just believe the available time should be used best to discuss controversial topics.

Topics

The relevant topics for papers cover all aspects of data centre I/O, including:

Application workflows
User productivity and costs
Performance monitoring
Dealing with heterogeneous storage
Data management aspects
Archiving and long term data management
State-of-the-practice (e.g., using or optimising a storage system for data centre workloads)
Research that tackles data centre I/O challenges

Paper Deadlines

2021-02-24: Submission deadline: AoE ²⁾
2021-03-24: Submission deadline extended: AoE ³⁾
- Note: The call for papers and talks is already open.
- We appreciate early submissions of abstracts and full papers and review them within 45 days.
2021-04-24: Author notification
2021-06-10: Pre-final submission for ISC (Papers to be shared during the workshop. We will also use the JHPS papers, if available.)
2021-07-02: Workshop (to be announced)
2021-07-24: Camera-ready papers for ISC ⁴⁾ – As they are needed for ISC's post-conference workshop proceedings. We embrace the opportunity for authors to improve their papers based on the feedback received during the workshop.
2021-08-24: Camera-ready papers for the extended JHPS paper (It depends on the author's ability to incorporate feedback into their submission in the incubator.)

Review Criteria

The main acceptance criterion is the relevance of the approach to be presented, i.e., the core idea is novel and worthwhile to be discussed in the community. Considering that the camera-ready version of the papers is due after the workshop, we pursue two rounds of reviews:

Acceptance for the workshop (as a talk).
Acceptance as a paper *after* the workshop, incorporating feedback from the workshop.

After the first review, all papers undergo a shepherding process.

The criteria for The Journal of High-Performance Storage are described on its webpage.

Track: Expert Talks

The topics of interest in this track include, but are not limited to:

A description of the operational aspects of your data centre
A particular solution for specific data centre workloads in production

We also accept industry talks, given that they are focused on operational issues on data centres and omit marketing.

We use Easychair for managing the interaction with the program committee. If you are interested in participating, please submit a short (1/2 page) intended abstract of your talk together with a brief Bio.

Abstract Deadlines

Submission deadline: 2021-04-24 AoE
Author notification: 2021-05-03

Content

The following list of items should be tried to be integrated into a talk covering your data centre, if possible. We hope your site's administrator will support you to gather the information with little effort.

Workload characterisation
1. Scientific Workflow (give a short introduction)
  1. A typical use-case (if multiple are known, feel free to present more)
  2. Involved number of files/amount of data
2. Job mix
  1. Node utilisation (related to peak-performance)
System view
1. Architecture
  1. Schema of the client/server infrastructure
    1. Capacities (Tape, Disk, etc.)
  2. Potential peak-performance of the storage
    1. Theoretical
    2. Optional: Performance results of acceptance tests.
  3. Software/Middleware used, e.g. NetCDF 4.X, HDF5, …
2. Monitoring infrastructure
  1. Tools and systems used to gather and analyse utilisation
3. Actual observed performance in production
  1. Throughput graphs of the storage (e.g., from Ganglia)
  2. Metadata throughput (Ops/s)
4. Files on the storage
  1. Number of files (if possible, per file type)
  2. Distribution of file sizes
Issues/Obstacles
1. Hardware
2. Software
3. Pain points (what is seen as the most significant problem(s) and suggested solutions, if known)
Conducted R&D (that aim to mitigate issues)
1. Future perspective
2. Known or projected future workload characterisation
3. Scheduled hardware upgrades and new capabilities we should focus on exploiting as a community
4. Ideal system characteristics and how it addresses current problems or challenges
5. What hardware should be added
6. What software should be developed to make things work better (capabilities perspective)
7. Items requiring discussion

Track: Student Mentoring Sessions

To foster the next generation of data-related practitioners and researchers, students are encouraged to submit an abstract following the expert talk guidelines above as far as their research is aligned with these topics. At the workshop, the students will be given 10 minutes to talk about what they are working on followed by 10-15 minutes of conversation with the community present about how to further the work, what the impact could be, alternative research directions, and other topics to help the students progress in their studies. We encourage students to work with a shepherd towards a JHPS paper illustrating their research based on the feedback obtained during the workshop.

¹⁾

ESiWACE is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 823988.

²⁾ , ³⁾

Anywhere on Earth

⁴⁾

tentative

Table of Contents