BoF: Analyzing Parallel I/O

Parallel I/O performance can be a critical bottleneck for applications, yet users are often ill-equipped for identifying and diagnosing I/O performance issues. Increasingly complex hierarchies of storage hardware and software deployed on many systems only compound this problem. Tools that can effectively capture, analyze, and tune I/O behavior for these systems empower users to realize performance gains for many applications.

In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the problem presented above, drawing on the expertise of users, I/O researchers, and administrators in attendance.

The primary objectives of this BoF are to: 1) highlight recent advances in tools and techniques for monitoring I/O activity in data centers, 2) to discuss experiences and limitations of current approaches, 3) to discuss and derive a roadmap for future I/O tools with the goal to capture, assess, predict and optimize I/O.

The BoF is held in conjunction with the Supercomputing conference. The official schedule is listed here.

Date Wednesday, November 20th, 2019
Time 5:15pm - 6:45pm
Venue Room 220, Denver, USA

The BoF is powered by the Virtual Institute for I/O and ESiWACE 1).

The BoF is organized by

The agenda is currently in preparation. We have a series of talks followed by a longer discussion:

  • IntroductionShane Snyder
  • What's new with DarshanShane Snyder
  • HPC Storage as a Blank Canvas in Google CloudDean Hildebrand (Google)
  • Eugen Betke (DKRZ)
  • Kevin Huck (University of Oregon)
  • State of IO profiling in ForgeFlorent Lebeau (ARM)
  • Research community I/O patternsGordon Gibb (EPCC)
    We have used a combination of Cray LASSi and EPCC SAFE to analyse the I/O profiles for different research communities based on analysing the I/O in all jobs on the UK National Supercomputing Service, ARCHER over a period of 6 months. The patterns reveal the different I/O requirements of different communities and will allow us to design better HPC services in the future.
  • Panel and discussion – (moderated by Julian Kunkel)

Eugen Betke has completed his study of computer science in 2015 with specialization on machine learning and I/O performance. In his master thesis he applied machine learning methods to predict I/O performance. At the beginning of 2016 he started as a researcher at the German Climate Computing Center. His key areas are analysis and optimization of HPC-I/O; he, for example, developed a cluster wide monitoring system for Lustre on Mistral.

Dr Gordon Gibb is an Applications Consultant at EPCC, the University of Edinburgh. He obtained an MPhys in Astrophysics at the University of St Andrews, where he then went on to receive a PhD in Solar Physics, followed by several positions as a research software engineer. At EPCC, he is a member of the computer science and engineering team for ARCHER, the UK's national supercomputing service. His work has included general technical support for ARCHER users, optimisation and porting of codes to ARCHER, and he is the point of contact for one of the UK’s high end computing consortia.

Dean Hildebrand is a Technical Director of HPC and enterprise storage in the Office of the CTO (OCTO) at Google Cloud. He has authored over 100 scientific publications and patents and currently focused on making HPC and enterprise storage 1st class citizens in the cloud. He received a B.Sc. degree in computer science from the University of British Columbia in 1998 and M.S. and PhD. degrees in computer science from the University of Michigan in 2003 and 2007, respectively.

Julian Kunkel is a Lecturer at the Computer Science Department at the University of Reading. He manages several research projects revolving around High-Performance Computing and particularly high-performance storage. Besides his main goal to provide efficient and performance-portable I/O, his HPC-related interests are: data reduction techniques, performance analysis of parallel applications and parallel I/O, management of cluster systems, cost-efficiency considerations, and software engineering of scientific software.

Shane Snyder is a software engineer in the Mathematics and Computer Science Division of Argonne National Laboratory. He received his master's degree in computer engineering from Clemson University in 2013. His research interests primarily include the design of high-performance distributed storage systems and the characterization and analysis of I/O workloads on production HPC systems.

1)
ESiWACE has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 675191