BoF: Analyzing Parallel I/O

Parallel I/O performance can be a critical bottleneck for applications, yet users are often ill-equipped for identifying and diagnosing I/O performance issues. Increasingly complex hierarchies of storage hardware and software deployed on many systems only compound this problem. Tools that can effectively capture, analyze, and tune I/O behavior for these systems empower users to realize performance gains for many applications.

In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the problem presented above, drawing on the expertise of users, I/O researchers, and administrators in attendance.

The primary objectives of this BoF are to: 1) highlight recent advances in tools and techniques for monitoring I/O activity in data centers, 2) to discuss experiences and limitations of current approaches, 3) to discuss and derive a roadmap for future I/O tools with the goal to capture, assess, predict and optimize I/O.

The BoF is held in conjunction with the Supercomputing conference. The official schedule is listed here.

Date Wednesday, November 20th, 2019
Time 5:15pm - 6:45pm
Venue Room 220, Denver, USA

The BoF is powered by the Virtual Institute for I/O and ESiWACE 1).

The BoF is organized by

The agenda is currently in preparation. We have a series of (10 minute) talks followed by a longer discussion:

  • IntroductionShane Snyder
    Slides
  • What's new with DarshanShane Snyder
  • HPC Storage as a Blank Canvas in Google CloudDean Hildebrand (Google)
    slides
  • Timeline-based I/O Behavior Assessment of Parallel JobsEugen Betke (DKRZ)
    Slides
  • Measuring I/O with TAUKevin Huck (University of Oregon)
    Slides
  • State of IO profiling in ForgeFlorent Lebeau (ARM)
    Slides
  • Research community I/O patternsGordon Gibb (EPCC)
    Slides
    We have used a combination of Cray LASSi and EPCC SAFE to analyse the I/O profiles for different research communities based on analysing the I/O in all jobs on the UK National Supercomputing Service, ARCHER over a period of 6 months. The patterns reveal the different I/O requirements of different communities and will allow us to design better HPC services in the future.
  • Tracking User-Perceived I/O Slowdown via ProbingJulian Kunkel
    Slides
  • Panel and discussion – (moderated by Julian Kunkel)

Eugen Betke has completed his study of computer science in 2015 with specialization on machine learning and I/O performance. In his master thesis he applied machine learning methods to predict I/O performance. At the beginning of 2016 he started as a researcher at the German Climate Computing Center. His key areas are analysis and optimization of HPC-I/O; he, for example, developed a cluster wide monitoring system for Lustre on Mistral.

Dr Gordon Gibb is an Applications Consultant at EPCC, the University of Edinburgh. He obtained an MPhys in Astrophysics at the University of St Andrews, where he then went on to receive a PhD in Solar Physics, followed by several positions as a research software engineer. At EPCC, he is a member of the computer science and engineering team for ARCHER, the UK's national supercomputing service. His work has included general technical support for ARCHER users, optimisation and porting of codes to ARCHER, and he is the point of contact for one of the UK’s high end computing consortia.

Dean Hildebrand is a Technical Director of HPC and enterprise storage in the Office of the CTO (OCTO) at Google Cloud. He has authored over 100 scientific publications and patents and currently focused on making HPC and enterprise storage 1st class citizens in the cloud. He received a B.Sc. degree in computer science from the University of British Columbia in 1998 and M.S. and PhD. degrees in computer science from the University of Michigan in 2003 and 2007, respectively.

Kevin Huck is Research Faculty and Computer Scientist in the Oregon Advanced Computing Institute for Science and Society at the University of Oregon. Dr. Huck was awarded his PhD (2009) in Computer and Information Science from the University of Oregon. Previously, Dr. Huck has worked in various private industry efforts and as a postdoc at the Barcelona Supercomputing Center (2009-2011). Dr. Huck works primarily in the area of large scale parallel performance measurement, analysis and visualization. Dr. Huck is the creator and primary developer of APEX (Autonomic Performance Environment for eXascale), a measurement and feedback-control infrastructure for asynchronous, user-level threading runtimes like OpenMP and HPX. He is also the creator and primary developer of PerfExplorer, a data mining framework for large scale parallel performance analysis. His other research interests include application and workflow performance measurement, analysis, aggregation and visualization as well as lightweight measurement, dynamic runtime optimization and feedback/control systems for asynchronous multitasking runtimes.

Julian Kunkel is a Lecturer at the Computer Science Department at the University of Reading. He manages several research projects revolving around High-Performance Computing and particularly high-performance storage. Besides his main goal to provide efficient and performance-portable I/O, his HPC-related interests are: data reduction techniques, performance analysis of parallel applications and parallel I/O, management of cluster systems, cost-efficiency considerations, and software engineering of scientific software.

Florent Lebeau is a solution architect at Arm providing effective customer training across the broad range of debugging, profiling and optimization tools. Having worked in HPC for many years, Florent brings valuable knowledge and experience in the practical use of parallel programming and development tools, joining the Arm HPC Tools team after working as an engineer for Allinea Software and at CAPS enterprise where he developed profiling tools for HMPP Workbench and provided training on parallel technologies. Florent graduated from the University of Dundee with an MSc in Applied Computing.

Shane Snyder is a software engineer in the Mathematics and Computer Science Division of Argonne National Laboratory. He received his master's degree in computer engineering from Clemson University in 2013. His research interests primarily include the design of high-performance distributed storage systems and the characterization and analysis of I/O workloads on production HPC systems.

1)
ESiWACE has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 675191