Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
events:2020:iodc [2020-05-22 17:47]
Julian Kunkel [Agenda]
events:2020:iodc [2020-06-03 16:10] (current)
Julian Kunkel [Agenda]
Line 56: Line 56:
   * 11:30 **Research talks** -- chair Julian Kunkel   * 11:30 **Research talks** -- chair Julian Kunkel
      * **A Reinforcement Learning Strategy to Tune Request Scheduling at the I/O Forwarding Layer** -- Jean Luca Bez, Francieli Zanon Boito, Ramon Nou, Alberto Miranda, Toni Cortes, Philippe O. A. Navaux \\ //I/O optimization techniques can improve performance for the access patterns they were designed to target, but they often decrease for others. Moreover, these techniques usually depend on the precise tune of their parameters, which commonly falls back to the users. We propose an approach to tune parameters dynamically at runtime based on the I/O workload observed by the system. Our focusing is on the I/O forwarding layer as it is transparent to applications and file system independent. Our approach uses a reinforcement learning technique to make the system capable of learning the best parameter value to each observed access pattern during its execution, eliminating the need for a complex and time-consuming training phase. We evaluate our proposal for the TWINS scheduling algorithm designed for the I/O forwarding layer seeking to reduce contention and coordinate accesses to the data servers. We demonstrate our approach can reach a precision of 88% on the parameter selection in the first hundreds of observations of an access pattern, achieving 99% of the optimal performance.//​      * **A Reinforcement Learning Strategy to Tune Request Scheduling at the I/O Forwarding Layer** -- Jean Luca Bez, Francieli Zanon Boito, Ramon Nou, Alberto Miranda, Toni Cortes, Philippe O. A. Navaux \\ //I/O optimization techniques can improve performance for the access patterns they were designed to target, but they often decrease for others. Moreover, these techniques usually depend on the precise tune of their parameters, which commonly falls back to the users. We propose an approach to tune parameters dynamically at runtime based on the I/O workload observed by the system. Our focusing is on the I/O forwarding layer as it is transparent to applications and file system independent. Our approach uses a reinforcement learning technique to make the system capable of learning the best parameter value to each observed access pattern during its execution, eliminating the need for a complex and time-consuming training phase. We evaluate our proposal for the TWINS scheduling algorithm designed for the I/O forwarding layer seeking to reduce contention and coordinate accesses to the data servers. We demonstrate our approach can reach a precision of 88% on the parameter selection in the first hundreds of observations of an access pattern, achieving 99% of the optimal performance.//​
-    * **Data Systems at Scale in Climate and Weather: Activities in ESiWACE** -- Julian Kunkel (University of Reading) +    * **Data Systems at Scale in Climate and Weather: Activities in the ESiWACE ​Project** -- Julian Kunkel (University of Reading) ​\\ The ESiWACE project aims to enable global eddy-resolving weather and climate simulations on the upcoming (pre-)Exascale supercomputers. In this talk, a selection of efforts to mitigate the effects of the data deluge from such high-resolution simulations is introduced. In particular, we describe the advances in the Earth System Data Middleware (ESDM), which enables scalable data management and supports the inhomogeneous storage stack. ESDM which provides a NetCDF compatible layer at a high-performance and portable-portable fashion. A selection of performance results is given and ongoing efforts for workflow support and active storage are discussed. 
-    * **TBA**+    * **Phobos a scale-out object store implementing tape library support** -- Patrice Lucas (CEA), __Philippe Deniel (CEA)__, Thomas Leibovici(CEA) \\ Phobos is an open source scale-out distributed object store providing access to multiple backends from flash and hard drives to tape libraries. Very large datasets can be efficiently managed on inexpensive storage media without giving up performance,​ scalability or fault-tolerance. Phobos is designed to offer several data layouts, such as mirroring or erasure coding. IOs through tape drives are optimized by dedicated resource scheduling policies. Developed at CEA, Phobos is in production since 2016 to manage the France Genomique multi-petabyte dataset at TGCC.
  
   * //13:00 Virtual Lunch break//   * //13:00 Virtual Lunch break//
   * 14:00 **Expert talks** -- chair Jay Lofstead   * 14:00 **Expert talks** -- chair Jay Lofstead
-     * **The ALICE data management pipeline** -- Massimo ​Lamana ​(CERN)+     * **The ALICE data management pipeline** -- Massimo ​Lamanna ​(CERN)
      * **Accelerating your Application I/O with UnifyFS** -- Kathryn Mohror (Lawrence Livermore National Laboratory) \\ UnifyFS is a user-level file system that is highly-specialized for fast shared file access on high performance computing (HPC) systems with distributed burst buffers. UnifyFS delivers significant performance improvements over general purpose file systems by supporting the specific needs of HPC workloads with reduced POSIX semantics support called "​lamination semantics."​ In this talk, we will give an introductory overview of how to use the lightweight UnifyFS file system to improve the I/O performance of HPC applications. We will describe how UnifyFS works with burst buffers, the benefits and limitations of lamination semantics, and how users can incorporate UnifyFS into their jobs. Finally, we will detail the current implementation status of UnifyFS and our plans for the future.      * **Accelerating your Application I/O with UnifyFS** -- Kathryn Mohror (Lawrence Livermore National Laboratory) \\ UnifyFS is a user-level file system that is highly-specialized for fast shared file access on high performance computing (HPC) systems with distributed burst buffers. UnifyFS delivers significant performance improvements over general purpose file systems by supporting the specific needs of HPC workloads with reduced POSIX semantics support called "​lamination semantics."​ In this talk, we will give an introductory overview of how to use the lightweight UnifyFS file system to improve the I/O performance of HPC applications. We will describe how UnifyFS works with burst buffers, the benefits and limitations of lamination semantics, and how users can incorporate UnifyFS into their jobs. Finally, we will detail the current implementation status of UnifyFS and our plans for the future.
-     * **How to recognise I/O bottlenecks and what to do about them** -- Rosemary Francis (Ellexus)+     * **How to recognise I/O bottlenecks and what to do about them** -- Rosemary Francis (Ellexus) ​\\ Dr Rosemary Francis is CEO and technical founder of Ellexus, the I/O profiling company. Rosemary will be sharing industry perspectives on how to recognise I/O bottlenecks and what to do about them. The delicate and often dynamic balance between I/O, CPU and memory can hide some easy wins in terms of improving throughput on-prem and reducing costs in the cloud. Equally, improving I/O is also about reducing the load on shared storage and not just about the incremental improvements of individual applications.
   * 15:30 **Discussion of hot topics** -- chair Julian Kunkel   * 15:30 **Discussion of hot topics** -- chair Julian Kunkel
   * 16:00 **Expert talks** -- chair Jean-Thomas Acquaviva   * 16:00 **Expert talks** -- chair Jean-Thomas Acquaviva
     * **Managing Decades of Scientific Data in Practice at NERSC** -- Glenn Lockwood (NERSC) \\ The National Energy Research Scientific Computing Center (NERSC) has been operating since 1974 and has been storing and preserving user data continuously for over 45 years as a result. ​ This has resulted in NERSC building significant expertise in how to store and manage user data for long periods of time--a decade or more--and the practical factors that must be considered when data must be retained for longer than the lifetime of the physical components of the data center, including the entire data center facility itself. ​ As the relevance of HPC extends beyond modeling and simulation and the usable lifetime of data extends from months to years or decades, these best practices in long-term data stewardship are likely to become more important to more HPC facilities. ​ To this end, we present here some of the practical considerations,​ best practices, and lessons learned from managing the scientific data of NERSC'​s thousands of users over a period of four decades.     * **Managing Decades of Scientific Data in Practice at NERSC** -- Glenn Lockwood (NERSC) \\ The National Energy Research Scientific Computing Center (NERSC) has been operating since 1974 and has been storing and preserving user data continuously for over 45 years as a result. ​ This has resulted in NERSC building significant expertise in how to store and manage user data for long periods of time--a decade or more--and the practical factors that must be considered when data must be retained for longer than the lifetime of the physical components of the data center, including the entire data center facility itself. ​ As the relevance of HPC extends beyond modeling and simulation and the usable lifetime of data extends from months to years or decades, these best practices in long-term data stewardship are likely to become more important to more HPC facilities. ​ To this end, we present here some of the practical considerations,​ best practices, and lessons learned from managing the scientific data of NERSC'​s thousands of users over a period of four decades.
-    * **Reproducibility** -- Ivo Jimenez (UC Santa Cruz)+    * **Portable Validations of Scientific Explorations with Container-native Workflows** -- Ivo Jimenez (UC Santa Cruz) \\ Researchers working in computer, computational or data science often find it difficult to reproduce experiments from artifacts like code, data, diagrams and results which are left behind by previous researchers. The code developed on one machine often fails to run on other machines due to differences in hardware architecture,​ OS, software dependencies,​ among others. This is accompanied by the difficulty in understanding how artifacts are organized, as well as in using them in correct order. Software container technology such as Docker, can solve most of the practical issues of portability,​ and in particular, container-native workflow engines can significantly aid experimenters in their work. In this talk, we introduce Popper, a container-native workflow engine that executes each step of a workflow in a separate dedicated container without assuming the presence of a Kubernetes cluster or any cloud based Kubernetes service. With Popper, researchers can build and validate workflows easily in almost any environment of their choice including local machines, SLURM based HPC clusters, CI services or Kubernetes based cloud computing environments. To exemplify the suitability of this workflow engine, we present three case studies where we take examples from Machine Learning and High Performance Computing and turn them into Popper workflows. We also discuss how Popper can be used to aid in preparing artifacts associated with article submissions to conferences and journals, and in particular give an overview of the Journal of High-Performance Storage, a new eJournal that combines open reviews, living papers, digital reproducibility,​ and open access.
     * **Identifying the performance issue of HDF5 on Summit** -- Xie Bing (Oak Ridge National Laboratory)     * **Identifying the performance issue of HDF5 on Summit** -- Xie Bing (Oak Ridge National Laboratory)
   * 17:30 **Discussion of hot topics** -- chair Jay Lofstead   * 17:30 **Discussion of hot topics** -- chair Jay Lofstead