Research
The research group High-Performance Storage improves the capabilities of storage landscapes applying smart concepts. We speak big data analytics and high-performance computing and apply our knowledge to meet the needs of environmental modeling.
Our motto is: Limitless Storage – Limitless Possibilities
More information about specific projects and our publications can be found in the navigation bar on the left.
Goals
The general goals of our research are:
- Data-centric Input/Output architectures
- Efficient execution of data-driven workflows 1)
- Autonomous storage systems making intelligent decisions (on behalf of the users)
Research Interests
Our research interests with respect to high-performance I/O are, but not limited to:
- Smart storage and semantic I/O interfaces
- Self-optimizing systems via machine learning
- Performance analysis methods, tools and benchmarks
- Modelling of performance and costs
- Data reduction techniques
- Efficient software stacks
- Management of data-driven workflows
- Utilization of complex storage landscapes
- Performance portability
- Management of I/O in cluster systems
Besides the interest in HPC I/O, we also conduct general research on HPC, software development, and training.
High-Performance Computing
Supercomputers combine the performance of hundreds or thousands of office computers to tackle problems that could not otherwise be solved on PCs in a reasonable amount of time. Scientific workflows executing on such systems typically involve either computer-aided simulation or the processing of Experimental and Observational Data (EOD).
With respect to computer-aided simulation, having the capabilities offered by supercomputers at hand, scientists no longer have to conduct time-consuming and error-prone experiments in the real world. Instead, the modeling and simulation of the laws of nature within computer systems offer a well-defined environment for experimental investigation. Models for climate, protein folding or nanomaterials, for example, can be simulated and manipulated at will without being restricted by the laws of nature. This method leads to new observations and understandings of phenomena which would otherwise be too fast or too slow to comprehend in vitro further reading.
The processing of observational data like sensor networks, satellites, and other data-driven workflows is yet another challenge as it usually dominated by the input/output of data.
In any case, with the improvement of computing performance, better experiments can be designed and conducted. As such, a thorough understanding of hardware and software design is vital to providing the necessary computing power for scientists. This understanding has developed into its own branch within the computer science field: High-Performance Computing (HPC). High-performance computing is the discipline in which supercomputers are designed, integrated, programmed, and maintained.
Supercomputers are tools used in the natural sciences to analyze scientific questions in silico. Indeed, HPC provides a new model of scientific inquiry – that is, a new way to obtain scientific knowledge. Mahootian and Eastman state:
The volume of observational data and power of high-performance computing has increased by several orders of magnitude and reshaped the practice and concept of science, and indeed the philosophy of science reference
To conclude, HPC systems provide the compute and storage infrastructure to address questions of relevance to fundamental research, industry, and the most complex social challenges.
High-Performance Storage
High-Performance storage provides the hardware and software technology to query and store the largest data volumes at a high velocity of input/output while ensuring data consistency. On Exascale systems – i.e, systems processing 10^18 floating-point operations per second, workflows will harness 100k of processors with millions of threads while producing or reading Petabytes to Exabytes of data.
Traditionally, a parallel application uses an I/O middleware such as NetCDF or HDF5 to access data. The middleware provides data access and manipulation API for higher-level objects such as variables. Naturally, these interfaces provide operations for accessing and manipulating data that are tailored to the needs of users. Historically, such middleware is also responsible for the conversion of the user data into the file, which is just a byte-array. Therefore, it uses a widely available file system interface like POSIX or MPI-IO. In the last decade, data centers realized that existing I/O middleware is unable to exploit the deployed parallel file systems in HPC systems for various reasons. As a consequence, data centers started to develop new middleware such as PNetCDF, SIONlib, GLEAN, ADIOS, PLFS, and data clay. Various of these interfaces are now used in applications that run on large scale machines.
Recent advances in new storage technologies, like in-memory storage and non-volatile memory promise to bring high capacity, non-volatile storage with performance characteristics (latency/bandwidth/energy consumption) that bridge the gap between DDR memory and SSD/HDD. However, they require careful integration into the existing I/O stack or demand the development of next-generation storage systems.
Future high-performance storage systems will need to have internal management systems and interfaces which provide capabilities far beyond those currently possible. In particular, they need to support fine-grained data access prioritization, and adaptive improved performance using internal replication and revised data layouts, all with acceptable resiliency. This must all be achieved in the presence of millions of simultaneous threads, not all under a single application’s control, and all doing I/O. Where multiple tiers are present data replication and migration should optimally adapt on the fly to the requirements of individual workflows and the overall system load. All this must be achieved with system interoperability and standardized application programming interfaces (APIs). Additionally, data centers are seeing challenges supporting mixed workflows of HPC and data analytics; The general consensus is that this needs to change and requires new methods and thinking about how to access storage, describe data and manage workflows.
The efficient, convenient, and robust data management and execution of data-driven workflows are key for productivity in computer-aided RD&E particularly for data-intense research such as climate/weather with complex processing workflows. Still, the storage stack is based on low-level I/O that requires complex manual tuning. In this environment, we are researching a novel I/O API that will lift the abstraction to a new level paving the road for intelligent storage systems. One key benefit of these systems is the exploitation of heterogeneous storage and compute infrastructures by scheduling user workloads efficiently across a system topology – a concept called Liquid Computing. These systems can improve the data handling over time without user intervention and lead towards an era with smart system infrastructure. They bear the opportunity to become the core I/O infrastructure in scientific computing but also enable us to host big-data tools like Spark in an efficient manner.
We believe – intelligent storage systems are the solution.
Please visit The Virtual Institute for I/O.