ECMWF steps up work on I/O issues in supercomputing

Antonino Bonanni, Tiago Quintino, Simon Smart


ECMWF presented its plans for an input/output (I/O) Workload Simulator during an international meeting at the Centre on 23 and 24 February. The event brought together the eight partner organisations involved in the NEXTGenIO project on I/O challenges in supercomputing.

NEXTGenIO aims to develop innovative solutions to tackle input/output (I/O) bottlenecks as high-performance computing (HPC) moves towards exascale capabilities. The project started on 1 October 2015 and is set to run for three years. It is a Horizon 2020 EU-funded project with an 8.1m-euro budget and is co-ordinated by the Edinburgh Supercomputing Centre (EPCC). The outcome of the project will be a prototype HPC system designed by Fujitsu with Intel 3D-XPoint Non-Volatile RAM (NVRAM), including newly developed systemware and an adapted application stack.

Workload Simulator

As one of the main application providers, ECMWF is contributing by developing an I/O Workload Simulator (IOWS), which will be used to evaluate the new HPC system and may be used in future HPC procurement.

The IOWS aims to simulate workloads from real HPC systems, both theoretically by means of software and practically by using a collection of lightweight, portable applications on real hardware. To do this, it will record metrics on operational systems. The recorded workload will be interpreted and modelled to reduce its complexity and gain insights into its structure. This model will then be simulated in software, which will permit an analysis of how varying the properties of the workload or of the hardware impacts the overall workflow.

The IOWS will also be able to execute the modelled workload on a physical HPC system, and it will ideally permit the scaling of modelled workloads for deployment on prototype HPC systems during development or procurement.

I/O libraries in the IFS

ECMWF will also adapt the I/O libraries in the Integrated Forecasting System (IFS) to take advantage of NVRAM technology and reduce the impact of I/O on daily operations. Currently, data is written to a parallel disk-based storage system between each of the components of the operational system (for example between the IFS and Product Generation). This forms a growing bottleneck as the volume of data generated increases. As part of the project, NVRAM will be used as a large, distributed, caching layer to connect different components of the operational system. This will minimise the usage of disk-based I/O in the time-critical path.

Participants in the NextGenIO meeting
Participants in the NextGenIO meeting. NextGenIO project members met at ECMWF in February 2016.