ECMWF high-performance computing facility boosts forecasts and research

ECMWF's high-performance computing facility

The Atos high-performance computing facility in ECMWF's data centre in Bologna, Italy.

In October 2022, ECMWF started to use its Atos high-performance computing facility (HPCF). It has already allowed the Centre to increase the resolution of its medium-range ensemble forecasts and to provide more frequent extended-range forecasts with more ensemble members.

These changes were brought in when the Integrated Forecasting System (IFS) was upgraded to Cycle 48r1 in June this year.

In addition, the new HPCF enables ECMWF to do efficient research on further resolution upgrades as well as other improvements of the IFS.

The four-cluster structure of the Atos system will make it easier to perform software upgrades than the two clusters of the previous system.

Greater resolution on the HPCF

The new HPCF has enabled ECMWF to increase the resolution of its medium-range ensemble forecasts (ENS) from 18 km to 9 km in Cycle 48r1. The ensemble resolution now matches that of our current single high-resolution forecasts (HRES).

“This is a really significant step in line with our strategy, which gives primacy to the ensemble,” says Director of Research Andy Brown. “It has brought significant improvements to a wide range of metrics, both concerning upper-air and near-surface weather.”

ECMWF’s medium-range forecasts extend 15 days ahead, while extended-range forecasts cover a period of 46 days. The powerful HPCF has also allowed us to increase the number of ensemble members of extended-range forecasts from 51 to 101, as well as increasing their frequency from twice a week to daily.

“The larger extended-range ensemble improves the representation of uncertainty. And running the forecasts every day means that users can benefit from fresher forecasts. They can also combine runs over multiple days to produce an even larger ensemble,” Andy explains.

Difference in extended-range re-forecasts starting on day 0 and 3 days earlier

The chart shows the difference of Madden-Julian Oscillation bivariate correlation between 5-member ensemble re-forecasts starting on 1 November and 1 February 1989–2016 and re-forecasts starting 3 days earlier. Positive values indicate an improvement resulting from using forecasts without a 3-day delay. The vertical red lines indicate the 90% level of statistical significance. The red, green and cyan curves represent the difference when using a lagged ensemble with more members (respectively days 0 to –1; days 0 to –2 and days 0 to –3).

The Cycle 48r1 upgrade also included an increase in the resolution of the 4D-Var data assimilation system, which combines short-range forecasts with observations.

“The change allows the system to extract more information from the wealth of Earth observations, and thus to improve forecasts,” Andy says. The resolution of the Ensemble of Data Assimilations will increase further in future cycles.

In addition, the new HPCF will be used to produce the next generation of reanalyses of weather and climate (ERA6), ocean conditions (OCEAN6), and atmospheric composition (EAC5).

Finally, research will be conducted on still higher-resolution models, in readiness for the next high-performance computing facility.

ECMWF's data centre

ECMWF’s new data centre in Bologna.

How the new HPCF helps

All of these changes and preparations for the future require increased computing power, which the new Atos HPCF provides.

For example, the upgrade from 18 km to 9 km resolution in the medium-range ensembles leads to a factor of eight increase in computational cost, and a factor of 2.5 increase in the volume of data written to the parallel filesystems, compared to the previous cycle.

“A cycle upgrade also brings with it new software that we had not evaluated and benchmarked when we procured the machine. A good example of this is OOPS (Object-Oriented Prediction System), which has been implemented for the first time in Cycle 48r1,” says senior HPC analyst Ioan Hadade.

The benefit of having four clusters is reliability and maintainability. The bigger the system, there are more ways the parts can interact, and there are more things that can go wrong.  With four clusters the impact of any problem is minimised, and things like software upgrades can be implemented more easily and safely than with the previous two clusters.

“No matter how much testing you do, or how good your test environment is, implementing a new software version or hardware fix on a large system almost always throws up its own problems, so running at scale is a challenge. With the Atos, we can test at scale by running an upgrade on one of the four clusters while operations run on two clusters, which still means we’ve got one cluster to spare if required by operations,” says Mike Hawkins, ECMWF’s Head of HPC and Storage Section.

Atos has a layer of high-performance solid-state disk (SSD) storage, which is more predictable than traditional disk storage. “It helps to deliver the same performance day in, day out, and it’s a bit faster, too,” Mike says.

The Atos HPCF is not just a system to run the forecast fast, but also to do a lot of research. Unlike in the past, researchers log directly in to the HPCF to do their work rather than doing it on their desktop. “It’s a different operating model: having more resources in the centre should better support our more distributed way of working, and it should be the same experience if you are working from one of our offices or from home,” says Mike.

Access to HPCF in Bologna

Schematic overview of the new operating model. Users log directly in to the HPCF to do research work. The European Weather Cloud, a cloud computing infrastructure focused on meteorological data and developed in collaboration with EUMETSAT, is where the ‘MCLOUD VDI’ (Virtual Desktop Infrastructure) is.

A lot of work has been done to get the HPC ready for the Cycle 48r1 upgrade. “A cycle upgrade brings with it new code that we had not evaluated when we bought the machine,” says Ioan.

“There has been a lot of fine-tuning to run Cycle 48r1 on the Atos HPCF,” he says. “It’s like an orchestra that has to be conducted, and with Cycle 48r1 the orchestra got very big.”

More information on the HPCF can be viewed on the ECMWF web page on our supercomputer facility.