To make full use of the diverse range of computing hardware available to the EU Destination Earth (DestinE) initiative across various EuroHPC systems, the ECMWF Integrated Forecasting System (IFS) has been adapted for graphics processing unit (GPU) accelerators. The latest version of ECMWF’s open-source operational wave model, ecWAM 1.5 (https://github.com/ecmwf-ifs/ecwam), is now GPU-ready, as part of an ongoing collaborative effort that includes ECMWF, CINECA, Nvidia and AMD.
Scientific significance
Ocean surface waves, generated primarily by the action of wind, are an integral component of the Earth system due to their role in modulating momentum, heat and moisture fluxes between the ocean and atmosphere. Third-generation spectral wave models are the best tools we have for modelling wave generation, transformation, and decay across the global oceans. These models explicitly resolve the full two-dimensional wave action spectrum and evolve it through physically based source terms representing wind input, nonlinear wave–wave interactions, and dissipation, without imposing constraints on spectral shape or wave growth. They also describe key propagation processes such as advection, refraction, and shoaling. ecWAM is ECMWF’s state-of-the-art third-generation spectral wave model and a core component of the IFS Earth system.
GPU-adaptation strategy
Because ecWAM contains a diverse set of computational patterns in the model, a mixed strategy was used to adapt ecWAM to GPUs. The wave propagation kernel uses a bespoke low-order computational stencil, and was adapted for GPU via manual, tailored code optimisations and insertion of pragma directives.
Source-term computation forms the other major scientific component of ecWAM. It represents the physical processes that generate, dissipate and redistribute wave energy. This is a large and scientifically fast evolving part of the code, and a manual GPU-adaptation strategy would thus be unsustainable. Here, ECMWF’s in-house source-to-source translation tool Loki (https://github.com/ecmwf-ifs/loki) is used to automatically generate the GPU code from the central processing unit (CPU) code as a preprocessor step in the ecWAM build procedure. Moreover, as Nvidia and AMD GPUs use distinct programming models, generating the GPU optimised code using Loki is essential for targeting both architectures without making highly intrusive code modifications. As the ecWAM source-term computation kernel features the same “single-column” (with each column comprised of one grid point) memory access pattern as the grid-point computations in the IFS dynamics and physics, the same Loki GPU-adaptation recipes can be used.
Data transfers between the separate memory spaces of a CPU and a GPU represent one of the key challenges of adaptation. To address this, the ecWAM data structures have been rewritten around Field-API (https://github.com/ecmwf-ifs/field_api), an array data management abstraction co-developed with Météo-France for the highly bespoke memory data layouts used in the IFS. Field-API provides vendor-agnostic support for highly optimised CPU-to-GPU data transfers and enables this functionality to be transparent to scientific developers behind an intuitive and non-intrusive abstraction. More details about Loki, Field-API and the wider IFS GPU-adaptation strategy can be found in the accompanying Newsletter article titled ‘GPU-adaptation of IFS for EuroHPC machines’ (see article by Lange et al. in this Newsletter).
Modularisation and open sourcing
Modularisation and open sourcing have also played a critical role in the GPU-adaptation of ecWAM. Modularisation, specifically the ability to build and test ecWAM independently of the IFS, significantly shortens the development cycle, and open sourcing greatly simplifies collaboration with external contributors.
The initial GPU-adaptation of the wave propagation kernel was performed by CINECA under a contracted DestinE activity. Both modularisation and open sourcing were instrumental in enabling this work to be completed in a timely manner. ecWAM has also benefited immensely from direct vendor contributions, initially from Nvidia and, more recently, AMD, which again was facilitated by open sourcing. Perhaps even more importantly, having a modular and open-source ecWAM also greatly lowers the barrier to continued optimisation, both by internal and external developers.
Computational performance
Whilst there is still room for optimisation, significant speedups have been achieved across a variety of GPU architectures as shown in the first figure. The entire ecWAM timestep is now executed on GPU, and data is only exchanged between CPU and GPU at the coupling points with the IFS when forcings are exchanged either way (see the second figure). Limiting the CPU–GPU data transfers in such a way allows the GPU computational performance gains to translate directly into reduced overall model-execution time.