ECMWF Newsletter #164

New HPC Test and Early Migration System

Cristian Simarro

 

After ECMWF signed a four-year contract with Atos for the supply of its BullSequana XH2000 supercomputer, Atos started the installation of a high-performance computing (HPC) Test and Early Migration System (TEMS) at ECMWF’s data centre in Reading, UK, in February 2020. The system comprises 60 nodes: 

  • 40 compute nodes (parallel jobs)
  • 20 GPIL nodes (general purpose and interactive login).

The GPIL nodes are intended to integrate the interactive and postprocessing work that is currently done on ecgate and the Linux Clusters.

One of the main differences between ECMWF’s current Cray XC40 system and the future BullSequana XH2000 is the processor technology, which changes from Intel Broadwell to AMD EPYC Rome. Even though both implement variants of the ubiquitous x86_64 instruction set, the latter has many more processors. The new layout implies that a correct process binding configuration is fundamental to achieve good performance. 

Test system
Test system. The HPC Test and Early Migration System supplied by Atos was installed at ECMWF in February 2020.

In addition, the queuing system will move from Cray’s usage of aprun under the Portable Batch System (PBS) to Slurm/srun.

After the initial integration of the TEMS into the Centre’s systems, ECMWF installed the environment ‘module’ system and other third-party software packages commonly required. Soon after, several teams across the organisation started to explore which different combinations of compilers and Message Passing Interface (MPI) implementations work best for different scenarios. First tests indicated that often best results appear to be achieved with a combination of the Intel compiler with Intel MPI or HPC-X-boosted OpenMPI for code development. Alternatively, the GNU compiler suite is available.

Since the computational capacity of the TEMS is small, and its configuration is ongoing, access is limited to ECMWF application migration teams and selected Member State users by invitation only (e.g. developers of Time-Critical suites).

The TEMS specification

60 nodes with: 

  • 2 x AMD Rome 7H12
  • 512 GiB memory
  • 1 TB local SSD (only in the GPIL nodes)