ECMWF Cray Supercomputer  

ECMWF's High Performance Computing Facility (HPCF) is the result of a competitive procurement carried out in 2012 and 2013. This resulted in ECMWF awarding a two-phase service contract to Cray UK Ltd to supply and support this HPCF until mid-2018. The contract was signed on 24 June 2013. 

In 2015, Cray and ECMWF signed a contract amendment extending the support period to 2020. This allowed Cray and ECMWF to upgrade the main systems in the first half of 2016 with the latest generation of Intel processors and add extra memory and storage.  In the second half of 2016, we added a new standalone cluster with 32 Intel Xeon Phi processors to support the work of the  Scalability Programme.

The main system in both phases comprises two identical Cray XC systems continuing ECMWF's successful design of having two self-sufficient clusters with their own storage, but with equal access to the high performance working storage of the other cluster. This cross-connection of storage allows most of the benefits of having one very large system but dual clusters add significantly to the resiliency of the system, allowing flexibility in performing maintenance and upgrades; when combined with separate resilient power and cooling systems they provide protection against a wide range of possible failures.  

The first phase started producing operational forecasts on 17 September 2014.  The second phase was accepted on the 29 June 2016 and produced its first operational forecast on the 6 June.

System description

The Cray HPCF has two identical Cray XC40 clusters.  Each has 20 cabinets of compute nodes and 13 of storage and weighs more than 50 metric tonnes.  In Phase 2, the bulk of the system consists of compute nodes with two Intel Xeon EP E5-2695 V4 “Broadwell” processors each with 18 cores. Four compute nodes sit on one blade, sixteen blades sit in a chassis and there are three chassis in a frame. This gives a maximum of 192 nodes or 6,912 processor cores per cabinet, 50% more than Phase 1.  The number of actual compute nodes in a cabinet will sometimes be less than the maximum since as well as compute nodes, each cluster has a number of “Service Nodes”. These have space for a PCI-Express card to support a connection to external resources such as storage or networks and are consequently twice the size of a compute node, and only two fit on one blade. 

Cray XC30 compute blade
An XC40 compute blade. In the main part of the blade, you can see the heat sinks for the eight CPU chips of the four nodes. At the back of the blade is the Aries router.

High performance interconnect

Connecting all of the processing power together is the Aries™ interconnect. Cray developed this network technology supported by the US Defence Advanced Research Projects Agency, High Productivity Computing Systems (HPCS) project. This interconnect uses a “dragonfly” topology. The dragonfly name is used for the topology because the dragonfly’s wide body and narrow wings represent the large number of local electrical connections and the relatively small number of longer distance optical connections.

Each blade in the system has a single Aries chip and all the nodes on the blade connect to it via PCI-Express Gen3 links capable of a transfer rate of 16 gigabytes per second in each direction. Each Aries chip then connects via the chassis backplane to every other blade in the chassis. A chip has five other electrical connections, one to each chassis in a group of two cabinets. Cray describes this as an “electrical group”. A further level of network uses optical links to connect every electrical group to every other electrical group in the system as electrical connections are cheaper than optical ones, but are limited in length to a few meters. The Aries chip design also removes the need for external routers.

Cray Aries interconnectA diagram of an XC40 compute blade. Each blade has four dual-socket nodes and an Aries router chip.

 

Dragonfly topology of the Cray Aries interconnect
The "dragonfly" topology of the Cray Aries interconnect has a large number of local electrical connections and a small number of longer distance optical connections.

Operating system

The nodes of the Cray system are optimised for their particular function. The bulk of the nodes run in “Extreme Scalability Mode”. In this mode, each node  runs a stripped down version of the Linux operating system. Reducing the number of operating system tasks running on a node to the minimum is a key element of providing a highly scalable environment for applications to run in. Any time spent not running the user’s application is a waste. If the application is a tightly-coupled parallel one where results need to be exchanged with processes running on other nodes in order to progress, then delays caused by operating system interruptions on one node can cause other nodes to go idle waiting for input, increasing the runtime of the application.

The other two types of nodes in a Cray system are "Service" nodes and "Multiple Applications Multiple User (MAMU)" nodes.  

MAMU or Pre-/Post-processing nodes (PPN) for ECMWF run full versions of the Linux operating system and allow more than one batch application to be run on a node. This mode is important: approximately four-fifths of the jobs run on the ECMWF systems require less than one full node to run on. These jobs are the preparation and clean-up for the main parallel jobs. While there are a huge number of these jobs, they account for less than 1% of the processing time offered by the system. 

Cray compute cluster jobs
Each of the Cray compute clusters in the HPCF processes more than 200,000 jobs per day. The majority of these jobs are small and require less than one full node to run on. Despite the large number of these small jobs, they account for less than 1% of the processing time offered by the system.

Service nodes are generally not visible to users. They perform a number of functions such as connecting the compute system to the storage and the ECMWF networks, running the batch scheduler and monitoring and controlling the system as a whole.

Storage 

High performance storage

High performance working storage for the compute clusters is provided by Lustre file systems from integrated Cray Sonexion appliances. Each cluster has two main pools of storage, one for time-critical operational work, the other for research work. Segregating time-critical from research storage helps avoid the I/O contention between workloads and thus limits the variability of run times for time critical work. While each cluster has its own high performance working storage and is self-sufficient, it also has access, at equal performance, to the storage resources of the other cluster. This cross mounting allows work to be flexibly run on either cluster, in effect making it work like a single system. This cross mounting does introduce a risk that an issue on one storage system can affect both compute clusters but, if necessary, the cross mounts could be dropped to limit the impact of the instability to just one compute cluster.

Each of our XC40 systems has about 10 petabytes of storage and offers more than 350 gigabytes per second of I/O bandwidth. 

Each Lustre file system has a metadata server, which provides the directory hierarchy and information about individual files such as who owns them and who can access them and a number of object servers that provide storage space for the files themselves. 

Sonexion single cabinet


A Cray Sonexion storage appliance. The rack contains a metadata server and six storage building blocks. Each building block has 80 disk drives and provides either 120 or 240 terabytes of usable storage.

General-purpose storage

The second type of storage in the HPCF is the general-purpose storage provided by NetApp network attached storage via NFS. This storage provides space for home file systems and for storing applications. Compared to the Lustre file systems, the capacity is relatively small at 38 terabytes, enough storage for ten million copies of the complete works of Shakespeare, but it is very reliable and offers a number of advanced features such as file system snapshots and replication, which Lustre currently does not implement.

Overview

The diagram below shows the basic components of the system.

Basic components of cray system

 

Facts and figures compared to previous HPCF

  Phase 1 - Cray XC30 Phase 2 - Cray XC40
Compute clusters   2 2
Peak performance (teraflops)   3,593 8,499
Sustained performance on ECMWF codes (teraflops)   200 333

Each compute cluster

Compute nodes   3,505   3,610
Compute cores 84,120 129,960
Operating system Cray CLE 5.2 Cray CLE 5.2 UP04
High performance interconnect Cray Aries Cray Aries

each compute node

Memory in compute node (gibibytes) 64 (60 with 128 and 4 with 256) 128 (4 with 256)
Processor type Intel E5-2697v2 "Ivy Bridge" Intel E5-2695v4 “Broadwell”
CPU chips per node 2 2
Cores per CPU chip 12 18

each processor (core)

Threads per core 1, 2 1, 2
Clock frequency (gigahertz)       2.7 2.1
Operations per clock cycle 8 16
L1,L2,L3 Cache 32KiB/256KiB(private)/30MiB(shared) 64KiB/256KiB(private)/45MiB(shared)

each compute cluster

High Performance Parallel Storage (petabytes) 6.0 10.0
General-purpose storage (terabytes) 38 38

For more information, see our Cray user documentation.