

# Fujitsu's Architectures and Collaborations for Weather Prediction and Climate Research

**Ross Nobes** 

Fujitsu Laboratories of Europe



ECMWF Workshop on HPC in Meteorology, October 2014

Copyright 2014 FUJITSU LIMITED

## Fujitsu's Approach to the HPC Market



FUĴÎTSU

#### The K computer



- June 2011 No.1 in TOP500 List (ISC11)
- November 2011 Consecutive No. 1 in TOP500 List (SC11)
- November 2011 ACM Gordon Bell Prize Peak-Performance (SC11)
- November 2012 No.1 in Three HPC Challenge Award Benchmarks (SC12)
- November 2012 ACM Gordon Bell Prize (SC12)
- November 2013 No.1 in Class 1 and 2 of the HPC Challenge Awards (SC13)
- June 2014 No.1 in Graph 500 "Big Data" Supercomputer Ranking (ISC14)

#### Build on the success of the K computer



## **Evolution of PRIMEHPC**

FUĴĨTSU

|              | K computer                | PRIMEHPC FX10  | Post-FX10                    |
|--------------|---------------------------|----------------|------------------------------|
| CPU          | SPARC64 VIIIfx            | SPARC64 IXfx   | SPARC64 XIfx                 |
| Peak perf.   | 128 GFLOPS                | 236.5 GFLOPS   | 1 TFLOPS ~                   |
| # of cores   | 8                         | 16             | 32 + 2                       |
| Memory       | DDR3 SDRAM                | $\leftarrow$   | HMC                          |
| Interconnect | Tofu Interconnect         | $\leftarrow$   | Tofu Interconnect 2          |
| System size  | 11 PFLOPS                 | Max. 23 PFLOPS | Max. 100 PFLOPS              |
| Link BW      | 5 GB/s x<br>bidirectional | $\leftarrow$   | 12.5 GB/s x<br>bidirectional |



#### PRIMEHPC FX10 Good byte/flop balance

Binary-compatible with the K computer &

Continuity in Architecture for Compatibility

- New features:
  - New instructions
  - Improved micro architecture

Upwards compatible CPU:

- For distributed parallel executions:
  - Compatible interconnect architecture
  - Improved interconnect bandwidth





#### 32 + 2 Core SPARC64 XIfx



- Rich micro architecture improves single thread performance
- 2 additional, Assistant-cores for avoiding OS jitter and non-blocking MPI functions

|                     |                                             | K              | FX10     | Post-FX10                                                               |
|---------------------|---------------------------------------------|----------------|----------|-------------------------------------------------------------------------|
| Peak FP performance |                                             | 128 GF         | 236.5 GF | 1-TF class                                                              |
| Core                | Execution unit                              | $FMA \times 2$ | FMA × 2  | $FMA \times 2$                                                          |
| config.             | SIMD                                        | 128 bit        | 128 bit  | 256 bit wide                                                            |
|                     | Dual SP mode                                | NA             | NA       | 2x DP performance                                                       |
|                     | Integer SIMD                                | NA             | NA       | Support                                                                 |
|                     | Single thread<br>performance<br>enhancement | -              | -        | Improved OOO<br>execution, better<br>branch prediction,<br>larger cache |

#### **Flexible SIMD Operations**

New 256-bit wide SIMD functions enable versatile operations

- Four double-precision calculations
- Strided load/store, Indirect (list) load/store, Permutation, Concatenation



## Hybrid Memory Cube (HMC)



- The increased arithmetic performance of the processor needs higher memory bandwidth
  - The required memory bandwidth can almost be met by HMC (480 GB/s)
  - Interconnect is boosted to 12.5 GB/s x 2 (bi-directional) with optical link

| Peak performance/node              | K   | FX10  | post-FX10 |
|------------------------------------|-----|-------|-----------|
| DP performance (Gflops)            | 128 | 236.5 | Over 1TF  |
| Memory Bandwidth (GB/s)            | 64  | 85    | 480       |
| Interconnect Link Bandwidth (GB/s) | 5   | 5     | 12.5      |



| Amount per processor | Capacity       | Memory BW |
|----------------------|----------------|-----------|
| HMC x8               | 32 GB 480 GB/s |           |
| DDR4-DIMM x8         | 32~128 GB      | 154 GB/s  |
| GDDR5 x16            | 8 GB           | 320 GB/s  |

HSSD was adopted for main memory since its bandwidth is three times more than DDR4

HMC can deliver the required bandwidth for high performance multi-core processors



## **Tofu2 Interconnect**



#### Successor to Tofu Interconnect

- Highly scalable, 6-dimensional mesh/torus topology
- Logical 3D, 2D or 1D torus network from the user's point of view
- Increased link bandwidth by 2.5 times to 100 Gbps



## A Next Generation Interconnect

#### Optical-dominant: 2/3 of network links are optical



#### Reduction in the Chip Area Size

- Process technology shrinks from 65 to 20 nm
- System-on-chip integration eliminates the host bus interface
- Chip area shrinks to 1/3 size



Tofu1 < 300mm<sup>2</sup> InterConnect Controller (65nm)



Tofu2 < 100mm<sup>2</sup> – SPARC64<sup>™</sup> XIfx (20nm)

# Throughput of Single Put Transfer

#### Achieved 11.46 GB/s of throughput which is 92% efficiency



FUĴITSU

# Throughput of Concurrent Put Transfers

#### Linear increase in throughput without the host bus bottleneck





| Pattern    | Method                          | Latency   |
|------------|---------------------------------|-----------|
| One-way    | Put 8-byte to memory            | 0.87 µsec |
|            | Put 8-byte to cache             | 0.71 µsec |
| Round-trip | Put 8-byte ping-pong by CPU     | 1.42 µsec |
|            | Put 8-byte ping-pong by session | 1.41 µsec |
|            | Atomic RMW 8-byte               | 1.53 µsec |

# **Communication Intensive Applications**



- Fujitsu's math library provides 3D-FFT functionality with improved scalability for massively parallel execution
  - Significantly improved interconnect bandwidth
  - Optimised process configuration enabled by Tofu interconnect and job manager
  - Optimised MPI library provides high-performance collective communications



#### PRIMEHPC FX series

FUÏTSU

## **Enduring Programming Model**

#### FUjitsu



#### Smaller, Faster, More Efficient

Highly integrated components with high-density packaging.

Performance of 1-chassis corresponds to approx. 1-cabinet of K computer.





Scale-Out Smart for HPC and Cloud Computing

#### Fujitsu Server PRIMERGY CX400 M1 (CX2550 M1 / CX2570 M1)



ECMWF Workshop on HPC in Meteorology, October 2014

Copyright 2014 FUJITSU LIMITED

## Fujitsu PRIMERGY Portfolio





# Fujitsu PRIMERGY CX400 M1

#### FUĴĨTSU



#### Feature Overview

- 4 dual-socket nodes in 2U density
- Up to 24 storage drives
- Choice of server nodes
  - Dual socket servers with Intel® Xeon® processor E5-2600 v3 product family (Haswell)
  - Optionally up to two GPGPU or co-processor cards
- DDR4 memory technology
- Cool-safe® Advanced Thermal Design enables operation in a higher ambient temperature



# Fujitsu PRIMERGY CX2550 M1

#### **Feature Overview**

- Highest performance & density
  - Condensed half-width-1U server
  - Up to 4x CX2550 M1 into a CX400 M1 2U chassis
  - Intel Xeon E5-2600 v3 product family, 16 DIMMs per server node with up to 1,024 GB DDR4 memory
- High reliability & low complexity
  - Variable local storage: 6x 2.5" drives per node, 24 in total
  - Support for up to 2x PCIe SSD per node for fast caching
  - Hot-plug for server nodes, power supplies and disk drives enable enhanced availability and easy serviceability



# Fujitsu PRIMERGY CX2570 M1

#### FUJITSU

#### Feature Overview

- HPC optimization
  - Intel Xeon E5-2600 v3 product family, 16 DIMMs per server node with up to 1,024 GB DDR4 memory
  - Optional two high-end GPGPU/co-processor cards (Nvidia Tesla, Grid or Intel Xeon Phi)
- High reliability & low complexity
  - Variable local storage: 6x 2.5" drives per node, 12 in total
  - Support for up to 2x PCIe SSD per node for fast caching
  - Hot-plug for server nodes, power supplies and disk drives enable enhanced availability and easy serviceability





IDIVIDI

TESL

#### **Extreme Weather**

- Strong interest within Wales in impacts of extreme weather events
- Fujitsu has established collaborations in this area
  - HPC Wales studentships
  - Knowledge Transfer Partnership with Swansea University





# HPC Wales – Fujitsu PhD Studentships

20 PhD studentships in Welsh Government priority sectors including "Energy and Environment"

| Cardiff University                                                                           | Dr Michaela Bray                         |
|----------------------------------------------------------------------------------------------|------------------------------------------|
| Extreme weather events in W<br>flood inundation modelling<br>Unified Model linked to river-e | Ŭ                                        |
| Bangor University                                                                            | Dr Reza Hashemi                          |
| Simulating the impacts of clin<br>dynamics<br>Integrated catchment-to-coas                   | 0                                        |
| Bangor University                                                                            | Dr Simon Creer                           |
| Biogeochemical analysis of t                                                                 | he effects of drought on CO <sub>2</sub> |

27

release from peat bogs and fens



#### Knowledge Transfer Partnership



- UK Government program to promote skills uptake
- Partnership between Swansea University and Fujitsu
- Funding from Welsh Government including overseas visits
- Met Office Unified Model performance
- Modelling of extreme weather and flood risk



#### Contribute to the development of NG-ACCESS

Next Generation Australian Community Climate and Earth-System Simulator (NG-ACCESS) - A Roadmap 2014-2019

#### Collaboration with NCI



**PROVIDING AUSTRALIAN** RESEARCHERS WITH WORLD-CLASS HIGH-END COMPUTING SERVICES



FUITSU

 Planned team of 10: 4 funded positions + 2 in-kind (NCI) + 2 in-kind (BoM) + 2 in-kind (Fujitsu)



#### ACCESS





## **Collaborative Activities**



- Understanding of scalability bottlenecks
- IO server optimisation
- Improved use of thread (OpenMP) parallelism

#### Activities within the ACCESS Optimisation Project

| Atmosphere           | <ul> <li>Scaling and optimisation of 2-3 global configurations of UM</li> <li>Benchmark configurations of UM provided by the Bureau</li> </ul>                                               |
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Ocean                | <ul> <li>Scaling and optimisation of two global resolution configurations<br/>of the MOM ocean model</li> <li>Coupled MOM-CICE utilising the OASIS3-MCT coupler</li> </ul>                   |
| Coupled<br>Climate   | <ul> <li>Performance characteristics of ACCESS-CM 1.3 and then<br/>introducing a 0.25 degree global ocean into ACCESS-CM 1.4</li> </ul>                                                      |
| Data<br>Assimilation | <ul> <li>Collect performance and scalability data for the various data assimilation codes used in ACCESS Parallel Suites</li> <li>Initial work will focus on scalability of 4DVAR</li> </ul> |

#### Summary

Fujitsu is committed to developing high-end HPC platforms

- Fujitsu selected as RIKEN's partner in FLAGSHIP 2020 project to develop an exascale-class supercomputer
- Post-FX10 is a step on the way to exascale computing

#### October 1, 2014

#### **RIKEN Selects Fujitsu to Develop New Supercomputer**

Oct. 1 — Following an open bidding process, Fujitsu Ltd. has been selected to work with RIKEN to develop the basic design for Japan's next-generation supercomputer.

Fujitsu also offers x86 cluster solutions across the range from departmental to petascale

Fujitsu is promoting collaborative research and development programmes with key partners and customers

# FUJTSU

# shaping tomorrow with you