### Experiences Optimizing NWP for Intel's Xeon Phi Processor

#### **John Michalakes**

Naval Research Laboratory Marine Meteorology Division University Corporation for Atmospheric Research Boulder, Colorado USA

17<sup>th</sup> Workshop on High Performance Computing in Meteorology

24 October 2016

## **WRF Performance Timeline**



WRF results on KNL from Gokhale & Michalakes in:

Reinders, J. and J. Jeffers. Intel Xeon Phi Processor High Performance Programming, 2nd Edition Knights Landing Edition. Morgan Kaufman. 2016. ISBN: 9780128091944

## Outline

- A Brief History of Time(-ings)
- Hardware and software
  - Knights Landing overview
  - Applications: NEPTUNE, WRF and kernels
- Optimizing for KNL
  - Roofline
  - Baselines and optimizations
- Other observations
- Summary

### Hardware: Xeon Multi/Many-core Computing Platforms



- Up to ~ 1 TFLOP/s in DP\*
- Up to ~ 2 TFLOP/s in SP\*
- Up to 3072 GiB DDR4 RAM\*
- ~ 154 GB/s bandwidth\*
- Hardware-rich: forgiving of sub-optimal code

- PCIe add-in card
- ~ 1.2 TFLOP/s in DP
- ~ 2.4 TFLOP/s in SP
- Up to 16 GiB GDDR5 RAM
- ~ 176 GB/s bandwidth

- Socket version or coprocessor
- 3+ TFLOP/s in DP
- 6+ TFLOP/s in SP
- Up to 16 GiB MCDRAM
- ~ 400 GB/s MCDRAM bandwidth
- Up to 384 GiB DDR4 RAM
- ~ 90 GB/s DDR4 bandwidth
- Supports common OS

http://colfaxresearch.com/how-series-archive/

### Hardware: Xeon Multi/Many-core Computing Platforms



- Intel Xeon Phi 7250 (Knights Landing) announced at ISC'16 in June
  - 14 nanometer feature size, > 8 billion transistors
  - 68 cores, 1.4 GHz modified "Silvermont" with out-of-order instruction execution
  - Two 512-bit wide Vector Processing Units per core
  - Peak ~3 TF/s double precision, ~6 TF/s single precision
  - 16 GB MCDRAM (on-chip) memory, > 400 GB/s bandwidth
  - "Hostless" no separate host processor and no "offload" programming
  - Binary compatible ISA (instruction set architecture)

# Models: NEPTUNE/NUMA



Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. "Towards Operational Weather Prediction at 3.0km Global Resolution With the Dynamical Core NUMA". 96<sup>th</sup> Amer. Met. Society Annual Mtg. January, 2016. https://ams.confex.com/ams/96Annual/webprogram/Paper288883.html

# Models: NEPTUNE/NUMA

- Spectral element
  - 4<sup>th</sup>, 5<sup>th</sup> and higher-order\* continuous Galerkin (discontinuous planned)
  - Cubed Sphere (also icosahedral)
  - Adaptive Mesh Refinement (AMR) in development





NEPTUNE 72-h forecast (5 km resolution) of accumulated precipitation for Hurr. Sandy



Example of Adaptive Grid tracking a severe event courtesy: Frank Giraldo, NPS

- Computationally dense but highly scalable
  - Constant width-one halo communication
  - Good locality for next generation HPC
  - Supports hybrid OpenMP/MPI (new!)

\*"This is not the same 'order' as is used to identify the leading term of the error in finite-difference schemes, which in fact describes accuracy. Evaluation of Gaussian quadrature over N+1 LGL quadrature points will be exact to machine precision as long as the polynomial integrand is of the order 2x(N-1) -3, or less." Gabersek et al. MWR Apr 2012.DOI: 10.1175/MWR-D-11-00144.1

- Most work in MIC programming involves *optimization* to achieve best share of peak performance
  - Parallelism:
    - KNL has up to 288 hardware threads (4 per core) and a total of more than 2000 floating point units on the chip
    - Exposing coarse, medium and especially fine-grain (vector) parallelism in application to efficiently use Xeon Phi
  - Locality:
    - Reorganizing and restructuring data structures and computation to improve data reuse in cache and reduce floating point units idle-time waiting for data: *memory latency*
    - Kernels that do not have high data reuse or that do not fit in cache require high memory bandwidth
- The combination of these factors affecting performance on the Xeon Phi (or any contemporary processor) can be characterized in terms of *computational intensity*, and conceptualized using The Roofline Model

- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity – the number of floating point operations per byte moved between the processor and a level of the memory hierarchy.



Thanks: Doug Doerfler, LBNL



\*Sometimes referred to as:

**Arithmetic intensity** (registers→L1): *largely algorithmic* **Operational intensity** (LLC→DRAM): *improvable* 

*Williams, S. W., A. Waterman, D. Patterson.* Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS-2008-134. October 17, 2008

slide 21

- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
    - 2333 GFLOP/s double precision vector & fma



(Vectorization means the processor can perform 8 double precision operations at once as long as the operations are independent)



- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
    - 2333 GFLOP/s double precision vector & FMA
    - 1166 GFLOP/s double precision vector, no FMA



(FMA instructions perform a floating point multiply and a floating point addition as the same operation when the compiler can detect such expressions of the form:  $result = A^*B+C$ )



- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
    - 2333 GFLOP/s double precision vector & FMA
    - 1166 GFLOP/s double precision vector, no FMA
    - 145 GFLOP/s double precision no vector nor FMA





- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
  - If intensity is not enough to satisfy demand for data by the processor's floating point units, the application is "memory bound"
    - 128 GB main memory (DRAM)





- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
  - If intensity is not enough to satisfy demand for data by the processor's floating point units, the application is "memory bound"
    - 128 GB main memory (DRAM)
    - 16 GB High Bandwidth memory (MCDRAM)



*Williams, S. W., A. Waterman, D. Patterson.* Electrical Engineering and Computer Sciences, University of California at Berkeley Technical Report No. UCB/EECS-2008-134. October 17, 2008 Among the ways to use MCDRAM:

- Configure KNL to use MCDRAM as direct mapped Cache
- Configure KNL to use MCDRAM as a NUMA region numactl –membind=0 ./executable (main memory) numactl –membind=1 ./executable (MCDRAM)
- More info: <u>http://colfaxresearch.com/knl-mcdram</u>

- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
  - If intensity is not enough to satisfy demand for data by the processor's floating point units, the application is "memory bound"
    - 128 GB main memory (DRAM)
    - 16 GB High Bandwidth memory (MCDRAM)
  - KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:
    - 0.35 flops per byte from L1 cache



- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
  - If intensity is not enough to satisfy demand for data by the processor's floating point units, the application is "memory bound"
    - 128 GB main memory (DRAM)
    - 16 GB High Bandwidth memory (MCDRAM)
  - KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:
    - 0.35 flops per byte from L1 cache
    - 1 flop per byte from L2 cache



- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
  - If intensity is not enough to satisfy demand for data by the processor's floating point units, the application is "memory bound"
    - 128 GB main memory (DRAM)
    - 16 GB High Bandwidth memory (MCDRAM)
  - KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:
    - 0.35 flops per byte from L1 cache
    - 1 flop per byte from L2 cache
    - 6 flops per byte from high bandwidth memory



- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
  - If intensity is not enough to satisfy demand for data by the processor's floating point units, the application is "memory bound"
    - 128 GB main memory (DRAM)
    - 16 GB High Bandwidth memory (MCDRAM)
  - KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:
    - 0.35 flops per byte from L1 cache
    - 1 flop per byte from L2 cache
    - 6 flops per byte from high bandwidth memory
    - 25 flops per byte from main memory



- Roofline Model of Processor Performance
  - Bounds application performance as a function of computational intensity
  - If intensity is high enough, application is
     "compute bound" by floating point capability
  - If intensity is not enough to satisfy demand for data by the processor's floating point units, the application is "memory bound"
    - 128 GB main memory (DRAM)
    - 16 GB High Bandwidth memory (MCDRAM)
  - KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:
    - 0.35 flops per byte from L1 cache
    - 1 flop per byte from L2 cache
    - 6 flops per byte from high bandwidth memory
    - 25 flops per byte from main memory
  - Hard to come by in real applications!

slide 28

 NEPTUNE benefits from MCDRAM (breaks through the DRAM ceiling) but realizing only a fraction of the MCDRAM ceiling



NEPTUNE E14P3L40 (U.S. Navy Global Model Prototype)



#### NEPTUNE E14P3L40 (U.S. Navy Global Model Prototype)



NEPTUNE **E28**P3L40 (U.S. Navy Global Model Prototype)



NEPTUNE diffusion (create\_laplacian)

### Optimizations

40

#### - Compile time specification of important loop ranges

- Number of variables per grid point
- Dimensions of an element
- Flattening nested loops to collapse for vectorization
- Facilitating inlining
  - Subroutines called from element loops in CREATE\_LAPLACIAN and CREATE\_RHS
  - Matrix multiplies (need to try MKL DGEMM)
- Splitting apart loops where part vectorizes and part doesn't
- Fixing unaligned data (still a benefit on KNL)

| Create_<br>laplacian | CPU<br>Time | L2 Hit<br>Rate | L2 Hit<br>Bound | L2 Miss<br>Bound | L2 Miss<br>Count | CPI Rate | Instructions |
|----------------------|-------------|----------------|-----------------|------------------|------------------|----------|--------------|
| Orig.                | 963.645     | 0.80597        | 0.0286706       | 0.0933823        | 585,017,550      | 1.38354  | 9.9862E+11   |
| New                  | 422.305     | 0.81075        | 0.0584720       | 0.184662         | 507,015,210      | 2.13113  | 2.8283E+11   |

Hardware Threads (68 cores)



NEPTUNE diffusion (create\_laplacian)

# **NEPTUNE Results**

| Time in seconds for 30 time steps<br>Lower is better |                            | create_laplacian |                    |       | Who      |                                 |         |
|------------------------------------------------------|----------------------------|------------------|--------------------|-------|----------|---------------------------------|---------|
|                                                      |                            | Original         | Optimized Benefit: |       | Original | With optimized create_laplacian | Benefit |
| E14                                                  | KNL (68c)                  | 1.53             | 1.20               | 1.28x | 7.45     | 6.66                            | 1.09x   |
| (15,288<br>elements)                                 | BDW (72c 2S)               | 1.43             | 0.72               | 1.99x | 6.98     | 6.31                            | 1.11x   |
|                                                      | Advantage KNL $ ightarrow$ | 0.93x            | 0.60x              | -     | 0.94x    | 0.93x                           | -       |
| E28                                                  | E28 (KNL)                  | 5.92             | 2.65               | 2.23x | 26.4     | 23.5                            | 1.15x   |
| (61,152<br>elements)                                 | E28 (2S<br>BDW)            | 6.07             | 3.61               | 1.68x | 32.2     | 30.2                            | 1.07x   |
|                                                      | Advantage KNL $ ightarrow$ | 1.03x            | 1.36x              | -     | 1.22x    | 1.31x                           | -       |

- Benchmark comparisons for two "supercell" workloads: E28 workload is 4x E14
- Original NEPTUNE code and with optimizations to most expensive diffusion routine
- Hardware:
  - KNL: Knights Landing 7250 (B0), 68c, 1.4 GHz
  - BDW: E5-2697v4, 2.3 GHz, 18c x 2 sockets

## **NEPTUNE Results**



KNL efficiency increases with larger workload; Broadwell declines.

### **Realizing performance potential?**

- Whole code
  - NEPTUNE: 56 GF/s (< 2 percent D.P. peak fp)</p>
  - WRF: 125 GF/s (~2 percent S.P. peak fp)
- Discouraged? Remember:
  - NWP is multiphasic with mix of different roofline characteristics
    - Optimized create\_laplacian (NEPTUNE diffusion) is reaching 318 GF/s (> 10 percent D.P. peak fp)
    - Optimized WSM5 (WRF microphysics) is reaching 447 GF/s (7.5 percent of S.P. peak fp)
  - Relatively flat profile need to pay specific attention to large swath of code

### Other observations: hybrid parallelism

### NEPTUNE on KNL

- Best performance with straight MPI, 2 ranks per core
- WRF on KNL (CONUS 12km benchmark)
  - Best performance with **all OpenMP** threads, 2 per core
- Why the difference?
  - Both WRF and KNL use high-level SPMD threads
  - But NEPTUNE does many hundreds of OMP BARRIER synchronizations per time step
    - Needed to prevent race conditions on node points shared between neighboring elements computed on different threads
    - Andreas Mueller's work on Blue Gene/Q showed OpenMP was a win in spite of OMP BARRIER -- BG/P barriers more efficient?
    - "CGd" organization (used in HOMME) will eliminate shared nodes between neighboring elements, at cost of additional memory
- Be careful of "first touch" in Sub-NUMA clustering mode

## **Other observations: NUMA modes**



Be careful of "first touch" in Sub-NUMA clustering mode

# Other observations: SMT (hyperthreading)

- Both WRF and NEPTUNE give best results using half the available (2 of 4) hardware threads per KNL core
- Earlier KNC gave best results using 3 of 4 available hardware threads per core.
- Why?
  - KNL cores have out-of order instruction execution
  - ILP is hiding some of the latency that in-order KNC needed 3 threads to hide



## **Physics Kernels (Microphysics)**

| WSM5 Microphysics<br>(single precision)                                                                         | original                            | optimized                                                                        | original→optim.                                                                                                   |  |
|-----------------------------------------------------------------------------------------------------------------|-------------------------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|--|
| KNL                                                                                                             | 102 GF/s                            | 447 GF/s                                                                         | 4.4x                                                                                                              |  |
| BDW                                                                                                             | 147 GF/s 307 GF/s                   |                                                                                  | 2.1x                                                                                                              |  |
| KNL / BDW                                                                                                       | 0.69 x <b>1.46 x</b>                |                                                                                  |                                                                                                                   |  |
| 500<br>450<br>400<br>(uo 350<br>300<br>400<br>Knights Corner (60 cord<br>143.4 GF/s<br>400<br>50 GF/s<br>0<br>0 | W<br>W<br>W<br>W<br>W<br>W<br>W<br> | /SM5 (Xeon)<br>/SM5 (Xeon Phi)<br>/SM5 (GPU)<br>/RF (Xeon)<br>/RF (Xeon Phi)<br> | Knights Landing (68 cores)<br>447 GF/s<br>Broadwell (2S, 36 cores)<br>307 GF/s<br>well (2S, 28 cores)<br>259 GF/s |  |
| Jan-11 Jan-12                                                                                                   | 2 Jan-13                            | Jan-14 Jai                                                                       | n-15 Jan-16                                                                                                       |  |

http://www2.mmm.ucar.edu/wrf/WG2/GPU/WSM5.htm

# Physics Kernels (radiation and chemistry)

| RRTMG Radiation<br>(double precision) | original | optimized | original→optim. |
|---------------------------------------|----------|-----------|-----------------|
| КМС                                   | 10 GF/s  | 36 GF/s   | 3.6x            |
| KNL (7250 68c, 1.4 GHz)               | 64 GF/s  | 116 GF/s  | 1.8x            |
| BDW (2S E5-2697v4 2.3 GHz)            | 95 GF/s  | 128 GF/s  | 1.35x           |
| KNL / KNC                             | 3.6 x    | 2.2 x     |                 |
| KNL / BDW                             | 0.67 x   | 0.91 x    |                 |

Michalakes, Iacono and Jessup. Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture. *Parallel Processing Letters*. World Scientific. Accepted.

| <b>Reactive Chemistry</b> | original | optimized | original→optim. |
|---------------------------|----------|-----------|-----------------|
| KNL                       | 32 GF/s  | 70 GF/s   | 2.18 x          |
| BDW                       | 59 GF/s  | 34 GF/s   | 0.57 x          |
| KNL / BDW (unopt.)        | 0.54 x   | 1.18 x    |                 |

http://www2.mmm.ucar.edu/wrf/WG2/GPU/Chem.htm

# **Overall experiences**

Lift

- Lift factors
  - Improved memory bandwidth in KNL's MCDRAM
  - Standard languages, programming models, and tools (next slide)
  - Improvements for Xeon usually benefit Phi and vice versa
- Drag factors
  - Low computational intensity, flat profiles in NWP models
  - Unexamined use of double precision, -fp-model precise, fullprecision intrinsics
  - No hardware floating point counters
    - Today's performance based analysis depends on computational intensity, which requires data traffic and floating point measurements of code and representative (ie. *large*) workloads
    - Only an emulator (SDE) is available that runs orders of magnitude slower than real-time
  - Leap-frogging Xeon and Xeon Phi

## **Tools overview**

|                                              | Profiling                               | Opcount<br>and FP<br>rate                            | Memory traffic<br>and BW                                          | Thread utilization | Vector<br>utilization                     | Roofline<br>analysis               | Execution<br>traces                                              |
|----------------------------------------------|-----------------------------------------|------------------------------------------------------|-------------------------------------------------------------------|--------------------|-------------------------------------------|------------------------------------|------------------------------------------------------------------|
| Intel Vtune<br>Amplifier                     | By routine,<br>line, and<br>instruction |                                                      | Timeseries B/W<br>output but and<br>traffic counts                | Histogram          | General<br>exploration<br>output          |                                    | B/W, CPU<br>time, spin-<br>time, MPI<br>comms, as fcn<br>of time |
| Intel Software<br>Development<br>Emulator    |                                         | Only way to<br>count FP<br>ops on<br>current<br>Xeon | Count load/store<br>between<br>registers and first<br>level cache |                    |                                           |                                    | Not tried                                                        |
| Intel Advisor                                | Not tried                               | Not tried                                            | Not tried                                                         |                    | Routine by<br>routine and<br>line by line | Routine by<br>routine<br>(In beta) |                                                                  |
| MPE/Jumpshot<br>(ANL/LANS)                   | By routine,<br>Instrument-<br>ed region |                                                      |                                                                   |                    |                                           |                                    | Detailed<br>Gantt plots                                          |
| Empirical<br>Roofline Toolkit<br>(NERSC/LBL) |                                         | Max FP<br>rate                                       | Multiple levels<br>of memory<br>hierarchy                         |                    |                                           | Yes                                |                                                                  |
| Linux Perf<br>PAPI or other                  | Possible                                | Possible                                             | Count<br>load/store<br>between<br>LLC and Mem.                    |                    | Possible                                  |                                    | Possible                                                         |

### Acknowledgements

- Intel: Mike Greenfield, Ruchira Sasanka, Indraneil Gokhale, Larry Meadows, Zakhar Matveev, Dmitry Petunin, Dmitry Dontsov, Naik Sumedh, Karthik Raman
- NRL and NPS: Alex Reinecke, Kevin Viner, Jim Doyle, Sasa Gabersek, Andreas Mueller (now at ECMWF), Frank Giraldo
- NCAR: Bill Skamarock, Michael Duda, John Dennis, Chris Kerr
- NOAA: Mark Govett, Jim Rosinski, Tom Henderson (now at Spire)
- Google search, which usually leads to something John McAlpin has already figured out...