3.3 Data independence
The achievable parallelism within the IFS differs for each algorithmic step and depends primarily on the data independence within that step. For practical and efficiency reasons, not all of the available parallelism is exploited, as discussed below.
3.3.1 Grid point computations
With the exception of the semi-Lagrangian advection and mixed fine/coarse grained radiation calculations, grid-point computations contain only vertical dependencies. Thus, all grid columns can be considered to be independent of each other, allowing an arbitrary distribution of columns to processors. There are 138346 columns in a TL319L31 resolution model. An extreme decomposition of one grid point per processor would work but would suffer from fine-grain granularity (high overheads in calling all the physics subroutines compared to useful work) and from dynamic imbalances. Additionally, performance on a computer with vector architecture would suffer greatly due to short vector lengths.
The semi-Lagrangian advection requires access to data from columns in the upwind direction in order to compute departure points. This is handled by means of additional messages in which the data (potentially) required as input to this calculation are transferred prior to their need.
The coarse-grained radiation calculations are performed on a grid that is finer near the poles than towards the equator. Therefore, the calculations require a redistribution of columns to processors in order to achieve reasonable static load balancing.
3.3.2 Fourier transform.
For the fast Fourier transforms (FFT), the Fourier coefficients are fully determined for each field from the grid-point data on a latitude. The individual latitudes are all independent as are the vertical levels and the fields. In practice, independence across the fields is not exploited in the current code, so the quantity of exploitable parallelism is limited to (number of latitudes x levels) or 9920 for a TL319L31 model. This approach allows an efficient serial FFT routine to be used and precludes fine-grain parallelism which is unavoidable in parallel FFT implementations. An additional performance constraint is imposed when running on a vector machine. Best performance is obtained by using a vectorized FFT whereby the vector dimension is across multiple transforms. In the IFS the vector dimension is over (field types) * (levels) so that any parallelism greater than the number of latitude rows (320) has to be gained at the expense of vector performance.
3.3.3 Legendre transform
In the Legendre transforms, the zonal wave numbers, m, are dependent in the north-south direction, but are independent of each other, of the vertical levels, and of the fields. As with FFTs, independence across fields is not exploited. Because of the triangular nature of the truncation, the work content of each transform is inversely proportional to the value of m, so to achieve a good load balance, the zonal waves are coupled together in pairs. This leaves parallelism of order (m/2 x levels = 4960). The reduction factor of 2 comes from pairing a long and short wave number in order to achieve good load balance. Again, in practice, vector performance issues reduce this somewhat. The transform is accomplished using a matrix product routine with the vector dimension maximized by combining all fields to obtain a vector length of (field types) * (levels). Parallelism greater than 160 may be achieved at the expense of vector performance.
3.3.4 Spectral computations.
Computations in spectral space have different data dependencies in various phases. The horizontal diffusion calculations have no dependencies. The semi-implicit time-stepping scheme imposes a vertical coupling which means that data independence is restricted to wave space (m, n). However, if the stretched/rotated grid options are used, then coupling exists again in the n components and the m components. A similar n dependency exists for some semi-implicit schemes not used operationally at ECMWF. As a minimum, data independence is always available over the zonal waves m or total wave number n. For practical and communication efficiency reasons, horizontal computations are performed with levels and m waves distributed as in the Legendre transforms. If the levels are distributed among processors, a transposition is needed to bring all levels together for a subset of the (m, n) spectral coefficients.