Does better mean better forecasts for everyone?

9 December 2025
Zied Ben Bouallègue
Martin Janoušek
Leonardo Olivetti (Uppsala University, Sweden)

Machine learning (ML) approaches to weather forecasting have made the headlines over the past two years, and some ML models are now operational. At ECMWF, the AIFS Single is one such system. Alongside other models, it generates forecasts with a lead time up to 15 days, four times daily. 

The AIFS Single forecast undergoes continuous evaluation. It is routinely compared with the deterministic physics-based IFS Control in terms of domain-averaged score, over a season, for example. Typically, forecast errors are plotted as a function of forecast lead time across different domains.

In many of these comparisons, the AIFS Single appears to outperform the IFS Control in terms of root mean squared error (RMSE) across domain-averaged scores (Figure 1). But averages can sometimes hide regional differences. Are these gains consistent everywhere, or are some areas benefiting more than others? 

Inspired by a recent study on the concept of fairness by Olivetti and Messori (2025), we examined whether the benefits of using the ML-based approach are perceived equally across the globe or whether certain regions exhibit different patterns of performance. 

Two plots each with two curves. Lead times in days along the x axis and RMSE on the Y axis

Figure 1: Root mean squared error (RMSE) of 2 m temperature forecasts as a function of the forecast lead time over summer 2025 in northern (left) and southern (right) hemisphere extratropics. There is a clear advantage of AIFS compared to IFS for all lead times for both the northern and southern hemispheres.

Methodology 

To address this, we used the following methodology. First, we computed the average score over a season at each grid point, for each forecast lead time and model separately. The operational IFS analysis was used to verify 2 m temperature (T2m) in this instance. In Figure 2, we show the differences in RMSE between the AIFS single and IFS Control at two different lead times, day 1 and day 6.

Two global maps in blue with red patches in places to show differences between AIFS single and IFS Control

Figure 2: Difference in RMSE of 2 m temperature forecasts between AIFS single and IFS Control for a forecast lead time of 1 day (top) and 6 days (bottom).

Next, we counted the number of grid points where the AIFS Single performs better (lower RMSE) than the IFS Control – the blue grid points in Figure 2. Let's denote this number call. This number is normalised with the total number of grid points over the domain nall to get the ratio rallcall / nallA value equal to 0.5 means the AIFS is, on average, better than the IFS at half of the grid points; values above 0.5 indicate the AIFS is superior at more than half of them. 

We then applied different masks to select sub-samples of the verification domains according to four criteria:  

1) Land mask: land–sea mask greater than 0.5  

2) Flat mask: model orography lower than 500 m  

3) High GDP mask: logarithm GDP per capita (adjusted for Purchasing Power Parity, downscaled by Wang et al., 2022) greater than nine 

4) Urban mask: urban tile cover greater than 0.05  

These four masks are shown in Figure 3. 

Global map with all countries in red

 

Global map with a lot of green colouring

 

Global map with beige colouring in places

 

Global map with patches of blue

Figure 3: Masks used to stratify the verification results: the land mask based on the IFS land– sea mask (top), the flat mask based on the IFS model orography (second from top), the high GDP mask based on GDP per capita estimates (third from top), and the urban mask based on the IFS urban cover fraction (bottom).   

After applying each mask, we again counted the number of grid points cmask where the AIFS Single is on average better than the IFS Control. We counted the total number of grid points after masking nmask and estimated the new ratio rmask = cmask / nmask. Eventually, we calculated the multiplicative factor m = rmask / rall. This multiplicative factor, m, measures how the performance changes when we focus on certain areas (land only, flat areas only, high GDP areas only, or urban areas only) rather than the whole domain. If the masking has no impact, we obtain m close to 1. If the masking makes the AIFS compare better with the IFS, m is greater than 1.

What we found 

We applied this methodology to T2m verification against operational analysis, examining the northern and southern mid-latitudes separately. In Figure 4, the upper panels show the percentage of grid points for which the AIFS is, on average, better than the IFS over the summer of 2025. The above-defined ratios rall and rmask are shown in terms of percentage as black and coloured bars, respectively. The corresponding multiplicative factor m is shown for each mask in the lower panels.

Two-panel figure. Top: bar chart showing the percentage of grid points in four categories (land, flat, high-GDP, urban) across forecast lead times from 0–240 hours, with values roughly 60–80%. Bottom: line plot showing each category’s multiplicative factor over lead time, with flat decreasing, land near 1, high-GDP lowest, and urban varying but generally below 1.
Two-panel figure. Top: bar chart showing the percentage of grid points in four categories (land, flat, high-GDP, urban) at forecast lead times from 0–240 hours, mostly between about 50–90%. Bottom: line plot showing each category’s multiplicative factor, all dipping early then rising toward or above 1 by 240 hours, with urban lowest throughout.

Figure 4: Upper panels: percentage of grid points for which the AIFS Single is on average better than the IFS Control in terms of RMSE over a domain (black bars) and for four sub-samples (coloured bars) as a function of the forecast lead time. Lower panels: multiplicative factor of the four sub-samples when taking the black histogram as a reference. The four sub-samples are defined by the masks in Figure 3. Results for the northern hemisphere (top) and the southern hemisphere (bottom) over summer 2025.  

The results in the upper panels of Figure 4 align with those in Figure 1: the AIFS outperforms the IFS at more than half of the grid points. When focusing on sub-samples defined by the geographical masks, however, a more complex picture appears. Orography appears to play a key role in the northern hemisphere, while ocean coverage influences the southern hemisphere results.

The signals for high GDP and urban areas are weaker and seem to correlate with performance over land in general. As these masks represent small proportions of the whole sample, more robust results could be achieved by increasing the number of verification days. 

In the northern hemisphere, better performance over flat areas could be related to the non-uniform distribution of in-situ observations. A higher density of observations entering the data assimilation system might come from lower altitude land areas. Further investigations will therefore include a mask based on observation density.

Take-home message and outlook 

This verification exercise illustrates how geographical masks can be used for stratification. Extending beyond standard categories such as season, forecast lead time, and geographical domain, stratification can reveal hidden model biases. Although the AIFS Single delivers substantial improvements in 2 m temperature forecasts across most regions, these gains are not uniformly distributed. 

Two of the stratifying masks (orography and land–sea mask) were known to the model during training but not directly optimised for. The two socio-economic masks were completely unknown to it. The disparity in skill in that case is particularly interesting because it is often assumed that if a model does not explicitly optimise for a variable, it will not exhibit bias toward its subcategories (“fairness through unawareness”, Dwork et al., 2012). In practice, we see that biases can still emerge due to associations between these variables and other features in the training data (e.g. urban tiles and land–sea mask). 

Understanding where and for which areas a model performs better or worse is an important step to highlight potential areas for improvement. Identifying stratification variables associated with weaker performance can motivate strategies for model improvement – for example, orographic weakness implies spatial resolution increase as a key priority.  

When applied to non-weather-related features, such as socio-economic factors, the goal could be to achieve more uniform improvements by incorporating these variables into the model’s optimisation function. This approach is labelled as “fairness through awareness” (Dwork et al., 2012). Achieving truly even-handed improvements across all regions remains a complex but valuable aspiration for each new model cycle. 

 

DOI
10.21957/47b0c23469