Farewell to the external AI models

Machine learning (ML) models have become an increasingly important component of the global weather forecasting landscape.

Over the past few years at ECMWF, several externally developed models have been run alongside the Integrated Forecasting System (IFS) in an experimental setting, allowing their performance to be assessed against established numerical methods.

The upgrade of the IFS to Cycle 50r1 has provided a test of how robust these models are to changes in their underlying data.

In this blog post, we discuss the impact of the IFS upgrade on the performance of the ML models run at ECMWF, and why this has prompted the decision to stop running external models in real time. 

The IFS and ML models

The IFS is the global numerical weather prediction (NWP) model run operationally by ECMWF.

It receives regular updates, with each upgrade incorporating new science, improving its performance. The IFS is also the backbone of the ERA5 reanalysis dataset, which is used to train ML models for global weather forecasting. ERA5 is based on IFS Cycle 41r2 – the operational cycle in 2016.

ML models for global weather forecasting are generally trained on approximately 40 years of ERA5 data and, for some, fine-tuned on the latest version of the IFS. This is the case for the Artificial Intelligence Forecasting System (AIFS), ECMWF’s ML-based forecasting model that has been operational since February 2025. More precisely, AIFS v1.1 is fine-tuned on seven years of operational analysis spanning January 2016 to January 2023.

A verification experiment

Since 2023, ECMWF has run several external ML models in experimental mode. Forecasts from a selection of these models are routinely made available on openCharts, providing the weather community with quick and easy access to ML-based predictions. This effort predates the creation of the AIFS and has helped build a better understanding of the structure and performance of these new types of forecasts, which helped initiate the uptake of ML in global weather forecasting.

At the beginning of 2026, the set of external ML models running daily at ECMWF comprised the following: Pangu-Weather (Huawei), GraphCast (Google DeepMind), Aurora (Microsoft Research), and FourCastNet (Nvidia). These models were run twice daily (at 0 and 12 UTC) and were initialised with the IFS operational analysis.

Prior to the 50r1 upgrade, we conducted a verification experiment to simulate the impact of the new IFS cycle on ML model performance. The ML models were initialised with pre-operational 50r1 analysis over a two-month period (21 January to 20 March 2026), and verification scores were computed using 50r1 analysis as the reference. The results were compared with those from the same models initialised with and verified against the 49r1 analysis (the operational analysis). For surface variables, forecasts were assessed against SYNOP observations.

The impact on scores

The IFS upgrade is motivated by research advances that have improved IFS forecast performance. This is reflected in Figures 1 and 2, which show a reduction in the root-mean-squared error (RMSE) of IFS forecasts for both upper-air and surface variables. However, without adaptive measures in place, the impact of the IFS upgrade on ML model performance is neutral or negative.

We can distinguish two groups of ML models. The first group (PanguWeather and FourCastNet) shows a neutral impact, while the second group (GraphCast, Aurora, AIFS v1.1) shows a reduction in performance after the change in initial conditions. The models that are negatively impacted are those that have a fine-tuning step in their optimisation process. This fine-tuning step is key to producing the best possible forecasts, but it introduces greater sensitivity to the exact initiating analysis.

Multi-panel line plots showing RMSE difference and bias versus lead time (1–10 days) for Z500 and T850 across six models; GraphCast shows the largest negative RMSE differences, while biases generally become more negative with lead time.

Figure 1: Impact of using initial conditions from IFS Cycle 50r1 on forecast performance for 500 hPa (Z500) geopotential height and temperature at 850 hPa (T850) over the northern hemisphere. Upper panels: RMSE difference, where negative values indicate a degradation in performance with the IFS upgrade and positive values indicate an improvement in performance. Lower panels: mean error (bias) of 49r1 initialised forecasts (solid lines) and of 50r1 initialised forecasts (dashed lines).

Results for upper-air variables (Figure 1) and surface variables (Figure 2) show similar behaviour, although surface variables exhibit greater variability.

Two line plots showing RMSE difference versus lead time (1–10 days) for 2T and 10FF across models; GraphCast shows the largest negative differences, while AIFS v1.1 trends positive at longer lead times.

Figure 2: Impact of IFS Cycle 50r1 on surface variable performance in terms of RMSE. Difference for 2 m temperature and 10 m wind speed over the northern hemisphere.

While the overall bias aggregated over the northern hemisphere is rather similar after the change in initial conditions (bottom plots of Figure 1), looking at maps of scores reveals local discrepancies. Differences in mean state and RMSE at day six are shown in Figure 3. In this example, we show results only for GraphCast and PanguWeather, as representative members of their respective groups (the less- and more-impacted ML models after the IFS upgrade) mentioned above.

Global maps showing mean temperature difference (left) and RMSE difference (right) for GraphCast and PanguWeather, with mostly negative (blue/green) differences indicating generally lower errors and cooler biases across many regions.

Figure 3: Mean error difference (left) and root-mean-squared error (RMSE) difference (right) of T850 forecasts at day six. GraphCast (top), and PanguWeather (bottom) forecasts initialised with 50r1 are compared with forecasts initialised with 49r1.

In the top plots of Figure 3, we see that GraphCast tends to be colder over north-eastern Europe and Russia when initialised with 50r1, and that this bias contributes to an increase in RMSE over the same region. This change in the forecast mean state could potentially be explained by the change in the sea-ice configuration in 50r1. Conversely, the change in initial conditions has little impact on the mean bias of PanguWeather and its resulting error fields (bottom plots of Figure 3).

Implications for ML models

Fine-tuning is a strategy used to mitigate the reanalysis-to-forecast domain shift. When the forecast itself is subject to a domain shift, for example, due to a cycle upgrade, the fine-tuning strategy can be partially undermined, as illustrated by the results presented here.

To counter this negative effect, the new version of the AIFS (AIFS v2) has also been fine-tuned using prototype 50r1 data. In addition to the operational analysis, the seven-year fine-tuning dataset includes prototype 50r1 data. For other organisations running (and fine-tuning) data-driven models using ECMWF analysis, they can expect to see similar effects on their models.

As for the external ML models, the performance degradation following the IFS upgrade to 50r1 has prompted a discussion at ECMWF about the current added value of these first-generation ML models. While these models demonstrated the potential of ML-based approaches and inspired the creation of the AIFS, they are increasingly out of date. Both GraphCast and Aurora are negatively affected by 50r1. For Aurora and PanguWeather, the lack of precipitation forecasts limits utility. In addition, both GraphCast and FourCastNet have been superseded by probabilistic models developed by their respective research groups.

Perhaps more importantly, we now see a wider range of real-time model forecasts from operational centres. Examples include AICON from the German Meteorological Service (DWD), which provides a direct data feed; GEML from Environment and Climate Change Canada (ECCC); and AIGFS and AIGEFS from the US National Oceanic and Atmospheric Administration (NOAA). Together, with upcoming additions from the operational meteorological community, these provide the community with a diverse and regularly updated set of models to examine as we continue to understand the role of ML-based forecasting systems.

As such, we have decided to stop the real-time operation of the external ML models with the implementation of 50r1. Charts, such as T850 forecasts based on PanguWeather, will no longer be available on openCharts. However, users can continue to generate ML-based forecasts themselves by running ML models from ECMWF’s open data.

Each of the external ML models – FourCastNet, PanguWeather, GraphCast, and Aurora – has made a significant contribution to the development of this field. By making their models open source, their developers have enabled the community to build trust in data-driven modelling. ECMWF acknowledges and greatly values these contributions.

DOI

10.21957/fa8ad01483