Evaluating training strategies in tropical cyclone forecasts

19 January 2026
Jef Philippé (Ghent University/RMI)
Dieter Van den Bleeken (Ghent University/RMI)
Michiel Van Ginderachter (Ghent University/RMI)
Michael Maier-Gerber
Mario Santa Cruz
Zied Ben Bouallègue

Tropical cyclones (TCs) are extreme weather phenomena that serve as a critical test for global data-driven weather models. Characterised by intense winds and heavy rainfall they can lead to enormous destruction. According to the World Meteorological Organization (WMO), TCs accounted for roughly 17% of all weather- and climate-related disasters over the past five decades, causing on average 43 fatalities and €70 million in damages each day. Furthermore, the IPCC’s Sixth Assessment Report has provided observational evidence for an increase in the intensity of the most severe TCs. Reliable forecasts are therefore needed to support more effective preventive measures.

While deterministic state-of-the-art machine learning weather-prediction (MLWP) models show superior track forecasts, they underperform ECMWF’s physics-based Integrated Forecasting System (IFS) in terms of TC intensity. This applies to ECMWF’s AIFS Single, as well as external models such as Pangu-Weather, GraphCast, FourCastNet and Aurora (all featured with daily forecasts in the ECMWF open charts catalogue), suggesting a limitation common to ML approaches rather than a model-specific issue. This shortcoming can be attributed to factors such as the rarity of extremes (a classic data-imbalance problem in the training dataset), model spatial resolution, and the smoothness bias present in ML-produced forecasts (a consequence of mean squared error loss minimisation). While such smoothing generally improves global accuracy scores compared to the IFS and contributes to accurate TC track forecasts, it reduces forecast “activity” in the sense discussed in the ‘Accuracy versus activity’ blog and leads to an underestimation of extremes.

In a collaboration between the Royal Meteorological Institute of Belgium (RMI)/Ghent University and ECMWF, we examine a potential contributor to this smoothing behaviour that has received little attention so far: the training strategy. In particular, the impact of applying rollout during finetuning is assessed for predicting TC intensity and tracks.

Exploring different training strategies within Anemoi

The training procedure of the operational AIFS Single, developed under the open-source Anemoi framework, consists of two stages. The first, and longest in terms of training steps, is pre-training on ERA5 reanalysis, during which the model learns to predict a single 6-hour timestep ahead. In the second stage, finetuning, the model is further trained on IFS operational analyses. During this stage, rollout training is typically applied, meaning that the model predicts multiple timesteps in an autoregressive manner with a horizon-averaged loss. For the operational AIFS, the horizon was gradually increased from 12 to 72 hours.

A rollout procedure is common among state-of-the-art MLWP models (with varying horizons and hyperparameters), as it improves accuracy of long-term forecasts. However, by penalising the accumulation of errors, the loss also suppresses fine-scale variability that has lower predictability and produces smoother forecasts. Distinguishing pre-training effects from rollout-induced smoothing is therefore essential.

Diagram showing a two‑stage training workflow. Step 1, “Pre‑training,” trains on ERA5 reanalysis using single‑step 6‑hour forecasts, shown in blue and labeled “pre‑training only.” Step 2, “Finetuning,” adds training on operational IFS analyses, producing either single‑step 6‑hour forecasts with no rollout (green) or autoregressive forecasts up to 72 hours, labeled as rollout similar to AIFS (orange).

Figure 1: Schematic overview of the training procedure, illustrating the two main stages (pre-training and fine-tuning) and indicating which configurations were used for each of the three models. 

Accordingly, three models with different training strategies were trained, as summarised in Figure 1. One model underwent only the pre-training stage, whereas the other two had additional finetuning, one with rollout and one without. The finetuning period was one year shorter than that of the operational AIFS to enable TC evaluation over several years, to build more robust statistics for these rare events. The models’ architecture was kept close to that of the operational AIFS, and all forecasts were initialised from the IFS operational analysis.

Impact of training steps on TC forecast performance 

The three models are evaluated in terms of TC forecast performance by verifying against observed best track data. In the case of pre-training only, TC track errors are comparable to those of the IFS, but the model features a pronounced weak-intensity bias from as early as 12 hours after initialisation. 

In contrast, the model with additional finetuning and rollout, mirroring the operational AIFS setup, demonstrates substantial improvements in TC track accuracy, consistent with gains that have been previously reported for the AIFS and other MLWP models. While being good for track prediction, this model also systematically underestimates TC intensity by far, with performance trailing behind that of the IFS. 

A promising alternative for TC forecasting seems to be given by the model that underwent finetuning without rollout. This preserves most of the track accuracy improvements while significantly reducing the intensity bias. In fact, intensity forecasts from this model can predict TC intensities comparable to those of the IFS and even outperform the IFS for lead times greater than 36 hours.

Two panels show 2022–2024 tropical cyclone forecast errors versus lead time, comparing pre‑training, finetuning with and without rollout, and IFS. Finetuning without rollout has the lowest position and pressure errors. Error bars show 2.5–97.5% confidence intervals.

Figure 2: Comparison of mean errors in 2022–2024 tropical cyclone forecasts for (a) position and (b) central pressure as function of forecast time. This is based on a global verification against the International Best Track Archive for Climate Stewardship (IBTrACS v4r01). Forecasts are homogenised to have a consistent number of cases between models. The verification is based on TCs that are present in the observation database at the forecast initial time. For each lead time, the number of cases is displayed directly below the graphs. Vertical bars indicate the 2.5%–97.5% confidence intervals. 

These results show that using rollout during finetuning significantly enhances TC track forecasts compared to the IFS, in addition to the gains seen in headline scores. At the same time, it has a large detrimental effect on TC intensity. The TC weak-intensity bias of MLWP models is commonly attributed to the coarse resolution of the training data and the smoothing effect introduced by the mean squared error-based loss function. Both undoubtedly play significant roles, but the  results show that the impact of loss minimisation largely depends on the chosen training strategy, and a clearer distinction between rollout and no-rollout should be made.

Case study: Tropical Cyclone Mocha

The systematic verification results are supported by a case study of TC Mocha (2023) in the Northern Indian Ocean (Figure 3). The more accurate central pressure forecast is embedded within a pressure field broadly similar to that of the IFS forecast, thus not distorting the profile at larger radii. In its four-day forecast valid on 13 May 2023 at 12 UTC – about a day before landfall in Myanmar – the finetuned model without rollout predicted a minimum central pressure of 946 hPa, which was closest to the observed 944 hPa. In comparison, the IFS forecast predicted 959 hPa, and the finetuned model with rollout predicted 971 hPa.

Mean sea‑level pressure maps for Tropical Cyclone Mocha on 13 May 2023 compare finetuned models with and without rollout to IFS. The no‑rollout finetuned model shows the strongest, most compact low‑pressure system.

Figure 3: Mean sea-level pressure fields associated with Tropical Cyclone Mocha from the finetuned model with rollout (left) and without (middle), and the IFS forecast initialised on 9 May 2023 at 12 UTC, valid for 13 May 2023 at 12 UTC.

Relationship to the accuracy–activity trade-off

At first glance, completely removing the rollout may seem like the solution, but it comes at the cost of accuracy in general verification scores. Figure 4 presents how each of the models performs in the accuracy–activity trade-off for one of the key headline scores – 500 hPa geopotential height in the northern hemisphere. With respect to the IFS, the model that underwent pre-training only exhibits a degradation in accuracy with increasing forecast time, accompanied by an initial relative loss in forecast activity followed by a later neutralisation. 

Conversely, finetuning with rollout results in a considerable increase in forecast accuracy, especially at longer lead times. Omitting the rollout places the AIFS in a state that lies between these two strongly opposing behaviours in terms of accuracy, with a clear gain compared to the IFS, and at the same time exhibiting the highest forecast activity of all models.

Scatter plot showing the forecast accuracy–activity trade‑off for Northern Hemisphere geopotential height at 500 hPa. The x‑axis shows relative gain in activity with respect to the IFS analysis, and the y‑axis shows relative gain in accuracy, measured by anomaly correlation, with respect to the IFS control forecast. Points represent forecast lead times from day 1 (dots) to day 10 (squares). Four systems are shown: IFS (red), finetuning without rollout (green), finetuning with rollout (orange), and pre‑train

Figure 4: Forecast accuracy–activity trade-off for geopotential height at 500 hPa in the northern hemisphere. The plot shows the forecast relative accuracy with respect to the IFS control forecast versus the forecast relative activity with respect to the IFS analysis. Results for forecasts at lead time day 1 (dot) up to day 10 (squares) are plotted. Accuracy is here measured with the anomaly correlation coefficient.

Summary

While the operational AIFS Single has delivered major improvements in TC track forecasting, its TC intensity predictions remain less competitive than those of the IFS so far. This study assessed the impact of different training strategies on TC track and intensity performance. Our results show that rollout is the most influential component, enhancing large-scale forecast accuracy and TC track skill, but also reducing forecast activity, which leads to a pronounced TC weak-intensity bias. Omitting rollout shifts the AIFS to a different position in the accuracy–activity trade-off, enabling it to retain most track improvements while significantly improving intensity forecasts, surpassing the IFS for TC central pressure beyond the 36-hour lead time, despite using an MSE-based loss function.

These findings provide guidance for future developments of the AIFS Single and other deterministic MLWP models. They prove that deterministically trained systems are capable of making skilful TC intensity forecasts. 

Top banner image: The Copernicus Sentinel-3 mission captured this image of the powerful Cyclone Mocha on 13 May 2023 as it made its way across the Bay of Bengal heading northeast towards Bangladesh and Myanmar. © ESA, contains modified Copernicus Sentinel data (2023), processed by ESA

DOI
10.21957/d7d335141c