Separating the signal from the noise

Forecast verification is central to numerical weather prediction (NWP).

Operational centres generate vast amounts of forecast data, and their accuracy is evaluated using a variety of statistical measures such as root-mean-squared error (RMSE), anomaly correlation coefficient (ACC), Brier score, continuous ranked probability score (CRPS) and related metrics. These measures quantify total errors effectively, but they do not directly assess how much physically meaningful information a forecast contains about the future state of the atmosphere, or any other forecasted component of the Earth system.

Our new study introduces a framework designed to address this. By separating the information contained in a forecast from the noise, the work offers a more transparent way to assess intrinsic forecast performance. This is particularly important as new approaches such as machine-learning-based emulators become more widely used.

How forecast quality is traditionally measured

To check the accuracy of weather predictions, forecasts are analysed using statistical assessments of error and skill, typically known as “scores”. For single (one-shot) forecasts, two such verification methods include RMSE, which measures the average magnitude of forecast errors in a root mean square sense, and ACC, which assesses how well forecasts reproduce anomalies relative to climatology.

However, while these metrics are powerful, they have limitations. They provide estimates of total error, which is important for end users of forecast products, but they do not allow for the evaluation of the intrinsic information content of the forecasts. As a result, the correspondence between forecast scores and the true skill of a forecast is not exact. RMSE and ACC, in particular, can generally be reduced (i.e. improved) by smoothing the forecast, even to the level that the forecast becomes unrealistic.

This makes it hard to fairly evaluate different forecasting approaches, such as single models, averages of multiple forecasts, and artificial intelligence (AI)/machine learning (ML)-based predictions.

A shift in perspective: information, noise, reliability and resolution

Building on earlier work by Feng et al. (2024), our study introduces a new way of interpreting forecast errors by separating them into two components: information error and noise error (see Figure 1).

Information error represents the part of the forecast error associated with missing signal, while noise error captures the remaining error unrelated to the observed atmospheric state. Information error is proposed as a measure of the statistical resolution of a forecast system.

This distinction connects to two attributes that characterise forecast performance: statistical reliability and statistical resolution.

A diagram showing the relationship between information and noise in forecasting. It features axes for Information and Noise, points marking a perfect forecast and a no‑information forecast, and labeled components including RMSE, noise error, information error, and anomaly correlation represented as an angle θ.

Figure 1: Separation of forecast error into information and noise components.

Statistical reliability refers to how well, on average, forecast probabilities match observed outcomes (how often predictions are correct) and can often be improved after the forecast is produced through calibration and bias correction.

In contrast, statistical resolution measures the intrinsic predictive capability of a forecast system, irrespective of whether the forecast is a single forecast or an ensemble forecast. It represents how well a forecast distinguishes between different possible weather outcomes. Unlike reliability, resolution cannot be improved through post-processing.

What the new metrics reveal in practice

Our study gives various examples of the practical application of these ideas, one of them using ensemble forecasts from ECMWF’s operational system. Ensemble forecasting produces multiple simulations of the atmosphere to represent uncertainty, and the average of these simulations – the ensemble mean – often shows lower error according to traditional measures.

However, the new analysis reveals a more nuanced picture. While the ensemble mean reduces overall error, it does so by smoothing out unpredictable features in the forecast, which in our framework corresponds to reducing the noise error component.

On the other hand, the ensemble mean has the same information error as any of the ensemble perturbed members, which is greater than the information error of the control (unperturbed) forecast. In other words, the statistical resolution of the ensemble mean, i.e. its ability to distinguish different atmospheric outcomes, is the same as that of any perturbed ensemble member. This is something to be expected, as statistical resolution (by definition) cannot be increased by post-processing, but it is not universally appreciated in our community.

This distinction highlights an important insight: a forecast can present better conventional performance metrics without increasing – and sometimes while reducing – its fundamental predictive skill.

The framework also proves useful in model development. When testing changes to a forecasting system, standard scores alone may suggest improvement, but the new decomposition can reveal whether changes enhance genuine predictive information or are merely an artifact from reduced forecast variability.

Implications for forecasting and model development

As weather prediction evolves, the ability to evaluate forecast skill accurately becomes increasingly important. The evolving landscape of machine-learning-based forecasting systems, alongside traditional physics-based models, creates new challenges for fair comparison. Different approaches may produce forecasts with different levels of smoothness or variability, making conventional metrics harder to interpret.

By distinguishing intrinsic predictive information from noise, our proposed framework provides a more transparent basis for comparison. It offers a way to track whether model improvements genuinely enhance predictive capability, supports more informed development decisions, and helps ensure that apparent improvements in forecast accuracy reflect real advances in understanding the atmosphere.

More broadly, our work highlights a fundamental principle in forecast evaluation: not all reductions in forecast errors represent better predictions. Understanding what a forecast truly knows – rather than simply how small its errors appear – may be essential for the next generation of numerical weather prediction.

As forecasting technologies continue to advance, measuring skill in terms of information and noise could help researchers and operational centres focus on what matters most: improving our ability to predict the atmosphere’s behaviour with meaningful accuracy.

Read the paper

Bonavita, M. & A.J. Geer, 2026: Forecast verification using information and noise. Quarterly Journal of the Royal Meteorological Society, e70109. https://doi.org/10.1002/qj.70109