The impact of developments in weather forecasting is measured using forecast verification, but many developments, though useful, have impacts of less than 0.5% (about 0.5 h) on medium-range forecast scores. Chaotic variability in the quality of forecasts makes it hard to achieve statistical significance when comparing these developments to a control. For example, with 60 separate forecasts and a 95% confidence interval, a change in quality of the 5-day forecast would need to be larger than 1% to be statistically significant using a Student’s t-test. The first aim of this study is simply to illustrate the importance of significance testing in forecast verification, and to point out the surprisingly large sample sizes that are required to attain significance. Further, chaotic noise is correlated in time and can generate apparently systematic-looking improvements at different forecast ranges, so a ‘run’ of good scores is not necessarily evidence of statistical significance. Even with significance testing, forecast experiments have sometimes appeared to generate too many strange and unrepeatable results, and a second aim has been to investigate this. By making an independent realisation of the null distribution used in the hypothesis testing, using 1,885 paired forecasts (about 2.5 years of testing), it is possible to construct an alternative significance test that makes no statistical assumptions about the data. This is used to experimentally test the validity of the normal statistical framework for forecast scores, and it shows that the naive application of Student’s-T does generate too many false results. A known issue is temporal autocorrelation in forecast scores, which can be corrected by an inflation in the size of the error bars, but typical inflation factors (such as those based on an AR(1) model) are not big enough and are not reliable for smaller samples. Also, the importance of statistical multiplicity has not been appreciated. For example, across three forecast experiments, there is a 1 in 2 chance of getting a false result through multiplicity. The t-test can be reliably used to interpret the significance of changes in forecast scores, but only when correctly adjusted for autocorrelation, and when the effects of multiplicity are properly considered. 1 Introduction Forecast

JF - ECMWF Technical Memoranda PB - ECMWF UR - https://www.ecmwf.int/node/15287 ER -