European open data for machine learning applications

Meteorological data are recognised as highly valuable by the European Union (EU). New EU directives aim to make these data more accessible, fuelling artificial intelligence and data-driven innovation. In this context, RODEO (https://rodeo-project.eu/), a project funded by the EU and EUMETNET (https://eumetnet.eu/), has been established to enhance and unify open access to public meteorological data across Europe.

Within the framework of RODEO, two example datasets have been developed to demonstrate how open meteorological observations can be used in machine learning (ML) applications. Building on data curated by project partners, datasets have been designed with both domain expert and ML expert constraints and requirements in mind. They show how meteorological data can serve not only for training ML models but also for validating the quality of their output.

Weather data for ML training

The Operational Programme for the Exchange of weather Radar information (OPERA; https://eumetnet.eu/observations/weather-radar-network/), supported by EUMETNET, aims to harmonise and improve the exchange of weather radar information between national meteorological services across Europe. The OPERA pan-European 2D composites form a homogenised and consistent dataset with an archive of up to ten years.

In RODEO, different radar datasets based on OPERA were built to support a range of ML applications, such as nowcasting, medium-range forecasting (e.g. AIFS), and direct training from observations (e.g. Weather Generator). Covering the period 2013–2023, the datasets vary in spatial and temporal resolution to suit specific application needs.

The published datasets include both 1-hour and 6-hour accumulated precipitation. The 6-hour data are intended to align with popular reanalysis datasets, such as ERA5, used for medium-range forecast ML training.

Datasets are available at the native 2 km resolution, as well as at coarser resolutions reprojected onto reduced Gaussian grids, a type of grid used by the Integrated Forecasting System (IFS) and the AIFS, at O96 (approximately 1°) and N320 (approximately 31 km). Although coarsening can reduce some fine-scale features, these versions can significantly ease experimentation and exploration, particularly in the context of medium-range weather forecasting. Interpolation was performed using a conservative approach implemented in the Meteorological Interpolation and Regridding (MIR; https://www.ecmwf.int/en/newsletter/152/computing/new-ecmwf-interpolation-package-mir) package to ensure consistency of precipitation totals between the original and reprojected fields.

Fig 1b — **Precipitation composites.** Example of OPERA data showing total precipitation (tp) composites for 6 October 2018 at 23:05 UTC. Projection on the original grid (LAEA – Lamberth Azimuthal Equal Area) (top) and after interpolation to an N320 grid using MIR (bottom).

As MIR requires input data in GRIB2 format, the original HDF5 data were re-encoded accordingly. This not only enabled MIR-based reprojection but also makes it simpler to ingest OPERA data into the ECMWF MARS archive, improving user accessibility and reproducibility.

Access to the OPERA dataset is provided through Anemoi (https://www.ecmwf.int/en/newsletter/181/news/introducing-anemoi-new-collaborative-framework-ml-weather-forecasting), an open-source framework co-developed by ECMWF and several European national meteorological services. The primary objective of Anemoi is to empower meteorological organisations to train ML models using their own data, simplifying the process through the provision of shared tools and workflows.

Climate statistics for ML forecast verification

The European Climate Assessment and Dataset (ECA&D; https://www.ecad.eu/) receives daily in situ meteorological surface observations data for 13 climate variables. It currently contains around 24,500 stations from 89 participants across 65 countries. The data, which are free for research and educational purposes, serve as the backbone for the 'Climate node' of the World Meteorological Organization Region VI that provides climate services for regional monitoring. In RODEO, a dataset based on ECA&D was designed for applications in forecast verification. Targeted at ML developers, the data are provided alongside dedicated verification scripts for assessing precipitation forecasts over Europe. The data not only consist of observations but also of climate statistics necessary for computing state-of-the-art verification metrics, such as the Stable and Equitable Error in Probability Space (SEEPS), a performance measure used at ECMWF as a supplementary headline score. Calculating this score is not straightforward because it requires prior knowledge of the precipitation climatology at each station where the forecast is verified.

The verification dataset is called SEEPS4ALL and contains the information needed to compute the SEEPS score and local climate percentiles for computing other scores and skill measures. Climate statistics are crucial for assessing forecasts of extreme precipitation and high-impact events, and this is showcased with verification scripts for both deterministic and probabilistic forecasts.

In the ML community, benchmarking is widely recognised as a key driver of progress. With SEEPS4ALL, benchmarking of daily precipitation forecasts against in situ observations over Europe is promoted.

As part of the RODEO project, it has been shown how European open meteorological data can be transformed into practical, ML-ready datasets. At a time of extraordinary innovation, RODEO has demonstrated the value of collaboration for developing datasets and facilitates not only the access but also the use of open weather data.

Data access

The RODEO-ML datasets are publicly available under their respective licences and can be downloaded from the S3 bucket: s3://ecmwf-rodeo-benchmark. The accompanying code repository at https://github.com/ecmwf/rodeo-ai-static-datasets provides detailed instructions on downloading the datasets, along with examples and guidance on using the OPERA for Machine Learning applications and the ECA&D datasets for verification.

News

Earth system science

Computing

Editorial

Newsletter

European open data for machine learning applications

Weather data for ML training

Climate statistics for ML forecast verification