The winner of an ECMWF-organised Kaggle in Class competition on ‘Predicting the impact of air quality on mortality rates’, Matthias Gehrig from Germany, was announced on 5 May 2017. Matthias did a fantastic job tackling this challenge: after 60 submissions he scored a root-mean-square error of 0.29023 on the private leaderboard just 12 minutes before the competition’s deadline. After the competition he revealed that he used a simple Excel spreadsheet and public leaderboard feedback to model the mortality trend, which he then used as input to xgboost, along with other features derived from the input data provided. The xgboost library implements eXtreme Gradient Boosting, a scalable tree boosting algorithm widely used by data scientists and winners of other Kaggle machine learning competitions.
Wide range of strategies
Poor air quality is a significant public health issue. According to a COMEAP (UK Committee on the Medical Effects of Air Pollutants) report, the burden of particulate air pollution in the UK in 2008 was estimated to be equivalent to “nearly 29,000 deaths at typical ages and an associated loss of population life of 340,000 life years”. The goal of the competition was to predict mortality rates due to cardiovascular and respiratory diseases and cancer for each English region using daily means of temperature and ozone (O3), nitrogen dioxide (NO2), PM10 (particulate matter with a diameter of less than or equal to 10 micrometres) and PM2.5 (2.5 micrometres or less) surface concentration. The competition attracted 51 data enthusiasts, from beginners to Kaggle Grandmasters, from several countries including China, India, Mexico, the United Kingdom, France, Germany and Poland. Some of them actively participated in the forum discussions highlighting the challenges the data posed. The modelling strategies adopted varied widely, even the choice of the most important predictors seemed to be very different across submissions. Some pointed out that environmental factors (such as pollution concentrations) were less influential than temporal trends. Many considered that handling missing values was the most challenging part of the competition. The competition’s organisers provided background information on the data and sample code. This helped first-time Kagglers (about 25% of the participants), who said that ‘it was easy to get started’. The example code was provided in two programming languages: Python and R. These were the two main tools used by the participants. Additional information and alternative modelling strategies were discussed in the forum, which was considered useful by the majority of participants.
Promoting data use
The Kaggle competition is one of several outreach activities to promote data from ECMWF and the EU’s Copernicus Earth observation programme. It was launched on the last day of the Open Data Week held at ECMWF earlier this year. Some of the public datasets provided by the ECMWF-run Copernicus Atmosphere Monitoring Service (CAMS regional reanalysis) were used to assemble predictors (temperature and pollutant concentrations) for training and testing competition data. The outcome variable was obtained from mortality and population counts for the English regions provided by the UK’s Office for National Statistics.
The competition was designed to reach out to a new audience: the exploding number of data scientists and data enthusiasts across the world. Introducing some of the ECMWF and Copernicus open datasets to a wider audience could help to unleash their potential and encourage their use in fields not directly related to weather and climate. At the same time, showcasing machine learning approaches, which are domain agnostic, provides a novel perspective and generated interest among ECMWF staff and the weather and climate community.
The design of a data science competition is, however, not straightforward. Kaggle provides an infrastructure for competitions at different levels of complexity. In the lower range there are Kaggle in Class competitions, in which both training and testing data can be in the public domain and there is no monetary prize. We opted for this category for two reasons: we only used open datasets and wanted to ensure the competition was beginner friendly. In the future, we plan to also run advanced competitions with monetary prizes. These generally draw more attention, especially by expert kagglers, but they can only be set up using private datasets for the outcome variable. Suggestions are welcome!