How to make Earth science workflows more reproducible

20 May 2019

Reproducible workflows Oct 2019 workshop logo

ECMWF will hold an international workshop from 14 to 16 October 2019 at its headquarters in Reading, UK, to investigate how workflows in Earth sciences can be made more reproducible.

“The issue of reproducibility affects everyone working in the field,” says Claudia Vitolo, one of the workshop organisers.

“In our research experiments and output, we use coding and procedures tailor-made for what we want to achieve. We are pleased when others want to reproduce our results or try something similar. But there may be a hitch: what we have done may not be easily shareable or reproducible.”

One of the questions to be addressed at the workshop is which technologies and best coding practices to use to minimise such problems.

Similar issues can arise, for instance, when old data, computer code or workflows are to move to a new platform at the same research centre, such as a new high-performance computing facility.

The case of wildfire emissions

Mark Parrington works on wildfire emissions for the EU-funded Copernicus Atmosphere Monitoring Service (CAMS) implemented by ECMWF.

In late 2015, he was monitoring the large-scale emissions of carbon in Indonesia resulting from wildfires linked to the dry conditions associated with the widely reported strong El Niño event.

After he posted one of his plots on Twitter, he was contacted by a researcher from Thailand who wanted to produce similar output locally.

Carbon emissions plot for Indonesia

Time series of total carbon emissions from Indonesia according to CAMS Global Fire Assimilation System (GFAS) data from 1 August to 30 November 2015, shown together with average and minimum/maximum emissions for the previous available years (2003–2014).

It turned out that there were some obstacles to overcome. “The entire GFAS dataset, going back to 2003, was too voluminous to download and store, and at the time no generic, open-source script was available to manipulate and plot the data for particular countries or regions,” Mark says.

ECMWF scientists worked to resolve these issues, and today the Forestry Research Center at Kasetsart University, Thailand, uses CAMS Global Fire Assimilation System (GFAS) data in its forecasts and analysis of wildfires in the region.

Moving tools to the data

Claudia believes in many cases moving tools to the data instead of data to the tools can be part of the solution. The concept underlies the design of the Climate Data Store managed by the EU-funded Copernicus Climate Change Service (C3S) implemented by ECMWF.

The Climate Data Store includes a toolbox which enables users to manipulate and visualise the raw data they are interested in without having to download them. A similar data store for atmospheric composition is being developed by CAMS.

Climate Data Store diagram

The C3S Climate Data Store enables users to transform data into information without first downloading large volumes of raw data.

Cloud computing can also make it easier to make large volumes of data as well as tools available to a wide range of users without the need for any large-scale data transfers.

“But ensuring reproducibility across clouds is a complex task because different cloud services work in different ways,” says Claudia.

“Reusing existing code on different platforms and leveraging high-performance computing and cloud facilities to handle large data volumes are just some examples of circumstances in which robust reproducible code is paramount. Our aim in running this workshop is to enable the sharing of experiences gained in different environments so that we can learn from each other.”

Workshop details

The workshop ‘Building reproducible workflows for Earth sciences’ will take place at ECMWF’s headquarters in Reading, UK, from 14 to 16 October 2019.

The registration deadline is 30 August 2019.

Keynote speakers include Carol Willing (Willing Consulting) on JupyterLab and JupyterHub; Ana Trisovic (CERN, University of Chicago) on data preservation and reproducibility at CERN and climate data reproducibility and openness; and Tamsin Edwards (King’s College London) on challenges of reproducible research in academia.

For more details and to register, please visit the workshop page.