Data archive growth: Escaping from the black hole

Paul Burton, Bentorey Hernandez Cruz


Continuing to increase the size of ECMWF’s data archive at historical rates is becoming unsustainable due to rising costs. To limit future growth, the Centre has developed a tool to facilitate the removal of obsolete data, and it will take measures to reduce the amount of data written into the archive.

Archive growth

The data archive, comprising the MARS (Meteorological Archival and Retrieval System) database for structured meteorological data, and ECFS (ECMWF File Storage) for all other data, has been an indispensable part of the computational infrastructure of ECMWF for many years. It allows researchers to easily examine, share and compare results from their experiments, supporting the continual improvement of the Integrated Forecasting System, whilst also providing an invaluable long‐term archive of operational forecast and reanalysis data for use both inside and outside ECMWF.

Due to the good design and implementation of the data archive system, there has been an almost irresistible temptation to treat it as a black hole for data, with users simply adding ever increasing amounts into the archive every year. The graph shows the amount of data users have written into the archive over the past seven years. The growth of the archive is exponential, increasing at a rate of around 45% per year – driven by the ever‐increasing computational power of ECMWF’s high‐performance computing (HPC) resources. Historically we accommodated the archive growth by simply buying more tapes and associated hardware. This technology grew cheaper at a rate corresponding to the increase in HPC performance per dollar, meaning the relative financial cost of the archive to the HPC resource stayed stable. However, over recent years the cost of archive storage has almost stagnated, therefore increasing the size of the archive at historical rates is becoming unsustainable within existing budgets.

Data archive growth. The increase in the growth of ECMWF’s data archive has been driven by three areas: research, the ERA-Interim and ERA5 weather and climate reanalyses, and operations.

New tool

To address this unaffordable growth, two big purges have taken place in the research data archive over the last four years, which together have removed over 100 PiB of data. The total archive, however, is still growing at an unsustainable rate, so further action is now being taken.

We have developed a Data Lifetime Management (DLM) Tool which allows research users to classify all their experimental data according to its expected use (e.g. ‘test’/‘long‐term reference’/‘publication’). Each classification has a lifetime associated with it, and when a given dataset has exceeded its lifetime, its owner is prompted to either delete it or reclassify it. The tool is under active development to give users more information about their data and how it is being used, which will allow them to manage its lifetime effectively.

Deleting old data is only a temporary solution, however; the fundamental issue is to reduce the amount of data being written into the archive. To this end, an Archive Working Group has been convened which will be considering how this can be achieved, addressing all users of the ECMWF archive, including research users, Copernicus and reanalysis activities, and our operational forecasting system.

The changes in behaviour that will happen as a result of the DLM and Archive Working Group will mean an end to treating the data archive as a black hole for data; there will be some difficult decisions to be made on our journey to reducing the amount of data written to the archive – but the reward will be an affordable and sustainable data archive that can continue to support the activities of ECMWF and its Member States for many years to come.