NL160_Cover_image_v4_for_digital_banner

Upgrade makes Fields Database more resilient

Tiago Quintino, Simon Smart, Baudouin Raoult, Manuel Fuentes, Matthias Zink, Sebastien Villaume, John Hodkinson, Anna Mueller-Quintino, Axel Bonet-Cassagneau, Oliver Treiber, Christian Weihrauch

 

A major upgrade of ECMWF’s Fields Database (FDB) software library makes the short-term storage of meteorological fields much more resilient and flexible, thus minimising delays in the dissemination of forecasts. The change to the new version (FDB5) for time-critical operations was implemented as part of the upgrade of the Integrated Forecasting System to IFS Cycle 46r1, described in this Newsletter. The FDB is an internally provided service, used as part of ECMWF’s weather forecasting software stack. It operates as a domain-specific object store, designed to store, index and serve meteorological fields produced by the IFS. It acts as the first level of storage for recently created objects, efficiently receiving all model output and derived post-processing fields and making them available to the post-processing tasks in the forecast pipeline, as well as to users.

The FDB serves as a ‘hot-object’ cache inside the high-performance computing facility (HPCF) for the Meteorological Archival and Retrieval System (MARS). MARS makes many decades of meteorological observations and forecasts available to a wide range of end users and operational systems. Around 80% of MARS requests are served from the FDB directly, typically for very recently produced data. A subset of this data is later re-aggregated and archived into the permanent archive for long-term availability.

Every day, more than 200 TiB of data is written to and 370 TiB is read from the FDB, including both core operations and research activities. More than 100 TiB of this data is then moved to MARS for archiving. At any given time, the total content of the operational FDB is estimated to be between 4 and 5 PiB.

%3Cstrong%3E%20The%20place%20of%20the%20FDB%20in%20ECMWF%E2%80%99s%20supercomputing%20and%20data%20handling%20infrastructure.%3C/strong%3E%20The%20FDB%20works%20as%20a%20data%20cache%20layer%20between%20the%20HPCF%20and%20MARS.%20Most%20data%20requests%20are%20handled%20from%20the%20FDB,%20with%20only%20a%20few%20passing%20through%20to%20the%20MARS%20disk%20cache%20(HDD%20cache)%20and%20HPSS%20tape%20system.
The place of the FDB in ECMWF’s supercomputing and data handling infrastructure. The FDB works as a data cache layer between the HPCF and MARS. Most data requests are handled from the FDB, with only a few passing through to the MARS disk cache (HDD cache) and HPSS tape system.

The fourth version of the FDB (FDB4) was one of the most venerable pieces of software still in operations, with code containing references to the original Cray machine from more than 20 years ago! As such, it had quirks that showed its age and restricted the operational forecast system. For example, FDB4 did not handle some error conditions well, which could lead to corruption of indexing information or the data. Furthermore, only one forecast component could write to each database within FDB4 at any one time without risking index corruption, which placed severe constraints on the design of the forecast pipeline and suites.

Main benefits

To address these deficiencies, development of FDB5 began in 2015. This version brings several improvements:

  1. Most importantly, FDB5 is now a transactional store designed in line with ACID (Atomicity, Consistency, Isolation, Durability) database principles. This means it is robust and resilient to failures. A model or system failure does not corrupt existing data, and the model is able to restart where it stopped. This saves critical minutes in the event of a catastrophic model crash and restart, minimising forecast delivery delays.
  2. By improving the consistency semantics, some restrictions on ECMWF’s workflows have been lifted, enabling more flexible suite design. For example, FDB5 is no longer restricted to a single serial writer per database.
  3. FDB5 supports direct-to-MARS archiving, which will reduce overall congestion of the HPCF storage system by avoiding the creation of intermediate files when archiving to MARS. Moreover, FDB5 has stronger verification of the data and disallows fields which are not recognised by MARS.
  4. FDB5 separates access via a configurable front-end API from storage back-ends. This creates a great deal of flexibility to develop and configure the system to make use of new storage technologies and paradigms without having to explicitly modify the forecasting workflow.
  5. FDB5 should have slightly better performance for the same hardware thanks to an improved indexing scheme for data retrievals, although this improvement is likely to be noticeable for very large datasets only, such as ECMWF’s operational forecasts.
  6. FDB5 makes the I/O software stack more flexible and adaptable to new technologies, such as object-stores and non-volatile storage class memories (NVRAM).

Bringing the newly developed FDB5 into operational use took some time. Existing components of the forecast pipeline depended on and were hard-wired to use FDB4. In particular, the previous Product Generation software (ProdGen) needed to be retired and replaced with a newer system (pgen) before FDB5 could be brought into time-critical operations. Although improved performance was not necessarily the main goal of the development, the adoption of FDB5 into time-critical operations, along with corresponding changes to the software pipeline, have saved 50% of the time spent in product generation I/O.