ECMWF has developed a distributed computation service for scalable post-processing, called Hermes. The new service will support ECMWF’s Scalability Programme by bringing computations closer to the data, thus saving bandwidth and reducing the need for client-side processing power.
Users have been requesting data from ECMWF's Meteorological Archiving and Retrieval System (MARS) for over 30 years. Internal users can access MARS via a command-line client, which pulls data from the archive to the user's workstation and performs any requested post-processing locally. External users, on the other hand, frequently use the MARS Web API, where the post-processing is carried out on one of the machines serving ECMWF's web infrastructure and the processed data is sent to the user via HTTP.
With increases in resolution and in light of the strategic importance placed on ensembles, transferring entire fields to the client for post-processing may involve large volumes and wastes bandwidth, especially when extracting only a small sub-area or a single point from a long time series. More control over where the post-processing happens is required since clients may not have the necessary processing power or the post-processing may take too long.
ECMWF's Scalability Programme launched in 2014 recognises the need to bring computations closer to the data. As part of this programme, the Hermes project has developed a distributed computation service to complement MARS while using a similar architecture. Hermes has focused on post-processing fields using the new interpolation package MIR. The successful conclusion of the project means that the Hermes package can be part of future data processing system developments at ECMWF.
Hermes employs a broker–worker model of distributed computing, in which a client sends a request to a broker, which queues all incoming requests for dispatch to a worker. All communication is done via the Transmission Control Protocol (TCP) using specially encoded messages. The system is transactional and automatically recovers from failures by resubmitting any request that failed to be processed, e.g. because a worker node crashed or a network connection was lost. Workers are ‘stateless’, which means that they process any work item independently. They can connect or disconnect from the broker at any time, allowing the system to be scaled up and down dynamically based on load and available computational resources. Hermes is not tied to any particular architecture, network topology, interconnect or even a shared file system and has been successfully deployed on desktops, ECMWF's high-performance computing facility, a dedicated cluster and in a cloud service.
Hermes leverages much of the software stack developed and maintained at ECMWF. By building on established MARS technology, Hermes is able to provide the same Quality of Service (QoS) as MARS, and similar resilience characteristics.
Hermes as an interpolation service
To address the need for scalable post-processing outlined above, initially a prototype was developed where Hermes performs an interpolation service. Support for the MARS protocol (client and server) has been added, allowing the Hermes broker to fully integrate with the MARS server and to respond to requests from the widely deployed MARS client that users are familiar with. As illustrated in the diagram, the crucial difference to the previous setup is that any post-processing is performed server-side by the Hermes worker and only the processed data is sent back to the user. This cuts bandwidth requirements, brings the processing closer to the data and enables data delivery to thin clients (such as Python).
Hermes’s first deployment will be together with MARS, to serve interpolated ERA5 climate reanalysis data for the Climate Data Store, where Hermes workers will use the new interpolation package MIR (ECMWF Newsletter No. 152, pp 36–39) to process fields. The workers will be collocated with MARS data movers to minimise unnecessary data movement and optimise resource usage.