Peter Dueben, coordinator of AI and machine learning activities at ECMWF and coordinator of the MAELSTROM project
The MAchinE Learning for Scalable meTeoROlogy and climate (MAELSTROM) project is funded under the EuroHPC Joint Undertaking (grant agreement No 955513) and coordinated by ECMWF. The three-year project started on 1 April 2021 and aims to help prepare the weather and climate community for large-scale machine learning applications.
Machine learning continues to be a hot topic for Earth system sciences. Machine learning tools are promising improvements for the extraction of relevant information and the learning of complex, non-linear systems from large datasets for many applications. However – as outlined in ECMWF’s machine learning roadmap (see seminar also) – the growth of machine learning applications also comes with challenges for weather and climate prediction centres such as ECMWF. For example, domain and machine learning scientists are used to a different set of tools and infrastructure (Fortran on CPUs versus Python on GPUs).
One of the areas for which machine learning will arguably have the most significant impact is how high-performance computing (HPC) is done. Artificial intelligence (AI) is a multi-trillion-dollar market. This is a multiple of the same value for the entire supercomputing market. Therefore, machine learning, and in particular deep learning, are currently changing the landscape of HPC and it is difficult to underestimate the impact that AI will have on hardware developments in the coming years and in fact already today.
While customised processors are developed for deep learning applications, such as the Tensor Processing Unit (TPU) of Google or customised AI chips by Cerebras, commodity hardware for the general HPC market will have accelerators for deep learning, such as the Tensor Core accelerator on NVIDIA Volta GPUs. The special hardware is typically optimised for dense linear algebra calculations at low numerical precision and will likely allow for significant improvements in performance and cost reductions for applications that can make use of this arithmetic. Consequently, weather and climate scientists should have an interest in exploring the new capabilities and learning how to use this hardware for their needs.
Furthermore, the complexity of machine learning tools can be increased arbitrarily, and the only limits for the accuracy of machine learning methods are the quality and amount of data that is available for training and the limits of the computational and data handling infrastructure (both software and hardware). However, weather and climate science is data-rich and for many application areas an almost infinite amount of training data could be generated (for example, the emulation of model components using deep learning). It is therefore of interest to develop large-scale machine learning solutions, that are being tested and developed with many millions of trainable parameters, and that are capable of taking the three-dimensional state of the global atmosphere as input, trained from many terabytes of data, and requiring the use of large supercomputers.
Here, the MAELSTROM project will make significant contributions. In a first step, MAELSTROM will explore six promising machine learning applications in weather and climate science that will cover all important components of the workflow of weather and climate predictions. The applications include: the blending of citizen observations and social media data with numerical weather forecasts, the use of neural network emulators to speed-up weather forecast models and data assimilation, improvements of local weather predictions via forecast post-processing, and bespoke weather forecasts to support energy production in Europe. For each application, benchmark datasets will be published online for training and development of machine learning tools.
In a second step, MAELSTROM will design a software framework to enable scientists to apply and compare machine learning tools and libraries efficiently across a wide range of computer systems. A user interface will link application developers with compute system designers. Automated benchmarking and error detection of machine learning solutions will be performed during the development phase.
In a third step, MAELSTROM will benchmark compute system designs for the different applications across a range of computing systems regarding energy consumption, time-to-solution, numerical precision and solution accuracy. Customised compute systems will be designed that are optimised for application needs to strengthen Europe’s high-performance computing portfolio and to pull recent hardware developments, driven by general machine learning applications, toward the needs of weather and climate applications.
Figure 1: MAELSTROM’s co-design cycle.
The three steps outlined above will form a co-design cycle as outlined in Figure 1 that will allow for feedbacks between application, software, and hardware developments. The main outcomes of the project are large-scale machine learning applications for the domain of weather and climate science, a software framework to optimise usability and training efficiency for machine learning at scale, and bespoke compute system designs for optimal application performance and energy efficiency. MAELSTROM’s machine learning solutions will serve as a blueprint for a wide range of machine learning applications on supercomputers in the future.
Figure 2: The MAELSTROM family.
To realise this ambitious plan, MAELSTROM will draw on the capabilities of a multi-disciplinary consortium (see Figure 2). The coordination by ECMWF will allow for tight interactions with one of the world’s leading global weather prediction centres and the Copernicus Atmosphere Monitoring and Climate Change Services.
The Norwegian Meteorological Institute is one of the first meteorological services exploring Internet-of-Things (IoT) data to improve forecasts.
The group of Torsten Hoefler at ETH is one of Europe's leading teams for supercomputing applications, HPC software, and diagnostics. They are also driving the DEEP500 machine learning inter-comparison project.
The University of Luxembourg participates in MAELSTROM via the NVIDIA AI Technology Center Luxembourg, which will provide essential know-how in the optimisation of machine learning techniques for both solution quality and HPC efficiency and performance modelling.
4cast is one of the very few companies that is already using machine learning for weather predictions operationally, generating local wind predictions to advise wind farms. 4cast also brings a lot of expertise in the development of workflow tools into the consortium.
E4 is one of the few European companies that has recently deployed a TOP500 system.
Finally, Juelich maintains some of the largest supercomputers in Europe with leading groups in compute system design, and the group of Martin Schultz is coordinating several large machine learning projects (e.g. DeepRain and IntelliAQ) in the weather and climate domain.
MAELSTROM will be an exciting opportunity to improve machine learning applications for weather and climate science and to make a significant contribution to machine learning in HPC – a scientific domain that is currently moving at breath-taking speed.
The MAESLTROM project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955513. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and United Kingdom, Germany, Italy, Luxembourg, Switzerland, Norway.