(Left to right): Patricia de Rosnay, ECMWF Coupled Assimilation Team Leader. Filipe Aires, Director of Research CNRS/LERMA (Laboratoire d’Etudes du Rayonnement et de la Matière en Astrophysique et Atmosphères), France. Pete Weston and David Fairbairn, ECMWF Scientists, Coupled Assimilation.
We have been working with Dr Filipe Aires from CNRS/LERMA (Laboratoire d’Etudes du Rayonnement et de la Matière en Astrophysique et Atmosphères) to investigate neural network (NN) approaches for land data assimilation. The collaboration began in February 2020 when Pete Weston spent a week working with Filipe in Paris.
Initial results from the on-going collaboration are very promising – showing that an NN model can outperform currently used inversion techniques in some areas of the globe. The work will help to inform long-term strategies for the assimilation of satellite-derived observations of surface variables into land surface models.
Current approaches to assimilation of satellite-derived observations
The modern approach in data assimilation is to assimilate level 1 (L1) measurements where possible. However, this requires a fast and accurate observation operator to produce simulated observation equivalents using short-range model forecast variables as inputs. For scatterometer instruments (such as Metop-B ASCAT, used to derive soil moisture and other variables) the L1 measurement is backscatter, and simulating backscatter measurements is still challenging over heterogeneous land surfaces.
Therefore, in the ECMWF Integrated Forecasting System (IFS) we currently assimilate level 2 (L2) scatterometer derived soil moisture, rescaled to volumetric soil moisture using a cumulative distribution function (CDF) matching approach. Assimilating L2 products, which often directly represent model variables, avoids the need for an observation operator. However, the L2 products can have complex and unknown error statistics and correlations, which are introduced in the retrieval algorithms and are difficult to represent accurately.
Neural network approaches to link satellite observations and model variables
In this study we looked at NN approaches to represent the relation between scatterometer measurements and surface soil moisture, which are compared to state-of-the-art, operational approaches.
Figure 1 shows a schematic of the various methods including combinations of NN and CDF-matching with both forward (options 4 and 5) and inverse (options 1, 2 or 3) approaches. Initial results show that inverse modelling performs better than forward modelling.
A combined approach, using the global perspective of an NN model with the localised properties of a traditional CDF-matching, also shows promising results. This should be investigated further in the future, in the context of the EUMETSAT Satellite Application Facilities on Support to Operational Hydrology and Water Management (H SAF) and further collaboration between Filipe Aires and ECMWF.
Figure 1: Schematic showing the various combined methodologies tested in this study. σ40 is ASCAT equivalent backscatter at 40 degree incidence angle. From Aires et al, 2021.
Preparing input data for neural network training
The observations used were MetOp-B ASCAT backscatter measurements. To reduce the effect of variable vegetation on the observations, the raw backscatter measurements made at several incidence angles and by three separate radar beams are linearly regressed to produce a single equivalent backscatter measurement at a 40 degree incidence angle.
Quality control of the observations is important to avoid measurements prone to gross errors. In this case we avoid wetland, mountainous and snow-covered areas where the link between backscatter measurements and soil moisture is weaker.
The observed variables are then interpolated to model locations. In future work the sensitivity of the NN method results to the choice of interpolation will be investigated.
The model data used is from the ERA5 reanalysis (Hersbach et al, 2020) which was chosen because it combines a wide range of observations from the global observing system with a state-of-the-art numerical weather prediction model to provide high spatial and temporal resolution analyses of thousands of meteorological variables over a long period. A reanalysis was chosen instead of operational analyses because a reanalysis has a near-constant scientific configuration whereas operational models are regularly updated with new scientific methods. For the neural network training a long time period is needed to sample many different meteorological and hydrological conditions over several years. In this study a period of 4 years was used but, for pragmatic reasons, only data every 10th day was used in the neural network training. In this study local desktop machines with CPUs were used but, in the future, a much larger database could be used by making use of more powerful machines with GPUs.
The next key step was choosing which model variables to use as inputs (features) in the neural network training. This was done by examining correlations between the observed backscatter and the various model variables and only choosing model variables which had significant correlations. If model variables with low correlations to the observed data were chosen, the neural network would fit to noise and spurious effects in the data rather than the real physical links between the observations and model variables. After this procedure the following model variables were chosen: volumetric soil moisture at 0–7 cm depth; soil temperature at 0–7 cm depth; leaf area index; and magnitude of the diurnal cycle of 2 metre temperature. Figure 2 shows the correlations between the chosen variables with soil moisture and leaf area index positively correlated with backscatter and soil temperature and diurnal temperature negatively correlated with backscatter.
Figure 2: Correlation matrix between ASCAT equivalent backscatter at 40 degree incidence angle (σ40) and chosen ERA5 model variables: soil temperature at 0–7 cm depth (ST1); leaf area index (LAI); magnitude of diurnal cycle of 2 metre temperature (ΔT); volumetric soil moisture at 0–7 cm depth (SM). From Aires et al, 2021.
Neural network method
Once the model variables and observations had been collocated in the input database, the neural network training could be run. In the inverse method the chosen target was the ERA5 soil moisture, and the features (inputs) were ASCAT backscatter, ERA5 leaf area index, ERA5 soil temperature and ERA5 magnitude of diurnal cycle of 2m temperature. In the forward method the target was ASCAT backscatter and the inputs were the four chosen ERA5 variables. Due to the relatively simple nature of the relationship between features and targets the neural network has a single hidden layer of 10 neurons, as shown in Figure 3, and is very quick to train. Using more hidden layers and/or neurons would lead to over-fitting, which we wish to avoid.
Figure 3: Schematic showing the structure of the neural network for the forward approach.
The neural network training is run using 60% of the database and validated against an independent 20% of the database. This procedure produces weights which describe the transformation from input variables to target variables. Then we can assess how well the neural network is able to reproduce the targets by comparing the predictions from the model against the real targets using the remaining 20% of the database as an independent test dataset. Splitting the database in this way is necessary to avoid over fitting the noise in the full database at the expense of the real physical features and relationships between variables.
The NN model is compared with a CDF-matching approach where the statistics (mean and standard deviation) of the input dataset are matched to the output (or target) dataset using a linear transformation. The two approaches can also be combined by applying the CDF-matching to the NN outputs.
Figure 4 shows a snapshot of the results for the inverse approach showing that the pure CDF-matching approach appears to produce the smallest standard deviation of soil moisture in the tropics, whereas the NN model works better in northern mid-latitudes. The combined approach works better than the NN model in central Asia and Africa, but worse at high latitudes.
When looking at similar results from the forward approach the results don’t seem as good, with larger normalised standard deviations (not shown). However, they are difficult to compare as the outputs of the forward approach are backscatter values and the outputs of the inverse approach are volumetric soil moisture. One possible explanation is that in this study the observations are interpolated to the model locations whereas a more orthodox approach is to interpolate model variables to observation locations. This will certainly be attempted as part of future work to develop a forward operator which could lead to attempts to directly assimilate backscatter.
Figure 4: Global normalised standard deviations of CDF-matching (upper), NN (middle) and combined NN and CDF-matching (lower) derived soil moisture normalised by the standard deviation of the ERA5 soil moisture. Derived from Aires et al, 2021.
Overall, as the result of a one-week visit and some follow-up over the subsequent months, it has been proven that an NN model can outperform the currently used CDF-matching approach in some areas of the globe. These initial results are promising and the aim is to take them further in collaboration with Filipe Aires and linked to our EUMETSAT H SAF activities. It is envisaged that the NN model will provide a new approach to derive ASCAT soil moisture observations for H SAF. In this context, the ASCAT NN observations will be assimilated to produce 10 km resolution root-zone soil moisture products both in near-real time and over the scatterometer data record.
This study will be published in 2021 in the Quarterly Journal of the Royal Meteorological Society. It is currently in press, available online in Early View: https://doi.org/10.1002/qj.3997.