A statistical method trained and optimized to retrieve seven-layer relative
humidity (RH) profiles is presented and evaluated with measurements from
radiosondes. The method makes use of the microwave payload of the
Megha-Tropiques platform, namely the SAPHIR sounder and the MADRAS imager.
The approach, based on a generalized additive model (GAM), embeds both the
physical and statistical characteristics of the inverse problem in the
training phase, and no explicit thermodynamical constraint – such as a
temperature profile or an integrated water vapor content – is provided to the
model at the stage of retrieval. The model is built for cloud-free conditions
in order to avoid the cases of scattering of the microwave radiation in the
18.7–183.31 GHz range covered by the payload. Two instrumental
configurations are tested: a SAPHIR-MADRAS scheme and a SAPHIR-only scheme
to deal with the stop of data acquisition of MADRAS in January 2013 for
technical reasons. A comparison to learning machine algorithms (artificial
neural network and support-vector machine) shows equivalent performance over
a large realistic set, promising low errors (biases

The atmospheric water vapor is a key parameter of the climate
system and the understanding of its variation under a climate evolution
relies on a thorough documentation of its horizontal and vertical
distributions

While direct measurements by radiosondes are the most simple ways to look at
the vertical structure of the relative humidity (RH) field, the network of
stations (permanent or not) is unequally distributed between the two
hemispheres and there is a clear gap of data over the oceans

Upper tropospheric humidity (UTH) can be one way to interpret these “water
vapor” measurements. The retrieval of UTH was initiated by

The first aim of this study is to perform an analysis of the
contribution of the two microwave instruments of the Megha-Tropiques mission,
operating since October 2011, for the retrieval of layer-averaged RH
profiles. The SAPHIR sounder and the MADRAS imager are both dedicated to
improving the documentation of the atmospheric water cycle. In a previous
paper,

The second objective is to demonstrate the potential of purely statistical
methods in the following problem: given a set of brightness temperatures
(BTs) provided by a space-borne radiometer, what is the vertical distribution
of RH and what are the expected limits of such an approach? Many retrieval
approaches exist; however, to our knowledge, a few of them estimate the RH profile
from a simple input data set restricted to the BTs. Indeed, most of the
approaches are physically based iterative techniques such as a

The current operational retrieval (version 6, released in 2013) of water
vapor profiles (layer and level products) from the instruments of the Aqua
mission (namely AIRS, AMSU and HSB (Humidity Sounder for Brazil, INPE), see

As in

The description of the data at hand and of the context of the work is made in Sect. 2. The three non-linear models, GAM, MLP and LS-SVM and their design for the study are detailed in Sect. 3. Section 4 is dedicated to the evaluation of the estimations over a realistic data set in order to have a large sample of evaluation. The application to Megha-Tropiques measurements is discussed in Sect. 5 with a comparison to radiosonde measurements. Section 6 finally draws a conclusion on the study and discuss the ongoing work.

Megha-Tropiques is an Indo-French satellite that is dedicated to the
observation of the energy budget and of the water cycle within the tropical
belt (

Observational characteristics of SAPHIR and MADRAS.

Characteristics of the database.

High quality RH soundings sampling the tropical troposphere, reasonably
collocated in space and time with Megha-Tropiques observations are quite
scarce, yielding to use a synthetic training set to overcome the problem.
This set is made of thermodynamical profiles representative of the
30

The RH profiles come from the Analyzed RadioSoundings Archive (ARSA,

In the current study, the profiles are vertically restricted to the
troposphere (from the surface up to 85

Only clear-sky conditions are considered. Indeed, as underlined by

The base is finally made of 1631 thermodynamic 22-level profiles that cover
the tropical oceans (30

Given a set of BTs, the expected accuracy in the estimated RH will obviously
highly depend on the atmospheric area under consideration. Therefore for a
specific atmospheric layer, the relevant inputs will not be necessarily the
same as for the layer above. One can indeed expect that the estimation of
RH in the mid-troposphere will not significantly benefit from MADRAS
measurements, while these should be an asset for a surface layer. This is why
layer-dependent models are considered here. The RH profiles were analyzed to
group the 22 original levels in relatively homogeneous layers. First, the
analysis of the variance–covariance matrix determined groups of correlated
successive levels. Then, self-organized maps (SOM also named Kohonen maps)

Figure

The RTTOV fast radiative transfer model, version 9.3 (Radiative Transfer for
Television and Infrared Observation Satellite Operational Vertical Sounder,

The simulations also make use of the instrumental noise to have a realistic
base of work. The radiometric sensitivity is often considered as the
instrumental noise since it gives the minimum variation in the measured
upwelling radiation that a specific channel can detect (noise-equivalent

With the conclusion of MADRAS after almost 15 months of measurements, two
configurations of the RH retrieval method have been considered: a
SAPHIR-only scheme and a SAPHIR-MADRAS scheme, the latter being associated to
a selection of the optimal channels. For the former configurations all SAPHIR
channels are used. For the latter configuration, a selection of the BTs is
performed because the BTs that will be significantly relevant in the RH
retrieval of a given layer will not necessarily be the same set when
considering another layer. For this purpose the optimal subset of channels is
determined thanks to the Gram–Schmidt orthogonalization (GSO) procedure (see

Relationship between the RH of two atmospheric layers (in %RH) and
the associated BT (in

To ensure the consistency between the mathematical descriptions of the three
statistical models, the notation will be as follows: the estimation of the
RH

The GAM, MLP and LS-SVM models are built with three different statistical
supervised learning techniques. Overall, the learning phase consists of using
a set of training examples to produce an inferred function. Each example is a
pair made of an input vector (

The data set described in Sect.

The input vector

GAMs have recently started to be used in environmental studies as a surrogate
to traditional MLP thanks to their ability to model nonlinear behaviors while
providing a control of the physical content of the statistical relationships

Part of the GAM fitting process is to choose the appropriate degree of
smoothness of the regression splines. The smoothing parameter

An artificial neural network is an interconnection of simple computational
elements (nodes or neurons) using functions that are usually non-linear,
monotonically increasing and differentiable

SVMs are kernel methods

The LS-SVM training procedure consists of estimating the set of adjustable
parameters

Since LS-SVM models are linear in their parameters models, the solution of
the training phase is unique and can be computed straightforwardly, using the
set of

The retrievals of layer-averaged RH profiles provided by GAM, MLP and LS-SVM
are compared for the two schemes. The following criteria are computed over
the test set (

As mentioned in Sect.

For the SAPHIR-only scheme, all channels are used.

An impact study of the pre-processing of the data on the accuracy of RH
retrieval shows that, whatever the atmospheric layer or algorithm considered,
the improvement obtained with PCA is negligible (

From here on, noise-free BTs are considered in order to only assess the
statistical approaches. The radiometric noise of the two instruments are
implemented for the evaluation of the retrieval of RH with profiles
considered as reference profiles. Vertical profiles of mean biases, SD and

Vertical profiles of

Mean bias (in %RH), standard deviation (SD, in %RH) and
correlation coefficient (

The LS-SVM technique provides overall the best results, with the highest correlation coefficients and the lowest variance for five layers over the seven considered in this study. In fact, theoretically, these three learning methods are equivalent, but the conditions of their implementation are somewhat different. First, since the LS-SVM are linear-in-their-parameters models, an exact validation method was implemented. The resulting procedure of selection of the relevant inputs is quite efficient. In addition, MLP models are nonlinear with respect to the adjusted parameters, and their training amounts to a nonlinear optimization. Several trainings with different initializations must be performed with no guarantee to achieve the best generalization capability given a network architecture. From this point of view, the LS-SVM approach is thus more successful. Finally, concerning the GAM approach, the smoothing splines used guarantee a nonlinear behavior, continuity and smoothness which are important characteristics in a learning algorithm. Another convenient characteristic for splines is that they are monotonic: the back-propagation algorithm can estimate parametric and non-parametric components of the model simultaneously.

The three methods perform equivalently:

Examples of three estimations of RH profiles (in %RH) extracted from the database using the SAPHIR-MADRAS configuration. The observed profile is the thick gray line and the three estimations (plain, dashed, dots, respectively, for MLP, GAM and LS-SVM) are in black.

RH and the associated errors (both in %RH) projected on the

Scatter-plots of the observed RH versus the estimated RH (in %RH)
for layer 4 (top row) and layer 6 (bottom row). The estimations are done
using GAMs trained from SAPHIR-only BTs (left-hand side column) and from
SAPHIR and MADRAS BTs (right-hand side column). The dashed line is

The errors obtained from the GAM estimation are projected on the

In the following, noisy BTs are used in order to discuss the results over the
realistic instrumental configurations. Two GAMs are optimized for each
atmospheric layer, one for each instrumental scheme: a SAPHIR-MADRAS scheme
and a SAPHIR-only scheme. The evaluations over the validation set are
summarized on Table

When the SAPHIR-only scheme is used, such a statement can be extended to the
top layer (

As other similar radiometers with varying viewing geometries, SAPHIR
observations are subject to the so-called “limb effect”, described for
instance in

Observed RH profiles gathered from the CINDY/DYNAMO/AMIE international field
experiment are used to evaluate the estimated RH profiles. With the 1st
orbit of Megha-Tropiques executed on 13 October 2011, this large scale
campaign is ideal to perform such an exercise. It took place over the October 2011–March 2012 period in the Indian Ocean and was dedicated to better
understand the processes involved in the initiation of the Madden–Julian
Oscillation and to improve its simulation and prediction (Cooperative Indian
Ocean Experiment on Intraseasonal Variability in the Year 2011/Dynamics of
the Madden–Julian Oscillation/ARM Madden–Julian Oscillation Investigation
Experiment, hereafter C/D/A). Measurements related to the atmospheric and
oceanic states have been collected from radars, microphysics probes, a
mooring network and an upper air sounding network. One can refer to

The restriction of the training of the GAMs to clear-sky conditions requires
a cloud mask. Therefore, cloud-free cases are detected from the radiosounding
record itself (RH limited to 100 %RH) and are associated to the

For each of the seven layers, the observed RH is defined by the mean of the
measurements that fit into the pressure boundaries, assuming that this mean
will be representative of the layer. This assumption is very simple,
especially since the tropospheric RH is characterized by strong vertical
gradients induced by complex transport and thermodynamic processes

Vertical profiles of

Relative humidity (in %) of the layer 400–600

The approach has been adapted to continental cases, where the influence of
the surface emissivity on the measured brightness temperature at the top of
the atmosphere needs to be taken into account

Comparisons (not shown) to radiosoundings launched from a continental site in Ouagadougou, Burkina Faso (a dedicated field campaign during the summer 2012) reveal similar performance in the mid-tropospheric layers, the surface layers being slightly better estimated.

Figure

Microwave observations from the SAPHIR and MADRAS microwave radiometers of
the Megha-Tropiques satellite are used to retrieve seven-layer RH profiles. For
this purpose, optimized GAMs were trained for each atmospheric layer over a
realistic set of synthetic observations. This set is composed of 18 years of
radiosonde profiles covering the tropical belt (

To assess the performance of GAM, two other algorithms based on supervised learning, namely a MLP and a LS-SVM, have been also trained and optimized using adapted validation methods. To our knowledge, the LS-SVM modeling technique has never been applied for remote sensing retrievals, whereas it solves the major problem of local minima, a common pitfall when using neural networks (such as the MLP). While the three modeling methods come from different theoretical backgrounds, they achieve roughly the same performance, even though the LS-SVM approach provides roughly slightly better results. We assume that these improvements come from their built-in regularization mechanisms, but they are associated to a heavy computational burden that compromises their implementation when considering large data sets (such as satellite measurements).

The intercomparison of the three models points towards the definition of the
problem given the inputs at hand. The combination of SAPHIR and MADRAS or the
use of SAPHIR-only makes it possible to perform a robust estimation of RH in
the 150–950

Following this work, our current efforts focus on the estimation of the
conditional error associated to the retrieval itself. Indeed, because the
widths and altitudes of the weighting functions of SAPHIR are strongly
dependent on the thermodynamical state of the atmosphere (the drier the
atmosphere, the wider the layer; the maximum of sensitivity shifting from the
upper troposphere towards the mid-troposphere), it is clearly expected that
the robustness of the RH estimation will be conditioned by the state of the
atmosphere. The aim will be to provide the probability density function of
the relative humidity on given BTs (a given state of the atmosphere) and thus
address the issue of non-Gaussian distribution of the relative humidity at a
given height. The knowledge of such information fits into the current work
done within the Global Energy and Water Cycle Experiment (GEWEX) Water Vapor
Assessment (G-VAP:

The authors thanks the LMD/ABC(t)/ARA group for producing and making
available to the community their “ARSA” radiosounding database (available
from