Ice water path retrievals from Meteosat-9 using quantile regression neural networks

. The relationship between geostationary radiances and ice water path (IWP) is complex, and traditional retrieval approaches are not optimal. This work applies machine learning to improve the IWP retrieval from Meteosat-9 observations, with a focus on low latitudes, training the models against retrievals based on CloudSat. Advantages of machine learning include avoiding explicit physical assumptions on the data, an efﬁcient use of information from all channels, and easily leveraging spatial information. Thermal infrared (IR) retrievals are used as input to achieve a performance independent of the solar angle. They are compared with retrievals including solar reﬂectances as well as a subset of IR channels for compatibility with historical sensors. The retrievals are accomplished with quantile regression neural networks. This network type provides case-speciﬁc uncertainty estimates, compatible with non-Gaussian errors, and is ﬂexible enough to be applied to different network architectures. Spatial information is incorporated into the network through a convolutional neural network (CNN) architecture. This choice outperforms architectures that only work pixel-wise. In fact, the CNN shows a good retrieval performance by using only IR channels. This makes it possible to compute diurnal cycles, a problem that CloudSat cannot resolve due to its limited temporal and spatial sampling. These retrievals compare favourably with IWP retrievals in CLAAS, a dataset based on a traditional approach. These results highlight the possibilities to overcome limitations from physics-based approaches using machine learning while providing efﬁcient, probabilistic IWP retrieval methods. Moreover, they suggest this ﬁrst work can be extended to higher latitudes as well as that geostationary data can be considered as a complement to the upcoming Ice Cloud Imager mission, for example, to bridge the gap in temporal sampling with respect to space-based radars.


Introduction
Clouds remain among the main factors that hinder climate models from giving a confident value for climate sensitivity. According to the Sixth Assessment Report, the last report from the Intergovernmental Panel on Climate Change (IPCC, 2021), there is now high confidence in the feedbacks associated with the subtropical marine low-cloud regime and altitude of high clouds. There has been less progress on the tropical high-cloud amount feedback, and this component is the largest contributor to the overall cloud feedback uncertainty (Forster et al., 2021). That said, global warming is ongoing and will continue. One of the critical aspects of this warming is changes in precipitation, which are also difficult to predict with accuracy. A quantity related to both of these modelling challenges is the mass of ice hydrometeors. At low ice concentrations, the radiative forcing of ice clouds follows the ice water content (IWC, kg m −3 ), albeit altitude and time of day must also be considered. Precipitation in the form of snow, graupel, and hail is directly linked to the masses of ice hydrometeors inside the atmosphere, but also rain is affected by the IWC above.
The total amount of ice hydrometeors is normally reported as the ice water path (IWP, kg m −2 ), which is one of the essential climate variables from the Global Climate Observing System (GCOS, 2021). Despite it being an integrated value and the fact that it should be easier to constrain than levelspecific IWC, there has been little progress in IWP both in measurements and in models (Waliser et al., 2009; Published by Copernicus Publications on behalf of the European Geosciences Union.

5702
A. Amell et al.: IWP from Meteosat-9 with QRNNs et al., 2011;Duncan and Eriksson, 2018). The most accurate global data on IWP should be provided by retrievals based on CloudSat reflectivities such as DARDAR (Cazenave et al., 2019). Where the CloudSat 94 GHz radar measures it gives accurate information at high vertical resolution, except for high IWP values where attenuation and multiple scattering decrease the retrieval accuracy. However, the swath width of CloudSat is only 1.4 km.
Passive instruments offer a much higher horizontal coverage. Therefore, passive observations are a good complement to CloudSat. In 2025 the Ice Cloud Imager (ICI) will be launched aboard Metop Second Generation (Metop-SG) B and will provide information on IWP over a swath 1500 km wide with a 16 km horizontal resolution . That is, the geographical coverage of the ICI radiometer is more than a factor of 100 higher than CloudSat, and close to global coverage is obtained on a daily basis. The ICI IWP retrieval accuracy should be comparable to the one enabled by a 94 GHz radar (Pfreundschuh et al., 2020) and retrievals of coarse IWC profiles should be possible (Brath et al., 2018), but cloud radars are still far superior in terms of spatial resolution.
Microwave instruments to date are only operated on satellites in low orbits. Observations from geostationary satellites are an important complement as they provide short revisit times. For instance, the Spinning Enhanced Visible and InfraRed Imager (SEVIRI) instrument aboard the Meteosat Second Generation (MSG) of geostationary satellites (Schmetz et al., 2002) provides full disc images every 15 min. The main challenge with retrievals from geostationary satellites is the complex relationship between visible and infrared (VISIR) radiances and IWP.
Nakajima and King (1990) found that the reflectances at 0.75 µm are primarily sensitive to cloud optical thickness τ , while reflectances at 2.16 µm are primarily sensitive to effective droplet radius r e . Assuming a sufficiently representative r e value, then IWP can be estimated from these two parameters (Stephens, 1978). This solar bispectral method constitutes the foundation of several IWP retrieval methods based on VISIR radiances. This includes the Cloud Physical Properties algorithm (CPP, first published by Roebeling et al., 2006), the Daytime Cloud Optical and Microphysical Properties (DCOMP) algorithm used in Pathfinder Atmospheric Extended (PATMOS-X, Walther and Heidinger, 2012), the Moderate Resolution Imaging Spectroradiometer (MODIS) cloud properties product (Platnick et al., 2017), or the NASA Clouds and the Earth's Radiant Energy System (CERES) project algorithms (Minnis et al., 2011, the latter originally developed for MODIS but also adapted for other polar-orbiting (Minnis et al., 2016a) and geostationary imagers (Minnis et al., 2008) in the Satellite Cloud and Radiation Property retrieval System (SatCORPS). All these retrieval algorithms estimate IWP from τ and r e , where the last two properties are derived by solar reflectances using physics-based methods. The CERES algorithms have a nighttime retrieval algorithm for these parameters, but according to Minnis et al. (2011, p. 4386), its τ and r e values should be considered experimental.
Machine learning (ML) methods, and in particular artificial neural network (NN) approaches, are promising candidates for remote sensing retrievals. This is primarily due to the fact that they do not require explicit assumptions used in pure physics-based models. ML methods instead can find non-linear relationships by learning from data, whether these data consist of physical observations or are obtained through physical simulations. Only concerning ice optical thickness, Yost et al. (2021) remark that new editions of the CERES algorithms must consider NNs to improve its estimates, such as the work from Kox et al. (2014) and Minnis et al. (2016b). The Cirrus Properties from SEVIRI (CiPS,  directly retrieves IWP from SEVIRI using an NN trained against Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP) observations. The CALIOP lidar signal quickly attenuates in thick clouds. Therefore, thick clouds and large IWPs are not well represented by CALIOP observations, which constrains CiPS to thin ice clouds. Holl et al. (2014), Islam and Srivastava (2015) and Mastro et al. (2022) also directly retrieve IWP with NNs, but they make use of combinations of microwave and infrared observations, as they found this to be advantageous compared to using infrared data only.
In any case, the use of NNs for IWP retrieval from VISIR passive imagers remains largely unexplored. This work contributes towards filling this gap for the SEVIRI instrument. NNs with low-latitude, Meteosat-9 SEVIRI observations are trained against DARDAR collocations. This choice of reference data for the NNs does not constrain the retrieval to small IWP values, but rather targets the full IWP range. Retrievals with only thermal infrared (IR) channels are analysed to overcome the daylight-only limitation of VISIR retrievals. Moreover, IR retrievals with a selection of channels based on the previous Meteosat generation are also evaluated. Additionally, the retrievals obtained here are compared with retrievals from CLAAS edition 2.1 (Finkensieper et al., 2020), a dataset based on the CPP algorithm and SEVIRI observations.
The NN method used here, quantile regression neural networks (QRNNs), was analysed by Pfreundschuh et al. (2018) in the context of remote sensing retrievals. QRNNs estimate the posterior distribution of Bayesian retrievals, and thus can provide uncertainties for individual retrievals. QRNN is a flexible method that enables the use of different NN architectures: a convolutional neural network (CNN) architecture is integrated in a QRNN to evaluate whether using spatial information from the observations is advantageous. Both providing case-specific ML errors and using multiple footprints in the ML retrieval are new features for IWP retrievals.
IWP retrievals are the primary focus of this work. Nonetheless, two other properties are also retrieved from SE-VIRI observations: the mean ice mass height and the mean ice mass size. These two variables, planned to be retrieved by the ICI product released right after its commissioning , are referred to here as "auxiliary" variables.

Reference data: DARDAR
The DARDAR-cloud product (Delanoë and Hogan, 2010) synergistically combines radar and lidar measurements from the CloudSat and CALIPSO tandem to provide cloud properties at a horizontal and vertical resolution of 1.4 km and 60 m, respectively. The CloudSat satellite mission was designed to cross the Equator in ascending orbit after 13:30 local mean time, with a repeat cycle of 16 d (Stephens et al., 2002). For a given location, CloudSat-derived products, such as DAR-DAR, can then only be provided at two different times, corresponding to the observations in the ascending and descending orbits, with a large time span in between. We refer to this as "daytime and nighttime observations". In April 2011 Cloud-Sat was forced to switch to daylight-only operations due to a battery anomaly, thus only allowing daytime observations (Nayak et al., 2012).
IWP, mean mass height (Z m ), and mean mass size (D m ) for an atmospheric ice column can be derived from the DAR-DAR cloud properties. In discrete form, these quantities are defined as where Z m and D m are only defined for IWP > 0, Z is the set of indices defining the variable values at each DARDAR bin height, IWC i is ice water content, z i is bin height range, N * 0,i is the intercept parameter of the normalized size distribution of ice particles (Delanoë et al., 2005(Delanoë et al., , 2014, all at bin height z i , i ∈ Z, and ρ w = 1000 kg m −3 the density of water. In this work, Z consisted of all indices for heights above sea level. A comprehensive derivation of these variables is provided in Appendix A.

Input data: SEVIRI from Meteosat-9
Meteosat-9 carries the SEVIRI instrument (Aminou et al., 1997;Schmid, 2000), the MSG imager. SEVIRI allows for observations of the earth in 12 spectral channels (Table 1) with a maximum repeat cycle of 15 min for the full earth disc scan. The images have a sampling distance of 3 km at subsatellite point for all channels except the high-resolution visible (HRV) channel, which is 1 km. That is, the channels pro-vide a ground resolution of 3 × 3 km 2 at nadir, with this resolution becoming worse when increasing the incidence angle. Therefore, SEVIRI offers a better temporal resolution and spatial coverage than DARDAR, although at a worse, varying ground resolution.
Launched in December 2005, Meteosat-9 was the primary operational satellite located at a nominal longitude of 0 • between April 2007 and January 2013 (WMO, 2022). It has also been located at the commissioning longitude of −6.5 • , and operational longitudes of 9.5, 3.5, and, currently, 45.5 • (EUMETSAT, 2022). These changes in longitude make the ground resolution be dependent on time; for a given position on earth, only observations taken from the same operational longitude are strictly directly comparable.
The SEVIRI images are provided in a geostationary projection specified by Wolf (1999), which we refer to as "SE-VIRI projection". Satpy (Raspaud et al., 2021) was used to read the SEVIRI images, which retains the native SE-VIRI observation grid and projection. In addition, this library automatically handles the erroneous georeferencing offset present in Meteosat images until 2017 (EUMETSAT, 2017, Sect. 3.1.4).

Collocations
Rectified level 1.5 Meteosat-9 SEVIRI data (EUMETSAT, 2017) and DARDAR-cloud version 2.1.1 between 6 May 2008 and 31 March 2011 were collocated to form the dataset used in this work. All but a few, for practical reasons, publicly available samples in this time range were used.
A DARDAR profile taken during a SEVIRI scan was collocated with this scan, and profiles taken in between consecutive scans were assigned the closest scan in time.
Temporally collocated DARDAR profiles were duplicated at the edges of the horizontal swath, calculated on the SE-VIRI projection, to account for the DARDAR horizontal resolution. IWP, Z m , and D m were computed for each DAR-DAR profile. All variable values in a SEVIRI pixel were averaged, weighted by the profile IWP, to obtain one DARDAR value per SEVIRI pixel (Fig. 1). It can be seen that averaging all profiles and then computing IWP, Z m , and D m from an averaged DARDAR profile is equivalent to the IWP-weighted average (see Appendix A).
Finally, the collocated images were divided into nonoverlapping samples of 32 × 32 pixels with the DARDAR swath in the centre. The division grid of the samples was randomly placed to diminish any possible bias. Figure 2 shows the region of interest (ROI), which ranges [−17 • , +40 • ] in longitude and [−17 • , +15 • ] in latitude. All samples covering any part of the ROI were randomly split into a training set, validation set, and test set of 60 %, 20 %, and 20 %, respectively, in size totalling more than 10 6 pixels with reference data. Based on DARDAR flags, 68 % of all 32 × 32 pixels samples have at least one pixel flagged with ice content; 56 % of all pixels are flagged as such. Further details on the collocated data, which include the relationships between the SEVIRI channels and the collocated DARDAR IWP, can be found in Sect. S1 in the Supplement.
3 Machine learning

Quantile regression neural networks
For a cumulative distribution function F x|y (x), the quantile x τ at level τ ∈ [0, 1] is the value such that The expectation with respect to x of the loss function, is minimized by the quantile x τ (Koenker, 2005, pp. 5-6).
A quantile regression neural network (QRNN) is an artificial NN that seeks to minimize L τ . In this work, QRNNs are employed in a multi-task learning setting for multiple quantile regression, minimizing where T = {0.01, 0.02, 0.03, . . ., 0.98, 0.99}, that is, all percentiles, and |T | is the cardinality of T . By extension, we use the term QRNN also for this multiple quantile regression. The main advantage of QRNNs with respect to NNs that minimize the mean squared error (MSE) is that QRNNs can model aleatoric uncertainty. This type of uncertainty describes the inability of the observations y to fully determine x due to hidden variables, therefore it cannot be reduced by increasing the number of training data. QRNNs model this uncertainty by estimating F x|y (x) at multiple quantile levels as illustrated in Fig. 3. This not only makes the regression robust against outliers but also provides a more complete description of the data distribution: a case-specific uncertainty can be assigned to each prediction. Pfreundschuh et al. (2018) observed that QRNNs approximate well the posterior distribution of Bayesian remote sensing retrievals, with uncertainty estimates consistent with non-Gaussian retrieval errors. In addition, quantile regression enjoys the property of equivariance to monotone transformations (Koenker, 2005). This enables training on a log-transformed response variable and back-transforming of the estimates, a useful property for right-skewed data distributions.
Quantile crossing is the major drawback of QRNNs. This problem consists of a lack of monotonicity in the quantile estimation, and is illustrated also in Fig. 3. Values derived Figure 3. Simplified retrieval example of x from y. A model trained to minimize the MSE can only aspire to predict the expected value of x at a given y, indicated by the line µ, while a quantile regression can describe the aleatoric uncertainty estimating quantiles at level τ . These quantiles can then be used to estimate F x|y (x) and derive µ from it. However, quantile crossing can occur if the quantiles are not perfectly estimated. from a QRNN experiencing severe quantile crossing can then be inaccurate. Several approaches exist to overcome quantile crossing. In this work quantile crossing is corrected a posteriori. The correction consists of an isotonic regression of the predicted quantilesx τ constrained at all quantile levels. That is, the optimization problem, is solved to find the corrected quantilesx

Network architectures
The QRNN approach was implemented in two different network architectures. Retrievals based on single SEVIRI pixels were done using a multilayer perceptron (MLP To exploit spatial correlations among neighbouring SE-VIRI pixels, the CNN presented in Fig. 4 was used. This CNN consists of convolutional blocks, based on the Xception network (Chollet, 2017), with an asymmetric encoderdecoder, U-net-like architecture (Ronneberger et al., 2015), and residual connections. The CNN hyperparameters were chosen with the same method as the MLP, using a regular grid of {64, 128} filters and {0, 1, 2, 4} Xception blocks. . The CNN architecture used, for an input image of spatial size H × W pixels and C channels, producing an output of the same dimensions with M channels. Here M = 99, corresponding to all percentiles. Block widths relate to the spatial sizes at each stage (not to scale). In each convolutional layer 128 filters were used, and n Xception means that n consecutive Xception blocks are applied, where it was chosen n = 2. Depthwise separable convolutions (Sep-Conv), with a 3×3 kernel, preserve the spatial size using a replicate padding of 1 at the depthwise convolution. GELU: Gaussian error linear unit (Hendrycks and Gimpel, 2020), BN: batch normalization (Ioffe and Szegedy, 2015), 1 × 1 Conv: pointwise convolution. Strides of 1, otherwise indicated.
The three input feature settings presented in Table 2 were explored. The HRV channel was disregarded in all cases. The channel selection for the IR subset was made to represent the Meteosat visible and infrared imager (MVIRI), the imager in the previous Meteosat generation. MVIRI only had two IR channels with spectral ranges 5.7-7.1 and 10.5-12.5 µm; SEVIRI channel 5 covers the former range, while channels 9 and 10 cover the latter. The IR subset had a special treatment: inputs from channels 9 and 10 were averaged with weights inversely proportional to the difference between their central wavelengths and the MVIRI central wavelength. That is, these channels were combined and fed to the network as x MVIRI,11.5 = w 9 x 9 + w 10 x 10 w 9 + w 10 (8) where x 9 and x 10 are the standardized channel 9 and 10 values. This aimed to synthesize the MVIRI channel with central wavelength 11.5 µm. The satellite zenith angle was included in all three settings to take into account the varying ground resolution.

Training methodology
The 32 × 32 pixels images were fed to the networks using a batch size of 128 images in all trainings and network models. The input data were standardized with the training set statistics, and invalid input values replaced with −999 999. The networks were trained with the Adam optimizer (Kingma and Ba, 2015) with base learning rate set to 0.001. All networks were evaluated on the validation loss as well as the number of quantile crossings, both only computed for pixels with reference values. The learning curves for all networks presented in this work can be found in the Supplement (Fig. S7). Early stopping on the validation loss determined the selected network state. A log transform was applied to train for IWP and D m . Each time the data was accessed, zero IWP values were replaced with samples from a log-uniform distribution between 10 −8 and 10 −6 kg m −2 (the minimum non-zero IWP in the dataset is of the order of 10 −6 kg m −2 ). The images were randomly mirrored and rotated 0, 90, 180, or 270 • each time the training data were accessed, for a better generalization ability of the network. After the first epoch, one additional pass of the training data was used to average the batch normalization statistics of each batch. These averaged statistics were then frozen and used throughout the rest of the training, empirically observed to help generalization.

Retrieval results
We examined two aspects for retrieving IWP from Meteosat-9 SEVIRI images: the channel selection and the use of spatial information. The evaluation of QRNNs or, more generally, probabilistic predictions is not straightforward. The common summary statistics root mean squared error (RMSE), mean absolute error (MAE), and bias require a point estimate. QRNNs do not provide a unique point estimate, therefore one value has to be selected to compute these statistics. Throughout this section, we use the expected value (mean) of the distribution as the QRNN point estimate. The distribution was obtained by constructing a cumulative distribution function from linearly interpolating the predicted quantiles, as illustrated in Fig. 3, and linearly extrapolating quantiles at level τ ∈ {0, 1} from the two nearest quantiles. That is, a continuous distribution function was constructed from each QRNN prediction. Quantiles at level τ = 0 were clipped to 0 to avoid implausible negative values resulting from the linear extrapolation.
A probabilistic measure of performance is the continuous ranked probability score (CRPS), which here is defined for each prediction as where 1 is the indicator function, x is the scalar reference value, andF x|y (x) is the cumulative distribution function estimated with the QRNN. We denote the mean and median of all CRPS values from a QRNN as CRPS µ and CRPS m , respectively. The RMSE, MAE, and bias can be relatively misleading if the data range several orders of magnitude, as in the case of IWP. Therefore, the QRNNs should not be judged only on these summary statistics but rather mainly with the plots presented in, for example, Fig. 5. Furthermore, it should not be concerning that quantiles at extremal quantile levels, such as levels τ ∈ {0.01, 0.02, 0.98, 0.99}, show unrealistic values. These quantiles have a small contribution and we observed that they can show noticeable variations between different trainings.
Another summary statistic can be computed from point estimates: the correlation between the retrieved value and the reference value. The Spearman correlation coefficient r S is used here. This statistic measures the monotonic relationship between two variables, where r S = ±1 implies perfect correlation, and 0 no correlation at all.  (Table 2). Networks trained only with daytime observations. The solid curves indicate the median value of each prediction at the local DARDAR IWP. The statistics are computed for the expected value. In brown is the best value, and in bold font is the second best value. Summary statistics in kg m −2 . Statistics and predictions computed using all daytime test data, with an average DARDAR IWP of 1.14 × 10 −1 kg m −2 .
Finally, the shape of a predicted distribution by a QRNN can be analysed computing its skewness γ 1 and kurtosis β 2 . For a random variable X, they are defined by The skewness measures the distribution asymmetry about its mean: the more positive γ 1 is, the larger the right-skew, and the more negative γ 1 , the larger the left-skew. On the other hand, kurtosis measures the contribution of the tails to the rest of the distribution.

Channel selection
We used MLPs to analyse the selection of input features. The trainings performed for this analysis only used daytime samples to facilitate the networks to leverage the visible channels. Figure 5 summarizes the findings for the test set. We make two main observations. Firstly, the predicted distributions are right-skewed, as the expected value is larger than the median (τ = 0.5). Figure 6 shows skewness and kurtosis frequencies for MLP predictions. It is observed that the retrieval distributions tend to be non-Gaussian. As a reference, Gaussian distributions have γ 1 = 0 and β 2 = 3. It can also be observed that VISIR retrievals tend to be more skewed and have more information in the tails than the two other options. Secondly, the visible channels are useful for the retrieval of larger IWP values. This is not only observed from the expected value being closer to the identity line in the range 10 −1 -10 +1 kg m −2 , but also from the fact that the model is more confident as there is less spread in the predicted quantiles, particularly for larger IWP values.
Although using the VISIR setting favours the retrieval, this setting is restricted to daytime retrievals. This implies that the retrieval performance is dependent on the solar angles. Because of the CloudSat orbit (Sect. 2), there is little variation in the solar angles range in the DARDAR data used. It is erroneous to execute QRNN VISIR retrievals from obser-vations with solar angles not found in the training data. An IR-only retrieval is thus preferred for a constant performance throughout the full day, as it is independent of the solar angles.
Concerning the IR and IR subset settings, the main difference between the retrievals resides in the confidence of the models. Particularly at low IWP, there is more spread among the quantiles for the IR subset retrieval. All summary statistics except bias also disfavour the IR subset setting, although it can be argued that the two options have a similar performance.

Spatial information
Building on the previous section arguments for IR-only retrievals, we examined the use of spatial information for the IR input setting. Figure 7 shows the retrieval performance with and without the use of spatial information, corresponding to the CNN and MLP, respectively. In this case, both daytime and nighttime observations were used for training and evaluation of the models. It can be observed that using spatial information improves the retrieval of IWP: not only does it tend to be closer to the identity line, but it also shows better summary statistics, except for the bias.
The main advantage of using the CNN is a decrease in uncertainty. This argument comes from a lower CRPS for the CNN, as well as smaller spread for the expected value with respect to the reference values. That is, the CNN network most likely leverages local spatial patterns to improve the retrieval uncertainty. Analogously to Sect. 4.1 and based on Fig. 6, it can also be inferred that retrieval distributions constructed from the predicted quantiles are not necessarily Gaussian. In this case, however, the CNN distributions tend to be less skewed and have lighter tails than the MLP, which can be considered preferable.

Auxiliary variables
IWP is an integrated value of ice water content, but it does not provide any information about at what height IWC is located nor information about the size of the ice crystals constituting the IWC. Estimating the auxiliary variables Z m and D m provides information of these two problems, and they help to characterize better atmospheric ice.
Given that the CNN was shown to reduce the retrieval uncertainty, we analysed the retrieval of Z m and D m with the same CNN architecture. The data employed for training and evaluating the networks were the same as in Sect. 4.2. Nevertheless, the networks struggled to model D m when this variable was derived from columns with low IWP. Excluding all D m with IWP ≤ 10 −3 kg m −2 from the dataset enabled the networks to model it. This situation was not experienced when training for Z m , but for simplicity Z m values with IWP ≤ 10 −3 kg m −2 were also excluded.
The retrievals of Z m and D m as well as the distributions of the training set are presented in Fig. 8. The expected value of Z m (Fig. 8a) follows closely the identity line for 10-15 km, and the expected value of D m (Fig. 8b) for 0-200 µm, in both cases with relatively low spread. Nevertheless, it should not be surprising when comparing them with the probability distribution functions (PDFs), as the models can leverage a priori information for the retrieval in these cases. Furthermore, any effects of multilayer clouds in the Z m retrieval are unclear, as is the relationship between retrievals of Z m and cloud top heights; these questions can be considered for further research.

Comparison with CLAAS
The CLAAS dataset edition 2.1 (Finkensieper et al., 2020) provides cloud properties derived from MSG satellites. One of the cloud properties provided in this dataset is IWP. The retrieval in CLAAS is based on the two-stage CPP algorithm (CM SAF, 2016). In the first stage of the algorithm, the cloud type is determined using an IR-based algorithm. The cloud type is then reduced to a cloud top phase indicator: clear, liquid, or ice. In the second stage, the cloud optical thickness τ and the particle effective radius r e are computed for cloudy pixels. This retrieval uses the SEVIRI channels 1 and 3, and compares the observed reflectances with look-up tables of simulated reflectances. Following Stephens (1978), IWP in CLAAS is calculated as IWP = 2τ r e ρ/3, using ρ = 930 kg m −3 for the ice density. An inconvenience with the CPP algorithm is the variability of r e throughout thick ice clouds. This may make r e totally unrepresentative of the ice column in Eq. (13). A further inconvenience is the channel selection, which only makes it possible to retrieve IWP for daytime observations, and can be affected by sunglints. CLAAS IWP has been analysed against DARDAR and compared with MODIS retrievals (Benas et al., 2017;CM SAF, 2020b) and, despite its limitations for IWP retrievals, it can be considered a reasonable dataset of reference. We compared our CNN, IR-only, QRNN IWP retrieval with two products in CLAAS: the IWP instantaneous retrieval and its monthly mean diurnal cycle.

Instantaneous IWP retrieval
The instantaneous IWP product from CLAAS corresponds to retrieving IWP from each Meteosat observation with the CPP algorithm. For the observations in our test set, all matching CLAAS retrievals used Meteosat-9 observations. All daytime observations in the test set also present in CLAAS were used in this comparison.
The CLAAS-2.1 instantaneous IWP is provided at the SEVIRI native grid, and it also suffers from the erroneous Figure 7. IWP predictions using spatial information (CNN) and without using spatial information (MLP). Networks trained with both daytime and nighttime observations with the IR input settings ( Table 2). The solid curves indicate the median value of each prediction at the local DARDAR IWP. The statistics are computed for the expected value, and the bold font indicates the best value. Summary statistics in kg m −2 . Statistics and predictions computed using all test data, with an average DARDAR IWP of 1.19 × 10 −1 kg m −2 . The rightmost plot corresponds to the black curves from the other plots.
georeferencing offset in Meteosat images (CM SAF, 2020a, pp. 14-15). The offset in the CLAAS data was also handled with Satpy (Raspaud et al., 2021). DARDAR data were collocated with CLAAS following the method described in Sect. 2.3, assuming zero IWP for the CLAAS data when the cloud phase indicated by CLAAS is not ice. Figure 9 compares the retrieval in CLAAS and the QRNN approach. It is seen that the QRNN mean has a tendency to be closer to DARDAR. This should not be surprising given that the network learns from DARDAR data. Furthermore, the retrieved r e with the CPP algorithm is likely smaller than the DARDAR retrieval (CM SAF, 2020b), and therefore the estimated IWP with Eq. (13) should be smaller. The CPP algorithm also characterizes a cloud as either of water or ice phase, ignoring mixed-phase states. This restriction is not made in the QRNN retrieval. This dichotomy in the CPP algorithm might be related to the abrupt drop of its curve in Fig. 9. Besides, the QRNN retrieval has a better monotonic relationship with DARDAR, as indicated by the higher r S .
The CLAAS dataset evaluation from Benas et al. (2017) provides an analogous Fig. 9, which is different from the results presented here. We note, however, their different collocation strategy, that they do not restrict the collocations to the region of interest of this work, and, significantly, that they exclude any DARDAR profiles that are not only of ice cloud phase.
The IWP probability distribution functions of the retrievals under comparison are shown in Fig. 10. It is observed that the expected value of the QRNN can retrieve larger values of IWP than the CPP algorithm. The difference of the CNN PDF with respect to the DARDAR PDF might be corrected a posteriori, and can be considered for further research. Figure 10 also presents the case when one sample from the QRNN is used as a point estimate instead of the mean. The PDF constructed from this point estimate follows the DARDAR PDF closer, and shows that the QRNN uncertainty can capture DARDAR IWP values well from IR-only observations, even the larger values. The ability to capture larger IWP values combined with a better correlation with DARDAR indicate that the CNN, IR-only QRNN retrieval performs better than the CPP algorithm retrieval for IWP retrievals.

Monthly mean diurnal cycles
The monthly mean diurnal cycles product in CLAAS is obtained by averaging hour-wise all observations over all days of a month (CM SAF, 2020a). The files for this CLAAS product covering the full year of 2012 indicate that Meteosat-9 was the data source. Consequently, and because the training did not contain data for 2012, we replicated this product for the CNN, IR input settings QRNN retrieval using all Meteosat-9 observations taken in 2012, except for one clearly faulty observation.
Two tropical areas were used to compare the monthly mean diurnal cycles, denominated as ocean and land areas (Fig. 2). These areas were chosen based on Fig. 10e from Benas et al. (2017), where it is seen that they have high IWP on average. All pixel values in them were averaged to compute a single monthly mean diurnal cycle per area. Figure 11 shows the diurnal cycles for 4 months, and Fig. S6 for all 12 months.
There are two primary differences between the diurnal cycles from the two methods. There is a general discrepancy in the IWP magnitude, with this being lower in CLAAS. This agrees well with the remarks in the previous section, where CLAAS IWP results were smaller than the QRNN retrieval and the DARDAR IWP. Mean IWP values significantly lower than the DARDAR values are also observed in similar retrievals, such as retrievals from MODIS . Secondly, CLAAS does not retrieve IWP during nighttime, an inconvenience that our IR-only retrieval does not present. Figure 11 also shows the local solar time (LST) coverage of all samples in the training set. It is seen that the network learnt to make use of the physical information from only roughly 25 min of LST in either daytime or nighttime to make retrievals at other times. This was possible as a consequence of selecting the QRNN that only uses IR channels.
It is worth noting that an exhaustive validation of the diurnal cycles is impossible as there are no DARDAR data outside these time ranges for both areas. In addition, currently available DARDAR version 2.1.1 data for 2012 are scarce and only for daytime: the number of DARDAR retrievals in a month ranges from 0 to 6 overpasses for the chosen areas. This makes the spatial and temporal coverage excessively sparse to estimate a comparable monthly mean value from DARDAR. Bearing this in mind, and that IR retrievals do not Figure 9. IWP predictions from our approach (QRNN with CNN architecture, IR input settings) and from CLAAS. The QRNN is trained with daytime and nighttime observations, but the test data in both cases contain only daytime observations. The QRNN expected value is used as its retrieval estimate. The solid curves indicate the median value of each measure at the local DARDAR IWP. Summary statistics in kg m −2 . Predictions and statistics computed for all test set samples also available in CLAAS. depend on solar angles, it is reasonable again to assume that the QRNN retrievals using the CNN and IR input settings are more accurate than the physics-based CPP algorithm.

Discussion
The performance of any ML retrieval will depend on, among other factors, its capacity to learn from the training data, its expressivity to represent the retrieval complexity, and its abil- Figure 11. Monthly mean diurnal cycles for 4 months in 2012 from CLAAS and the CNN, IR-only QRNN retrieval for the areas in Fig. 2. The QRNN expected value is used as the retrieval value. Local solar time (LST) approximated from UTC as LST = UTC + 12 h/180 • · longitude. The grey areas indicate the LST coverage of the DARDAR profiles in the training set. Note the different vertical axes.
ity to generalize to unforeseen data. In the case of NNs, these factors can be tackled by exploring different architectures and training methodologies. Achieving better ML results depends on the computational resources available. However, the quality of the training data is a main determinant.
The results presented here show that QRNNs learn to represent DARDAR retrievals from SEVIRI observations to a certain extent, where DARDAR is considered a ground truth. However, DARDAR contains retrieval errors; it does not necessarily represent the exact atmospheric state. In DARDARcloud version 3, the retrievals show a 24 % reduction in IWP on average (Cazenave et al., 2019). Nonetheless, this work used an older version to compare with the validation works of the CLAAS dataset (Benas et al., 2017;CM SAF, 2020b). If a product is to be created from this work, then a subsequent refined DARDAR product should be used for the best quality of the reference data.
Concerning NN architectures, the CNN improved the IWP retrieval performance by decreasing the QRNN uncertainty. Moreover, it generally presents better summary statistics than its MLP counterpart, except for the bias. There is no reason to believe that a CNN approach inherently leads to a worse bias. Training the selected CNN is hard: nearly twice more parameters are optimized in the CNN than in the MLP (532 787 and 304 169 parameters, respectively). Therefore, better trainings or other CNNs can reduce the bias.
The retrievals for an image with the CNN can appear smoothed out when compared with the MLP retrievals. A particular example is presented in Fig. 12. This can be a consequence of using spatial information. DARDAR consists of narrow stripes, hence there are few neighbouring SEVIRI pixels with collocated reference data: this scarcity in spatial reference information may discourage predicting large differences between neighbouring pixels.
The DARDAR narrow stripes also make it unfeasible to determine, visually, which network architecture would produce retrievals from SEVIRI IR images that differ less from a hypothetical analogous DARDAR image. The performance of the networks can then only be evaluated on the results presented here, on visual animations, or compared Figure 12. IWP retrievals (same colour scale) from a sample where the CNN may smooth out the retrieved value. The RGB composite is generated with channels 1 (blue), 2 (green), and 3 (red), using the natural_color composite from Satpy (Raspaud et al., 2021). Ice clouds strongly absorb in λ = 1.64 µm, and hence appear bluish.
with retrievals from non-geostationary satellites, which consequently cannot provide the same temporal coverage.
Visual animations for the ocean and land areas for retrievals from all observations in January 2012 are provided as a video supplement, where Fig. 13 illustrates few of these retrievals. These animations show the IWP retrieval by the MLP and the CNN, both using only IR channels, and the QRNN expected value as the point estimate. It is observed that while both networks generally agree when there is IWP, they can show clear differences in the retrieved values: in Fig. 13a and b, the CNN retrieval clearly differentiates between the left and right sides with respect to the MLP retrieval. Additional research might be able to explain the cause A.  with QRNNs of such differences. On the other hand, it can be considered that the CNN qualitatively favours the IWP retrieval in the time dimension, since the MLP temporal evolution for one pixel exhibits a small noise-like pattern (Fig. 13c-f show a few contiguous retrievals, but it is clear in the video supplement).
This work covered only retrievals from Meteosat-9, constrained by the range of daytime and nighttime observations from CloudSat. Retrievals from Meteosat-9 can complement DARDAR, in both time and spatial dimensions, by providing nearly instantaneous retrievals for positions not even sampled by CloudSat. The models developed can further complement DARDAR if they are transferred to other MSG satellites. This would allow us to cover a much larger time span of IWP retrievals. This model transferability should, in principle, be possible as they all carry the SEVIRI instrument, and can be a line of research.
A more demanding challenge would be that of transferring the models to the previous or next Meteosat generation for an even larger coverage: approaches such as the one performed here, where a subset of IR channels were used to approximate the previous Meteosat imager (Sect. 3.3), may be too simplistic to represent data from the actual instruments. Whether the models are transferred to other MSG satellites or other Meteosat generations, or even for the shifts in the Meteosat-9 nominal longitude, a comprehensive validation against DARDAR is impossible due to the time period it has covered.
SEVIRI is an instrument that was not specifically designed to characterize atmospheric ice. Compared with the expected retrieval performance from ICI (Eriksson et al., 2020, Fig. 9a), the IWP retrievals obtained here are not too distant from the upcoming ICI retrievals, where the latter follows closely the identity line for IWP > 3×10 −2 kg m −2 , but the IR-only, QRNN IWP retrievals from SEVIRI are better for IWP < 10 −2 kg m −2 . However, the IWP retrievals from ICI will, in principle, have much less uncertainty. The same can be said for Z m . Nevertheless, ICI is expected to be able to retrieve D m much better: the ICI retrieval follows closely the identity line, even for the larger D m values, and it also presents less uncertainty. Overall, ICI will provide more accurate retrievals.
Nonetheless, the sun-synchronous orbit of Metop-SG B, the satellites that will carry ICI, poses a similar challenge for temporal coverage to CloudSat, although the ICI wide swath will provide semi-global coverage on a daily basis. If the models developed here are transferred to other current or upcoming Meteosat satellites, then near-instantaneous retrievals from a Meteosat satellite can complement the more accurate ICI retrievals. In addition, the finer spatial resolution of SEVIRI can further complement the coarser ICI resolution. A further consideration is that there will be no collocations between ICI and space-based cloud radars (CloudSat and EarthCARE) at lower latitudes as they are all in sunsynchronous orbits but will cross the Equator at different lo-cal times. One way to overcome the time difference in observations would be to use retrievals like the ones presented here as a common reference.
Covering the observational gaps from ICI and space-based cloud radars with geostationary data allows us to obtain diurnal cycles as in Sect. 5.2. Cloud diurnal cycles are a feature often overlooked. Concerning IWP on a regional scale, limb sounding observations have revealed discrepancies with global climate models (Eriksson et al., 2014;Jiang et al., 2015). Discrepancies of the diurnal cycle of clouds in such models increase uncertainties in climate projections (Yin and Porporato, 2017). The availability of diurnal cycles from satellite observations should help obtain more realistic model simulations by better constraining them.
The results presented here lead to extending the work to a larger ROI. However, the worse resolution of geostationary images when increasing the incidence angle comes at a cost. Firstly, it implies more difficulties in accurately resolving the actual atmospheric state. Secondly, the DARDAR data result in less quality when resampled to match the geostationary images. On the one hand, the larger the incidence angle, the more DARDAR profiles are located in one SEVIRI pixel. On the other hand, pixels at large incidence angles contain more aggregated information than at smaller angles. These inconveniences, all a product of the worse resolution, create a significant irregularity in the reference data. Extending this work to a full-disc retrieval can benefit from handling this quality irregularity; more advanced approaches than just using the satellite zenith angle can be considered for this issue. Further research can include using uncertainty in the reference data when training the models, where this uncertainty can be based on the disagreement among the DARDAR profiles in a SEVIRI pixel.

Conclusions
Progress in characterizing the mass of ice hydrometeors helps to model better the climate sensitivity and changes in precipitation. The frequent observations from geostationary satellites can contribute to this progress, but IWP retrievals from geostationary radiances are complex. Traditional approaches for these retrievals present limitations: among others, they rely on solar reflectances and estimate IWP indirectly. ML models designed to avoid such inconveniences can be trained to approximate IWP retrievals from active sensors.
In this work, QRNNs are used to retrieve IWP from the SEVIRI instrument aboard Meteosat-9. QRNNs reproduce Bayesian retrievals, and it is seen that the IWP retrievals tend to be non-Gaussian in the different configurations explored. The use of solar reflectances helps the QRNN retrieval, but this restricts the retrieval to the small range of daytime solar angles in the reference data. Hence, thermal IR retrievals are preferred. A subset of IR channels based on the previous Me- . Panels (c-f) show the retrieval at t observations from the retrieval in panel (a), zoomed into the area shown in red in panel (b); analogously for panels (g-j) for panel (b). Same colour scale as in Fig. 12. teosat generation satellites show promising results, but further work is required to evaluate the compatibility with their imagers.
Spatial information incorporated in the QRNN through a CNN improves the IWP retrieval. Therefore, retrievals from a CNN, IR-only QRNN trained with daytime and nighttime observations are advantageous. The QRNN retrievals based on an IR-only CNN not only provide retrievals at any time of the day but also approximate better DARDAR than the physics-based CPP algorithm. Overall, these IR retrievals suggest extending the work to cover larger ROIs, as well as considering ML retrievals in the preparations for upcoming missions.
There will be no collocations between the upcoming ICI and space-based cloud radars at tropical latitudes. Geostationary data can then act as a bridge in time between the two types of observations. For the ROI used in this work, which consists of low latitudes, retrievals of IWP as well as Z m from geostationary radiances can complement in time and space ICI, but D m retrievals are far from the expected ICI performance. In addition, the models trained in this work also provide a basis for retrievals from other Meteosat satellites in the current and next generations.
Appendix A: Derivation of IWP, Z m and D m from DARDAR Let z 0 and z max be the range of heights in which we are interested, P define the set of indexes identifying the profiles we want to combine, and |P| its cardinality. DARDAR variables are provided at a discrete range of heights. Let Z contain the set of indexes identifying the DARDAR heights between z 0 and z max , and z i the bin height range at height z i , i ∈ Z. Furthermore, let w l define the importance of profile l ∈ P. Throughout these derivations, we assume that all profiles are equally important, and the subindexes refer to the DARDAR variable values at the corresponding height and profile. The variables Z m and D m are only defined when IWP > 0, that is, IWC i = 0 for some z i .
Concerning the average IWP from a set of profiles, we have where we see that IWP from a set of profiles is the arithmetic mean of each profile IWP.
Regarding Z m , we have Therefore, the Z m of a set of profiles is given by averaging the Z m of each profile weighted by its IWP.
To derive the expression for D m we first need to introduce the diameter of an equivalent melted particle d eq , the particle size distribution for diameter d eq and height z as n(d eq , z), and ρ w = 1000 kg m −3 , the water density. From Delanoë et al. (2014), we have that IWC(z) = π ρ w 6 +∞ 0 d 3 eq n(d eq , z)dd eq (A6) Then, the D m for a set of profiles indexed by P is D m = z max z 0 +∞ 0 d 4 eq n(d eq , z) dd eq dz z max z 0 +∞ 0 d 3 eq n(d eq , z) dd eq dz (A13) = l∈P i∈Z w l z il +∞ 0 d 4 eq n(d eq , z il ) dd eq k∈P i∈Z w k z ik +∞ 0 d 3 eq n(d eq , z ik ) dd eq (A14) = l∈P i∈Z z il +∞ 0 d 4 eq n(d eq , z il ) dd eq k∈P i∈Z z ik +∞ 0 d 3 eq n(d eq , z ik ) dd eq (A15) · i∈Z z il +∞ 0 d 3 eq n(d eq ,z il ) dd eq i∈Z z il +∞ 0 d 3 eq n(d eq ,z il ) dd eq 1 = l∈P i∈Z z il +∞ 0 d 3 eq n(d eq , z il ) dd eq D m,l k∈P i∈Z z ik +∞ 0 d 3 eq n(d eq , z ik ) dd eq (A16) = l∈P i∈Z z il IWC il D m,l k∈P i∈Z z ik IWC ik = l∈P IWP l D m,l k∈P IWP k .
That is, it is also given by averaging the D m of each profile weighted by its IWP.
Author contributions. All authors contributed to the project through discussions. AA collected the data, performed the training and evaluation of the models, and prepared the manuscript, including all visual material. PE supervised the project and provided scientific feedback. SP participated with the retrieval of D m and machine learning feedback. SP also developed the quantnn Python package (Pfreundschuh, 2022), which has been used to implement the retrievals.
Competing interests. The authors declare that they have no conflict of interest.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Acknowledgements. The contributions from PE and SP were covered by Swedish National Space Agency grant 154/19. The computations were performed on resources at Chalmers Centre for Computational Science and Engineering (C3SE) provided by the Swedish National Infrastructure for Computing (SNIC). We would also like to thank the Satpy contributors that clarified the presence of a Meteosat georeferencing offset to us and how to handle it. Finally, we thank the three anonymous referees for their constructive comments.
Financial support. This research has been supported by the Swedish National Space Agency (grant no. 154/19).
Review statement. This paper was edited by Jian Xu and reviewed by three anonymous referees.