Improved cloud detection for the Aura Microwave Limb Sounder: Training an artificial neural network on colocated MLS and Aqua-MODIS data

An improved cloud detection algorithm for the Aura Microwave Limb Sounder (MLS) is presented. This new algorithm is based on a feedforward artificial neural network and uses as input, for each MLS limb scan, a vector consisting of 1,710 brightness temperatures provided by MLS observations from 15 different tangent altitudes and up to 13 spectral channels in each of 10 different MLS bands. The model has been trained on global cloud properties reported by Aqua’s Moderate Resolution Imaging Spectroradiometer (MODIS). In total, the colocated MLS-MODIS data set consists of 162,117 combined 5 scenes sampled on 208 days over 2005–2020. We show that the algorithm can correctly classify > 96% of cloudy and clear instances for previously unseen MLS scans. A comparison to the current MLS cloudiness flag used in “Level 2” processing reveals a huge improvement in classification performance. For all profiles in the colocated MLS-MODIS data set, the algorithm successfully detects 97.8% of profiles affected by clouds, up from 15.8% for the Level 2 flagging. Meanwhile, false positives reported for actually clear profiles are reduced to 1.7%, down from 6.2% in Level 2. The classification performance is not 10 dependent on geolocation. The new cloudiness flag is applied to determine average global cloud cover between 2015 and 2019, successfully reproducing the spatial patterns of mid-level to high clouds reported in previous studies. It is also applied to four example cloud fields to illustrate the reliable performance for different cloud structures with varying degrees of complexity. Training a similar model on MODIS-retrieved cloud top pressure yields reliable predictions with correlation coefficients greater than 0.99. The combination of cloudiness flag and predicted cloud top pressure provides the means to identify MLS profiles in 15 the presence of high-reaching convection. Copyright statement. ©2020. California Institute of Technology. Government sponsorship acknowledged.

analysis of lower-stratospheric water vapor enhancements associated with overshooting convection. Currently, studies of these events rely on computationally expensive colocation of water vapor profiles with cloud properties from different observational sources (e.g., Tinney and Homeyer, 2020;Werner et al., 2020;Yu et al., 2020).
This study describes the training and validation of an improved MLS cloud detection scheme employing a feedforward artificial neural network ("ANN" hereinafter). This algorithm is designed to classify clear and cloudy conditions for individual 70 MLS profiles, based purely on the sampled MLS radiances. Two specific goals are set for the new algorithm: (i) detection of both high and mid-level clouds (e.g., stratocumulus and altostratus), and (ii) detection of less opaque clouds containing lower amounts of liquid or ice water. Observed cloud conditions, used to train the ANN, are provided by the cloud products reported by the Moderate Resolution Imaging Spectroradiometer (MODIS) aboard NASA's Aqua platform. Of the major satellite instruments, Aqua MODIS observations are ideal for this study, as they provide operational cloud products on a 75 global scale that are essentially coincident and concurrent with the MLS observations. The manuscript is structured as follows: section 2 describes both the MLS and MODIS data used in this study. Then a short introduction to the general setup of a feedforward ANN is given in section 3.1, followed by specifics on the output (section 3.2), input (section 3.3), and the training and validation procedure (section 3.4) of the developed models. Results of applying the new algorithm to MLS data are given in section 4, which includes a statistical comparison of the prediction performance Aura MLS samples brightness temperatures (T B ) in five spectral frequency ranges around 118,190,240,640, and 2,500 GHz (Waters et al., 2006) (the latter, measured with separate, independent optics, was deactivated in 2010 and is not considered here). Multiple bands, consisting of 4-25 spectral channels, cover each of these frequency ranges. The exact position of the respective bands is dictated by the different absorption features of the various atmospheric constituents that MLS observes. 90 MLS makes ≈ 3500 daily vertical limb scans (called major frames; MAFs), each consisting of 125 minor frames (MIFs) that can be associated with tangent pressures (p tan ) at different altitudes in the atmosphere. These observations provide the input for profile retrievals of a wide-ranging set of atmospheric trace gas concentrations including water vapor, ozone, and nitric acid.
The respective Level 2 Geophysical Product (L2GP) files also report a status diagnostic for every MLS profile, which includes flags indicating high and low cloud influence. The most recent MLS dataset is version 5; however, at the time the ANN was 95 being developed, reprocessing of the entire 16-year MLS record with the v5 software had not yet been completed. Accordingly, L2GP cloudiness flags in this study are provided by the version 4.2x data products (Livesey et al., 2020), and v4.2x is also the source for the Level 1 radiance measurements used herein. The spatial resolution of MLS Level 2 products varies from species to species, but typical values are 3 km in the vertical and 5×500 km in the cross-track and along-track dimensions. The distance along the orbit track between adjacent sampled profiles is ≈ 165 km. 100 Global cloud variables used in this study are provided by retrievals from the Aqua-MODIS instrument, which precedes the Aura overpass by about 15 minutes. However, because of the differences in their viewing geometries, the true time separation between MLS and MODIS measurements is substantially smaller than 15 minutes (see section 3.2). MODIS collects radiance data from 36 spectral bands in the wavelength range between 0.415-14.235 µm. For a majority of the channel observations and subsequently retrieved cloud properties, the spatial resolution at nadir is 1, 000 m, although the pixel dimensions increase 105 towards the edges of a MODIS granule. Each granule has a viewing swath width of 2, 330 km, enabling MODIS to provide global coverage every two days. More information on MODIS and its cloud product algorithms (the current version is Data Collection 6.1) is given in Ardanuy et al. (1992); Barnes et al. (1998); Platnick et al. (2017). Each pixel, j, within a MODIS granule reports a value for the cloud flag, a cloud top pressure (p j CT ), cloud optical thickness (τ j ), and effective droplet radius (r j eff ). These last two variables are used to derive the total water path (Q j T ), which contains both the liquid and ice water path 110 and characterizes the amount of water in a remotely sensed cloud column. It can be calculated following the discussions in Brenguier et al. (2000); Miller et al. (2016): where ρ j is the bulk density of water in either the liquid or ice phase (following the cloud phase retrieval for pixel j), and the factor Γ accounts for the vertical cloud structure. For vertically homogeneous clouds it can be shown that Γ = 2/3. 115 Table 1 lists the 208 days that comprise the global data set used in this study. It consists of eleven random days from each year between 2005 and 2020, as well as a pair of two consecutive days to bring the yearly coverage to thirteen days. Particular attention was paid to ensure that each month is represented (close to) equally in the final data set.
which is a subcategory of feedforward ANNs that sequentially connects neurons between different layers. An introduction to multilayer perceptrons is given in section 3.1. The output vector containing the labels (i.e., the binary cloud classifications) based on a colocated MLS-MODIS data set, and the input matrices, which consist of MLS T B observations, are described in sections 3.2 and 3.3, respectively. The choice of hyperparameters, the training setup, and the validation results from the algorithm are provided in section 3.4.

125
The weights that connect the input to the output data are determined by the "Keras" library for Python (version 2.2.4;Chollet et al., 2015) with "TensorFlow" (version 1.13.1) as the backend (Abadi et al., 2016). Figure 1 illustrates the general setup of a simplified multilayer perceptron that contains four layers. The input layer (shown in blue) consists of m = 3 vectors that contain selected MLS brightness temperatures T B1 , T B2 , and T B3 . The input layer is 130 succeeded by two hidden layers (shown in green) with two neurons each (N h1−1 and N h1−2 , as well as N h2−1 and N h2−2 ) and the respective bias vectors (B 1 and B 2 ). The following output layer (shown in orange) consists of a single vector (L; containing the predicted labels) and a corresponding bias (B L ). The brightness temperature vectors (T Bi ; i = 1, 2, 3) used as input for the ANN are provided by T B observations in selected channels, bands, and minor frames. They are of length n, which describes the number of scalar MLS observations (T j Bi ). This means, that i = 1, 2, 3 brightness temperatures were sampled by MLS at 135 j = 1, . . . , n major frames. Similarly, there is a scalar label L j for each MAF, so L is also of length n.

Algorithm description
At each neuron N h1−k , k =1-2 in the first hidden layer a scalar value γ j 1−1 and γ j 1−2 for each of the j MAFs is calculated: These values are subsequently modified by an activation function, which introduces non-linearity into the neuron output.

140
The hyperbolic tangent activation function is applied, which is shown to be very efficient during training because of its steep gradients (e.g., LeCun et al., 1989;LeCun et al., 1998)) and yields new values Γ j 1−1 and Γ j 1−2 . For the second hidden layer, the scalar neuron values at N h2−k , k =1-2 for each MAF j are derived as:

145
As before, these values are transformed by the hyperbolic tangent activation function, which yields the transformed neuron values Γ j 2−1 and Γ j 2−2 . Finally, the neuron output from N h2−1 and N h2−2 is connected to the single vector L in the output layer. For each MAF j the respective scalar value λ j is calculated as: We aim for a binary, two-class classification setup (i.e., either cloudy or clear designations). As a result, the softmax function normalizes the λ j results at the output layer. The softmax activation function is identical to the logistic sigmoid function for a binary, two-class classification setup. This means that the predicted neuron output in the output layer is calculated as: The ideal weights in Eqs.
(2), (3), (4), (5) and (6) need to be derived iteratively by evaluating a loss function (χ), which is 155 the log-loss function (or cross-entropy) in the classification setup. If L j andL j are the individual elements of the two output vectors L andL (i.e., the prescribed and currently predicted labels), χ for two classes is defined as: Note that in case of L j = 0 orL j = 0 an infinitesimal quantity ≈ 0 is added to the respective label to avoid the undefined ln 0.
The "Keras" algorithm includes multiple optimizers to solve Eq. (8) in a numerically efficient way. The exact setup and choice 160 of hyperparameters need to be determined carefully via cross-validation during the training process (see section 3.4).

The labels from colocated MLS-MODIS cloud data
Training data for the output vector L, which contains the prescribed labels for Eq. box (in latitude and longitude; blue box) around an MLS profile (blue "x"), then each of the n per pixels reports a cloudiness flag, as well as a total water path (Q j T ) and a cloud top pressure (p j CT ), with j = 1, 2, ..., n per denoting the individual pixels within the 1 • × 1 • box. Note that for legibility the cloud properties of only three MODIS pixels are shown. For the respective MLS profile, these parameters are aggregated to more general cloud statistics consisting of the cloud cover (C) within the 1 • × 1 • box, as well as the median total water path (Q T ) and median cloud top pressure (p CT ).
170 Figure 2b shows the global distribution of sample frequencies for the colocated MLS-MODIS training data set within grid boxes of length 60 • × 60 • (latitude and longitude). While not every grid box contains the same number of profiles, each area contains at least 5,000 samples. Apart from a single grid box over Africa, the higher latitudes tend to contain more samples, because both Aqua and Aura are polar-orbiting satellites. A majority of grid boxes contain more than 8,000 samples.
The aggregated profile-level cloud statistics are used to define the observed clear sky and cloudy conditions. All profiles 175 that are characterized by C ≥ 2/3, p CT < 700 hPa, and Q T > 50 g m −2 are labeled as cloudy, while profiles with C < 1/3 and Q T < 25 g m −2 are considered to be associated with clear sky samples. While the cloud cover threshold is somewhat arbitrary, the p CT limit for cloudy observations and the Q T thresholds are carefully selected. The large opacity of the atmosphere for longer path lengths means that MLS shows almost no sensitivity towards clouds with p CT > 700 hPa (see section 3.3). This upper pressure limit, which in the 1976 US Standard Atmosphere (COESA, 1976) is located at an altitude of ∼ 3 km, is around 180 the lower limit of observed cloud tops of mid-level cloud types (e.g., altostratus, altocumulus). Meanwhile, the 10 th and 25 th percentiles of all profiles containing clouds within the 1 • perimeter, regardless of C, are Q T ≈ 25 g m −2 and Q T ≈ 50 g m −2 , respectively. These definitions have an additional benefit: they almost evenly split the data set into cloudy and clear sky profiles (52.0% and 48.0%, respectively), which improves the reliability of the trained weights for the cloud classification. Naturally, these definitions leave some profiles undefined (e.g., those with C in the range 1/3-2/3). These profiles (about the number 185 of the combined cloudy and clear classes) cannot be included in the training of the ANN, as they lack a prescribed label.
The discussion in section 4.1 provides an analysis of the ANN performance for a redefined classification based on a simple threshold of C = 0.5 (in addition to a positive Q T ) to distinguish between cloudy and clear sky profiles.
It is important to note that the difference in viewing geometry between MLS and MODIS (i.e., limb geometry versus nadir viewing) induces a considerable degree of uncertainty in the colocation. While it is reasonable to assume that the majority of 190 a potential cloud signal (or lack thereof) will come from the 1 • × 1 • box around the respective MLS profile, there are certain scenarios that will lead to a false classification. The most likely such scenario consists of an MLS line-of-sight that passes through a high-altitude cloud before a clear sky 1 • × 1 • box. Here, MLS will detect a strong cloud signal, even though the nadir-viewing MODIS instrument does not record any cloudiness at the location of the respective MLS profile. Less likely is the scenario of a very low-altitude cloud located right after (in terms of an MLS line-of-sight) a clear sky 1 • × 1 • box. This 195 would also result in a false cloud classification (if the MODIS observations are taken as reference). However, because of the increase in atmospheric opacity, the sensitivity of the MLS instrument towards signals further along the line-of-sight decreases, and it is less likely that MLS would detect these cloud signals in any case. One contributor to the overall uncertainty that is of less concern is the time difference between the Aqua and Aura orbits (≈ 15 minutes). Because MLS looks forward in the limb, the temporal discrepancy between the sampling of individual MLS profiles and the colocated MODIS pixels is in the range of 200 0.6-1.4 minutes. The results presented in section 3.4 illustrate that by training the ANN with a large data set, as well as crossvalidating the training results against a large number of random validation data, the contributions of uncertainties associated with colocation (both in space and time) can be considered small and do not overly impact the reliability of the cloud detection algorithm.

The input matrix from MLS brightness temperature observations 205
Figures 3a-c show the spectral behavior of T B sampled in MLS bands 2, 33, and 14 at MIF=15, which on average corresponds to p tan ∼ 576 hPa (at an altitude of ∼ 4.5 km in the 1976 US Standard Atmosphere). In this section we mostly omit the superscript "j" to indicate the statistical analysis of all T j B in the respective band (j = 1, 2, · · · , n). The median T B for profiles associated with clear sky (orange) and cloudy conditions (blue), based on the classifications from the colocated MLS-MODIS data set Median clear sky profiles exhibit consistently larger T B than cloudy observations, with differences of up to 10 K. This behavior confirms the findings in Wu et al. (2006), where ice clouds at an altitude of 4.7 km reduce band 33 T B at the lower minor frames (i.e., larger p tan ). The IQR ranges of the two different data sets are very close for band 2 observations (i.e., within 1-2 K), while there is overlap for the T B sampled in bands 33 and 14.
To illustrate the reduced sensitivity of MLS to signals from very low clouds, the median T B from profiles with p CT > 700 hPa is shown in green (for clarity the corresponding IQR is omitted). These profiles behave similarly to clear sky observations, and the difference in median T B is less than 1 K. The significant contrast in median T B between clear sky and cloudy profiles, especially for band 2 and partly for band 33, might suggest the possibility of a simple cloud detection approach via thresholds. However, the respective IQR ranges often overlap, which indicates that a simple T B threshold would miss about 25% of the clear and cloudy data, respectively. Moreover, the behavior illustrated in Figure 3 is specific to the latitudinal range of −30 • to +30 • . For higher latitudes, changes in 230 atmospheric temperature and composition yield a noticeable decrease in the observed contrast, while close to the poles the clear sky profiles almost always have lower T B than the cloudy observations (even at the lower MIFs). A more sophisticated classification approach, with T B samples from additional MLS bands and minor frames, is necessary to derive a more reliable global cloud detection. Table 2 details the MLS bands, as well as the respective channels and MIFs, that comprise the m × n input matrix for the 235 ANN. The input matrix consists of m different T j B , sampled in individual channels (within the respective MLS bands) and MIFs, at n different times. To reduce the computational costs during the training of the model, not all MLS observations are considered. Instead, three different bands are chosen from the 190, 240, and 640 GHz regions, respectively. Those are bands 2, 3, 6, bands 7, 8, 33, and bands 10, 14, 28 for the three receivers. These bands were carefully selected after a statistical analysis of the altitude-dependent contrast in observed T B between clear and cloudy profiles. This contrast is generally low (in the range 240 of 1 K) for the observations from the 118 GHz region, so only band 1 from this receiver is included in the model input. For most bands, every second channel is included in the input (except for band 33, which only has 4 channels in total), while considering every third MIF in the range 7-49 yields a decent vertical resolution between 15 hPa (for the highest altitudes) and 150 hPa (at the lowest altitudes). Overall, the input matrix for the training and validation of the ANN is of shape 1, 710 × 162, 117, i.e., it consists of m = 1, 710 different features (T j B at different frequencies and altitudes) from n = 162, 117 MAFs (either classified 245 as clear sky or cloudy).

Training and validation
The "Keras" python library provides convenient ways to manage the setup, training, and validation of ANN models. The optimal weights for Eqs.
(2), (3), (4), (5) and (6) are derived in three steps: (i) determining the most appropriate hyperparameters via k-fold cross-validation, (ii) training and validating a number of different models with the best set of hyperparameters on 250 multiple, random splits between training and validation data sets, and (iii) comparing the performance scores for the different model runs to evaluate the stability of the approach and pick the best set of weights. Each model is set up with two hidden layers. The number of neurons per hidden layer is set to 856, which corresponds to the average between the number of nodes in the input and output layers (i.e., 1, 710 and 1, respectively).
The hyperparameters to be determined are (i) the optimizer for the cloud classification, (ii) the learning rate, (iii) the mini-255 batch size, and (iv) the value for the weight decay (i.e., the L2 regularization parameter). While the optimizer choice and learning rate control how quickly and accurately the minimum of the cost function in Eq. (8) is determined, the values for mini-batches and the L2 regularization characterize the level of noise and degree of freedom in the models, which have a noticeable impact on model performance for new, previously unseen data. More information about ANN hyperparameters and their impact on the reliability of model predictions can be found in, e.g., Reed and Marksll (1999); Goodfellow et al. (2016).

260
Note that the number of epochs (i.e., the number of iterations during the training process) is not considered an important hyperparameter for this study. Instead, the models are run with a large number of epochs, and the lowest validation loss is recorded, so an increase in validation loss during the training (i.e., cases where the model is overfitting the training data at some point) has no impact on the overall performance evaluation.
At first, the data set is randomly shuffled and split into k = 4 parts. Subsequently, one of the four parts is used as the 265 validation data set, and the other three are used to train the ANN with a certain set of hyperparameters. Here, each of the 1, 710 features is individually standardized, i.e., each feature is transformed to have a mean value of 0 and unit variance. This step is essential for a successful ANN training, as the individual features are characterized by different dynamic ranges. Meanwhile, the labels for clear and cloudy profiles are simply set to 0 and 1, respectively. After model convergence and determination of a set of performance scores, the model is discarded and a different set of three parts is used for training (the remaining fourth From M the accuracy (Ac), F1 score (F 1) and Matthews correlation coefficient (M cc) can be derived as: While Ac quantifies the proportion of correctly classified samples, F 1 describes the harmonic mean value between precision (proportion of true positives in the positively predicted ensemble, i.e., the ratio of tp to tp+f p) and recall (proportion of correctly predicted true positives, i.e., the ratio of tp to tp+f n). Generally, F 1 assigns more relevance to false predictions and 285 is more suitable for imbalanced classes. Meanwhile, all elements of the confusion matrix are important in determining the M cc, which yields values between −1 and 1 and thus is analogous to a correlation coefficient.
This analysis revealed that the stochastic gradient descent optimizer, using a learning rate of 0.001, and a Nesterov momentum value of 0.9 yielded the overall best validation scores. The best weight decay and mini-batch size values were found to be 5 × 10 −4 and 1024 (i.e., 0.8% of the training data), respectively. Therefore, developing multiple models with a reasonable split of training and validation data, as well as careful monitoring of the spread in validation scores, is imperative. In this study, 100 different models are developed. Before each model run, the data set is randomly shuffled and split into 75% training and 25% validation data. As mentioned earlier, each model is run with a large number of epochs, and the weights associated with the lowest validation loss are recorded.
The output of each ANN model is a cloudiness probability (P ) between 0 (clear) and 1 (cloudy). Note that throughout this 300 study we simply group each prediction in either the clear or cloudy class, i.e., MAFs with predicted probabilities 0 ≤ P < 0.5 are considered to be sampled under clear sky conditions, while MAFs with 0.5 ≤ P ≤ 1 are considered to be cloudy. The one exception is the discussion in section 4.2, where the actually predicted P are employed to study the ANN performance for undefined cloud conditions (with respect to the clear sky and cloudy definitions presented in section 3.2).
A summary of the derived prediction statistics is shown in Figure 4a. Each histogram shows the average percentage of 305 correctly predicted clear sky (i.e., tn, orange shading) and cloudy (i.e., tp, blue shading) labels for all 100 validation data sets.

Probabilities for different cloud conditions
The clear sky and cloudy classes defined in section 3.2 leave a number of profiles unaccounted for (i.e., neither clear sky nor cloudy), such as those with 1/3 ≤ C < 2/3 or p CT > 700 hPa. While it is reasonable to only train the model on the confidently clear and cloudy conditions, it is essential to understand the ANN performance for the undefined, in-between cases. flagged to be probably cloudy (P > 0.5). However, only profiles that also have Q T > 100 g m −2 are reliably predicted to have P > 0.75. The less-confident identification of the Q T > 100 g m −2 cases reflects the fact that many of them have low cloud tops, p CT > 700 hPa, and are thus not readily observed by MLS. As noted in section 3.3, these profiles exhibit similar spectral behavior to clear ones and the ANN is expected to miss most of these clouds. With increasing Q T even profiles with smaller cloud fractions (as little as C = 0.25) are flagged as cloudy. Note that the P results become noisy for very large Q T > 500 g 370 m −2 , conditions that are only observed for less than 4% of the total samples (< 1% for Q T > 1000 g m −2 ).
The behavior of predicted P for observations with p CT < 700 hPa is shown in Figure 6b. Here, the confidently clear predictions remain largely unchanged. However, probably cloudy predictions are observed for C > 0.5, even for low Q T . Confidently cloudy predictions dominate the previously defined cloudy class (C > 2/3 and Q T > 50 g m −2 ).
In order to evaluate the ANN performance when more of these uncertain cases are encompassed in the validation, we 375 included in Table 3 a comparison of the binary performance scores for a redefined set of the cases classified as clear and cloudy according to less conservative thresholds for the cloud cover and the total water path (C < 0.5 and Q T < 25 g m −2 for clear sky profiles, C ≥ 0.5 and Q T ≥ 25 g m −2 for cloudy profiles). These changes increase the validation data set from n = 162, 117 to n = 328, 286 profiles. Due to the looser definitions, there is a significant drop in performance scores, which can mostly be attributed to a lower true positive rate (i.e., cloud detection) of 0.69 and 0.08 for the ANN classification and v4.2x, 380 respectively. The fraction of false positives (i.e., false prediction of cloudiness for actually clear profiles) remains basically unchanged (increases of 0.02 and 0.00 for the ANN and v4.2x flags, respectively). This means that even with a looser cloudiness definition, the ANN does not yield a multitude of false cloud classifications; rather, the algorithm fails to detect a larger fraction of cloudy profiles. This mostly applies to lower-level clouds and those with small Q T . The ANN still detects 73.0% of cloudy profiles with Q T ≥ 1, 000 g m −2 (compared to 17.3% for the v4.2x flag). As a consequence of the reduced true positive rates 385 for the redefined class definitions, the derived F 1 for the ANN score is reduced to 0.81 (from 0.98), while F 1 for the current v4.2x flag drops from 0.26 to 0.14.

Geolocation-dependent performance and global cloud cover distribution
The spectral behavior for clear sky and cloudy profiles shown in Figure 3 only applies for observations made in the latitudinal range of −30 • to +30 • . As mentioned in section 3.3, the contrast between the two classes of data decreases for increasing as well as the observed spatial patterns of mid to high clouds, agree well with those reported in King et al. (2013);Lacagnina and Selten (2014).
As before, we are interested in comparing the results of the new ANN classification to the ones from the current v4.2x cloud flag. Therefore, a similar map of derived global cloud cover from the current v4.2x cloud flag is shown in Figure 7d. In contrast to the ANN results, calculated C < 32% almost everywhere. This behavior is consistent with the focus of the v4.2x

Example scenes
The analysis in the previous sections centered on statistical metrics and the reproduction of large-scale, global cloud patterns.
There, the cloud flag based on the new ANN algorithm yields reliable results, both in comparison to the current v4.2x status flag and as a standalone product. However, a more qualitative assessment of the model performance for individual cloud scenes provides additional confidence in the technique, as well as insights into the classification performance for different cloud types.

440
Again, profiles are flagged as cloudy when P ≥ 0.5. Similarly, Figure 9 shows two example cloud fields over the Asian summer monsoon region, which also regularly contains overshooting convection from mesoscale cloud systems. The first scene, shown in Figures 9a-d,  Note that the two example scenes in Figure 9 represent previously unseen data for the ANN, i.e., the models were not trained on these MLS observations.

Predicting cloud top pressure
The results in section 4 illustrate that the proposed ANN algorithm can successfully detect the subtle cloud signatures in the spectral T B profiles shown in Figure 3. For many MLS bands, the differences between cloudy and clear sky T B are usually in the range of just a few Kelvin, and the spectral behavior heavily depends on the respective MIF (i.e., pressure level at the tangent point of each scan). This section demonstrates how this behavior can be used in a similar ANN setup to infer the classification setup. The labels in the output layer, instead of being set to either "0" or "1" (i.e., clear sky or cloudy), now contain the respective p CT reported by the colocated MLS-MODIS data set. A simple linear function replaces the "softmax" activation in the output layer, i.e.,L j = λ j in Eq. (7). Similarly, the model optimizer, learning rate and mini-batch size reported in section 3.4 for the cloud classification ANN provide the best set of hyperparameters; here the only change concerns the weight decay parameter, which is turned off. As before, the model with the best validation loss provides the weights for the 485 following evaluation.
Joint histograms of true (in the sense that they are the prescribed labels to train the ANN) and predicted p CT for all cloudy profiles in the colocated MLS-MODIS data set are presented in Figure 10. United States during summer. The first example (panels a-b) is characterized by high clouds in the northern part with p CT as low as ∼200 hPa, while at the southern tip there are low clouds with p CT > 600 hPa. The ANN can reliably reproduce the low cloud top pressures in the north; however, the p CT values for the low clouds in the south are slightly underestimated, with predicted p CT ≈ 500 − 550 hPa. The second example scene (panels c-d) consists of a mix of high and mid-level clouds with 500 p CT < 450 hPa, which the ANN predictions correctly reproduce. The last example scene (panels e-f) shows a complicated mix of low, mid-level, and high clouds that basically covers the full p CT range, as well as some clear sky areas in between. The ANN algorithm detects the small mid-level convection in the north, followed by very low clouds and the cloud gap over the center of the scene. It also correctly detects the subsequent large band of high clouds with p CT < 350 hPa over the southern continental United States.

505
Similar to the cloud detection algorithm, the prediction performance for p CT appears to decline with an increase in cloud top pressure, consistent with the reduced contrast between clear sky and cloudy T B around p CT ∼ 700 hPa, as shown in Figure 3. However, the ANN can reliably distinguish between high-reaching convection with p CT < 300 hPa and mid-to low-level clouds.
The current MLS cloud flags, reported in the Level 2 Geophysical Product of version 4.2x, are designed to identify profiles that are influenced by significantly opaque clouds, with the main goal being to identify cases where retrieved composition profiles may have been adversely affected either by the clouds or by the steps taken in the retrieval to exclude cloud-affected radiances.
In this study, we present an improved cloud detection scheme based on the popular "Keras" Python library for setting up, testing, and validating feedforward artificial neural networks (ANNs). This new algorithm is shown to not only reliably detect 515 high and mid-level convection containing even small amounts of cloud water, but also to distinguish between high-reaching and mid-to low-level convection.
To monsoon regions reveals that the ANN can reliably identify diverse cloud fields, including those characterized by low-level clouds and low Q T . Together with the consistently large statistical agreement, these global and regional examples of successful cloud detection illustrate that the predefined cloudiness conditions (following thresholds for C, p CT , and Q T ) are reasonable.
Moreover, the uncertainties arising from associating MLS observations in the limb with nadir MODIS images do not seem to substantially impact the reliability of the ANN algorithm.

545
This study demonstrates that the ANN algorithm can not only detect cloud influences for individual MLS profiles, but also that it can reliably predict MODIS-retrieved p CT . This is illustrated by high correlation coefficients of > 0.99 and objectively good model performance for three example cloud fields of varying degrees of complexity.
This new cloud classification scheme, which will be included in future versions of the MLS v4.2x, provides the means to reliably identify profiles with potential cloud influence. As mention in the introduction, this new algorithm will facilitate future 550 research on reducing uncertainties in the retrieval of atmospheric constituents in the presence of clouds. Moreover, studies on convective moistening of the lowermost stratosphere, as well cloud scavenging of atmospheric pollutants, will benefit from these new capabilities.
Hidden Layer 1 Hidden Layer 2 Input Layer Output Layer Figure 1. Simplified sketch of the algorithm setup, including three vectors in the input layer (blue) that contain MLS brightness temperatures (Ti; i =1-3), two hidden layers (green) with two neurons (N h1−k and N h2−k ; k =1-2) and one "bias" node each (B k ; k =1-2), and output layer (orange) with the labels vector (L) and "bias" node (BL). Also shown are the input weights (ω i,k ; i =0-3, k =1-2), connecting weights ( k,l ; k =0-2, l =1-2), and output weights (Ω l ; l =0-2) that connect the input variables to the neurons in the first hidden layer, the neurons from the two hidden layers, and the neurons from the second hidden layer to the labels vector, respectively.    d01  001  005  006  018  031  021  010  004  014  002  015  010  021  008  010  008   d02  034  041  041  055  055  041  040  036  037  052  016  031  053  043  041  048  Table 2. Details of the input variables for the ANN algorithm, which consist of MLS brightness temperature observations in 10 different bands from 4 radiometers. Besides the official radiometer and band designations, the local oscillator (LO) and primary species of interest in the respective band are given, as well as the ranges of minor frames (MIFs) and channels used as input for the ANN.