The SPARC water vapor assessment II: Assessment of satellite measurements of upper tropospheric water vapor

William Read1, Gabriele Stiller2, Stefan Lossow2, Michael Kiefer2, Farahnaz Khosrawi2, Dale Hurst3, Holger Vömel4, Karen Rosenlof5, Bianca M. Dinelli6, Piera Raspollini7, Gerald E. Nedoluha8, John C. Gille9,10, Yasuko Kasai11, Patrick Eriksson12, Christopher E. Sioris13, Kaley A. Walker14, Katja Weigel15, John P. Burrows15, and Alexei Rozanov15 1Jet Propulsion Laboratory, California Institute of Technology, Pasadena, Ca., USA. 2Karlsruhe Institute of Technology, Institute of Meteorology and Climate Research, Karlsruhe, Germany. 3Global Monitoring Division, NOAA, Earth System Research Laboratory, Boulder, Colorado, USA. 4Earth Observing Laboratory, National Center for Atmospheric Research, Boulder, Colorado, USA. 5Chemical Science Division, NOAA, Earth System Research Laboratory, Boulder, Colorado, USA. 6Instituto di Scienze dell’Atmosfera e del Clima del Consiglio Nazionale delle Ricerche (ISAC-CNR), Via Gobetti, 101, 40129 Bologna, Italy. 7Instituto di Fisica Applicata del Consiglio Nazionale delle Ricerche (IFAC-CNR), Via Madonna del Piano, 10, 50019 Sesto Fiorentino, Italy. 8Naval Research Laboratory, Remote Sensing Division, 4555 Overlook Avenue Southwest, Washington, DC 20375, USA. 9National Center for Atmospheric Research, Atmospheric Chemistry Observations & Modeling Laboratory, P.O. Box 3000, Boulder, CO. 80307-3000, USA. 10University of Colorado, Atmospheric and Oceanic Sciences, Boulder, CO 80309-0311, USA. 11National Institute of Information and Communications Technology (NICT), 20 THz Research Center, 4-2-1 Nukui-kita, Koganei, Tokyo 184–8795, Japan. 12Chalmers University of Technology, Department of Space, Earth and Environment, Hörsalsvägen 11, 41296 Göteborg, Sweden. 13York University, Center for Research in Earth and Space Science, 4700 Keele Street, Toronto, Ontario M3J 1P3, Canada. 14University of Toronto, Department of Physics, 60 St. George Street, Toronto, Ontario M5S 1A7, Canada. 15University of Bremen, Institute of Environmental Physics, Otto-Hahn-Allee 1, 28334 Breman, Germany. Correspondence: Read (william.g.read@jpl.nasa.gov)

The satellite data sets are described in more detail in a companion paper by Walker and Stiller (2021). The data sets are quality screened per recommendations from each of the data set providers. These data sets were read and repackaged in a common format that contains the fields, year, UT time, longitude, latitude, day night or sunrise sunset flag, tropopause height, 40 height, pressure, H 2 O concentration, and H 2 O concentration uncertainty. Figure 1 shows a list of data sets used here and the color and symbol coding being used when multiple data sets are shown in a plot.

Comparison Methods
Tropospheric H 2 O is highly variable both temporally and spatially making accuracy assessments difficult. Three comparison methods, coincident comparisons, time series, and gridded maps are used here to assess the data. For coincident comparisons 45 we compare measurement pairs that are within 2.5 • in longitude and latitude and 3 hours in time. The spatial coincidence is roughly the weighting function width for a limb sounder and there is no benefit to using a tighter criterion. The temporal matching criterion is rather arbitrary but is well under a diurnal time difference (12 hours). In principle, coincident pair matches are the best method of comparing two data sets but have some limitations. The first having a suitable number of coincidences to obtain enough statistics, and secondly, having good global coverage of the matches. For example, comparing a limb viewer in 50 a sun synchronous orbit with an occultation instrument means the only available coincidences will occur for specific latitudes where the local viewing time is within ±3 hours of sunrise and sunset.
Time series comparisons are useful to see how well the instruments track temporal changes and interannual variability. These comparisons are also useful for detecting possible drifts in their H 2 O retrievals. This is why we only used BFH sites that have frequent launches (monthly or more frequent) over several years. 55 The third comparison methodology compares gridded data maps. Many scientific studies are interested in global distributions of H 2 O during the year and also how this changes interannually. One advantage of this type of comparison is that small-scale variability over time and space is averaged out. A disadvantage is that there can be significant sampling biases. For example, limb viewing infrared instruments are heavily cloud contaminated in the tropics and will show a significant dry bias compared to maps made from nadir viewing or submillimeter instruments (Millán et al., 2018). This study will show that the sampling 60 bias is more than a factor of two in the upper troposphere. Also there will be temporal biases because sun-synchronous orbiters only sample two local times missing much of the diurnal cycle (Eriksson et al., 2010). not well mixed and therefore there is typically a large humidity variation within the measured volume that will not be captured in an in-situ measurement. Using MOZAIC (Measurement of OZone and water vapor by Airbus In-service airCraft, Marenco et al., 1998) data during the UARS MLS upper tropospheric validation, it was noted that over 100 km of level flight, MOZIAC 70 humidity typically showed 20-30% variability (Read et al., 2001). An example of how a "coincident" comparison between an in-situ and volume averaged satellite measurement might look like is shown in Figure 2. In this figure we take 200 values and generate a random sequence of values having a mean of 100 with a 1-standard deviation variability of 25 (25%) shown as black asterisks ( * ). These values are sorted and plotted. Note that the curve is nonlinear with low and high values curving away from the mean value of 100. A BFH can measure any one of these values. Although it is more likely that the in-situ hygrometer will 75 measure a subvolume that is close to the mean value, it is possible that it may measure a value in another region of the volume that departs significantly from the mean. A remote measurement will always sample a large volume and measure an average value; however, the average derived depends on the instrument measurement response to the H 2 O concentration. This is shown as the gray plus (+) symbols in the figure.
Upper tropospheric water vapor has a large dynamic range of values from 2 ppmv to 1000 ppmv. Therefore it makes most 80 sense to assess the degree of agreement in terms of percent of humidity. Reporting the results of a comparison between dataset x and dataset y can be done in three ways. First one can compute percent differences relative to dataset x and calculate the mean and standard deviation of the comparison. Another way is to compute the percent difference relative to the mean of the x and y datasets. The third is to compute the mean and standerd deviation in concentration and convert the result into percent. In our example in figure 2, computing the statistics in percent relative to the x dataset has a biased mean of 9.4% 85 and a 38.8% standard deviation. If the comparison is made in terms of the sum of the x and y datasets, the mean bias and standard deviation are reduced to 3.9% and 28.9% respectively. When the analysis is done in concentration units and converted to percent produces a mean difference of 0% and 27% for the standard deviation. Ideally the expected result should be 0% for the mean and 25% for the standard deviation; therefore, the analysis of the comparsions are done in concentration and converted to percent afterwards. 90 Figure 3 shows a coincident humidity comparison between AIRS and the Earth Science Research Laboratory (ESRL) BFH that is routinely launched from Boulder on a monthly basis. The comparison pressure is 261 hPa. Next to it is a comparison between MLS and the BFH at 100 hPa. Averaging kernels were not applied to either data. The BFH data have been sorted by value and shows a similar shape to that in Figure 2. Likewise, both AIRS and MLS show a generally flatter response. In addition to the spatial averaging variation, there is atmospheric variability that accounts for why the slope for the satellite 95 measurement is not zero like it is in the demonstration plot ( Figure 2). Additionally, satellite spatial averaging and the retrieval itself over the sampled volume will exhibit some nonlinearities and non Gaussian behavior. Therefore, while a comparison like that in Figure 2 may not look good, the reality is that, the agreement may actually be as good as one can expect because of the very different characteristics of the measurements themselves. It is also important to recognize that the dynamic range of measurements at 261 hPa is 10-400 ppmv versus the stratospheric 100 hPa which run from 2.5-7.5 ppmv. The smaller dynamic 100 range for the stratospheric measurements suggests that H 2 O is more tightly regulated by large scale atmospheric processes (e.g. tropical tropopause temperature, transport, and chemistry). This is shown in the MLS-Aura comparison which generally tracks the BFH values even through to the highest values. MLS-Aura appears to overestimate however the extreme lower values. BFH versus MLS-Aura and the MIPAS suite comparisons at 261 hPa look similar to that shown for AIRS, and at 100 hPa for the MIPAS suite look similar to that shown for MLS-Aura. data sets with and without application of the averaging kernel. The averaging kernel is applied (Livesey et al., 2018) to the coincident sonde profiles for the MIPAS and MLS retrievals. Although applying the averaging kernel to a highly vertically resolved measurement when comparing to a remote sensing measurement is the proper method for comparison, in practice 110 there are some limitations. These include neglect of non linear forward model effects (causes the averaging kernel function to be profile shape and amount dependent) and truncation effects when the balloon does not achieve high altitude. For some of the retrievals, applying the averaging kernel makes the agreement worse, in particular, the MIPAS-Oxford retrieval. For MLS-Aura, which has the most number of coincidences, applying the averaging kernel makes very little difference. The same is also true of the MIPAS-IMK retrieval. Off line simulation studies done on MLS support the above result. The averaging   kernel is important for the 121 hPa and lower pressure levels but was not important for the higher pressure levels (Read et al., 2008). Averaging kernels are not available for all the data sets used here, for example AIRS. Therefore there is no advantage to be gained from using the averaging kernel and for consistency in handling of the data sets here it is not used in the following analysis. The comparisons in Figure 5 show mean coincident agreement within several tens of percent of the BFH with a scatter about the mean value of 20-60% for most instruments. AIRS shows the best agreement overall. MIPAS-IMK shows typically 20% agreement but with a positive bias. The other MIPAS retrievals (Bologna, ESA, and Oxford) are mostly drier. MLS-Aura is also drier but consistently shows a significant dry bias for the level that is 2-3 km below the tropopause. For the mid-high latitude, this is near 215 hPa and for the tropical latitudes, it is at 147 hPa. Curiously, the MIPAS-Bologna retrieval also exhibits this behavior for the mid latitude comparisons but it is not possible to determine if this is linked to the tropopause height because there were no suitable comparisons in the tropics. HIRDLS shows moist biases for the mid latitude sites but a small dry bias at the Heredia (tropical) site.

Time Series Comparisons
Another way to look at the humidity data is through a time series. This type of comparison shows how each satellite data set will capture seasonal cycles and interannual variability. Comparisons are shown in two formats. Figure 6 shows an overlay of BFH sonde measurements at Boulder with smoothed reconstruction of satellite measurements in the vicinity of Boulder (±2.5 • longitude and latitude). Temporal coincidence with the actual Boulder sonde launches is not imposed. As is shown in 140 the figure, most of the data sets capture similar annual cycles with varying degrees of fidelity relative to the sonde. Interannual variability is similar among the majority of the data sets and sonde. For example, 2007 shows higher values and a stronger seasonal amplitude than during the two years succeeding. Figure 7 shows the time series over Hilo (Hawaii), a tropical site. As with Boulder, most of the satellite retrievals capture the seasonal cycles seen in the BFH sonde data. One exception is MLS-Aura at 147 hPa which shows a much weaker amplitude 145 than the BFH sonde and is also drier (MLS-Aura is not unique in this respect though). This feature was noted in a comparison report by Hegglin et al. (2013). A possible explanation for this could be related to the tropopause height dependence of the dry bias seen in the MLS-Aura sonde comparisons. Over the tropical sites, the tropopause is rising and falling by ∼1.5 km over the year and thus the MLS-Aura bias also rises and falls with it causing a potential flattening of the annual cycle. Notice in Figure 5 the bias gradient with the tropopause height is rather steep. The mid-latitude locations where the tropopause is 150 near 147 hPa shows a 50-60% dry bias for MLS-Aura at 215 hPa and much smaller 10% dry bias at 147 hPa. For the tropical sites, where the tropopause is near 100 hPa, the dry bias at 215 hPa drops to 10-15% but increases to 40% at 147 hPa.
Therefore a seasonally modulating tropopause height would be expected to modulate the MLS dry bias significantly for the level that is 2-3 km below the tropopause or in this case the 147 hPa level and also the levels above and below but to a lesser extent. Subsequent investigation of this bias suggests that it is caused by a pointing difference error between the radiometer 155 that measures water vapor and the radiometer that measures O 2 for pointing. This bias will be corrected in version 5 under development. Early version 5 testing is showing that the current pointing error in v4 is flattening the 147 hPa annual cycle (in contrast to accentuating it). The BFH data shows the individual measurements in addition to the smoothed curve. Only smoothed curves are shown for the data sets to remove excessive data clutter.  The BFH shows a weak annual cycle at 147 hPa and stronger ones at lower altitudes. This behavior is captured by most of the 160 data sets. Figure 9 shows a data smoothed time series comparison over Lauder, New Zealand, a mid latitude southern hemisphere site.
The BFH shows an irregular seasonal cycle that in most years is weak at 147 hPa except at the beginning of 2007. Most satellite measurements show larger seasonal cycles with a more regular phasing. The phasing does differ among the satellites.
Exploring the question of seasonal amplitudes and phase further, the time series data is fitted to a periodic function that yields 165 a mean value, annual cycle amplitude and phase. Interannual variability is ignored in this fit, thus the result should be viewed as a "climatology". The result for Boulder, USA is shown in Figure 10. The data sets capture the annual cycle with correct phase. Figure 11 shows a comparison of the fitted function to Hilo data. It is noteworthy that MLS-Aura greatly underestimates the seasonal cycle at 147 hPa relative to the other data sets and BFH sondes. This feature is present regardless of whether averaging kernels are applied or not and a likely cause has been identified.
170 Figure 12 summarizes the results from fitting a periodic function to the coincident data for the 6 sonde sites. Ideally, the left panels in Figure 12 should be similar to the center panel in Figure 5. The difference between these is that Figure Figure 13 shows mean biases between BFH and instrument data sets derived from mean differences of spatial/temporal coincidences and mean value derived from fitting to all data with a periodic function that is only spatially coincident. AIRS and MLS-Aura, the data sets with the best statistics for the coincident comparisons, show the best agreement for a mean derived 185 from a time series fit and a mean from a coincident comparison fit. Even for these data sets, the agreement between the two methods is as large as 20%. The lack of consistency between sonde values and the direct coincidences prompts us not to use derived biases as a proxy for direct coincidences when summarizing results later in this paper.   Figure 10. For each data set, a periodic one year function is fitted to the time series data near Boulder, Co. and plotted for one year. Interannual variability is averaged over for each data set. Therefore this figure is a climatology for that data set.  Figure 11. Same as Figure 8, but for Hilo, Hawaii.

Satellite to Satellite Coincident Comparisons
Coincidences between satellite based data sets are discussed here. Figure 16 shows a coincident match scatter plot comparison between MLS-Aura and ACE-FTS as a probability density function (PDF). Only data below the MERRA-5.2 tropopause height 220 is considered. A relative density amount for each contour is shown in the color bar. Since the number of coincidences decreases with altitude, only data below the tropopause are being compared, the scale is relative. The number in all bins is divided by  Figure 17 Shows a comparison between MLS-Aura and MIPAS-IMK. Note that there are essentially no coincidences in the tropics due to the 3 hour coincidence criterion (the equator crossing local time for Aura is 13:45 versus 10:00 for ENVISAT). As with ACE-FTS, MLS-Aura tends to be drier. Figure 18 shows a scatter comparison between MIPAS-ESA versus MIPAS-IMK.

230
Since these are different retrievals from the same instrument, all measurements are coincident and all latitudes are covered. This shows that MIPAS-IMK is more humid than the ESA product. One thing that is noteworthy in all these plots is the stretched S

Gridded Map Comparisons
Gridded map comparison is another method where climatologies can be compared. It has the advantage of not requiring coincidences, and inter measurement coincident matched variability should average down. Its weakness is that sampling biases 240 can significantly affect the comparison. Figure 19 shows 3 Figure 21 shows humidity maps at 175 hPa from 9 other data sets. HIRDLS, MLS-Aura, MIPAS, and SCIAMCHY are limb viewing instruments, SMR (in UTH retrieval mode) and TES are downward viewing instruments. These data sets sample the Earth less frequently than either AIRS or MLS, the grid box size is 10 • longitude by 6 • latitude. These comparisons fall into two distinct groups, MLS-Aura, SMR, and TES showing very moist tropics with moist features coincident with frequent convective activity over the tropical continents including the maritime, and HIRDLS, the MIPAS suite, and SCIAMACHY showing a more 265 featureless and less moist tropics. These differences are all attributable to cloud impacts. HIRDLS/MIPAS and SCIAMACHY measure infrared and ultraviolet radiation in the limb and are very often cloud contaminated. Their tropical sampling is poor and only the driest, cloud free scenes can be processed. The limb geometry is especially problematic because of the long absorption pathlength in the atmosphere. The result is that the deep tropics are not well sampled for these instruments. The large missing data region in the southern Atlantic Ocean and South America in the SCIAMACHY map is caused by the south 270 atlantic anomaly where this instrument chooses not to make retrievals. The microwave instruments (MLS-Aura and SMR) and the nadir looking infrared TES instrument can better deal with cloudy scenes and therefore show more moisture in the tropics and well defined convective features. These features must be kept in mind when making climatological maps from satellite data. Climatological maps for other heights are presented in the supplement. Figure 22 shows a scatter plot of the mapped grid values with MLS-Aura on the x-axis and various instruments on the 275 y-axis. The correlation is generally good between the instrument pairs. The MIPAS suite and HIRDLS are drier for moist values relative to MLS for reasons previously described. SCIAMACHY has no measurements in regions associated with active convection probably because the UV backscatter is affected by even thinner clouds than the IR. TES and SMR are more moist than MLS-Aura for all values of humidity. Scatter plots for other heights are shown in the supplement.
Another submillimeter radiometer, SMILES, has dense enough data coverage to produce climatological maps. SMILES 280 operated for 6 months on the International Space Station (ISS). The instrument was not specifically designed to measure H 2 O but its radiances are affected by it providing an opportunity for its measurement. Three independent humidity retrievals are available for SMILES using three different approaches. The NICT product retrieves H 2 O from the line wing shape in its A and B radiometers. The JPL product fits the radiance growth curve in the window regions of each of its available radiometer bands (A, B, or C) relying on knowledge of the H 2 O continuum function. The Chalmers product retrieves from the opaque 285 down looking radiance, similar to its upper tropospheric humidity product on SMR. Table 1 gives the altitude ranges of these retrievals. The supplement has maps showing these comparisons and scatter plots. Using MLS-Aura as a comparison standard all these retrievals show significant biases; however, qualitatively, they do show the same patterns, but over limited altitude ranges. A quick summary shows that the Chalmers retrieval produces good qualitative results from 280-200 hPa, the JPL retrieval does so from 200-125 hPa, and the NICT retrieval from 175-125 hPa. The NICT A band retrieval is much drier than 290 the B band retrieval. The JPL retrievals suffer from high value artifacts at ∼45 • S that are not detected in quality screening. As mentioned previously, all these retrievals show significant (> factor of 2) usually moist biases relative to MLS-Aura.
Climatologies for all of 2005 for several occultation instruments are compared to MLS-Aura. The sampling of the occultation instruments is much more sparse than it is for a passive thermal emission instrument. Moreover many of these occultation instruments are set-up in orbits to emphasize coverage in high latitudes in the interest in studying polar ozone chemistry.   Coincidences are available for a subset of the BFH sites for some instruments. Thus those instruments will have less geograph-310 ical sampling in their assessment. The MIPAS retrieval suite for example has no coincidences with tropically located sondes.
A summary of sondes and instruments with suitable coincidences is summarized in Figure 5.
For those instruments for which there are direct sonde comparisons available, a mean bias can be established. For example for MLS-Aura, it is -25% for p>200 hPa and -31% for p<200 hPa. When another instrument is compared to MLS-Aura in a satellite coincident or a gridded map comparison, the MLS-sonde bias is added to those comparison results in order to "correct"  the BFH represents the best accuracy standard for measuring humidity in the upper troposphere and lower stratosphere. The BFH hygrometer itself is considered to be accurate to 10%. Figure 23 is the upper tropospheric equivalent to Figure 1 in the 350 first assessment report (Kley et al., 2000) summarizing the stratospheric humidity sensors in the pre 2000 era. The spread in the variability bar shown arises due to many factors such as location and concentration dependencies, sampling differences, possible averaging kernel smoothing effect dependencies on profile shape and many possible systematic error contributions such as errors in atmospheric temperature and interfering species whose errors may not be uniform under all conditions that these comparisons are made. Although an attempt has been made to reference these biases relative to the BFH, there are some 355 inconsistencies. The mapped comparisons between MLS-Aura and AIRS typically show agreement within ±20% for H 2 O bins, pressures and seasons considered. However, when the MLS-Aura dry bias relative to sonde is added it suggests that a climatological map produced by AIRS should have an overall dry bias of 20%. Whereas the same gridded map comparison for MLS-Aura which is based on the same comparison with AIRS except that AIRS is the reference measurement shows only a slight (<10%) dry bias. This is because the AIRS to BFH adjustment is -2% and -6% for the higher and lower pressure ranges 360 in Figure 23. The cause of the differences relative to the BFH reference arises from MLS showing a strong bias dependence based on the height of the tropopause. In short for pressure levels considered here the MLS bias runs between near 0 to 60% when the tropopause is 2-3 km above the compared pressure level. The adjustment is roughly an average of these conditions. AIRS does not show this behavior and therefore its bias adjustment is not tropopause height dependent and is therefore more robustly applicable. What is not included in Figure 23 for the satellite coincident comparisons is an additional scatter resulting 365 from the variability of paired differences and for the gridded maps, paired grid box value difference variability. These are typically ∼30% for these comparisons and therefore an additional ∼30% variability would be added to that shown in Figure 23 if one is to compare a single matched pair comparison.
Upper tropospheric H 2 O is a highly variable field in space and time. In the atmosphere, H 2 O can vary by a couple of orders of magnitude. Figure 23 shows that for most of the instruments, their comparisons among themselves and with BFH sondes are 370 indicating mean agreement within ∼30% but with large spreads suggesting something like a factor of two agreement. Relative to stratospheric comparisons where H 2 O is well mixed and it is possible to quantify biases to within a few percent, for the upper troposphere such a precise assessment cannot be realized. The problem is that the measurements sample atmospheric volumes differently where concentration gradients are large. The measurement systems have non linear responses to changes in water vapor amounts. The retrievals also require temperature in their inversion that also may have large vertical gradients.

375
In short it is probable that a comparison between two satellites or with balloon sondes (discussed earlier) will show different degrees of agreement for a large ensemble of coincident data making it not possible to establish a single bias number by height and latitude.
In closing some features specific to certain instruments will be discussed. It is clear from the gridded map comparisons that high clouds in the tropical upper troposphere have a significant impact on infrared-ultra violet limb viewers (see Table 1).

380
While the limb geometry allows low concentrations of H 2 O to be measured and doesn't require a thermal lapse, the long horizontal path length makes cloud encounters much more likely. The MIPAS retrieval suite and SCIAMACHY demonstrated good agreement with mid and high latitude sondes; however, their clear sky sampling limitation causes a severe undersampling of the tropics leading to a dry bias. This limitation was so severe that for Figure 23, moist value bins were not included in the assessment summary. Of course this limitation needs to be kept in mind for science investigations. The microwave limb 385 viewers MLS-Aura/UARS, SMR, and SMILES are more immune to clouds due to the longer measurement wavelength being less subject to cloud emission and scattering. Nadir sounding geometries work better than limb in cloudy scenes because the imaged scene is small compared to the horizontal distance covered by limb viewer and can look at scenes in close proximity to clouds without being contaminated by them. Also AIRS by using highly spatially resolved pixels can use a cloud clearing scheme to derive a cloud free signal. Therefore AIRS and TES although being infrared instruments can better observe in cloudy 390 regions and avoid the severe sampling bias. As mentioned before, these instruments have a relatively short path length in the atmosphere and require a thermal lapse to measure humidity. Therefore they are unable to make measurements ∼3 km below the tropopause and above or where H 2 O concentrations are <10-20 ppmv.
Among the limb viewers, MLS-Aura has the highest daily sampling, one of the longest running operations (still in operation) and is the least affected by clouds. Therefore it was probably one of the better instruments to use as a reference for comparing 395 the others which was often done in this study. Having said that, the one significant feature is that MLS-Aura shows a significant dry bias (50%) in any level that is ∼2.5 km below the tropopause. This bias reduces to <20% above and below this critical level. The bias behavior is not caused by retrieval smoothing that can be corrected by including the averaging kernels. The cause is most likely due to a pointing error in the retrieval system. The pointing error is from a combination of two sources, one being a field-of-view alignment measurement and another from a sideband measurement. The needed adjustment for the 400 field-of-view alignment is within its pre-launch uncertaianty, but the sideband adjustment is ∼15 times larger than its prelauch uncertainty which confounded its discovery. Version 5 currently in production corrects for these deficiencies will be shown in a future publication.
The occultation sounders can provide accurate profile measurements, ACE-FTS being the best amongst them in terms of sampling a wide range of concentration values, a long operational period (still in operation) and producing accurate measure-405 ments. However, the high temporal and spatial variability of H 2 O in the upper troposphere along with the sparse sampling from occultation instruments, limits the usefulness of these measurements mostly to validation studies.
Instruments such as SMILES, HIRDLS had short operational lifetimes 6 months and 3 years respectively). The science that can be done with these measurements would be limited to features unique to that instrument. For example, HIRDLS has the best vertical resolution (1 km) among the satellite suite (typically 3 km). SMILES was mounted on the ISS and thus its 410 measurements sample the full diurnal cycle. This was exploited in a cloud study (Jiang et al., 2015). The water vapor products from SMILES are research products for which the instrument was not specifically designed to measure. Although qualitatively the mapped fields are mostly reasonable, biases are large and artifacts present (see the supplement for more detail).
The last observation derived from this study refers to the goodness of the Vaisala-RS92 radiosonde hygrometer in the uppermost troposphere. It is well known that the response time of the humicap sensor in the Vaisala-RS92 slows as the air 415 becomes more desiccated (Miloshevich et al., 2009). This leads to erroneous measurements. Time lag correction algorithms have been applied to some of these sondes and only corrected sondes have been used here. This is in contrast to those used by the radiosonde network that uses an algorithm provided by Vaisala that does not have the time lag correction. The motivation for including the Vaisala-RS92 profiles was to greatly expand the number of Vaisala-RS92 profiles available for more satellite datasets to be compared. Unfortunately, in the uppermost troposphere, the Vaisala-RS92 show inconsistent results and therefore 420 best not used for pressures less than 200 hPa. The agreement is much better for pressures between 300-200 hPa but show a dry bias of 20%. The expanded Vaisala-RS92 data set does allow an assessment to be made for ACE-FTS, SMILES, and SMR.
After correcting for the 20% dry bias, and only considering pressure levels > 200 hPa, the mean agreement for ACE-FTS is 8%, SMILES-JPL, -10%, and SMR, 110%. The variability of the differences between the Vaisala-RS92 and the satellite instruments is quite large ∼100%.

425
In conclusion, with exceptions noted in the text, most of the satellite instruments do a realistic job of tracking upper tropospheric humidity changes. Precise quantitative assessment is much more difficult because the nature of these measurements coupled with the sharp vertical and horizontal gradients in H 2 O leads to large variability of the coincident pair differences between the data sets. Even among the MIPAS suite of retrieval products where the four retrieval products are using the same radiance signal sampling exactly the same volume with perfect spatial and temporal coincidence show surprisingly large biases 430 and variability underscoring significant sensitivities to the forward models. Science investigations using these data need to take these features into consideration and be aware of the sampling and measurement durations of these data sets (Table 1).
Having said this, and ignoring some notable anomalies (e.g. MLS-Aura large dry bias in a 2-3 km layer below the tropoapuse