Probabilistic retrieval of volcanic SO2 layer height and cumulative mass loading using the Cross-track Infrared Sounder (CrIS)

During most volcanic eruptions and many periods of volcanic unrest, detectable quantities of sulfur dioxide (SO2) are injected into the atmosphere at a wide range of altitudes, from ground level to the lower stratosphere. Because the fine ash fraction of a volcanic plume is, at times, collocated with SO2 emissions, global tracking of volcanic SO2 is useful in tracking the hazard long after ash detection becomes dominated by noise. Typically, retrievals of SO2 loading have relied heavily on hyperspectral ultraviolet measurements. More recently, infrared sounders have provided additional loading measurements 5 and estimates of the SO2 layer altitude, adding significant value to real-time monitoring of volcanic emissions as well as climatological analyses. These methods leverage the relative simplicity of infrared radiative transfer calculations, providing fast and accurate physics-based retrievals of loading and altitude. In this study, we detail a probabilistic enhancement of an infrared SO2 retrieval method, based on a modified trace-gas retrieval, to estimate SO2 loading and altitude probabilistically using the Cross-track Infrared Sounder (CrIS) on the Joint Polar 10 Satellite System (JPSS) series of satellites. The methodology requires the characterization of real SO2-free spectra aggregated seasonally and spatially. The probabilistic approach replaces loading and altitude estimates with non-parametric probability density functions, fully quantifying the retrieval uncertainty. This framework adds significant value over basic loading and altitude retrieval because it can be readily incorporated into Monte Carlo forecasting of volcanic emission transport. We highlight results including successes and challenges from analysis of several recent significant eruptions including the 15 22 June, 2019 eruption of Raikoke volcano, Kuril Islands; the mid-December, 2016 eruption of Bogoslof volcano; and the 26 June, 2018 eruption of Sierra Negra volcano, Galapagos Islands. This retrieval method is currently being implemented in the VOLcanic Cloud Analysis Toolkit (VOLCAT), where it will be used to generate additional cloud object properties for real-time detection, characterization, and tracking of volcanic clouds in support of aviation safety.


20
During most volcanic eruptions and many periods of volcanic unrest, detectable quantities of sulfur dioxide (SO 2 ) are injected into the atmosphere at a wide range of altitudes, from ground level to the lower stratosphere. Often early in eruptions the fine ash fraction of a volcanic plume is collocated with SO 2 emissions and so ash tracking can be performed by proxy; however, later ash and SO 2 tend to evolve along different trajectories due to subtly differences in altitude and removal processes (Karagulian et al., mented in the VOLcanic Cloud Analysis Toolkit (VOLCAT, https://volcano.ssec.wisc.edu/ ; Pavolonis et al., 2013Pavolonis et al., , 2015aPavolonis et al., , b, 2018, where it will be used to generate additional cloud object properties for real-time detection, characterization, and tracking of volcanic clouds in support of aviation safety. 2 Probabilistic SO 2 layer retrieval theory 2.1 Classical methods for height retrieval As a preliminary we discuss several methods which we describe here as "classical". In fact these methods are relatively recent; however, they do not make full use of the probability spaces which we will exploit here. Previous analyses of the height and distribution of volanic SO 2 plumes using data from the Infrared Atmospheric Sounding Interferometer (IASI) instrument by Carboni et al. (2012), Clarisse et al. (2014), and Carboni et al. (2016) utilized the ability to linearize a forward radiative transfer model around a climatological mean state for the concentration of trace SO 2 , a method originally outlined by Walker et al. (2011). Employing the notation of Rodgers (2000), the theory of this retrieval relates a set of parameters governing the 70 concentration of a trace gas (SO 2 in this case) comprising the true state (x) to a set of measurements y, typically brightness temperature spectra by a forward radiative transfer model F. As in Pavolonis (2010), all radiative transfer model simulations used in this study were performed using the LBLDIS tool (Turner, 2005), which utilizes the Line-by-Line Radiative Transfer Model (LBLRTM; Clough and Iacono, 1995) to compute gaseous absorption and the Discrete Ordinate Radiative Transfer (DISORT) model to complete the radiative transfer calculation (including multiple scattering). 75 The above trace gas methods for infrared sensors rely on the ability to write the data and true state each as a sum of the climatological average (y, x) and an anomaly ( y, x) as y = y + y and x = x + x respectively. Linearizing around the climatological average gives: where u is a collection of best estimates of all auxiliary parameters necessary to the forward model including the state of 80 the atmosphere, surface and the instrument. K is the Jacobian matrix of F linearized about x and ε tot is the total error associated with the measurement, linearization, and other surface and atmospheric properties (including clouds) that influence the measured radiances. Because it is inferred that y = F(x; u), the equation for the error can be written simply as From this formula, the retrieval of x can proceed either by maximum likelihood estimation, iterative methods such as  Marquard and gradient descent algorithms, or other methods.
The analysis of Carboni et al. (2012) imposed an a priori Gaussian vertical distribution over pressure coordinates for the SO 2 concentration, retrieving only the total column SO 2 , the mean pressure and the standard deviation (spread). By contrast, Clarisse et al. (2014) developed a system in which the SO 2 is assumed to exist in a narrow layer and an iterative retrieval is performed for the total column concentration. For a very narrow layer, the concentration profile of anomalous SO 2 (the true state) can be represented with the Dirac delta function of height: where the gas of mass loading x is concentrated at the altitude h 0 . In reality, the layer has some thickness and the concentration is non-uniform. However, this abstraction is made for the benefit of complex probability calculations to follow. Using this form of the concentration profile, the model spectrum is then a function of the mass loading and the layer height. Because the model 95 Jacobian for the trace gas retrieval is calculated at the background state, the Jacobian is the limit For SO 2 , the climatological background state (x) is taken to be zero. In practice, calculating the limit is computationally prohibitive so the limit is dropped and the perturbation ( ) is taken as 5 DU. Below we refer to the height-dependent Jacobian calculated at the zero-background simply as K(h). Additionally, radiative transfer models do not have an implementation of 100 the Dirac delta, and consequently, a layer of finite thickness (1 km) is used which is an approximation to the Dirac delta.
Following Clarisse et al. (2014), this Jacobian can be used to calculate a height-dependent statistical z-score, measuring the relative confidence in the state estimate: z(h; y, y) = [K (h)S −1 K(h)] − 1 2 K (h)S −1 (y − y) where the error covariance matrix S is built up from spectral residuals (y − y) on measurements with low or no detectable 105 SO 2 present. The criteria for this discrimination are detailed in a later section. Note that the factor K (h)S −1 K(h) is a scalar at each height h. Here, z(h; y, y) is the statistical z-score (number of standard deviations from the mean) of finding the SO 2 anomaly at elevation h given the data y and linearization about the mean spectrum y. Using this z-score function of height, Clarisse et al. (2014) estimated the layer height (which we refer to as h C ) as that which maximizes the z-score function: 110 Throughout the remainder of this work we use a functional notation to describe this height retrieval:

Probabilistic enhancement
Although the classical methods make use of maximum likelihood estimation and are therefore probabilistic in the sense that they retrieve the maximum likelihood estimate and the retrieval uncertainty estimate, they make an assumption of normality 115 about the input and output variables. Although this is workable for many types of retrieval, it is unsuitable for the height retrieval in particular as follows. Instead of calculating the mean and covariance of the climatological background, we treat the background spectrum as a random vector Y, where the channels (each of the sampled wavenumbers) are correlated random variables, each characterized by its own marginal probability distribution. In reality, these distributions may belong to a family of parameterized distributions; however, in this study they are non-parametric, only characterized by a histogram on each 120 channel and a covariance structure: where the expectation operator is approximated by averaging over all samples. In this framework, the z-score function is a conditional random variable given the layer height h: 125 and the height is a random variable: The underlying assumption made by the classical method of Clarisse et al. (2014) is that Y is a multivariate normal random vector with mean y = E[Y] and covariance S, meaning that Z is a standard normal random variable (Z ∼ N (0, 1)). This fact about the z-score is expected to hold in the present case with the full probabilistic characterization of Y which may not be 130 Gaussian because the z-score is the sum over all of the channels in bvecY . The principal enhancement over the classical method comes from the height retrieval. Because the function g is highly nonlinear, due to the arg max function, in general the mean height is not equal to the height calculated by Clarisse et al. (2014), that is Furthermore, h C is not generally the maximum likelihood height either (h C = mode[H]). Consequently, without a clear un-

135
derstanding of what h C is measuring in terms of statistics, its probability distribution is unknown.
This study aims to estimate the probability distribution of H and show the importance of its probability density function (PDF) in making predictions about the cloud. We enhance the method of Clarisse et al. (2014), by retrieving a probabilistic SO 2 layer, that is, we retrieve the SO 2 layer height as a random variable H (Fig. 1).
In practice, we retrieve this PDF in a Bayesian framework. In this framework, we represent the height PDF as the posterior 140 distribution due to a prior estimate f prior H (h) of the distribution and a likelihood function L(h; y) based on the data (evidence).
The likelihood function is constructed by Monte-Carlo (MC) sampling of Y and retrieving the height due to a background spectrum given by MC random sampling according to the marginal PDFs of Y and its covariance S (Fig. 1). The process for sampling this non-Gaussian correlated random vector is detailed in Appendix A. In this study we generate 10,000 samples of Each height sample is generated as: that is we construct the random variable from elements of the sample space h s ∈ Ω H . The likelihood function is then reconstructed by kernel density estimation (KDE) of these height samples, that is We impose a Gaussian prior with mean and variance given by MC sampling using the model columns that make up the 150 Jacobian with noise added. Specifically, we use the Jacobian corresponding to the traditional Clarisse et al. (2014) height retrieval (h C ). The appropriate zero-mean noise is generated by Y−E[Y] and the model spectral anomaly is F δ(h−h C ); u − F 0; u = K(h C ). The mean and variance can therefore be expressed as These values are similar to Fig. 2 where the proportionality is eliminated by normalizing the posterior PDF such that the total probability is unity.
Although slower than retrieving the layer height (h C ) due to a mean spectrum alone, this distribution provides significantly 160 more information including the full PDF of H. This PDF may be used to calculate the maximum likelihood, expected value, or median value of the retrieved height or probabilities of finding the plume in a given altitude interval. Additionally, this PDF is essential for calculating the mass loading correctly according to probability theory as detailed in the following section.

Probabilistic Mass Loading
Although this method is used primarily for detection (using z-scores) and height estimation, it is also used to estimate the 165 mass loading, which we treat here as a random variableX. Because the total column depends strongly on the layer height, we calculate the conditional loading random variable after Walker et al. (2011Walker et al. ( , 2012: where an air mass factor equal to the cosine of the satellite zenith angle (cos θ) has been applied. This formula is in some sense a probabilistic enhancement of an optimal unconstrained least-squares mass loading estimate. Similar to the z-scores, 170 this function is normally distributed at every height with mean E[X | H=h] and variance Var[X | H=h]. In practice, these are calculated by the same MC sampling used to generate the height samples. The conditional loading is a random function of height which generally reflects the principles of the Beer-Lambert Law for any given realization. If the height of the SO 2 layer were known exactly, this function could be queried and the loading could be estimated. However, since the height of the layer is known only probabilistically, that is as measured by the PDF f H (h), additional computation is required to determine 175 the loading.
Although we do not know the vertical profile of SO 2 concentration, the retrieval assumptions involve the thin layer representation of the SO 2 as a Dirac delta. The total column is then which does not depend upon the height. More generally, the cumulative mass loading below a certain altitude h is: which is zero for h < H and equalsX for h > H. Because the concentration profile is linear in the loading and the conditional loading calculated above is normally distributed, we take the loading to be normally distributed, thus requiring only two parameters: the mean and variance which can be found using the conditional loading function calculated above. In this study, integration is performed numerically over a discretely sampled height domain. Sketches of the proofs for the mean, variance, 185 and covariance formulae below are detailed in Appendix B.
The mean loading is found by the law of total expectation: where the expectation of the conditional loading is taken as the mean of the samples.
The variance can also be calculated from the statistics of the conditional loading expectation: where Var[X | H=h] and E(X | H=h) were previously estimated by MC sampling.
The covariance between cumulative mass loading up to two altitudes h = a and which is always less than Var[X(a)] unless a = b since the loading is cumulative, yielding larger values at higher altitudes.

195
Of particular importance, these formulae may be used to calculate the expectation and variance values of the mass loading between two altitudes. The expected value is and the variance for this loading is

200
In this system, we retrieve probabilistic SO 2 information in two stages. In the first stage we perform an initial detection using the classical method (Eq. 5) to pre-screen each CrIS field of view (FOV) that likely contains SO 2 , taken as an initial maximum z-score greater than 5, that is, z(h C ; y, y) > 5 (Walker et al., 2011Clarisse et al., 2014). In the second stage, we retrieve the height PDF and the mean and variance cumulative loading functions of height for each CrIS FOV that satisfies this initial z-score threshold. In both stages, the background spectrum and covariance are interpolated spatially at the CrIS FOV center 205 location by a bilinear interpolation scheme using the four nearest grid cells (Appendix C).

Specialized Retrieval for Strong Loading
For strong SO 2 columns, which we define here heuristically as z > 200, an alternate retrieval is needed to increase sensitivity to large loading. This is clear from examination of the chosen formula for approximating the Jacobian (Eqn. 4) in which the background state (x) is zero DU and the perturbation ( ) is 5 DU. For strong loading, the sensitivity of the Jacobians to 210 additional SO 2 is greatly reduced for most CrIS channels, so linearized Jacobians with only a 5 DU anomaly will drastically under predict the loading for a given brightness temperature difference (Fig. 2b). In order to construct a Jacobian that is more sensitive at higher mass loading values, the Jacobian must be dominated by channels with approximately linear responses ( SO 2 loading. However, by applying this new retrieval only when strong loading (z > 200) is detected by the original retrieval, nontrivial brightness temperature differences are guaranteed. In theory, a sequence of increasingly restricted retrievals could be implemented to increase the sensitivity to even stronger loading values, such a study is beyond the scope of the present work. 220

Background State Construction
At every stage of the retrieval, the background state of the volcanic SO 2 -free atmosphere must be accurate in order for this linearized method to succeed. Consequently, calculating accurate statistics of the background state is paramount. The CrIS instrument collects almost 3 million spectra per day, allowing for robust characterization of the background spectrum including variation across seasons and locations.

225
In constructing the background spectrum PDFs and covariance matrix, the periods with little or no SO 2 must be determined.
We utilize the detailed record of global volcanic SO 2 emissions from the IASI SO 2 retrieval algorithm (L. Clarisse, pers. comm.) between 1 November, 2017 and 1 November, 2018, collecting all spectra measured on days with maximum total columns less than 1 DU SO 2 present anywhere in the atmosphere (Fig. 3a).
This omission leaves more than 3.6×10 8 SO 2 -free spectra collected by CrIS over the one year period. We classify the spectra 230 regionally and seasonally, partitioning the data into four seasons and 5 • × 5 • latitude and longitude grid cells yielding 10,368 bins ( Fig. 4), each of which has a full set of PDFs for each channel and a covariance matrix characterizing the correlation structure among the channels. This partitioning reduces the the overall variability represented in the mean spectra and thus also reduces the magnitude of the error covariance matrix entries while still capturing the fundamental variability due to spectral trends between regions and throughout a year.  support (Fig. 5c). This underestimate early in the Raikoke cloud history was likely due two factors, channel saturation despite 245 the specialized strong column retrieval and the fact that the footprint of the CrIS FOVs leaves many gaps where extremal values could have been present. As described above, more specialized retrievals could be devised to increase the sensitivity to very strong loading as in the Raikoke case; however, such schemes are beyond the scope of this work, which is principally concerned with advances in height information. Within about one day, the ash and SO 2 were entrained into a large extratropical cyclone which heavily distorted the dispersion of the cloud, with the SO 2 cloud being pushed to the north and dispersing in 250 both easterly and westerly directions (Fig. 6). Early in this complex dispersion, SO 2 columns remained strong despite a rapid decline in eruptive output. This is most likely a result of the convergence caused by entrainment into the cyclone. Based on the probabilistic retrieval and tropopause data from the National Centers for Environmental Prediction (NCEP), it is clear that the vast majority of the SO 2 cloud mass was in the lower stratosphere, with only a small lower layer in the mid-upper troposphere which had mostly dispersed after the first week ( Fig. 6b-l). After one month, the SO 2 cloud had spread out over most of the 255 northern hemisphere above 30 • N with most columns < 2 DU; however, some columns remained as strong as 20 DU. After two months, traces of the SO 2 cloud remained over Northern Canada and the Hudson Bay with columns up to 1 DU.  Volcano Observatory (AVO) was not able to issue a Volcanic Activity Notice (VAN) for this event and consequently no height information was generated until the reanalyses of Schneider et al. (2020) in which a cloud height of 6.1 km was determined.

265
This small pulse was not observable by the multispectral infrared remote sensing methods nor by automated analysis of multispectral signatures and cloud growth rates (Pavolonis et al., 2013(Pavolonis et al., , 2015a(Pavolonis et al., , b, 2018Schneider et al., 2020). The CrIS estimated heights are mainly clustered between 5 -8 km with some scatter due to localized cloud edge effects (Fig. 7c,d). This is broadly consistent with the reanalysis of Schneider et al. (2020).
Of particular importance in this small cloud made up of only a few CrIS FOVs, the midwave IR for FOV 7 on the CrIS 270 instrument aboard S-NPP is significantly nosier (above specification) than that of other FOVs Han et al., 2013) and consequently, the FOV 7 retrievals are highly suspect and have been omitted (Fig. 7). Because the probabilistic framework allows the calculation of a mean cumulative mass loading, we may derive a formula for the mean or expected concentration profile by similar means as for Eq. 19: 275 which is shown for FOVs in the detected Bogoslof cloud (Fig. 7e). This example demonstrates that the CrIS SO 2 detection and characterization scheme is sufficiently sensitive to capture some small emissions which are generally difficult to observe by other means.
3.3 Test Case III: Resolving Strong Stratification in an SO 2 Plume, Sierra Negra, Galapagos Islands, 2018 On 26 June, 2018 after a period of elevated seismicity, the onset of a major eruption at Sierra Negra was signaled by volcanic 280 tremor at 19:40 UTC, producing an ash and SO 2 plume at 20:09 UTC (Carn et al., 2018;Vasconez et al., 2018;Hedelt et al., 2019). The first CrIS observation also occurred at 20:09 UTC from S-NPP, detecting SO 2 above the Sierra Negra on 3 adjacent FOVs on the edge of scan with maximum initial z-scores of approximately 9, 16, and 31. Subsequent overpasses show the plume rising to approximately 14 -19 km and spreading in a complex manner due to vertical wind shear as evidenced by the lower plume altitudes spreading towards the west and the upper plume altitudes spreading towards the southeast (Fig. 8). The 285 significant shearing of the eruption column enables observations of the cloud at many levels. Consequently, this eruption forms a good opportunity to highlight the broad sensitivity of this method in detecting and characterizing SO 2 at every elevation from the vent (1.124 masl) up to ∼14 -19 km and potentially higher. Additionally, this example highlights the strength of the probabilistic height retrieval, enabling the retrieval of any desired confidence interval on the height. Here we compute the 90% confidence interval (Fig. 8), highlighting the fact that the 95 th -and 5 th -percentile are not in general symmetric about the 290 median nor are the same size at different locations within the plume. Because our method retrieves consistent statistics across all measurements, we can ensure the stability of the method and derived probabilities in particular, giving good smoothness even without post-processing. As described above this is not necessarily the case for other height retrievals which compute a single estimate with constant uncertainty.

Comparison with other data
Although a deep analysis of the differences between the present method and others is beyond the scope of the present work, here we highlight a brief, representative comparison of our SO 2 retrievals with data from TROPOMI, IASI, and CALIOP during the evolution of the Raikoke eruption cloud.
As mentioned above, our strongest total column loading from the Raikoke cloud was 432 DU. This is significantly lower 300 than the maximum detected by TROPOMI (> 900 DU, ), and several other UV-based methods (∼1000 DU, S. Carn, pers. comm.). This suggests that our method, despite the integrated height estimate and the specialized retrieval for strong loading, currently cannot fully capture extremely high column loading values; however, away from these extreme values, our retrieval performs well in comparison to TROPOMI and IASI (Fig. 9 a-e). Other than the relatively few columns with unusually large loading, the largest discrepancy between the TROPOMI-retrieved Raikoke cloud and that from CrIS is 305 that the CrIS retrieval does not resolve the smaller diffuse cloud to the south of the main cloud, although the CrIS retrieval does resolve some similar features elsewhere in the cloud. This may be due to the very high spatial resolution of the TROPOMI data (7 × 3.5 km 2 pixels at nadir) compared with the coarse CrIS resolution (∼ 154 km 2 FOVs at nadir with gaps between FOVs). An additional contributing factor is that the CrIS retrieval starts with an initial detection of the z-score (Fig. 9 j) and only retrieves the height and loading for FOVs with z > 5. Evidence for this smaller diffuse cloud are present in the initial 310 z-score field, although it mostly presents with a z-score below this threshold. Lastly, this discrepancy may also be due to spectral interference from water vapor in the CrIS SO 2 infrared band (1300 -1410 cm −1 ), although our approach is generally insensitive to the presence of background levels of water vapor by construction.
The strength of our approach lies in our ability to generate physics-based PDFs for the height. Because our height retrieval is based on the operational algorithm in use for IASI, our retrieved heights are very similar to those from IASI although there are 315 key differences. As mentioned above, the IASI heights represent the height retrieved due to the mean background spectrum; however, because the retrieval of height is not necessarily linear (due to the arg max function), the retrieved height is not the expected value height. Inspection of the PDFs generated by this approach show that they are typically non-symmetric, meaning that IASI heights are also not necessarily the maximum likelihood estimated heights, although exact comparison is not possible due the orbital separation between the satellites carrying IASI (METOP-A,B) and those carrying CrIS (S-NPP, 320 NOAA-20). Because of this mathematical point, the IASI heights also cannot be said to represent a consistent metric or statistic of the height, meaning that there may be significant inconsistency between measurements. At least for the Raikoke cloud, the IASI heights are almost entirely bound within the CrIS 90% confidence interval ( Fig. 9 f,g,i) and are very similar to the CrIS median heights (Fig. 9 h). For the snapshot of the Raikoke cloud shown in Fig. 9, the largest differences in height appear at the southernmost part of the cloud, where the IASI heights are > 19 km altitude over a significant region. Furthermore, the IASI height estimate (Fig. 9 f) varies significantly over nearby, continuous parts of the cloud, whereas the CrIS median height is more consistent across space with some minor variation due to noise.
Although not shown here, Hedelt et al. (2019) have recently developed a new SO 2 height retrieval for TROPOMI using inverse learning machines. Although computationally expensive to train, such an approach has the advantage of computation speed of the inversion once deployed, though it has not yet been incorporated into the TROPOMI SO2 data product as of

335
Because we retrieve a PDF on each CrIS FOV rather than a single estimate, we can compare the PDFs directly to CALIOP data as CALIPSO passes over the cloud. Here we show an example comparison from Raikoke; however, a full comparison for every overpass of the Raikoke cloud is the subject of future work. For the first several days after the eruption, there was still significant ash suspended in the dispersing cloud, leading to the appearance of several highly attenuating layers in CALIOP data between 10 -15 km (Fig. 10 a,b). The comparison we focus on is between CrIS retrievals from 14:18:00 -14:24:00 UTC bimodal PDFs which may have several interpretations (Fig. 10 b,c). The CrIS retrieval assigns significant probability mass to the same locations as the strongly attenuating layers, suggesting that they are at least a mix of ash, SO 2 and sulfate aerosols.
Where this cloud rises up to 14 -15 km, the CrIS retrieval also assigns significant probability mass to a lower layer between 11 -13 km ( Fig. 10 b,c). This may indicate the presence of a separate molecular SO 2 layer at this altitude. Indeed, this part of 350 the CALIOP layer is very strongly attenuating and may completely shadow any evidence of a lower weakly attenuating sulfate aerosol layer. However, this lower layer may merely be a (true) probability artifact inherited from the background spectrum probability space. If the background spectrum has multiple modes (for example, one mode representing deep convective cloud radiances and another for cloud-free radiances), then multiple populations of the Monte-Carlo height samples may accumulate, leading to a multimodal height PDF. By retrieving PDFs for height and cumulative mass loading it is possible to enhance time series analysis of SO 2 clouds accordingly, enabling the generation of time series with quantified uncertainty. As an example, we calculate the total and stratospheric SO 2 mass time series probabilistically as sums of many independent retrievals (Fig. 11).  To estimate the mass of an SO 2 cloud, the values of the retrieval on the CrIS FOVs must be interpolated onto a continuous 360 grid spanning the cloud. In the present study we use an equal-area grid with area element δA and perform nearest neighbor interpolation of the CrIS FOV retrievals. Consequently we calculate the total cloud mass in units of kilotons (kt) of SO 2 as a Riemann sum: where κ = 2.8617 × 10 −11 kt m −2 DU −1 and theX i (h ∞ ) are the gridded cumulative mass loading values at the top of the atmosphere h ∞ . By aggregating enough CrIS FOVs to completely cover the area of a cloud, a total cloud mass can be approximated. Furthermore, because the loading measurements are independent random variables, the total cloud mass is a normal random variable and the mean and variance of the cloud mass are: The mass of SO 2 in the stratosphere can be calculated similarly if the height of the tropopause h trop is available from ancillary data sources. The mass loading in the stratosphere isX strat =X(h ∞ )−X(h trop ) which is a normal random variable 370 independent from all other locations. By analogy to the total cloud mass calculation, the mean and variance of the stratospheric cloud mass are therefore: where the mean and variance ofX strat are calculated as in (Eqns. 22 and 23).
Additionally, we compute from these results the "instantaneous" e-folding time of the SO 2 as τ = −M (t)/Ṁ (t) withṀ (t) calculated by finite difference and the PDFs computed by standard results in probability theory (Fieller, 1932;Hinkley, 1969;375 DeGroot and Schervish, 2012, Appendix E).
From Fig. 11 it is clear that the SO 2 cloud, as characterized by CrIS, did not immediately show the exponential mass decay that has been used in similar studies of large eruptions (e.g., Read et al., 1993;Carn et al., 2017;Krotkov et al., 2010). The observed trend is likely a combination of retrieval limitations and genuine atmospheric chemical properties. As described above, the CrIS SO 2 loading strength underestimate early in the Raikoke cloud history was likely due to channel saturation 380 despite the specialized strong column retrieval. The strongest CrIS total columns were only approximately 50% of the strongest columns reported at the time; 432 DU from CrIS vs. >900 DU  and >1000 DU (S. Carn pers. comm.).
Such strong columns did not persist for many overpasses; however, total SO2 mass did not begin an exponential decay until approximately 20 days after SO 2 injection, suggesting that in the first several weeks, the reaction kinetics, and consequently the e-folding time, were strongly varying. This could have been the result of limited hydroxyl radical (OH) early in the cloud 385 history (e.g., Theys et al., 2013;Sekiya et al., 2016). As the e-folding times shown in Fig. 11c,d are calculated by finite difference, they are quite noisy; however, they exhibit the same trend of slow decay early in the cloud history, settling into a constant decay rate later (∼10 day e-folding time). The large uncertainty in the stratospheric e-folding time after approximately 40 days is the result of the mean mass decaying below the order of magnitude of the standard deviation mass and the particular details of the PDF calculation (Hinkley, 1969).
i) New probabilistic enhancement of existing hyperspectral IR SO 2 retrieval techniques enable the retrieval of PDFs for SO 2 height and cumulative mass loading, providing significant statistical power, precision, and consistency to global SO 2 detection, tracking, and forecasting efforts. Retrieving these PDFs enables the calculation of many new quantities, including exceedence probabilities for concentration, layer height constraint probabilities, mean concentration profiles, and mass 395 at different layers with uncertainty. Although these capabilities are primarily beneficial for operational SO 2 monitoring, these methods are relevant to climatological studies because of the ability to calculate PDFs for the stratospheric fraction of total mass of a given SO 2 cloud where the tropopause height is available from ancillary sources.
ii) This technique is capable of resolving strong SO 2 mass loading based on a specialized retrieval with less sensitive CrIS channels; however, it is limited in its ability to resolve extreme mass loading values like some of those observed in 400 the recent eruption of Raikoke Volcano, Kuril Islands. Because of the improved spatial resolution over IASI and the technique's sensitivity, we can resolve small clouds that are undetectable by other means. Additionally, the technique can adequately resolve height information across a broad range of plume altitudes including the lower stratosphere.
iii) Preliminary comparisons suggest that this method compares well with other measurements of SO 2 mass loading and altitude; however, the probabilistic framework adds significant value over these techniques especially in the retrieval of 405 height information. Cross sections through these probability clouds compare very well with cloud heights from CALIOP lidar backscatter.
iv) This technique enables the probabilistic characterization of SO 2 clouds for long-term analysis of cloud evolution and key time-varying parameters such as total mass, stratospheric mass, and e-folding time.
v) The algorithms presented here are currently being integrated into VOLCAT, where they will be used for operational 410 SO 2 cloud detection, characterization, and tracking in support of aviation safety. We anticipate future work to include more comprehensive comparison of height PDFs with CALIOP lidar backscatter data, applyication of these techniques to similar instruments including IASI and the Atmospheric Infrared Sounder (AIRS), development of volcanic degassing and aviation-focused products, and analysis of errors in the trace gas technique induced by a warming background atmosphere.
Code and data availability. The Level -1B CrIS data utilized in this study are available from the Goddard Earth Sciences Data and Informa-

415
tion Services Center (GES-DISC, https://disc.gsfc.nasa.gov/). The tropopause data used here to define the lower limit of the stratosphere are available from the NOAA Earth Science Research Laboratory (ESRL, https://www.esrl.noaa.gov/psd/data/gridded/ data.ncep.reanalysis.html) The code developed to generate samples of the SO2-free background spectrum by the NORTA process are available in a git repository at https://gitlab.ssec.wisc.edu/dhyman/trace_gas_background_spectra. This repository also includes a list of the SO2-free days discussed in the text as well as GES-DISC links for all of the CrIS granules collected during those days.
where Φ is the univariate standard normal CDF. In this formulation, the individual marginals can be used to transform the components of the standard normal vector. The crux of this method lies in generating a standard normal random vector Z with appropriate correlation structure ρ Z which upon component-wise transformation as above, will produce the desired random vector Y with the desired covariance structure S. That is, a correlated Z must be generated with the correlation matrix which, after transformation, results in the correlation matrix ρ Y = Corr(Y) associated with the the known covariance matrix

S.
This is accomplished by solving for each of the unique N (N − 1)/2 correlations (ρ Z (i, j)) in the lower triangle of ρ Z . The correlation transformation function is and where ϕ(z i , z j ; ρ Z (i, j)) is the bivariate standard normal density function with correlation ρ Z (i, j) between the variables Z i and Z j . These problems may be inverted individually by various methods outlined in the original formulation of Cario and Nelson (1997).

A1 Application of NORTA to background spectrum
For the purposed of generating the correlated random background spectrum in the present study, we make several modifications 445 to the classical method of Cario and Nelson (1997) to make our problem tractable but retain the high fidelity of the CrIS measurements. In the present study, 177 CrIS channels are used representing the FSR mid-wave band between 1300 − 1410 cm −1 . This yields 15, 576 independent correlation -matching inverse problems which are solved in the present study by gradient descent iteration.
Because each of the correlation -matching inverse problems requires multiple rounds of numerical integration of Eqn. A5, we make several modifications to increase computational efficiency. Because many of the channel correlations are strong, the bivariate standard normal distribution ϕ(z i , z j ; ρ Z (i, j)) for each such pair of channels is very narrow. Consequently, for a typical rectangular sampling domain, the integrand of Eqn. A5 is approximately zero almost everywhere except for a concentrated region in which accurate numerical integration is challenging. To reduce wasted integrand samples, we make a standard 455 transformation of the bivariate normal distribution and then cast the domain in polar coordinates and approximate the integral over a finite radial domain [0, R]: is the product of the transformed i-th and j-th components of the desired random vector. This approximates the full improper integral within tolerance , with the fixed value = 10 −6 in the present study. The terminal radius is estimated conservatively where the maximum and minimum values of the components were recorded during the generation of S and the marginal 465 distributions from sample data. In the absence of knowledge of these minima and maxima, they could be estimated from the marginal distributions as fixed percentiles of these distributions.
The estimated radial limit of integration R is derived by requiring that This inequality can be solved for R under the conservative estimation that max which are finite in this case due to the collection of a finite number of spectra for the generation of the channel marginals. This and upon integrating, R may be chosen conservatively to satisfy This transformation to a scaled polar coordinates ensures that the curvature and gradients in the integrand are as small as possible, ensuring that numerical integration is as accurate as possible for a given domain sampling up to the order of the method employed. Additionally, the sensible radial limitation of this integral ensures that it can be computed efficiently to within tolerance without the inclusion of samples which minimally affect the total (approximately Gaussian tails).

480
Although each channel pair inverse problem can be solved separately by Newton's method or other algorithms, we solve all of the problems jointly, restating the problem as a gradient -descent minimization of the total square correlation error where the vectors r Z , r Y ∈ R 15,576 are comprised of the unique lower triangular elements of ρ Z and ρ Y respectively and C : R 15,576 → R 15,576 is the correlation transformation function cast as a vector function. Although this seems intractable 485 due to the extreme dimensionality, there is no correlation between these dimensions (no correlation between pairwise inverse problems), and consequently, the Jacobian is zero everywhere except for its main diagonal. The gradient descent method of Barzilai and Borwein (1988) produces fast convergence to a global minimum due to the monotonicity and bounding properties of the correlation functions C ij as described by Cario and Nelson (1997). At each iteration, the NORTA processes is completed and the error on the synthesized channel marginal distributions and spectral covariance matrix are used as convergence criteria.

490
The vectorization of these independent problems allows for standardization in convergence criteria using global (total) error minimization. Because the error is measured on the final product (the desired random spectra), a minimum number of iterations is required by comparison with performing each problem separately and then generating the desired random spectra.
Appendix B: Proof of mass loading mean, variance, and covariance formulae The calculation of the mean and variance make extensive use of Fubini's Theorem allowing reordering of iterated integration.

495
Because these integrals contain the Dirac delta, it is not simple to show that the conditions of Fubini's theorem are satisfied due to the Lebesgue non-integrability of the Dirac delta, which in turn stems from the fact that the Dirac delta is not a function, but a distribution. However, we proceed assuming that the iterated integrals can be interchanged. We remark that a proof using limits of functions ("nascent delta functions") approaching the Dirac delta could be substituted here at great cost to simplicity and readability. If a finite thickness layer must be used, then all convolutions with the Dirac delta below are replaced with 500 convolution with a nascent delta function and the convolutions will result in filtered or local-averaged quantities. For example, if a rectangular nascent delta is used, the results of convolution will be smoothed, local-averaged version of the desired function.
The final integral formulas will be expressed in terms of integrals of these filtered functions.
where H and X are the domains of height and loading (both σ-finite measure spaces) and the last equality follows from the definition of the conditional probability density. Rearranging this iterated integral gives Because of the symmetry of the delta function (δ(η − h ) = δ(h − η)), the integral properties of the delta function yields Because the algebraic form of the variance is Var(X(h)) = E[X 2 (h)] − [E(X(h))] 2 , only the second moment of the cumu-515 lative loading E[X 2 (h)] must be calculated to complete the formula. This quantity is Since the dummy variable η always runs between 0 < η < h, the delta function δ(η 0 − η) is always centered in the interval 0 < η 0 < h and the integral properties of the delta function can be applied again: Substitution of E(X 2 | H = η) = Var(X | H = η) + [E(X | H = η)] 2 into the formula for E[X 2 (h)] and subsequently into the formula above for Var(X(h)) completes the proof of the variance.
Similarly, for the covariance between cumulative mass loading up to two altitudes h = a and h = b with b ≥ a, only the mixed expectation E[X(b)X(a)] must be calculated: it follows that η a ∈ H, η b ∈ H, and 0 < η a < η b , so that Using the algebraic formula for the variance and covariance, we attain by substitution which is always less than Var[X(a)] unless a = b since the loading is cumulative, yielding larger values at higher altitudes.

Remark:
Of particular importance, these formula may be used to calculate the expectation and variance values of the mass loading 545 between two altitudes. The expected value is and the variance is Appendix C: Bilinear interpolation of background spectrum To smooth the changes between retrievals in adjacent background cells, we use a bilinear interpolation of the background spectra. For a general quantity (Q), the bilinear interpolation is represented as between the corner points (x 0 , y 0 ), (x 1 , y 0 ), (x 0 , y 1 ), (x 1 , y 1 ), using the scalings c x = (x 1 − x)/(x 1 − x 0 ) and c y = (y 1 − y)/(y 1 − y 0 ). We interpolate the inverse error covariance matrix S −1 by this formula. However, because our background spectrum is characterized probabilistically as a set of N = 10, 000 samples (y s ∈ Ω Y ) of a correlated random vector Y in each seasonal 5 • × 5 • background grid cell, the samples cannot be interpolated directly by the above formula. Instead we treat Y(x, y) as the collection of each of the n(x i , y j ) samples from the corners Y(x i , y j ). This generates a total of N samples for the interpolated background spectrum.
E2 Probabilistic decay rate coefficient and e-folding time 610 We treat the above time series of PDFs of SO 2 mass as a random process M t . As a continuous process, the conversion of SO 2 into sulfur aerosols can be modelled kinetically by the differential equationṀ t = −k t M t where k t is the instantaneous decay rate coefficient. Below we generate k t and the e-folding time τ t = k −1 t as random processes from M t .
To make this calculation in practice, a finite difference formula is needed forṀ t . We write this as a general 2α -order accuracy central finite difference formula for the first derivative: where δ t+i are the central difference scheme coefficients. As before, the weighted sum of normal random variables is also normally distributed. Consequently, where the mean is where each covariance is zero because each measurement is independent. Because eachṀ t is normally distributed, this sequence of means and variances fully parameterizes its random process. To calculate k t , we also must calculate Cov(M t ,Ṁ t ): where the last two equalities follow from the independence of each M t and the fact that the central coefficient (δ t ) in any central finite difference for the first derivative is zero. For non-central differences, this is not zero, the last equality does not 630 hold, and Cov(M t ,Ṁ t ) = 1 δt δ t Var(M t ). With random processes for the mass and mass rate of change calculated we can calculate the decay rate coefficient as a function of these two random processes: which is a ratio of two Gaussian random processes which may be correlated depending on the finite differencing scheme. The 635 calculation of such a ratio of random variables (Fieller, 1932;Hinkley, 1969) describes the uncertainty of the decay coefficient at each time as a PDF f kt (k).
Calculating the PDF for the instantaneous e-folding time τ t = k −1 t is performed by applying standard rules for functions of random variables (from DeGroot and Schervish, 2012): f τt (τ ) = 1 Notably, neither the distributions for decay rate coefficient nor for e-folding time are Gaussian.
Author contributions. DMH and MJP conceived of the main concepts. DMH developed the mathematical framework and details for the probabilistic retrieval. DMH developed the code used to construct the background spectra and perform the retrieval. MJP performed all radiative transfer modelling. MJP and DMH analyzed the sensitivity of the Jacobians and developed the specialized retrieval for strong loading. DMH and MJP both tested and tuned the retrieval as well as interpreted and discussed the results. DMH and MJP wrote the 645 manuscript.
Competing interests. The authors declare no competing interests.