Estimating and Reporting Uncertainties in Remotely Sensed Atmospheric Composition and Temperature

Remote sensing of atmospheric state variables typically relies on the inverse solution of the radiative transfer equation. An adequately characterized retrieval provides information on the uncertainties of the estimated state variables as well as on how any constraint or a priori assumption affects the estimate. Reported characterization data should be intercomparable between different instruments, empirically validatable, grid-independent, usable without detailed knowledge of the instrument or retrieval technique, traceable, and still have reasonable data volume. The latter condition of adequacy may force one to work 5 with representative rather than individual characterization data. The main sources of uncertainty are measurement noise, calibration errors, simplifications and idealizations in the radiative transfer model and retrieval scheme, auxiliary data errors, and uncertainties in atmospheric or instrumental parameters. Some of these errors affect the result in a random way, while others chiefly cause a bias or are of mixed character. Beyond this, it is of utmost importance to know the influence of any constraint 1 https://doi.org/10.5194/amt-2019-350 Preprint. Discussion started: 14 October 2019 c © Author(s) 2019. CC BY 4.0 License.


Introduction
Observations from remote sensing instruments are central to many studies in atmospheric science. The robustness of the conclusions drawn in these studies is critically dependent on the characteristics of the reported data, including their uncertainty, 5 resolution, and dependence on any a priori information. Adequate communication of these data characteristics is therefore essential. Further, when, as is increasingly the case, observations from multiple sensors are considered, it is important that these characteristics be described in a manner that allows for appropriate intercomparison of those characteristics and the observations they describe. In the satellite community, however, the definition of what constitutes "adequate communication" is far from uniform. Currently, multiple retrieval methods are used by different remote sounding instrument groups, and various 10 approaches to error or uncertainty estimation are applied. Furthermore, reported uncertainties are not always readily intercomparable. For example, the metrics used as uncertainty values for a data set might not be properly defined (as, say, 1σ or 2σ values, or as an appropriate confidence interval), uncertainty values might not be adequately described as "random" or "systematic" in nature (let alone any more nuanced description of inter-error correlations), spatial resolution information or the influence of a priori content might not be given, etc. This paper discusses these issues and proposes a common framework for 15 the appropriate communication of uncertainty and other measurement characteristics. This review has been undertaken under the aegis of the 'Towards Unified Error Reporting' (TUNER) project and was carried out by retrieval experts from the atmospheric remote sensing community (including active participation from eight different instrument science teams) who have come together to tackle the (arguably daunting) goal of establishing a consensus approach for reporting errors, hopefully enabling more robust scientific studies using the retrieved geophysical data products. This review 20 paper, the first 'foundational' paper from the TUNER team, is mainly addressed to the providers of remotely sensed data. A paper addressed to the data users, guiding them through the correct use of the uncertainty information, is currently being written (Livesey et al., in preparation).
Most concepts presented in this paper rely on the assumption that providing the user with the result of the retrieval, a measure of estimated error or uncertainty along with correlation information, and sensitivity to possible a priori information used is 25 sufficient for most scientific uses. In other words, there is no need for more detailed discussion of the expected distribution of the retrieved values around a true value (or around the expectation value of the retrievals) to be provided. That said, we recognize that they might be useful for some specialized quantitative applications.
The well-informed reader will already be acquainted with most of the material in this paper, although those less familiar with retrieval algorithms may find it a useful introduction. Firstly we list conditions of adequacy of the reporting of error and 30 uncertainty (desiderata), which summarize the information that should be provided to the data user (Section 2). Next, before diving headlong into the technical details, Section 3 attempts to offer some necessary clarification of various terminological issues. In Section 4 we lay down the formal background. In particular, we discuss the retrieval equation and try to provide unambiguous interpretations of all involved terms, enabling the informed reader to map their own notation and terminology to that discussed herein. In our discussions of retrieval theory we will not reinvent the wheel but build heavily on the framework laid out by Rodgers (1976Rodgers ( , 1990Rodgers ( , 2000. Importantly, however, our characterization is applicable beyond the retrieval schemes endorsed therein, including many in every day use among remote sounding teams. Section 5 discusses how the theory translates into real world problems, centering on how the full retrieval problem is decomposed into sub-problems. Following this, we 5 turn towards error estimation and uncertainty assessment. We then systemize and discuss the various sources of retrieval error (Section 6) and, if applicable, their dependence on the retreval scheme chosen. We identify data characterization methods currently in use and relate them to the theoretical concepts presented. Recommendations on unified error reporting for spaceborne atmospheric temperature and composition measurements are given in Section 7. In these recommendations we refrain from stipulating conventions and confine ourselves to recommendations that can directly be inferred from the conditions 10 of adequacy. Finally, we identify unsolved problems and applications which might not be fully covered by our framework (Section 8).

Conditions of adequacy for diagnostic metadata
With the ultimate goal of presenting a list of recommendations to the community of data providers, we first discuss a list of desired properties of diagnostic metadata from the point-of-view of a data user. We refer to diagnostic metadata as error or 15 uncertainty estimates and all information on the content of a priori data, spatial resolution, and the like. The list of possible metadata to characterize retrievals of atmospheric state variables is huge, but some of them are more useful than others. Here we define conditions of adequacy (CoA) for error and uncertainty reporting. These conditions will be used as criteria which metadata are indeed essential and should thus find their way into the recommendations.
CoA 1. The error estimates should be intercomparable among different instruments, retrieval schemes, and/or error estimation 20 schemes.
CoA 2. The estimated errors should be independent of the vertical grid such that correct propagation of the errors to a different grid yields the same errorestimates as the direct evaluation for a retrieval on the new grid would do.
CoA 3. The error budget shall be useable without detailed technical knowledge of the instrument or retrieval technique. This enables the data user to correctly apply error propagation laws and calculate uncertainty in higher level data products. inconsistent'; we understand this in a sense that the underlying methodology and mathematical tools are the same, and that the differences are restricted to the interpretation of the terms under dispute.
The GUM-stipulated framework, however, does present a dilemma when seeking to unify terminology in the TUNER arena.
On the one hand, we are not in favour of brushing away the common interpretation whereby the term 'estimated error' is used for a statistical quantity that reflects the difference between the true value and the value inferred from the measurement. It 5 remains to be seen whether the new terminology stipulated by the JCGM (JCGM, 2008a and2008b) will be widely accepted.
Accordingly, given the significant heritage within the atmospheric remote sensing community, renaming long-established concepts would not promote our goal of 'unification'. In recent scientific literature, terms like 'estimated measurement error', 'error analysis', 'error covariance matrix' or 'standard error of the mean' are still widely in use, and replacement terms like 'standard uncertainty of the mean' etc., are rarely invoked. On the other hand, we recognize that explicitly breaking with the 10 official stipulations of the JCGM does not advance the overall goal of 'unification' either.
For the purposes of the following discussion we define 'error' to be the difference between an unknown truth and a value inferred from measurements. 'Uncertainty' describes the distribution of an error. This can be summarized with metrics such as the total squared error, which can be decomposed into systematic and random components that are reflected by bias and variance. We will often use the word 'error' as a part of composite terms, (e.g., 'parameter error', 'noise error', 'retrieval 15 error', 'estimated error ', etc.). When we use a composite containing the term 'error', this does not imply that the uncertainty interpretation is excluded, and conversely, when we use a composite term containing the term 'uncertainty', this does not imply that the error interpretation is excluded. The use of the term 'error' as a generic term in the sense of 'measurement noise causes an error in the inferred quantity' is probably uncontroversial and can be accepted both by adherents of the error concept and adherents of the uncertainty concept. 20 We think that no terminology is per se better than another one, as long as it is clearly defined. Instead of further fueling the terminological conflict, we try to concentrate on the content and to lay down an error reporting framework custom-tailored to remote measurements of atmospheric temperature and constituents that is more detailed and specific than most of the previous literature. 25 Regardless of whether one prefers to call the estimated retrieval error 'uncertainty' or the uncertainty of the measurement 'estimated error', there are still two different ways to evaluate this quantity. One relies on generalized Gaussian error propagation or, particularly in grossly nonlinear problems, on sensitivity studies, either as case studies or in a Monte Carlo sense. Uncertainties of input quantities are propagated through the data analysis system to yield the uncertainties of the target quantities.

Ex ante versus ex post error estimates
The other way relies on a statistical analysis of the results, e.g., by comparison to other observations. Many different terms 30 are commonly used to distinguish between these different approaches. In Joint Committee for Guides in Metrology (JCGM) (2008a), the first fall into their 'category B', while the second are 'category A '. Von Clarmann (2006) distinguishes between ex ante and ex post error estimates, reflecting the fact that error propagation can be calculated even before the measurement has been made, while the statistical analysis of the measurements requires the availability of actual measurements. Along the same line of thought, one could also talk about error prediction versus evidence of errors. Since error estimation is deterministic with respect to the estimated variances (but certainly not with respect to the actual realizations of the measurement error), and since statistical analysis of any evidence follows the laws of inductive logic (Carnap and Stegmüller, 1959), one could also distinguish between deductive and inductive error estimation. Others prefer to use the terms 'bottom up' and 'top down' for this dichotomy. This study focuses chiefly on ex ante error estimation. To validate these estimates, ex post error estimation is 5 relevant, as expounded, e.g., by Keppens et al. (2019).

Retrieval theory and notation
Measurements -also most so-called direct measurements -invoke inverse methods. The only exception is a direct comparison where the measurand is directly accessible via human sensation, like length measurement by comparison with a yardstick or determination of colour by comparison with a colour table. The inverse nature of most measurements is due to the fact that 10 the measurand x is the cause and the measured signal y is the effect. These are connected via a natural regularity which is formalized via a function which maps the discretized measurand onto the respective observable signal, and where m and n designate the number of measured data points and the number of state values, respectively. 15 In the macroscopic world, exempt from quantum processes, the measured effect is thus, for given conditions, a deterministic unambiguous function of the measurand. While microscopic processes can admittedly be indeterministic, their statistical treatment for ensembles of sufficient size leads to deterministic laws. Irreducibly non-deterministic components contribute to the measurement noise. In contrast, the conclusion from the measured signal y to the measurand x is not always unambiguous because in many cases the inverse function 20 g : R m → R n : y → x = f −1 (y) can only be approximated due to the over-or underdetermined or otherwise ill-posed nature of the problem and the large rank of the matrix to be inverted.
In some cases, the inverse process can be quite trivial, e.g., in the case of a temperature measurement with a mercury thermometer. The causal process is the thermal expansion of mercury, and the inverse conclusion goes from the volume of the 25 mercury to the ambient temperature. The scale of the mercury thermometer is simply a pretabulated solution of the inverse process for various temperatures. In other applications, such as remote sensing of the atmosphere from space, the inverse process is slightly more complicated because f −1 (y) does not usually exist. Related workarounds to solve this problem are discussed below.
Remote sensing of the atmospheric state from space relies in one form or another on the radiative transfer equation (Chan-30 drasekhar, 1960). This equation is deterministic in a sense that its formulation f simulates the measured signal via causal processes. The deterministic characteristic of f in the macroscopic world is achieved via a statistical treatment of the underlying microscopic processes. While its forward solution allows the calculation of the radiance received by the instrument, its inverse solution allows for the determination of the state of the atmosphere from a known radiance signal.
Roughly following the notation of Rodgers (2000), we define F to be the radiative transfer model which approximates f .
F is a vector-valued non-linear function and deviates from f in that it is discrete in space and frequency, involves numerical 5 approximations and may not include the full physics of radiative transfer. x ∈ R n is the vector representing the atmospheric state, and y ∈ R m the vector containing the measured radiance signal. The elements of x contain both the 'target variables' and 'joint-fit' variables. Target variables are those variables we are actually interested in. Conversely, the joint-fit variables are variables needed by F that, while not the focus of our interest, have to be sought in the inversion because they may not be accurately known and their uncertainties would otherwise make an unacceptably large contribution to the total error budget. 10 Typically m = n, i.e., the dimension of x does not equal the dimension of y. For the overdetermined case (m > n), Gauss (1809) 1 suggested an approximate inversion obtained by minimizing the sum of squares of the residual F (x)−y. If we assume, for now, that F is linear and that Gaussian distributions of are adequate to characterize the measurement (see Section 5.5 for related problems) the unconstrained 2 solution of the inverse problem iŝ K is the Jacobian matrix with the elements K ij = ∂yi ∂xj , x 0 represents an initial guess of the atmospheric state, and S y,total is the covariance matrix characterizing the total measurement error. Here theˆsymbol indicates that, due to measurement noise mentioned above, and other uncertainties and ambiguities which will be discussed below, the result of the inversion is only an estimate of the measurand x. In most real-world applications, only measurement noise is considered here, while other measurement uncertainties like calibration errors are neglected at this stage. Since the solution provided by Eq. 3 does not 20 consider any prior information, it is a "maximum likelihood" solution in the sense of Fisher (1922Fisher ( , 1925 One major difference between our notation and Rodgers' notation refers to the error covariance matrices S. We use two subscripts. The first indicates if the uncertainties refer to the retrieved quantities x or to the ingoing quantities. The second subscript specifies the source of the uncertainty. For example, S y,noise is noise in the measurement data, while S x,noise is the measurement noise mapped into the retrieved atmospheric state. In other words, S x,noise is the error component in x due to 25 the error source S y,noise . In some cases, e.g., if any ambiguity can be excluded or if the sources of the error are not known, the second subscript can be missing.
By explicitly assuming equally distributed, i.e., uniform prior, state values Gauss (1809, p. 211) gave this solution a probabilistic interpretation without clashing with the Bayes (1763) theorem. In a linear context and for measurement errors following 1 The first publication of a least squares method was actually by Legendre (1805), but Gauss is said to have had this idea about ten years before. Obviously unaware of Legendre's work, also Robert Adrain (1808) proposed the least squares method as the most advantageous solution in this context. See Merriman (1877), Sprott (1978), or Stahl (2006 for a deeper discussion of the priority regarding this method. 2 See below for a deeper discussion of this term. 3 See below for a deeper discussion of this term. a normal 4 distribution around the true value, the Gaussian least squares solution corresponds formally -but certainly not in terms of its interpretation -to a maximum likelihood solution in the terminology of Fisher (1922Fisher ( , 1925 (thus the index ML inx ML ). An interesting overview on the history of maximum likelihood estimates is given by Hald (1999), while the justification of this method is critically discussed in Aldrich (1997). For instructive discussions of the relevance of the Bayes theorem in inductive statistics, see, e.g., Bar-Hillel (1980) and Thompson and Shuman (1987). The original Gaussian least 5 squares method was valid for independent measurement errors only. The introduction of the correlation coefficient by Galton (1888) and Pearson (1896) paved the way towards a wider range of applications. The matrix formulation as used today, where correlated measurement errors are represented in the measurement error covariance matrix S y , owes much to Yule (1907), Fisher (1925), and Aitken (1935). A reconstruction of the historical development of this technique was performed by Aldrich (1998). 10 If the inverse problem is underdetermined (m < n) or ill-posed in a sense that the K T S −1 y,total K matrix is singular or has a high condition number, then a constraint has to be used. Even in formally well-conditioned problems but large measurement noise, the use of a constraint can be helpful. With a prior assumption on the atmospheric state x a and a regularization matrix R we can modify Eq. (3) in a way that the matrix inversion can be accomplished. This so-called regularized solution is (von Clarmann et al. 2003, building upon Rodgers 2000Phillips 1962;Tikhonov 1963;Twomey 1963;Steck and von Clarmann 15 2001) Many choices of the regularization matrix R are possible. With the (n−1)×n first order differences matrix L 1 and γ a scaling parameter to control the strength of the regularization and balancing the units, the choice of 20 renders fields of profiles of atmospheric state variables that are smoothed in the sense of reduced altitude-to-altitude differences of thex reg − x a profile, thus avoiding unphysical oscillations that typically result from instabilities associated with ill-posed inverse problems.
If we represent the best known a priori statistics about the targeted atmospheric state as x a , its covariance matrix as S a , the inverse of this matrix as R, and continue to assume Gaussian error distributions, then we get a Bayesian solution that 25 is usually referred to as 'optimal estimate' (Rodgers, 1976) or 'maximum a posteriori (MAP) solution' (Rodgers, 2000) and is fully compatible with the Bayes (1763) theorem and information theory by Shannon (1948), and thus gives the solution a probabilistic interpretation in the sense of the maximum a posteriori probability: The formalism of Eq. 6 can also be used without committing oneself to a probabilistic interpretation of S a . For example, S a can be rescaled to give less weight to the priori information.
This equation, however, holds only if the variability of the atmospheric state is fairly well covered by a Gaussian probability density function. To characterize the variability of highly variable trace gases, a log-normal probability density function can be more adequate. It avoids, e.g., that non-zero a priori probability densities are assigned to negative mixing ratios. Technically, 5 this is achieved by using Eq. (6) but re-interpreting x as the logarithm of the concentrations and S a as the covariance matrix of these logarithms. This is, for instance, important for tropospheric water vapour (e.g., Hase et al., 2004or Schneider et al., 2006. However, there is a price to be paid, in that this then casts the measurement error in terms of a log-normal distribution also. The positive bias caused by the retrieval of logarithms of concentrations in the case of measurement noise oscillating arount zero signal has been investigated by Funke and von Clarmann (2012).

10
For brevity, we define the gain matrix (Rodgers, 2000) which will play an essential role in error estimation. The remainder of this paper broadly identifies all relevant sources of uncertainties including measurement noise, approximations, idealizations, and assumptions.
Tables 1 and 2 summarize the retrieval schemes used by a number of satellite data processors. 15

Retrieval in the real world
Application of Eqs. (3-6) usually involves many approximations and idealizations including discretization, decomposition of the argument of the radiative transfer function into variables and parameters, spatial decomposition, and non-linearity issues, just to name a few. Since all these approximations give rise to retrieval errors, a full understanding of them is of utmost importance when quantifying the error budget of a measurement. 20

Discretization
At least on macroscopic scales, atmospheric state variables are construed as continuously varying in space and time. In the retrieval equations they are, however, represented by vectors with a finite number of elements, each representing the atmospheric state at a gridpoint. If the discretization is too fine, a stronger regularization is needed to fight ill-posedness of the inversion, while a too coarse discretization can cause errors in the radiative transfer modelling and limits the spatial resolution of the 25 solution. In a maximum likelihood retrieval, the grid-width is identical to the spatial resolution of the retrieval.
In this context we note that the atmospheric state does not necessarily need to be represented as vertical profiles where each element of x is a state variable at a certain altitude or location, or represents an atmospheric layer. Alternative representations include, e.g., principal components/empirical orthogonal functions (see, e.g., Boukabara et al. 2011;Munchak et al. 2016;Duncan and Kummerow 2016). These can be inferred from an ensemble of spatially highly resolved prior measurements. 30 The unknowns in the retrieval are the weights of the principal components. Complete or partial neglect of higher principal components will regularize the retrieval. Such an approach is under consideration for the Atmospheric Limb Tracker for Investigation of the Upcoming Stratosphere (ALTIUS) mission (Fussen et al., 2016). A similar approach was tried for the Scanning Imaging Absorption Spectrometer for Atmospheric Chartography (SCIAMACHY) (Doicu et al., 2007), for the multichannel infrared radiometer on the Geostationary Operational Environmental Satellite (GOES-13) and infrared sounder on the Indian National Satellite (INSAT-3D) by Jindal et al. (2016). The retrieval of vertical column amounts by simply scaling the 5 initial guess profile reduces the profile retrieval to a single degree of freedom. Alternatively, the altitude axis of the profile can be stretched or compressed using the so-called 'downwelling factor' as suggested by Toon et al. (1992). These approaches are often used for analysis of measurements which do not provide direct information on the vertical distribution of the target species. Particularly in the greenhouse gas monitoring community, retrieved column amounts of target species are divided by the molecular oxygen column amount retrieved with the same instrument. Rescaling of the quotient by the 0.20946 gives 10 the column-averaged dry-air mole fraction XCO 2 or XCH 4 . The benefit of this approach is a cancellation of multiplicative systematic error components (see, e.g., Wallace and Livingston, 1990;Yang et al., 2002;Wunch et al., 2010;Reuter et al., 2011). Similar arguments hold for isotopic ratios (e.g., Piccolo et al., 2009;Schneider et al., 2016) or ratios between trace gas profiles (e.g. García et al., 2018).

5.2
The measurement error covariance matrix 15 Typically in real-world applications, the measurement error S y in Eqs. (3-6) contains only measurement noise, while other sources of measurement error are often ignored during the retrieval (see Section 6.1 for details) and typically analyzed after performing the retrieval. Since this treatment deprives any solution from its claimed optimality (Cressie, 2018), in some cases the measurement noise is artificially "inflated" to account for potential calibration uncertainties. A method to include multiple types of uncertainties in the measurement error covariance matrix is discussed in Marks and Rodgers (1993), Tarantola and 20 Valette (1982), Eriksson (2000), and von Clarmann et al. (2001). These authors discuss the possibility of mapping all relevant error contributions into the measurement space and include them in the S y matrix 5 . Rodgers (2000, Section 4.1.2) views this problem from a different perspective but the suggested solution is mathematically equivalent to the approach suggested above.

Variables and parameters
While the measurement typically depends on a large number of atmospheric state variables, only a few of them are actually 25 dealt with as unknowns. The other variables are assumed to be known and are dealt with as constant parameters. For example, in an ozone profile retrieval the atmospheric temperature profile may be assumed to be known and thus not be included in the retrieval vector x. With this, the forward problem can be formalized as where b is the vector of parameters, which are separated in the argument of function f by the semicolon. The respective inverse solution reads in its general form The uncertainties of parameters b affect the estimatex and thus a parameter error term has to be included in the error budget.

5
Practical reasons typically force one to decompose the inverse problem, e.g., to reduce the size of the problem in order to achieve numerical efficiency. Often a part of the measurement is virtually insensitive to some of the atmospheric state variables.
The general idea of decomposition is to isolate subsets of the entire set of measurements that are mainly sensitive to only a subset of the unknown variables. This decomposition can be made according to spectral or geometrical criteria (see below).
Decomposition of the inverse problem can be done either in an "optimal" or in a "non-optimal" way. The optimal decom-10 position solves the inverse problem sequentially, where at each step the retrieval is made for the full x-vector but based only on a subset of the measurements, whereby each measurement is used only once during the entire process (see, e.g., Rodgers, 2000, Ch. 5.8.1.3; his requirement of a diagonal measurement covariance matrix can be replaced by the weaker requirement of a block-diagonal covariance matrix if the algebra is adjusted accordingly). Initially, the retrieval, which typically is patently under-determined because of the temporarily ignored measurements, is constrained by an initial S a matrix. For each subsequent 15 step, the S a matrix is replaced by the so-called 'retrieval covariance matrix' of the preceding step. Within linear theory, the solution of such sequential methods is equivalent to the direct solution of the full inverse problem.
More frequently used is non-optimal decomposition. Here the relevance of some components of the state vector for the 20 measurements is temporarily ignored, and the retrieval solves the inverse problem only for a part of the state values, using only a subset of the measurements. This approach lends itself to problems where it is adequate to assume that the Jacobian matrix K has an almost block diagonal structure, that is, that there are state variables which have no significant influence on some of the measurements under analysis and vice versa. In the following we discuss spectral and spatial decomposition.

Spectral decomposition 25
Not all spectral gridpoints or channels of a spectrometer or a multi-channel radiometer are equally sensitive to all unknown variables. For example, the subset of the measurements used to retrieve the ozone concentration may be insensitive to the concentration of water vapour (Flittner et al., 2000). In such cases, the abundances of various species can be retrieved in sequence, using dedicated "microwindows" in infrared spectroscopy (see, e.g. von Clarmann and Echle, 1998;Echle et al., 2000;Dudhia et al., 2002), different spectral regions in microwave radiometry (Livesey et al., 2003(Livesey et al., , 2006 or measurements 30 in the ultraviolet and visible (UV-VIS) spectral range (e.g., Bovensmann and M. Gottwald, 2011). In these cases, a subset of spectral points is selected for analysis. Those unknowns which have sizeable impact on the signal at these spectral points are retrieved. When in later steps other spectral points are analyzed, the results of the first steps can be used and either be treated as known parameters, or as a priori information in an optimal sequential scheme. Uncertainties entailed by this procedure are associated with the following problems: (1) In the first step some of the disregarded variables may still introduce some error; (2) retrieval errors of all kinds resulting from a prior step of the sequential scheme propagate onto the results of later steps; and 5 (3) inconsistencies in spectroscopic parameters between different spectral points can cause a spurious residual signal when, e.g., the concentration of a gas retrieved in one part of the spectrum is used as a known parameter in the analysis of another part of the spectrum.
Spectral decomposition is also often used for the retrieval of a single species. For example, Kramarova et al. (2018) retrieve ozone sequentially in different spectral bands. An alternative to spectral decomposition is the simultaneous analysis of the full 10 spectrum (e.g., Serio et al., 2016). In cases when spectroscopic data are consistent over the entire spectral range it will best exploit the observational information.

Geometric decomposition
In the case of nadir sounding, lines of sight referring to different ground-pixels cross different parts of the atmosphere and can thus be analyzed independently without sizeable loss of information. In limb sounding, first suggested around the same time 15 by Gille (1968), Blamont and Luton (1972), Hays et al. (1973) and Donehue et al. (1974) for different scientific contexts, the situation is more complicated because the retrieval of a state value at a given altitude depends on the knowledge of the same state value at other altitudes passed by the line of sight.
If the same air parcel is seen under multiple geometries, the measurements have a tomographic nature. Since the simultaneous retrieval of all these intertwined measurements easily exceeds available computational resources, often only a subset of the 20 measurement geometries are analyzed in one step.
More specifically, the algorithm can be constructed such that only a subset of the measurements are needed to retrieve the atmospheric state corresponding to a given subset of the state vector elements that affect signals along the raypath of the considered measurement. The two most prominent examples are single profile retrievals and onion peeling. Typical approaches to decompose the entity of measurements geometrically are listed in Table 1. 25 In some cases, the geometric profile reconstruction is decoupled from the spectral inversion. In order to gain numerical efficiency, the inversion can be performed in sequential steps. Such an approach is realized for GOMOS two-step inversion, which decompose the retrievals into the spectral inversion followed by the vertical inversion using the concept of effective cross-sections (Kyrölä et al., 1993).

30
Optimal decomposition techniques formally retrieve all relevant variables x in each step but measurement information y of only a subset of the measurement geometries is used. Since, in a maximum likelihood setting, such a retrieval would be hopelessly underdetermined, sequential estimation as described above lends itself to this class of problem. Every state variable can be updated as soon as new information becomes available. In contrast, non-optimal techniques will not update any quantity once retrieved.

Single profile retrieval vs. 2D/3D-retrievals
The vast majority of limb sounding retrievals assume local spherical homogeneity of the atmosphere, i.e. considering only vertical variations in the atmospheric state around the line of sight, and neglecting horizontal variability (e.g. Gille, 1968;5 McKee et al., 1969a, b;House and Ohring, 1969;Carlotti, 1988). Russell III and Drayson (1972) explicitly state the assumption, and only a small number of retrieval schemes relinquish it. In solar occultation observations, where the measurement geometry is determined by the position of the sun and the instrument and where at most one sunset and one sunrise can be observed per orbit, there is not much choice; tomographic multi-limb-scan retrievals are out of reach and the single profile retrieval is the way to go. 10 For limb measurements, von Clarmann (1993) suggested a non-optimal decomposition similar to "onion peeling" (see below) but in the horizontal domain. This approach, however, was never put into action. Carlotti et al. (2001) proposed to solve the inverse problem for a full satellite orbit instead of for single limb-scans. This tomographic method was published under the name 'geofit'. Steck et al. (2005)  approaches, whereby a 2-dimensional along-track curtain of profiles is simultaneously retrieved from multiple sets of limb scans. A similar approach is used for SCIAMACHY retrievals of metals (Scharringhausen et al., 2008;Langowski et al., 2014) and NO (Bender et al., 2013(Bender et al., , 2017 and for OMPS-LP ozone . Dudhia andLivesey (1996) andvon Clarmann et al. (2009) use prior information on the horizontal variation of state variables in a single limb-scan retrieval. The latter scheme lends itself particularly to reprocessing of data when the initial processing 20 information on the horizontal variability is already available. This approach has been critically analyzed by Castelli et al. (2016). Tomographic approaches and the effect of horizontal gradients were investigated for SCIAMACHY limb measurements by Pukite et al. (2008) and Pukite et al. (2010). A series of OSIRIS orbits allowed the tomographic analysis of polar mesospheric clouds (Hultgren et al., 2013). This application was preceded by theoretical studies on OSIRIS infrared channels tailored for tomography.

25
Most other limb sounding retrieval schemes use the spherical homogeneity approximation, although this approach can be challenged for limb sounders. For example, Kiefer et al. (2010) provided evidence of biases in trace gas retrievals from MIPAS limb emission spectra due to horizontal temperature gradients. Thus, neglect of the horizontal variation of the atmospheric state needs either to be corrected or to be considered in the error budget.
In the case of nadir sounding, single profile retrievals seem to be the natural thing to do, since a raypath associated with 30 one geolocation intersects each altitude level only once. However, in the UV-VIS spectral range multiple scattering along with strong inhomogeneities in the surface reflection or cloud coverage might cause some interplay between the neighbouring pixels.

Onion peeling
In the "onion peeling" approach, (Gille, 1968;McKee et al., 1969a, b;House and Ohring, 1969;Russell III and Drayson, 1972;Goldman and Saunders, 1979) the collection of limb measurements in a vertical scan is decomposed into a sequence of retrievals, each dealing with one tangent altitude, starting at the top and working down. This method builds upon the fact that the bulk of the information obtained along the horizontal line of sight originates from the vicinity of the tangent point, with limited 5 information from above and essentially none from below. In the first step, the measurement associated with the uppermost tangent altitude is analyzed and the profile above is scaled. Then the second tangent altitude from the top is used and the profile between this tangent altitude and the tangent altitude above is scaled; this is repeated until the lowermost tangent altitude is reached. Often the discretization of the atmospheric state corresponds to the tangent altitude pattern, i.e., there is one profile point per tangent altitude, and the profile shape between the points is determined by interpolation. Gaussian elimination is 10 already provided by the measurement geometry, and the Jacobian K has a quasi-triangular structure 6 . This approach, however, is prone to instabilities (von Clarmann et al., 1991). The alternative would be that layer values are retrieved instead of level values.
In the early era of limb sounding and solar occultation measurements, onion peeling was the work-horse data analysis algorithm and was used, among others, in the following missions: LIMS (Bailey and Gille, 1986), ATMOS (Norton and 15 Rinsland, 1991), HALOE (Russell III et al., 1993), CRISTA (Offermann et al., 1999). More recently, onion-peeling related algorithms have been used, e.g., for, TIMED-SABER (Russell III et al., 1994), AIM-SOFIE (Gordley et al., 2009a), and SCIAMACHY (Noël et al., 2018). When more computer power along with quasi-analytical algorithms to calculate larger Jacobians became available, onion peeling was often superseded by global-fit-like algorithms (Carlotti, 1988) which solve the inverse problem for the entire limb sequence in one leap. 20 Approaches related to onion peeling are the Mill-Drayson method (1978) and the 'interleave method' (Thompson and Gordley, 2009). The Mill-Drayson method starts with the lowermost tangent altitudes and scales the entire profile of the atmospheric state variables above to minimize the residual between measurement and modeled signal. Next, the second tangent altitude from bottom is used to scale the related upper segment of the profile above the related tangent altitude. Several iterations over the limb scan are made. The goal is to avoid the typical onion-peeling error propagation which tends to trigger oscillations in 25 the profiles. This method became somewhat obsolete with the advent of numerical regularization. Without knowledge of the original method by Mill and Drayson this method has been applied to the SOFIE instrument by Marshall et al. (2011).
The interleave method decomposes the limb scan in multiple disjoint subsets of measurements, e.g., such that one set contains the tangent altitudes with even numbers and the other those with the odd numbers. For each subset of measurements an independent onion peeling retrieval is performed. Finally both the resulting profiles are merged to give one profile. The 30 goal of this method is to get rid of the onion-peeling oscillations, which is achieved by having thicker layers and thus better sensitivity -at the cost of degraded vertical resolution -in each retrieval step. The interleave method has been used, e.g., for HALOE and SABER. 6 We are saying 'quasi-traingular' here because, due to over-determination at each tangent altitude, K can have more rows than columns.
As will be seen below, rigorous error propagation for onion peeling retrievals and its variants is tedious and thus rarely performed. Instead, Monte-Carlo-type sensitivity studies can be performed on the basis of simulated measurements superimposed with artificial noise, which are analyzed using the onion-peeling scheme. The error estimate is then provided by the variance of the ensemble results around the reference value at each altitude.

Chahine's relaxation method 5
The Chahine relaxation method (Chahine, 1968(Chahine, , 1970 was originally suggested to retrieve vertical profiles of the temperature from measurements of the emerging specific intensity at several frequencies in the infrared spectral range. Later this method has been adapted by employing the geometrical decomposition to the retrieval of vertical distributions of atmospheric trace gases from the measurements of the scattered solar light in limb viewing geometry (e.g., Sioris et al., 2003Sioris et al., , 2004. Essentially, the measurement and state vectors have to be constructed in a way that for each of its components the following 10 linear relationship can be considered as an acceptable approximation: In the original approach of Chahine, the measurement vector y comprised measured radiances, F (x a ) the related modeled radiances, the state vector x comprised Planck functions at certain pressure levels, and spectral decomposition was applied, i.e., Eq. 11 was solved for each frequency independently. In the approach of Sioris, the measurement vector contained trace gas 20 slant columns at each line-of-sight, the state vector contained trace gas number densities at altitude levels and Eq. 11 needed to be solved for each line-of-sight independently, i.e., the geometrical decomposition was employed.
The Chahine relaxation method is a nested iteration of the type The inner loop runs over the altitude indices j and is usually started at the top of the atmosphere and procedes downwards, 25 similarly to the onion peeling method. However, in this inner loop, the information retrieved at higher levels is not directly used when Eq. 11 is solved for lower layers. Instead, the same current guess profile x i is used to evaluatex i+1 for all altitudes j. Only after finishing the inner iteration over the altitudes j, the state vector x i is updated with x i+1 . The outer iteration over i is repeated until convergence is reached.
Similarly to the onion peeling, rigorous error propagation for the Chahine relaxation method is challenging and the same 30 approach as suggested for the onion peeling method can be used instead.

Nonlinearity issues
The radiative transfer equation is nonlinear. This problem can be remedied by putting the retrieval equation used in an iterative context, e.g., for maximum likelihood or regularized problems, respectively, where i is the iteration index. The last term in the latter equation assures that the prior information will not be 'forgotten' during the iteration (see, Rodgers 2000, p. 88).

10
To avoid seeking anx that is beyond the range of validity of the linear approximation y( Levenberg (1948); Marquardt (1963) suggested a method that limits the stepwidth x i+1 − x i and turns it towards the direction of the steepest descent of the object function: where λ is a scalar that is adjusted during the iteration according to the local non-linearity of F and I is unity. Marks and Rodgers (1993) and Rodgers (2000) suggest the following variant: Butz et al. (2012) have found that in some cases a reduced step-size Gauss-Newton algorithm works much better than the Levenberg-Marquardt algorithm (Eq. 15).
Many inverse radiative transfer problems are only "moderately non-linear" (in the sense of Rodgers, 2000) in that the 25 retrieval equations are solved iteratively, to cope with non-linearity, but linear error estimation around the best estimate is considered adequate. If error bars are so large that they exceed the range around the best estimate where the true function Carlo methods, is that the posterior distributions, which can significantly deviate from the Gaussian ones, can be explored and characterized in detail, as demonstrated by Tamminen and Kyrölä (2001), Tamminen (2004), and Brynjarsdottir et al. (2018).
Also neural network based concepts have been developed and investigated in this context (see, e.g. Pfreundschuh et al., 2018).
Monte Carlo error estimates exceed the computational resources needed for the retrieval by far. Thus, they are often not apt for routine applications but their range of application remains limited to representative test cases.

6 Sources of Errors
There are multiple categories of errors and uncertainties in atmospheric state variables retrieved from satellite measurements.
These are: 1. errors caused by less than perfect measurements, which include measurement noise and calibration errors, 2. errors caused by inaccuracies of the radiative transfer model used in the data analysis, which include numerical ap- 10 proximations, missing physical processes, or uncertainties in the values used as constants by the model, particularly spectroscopic parameters, 3. errors caused by decomposing the inverse problem, giving rise to parameter errors, 4. errors caused by the constraint applied to the retrieval, which does not allow the retrieval to produce the solution that is best compatible with the measurements. 15 Another factor that can cause discrepancies between two sets of measurements is that the measurements might not refer to exactly the same air mass or the same time. This, along with natural variability, often explains the differences encountered (see, e.g., Sofieva et al., 2008;Verhoelst et al., 2015;. In the following sections, these categories of errors and uncertainties are discussed in more detail.

20
In remote sensing a number of processing steps are necessary to obtain a calibrated signal in physical units from the raw data.
The latter are usually referred to as the Level-0 data. Their units depend on the instrument type and the related quantities can be detector voltages, photon counts or similar. Level-1 processing transforms the Level-0 data into calibrated measurement data, which no longer depend on the particular measurement device used, such as radiance units or transmission. These are conventionally referred to as Level-1 data. If multiple processing steps are required, distinctions can be made between Level-25 1a, Level-1b, etc. data, but this distinction is of no relevance here. These Level-1 data come with auxiliary data describing the geolocation and time of the measurement, the measurement geometry, and so forth. The Level-1 data are the input to the retrieval of the atmospheric state. Estimates of the atmospheric state variables are referred to as the Level-2 data product.
We use a convention that all uncertainties in the Level-1 data -including metadata -fall into the category "measurement uncertainties". The main sources of measurement uncertainties include but are not limited to measurement noise, including 30 discretization noise; zero calibration error (i.e., that the measurement signal is non-zero even though the true radiance signal is zero, which can be understood as an additive calibration error); gain calibration (this is a multiplicative calibration error); higher order errors (e.g., nonlinear detector response); uncertainties in auxiliary data, such as measurement geometry in terms of tangent altitude, the exact time of the measurement, etc.; and straylight. Further, all these errors can be subject to a drift, i.e., there can be some time-dependence. 5

Measurement Noise
Measurement noise is usually conceived as a statistical uncertainty which is described by the error variance of each single spectral data point. Usually, the uncertainties are considered as uncorrelated between the single components of the measurement vector, which implies a diagonal noise covariance matrix. In some cases, however, the measurement noise covariance matrix S y has off-diagonal elements, e.g., in Fourier transform spectrometry if apodization (see, e.g., Norton and Beer 1976) and/or 10 zero-filling is applied.
According to generalized Gaussian error analysis, the mapping of measurement noise onto the resultx reg is sors. Equation (18) is applicable also to maximum likelihood retrievals just by setting the R term in the gain function G to zero. After excessive and cheerful cancellation this finally gives Error correlations between the elements of x are implicitly considered. It is important to note that such correlations will typically be present even if the measurement errors are uncorrelated and if no regularization is applied. For some retrievals, 20 the so-called retrieval covariance matrix S x is evaluated using Eq. (10). These error estimates, however, represent not only the mapping of the measurement noise onto the retrieved quantity but also the error introduced by the application of the constraint, i.e., the 'smoothing error' in the terminology by Rodgers (2000). Related problems are discussed in Section 6.4. The retrieval error evaluated by this method will represent a meaningful quantity only if the a priori covariance matrix S a represents the actual variability of the atmospheric state rather than any ad hoc assumptions. 25 For some instruments the error estimate is based on the analysis of the residuals between the measurements and the best fitting modeled spectrum. Gauss (1821) has proven that the "residual sum of squares divided by the number of degrees of freedom is an unbiased estimator of σ 2 " (translation into modern terminology by Aldrich 1998). This Gaussian σ contains not only measurement noise but also other error components. Application of Eq. (18) to a residual-based noise characterization may be deemed more realistic than the application of this equation to pure measurement noise. However, not all uncertainties will show up in the residual. For example, spectroscopic band intensity errors of the target species will be fully compensated by erroneous retrieved concentrations and will thus create no additional spectral residual. Thus the residual-based error analysis will not provide the total uncertainty of the retrieved state variable, nor does it allow for decomposition of the error budget into its components. It is suitable to estimate the retrieval noise error S x,noise only if it can be assumed that the residual is dominated by the measurement noise. Residual based uncertainty estimation is used for, e.g., SCIAMACHY (U. Bremen) or ACE-FTS.
Non-optimal decomposition of the inverse problem, such as single profile retrieval, single species retrieval, etc. (Section 5.4.2), however, causes the following problem: S x,noise contains only the noise-induced uncertainties associated with 5 the current step of the inversion process. Propagated noise from preceding retrieval steps is formally dealt with as parameter error (see Section 6.3).
The mapping of measurement noise into the retrieval domain depends on the retrieval approach chosen. Naturally, noise has a larger effect when regularization is kept small in order to get the best possible spatial resolution, because noise and resolution are competing quantities. However, there are also other choices in the retrieval scheme which have bearing on the measurement 10 noise as evaluated above. In the ideal case, when the retrieval vector represents the entire atmospheric state with all its relevant variables, S x,noise covers all uncertainties associated with everything other than the target variable. For example, if one is interested in the error of ozone abundances, any uncertainty in the ozone mixing ratio caused by water vapour uncertainties is implicitly included in S x,noise , as suggested by Marks and Rodgers (1993), Tarantola and Valette (1982), Eriksson (2000), or von Clarmann et al. (2001); for a different perspective on this issue, see Section 4.1.2 in Rodgers (2000). The situation 15 is different in a decomposed retrieval (Section 5.4). In the case of species-wise decomposition, the uncertainty entailed by the uncertainty of an interfering species is evaluated as parameter error. The same holds for onion peeling error propagation (Section 5.4.5). Here retrieval noise, i.e., the mapping of the measurement noise on the retrieval, accounts only for the noise of the analysis of a single tangent altitude, while the noise propagated downwards from higher altitudes is formally considered to be a parameter error. As a consequence, retrieval noise estimates from two datasets are not necessarily intercomparable. A 20 sensible comparison is only possible between the total random errors, because the partitioning between noise and parameter errors depends on the retrieval system chosen and in particular how the inverse problem is decomposed into sub-problems.
In the context of error propagation in the the Levenberg-Marquardt algorithm (Section 5.5), it is important to distinguish two different applications.
(a) If the Levenberg-Marquardt algorithm is used only to dampen each iteration step and the iteration is only truncated after 25 full convergence has been reached, then the λI term has no sizeable impact on the solution, even if λ = 0 at the final iteration.
Thus, λI must not be included in the gain matrix G used for error estimation.
(b) Sometimes the Levenberg-Marquardt iteration is intentionally stopped before full convergence is reached. The rationale is to use the regularizing characteristics of the λI term which would be lost after too many iterations. The discussion of this approach of regularization is beyond the scope of this paper, and it must suffice to mention that in this case the retrieval error 30 has to be evaluated as suggested by Ceccherini and Ridolfi (2010).

Calibration Uncertainties
Besides measurement noise, calibration uncertainties also contribute to the measurement error (see, e.g., Kleinert et al., 2018).
Often the transformation from the raw data y raw (such as detector voltage) to the data in physical units y (such as spectral radiance) uses a linear scheme such as where a is a zero level offset correction and b is a gain calibration coefficient (e.g., Revercomb et al., 1988, their Eq. 2). In the case of spectral measurements, both a and b are usually a function of frequency. Even after careful radiometric calibration, there will always be a residual zero level and gain calibration uncertainty. 5 Among the satellite missions considered here, the following schemes to assess the zero level calibration error are in use, or at least possible: -Propagation of the assumed zero level calibration error in the retrieved target quantity σ x,zero from the zero level calibration uncertainty in the measurement domain σ y,zero , using linear mapping of the type 10 -A zero level correction is jointly fitted along with the target variables. In this case, this error component does not need to be assessed separately but is automatically included in the noise-induced error, at least if no constraint is applied to the zero offset correction. Since this additional fit variable tends to destabilize the retrieval, noise-induced errors will become larger. This approach has been chosen for MIPAS-IMK, Odin/SMR, and for some of the MLS data products.
-The zero level uncertainty is added as a fully correlated component to the measurement error covariance matrix S x,noise 15 and thus needs no extra treatment. It is then accounted for by the error evaluated using Eq. 18. We are not aware of any processor using this method.
-The zero level uncertainty is deemed negligibly small and thus not evaluated. This approach has been chosen by SAGE I, SAGE II, SAGE III, SCIAMACHY, ACE-FTS, and OMPS LP.
Similar arguments hold for the gain calibration uncertainty, and in theory the same methods can be applied. In emission 20 spectroscopy, however, gain calibration uncertainty is much harder to distinguish from concentration changes of the target species or temperature changes than offset calibration. For MIPAS-IMK the linear mapping method is used. On the contrary, for many limb-scatter retrievals a normalization with respect to a higher tangent height is done. As a result, the gain correction, b, mostly cancels out (von Savigny et al., 2003, e.g.,).
Occasionally, application of Eq. (20) is inadequate, e.g., if the detector response function is nonlinear (see, e.g., Kleinert 25 et al., 2018). We are not aware of any data product where uncertainties of the coefficients of the nonlinear detector response function are routinely considered in the error budget of the Level-2 products. Arguably all calibration constants can be timedependent and thus cause a drift. This issue is discussed in Section 6.7.
Another issue is frequency calibration. A spectral shift translates into a radiometric error that is highly correlated across the spectral line. The impact of such an error on the retrieval result is highly dependent on the retrieval setup and the selection of the the zero level correction. Residual frequency calibration errors after correction are still an issue of the Level-2 error budget.
Since the radiometric error induced by a spectral calibration error is antisymmetric to the line center, its effect on the retrieval results will be different when the microwindow contains only part of the line.
For Odin/SMR and MIPAS a frequency offset is fitted as a scalar value characterizing a complete limb scan. Where necessary, for SCIAMACHY and OMPS (IUP Bremen), in addition to the Level-1 correction from ESA or NASA, respectively, a spectral 5 shift/squeeze correction is determined during the pre-processing step by performing spectral fits for each line of sight and spectral window individually. IUP-DOAS and BIRA retrievals also use a shift/squeeze correction. For TES, the frequency calibration is performed as part of the Level-1B processing and is not included in the error covariances supplied with the Level 2 product. OMPS LP depends on the well-characterized Fraunhofer structure in the solar spectrum to establish and maintain its spectral registration (Jaross et al., 2014), and this work is done as a part of the Level-1 processing. For SAGE II, the filter used 10 for the water vapor retrieval changed after launch but appeared to stabilize and a static correction for the filter spectral location and bandpass is applied in the retrieval (Thomason et al., 2004). For SAGE III/ISS, spectral calibration is performed for each observation by analyzing the apparent unobstructed solar spectrum.

Instrument characterization errors
Under instrument characterization errors we subsume instrument line shape errors (uncertainties in the spectral response func- 15 tion of the instrument), uncertainties in the field of view characterization, and so forth. Which of the error sources in this category are relevant depends on the particular instrument under assessment.
The preflight characterization of the spectral response function of the instrument typically relies on a monochromatic signal.
Once in space, narrow spectral lines can be used to determine possible drifts in the instrument spectral line shape.
Depending on the field-of-view width and a shape of the response function, the field-of-view characterization can be of 20 crucial importance for limb-scatter sensors, because the limb-scatter radiance varies by more than 5 orders of magnitude between tangent altitudes of 0 km and 100 km. In this case, small errors in the field-of-view characterization may lead to large errors in the measured limb radiances at higher tangent altitudes. Also limb scanning emission and solar occultation measurements show a sizeable sensitivity to field-of-view uncertainties.
Instrument characterization errors are, if at all, typically evaluated using linear mapping. 25

Auxiliary data errors
We understand auxiliary data errors to refer to quantities that come along with the measurement data but are not part of the y vector. Typical examples are time registration errors, uncertainties in the measurement geometry such as tangent altitude pointing, and so forth. Due to the variable nature of the errors under this category, it is impossible to suggest a common scheme.
Some of these errors can be assessed by sensitivity studies or linear mapping, using the same formalism as discussed under 30 parameter errors. Alternatively, the uncertain auxiliary data can be jointly retrieved with the target variables. In the following, the most prominent auxiliary data uncertainties are listed, and their treatment by the instrument groups is documented.
In limb sounding, pointing errors propagate to the result for various reasons. Depending on the design of the retrieval scheme, different mechanisms may play a role. For example, the amount of air seen along the line of sight and the atmospheric state variables depend crucially on the tangent altitude. In the case of vertical gradients of atmospheric state variables, the assignment of a value which is per se correct to an erroneous altitude causes an error. Occultation measurements using the sun as background radiation source can depend on which part of the solar disk is seen by the instrument. The residual pointing 5 error to be considered in the error estimation depends on the pointing correction schemes applied. For MIPAS-IMK limb emission measurements, the first step of the retrieval chain is the simultaneous retrieval of temperature and tangent altitudes (von Clarmann et al., 2003). Results are used as known parameters in subsequent retrieval steps where trace gas abundances are retrieved. Residual errors of temperature and tangent altitudes are treated as parameter error in subsequent steps. For OMPS-LP measurements, the pointing correction is derived from radiance measurements using the absolute radiance residual method (ARRM) and the Rayleigh scattering attitude sensor (RSAS) methods (Scott et al., 1996;Moy et al., 2017). Also OSIRIS uses the RSAS method . For SCIAMACHY a correction to the pointing information is derived by analyzing measurements in the occultation geometry (Bramstedt et al., 2017) and is implemented to Level 0 to 1 data processing. The effect of residual pointing errors is assessed via Monte-Carlo-type studies for representative profiles (Rahpoe et al., 2013). Earlier SCIAMACHY analysis relied on a pointing retrieval using limb radiances below 300 nm. This so-called 15 "knee-method" uses the known altitude of the maximum of the limb radiances originating from Rayleigh-scattering (Kaiser et al., 2004;von Savigny et al., 2005). For ODIN/SMR a scalar pointing correction is fitted for the entire limb scan.

Model errors
The radiative transfer model used in the retrieval solves the radiative transfer equation and makes the signal comparable to what the instrument would see by integration over the finite field of view and by convolution with the spectral instrument response 20 function. A lot can go wrong here as our knowledge on radiative transfer can be erroneous or inaccurate. Some known physics, such as non-local thermodynamic equilibrium, line coupling, or more sophisticated than usual line-shape functions, may be disregarded for reasons of computational efficiency. Time constraints can also lead to numerical integration being performed with limited precision or weak spectral transitions being ignored. The goal in formulating the radiative transfer model is to keep model errors from known sources much smaller than the measurement error while maintaining computational efficiency. 25 Naturally, any unknown sources of model error are hardest to quantify. In the following, the most relevant types of known model errors are discussed.

Incomplete models
Some relevant physical processes included in f may be left unaccounted for by the radiative transfer model F in use (Section 4).
Typical examples are non-local thermodynamic equilibrium (Non-LTE) emission, line coupling, or line shape issues. Non-LTE 30 emissions occur when air density is so low that the excited molecule after absorption of a photon or in its nascent state will re-emit radiation before quenching redistributes the energy towards a Boltzmann distribution (e.g., López-Puertas and Taylor, 2001). Line mixing is a high pressure phenomenon where collisions transfer angular momentum, entailing energy transfer between energetically adjacent transitions (e.g., Armstrong, 1982;Bulanin et al., 1984;Strow and Gentry, 1986;Hartmann and Boulet, 1991;Hartmann et al., 2009;Thompson et al., 2012;Alvarado et al., 2013). The usual line-shape models, such as the Voigt lineshape (Voigt, 1912), may not adequately represent the true line shape (e.g., Galatry, 1961;Berman, 1972;Pickett, 1980;Thompson et al., 2012;Long and Hodges, 2012;Mendonca et al., 2019). Issues related to Zeeman coefficients, most relevant in microwave spectroscopy of mesospheric oxygen, are discussed in Larsson et al. (2019). The impacts of Zeeman 5 splitting on microwave and submillimeter lines are often (but not universally) ignored, although proper accounting is essential for adequate representation of mesospheric signals.
Critical issues in ultraviolet or visible remote sensing are scattering and polarization. Different levels of sophistication of models refer to the treatment of sphericity of the atmosphere and orders of scattering accounted for.
If a complete model is available but not used for the operational retrieval for reasons of computational efficiency, the effect of 10 the missing processes can be assessed via sensitivity analyses based on the complete model and considered in the error budget.
If the error is of a systematic nature, the related bias can even be corrected for, and only the residual scatter begs consideration in the error analysis.
In stellar occultation, the forward model for retrievals of traces gases from UV-VIS measurements does not include the deterministic description of stellar spectra perturbations due to scintillations. This omission is not only due to complicated 15 description of wave propagation in random media, but also by a stochastic nature of small-scale air density irregularities generated by small-vertical-scale gravity waves and turbulence. These perturbations can be, however, characterized and added as an additional, correlated in wavelength component to the measurement noise, as shown in Sofieva et al. (2009).
If no complete model is available, then it can only be hoped that the related error is sufficiently small compared to the other error sources that it has no bearing on the total error budget. 20

Numerical issues
The numerical solution of the radiative transfer equation requires a lot of integration, e.g., to integrate the spectral radiances over the field of view based on a finite number of so-called pencil beams; the spectral grid on which the radiative transfer is calculated has a finite width; radiative transfer through the atmosphere is in most models based on a finite number of layers or levels, just to name a few. Any improvement of computational accuracy goes along with increased computational effort. 25 For most satellite data processors, the setting is chosen in a way that these issues produce a retrieval error which is so small compared to the leading error sources that it can be ignored in the error budget.

Model constants
The main constants of relevance here include spectroscopic data, quenching rates, refractive indices, etc. The values of other constants (radius of Earth, gas constant, molecular weights, etc.) are known at an accuracy which renders analysis of related 30 retrieval errors unnecessary. Estimation of the impact of spectroscopic errors poses some serious problems.
A major problem in the propagation of spectroscopic data errors is that, in some cases, no uncertainties of cross-sections are available. Also, when they are available, information on error correlations is not provided. If a retrieval uses, say, a large number of ozone lines, it would be of utmost importance to know whether errors in the intensity of these lines are correlated (e.g., because the uncertainties are attributed to uncertainties in the gas amount in the cell used in the lab where the spectroscopic parameters were measured) or uncorrelated (because errors are dominated by noise in the lab measurement or because the spectroscopic information stems from different lab measurements). In the uncorrelated case the errors would randomize while in the correlated case they would fully survive the error propagation for a retrieval using multiple spectral lines. 5 To exemplify another issue, consider a gas-wise sequential retrieval where H 2 O is retrieved first, and this H 2 O profile is then used as a known parameter in a retrieval of ozone in another spectral region. errors not propagating into the retrieved ozone concentrations.
The usual way to estimate the propagation of spectroscopic data errors is to conduct sensitivity studies with perturbed spectroscopic data. Since, as stated above, the correlations between spectroscopic data errors are unknown and not reported in commonly used spectroscopic databases, these sensitivity studies render only a crude estimate of the related retrieval error.
In the case of retrievals of trace gas abundances, one might argue that uncertainties of the line intensity can be mapped 15 directly onto the target concentration retrieval. Because both the line intensity and abundance appear reciprocally in the exponent of Beer's law, the non-linearity of the radiative transfer equation has no bearing on the line intensity error propagation. It has, however, been shown that it is not sufficient to restrict related error analysis to the line intensities. For example, pressure broadening has a sizeable effect in the infrared and microwave regions (e.g. Urban et al., 2005;Glatthor et al., 2018). In this case no direct mapping is possible and full sensitivity studies are needed. Connor et al. (2016) had found, in their linear error 20 analysis for OCO-2 retrievals, that a constant perturbation even to CO 2 line intensities did not in fact map to a constant impact on the XCO 2 retrievals within the NASA OCO-2 algorithm. Within that algorithm, the impact of line intensity perturbation on the retrieved XCO 2 varies spatially and appears to depend on the surface brightness. Inclusion of surface albedo terms in the state vector for the OCO-2 algorithm gives rise to this information cross-talk.
The propagation of uncertainties of model constants follows the same formalisms as proposed for uncertainties in atmo- 25 spheric parameters (Section 6.3).
In the NASA ACOS/OCO-2 and OCO-3 CO 2 retrieval algorithm spectral residuals caused by imperfect spectroscopy, solar model and instrument characterization are dealt with by fitting scaling factors to fixed spectral residual patterns. These patterns are the Empirical Orthogonal Functions (EOFs) that result from a singular value decomposition of spectral residuals from training retrievals (O'Dell et al., 2018). A similar approach was adopted independently by Lange and Landgraf (2018) for 30 retrievals of methane from GOSAT thermal infrared spectra.

Parameter errors
We define parameter errors as those errors originating from the decomposition of the full retrieval problem such that a part of the atmospheric state is assumed to be already known and thus not included in the retrieval vector x (see, Section 5.4). The assumed values can derive from either a preceding retrieval step or from climatologies or any other source of prior information.
The ideal sequence of operations has the first atmospheric state variables retrieved being those whose signal is only weakly dependent on or interfered by other state variables. Once known, these values can be used for subsequent retrieval steps as "fixed" parameters. Error propagation has to be considered.
The impact ∆x of errors in parameters can be estimated via sensitivity studies, where a measurement is simulated with 5 parameter b that is perturbed by a certain amount ∆b, e.g., one standard deviation of its uncertainty.
If the parameter b is a vector, whose elements' error correlations are known and relevant, generalized Gaussian error propa-10 gation can be applied where K b is the Jacobian matrix representing the sensitivities ∂ym ∂bj of the measurements with respect to a changing parameter b j .
Depending on the source of the information on the parameter vector -climatology, preceding retrieval step, independent 15 measurements, or whatsoever, the parameter errors can be correlated or uncorrelated in space and time.
Occasionally errors are of a mixed nature, e.g., if a quantity is jointly retrieved along with the target quantity but strongly constrained. In this case, the parameter actually is part of the retrieval vectorx but its value still depends largely on the a priori information. Uncertainties that derive for this situation are discussed in Section 6.4. 20 In onion peeling (Section 5.4.5) the ray path with the highest tangent altitude is analyzed first. In the second step, the results of the first step are used as known parameters. Thus the retrieval error of the first step has to be considered as a source of parameter error in the second step, and so forth. Explicit error propagation through an onion peeling retrieval has been studied, e.g., by Noël et al. (2016).

Error propagation in onion peeling
Alternatively, the onion peeling retrieval error can be estimated using a Monte Carlo method. For the solution profile x a 25 limb sequence of measurements is calculated. Artificial noise with the same characteristics as the real measurement noise is superimposed upon the measurements. A sample of limb sequences is generated, based on the same forward radiative transfer calculations but different in the actual realization of the random noise. For each of these simulated limb sequences a retrieval is performed and, from the scatter of these results, the retrieval error covariance matrix S x,noise is calculated.

A priori information
In order to avoid wording that is too abstract, we assume that the retrieval vector represents vertical profiles of atmospheric state variables. However, with some adjustments the mathematical concept is applicable to 2-D or 3-D fields of atmospheric state variables as well. The framework is also applicable to column retrievals. In this case, the retrieval vector has only one element. 5 By performing regularized retrievals invoking Eq. (4) or variants of it, the retrieved atmospheric state will deviate from the one which is most consistent with the pure measurement information. As a consequence, this can introduce additional bias and distortion, and the resolution can be degraded with respect to the true state of the atmosphere beyond the degradation caused by the finite retrieval grid. The resulting profile is a mixture of the measurement information and the a priori information used.
For the interpretation of constrained retrievals it is of utmost importance to have tools available to diagnose the content of a 10 priori information in the retrievals. As in Rodgers (1976Rodgers ( , 1990Rodgers ( , 2000, one can calculate the derivative of the retrieved state with respect to the true state, and call the resulting matrix the "averaging kernel matrix" The rows of the averaging kernel represent the weighting functions, which determine to what degree the result at altitude 15 level i depends on the true atmospheric state at altitude level j. Its columns represent the response of the retrieval to a delta perturbation at a single altitude level. If a joint retrieval of profiles of multiple different quantities is made, the above refers to the diagonal blocks of the averaging kernel matrix which refer to the quantity under consideration. The presence of non-negligible off-diagonal blocks indicates a significant interference between the species introduced by the regularization scheme. The averaging kernel of a fully converged Levenberg-Marquardt retrieval equals that of the respective retrieval without the 20 λI term because after convergence this term has no impact on the solution. If the iteration is ended prematurely in order to use the Levenberg-Marquardt method to regularize ill-posed inverse problems, then the averaging kernel has to be calculated as suggested by Ceccherini and Ridolfi (2010).
When S y,total is approximated by S y,noise in the retrieval, the same approximation must be used for the evaluation of the averaging kernel. That is to say, in this case the averaging kernel should be calculated involving S y,noise instead of S y,total .

25
Conversely, the derivative of the retrieved state with respect to the a priori information is I − A. With this, the retrieval can be rewritten aŝ For the retrieval of column amounts, the sensitivity of the column to the true state values at different altitudes can be represented by the column averaging kernel A col , which is, as opposed to the averaging kernel described above, not a square matrix but a 30 row vector (Wunch et al., 2010, see, e.g.,) where h T is the column operator whose multiplication with the vertical profile yields the vertical column density and where A is the regular profile averaging kernel as described above.
Usually regularization will entail that the retrieved statex is a smoothed and possibly biased representation of the true state x. Rodgers (2000, p. 48) offers two possible interpretations of the retrieved statex. It can either be conceived as a smoothed estimate of the true state, or it can be construed as an estimate of the smoothed true state. The choice of the interpretation has 5 major impacts on the error budget, which are discussed below. All this is not to say that the effect of the prior information is restricted to smoothing. The resulting profile shape can be distorted in a sense that the extrema of a profile can be shifted upward or downward (see, e.g., the HOCl profiles in Jackman et al., 2008, their Fig. 12) or bias the result (see, e.g., Bhartia et al., 2013). The averaging kernel matrix contains information on the dependence of the result on the true state and the a priori assumption, the vertical resolution, and the information displacement. As stated above, a retrieval can be understood as a smoothed estimate of the truth or an estimate of the smoothed truth. In the first case, any deviation between the estimate and the truth which is caused by the regularization of the retrieval has to be included in the error budget. Rodgers (2000) calls this error component 'smoothing error' and has suggested the following formalism to estimate it (Rodgers, 1990): While in principle this formulation complies with generalized Gaussian error propagation, this concept has been criticized (von Clarmann, 2014). The main criticism refers to the fact that this estimate does not refer to the difference between the retrieved and the true state but only to the difference between the estimate and the true state as represented on the grid on which S a has been evaluated. This leads to the undesirable effect that a smoothing error evaluated on a coarse grid will be smaller 20 than a smoothing error evaluated on a fine grid. Further, a smoothing error evaluated on a coarse grid and then propagated onto a fine grid, will be smaller than the smoothing error evaluated directly on the fine grid, although the interpolation between the grids is a linear operation, which is another undesirable outcome.
Another criticism that may be applied to the concept of the smoothing error is that it forces one to meander around between a subjective (personalist) and an objective concept of probability. In the retrieval (Eq. 6), S a represents parameters of a person- 25 alist's probability distribution. That is to say, the underlying concept of probability is a subjective one, describing the agent's (lack of) knowledge or information about a value which is determined in the true world. Conversely, S a in Eq. (27) is required to represent parameters of an objective probability distribution, equivalent to a frequency distribution.
Since interpolation of profiles to other grids is a standard operation, it is not advisable to include the smoothing error in the error budget without a caveat. Instead, the averaging kernels should be communicated to the user, allowing them to evaluate the smoothing error on the final working grid.
In this context it should be mentioned that error estimates according to Eq. (10) include a smoothing error component and should not be used to calculate the error budget because the data user might not be aware of related problems and might, when interpolating profiles on a finer grid, propagate these error estimates to the finer grid.
Further, Rodgers (2000) points out that Eq. (27) will only yield a meaningful smoothing error if S a is not just a constraint matrix chosen ad hoc to regularize the inversion but a real statistical description of the variability of the actual states around 5 the mean state used as x a . This criterion should even be more rigorous: The maxim of the most specific reference class has to be applied (Hempel, 1965). For example, to calculate the smoothing error of a midlatitude ozone profile retrieval we cannot use a global ozone climatology. To calculate the smoothing error of a polar winter ozone retrieval, we must not use an ozone climatology built from a whole year of polar ozone data. The most specific reference class will be a homogeneous reference class whose internal variability is, as far as known, purely random. 10 Not all applications of a retrieval scheme of the type Eq. (6) use a climatological mean profile as a priori. For example, for the upcoming TEMPO mission, actual ozone measurements have been tested to be used as a priori (see, e.g., Johnson et al., 2018). In such applications, the S a matrix contains the estimated uncertainties of the individual ozone measurements used instead of the climatological variability. The standard approach of maximum a posteriori retrievals with climatological prior is based on the assumption that a climatology based on data collected in the past will also be true for the actual case. Hume (1748) 15 was the first to show that this assumption cannot conclusively be inferred from anything. The use of actual measurement data from independent sources as prior information dispenses with this assumption and reduces the maximum a posteriori retrieval to a sort of optimal average of two independent measurements. The smoothing error evaluated for this kind of retrieval scheme represents the propagated uncertainties of the measurement(s) used as prior information.
A particular problem is the evaluation of the smoothing error difference (occasionally, perhaps more adequately, called 20 'smoothing difference error') of a pair of measurements. For this purpose, Rodgers and Connor (2003) suggest that the retrieved profiles should first be transformed to the same a priori profile x c , using the general transformation scheme Eq. (10.48) of Rodgers (2000) x where profile and covariance matrix x a,old and S a,old represent the initially used prior information to be removed, and where x a,new and S a,new represent the new prior information be included instead. In the given application, the old prior information is that used for the retrieval, and the new one, x a,new , is the prior information x c , valid for the comparison profile. This Here S c is the a priori covariance matrix describing the variability of the atmospheric state around a priori profile x c valid for the comparison profile. Rodgers and Connor (2003), however, do not specify what type of x c profile is adequate. The common 5 a priori profile x c can by no means be freely chosen. Here the maxim of the most specific reference class Hempel (1965) becomes important again. When large samples of collocated measurements are compared, the appropriate reference class for instrument 1 is not necessarily the appropriate reference class for instrument 2, and vice versa, because both instruments might typically sample different parts of the atmosphere. The adequate sample with which to build the statistics needed to evaluate the smoothing error of the difference is that which is representative for the actual collocations of both measurements. 10 The criticism of the smoothing error as formulated above (After Eq. 27) does not apply to the smoothing error difference. have the same vertical resolution. This is not typically given when two measurements are compared, and a part of the observed differences is thus due to related artefacts.
If the contrast in resolution is large enough to consider the better resolved measurement as both practically ideal compared 20 to the other one and practically free of a priori information, then it is common practice to apply the averaging kernel of the coarser resolved measurement to the better resolved measurement (see Section 6.4.3 for concepts of altitude resolution).
Here index 1 refers to the better resolved measurement and index 2 to the coarser resolved one. The other indices are selfexplanatory. To our best knowledge, this approach was first suggested by Connor et al. (1994). This approach is also commonly 25 applied when measurements are compared to model data. In this case the averaging kernel of the measurement is applied to the modeled atmospheric state. In data assimilation the averaging kernel has to be included in the observation operator. Within linear theory and the assumption in force that the better resolved data set contains no sizeable amount of a priori information,  2017), encountered the following difficulty: When the MIPAS averaging kernels were applied to modeled Nitric Oxide (NO) distributions, the discrepancies between the modeled and the measured NO distributions were found to be larger than in the comparison without application of MIPAS averaging kernels. averaging kernels calculated for the modeled NO distribution. The latter approach, i.e., to calculate dedicated averaging kernels for model atmospheres, has been chosen, e.g., by Schneider et al. (2017) and references therein.
In this context another caveat is in order: Averaging kernels usually depend on the units in which the atmospheric state is expressed. For example, averaging kernels evaluated for volume mixing ratios must not be applied to number density profiles.
Some authors prefer to use so-called 'fractional averaging kernels' instead, which refer to the relative instead of the absolute 10 change of the state variable and are thus unit-independent (Keppens et al., 2015). However, again the caveat applies that these can be calculated only within linear theory, with the assumption in force that the retrieved profile is sufficiently close to the true profile.

Altitude resolution
Often the full information contained in the averaging kernel is summarized in simpler terms. The most important simple 15 diagnostics that partially describe the content of the averaging kernel are vertical resolution, information displacement, and measurement response. We first discuss the concept of vertical resolution.
Vertical resolution of the retrieval, not to be confused with the vertical resolution of the instrument itself, describes the ability to distinguish separate features in a vertical profile. It is limited by both the vertical retrieval grid on which the results are presented and by correlations of the data at adjacent grid points caused by the regularization term of the retrieval. Contrary 20 to common belief, a wide field of view or an observation geometry other than limb or with coarse vertical sampling do not per se degrade the vertical resolution of the measurements. The altitude resolution of the retrieval is determined only by the vertical grid and the regularization. It goes without saying, however, that a wide field of view or any sub-optimal observation geometry often forces the retrieval scientist to use a stronger regularization to get useful results, which, in turn, will degrade the altitude resolution. Thus, the field of view geometry or sampling have an indirect influence on the vertical resolution of the retrieval, 25 which is fully accounted for by the averaging kernel matrix and does not need extra treatment. Vertical oversampling in limb sounding, i.e. the use of a tangent altitude spacing finer than the width of the instantaneous field of view of the instrument still allows a useful vertical resolution finer than the field of view (Roscoe and Hill, 2002). A measurement mode of this type has been employed, e.g., for MIPAS for the measurements recorded after 2004 (Fischer et al., 2008), and is standard for submillimeter and microwave measurements such as MLS (Barath et al., 1993;Waters et al., 2006) or ODIN/SMR (Urban et al., 30 2005).
Rodgers (2000) reports four measures of the vertical resolution, all based on the averaging kernel matrix. Two of these measures are commonly used. The first is the full width at half maximum of the respective row of the averaging kernel matrix, or, in the case of a retrieval of multiple quantities, the part of the row associated with diagonal block associated with the quantity of interest. The second is the reciprocal data density, which is the local grid width divided by the respective diagonal value of the averaging kernel matrix (Purser and Huang, 1993). In less than well-behaved retrievals or at the extreme ends of the profiles, where the maximum of the averaging kernel does not coincide with its nominal altitude, the latter provides better intelligible results.
The Backus and Gilbert (1970)

spread (shown here in a generalized variant introduced by Rodgers 2000)
where z is altitude and A the respective element of the averaging kernel matrix A, was found by Keppens et al. (2015) to be most informative under certain circumstances. Obviously in the case of the retrieval of different state variables the summation here and all similar applications should only be performed inside the diagonal sub-blocks, corresponding to each retrieval quantity. Among other reasons discussed towards the end of the section, going outside the sub-blocks could mean that even 10 different units would be mixed.
A drawback of the Backus-Gilbert spread is that it depends largely on the grid on which the retrieval is performed. The averaging kernel of a retrieval performed on a finer vertical grid will have more pronounced side-lobes which are simply not resolved by an averaging kernel evaluated on a coarser grid. The Backus-Gilbert spread is very sensitive to such side-lobes and will thus inadequately 'punish' the fine-grid retrieval by giving large weight to these side-lobes and thus assigning a large 15 'spread' to them. It thus does not seem suitable for a largely grid-independent measure of the vertical resolution.
Obviously, the altitude resolution can be altitude dependent. Usually, the averaging kernel matrix is evaluated on the grid on which the retrieval is performed, because the Jacobians needed are often a by-product of the retrieval. The disadvantage of this approach, however, is that the averaging kernel does not represent any subgrid smoothing effects. Averaging kernels evaluated on a finer grid, which, by the way, are no longer square, can in principle be provided if the related Jacobians are 20 made available, but this is hardly ever done. The ideal averaging kernel is the identity matrix. This averaging kernel matrix corresponds to a maximum likelihood retrieval, where the weight of prior information is zero. Here the altitude resolution is equal to the gridwidth of the retrieval. In agreement with our intuition, the altitude resolution cannot be better than the width of the grid on which the retrieval is performed.
It is a common misconception that the averaging kernel characterizes the vertical resolution of the estimated profilesx. 25 The retrieval as represented by Eqs (3), (4), (6), or (25) is a correction of an initially guessed or a priori assumed profile. The altitude resolution obtained from the averaging kernel characterizes only the correction term but not the a priori component. If the a priori profile is highly structured and thus resolves fine scales, these structures are propagated onto the resultx.
As is often the case, precision and resolution share a trade space in remote sounding retrievals. We see from Eq. (18) (with Eq. 7 inserted) that weaker regularization will increase the impact of measurement noise. Conversely, weaker regularization 30 will, according to Eq. (24), push the averaging kernel towards the identity matrix, which is associated with the optimally obtainable resolution of a profile at a given discretization.
In the context of altitude resolution, a cautionary note is in order. The altitude resolution is neither identical to the grid width nor with the information smearing. In a regularized retrieval the vertical resolution is coarser than the retrieval grid. Only in an unconstrained maximum likelihood retrieval is the vertical resolution equal with the gridwidth. Conversely, the vertical resolution of measurements that are sensitive to a very small air parcel is only limited by the vertical grid, and the sampling theorem (Shannon, 1948) applies. That is to say, in situ measurements from a cruising aicraft have no altitude resolution in the sense as defined here although the measurements may be practically point measurements due to the small vertical extent of the air parcel probed. In remote sensing the radiative transfer equation, which is integrated over all altitudes relevant to the 5 retrieved profile, acts as an anti-aliasing filter, and the sampling theorem is of no concern.
Another concept closely related to the concept of altitude resolution is that of the degrees of freedom of the retrieval. This number is calculated as the trace of the averaging kernel matrix (Rodgers, 2000) 7 .

Information displacement
Ideally the maximum, the mean, and the median of the averaging kernel coincide with the nominal altitude but "it ain't nec-10 essarily so" (George and Ira Gershwin, 1935). Any displacement reflects the fact that the interpretation of the retrieved profile without consideration of the averaging kernel is, mildly speaking, misleading. This problem can often be remedied by comparing the retrieved profile not with any reference profile, but with a reference profile to which the averaging kernel matrix of the remotely sensed profile has been applied according to Eq. (30). Again, the caveat that this method is valid only within linear theory applies. A measure of the information displacement is the centroid offset of the averaging kernel (see, e.g., Keppens 6.4.5 Regularization bias and measurement response 20 As mentioned above, regularization may not only smooth the retrieval but may bias or distort it. The related component of the We have to distinguish two cases: Firstly, smoothing can cause biases, because, e.g., a sharp maximum of the true profile will always be reproduced too low, and the wings of the maximum will be reproduced too high. This type of bias, however, is only 25 relevant if the retrieval is conceived as a smoothed representation of the truth as discussed in Section 6.4.1. If the retrieval is conceived as an estimate of the smoothed truth as discussed in Section 6.4.2, this type of bias is of no concern. Secondly, regularization can cause a bias by pushing the result systematically towards higher or lower values. Any such effect besides mere smoothing is characterized by the measurement response function q (occasionally also called 'vertical sensitivity'), a concept which goes back to an idea of Eriksson (2000) and Baron et al. (2002). It is defined as the sum over the row of the averaging kernel matrix.
In the case of a multi-species profile retrieval, the sum is calculated over the subblock or the averaging kernel matrix referring to the profile under assessment. If the regularization of a retrieval provides a smoothed version of the truth, without systematically pushing results towards greater or smaller values, the sum of the elements over each row of the averaging kernel 5 should be unity. Any deviation of the row-sums from unity thus hints at an influence of the constraint that is beyond pure smoothing. The measurement response function is retrieval-unit-dependent.
Even if the averaging kernel matrix is far from unity, a measurement response function close to unity indicates that the retrieval is, putting measurement errors aside for a moment, a smoothed but unbiased representation of the true profile. Conversely, values of the measurement response function deviating by an appreciable amount from unity indicate a large influence 10 of the prior information not only on the profile shape but also on the integrated values. Interpretation of the measurement response, however, requires some caution. Any non-zero x a − x will cause a bias in this case.
The row sum of the averaging kernel, which makes up the measurement response, consists of summands which refer to a perturbation by the same amount in each layer, where, again, the 'sameness' is unit-dependent. Such a perturbation can be fully realistic in one layer and fully unrealistic in other layers, depending on the retrieval-units. The evaluation of the measurement 15 response is particularly problematic in cases where the profile values cover a wide dynamic range. This is the case, e.g., for the H 2 O mixing ratios or when the retrieval units are number density. For example, a certain perturbation in terms of number density of a certain trace gas at lower altitudes can correspond to merely moderate changes in the mixing ratio, while the same perturbation in terms of number density at high altitudes where air density is low will correspond to fully unrealistic mixing ratios. Thus, averaging kernels evaluated in units of number density can fake a large dependence of the result on values at 20 higher altitudes which does not exist in the real world. These large contributions from higher layers can lead to unrealistic large values of the measurement response.

Regularization crosstalk
The discussion of the averaging kernel matrix and smoothing error was focused on the retrieval of single quantities so far, e.g., vertical profiles of a single state variable. Often, however, multiple different state variables are jointly retrieved in one leap. In If the full (multi-variable) averaging kernel matrices are stored, the resulting parameter errors can be evaluated using Eq. 27.
The case under discussion lies between the extremes of treating the other variable as a known parameter during the retrieval and the unconstrained joint-fit of both quantities.

Related issues
In the context of averaging kernels and vertical resolution a few further remarks are in order.
-Time series of state values at a given altitude are particularly problematic when the averaging kernel is time-dependent in itself. Here it may help to remove the prior information from the data along with resampling in order to achieve A = I as suggested by von Clarmann et al. (2015).
-While averaging kernels of maximum likelihood retrievals are unity on the native grid on which the retrieval has been performed, any interpolation to finer grids will entail non-unity averaging kernels.

5
-Averaging optimal estimates will not usually create optimal averages. This is particularly true when the prior information is the same for each retrieval, e.g., a climatological data set. This is because the weight of the prior information will be too large in the average (see, e.g., Ceccherini et al., 2014).
-Even if optimal estimation is conceived in an objectivist sense, where the prior information can be conceived as the frequency distribution of true states, any deviation of the assumed frequency distribution from the true one is an additional 10 error source which is not typically considered in estimated error budgets.

Unknown error components
Error estimation will never be perfect, not only because the input variables of error estimation are uncertain in themselves, but also because there always are error sources that those responsible for the error estimation may not be aware of. Povey and Grainger (2015) propose "to present multiple self-consistent realisations of a data set as a means of depicting unquantified 15 uncertainties." It is obvious that such ensemble techniques are well suited to investigate the non-linear interaction of multiple known error sources, to obtain sensitivity information if the data processors contributing to the ensemble consider different types of known uncertainties, or to identify the spread of results which may result from different numerical implementations.
These authors, however, fall short of telling us how such ensemble techniques should provide information on the effect of unknown error sources. The problem is that none of the data processors contributing to the ensemble has the unknown mech-20 anism implemented, and the unknown uncertainty will cause an unknown bias of the ensemble mean rather than scatter of the ensemble.
The only way known to us to gain confidence that all relevant error sources have been considered is to compare multiple independent measurements based on different measurement systems where we can fairly safely exclude that they all are affected by the same type of systematic effect. If the discrepancies between the results of different instruments can be explained by the 25 combined error budgets, we have reason to believe that the error budgets of the instruments under comparison are fairly complete (e.g., Rodgers and Connor, 2003;von Clarmann, 2006). For at least three independent measurements the random components of the error can be pinpointed quite safely , and references therein) (Loew et al., 2017, and references therein), and for a large number of independent instruments one can assume that even the bias of the mean will approach zero.

Natural variability
It goes without saying that natural variability in a sense that the atmospheric state at place s 1 and time t 1 differs from the one at s 2 and t 2 is not a genuine retrieval error. However, when in a validation context two independent measurements of the same state variable are compared and the measurements do not refer to exactly the same airmass, the spatial or temporal mismatch of the measurements along with natural variability will contribute to the difference. Often, natural variability is invoked as a 5 universal excuse if validation studies hint at unexplained discrepancies. To allow a more quantitative assessment of the role of natural variability in validation, tools to assess the impact of less-than-perfect collocations is provided by, e.g., Sofieva et al. (2008), Verhoelst et al. (2015) or . The latter tool estimates the difference between two measurements that is explained by natural variability and is based on a parametrization of high-resolution model data. It saves the validation scientist from the need of dedicated model studies for each comparison. Also dense and precise high-resolution measurements can be 10 used as so-called 'fiducial reference measurements'. The latter approach allows simultaneous evaluation of natural variability and validation of error estimates, as discussed in Staten and Reichler (2009) and Sofieva et al. (2014).

Drifts
Instrument drift we understand is a false trend in the derived state variables which is caused by an unstable instrument. At first order, a drift can be avoided if regular and frequent calibration are performed or if the self-calibrating measurement procedures 15 are employed. However, higher order effects, e.g., related to the non-linearity of the calibration curve, can lead to noticeable drifts. Eckert et al. (2014), e.g., found drifts in MIPAS ozone even though regular calibration was performed. This was for the reason that, due to detector aging, the non-linearity of the detector sensitivity changed with time. Also the notorious aging problem of spacebased UV-measurements, degradation in the sense of a reduction in throughput due to intense solar radiation can cause drifts. Most often the cause are the coatings of optical elements.

20
Whenever ex ante drift estimates are available, they should of course be communicated to the data user. Since, however, drifts usually can be determined only reliably towards the end of a mission, it does not make sense to require drift estimates in data characterization papers, which are typically written in the early phase of a mission. A deeper discussion of drifts is found, e.g., in Hubert et al. (2016).
Spaceborne UV measurements are typically affected by particularly severe instrumental degradation, i.e., loss of throughput. 25 This is usually caused by optical coatings degrading when exposed to UV radiation. If a tangent altitude normalization approach or another self-calibration approach is used, this degradation is not necessarily a big problem, but the signal to noise ratio will decrease over time.
The SBUV/2 instruments use an on-board calibration system to track relative spectral and temporal changes in diffuser reflectivity using a mercury lamp (e.g., DeLand et al., 2012). Since the solar diffuser is the only additional optical element be-30 tween radiance and irradiance measurements, this system enables an accurate throughput change correction to be derived from SBUV/2 solar measurements. This correction is applied in the Level 1 processing and is not included in the error covariances.
Intrinsically self-calibrating measurement geometries such as solar and stellar occultation or regular calibration measurements using internal sources at first order remove this error. This does, however, not apply to drifts of the shape of the nonlinear detector response function as discussed above. To date, these drifts are not evaluated as part of the routine error analysis of the Level-2 product but they are assessed by careful comparison with other instruments. While it is not easily possible to get absolute drift estimates from this, at least the relative drifts between instruments can be estimated (e.g., Eckert et al., 2014;5 Laeng et al., 2017;Hubert et al., 2012;Rahpoe et al., 2015;DeLand et al., 2012). It is important to note that relative drifts between instruments may have causes beyond time-dependent calibration changes (e.g., a drift in tangent height registration as shown in Livesey et al. 2018, or Kramarova et al. 2018.

Combination of error components
Within linear theory, errors of different sources combine additively and follow Gaussian error propagation. We have k covari-10 ance matrices of the dimension n × n representing the errors of k different sources S x,1 , S x,2 , ... S x,k and get where I n is an n × n identity matrix and where the C matrices represent covariances among the error sources. For independent error sources, these covariances are zero and combination of errors comes down to summing up the error covariance matrices. 15 Beyond linear theory, the interaction of various error sources is best studied by means of ensemble sensitivity studies (see, e.g., Kulawik et al., 2019) Some data providers publish total error estimates. This practice is also endorsed by Joint Committee for Guides in Metrology (JCGM) (2008a). There are, however, compelling arguments in favour of publishing the individual error components.
Specifically, depending on the application of the data, the same type of error can act as random or systematic error. For ex-20 ample, in trend estimation constant biases of the target gas will fully cancel out. Conversely, if, e.g., the total chlorine budget is calculated, the systematic (i.e., time-independent) error components of the parent chlorine species can be fully uncorrelated among the species and thus have to be treated like fully uncorrelated random errors when the error of the total chlorine budget is estimated. In other words, the extent to which error components are 'systematic' is domain-dependent. An error which is systematic in time can be random in the altitude, species, or some other domain. Thus, the data user may be better helped by 25 being given access to the individual error components and some advice on systematicity in the various domains.
The goal of the TUNER effort has always been to bring the atmospheric remote sensing community together to enable better science. While a great deal of work has been performed over many decades, certain questions about the intercomparability of different data sets continue to linger and can only be answered if the data provided satisfy the conditions of adequacy described in this paper. While TUNER is not the first attempt at achieving this lofty goal (and may not be the last), we believe that the 5 TUNER group is well-suited to this task. With the aim of establishing a consensus on error reporting, the TUNER group is comprised of remote sensing retrieval experts representing instruments with well over a century of combined operational time and experience. Comprising both data providers and data users, the TUNER consortium aims to "practice what they preach" in the hopes that data from past, present, and future instruments may finally be used in a consistent and intercomparable fashion.
Based on the framework and consensus terminology outlined above, and in response to the conditions of adequacy formu-10 lated in Section 2, recommendations have been developed on how uncertainties shall be assessed and data characterization shall be reported. These recommendations may seem less specific than the reader might expect, but one-size-fits-all recommendations were found to be inadequate for the variety of instruments under consideration. In the following, we state the general principles that we consider to be useful. Further, we formulate recommendations with respect to the evaluation and reporting of random errors, systematic errors, and further diagnostic data. The respective conditions of adequacy which led to a particular 15 recommendation are listed in brackets (see Section 2). When appropriate, the recommendation is followed by an example or a short discussion in order to elucidate the rationale behind the recommendation.
R 1. The language and notation used to describe the error budget must be clearly defined. This can be accomplished either by explicit definitions of all terms and symbols used or by reference to any available document that lays down a self-consistent terminology. We hope that this paper serves that purpose and that the terminology and notation introduced here will be found R 2. The choice of which error estimation scheme is adequate depends on the instrument and the specific retrieval scheme.
Thus, no 'one-size-fits-all' error estimation scheme is recommended here. The responsibility for judging which treatment of uncertainties is adequate lies with the retrieval scientist, because only they can judge which error sources and error propagation mechanisms are relevant for a particular instrument or data product. Every effort should be made to make the error budget as 25 complete as possible in the sense that all sizeable sources of uncertainty are included, either via linear mapping, sensitivity studies, or whatever is appropriate for the particular case under assessment [CoA 1, CoA 5,CoA 3]. An overview of the most commonly used retrieval schemes is given is Sections 4 and 5. Error sources are discussed in Section 6.
R 3. The ideal approach is to report the substantive contributions from each relevant error component separately [CoA 5, CoA 4]. The reason for this recommendation is that an estimated error component due to one particular error source can be 30 of random characteristic in one application and of systematic characteristic in another application. For example, errors due to uncertain strengths of spectral lines are random if, say, the chlorine budget is calculated from multiple chlorine-containing constituents, each having its own uncertainty due to spectroscopic data. Conversely, in the analysis of a time series of one species the estimated errors due to erroneous line intensities act as a systematic error. The data user is able to consider the relevant error components only if the error contributions are reported separately. If, in addition, the total error is reported, it should include the systematic and the random components.
R 4. For each error source, it is often necessary to know if the resulting error components are independent between two subsets of data within a certain domain (time, space, species, etc.). For example, the error component due to tangent altitude 5 uncertainties can be correlated between different species retrieved from the same measurement. The error component due to spectroscopic data may be correlated in the altitude domain but uncorrelated between different species, etc. We recommend that data providers describe the correlation within each relevant domain either qualitatively or quantitatively, wherever possible.
[CoA 5; CoA 3]. The need of this is illustrated by the example already described under Recommendation R 3. Another example are quasi-systematic errors which are random in the long run only but can be highly correlated on shorter time scales. 10 R 5. When instrument groups make the error components available, they should also indicate which of them contribute primarily to the random error and which contribute primarily to the systematic error. Classification and combination of errors is most helpful to the data user if it is made by their systematic vs. random nature rather than by origin [CoA 5;CoA 3]. This is important, e.g., in the context of validation. If estimated errors are reported as aggregated parameter errors, and some of them are of systematic nature while the others are of random nature, the data user will not be able to judge which fraction of the bias 15 or the standard deviation of the differences between two measurement systems is explained by the systematic or random error, respectively. On the face of it, this recommendation looks redundant with Recommendation R 4 applied to the time domain but it is not. Components of the error budget may be strongly autocorrelated in the time domain but still lead to zero bias and thus contribute to the random error only.
R 6. The meaning of the reported uncertainties shall be clarified. Do they refer to ±1σ, ±2σ, etc. or to a specified confidence 20 limit, such as 95% or 99%? Note that generalized Gaussian error propagation will usually produce error estimates in terms of variances, while Monte-Carlo-type sensitivity studies enable the confidence limits to be directly estimated. If the one is transformed to the other, the assumed underlying distribution shall be reported [CoA 1; CoA 4].
R 7. For all error components, the assumed ingoing uncertainties shall be reported in the relevant documentation, otherwise error propagation would not be traceable. It should also be reported which correlation characteristics were assumed (e.g., scalar R 8. If the retrieval uses prior information in the sense of Eq. 4 or Eq. 6, the a priori profiles must be reported to allow the data 30 user to apply Eq. (30) or variants of it [CoA 5,CoA 4]. Also for column retrievals where an a priori profile is scaled to obtain the best fit and then integrated over altitude to render the column, the a priori profile should be reported, because the column can depend on the assumed profile shape.
R 9. In addition to the error budget, averaging kernels (Eq. 24) should be reported. If a certain retrieval scheme does not give direct access to averaging kernels (e.g., onion peeling) then averaging kernels shall be determined by sensitivity studies based on delta perturbations of the profile. For retrieval approaches using truncated singular value decomposition or related approaches, the final altitude resolution shall be expressed as averaging kernels. For global fit maximum likelihood retrievals (no regularization) the averaging kernels are by definition unity, but only in the native retrieval grid. In such cases, regridding 5 of data will give rise to non-unity averaging kernels. At the very least, the original grid and the interpolation scheme shall be reported. Ideally the data provider calculates the averaging kernels on the final grid on which the data are provided to the user.
R 10. The space to which the averaging kernel applies (e.g., linear/logarithmic, mixing ratio/density, absolute/relative, etc.) 10 shall be reported. This is particularly important when data are reported in a form that differs from that of the retrieval state vector [CoA 1, CoA 5, CoA 3, CoA 4]. E.g., the averaging kernels resulting from a retrieval of the logarithms of mixing ratios must not be applied to the mixing ratios themselves. It is thus of utmost importance to communicate to the data user to which quantities the averaging kernels refer. R 12. If an altitude-resolved retrieval is performed in any other space than state value over altitude, pressure, or likewise (e.g., if eigenvectors or similar are used, see Section 5.1), then the final result should in addition be presented as vertical profiles 25 and also all diagnostic data (error estimates, averaging kernels) should be transformed to the respective representation [CoA 5, CoA 2, CoA 3]. While these alternative representations certainly have their advantages, the data producer is in a better position than the data user to provide the diagnostic data for a profile representation.
R 13. Communication of a complete error budget for each profile, broken down to all components with all correlation information, along with averaging kernels and a priori information used, is not always technically feasible and often creates The following recommendations R 14-17 are applicable to the case when only representative diagnostic data are available.
In this context we would like to mention that there exist methods to convey the information content of a measurement at drastically reduced data volume (Migliorini et al., 2008). Such methods are particularly convenient in the context of data For example, in infrared emission spectroscopy the precision of concentration retrievals is usually worse for a colder atmo- 15 sphere. With this information a data user who is using a retrieval of a particular cold day which is not well represented by the sample error estimates is warned that the actual precision may be worse than the reported one.  CoA 5,]. In this context we distinguish between random and systematic errors.
1. We consider random error estimation schemes as adequate if a combination of the deduced error and the less-than-perfect spatial or temporal coincidences between two data sets and natural variability together explain the observed standard 30 deviation of the differences between two data sets. If predicted random errors fail to explain observed differences, they should be reassessed. Methods to find out which of the compared data sets has an inadequate random error estimate have been described in, e.g., Fioletov et al. (2006) 2. We consider estimates of the systematic errors to be adequate if they, along with sampling biases and after accounting for different vertical/horizontal/temporal resolutions and content of a priori information, explain the observed biases between independent instruments [CoA 5].
We consider it undesirable and a source of confusion to still report over-optimistic or pessimistic ex-ante error budgets without a related caveat if validation studies show that there is strong indication that the actual errors are significantly larger or smaller. 5 On the face of it, the list of recommendations appears quite weak, leaving a lot of freedom to the data provider. This is, however, not the case. Recommendation R 2, that the error budget should be as complete as possible, along with Recommendation R 18, which gives a criterion for the completeness of the error budget quickly make the apparent freedom disappear.
Admittedly, these recommendations will not guarantee perfect compliance with the conditions of adequacy, but due to the competing needs of rigor versus practicability the problem seems overconstrained. In other words, you 'can't always get what 10 you want' (Jagger and Richards, 1969). However, we are still confident that they help to unify uncertainty reporting in the community of remote sensing of atmospheric composition and temperature. These recommendations have been developed from the perspective of mainly satellite-borne limb sounding and occultation observations but some of these concepts are equally applicable to other types of remote sensing missions.

15
In this paper we have discussed conventional (as opposed to machine learning and artificial-intelligence based approaches) error estimation methods for Bayesian and non-Bayesian retrieval methods. The choice of the retrieval method is a dilemma. If likelihood-based methods are chosen, the retrieval lacks a probabilistic interpretation and ad hoc constraints will imply a bias, at least if the retrieval is conceived as a smoothed estimate of the true state. This horn of the dilemma is avoided by Bayesian methods, which use probabilistic constraints. Adherents of likelihood-based methods, however, will point out the second horn 20 of the dilemma, which is, that it is never warranted that the a priori statistics chosen indeed represents the true background state. Further, they will raise the concern that Bayesian methods, even if based on the true background statistics, may render bias-free estimates in the long run, but may be off the true atmospheric state in a single case. The decision for the acceptance of the one or the other horn of the dilemma is a philosophical one and in most cases it cannot be based on scientific grounds.
The only recommendation we can offer in this respect is a plea for mutual tolerance. Regardless which approach is chosen, the 25 data characterization has to be consistent with the retrieval method chosen. This paper tries to provide the scientific basis for this.
This paper is mainly addressed to providers of Level-2 data, i.e., data on atmospheric state variables. Some data users, however, prefer to work directly with Level-1 data, i.e., with measured radiances or transmissions. For example, the direct data assimilation of measured signals is sometimes preferred over the assimilation of retrieved state variables (e.g., Andersson 30 et al., 1994). The radiative transfer forward model is in this case included in the assimilation scheme. The advantage of this method is that it avoids all problems related to a priori knowledge and regularization. We have touched this approach only upon passing in this paper and do not want do delve deeper into this. The only caveat we wish to add is that the observation error covariance matrix should not include measurement noise only but also contributions by uncertain parameters not assimilated (Section 5.2). This will typically lead to non-sparse observation error covariance matrices which may be the source of some further headache.
In some fields of remote sensing of the atmosphere, retrieval methods based on artificial intelligence, neural networks and machine learning are explored (Lary et al., 2016). A precondition for unification of error reporting of classical and artificial-5 intelligence based retrieval schemes seems to be semantic connectibility. The glossary by Stanford University (2020) is considered as an important first step. With respect to the data characterization of retrieval products generated with such algorithms, two cases have to be distinguished. The first case is that a neural network is used as a surrogate radiative transfer forward model, while the retrieval still follows the concepts presented in this paper. In this case, the error estimation and data characterization strategies discussed in Section 6 are still applicable, and the approximative nature of the neural-network based radiative transfer 10 calculation can simply be conceived as a further source of forward modelling error. The second case is that machine learning algorithms are directly used for the retrieval. In this case, complete data characterization appears to be more challenging to us.
Sensitivity studies or supervised learning of uncertainty prediction may be two possible pathways towards data characterization of artificial-intelligence based retrievals. In either case it seems important to us that the data user is provided with the same full data characterization as required for the conventional retrieval schemes. 15 But even with the conventional retrieval and error estimation schemes there is a lot of homework to do. We hope that this review paper has identified the most relevant problems in this field and provides a conceptual framework to adequately characterize remotely sensed atmospheric temperature and composition data.    (1) For processors with multiple data products, the actual regularization may vary depending on the retrieved atmospheric parameter.
(2) sequential estimation using a Kalman filter (3) under consideration of horizontal gradients (4) sequential estimation in the spectral domain 5 (5) subsets of orbits are used.
(6) SAGE team's best guess as original documentation was lost.
(7) onion peeling for H 2 O.      (1) For processors with multiple data products, the actual regularization may vary depending on the retrieved atmospheric parameter, and whether it is a column or profile.
Code and data availability. N/A Author contributions. TvC, DAD and NJL organized the project. TvC, NJL and RD wrote major parts of the text. All authors contributed to the discussion of the paper and particularly the recommendations Competing interests. TvC, CvS and GPS are associate editors of AMT but have not been involved in the evaluation of this paper.
Acknowledgements. The World Meteorological Organization (WMO) has provided travel support through the Stratosphere-troposphere Pro-