This paper discusses a best-practice representation of uncertainty in satellite remote sensing data. An estimate of uncertainty is necessary to make appropriate use of the information conveyed by a measurement. Traditional error propagation quantifies the uncertainty in a measurement due to well-understood perturbations in a measurement and in auxiliary data – known, quantified “unknowns”. The under-constrained nature of most satellite remote sensing observations requires the use of various approximations and assumptions that produce non-linear systematic errors that are not readily assessed – known, unquantifiable “unknowns”. Additional errors result from the inability to resolve all scales of variation in the measured quantity – unknown “unknowns”. The latter two categories of error are dominant in under-constrained remote sensing retrievals, and the difficulty of their quantification limits the utility of existing uncertainty estimates, degrading confidence in such data.

This paper proposes the use of ensemble techniques to present multiple self-consistent realisations of a data set as a means of depicting unquantified uncertainties. These are generated using various systems (different algorithms or forward models) believed to be appropriate to the conditions observed. Benefiting from the experience of the climate modelling community, an ensemble provides a user with a more complete representation of the uncertainty as understood by the data producer and greater freedom to consider different realisations of the data.

All measurements are subject to error, the difference between the value
obtained and the theoretical true value (or measurand). Errors are
traditionally classified as “random” or “systematic” depending on if they
would have zero or non-zero mean (respectively) when considering an infinite
number of measurements of the same circumstances. The uncertainty on a
measurement describes the expected magnitude of the error by characterising
the distribution of error that would be found if the measurement was
infinitely repeated. These concepts are sketched in
Fig.

Uncertainty is a vital component of data as it provides

a means of efficiently and consistently communicating the strengths and limitations of data to users, and

a metric with which to compare and consolidate different estimates of a measurand.

This paper aims to present a succinct outline of uncertainty and validation
and their best-practice application to satellite remote sensing of the
environment. Satellite remote sensing is a sequence of processes that
estimate a geophysical quantity from a measurement of the current or voltage
produced by a space-based detector in response to the radiation incident upon
it. Each step in processing, formally described in Table

An illustration of error and uncertainty. The error in a
measurement (purple arrow) is the difference between the true value of the measurand
(solid blue) and the value measured (dashed red). The black line shows the frequency
distribution of values that would be obtained if the measurement were infinitely
repeated, referred to as the distribution of error.

Standardised methods for uncertainty estimation can be insufficient for
satellite remote sensing data as they assume a well-constrained measurement
where the sources of error are established –

Ensemble techniques, a method widely used in the weather and climate communities, provide multiple self-consistent realisations of a data set as a means of representing non-linear error propagation and variations resulting from ambiguous representations of natural processes. This paper argues that such techniques provide an effective means to represent and communicate the uncertainty resulting from the latter two categories of “unknowns” affecting satellite remote sensing data.

Satellite data processing levels, adapted from

The discussions to follow aim to be accessible to both users and producers of
satellite remote sensing data, and the issues considered apply (theoretically)
to all satellite-based instruments. The relative importance of each point
will depend on the precise technique considered, and the concepts will not be
considered for all possible measurements. Illustrative examples will
primarily draw from the characterisation of aerosol, cloud, and the surface
with a hypothetical nadir-viewing radiometer in a low Earth orbit (

Section

A generalised description of a retrieval technique is that it uses
observations

If a hat denotes the theoretical true value of a quantity or function, the
error in the retrieval is given by

Random fluctuations in the measurement, such as thermal fluctuations and shot
noise. These are unavoidable but generally linear and (at least approximately) normally distributed such
that the uncertainty can be represented by the standard deviation of their distribution. When using
Eq. (

where

Simplifications and approximations made in the technique. These errors are systematic and are unlikely to be quantified (as they would have been included in the forward model if they were). Such errors are commonly characterised through validation.

The degree to which the observation is representative of the situation it is proposed to describe. These are especially important for satellite observations, where measurements are averaged over some volume of the atmosphere that does not necessarily correspond to the scale of physical perturbations, such as turbulent mixing or cloud contamination.

These considerations compound when considering the uncertainty resulting from
the use of auxiliary parameters,

The metrological community has prepared an extensive summary of best-practice
in the assessment of uncertainty in measurements – the

In clause 0.4, the GUM states that an ideal method for evaluating uncertainty
should be

These conventions apply equally to satellite remote sensing data but
represent an impractical ideal that does not help an analyst fully represent
their understanding of the uncertainty in their data. This is due to the
simplistic treatment of systematic errors. Clause 3.2.4 of the GUM states
that, “It is assumed that the result of a measurement has been corrected for
all recognized significant systematic effects and that every effort has been
made to identify such effects.” While data producers put significant effort
into identifying systematic errors, their quantification can be a difficult
and occasionally impossible task. For such errors, it is unclear that their
distribution is symmetric, such that the emphasis on traditional error
propagation contributes to many analysts neglecting important systematic
errors as they cannot be quantified with confidence

The magnitude and nature of systematic errors experienced is a function of the state observed. A common example is the differing treatment of land and sea surfaces. Averaging adjacent retrievals will not necessarily combine errors sampled from the same distribution. As the uncertainty of a retrieval is a function of the environment observed, they must be ascertained on a pixel-by-pixel basis to be meaningful.

The basis chosen to describe a system also impacts the expression of
uncertainty. Consider the retrieval of cloud top temperature or pressure from
measurements by a nadir-viewing infrared radiometer

If errors are expected to be small (as in the radiance to temperature transform), the non-linearity will be minimal and a variance-based representation of error is sensible. Otherwise, the distribution of error may be skewed or asymmetric such that one value is insufficient to describe it. Ensemble techniques can provide the additional information required to characterise the distribution of error properly.

Distortion of the distribution of error for different selections of
measurand when observing a cloud. (Non-linearities exaggerated for illustration.)

As illustrated above, the standard error propagation techniques do not
properly represent the distribution of non-linear errors. In such situations,
the uncertainty can be approximated by the variation in an ensemble of
individually self-consistent predictions. An example is numerical weather
prediction (NWP). Rather than predict the weather from the output of a single
model run, multiple runs are performed

Non-linear error propagation in satellite remote sensing observations can be
characterised via ensembles. Each member of the ensemble adds a random
perturbation to the measurements

Ensembles are also widely used in the climate modelling community

Such ensembles could be useful to assess the impact of a priori assumptions in poorly constrained retrievals (such as the selection of aerosol microphysical properties). To illustrate the concept, consider estimating the volume of an aluminium bucket knowing only its mass. As the density of aluminium is known and the thickness of metal used to make the bucket is assumed, the mass can be converted into a surface area. The volume is then determined from the surface area by assuming the shape and height of the bucket. That choice of shape (i.e. the forward model) will greatly affect how the retrieval interprets the mass measurement.

This is portrayed in Fig.

When the bucket is assumed to have a height of 12 cm (purple), the three different models produce consistent results between 0.15 and 0.3 kg. The error due to using an inappropriate model there will be small, but increases for masses > 0.3 kg. The error is a function of the true state.

For a height of 24 cm (red) the models diverge greatly; a 0.32 kg bucket could have a volume between 0.10 and 11 L. Thus, the use of an incorrect model will introduce substantial error. The error is a function of the forward model's parameters.

In this example the actual shape of the bucket is not known, so it is not possible to rigorously quantify the error resulting from the choice of forward model. Without additional information, the results for a hemispherical bucket are just as valid as a conical one despite their significantly different interpretations of the data (e.g. a hemispherical bucket has a minimum mass for a given height while a conical one does not).

An ensemble of forward models for the volume of a bucket (

The form of the ensemble will depend on its intended use and a priori knowledge. In this example, the ensemble would be three estimates of the volume (one for each shape). The uncertainty resulting from errors in the weight, density, and thickness would be given separately for each ensemble member. If genuinely nothing was known about the height, the ensemble could be extended to represent a range of heights. In reality, some auxiliary information will exist that should constrain the values.

The standard deviation across ensemble members may be a useful proxy where
the models are consistent, as in the 12 cm slice, but not generally.
Non-linear errors can be most meaningfully described through an ensemble,
with which many users already have extensive experience

This example is artificial but illustrates the utility of ensemble techniques to satellite remote sensing
data.

Retrievals of aerosol optical depth are strongly affected by the choice of aerosol microphysical
properties. Analogous to the choice of bucket shape, these properties alter the form of the forward
model and introduce unquantifiable errors. An ensemble can be produced by evaluating the observations
with various models, as currently performed by the MISR

A variety of techniques can be used to merge multiple satellite sensors into a single, long-term
product, such as the Jason-1 and Jason-2 mean sea-level missions

Retrieval parameters and auxiliary data have associated uncertainties. Where the propagation of
these is highly non-linear, they can be estimated via ensemble techniques analogous to the NWP approach, as
done by

Errors that are correlated over large temporal and/or spatial scales are impractical to calculate and
represent with traditional covariance matrices. Ensembles have been used to represent these in sea surface
temperature (SST) products

In essence, the ensemble approach is useful for characterising the error
resulting from an incomplete description of the situation observed. At the
expense of increased data volume, an ensemble provides the user with

a more appropriate representation of the uncertainty resulting from the realisation of the problem, and

the freedom to select the portrayal(s) of the data most appropriate to their purposes.

Despite their extensive use in the community (and this paper), the classification of errors as random or systematic is limited. A random error can appear to introduce a systematic bias after propagation through a non-linear equation due to its asymmetric distribution, and the distribution of a systematic error has finite width. The use of these terms is better understood as synonyms for the non-technical meanings of noise and bias, respectively.

The GUM chose to eschew classification of error altogether, instead classifying uncertainties as type A and B dependent on if they were calculated from an observed frequency distribution (i.e. traditional statistical techniques) or an assumed probability density function. This provides an important focus on the different techniques through which uncertainty is calculated, but does not address the interest of data users in understanding the cause of errors in a measurement. The source of an error affects how it is realised and its relative importance in the eyes of data producers and users. Five classifications of error by source are proposed, which will be discussed in turn.

Measurement errors result from statistical variation in the measurand or random fluctuations in the detector and electronics. To assess these accurately, it is important that a measurement is traceable to a well-documented standard. This requires the straightforward (if not simple) comparison of an instrument to a thoroughly characterised reference. Further the response of any instrument will evolve over time, necessitating the periodic repeat of calibration procedures.

Satellite radiometers are characterised prior to launch

Vicarious methods of calibration can be used, whereby the response of the
instrument to a known stimulus is considered

Retrievals using satellite observations virtually always require auxiliary
information as there is insufficient information available to retrieve all
parameters of the atmosphere and the surface simultaneously. For example, the
accuracy of line-by-line radiative transfer calculations depends upon the
spectroscopic data used

It is not always practical to evaluate the most precise formulation of a
forward model. For example, the atmosphere may be approximated as plane
parallel to simplify the equations or look-up tables (LUTs) may be used rather
than solving the equations of radiative transfer. Such approximations will
introduce error. Often known as “forward model error”

How a measurand is defined affects which errors are relevant. Summarising clause D.3 of the GUM, consider the use of a micrometer to measure the thickness of a sheet of paper. As the sheet will not be uniform, the true value depends on the precise location of the measurement. Hence, when measuring “the thickness of this sheet of paper”, the variation of thickness across the sheet is an additional source of error to be considered when estimating the uncertainty. This error can be neglected by defining the measurand as “the thickness of this sheet of paper at this point”, but that is of little practical use. Similarly, “the thickness of a sheet of paper from this supplier” is a more useful measurand, for which the error due to variations between different sheets would also need to be considered.

A datum in a satellite product is understood to represent an average of some
physical quantity over the observed pixel at a specified time. Compared to
the situations considered in the GUM, these suffer a number of important
limitations.

It is not possible to redefine the scope of the measurand (i.e. changing from “this sheet of paper” to
“a sheet from this supplier”) as that is prescribed by the optics of the instrument. What will be called
the

The perturbations are not necessarily independent. For example, in the open ocean it is reasonable to expect that mixing will homogenise SST over a pixel, but in coastal waters variations in depth and sediment concentration introduce spatially correlated perturbations that will not average to zero.

Unlike the thickness example, it is not possible to repeat the observation. Atmospheric states evolve over minutes to hours and influence (to some extent) any environmental observation such that two instruments can never strictly observe the same state. This contrasts with laboratory-based measurements, where experiments generally accumulate statistical confidence through repeated measurement of equivalent circumstances.

The last point can be addressed by averaging adjacent pixels from the same
sensor. When done with Level 1 data, this is known as superpixeling

When Level 2 data are aggregated onto a regular grid, the result is Level 3
data. Averages over hundreds of kilometres and days to weeks are similar to
the scales evaluated by climate models, and the volume of data is vastly more
manageable. Such data are susceptible to additional limitations.

The definition of the measurand is even more important. It may appear sufficient to describe
a product as (for example) “average SST in March 2005 over 30–31

Satellite products are only representative of the time they observe

Resolution errors are a function of the pixel size and the variability of the measured quantity. A satellite datum is interpreted as a spatial average over the footprint of the pixel. This presumes that the value retrieved is equal to the average of retrievals from infinitely high spatial resolution data (i.e. the derivative of the product with respect to the measurement is linear for variations within the pixel). While this approximation holds in many circumstances, it is not universally true and certainly breaks down as pixels are aggregated to represent a larger spatial scale.

For retrievals that use an a priori constraint, each retrieved value contains a contribution from the a priori. When averaging, if the a priori is not “removed” from the value, it will contribute repeatedly to the average, biasing it. Neglecting covariance between state vector elements, this can be done via

To account for covariance, see Eq. (10.47) of

The interaction of cloud with the radiation field is sufficiently complex and
variable that it is not generally possible to retrieve its properties
simultaneously with the surface and/or other atmospheric constituents. Hence,
most atmospheric measurements are pre-filtered for the presence of cloud via
one of a plethora of empirical techniques

The filtering process impacts the sampling of the product, as regions with
persistent cloud cover will be neglected. Level 3 products are particularly
susceptible to these sampling effects. The concept is also known as
“fair-weather bias” as the exclusively clear-sky conditions considered are
not necessarily representative of the long-term average conditions that the
measurand purports to describe (an example can found in

Filtering can remove exceptional events. Aerosol retrievals often assume all data with optical thickness above some threshold are cloud contaminated, but it is possible for dust or volcanic ash to achieve an optical thickness above any useful threshold. This systematically removes high optical depths from long-term averages, producing a low bias in average products and failing to characterise the largest (and potentially most important) events. Such limits should be stated within the product definition to make this distinction clear.

Sampling is also affected by the instrument swath. As examined in

The stochastic change in TOA radiance due to the presence of cloud (or other
optically thick layer such as smoke or volcanic ash) is a long-standing
problem in satellite remote sensing. The issue is that the forward model,

If there is no a priori knowledge of which system is appropriate, the forward
model could be formed from the linear sum of all possible systems;
e.g.

Another technique is to assume the measurements are of a specific system
(i.e. one of the weights is unity and the others are zero). The choice of
system is based on prior knowledge, usually relative values of radiances or
their spatial variability (e.g. the cloud flagging discussed in
Sect.

An alternative approach is to perform a retrieval with each relevant system
in turn and choose a posteriori the best system

The combined impact of approximation, resolution, and system errors was
defined as “structural uncertainty” by

Measurement and parameter errors are both intrinsic sources of uncertainty in a retrieval. Measurement errors affect the quantities measured and analysed by the retrieval. Parameter errors are propagated from auxiliary inputs, such as meteorological data or empirical constants. Resolution errors result from finite sampling of a constantly varying system. These can be especially important as satellites do not sample the environment randomly but with a systematic bias due to the satellite's orbit and quality control or filtering.

Approximation errors represent aspects of the analysis that could have been done more precisely but do not affect the fundamental measurand. A plane parallel atmosphere is a simplification of the real world; it would not be observed. System errors express choices in the analysis that alter the measurand. An assumed aerosol optical model will represent a possible state of particulates in the atmosphere; it may be unlikely but still possible. The system error results from the difference between the assumed system and reality.

One-dimensional representation of a retrieval considering multiple systems (realisations of the forward model that do not necessarily retrieve the same variable). For a system, the retrieved state is the minimum of its cost function (indicated by a circle). The state with globally minimal cost (across all systems) is a posteriori taken as the best representation of the observed environment.

Validation is a vital step in the production of any data set, confirming that
the data and methodology are fit for their purpose. Often thought of as the
conclusion of data generation, it provides guidance for future development of
the algorithm and so is better considered a step in the cycle of retrieval
development (see Fig.

This paper construes a validation as a comparison against real data only. There is use in evaluating the performance of an algorithm against simulated data, but that is considered a step in retrieval refinement (confirming it behaves as expected in controlled conditions) rather than a validation.

The cycle of retrieval development. The initial formulation and algorithm are repeatedly revised in light of internal validation activities. When consistent results are achieved, an external validation is performed (and published) to begin the operational cycle, where data are generated and disseminated. The application and critique of the data by the scientific community then feeds into further refinement of the algorithm (or entirely new algorithms). The development and operational cycles continue independent of the larger cycle but over time operations will increasingly dominate resources as the product becomes increasingly fit for purpose.

Users will be most familiar with external validation – the comparison of observations from two or more instruments. This focuses on quantifying the correlation and difference between data sets. While such validation activities are fundamental to the characterisation and minimisation of systematic errors, they should not be confused with a quantification of uncertainty. Validation techniques are neither universal (being dependant on the collocation criteria), internally consistent (as external data are used), nor transferable (being representative of only the conditions considered).

When comparing two data sets, neither quantifies “the truth” (even when one is substantially more precise than the other). Both have associated errors, random and systematic such that all that can be said is the products are consistent with each other. Also, simply because two measurements purport to quantify the same measurand does not mean they actually do. Weighting functions illustrate the difference in sensitivity between instruments.

As an illustration, consider cloud top height (CTH). The entire cloud emits
thermal radiation, much of which will be scattered or absorbed within the
cloud. Radiation from the cloud observed by a satellite corresponds to
photons that found an unimpeded path to TOA. Hence, a radiometer quantifies
an average of the cloud's temperature profile weighted by the probability
that a photon from that level can arrive at TOA. The distribution of the
weight is known as the weighting function, and is sketched in red in
Fig.

A very simple model of this situation assumes that radiation increases
linearly with optical path

The Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP) is commonly
used to validate CTH

Schematic of the weighting functions for CTH for an infrared
radiometer (red) and lidar (black), with dashed lines denoting the value retrieved.

A direct comparison of these two products will find that radiometer-retrieved
CTH are consistently lower than those from the lidar. To validate the
satellite against the lidar properly, it is necessary to use the satellite's
weighting function to calculate an “effective cloud radiating height” from
the lidar profiles (see, for example,

More formally, a weighting function describes the dependence of a measurement
on the underlying state. When the state chosen to describe a measurement is
not an orthogonal basis of the observed state, a variable in the state vector
will not uniquely determine an element of the true state. The relationship
between the retrieved state and true state is expressed by the averaging
kernel

Consider where

The off-diagonal elements of the averaging kernel represent aspects of the state that cannot be resolved by the chosen basis and forward model. Here, a two-layer cloud cannot be properly represented when the basis only describes the properties of a single-layer cloud. The characterisation of an averaging kernel may require the use of an extended state vector and simulations with a more detailed model. (If the retrieval had been posed over that extended state vector, the averaging kernel would have been diagonal.)

Retrievals will be compared over some collection of observations representing only a subset of the realisable state vectors (e.g. a SST product compared to ship-based measurements will only encapsulate the variation in SST over major shipping lanes rather than globally). As systematic errors are circumstantial, this collection represents only a sample of the complete distribution – just as the definition of a measurand frames how its value can be understood and used, the scope of a validation frames the understanding of systematic errors.

Towards the aim of repeatability, validation should be performed in a manner such that, if an additional source of data were introduced (e.g. a new instrument site or satellite orbit), the conclusions would not be expected to change. In the highly common case that there are insufficient data to achieve this, the scope of the validation should be clearly outlined.

One would naïvely judge if two retrievals are consistent by considering,

Different algorithms have distinct sensitivities to the same input
information. Products from different sensors consider distinct inputs and so
react differently to the same atmospheric state. Even where channels with
similar wavelengths are used, they will have different band passes which
subtly affect their sensitivity (weighting functions). For example, the
scattering properties of smaller droplets change more rapidly with wavelength
than those of larger droplets. In Fig.

When independent observations are not available to externally validate data,
one can compare a product to model output provided the model is sampled as if
viewed by a satellite. The retrieval's averaging kernel and weighting
functions are necessary to translate the physical variables quantified by the
model (e.g. particle number density) into the observed measurand. Further, a
method for estimating the random error variance of a geophysical variable
from three collocated data sets was proposed by

The formalism of

Equation (

When one product is of much higher resolution, such as the comparison against CALIOP described in Sect.

As Eq. (

Expected error envelopes are a common means of presenting the result of a
validation of, for example, aerosol optical depth

This is an efficient means of communicating the results of the validation
against AERONET and conveys a quantitative measure of the degree of certainty
the data producer has in their product. It is not, strictly, an estimation of
uncertainty. Such validation techniques are neither universal (being
dependant on the collocation criteria), internally consistent (as external
data are used), nor transferable (being representative of only the conditions
considered). Though envelopes provide a diagnostic approximation of the
uncertainty, additional correction is necessary to use them as prognostic
uncertainties

This application of envelopes conveys an incorrect appreciation of the uncertainty to users as it implies well-constrained random and systematic components. Though stratification by relevant circumstances (e.g. over desert, high aerosol loading) indicates that the error depends on the state observed, a simple expression cannot usefully communicate the distribution of error in any particular measurement. Only pixel-level estimates provide an uncertainty consistent with its widely accepted definition and the presentation of ensembles, already used in the calculation of these envelopes, can better represent the distribution of errors not quantified in that estimate.

Internal validation is a less frequently discussed means to assess the precision and consistency of measurements.

Repeated observations of an unchanged target should sample the distribution
of error, such that a histogram of the observations should be Gaussian with a
standard deviation equivalent to the uncertainty. An opportunity for this
type of repeated observation is rare with satellite instruments. More common
is the sampling of the same point in successive orbits (often near the
poles), assembling pairs of measurements of similar (if not identical)
atmospheric states

Atmospheric variation may increase the observed variability so a larger standard deviation is not questionable. A variance less than one usually indicates an underestimation of the uncertainty. Significant departure from a Gaussian distribution is indicative of unidentified systematic errors. If the variable is expected to be homogeneous across a region, all observations there can be used to validate the uncertainty directly, as the variance of the observations should be greater than the average of the uncertainties.

Using different forward model assumptions, statistical techniques, and/or
filtering methods can produce results that may be consistent with themselves
and external validation but not with each other. Differences between
retrievals, in the absence of external validation data or a programming
error, indicate variations in the state within the unconstrained state space.
They form an ensemble that illuminates where the formulation of the problem
is most relevant, highlighting where future research could be concentrated to
represent the observations more carefully

Example of an error budget.

Confidence in data is communicated to users through uncertainty estimates and quality assurance statements. The quantification of uncertainty illustrates how new data relate to the existing body of knowledge, but there is also the user's qualitative sense of the “worth” of data. To what extent does it constrain the variables they are investigating? When and where are the data most robust and when and where do they effectively convey no information? What do they quantify that was not already known? The aims of the user frame these questions. A detailed case study requires reliable uncertainty estimates to incorporate varied measurements and understand the limitations of the information provided but it is impractical for a 20-year model climatology to consider a single measurement, its uncertainty even more so.

Further, the “unknown unknowns” affecting satellite remote sensing data are not completely indescribable. Information such as “results are often unreliable over deserts” is still important to users, even if the uncertainty cannot be quantified. A dialogue with users is important in improving the understanding of data and receiving feedback on those data for future improvement.

The aim of an error budget is to classify the contributions to the
uncertainty by their source. At its simplest this may be in the form of a
table, as suggested in Table

Quality assurance (or flagging) is a qualitative judgement of the performance of a retrieval and the suitability of that technique for processing the data. This complements the uncertainty, whose calculation assumes that the forward model is appropriate to the observed circumstances. Statistical distributions are unsuited to show when an algorithm fails to converge, converges to an unphysical state, encounters incomprehensible data, or observes circumstances beyond the ability of its model to describe. Provided it is described in the language of a statement of confidence, quality assurance provides useful information.

The difficulty is that a simple flag is a coarse means of communication. For
example, MODIS Collection 5 aerosol products provided a data quality flag of
value 0, 1, 2, or 3 to describe increasing confidence in the retrieval method

However, such filtering is a logical response to this presentation of information. A more useful scheme would provide multiple separate flags (e.g. presence of cloud, challenging surface conditions, failure to converge, etc.) in a bit mask. When these are properly documented they allow an attentive user to evaluate the impact of using data degraded by a specific feature, and the disinterested user may be inspired to consider, if only briefly, the most appropriate flags for their purposes.

Satellite remote sensing data have existed for several decades, but the
retrieved geophysical quantities evolve as additional auxiliary data become
available and new scientific problems appear. For example, AVHRR measurements
from 1978 are still reprocessed for climate studies

the data set sufficiently addresses the needs of its users; or

the maximal amount of information has been extracted from the measurement and additional information is required to meet the needs of users.

The sequence of scientific output needed to underpin satellite observations. The instrument, calibration, and algorithm descriptions may be contained in one or more publications. Significant iterations of the retrieval algorithm are usually described in a new publication.

Levels of system maturity, as defined in

Excerpts of the system maturity matrix defined by

The appropriate presentation of data with thorough documentation and metadata produced using a publicly available, consistently realised computer code is a desirable aim. Such features should be included in any algorithm from inception to minimise simple mistakes and the misunderstanding of data by users. However, the presence of such features does not address the scientific quality or importance of the data.

The proposed metric simply counts the citations the data have received, disregarding the variety of applications and their impact upon scientific understanding. Participation in international data assessments works towards this aim, but only when there are multiple means of observing or evaluating a measurand. These are not available for many environmental variables, and they should not be considered immature if they make the best use of the information available (goal 2).

It is important that an inexperienced user should not misinterpret data with a high maturity index as being more accurate or suited to a particular study. A mature data set is one which is near the end of its development cycle in that it is agreed to be fit for purpose by the scientific community. This must not be confused with a data set that fully constrains the measurand.

With specific regard to the evaluation of uncertainty:

As discussed in Sect.

The spatial covariance of error in a product can only be quantified through validation against spatially distributed, independent data. Satellite remote sensing is used for many environmental products because they are impractical to measure from the ground. In such cases it is not possible to assess covariance errors independently. Ensemble techniques may be useful there.

A distinction must be made between internal and external validation activities. An international assessment of multiple, independent products from different measurement techniques that quantify equivalent measurands represents the external validation of a mature research area. An internal validation of differing algorithms from the same sensor evaluates the relative properties of the algorithms, not their suitability for quantifying the measurand.

Monitoring the progress of algorithm development must be done in a manner
which encourages researchers to follow the fundamental scientific method
(Fig.

An appreciation of the range of values consistent with a measurement is
necessary to apply and to contextualise data. Three qualities were identified by
the

Measurement and parameter errors are generally well represented by the traditional propagation of random perturbations. These are useful but only describe one aspect of the uncertainty – the “unknowns” that are known and quantifiable. Approximation and system errors represent the inability of the analysis to describe the environment observed and are the dominant source of error in most passive satellite remote sensing data (as it is not possible to constrain the complex behaviour of the environment with a few TOA radiances). Data producers are aware of these additional “unknowns”, such as the representation of the surface's bi-directional reflectance, but cannot quantify them in the manner required for traditional error propagation (i.e. they are known, unquantifiable unknowns). Even well-constrained analyses will be affected by system errors resulting from quality control, cloud filtering being the most common. Resolution errors describe the disconnect between what occurs in nature and the means by which it is observed, primarily resulting from the instrument's sampling.

The difficulty with the last three categories of error is that they can be highly non-linear – their magnitude and nature depend upon the state observed and the ability of the forward model to describe it. Propagation of errors assumes that the equations used are accurate and that errors affect them linearly. Uncertainties currently reported with satellite remote sensing data neither represent the actual (non-linear) distribution of errors nor the full range of information known about the errors.

This can be addressed in various ways. Firstly, uncertainty estimates in satellite remote sensing data must be presented at pixel level. Pervasive quantifications misrepresent the dependence of error upon state and rely on external information. While pixel-level estimates will not represent the impact of unquantified unknowns, it is important that uncertainty be presented in a context that represents the data producer's confidence in and understanding of their data.

Ensemble techniques can be used to represent unquantifiable unknowns. The under-constrained nature of many satellite observations means that multiple realisations of a data set that are consistent with measurements can be derived by using conflicting descriptions of the environment, such as assumptions of particle microphysical properties or differing calibration coefficients. In the absence of a priori constraints, each of these realisations is feasible and should be presented together. This is common practice in the climate modelling community, and the satellite remote sensing community should capitalise on user's experience to improve communication of the uncertainty in products.

The manner in which a measurand is defined affects both the sources of error
that must be considered (e.g. resolution errors) and the manner in which the
data must be compared with other measurements. In an under-constrained
problem, it is often not possible to report a value that is uniquely
constrained by those conditions (i.e. the state vector elements do not form a
basis of the observed conditions). This can result in the retrieved value
being sensitive to multiple features of the environment, as quantified by the
averaging kernel. When comparing data sets, it is important to ensure that
equivalent quantities are being compared or biases will be observed that are
a function of the system definition rather than an error in the retrieval.
The necessary transforms were outlined in

As not all errors can be quantified, there is also qualitative information
necessary to appreciate the applicability of data and, as a data set evolves,
it is important to assess both the degree to which it represents a scientific
advancement and to which it satisfies the needs of its users. This
information can be conveyed through product user guides, validation studies,
quality assurance flags, and/or measures of a retrieval system's maturity. It
is both important that this information is readily available to users and
that it is communicated in the language of a statement of confidence.
Continuous interaction with users will be necessary to improve these reports
to ensure they communicate the desired information. Of particular importance
are the following:

an error budget outlining the quantified sources of error;

a description of the available quality control information and its physical meaning to enable users to apply it in an educated fashion;

known weaknesses of the data that are not represented by the uncertainty.

This paper concentrated on passive remote sensing, but the clear communication of uncertainty to users is still important in active remote sensing. The different definitions of active and passive measurands must be appreciated if they are to be compared. Active data are generally better constrained than passive and are often analysed with analytical equations, where approximations and system choices are substantially less important but still present (for example, the Ångström coefficient, the lidar ratio, and multiple scattering). These errors are minimised, in part, by selecting measurands closely aligned with the measurement (e.g. backscatter, extinction, reflectivity, depolarisation). Approximation and system errors can become important when calculating more poorly constrained, physical parameters such as particle size or number. Resolution errors are more obvious with active sensing due to their narrow swath.

Evaluating the quality of an algorithm using existing metrics limits the ability of the satellite remote sensing community to communicate their understanding of the uncertainties in their products to users in an efficient or effective manner. Without that dialogue, users cannot appropriately use data and cannot feedback to data producers to improve it. The hope is that by representing uncertainties in satellite remote sensing data through ensembles, understanding of the limitations of the data will increase, highlighting areas for future research. Through continual communication among the entire scientific community, unknown unknowns can become known and, eventually, make the use of ensembles unnecessary as understanding of the environment converges upon the truth.

This work is supported by the European Space Agency (ESA) through the Aerosol_cci and Cloud_cci projects and through the Natural Environment Research Council's support of the National Centre for Earth Observation. For their inspirational and insightful conversations, thanks must be given to the participants of the Aerosol_CCI uncertainty workshop on 4 September 2014; the AeroSat meeting on 27–28 September 2014; and SST_CCI Uncertainty Workshop of 18–20 November 2014. The authors are indebted to Claire Bulgin, Gerrit de Leeuw, Thomas Holzer-Popp, John Kennedy, Greg McGarragh, Chris Merchant, Andy Sayer, and an anonymous referee for their useful comments. Edited by: A. Kokhanovsky