Statistical modelling of collocation uncertainty in atmospheric thermodynamic profiles

The quantification of measurement uncertainty of atmospheric parameters is a key factor in assessing the uncertainty of global change estimates given by numerical prediction models. One of the critical contributions to the uncertainty budget is related to the collocation mismatch in space and time among observations made at different locations. This is particularly important for vertical atmospheric profiles obtained by radiosondes or lidar. In this paper we propose a statistical modelling approach capable of explaining the relationship between collocation uncertainty and a set of environmental factors, height and distance between imperfectly collocated trajectories. The new statistical approach is based on the heteroskedastic functional regression (HFR) model which extends the standard functional regression approach and allows a natural definition of uncertainty profiles. Along this line, a five-fold decomposition of the total collocation uncertainty is proposed, giving both a profile budget and an integrated column budget. HFR is a data-driven approach valid for any atmospheric parameter, which can be assumed smooth. It is illustrated here by means of the collocation uncertainty analysis of relative humidity from two stations involved in the GCOS reference upper-air network (GRUAN). In this case, 85 % of the total collocation uncertainty is ascribed to reducible environmental error, 11 % to irreducible environmental error, 3.4 % to adjustable bias, 0.1 % to sampling error and 0.2 % to measurement error.


Introduction
While global availability of profiling measurements of atmospheric parameters is increasing, full exploitation of these measurements is still far from being achieved.In fact, the lack of an extensive effort, on a global scale, aimed at coordinating the operation of available measurement stations towards harmonized and traceable observations, uncertainty included, has hampered exploitation of the data.GRUAN (GCOS Reference Upper-Air Network, www.gruan.org) is a network aiming at rectifying this issue in order to provide traceable measurements of essential climate variables (ECVs), namely pressure, temperature, water vapour, wind and aerosol, with their uncertainty, over a long-term period (GCOS-112, 2007).The quantification of the uncertainty budget is one of the key priorities for GRUAN (Seidel et al., 2011).
Instrumental contribution to the error budget (random and systematic uncertainties) has been investigated for various sensors, e.g.Raman lidars (e.g.Whiteman et al., 2001), radio sondes (Immler et al., 2010) or weather radars (e.g.O'Connor et al., 2005).On the other hand, one of the critical contributions to the uncertainty budget is related to the collocation mismatch in space and time among pairs of sensors.Although these different measurements (of the same atmospheric parameter) are assumed to be nominally collocated, there is a real physical separation between their actual measurement locations and timing.This assumption is generally true for ground-based observation, or when one is groundbased and another satellite-based.

A. Fassò et al.: Collocation uncertainty in atmospheric profiles
Estimates of the representativeness error resulting from the effects of small-scale turbulence have been performed in many cases, for example, for rawinsonde wind measurements (e.g.Frehlich and Sharman, 2004) or high-resolution radiosonde wind shear (Houchi et al., 2010).
However, there is a need for flexible statistical modelling, capable of assessing jointly the dynamic impact of both the imperfect collocation of atmospheric observations and environmental factors on collocation uncertainty.The approach should be flexible enough to cover atmospheric processes characterized by regimes ranging from quasi-linear (e.g. a horizontally homogeneous atmosphere) to non-linear.This approach would extend the VAM approach of Pougatchev et al. (2009), which is restricted to linear modelling and has no capability to model the impact of environmental factors.
Radiosondes provide one of the primary data sources for vertical atmospheric profiles (Immler et al., 2010), but they are affected by uncontrolled drift once they are launched (Seidel et al., 2011).Radiosonde data have been extensively used for a wide range of applications including intercomparison with ground-based and space-based remote sensing systems, atmospheric model evaluations, and studies of atmospheric variability.However, these studies have mainly assumed the radiosonde measurement to represent the atmospheric conditions over a given area, as if they came from a fixed measurement location, and neglect the impact of their uncontrolled drift.In satellite validation, it is often assumed that radiosondes are spatially collocated with the satellite field of view.However, representativeness error can be minimized only if the validation is performed in homogeneous conditions.If the error is small for integrated variables in ideal atmospheric conditions (Vogelmann et al., 2011), quite often the uncertainty introduced by representativeness dominates the error budget of the validation experiment.Uncontrolled radiosonde drift may also affect the evaluation of model data when the representativeness of observations is not quantified.These data sets should represent the range of conditions influencing the model prediction and not the "truth" for the location.
Spatial collocation mismatch does not seem to play a big role in the radiance matching, due to the large footprint characterizing these measurements.On the contrary, temporal collocation and time interpolation are critical to achieving these results due to the related vertical thermodynamic factors (Tobin et al., 2006).
The satellite validation community considers, as a priority, the availability of robust collocation criteria that would increase the matches by a significant amount at an affordable cost due to data synergy.Appropriate collocation criteria are strongly required to combine different measurements, to potentially reduce the overall uncertainty in the atmospheric profile measurement (Tobin et al., 2006;Calbet et al., 2011).For example, in the former paper, temporal collocation and time interpolation were critical to achieving good correlation between the ground and satellite observations, although collocation does not seem to play a big role in radiance matching.
In this study, we aim at two objectives.The first is a general statistical modelling approach to understand the vertical profiles of collocation uncertainty for any climate variable, in relation to environmental factors, altitude of measurement and distance between trajectories.The second objective is an illustrative example based on relative humidity data from ground rawinsonde measurements, which are made from two different locations at almost the same time.The case study is important because humidity is known to have large forecast errors even on small time and space scales.
To do this, we merge two statistical methodologies.One is heteroskedastic regression, which has been used for calibration, see e.g.Bhaumik and Gibbons (2005) and Spiegelman et al. (2011), and financial data, see e.g. the classical Engle (1982).According to this approach, the error variance in a regression model is not constant but is a function of some variables.The second methodology is statistical functional data analysis, which dates back to the eighties, see e.g. the primer of Ramsay and Silverman (2005).In the last decade these methods have been increasingly developed and used in various scientific areas and especially in life and environment observation.For example, Ruiz-Medina and Espejo (2012) proposed spatial interpolation of functional ocean surface temperature and Ignaccolo et al. (2013) proposed regional zoning according to functional air quality data.Moreover, Sangalli et al. (2013) proposed functional regression for complex spatial configurations which are important, for example, in the study of hemodynamic forces, see Ettinger et al. (2013).Following this statistical framework and developing the idea of Ignaccolo (2013), we propose the heteroskedastic functional regression (HFR) model, which extends the standard functional regression approach to cover for non-constant functional conditional variance, as an effective approach to understand and decompose the uncertainty of the atmospheric thermodynamic profiles.
The rest of the paper is organized as follows: in Sect. 2 we discuss radiosonde collocation in general and introduce the Beltsville-Sterling data set, which is used throughout the paper as a motivating and illustrating case study.In Sect.3, the HFR model for collocation uncertainty of generic atmospheric thermodynamic profiles is presented.Using this general modelling approach, a method for computing conditional and marginal uncertainty profiles is proposed.In Sect.3.3, a total collocation uncertainty budget is introduced by means of a model-based five-fold uncertainty decomposition.Section 4 illustrates the method using data from two North American stations involved in GRUAN, and focuses on collocation uncertainty of relative humidity, which is selected because it is characterized by high vertical variability.Section 5 gives concluding remarks.

Case study introduction
Although the general statistical approach is developed in Sect.3, a motivating example is introduced here.To do this we present below the specific problem of sonde collocation and the data sets to be used with an introduction to the parameters considered.This is done in the context of the two GRUAN stations mentioned earlier: the Beltsville and Sterling locations.

Collocation uncertainty for radiosonde
Data used in this work consist of radiosounding profiles of pressure, temperature, humidity and wind measured at the Howard University research site in Beltsville, Maryland, USA (39.054 • , − 76.877 • , 53 m a.s.l.), which is also a GRUAN site, and the U.S. National Weather Service (NWS) operational site located in Sterling, Virginia, USA (38.98 • , − 77.47 • , 88 m a.s.l.).
These two sites, being separated by about 50 km, are selected because of their relatively close proximity, representing a similar climate regime.Moreover, they would serve as a good example of using one GRUAN research site to understand a non-GRUAN site, where knowledge can be transferred to a larger network represented by NWS.
Beltsville soundings are based on RS92-SGP sondes, manufactured by Vaisala Inc., while Sterling uses the Radiosonde Replacement System (RRS), built by Lockheed Martin Sippican.The latter is referred to hereafter as Mark IIA and is very similar to the LMS6 sonde (Nash et al., 2010).Differences in the vertical sounding of the atmosphere between the two sensor types are known.In dry regions of the troposphere, it has been reported amply that relative humidity derived from Mark IIA (LMS6 sonde) shows substantial errors.This limitation has been reported by Blackmore and Taubvurtzel (1999) to be a result of errors in calibration, sensor hysteresis, and sensor response time.They also report that at low temperatures, the time response slows down significantly, resulting in large relative humidity errors.During the latest WMO intercomparison of high-quality radiosonde systems (Nash et al., 2010), the RS92 version tested has shown systematic errors of less than 2 % in relative humidity and random errors of about 5 % from the surface to the lower stratosphere, whereas the LMS6 exhibited significant biases in the upper troposphere and layers above.Moreover, the LMS6 sensor did reveal a day-night difference, but which is significant only in the upper troposphere; see Miloshevich et al. (2006).
Most of the sonde-to-sonde or other comparisons reported in the literature are a result of multi-payload sonde launches where two or more sondes are tethered to a single balloon to minimize the atmospheric variability, which is commonly assumed to be zero.These types of comparisons come mainly from coordinated and intensive field campaigns; see Miloshevich et al. (2006).These intensive operations tend to be expensive, are held less frequently, and do not offer climatically representative data across seasons and climates.As a result, the data and opportunity for building statistics is usually limited and cross-instrument and cross-network knowledge transfer is limited.
The Beltsville-Sterling radiosonde flights are launched on separate balloon payloads, are operated by different operators, are different instruments and sample the atmospheric profile with some variability.Quantifying the latter is a major issue.Traditionally, a simple averaged ensemble comparison as shown in Fig. 1 is done.The time-height matched difference between two data pairs from the Beltsville-Sterling flights is averaged to show temperature comparisons.The temperature profile difference of sondes launched from these two sites within a 3 h window did not show that large of a difference.As can be seen from the figure, the temperature difference (standard deviation) was well within about a percent.As expected, the water vapour mixing ratio (g kg − 1 ) varied greatly for the same sondes throughout the tropopause, above about 2 km.Mid-tropospheric mean differences of about fifty percent were recorded.Alternatively, comparison of the integrated column water vapour amounts between these two stations revealed correlation coefficients of 0.95 or better.The difference in these comparative plots is a result of the measurement location mismatch, instrument quality, and statistical sampling of the atmospheric variability.These types of "standard" statistics plots, while important in understanding the overall characteristics of the atmospheric state variables, cannot be used to do quantitative contribution of the different error components.The statistical "tool kit" described here has quantitative description and separation of the different error components as its goal.Note also that despite the performance limitations of the RRS, we proceed with using the Beltsville-Sterling data in our study here to demonstrate the efficacy of the statistics developed.

Beltsville-Sterling data
For this study, we selected n = 32 pairs of vertical profile data in the vertical range of 100 to 10 000 m launched between July 2006 and September 2009.A flight from Beltsville was matched to Sterling launch if the launch time was within 3 h.
In particular, we focus on collocation of relative humidity (rh, in %) and we try to explain its profile uncertainty using covariates or predictors, which can be interpreted as environmental factors, and are given by water vapour mixing ratio (mr), pressure (p), temperature (T ), time (t), wind vector (u, v), height above sea level (h) and coordinates (lat, lon).The differences in natural variability and collocation mismatch between these variables may be appreciated in Fig. 3, where water vapour mixing ratio, relative humidity, wind and temperature data are plotted.The collocation mismatch, plotted as the time-height matched difference data, for relative humidity and pressure is plotted in Fig. 4, showing quite a strong variability for humidity at all altitude levels without an apparent pattern.Hence it is challenging for the HFR model to try to explain this strong variability, which is assumed to be a cumulative contribution of errors from the instrument overall performance and from water vapour spatial variability induced by the drifts shown in Fig. 2. Note that the depicted distances between the collocated profile trajectories is between 45 and 95 km.

Modelling collocation uncertainty
Let y denote the measurement of an atmospheric quantity µ along a sonde trajectory, e.g. the profile of relative humidity to be recorded by Beltsville balloon launches.A measurement at spatial point s and time t is denoted here by y(s, t), where the measurement position is s = (lat, lon, h), with measurement height h in the range h 0 -h 1 , which is 100-10 000 m in the example considered, and measurement time is t ≥ t 0 , where t 0 is the sonde release time.Considering only height, the space-time trajectory can be described as y(h) = y (s(h), t (h)).In other words, we represent data y(s, t) as vertical profile functions y(h), whose exact spacetime trajectory can be represented by the vector function h → (y(h), lat(h), lon(h), t (h)).
Using the approach known as functional data analysis (FDA) described e.g. by Ramsay and Silverman (2005), we consider the vertical profile of an atmospheric variable to be considered by a single object defined by a smooth function µ (h).According to standard measurement error decomposition, an observation profile, labelled by launch place and time (s i , t i ), i = 1, ..., n, is given by a random function where µ (h) is the profile of the physical quantity under study, assumed continuous, and ε i (h) is the corresponding error.Since the measurement error ε is assumed with constant variance, Var (ε(h)) = σ 2 ε , and zero mean, E (ε(h)) = 0, it follows that we are considering an unbiased instrument.

HFR collocation model
Consider, now, measurements of the same variable from two instruments, e.g. the radiosounding of relative humidity profiles at Beltsville and Sterling, giving measurements y and y 0 respectively, which can be represented by Eq. ( 1) and have equal uncertainty σ ε = σ ε 0 .
The collocation error follows by comparing y and y 0 at the same height; hence, following the stochastic model approach of the previous section, we have (2) In this formula µ = µ− µ 0 is the collocation drift and ε = ε − ε 0 is the collocation measurement error, with variance σ 2 ε = 2σ 2 ε .From a practical point of view the observed collocated profiles y(h) and y 0 (h) are not observed exactly at the same height h, so the collocation error y (h) is not directly computable for every height h.Vice versa, µ (h) and µ 0 (h) are continuous functions and µ (h) may be easily computed for every height h.In this paper, we assume that measurements, conditionally on a set of q different environmental factors denoted by x(h) = x 1 (h), . .., x q (h) , are independent random functions and we focus on modelling their conditional mean and variance as functions of x(h).For example, in the Beltsville-Sterling case study, candidate components for x(h) are all measured profiles of the atmospheric state introduced in Sect.2.2.We will address the spatial correlation among different profiles e.g.among y i (h) and y j (h) characterized by different launch places s i and s j in Sect.3.4.
Here, we assume that the collocation drift is a heteroskedastic functional regression model given by where the prime denotes matrix transposition.In this model, we assume that the trend is locally linearly related to x but the global relation is not assumed linear.Moreover, the error ω is assumed to be a conditionally heteroskedastic component with zero mean and uncorrelated with x .The term heteroskedasticity is derived from the ancient Greek language and means varying variance, which is what is assumed in this paper.In particular, the conditional variance, namely σ 2 ω (h|x) = Var (ω(h)|x), is assumed to be a linear or loglinear function of x.The former case is given by and the latter is given by The choice between these skedastic models is based on the data under study and both describe the uncertainty unaccounted for by the locally linear component β(h) x(h) = q j = 1 β j (h)x j (h).Therefore, Eqs. ( 3) and ( 4) or (5) define a heteroskedastic functional regression or HFR model.
In the subsequent discussion, symbols β and γ denote functional estimates of β and γ obtained on historical data as discussed in Appendix B.

Decomposition of collocation uncertainty
We consider first the conditional collocation uncertainty profile given by the following conditional mean squared error: This equation is a direct consequence of Eq. ( 3) and gives the decomposition of the uncertainty profile for each value of the set of factors x at height h, as the sum of the collocation squared drift β(h) x(h) 2 plus the conditional variance function σ 2 ω (h|x).In practice, the first term could be reduced www.atmos-meas-tech.net/7/1803/2014/or even cancelled by observing factors x(h) and adjusting the collocation mismatch accordingly, that is * y = y − β x .Thus, we will refer to this term as the conditional reducible collocation error.Also, the conditional variance σ 2 ω (h|x) depends on x but this error component cannot be reduced even if x is known.Consequently, we will refer to this as the conditional irreducible collocation error.
Averaging the effects of environmental factors x, we get the (marginal) uncertainty profile σ 2 y (h) = E( 2 y ).Its decomposition extends the heteroskedastic uncertainty decomposition of Fassò et al. (2003) and is given by where is the expectation of Eq. ( 6) and defines the collocation uncertainty profile.Moreover, as discussed in Appendix A1, this quantity is decomposed into three terms: In this equation, similar to Eq. ( 6), σ 2 x + σ 2 ω defines the environmental error and σ 2 β (h) is the sampling error discussed below.
The first term on the right-hand side in Eq. ( 8), namely σ 2 x , is the (marginal) reducible collocation uncertainty and is given by where x (h) = E x(h)x(h) is the functional second moment matrix of x(h) and, in practice, σ 2 x (h) is computed using the estimate β.In some cases Eq. (3) has a constant term β 0 which does not depend on the covariates x or on height h.It is worth observing that in this case β 0 is the constant component of the collocation bias, and, if known, one can easily adjust for it.Hence it is interesting to evaluate the relevance of this constant using a decomposition which separates the impact on uncertainty of the constant β 0 from the non-constant covariates.Denoting this term by x ∼ β 0 , we write whose computational and theoretical details are given in the Appendix A2.As above, in practice it is computed using an estimated β0 .
The second term on the right-hand side in Eq. ( 8), namely σ 2 ω , is the irreducible collocation uncertainty and is given by the average of the estimated skedastic function (4) or (5), i.e.
with computation details shown in Appendix A3.
Finally the third summand of Eq. ( 8), namely σ 2 β (h), is the estimation uncertainty or sampling error and is given by where β (h) = Var( β(h)|x) is the estimation functional variance covariance matrix of β(h).When the historical information used for estimation is large, β is small and σ 2 β can be neglected, otherwise this type of random error is irreducible.
Note that, in Eq. ( 7), the functional object σ 2 y (h) gives the uncertainty profile at height h irrespective of the particular value assumed by the set of factors x and improves the vertical resolution of the usual grouped collocation uncertainty estimate.The latter is known to be given by where h j and h 0 j are matched by some proximity criterion and n h is the number of elements in the vertical bin h ± h , with vertical size 2 h which is usually between 100 and 500 m, as in Fig. 1; see e.g.Immler et al. (2010) and Sun et al. (2010).

Total uncertainty budget
Summarizing the above discussion, we have the following total profile uncertainty budget: In order to get a simple uncertainty decomposition, Eq. ( 12) can be integrated along the atmospheric column, giving the total column uncertainty budget where the column uncertainties σ2 I for I = y, x ∼ β 0 , ω, β are given by the profile averages and Formula (13) gives a scalar uncertainty decomposition which is related to the usual concept of total uncertainty.

Spatial correlation
The HFR model introduced assumes conditional independence among different profiles y i (h) and y j (h).This is a consequence of the natural assumption of independence among the instrumental errors, say ε i and ε j , and independence among the model errors ω i and ω j .This assumption is valid since we are considering here only two stations and two independent instruments.
When profile data at many spatial sites are available, a suitable extension of the present HFR model can be devised to take into account the possible spatial dependence among different profiles y i (h) and y j (h) characterized by different launch places s i and s j .Functional spatial statistics has been recently developed, see e.g.Delicado et al. (2010) and Ignaccolo et al. (2014), and could be used to handle such data in the framework of a spatial HFR model.

Case study results
In this section, we use the data introduced in Sect.2, and model collocation uncertainty of relative humidity in Beltsville radiosonde data as a function of the corresponding rh levels in the Sterling soundings as well as the rest of the measured atmospheric variables: water vapour mixing ratio (mr), pressure (p), temperature (T ), time (t), wind vector (u, v), height above sea level (h) and coordinates (lat, lon) from both collocated radiosondes.The resulting collocation error analysis corresponds to forecasting the single rh sensor rather than all radiosonde ECVs.In particular, since we also use the collocation error of the mixing ratio, mr , the heteroskedastic component (Eq.5) of this HFR model describes the variability of relative humidity for fixed water vapour content in dry air mass.
Before using the estimation algorithm of Appendix B, to avoid scale effects and facilitate interpretation, the covariates data have been standardized so that the total profile average is zero and the total profile variance is unity.Moreover, preliminary smoothing is applied to y and x using penalized cubic B splines with regularly spaced knots every 50 m and a small smoothing factor λ = 1 giving small errors σ ε , with the worst case arising for relative humidity where σ ε = 0.64 %.
Then, model selection involved fitting and testing a large number of alternative models for various combinations of different covariates.To do this, the inclusion/exclusion of each functional covariate x j (h) has been decided upon the Wald-type test for zero effect of smooth components (Wood, 2013) and the comparison of the optimized REML scores.
As a result, relative humidity and the water vapour mixing ratio from Sterling radiosondes, respectively rh 0 and mr 0 , and the difference in vapour, mr , have been included, while the other covariates related to the rest of the atmospheric information as well as time, space and distance have been found not to be significant for this data set.This results in the following "water only" model for the mean of the collocation drift: where the standard deviations of the corresponding quantities are given in the bracketed subscripts.The beta functions are plotted in Fig. 5 with 95 % confidence bands and show the stable influence of rh 0 on the collocation drift, which hints at an approximately linear relation, ceteris paribus.Moreover, the increasing behaviour of β3 compensates for the sharp decrease in | mr | related to the behaviour of mr shown in Fig. 3. Interestingly, an anonymous referee noted that the behaviour of β3 in Fig. 5 is consistent with the results of approximations based on explicit physics modelling; see e.g.Pruppacher and Klett (2010).Although the two pictures are similar but not equal, this is a confirmation that our general approach gives sensible results in special cases.The same referee also makes the following shareable comment: "It is not then surprising that most of the variance in the data is explained by this function".Nonetheless, modelling the reducible error component of the HFR model, in addition to β 3 , allows one to assess the role of β 0 , β 1 and β 2 and, in principle, of any other environmental factor for which data are available.Moreover, we also consider the heteroskedastic component ω, which is not covered by the above-mentioned explicit physics and needs to be estimated based on data.For the case study considered, the latter gives an uncertainty quota about 10 % of the total, whilst the sampling error is limited to 0.1 %.In general, we believe that our approach goes beyond previous analyses, increasing the knowledge about the components of the total uncertainty budget of Eq. ( 12).
It is worth observing that, after accounting for the above "water only" covariates, the collocation drift mean does not depend on the distance of the paired trajectories.The intercept term β0 = 3.40 characterizes the constant collocation bias between Beltsville and Sterling radiosoundings, and could be used for adjusting collocated measurements in practice.
With an adjusted determination coefficient R 2 = 0.886, this model misses only 11.4 % of the collocation uncertainty which is covered by σ 2 ω (h).The latter is estimated by the functional logarithmic regression model applied to the squares of first-order model functional errors ω2 = (µ − β x) 2 .In doing this we find that the irreducible uncertainty of relative humidity depends on pressure (p 0 ), temperature (T 0 ) and relative humidity (rh 0 ) in Sterling radiosondes, and the difference in pressure ( p ), longitude ( lon ), water vapour ( mr ) and wind ( u , v ).This gives the functional model of Eq. ( 15), whose γ functions are given in Fig. 6a and b.
Note that, whilst the mean part of the HFR model given by Eq. ( 14) is a "water only" model, the heteroskedastic component of Eq. ( 15) has a number of other environmental variables which are statistically significant.In particular, the east-west distance lon is important especially at lower altitudes, below 3000 m, within the boundary layer, where the variability of water vapour is large.The north-south direction was not significant in the model, which is consistent with both the marked trajectory anisotropy shown by Fig.   and the prevailing wind direction.After accounting for all the above, the resulting estimated skedastic equation is given by log ω 2 (h) = 2.68 The model given by Eqs. ( 14) and ( 15) is used to compute the profile and column budgets given by Eqs. ( 12) and (13), respectively.In particular, Fig. 7 clearly shows that the major part of the uncertainty is related to the observed atmospheric conditions as described by Eq. ( 14).Moreover, we can see that the environmental error σ 2 ω is smaller inside the boundary layer, so the environmental trend µ has a greater uncertainty contribution at these altitudes.
It is worth noting that, after fixing the atmospheric conditions as in model ( 14), the collocation drift does not depend on the distance between paired trajectories.Nevertheless, the conditional uncertainty, model (15), depends on the distance along parallels ( lon ) as mentioned above.Figure 6b shows that this estimate is less precise below 2000 m.In other words, after adjusting for the other environmental factors, the distance cannot be used as a correction factor for the relative humidity collocation error but is a determinant of the collocation uncertainty, especially above 2000 m.Moreover, for these stations, the distance along meridians is not a key factor in relative humidity collocation uncertainty.From a technical point of view, it should be noted that in Fig. 7 the total uncertainty σ y is not directly computed on the data, because y and y 0 from Beltsville and Sterling cannot be matched exactly for every height h and we avoid binning as in formula (11).Instead, thanks to the model-based approach, we computed σ 2 y (h) for all h using Eqs.( 7) and ( 13); see also the discussion in Appendix A1.
Taking averages of the uncertainty profiles, Table 1 shows the role of the various components in the uncertainty budget for this particular data set.The major source of uncertainty is given by the reducible environmental error which is related to the water vapour mixing ratio at Sterling and to the collocation difference of water vapour content in the dry air mass, as shown by Eq. ( 14) and Fig. 5.Note that this collocation difference of water vapour content stems from a combination of the short-term variability of water vapour and the sensor response to that variability: it is instructive to find this known fact through purely statistical model formulations.The second appreciable source of uncertainty is the irreducible environmental error (10.8 %), which has been shown to depend partly on distance along parallels.Moreover, it can be observed that the measurement error σ ε = 0.9 and estimation uncertainty σ β = 0.46 are quite small.Last but not least, a simple constant bias correction for these data would reduce the collocation uncertainty by about 3.4 %.As mentioned above, the small size of the measurement error σ is a confirmation of the appropriateness of the smoothing used.From the statistical point of view, this is also confirmed by the effective degrees of freedom of all components of model ( 14)-( 15), which are always smaller than 20 and are computed by mgcv, which is discussed in Appendix B.

Conclusions
This paper proposes a new and general statistical method for defining and computing the total collocation uncertainty budget of an ECV.The output is presented both as a profile uncertainty budget and an integrated total column budget.The model used is based on an extension of the classical functional regression model, which is able to cover for heteroskedasticity and allows the decomposition of total uncertainty up to five different components, namely constant bias, reducible and irreducible environmental errors, sampling error and measurement error.Moreover, the conditional uncertainty may be computed for any set of environmental conditions, providing, inter alia, more information about the factors determining the collocation uncertainty.The proposed method is self-assessing, in the sense that it is able to consider the information content of the data for the model and evaluate the size of the sampling error with respect to the other uncertainty components.Although smoothing of www.atmos-meas-tech.net/7/1803/2014/vertical profiles is applied, this is data adaptive and the vertical resolution is higher than similar literature on this topic.
When available, an explicit physics approximation gives results which are expected to be consistent with the mean collocation drift given by the first part of the HFR model.In addition to this, the HFR model can assimilate any other environmental covariate and, more importantly, thanks to the heteroskedastic component, it can quantify the various contributions to uncertainty, giving a new and detailed budget.
This approach has been tested on radiosounding profiles of humidity provided by the GRUAN site at Beltsville, Maryland, and the NWS operational site located in Sterling, Virginia, USA.Although, as mentioned above, the method is quite general and data driven, for the sites considered, we found that the collocation drift strongly depends on the direction of air mass advection and not on the distance of the paired trajectories.Moreover, the collocation error has an adjustable constant bias amounting to 3.4 % of the total collocation uncertainty.The model performed better below 3000 m of altitude and, globally, it missed only 11.4 % of the collocation uncertainty for relative humidity.The latter uncertainty, named irreducible environmental uncertainty, is related to wind and distance in the east-west direction.
From the presented case study, we conclude that the collocation uncertainty of relative humidity is related to physical quantities and, in principle, could be reduced by inclusion of auxiliary information.

Appendix A
This appendix considers some mathematical details about uncertainty computations of Sect.3.2 and the uncertainty decomposition of Eqs. ( 7) and (8).To do this, we simplify the functional notation µ = µ(h) and observe that, in practice, µ and ω are estimated on suitable historical data by μ and ω respectively.Moreover, the decomposition for the constant term β 0 of Eq. ( 9) is discussed.

A1 Mean estimation and marginal uncertainty
In this section, we discuss the uncertainty decomposition of Eq. ( 8), focusing on µ and ignoring the heteroskedastic structure of ω, which is considered in Appendix A3.In particular we consider the estimation error β = β − β so that the total collocation error is defined by adding the sampling error as follows: Note that, since β x = β x + β x , adding the latter term to y is the same as using β x instead of β x in the definition of y .Now observe that E( ε |x) = E (ω|x) = 0 and E( βω|x) is neglectable for large sample sizes under weak assumptions.Hence, we have and Eq. ( 8) follows the taking expectation of E( 2 y |x), namely Note that, in practice, σ 2 x is computed for β = β and x is suitably estimated on the historical data.Similarly σ 2 β is computed using suitable formulas.For example, under least square estimation, we have where X is the functional design matrix.

A2 Uncertainty for constant bias term
If Eq. (3) has a constant term β 0 , it may be written as where x(h) = (1, • x (h) ) and β(h) = (β 0 , • β (h) ) .Note that β 0 is an identifiable coefficient since no instrumental bias has been assumed in the previous section and, in practice, it is easy to adjust for this collocation bias and its relevance may be evaluated by decomposing σ2 x as in Eq. ( 9) with σ2 x∼β 0 = σ2 x − β 2 0 .Note that, if we are using a centered design with E( • x (h)) = 0, this apparently obvious decomposition exactly separates .In general, an interaction term between β 0 and • x, which is usually small, remains embedded in σ2 x∼β 0 .In fact, using simple algebra, we can see that σ2 x = β 2 0 + σ2

A3 Conditional variance estimation and marginal uncertainty
In this appendix, Eq. ( 10) is considered for both linear and loglinear heteroskedastic models (4) and ( 5).Considering first the linear case, according to the conditional heteroskedastic literature (see e.g.Chatfield, 1995), we write where η and ζ are zero mean and independent random variables, both independent of x, with η also unit variance and symmetrically distributed.This motivates the assumptions we made in this paper on ω, namely E(ω|x) = 0 and E(ω 2 |x) = γ x.Now, by simple algebra we write which in practice is computed for γ = γ .When enough data are available, Var(ω) may be estimated using the sample variance of the residuals in Eq. ( 13).

Figure 1 .
Figure 1.Comparison of radiosonde flights made at Beltsville and NWS-Sterling.Data pairs were matched if they were within 3 h.The temperature difference and the standard deviation of the difference are shown as absolute difference and in percent as well as the number of data pairs used at each layer.

Figure 2 .
Figure 2. The displacement of collocated trajectories is given by the difference of positions for two collocated instruments at the same altitude.X axis is distance along parallels and Y axis along meridians.Distance range is 45-95 km.

Figure 6b .
Figure 6b.Gamma functions of collocation error ω 2 for relative humidity in Beltsville using model (15).Top left: difference in longitude ( lon); top right: difference in mixing ratio ( mr); bottom left: difference in wind, u direction ( u); bottom right: difference in wind, v direction ( v).

Table 1 .
Budget of total collocation uncertainty for relative humidity ( rh) in Beltsville-Sterling, using the HFR model of Eqs.(14) and (15).