A data-driven persistence test for robust (probabilistic) quality control of measured environmental time series: constant value episodes

Kaffashzadeh, Najmeh

doi:https://doi.org/10.5194/amt-16-3085-2023

Articles | Volume 16, issue 12

https://doi.org/10.5194/amt-16-3085-2023

Articles | Volume 16, issue 12

Research article

21 Jun 2023

Research article |

| 21 Jun 2023

A data-driven persistence test for robust (probabilistic) quality control of measured environmental time series: constant value episodes

Najmeh Kaffashzadeh

Abstract

Robust quality control is a prerequisite and an essential component in any data application. That is especially important for time series of environmental observations such as air quality due to their dynamic and irreversible nature. One of the common issues in these data is constant value episodes (CVEs), where a set of consecutive data values remains constant over a given period. Although CVEs are often considered to be an indicator of sensor failure or other measurement errors and are removed during quality control procedures, there are situations when CVEs reflect natural environmental phenomena, and they should not be removed from the data or analysis. Assessing whether the CVEs are erroneous data or valid observations is a challenge. As there are no formal procedures established for this, their classification is based on subjective judgment and is therefore uncertain and irreproducible. This paper presents a novel test procedure, i.e., constant value test, to estimate the probability of CVEs being valid data. The theoretical foundation of this test is based on statistical characteristics and probability theory and takes into account the numerical precision of the data values. The test is a data-driven (parametric) approach, which makes it usable for time series analysis in different environmental research domains, as long as serial dependency is given and the data distribution is not too different from Gaussian. The robustness of the test was demonstrated with sensitivity studies using synthetic data with different distributions. Example applications to measured air temperature and ozone mixing ratio data confirm the versatility of the test.

Download & links

Article (PDF, 5015 KB)

Download & links

How to cite.

Received: 13 Nov 2022 – Discussion started: 18 Jan 2023 – Revised: 25 Apr 2023 – Accepted: 16 May 2023 – Published: 21 Jun 2023

1 Introduction

Millions of sensors monitor the environment every day, and their data are used in many applications such as trend analysis (Fang et al., 2013; Mills et al., 2016, 2018; Chang et al., 2017; Fleming et al., 2018; Lefohn et al., 2018) and forecasts (Gardner, 1999; Zhang et al., 2012; Debry and Mallet, 2014; Zhou et al., 2019) to provide important information on global challenges such as climate change, air quality, soil degradation, etc. The measurement process can be interpreted as sampling from a true distribution of atmospheric state variables, for example, temperature or air pollutant concentration, at a given location. Each measured value is an estimation of “truth” that has been obtained through a set of data samples (Grant and Leavenworth, 1996). A common feature of many environmental time series is the fact that the true distribution changes with time. This makes such measurements irreproducible.

Measured data can be contaminated by various errors such as systematic, random, non-representative and gross errors (Gandin, 1988; Steinacker et al., 2011). These errors can arise from poor sensor calibration, long-term sensor drift, noise, non-resolvable processes by an observational network, and mistakes during data processing, decoding or transmission. Some of these errors arise from unpredictable natural phenomena such as floods, fire, frost and animal activities (Campbell et al., 2013) that cannot be documented in every detail. Although many efforts are devoted to developing advanced analytical tools and methods, these errors can have deleterious effects on the statistical analyses. For instance, outliers, i.e., values far outside of the norm for a variable or population, can increase the error variance or reduce the power of statistical tests (Osborne and Overbay, 2004). Specifically, constant value episodes (CVEs) can decrease the normality when the assumption of a normal distribution must be satisfied, for example, in linear regression. Thus, even the most sophisticated statistical model can be vulnerable against unknown and potentially erroneous data. If such errors in the data are not identified by applying quality control (QC) procedures, the information obtained from the data will be misleading, and the results from scientific data analyses can be unreliable and biased. Therefore, robust QC procedures are an essential component in the data production chain and a requirement for having a more reliable quantification of trend or other statistical analysis.

Many research initiatives and environmental monitoring programs have thus established standards and guidelines for QC procedures. Most of them rely on visual screening of data, and therefore personal inspection, and on manual elimination of erroneous values based on empirical knowledge and investigator experiences. Several advanced tools such as GCE (Scully-Allison et al., 2018), CoTeDe (Castelão, 2016), and AutoQC (Good et al., 2022) and comprehensive user manuals such as QARTOD (Bushnell et al., 2019) and WMO-AWS (Zahumenský, 2004) have been developed with precise rules to overcome this subjectivity. However, their application is often limited to a few variables or specific data sets, for example, from limited geographic regions with relatively homogenous conditions. This, in turn, can be problematic if one wants to assemble global data sets of various environmental variables. For example, in the Tropospheric Ozone Assessment Report (TOAR), a global database with ground-level ozone measurements at more than 10 000 locations around the world, was built with data from more than 30 different contributors (Schultz et al., 2017). Different QC procedures at these agencies and sites led to increased uncertainty in the assessment. At this scale of data, manual inspection methods are not only error prone but also impractical. It is therefore desirable to develop a more generic, robust and data-driven approach for the QC of environmental monitoring time series.

The focus of this study is to develop a QC test for CVEs as the first element for such data-driven QC. CVEs are a common feature in air quality time series and other environmental data sets. As an example, in a specific 35-year-long ozone time series with hourly sampling, CVEs with a length of 2 occurred 20 313 times. Therefore, about 6.7 % of the data values are CVEs, meaning that such incidents are expected to occur naturally about 16 times per 10 d in the hourly data. The CVEs with a longer length, e.g., 3, 4 and 5, occur 6190, 2887 and 1681 times, respectively, and so the proportion of these incidents are 4.85, 2.26 and 1.31 for 10 d hourly data time series. While they can be detected through a persistence test, a qualified judgment whether such data are erroneous or not is a difficult undertaking. If CVEs are excluded from the data (Horsburgh et al., 2015; Gudmundsson et al., 2018), the results of the analysis, such as model–data comparisons (Bey et al., 2001; Horowitz et al., 2003; Dawson et al., 2008; Emmons et al., 2010; Lamarque et al., 2012; Rasmussen et al., 2012; Tilmes et al., 2012; Im et al., 2015; Schnell et al., 2015; Lyapina et al., 2016; Sofen et al., 2016), can become biased. That can be an issue in (re)analysis products (Inness et al., 2019; Hersbach et al., 2020), where assimilation processes reduce misfits between observations and their modeled values. If the models correctly capture CVE events, excluding the CVEs will lead to a type I error. On the other hand, if CVEs originating from instrument malfunctions are included in the analysis, that will raise type I and type II errors and likely raise unreliable results.

This study presents a new (QC) test procedure, i.e., constant value test (CVT), which estimates the probability of a CVE representing valid data. Data users can select a threshold of an acceptable probability depending on their scientific study or data analysis task. The CVT is entirely data-driven and makes very few assumptions about the properties of the underlying values' distribution and probability density function (Gaussian). Currently, the method is valid for data with a Gaussian frequency distribution. Possible extensions of the method are discussed in the conclusions section. In principle, it is possible to use the technique of statistical simulations to examine how the CVE probabilities change for non-Gaussian distributions. However, this is beyond the scope of this paper. Due to its generality, the test is applicable for a wide variety of environmental variables with a serial dependency (autocorrelation). The article structure is as follows: the method (CVT) is described in Sect. 2. In Sect. 3, the approach is evaluated using synthetic data for demonstration purposes. The results of three real test cases are discussed in Sect. 4. And, finally, conclusions are given in Sect. 5.

2 Methodology

Before describing the proposed method, we briefly summarize some issues with existing methods. In existing QC frameworks, the persistence test is typically defined based on the minimum expected variability, but this requires prior knowledge about the true statistical distribution of the measurements. For example, Zahumenský (2004) has defined that air temperature measurements shall be flagged as “doubtful or suspected value” if the measured variable varies by less than 0.1 K over 60 min. Such a priori assumptions may lead to false data labeling when environmental conditions are exceptionally stable and the true data variability is reduced for some period of time. For instance, temperature variation of 0.1 K can occur in the morning when radiative forcing is small, e.g., on a foggy day in autumn. In measurements of air pollutant concentrations, longer periods of zero values can be found if the measured concentrations are below the instrument detection limit or if chemical conversion leads to a complete removal of a species. For example, ground-level ozone concentrations at urban sites remain zero for several hours if there is a high level of nitrogen oxide emission.

The assessment of CVEs will also have to depend on the numerical precision or resolution (res), which is the number of significant digits with which an observation is recorded (Chapman, 2005). For example, historical measurements of ground-level ozone (Azusa station) in the EPA Air Quality System (AQS) in the 1980s were reported with a resolution of 8 ppb (parts per billion). Another pollutant in the EPA AQS database for which reporting precision has changed over time since 1980 is carbon monoxide at the Fresno station (California state). So, it is not uncommon to find episodes of several hours when all measurements are reported as the same value, and it would be implausible to remove all of them as “erroneous measurements”.

The CVT takes these considerations into account and provides a data-driven approach with very few a priori assumptions. It consists of two main procedures: first, CVEs need to be found and the length of the episodes must be recorded, then, in the second step, the probability of each CVE being a period of valid data with low variability is estimated. While the first procedure can be simply implemented by taking the differences of consecutive values, a possible complication arises if the time series contains missing data or if the data were irregularly sampled. While the software accompanying this paper has a provision to deal with missing data, we ignore the second issue for the purpose of this paper and require that the time series has been sampled at regular intervals. The following method description focuses on the estimation of the likelihood that two or more constant values occur in reality and are thus not necessarily resulting from measurement or data processing errors.

2.1 Statistical background

To describe the joint process of a given time series, we assume such a stochastic process can be represented as a multivariate Gaussian distribution (Tong, 1990; Rencher, 2002). Let $X = (x_{1}, \dots, x_{n})$ be a series of random variables; the joint probability density function of a multivariate Gaussian distribution, 𝒩(μΣ), can be written as

\begin{matrix} (1) & f_{X} (x_{1}, \dots, x_{n}) = \frac{\exp (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))}{\sqrt{(2 π)^{k} | Σ |}} . \end{matrix}

Here, μ is an n×1 mean vector and Σ is an n×n positive definite covariance matrix. In the stationary case, without loss of generality, μ can be assumed to be a constant, and Σ can be represented as multiplication of a finite constant variance σ² and a (auto)correlation matrix $\{i = 1, \dots, n; j = 1, \dots, n\}$ , with ∅(ij)=1 if i=j (diagonal) and $0 \leq \emptyset (i j) \leq 1$ if i≠j (off-diagonal) for a given time series.

Long range approximation of an environmental time series is generally unnecessary and computationally expensive (e.g., Wincek and Reinsel, 1986; Guttorp et al., 1994; Niu, 1996; Fioletov and Shepherd, 2003; Kumar and De Ridder, 2010). Here we use an assumption that an environmental time series is auto-correlated and can be approximated by an autoregressive (AR(1)) process (Tiao et al., 1990; Weatherhead at al., 1998, 2000; Reinsel et al., 2002). The definition of an AR(1) process, the x_i, i.e., data value at time i, can be written as

\begin{matrix} (2) & x_{i} = const + \emptyset x_{i - 1} + ε_{i} . \end{matrix}

Here, ε_i is a white noise, and const is an offset. With the assumption of the AR(1) process, the correlation matrix can be approximated by one parameter, ∅, since $Corr (X_{i}, X_{i - h}) = \emptyset^{|h|}$ (the correlation between any two points is only dependent on the time interval, h); thus, the stochastic process can be governed by three parameters, i.e., μ, σ² and ∅.

The general likelihood of an AR(1) process can be approximated using the first-order Markov property as

\begin{matrix} (3) & p (x_{1}, \dots, x_{n}) = p (x_{1}) \prod_{k = 2}^{n} p (x_{k} | x_{k - 1}), \end{matrix}

where p(x₁) is the density of initial state, which is not critical in this study, because the focus is placed on the probability of a consecutive state that is identical to the previous value, i.e., the second term, and p(x_k | x_k−1) represents the probability distribution of x_k depending only on x_k−1. The above equation is a general form without a distributional assumption. To derive the explicit form for the Gaussian case, we start from a univariate and a bivariate probability density function:

\begin{matrix} (4) & f (x_{k - 1}) = \frac{1}{σ \sqrt{2 π}} \exp (- \frac{1}{2} [\frac{{(x_{k - 1} - μ)}^{2}}{σ^{2}}]), \\ (5) & \begin{aligned} f (x_{k - 1}, x_{k}) & = \frac{1}{2 π σ^{2} \sqrt{1 - \emptyset^{2}}} \exp (- \frac{1}{2 (1 - \emptyset^{2})} \\ [\frac{{(x_{k - 1} - μ)}^{2}}{σ^{2}} + \frac{{(x_{k} - μ)}^{2}}{σ^{2}} \\ - \frac{2 \emptyset (x_{k - 1} - μ) (x_{k} - μ)}{σ^{2}}]) . \end{aligned} \end{matrix}

Then the conditional probability distribution of X_t given $X_{t - 1} = c$ can be derived by the Bayes' theorem and written as (see Appendix A)

\begin{matrix} (6) & p (x_{t} | x_{t - 1} = c) \sim N (μ + \emptyset (c - μ), (1 - \emptyset^{2}) σ^{2}), \end{matrix}

where c is an arbitrary constant. The implication of such a formulation is that the resulting probability is also a function of c: if the statistical model parameters $(μ, σ^{2}, \emptyset)$ are fixed, a shorter distance of c from the mean, μ, will result in a relatively higher probability density than those are far away.

2.2 Constant value episode (CVE) probability

The estimation of the CVT probability consists of the following two steps.

Step 1. Deriving a joint probability density. For a series of (dependent) events, A_k with $1 \leq k \leq n$ , the joint density of probability can be described through a product of multiple conditional probabilities as

\begin{matrix} (7) & \begin{aligned} p (A_{n} \cap \dots \cap A_{1}) & = p (A_{1}) \prod_{k = 2}^{n} p (A_{k} | \cap_{j = 1}^{k - 1} A_{j}) \\ = p (A_{1}) \prod_{k = 2}^{n} p (A_{k} | A_{k - 1}) . \end{aligned} \end{matrix}

The first equality yields from the chain rule of the joint distribution (Schum, 2001); the second equality is a special case of an AR(1) process.

Step 2. Imposing a distributional assumption to the joint probability distribution. From Eq. (6), the probability of consecutive values in a series with Gaussian probability density can be determined by

\begin{matrix} (8) & \begin{aligned} P ({CVE}_{t = 1, c \neq 0}) & = p (x_{t} = c | x_{t - 1} = c) \\ = \int_{c - res / 2}^{c + res / 2} \frac{1}{σ \sqrt{2 π (1 - \emptyset^{2})}} \\ \exp (- \frac{1}{2} [\frac{{((c - μ) - \emptyset (c - μ))}^{2}}{(1 - \emptyset^{2}) σ^{2}}]) . \end{aligned} \end{matrix}

The integral reflects the fact that digital data are recorded with finite numerical precision. Then, according to the property of an AR(1) process, the probability of a CVE with a length of t can be calculated through P(CVE₁) raising to the power of t−1 as

\begin{matrix} (9) & \begin{aligned} P ({CVE}_{t, c \neq 0}) & = (\int_{c - res / 2}^{c + res / 2} \frac{1}{σ \sqrt{2 π (1 - \emptyset^{2})}} \\ \exp (- \frac{1}{2} [\frac{{((c - μ) - \emptyset (c - μ))}^{2}}{(1 - \emptyset^{2}) σ^{2}}]))^{t - 1} . \end{aligned} \end{matrix}

Since this equation is designed for a constant event, so the marginal probability remains a constant for each CVE. To diminish the influence of CVE on μ, they were excluded first, then the μ, σ and ∅ were calculated.

For non-normal cases, the explicit parameterization of a non-independent joint distribution is difficult to derive due to mathematical challenges and often does not have a closed form. The non-parametric alternative is to use empirical distribution (Epanechnikov, 1969; Waterman and Whiteman, 1978) or kernel distribution (Hwang et al., 1994; Duong and Hazelton, 2005), but this approach is not desirable for database management at this stage because it is difficult to develop a unified framework that is adequate for all situations. Besides, the empirical distribution estimates a probability without taking into account auto-correlation, i.e., independent of the adjacent data points.

The AR(1) assumption can be relaxed by increasing the order of autocorrelation without too much complexity. For example, for an AR(2) process, one could specify the covariance matrix in Eq. (1) as

\begin{matrix} (10) & Σ = |\begin{array}{ccc} σ^{2} & σ^{2} \emptyset_{1} & σ^{2} \emptyset_{2} \\ σ^{2} \emptyset_{1} & σ^{2} & σ^{2} \emptyset_{1} \\ σ^{2} \emptyset_{2} & σ^{2} \emptyset_{1} & σ^{2} \end{array}| \end{matrix}

and modify Eq. (7) in step 1 as

\begin{matrix} (11) & \begin{aligned} p (A_{n} \cap \dots \cap A_{1}) \\ = p (A_{1}) p (A_{2} | A_{1}) \prod_{k = 3}^{n} p (A_{k} | A_{k - 1}, A_{k - 2}), \end{aligned} \end{matrix}

then update the conditional probability parameterized by $(μ, σ^{2}, \emptyset_{1}, \emptyset_{2})$ in step 2. The more general extension of the autoregressive model is out of the scope of this study and can be referred to in Box et al. (2015).

For the variables with extra incidences of zero such as nitrogen oxides (NO, NO₂) and ozone, the lower interval of the integration in Eq. (9) was changed from c− res to 0. Note that in reality “zero” values in measurements may actually be recorded as small positive or negative numbers. This detail is ignored in the following because there is no universally applicable correction available. Some data sets may require a linear or non-linear bias correction, while for other data sets a simple cutoff, e.g., set to zero if | value $| <$ threshold, may be more appropriate.

3 Model sensitivity test

The P in Eq. (9) is affected by the parameters μ, σ, ∅, c, t and res. A simulation study was developed to evaluate the sensitivity of P to each parameter. Several experiments were conducted by generating a synthetic data series to demonstrate the influence of each parameter. For each experiment, the CVT was performed over a range of possible values.

A set of first-order autoregressive, AR(1), time series with hourly time steps and a length of 240 values (10 d) was generated using Eq. (2) and a random noise generator. As a reference case (ref), we set μ=10, σ=4 and ∅=0.8. The numerical precision was defined as 0.01. Four sets of CVEs with the same length (t=3) were added to this time series. The distance of the CVE from the mean, i.e., c−μ, was given as 0, 1, 2 and 3σ (see Fig. 1). In this figure, four CVEs are illustrated with a color code, i.e., red, blue, cyan and black, which are shown with boxes. The P varies from $7.67 \times 10^{- 6}$ for the first CVE to $4.77 \times 10^{- 7}$ for the fourth (last) CVE. As stated in Sect. 2.1, the value of P decreases as c−μ increases. CVEs which are further away from the mean are less likely to occur in nature.

https://amt.copernicus.org/articles/16/3085/2023/amt-16-3085-2023-f01

Figure 1A synthetic AR(1) time series with Gaussian data distribution and four arbitrarily selected CVEs of length t=3, with μ=10, σ=4, ∅=0.8, and $c - μ = 0$ , 4, 8, and 12, respectively. The CVEs are shown using a color code, i.e., red, blue, cyan and black. The numerical precision (res) is chosen as 0.01.

A data-driven persistence test for robust (probabilistic) quality control of measured environmental time series: constant value episodes

2.1 Statistical background

2.2 Constant value episode (CVE) probability

4.1 Temperature

4.2 Ozone

4.3 Carbon monoxide