Linear regression techniques are widely used in atmospheric science, but they are
often improperly applied due to lack of consideration or inappropriate
handling of measurement uncertainty. In this work, numerical experiments are
performed to evaluate the performance of five linear regression techniques,
significantly extending previous works by Chu and Saylor. The five techniques
are ordinary least squares (OLS), Deming regression (DR), orthogonal distance
regression (ODR), weighted ODR (WODR), and York regression (YR). We first
introduce a new data generation scheme that employs the Mersenne twister (MT)
pseudorandom number generator. The numerical simulations are also improved
by (a) refining the parameterization of nonlinear measurement
uncertainties, (b) inclusion of a linear measurement uncertainty, and (c) inclusion of WODR for comparison. Results show that DR, WODR and YR produce
an accurate slope, but the intercept by WODR and YR is overestimated and the
degree of bias is more pronounced with a low

Linear regression is heavily used in atmospheric science to
derive the slope and intercept of

Ordinary least squares (OLS) regression is the most widely used method due
to its simplicity. In OLS, it is assumed that independent variables are
error-free. This is the case for certain applications, such as determining a
calibration curve of an instrument in analytical chemistry. For example, a
known amount of analyte (e.g., through weighing) can be used to calibrate
the instrument output response (e.g., voltage). However, in many other
applications, such as inter-instrument comparison,

To overcome the drawback of OLS, a number of error-in-variables regression
models (also known as bivariate fittings; Cantrell, 2008) or total
least-squares methods (Markovsky and Van Huffel, 2007) arise. Deming (1943)
proposed an approach by minimizing sum of squares of

In principle, a best-fit regression line should have greater dependence on
the more precise data points rather than the less reliable ones. Chu (2005)
performed a comparison study of OLS and DR specifically focusing on the EC
tracer method application and found that the slope estimated by DR is closer to
the correct value than OLS but may still overestimate the ideal value. Saylor
et al. (2006) extended the comparison work of Chu (2005) by including a
regression technique developed by York et al. (2004). They found that the
slope overestimation by DR in the study of Chu (2005) was due to improper
configuration of the weighting parameter,

In this study, we extend the work by Saylor et al. (2006) to achieve four
objectives. The first is to propose a new data generation scheme by applying
the Mersenne twister (MT) pseudorandom number generator for evaluation of
linear regression techniques. In the study of Chu (2005), data generation is
achieved by a varietal sine function, which has limitations in sample size,
sample distribution, and nonadjustable correlation (

OLS only considers the errors in
dependent variables (

ODR minimizes the sum of the squared
orthogonal distances from all data points to the regressed line and
considers equal error variances (i.e., distance of AC in Fig. S1):

Unlike ODR, which considers
even error in

Deming (1943) proposed the following function to
minimize both the

The York method (York et al., 2004) introduces the
correlation coefficient of errors in

Summary of the five regression techniques is given in Table S1 in the Supplement. It is worth noting that OLS and DR have closed-form expressions for calculating slope and intercept. In contrast, ODR, WODR and YR need to be solved iteratively. This need to be taken into consideration when choosing regression algorithm for handling huge numbers of data.

A computer program (Scatter Plot; Wu, 2017a) with a graphical user interface (GUI) in Igor Pro (WaveMetrics, Inc. Lake Oswego, OR, USA) was developed to facilitate the implementation of error-in-variables regression (including DR, WODR and YR). Two other Igor Pro-based computer programs, Histbox (Wu, 2017b) and Aethalometer data processor (Wu, 2017c), are used for data analysis and visualization in this study.

Two types of data are used for regression comparison. The first type is synthetic data generated by computer programs, which can be used in the EC tracer method (Turpin and Huntzicker, 1995) to demonstrate the regression application. The true “slope” and “intercept” are assigned during data generation, allowing quantitative comparison of the bias of each regression scheme. The second type of data comes from ambient measurement of light absorption, OC and EC in Guangzhou for demonstration in a real-world application.

In this study, numerical simulations are conducted in Igor Pro (WaveMetrics, Inc. Lake Oswego, OR, USA) through custom codes. Two types of generation schemes are employed: one is based on the MT pseudorandom number generator (Matsumoto and Nishimura, 1998) and the other is based on the sine function described by Chu (2005).

The general form of linear regression on

To make the discussion easier to follow, we intentionally avoid discussion
using the abstract general form and instead opt to use a real-world
application case in atmospheric science. Linear regression had been heavily
applied on OC and EC data, here we use OC and EC data as an example to
demonstrate the regression application in atmospheric science. In the EC
tracer method, OC (mixture) is

Weighting of variables is a crucial input for errors-in-variables linear
regression methods such as DR, YR and WODR. In practice, the weights are
usually defined as the inverse of the measurement error variance (Eq. 5).
When measurement errors are considered, measured concentrations
(Conc.

Two types of measurement error are considered in this study. The first type
is

Uniform distribution has been used in previous studies (Cox et al., 2003;
Chu, 2005; Saylor et al., 2006) and is adopted in this study to parameterize
measurement error. For a uniform distribution in the interval

The Mersenne twister (MT) is a pseudorandom number generator (PRNG) developed
by Matsumoto and Nishimura (1998). MT has been widely adopted by mainstream
numerical analysis software (e.g., MATLAB, SPSS, SAS and Igor Pro) as well as
popular programing languages (e.g., R, Python, IDL, C

In this section, we will use POC as

Flowchart of data generation steps using MT.

Besides MT, inclusion of the sine function data generation scheme in this study mainly serves two purposes. First, the sine function scheme was adopted in two previous studies (Chu, 2005; Saylor et al., 2006), the inclusion of this scheme can help to verify whether the codes in Igor for various regression approaches yield the same results from the two previous studies. Second, the crosscheck between results from sine function and MT provides circumstantial evidence that the MT scheme works as expected.

In this section,

POC

Sampling was conducted from Feb 2012 to Jan 2013 at the suburban Nancun (NC)
site (23

In the following comparisons, six regression approaches are compared using
two data generation schemes (Chu sine function and MT) separately, as
illustrated in Fig. 4. Each data generation scheme considers both ^{®}, SigmaPlot^{®},
GraphPad Prism^{®}),

Overview of the comparison study design.

In this section, the scheme of Chu (2005) is adopted for data generation to
obtain a benchmark of six regression approaches. With different setup of
slope, intercept and

Summary of six regression approaches comparison with 5000 runs for 18 cases.

A comparison of the regression techniques results with

As shown in Fig. 5, for the zero-intercept case (Case 1), OLS significantly
underestimates the slope (2.95

Regression results on synthetic data, Case 1 (Slope

For Case 3, LOD

Slope and intercept biases by different regression schemes in two
test scenarios (A and B) in which the assumed error for one of the
regression variables deviates from the actual measurement error. In Test A
data generation,

An uneven LOD

Cases 5 and 6 represent the results from using

In this section, MT is adopted for data generation to obtain a benchmark of
six regression approaches. Both

Cases 7 and 8 use data generated by MT and

To test the overestimation/underestimation dependency on the true slope,
Case 9 (slope

These results imply that if the true slope is less than 1, the improper
weighting (

Cases 13 and 14 (Table 1) represent the results from using

Regression results using ambient

As discussed above, inappropriate

In the

In atmospheric applications, there are scenarios in which a priori error in one of the variables is unknown, or the measurement error described cannot be trusted. For example, in the case of comparing model prediction and measurement data, the uncertainty of model prediction data is unknown. A second example is the case in which measurement uncertainty cannot be determined due to the lack of duplicated or collocated measurements and as a result, an arbitrarily assumed uncertainty is used. Such a case was illustrated in the study by Flanagan et al. (2006). They found that in the Speciation Trends Network (STN), the whole-system uncertainty retrieved by data from collocated samplers was different from the arbitrarily assumed 5 % uncertainty. Additionally, the discrepancy between the actual uncertainty obtained through collocated samplers and the arbitrarily assumed uncertainty varied by chemical species. To investigate the performance of different regression approaches in these cases, two tests (A and B) are conducted.

In Test A, the actual measurement error for

The user interface of the Scatter Plot Igor program. The program and
its operation manual are available from

In Test B,

The results from these two tests suggest that, if one of the
measurement errors described cannot be trusted or a priori error in one of
the variables is unknown, WODR, DR

This section demonstrates the application of the six regression approaches on a
light absorption coefficient and EC dataset collected in a suburban site in
Guangzhou. As mentioned in Sect. 4.4, measurement uncertainties are crucial
inputs for DR, YR and WODR. The measurement precision of Aethalometer is
5 % (Hansen, 2005), while EC by the RT-ECOC analyzer is 24 % (Bauer et al.,
2009). These measurement uncertainties are used in DR, YR and WODR
calculation. The dataset contains 6926 data points with an

As shown in Fig. 7, the

Regression comparison is also performed on hourly OC and EC data. Regression
on OC

As discussed in this section, the ambient data confirm the results obtained in comparing methods with the synthetic data. The advantage of using the synthetic data for regression approaches evaluation is that the ideal slope and intercept are known values during the data generation, so the bias of each regression approach can be quantified.

This study aims to provide a benchmark of commonly used linear regression
algorithms using a new data generation scheme (MT). Six regression
approaches are tested, i.e., OLS, DR (

Application of error-in-variables regression is often overlooked in
atmospheric studies, partly due to the lack of a specified tool for the
regression implementation. To facilitate the implementation of
error-in-variables regression (including DR, WODR and YR), a computer
program (Scatter Plot) with a GUI in Igor Pro
(WaveMetrics, Inc. Lake Oswego, OR, USA) was developed (Fig. 8). It is packed
with many useful features for data analysis and plotting, including batch
plotting, data masking via GUI, color coding in the

OC, EC and

Ordinary least squares (OLS) calculation steps.

First calculate average of observed

Besides

Slope by OLS can be used as the initial

Then calculate

Slope and intercept can be obtained from

The supplement related to this article is available online at:

The authors declare that they have no conflict of interest.

This work was supported by the National Natural Science Foundation of China (grant no. 41605002, 41475004 and 21607056), NSFC of Guangdong Province (grant no. 2015A030313339), Guangdong Province Public Interest Research and Capacity Building Special Fund (grant no. 2014B020216005). The author would like to thank Bin Yu Kuang at HKUST for the discussions on mathematics and Stephen M. Griffith at HKUST for the valuable comments. Edited by: Willy Maenhaut Reviewed by: two anonymous referees