Low-cost sensors are often co-located with reference instruments to assess their performance and establish calibration equations, but limited
discussion has focused on whether the duration of this calibration period can be optimized. We placed a multipollutant monitor that contained
sensors that measured particulate matter smaller than 2.5
Instrument calibration is one of the main processes used to ensure instrument accuracy. In one method of calibration, measurements are compared
between an uncalibrated instrument and a reference instrument, which can then be used to adjust the output of the uncalibrated instrument to see
whether the data can meet performance standards (often in terms of accuracy and precision). In the case of low-cost air pollution sensors, the raw
output is often a voltage or resistance instead of a concentration, so a calibration curve is needed to convert the raw output into practical
units. Cross-sensitivities to environmental conditions or other pollutants, nonlinear responses, and variability between sensor units are common
difficulties that must be considered when working with low-cost sensor data (Van Zoest et al., 2019; Levy Zamora, 2022; Li et al., 2021; Spinelle
et al., 2015; Ripoll et al., 2019). Several methodologies have been used to derive the calibration equations needed to convert the raw data into
useable concentrations, such as exposing the sensors to known concentrations in a laboratory setting and co-locating the sensors with a reference
instrument, often in a similar setting to which the sensor is to be used (Taylor, 2016; Zimmerman et al., 2018; Mead et al., 2013; Ikram et al., 2012;
Hagler et al., 2018; Cross et al., 2017; Holstius et al., 2014; Mukherjee et al., 2019; Gao et al., 2015; Heimann et al., 2015; Air Quality Sensor Performance Evaluation Center,
2016a, b, 2017; Levy Zamora et al., 2018a). Field co-location is a widely used calibration method,
but a trade-off must be made between the time dedicated to collecting calibration data and the data collected at the final measurement location. There
is currently no standardized co-location duration, and the reported co-location durations for low-cost sensors with reference instruments in recent
work have varied from several days to several months (Mukherjee et al., 2019; Gao et al., 2015; Topalović et al., 2019; Kim et al., 2018; Spinelle et al., 2017; Pinto et al., 2014; Datta et al., 2020). To date, little discussion has
focused on whether the selected periods were adequate for the deployment period or whether the calibration period can be optimized in future studies
(Topalović et al., 2019; Okorn and Hannigan, 2021). In one study that assessed the impacts of the co-location duration for a low-cost sensor, Okorn and Hannigan (2021) randomly selected calibration periods of up to
6 weeks in duration from 6 weeks of methane data in Los Angeles. The calibration equations were then applied to data from an earlier month at the same
location. They reported that longer calibration periods (i.e., 6 weeks) produced fits with a lower bias than fits from shorter calibration periods
(i.e., 1 week). In that study, the 1-week calibrations yielded the best
The central goal of this specific work was to identify the key factors that influence the duration of the co-location required to obtain sufficient data
to achieve consistent calibrate curves for five low-cost sensors (particulate matter smaller than 2.5
Data collected at two sites were used in the co-location analyses based on the availability of reference instrumentation. The
We use different subsets of the full co-location period to create a suite of hypothetical co-location durations based on which the calibration models
will be trained. For each hypothetical calibration co-location scenario (i.e., ranging from 1 to 180 consecutive days in 1
Sensor data from the calibration period were used to determine the coefficients for multiple linear regression (MLR) models based on previously
identified known environmental factors influencing concentration for each sensor (Levy Zamora, 2022). A generic MLR model is given by
For each calibration period tested, the RMSE and correlation coefficient values were determined by comparing the 1
An RMSE value of 0 would indicate a perfect agreement between the reference and the sensor. The correlation coefficient is a measure of the linear
correlation between two data sets. It is a value between
We hypothesize that a user could strategically choose a co-location period to minimize the calibration period and that the co-location duration is not the
only factor to consider when optimizing co-locating an instrument for calibration. In these analyses, we use the term “coverage” to indicate the
representativeness of environmental conditions during a calibration period compared to that observed across the full data set (calibration and
evaluation periods). In order to visualize how the environmental conditions during the calibration period compared to the evaluation period, we
compared the range of temperature, RH, and other key pollutants from each period. For example, if the full RH ranged between 10 % and 90 % and
the calibration period ranged between 20 % and 60 %, the RH coverage of that calibration period would be 50 % (40/80). Descriptive
statistics of the reference data used in the calibration models from the full year are displayed in Table S1.
The median and range (1st–99th percentile) of the RMSE from 250 calibration runs from six co-location lengths (1
The potential range of the
The potential range of RMSE values for a given co-location length for three low-cost sensors (
The range of RMSE values from 250 calibration periods in the sensitivity analysis of six co-location durations (i.e., 1
The median and range (1st–99th percentile) of correlation coefficients (
The ranges of correlation coefficients for the five low-cost sensors are shown in Table 2, and the box plots of the
Example comparison of two potential 1-week calibration periods. These were selected to illustrate the range of potential RMSE values that can result from using different periods of the same co-location duration. In the example here, “Calibration Period 1” yielded more accurate concentrations (shown in green; RMSE
The results show that the calibration performance from shorter-term co-locations varies considerably depending on the chosen co-location period. If a
user wanted all 250 potential co-location periods for the
Median RMSE values for
Comparison of the median RMSE (
Based on these results, we hypothesized that a key element governing good calibration outcomes is if the calibration co-location period is
representative of the evaluation period in terms of the required predictors in Eq. (1). Note that the required predictors are distinct for each sensor
type, so optimal periods may differ by sensor. To evaluate this hypothesis, the median RMSE values for three sensors (
The
For all three sensors in Fig. 4, the RMSE values decreased as the concentration coverage increased, but it was particularly notable for the
It is important to mention that Baltimore, MD, is a region that experiences a broad range of meteorological conditions each year, so the co-location
duration must be long enough to capture an adequate range of conditions to fully characterize the calibration curves. The pollutants also exhibit
significant seasonal variation at this location. In other regions where the weather conditions are less variable, shorter co-location durations may be
more likely to produce accurate results. This is the primary reason why employing a “coverage” approach might be a more useful approach for
identifying appropriate co-location durations. Moreover, we applied the calibration equations on data from a full year, but shorter co-location
durations would likely be satisfactory if the calibration and measurement period were going to be completed under similar conditions (e.g., within one
season). For example, if we limited the calibration and evaluation periods to between 1 June and 31 August 2019 (peak
If little information is known about key predictors at the measurement sites, which is likely at remote locations, it may be possible to use historical meteorological data and general information about pollutant patterns (e.g., emissions and seasonal concentration patterns) to determine a representative range of conditions. Future work should explore whether a combination of multiple, shorter calibration periods in different seasons may produce reasonable calibrations for year-round data sets. However, in all cases, it is advisable to increase the estimated co-location periods in case of data loss or unusual air quality events to increase the probability of well-performing calibrations.
In this study, we assessed five pairs of co-located reference and low-cost sensor data sets (
The data shown in the paper are available upon request from the corresponding author.
Additional figures shown in the Supplement include (1) the start times of 250 randomly selected
MLZ and KK conceived of the presented idea. MLZ performed the calculations and created the visualization. All authors participated in designing and deploying instruments for data collection; optimizing the analytical approach; analysis and discussion of results; and contributing to writing the final manuscript.
The contact author has declared that none of the authors has any competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This publication was developed under Assistance Agreement No. RD835871 awarded by the U.S. Environmental Protection Agency (EPA) to Yale University. It has not been formally reviewed by the EPA. The views expressed in this document are solely those of the authors and do not necessarily reflect those of the Agency. The EPA does not endorse any products or commercial services mentioned in this publication. The authors thank the Maryland Department of the Environment Air and Radiation Management Administration for allowing us to co-locate our sensors with their instruments at the Baltimore sites. Misti Levy Zamora is supported by the National Institute of Environmental Health Sciences of the National Institutes of Health (award nos. K99ES029116 and R00ES029116). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Drew R. Gentner has had externally funded projects on low-cost air quality monitoring technology (EPA, HKF Technology), where the developed technology has been licensed by Yale to HKF Technology. Abhirup Datta is supported by the National Science Foundation (grant no. DMS-1915803) and the National Institute of Environmental Health Sciences (NIEHS; grant no. R01ES033739). Colby Buehler is supported by the National Science Foundation Graduate Research Fellowship Program (grant no. DGE1752134). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
This research has been supported by the Environmental Protection Agency (grant no. RD835871), the National Institute of Environmental Health Sciences (grant nos. K99ES029116, R00ES029116, and R01ES033739), and the National Science Foundation (grant nos. DMS-1915803 and DGE1752134).
This paper was edited by Maria Dolores Andrés Hernández and reviewed by Sreekanth Vakacherla and one anonymous referee.