Understanding and improving the quality of data generated from low-cost sensors represent a crucial step in using these sensors to fill gaps
in air quality measurement and understanding. This paper shows results from
a 10-month-long campaign that included side-by-side measurements and
comparison between reference instruments approved by the United States
Environmental Protection Agency (EPA) and low-cost
particulate matter sensors in Bartlesville, Oklahoma. At this rural site in
the Midwestern United States the instruments typically encountered only low
(under 20
Traditional particulate matter measurements are taken using stationary instruments that cost tens if not hundreds of thousands of dollars. The high cost limits data collection to certain entities such as government agencies and research institutions that take measurements through field campaigns and through networks of stationary sensors. However, research has shown that these traditional measurements do not capture the spatial variations in particulate matter (Apte et al., 2017; Mazaheri et al., 2018). Low-cost sensors are increasingly being used in attempts to better map the spatial and temporal variations in particulate matter (Ahangar et al., 2019; Bi et al., 2020; Gao et al., 2015; Li et al., 2020; Zikova et al., 2017). Governments, citizen scientists, and device manufacturers are connecting these low-cost devices to build large air quality measurement networks. Understanding and improving the quality of this type of data is crucial in determining their appropriate applications. Though there has been a significant amount of research in recent years on the topic (Feenstra et al., 2019; Holstius et al., 2014; Jiao et al., 2016; Malings et al., 2020; Papapostolou et al., 2017; Williams et al., 2019, 2018), there is an ongoing effort to understand (1) how to concisely describe the performance of a low-cost sensor and (2) what best practices can maximize data quality while keeping costs down. Rather than presenting evaluation results for specific low-cost sensors, this study focuses on evaluation methods that can improve the use of all low-cost sensors.
Much of the performance characterization has focused on correlation (
A past evaluation of a sensor is a useful predictor of future data quality, but quality assurance techniques are needed to ensure that data quality continues to meet expectations. Calibrations are an important component of quality assurance. During a calibration low-cost sensors and reference instruments measure the same mass of air for a period and then adjustments are made to better align sensor measurements. Though laboratory comparisons would be more consistent, only location-specific field comparisons are able to capture the full variety of particle sizes and compositions that a sensor will encounter once deployed (Datta et al., 2020; Jayaratne et al., 2020). However, there are different calibration techniques with varying costs (Hasenfratz et al., 2015; Holstius et al., 2014; Malings et al., 2020; Stanton et al., 2018; Williams et al., 2019), and the needed requirements are not always clear for a successful field calibration. This technical gap is explored in this study by evaluating changes in linear regression parameters over time and their dependence on the amount of data included. A recent publication from the United States Environmental Protection Agency (Duvall et al., 2021) begins to address these issues and standardize evaluation practices, though they acknowledge that this is an evolving topic.
This study of low-cost particulate matter sensors was conducted in a rural
area of the Midwestern United States (Bartlesville, Oklahoma). This area is
interesting for evaluation as it typically sees lower concentrations of
PM
Data were collected at the Phillips 66 Research Center in Bartlesville,
Oklahoma. Bartlesville is approximately 75 km north of Tulsa and has a
population of approximately 36 000. Measurements were collected over 9 months from May 2018 to January 2019 and for 1 additional month in April
2019. Particulate matter concentrations were typically low (under 20
Reference measurements were collected using a Met One Beta Attenuation
Monitor 1020 (BAM) and a Teledyne T640 (T640). Though both instruments are
considered Federal Equivalent Methods (FEM) by the United States
Environmental Protection Agency (EPA), the BAM uses beta ray attenuation to
measure the mass of PM
Low-cost particulate matter sensors were evaluated by comparing samples
taken within 4 m of the reference instruments. Four brands of low-cost
(less than USD 300 per sensor) nephelometric-type particulate matter sensors
were initially evaluated through comparison with reference measurements for
1 month in May 2018. The specific brands of sensors are not identified
here, as the primary goal of this work is exploration of different methods
of evaluation. Each sensor provided measurements in near-real time but was
averaged to 1 h intervals for comparison with the BAM. The correlation
(
Eight replicas of the best-performing brand of low-cost particulate matter
sensor were placed with the BAM from May 2018 to January 2019. The overall
correlation (
The eight sensor replicas were divided into four pairs with different measurement agreement requirements for the data from the sensors in each pair (first averaged to 1 h intervals). The
Though only absolute (
Panel
Figure 1 shows that good agreement was observed between measurements from
duplicate sensors. Figure 2a supports this observation, showing close
correlation between bias, defined as
The lack of correlation suggests that a different external factor, such as particle properties, may influence sensor measurements. Previous research has observed the impact of particle composition on the accuracy of low-cost sensors (Giordano et al., 2021; Kuula et al., 2020; Levy Zamora et al., 2019). Particle size has also been observed to influence measurements (Stavroulas et al., 2020; Zou et al., 2021a). Very small particles go undetected and other particles can be incorrectly sized by the optical detectors used in low-cost particulate matter sensors. Regardless of the cause in varying yet correlated sensor response, data here suggest that low-cost measurements of meteorology will not be sufficient to improve low-cost sensor data. It may be possible to improve sensor data through measurements of particle properties, but the high cost of these measurements would undo the benefit of the low sensor price.
The T640 was available for comparisons for approximately 1 month in April
2019. In contrast to the BAM, which reports data only in 1 h intervals,
the T640 was programmed to report a measurement every minute, allowing
comparisons to the high-time-resolution data offered by sensors. In
addition, the month of April provided useful comparisons as elevated
concentrations of particulate matter were observed on 3 different days,
likely due to nearby agricultural burning. Under different evaluation
conditions, the
The changes to
1 h averaged data comparisons between a sensor and the T640 are used as
a baseline for comparison (Fig. 3b). To highlight differences, the 3 d that included high particulate matter concentrations were not included, except when indicated (Fig. 3e). Figure 3a–c shows that less averaging
(higher-time-resolution data) negatively impacted
Figure 3d shows the difference in comparison due to reference instrument.
The EPA considers both the BAM and the T640 to be FEM instruments, but both the
Figure 3e is the only panel in Fig. 3 that includes data from the 3 d
in April 2019 on which higher concentrations of particulate matter were
observed. Most 1 h measurements were below 50
Figure 3 shows that the circumstances surrounding an evaluation such as
averaging time, reference instrument, and the presence of high particulate
matter concentrations can be very influential on the performance results for
a sensor, even with other factors being held equal. The averaging time and
the choice of reference instrument could become smaller issues as standard
evaluation procedures are developed, such as those recently proposed by the
United States Environmental Protection Agency (Duvall et al., 2021). However, the influence of concentration range on
A prediction interval (PI) between sensor and reference data offers a
robust yet straightforward interpretation of sensor measurements. A 95 %
PI suggests that one can be 95 % confident that any new measurement will
be within its bounds; thus, a new sensor reading can be converted to a range
of estimates with statistical confidence. The width of this PI is a useful
way to show the performance of the sensor. Though a PI is calculated from a
linear regression just like
Figure 4 shows the PI for data that were collected at 5 min intervals in
April 2019. After fitting a linear regression to these data, it was found
that residuals generally increased with increasing concentration, suggesting
that bias can be higher as concentrations increase. This is also seen in the
Fig. 2d comparison between low-cost sensor bias and BAM measurements. In
order to ensure a correct linear fit and PI, these trends in residuals were
eliminated by transforming both the sensor and reference data prior to the
fit. Through examination it was found that residual trends were best
eliminated by raising both the sensor and reference data to the 0.4 power.
Future applications of this method to various sensors may find that different
powers or transformation methods are needed to eliminate trends that are
observed in residuals. Even duplicates of the same sensor may require
different transformations if taking measurements in different locations. A
detailed analysis of residuals is an important step in all model
development. In addition to the transformation, measurements less than 70
An example of a prediction interval evaluation for 5 min data from a single sensor in April 2019 that includes periods of high concentration. The curved lines are the upper and lower limits of the 95 %
prediction interval. A visual interpretation of a new sensor measurement of
200
Once data have been analyzed using the method shown in Fig. 4, the
interpretation of new sensor data becomes easy. As shown in Fig. 4, a new
sensor measurement at 200
A range can be provided for any new sensor measurement following this
method with 95 % confidence. In some cases that uncertainty range may
limit the ability to distinguish one concentration from another. For
example, in Fig. 4 the estimated ranges for 5 min averaged sensor
measurements of 150 and 200
Figure 5 shows the dependence of average uncertainty on the concentration
(sensor estimate) and on the averaging time. For the sake of simplicity Fig. 5 only shows the average difference between the sensor estimate and both the upper and lower PI. In the example in Fig. 4 uncertainty would be calculated as (
An alternative view of the prediction interval, which shows how this interval varies with concentration and with averaging time.
A PI evaluation such as that shown in Figs. 4 and 5 offers more information about what to expect from future sensor measurements versus
Picking a single comparison point allows users to quickly compare
measurement uncertainty between different sensor types, as they might
currently using
The analysis shown in Figs. 4 and 5 relies on a user collecting enough
data to predict the PI bounds that capture 95 % of future data points.
Without sufficient data the PI will be incorrect or will only cover
measurements over a limited range. This is illustrated in the 1 d averaged
uncertainty in Fig. 5, where uncertainty is only calculated for
concentrations ranging from approximately 5 to 25
Linear regression parameters from 1 h measurements over a 2-month period in 2018. Panel
Figure 6 shows variations to linear fit parameters between 1 h averaged
sensor and reference measurements collected during a 2-month period.
Figure 6a shows the range of values observed each day during this period. In
Fig. 6b–d a linear regression is calculated for 1 h measurements in a
rolling timeframe of 1, 7, or 14 d. Figure 6b shows that with just 1 d of 1 h averaged data,
Just as for
Figure 6d shows residual standard error over time, which can be used as a surrogate for the width of a PI. Calibrations using 1 d of data often have low error, as they include only a narrow range of concentrations. The residual standard error changes over time, even using 7 and 14 d intervals. All variations in residual standard error should be captured for a thorough calibration. It is especially important to capture the maximum in residual standard error to ensure that the prediction interval captures the full range of uncertainty in future measurements.
In short, Fig. 6 shows that even with up to 7 or 14 d of data, variations in linear regression parameters are still observed. Considine et al. (2021) observed that approximately 3 weeks were needed for a simple linear sensor calibration, and 7–8 weeks were required if using a more complicated correction such as machine learning. Duvall et al. (2021) recommended 30 d to develop a calibration. The goal of any calibration for low-cost particulate matter sensors should be to observe all variations in particle size, composition, and concentration. Thorough calibrations may require lengthy periods of collocation to capture all variations in these parameters and in the range of concentrations observed. An analysis of linear fit parameters over time such as in Fig. 6 can be helpful in determining if the full range of situations has been observed and captured by the resulting calibration model.
Understanding the behavior of low-cost particulate matter sensors over time
is important in planning for the amount of data required to calibrate the
sensor. The results presented in this study are useful in analyzing the
strengths and weaknesses of different calibration methods. Stanton et al. (2018) outlined four methods that could be used to capture sensor and reference measurements for calibration.
The routine collocation method is a useful starting point in any sensor network. A period of side-by-side sensor and reference data captures the slope, intercept, and typical error that is associated with sensor measurements. Figure 6 suggests that this method will have mixed results if calibrating over short time periods but can be reliable given enough time to capture all variations in slope, concentration, and residual standard error. The method does not capture the location bias that may be observed if sensors respond differently to the particles at the sensor location compared to those at the calibration location. Despite these limitations, Fig. 6 suggests that this method can likely improve sensor data if utilized appropriately.
The precision between sensors (see Fig. 2) suggests there is potential for a routine collocation method to greatly improve data, though there may be difficulties in accounting for location bias. The question remains of how much distance can be allowed between the reference sensor and the field-deployed network sensors before this method fails. This allowable distance will depend on how quickly particle properties change between the reference instrument and the network of sensors.
A mobile approach to calibration is attractive in that it can be used to capture variations in sensor responsiveness at different locations with different types of particles compared to those at a fixed reference location. The challenge with this approach will be capturing enough data to thoroughly understand the slope and prediction interval of sensor measurements. Figure 6 shows that even using 24 1 h data points can lead to unusual estimates for the slope between sensor and reference measurements, and a single-measurement spot check between a mobile reference and a sensor would likely be even more sporadic. It is possible that these single-measurement spot checks could slowly improve an existing calibration over time, but whether that improvement happens within a reasonable timeframe is something that would require more exploration.
The “golden sensor” method relies on a calibrated sensor to calibrate other sensors in the network. The challenge with this approach is that the actual concentration is not known once the sensor has left the reference instrument. At this point measurements are just an estimate with a 95 % confidence interval. The level of uncertainty in measurements (see Fig. 4) would make it very challenging to pass a calibration from one sensor to another without greatly increasing uncertainty.
Regardless of the choice of calibration method, it is important to consider variations in sensor data when conducting a calibration and when interpreting future sensor results. As Fig. 6 shows, calibrations may take weeks or longer in order to capture all variations in the external factors that influence sensor response. If these variations are not captured correctly, then the resulting calibration may miss important changes to sensor response that occur due to changing environmental variables. Reliance on a calibration that does not account for all variance in measurements may make sensor data less reliable. A calibration that also provides a PI ensures that future results can be interpreted with statistical rigor.
Low-cost sensors have potential to provide a better understanding of temporal and spatial trends of pollutants like particulate matter. Evaluations of low-cost particulate matter sensors alongside reference instruments in Bartlesville, Oklahoma, have been used to identify methods that ensure more consistent evaluation and interpretation of sensor data.
Bias in sensor measurements varied over time but was very closely correlated
from one sensor to the next (see Fig. 2). Because bias is so closely
correlated, sensors can be deployed in pairs as a simple way to identify
erroneous measurements (see Fig. 1). Finding ways to efficiently and
effectively determine sensor performance is critical as sensors become more
widely adopted. Two of the most popular evaluation metrics,
The raw data from this study are available upon request from the corresponding author.
The contact author has declared that there are no competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The author thanks John Gingerich and Irby Bailey for their assistance in carrying out the experiments described in this study.
This paper was edited by Pierre Herckes and reviewed by three anonymous referees.