Assessing the intracity spatial distribution and temporal variability in air
quality can be facilitated by a dense network of monitoring stations.
However, the cost of implementing such a network can be prohibitive if
traditional high-quality, expensive monitoring systems are used. To this end,
the Real-time Affordable Multi-Pollutant (RAMP) monitor has been developed,
which can measure up to five gases including the criteria pollutant gases
carbon monoxide (CO), nitrogen dioxide (
Current regulatory methods for assessing urban air quality rely on a small
network of monitoring stations providing highly precise measurements (at a
commensurately high setup and operating cost) of specific air pollutants
(e.g., Snyder et al., 2013). The United States Environmental Protection Agency
(EPA) determines compliance with national air quality standards at the county
level using data collected by local monitoring stations. Many rural counties
have at most a single monitoring site; urban counties may be more densely
instrumented, though not at the neighborhood scale. For instance, the
Allegheny County Health Department (ACHD) maintains a network of 10
monitoring stations which collect continuous and/or 24 h data for the
2000 km
One approach to increasing the spatial resolution of air quality data is the use of dense networks of low-cost sensor packages. Low-cost monitors are instruments which combine one or more comparatively inexpensive sensors (typically electrochemical or metal oxide sensors) with independent power sources and wireless communication systems. This allows larger numbers of monitors to be employed at a cost similar to that of a more traditional monitoring network as described above. The general goals of low-cost sensing include supplementing existing regulatory networks, monitoring air quality in areas that have lacked this in the past (for example in developing countries), and increasing community involvement in air quality monitoring through the provision of sensors and the resulting data to community volunteers to support more informed public decision-making and engagement in air quality issues (Snyder et al., 2013; Loh et al., 2017; Turner et al., 2017). Several pilot programs of low-cost sensor network deployment have been attempted, in Cambridge, UK (Mead et al., 2013); Imperial Valley, California (Sadighi et al., 2018; English et al., 2017); and Pittsburgh, Pennsylvania (Zimmerman et al., 2018).
There are several trade-offs resulting from the use of low-cost sensors.
These sensors are less precise and sensitive than regulatory-grade
instruments at typical ambient concentrations due to cross-sensitivities to
other pollutants and dependence of the sensor response to ambient temperature
and humidity (Popoola et al., 2016). These interactions are often nonlinear,
meaning that linear regression models developed under controlled laboratory
conditions are often insufficient to accurately translate the raw sensor
responses into concentration measures (Castell et al., 2017). Due to the
variety of interactions and atmospheric conditions which can affect sensor
performance, covering the range of conditions to which the sensor will be
exposed using laboratory calibrations is difficult. Field calibrations of the
sensors are thus necessary, with the sensors being collocated with highly
accurate regulatory-grade instruments. Various calibration methods that have
been explored include the determination of sensor calibrations from physical
and chemical principles (Masson et al., 2015), higher-dimensional models to
capture nonlinear interactions (Cross et al., 2017), and nonparametric
approaches including artificial neural networks (Spinelle et al., 2015) and
There remain several unanswered questions with respect to the calibration of data collected by low-cost sensors which we seek to answer in this work by examining data collected by almost 70 Real-time Affordable Multi-Pollutant (RAMP) monitors over periods ranging up to 18 months in the city of Pittsburgh, PA, USA. First, although various models have been applied to perform calibrations in different contexts, a thorough comparison on a common set of data of several different forms of calibration models applied to multi-pollutant measurements has yet to be performed. We seek to provide such a comparison and thereby draw robust conclusions about which calibration approaches work best overall and in specific contexts. Second, in previous work with the RAMP monitors and in work with other sensors, unique models have been developed for each sensor. This requires that extensive collocation data be collected for each low-cost sensor, which may not be feasible if large sensor networks are to be deployed. Therefore, it is important to investigate how well a single generalized calibration model can perform when applied across different individual sensors. Third, it is important to quantify the generalizability of models calibrated using data collected at a specific location to other locations across the same city where the sensors might be deployed, which may not share the same ratios of pollutants. This question is examined with several RAMPs that are co-located with regulatory monitors in the city of Pittsburgh, PA, USA. Finally, we seek to address the stability of calibration models over time by tracking changes in performance over the course of a year, and from one year to the next. Overall, we find support for using a generalized model for a network of RAMPs, developed based on local collocation of a subset of RAMPs. This reduces the need to collocate each node of a network, which otherwise can significantly increase network operating costs. These results will help guide future deployment efforts for RAMP or similar lower-cost air quality monitors.
The RAMP monitor (Fig. 1) was jointly developed by the Center for Atmospheric
Particle Studies at Carnegie Mellon University (CMU) and a private company,
SenSevere (Pittsburgh, PA). The RAMP package combines a power supply, control
circuitry, cellular network communications capability, a memory card for data
storage, and up to five gas sensors in a weatherproof enclosure. All RAMPs
incorporate a nondispersive infrared (NDIR)
External
Following Zimmerman et al. (2018), RAMP monitors are deployed outdoors on a
parking lot located on the CMU campus for a calibration based on collocated
monitoring with regulatory-grade instruments. The parking lot
(40
In addition to collocation at the CMU campus, additional special collocation
deployments of RAMP monitors were performed, in order to allow independent
comparisons between the RAMP monitor data and regulatory monitors at
different locations. One RAMP monitor was collocated with ACHD regulatory
monitors at their Lawrenceville site (40
Various computational models were applied to the sensor readings of the RAMPs
(i.e., the net signal, or raw response minus reference signal, from each
electrochemical gas sensor, together with the outputs of the
Models using each of these algorithms were calibrated in three separate
categories. First, individualized RAMP calibration models (iRAMPs)
were created for each RAMP, using only the data collected by gas sensors in
that RAMP and the regulatory monitors. Individualized models are applied only
to data from the RAMP on which they were trained. Second, from these
individualized models, a best individual calibration model (bRAMP)
was chosen, which performed best out of all the individualized models on a
testing data set with respect to correlation (Pearson
In all cases, models were calibrated using training data, which consist of
the RAMP monitor data collected during the collocation period (which are
measurements of the input variables, i.e., the signals from the various gas
sensors) together with the readings of the regulatory-grade instruments with
which the RAMP monitor was collocated (which are the targets for the output
variables). These collocation data are down-averaged from their original
sampling rates to 15 min averages, to ensure stability of the trained models
and minimize the effects of noise on the training process. From the
collocation data, eight equally sized, equally spaced time intervals are
selected to serve as training data for the calibration models. The number of
training data is selected to be either 80 % of the collocation data or
4 weeks of data (corresponding to 2688 15 min averaged data points),
whichever is smaller. The minimum number of training data is 21 days; if
fewer
are available, no iRAMP model is trained for this RAMP, and thus no
iRAMP model performance can be assessed for it (although bRAMP and gRAMP
models trained on other RAMPs are still applied to this RAMP for testing).
Training data for gRAMP models are obtained in the same way, although in that
case it is the data for the virtual typical RAMP which are divided,
rather than data for individual RAMPs. Any remaining data from the
collocation period are left aside as a separate testing set, on which the
performance of the trained models is evaluated. Note that due to differences
in which RAMPs and/or regulatory-grade instruments were operating at a given
time, training and testing periods are not necessarily the same for all RAMPs
and gases; for example, a certain time may be part of the training period for
the CO model for one RAMP and be part of the testing period for the
Linear regression models represent perhaps the simplest and most common
method for gas sensor calibration and have been used extensively in prior
work (Spinelle et al., 2013, 2015; Zimmerman et al., 2018). A linear
regression model (sometimes called a multi-linear regression model in the
case that there are multiple inputs) describes the output as an affine
function of the inputs. Here, linear functions are used where the sets of
inputs are restricted to the signal of the sensor for the gas in question
along with temperature and RH. For example, the calibrated
measurement of CO from the RAMP,
In addition to linear regressions, quadratic regressions were also applied.
These are the same as linear regressions but can involve second-order
interactions of the input variables. For example, for CO, a quadratic
regression function would be of the following form:
The main advantages of linear and quadratic regression models are their ease of implementation and calibration, as well as their ability to be readily interpreted, e.g., the relative magnitudes of the regression coefficients correspond to the relative importance of the different inputs in producing the output. The main disadvantage of these models is their inability to compute complicated relationships between input and output which are beyond that of a second-order polynomial. The training and application of linear and quadratic regression models are implemented using custom-written routines for the MATLAB programming language (version R2016b).
Gaussian processes are a form of regression which generalizes the
multivariate Gaussian distribution to infinite dimensionality (Rasmussen and
Williams, 2006). For the purposes of calibration, we make use of a simplified
variant of a Gaussian process model. From the training data, both the signals
of the RAMP monitors and the readings of the regulatory-grade instruments are
transformed such that their distributions during the training period can be
approximately modeled as standard normal distributions. This transformation
is accomplished by means of a piecewise linear transformation, in which the
domain is segmented and for each segment different linear mappings are
applied. After this transformation, an empirical mean vector
Given a new set of signal measurements from a RAMP, denoted as
The main advantage of a Gaussian process calibration model of this form is its robustness to incomplete or inaccurate information; for example, if a signal from one gas sensor were missing or corrupted by a large voltage spike, in the former case the missing input could be “filled in” by the correlated measurements of other sensors, while in the latter case estimates would be “reigned in” by the more reasonable measures of the other sensors. A major disadvantage of this calibration model is its continued use of what is basically a linear regression formula, the only difference being in the nonlinear transformation from the original measurement space to the standard normal variable space used by the model. Furthermore, during the calibration process, the ratios of concentration for the pollutants of the collocation site may be learned by the model, making it less likely to predict differing ratios during field deployment. The training and application of Gaussian process calibration models are accomplished using custom-written routines in the MATLAB programming language.
The clustering model presented here seeks to estimate the outputs
corresponding to new inputs by searching for input–output pairs in the
training data for which the distance (by a predefined distance metric in a
potentially high-dimensional space) between the new input and the training
inputs is minimized and using the average of several outputs corresponding
to these nearby inputs (the nearest neighbors). In a traditional
A major advantage of this approach is its simplicity and flexibility, allowing it to capture complicated nonlinear input–output relationships by referring to past records of these relationships, rather than attempting to determine the actual pattern which these relationships follow. Such a method can perform very well when the relationships are stable, and when any new input with which the model is presented is similar to at least one of the inputs from the training period. However, as with all nonparametric models, generalizing beyond the training period is difficult, and the model will tend to perform poorly if the nearest neighbors of a new input are in fact quite far away, in terms of the distance metric used, from this input.
The artificial neural network model, or simply neural network, is a machine learning paradigm which seeks to replicate, in a simplified manner, the functioning of an animal brain in order to perform tasks in pattern recognition and classification (Aleksander and Morton, 1995). A basic neural network consists of several successive layers of “neurons”. These neurons each receive a weighted combination of inputs from a higher layer (or the signal inputs, if they are in the top layer) and apply a simple but nonlinear function to them, producing a single output which is then fed on into the next layer. By including a variety of possible functions performed by the neurons and appropriately tuning the weights applied to inputs fed from one layer to the next, highly complicated nonlinear transformations can be performed in successive small steps.
Neural networks have been applied to a large number of problems, including the calibration of low-cost gas sensors (Spinelle et al., 2015). Neural networks represent an extremely versatile framework and are able to capture nearly any nonlinear input–output relationship (Hornik, 1991). Unfortunately, to do so may require vast numbers of training data, which it is not always practical to obtain. Calibration of these models is also a time-consuming process, requiring many iterations to tune the weightings applied to values passed from one layer to the next. In this work, neural networks were trained and applied using the “Netlab” toolbox for MATLAB (Nabney, 2002). The network has a single hidden layer with 20 nodes. To limit the computation time needed for model training, the number of allowable iterations of the training algorithm was capped at 10 000; this cap was typically reached during the training.
A random forest model is a machine learning method which makes use of a large number of decision “trees”. These trees are hierarchical sets of rules which group input variables based on thresholding (e.g., the third input variable is above or below a given value). The thresholds used for these rules as well as the inputs they are applied to and the order in which they are applied are calibrated during training. The final groupings of input variables from the training data, located at the end or “leaves” of the branching decision tree, are then associated with the mean values of the output variables for this group (similar to a clustering model). For estimating an output given a new set of inputs, each decision tree within the random forest applies its sequence of rules to assign the new data to a specific leaf, and outputs the value associated with that leaf. The output of the random forest is the average of the outputs of each of its trees.
A primary shortcoming of the random forest model (which it shares with other
nonparametric methods) is its inability to generalize beyond the range of the
training data set, i.e., outputs of a random forest model for new data can
only be within the range of the values included as part of the training data.
For this reason, the standard random forest model was expanded into a hybrid
random forest–linear regression model. The use of this approach for RAMP data
was suggested by Zimmerman et al. (2018). Furthermore, it is similar to the
approach of Hagan et al. (2018), who hybridize nearest-neighbor and linear
regression models. In this modified model, a random forest is applied to new
data to estimate the concentrations of various measured pollutants. For
example, the concentration of CO measured by a RAMP including sensors for CO,
In the following section, the performance of the calibration models in
translating sensor signals to concentration estimates is assessed in several
ways. It should be noted that the metrics presented here are applied only
for testing data, i.e., data which were not used to build the calibration
models. Model performance on the training data is expected to be higher, and
thus less representative of the true capability of the model. The estimation
bias is assessed as the mean normalized bias (MNB), the average difference
between the estimated and actual values, divided by the mean of the actual
values. That is, for
In addition to the above metrics, EPA methods for evaluating precision and
bias errors are used as outlined in Camalier et al. (2007). To summarize, the
precision error is evaluated as
Assigned lower limits for censoring small measurement values.
Using the EPA precision and bias calculations allows for these values to be compared against performance guidelines for various sensing applications, as presented in Williams et al. (2014) and listed in Table 2. For the RAMP monitors, a primary goal is to achieve data quality sufficient for hotspot identification and characterization (Tier II) or personal exposure monitoring (Tier IV), which requires that both precision and error bias metrics be below 30 %. A supplemental goal is to achieve performance sufficient for supplemental monitoring (Tier III), requiring precision and bias metrics below 20 %.
EPA air quality sensor performance guidelines for various applications. Reproduced from Williams et al. (2014).
In this section, we examine the performance of the RAMP gas sensors and the
various calibration models applied to their data. We will focus our attention
on the CO, NO,
Figure 2 presents a comparison of the performance of various calibration
models applied to testing data collected at the CMU site during 2017. As
described in Sect. 2.3, collocation data are divided into training and
testing sets, with the former (always being between 3 and 4 weeks in
total duration) used for model development and the latter used to test the
developed model using the assessment metrics described in Sect. 2.4, as
presented in Fig. 2. All models in the figure are of the iRAMP category,
being developed using only data collected by a single RAMP and the collocated
regulatory-grade instruments. In the figure, squares indicate the median
performance across all RAMPs for each performance metric, and the error bars
span from the 25th to 75th percentiles of each metric across the RAMPs.
For CO, 48 iRAMP models are compared; for NO, 19 models; for
Comparative performance of various individualized RAMP calibration models across gases measured by the RAMPs. Models are trained and tested on distinct subsets of collocation data collected at the CMU site during 2017; performance shown is based on the testing data set only. Proximity to the lower-left corner of each figure indicates better performance. Note the differing vertical axis scales.
Performance of iRAMP calibration models with respect to EPA air quality sensor performance guidelines as assessed at the CMU site. Entries in the table denote which models meet the corresponding guidelines for each gas (LR: linear regression; LQR: limited quadratic regression; CQR: complete quadratic regression; GP: Gaussian process; CL: clustering; NN: neural network; HY: hybrid random forest–linear regression).
Typically, several of the model types provide similar performance for a given
gas. For CO and
Durations and ranges of testing and training data at CMU in 2017.
Durations of the training and testing periods are in days. Ranges indicated
are in parts per billion for all gases, degrees Celsius for temperature (
Performance data for iRAMP models at CMU in 2017 (Avg. is the
average; SD is the standard deviation). The “No.” sub-column under
“Model” indicates the total number of iRAMP models developed for each gas.
Slope and
Next, we examine how the performance of the best individual models (bRAMP) and of the general models (gRAMP) applied to all RAMPs compare to the performance of the individualized RAMP (iRAMP) models presented in Sect. 3.1. Evaluation is carried out on the testing data collected at the CMU site in 2017. For simplicity, we restrict ourselves to three models for each gas, chosen from among the better-performing iRAMP models and including at least one parametric and one nonparametric approach. Figure 3 presents these comparisons.
Comparative performance of individualized (iRAMP – square), best individual (bRAMP – diamond), and general (gRAMP – circle) model categories across gases measured by the RAMPs. The modeling algorithms used for each gas correspond to three of the better-performing algorithms identified among the individualized models. Models are trained and tested on distinct subsets of collocation data collected at the CMU site during 2017; performance shown is based on the testing data set only. Proximity to the lower-left corner of each figure indicates better performance.
Across all gases and models, iRAMP models tend to perform best, as might be
expected since these models are both trained and applied to data collected by
a single RAMP monitor, and therefore will account for any peculiarities of
individual sensors. Between the bRAMP
models, in which a model is trained using data from a single RAMP and applied
across multiple RAMPs, and gRAMP models, which are trained on data from a
virtual typical RAMP (composed of the median signal from several RAMPs) and
then applied across other RAMPs, it is difficult to say which approach would
be better based on these results, as they vary by gas as well as by modeling
approach. For parametric models (i.e., linear and quadratic regression) the
bRAMP and gRAMP versions typically have a similar performance, although there
is less variability in performance for the gRAMP versions. For nonparametric
models (i.e., neural network and hybrid models), performance of bRAMP
versions is typically better than the gRAMP versions, although in the case of
Figure 4 depicts the performance of calibration models for two RAMP monitors deployed at two EPA monitoring stations operated by the ACHD (one monitor is deployed to each station). Filled markers indicate the performance of the models at these sites, while hollow markers indicate the 2017 testing period performance of the corresponding RAMP when it was at the CMU site for comparison. For each gas type, different calibration models are used, chosen from among the models depicted in Fig. 3. Models trained at the CMU site (as presented in previous sections) are used to correct data collected by the RAMP monitor at the station. Note that all data collected at either deployment site are treated as testing data, and that no data from these other sites are used to calibrate the models. Also note that not all gases monitored by RAMPs are monitored by the stations, hence why only one station may appear in each plot.
Comparative performance of individual and general models for RAMPs
deployed to ACHD monitoring stations (filled makers), compared to the
performance of the same RAMPs at the CMU site (hollow markers). For example,
a filled green marker indicates the performance of a RAMP at the
Lawrenceville site, while a hollow green marker indicates the performance of
that same RAMP when it was at the CMU site. The modeling algorithm used for
each gas corresponds to the most consistent algorithm identified among the
models depicted in Fig. 3: limited quadratic regression (LQR) for CO, neural
network (NN) for NO, and hybrid random forest–linear regression models (HY)
for
Overall, there tends to be a change in model performance at either of the
deployment sites compared to the CMU site. This is to be expected to some
degree, as the concentration range and mixture of gases (especially at the
Parkway East site, which is located next to a major highway) can be
different at a new site (where the model was not trained), and thus
cross-sensitivities of the sensors may be affected. These differences appear
to be greatest for CO, with performance being
To evaluate the performance of these sensors in a different way, EPA-style
precision and bias metrics are provided in Fig. 5. Only CO,
Comparative performance of individual and general models for RAMPs deployed to ACHD monitoring stations using EPA performance criteria. Dotted lines indicate the outer limits of each performance tier. Performance shown for the CMU site is based on performance across all RAMPs at that site based on testing data only. Proximity to the lower-left corner of each figure indicates better performance.
We now examine the change in performance of calibration models over time. Figure 6 shows the performance of models developed based on data collected at the CMU site in both 2016 and 2017 and tested on data collected in either of these years.
Comparative performance of models in 2016 and 2017. The “models” year indicates the year from which training data collected at the CMU site are used to calibrate the model; the “data” year indicates the year from which testing data collected at the CMU site are used to evaluate the model.
Training and testing data for 2017 represent the same training and testing periods as used for previous results. For 2016, training and testing data are divided using the same procedure as was applied for 2017 data, as discussed in Sect. 2.3. For example, the results for “2016 data, 2017 models” represent the performance of models calibrated using the training data subset of the 2017 CMU site data when applied to the testing data subset of the 2016 CMU site data. A change in performance between these two models on data from the same year will indicate the degree to which the models have changed from one year to the next; likewise, a change in performance for the same model applied to data from different years will indicate the degree to which sensor responses have changed over time. Note that NO is omitted here because data to build calibration models for this gas were not collected in 2016. Also note that results presented in the rest of this paper only use data collected in 2017 for model training and evaluation.
A drop in performance when models from one year are applied to data collected
in the next year is consistently observed for all models and gases, with
Additionally, calibration model performance was assessed as a function of
averaging time. Note that the calibration models discussed in this paper are
developed using RAMP data averaged over 15 min intervals, as discussed in
Sect. 2.3. However, these models may be applied to raw RAMP signals averaged
over longer or shorter time periods. Furthermore, the calibrated data can
also be averaged over different periods. To investigate the effects of
averaging time on calibration model performance, we assess the performance of
RAMPs calibrated with gRAMP models for CO,
Finally, we track the performance of RAMPs over time at specific deployment
locations, as depicted in Fig. 7, to evaluate changes in calibration model
field performance over time. This is carried out using three RAMP monitors; one was
deployed at the ACHD Lawrenceville station from January through September of
2017, as well as during November and December 2017. A second RAMP was kept at
the CMU site year-round, where it was collocated with regulatory-grade
instruments intermittently between May and October. The third RAMP was
deployed at the ACHD Parkway East site beginning in November of 2017. The
same gRAMP calibration models as depicted in Fig. 4 (using the training data
collected at the CMU site in 2017) are used; note that the RAMP present at
the CMU site was a part of the training set of RAMPs for the gRAMP model,
while the other two RAMPs were not. Performance of CO,
Tracking the performance of RAMP monitors deployed to ACHD
Lawrenceville, ACHD Parkway East, and CMU over time. Statistics are computed
for each week. Results shown correspond to those of models trained using data
collected at the CMU site during 2017. For CO, the generalized limited
quadratic regression model is used; for
Performance of the calibrated
Based on the results presented in Sect. 3.1, complete quadratic regression
and hybrid models give the best and most consistent performance across all
gases. Of these, the hybrid models, combining the complicated non-polynomial
behaviors of random forest models (capable of capturing unknown sensor
cross-sensitivities) with the generalization performance of parametric linear
models, tend to generalize best for NO,
Overall, in most cases the generic bRAMP and generalized gRAMP calibration
models perform worse than the individualized iRAMP models at the calibration
site, but the decline in performance may be manageable and acceptable
depending on the use case. For example, for
Based on comparisons between the performance of models from one year to the
next, as well as the analysis of changes in performance of the RAMP monitors
collocated with regulatory-grade instruments for long periods, some sensors,
such as the
It can generally be expected that RAMP monitors will at least meet Tier II or
Tier IV EPA performance criteria (
In comparing different methods for the calibration of electrochemical sensor
data, it was found that in some cases, e.g., for CO, simple parametric
models, such as quadratic functions of a limited subset of the available
inputs, were sufficient to transform the signals to concentration estimates
with a reasonable degree of accuracy. For other gases, e.g.,
Although there is a reduction in performance as a result of not using individualized monitor calibration models when these are calibrated and tested at the same location, the use of a single calibration model across multiple monitors, representing either the best of available individualized models or a general model developed for a typical monitor, tends to give more consistent generalization performance when tested at a new site. This suggests that variability in the responses of individual sensors for the same gas when exposed to the same conditions (such as would be accounted for when developing separate calibration models for each monitor) tends to be lower than the variability in the response of a single sensor when exposed to different ambient environmental conditions and a different mixture of gases (such as is experienced when the monitor is moved to a new site). Models that are developed and/or applied across multiple monitors will avoid “overfitting” to the specific response characteristics of a single sensor in a single environment. Thus, considering that it is impractical to perform a collocation for each monitor at the location where it is to be deployed, there is little benefit to developing individualized calibration models for each monitor when their performance will be similar to (if not worse than) that of a generalized model when the monitor is moved to another location.
There are several additional qualitative advantages to using generalized models. First, the effort required to calibrate models is reduced since not every monitor needs to be present for collocation and separate models do not have to be created for every monitor. For example, while for CO only 48 RAMPs had sufficient data to calibrate individualized models based on data collected at the CMU site in 2017, general models can be calibrated and applied for all 68 RAMPs which were at the CMU site during this period, as well as for additional RAMPs which were never collocated at the CMU site but have the same gas sensors installed. Second, collocation data collected from multiple monitors at different sites can be combined in the creation of a generalized model, whereas individualized models would require each monitor to be present at each collocation site. This means that a wider range of ambient gas concentrations can be reflected in the training data, allowing for better generalization. Finally, the use of generalized models allows for robustness against noise of individual sensors, which can lead to mis-calibration of individualized models but is less likely to do so if data from multiple sensors are averaged. Therefore, for future deployments, generalized models applicable across all monitors should be used.
For long-term deployments, it is recommended that model performance be periodically re-evaluated (using limited co-location campaigns with a subset of the deployed sensors) every 6 months to 1 year and that development of new calibration models be contingent on the outcomes of these re-evaluations. This recommendation is due to the noticeable change in performance when models for one year were used for processing data collected in the subsequent year. If generalized models are used, model development can be performed using only a representative subset of monitors collecting data across a range of temperature and humidity conditions, allowing most monitors to remain deployed in the field (although periodic “sanity checks” should be made for field-deployment monitors to ensure all on-board sensors are operating properly). Another option is to maintain a few “gold standard” monitors collocated with regulatory-grade instruments year-round and to use these monitors for the development of generalized models to be used with all field-deployed monitors over the same period. Determination of how many monitors are necessary to develop a sufficiently robust generalized model is a topic of ongoing work.
All data (reference monitor data, RAMP raw signal data,
calibrated RAMP data for both training and testing), and codes (in MATLAB
language) to recreate the results discussed here are provided online at
The supplement related to this article is available online at:
CM analyzed data collected by the RAMP sensors, implemented the calibration models, and primarily wrote the paper. CM, RT, AH, and SPNK deployed, maintained, and collected data from RAMP sensors and maintained regulatory-grade instruments at the CMU site. NZ provided data from RAMP sensors and regulatory-grade instruments from the CMU site collected in 2016. NZ and LBK aided with the design of random forest and neural network calibration models, respectively, and provided general advice on the paper. AAP and RS conceptualized the research, acquired funding, provided general guidance to the research, and assisted in writing and revising the paper.
The authors declare that they have no conflict of interest.
Funding for this study is provided by the Environmental Protection Agency (assistance agreement no. 83628601) and the Heinz Endowment Fund (grant nos. E2375 and E3145). The authors thank Aja Ellis, Provat K. Saha, and S. Rose Eilenberg for their assistance with deploying and maintaining the RAMP network and Ellis S. Robinson for assistance with the CMU collocation site. The authors also thank the ACHD, including Darrel Stern and Daniel Nadzam, for their cooperation and assistance with sensor deployments. Edited by: Jun Wang Reviewed by: two anonymous referees