1 Gaussian Process regression model for dynamically calibrating a wireless low-cost particulate matter sensor network in Delhi

Wireless low-cost particulate matter sensor networks (WLPMSNs) are transforming air quality monitoring by providing PM information at finer spatial and temporal resolutions; however, large-scale WLPMSN calibration and maintenance remain a challenge because the manual labor involved in initial calibration by collocation and routine recalibration is intensive, the transferability of the calibration models determined from initial collocation to new deployment sites is questionable as calibration factors typically vary with urban heterogeneity of operating conditions and aerosol optical 15 properties, and the stability of low-cost sensors can develop drift or degrade over time. This study presents a simultaneous Gaussian Process regression (GPR) and simple linear regression pipeline to calibrate and monitor dense WLPMSNs on the fly by leveraging all available reference monitors across an area without resorting to pre-deployment collocation calibration. We evaluated our method for Delhi where the PM2.5 measurements of all 22 regulatory reference and 10 low-cost nodes were available in 59 valid days from January 1, 2018 to March 31, 2018 (PM2.5 averaged 138 ± 31 μg m-3 among 22 reference 20 stations) using a leave-one-out cross-validation (CV) over the 22 reference nodes. We showed that our approach can achieve an overall 30 % prediction error (RMSE: 33 μg m-3) at a 24 h scale and is robust as underscored by the small variability in the GPR model parameters and in the model-produced calibration factors for the low-cost nodes among the 22-fold CV. We revealed that the accuracy of our calibrations depends on the degree of homogeneity of PM concentrations, and decreases with increasing local source contributions. As by-products of dynamic calibration, our algorithm can be adapted for 25 automated large-scale WLPMSN monitoring as simulations proved its capability of differentiating malfunctioning or singular low-cost nodes within a network via model-generated calibration factors with the aberrant nodes having slopes close to 0 and intercepts close to the global mean of true PM2.5 and of tracking the drift of low-cost nodes accurately within 4 % error for all the simulation scenarios. The simulation results showed that ~20 reference stations are optimum for our solution in Delhi and confirmed that low-cost nodes can extend the spatial precision of a network by decreasing the extent of pure 30 interpolation among only reference stations. Our solution has substantial implications in reducing the amount of manual labor for the calibration and surveillance of extensive WLPMSNs, improving the spatial comprehensiveness of PM evaluation, and enhancing the accuracy of WLPMSNs. Atmos. Meas. Tech. Discuss., https://doi.org/10.5194/amt-2019-55 Manuscript under review for journal Atmos. Meas. Tech. Discussion started: 1 March 2019 c © Author(s) 2019. CC BY 4.0 License.


Introduction
Low-cost air quality (AQ) sensors that report high time resolution data (e.g., ≤ 1 h) in near real time offer excellent potential for supplementing existing regulatory AQ monitoring networks by providing enhanced estimates of the spatial and temporal variabilities of air pollutants (Snyder et al., 2013).Certain low-cost particulate matter (PM) sensors demonstrated satisfactory performance benchmarked against Federal Equivalent Methods (FEMs) or research-grade instruments in some previous field studies (Holstius et al., 2014;Gao et al., 2015;SCAQMD, 2015a-b;Jiao et al., 2016;Kelly et al., 2017;Mukherjee et al., 2017;SCAQMD, 2017a-c;Crilley et al., 2018;Feinberg et al., 2018;Johnson et al., 2018;Zheng et al., 2018).Application-wise, low-cost PM sensors have had success in identifying urban fine particle (PM2.5, with a diameter of 2.5 µm and smaller) hotspots in Xi'an, China (Gao et al., 2015), mapping urban air quality with additional dispersion model information in Oslo, Norway (Schneider et al., 2017), monitoring smoke from prescribed fire in Colorado, US (Kelleher et al., 2018), measuring a traveler's exposure to PM2.5 in various microenvironments in Southeast Asia (Ozler et al., 2018), and building up a detailed city-wide temporal and spatial indoor PM2.5 exposure profile in Beijing, China (Zuo et al., 2018).
On the down side, researchers have been plagued by calibration-related issues since their emergence.One common brute force solution is initial calibration by collocation with reference analyzers before field deployment and follow-up routine recalibration.Yet, the transferability of these pre-determined calibrations at collocation sites to new deployment sites is questionable as calibration factors typically vary with operating conditions such as PM mass concentrations, relative humidity, temperature, and aerosol optical properties (Holstius et al., 2014;Austin et al., 2015;Wang et al., 2015;Lewis and Edwards, 2016;Crilley et al., 2018;Jayaratne et al., 2018;Zheng et al., 2018).Complicating this further, the pre-generated calibration curves may only apply for a short term as the stability of low-cost sensors can develop drift or degrade over time (Lewis and Edwards, 2016;Jiao et al., 2016;Hagler et al., 2018).Routine recalibrations which require frequent transit of the deployed sensors between the field and the reference sites are not only too labor intensive for a large-scale network but also still cannot address the impact of urban heterogeneity of ambient conditions on calibration models (Kizel et al., 2018).
As such, calibrating sensors on-the-fly while they are deployed in the field is highly desirable.Takruri et al. (2009) showed that the Interacting Multiple Model (IMM) algorithm combined with the Support Vector Regression (SVR)-Unscented Kalman Filter (UKF) can automatically and successfully detect and correct low-cost sensor measurement errors in the field; however, the implementation of this algorithm still requires pre-deployment calibrations.Fishbain and Moreno-Centeno (2016) designed a self-calibration strategy for low-cost nodes with no need for collocation by exploiting the raw signal differences between all possible pairs of nodes.The learned calibrated measurements are the vectors whose pairwise differences are closest in normalized projected Cook-Kress (NPCK) distance to the corresponding pairwise raw signal differences given all possible pairs over all time steps.However, this strategy did not include reference measurements in the self-calibration procedure, and therefore the tuned measurements were still essentially raw signals (although instrument noise was dampened).An alternative calibration method involves chain calibration of the low-cost nodes in the field with only the first node calibrated by collocation with reference analyzers and the remaining nodes calibrated sequentially by their respective previous node along the chain (Kizel et al., 2018).While this node-to-node calibration procedure proved its merits in reducing collocation burden and data loss during calibration/relocation/recalibration and accommodating the influence of urban heterogeneity on calibration models, it is only suitable for relatively small networks because calibration errors propagate through chains and can inflate toward the end of a long chain (Kizel et al., 2018).
In this paper, we introduce a simultaneous Gaussian Process regression (GPR) and simple linear regression pipeline to calibrate PM2.5 readings of any number of low-cost PM sensors on the fly in the field without resorting to pre-deployment collocation calibration by leveraging all available reference monitors across an area (e.g., Delhi N 28.6139, E 77.2089).The proposed strategy is theoretically sound since the GPR (also known as Kriging) can capture the spatial covariance inherent in the data and has been widely used for spatial data interpolation (e.g., Holdaway, 1996;Di et al., 2016;Schneider et al., 2017) and the simple linear regression calibration can adjust for disagreements between low-cost sensor and reference instrument measurements and lead to more consistent spatial interpolation.This paper focuses on 1) quantifying experimentally the daily performance of our dynamic calibration model in Delhi during winter season based on model prediction accuracy on the holdout reference nodes during leave-one-out cross-validations (CV) and low-cost node calibration accuracy; 2) revealing the potential pitfalls of employing a dynamic calibration algorithm; 3) demonstrating the ability of our algorithm to auto-detect the faulty and auto-correct drift nodes within a network via computational simulation, therefore the practicality of adapting our algorithm for automated large-scale sensor network monitoring; and 4) studying computationally the optimal number of reference stations across Delhi to support our technique and the usefulness of low-cost sensors for extending the spatial precision of a sensor network.To the best of our knowledge, this is the first study to apply such a non-static calibration technique to a wireless low-cost PM sensor network in a heavily polluted region such as India and the first to present methods of auto-monitoring dense AQ sensor networks.

Low-cost node configuration
The low-cost packages used in the present study (dubbed "Atmos") shown in Fig. 1a  of PMS7003 are similar to its PMS1003, PMS3003, and PMS5003 counterparts and have been extensively discussed in previous studies (Kelly et al., 2017;Zheng et al., 2018;and Sayahi et al., 2018, respectively).The inlet and outlet of PMS7003 were aligned with two slots on the box to ensure unrestricted airflow into the sensor.The PM and meteorology data are read over the serial TTL interface every three seconds, aggregated every 1 min in memory on the device, before being transmitted by a Quectel M66 GPRS module through the mobile 2G cellular network to an online database.The Atmos can also store the data on a local microSD card in case of transmission failure.Users have the option to configure the frequencies of data transfer and logging to 5, 10, 15, 30, and 60 minutes via a press key on the device and are able to view the settings on the LCD.All components of the Atmos monitors (key parts are labelled in Fig. 1b) are integrated to a customdesigned printed circuit board (PCB) which is controlled by a STMicroelectronics microcontroller (model STM32F051).
Each Atmos was continuously powered up by a 5V 2A USB wall charger but also comes with a fail-safe 3.7V-2600 mAh rechargeable Li-ion battery in case of power outage that can last up to 10 hours at a 1 min transmission frequency and 20 hours at a 5 min frequency.
The Atmos network's server architecture was also developed by Respirer Living Sciences and built on the following opensource components: KairosDB as our primary fast scalable time series database built on Apache Cassandra, custom-made Java libraries for ingesting data and for providing XML/JSON/CSV-based access to aggregated time series data, HTML5/JavaScript for creating the front-end dashboard, and LeafletJS for visualizing Atmos networks on maps.

Reference PM2.5 data
Hourly ground-level PM2.5 concentrations from 21 monitoring stations operated by the Central Pollution Control Board (CPCB), the Delhi Pollution Control Committee (DPCC), the India Meteorological Department (IMD), and the Uttar Pradesh and Haryana States Pollution Control Boards (SPCBs) (https://app.cpcbccr.com/ccr/#/caaqm-dashboard/caaqmlanding,last access: 18 September 2018) and from one monitoring station operated by the U.S. Embassy in New Delhi (https://www.airnow.gov/index.cfm?action=airnow.global_summary#India$New_Delhi,last access: 18 September 2018) were available in our study domain of Delhi and its satellite cities including Gurgaon, Faridabad, Noida, and Ghaziabad from January 1, 2018 to March 31, 2018 (winter season) and were used as the reference measurements in our Delhi PM sensor network.The topographical, climatic, and air quality conditions of Delhi are well documented by Tiwari et al. (2012 and2015) and Gorai et al. (2018).Briefly, Delhi experiences unusually high PM2.5 concentrations over winter season due to a combination of increased biomass burning for heating, shallower boundary layer mixing height, diminished wet scavenging by precipitation, lower wind speed, and trapping of air pollutants by the Himalayan topology.Figure 2 visualizes the spatial distribution of these 22 reference monitors (red icons) and Table 1 lists their latitudes and longitudes.No station of the 22 reference monitors is known for regional background monitoring.The complex local built environment in Delhi arising from the densely and intensively mixed land use (Tiwari, 2002) and the significant contributions to air pollution from all vehicular, industrial (small scale industries and major power plants), commercial (diesel generators and tandoors), and residential (diesel generators and biomass burning) sectors (CPCB, 2009;Gorai et al., 2018) render the PM2.5 concentrations relatively unconnected to the land-use patterns.We removed 104 1 h observations (labeled invalid and missing) from the U.S.
Embassy dataset based on its reported QA/QC (quality assurance/quality control) remarks; however, the same procedure was not applied to the remaining 21 Indian government monitoring stations due to lack of relevant information.GPR requires concurrent measurements of all the stations but certain stations had a high fraction of missing values.Therefore, to maximize the number of complete concurrent observations for modelling in order to significantly increase the model accuracy, we linearly interpolated PM2.5 values for the hours with missing measurements for each station, after which we averaged the hourly data to daily resolution as the model inputs.

Low-cost node PM2.5 data
Hourly uncalibrated PM2.5 measurements from 10 Atmos low-cost nodes across Delhi between January 1, 2018 and March 31, 2018 were downloaded using our custom-designed Application Program Interface (API).Figure 2 shows the sampling locations of these 10 low-cost nodes as blue icons and Table 1 specifies their latitudes and longitudes.In our current study, the factors governing the siting of these nodes consist of the ground contact personnel availability, the resource availability such as strong mobile network signal and 24/7 main power supply, the location physical accessibility, and some other common criteria for sensor deployment (e.g., locations away from major pollution sources, situated in a place where free flow of air is available, and protected from vandalism and extreme weather).Similar to the preprocessing of the reference PM2.5 data, we linearly interpolated the missing hourly PM2.5 for each low-cost node and then aggregated the hourly data at a daily interval.The comparison of initial 1 h PM2.5's completeness with that of after missing data imputation for both reference and low-cost nodes is detailed in Table 1 and the periods over which data were imputed for each site are illustrated in Fig. S1.To remove the prospective outliers such as erroneous surges/nadirs existing in the datasets of the 21 Indian government reference nodes and the 10 low-cost nodes or unreasonable interpolated measurements introduced during handling the missing data, we employed the Local Outlier Factor (LOF) algorithm with 20 neighbors considered (a number that works well in general) to remove a conservative ~10% of the 32-dimensional (22 reference + 10 low-cost nodes) 24 h PM2.5 datasets.LOF is an unsupervised anomaly detection method that assigns each multi-dimensional data point an LOF score, defined as the ratio of the average local density of its k nearest neighboring data points (k = 20 in our study) to its own local density, to measure the relative degree of isolation of the given data point with respect to its neighbors (Breunig, et al., 2000).Normal observations tend to have LOF scores near 1 while outliers significantly larger than 1.The LOF therefore identifies the outliers as those multi-dimensional observations with the top x% (x = 10 in our study) LOF scores.A total of 59 days' PM2.5 measurements common to all 32 nodes in the network were left (see Fig. S1) and used for our model evaluation.

Simultaneous GPR and simple linear regression calibration model
Figure 3 shows the overall schema for the simultaneous GPR and simple linear regression dynamic calibration model.Under the context of not knowing beforehand the true calibration factors for the low-cost nodes, a leave-one-out CV approach (i.e., holding one of the 22 reference nodes out of modelling each run for model predictive performance evaluation) was adopted as a surrogate to estimate our proposed model accuracy of calibrating the low-cost nodes.For each of the 22-fold CV, 31 node locations (denoted Γ = { ' , … ,  *' }) were available, where  , is the latitude and longitude of node i.Let  ,. represent the daily PM2.5 measurement of node i on day t and  .∈ ℝ *' denote the concatenation of the daily PM2.5 measurements recorded by the 31 nodes on day t.Given a finite number of node locations, a Gaussian Process (GP) becomes a Multivariate Gaussian Distribution over the nodes in the form of: where  ∈ ℝ *' represents the mean function (assumed to be  in this study) and  ∈ ℝ *'×*' with Σ ,; = = , ,  ; ; @ represents the covariance function/kernel function.
For simplicity's sake, the kernel function was set to a squared exponential (SE) covariance term to capture the spatiallycorrelated signals coupled with another component to constrain the independent noise: (Rasmussen and Williams, 2006) (2) where  B C , , and  Q C are the model hyperparameters (to be optimized) that control the signal magnitude, characteristic lengthscale, and noise magnitude, respectively.
What separates our method from standard GP applications is the simultaneous incorporation of calibration for the low-cost nodes using a simple linear regression model into the spatial model.Linear regression has previously been shown to be effective at calibrating PM sensors (Zheng et al., 2018).Initially (step two in Fig. 3), each low-cost node was calibrated based on its closest reference node (Eq. 3) to bridge disagreements between low-cost and reference node measurements which led to a more consistent spatial interpolation and a faster convergence during model optimization.After standardizing the PM2.5 measurements for each node by subtracting the mean and scaling to unit variance (i.e., transforming the PM2.5 measurements to have a zero mean and unit variance), a GPR model was fit to all 31 nodes (i.e., 10 initialized low-cost nodes and 21 reference nodes) as described in Eq. ( 4) and step three in Fig. 3. Then the GPR was trained to maximize the log marginal likelihood over all 59 days (Eq.5) using an L-BFGS-B optimizer (Byrd et al., 1994).To avoid bad local minima, several random hyperparameter initializations were tried and the initialization with the best log marginal likelihood was chosen.where  , and  , are the slope and intercept, respectively, of the calibration equation for low-cost node i based on its closest reference node;  , is all the daily PM2.5 measurements of either the initially-calibrated low-cost node i or reference node i; and  . is the concatenation of all 31 nodes' PM2.5 measurements on day t.
where  is a vector of  B C , , and Once the optimum  for the (initial) GPR was found, we used the learned covariance function to find the mean of each lowcost node i's Gaussian Distribution conditional on the remaining 30 nodes within the network (i.e.,  v|w ,. ) on day t as described mathematically in Eq. ( 6)-( 8) and repeatedly did so until all 59 days'  v|w ,. (i.e.,  |  ) were found and then recalibrated that low-cost node i based on the  |  .This procedure is summarized graphically in Fig. 3 step four and was performed iteratively for all low-cost nodes one at a time.The reasoning behind this step is given in the Supplement.A highlevel interpretation of this step is that the target low-cost node is calibrated by being weighted over the remaining nodes within the network and the       K' term computes the weights.In contrast to the inverse distance weighting interpolation which will weight the nodes used for calibration equally if they are equally distant from the target node, the GPR will value sparse information more and lower the importance of redundant information (suppose all the nodes are equally distant from the target node) as shown in Fig. S2.
,. are the covariance between the low-cost node i and itself, the low-cost node i and the remaining 30 nodes, the remaining 30 nodes and the low-cost node i, the remaining 30 nodes and themselves, and the low-cost node i conditional on the remaining 30 nodes and itself, respectively, on day t.
Iterative optimizations alternated between the GPR covariance function and the low-cost node measurements (Fig. 3 steps five and six) until the GPR parameters  converged.The final GPR was used to predict the 59-day PM2.5 measurements of the holdout reference node (Fig. 3 step seven) following the Cholesky decomposition algorithm (Rasmussen and Williams, 2006) with the standardized predictions being transformed back to the original PM2.5 measurement scale at the end.Metrics including root mean square errors (RMSE, Eq. 9) and percent errors defined as RMSE normalized by the average of the true measurements of the holdout reference node in this study (Eq.10) were calculated for each fold and further averaged over all 22 folds to assess the accuracy and sensitivity of our simultaneous GPR and simple linear regression calibration model.
where  , and  ‡ ˆ are the true and model predicted 59 daily PM2.5 measurements of the holdout reference node i.

Assessment of GPR model performance
The optimum values of the GPR model parameters including the signal variance ( B C ), the characteristic length-scale (), and the noise variance ( Q C ) are shown in Fig. S3.The  B C , , and  Q C from the 22-fold leave-one-out CV averaged 0.53 ± 0.02, 97.89 ± 5.47 km, and 0.47 ± 0.01, respectively.The small variability in all the parameters among all the folds indicates that the model is fairly robust to the different combinations of reference nodes.The learned length-scale can be interpreted as the modeled spatial pattern of PM2.5 being relatively consistent within approximately 98 km, suggesting that the optimized model majorly captures a global trend rather than fine-grained local variations.

Accuracy of reference node prediction
We start by showing the accuracy of model prediction on the 22 reference nodes using leave-one-out CV (when the low-cost node measurements were included in our spatial prediction).Without any prior knowledge of the true calibration factors for the low-cost nodes, the holdout reference node prediction accuracy is a statistically sound proxy for estimating how well our technique can calibrate the low-cost nodes.The performance scores (including RMSE and percent error) for each reference station sorted by the 3-month mean PM2.5 in descending order are listed in Table 2.An overall 30 % prediction error (equivalent to an RMSE of 33 µg m -3 ) at a 24 h scale was achieved on the reference nodes following our calibration procedure.Although the technique's performance is decent especially considering the minimal amount of field work involved, its calibration error is nearly 3 times higher than the one of the low-cost nodes that were well calibrated by collocation with an environmental b-attenuation monitors (E-BAM) in our previous study (error: 11 %; RMSE: 13 µg m -3 ) under similar PM2.5 concentrations at the same temporal resolution (Zheng et al., 2018).The suboptimal on-the-fly mapping accuracy is a result of the optimized model's ability to simulate only the global trend well.From a different perspective, the GPR method would have modeled the spatial pattern of PM2.5 in Delhi well had the natural spatial covariance among the nodes not been disturbed by the complex and prevalent local sources there.As a substantiation of the flawed local PM2.5 variation modelling, the reference node mapping accuracy follows a pattern with relatively quality prediction for those nodes whose means close to the global mean (e.g., global mean ± SD as highlighted with shading in Table 2) while poor prediction for the means wide of the global mean (and particularly in the lower end).
It is of particular interest to validate the value of establishing a relatively dense wireless sensor network in Delhi by examining if the addition of the low-cost nodes can truly lend a performance boost to the spatial interpolation among sensor locations.We juxtapose the interpolation performance using the full sensor network (including both the reference and lowcost nodes) with that using only the reference nodes in Fig. 5.In this context, the unnormalized RMSE is less representative than the percent error of the model interpolation performance because of the unequal numbers of overlapping 24 h observations for all the nodes (59 data points) and for only the reference nodes (87 data points).The comparison revealed that the inclusion of the 10 low-cost devices on top of the regulatory grade monitors can reduce mean and median interpolation error by roughly 2 %.While only a marginal improvement at the scale of 10, the outcome hints that denselydeployed low-cost nodes can have great promise of significantly decreasing the amount of pure interpolation among sensor locations, therefore benefitting the spatial precision of a network.We will explore more about the significance of the lowcost nodes for the network performance in Sect.3.3.3.

Accuracy of low-cost node calibration
Next  4b, which stands in marked contrast to the true wide variability across the reference sites (Fig. 4a).In other words, the geostatistical techniques can calibrate the low-cost nodes dynamically, with the important caveat that it is effective only if the degree of urban homogeneity in PM2.5 is high (e.g., the local contributions are as small a fraction of the regional ones as possible or the local contributions are prevalent but of similar magnitudes).Otherwise, quality predictions will only apply for those nodes whose means are close to the global mean.In this study, our MRU and IITD sites are similar to the IITM site from the studies by Tiwari et al. (2012 and2015), which are all on campus and free from major pollution sources and therefore qualified to be regional background sites.The PM2.5 regional background concentration during winter in Delhi was then estimated to be approximately 72 µg m -3 .The global mean of the 22 reference sites was 138 µg m -3 , thus the mean local contribution across Delhi was roughly 66 µg m -3 .Clearly this ~1:1 regional-to-local ratio did not fully support the technique.Alternatively, prior information about urban PM2.5 spatial patterns such as high-spatial-resolution annual average concentration basemap from air pollution dispersion models can dramatically improve the on-the-fly calibration performance by correcting for the concentration range-specific biases (Schneider et al., 2017).

Simulation results
While the exact values of the calibration factors derived from the GPR model fell short of faithfully recovering the original picture of PM2.5 spatiotemporal gradients in Delhi, these values of one low-cost node relative to another in the network (Sect. 3.3.1)or relative to itself over time (Sect.3.3.2) turned out to be useful in facilitating automated large-scale sensor network monitoring.

Simulation of low-cost node failure or under heavy influence of local sources
One way to simulate the conditions of low-cost node failure or under heavy influence of local sources is to replace their true signals with values from random number generators so that the inherent spatial correlations are corrupted.In this study, we simulated how the model-produced calibration factors change when all (10), nine, seven, three, and one of the low-cost nodes within the network malfunction or are subject to strong local disturbance.We have three major observations from evaluating the simulation results (Fig. 8 and Fig. S4).indicate that the GPR model enables automated and streamlined process of instantly spotting any malfunctioning or singular low-cost nodes within a large-scale sensor network.Third, the performance of the GPR model seems to be rather uninfluenced by changing the true signals to random numbers (see Fig. S4, 33 % error rate when all low-cost nodes are random vs. baseline 30 % error rate).One possible explanation is that the prevalent and intricate air pollution sources in Delhi have already dramatically weakened the natural spatial correlations.This means that a significant degree of randomness has already been imposed on the low-cost nodes in Delhi prior to our complete randomness experiment.
Nevertheless, our not so accurate on-the-fly calibration model has created a useful algorithm for supervising large-scale sensor networks in real time as a by-product.

Simulation of low-cost node drift
We further investigated the feasibility of applying the GPR model to track the drift of low-cost nodes accurately over time.
We simulated drift conditions by first setting random percentages of intercept and slope drift, respectively, for each individual low-cost node and for each simulation run.Next, we adjusted the signals of each low-cost node over the entire study period given these randomly selected percentages using Eq. ( 11).Then, we rebuilt a GPR model based on these driftadjusted signals and evaluated if the new model-generated calibration factors matched our expected predetermined percentage drift relative to the true (baseline) calibration factors.11) where   , true intercept ¢ , true slope ¢ , percentage intercept drift ¢ , percentage slope drift ¢ , and  _ are a vector of the true signals, the standard model-derived intercept, the standard model-derived slope, the randomly generated percentage of intercept drift, the randomly generated percentage of slope drift, and a vector of the drift-adjusted signals, respectively, over the full study period for low-cost node i.
The performance of the model for predicting the drift was examined under a variety of scenarios including assuming that all (10), eight, six, four, and two of the low-cost nodes developed various degrees of drift such as significant (11 %-99 %), marginal (1 %-10 %), and a balanced mixture of significant and marginal.The testing results for 10, six, and two low-cost nodes are displayed in Table 3 and those for eight and four nodes are in Table S1.Overall, the model demonstrates excellent drift predictive power with less than 4 % errors for all the simulation scenarios.The model proves to be most accurate (within 1 % error) when low-cost nodes only drifted marginally regardless of the number of nodes drift.In contrast, significant and particularly a mixture of significant and marginal drifts might lead to marginally larger errors.We also notice that the intercept drifts are slightly harder to accurately capture than the slope drifts.Similar to the simulation of low-cost node failure/under strong local impact as described in Sect 3.3.1, the performance of the model for predicting the measurements of the 22 holdout reference nodes across the 22-fold leave-one-out CV was untouched by the drift conditions (see Fig. S5).This unaltered performance can be attributable to the fact that the drift simulations only involve simple linear transformations as shown in Eq. ( 11).The quality drift estimation has therefore presented another convincing case of how useful our original algorithm can be applied to dynamically monitoring dense sensor networks, as a by-product of calibrating low-cost nodes.We can rebuild a model such as every week using a rolling window (to keep the number of observations for model construction roughly unchanged) to assess the drifts in the model space over time.After that, the true calibration factors obtained from the initial collocation with reference instruments prior to deployment can be adjusted accordingly based on the model-estimated drifts.This procedure allows for real-time drift corrections to low-cost node measurements.

Optimal number of reference nodes
Questions remain unsolved are 1) what the optimum or minimum number of reference instruments is to sustain this technique and 2) if the inclusion of low-cost nodes can effectively assist in lowering the technique's calibration/mapping inaccuracy.It is interesting to note that optimizing the model's calibration accuracy can not only directly fulfill the fundamental calibration task, but also help better the sensor network monitoring capability as an added bonus.To address these two outstanding issues, we randomly sampled subsets of all the 22 reference nodes within the network in increments of one node (i.e. from 1 to 21 nodes) and implemented our algorithm with and without incorporating the low-cost nodes, before finally computing the mean percent errors in predicting all the holdout reference nodes.To get the performance scores as close to truth as possible but without incurring excessive computational cost in the meantime, the sampling was repeated 100 times for each subset size.The calibration error in this section was defined as the mean percent errors in predicting all the holdout reference nodes further averaged over 100 simulation runs for each subset size.
Figure 9 describes the 24 h calibration percent error rate of the model as a function of the number of reference stations used for modelling with and without involving the low-cost nodes.The error rates generally decrease as the number of reference instruments increases (full network: from ~40 % with 1 node to ~29 % with 21 nodes; network excluding low-cost nodes: from ~43 % to ~30 %) but are somewhat locally variable and most pronounced when five, seven, and eight reference nodes are simulated.These bumps might simply be the result of five, seven, and eight reference nodes being relatively non-ideal (with regard to their neighboring numbers) for the technique, although the possibility of non-convergence due to the limited 100 simulation runs for each scenario cannot be ruled out.The 19 or 20 nodes emerge as the optimum numbers of reference nodes with the lowest errors of close to 28 %, while 17 to 21 nodes all yield comparably low inaccuracies (all below 30 %).
The pattern discovered in our research shares certain similarities with Schneider et al. (2017) who studied the relationship between the accuracy of using colocation-calibrated low-cost nodes to map urban AQ and the number of simulated low-cost nodes for their urban-scale air pollution dispersion model and kriging-fueled data fusion technique in Oslo, Norway.Both studies indicate that at least roughly 20 nodes are essential to start producing acceptable degree of accuracy.Unlike when the number of simulated reference nodes is favorable for carrying out the technique).And for the comparatively ideal 17-20 nodes, we even observed approximately non-overlapping 95 % confidence intervals, suggesting significantly lower errors are yielded when low-cost nodes are incorporated.The accuracy gains are still relatively minor because of the suboptimal size of the low-cost node network (i.e., 10).We postulate that once the low-cost node network scales up to 100s, the model constructed using the full network information can be more accurate than the one with only the information of reference nodes by considerable margins.

Conclusions
This study introduced a simultaneous GPR and simple linear regression pipeline to calibrate wireless low-cost PM sensor networks (up to any scale) on the fly in the field by capitalizing on all available reference monitors across an area without the requirement of pre-deployment collocation calibration.We evaluated our method for Delhi where 22 reference and 10 lowcost nodes were available from January 1, 2018 to March 31, 2018 (global average of the 3-month mean PM2.5 among 22 reference stations: 138 ± 31 µg m -3 ) using a leave-one-out CV over the 22 reference nodes.We demonstrated that our approach can achieve excellent robustness and decent accuracy, as underscored by the low variability in the GPR model parameters and model-produced calibration factors for low-cost nodes and by an overall 30 % prediction error (equivalent to an RMSE of 33 µg m -3 ) at a 24 h scale, respectively, among the 22-fold CV.Closer investigations into 1) the large model calibration errors (~50 %) at two Atmos regional background sites (3-month mean PM2.5: ~72 µg m -3 ) where our E-BAMs were collocated; 2) the similarly large model prediction errors at the comparatively clean Pusa and Sector 62 reference sites; 3) and the washed-out local variability in the model calibrated low-cost sites revealed that the performance of our technique (and more generally the geostatistical techniques) can calibrate the low-cost nodes dynamically, but effective only if the degree of urban homogeneity in PM2.5 is high (e.g., the local contributions are as small a fraction of the regional ones as possible or the local contributions are prevalent but of similar magnitudes).Otherwise, quality predictions will only apply for those nodes whose means are close to the global mean.Despite our algorithm's non-ideal calibration accuracy for Delhi, it holds the promise of being adapted for automated and streamlined large-scale wireless sensor network monitoring and of significantly reducing the amount of manual labor involved in the surveillance and maintenance.Simulations proved our algorithm's capability of differentiating malfunctioning or singular low-cost nodes within a network and of tracking the drift of low-cost nodes accurately with less than 4 % errors for all the simulation scenarios.Finally, our simulation results Two directions are possible for our future work.The first one is to expand both the longitudinal and the cross-sectional scopes of field studies and examine how well our solution works for more extensive networks in a larger geographical area over longer periods of deployment (when sensors are expected to actually drift, degrade, or malfunction).This enables us to validate the practical use of our method for calibration and surveillance more confidently.The second is to explore the infusion of information about urban PM2.5 spatial patterns such as high-spatial-resolution annual average concentration basemap from air pollution dispersion models (Schneider et al., 2017) into our current algorithm to further improve the onthe-fly calibration performance by correcting for the concentration range-specific biases.
were developed by Respirer Living Sciences (http://atmos.urbansciences.in/,last access: 30 November 2018) and cost ~ USD 300 per unit.The Atmos monitor measures 20.3 cm L × 12.1 cm W × 7.6 cm H, weighs 500 g, and is housed in an IP65 (Ingress Protection rating 65) enclosure with a liquid crystal display (LCD) on the front showing real-time PM mass concentrations and various debugging messages.It includes a Plantower PMS7003 sensor (~ USD 25; dimension: 4.8 cm L × 3.7 cm W × 1.2 cm H) to measure PM1, PM2.5, and PM10 mass concentrations, an Adafruit DHT22 sensor to measure temperature and relative humidity, and an ultra-compact Quectel L80 GPS model to retrieve accurate locations in real time.The operating principle and configuration Atmos.Meas.Tech.Discuss., https://doi.org/10.5194/amt-2019-55Manuscript under review for journal Atmos.Meas.Tech.Discussion started: 1 March 2019 c Author(s) 2019.CC BY 4.0 License.
Figure4apresents the box plot of the daily averaged PM2.5 at each available reference site across Delhi from January 1, 2018 to March 31, 2018.The Vasundhara and DTU sites were the most polluted stations with the PM2.5 averaging 194 ± 104 µg m -3 and 193 ± 90 µg m -3 , respectively.The Pusa and Sector 62 sites had the lowest mean PM2.5, averaging 86 ± 40 µg m -3 and 88 ± 36 µg m -3 , respectively.Spatially, the global average of the 3-month mean PM2.5 of the 22 reference stations was found to be 138 ± 31 µg m -3 .This pronounced spatial variation in mean PM2.5 in Delhi (as reflected by the high SD of 31 µg m -3 ) coupled with the stronger temporal variation for each station even at a 24 h scale (range: 35-104 µg m -3 , see Fig.4a) caused nonuniform calibration performance of the GPR model across Delhi, as detailed in Sect.3.2.
we describe the technique's accuracy of low-cost node calibration.The model-produced calibration factors are shown in Fig.6.The intercepts and slopes for each unique low-cost device varied little among all the 22 CV folds, reiterating the stability of the GPR model.The values of these calibration factors resemble those obtained in the previous field work, with slopes comparable to South Coast Air Quality Management District's evaluations on the Plantower PMS models (SCAQMD, 2017a-c) and intercepts comparable to our Kanpur, India post-monsoon study(Zheng et al., 2018).Two low-cost nodes (i.e., MRU and IITD) were collocated with two E-BAMs throughout the entire study.This allows us to take their model-derived calibration factors and calibrate the corresponding raw values of the low-cost nodes before Atmos.Meas.Tech.Discuss., https://doi.org/10.5194/amt-2019-55Manuscript under review for journal Atmos.Meas.Tech.Discussion started: 1 March 2019 c Author(s) 2019.CC BY 4.0 License.computing the calibration accuracy based on the ground truth (i.e., E-BAM measurements).Figures7a and 7bshow the scatterplots of the collocated E-BAM measurements against the model-calibrated low-cost nodes at the MRU and the IITD sites, respectively.The two sites had similarly large calibration errors (~50 %) because their concentrations were both near the lower end of PM2.5 spectrum in Delhi.These high error rates echo the conditions found at the comparatively clean Pusa and Sector 62 reference sites.The scatterplots also reveal the reason why the technique especially has trouble calibrating low-concentration sites-the technique overpredicted the PM2.5 concentrations at the low-concentration sites to match the levels as if subject to the natural spatial variation.The washed-out local variability after model calibration more obviously manifests in Fig.
First, the normal calibration factors are quite distinct from those of the low-cost nodes with random signals.Compared to the normal values (see Fig. 8 bottom right panel), the ones of the low-cost nodes with random signals have slopes close to 0 and intercepts close to the global mean of true PM2.5 in Delhi (most clearly Atmos.Meas.Tech.Discuss., https://doi.org/10.5194/amt-2019-55Manuscript under review for journal Atmos.Meas.Tech.Discussion started: 1 March 2019 c Author(s) 2019.CC BY 4.0 License.shown in Fig. 8 top left panel).Second, the calibration factors of the normal low-cost nodes are not affected by the aberrant nodes within the network (see Fig. 8 top right, middle left, middle right, and bottom left panels).These two observations Atmos.Meas.Tech.Discuss., https://doi.org/10.5194/amt-2019-55Manuscript under review for journal Atmos.Meas.Tech.Discussion started: 1 March 2019 c Author(s) 2019.CC BY 4.0 License.
Atmos.Meas.Tech.Discuss., https://doi.org/10.5194/amt-2019-55Manuscript under review for journal Atmos.Meas.Tech.Discussion started: 1 March 2019 c Author(s) 2019.CC BY 4.0 License.Schneider et al. (2017) who further expanded the scope to 150 nodes by generating new synthetic stations from their established model and showed a "the more, the merrier" trend up to 50 stations, we restricted ourselves to only realistic data to investigate the relationship since we suspect that stations created from our model with approximately 30 % errors might introduce large noise which could misrepresent the true pattern.We agree with Schneider et al. (2017) that such relationships are location-specific and cannot be blindly transferred to other study sites.At last, including low-cost nodes in the model building can most of the time reduce the model's errors notably when more than nine reference nodes are sampled (i.e., Atmos.Meas.Tech.Discuss., https://doi.org/10.5194/amt-2019-55Manuscript under review for journal Atmos.Meas.Tech.Discussion started: 1 March 2019 c Author(s) 2019.CC BY 4.0 License.confirmed that the low-cost nodes are beneficial for the spatial precision of a sensor network by decreasing the extent of pure interpolation among only reference stations, highlighting the substantial significance of dense deployments of low-cost AQ devices for a new generation of AQ monitoring network.

Figure 1 :
Figure 1: (a) Front view of the low-cost node.(b) Key components of the low-cost node.

Figure 2 :
Figure 2: Locations of the 22 reference nodes (red icons) and 10 low-cost nodes (blue icons) that form the Delhi PM sensor network.

Figure 3 :
Figure 3: The overall schema for the simultaneous GPR and simple linear regression calibration model.

FigureFigure 5 :
Figure 4: a) Box plots of the 24 h aggregated true ambient PM2.5 mass concentrations measured by the 22 government reference monitors across Delhi from January 1 to March 31, 2018.b) Box plots of the low-cost node 24 h aggregated PM2.5 mass concentrations calibrated by the optimized GPR model.In both a) and b), mean and SD of the PM2.5 mass concentrations for each individual site are superimposed on the box plots.5

Figure 6 :
Figure 6: Box plots of the learned calibration factors (i.e., intercept and slope) for each individual low-cost node from the 22 optimized GPR models across the 22-fold leave-one-out CV.

Figure 7 :
Figure 7: Correlation plots comparing the GPR model-calibrated low-cost node PM2.5 mass concentrations to the collocated E-BAM measurements at a) MRU and b) IITD sites.In both a) and b), correlation of determination (R 2 ), RMSE, percent error, and mean of the true ambient PM2.5 mass concentrations throughout the study (from January 1 to March 31, 2018) are superimposed on the correlation plots.

Figure 8 :
Figure 8: Learned calibration factors for each individual low-cost node from the optimized GPR models by replacing measurements of all (top left), nine (top right), seven (middle left), three (middle right), one (bottom left), and zero (bottom right) of the low-cost nodes with random integers bounded by the min and max of the true signals reported by the corresponding lowcost nodes.Note that the nine, seven, three, and one low-cost nodes (whose true signals are replaced with random integers) were 5

Figure 9 :
Figure 9: Average 24 h percent errors of the GPR model for predicting the holdout reference nodes in the network as a function of the number of reference stations used for the model construction under two scenarios -using the full sensor network information by including both reference and low-cost nodes and using only the reference nodes for the model construction.Note each data point (mean value) is derived from 100 simulation runs.The error bars indicating 95 % CI of the means are based on 1000 5 Atmos.Meas.Tech.Discuss., https://doi.org/10.5194/amt-2019-55Manuscript under review for journal Atmos.Meas.Tech.Discussion started: 1 March 2019 c Author(s) 2019.CC BY 4.0 License.Table 2: Summary of the GPR model 24 h performance scores (including RMSE and percent error) for predicting the measurements of the 22 holdout reference nodes across the 22-fold leave-one-out CV when the full sensor network is used.The mean of the true ambient PM2.5 mass concentrations throughout the study (from January 1 to March 31, 2018) for each individual reference node is provided.The reference nodes with the means of true PM2.5 inside the range of [global mean ± SD, i.e.
Atmos.Meas.Tech.Discuss., https://doi.org/10.5194/amt-2019-55Manuscript under review for journal Atmos.Meas.Tech.Discussion started: 1 March 2019 c Author(s) 2019.CC BY 4.0 License.Table 3: Comparison of predetermined percentages of drift to those estimated from the GPR model for intercept and slope, respectively, for each individual low-cost node, assuming all (10), six, and two of the low-cost nodes developed various degrees of drift such as significant (11 %-99 %), marginal (1 %-10 %), and a balanced mixture of significant and marginal.Note the sensors that drifted, the percentages of drift, and which sensors drifted significantly or marginally are randomly chosen.The results reported under each scenario are based on averages of 10 simulation runs.