Towards low-cost and high-performance air pollution measurements using machine learning calibration techniques

. Air pollution is a key public health issue in urban areas worldwide. The development of low-cost air pollution sensors is consequently a major research priority. However, low-cost sensors often fail to attain sufﬁcient measurement performance compared to state-of-the-art measurement stations, and typically require calibration procedures in expensive laboratory settings. As a result, there has been much debate about calibration techniques that could make their performance more reliable, while also developing calibration procedures that can be carried out without access to advanced laboratories. One repeatedly proposed 5 strategy is low-cost sensor calibration through co-location with public measurement stations. The idea is that, using a regression function, the low-cost sensor signals can be calibrated against the station reference signal, to be then deployed separately with performances similar to the original stations. Here we test the idea of using machine learning algorithms for such regression tasks using hourly-averaged co-location data for nitrogen dioxide (NO 2 ) and particulate matter of particle sizes smaller than 10 µ m (PM10) at three different locations in the urban area of London, UK. Speciﬁcally, we compare the performance of Ridge 10 regression, a linear statistical learning algorithm, to two non-linear algorithms in the form of Random Forest (RF) regression and Gaussian Process regression (GPR). We further benchmark the performance of all three machine learning methods to the more common Multiple Linear Regression (MLR). We obtain very good out-of-sample R 2 -scores (coefﬁcient of determination) > 0.7, frequently exceeding 0.8, for the machine learning calibrated low-cost sensors. In contrast, the performance of MLR is more dependent on random variations in the sensor hardware and co-located signals, and is also more sensitive to the length


Introduction
Air pollutants such as nitrogen dioxide (NO 2 ) and particulate matter (PM) have harmful impacts on human health, the ecosystem, and public infrastructure (European Environment Agency, 2019). Moving towards reliable and high-density air pollution measurements is consequently of prime importance. The development of new low-cost sensors, hand in hand with novel sensor calibration methods, has been at the forefront of current research efforts in this discipline (e.g. Mead et al., 2013;Moltchanov et al., 2015;Lewis et al., 2018;Zimmerman et al., 2018;Sadighi et al., 2018;Tanzer et al., 2019;Eilenberg et al., 2020;30 Sayahi et al., 2020). Here we present insights from a case study using low-cost air pollution sensors for measurements at three separate locations in the urban area of London, UK. Our focus is on testing the advantages and disadvantages of machine learning calibration techniques for low-cost NO 2 and PM10 sensors. The principal idea is to calibrate the sensors through co-location with established high-performance air pollution measurement stations (Fig. 1). Such calibration techniques, if successful, could complement more expensive laboratory-based calibration approaches, thereby further reducing the costs of the 35 overall measurement process (e.g. Spinelle et al., 2015;Zimmerman et al., 2018;Munir et al., 2019). For the sensor calibration, we compare three machine learning regression techniques in the form of Ridge regression, Random Forest (RF) regression, and Gaussian Process regression (GPR), and contrast the results to those obtained with standard Multiple Linear Regression (MLR). RF regression has been studied in the context of NO 2 co-location calibrations before, with very promising results (Zimmerman et al., 2018). Equally for NO 2 , but not for PM10, different linear versions of GPR Table 1. Overview of the measurement sites and the corresponding maximum co-location periods, which varies for each specific sensor node.
Note that reference measurements for NO2 and PM10 are only available for two of the three sites each. Sensors that were co-located for at least 820 active measurement hours are identified by their sensor IDs in the last column. Further note that the only sensor used to measure at multiple sites is sensor 19, which is therefore used to test the feasibility of site transfers.

Co-location set-up and calibration variables
In total, we co-located up to 30 nodes, labelled by identifiers (IDs) 1 to 30. For our NO 2 measurements, we considered 135 the following 15 sensor signals per node to be important for the calibration process: the NO sensor (plus its baseline signal to remove noise), the NO 2 + O 3 sensor (plus baseline), the two intermediate cost NO 2 sensors (NO2-A43F, NO2-B43F) plus their respective baselines, the two cheaper MICS sensors, three different temperature sensors and two relative humidity sensors. All 15 signals can be used for calibration against the reference measurements obtained with the co-located higher cost measurement devices. We discuss the relative importance of the different signals, e.g. the relative importance of the different NO 2 sensors 140 or the influence of temperature and humidity in section 3. For the PM10 calibrations, we used two devices of the same type of low-cost PM sensor, resulting in 2×10 different particle measures used in the PM10 calibrations. In addition, we included the respective sensor signals for temperature and relative humidity, providing us with in total 24 calibration signals for PM10.

Calibration algorithms
We evaluate four regression calibration strategies for low-cost NO 2 and PM10 devices, by means of co-location of the devices 145 with the aforementioned air quality measurement reference stations. The four different regression methods -which are multiple linear regression (MLR), Ridge regression, Random Forest (RF) regression, and Gaussian Process regression (GPR) -are introduced in detail in the following subsections. As we will show in section 3, the relative skill of the calibration methods depends on the chemical species to be measured, sample size available for calibration, as well as certain user preferences.
We will additionally consider the issue of site transferability for sensor node 19, including its dependence on the calibration 150 algorithm used. We note that we do not include the manufacturer calibration of the low-cost sensors in our comparison here mainly because we found that this method, which is a simple linear regression based on certain laboratory measurement relationships, provided us with negative R 2 -scores when compared with reference sensors in the field. This result is in line with other studies that reported differences between sensor performances in the field and under laboratory calibrations (see e.g. Mead et al., 2013;Lewis et al., 2018;Rai and Kumar, 2018).

Ridge and Multiple Linear Regression
Ridge regression is a linear least squares regression augmented by L 2 -regularization to address the bias-variance trade-off (Hoerl and Kennard, 1970;James et al., 2013;Nowack et al., 2018Nowack et al., , 2019. Using statistical cross-validation, the regression fit is optimized by minimizing the cost function (1) 160 over n hourly reference measurements of pollutant y (i.e. NO 2 , PM10). x j,t are p non-calibrated measurement signals from the low-cost sensors, representing signals for the pollutant itself as well as signals recorded for environmental variables (temperature, humidity) and other chemical species that might cause interference with the signal in question. The cost function (1) determines the optimization goal. Its first term is the ordinary least squares regression error, the second term term puts a penalty on too large regression coefficients and thus avoids overfitting in high-dimensional settings by nudging the regression towards 165 small regression coefficients c j . Smaller (larger) values of the regularization coefficient α put weaker (stronger) constraints on the size of the coefficients, thereby favoring overfitting (high bias). We find the value for α through 5-fold cross-validation, i.e. each data set is split into five ordered time slices and α optimized by fitting regressions for large ranges of α-values on four of the slices at a time, and then the best α is found by evaluating the out-of-sample prediction error on each corresponding remaining slice using the R 2 -score. Each slice is used once for the evaluation step. Before the training procedure, all signals are 170 scaled to unit variance and zero mean as to ensure that all signals are weighted equally in the regression optimization, which we explain in more detail at the end of this section. Through the constraint on the regression slopes, Ridge regression can handle settings with many predictors, here calibration variables, even in the context of strong collinarity in those predictors (Dormann et al., 2013;Nowack et al., 2018Nowack et al., , 2019. The resulting linear regression function f Ridgê 175 provides estimates for pollutant mixing ratiosŷ at any time t, i.e. a calibrated low-cost sensor signal, based on new sensor readings x j (t). f Ridge represents a calibration function because it is not just based on on a regression of the pollutant signal itself against the reference, but on multiple simultaneous predictors, including those representing known interfering factors.
Multiple Linear Regression (MLR) is the simple non-regularized case of Ridge, i.e. where α is set to nil. MLR is therefore a good benchmark to evaluate the importance of regularization and, when compared to RF and GPR below, of non-linearity in the 180 relationships. As MLR does not regularize its coefficients, it is expected to increasingly lose performance in settings with many (non-linear) calibration relationships. This loss of MLR performance in high-dimensional regression spaces is related to the 'curse of dimensionality' in machine learning, which expresses the observation that one requires an exponentially increasing number of samples to constrain the regression coefficients as the number of predictors is increased linearly (Bishop, 2006). We will illustrate this phenomenon for the case of our NO 2 sensor calibrations below.

185
Finally, we note that for Ridge regression, as also for GPR described below, the predictors x j must be normalized to a common range. For Ridge this is straightforward to understand as the regression coefficients, once the predictors are normalized, provide direct measures of the importance of each predictor for the overall pollutant signal. If not normalized, the coefficients will additionally weight the relative magnitude of predictor signals, which can differ by orders of magnitude (e.g. temperature at around 273 Kelvin, but a measurement signal for a trace gas on the order of 0.5 amplifier units). As a result, the predictors 190 would be penalized differently through the same α in equation (1), which could mean that certain predictors are effectively not considered in the regressions. Here, we normalize all predictors in all regressions to zero mean and unit standard deviation according to the samples included in each training dataset.

Random Forest regression
Random Forest (RF) regression is one of the most widely used non-linear machine learning algorithms (Breiman and Friedman, 195 1997;Breiman, 2001), and has already found applications in air pollution sensor calibration as well as in other aspects of atmospheric chemistry (Keller and Evans, 2019;Nowack et al., 2018Nowack et al., , 2019Sherwen et al., 2019;Zimmerman et al., 2018;Malings et al., 2019). It follows the idea of ensemble learning where multiple machine learning models together make more reliable predictions than the individual models. Each RF regression object consists of a collection (i.e. ensemble) of graphical tree models, which split training data by learning decision rules ( Figure 3). Each of these decision trees consists of a sequence 200 of nodes, which branch into multiple tree levels until the end of the tree (the 'leaf' level) is reached. Each leaf node contains at least one, or several samples from the training data. The average of these samples is the prediction of each tree for any measurement of predictors x defining a new traversion of the tree to the given leaf node. In contrast to Ridge regression, there is more than one tunable hyperparameter to address overfitting. One of these hyperparameters is the maximum tree depth, i.e. the maximal number of levels within each tree, as deeper trees allow for a more detailed grouping of samples. Similarly, one 205 9 https://doi.org/10.5194/amt-2020-473 Preprint.  Note that in real examples, branches can have different depths, i.e. the leaf nodes can occur at different levels of the tree hierarchy, see e.g. Figure 2 in Zimmerman et al. (2018). Once the decision rules for each node and tree are learned from training data, each tree can be presented with new sensor readings x at a time t to predict pollutant concentration yt. The decision rules depend, inter alia, on the tree structure and random sampling through bootstrapping, which we optimize through 5-fold cross-validation. Based on the values xi, each set of predictors follows routes through the trees. The training samples collected in the corresponding leaf node define the tree-specific prediction for y. By averaging K tree-wise predictions we combat tree-specific overfitting, and finally obtain a more regularized Random Forest predictionȳt.
The RF training process tunes the parameter thresholds for each binary decision tree node. By introducing randomness, for 210 example by selecting a sub-set of samples from the training data set (bootstrapping), each tree provides a somewhat different data representation. This random element is used to obtain a better, averaged prediction over all trees in the ensemble, which is less prone to overfitting than individual regression trees. We here cross-validated the scikit-learn implementation of RF regression (Pedregosa et al., 2011) over problem-specific ranges for the minimum number of samples required to define a split and the minimum number of samples to define a leaf node. The implementation uses an optimised version of the Classification

Gaussian Process regression
Gaussian process regression (GPR) is a widely used Bayesian machine learning method to estimate non-linear dependencies (Rasmussen and Williams, 2006;Pedregosa et al., 2011;Lewis et al., 2016;De Vito et al., 2018;Runge et al., 2019;Malings et al., 2019;Nowack et al., 2020;Mansfield et al., 2020). In GPR, the aim is to find a distribution over possible functions that 225 fit the data. We first define a prior of possible functions that is updated according to the data using Bayes' theorem, which provides us with a posterior distribution over possible functions. The prior distribution is a Gaussian Process (GP ) with mean µ and a covariance function, or kernel, k which describes the covariance between any two points x i and x j . We here standard-scale (i.e. centre) our data so that µ=0, meaning our GP is entirely defined by the covariance function. Being a kernel 230 method, the performance of GPR depends strongly on the kernel (covariance function) design as it determines the shape of the prior and posterior of the Gaussian Process, and in particular the characteristics of the function we are able to learn from the data. Owing to the time-varying, continuous but also oscillating nature of air pollution sensor signals, we here use a sum kernel of a radial basis function (RBF) kernel, a white noise kernel, a Matérn kernel and a Dot-Product kernel. The RBF kernel, also known as squared exponential kernel, is defined as It is parameterized by a length scale l > 0, and d is the Euclidean distance. The length scale determines the scale of variation in the data, and is learned during the Bayesian update, i.e. for a shorter length scale the function is more flexible. However, it also determines the extrapolation scale of the function, meaning that any extrapolation beyond the length scale is probably unreliable. RBF kernels are particularly helpful to model smooth variations in the data. The Matérn kernel is defined by where K ν is a modified Bessel function and Γ the gamma function (Pedregosa et al., 2011). We here choose ν=1.5 as the default setting for the kernel, which determines the smoothness of the function. Overall, the Matérn kernel is useful to model less smooth variations in the data than the RBF kernel. The Dot-Product kernel is parameterized by a hyperparameter σ 2 245 and we found that adding this kernel to the sum of kernels significantly improved our results empirically. The white noise kernel simply allows for a noise level on the data as independently and identically normally-distributed, specified through a variance parameter. This parameter is similar to, and will interact with, the α noise level described below, which is, however, tested systematically through cross-validation.
The Python scikit-learn implementation of the algorithm used here is based on algorithm 2.1 of Rasmussen and Williams 250 (2006). We optimized the kernel parameters in the same way as for the other regression methods through 5-fold crossvalidation, and subject to the noise α-parameter of the scikit-learn GPR regression packages (Pedregosa et al., 2011). This 11 https://doi.org/10.5194/amt-2020-473 Preprint. Discussion started: 22 December 2020 c Author(s) 2020. CC BY 4.0 License.
parameter is not to be confused with the α regularization parameter for Ridge regression and takes the role of smoothing the kernel function, as to address overfitting. It represents a value added to the diagonal of the kernel matrix during the fitting process with larger α values corresponding to greater noise level in the measurements of the outputs. However, we note that 255 there is some equivalency with the α parameter in Ridge as the method is effectively a form of Tikhonov regularization that is also used in Ridge regression (Pedregosa et al., 2011). Both inputs and outputs to the GPR function were standard-scaled to zero mean and unit variance based on the training data. For each GPR optimization, we chose 25 optimizer restarts with different initializations of the kernel parameters, which is necessary to approximate the best possible solution to maximize the log-marginal likelihood of the fit. More background on GPR can be found in Rasmussen and Williams (2006).

Cross-validation
For all regression models, we performed 5-fold cross-validation where the data is first split into training and test sets, keeping samples ordered by time. The training data is afterwards divided into five consecutive subsets (folds) of equal length. If the training data is not divisible by five, with a residual number of samples n, then the first n folds will contain one surplus sample compared to the remaining folds. Each fold is used once as a validation set while the remaining four folds are used for training.

265
The best set of model hyperparameters, or kernel functions, is found according to the average generalization error on these validation sets. After the best cross-validated hyperparameters are found, we refit the regressions models on the entire training data using these hyperparameter settings (e.g. the α value for which we found the best out-of-sample performance for Ridge regression).

NO 2 sensor calibration
The skill of a sensor calibration function is expected to increase with sample size, i.e. the number of measurements used in the calibration process, but will also depend on aspects of the sampling environment. For co-location measurements, there will be time-dependent fluctuations in the value ranges encountered for the predictors (e.g. low-cost sensor signals, humidity, temperature) and predictands (reference NO 2 , PM10). The calibration range in turn affects the performance of the calibration 275 function: if faced with values outside its training range, the function effectively has to perform an extrapolation rather than interpolation, i.e. the function is not well constrained outside its training domain. This limitation is particularly critical for nonlinear machine learning functions (Nowack et al., 2018;Zimmerman et al., 2018;Hagan et al., 2018). Calibration performance will further vary for each device, even for sensors of the same make, due to unavoidable randomness in the sensor production process (Mead et al., 2013;Castell et al., 2017). To characterize these various influences, we here test the dependence of three 280 machine learning calibration methods, as well as of MLR, on sample size and co-location period for a number of NO 2 sensors.
The NO 2 co-location data at CR7 is ideally suited for this purpose. 21 sensor nodes of the same make were co-located with a LAQN reference during the period October to December 2018 (Table 1). We actually co-located 30 sensor sets at the site, but we excluded any sensors with fewer than 820 hours (samples) after outlier removal from our evaluation. The remaining sensors measure sometimes overlapping, but different time intervals as a result of varying co-location start and end times as well as 285 accidental sensor malfunctions. To detect these malfunctions, and to exclude the corresponding samples, we removed outliers (evidenced by unrealistically large measurement signals) at the original time resolution of our measurements, i.e. < 1 minute and prior to hourly averaging. To detect outliers for removal, we used the Median Absolute Deviation (MAD) method, also known as 'robust Z-Score method', which identifies outliers for each variable based on their univariate deviation from their training data median. Since the median is a robust statistic to outliers itself, it is a typically a better measure to identify outliers 290 than for example a deviation from the mean. Accordingly, we excluded any samples t from the training and test data where the quantity takes on values >7 for any of the predictors, wherex j is the training data median value of each predictor. To train and crossvalidate our calibration models, we took the first 820 hours measured by each sensor set and split it into 600 hours for training 295 and cross-validation, leaving 220 hours to measure the final skill on an out-of-sample test set. We highlight again that the test set will cover different time intervals for different sensors, meaning that further randomness is introduced in how we measure calibration skill. However, the relationships for each of the four calibration methods are learned from exactly the same data, and their predictions are also evaluated on the same data, meaning that their robustness and performance can still be directly compared. To measure calibration skill we used two standard metrics in the form of the R 2 -score (coefficient of determination, 300 1 -residual sum of squares/total sum of squares) and the RMSE between the reference measurements and our calibrated signals on the test sets. For particularly poor calibration functions, the R 2 -score can take on infinitely negative values whereas a value of 1 implies a perfect prediction. An R 2 -score of 0 is equivalent to a function that predicts the correct long-term time average of the data, but no fluctuations therein.
As discussed in section 2.1, each of AirPublic's co-location sets measures 15 signals (the predictors, or inputs) that we 305 consider relevant for the NO 2 sensor calibration against the LAQN reference signal for NO 2 (the predictand, or output). Each of the 15 inputs will potentially be systematically linearly or non-linearly correlated with the output, which allows us to learn a calibration function from the measurement data. Once we know this function, we should be able to make accurate predictions given new inputs to reproduce the LAQN reference. As we fit two linear and two non-linear algorithms, certain transformations of the inputs can be useful to facilitate the learning process. For example, a relationship between an input 310 and the output might be an exponential dependence in the original time series so that applying a logarithmic transformation could lead to an approximately linear relationship that might be easier to learn for a linear regression function. We therefore compared three set-ups with different sets of predictors: 1. Using the 15 input time series as provided (label I 15 ).
where A max is the maximum value of the predictors time series and =10 −9 . The latter prevents possible divisions by zero whereas the former prevents infinite values in the logarithmic function.

Comparison of regression models for all predictors
For a first comparison of the calibration performance of the four methods, we show R 2 -scores and RMSEs in We next consider the performance dependence on sample size of the training data ( Figure 4). The advantages of machine learning methods become even more evident for smaller numbers of training samples, even if we consider case (b) with 30 predictors, i.e. I 30 , for which we found that MLR performs fairly well if trained on 600 hours of data. The mean R 2score and RMSE (µg m −3 ) quickly deteriorate for smaller sample sizes for MLR, in particular below a threshold of less than 400 hours of training data. Ridge regression -its statistical learning equivalent -always outperforms MLR. In contrast, 345 both GPR and RF regression already perform well at small samples sizes of less than 300 hours. While all methods converge towards similar performance approaching 600 hours of training data (Table 2), MLR is generally performing worse than Ridge regression and significantly worse that RF regression and GPR.
Further evidence for advantages of machine learning methods are provided in Figure 5 showing boxplots of the R 2 -score distributions across all 21 sensor nodes depending on sample size (300, 400, 500, 600 hours) and regression method. While 350 median sensor performances of MLR, Ridge and GPR ultimately become comparable, MLR is typically found to yield a number of poor performing calibration functions with some R 2 -scores well below 0.6 even for 600 training hours. In contrast, the distributions are far narrower for the machine learning methods: GPR and RF do not show a single extreme outlier even after being trained on only 400 hours of data, providing strong indications that the two methods are the most reliable. After 600 hours, one can effectively expect that all sensors will provide R 2 -scores > 0.7 if trained using GPR. Overall, this highlights 355 again that machine learning methods will provide better average skill, but also are expected to provide more reliable calibration

Calibration performance depending on predictor choices and NO 2 device
Tests (a) to (c) listed in Table 2 indicate that the machine learning regressions for NO 2 , specifically GPR, can benefit slightly 360 from additional logarithmic predictor transformations, but that adding exponential transformations on top of these predictors does not further increase predictive skill, as measured through the R 2 -score and RMSE. Incorporating the logarithmic transformations, we next tested the importance of various predictors to achieve a certain level of calibration skill (rows (d) to (i) in We first tested three set-ups in which we used only the sensor signals of the two cheaper MICS devices (d) and then set-ups with the more expensive Alphasense A43F (e) and B43F (f) devices. Using just the MICS devices, the R 2 -score drops from 0.75-0.81 for the machine learning methods to around zero, meaning that hardly any of the variation in the true NO 2 reference 370 signal is captured. Using our calibration approach here, the MICS would therefore not be sufficient to achieve a meaningful measurement performance. The picture looks slightly better, albeit still far from perfect, for the individual A43F and B43F devices for which R 2 -scores of almost 0.5 are reached using non-linear calibration methods. We note that the linear MLR and Ridge methods do not achieve the same performance, but Ridge outperforms MLR. The most recently developed Alphasense sensor used in our study, B43F, is the best performing standalone sensor. If we add the NO/O 3 sensor and the humidity and 375 temperature signals to the predictors -case (g) -its performance alone almost reaches the same as for the I 30 configuration.
This implies that the interference with NO/ozone, temperature and humidity might be significant and has to be taken into account in the calibration, and that if only one sensor could be chosen for the measurements, the B43F sensor would be the best choice. By further adding the A43F sensor to the predictors the predictive skill is only mildly improved (h). Finally, we note that, in this stationary sensor setting, further predictive skill can be gained by considering past measurement values.

380
Here, we included the one-hour lagged signal of the best B43F sensor (i). This is clearly only possible if there is a delayed consistency, or autocorrelation, in the data, which here leads to the maximum R 2 generalization score of 0.84 for GPR, and related gains in terms of the RMSE. While being an interesting feature, we will not consider such set-ups in the following, because we intend sensors to be transferable among locations, and that they should only rely on live signals for the hour of measurement in question.

385
In summary, using all sensor signals in combination is a robust and skilful set-up for our NO 2 sensor calibration and is therefore a prudent choice, at least if one of the machine learning methods is used to control for the curse of dimensionality. In particular, the B43F sensor is important to consider in the calibration, but further calibration skill is gained by also considering environmental factors and additional NO 2 devices.

PM10 sensor calibration
In the same way as for NO 2 , we tested several calibration settings for the PM10 sensors. For this purpose, we consider the measurements for the location CarPark, where we co-located three sensors (IDs 19, 25 and 26) with a higher cost device (Table 1)  However, this problem is not entirely exclusive to RFs, but is inherited by all methods, with RFs only being the most prominent case. We illustrate the more general issue, which will occur in any co-location calibration setting, in Figure 6. In the training data, there are not any pollution values beyond ca. 40 µg m −3 so that the RF predictions simply level off at that pollution. We note that this effect is somewhat alleviated by using GPR and even more so by Ridge regression. For the latter, this behaviour is intuitive as the linear relationships learned by Ridge will hold to a good approximation even under extrapolation to previously unseen values. However, even for Ridge regression the predictions eventually deviate from the 1:1 line for the highest pollution levels. This aspect will be crucial to consider for any co-location calibration approach, as is also evident from 415 the poor MLR performance, despite being another linear method. In addition, MLR sometimes predicts substantially negative values, producing an overall R 2 -score of below 0.3, whereas the machine learning methods appear to avoid the problem of negative predictions almost entirely. In conclusion we highlight the necessity for co-location studies to ensure that maximum pollution values encountered during training and testing/deployment are as similar as possible. Extrapolations beyond 10-20 µg m −3 appear to be unreliable even if Ridge regression is used as calibration algorithm, which is the best of our four methods 420 to combat such extrapolation issues.  A test with additional log-transformations (I 48 ) of the predictors led to test score improvements for the two linear methods (Table 3), in particular for MLR (R 2 =0.7) but also for Ridge regression (R 2 =0.8). This implies that the log-transformations have helped linearize certain predictor-predictand relationships. Further exp-transformations (I 72 ), and thus also further increasing the predictor dimensionality, did not lead to an improvement in calibration skill. We therefore ran one final test using the I 48 425 set-up but without relative humidity and temperature included as predictors. This test confirmed that the sensor signals indeed experience a slight interference from humidity and temperature, at least considering the machine learning regressions. Notably, this loss of skill is not observed for MLR for which the R 2 -score actually improves. A likely explanation for this behaviour is the curse of dimensionality that affects MLR more significantly than the three machine learning methods so that the reduction in collinear dimensions (given the sample size constraint) is more beneficial than the information gained by including temperature In summary, we have found that Ridge regression and GPR are the two most reliable and high-performing calibration methods for the PM10 sensor. We are able to attain very good R 2 -scores > 0.7 for all four regression methods though. An important point to highlight is the characteristics of the training domain, in particular of the pollution levels encountered during the training data measurements. If the value range is not sufficient to cover the range of interest for future measurement 435 campaigns, then Ridge regression might be the most robust choice to alleviate the underprediction of the most extreme pollution values. However, the power of extrapolation of any method is limited so that we underline the need to carefully check every training data set if it fulfills such crucial criteria, see also similar discussions in other calibration contexts (Hagan et al., 2018;Zimmerman et al., 2018;Malings et al., 2019).

Site transferability 440
Finally, we aim to address the question of site transferability, i.e. how reliably a sensor calibrated through co-location can be used to measure air pollution at a different location. One of the sensor nodes (ID 19) was used for NO 2 measurements at both locations CR7 and CR9, and was also used to measure PM10 at CR9 and CarPark, allowing us to address this question for our methodology. Note that these tests also include a shift in the time of year (Table 1), which has been hypothesized to be one potentially limiting factor in site transferability. The results of these transferability tests for PM10 (from CR9 to CarPark and 445 vice versa) and NO 2 (from CR7 to CR9 and vice versa) are shown in Figures 7 and 8, respectively.
For PM10, we trained the regressions, using the I 24 predictor set-up, on 400 hours of data at either location. This emulates a situation in which, according to our results above, we limit the co-location period to a minimum number of samples required to achieve reasonable performances across all four regression methods. To mitigate issues related to extrapolation (Figure 6), we selected the last 400 hours of the time series for location CarPark (Figure 7a), and hours 600 to 1000 of the time series for 450 location CR9 (Figure 7b), as to include the maximum possible pollution range in the training data. The maximum pollution found within the two time slices differ only by ca. ± 10 µg m −3 , for which at least Ridge regression should provide reasonable extrapolation performances. For the resulting predictions at location CarPark, using models trained on the CR9 data, we achieve generally very good R 2 -scores ranging between 0.67 for RFs and 0.78 for Ridge regression. The site-transferred measurement performance of sensor 19 is therefore almost as good as the one for sensor 26 at the co-location site itself (Table 3), i.e. we 455 cannot detect any significant loss in measurement performance due to the site transfer. A surprising element is that MLR performs almost as good as Ridge regression in this case, whereas it performed poorly for sensor 26, where it only achieved an R 2 -score of 0.28 (Table 3). This underlines our previous observation that the performance of MLR is more sensitive to the specific sensor hardware, with sometimes low performance for relatively small sample sizes ( Figure 5). However, our results also show again that linear methods appear to generally perform well for our PM10 sensors, with Ridge regression being the as inset numbers in Figures 7 and 8. Overall, these site-transfer PM10 results from CR9 to CarPark imply that sensors calibrated through co-location can achieve high measurement performance distant from the co-location site. Figure 7. Tests of PM10 sensor site transfers using calibration models trained on 400 hours of data. (a) Predictions for the four regression models (as labelled) and reference measurements at location CarPark, using models trained at CR9, and (b) at location CR9, using models trained on data measured at CarPark. The inset values provide the R 2 -scores for each method relative to the reference as well as the corresponding recall and precision for the detection of the strongest pollution events, which are typically of particular interest in real life situations (here defined as events when two values within the last three hours exceeded a threshold of 35 µg m −3 ). For compactness, we only show data for times at which both reference and low-cost sensor data was available, and label these hours as a consecutive timeline.
However, we do find that site transferability is not always as straightforward as found for this particular case. For example, for the inverse transfer using models trained at CarPark and predicting PM10 pollution levels at CR9, we find lower R 2 -scores overall. There is consistency in the sense that MLR and Ridge remain the best performing methods for sensor 19 with R 2 -470 scores of around 0.5, but the sensors now miss several significant pollution events in the time series. We note, however, that many of the most extreme pollution events are still detected, which is evident from the still relatively high precision and recall scores for all methods. These results underline that in general at least a good performance can be achieved with co-location calibrations, but that there are also significant challenges posed by site transfers. In particular, the problem is not necessarily symmetric among sites, i.e. the skill of the method can depend on the direction of the transfer, even if the pollution levels at both sites are similar. We therefore hope that our insights and results will motivate further work in this direction, with the aim to identify possible causes of such effects. We discuss some of the possible reasons for this behaviour in sect. 4. Predictions for the four regression models (as labelled) and reference measurements at location CR7, using models trained on data from CR9, and (b) vice versa. The inset values provide the R 2 -scores for each method relative to the reference as well as the corresponding recall and precision for the detection of the strongest pollution events, which are typically of particular interest in real life situations. Since the two locations were subject to very different pollution ranges (note the different value ranges on the y-axes), these are defined in (a) as 45 µg m −3 and in (b) as 90 µg m −3 , and we indicate these thresholds by gray dashed lines. An extreme pollution event occurs when the threshold is exceeded for two of the last three hours. For compactness, we only show data for times at which both reference and low-cost sensor data was available, and label these hours as a consecutive timeline.
Similarly, we find promising results for the NO 2 sensor site transfer using the I 30 predictor set-up (Figure 8). The key challenge for the sensor transfer from CR7 to CR9 is that the maximal pollution levels at the two locations differ strongly, with peak concentration being around 100 µg m −3 greater at CR9. To allow for the best possible learning opportunities for the 480 regression algorithm, we therefore used all available samples for training, which are 1482 samples at CR9 and 829 samples at CR7. This leads to overall good performance of the non-linear RF and GPR methods at location CR7 using models trained at CR9. As no extrapolation is necessary, these methods achieve a good performance of R 2 -scores > 0.6 and also a good balance of precision and recall. The results are, however, slightly worse than for the same site calibrations (Table 2). Ridge regression has a tendency to overpredict NO 2 pollution levels in this particular case, likely because it cannot capture some non-linear 485 effects that would have limited the prediction values. As a result, it also reproduces almost all extreme pollution events where the concentration of NO 2 exceeds 45 µg m −3 (recall =0.98) but also predicts many false pollution events (precision=0.49).
Despite the large sample size, MLR performs poorly for both site transfers (R 2 =0.07 and -0.31). In particular, MLR underpredicts, sometimes providing even impossible negative pollution estimates, at CR7 whereas it provides several runaway positive values at CR9 (Figure 8). However, at CR9 all methods struggle with the impossible challenge of extrapolation far out-490 side their training domain, which effectively is an extreme demonstration of the effects of an ill-considered training range (cf. Figure 6). Among the machine learning methods, the effect is as expected most prominent for RF regression (r2=0.36), which cannot predict any pollution values beyond those encountered at training stage. This is a serious limitation and means that the method scores nil on precision and recall of any extreme pollution events at CR9 where NO 2 levels exceeded 90 µg m −3 . GPR is slightly better at extrapolating beyond its training domain (compare also Figure 6) but still not good enough to reproduce 495 any of the extreme pollution events, giving rise to equally low precision and recall. Ridge regression, as a regularized linear method, performs best in the sense that it is able to reproduce at least a few of the extreme events (recall=0.05) and predicting no false extreme events (precision=1.0), while still achieving an R 2 -score of 0.53. Nonetheless, it is clear from the time series in Figure 8 that none of the regression methods works for this site transfer, simply because of the too large extrapolation range. 500 We have compared four different regressions methods to calibrate a number of low-cost NO 2 and PM10 sensors against reference measurement signals, by means of co-location at three separate sites in London, UK. Comparing the four regression methods our main conclusions are:

Discussion and conclusions
1. For the 21 NO 2 sensors, Gaussian Process regression (GPR) is generally performing best at the same measurement site, followed by Ridge regression, Random Forest (RF) regression, and Multiple Linear regression (MLR). For a single 505 sensor PM10 calibration, we find that Ridge regression and GPR attain about the same measurement performance. We note that in particular the relative performance of GPR differs greatly from a recent study by Malings et al. (2019), likely due to our different choice of kernel design.
2. Special care must be taken of the calibration conditions, in particular if sensors are thereafter used for measurements in areas where higher pollution levels are to be expected. The linear Ridge method can best mitigate the catastrophic 510 measurement failure in such extrapolation settings, but also fails if measurement signals deviate by more than around 10-20 µg m −3 from the maximum pollution level in the training data. For our NO 2 measurement with site transfer, we find that the non-linear methods GPR and RF outperform Ridge, assuming that the training pollution range encapsulates the range of values encountered at the new site. For the PM10 sensor calibrations and corresponding site transfers, we find that Ridge regression is the highest performing and most reliable calibration algorithm overall. 515 typically achieve high performances with R 2 -scores often exceeding 0.8 on new unseen test data.
On another note, we highlight that we sometimes found significant signals in our test data sets that were not reproduced by our low-cost sensor nodes, see e.g. the pollution peak at t ≈140 in Figure 7a, even if the measured pollution value lies well within the training data range. It is hard to assign reasons to this surprising sensor behaviour, as our low-cost sensors are apparently able to capture most of the other pollution spikes well for the same dataset. One possible reason is a calibration blind spot, 525 i.e. we encounter a new type of sensor interference which we did not find in the training data. However, we think that this is unlikely, given that the behaviour is not found frequently, at least in this particular time series. Two other options are (a) imperfect co-location, for example we might have missed an important local pollution plume by chance, or (b) temporary sensor failures that were removed by the MAD outlier removal, i.e. that our sensors were temporarily inoperational at the time of a pollution spike that dominated the values for the given measurement hour. We hope that future measurement campaigns 530 can provide further insights into such calibration challenges, and hope that our study can motivate further work in this direction.
In conclusion, our results underline the potential of machine learning algorithms for the calibration of co-located low-cost NO 2 and PM10 sensors. At the same time, we highlight several significant challenges that will always have to be considered in similar calibration processes. For example, this includes the need for a well-adjusted calibration dataset to avoid calibration failure if the algorithm needs to extrapolate to higher pollution values, and the role of individual choices relating to the combi-535 nation of calibration variables and calibration algorithms, e.g. concerning the curse of dimensionality, predictor transformations and linearity in the predictor-predictand relationships. Recent studies indicate that the issues related to extrapolation can be mitigated through the application of hybrid models in which non-linear machine learning models are used within the training domain and a simpler linear regression approach otherwise (Hagan et al., 2018;Malings et al., 2019).We note that in particular Ridge regression could be a good compromise, which does not require a somewhat arbitrary hybrid-model definition. Having 540 said that, we also found that even high-dimensional linear methods have ultimately limited extrapolation skill ( Figure 6) so that the consideration of the training data pollution range remains of fundamental importance. We hope that such insights will contribute to ever less expensive and more spatially dense measurements of air pollution in the future, and that our work will motivate additional measurement campaigns, testing of other calibration algorithms, and further low-cost sensor development.