Articles | Volume 15, issue 9
Research article
13 May 2022
Research article |  | 13 May 2022

Long-term behavior and stability of calibration models for NO and NO2 low-cost sensors

Horim Kim, Michael Müller, Stephan Henne, and Christoph Hüglin

Low-cost sensors are considered to exhibit great potential to complement classical air quality measurements in existing monitoring networks. However, the use of low-cost sensors poses some challenges. In this study, the behavior and performance of electrochemical sensors for NO and NO2 were determined over a longer operating period in a real-world deployment. After careful calibration of the sensors, based on co-location with reference instruments at a rural traffic site during 6 months and by using robust linear regression and random forest regression, the coefficient of determination of both types of sensors was high (R2> 0.9), and the root mean square error (RMSE) of NO and NO2 sensors was about 6.8 and 3.5 ppb, respectively, for 10 min mean concentrations. The RMSE of the NO2 sensors, however, more than doubled when the sensors were deployed without recalibration for a 1-year period at other site types (including urban background locations), where the range and the variability of air pollutant concentrations differed from the calibration site. This indicates a significant effect of relocation of the sensors on the quality of their data. During deployment, we found that the NO2 sensors are capable of distinguishing general pollution levels, but they proved unsuitable for accurate measurements, mainly due to significant biases. In order to investigate the long-term stability of the original calibration, the sensors were reinstalled at the calibration site after deployment. Surprisingly, the coefficient of determination and the RMSE of the NO sensor remained almost unchanged after more than 1 year of operation. In contrast, the performance of the NO2 sensors clearly deteriorated as indicated by a higher RMSE (about 7.5 ppb, 10 min mean concentrations) and a lower coefficient of determination (R2= 0.59).

1 Introduction

Severe negative impacts of urban air pollution on human health are still a major concern. Today, millions of city dwellers suffer from exposure to increased levels of air pollutants (Mage et al.1996; Pascal et al.2013; WHO2016). Nonetheless, existing air quality monitoring approaches are not always sufficient for a detailed understanding of urban air quality and human exposure as large spatial and temporal variability of air pollutants challenges any monitoring (Marshall et al.2008; Tan et al.2014). Conventional instruments for air quality monitoring provide precise information on pollutant concentration and approach based on a limited number of point measurements. Their high acquisition and operational costs, as well as the requirement of specific expertise in operation of the instruments, are, however, main obstacles for using them in larger numbers for achieving a denser spatial coverage of air pollutant observations in a city (Snyder et al.2013; Kumar et al.2015). Therefore, new techniques and strategies for measuring urban air quality with higher spatial and temporal resolution are highly desirable. Low-cost sensors (LCSs) have lately attracted a lot of attention as they have the potential to fill this gap. They are cost-effective, can be employed in large numbers, are very simple to use, and, in principle, require little maintenance (Karagulian et al.2019).

LCS systems have already been used in various studies, and their potential for air quality monitoring was demonstrated (e.g., Jiao et al.2016; Cross et al.2017; Hagan et al.2018; Malings et al.2019). However, the data quality that can be achieved with LCSs is often the main issue and a limiting factor. It is emphasized in multiple studies that high reliability of the sensors and appropriate calibration strategies are prerequisites for meaningful applications. Factors that can have a large influence on the data quality of LCSs for atmospheric measurements are the interference with ambient temperature and relative humidity (Bigi et al.2018; Zimmerman et al.2018) and insufficient sensitivity and selectivity. Low-cost sensors for reactive atmospheric gases have shown to be cross-sensitive to other gases; for example, electrochemical sensors for nitrogen dioxide (NO2) were found to be cross-sensitive to ozone (Mead et al.2013; Mueller et al.2017).

Low-cost sensors need to be calibrated before they are used for atmospheric composition measurements (Peltier et al.2021). Laboratory calibration against reference material often has major drawbacks, and calibration of LCSs is therefore predominantly done based on co-location with reference instruments operated in traditional air quality monitoring stations. Thus, the sensor output and other relevant environmental variables (e.g., temperature and relative humidity) are related to the true concentration values as represented by the reference measurement in parametric (e.g., Jiao et al.2016; Kim et al.2018; Malings et al.2019) and nonparametric regression models (e.g., Cross et al.2017; Hagan et al.2018) or by using machine learning techniques (Bigi et al.2018; Smith et al.2019). The obtained (mathematical) relationship forms a calibration model that can be used for converting the raw sensor data into a concentration of the air pollutant to be measured. Such a sensor calibration approach works generally well, although there are some challenging aspects that need to be taken into account. Firstly and for achieving a robust calibration model, it is important that during the co-located measurements all environmental conditions and the full concentration range of the measured pollutant which the sensor will experience in a subsequent deployment are covered (Peltier et al.2021). This requirement can often only be fulfilled through selection of a well-suited reference station and a rather long duration of co-location measurements (Hagler et al.2018), which may be in conflict with the rather short lifetime of air quality sensors. Secondly, it is currently unclear how long a sensor calibration model derived from co-location measurements can be applied and how often recalibration of sensors needs to be done. Finally, the limited transferability of calibration models to new locations can be an important issue. This means that relocation of calibrated sensors may lead to data quality of the sensors that differs from that during co-location (Bigi et al.2018).

In the present study, the above-mentioned challenges related to the calibration of sensors by co-location measurements are investigated. Four low-cost sensor systems for measuring ambient NO and NO2 were co-located to reference instruments for 6 months at a rural measurement location next to a highway (Haerkingen site) with widely varying NO and NO2 levels. After co-location, the LCSs were deployed for 1 year at four locations in the city of Zurich (Switzerland), co-located to NO2 diffusion tube samplers for NO2 sensor performance assessment. After deployment, the sensor units were brought back to the original co-location site (Haerkingen) where the LCSs again measured beside reference instruments for another 4 months and for evaluation of the long-term stability of the employed calibration models.

2 Materials and methods

2.1 Sensor unit

Four sensor units (denoted as AC009, AC010, AC011, and AC012) utilized in this study were jointly developed by Empa and Decentlab GmbH. They were already utilized and described in detail in a previous study (Bigi et al.2018). In each sensor unit, two electrochemical sensors for NO (Alphasense NO-B4) and NO2 (Alphasense NO2-B43F) and a combined relative humidity and temperature sensor (Sensirion STH21) are included (Bigi et al.2018). It should be pointed out here that the NO2 sensors used have an O3 scrubber membrane mounted on top of the inlet to prevent the interference from ambient O3. The O3 scrubber has been reported to have a capacity of 250 ppm h−1 of O3 (Li et al.2021) and thus a limited lifetime. The four electrochemical sensors in each sensor unit are denoted as NO_A, NO_B, NO2_A, and NO2_B in this study. All sensors recorded the measured data as 10 min mean values, and the data were transmitted and saved with the corresponding timestamp in a database operated by Decentlab GmbH.

2.2 Co-location and deployment sites

The four sensor units were deployed next to reference instruments during two co-location campaigns. The first co-location campaign had a duration of 6 months (29 June–12 December 2018; for AC009, the start date was 25 July 2018) and was done for sensor calibration and evaluation of sensor performance. The second 4-month-long (12 December 2019–31 March 2020) co-location campaign was done about 1.5 years after the first co-location campaign with the aim of assessing the long-term stability of calibrated sensors and re-evaluation of sensor performance after an extended time period of operation. During the time between the two co-location campaigns, the sensors were deployed in a small sensor network in Zurich.

2.2.1 Co-location site

The co-location measurements were done at the Haerkingen air-quality monitoring site. The Haerkingen site (HAE: 47.31 N, 7.82 E; 480 m a.s.l.) is part of the Swiss National Air Pollution Monitoring Network NABEL and situated 20 m north of a major highway (A1, 90 000 vehicles per day) in an open and rural environment. Thus, the concentrations of traffic-related air pollutants like NO and NO2 strongly depend on wind direction and traffic activity or daytime and span a wide concentration range (Bigi et al.2018). Reference NO and NO2 concentrations at HAE were measured as 10 min mean values using a chemiluminescence instrument (T200, Teledyne Technologies Inc.); measurements of other air pollutants and meteorological variables are also available. As mentioned above, data from the first co-location campaign were used for finding the best calibration models and for evaluation of the performance of recently calibrated sensors. The data from the second campaign were used for evaluation of the long-term stability of the sensor calibration (i.e., the applicability of the calibration models determined during the first co-location campaign) and determination of changes of the performance of the sensors after deployment over an extended time period without recalibration.

2.2.2 Deployment sites

In between the two co-location campaigns, the sensor units were deployed at four different locations in the city of Zurich for measurement of the NO and NO2 concentration during 11 months from 13 December 2018 to 31 October 2019. However, the data acquisition of AC011 was paused from 9 August 2019 and that of AC010 from 27 September 2019 due to insects interrupting the air flow into the sensor units. Geographical information including the site labels is presented in Table 1. An objective of this deployment was to analyze the sensor performance in various locations in the city with different ranges of air pollutant concentrations. It is highlighted that the sites ZSBS and ZMAN are urban traffic sites next to major roads in the city of Zurich. The total traffic volume on the roads nearby ZSBS and ZMAN is 20 000 and 50 000 vehicles per day. Consequently, high NO and NO2 concentrations can be expected at these two sites. Conversely, ZBLG is located in an urban green area surrounded by residential buildings, and ZRIS is in a rural area on the outskirts of the city. Hence, it was expected that NO and NO2 concentrations will be lower at these two sites compared with the two traffic sites.

NO2 passive diffusion samplers (Palmes et al.1976) were located at the four deployment sites close to the sensor units. These four samplers are part of the NO2 passive diffusion sampler network operated by the department of Environment and Health Protection of the City of Zurich (UGZ). For comparison with the sensor data, integrated values of NO2 concentrations were available biweekly from 4 December 2018 to 5 November 2019. In total, 24 values of concentration pairs were considered at each site. Even though the passive sampler produces data with insufficient temporal resolution (biweekly averaged), the observations are known for their accuracy.

Table 1Four deployment sites of low-cost sensors. The coordinate system refers to the world geodetic system, WGS84, obtained from the geo-mapping platform of the Swiss Confederation. © Swisstopo

Download Print Version | Download XLSX

2.3 Sensor calibration

Two calibration methods were utilized and evaluated in this study: robust linear regression (Huber2004) and random forest regression (Breiman2001). For finding a suitable calibration model for the individual NO and NO2 sensors in a sensor unit, the concentrations measured by the reference instrument have been used as target variables; the voltage of the corresponding sensor (VSensor), the temperature (VTemperature), and the relative humidity (VRelativeHumidity) as provided by the sensor unit have been used as predictors. It is well known that low-cost sensors for measurement of atmospheric trace gases can be influenced by external factors like temperature and relative humidity (Peltier et al.2021). In an earlier study by Mueller et al. (2017), it was observed that for the NO2 sensors the amplitude of the sensor response caused by varying relative humidity is of similar magnitude to the sensor response caused by typical ambient levels of NO2. In addition, it was found in Mueller et al. (2017) that the NO2 sensors showed a delayed and exponentially decaying response upon changes in relative humidity. Therefore, an additional variable, DRH, was introduced for compensation of the effect of changing relative humidity on the raw sensor signal.

(1) D RH = Δ t = 0 - 500 Δ S RH ( t + Δ t ) exp ( Δ t Δ t 0 ) .

ΔSRH represents the change in relative humidity (in %), Δt is the corresponding time lag in minutes, and Δt0 is a time constant. Changes in relative humidity up to 500 min back in time are considered and weighted using the exponential term exp(ΔtΔt0). Similar to Mueller et al. (2017), various values for Δt0 were examined in this study (60, 90, 120, and 150 min) for finding the value that leads to the best-performing sensor calibration models.

2.3.1 Robust linear regression

Robust multiple linear regression is a commonly used method in the field of calibration. The effectiveness of the methodology for calibration of air quality low-cost sensors has already been shown in several studies (e.g., Spinelle et al.2015; Hagan et al.2018). Robust regression is a technique that reduces the model distortion and bias induced by unusual observations and outliers (Andersen2008) by limiting their impact. The rlm() function from the R package MASS (Venables and Ripley2002, Ver. 7.3-54) was used for robust regression modeling.

Figure 1A scheme of the 5-fold cross validation utilized for the calibration model evaluation.


2.3.2 Random forest regression

Random forest regression is an ensemble learning method in which numerous decision trees are examined to identify a superior model of a classification or regression. Each node in a decision tree is split by using the best option among a subset of predictors that are randomly chosen at that node (Breiman2001). Subsequently, multiple decision trees are ranked, and the best option gets selected as an output. Random forest models showed great performance in previous studies; however, it was also shown that overfitting of the model may occur during calibration (Zimmerman et al.2018). Moreover, the method can not adequately predict values that are beyond the range of the training data set (Malings et al.2019). In the present work, the randomForest() function in R package randomForest was employed for this modeling task. (Liaw and Wiener2002). For the random forest models the number of decision trees was set to 1000, and the chosen minimum number of observations in a terminal node was 100.

2.4 Evaluation of sensor performance

2.4.1 Model selection

Owing to the numerous possible combinations of predictor variables including their interactions in the above-mentioned modeling approaches, a selection of 22 robust linear regression models and 9 random forest models were evaluated. The variables in each model are introduced in Tables S1 and S2 in the Supplement. Among the total of 31 calibration models, two models (one robust linear regression model and one random forest model) were identified as the best-performing models and selected for further investigation in the study. For the evaluation, 80 % of the sensor data from the first co-location campaign were randomly chosen and used for model training. The other 20 % were applied for model testing. In every model, only a single validation with one training and testing data was implemented because of the computational limitation. The selection was based on the normalized root mean square error (nRMSE) calculated for each model, and the models with the least nRMSE for both NO and NO2 were chosen.

2.4.2K-fold cross validation

Unlike the single validation process in the model selection, the chosen calibration models were then evaluated by k-fold cross-validation. The method is widely used to estimate the prediction error and the model accuracy. In the present study, the number of folds (k) is 5 considering the literature recommendation (Rodriguez et al.2010). For the validation, low-cost sensor data were randomly split into five different sub-groups (see Fig. 1). In each fold, four sub-groups of data (80 % of total) were used as training data in each fold, and the remaining group was used for testing the model. As a result of the 5-fold cross-validation, predictions of the five test data sets were obtained and combined into a single data set that was evaluated by comparison with the corresponding reference NO and NO2 concentrations.

2.4.3 Evaluation approach

For the evaluation of the sensor calibration performance, several statistical metrics (see Table 2) in combination with target diagrams and Taylor diagrams have been considered. The structure of target diagrams has been detailed in studies by Bigi et al. (2018) and Zimmerman et al. (2018). The statistical measures were calculated using the tdStats() function in the R package tdr (Lamigueiro2018).

Figure 2A schematic of the structure of the target diagram. The red point indicates the position of an exemplary point in the diagram.


Table 2Statistical metrics visualized in target diagrams. In the terms in the third column, y^i represents a concentration measurement with a calibrated low-cost sensor, and yi represents the corresponding concentration measurement with the reference instrument; σy^ and σy are the empirical standard deviations of sensor and reference data.

Download Print Version | Download XLSX

A schematic of a target diagram is presented in Fig. 2, and its related statistical metrics are introduced in Table 2. The statistical metrics represented in target diagrams are the root mean square error (RMSE) and the mean bias error (MBE). The RMSE is a nonnegative measure for the error of the model predictions (or here the sensor measurements) and defined as the mean squared difference of the model predictions and the reference values. Similarly, the MBE is the average mean difference between predicted and reference values, representing the mean bias of the model predictions. With the help of the MBE, the bias part in the RMSE can be corrected leading to a metric denoted as the centered root mean square error (CRMSE). For the target diagram, both CRMSE and MBE are normalized by the standard deviation of the reference values (σy) and plotted on the x axis and the y axis, respectively. Note that for reference sites where the range of prevailing concentrations is high, σy is also high, and consequently normalized MBE and nRMSE tend to be smaller at such sites. Also note that CRMSE is generally positive; the sign of CRMSE/σy in target diagrams is, however, determined by the sign of (σy^-σy). The interesting feature of target diagrams is that a multitude of information about the model behavior can easily be captured (Zimmerman et al.2018). (i) A vector distance between the coordinate and the origin represents the normalized RMSE (nRMSE, RMSE/σy). (ii) (MBE/σy>0) and (MBE/σy<0) indicate model predictions that systematically overestimate or underestimate the reference. (iii) The standard deviation of the model prediction is larger (CRMSE/σy>0) or smaller (CRMSE/σy<0) than that of the reference data. (iv) The standard deviation of the model residuals is larger (outside of the circle of radius 1) or smaller (inside of the circle of radius 1) than that of the reference measurements.

In addition, Taylor diagrams are presented to visualize three statistical metrics: (1) Pearson's correlation coefficients (r) are presented as the azimuthal angles from the y axis. (2) Normalized standard deviations (σy^/σy) of models are given as the distance from the origin. (3) CRMSE (see Table 2) is proportional to the distance between the data points and the reference points on the x axis (Taylor2001). The diagrams can demonstrate the metrics for which target diagrams are not illustrated. Depending on the range of r, the shape of a diagram is either a semicircle (-1r1) or a quad (0r1).

2.4.4 Sensor data filtering

The sensor systems had some obvious malfunctioning periods. The data acquired during such periods were eliminated and excluded from sensor performance evaluation. A malfunction period in this study has been defined as a period when the raw signal of an electrochemical sensor only recorded stable, non-fluctuating voltages and was only detected during the second co-location period. In contrast to the sensors, the corresponding measurements from the reference instrument during the identified malfunctioning periods showed temporally varying pollutant concentrations. Figure S15 illustrates an example of such a sensor malfunctioning period. The reasons of sensor malfunctioning could, however, not be discovered. The exact time periods of malfunctioning sensors and a detailed description of how erroneous sensor data have been identified and eliminated are given in the Supplement (Sect. S3.3.1.).

Figure 3Box plots of nRMSE by Δt0 of DRH. Five models (four RLM and one RF model) were considered for a Δt0; hence 40 data points are contained in each box plot.


3 Results and discussion

3.1 Sensor calibration

The individual sensors for NO and NO2 in the four sensor units were calibrated using the measurements of the first co-location campaign at the air quality monitoring site in Haerkingen. First, the calibration models with the most accurate prediction of NO and NO2 were selected, and the sensor performance was assessed. Second, the selected models were applied for the determination of the data quality of the sensors during the urban deployment and for reassessment of the sensor performance during the second co-location campaign. In the present study, the air pollutant measurements are presented in parts per billion (ppb) for the reason of practicality and readability, although the unit of nanomoles per mole (nmol mol−1) is a more accurate representation for the mole fractions of chemical species. For the same reason, the term “concentration” is also used when discussing mole fractions of measured air pollutants.

3.1.1 Selection of calibration models

A total of 22 robust linear regression (RLM) models and 9 random forest (RF) models (see Tables S1 and S2) were evaluated, and the best RLM and RF models (in terms of normalized root mean square error) were selected for further investigation and sensor performance analysis. Figure 3 demonstrates that for calibration models, which include DRH as a predictor, the selection of Δt0 = 150 min leads to the lowest mean nRMSE for both NO and NO2. For the next step, models with DRH (Δt0=150 min) and all other models without DRH as predictor were compared with respect to nRMSE. Figure 4 depicts the nRMSE of each model. It is prominent that the models that do not include the temperature signal (T) as predictor variable showed higher nRMSE than the models including T. This is mainly due to the temperature dependence of the electrochemical sensors and the wide range of temperatures encountered during the first co-location period (−4 to 35 C). In addition, including DRH in the model decreases the nRMSE of the NO2 sensors more than including relative humidity (RH), while there is only a negligible difference for NO (see nRMSE between RLM_5 and 9 and between RF_6 and 8). The effect of relative humidity on the gas sensor was observed in previous research (Mueller et al.2017). However, our current results emphasize that the temporal variability or history of relative humidity is more influential than the present relative humidity itself. Overall, RLM_22 and RF_6 models were chosen for further investigation in this study as they resulted in the smallest nRMSE in each class of models. For NO2 the two selected models clearly outperformed all other models, whereas for NO other models showed similarly good performances (Fig. 4). The predictor variables for the two models are presented in Table 3.

Figure 4nRMSE values for each model. Eight data points from individual sensors of NO and NO2 are shown separately, and mean nRMSE values by pollutant are depicted by larger points.


Table 3Model variables considered in the selected RLM and RF calibration models. The two models were denoted in the study as the “RLM” and “RF” models.

Download Print Version | Download XLSX

3.1.2 Calibration evaluation

The calibrated sensor data from the first co-location campaign were analyzed, and the different statistical metrics were calculated. Table 4 and 4 provide the statistical metrics for sensor performance evaluation calculated as the mean of all sensors in the four sensor units and for concentration measurements with 10 min time resolution. The considered statistical metrics and their graphical illustration in target and Taylor diagrams as shown in Fig. 5 demonstrate that no significant difference between the performance of sensors calibrated using RLM or RF models can be found. The considered statistical metrics were also calculated for different concentration ranges set at the rounded 25 %, 50 %, and 75 % quantiles of the reference concentration measured at the Haerkingen site. The intention of this division into sub-groups was to investigate the sensor performance for different air pollutant levels. The performance evaluation for such different sub-groups was motivated by the fact that for sensor calibration, a co-location site that covers the full range of the target pollutant concentration and of all other influential environmental variables is needed. Indeed, as shown in Fig. 6a, the sensors were exposed to a wide concentration range in Haerkingen station, from 0 ppb up to 232 ppb for NO and from 0.1 ppb up to 67 ppb for NO2. However, for deployment in a sensor network, the sensors may be used at locations representing different site types and more narrow concentration ranges, including background environments.

The division demonstrated two important features that were not revealed when calculating the metrics from the full data set. First, at low concentrations (up to 25 % quantile), nRMSE for the NO sensors was much higher (RLM: 11.23, RF: 8.35) than for the full concentration range (RLM: 0.30, RF: 0.26; see Table 4). The same behavior was observed for the NO2 sensors, where nRMSE for the RLM was 1.80 (RF: 1.39) at low concentrations and 0.43 (RF: 0.30) at the full concentration range. This demonstrates a clear limitation of the LCSs utilized in this study; even though the sensors were frequently exposed to low target pollutant concentrations, the physical properties of the sensor (e.g. high sensor noise) fundamentally limit the measurement accuracy in this concentration range. Second, the mean bias error in each of the considered concentration ranges elaborates that the overall MBEs, which were close to zero, are actually the result of compensation between overestimation at low NO and NO2 concentrations and underestimation at high concentrations. To be specific, the MBE values at low concentration (25 % quantile) were 2–4 ppb, whereas those in the highest concentration quantile were similar but negative. The calibration models were able to amend the bias completely, and the residual plots in Figs. S9–S12 did not clearly reveal this behavior and did not clearly indicate any model deficiency. Instead, the opposite sign of the significant sensor bias at low and high concentration is masked when the overall MBE is considered, bearing the risk of misinterpretation of the sensor performance when deployed in a network and predominantly operating in a more narrow concentration range. Figure 5a and c also illustrate that MBE/σy is near zero, but Fig. 6b and d depict that MBE should be interpreted more carefully.

Furthermore, Fig. 5a and c indicate that all the CRMSE/σy values were negative in the first co-location period, which means that the standard deviation of the model prediction is smaller than that of the reference data. The low standard deviation in the predicted concentration is not surprising because the prediction could not completely estimate the extreme value of concentration in the reference data. The same feature was identified in σy^/σy of the highest concentration range (NO≥28 ppb, NO2≥26 ppb), where the values are ∼1 in both RLM and RF models (illustrated in Fig. 6c and e). This implies that the predictions have small dynamics in this range, meaning that the extreme concentrations are poorly captured by the sensors. Moreover, the scatter plots in Fig. S5–S8 depict that RF models exhibit an upper limit in their predictions. The models cannot predict the concentration of NO above ∼130 ppb and that of NO2 above ∼40 ppb, and these limits lead to relatively lower standard deviations of the RF model compared to RLM, as shown in the Fig. 5b and d. The main reason for this flaw is that RF models can only adequately predict within the concentration range that is covered by the training data. This deficiency has already been reported previously (Bigi et al.2018; Zimmerman et al.2018).

Figure 5Target diagrams (a and c) and Taylor diagrams (b and d) of each pollutant during the calibration evaluation period (first) and the sensor relocation period after the deployment (second). In the target diagram, the centered root mean square error (CRMSE) normalized by the standard deviation of reference data (σy) is stated on the x axis, whereas the normalized mean bias error (MBE/σy) is stated on the y axis. A distance from the origin indicates the nRMSE. Note that by convention the sign of CRMSE/σy in target diagrams is determined by the sign of (σy^-σy). A Taylor diagram illustrates three statistical metrics (Taylor2001): (1) Pearson's correlation coefficients (r) are presented as the azimuthal angles from the y axis. (2) Normalized standard deviations (σy^/σy) of models are given as the distance from the origin. (3) CRMSE is proportional to the distance between the data points and the reference point (Ref) on the x axis. The reference point (Ref) is the point where an ideal sensor (r=1, σy^/σy=1) is located in Taylor diagrams.


Table 4Statistical metrics calculated for 10 min concentration values from all NO and NO2 sensors in the period of calibration evaluation (first) and of the assessment after deployment (second). The sensor data were divided by rounded values of the 25 %, 50 %, and 75 % quantile of the reference concentration of Haerkingen station in the first co-location campaign in order to keep a comparable number of data points in each sub-group. R2 denotes the coefficient of determination.

Download Print Version | Download XLSX

Figure 6Statistical metrics calculated for each reference concentration sub-group from the first co-location period. The sub-groups were defined by rounded values of 25 %, 50 %, and 75 % quantile of the NO and NO2 concentration during the period. Panel (a) represents the density plots of pollutants concentration, panel (b) and (d) elaborate the target diagrams of each concentration range in both co-location periods, and panel (c) and (e) illustrate the Taylor diagrams for the same concentration ranges.


3.2 Sensor performance

3.2.1 Deployment period

After the first co-location period, the four sensor units were relocated and deployed at four sites in Zurich. Figure 7 illustrates the comparison between the measurements with the NO2 sensors and the NO2 passive samplers at the four deployment sites. It should be mentioned again that two of the deployment sites, ZRIS and ZBLG, were located in suburban and urban background locations as shown in Fig. S4a and c, with relatively low concentrations of NO2 compared to the two other locations (ZSBS and ZMAN), which were strongly influenced by emissions from nearby road traffic. At ZRIS, where sensor unit AC009 was deployed, the biweekly NO2 concentration measured with the passive samplers was 2–9 ppb, and at ZBLG (AC010) NO2 concentration ranged between 6–22 ppb. At the two urban traffic locations ZSBS (AC011) and ZMAN (AC012), the NO2 concentrations were higher and ranged from 19–31 ppb (ZSBS) and 33–48 ppb (ZMAN), respectively. In general, AC011 agreed well with the passive sampler data for both calibration models, although the sensor measurements were biased high by on average 3.5 ppb (RLM) and 2.4 ppb (RF) (Figs. 7 and 8). For the suburban and the urban background sites ZRIS and ZBLG, the measurements with the sensor units (AC009 and AC010) showed clear mean biases for both calibration modeling approaches. Finally, at the highest polluted site ZMAN, the RF models resulted in substantial underestimation of the passive sampler data (Fig. 7), also visible in the time series in Fig. S13. The observed underestimation of peak NO2 concentrations by RF models results from the fact that predictions cannot go beyond the concentration range that has been covered by the training data. Compared to RF models, the RLM calibration applied to AC012 resulted in a much smaller bias, although the scattering between sensor and passive sampler data was high, as expressed by a large CRMSE (Fig. 8). Figure 7 shows that the used LCSs for NO2 enable a distinction between less (ZRIS and ZBLG) and more polluted sites (ZSBS and ZMAN) and therefore a general differentiation of locations with regard to the prevailing NO2 levels. However, the limitations of the used LCSs for providing accurate measurements of the biweekly NO2 are clearly visible. This can also be seen from Fig. 8, where RMSE, CRMSE, and MBE are shown for the co-location and the deployment periods. For comparison, the statistical metrics provided in Fig. 8 were for the two co-location periods also calculated from biweekly averaged data. Figure 8a illustrates that the RMSEs during the deployment period were much larger than during the first co-location period, which mainly resulted from an increasing bias (MBE) as discussed above but also from increasing CRMSE. This observation implies an effect of relocation on the performance of the sensors, i.e., a change in the data quality provided by the sensors when they are operated in other locations than the location chosen for co-location measurements and sensor calibration. A reason for the observed decreasing data quality during the deployment was likely the smaller range of NO2 concentrations prevailing at the deployment sites compared to the co-location site, in particular for the suburban and urban background sites ZRIS and ZBLG. As also seen in the co-location period, the applied sensor calibration models tended to overestimate NO2 at low concentrations (Tables 4 and 4). Another reason for the decreasing data quality of the sensors during the deployment might be differing combinations of influencing external factors like air temperature, relative humidity, and interfering gases during deployment, leading to a reduced applicability of the calibration models and possibly larger measurement errors as expressed by the higher CRMSE. Figure 8 also implies that for achieving the best possible data quality, implementation of strategies for the detection and the correction of sensor bias during deployment is needed.

Finally, we find that in agreement with the first co-location period, the RF calibration model generally performed slightly better than the RLM model. This can be seen in the somewhat smaller RMSE of the sensors calibrated using RF models and also the lower CRMSE, in particular for the sensors that were deployed at the more highly polluted sites (AC011 and AC012). However, AC012 showed a highly negative MBE, indicating the earlier mentioned inability for correctly predicting air pollutant concentrations for conditions that have not been covered in the data used for training the calibration model. This should be kept in mind when sensors are calibrated using this modeling approach.

Figure 7NO2 sensor performance during the deployment in Zurich illustrated by the sensor concentration comparison to that of NO2 passive samplers at each site.


Figure 8RMSE and CRMSE (a) as well as MBE (b) for NO2 during the co-location periods and the deployment period. The metrics for the co-location periods are also calculated from biweekly averaged data in order to facilitate the comparison with sensor performance during deployment.


3.2.2 Assessment of sensor performance after deployment

After the deployment at the four locations of the small sensor network in Zurich, the sensors were installed again at the co-location site Haerkingen for 4 months (12 December 2019–31 March 2020). During the second co-location period we observed some distinct sensor malfunctioning for several short periods. Hence, data filtering was implemented prior to performance analysis. Figure 9 illustrates that the data obtained during the identified malfunctioning periods caused severe underestimation of the true NO and NO2 concentrations and therefore had to be removed. Specific malfunctioning periods for each sensor unit are elaborated in Table S6. An exact cause or single influence factor for sensor malfunctioning could not be identified. However, it is hypothesized that specific weather conditions may be the reason for the sensors not working properly. Meteorological data collected at the co-location site indicate that rain events occurred before and after most of the time periods that had to be filtered. Electrochemical sensors are known for their vulnerability towards humidity. Thus, the penetration of raindrops into the housing of the sensor units may cause significant disturbance of the sensors or other components of the sensor units. Nevertheless, rain events were not the sole factor for the sensor malfunction because rain also occurred during the first co-location campaign without noticeable sensor malfunctioning, and rain events during the second co-location campaign did not always result in erroneous sensor signals. In addition, the possibility of interference by low battery was checked; however, none of these issues were detected from sensor log files. It is therefore speculated that an interaction between rain events and other meteorological factors such as wind speed caused the sensor malfunctioning. However, because the exact reasons for the erroneous sensor data remain unknown, the applied data filtering was therefore based on visual screening of the sensor signal.

Figure 9Scatter plots of sensor concentrations calibrated using the RLM model versus reference measurements during the second co-location period for (a) NO and (b) NO2 from AC010 (10 min values). Sensor malfunctioning periods were visually identified and filtered (red dots). Similar scatter plots for the other sensor units can be found in Figs. S16–S19.


The air pollutant concentrations reported by the sensors were for the second co-location period calculated using the models developed in the first co-location campaign and applied during deployment. Tables 4 and 4 present the statistical metrics for hourly concentration measurements and a comparison with the values from the first co-location period. Surprisingly, the sensors still showed a comparable performance for NO after more than a year of operation. The average RMSE of the 10 min NO sensor measurements slightly decreased from 7.9 to 7.6 ppb for the RLM models and from 6.8 to 6.1 ppb for the RF models. The target and Taylor diagrams in Fig. 5 depict no substantial change in the shown metrics calculated from the NO measurements. In contrast to NO, the performance of the NO2 sensors clearly decreased over this extended time period of operation; for example, the average RMSE for 10 min NO2 measurements increased by 3–4 ppb for both calibration models (Table 4 and Fig. 8). The target diagram for NO2 in Fig. 5 indicates that this increase in RMSE is due to both an increase in mean bias and an increase in the random component of the error as expressed by the CRMSE. The latter is also visible in the Taylor diagram for NO2 and the reduced correlation coefficient during the second co-location period. In addition, the target and Taylor diagrams separated for the different quartiles of observed 10 min mean values (Fig. 6) show that the above-described change in sensor performance over the extended time period is also visible when the different concentration ranges are separately evaluated.

A similar degradation of the performance of the same NO2 sensor type has been reported by Li et al. (2021). In their study, sensor performance degradation was noticeable after 200–400 d of deployment, a time period that was in agreement with the expected lifetime of the O3 scrubber as calculated from its reported capacity and the O3 concentration at the deployment site. It is therefore reasonable to assume that the decrease in NO2 sensor performance observed in this study is also influenced or caused by saturation of the O3 scrubber of the NO2 sensors. At the co-location site Haerkingen and in the urban background of Zurich, annual mean concentrations of O3 are about 21 and 25 ppb, respectively. This means that the expected lifetime of the O3 scrubber is about 13 to 17 months, which is comparable to the situation described by Li et al. (2021).

4 Conclusions

In this study, some of the main difficulties associated with the use of low-cost sensors for measuring air quality were investigated. In particular we analyzed calibration, long-term stability of the sensor output, and the effect of relocation on sensor performance, i.e., the change of sensor behavior when used at a different location than the site used for calibration through co-location with a reference instrument. Although only two specific types of sensors were used in this study, some general conclusions can be drawn. Co-location with reference instruments is a pragmatic and appropriate approach for the calibration of individual low-cost sensors. However, the duration of the co-location measurements should be sufficiently long so that a wide range of environmental conditions which may occur during deployment are covered. In addition, the chosen co-location site should allow the full concentration range expected during deployment to be covered. Otherwise, the calibration model extrapolates to conditions that have not been covered in the data used for training, leading to higher uncertainty and for some approaches (e.g., random forest regression) to significant bias.

In this study, duration and site of co-location were chosen accordingly. The sensors were calibrated using two widely used statistical approaches, and the corresponding sensor performance was evaluated. During co-location with reference instruments, the sensors showed no overall bias and had a rather small CRMSE, when the full data set was analyzed. However, when the performance metrics were calculated for different concentration ranges (i.e., the quartiles of the observed concentrations), it was observed that the applied calibration models led to sensor measurements that were biased high at low concentrations and biased low at high concentrations. A similar behavior was observed for the NO2 sensors when deployed in a small sensor network in the city of Zurich. In this case, the data quality of the sensors was much lower than expected from their performance during co-location with a reference instrument. For a relatively clean city like Zurich, the achieved data quality was not sufficient for meaningful quantitative measurements of NO2. However, the sensors were capable of distinguishing between locations with lower (0–20 ppb), medium (20–30 ppb), and higher (40–50 ppb) NO2 levels as shown in Fig. 7. An important factor for lower than expected data quality was seen in the fact that sensors were typically deployed in locations where the concentration range of the target air pollutant was considerably smaller than at the co-location site (e.g., at urban background locations). The calibration models derived from co-location with reference instruments might strongly be influenced by measurements at the highest prevailing concentrations and might therefore not be optimal for cleaner locations.

Another important limitation of low-cost air quality sensors may be their lifetime and the frequency of recalibration. For the electrochemical sensors for NO we found no change in response behavior over a time period of more than 18 months, and the data quality was therefore constant over time. In contrast, the electrochemical sensor for NO2 showed decreasing performance over time, and frequent interventions such as recalibration or replacement may be needed for achieving the best possible data quality. After about 18 months of deployment, the electrochemical sensors started to malfunction sporadically and during shorter time periods. Although the exact reasons remained unknown, this behavior might indicate the aging effects of the sensors themselves or of other parts of the sensor unit. The occurrence of these malfunctions with increasing time of use might indicate that the quality control of sensors deployed in networks needs to be strengthened over time.

Code and data availability

The observation data (sensors and reference), the calibration models, and all data analysis codes (all based on R programming language) are available on the open-access repository (, Kim et al.2022).


The supplement related to this article is available online at:

Author contributions

HK and MM developed the raw data acquisition from the Decentlab GmbH and the sensor calibration models. MM established the field measurement of the sensors by organizing the sensor co-location measurements at the Haerkingen site and deployed the sensors in Zurich. HK implemented the preprocessing of the raw data, sensor calibration, and post-processing of the prediction and carried out the validation and data analysis for the study. HK performed the results and wrote the manuscript. CH and SH supervised the whole study and was actively involved in the writing of the manuscript. All authors reviewed and agreed to the published version of the paper.

Competing interests

The contact author has declared that neither they nor their co-authors have any competing interests.


Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


The support and the data from the Swiss National Air Pollution Monitoring Network NABEL (FOEN/Empa) are gratefully acknowledged.

Review statement

This paper was edited by Albert Presto and reviewed by Laurent Spinelle and one anonymous referee.


Anderson, R.: Modern methods for robust regression, Sage Publications, Inc., 1st Edn., Vol. 152, California, United States, 2008. a

Bigi, A., Mueller, M., Grange, S. K., Ghermandi, G., and Hueglin, C.: Performance of NO, NO2 low cost sensors and three calibration approaches within a real world application, Atmos. Meas. Tech., 11, 3717–3735,, 2018. a, b, c, d, e, f, g, h

Breiman, L.: Random forests, Mach. Learn., 45, 5–32, 2001. a, b

Cross, E. S., Williams, L. R., Lewis, D. K., Magoon, G. R., Onasch, T. B., Kaminsky, M. L., Worsnop, D. R., and Jayne, J. T.: Use of electrochemical sensors for measurement of air pollution: correcting interference response and validating measurements, Atmos. Meas. Tech., 10, 3575–3588,, 2017. a, b

Hagan, D. H., Isaacman-VanWertz, G., Franklin, J. P., Wallace, L. M. M., Kocar, B. D., Heald, C. L., and Kroll, J. H.: Calibration and assessment of electrochemical air quality sensors by co-location with regulatory-grade instruments, Atmos. Meas. Tech., 11, 315–328,, 2018. a, b, c

Hagler, G. S. W., Williams, R., Papapostolou, V., and Polidori, A.: Air Quality Sensors and Data Adjustment Algorithms: When Is It No Longer a Measurement?, Environ. Sci. Technol., 52, 5530–5531,, 2018. a

Huber, P. J.: Robust statistics, Vol. 523, John Wiley & Sons, New Jersey, United States, 308 pp., 2004. a

Jiao, W., Hagler, G., Williams, R., Sharpe, R., Brown, R., Garver, D., Judge, R., Caudill, M., Rickard, J., Davis, M., Weinstock, L., Zimmer-Dauphinee, S., and Buckley, K.: Community Air Sensor Network (CAIRSENSE) project: evaluation of low-cost sensor performance in a suburban environment in the southeastern United States, Atmos. Meas. Tech., 9, 5281–5292,, 2016. a, b

Karagulian, F., Barbiere, M., Kotsev, A., Spinelle, L., Gerboles, M., Lagler, F., Redon, N., Crunaire, S., and Borowiak, A.: Review of the Performance of Low-Cost Sensors for Air Quality Monitoring, Atmosphere, 10, 1–7,, 2019. a

Kim, H., Mueller, M., Henne, S., and Hueglin, C.: Long-term behavior and stability of calibration models for NO and NO2 low cost sensors, Zenodo [data set],, 2022. a

Kim, J., Shusterman, A. A., Lieschke, K. J., Newman, C., and Cohen, R. C.: The BErkeley Atmospheric CO2 Observation Network: field calibration and evaluation of low-cost air quality sensors, Atmos. Meas. Tech., 11, 1937–1946,, 2018. a

Kumar, P., Morawska, L., Martani, C., Biskos, G., Neophytou, M., Di Sabatino, S., Bell, M., Norford, L., and Britter, R.: The rise of low-cost sensing for managing air pollution in cities, Environ. Int., 75, 199–205,, 2015. a

Lamigueiro, O. P.: tdr: Target Diagram, (last access: 4 May 2022), r package version 0.13, 2018. a

Li, J., Hauryliuk, A., Malings, C., Eilenberg, S. R., Subramanian, R., and Presto, A. A.: Characterizing the Aging of Alphasense NO2 Sensors in Long-Term Field Deployments, ACS Sensors, 6, 2952–2959,, 2021. a, b, c

Liaw, A. and Wiener, M.: Classification and Regression by randomForest, R News, 2, 18–22, (last access: 4 May 2022), 2002. a

Mage, D., Ozolins, G., Peterson, P., Webster, A., Orthofer, R., Vandeweerd, V., and Gwynne, M.: Urban air pollution in megacities of the world, Atmos. Environ., 30, 681–686,, 1996. a

Malings, C., Tanzer, R., Hauryliuk, A., Kumar, S. P. N., Zimmerman, N., Kara, L. B., Presto, A. A., and R. Subramanian: Development of a general calibration model and long-term performance evaluation of low-cost sensors for air pollutant gas monitoring, Atmos. Meas. Tech., 12, 903–920,, 2019. a, b, c

Marshall, J. D., Nethery, E., and Brauer, M.: Within-urban variability in ambient air pollution: Comparison of estimation methods, Atmos. Environ., 42, 1359–1369,, 2008. a

Mead, M., Popoola, O., Stewart, G., Landshoff, P., Calleja, M., Hayes, M., Baldovi, J., McLeod, M., Hodgson, T., Dicks, J., Lewis, A., Cohen, J., Baron, R., Saffell, J., and Jones, R.: The use of electrochemical sensors for monitoring urban air quality in low-cost, high-density networks, Atmos. Environ., 70, 186–203,, 2013. a

Mueller, M., Meyer, J., and Hueglin, C.: Design of an ozone and nitrogen dioxide sensor unit and its long-term operation within a sensor network in the city of Zurich, Atmos. Meas. Tech., 10, 3783–3799,, 2017. a, b, c, d, e

Palmes, E. D., GUNNISON, A., DiMATTIO, J., and TOMCZYK, C.: Personal sampler for nitrogen dioxide, Am. Ind. Hyg. Assoc. J., 37, 570–577,, 1976. a

Pascal, M., Corso, M., Chanel, O., Declercq, C., Badaloni, C., Cesaroni, G., Henschel, S., Meister, K., Haluza, D., Martin-Olmedo, P., and Medina, S.: Assessing the public health impacts of urban air pollution in 25 European cities: results of the Aphekom project, Sci. Total Environ., 449, 390–400, 2013. a

Peltier, R. E., Castell, N., Clements, A. L., Dye, T., Hüglin, C., Kroll, J. H., Ning, Z., Parsons, M., Penza, M., Reisen, F., and von Schneidemesser, E.: An Update on Low-cost Sensors for the Measurement of Atmospheric Composition, December 2020 (WMO – No.1215), World Meteorological Organization (WMO), Geneva, 90 pp., 2021.  a, b, c

Rodriguez, J. D., Perez, A., and Lozano, J. A.: Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation, IEEE T. Pattern Anal., 32, 569–575,, 2010. a

Smith, K. R., Edwards, P. M., Ivatt, P. D., Lee, J. D., Squires, F., Dai, C., Peltier, R. E., Evans, M. J., Sun, Y., and Lewis, A. C.: An improved low-power measurement of ambient NO2 and O3 combining electrochemical sensor clusters and machine learning, Atmos. Meas. Tech., 12, 1325–1336,, 2019. a

Snyder, E. G., Watkins, T. H., Solomon, P. A., Thoma, E. D., Williams, R. W., Hagler, G. S., Shelow, D., Hindin, D. A., Kilaru, V. J., and Preuss, P. W.: The changing paradigm of air pollution monitoring, Environ. Sci. Technol., 47, 11369–11377, 2013. a

Spinelle, L., Gerboles, M., Villani, M. G., Aleixandre, M., and Bonavitacola, F.: Field calibration of a cluster of low-cost available sensors for air quality monitoring. Part A: Ozone and nitrogen dioxide, Sensors and Actuators B Chem., 215, 249–257, 2015. a

Tan, Y., Lipsky, E. M., Saleh, R., Robinson, A. L., and Presto, A. A.: Characterizing the spatial variation of air pollutants and the contributions of high emitting vehicles in Pittsburgh, PA, Environ. Sci. Technol., 48, 14186–14194, 2014. a

Taylor, K. E.: Summarizing multiple aspects of model performance in a single diagram, J. Geophys. Res.-Atmos., 106, 7183–7192,, 2001. a, b

Venables, W. N. and Ripley, B. D.: Modern Applied Statistics with S, Springer, New York, fourth Edn., (last access: 4 May 2022), 2002. a

WHO: Ambient air pollution: a global assessment of exposure and burden of disease, World Health Organization (WHO), Geneva, 131 pp., 2016 a

Zimmerman, N., Presto, A. A., Kumar, S. P. N., Gu, J., Hauryliuk, A., Robinson, E. S., Robinson, A. L., and R. Subramanian: A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring, Atmos. Meas. Tech., 11, 291–313,, 2018. a, b, c, d, e

Short summary
In this study, the performance of electrochemical sensors for NO and NO2 for measuring air quality was determined over a longer operating period. The performance of NO sensors remained reliable for more than 18 months. However, the NO2 sensors showed decreasing performance over time. During deployment, we found that the NO2 sensors can distinguish general pollution levels, but they proved unsuitable for accurate measurements due to significant biases.