Articles | Volume 17, issue 13
Research article
03 Jul 2024
Research article |  | 03 Jul 2024

Transferability of machine-learning-based global calibration models for NO2 and NO low-cost sensors

Ayah Abu-Hani, Jia Chen, Vigneshkumar Balamurugan, Adrian Wenzel, and Alessandro Bigi

It is essential to accurately assess and verify the effects of air pollution on human health and the environment in order to develop effective mitigation strategies. More accurate analysis of air pollution can be achieved by utilizing a higher-density sensor network. In recent studies, the implementation of low-cost sensors has demonstrated their capability to quantify air pollution at a high spatial resolution, alleviating the problem of coarse spatial measurements associated with conventional monitoring stations. However, the reliability of such sensors is in question due to concerns about the quality and accuracy of their data. In response to these concerns, active research efforts have focused on leveraging machine learning (ML) techniques in the calibration process of low-cost sensors. These efforts demonstrate promising results for automatic calibration, which would significantly reduce the efforts and costs of traditional calibration methods and boost the low-cost sensors' performance.

As a contribution to this promising research field, this study aims to investigate the calibration transferability between identical low-cost sensor units (SUs) for NO2 and NO using ML-based global models. Global models would further reduce calibration efforts and costs by eliminating the need for individual calibrations, especially when utilizing networks of tens or hundreds of low-cost sensors. This study employed a dataset acquired from four SUs that were located across three distinct locations within Switzerland. We also propose utilizing O3 measurements obtained from available nearby reference stations to address the cross-sensitivity effect. This strategy aims to enhance model accuracy as most electrochemical NO2 and NO sensors are extremely cross-sensitive to O3. The results of this study show excellent calibration transferability between SUs located at the same site (Case A), with the average model performance being R2= 0.90 ± 0.05 and root mean square error (RMSE) = 3.4 ± 0.9 ppb for NO2 and R2= 0.97 ± 0.02 and RMSE = 3.1 ± 0.8 ppb for NO. There is also relatively good transferability between SUs deployed at different sites (Case B), with the average performance being R2= 0.65 ± 0.08 and RMSE = 5.5 ± 0.4 ppb for NO2 and R2= 0.82 ± 0.05 and RMSE = 5.8 ± 0.8 ppb for NO. Interestingly, the results illustrate a substantial improvement in the calibration models when integrating O3 measurements, which is more pronounced when SUs are situated in regions characterized by elevated O3 concentrations. Although the findings of this study are based on a specific type of sensor and sensor model, the methodology is flexible and can be applied to other low-cost sensors with different target pollutants and sensing technologies. Furthermore, this study highlights the significance of leveraging publicly available data sources to promote the reliability of low-cost air quality sensors.

1 Introduction

Interest in air quality (AQ) has increased significantly over the past decades as a result of the severe impact of air pollution on the environment and public health (WHO2004). Major air pollutants such as carbon monoxide (CO), nitric oxide and nitrogen dioxide (NO+NO2= NOx), particulate matter (PM), and anthropogenic volatile organic compounds (VOCs) originate mainly by anthropogenic activities that directly and indirectly affect the AQ and public health (Kelly and Fussell2015). Consequently, monitoring and mitigating air pollution are of utmost importance in support of sustainable development. To date, the official regulatory monitoring stations use high-precision instruments based on optical measurement principles (e.g., in the chemiluminescence method in the case of NO2) that are highly cost-intensive. The unit price for a fully equipped regulatory monitoring station varies from EUR 50 000 to 100 000, in addition to maintenance and operating costs (Mead et al.2013; Ionascu et al.2021). According to the current European Ambient Air Quality Directive (2008/50/EC), implemented by EU member states, the microscale siting of a monitoring station for atmospheric pollutants subject to regulatory limits has several requirements. Among these, the site should be within 10 m of the road edge and at least 25 m from high-traffic intersections. These high costs and space requirements constrain their spatial distribution to few areas. Moreover, as shown by several studies (e.g., Zhu et al.2020; Beckwith et al.2019; Baruah et al.2023), NO2 hotspots at urban sites are not fully represented by their corresponding monitoring station. In order to bridge the gap, it is crucial to increase the spatial coverage of air quality monitoring. A possible way to do this is using networks of low-cost sensors along with modeling.

The drive to promote spatial coverage of air quality monitoring, combined with advancements in sensor technology, has paved the way for the utilization of low-cost sensors in air quality monitoring (Ionascu et al.2021). Due to their affordability, portability, and simple deployment, utilization of low-cost sensors has been widely acknowledged (Karagulian et al.2019; Suriano and Penza2022; Snyder et al.2013; Bigi et al.2018). However, concerns about the stability of their performance and the quality of the data have significantly reduced their implementation on a large scale. Low-cost sensors for gas detection are mostly metal-oxide and electrochemical sensors (Spinelle et al.2015; Borrego et al.2016; Mijling et al.2018), and when deployed in environmental conditions, they suffer from drift, cross-sensitivity, and induced bias dependent on relative humidity or temperature (Masson et al.2015; Mueller et al.2017; Maag et al.2018; Tagle et al.2020; Papaconstantinou et al.2023). This type of sensor is generally subject to two main sources of error: internal errors arising from the sensor's working principle and external errors resulting from environmental factors. Internal errors include variable detection limits, drift, and non-linear response. Identical sensors can introduce bias even when deployed at the same site, mainly due to manufacturing tolerances. External errors are mainly attributed to environmental factors such as temperature and relative humidity, as well as cross-sensitivity to interference gases (Ionascu et al.2021; Giordano et al.2021). In response to certain temperatures or relative humidity levels or changes in their values, low-cost sensors can exhibit significant biases. For example, both Masson et al. (2015) and Tagle et al. (2020) reported such high biases for NO2 electrochemical cells during periods of high relative humidity (above 75 %). Widely used NO2 electrochemical cells have been shown to have significant cross-sensitivity to O3 (Miech et al.2021; Spinelle et al.2017; Alphasense Ltd2022). As a solution, an O3 scrubber was added (Hossain et al.2016; Alphasense Ltd2022), and it was shown that the filter material was successful at removing O3 without affecting the signal due to the target NO2. Notwithstanding this scrubber, NO2 cells still show some O3 interference. For example, according to Miech et al. (2021), Alphasense NO2-B43F exhibits 6.6 % cross-sensitivity to O3, which also increases with time, as indicated by Spinelle et al. (2017). As a result, this interference induces a bias in the response.

Low-cost sensor biases can be partially mitigated through calibration, usually performed either under laboratory controlled conditions or by field co-location next to a reference monitoring station (Miech et al.2021). Studies show that the latter approach is more satisfying and commonly used as it maximizes the performance of such sensors in real-world applications (Spinelle et al.2017; Suriano and Penza2022; Kureshi et al.2022). Successful calibration has the potential to significantly enhance the AQ measurement process and reduce overall costs (Zimmerman et al.2018; Munir et al.2019; Van Zoest et al.2019). However, the type and amount of processing applied to the air quality sensor data can lead to confusion about whether the processed data remain a true sensor measurement or a blend of secondary data and predictions. To address this issue, Schneider et al. (2019) proposed a standardized terminology for processing levels of air quality low-cost sensor systems. A four-level sequence ranges from Level 0 (raw sensor output) to Level 4 (processed data with spatial interpolation or assimilation into models). Each level serves different purposes, and data usability varies depending on the application. The proposed terminology aims to enhance the use and understanding of this technology and to ensure that the methods applied are well-documented and fit for their intended purpose.

Several calibration techniques have been reported in the literature, spanning environmental factor correction (Miech et al.2021; Van Zoest et al.2019; Kim et al.2018) and simple linear regression models (Okorn and Hannigan2021) to machine learning (ML) techniques (Nowack et al.2021; Bigi et al.2018; Zimmerman et al.2018; Spinelle et al.2015; Ionascu et al.2021). Although some low-cost sensor outputs show an approximately linear relationship with the target pollutant, this linearity varies with time due to sensor aging (Li et al.2021). ML algorithms have shown superior ability to interpret such complexity of low-cost sensors, especially when including covariates that account for meteorological and environmental variability. One of the most popular ML algorithms is random forest, which is an ensemble algorithm based on decision trees (Breiman2001). random forest, in addition to other commonly used methods such as multiple linear regression, support vector regression, and artificial neural networks, has been widely employed in air quality low-cost sensor calibration and in some aspects of atmospheric chemistry, as it tends to outperform linear regression models (Nowack et al.2021; Bigi et al.2018; Zimmerman et al.2018; Spinelle et al.2015; Ionascu et al.2021). Most studies available in the literature investigate the individual localized calibration approach, in which a single calibration model is created for each sensor unit (SU) after being co-located with a reference instrument (Zimmerman et al.2018; Spinelle et al.2015). Recent works, such as Bigi et al. (2018), Sahu et al. (2021), Van Zoest et al. (2019), and Nowack et al. (2021), have studied individual calibration models considering site transferability, where they investigated whether a co-location-based calibration at one location produces reliable measurements at a different location. Bigi et al. (2018) found a performance range of about 6.5 ppb root mean square error (RMSE) for NO2 and NO.

Only a few studies consider the calibration transferability (global calibration) among different sensors of the same make, including site transferability. A study conducted by Malings et al. (2019) evaluated the performance of individualized calibration models versus generalized calibration models. Individualized models are built based on data from a single sensor, while generalized models combine data from all sensors of the same type. The researchers found that the most effective calibration model type varied by sensor technology; for example, simpler regression models produced the best results for electrochemical CO sensors, while more complex models, such as artificial neural networks and random forest models, provided the best results for NO2 sensors. Although the outcomes varied, it was found that generalized models performed better at new locations compared with individualized models, despite slightly lower performance during initial calibration. Vikram et al. (2019) proposed a method for improving calibration transfer of NO2 and O3 by training calibration models on multiple sites. They rotated nine SUs among three sites with reference monitors and introduced a novel split-neural-network (split-NN) approach which incorporates two sets of models: a global calibration model that combines data from a set of similar sensors spread across different training environments and sensor-specific calibration models that correct the sensor-to-sensor variations. The approach demonstrates versatility, accommodating linear regressors (LRs) or NNs for sensor-specific models and utilizing a two-layer NN for global calibration. The researchers found that the split-NN method performed better than random forest, reducing errors by 0 %–11 % for NO2 and 6 %–13 % for O3. Training their models on two sites and testing them on a third site with no overlap between the training and test data distributions (“Level2” benchmark as classified by Schneider et al.2019) resulted in an RMSE between 6 and 8 ppb for NO2. Another study by Okorn and Hannigan (2021) examined the transferability of simple LR calibration models between several metal-oxide sensor systems (pods), focusing on ozone and methane. In their study, calibration transferability was performed among pods within the same location (i.e., sensors here share the same environmental variability). They suggested using a standardization approach to normalize sensor signals for enhanced calibration transferability among units. A recent study by Wang et al. (2023) examined the calibration transfer performance of five low-cost SUs for PM and NO2. The five SUs were co-located with a reference-grade monitor at one site for 4 weeks, and then two units were transferred to another site for a 16 d mobile campaign 6 months after the first deployment. The results show transferability between SUs located at the same site (same stationary settings), with the coefficient of determination (R2) of best-performing calibration models for PM exceeding 0.80 and with R2 for NO2 units ranging around 0.70. However, models trained in stationary settings are difficult to transfer to mobile settings with different environmental characteristics.

In our study, we developed global ML-based calibration models for electrochemical cells targeting NO2 and NO, using data of low-cost SUs that were utilized in a previous study by Bigi et al. (2018). We focus on calibration transferability among SUs when deployed at the same location (i.e., the same environmental characteristics) and different locations (i.e., different environmental characteristics), given that no explicit overlap exists between the training and testing data distributions. This approach uses simple standardization to account for sensor-to-sensor variations, unlike the approach proposed by Vikram et al. (2019), which utilizes an ML-based method. In addition, this study presents potential improvements to model transferability using additional information (O3) from nearby regulatory air quality monitoring stations. This approach assists in untangling the interference of O3 that persists in the NO2 cells despite the presence of an O3 scrubber (Spinelle et al.2017; Miech et al.2021; Li et al.2021). While there is abundant evidence supporting the integration of O3 as an input variable for NO2 calibration, as evidenced by the extensive literature (Mead et al.2013; Miech et al.2021; Spinelle et al.2015), there is limited support for its inclusion in NO calibration. In this study, we present the results of this scenario, which may be of interest to researchers in this field. The incorporation of information from nearby regulatory monitoring stations is referred to as Level3 in the classification by Schneider et al. (2019). Finally this study provides an opportunity to study the influence of geographical and seasonal variations on calibration transferability.

In Sect. 2, the sensor units, deployment sites, and calibration methods are described. Results and discussion are found in Sect. 3. Finally, the main conclusions are drawn in Sect. 4. All data processing was performed with MATLAB (MathWorks, Natick, MA, USA) version R2021b.

Table 1Details of the two-phase campaign of SU deployments.

Download Print Version | Download XLSX

2 Materials and methods

2.1 Sensor units

This study utilized data collected from four SUs developed jointly by Empa, the Swiss Federal Laboratories for Materials Science and Technology, and Decentlab GmbH. These SUs were described and employed in previous studies (Bigi et al.2018; Kim et al.2018). Each SU consists of four electrochemical sensors – two NO2 sensors (Alphasense NO2-B43F) and two NO sensors (Alphasense NO-B4) – along with temperature (T) and relative humidity (RH) sensors (Sensirion STH21). All signals were sampled every 20 s, aggregated to a 1 min mean value, and transmitted to a central database every 180 min. The four SUs are denoted as AC009, AC010, AC011, and AC012, and the electrochemical sensors are denoted as NO_A, NO_B, NO2_A, and NO2_B, provided in millivolts. Throughout this study, signals of each electrochemical sensor represent the voltage difference between the working electrode (WE) and auxiliary electrode (AE). Data collected from the SUs and their corresponding reference instruments were preprocessed for outlier removal, smoothing, and averaging over 10 min, following the same procedure as explained in Bigi et al. (2018).

2.2 Deployment sites and co-location

Over a two-phase campaign, SUs were deployed at three locations representing different emission and meteorological conditions in continuous co-location at quality regulatory stations within the National Air Pollution Monitoring Network (NABEL) (Bigi et al.2018). A detailed description of the two-phase campaign can be found in Table 1. The first phase began in April 2017 and lasted for approximately 3 months, during which the four SUs were installed in the rural site of Härkingen (HAE), facing a major highway. This peculiar location allowed sensors to be exposed to both traffic-related pollutants, as the southern wind carries polluted air from the highway, and cleaner air masses, as the northern wind flows over the rural area. After the first phase of the campaign was accomplished, the SUs were transferred to two different locations: AC009 and AC010 were installed in Zurich-Kaserne (ZUE), while AC011 and AC012 were installed in Lausanne (LAU). The second phase lasted for around 4 months (from 28 July to 5 December 2017). All reference instruments provide measurements for NO, NO2, O3, temperature, and relative humidity. Figure 1 summarizes the meteorological variables and pollutant concentrations at the different deployment sites, as measured by the reference instruments. In the vicinity of co-location site ZUE, there are four other nearby regulatory air quality monitoring stations located within an approximately 2.7 km radius. In Lausanne, there are two nearby stations situated within a radius of about 10.7 km of the co-location site (LAU), while none is available in Härkingen; see Fig. 2. These nearby monitoring stations provide O3 measurements that are used to assess the potential enhancement of calibration models when addressing cross-sensitivity issues arising from O3. For comparison purposes, the same assessment is conducted utilizing O3 measurements collected from the co-location reference stations (ZUE and LAU).

Figure 1Box plots showing meteorological variables and pollutant concentrations at deployment sites using 10 min averaged data. The central line indicates the median; the star represents the mean; and the bottom and top edges of the box indicate the 25th and 75th percentiles of the data, respectively. The whiskers extend to the minimum and maximum values.


Figure 2Map view of low-cost SUs co-location sites featuring nearby monitoring stations, in both Zurich and Lausanne (© Google Earth 2023).

2.3 Calibration

The application of calibration transfer methods may facilitate the effort needed to obtain valuable measurements of air pollutants from low-cost sensors. Figure 3 illustrates the two cases investigated by this study to examine the transferability of calibration between different (but identical) SUs. For Case A, each global calibration model was trained on a dataset from one SU, denoted as “primary SU”, and then applied to the rest of the SUs, denoted as “secondary SUs”, available at the same location. This case is designed to examine the ideal scenario and to serve as a benchmark. For Case B, each global calibration model was applied to secondary SUs installed at different sites than the primary SU. Every SU was once a primary SU in both cases.

Figure 3Scheme of the two cases of calibration transfer between different SUs (left) and the architecture of the global calibration model (right).


Calibration transfer approach is advantageous in networks consisting of a significant number of low-cost SUs. Instead of individually characterizing and calibrating each SU, it may suffice to characterize and calibrate a representative SU or a subset of units and then apply the acquired global calibration models to the remaining units within a network of low-cost SUs. These models can also be applied to other SUs of the same type, both those in close proximity to the calibrated units (primary SUs) (e.g., same city, similar emission conditions) and those further away (e.g., same city, differing emission conditions).

Our calibration strategy (illustrated in Fig. 3) is designed to enhance model performance by minimizing cross-sensitivity variance and sensor-to-sensor variability. As O3 could cause significant interference in NO2 low-cost sensors, O3 measurements were included in the feature set of the calibration models.

2.3.1 Data investigation and preparation

Evaluating the consistency of SUs is recommended to determine whether similar electrochemical sensors respond to target changes similarly (Giordano et al.2021). Higher consistency and reduced error sources such as sensor-to-sensor variations would pave the way for optimum transferability of calibration. Therefore, consistency was mainly assessed and addressed by (1) pairwise Pearson correlation (R) between identical low-cost sensors deployed at the same site, as shown in Table 2, where the results indicate significant correlations between the low-cost sensors; (2) Pearson correlation (R) between low-cost sensors and their corresponding reference measurements (Table 3); and (3) standardization of features because identical sensors may have different baseline levels, even if coming from the same manufacturer and deployed at the same location, as shown in Fig. 4. Therefore, to tackle this issue, standardization (Z score) was applied, in which all features have a mean of zero (μ= 0) and 1 standard deviation (σ= 1). This results in almost completely uniform signals from the electrochemical sensors, across all SUs, especially when exposed to similar environmental conditions. Overall, this reduces the sensor-to-sensor variations, making it possible for global calibration reproducibility.

Table 2Pairwise Pearson correlation (R) between the electrochemical sensors of different SUs.

Download Print Version | Download XLSX

O3 measurements acquired from the nearby monitoring stations are available at 1 h resolution; therefore, calibration models in this study are trained and tested based on 1 h data. When training a model (with primary SU data), O3 measurements were obtained from the co-location reference stations (either ZUE or LAU). When testing the global models (with secondary SU data), (1) for Case A, O3 measurements were obtained from the reference station within the NABEL network (either HAE, ZUE or LAU), since the secondary SUs are located at the same co-location site as primary SUs, and (2) for Case B, O3 measurements were obtained from the nearby monitoring stations, replicating a real-world scenario in which secondary SUs are installed at a different location without being co-located with reference instruments.

Table 3Pearson correlation (R) between electrochemical sensors of SUs and their corresponding reference instruments.

Download Print Version | Download XLSX

Figure 4A comparison of raw NO2_A measurements before and after Z-score application, for each SU deployed at HAE. A negative voltage in the signal indicates that the auxiliary electrode has a higher voltage than the working electrode, which occurs, for example, if the electronic zero points in both electrodes significantly differ from each other. Applying a Z score to the raw data minimizes this artifact.


Figure 5Daily O3 measurements from the co-location reference station and the other nearby stations in both Zurich and Lausanne. The inset tables list the Pearson correlation (R) between each nearby station and the co-location reference stations.


O3 measurements obtained from the co-location reference sites and other nearby monitoring stations throughout the entire year 2017 were analyzed in an effort to examine the consistency of O3 concentrations among these stations (see Fig. 5). Analyzing the daily average of these measurements revealed a strong correlation between the co-location reference station and all nearby stations, in both Zurich and Lausanne, as depicted in the inset tables of Fig. 5. This indicates a consistent variability in O3 across all stations at each site, implying that the positive contribution of any nearby station would enhance the model's capability to capture O3 cross-sensitivity.

2.3.2 ML-based calibration transfer method

In this study, three different ML-based calibration algorithms were used: multivariate linear regression (MLR), support vector regression (SVR), and random forest (RF). These algorithms were employed to estimate atmospheric concentrations of NO2 and NO based on a set of features (predictors). The choice of ML algorithms and features followed the approach by Bigi et al. (2018), as we utilized the same dataset. For the training of global calibration models for NO2 and NO, six features were initially used: voltage signals of the four electrochemical sensors – NO_A, NO_B, NO2_A, and NO2_B; temperature; and relative humidity. Additionally, our proposal suggests incorporating O3 obtained from nearby monitoring stations. To evaluate the influence of incorporating O3 on global models' performance, two sets of models were formulated and assessed. One set exclusively relied on SU data as features, while the other integrated O3 into the feature set. Following this, a comparative analysis was carried out.

The models were trained and analyzed in MATLAB utilizing the fitlm() function for MLR, the LIBSVM software package for SVR, and the TreeBagger() function for RF. A k-fold cross-validation approach was used to address overfitting, where the training dataset (primary SU) was divided into five folds (blocks) as depicted in Fig. 6. Here, we chose k= 5 based on the recommendation by Rodriguez et al. (2009). One block (20 % of the dataset) was used for validation, and the remaining blocks (80 % of the dataset) were used for training. This process was repeated five times (5 parts). In each split process, the block sampling approach introduced by Schultz et al. (2021) was followed to avoid the spurious correlation between training and validation sets. A grid search was applied to find the best hyperparameters, which were subsequently used to train the entire training set. The model was then evaluated using a test dataset (secondary SUs). In RF models, the variable (predictor) importance can be calculated by random permutation of each variable in the decision tree and averaging the estimation error over the forest (Breiman2001). The importance of a variable to the model increases as the estimation error increases.

Figure 6An illustration of the k-fold cross-validation approach.


3 Results and discussion

3.1 Performance of calibration transfer

The global calibration models were evaluated using R2 and RMSE as goodness-of-fit metrics, as described in Appendix A. Figure 7 summarizes the overall evaluation results for the two sets of global calibration models (with and without O3) for NO2 and NO in both cases. The results presented here and throughout this paper are based on the O3 measurements obtained from “Station Z1” in Zurich and “Station L1” in Lausanne (Fig. 2). The results indicate successful transferability of the calibration models across SUs for NO2 and NO, with Case A showing superior performance compared to Case B. In Case A, errors between the primary and secondary SUs are minimal, primarily because both SUs share the same environmental characteristics. Thus, Case A has a higher level of transferability than Case B. Moreover, the results indicate that, on average, RF consistently outperforms MLR and SVR, which aligns with the conclusions from Bigi et al. (2018), who investigated individual calibration models using the same dataset. The major outcome from this study is that global calibration models perform better when including nearby monitoring stations' O3 measurements in the feature set. In Case A, the RF-based NO2 models demonstrated their highest transferability performance with an R2 of 0.96 and an RMSE of 2.0 ppb. The corresponding averages were 0.90 ± 0.05 for R2 and 3.4 ± 0.9 ppb for RMSE. In contrast, Case B exhibited a different performance profile, with the best R2 value being 0.76 and an RMSE of 5.0 ppb. The averages in Case B were 0.65 ± 0.08 for R2 and 5.5 ± 0.4 ppb for RMSE. Comparing NO models to NO2 models, the former displayed superior transferability. In Case A, the RF-based NO models achieved an impressive R2 value of 0.99 and an RMSE of 1.6 ppb, along with averages of 0.97 ± 0.02 for R2 and 3.1 ± 0.8 ppb for RMSE. In Case B, the best performance for NO models was characterized by an R2 of 0.87 and an RMSE of 5.0 ppb, with averages of 0.82 ± 0.05 for R2 and 5.8 ± 0.8 ppb for RMSE. Generally, NO models show better transferability than NO2 models. Further details can be found in Tables S1 through S6 in the Supplement. In comparison with the existing literature such as Vikram et al. (2019) or Wang et al. (2023), these results demonstrate notable advancements.

Figure 7Average results of evaluating the performance of global calibration models for NO and NO2, based on MLR, SVR, and RF techniques for both Case A and Case B. O3 measurements were obtained from Station Z1 in Zurich and Station L1 in Lausanne.


The inclusion of O3 measurements has resulted in noteworthy enhancements in predictive accuracy and generalizability, as indicated by the increased R2 values and reduced RMSE values, which are particularly pronounced in the SVR and RF models. To comprehensively assess the impact on the global models, we explored the incorporation of O3 measurements from all nearby monitoring stations in Zurich and Lausanne. Interestingly, every station contributed positively to model performance. Figure 8 reports the average enhancements (%) in R2 and RMSE for NO2 RF global models by each nearby station in comparison with the co-location reference station, in both Zurich and Lausanne.

Figure 8The R2 and RMSE average positive improvements (%) of global NO2 RF models in Case B, when including O3 measurements from each nearby station in comparison with the co-location reference station in Zurich and Lausanne.


To better understand this interesting finding, we examined each global model's performance in terms of RMSE (%), as shown in Fig. 9. In Case A, notable enhancements in the performance of all RF-based models for NO2 and NO were observed, with the NO2 models experiencing a substantial improvement of up to 42 % and the NO models showing an improvement of up to 25 %. In contrast, Case B demonstrated more pronounced improvements when secondary SUs were located at ZUE. The RF-based models exhibited an enhancement of up to 17 % and 21 % for NO2 and NO, respectively. Interestingly, no significant improvement was observed when the secondary SUs were located at LAU. This finding can be attributed to higher O3 levels in ZUE (background site) compared to LAU (traffic site) (Fig. 1), leading to increased cross-sensitivity of low-cost sensors in ZUE. Consequently, the inclusion of O3 measurements allowed the models to effectively capture and account for its influence, resulting in improved prediction accuracy. The feature importance plots (Fig. 10) provide further proof by indicating a higher significance of O3 in the NO2 and NO models of ZUE compared to LAU, thereby reinforcing its key role in capturing model variations.

Figure 9An illustration of the performance enhancement achieved by incorporating O3 for both Case A and Case B. The x axis represents primary SU  secondary SU. The relative improvement was computed using the following formula: (new  old) / old × 100 %.


Figure 10Feature importance plots for NO2 and NO, for 1 h based measurements, including reference O3.


The results also reveal that the transfer of NO2 and NO calibration models to SU (AC010) resulted in the lowest performance among all secondary SUs. Table 3 provides insights into the potential reasons behind this outcome, showing that for SU (AC010), NO2_A exhibits a stronger correlation with the reference NO2 measurements compared to NO2_B. Furthermore, the feature importance plots (Fig. 10) indicate that NO2_A has a more significant influence on predicting NO2 than NO2_B for models trained with the primary SU (AC010), which is the opposite for the rest of the calibration models. Thus, we infer that the discrepancies in the correlation between counterpart features in the training and test datasets substantially impact the calibration transfer between SUs of the same make. Higher disparities suggest that the model may not generalize well with new data, which raises concerns about its overall performance. Accordingly, when selecting a primary SU for the final global calibration model, it is crucial to select an SU that demonstrates representative feature importance for the other SUs to which the model will be transferred.

In some cases, the poor performance of ML-based calibration models can be attributed to the nature of ML algorithms. As an example, many meteorological variables exhibit periodic variations and are correlated over time and space, with these correlations changing with time. Unfortunately, ML algorithms are unaware of these relationships and have difficulty extrapolating periodic features correctly (Grover et al.2015). Another possible reason for poor performance is the existence of “unknown error sources”, whose influences are not captured by ML models. As a result of the spatiotemporal difference between primary and secondary SUs, different external errors are imposed on ML models, which significantly impact their performance. Therefore, future solutions of such problems can be achieved by incorporating various measures such as feature engineering, which calculates derived properties that assist ML models in recognizing the more complex relationships imposed by various environmental conditions (Schultz et al.2021; deSouza et al.2022).

Figure 11Average results of evaluating the performance of global calibration models for the Modena dataset at the urban-background (UB) site and urban-traffic (UT) site.


3.2 Validity of the calibration transfer approach using a different dataset

Finally, in order to validate the reliability and effectiveness of our approach, we applied it to a different dataset collected in the town of Modena, in the Po Valley, a European air pollution hotspot. The dataset was described and investigated in a previous work by Baruah et al. (2023). This allowed us to assess its robustness in diverse scenarios and identify the conditions necessary for successful implementation. The Modena dataset consists of measurements obtained from 12 SUs deployed in Modena, Italy. Two different sites were selected for the co-location of these SUs with reference stations: an urban-background site, where NO2, NO, and O3 reference measurements are available, and an urban-traffic site, where only NO and NO2 reference measurements are available. Figures S1 and S2 in the Supplement illustrate the temporal deployment of the Modena SUs and pollutant concentrations measured by the reference instruments. The deployment periods are sparsely distributed and span a period of approximately 20 months. Modena SUs were deployed for the shortest period of time at the urban-traffic site; some were deployed for around 2 weeks. This dataset can be used to validate our calibration transfer method. Modena SUs are equipped with three electrochemical sensors – NO2 (Alphasense NO2-B43F), NO (Alphasense NO-B4), and OX (Alphasense OX-B431) – as well as temperature and relative humidity sensors. According to our calibration strategy, since that OX sensor is available, it will be utilized as a source of O3 data. This dataset has been analyzed, and the best features combination was identified, as stated in Eq. (1).

(1) NO 2 = function ( NO2_we , NO2_aux , NO_we , NO_aux , OX_we , OX_aux , T , RH ) NO = function ( NO2_we , NO2_aux , NO_we , NO_aux , OX_we , OX_aux , T , RH )

The correlation analysis was explored (see Table S7). According to these investigations, all NO low-cost sensors and some NO2 and OX low-cost sensors have a very low correlation with their corresponding reference measurements at the urban-background site. Figure 11 shows results of the overall calibration transfer performance of NO2 and NO models for the two sites. For additional details, see Figs. S3 through S7. The findings of these results can be summarized as follows: (1) there is consistency with the results from the Switzerland dataset, in which RF outperforms MLR and SVR, and calibration transfer within the same site (Case A) achieves better performance than in Case B. Also, NO models show better transferability than NO2 models. (2) It is possible that some models were unable to be transferred, presumably due to low correlation (pairwise and with their corresponding reference measurements), which is more prominent in NO low-cost sensors at the urban-background site. Moreover, the sparse deployment of SUs at the urban-background site and the short co-location period at the urban-traffic site can affect the generalizability of global models. (3) Despite the urban-traffic measurements having a short co-location period compared to urban-background measurements, the calibration transfer of urban-traffic data performed better than that of urban-background measurements, especially in Case B.

Based on our analysis of the Modena dataset, it is evident that three main conditions are required for the proposed calibration protocol to provide the best transferability of calibration models: (1) high correlation (pairwise, as well as with the reference measurements); (2) a sufficient period of co-location; and (3) using multiple electrochemical sensors dedicated to the same pollutant, such as the Switzerland dataset, which can enhance data reliability. This claim is supported by several studies. For example, the study by Bigi et al. (2018) showed that using a pair of sensors for NO2 and NO led to better performance in their calibration models compared to using a single sensor. Moreover, Smith et al. (2019) reported the effectiveness of employing an array of sensors rather than a single sensor. They utilized the instantaneous median signal from six identical electrochemical sensors for NO2 and O3, resulting in minimized random drifts and inter-sensor differences, thus addressing some limitations of individual sensors.

4 Conclusions

This study investigated the transferability of ML-based calibration models for NO2 and NO across identical low-cost SUs deployed at similar and distant locations within Switzerland. Moreover, this study advocated enhancing NO2 and NO global calibration models by incorporating O3 measurements from available nearby monitoring stations. This strategic augmentation aims at effectively mitigating the cross-sensitivity issues associated with low-cost sensors in the absence of dedicated O3 low-cost sensors (i.e., OX sensors), which is expected to improve the model's performance. The results of this study showed excellent calibration transferability between SUs located at the same site (Case A), with the average performance of RF-based models being R2= 0.90 ± 0.05 and RMSE = 3.4 ± 0.9 ppb for NO2 and R2= 0.97 ± 0.02 and RMSE = 3.1 ± 0.8 ppb for NO. The results also showed good transferability between SUs deployed at distant locations (Case B), which resulted in an average performance of R2= 0.65 ± 0.08 and RMSE = 5.5 ± 0.4 ppb for NO2 and R2= 0.82 ± 0.05 and RMSE = 5.8 ± 0.8 ppb for NO. These results reveal notable advancements compared to the existing literature.

Our study indicates that to achieve optimal performance of the global calibration model, there should be a strong correlation between sensors and their corresponding reference stations. Additionally, similar pollutant levels should be observed at both primary and secondary SU locations, as certain machine learning algorithms cannot extrapolate beyond the training data range. Employing multiple electrochemical cells within each SU targeting the same pollutant might be useful in enhancing data reliability, with caution required to prevent potential overfitting. Although this study demonstrated enhanced performance of NO calibration models by incorporating O3, there is limited evidence in the literature to support this inclusion for NO.

To conclude, the outcomes of our study will provide novel insights into the capability of ML models to generalize calibration models and emphasize the importance of utilizing publicly available data sources to improve the reliability of low-cost air quality sensors.

Appendix A: Evaluation metrics and raw results of calibration transfer approach

Three parameters were used to evaluate the overall performance of the calibration performance: R2, RMSE, and mean absolute error (MAE), given in Eqs. (A1)–(A3), respectively (Jolliff et al.2009).


where y denotes the reference measurements, y^ is the predicted values by the calibration model, and y is the mean of reference values. R2 values range between 0 and 1, measuring how much the independent variables (features) can explain the variation in the dependent variable (i.e., reference measurements). RMSE and MAE quantify the deviation between the calibrated values and their corresponding reference values.

Data availability

All raw data can be provided by the authors upon request.


The supplement related to this article is available online at:

Author contributions

AAH and JC conceived the study, with AB contributing the datasets. AAH designed the study's methodology, prepared the data, and conducted analysis and machine learning modeling. AAH, JC, VB, AW, and AB contributed to the analysis of the results. JC provided comprehensive supervision throughout the project. AAH wrote the paper, which was revised and edited by JC, VB, AW, and AB.

Competing interests

The contact author has declared that none of the authors has any competing interests.


Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.


The TUM authors wish to express their thanks to their funders: the German Academic Exchange Service (DAAD) and the Institute for Advanced Study at the Technical University of Munich. Alessandro Bigi acknowledges funding from the European Union NextGenerationEU program.

Financial support

The TUM authors are supported by the German Academic Exchange Service (Deutscher Akademischer Austauschdienst, DAAD) (grant no. 57552340) and the Institute for Advanced Study at the Technical University of Munich (grant no. 291763). Alessandro Bigi is supported by the “ECOSISTER” project (grant no. CUP E93C22001100001), funded by the European Union NextGenerationEU program, under the National Recovery and Resilience Plan (NRRP) Mission 4 Component 2 Investment Line 1.5.

Review statement

This paper was edited by Albert Presto and reviewed by two anonymous referees.


Alphasense Ltd: Alphsense Ltd: Technical specifications Version 1.0 for NO2-B43F, September 2022, (last access: 1 September 2022), 2022. a, b

Baruah, A., Zivan, O., Bigi, A., and Ghermandi, G.: Evaluation of low-cost gas sensors to quantify intra-urban variability of atmospheric pollutants, Environmental Science: Atmospheres, 3, 830–841,, 2023. a, b

Beckwith, M., Bates, E., Gillah, A., and Carslaw, N.: NO2 hotspots: are we measuring in the right places?, Atmospheric Environment: X, 2, 100025,, 2019. a

Bigi, A., Mueller, M., Grange, S. K., Ghermandi, G., and Hueglin, C.: Performance of NO, NO2 low cost sensors and three calibration approaches within a real world application, Atmos. Meas. Tech., 11, 3717–3735,, 2018. a, b, c, d, e, f, g, h, i, j, k, l

Borrego, C., Costa, A., Ginja, J., Amorim, M., Coutinho, M., Karatzas, K., Sioumis, T., Katsifarakis, N., Konstantinidis, K., De Vito, S., Esposito, E., Smith, P., André, N., Gérard, P., Francis, L. A., Castell, N., Schneider, P., Viana, M., Minguillón, M. C., Reimringer, W., Otjes, R. P., von Sicard, O., Pohle, R., Elen, B., Suriano, D., Pfister, V., Prato, M., Dipinto, S., and Penza, M.: Assessment of air quality microsensors versus reference methods: The EuNetAir joint exercise, Atmos. Environ., 147, 246–263,, 2016. a

Breiman, L.: Random forests, Mach. Learn., 45, 5–32, 2001. a, b

deSouza, P., Kahn, R., Stockman, T., Obermann, W., Crawford, B., Wang, A., Crooks, J., Li, J., and Kinney, P.: Calibrating networks of low-cost air quality sensors, Atmos. Meas. Tech., 15, 6309–6328,, 2022. a

Giordano, M. R., Malings, C., Pandis, S. N., Presto, A. A., McNeill, V., Westervelt, D. M., Beekmann, M., and Subramanian, R.: From low-cost sensors to high-quality data: A summary of challenges and best practices for effectively calibrating low-cost particulate matter mass sensors, J. Aerosol Sci., 158, 105833,, 2021. a, b

Grover, A., Kapoor, A., and Horvitz, E.: A deep hybrid model for weather forecasting, in: KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Sydney NSW Australia, 10–13 August 2015, Association for Computing Machinery, 379–386,, 2015. a

Hossain, M., Saffell, J., and Baron, R.: Differentiating NO2 and O3 at low cost air quality amperometric gas sensors, ACS Sensors, 1, 1291–1294,, 2016. a

Ionascu, M.-E., Castell, N., Boncalo, O., Schneider, P., Darie, M., and Marcu, M.: Calibration of CO, NO2, and O3 using airify: A low-cost sensor cluster for air quality monitoring, Sensors, 21, 7977,, 2021. a, b, c, d, e

Jolliff, J. K., Kindle, J. C., Shulman, I., Penta, B., Friedrichs, M. A., Helber, R., and Arnone, R. A.: Summary diagrams for coupled hydrodynamic-ecosystem model skill assessment, J. Marine Syst., 76, 64–82,, 2009. a

Karagulian, F., Barbiere, M., Kotsev, A., Spinelle, L., Gerboles, M., Lagler, F., Redon, N., Crunaire, S., and Borowiak, A.: Review of the performance of low-cost sensors for air quality monitoring, Atmosphere, 10, 506,, 2019. a

Kelly, F. J. and Fussell, J. C.: Air pollution and public health: emerging hazards and improved understanding of risk, Environ. Geochem. Hlth., 37, 631–649,, 2015. a

Kim, J., Shusterman, A. A., Lieschke, K. J., Newman, C., and Cohen, R. C.: The BErkeley Atmospheric CO2 Observation Network: field calibration and evaluation of low-cost air quality sensors, Atmos. Meas. Tech., 11, 1937–1946,, 2018. a, b

Kureshi, R. R., Mishra, B. K., Thakker, D., John, R., Walker, A., Simpson, S., Thakkar, N., and Wante, A. K.: Data-driven techniques for low-cost sensor selection and calibration for the use case of air quality monitoring, Sensors, 22, 1093,, 2022. a

Li, J., Hauryliuk, A., Malings, C., Eilenberg, S. R., Subramanian, R., and Presto, A. A.: Characterizing the aging of Alphasense NO2 sensors in long-term field deployments, ACS Sensors, 6, 2952–2959,, 2021. a, b

Maag, B., Zhou, Z., and Thiele, L.: A survey on sensor calibration in air pollution monitoring deployments, IEEE Internet Things, 5, 4857–4870,, 2018. a

Malings, C., Tanzer, R., Hauryliuk, A., Kumar, S. P. N., Zimmerman, N., Kara, L. B., Presto, A. A., and R. Subramanian: Development of a general calibration model and long-term performance evaluation of low-cost sensors for air pollutant gas monitoring, Atmos. Meas. Tech., 12, 903–920,, 2019. a

Masson, N., Piedrahita, R., and Hannigan, M.: Quantification method for electrolytic sensors in long-term monitoring of ambient air quality, Sensors, 15, 27283–27302,, 2015. a, b

Mead, M., Popoola, O., Stewart, G., Landshoff, P., Calleja, M., Hayes, M., Baldovi, J., McLeod, M., Hodgson, T., Dicks, J., Lewis, A., Cohen, J., Baron, R., Saffell, J. R., and Jones, R. L.: The use of electrochemical sensors for monitoring urban air quality in low-cost, high-density networks, Atmos. Environ., 70, 186–203,, 2013. a, b

Miech, J. A., Stanton, L., Gao, M., Micalizzi, P., Uebelherr, J., Herckes, P., and Fraser, M. P.: Calibration of low-cost no2 sensors through environmental factor correction, Toxics, 9, 281,, 2021. a, b, c, d, e, f

Mijling, B., Jiang, Q., de Jonge, D., and Bocconi, S.: Field calibration of electrochemical NO2 sensors in a citizen science context, Atmos. Meas. Tech., 11, 1297–1312,, 2018. a

Mueller, M., Meyer, J., and Hueglin, C.: Design of an ozone and nitrogen dioxide sensor unit and its long-term operation within a sensor network in the city of Zurich, Atmos. Meas. Tech., 10, 3783–3799,, 2017. a

Munir, S., Mayfield, M., Coca, D., Jubb, S. A., and Osammor, O.: Analysing the performance of low-cost air quality sensors, their drivers, relative benefits and calibration in cities—A case study in Sheffield, Environ. Monit. Assess., 191, 1–22,, 2019. a

Nowack, P., Konstantinovskiy, L., Gardiner, H., and Cant, J.: Machine learning calibration of low-cost NO2 and PM10 sensors: non-linear algorithms and their impact on site transferability, Atmos. Meas. Tech., 14, 5637–5655,, 2021. a, b, c

Okorn, K. and Hannigan, M.: Improving Air Pollutant Metal Oxide Sensor Quantification Practices through: An Exploration of Sensor Signal Normalization, Multi-Sensor and Universal Calibration Model Generation, and Physical Factors Such as Co-Location Duration and Sensor Age, Atmosphere, 12, 645,, 2021. a, b

Papaconstantinou, R., Demosthenous, M., Bezantakos, S., Hadjigeorgiou, N., Costi, M., Stylianou, M., Symeou, E., Savvides, C., and Biskos, G.: Field evaluation of low-cost electrochemical air quality gas sensors under extreme temperature and relative humidity conditions, Atmos. Meas. Tech., 16, 3313–3329,, 2023. a

Rodriguez, J. D., Perez, A., and Lozano, J. A.: Sensitivity analysis of k-fold cross validation in prediction error estimation, IEEE T. Pattern Anal., 32, 569–575,, 2009. a

Sahu, R., Nagal, A., Dixit, K. K., Unnibhavi, H., Mantravadi, S., Nair, S., Simmhan, Y., Mishra, B., Zele, R., Sutaria, R., Motghare, V. M., Kar, P., and Tripathi, S. N.: Robust statistical calibration and characterization of portable low-cost air quality monitoring sensors to quantify real-time O3 and NO2 concentrations in diverse environments, Atmos. Meas. Tech., 14, 37–52,, 2021. a

Schneider, P., Bartonova, A., Castell, N., Dauge, F. R., Gerboles, M., Hagler, G. S., Huglin, C., Jones, R. L., Khan, S., Lewis, A. C., Mijling, B., Müller, M., Penza, M., Spinelle, L., Stacey, B., Vogt, M., Wesseling, J., and Williams, R. W.: Toward a unified terminology of processing levels for low-cost air-quality sensors, Environ. Sci. Technol., 53, 8485–8487,, 2019. a, b, c

Schultz, M. G., Betancourt, C., Gong, B., Kleinert, F., Langguth, M., Leufen, L. H., Mozaffari, A., and Stadtler, S.: Can deep learning beat numerical weather prediction?, Philos. T. Roy. Soc. A, 379, 20200097,, 2021. a, b

Smith, K. R., Edwards, P. M., Ivatt, P. D., Lee, J. D., Squires, F., Dai, C., Peltier, R. E., Evans, M. J., Sun, Y., and Lewis, A. C.: An improved low-power measurement of ambient NO2 and O3 combining electrochemical sensor clusters and machine learning, Atmos. Meas. Tech., 12, 1325–1336,, 2019. a

Snyder, E. G., Watkins, T. H., Solomon, P. A., Thoma, E. D., Williams, R. W., Hagler, G. S., Shelow, D., Hindin, D. A., Kilaru, V. J., and Preuss, P. W.: The changing paradigm of air pollution monitoring, Environ. Sci. Technol., 47, 11369–11377,, 2013. a

Spinelle, L., Gerboles, M., Villani, M. G., Aleixandre, M., and Bonavitacola, F.: Field calibration of a cluster of low-cost available sensors for air quality monitoring. Part A: Ozone and nitrogen dioxide, Sensor. Actuat. B-Chem., 215, 249–257,, 2015. a, b, c, d, e

Spinelle, L., Kotsev, A., Signorini, M., and Gerboles, M.: Evaluation of low-cost sensors for air pollution monitoring: Effect of gaseous interfering compounds and meteorological conditions, EUR 28601 EN, Publications Office of the European Union​​​​​​​,, 2017. a, b, c, d

Suriano, D. and Penza, M.: Assessment of the performance of a low-cost air quality monitor in an indoor environment through different calibration models, Atmosphere, 13, 567,, 2022. a, b

Tagle, M., Rojas, F., Reyes, F., Vásquez, Y., Hallgren, F., Lindén, J., Kolev, D., Watne, Å. K., and Oyola, P.: Field performance of a low-cost sensor in the monitoring of particulate matter in Santiago, Chile, Environ. Monit. Assess., 192, 171,, 2020. a, b

Van Zoest, V., Osei, F. B., Stein, A., and Hoek, G.: Calibration of low-cost NO2 sensors in an urban air quality network, Atmos. Environ., 210, 66–75,, 2019. a, b, c

Vikram, S., Collier-Oxandale, A., Ostertag, M. H., Menarini, M., Chermak, C., Dasgupta, S., Rosing, T., Hannigan, M., and Griswold, W. G.: Evaluating and improving the reliability of gas-phase sensor system calibrations across new locations for ambient measurements and personal exposure monitoring, Atmos. Meas. Tech., 12, 4211–4239,, 2019. a, b, c

Wang, A., Machida, Y., deSouza, P., Mora, S., Duhl, T., Hudda, N., Durant, J. L., Duarte, F., and Ratti, C.: Leveraging machine learning algorithms to advance low-cost air sensor calibration in stationary and mobile settings, Atmos. Environ., 301, 119692,, 2023. a, b

WHO: Health aspects of air pollution: results from the WHO project “Systematic review of health aspects of air pollution in Europe”, WHO Regional Office for Europe, Report Nr. E83080, p. 30, 2004. a

Zhu, Y., Chen, J., Bi, X., Kuhlmann, G., Chan, K. L., Dietrich, F., Brunner, D., Ye, S., and Wenig, M.: Spatial and temporal representativeness of point measurements for nitrogen dioxide pollution levels in cities, Atmos. Chem. Phys., 20, 13241–13251,, 2020. a

Zimmerman, N., Presto, A. A., Kumar, S. P. N., Gu, J., Hauryliuk, A., Robinson, E. S., Robinson, A. L., and R. Subramanian: A machine learning calibration model using random forests to improve sensor performance for lower-cost air quality monitoring, Atmos. Meas. Tech., 11, 291–313,, 2018. a, b, c, d

Short summary
This study examined the transferability of machine learning calibration models among low-cost sensor units targeting NO2 and NO. The global models were evaluated under similar and different emission conditions. To counter cross-sensitivity, the study proposed integrating O3 measurements from nearby reference stations, in Switzerland. The models show substantial improvement when O3 measurements are incorporated, which is more pronounced when in regions with elevated O3 concentrations.