Low-Cost Air Quality Sensor Evaluation and Calibration in Contrasting Aerosol Environments

Gupta, Pawan; Doraiswamy, Prakash; Reddy, Jashwanth; Balyan, Palak; Dey, Sagnik; Chartier, Ryan; Khan, Adeel; Riter, Karmann; Feenstra, Brandon; Levy, Robert C.; Tran, Nhu Nguyen Minh; Pikelnaya, Olga; Selvaraj, Kurinji; Ganguly, Tanushree; Ganesan, Karthik

doi:https://doi.org/10.5194/amt-2022-140

Preprints

https://doi.org/10.5194/amt-2022-140

Preprints

16 May 2022

| 16 May 2022

Status: this preprint was under review for the journal AMT but the revision was not accepted.

Low-Cost Air Quality Sensor Evaluation and Calibration in Contrasting Aerosol Environments

Pawan Gupta, Prakash Doraiswamy, Jashwanth Reddy, Palak Balyan, Sagnik Dey, Ryan Chartier, Adeel Khan, Karmann Riter, Brandon Feenstra, Robert C. Levy, Nhu Nguyen Minh Tran, Olga Pikelnaya, Kurinji Selvaraj, Tanushree Ganguly, and Karthik Ganesan

Abstract. The use of low-cost sensors (LCS) in air quality monitoring has been gaining interest across all walks of society, including community and citizen scientists, academic research groups, environmental agencies, and the private sector. Traditional air monitoring, performed by regulatory agencies, involves expensive regulatory-grade equipment and requires ongoing maintenance and quality control checks. The low-price tag, minimal operating cost, ease of use, and open data access are the primary driving factors behind the popularity of LCS. This study discusses the role and associated challenges of PM_2.5 sensors in monitoring air quality. We present the results of evaluations of the PurpleAir (PA.) PA-II LCS against regulatory-grade PM_2.5 federal equivalent methods (FEM) and the development of sensor calibration algorithms. The LCS calibration was performed for 2 to 4 weeks during December 2019–January 2020 in Raleigh, NC, and Delhi, India, to evaluate the data quality under different aerosols loadings and environmental conditions. This exercise aims to develop a robust calibration model that uses PA measured parameters (i.e., PM_2.5, temperature, relative humidity) as input and provides bias-corrected PM_2.5 output at an hourly scale. Thus, the calibration model relies on simultaneous measurements of PM_2.5 by FEM as target output during the calibration model development process. We applied various statistical and machine learning methods to achieve a regional calibration model. The results from our study indicate that, with proper calibration, we can achieve bias-corrected PM_2.5 data using PA sensors within 12 % percentage mean absolute bias at hourly and within 6 % for a daily average. Our study also suggests that pre-deployment calibrations developed at local or regional scales should be performed for the PA sensors to correct data from the field for scientific data analysis.

Received: 29 Apr 2022 – Discussion started: 16 May 2022

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this preprint. The responsibility to include appropriate place names lies with the authors.

Download & links

Preprint (PDF, 6944 KB)

Supplement (2222 KB)

Download & links

Status: closed

RC1:
'Comment on AMT-2022-140', Anonymous Referee #1, 05 Jul 2022
General Comments

This manuscript describes the development and performance of a machine learning algorithm that used PM_2.5 concentrations reported by PurpleAir sensors, temperatures reported by PurpleAir sensors, and relative humidity values reported by PurpleAir sensors to predict the PM_2.5 concentration that would be reported by a PM_2.5 monitor with the U.S. Environmental Protection Agency Federal Equivalent Method (FEM) designation. Two different machine learning algorithms were trained and testing using data from Raleigh, NC, USA and Delhi, India, respectively. It's nice that the authors' examined the performance of the PurpleAir sensors in two locations with very different levels of PM_2.5 pollution. I'm most concerned that a) the authors might not have used appropriate techniques to split their data into training and validation sets and b) model performance metrics seem to be presented for predictions calculated using the same data that were used to train the model. I'm also concerned that most of the PurpleAir sensors tested in Raleigh were not collocated with the FEM and it's not clear what was done with the data from the non-collocated sensors.

Specific Comments

1. Page 7, Lines 15-19: “In order to develop the final sensor calibration algorithm, we used the following steps: 1) quality control the PA data; 2) randomly divide the data into training (75%) and validation (25%) datasets; 3) train the algorithm using the training dataset and validate using the validation dataset; and 4) repeat steps 2 and 3 ten times (i.e., 10-fold cross validation) using random data selection.”

My first concern is that steps 3 and 4 seem inconsistent with each other and with the authors’ presentation of their results. 10-fold cross validation typically implies that: i) the data were split into ten folds, ii) nine folds (90% of the data) were used to train the model while the remaining fold (10% of the data) was used to test the model, and then iii) step ii was repeated for all ten folds. I don’t see how a 75% training/25% validation split corresponds to 10-fold cross validation. Why are results presented for both the training and validation data in Table 1 and Figure 5? Performance metrics should not be shown for predictions generated using the same data that were used to train the model. How come training/validation data are shown separately in Figures 4 and 5, but together in Figures 6, 7, 8, and 9? Were the results shown in Figures 6–9 generated using 10-fold cross validation? Or were 75% of the data points shown in Figures 6–9 generated from the same data used to train the models?

My second concern is that the authors should not use random data selection to split the data into 75% training/25% validation or to split the data into folds for cross-validation. The assumption that observations are independent and identically distributed does not apply to grouped data or to time series data. The authors’ data are both grouped and time-series. The data are grouped by PurpleAir sensor: the authors have multiple PurpleAir sensors and multiple observations from each of those sensors. Data from each individual PurpleAir sensor might be dependent upon properties of that sensor. Time-series data are autocorrelated: each observation is related to some number of observations recorded immediately before and immediately after that observation. One approach for time-series data is to split up the data such that the training and validation sets do not overlap in time. If the training and validation data sets overlap in time, then the metrics reported for the validation set represent an overly-optimistic picture of the model’s performance.

References:

https://scikit-learn.org/stable/modules/cross_validation.html#group-cv

https://scikit-learn.org/stable/modules/cross_validation.html#timeseries-cv

2. Section 3.2: Only 5 of the PurpleAir sensors were collocated with the FEM in Research Triangle Park for two weeks. The other 84 PurpleAir sensors collected data at a residential location in Apex, NC. Were data from the 84 PurpleAir sensors that were never collocated with the FEM also used to train and validate the model? Or did the authors only use data from the 5 sensors that were collocated with the FEM to train and validate the model? Are data from the 84 sensors that were not collocated with the FEM included in any of the results presented in the manuscript (outside of Figure S4)? If data from those 84 sensors were used to train and validate the model, how were those data treated? Were those 84 sensors assumed to measure the same pollution as the FEM? How far from the FEM was the location in Apex? Please clarify the methods in the manuscript.

3. Page 3, Lines 34-35: “The PMS 5003 OPS is a nephelometer that measures particle loading through light scattering (wavelength~650 nm) (Hagan and Kroll, 2020a).”: Hagan and Kroll (2020a) is a good reference for this statement, but Ouimette et al. (https://doi.org/10.5194/amt-15-655-2022) provides a more comprehensive description of PMS 5003 operation.

4. Page 3, Line 42 through Page 4, Line 1: “The primary sensor-reported data include PM₁, PM_2.5, and PM₁₀ concentrations with a factory-specified correction factor for ambient measurements (CF=ATM), concentrations with CF=1 factor recommended by the manufacturer for use in indoor environments…” I don’t think it’s sufficient or appropriate to simply state the manufacturer-provided specifications here given that: a) there doesn’t seem to be any scientifically-valid reason to use the CF=ATM stream instead of the CF=1 stream for ambient monitoring and b) more detailed information on the PMS 5003 outputs is available from other sources. For sensor-reported concentrations below 30 μg m^-3, the PM_2.5 CF=ATM stream is equal to the PM_2.5 CF=1 stream. For sensor reported concentrations above 30 μg m^-3, the PM_2.5 CF=ATM stream has a nonlinear response and reports lower concentrations than the PM_2.5 CF=1 stream.

See the information provided under the heading “Comparison between Std. Particle and Std. Atmosphere” on the aqicn.org page for the PMS 5003 (https://aqicn.org/sensor/pms5003-7003/).

Also see Section 2.2.1 on page 4619 of Barkjohn et al. (https://doi.org/10.5194/amt-14-4617-2021): “The two data columns have a [cf_atm] / [cf_1] = 1 relationship below roughly 25 μg m^-3 (as reported by the sensor) and then transition to a two-thirds ratio at higher concentration ([cf_1] concentrations are higher).” Additionally, Barkjohn et al. found that the PM_2.5 CF=1 stream was more strongly correlated with 24-hour average ambient PM_2.5 concentrations measured using FRM and FEM instruments than the PM_2.5 CF=ATM stream.

Did the authors evaluate whether their model performed any better or worse if trained with the PM_2.5 CF=1 stream instead of the PM_2.5 CF=ATM stream?

5. Page 10, Line 10: Is there a US EPA AQS site identification number for this location?

6. Page 10, Line 25: “…suggesting an overall underestimation by PA in clean conditions (PM_2.5 < 10 µg m^-3).” I don’t think this generalization is appropriate. Other U.S.-based studies (Barkjohn et al.: https://doi.org/10.5194/amt-14-4617-2021; Tryner et al.: https://doi.org/10.1016/j.atmosenv.2019.117067) found that PurpleAir sensors overestimated ambient PM_2.5 concentrations, so while it is true that the mean bias was < 0 for the authors’ dataset, similar results are not always observed at low concentrations.

7. Page 10, Lines 31-32: “It is also important to note that the chemical composition of particles in Delhi and Raleigh is expected to be different.” This statement is true: There are likely to be differences in the sources of particles in Delhi and Raleigh and, as a result, the chemical composition of the particulate matter is likely to vary between these two locations, but I think differences in particle size distribution are likely to affect PurpleAir accuracy more than differences in optical properties and particle density. See Hagan and Kroll (https://doi.org/10.5194/amt-13-6343-2020) and Ouimette et al. (https://doi.org/10.5194/amt-15-655-2022).

8. Page 10, Lines 32–33: Hagan et al. (https://doi.org/10.1021/acs.estlett.9b00393) also report data on the composition of ambient PM in Delhi in the winter.

9. Page 13, Line 8: “We also evaluated the importance of each input parameter…” Please include the methods used for this analysis in the manuscript.

10. Page 21, Lines 6-7: “The performance of the model for other seasons (and thus for other weather conditions) will need to be evaluated as part of future work.” Differences in temperature and humidity are not the only reason why the model might perform differently in other seasons. I suggest the revising this text to reflect the fact that differences in model performance across seasons is likely to also be influenced by differences in pollution sources. Several prior publications have described how pollution sources and PurpleAir performance vary across seasons in locations around the world:

Sayahi et al., 2019, Salt Lake City, Utah, USA: https://doi.org/10.1016/j.envpol.2018.11.065

McFarlane et al., 2021, Accra, Ghana: https://doi.org/10.1021/acsearthspacechem.1c00217

Raheja et al., 2022, Lomé, Togo: https://doi.org/10.1021/acsearthspacechem.1c00391

Technical Corrections

11. Page 2, Lines 22-23: Barkjohn et al., 2021 (https://doi.org/10.5194/amt-14-4617-2021) seems to be missing from this list of citations.

12. Figures 2, 3, 4, 5, 6, 7, 9, 10, and 11: The two sites are referred to as “Delhi” and “Raleigh” throughout the text, but as “India” and “NC” in the figures. Please update the plot labels to also refer to the two sites as Delhi and Raleigh.

13. Figures 2, 3, 4, 5, 6, 8, 9, 10, and 11: All x- and y-axes should have descriptive labels with units (not just variable names from the authors’ code).

I assume the units are μg m^-3 on all x- and y-axes in Figures 2, 3, 4, 5, and 6?

The “PA (ML CALIBRATED – Daily)” labels on the y axes in Figure 6 make it look like one value has been subtracted from another, but I don’t think that’s what the authors are showing in those plots. How about “24-h average ML-calibrated PA PM_2.5 (μg m^-3)”?

I assume the units are μg m^-3 on the y-axes in Figures 8 and 9?

I assume that all PM_2.5 concentrations shown on in Figures 10 and 11 are in μg m^-3?

14. Figures 4, 5, and 6: “The color scale represents the density of data points.” I do not understand what this means.

15. Figure S1: What do the red lines represent? +/- 1 standard deviation? Please explain in the figure caption.

16. Figure S3.1: This image is not legible. Please provide a higher-resolution image.

17. Figure S4: Please use more informative axis labels. Why does the caption say “ADD THIS FIGURE”?
Citation: https://doi.org/10.5194/amt-2022-140-RC1
- AC1:
  'Reply on RC1', Pawan Gupta, 28 Sep 2022
  General Comments
  This manuscript describes the development and performance of a machine learning algorithm that used PM_2.5 concentrations reported by PurpleAir sensors, temperatures reported by PurpleAir sensors, and relative humidity values reported by PurpleAir sensors to predict the PM_2.5 concentration that would be reported by a PM_2.5 monitor with the U.S. Environmental Protection Agency Federal Equivalent Method (FEM) designation. Two different machine learning algorithms were trained and testing using data from Raleigh, NC, USA and Delhi, India, respectively. It's nice that the authors' examined the performance of the PurpleAir sensors in two locations with very different levels of PM_2.5 pollution. I'm most concerned that a) the authors might not have used appropriate techniques to split their data into training and validation sets and b) model performance metrics seem to be presented for predictions calculated using the same data that were used to train the model. I'm also concerned that most of the PurpleAir sensors tested in Raleigh were not collocated with the FEM and it's not clear what was done with the data from the non-collocated sensors.
  Response:
  We thank the reviewer for the feedback. We apologize for the confusion caused by unclear language in the text. We have revised the manuscript text to clarify the approach. We present our detailed responses below under specific comments, but provide a high-level summary here:
  The model development was done in two steps. The first step aimed to identifying the appropriate machine learning algorithm (MLA), while the second step then refined it further. For this second step, we used the standard 10-fold cross-validation technique. Section 4.2 has been revised to clarify our approach.
  
  We present performance metrics for both training & validation datasets separately for the model development portion. Table 1, Figure 5, and Figure S3 all show training and validation results separately. Once the final model is chosen, we present our analysis using the combined dataset.
  
  Due to logistical challenges, only five PurpleAir (PA) sensors were collocated with FEM in the Raleigh region, and the rest of the PA sensors were collocated with those five PA sensors. Model development for Raleigh is performed only using the data from the 5 PA sensors. The model is then applied to the data from the remaining 84 sensors and compared against the data from the 5 sensors (Figure S4). This is explained in section 3.2, and the results are discussed in section 5.3. We have further edited section 5.3 to make it clearer.
  
  Specific Comments
  Page 7, Lines 15-19: “In order to develop the final sensor calibration algorithm, we used the following steps: 1) quality control the PA data; 2) randomly divide the data into training (75%) and validation (25%) datasets; 3) train the algorithm using the training dataset and validate using the validation dataset; and 4) repeat steps 2 and 3 ten times (i.e., 10-fold cross validation) using random data selection.”
  
  My first concern is that steps 3 and 4 seem inconsistent with each other and with the authors’ presentation of their results. 10-fold cross validation typically implies that: i) the data were split into ten folds, ii) nine folds (90% of the data) were used to train the model while the remaining fold (10% of the data) was used to test the model, and then iii) step ii was repeated for all ten folds. I don’t see how a 75% training/25% validation split corresponds to 10-fold cross validation. Why are results presented for both the training and validation data in Table 1 and Figure 5? Performance metrics should not be shown for predictions generated using the same data that were used to train the model. How come training/validation data are shown separately in Figures 4 and 5, but together in Figures 6, 7, 8, and 9? Were the results shown in Figures 6–9 generated using 10-fold cross validation? Or were 75% of the data points shown in Figures 6–9 generated from the same data used to train the models?
  Response
  Thanks for bringing this to our attention. We have revised the text in sections 4.2 and 5.3 to clarify the approach. As described in section 4.2, the model development is performed in two steps. The first step tests several different models to identify the best performing MLA. In this exercise, data are split into 75% and 25% for training and validation, respectively, (Table 1, Figure 5) to select the best performing MLA to be used for further analysis. Figure 5 shows the training and validation results for the Random Forest (RF) MLA summarized in Table 1. The second step trains the chosen MLA (i.e., Random Forest) on the sensor data for final model development. In this second step, we used a 10-fold cross-validation approach (90% and 10%) to train the final ML model (Figure S3), and an ensemble of 10 models were used as the final model. The text is revised to clear this confusion.
  
  Model performance results are shown for both training and testing for the model development and finalization steps (Table1, Figure 5, Figure S3). Results are shown for both training and testing datasets to show the similarity or differences in model performance between the two to provide further insight into the model. Our results shown consistent behavior of the model suggesting an optimized model.
  
  After the model was optimized, the remaining analysis are shown using the combined dataset to ensure large sample size. Results are presented for daily mean and other error dependencies and shown in Figures 6-11.
  
  My second concern is that the authors should not use random data selection to split the data into 75% training/25% validation or to split the data into folds for cross-validation. The assumption that observations are independent and identically distributed does not apply to grouped data or to time series data. The authors’ data are both grouped and time-series. The data are grouped by PurpleAir sensor: the authors have multiple PurpleAir sensors and multiple observations from each of those sensors. Data from each individual PurpleAir sensor might be dependent upon properties of that sensor. Time-series data are autocorrelated: each observation is related to some number of observations recorded immediately before and immediately after that observation. One approach for time-series data is to split up the data such that the training and validation sets do not overlap in time. If the training and validation data sets overlap in time, then the metrics reported for the validation set represent an overly-optimistic picture of the model’s performance.
  References:
  https://scikit-learn.org/stable/modules/cross_validation.html#group-cv
  
  https://scikit-learn.org/stable/modules/cross_validation.html#timeseries-cv
  
  Response
  We appreciate the reviewer’s comments and suggestions. We agree that it would have been useful to have the data divided into non-overlapping time series. If this collocation had been conducted for one year (which is logistically very difficult for 50-100 sensors), we would have been able to split data in more than one way and assess the model performance over time windows. Unfortunately, given the nature of this calibration exercise and the limited number of days available for calibration collocation (2 to 4 weeks) within a single season, dividing the data into time periods will make datasets very small and not produce enough samples to generate useful statistics. In this exercise, we aim to calibrate PA sensors with those measured using regulatory grade monitors. As discussed in sections 2 & 5, the PA sensor performance depends on its own properties, meteorological conditions (RH & T), and PM2.5 loading and characteristics. In our input variables, we have considered all these aspects. The data are 10-fold randomly selected, and the model’s consistent performance (Fig S3) clearly demonstrates that the trained model is optimized.
  
  As we mention in the manuscript, we are currently doing a year-long collocation of 1 to 2 sensors against the FEM. Once data becomes available from the longer collocation, we would be able to assess the suggested approach as part of future work.
  
  Section 3.2: Only 5 of the PurpleAir sensors were collocated with the FEM in Research Triangle Park for two weeks. The other 84 PurpleAir sensors collected data at a residential location in Apex, NC. Were data from the 84 PurpleAir sensors that were never collocated with the FEM also used to train and validate the model? Or did the authors only use data from the 5 sensors that were collocated with the FEM to train and validate the model? Are data from the 84 sensors that were not collocated with the FEM included in any of the results presented in the manuscript (outside of Figure S4)? If data from those 84 sensors were used to train and validate the model, how were those data treated? Were those 84 sensors assumed to measure the same pollution as the FEM? How far from the FEM was the location in Apex? Please clarify the methods in the manuscript.
  
  Response
  The remaining 84 sensors were never used for ML model training. The Raleigh model is trained using 5 sensors only. The trained model was then applied to the remaining 84 sensors, and the calibrated data were compared with those 5 sensors (Figure S4). This comparison demonstrates the optimization of the Raleigh model and also serves as an independent validation of the model on a dataset not part of the model development. Section 5.3 has been clarified to explain this.
  
  Page 3, Lines 34-35: “The PMS 5003 OPS is a nephelometer that measures particle loading through light scattering (wavelength~650 nm) (Hagan and Kroll, 2020a).”: Hagan and Kroll (2020a) is a good reference for this statement, but Ouimette et al. (https://doi.org/10.5194/amt-15-655-2022) provides a more comprehensive description of PMS 5003 operation.
  
  Response
  Thanks for providing the reference. We have included it in the revised draft.
  
  Page 3, Line 42 through Page 4, Line 1: “The primary sensor-reported data include PM₁, PM_2.5, and PM₁₀concentrations with a factory-specified correction factor for ambient measurements (CF=ATM), concentrations with CF=1 factor recommended by the manufacturer for use in indoor environments…” I don’t think it’s sufficient or appropriate to simply state the manufacturer-provided specifications here given that: a) there doesn’t seem to be any scientifically-valid reason to use the CF=ATM stream instead of the CF=1 stream for ambient monitoring and b) more detailed information on the PMS 5003 outputs is available from other sources. For sensor-reported concentrations below 30 μg m^-3, the PM_2.5CF=ATM stream is equal to the PM_2.5 CF=1 stream. For sensor reported concentrations above 30 μg m^-3, the PM_2.5 CF=ATM stream has a nonlinear response and reports lower concentrations than the PM_2.5 CF=1 stream.
  
  See the information provided under the heading “Comparison between Std. Particle and Std. Atmosphere” on the aqicn.org page for the PMS 5003 (https://aqicn.org/sensor/pms5003-7003/).
  Also see Section 2.2.1 on page 4619 of Barkjohn et al. (https://doi.org/10.5194/amt-14-4617-2021): “The two data columns have a [cf_atm] / [cf_1] = 1 relationship below roughly 25 μg m^-3 (as reported by the sensor) and then transition to a two-thirds ratio at higher concentration ([cf_1] concentrations are higher).” Additionally, Barkjohn et al. found that the PM_2.5 CF=1 stream was more strongly correlated with 24-hour average ambient PM_2.5 concentrations measured using FRM and FEM instruments than the PM_2.5 CF=ATM stream.
  Did the authors evaluate whether their model performed any better or worse if trained with the PM_2.5 CF=1 stream instead of the PM_2.5 CF=ATM stream?
  Response
  We understand that both CF=1 and CF=ATM data streams are being used in the research community. We have chosen CF=ATM as recommended by the sensor provider (“According to the PMS5003 manual, CF = 1 values should be used for indoor monitoring and ATMvalues should be used for atmospheric monitoring”). One of the main reasons for that is that the CF=ATM setting is what is used and visualized by many sensor users, as it is the default reported on the PurpleAir website. Therefore, we wanted to develop and test a calibration approach for the commonly used and reported PA data (CF=ATM).
  
  Further, from our perspective, both are derived parameters and not raw measurement. We chose to use the CF=ATM field for the above-noted reason.
  
  There are several studies that have used CF=ATM to develop calibration coefficients. For example:
  Walker (2018) - https://www.sciencedirect.com/science/article/pii/S135223102100251X#bib25
  
  Delp and Singer, 2020 - W.W. Delp, B.C. Singer Wildfire smoke adjustment factors for low-cost and professional PM2.5 monitors with optical sensors Sensors, 20 (2020), p. 3685, 3390/s20133683.
  
  In fact, there is a third data stream, “particle number counts”, which has been also used in literature. Wallace et al., 2021(https://www.sciencedirect.com/science/article/pii/S135223102100251X#bib33) provide an excellent literature review on PA calibration (see table 1). Thus, we reference this important paper in our introduction section (Page 2, lines 1-5) and don’t repeat all the details in this paper to avoid redundancy.
  
  We have not used the CF=1 data stream in our analysis and thus cannot comment on its performance using MLA. However, given the CF=1 and CF=ATM data streams are usually well correlated, and since ML tries to learn the patterns in the relationship between input and output variables, and based on the literature review, we speculate that there may not be significant differences in the calibration ML model while using the CF=1 data stream.
  
  Page 10, Line 10: Is there a US EPA AQS site identification number for this location?
  
  Response
  The US EPA AQS site ID number has been added.
  
  Page 10, Line 25: “…suggesting an overall underestimation by PA in clean conditions (PM_2.5< 10 µg m^-3).” I don’t think this generalization is appropriate. Other U.S.-based studies (Barkjohn et al.: https://doi.org/10.5194/amt-14-4617-2021; Tryner et al.: https://doi.org/10.1016/j.atmosenv.2019.117067) found that PurpleAir sensors overestimated ambient PM_2.5concentrations, so while it is true that the mean bias was < 0 for the authors’ dataset, similar results are not always observed at low concentrations.
  
  Response
  We have edited the text to make it clearer that it is for our study. We agree that it is difficult to generalize this.
  
  Barkjohn et al., 2021 has all the analysis performed for 24-hour mean, whereas our results are based on an hourly timescale. Therefore, the comparison may not be consistent given the potential diurnal variability in performance in a region.
  
  shows similar results - Sayahi, A. Butterfield, K.E. Kelly, Long-term field evaluation of the Plantower PMS low-cost particulate matter sensors, Environ. Pollut., 245 (2019), pp. 932-940, 10.1016/j.envpol.2018.11.065.
  
  Page 10, Lines 31-32: “It is also important to note that the chemical composition of particles in Delhi and Raleigh is expected to be different.” This statement is true: There are likely to be differences in the sources of particles in Delhi and Raleigh and, as a result, the chemical composition of the particulate matter is likely to vary between these two locations, but I think differences in particle size distribution are likely to affect PurpleAir accuracy more than differences in optical properties and particle density. See Hagan and Kroll (https://doi.org/10.5194/amt-13-6343-2020) and Ouimette et al. (https://doi.org/10.5194/amt-15-655-2022).
  
  Response
  We have revised the text for clarity and rephrased it to refer to particle characteristics in general.
  
  The change in chemical composition and size distribution both eventually impact the optical properties of the particle, and thus the scattering measurements from the PA unit, which later gets converted into a mass concentration.
  
  Page 10, Lines 32–33: Hagan et al. (https://doi.org/10.1021/acs.estlett.9b00393) also report data on the composition of ambient PM in Delhi in the winter.
  
  Response
  Thanks for pointing this out; we agree that Delhi has been studied heavily, and there are several publications that report chemical composition. To avoid an excessive citation, we choose one of them. Nevertheless, we have added this reference.
  
  Page 13, Line 8: “We also evaluated the importance of each input parameter…” Please include the methods used for this analysis in the manuscript.
  
  Response
  We revised the text and added details on the method.
  
  Page 21, Lines 6-7: “The performance of the model for other seasons (and thus for other weather conditions) will need to be evaluated as part of future work.” Differences in temperature and humidity are not the only reason why the model might perform differently in other seasons. I suggest the revising this text to reflect the fact that differences in model performance across seasons is likely to also be influenced by differences in pollution sources. Several prior publications have described how pollution sources and PurpleAir performance vary across seasons in locations around the world:
  
  Sayahi et al., 2019, Salt Lake City, Utah, USA: https://doi.org/10.1016/j.envpol.2018.11.065
  
  McFarlane et al., 2021, Accra, Ghana: https://doi.org/10.1021/acsearthspacechem.1c00217
  
  Raheja et al., 2022, Lomé, Togo: https://doi.org/10.1021/acsearthspacechem.1c00391
  
  Response
  
  Thanks for catching this. It was an oversight at our end. We have revised the text to reflect other factors than T& RH.
  
  Technical Corrections
  Page 2, Lines 22-23: Barkjohn et al., 2021 (https://doi.org/10.5194/amt-14-4617-2021) seems to be missing from this list of citations.
  
  Response
  It is listed on page 23, line 10, in the original manuscript.
  
  Figures 2, 3, 4, 5, 6, 7, 9, 10, and 11: The two sites are referred to as “Delhi” and “Raleigh” throughout the text, but as “India” and “NC” in the figures. Please update the plot labels to also refer to the two sites as Delhi and Raleigh.
  
  Response
  Thanks for catching it. We have revised the figures.
  
  Figures 2, 3, 4, 5, 6, 8, 9, 10, and 11: All x- and y-axes should have descriptive labels with units (not just variable names from the authors’ code).
  
  Response
  Thanks for catching it. We have revised the figures.
  
  I assume the units are μg m^-3on all x- and y-axes in Figures 2, 3, 4, 5, and 6?
  
  Response
  Yes, revised.
  
  The “PA (ML CALIBRATED – Daily)” labels on the y axes in Figure 6 make it look like one value has been subtracted from another, but I don’t think that’s what the authors are showing in those plots. How about “24-h average ML-calibrated PA PM₅(μg m^-3)”?
  
  Response
  It is the daily mean PA value, revised for clarity
  
  I assume the units are μg m^-3on the y-axes in Figures 8 and 9?
  
  Response
  
  Yes, revised.
  
  I assume that all PM₅concentrations shown on in Figures 10 and 11 are in μg m^-3?
  
  Response
  Yes, revised.
  
  Figures 4, 5, and 6: “The color scale represents the density of data points.” I do not understand what this means.
  
  Response
  The figure caption is revised to provide more information.
  
  The scatter density plots are standard and common practice to use when a large number of data points are presented in a plot.
  
  Please see https://rdrr.io/cran/aqfig/man/scatterplot.density.html for more information.
  “The plotting region of the scatterplot is divided into bins. The number of data points falling within each bin is summed and then plotted using the image function. This is particularly useful when there are so many points that each point cannot be distinctly identified.”
  
  Figure S1: What do the red lines represent? +/- 1 standard deviation? Please explain in the figure caption.
  
  Response
  The caption is revised. Yes, they are one standard deviation.
  
  Figure S3.1: This image is not legible. Please provide a higher-resolution image.
  
  Response
  Revised, the figure is updated for higher resolution.
  
  Figure S4: Please use more informative axis labels. Why does the caption say “ADD THIS FIGURE”?
  
  Response
  This was an oversight. The figure caption is revised with more details.
  
  Citation: https://doi.org/10.5194/amt-2022-140-AC1
RC2:
'Comment on amt-2022-140', Anonymous Referee #2, 10 Aug 2022

General Comments

This manuscript presents the results of performance evaluation of low cost purple air (PA-II) sensors against the regulatory grade PM2.5 monitors and develops machine learning based sensor calibration algorithms. This study developed two machine learning algorithms for Raleigh, NC, USA and Delhi, India using PM2.5 concentrations, temperatures and relative humidity values reported by PA-II sensors. The ten-fold cross validation technique is used validate the developed algorithms. The manuscript is very well organized and uses comprehensive statistical techniques.

Specific Comments

Page 7, Line 7: “We used Scikit-learn (sklearn) machine learning library in Python (https://scikit-learn.org).”

The authors may like to specify the version of library used, if applicable.

Page 7, Line 15-18: “In order to develop the final sensor calibration algorithm, we used the following steps: 1) quality control the PA data; 2) randomly divide the data into training (75%) and validation (25%) datasets; 3) train the algorithm using the training dataset and validate using the validation dataset; and 4) repeat steps 2 and 3 ten times (i.e., 10-fold cross-validation) using random data selection.”

As per general understanding, the ten-fold cross validation randomly divides data into 90% training dataset and 10% validation dataset and the procedure is repeated ten times. Why step 2 in above text mentions dividing data into 75% and 25%?

Technical Corrections

Fig. 2 to 11: The plot labels given are India and NC throughout the manuscript whereas in text the sites are referred as Delhi and Raleigh. This discrepancy needs to be addressed.

Fig. 2 to 11: The units are required to be specified on the axis labels for clarity

Citation: https://doi.org/10.5194/amt-2022-140-RC2
- AC2:
  'Reply on RC2', Pawan Gupta, 28 Sep 2022
  General Comments
  This manuscript presents the results of performance evaluation of low cost purple air (PA-II) sensors against the regulatory grade PM2.5 monitors and develops machine learning based sensor calibration algorithms. This study developed two machine learning algorithms for Raleigh, NC, USA and Delhi, India using PM2.5 concentrations, temperatures and relative humidity values reported by PA-II sensors. The ten-fold cross validation technique is used validate the developed algorithms. The manuscript is very well organized and uses comprehensive statistical techniques.
  Response
  We thank the reviewer for the feedback. We provide our response below to the specific comments.
  
  Specific Comments
  Page 7, Line 7: “We used Scikit-learn (sklearn) machine learning library in Python (https://scikit-learn.org).”
  The authors may like to specify the version of library used, if applicable.
  Response
  Version is added to the text.
  
  Page 7, Line 15-18: “In order to develop the final sensor calibration algorithm, we used the following steps: 1) quality control the PA data; 2) randomly divide the data into training (75%) and validation (25%) datasets; 3) train the algorithm using the training dataset and validate using the validation dataset; and 4) repeat steps 2 and 3 ten times (i.e., 10-fold cross-validation) using random data selection.”
  As per general understanding, the ten-fold cross validation randomly divides data into 90% training dataset and 10% validation dataset and the procedure is repeated ten times. Why step 2 in above text mentions dividing data into 75% and 25%?
  Response
  Thanks for bringing this to our attention. We have revised the text in sections 4.2 and 5.3 to clarify the approach. As described in section 4.2, the model development is performed in two steps. The first step tests several different models to identify the best performing MLA. In this exercise, data are split into 75% and 25% for training and validation, respectively, (Table 1, Figure 5) to select the best performing MLA to be used for further analysis. Figure 5 shows the training and validation results for the Random Forest (RF) MLA summarized in Table 1. The second step trains the chosen MLA (i.e., Random Forest) on the sensor data for final model development. In this second step, we used a 10-fold cross-validation approach (90% and 10%) to train the final ML model (Figure S3), and an ensemble of 10 models were used as the final model. The text is revised to clear this confusion.
  
  Technical Corrections
  Fig. 2 to 11: The plot labels given are India and NC throughout the manuscript whereas in text the sites are referred as Delhi and Raleigh. This discrepancy needs to be addressed.
  Response
  All figures are revised to make labels consistent with the text
  
  Fig. 2 to 11: The units are required to be specified on the axis labels for clarity
  Response
  Units are added in the figure axis labels or provided in the caption.
  
  Citation: https://doi.org/10.5194/amt-2022-140-AC2

Status: closed

RC1:
'Comment on AMT-2022-140', Anonymous Referee #1, 05 Jul 2022
General Comments

This manuscript describes the development and performance of a machine learning algorithm that used PM_2.5 concentrations reported by PurpleAir sensors, temperatures reported by PurpleAir sensors, and relative humidity values reported by PurpleAir sensors to predict the PM_2.5 concentration that would be reported by a PM_2.5 monitor with the U.S. Environmental Protection Agency Federal Equivalent Method (FEM) designation. Two different machine learning algorithms were trained and testing using data from Raleigh, NC, USA and Delhi, India, respectively. It's nice that the authors' examined the performance of the PurpleAir sensors in two locations with very different levels of PM_2.5 pollution. I'm most concerned that a) the authors might not have used appropriate techniques to split their data into training and validation sets and b) model performance metrics seem to be presented for predictions calculated using the same data that were used to train the model. I'm also concerned that most of the PurpleAir sensors tested in Raleigh were not collocated with the FEM and it's not clear what was done with the data from the non-collocated sensors.

Specific Comments

1. Page 7, Lines 15-19: “In order to develop the final sensor calibration algorithm, we used the following steps: 1) quality control the PA data; 2) randomly divide the data into training (75%) and validation (25%) datasets; 3) train the algorithm using the training dataset and validate using the validation dataset; and 4) repeat steps 2 and 3 ten times (i.e., 10-fold cross validation) using random data selection.”

My first concern is that steps 3 and 4 seem inconsistent with each other and with the authors’ presentation of their results. 10-fold cross validation typically implies that: i) the data were split into ten folds, ii) nine folds (90% of the data) were used to train the model while the remaining fold (10% of the data) was used to test the model, and then iii) step ii was repeated for all ten folds. I don’t see how a 75% training/25% validation split corresponds to 10-fold cross validation. Why are results presented for both the training and validation data in Table 1 and Figure 5? Performance metrics should not be shown for predictions generated using the same data that were used to train the model. How come training/validation data are shown separately in Figures 4 and 5, but together in Figures 6, 7, 8, and 9? Were the results shown in Figures 6–9 generated using 10-fold cross validation? Or were 75% of the data points shown in Figures 6–9 generated from the same data used to train the models?

My second concern is that the authors should not use random data selection to split the data into 75% training/25% validation or to split the data into folds for cross-validation. The assumption that observations are independent and identically distributed does not apply to grouped data or to time series data. The authors’ data are both grouped and time-series. The data are grouped by PurpleAir sensor: the authors have multiple PurpleAir sensors and multiple observations from each of those sensors. Data from each individual PurpleAir sensor might be dependent upon properties of that sensor. Time-series data are autocorrelated: each observation is related to some number of observations recorded immediately before and immediately after that observation. One approach for time-series data is to split up the data such that the training and validation sets do not overlap in time. If the training and validation data sets overlap in time, then the metrics reported for the validation set represent an overly-optimistic picture of the model’s performance.

References:

https://scikit-learn.org/stable/modules/cross_validation.html#group-cv

https://scikit-learn.org/stable/modules/cross_validation.html#timeseries-cv

2. Section 3.2: Only 5 of the PurpleAir sensors were collocated with the FEM in Research Triangle Park for two weeks. The other 84 PurpleAir sensors collected data at a residential location in Apex, NC. Were data from the 84 PurpleAir sensors that were never collocated with the FEM also used to train and validate the model? Or did the authors only use data from the 5 sensors that were collocated with the FEM to train and validate the model? Are data from the 84 sensors that were not collocated with the FEM included in any of the results presented in the manuscript (outside of Figure S4)? If data from those 84 sensors were used to train and validate the model, how were those data treated? Were those 84 sensors assumed to measure the same pollution as the FEM? How far from the FEM was the location in Apex? Please clarify the methods in the manuscript.

3. Page 3, Lines 34-35: “The PMS 5003 OPS is a nephelometer that measures particle loading through light scattering (wavelength~650 nm) (Hagan and Kroll, 2020a).”: Hagan and Kroll (2020a) is a good reference for this statement, but Ouimette et al. (https://doi.org/10.5194/amt-15-655-2022) provides a more comprehensive description of PMS 5003 operation.

4. Page 3, Line 42 through Page 4, Line 1: “The primary sensor-reported data include PM₁, PM_2.5, and PM₁₀ concentrations with a factory-specified correction factor for ambient measurements (CF=ATM), concentrations with CF=1 factor recommended by the manufacturer for use in indoor environments…” I don’t think it’s sufficient or appropriate to simply state the manufacturer-provided specifications here given that: a) there doesn’t seem to be any scientifically-valid reason to use the CF=ATM stream instead of the CF=1 stream for ambient monitoring and b) more detailed information on the PMS 5003 outputs is available from other sources. For sensor-reported concentrations below 30 μg m^-3, the PM_2.5 CF=ATM stream is equal to the PM_2.5 CF=1 stream. For sensor reported concentrations above 30 μg m^-3, the PM_2.5 CF=ATM stream has a nonlinear response and reports lower concentrations than the PM_2.5 CF=1 stream.

See the information provided under the heading “Comparison between Std. Particle and Std. Atmosphere” on the aqicn.org page for the PMS 5003 (https://aqicn.org/sensor/pms5003-7003/).

Also see Section 2.2.1 on page 4619 of Barkjohn et al. (https://doi.org/10.5194/amt-14-4617-2021): “The two data columns have a [cf_atm] / [cf_1] = 1 relationship below roughly 25 μg m^-3 (as reported by the sensor) and then transition to a two-thirds ratio at higher concentration ([cf_1] concentrations are higher).” Additionally, Barkjohn et al. found that the PM_2.5 CF=1 stream was more strongly correlated with 24-hour average ambient PM_2.5 concentrations measured using FRM and FEM instruments than the PM_2.5 CF=ATM stream.

Did the authors evaluate whether their model performed any better or worse if trained with the PM_2.5 CF=1 stream instead of the PM_2.5 CF=ATM stream?

5. Page 10, Line 10: Is there a US EPA AQS site identification number for this location?

6. Page 10, Line 25: “…suggesting an overall underestimation by PA in clean conditions (PM_2.5 < 10 µg m^-3).” I don’t think this generalization is appropriate. Other U.S.-based studies (Barkjohn et al.: https://doi.org/10.5194/amt-14-4617-2021; Tryner et al.: https://doi.org/10.1016/j.atmosenv.2019.117067) found that PurpleAir sensors overestimated ambient PM_2.5 concentrations, so while it is true that the mean bias was < 0 for the authors’ dataset, similar results are not always observed at low concentrations.

7. Page 10, Lines 31-32: “It is also important to note that the chemical composition of particles in Delhi and Raleigh is expected to be different.” This statement is true: There are likely to be differences in the sources of particles in Delhi and Raleigh and, as a result, the chemical composition of the particulate matter is likely to vary between these two locations, but I think differences in particle size distribution are likely to affect PurpleAir accuracy more than differences in optical properties and particle density. See Hagan and Kroll (https://doi.org/10.5194/amt-13-6343-2020) and Ouimette et al. (https://doi.org/10.5194/amt-15-655-2022).

8. Page 10, Lines 32–33: Hagan et al. (https://doi.org/10.1021/acs.estlett.9b00393) also report data on the composition of ambient PM in Delhi in the winter.

9. Page 13, Line 8: “We also evaluated the importance of each input parameter…” Please include the methods used for this analysis in the manuscript.

10. Page 21, Lines 6-7: “The performance of the model for other seasons (and thus for other weather conditions) will need to be evaluated as part of future work.” Differences in temperature and humidity are not the only reason why the model might perform differently in other seasons. I suggest the revising this text to reflect the fact that differences in model performance across seasons is likely to also be influenced by differences in pollution sources. Several prior publications have described how pollution sources and PurpleAir performance vary across seasons in locations around the world:

Sayahi et al., 2019, Salt Lake City, Utah, USA: https://doi.org/10.1016/j.envpol.2018.11.065

McFarlane et al., 2021, Accra, Ghana: https://doi.org/10.1021/acsearthspacechem.1c00217

Raheja et al., 2022, Lomé, Togo: https://doi.org/10.1021/acsearthspacechem.1c00391

Technical Corrections

11. Page 2, Lines 22-23: Barkjohn et al., 2021 (https://doi.org/10.5194/amt-14-4617-2021) seems to be missing from this list of citations.

12. Figures 2, 3, 4, 5, 6, 7, 9, 10, and 11: The two sites are referred to as “Delhi” and “Raleigh” throughout the text, but as “India” and “NC” in the figures. Please update the plot labels to also refer to the two sites as Delhi and Raleigh.

13. Figures 2, 3, 4, 5, 6, 8, 9, 10, and 11: All x- and y-axes should have descriptive labels with units (not just variable names from the authors’ code).

I assume the units are μg m^-3 on all x- and y-axes in Figures 2, 3, 4, 5, and 6?

The “PA (ML CALIBRATED – Daily)” labels on the y axes in Figure 6 make it look like one value has been subtracted from another, but I don’t think that’s what the authors are showing in those plots. How about “24-h average ML-calibrated PA PM_2.5 (μg m^-3)”?

I assume the units are μg m^-3 on the y-axes in Figures 8 and 9?

I assume that all PM_2.5 concentrations shown on in Figures 10 and 11 are in μg m^-3?

14. Figures 4, 5, and 6: “The color scale represents the density of data points.” I do not understand what this means.

15. Figure S1: What do the red lines represent? +/- 1 standard deviation? Please explain in the figure caption.

16. Figure S3.1: This image is not legible. Please provide a higher-resolution image.

17. Figure S4: Please use more informative axis labels. Why does the caption say “ADD THIS FIGURE”?
Citation: https://doi.org/10.5194/amt-2022-140-RC1
- AC1:
  'Reply on RC1', Pawan Gupta, 28 Sep 2022
  General Comments
  This manuscript describes the development and performance of a machine learning algorithm that used PM_2.5 concentrations reported by PurpleAir sensors, temperatures reported by PurpleAir sensors, and relative humidity values reported by PurpleAir sensors to predict the PM_2.5 concentration that would be reported by a PM_2.5 monitor with the U.S. Environmental Protection Agency Federal Equivalent Method (FEM) designation. Two different machine learning algorithms were trained and testing using data from Raleigh, NC, USA and Delhi, India, respectively. It's nice that the authors' examined the performance of the PurpleAir sensors in two locations with very different levels of PM_2.5 pollution. I'm most concerned that a) the authors might not have used appropriate techniques to split their data into training and validation sets and b) model performance metrics seem to be presented for predictions calculated using the same data that were used to train the model. I'm also concerned that most of the PurpleAir sensors tested in Raleigh were not collocated with the FEM and it's not clear what was done with the data from the non-collocated sensors.
  Response:
  We thank the reviewer for the feedback. We apologize for the confusion caused by unclear language in the text. We have revised the manuscript text to clarify the approach. We present our detailed responses below under specific comments, but provide a high-level summary here:
  The model development was done in two steps. The first step aimed to identifying the appropriate machine learning algorithm (MLA), while the second step then refined it further. For this second step, we used the standard 10-fold cross-validation technique. Section 4.2 has been revised to clarify our approach.
  
  We present performance metrics for both training & validation datasets separately for the model development portion. Table 1, Figure 5, and Figure S3 all show training and validation results separately. Once the final model is chosen, we present our analysis using the combined dataset.
  
  Due to logistical challenges, only five PurpleAir (PA) sensors were collocated with FEM in the Raleigh region, and the rest of the PA sensors were collocated with those five PA sensors. Model development for Raleigh is performed only using the data from the 5 PA sensors. The model is then applied to the data from the remaining 84 sensors and compared against the data from the 5 sensors (Figure S4). This is explained in section 3.2, and the results are discussed in section 5.3. We have further edited section 5.3 to make it clearer.
  
  Specific Comments
  Page 7, Lines 15-19: “In order to develop the final sensor calibration algorithm, we used the following steps: 1) quality control the PA data; 2) randomly divide the data into training (75%) and validation (25%) datasets; 3) train the algorithm using the training dataset and validate using the validation dataset; and 4) repeat steps 2 and 3 ten times (i.e., 10-fold cross validation) using random data selection.”
  
  My first concern is that steps 3 and 4 seem inconsistent with each other and with the authors’ presentation of their results. 10-fold cross validation typically implies that: i) the data were split into ten folds, ii) nine folds (90% of the data) were used to train the model while the remaining fold (10% of the data) was used to test the model, and then iii) step ii was repeated for all ten folds. I don’t see how a 75% training/25% validation split corresponds to 10-fold cross validation. Why are results presented for both the training and validation data in Table 1 and Figure 5? Performance metrics should not be shown for predictions generated using the same data that were used to train the model. How come training/validation data are shown separately in Figures 4 and 5, but together in Figures 6, 7, 8, and 9? Were the results shown in Figures 6–9 generated using 10-fold cross validation? Or were 75% of the data points shown in Figures 6–9 generated from the same data used to train the models?
  Response
  Thanks for bringing this to our attention. We have revised the text in sections 4.2 and 5.3 to clarify the approach. As described in section 4.2, the model development is performed in two steps. The first step tests several different models to identify the best performing MLA. In this exercise, data are split into 75% and 25% for training and validation, respectively, (Table 1, Figure 5) to select the best performing MLA to be used for further analysis. Figure 5 shows the training and validation results for the Random Forest (RF) MLA summarized in Table 1. The second step trains the chosen MLA (i.e., Random Forest) on the sensor data for final model development. In this second step, we used a 10-fold cross-validation approach (90% and 10%) to train the final ML model (Figure S3), and an ensemble of 10 models were used as the final model. The text is revised to clear this confusion.
  
  Model performance results are shown for both training and testing for the model development and finalization steps (Table1, Figure 5, Figure S3). Results are shown for both training and testing datasets to show the similarity or differences in model performance between the two to provide further insight into the model. Our results shown consistent behavior of the model suggesting an optimized model.
  
  After the model was optimized, the remaining analysis are shown using the combined dataset to ensure large sample size. Results are presented for daily mean and other error dependencies and shown in Figures 6-11.
  
  My second concern is that the authors should not use random data selection to split the data into 75% training/25% validation or to split the data into folds for cross-validation. The assumption that observations are independent and identically distributed does not apply to grouped data or to time series data. The authors’ data are both grouped and time-series. The data are grouped by PurpleAir sensor: the authors have multiple PurpleAir sensors and multiple observations from each of those sensors. Data from each individual PurpleAir sensor might be dependent upon properties of that sensor. Time-series data are autocorrelated: each observation is related to some number of observations recorded immediately before and immediately after that observation. One approach for time-series data is to split up the data such that the training and validation sets do not overlap in time. If the training and validation data sets overlap in time, then the metrics reported for the validation set represent an overly-optimistic picture of the model’s performance.
  References:
  https://scikit-learn.org/stable/modules/cross_validation.html#group-cv
  
  https://scikit-learn.org/stable/modules/cross_validation.html#timeseries-cv
  
  Response
  We appreciate the reviewer’s comments and suggestions. We agree that it would have been useful to have the data divided into non-overlapping time series. If this collocation had been conducted for one year (which is logistically very difficult for 50-100 sensors), we would have been able to split data in more than one way and assess the model performance over time windows. Unfortunately, given the nature of this calibration exercise and the limited number of days available for calibration collocation (2 to 4 weeks) within a single season, dividing the data into time periods will make datasets very small and not produce enough samples to generate useful statistics. In this exercise, we aim to calibrate PA sensors with those measured using regulatory grade monitors. As discussed in sections 2 & 5, the PA sensor performance depends on its own properties, meteorological conditions (RH & T), and PM2.5 loading and characteristics. In our input variables, we have considered all these aspects. The data are 10-fold randomly selected, and the model’s consistent performance (Fig S3) clearly demonstrates that the trained model is optimized.
  
  As we mention in the manuscript, we are currently doing a year-long collocation of 1 to 2 sensors against the FEM. Once data becomes available from the longer collocation, we would be able to assess the suggested approach as part of future work.
  
  Section 3.2: Only 5 of the PurpleAir sensors were collocated with the FEM in Research Triangle Park for two weeks. The other 84 PurpleAir sensors collected data at a residential location in Apex, NC. Were data from the 84 PurpleAir sensors that were never collocated with the FEM also used to train and validate the model? Or did the authors only use data from the 5 sensors that were collocated with the FEM to train and validate the model? Are data from the 84 sensors that were not collocated with the FEM included in any of the results presented in the manuscript (outside of Figure S4)? If data from those 84 sensors were used to train and validate the model, how were those data treated? Were those 84 sensors assumed to measure the same pollution as the FEM? How far from the FEM was the location in Apex? Please clarify the methods in the manuscript.
  
  Response
  The remaining 84 sensors were never used for ML model training. The Raleigh model is trained using 5 sensors only. The trained model was then applied to the remaining 84 sensors, and the calibrated data were compared with those 5 sensors (Figure S4). This comparison demonstrates the optimization of the Raleigh model and also serves as an independent validation of the model on a dataset not part of the model development. Section 5.3 has been clarified to explain this.
  
  Page 3, Lines 34-35: “The PMS 5003 OPS is a nephelometer that measures particle loading through light scattering (wavelength~650 nm) (Hagan and Kroll, 2020a).”: Hagan and Kroll (2020a) is a good reference for this statement, but Ouimette et al. (https://doi.org/10.5194/amt-15-655-2022) provides a more comprehensive description of PMS 5003 operation.
  
  Response
  Thanks for providing the reference. We have included it in the revised draft.
  
  Page 3, Line 42 through Page 4, Line 1: “The primary sensor-reported data include PM₁, PM_2.5, and PM₁₀concentrations with a factory-specified correction factor for ambient measurements (CF=ATM), concentrations with CF=1 factor recommended by the manufacturer for use in indoor environments…” I don’t think it’s sufficient or appropriate to simply state the manufacturer-provided specifications here given that: a) there doesn’t seem to be any scientifically-valid reason to use the CF=ATM stream instead of the CF=1 stream for ambient monitoring and b) more detailed information on the PMS 5003 outputs is available from other sources. For sensor-reported concentrations below 30 μg m^-3, the PM_2.5CF=ATM stream is equal to the PM_2.5 CF=1 stream. For sensor reported concentrations above 30 μg m^-3, the PM_2.5 CF=ATM stream has a nonlinear response and reports lower concentrations than the PM_2.5 CF=1 stream.
  
  See the information provided under the heading “Comparison between Std. Particle and Std. Atmosphere” on the aqicn.org page for the PMS 5003 (https://aqicn.org/sensor/pms5003-7003/).
  Also see Section 2.2.1 on page 4619 of Barkjohn et al. (https://doi.org/10.5194/amt-14-4617-2021): “The two data columns have a [cf_atm] / [cf_1] = 1 relationship below roughly 25 μg m^-3 (as reported by the sensor) and then transition to a two-thirds ratio at higher concentration ([cf_1] concentrations are higher).” Additionally, Barkjohn et al. found that the PM_2.5 CF=1 stream was more strongly correlated with 24-hour average ambient PM_2.5 concentrations measured using FRM and FEM instruments than the PM_2.5 CF=ATM stream.
  Did the authors evaluate whether their model performed any better or worse if trained with the PM_2.5 CF=1 stream instead of the PM_2.5 CF=ATM stream?
  Response
  We understand that both CF=1 and CF=ATM data streams are being used in the research community. We have chosen CF=ATM as recommended by the sensor provider (“According to the PMS5003 manual, CF = 1 values should be used for indoor monitoring and ATMvalues should be used for atmospheric monitoring”). One of the main reasons for that is that the CF=ATM setting is what is used and visualized by many sensor users, as it is the default reported on the PurpleAir website. Therefore, we wanted to develop and test a calibration approach for the commonly used and reported PA data (CF=ATM).
  
  Further, from our perspective, both are derived parameters and not raw measurement. We chose to use the CF=ATM field for the above-noted reason.
  
  There are several studies that have used CF=ATM to develop calibration coefficients. For example:
  Walker (2018) - https://www.sciencedirect.com/science/article/pii/S135223102100251X#bib25
  
  Delp and Singer, 2020 - W.W. Delp, B.C. Singer Wildfire smoke adjustment factors for low-cost and professional PM2.5 monitors with optical sensors Sensors, 20 (2020), p. 3685, 3390/s20133683.
  
  In fact, there is a third data stream, “particle number counts”, which has been also used in literature. Wallace et al., 2021(https://www.sciencedirect.com/science/article/pii/S135223102100251X#bib33) provide an excellent literature review on PA calibration (see table 1). Thus, we reference this important paper in our introduction section (Page 2, lines 1-5) and don’t repeat all the details in this paper to avoid redundancy.
  
  We have not used the CF=1 data stream in our analysis and thus cannot comment on its performance using MLA. However, given the CF=1 and CF=ATM data streams are usually well correlated, and since ML tries to learn the patterns in the relationship between input and output variables, and based on the literature review, we speculate that there may not be significant differences in the calibration ML model while using the CF=1 data stream.
  
  Page 10, Line 10: Is there a US EPA AQS site identification number for this location?
  
  Response
  The US EPA AQS site ID number has been added.
  
  Page 10, Line 25: “…suggesting an overall underestimation by PA in clean conditions (PM_2.5< 10 µg m^-3).” I don’t think this generalization is appropriate. Other U.S.-based studies (Barkjohn et al.: https://doi.org/10.5194/amt-14-4617-2021; Tryner et al.: https://doi.org/10.1016/j.atmosenv.2019.117067) found that PurpleAir sensors overestimated ambient PM_2.5concentrations, so while it is true that the mean bias was < 0 for the authors’ dataset, similar results are not always observed at low concentrations.
  
  Response
  We have edited the text to make it clearer that it is for our study. We agree that it is difficult to generalize this.
  
  Barkjohn et al., 2021 has all the analysis performed for 24-hour mean, whereas our results are based on an hourly timescale. Therefore, the comparison may not be consistent given the potential diurnal variability in performance in a region.
  
  shows similar results - Sayahi, A. Butterfield, K.E. Kelly, Long-term field evaluation of the Plantower PMS low-cost particulate matter sensors, Environ. Pollut., 245 (2019), pp. 932-940, 10.1016/j.envpol.2018.11.065.
  
  Page 10, Lines 31-32: “It is also important to note that the chemical composition of particles in Delhi and Raleigh is expected to be different.” This statement is true: There are likely to be differences in the sources of particles in Delhi and Raleigh and, as a result, the chemical composition of the particulate matter is likely to vary between these two locations, but I think differences in particle size distribution are likely to affect PurpleAir accuracy more than differences in optical properties and particle density. See Hagan and Kroll (https://doi.org/10.5194/amt-13-6343-2020) and Ouimette et al. (https://doi.org/10.5194/amt-15-655-2022).
  
  Response
  We have revised the text for clarity and rephrased it to refer to particle characteristics in general.
  
  The change in chemical composition and size distribution both eventually impact the optical properties of the particle, and thus the scattering measurements from the PA unit, which later gets converted into a mass concentration.
  
  Page 10, Lines 32–33: Hagan et al. (https://doi.org/10.1021/acs.estlett.9b00393) also report data on the composition of ambient PM in Delhi in the winter.
  
  Response
  Thanks for pointing this out; we agree that Delhi has been studied heavily, and there are several publications that report chemical composition. To avoid an excessive citation, we choose one of them. Nevertheless, we have added this reference.
  
  Page 13, Line 8: “We also evaluated the importance of each input parameter…” Please include the methods used for this analysis in the manuscript.
  
  Response
  We revised the text and added details on the method.
  
  Page 21, Lines 6-7: “The performance of the model for other seasons (and thus for other weather conditions) will need to be evaluated as part of future work.” Differences in temperature and humidity are not the only reason why the model might perform differently in other seasons. I suggest the revising this text to reflect the fact that differences in model performance across seasons is likely to also be influenced by differences in pollution sources. Several prior publications have described how pollution sources and PurpleAir performance vary across seasons in locations around the world:
  
  Sayahi et al., 2019, Salt Lake City, Utah, USA: https://doi.org/10.1016/j.envpol.2018.11.065
  
  McFarlane et al., 2021, Accra, Ghana: https://doi.org/10.1021/acsearthspacechem.1c00217
  
  Raheja et al., 2022, Lomé, Togo: https://doi.org/10.1021/acsearthspacechem.1c00391
  
  Response
  
  Thanks for catching this. It was an oversight at our end. We have revised the text to reflect other factors than T& RH.
  
  Technical Corrections
  Page 2, Lines 22-23: Barkjohn et al., 2021 (https://doi.org/10.5194/amt-14-4617-2021) seems to be missing from this list of citations.
  
  Response
  It is listed on page 23, line 10, in the original manuscript.
  
  Figures 2, 3, 4, 5, 6, 7, 9, 10, and 11: The two sites are referred to as “Delhi” and “Raleigh” throughout the text, but as “India” and “NC” in the figures. Please update the plot labels to also refer to the two sites as Delhi and Raleigh.
  
  Response
  Thanks for catching it. We have revised the figures.
  
  Figures 2, 3, 4, 5, 6, 8, 9, 10, and 11: All x- and y-axes should have descriptive labels with units (not just variable names from the authors’ code).
  
  Response
  Thanks for catching it. We have revised the figures.
  
  I assume the units are μg m^-3on all x- and y-axes in Figures 2, 3, 4, 5, and 6?
  
  Response
  Yes, revised.
  
  The “PA (ML CALIBRATED – Daily)” labels on the y axes in Figure 6 make it look like one value has been subtracted from another, but I don’t think that’s what the authors are showing in those plots. How about “24-h average ML-calibrated PA PM₅(μg m^-3)”?
  
  Response
  It is the daily mean PA value, revised for clarity
  
  I assume the units are μg m^-3on the y-axes in Figures 8 and 9?
  
  Response
  
  Yes, revised.
  
  I assume that all PM₅concentrations shown on in Figures 10 and 11 are in μg m^-3?
  
  Response
  Yes, revised.
  
  Figures 4, 5, and 6: “The color scale represents the density of data points.” I do not understand what this means.
  
  Response
  The figure caption is revised to provide more information.
  
  The scatter density plots are standard and common practice to use when a large number of data points are presented in a plot.
  
  Please see https://rdrr.io/cran/aqfig/man/scatterplot.density.html for more information.
  “The plotting region of the scatterplot is divided into bins. The number of data points falling within each bin is summed and then plotted using the image function. This is particularly useful when there are so many points that each point cannot be distinctly identified.”
  
  Figure S1: What do the red lines represent? +/- 1 standard deviation? Please explain in the figure caption.
  
  Response
  The caption is revised. Yes, they are one standard deviation.
  
  Figure S3.1: This image is not legible. Please provide a higher-resolution image.
  
  Response
  Revised, the figure is updated for higher resolution.
  
  Figure S4: Please use more informative axis labels. Why does the caption say “ADD THIS FIGURE”?
  
  Response
  This was an oversight. The figure caption is revised with more details.
  
  Citation: https://doi.org/10.5194/amt-2022-140-AC1
RC2:
'Comment on amt-2022-140', Anonymous Referee #2, 10 Aug 2022

General Comments

This manuscript presents the results of performance evaluation of low cost purple air (PA-II) sensors against the regulatory grade PM2.5 monitors and develops machine learning based sensor calibration algorithms. This study developed two machine learning algorithms for Raleigh, NC, USA and Delhi, India using PM2.5 concentrations, temperatures and relative humidity values reported by PA-II sensors. The ten-fold cross validation technique is used validate the developed algorithms. The manuscript is very well organized and uses comprehensive statistical techniques.

Specific Comments

Page 7, Line 7: “We used Scikit-learn (sklearn) machine learning library in Python (https://scikit-learn.org).”

The authors may like to specify the version of library used, if applicable.

Page 7, Line 15-18: “In order to develop the final sensor calibration algorithm, we used the following steps: 1) quality control the PA data; 2) randomly divide the data into training (75%) and validation (25%) datasets; 3) train the algorithm using the training dataset and validate using the validation dataset; and 4) repeat steps 2 and 3 ten times (i.e., 10-fold cross-validation) using random data selection.”

As per general understanding, the ten-fold cross validation randomly divides data into 90% training dataset and 10% validation dataset and the procedure is repeated ten times. Why step 2 in above text mentions dividing data into 75% and 25%?

Technical Corrections

Fig. 2 to 11: The plot labels given are India and NC throughout the manuscript whereas in text the sites are referred as Delhi and Raleigh. This discrepancy needs to be addressed.

Fig. 2 to 11: The units are required to be specified on the axis labels for clarity

Citation: https://doi.org/10.5194/amt-2022-140-RC2
- AC2:
  'Reply on RC2', Pawan Gupta, 28 Sep 2022
  General Comments
  This manuscript presents the results of performance evaluation of low cost purple air (PA-II) sensors against the regulatory grade PM2.5 monitors and develops machine learning based sensor calibration algorithms. This study developed two machine learning algorithms for Raleigh, NC, USA and Delhi, India using PM2.5 concentrations, temperatures and relative humidity values reported by PA-II sensors. The ten-fold cross validation technique is used validate the developed algorithms. The manuscript is very well organized and uses comprehensive statistical techniques.
  Response
  We thank the reviewer for the feedback. We provide our response below to the specific comments.
  
  Specific Comments
  Page 7, Line 7: “We used Scikit-learn (sklearn) machine learning library in Python (https://scikit-learn.org).”
  The authors may like to specify the version of library used, if applicable.
  Response
  Version is added to the text.
  
  Page 7, Line 15-18: “In order to develop the final sensor calibration algorithm, we used the following steps: 1) quality control the PA data; 2) randomly divide the data into training (75%) and validation (25%) datasets; 3) train the algorithm using the training dataset and validate using the validation dataset; and 4) repeat steps 2 and 3 ten times (i.e., 10-fold cross-validation) using random data selection.”
  As per general understanding, the ten-fold cross validation randomly divides data into 90% training dataset and 10% validation dataset and the procedure is repeated ten times. Why step 2 in above text mentions dividing data into 75% and 25%?
  Response
  Thanks for bringing this to our attention. We have revised the text in sections 4.2 and 5.3 to clarify the approach. As described in section 4.2, the model development is performed in two steps. The first step tests several different models to identify the best performing MLA. In this exercise, data are split into 75% and 25% for training and validation, respectively, (Table 1, Figure 5) to select the best performing MLA to be used for further analysis. Figure 5 shows the training and validation results for the Random Forest (RF) MLA summarized in Table 1. The second step trains the chosen MLA (i.e., Random Forest) on the sensor data for final model development. In this second step, we used a 10-fold cross-validation approach (90% and 10%) to train the final ML model (Figure S3), and an ensemble of 10 models were used as the final model. The text is revised to clear this confusion.
  
  Technical Corrections
  Fig. 2 to 11: The plot labels given are India and NC throughout the manuscript whereas in text the sites are referred as Delhi and Raleigh. This discrepancy needs to be addressed.
  Response
  All figures are revised to make labels consistent with the text
  
  Fig. 2 to 11: The units are required to be specified on the axis labels for clarity
  Response
  Units are added in the figure axis labels or provided in the caption.
  
  Citation: https://doi.org/10.5194/amt-2022-140-AC2

Supplement

https://doi.org/10.5194/amt-2022-140-supplement

Viewed

Total article views: 2,324 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	Supplement	BibTeX	EndNote
1,603	651	70	2,324	159	73	95

HTML: 1,603
PDF: 651
XML: 70
Total: 2,324
Supplement: 159
BibTeX: 73
EndNote: 95

Views and downloads (calculated since 16 May 2022)

Month	HTML	PDF	XML	Total
May 2022	348	83	6	437
Jun 2022	89	33	2	124
Jul 2022	97	15	3	115
Aug 2022	72	22	2	96
Sep 2022	40	20	4	64
Oct 2022	52	30	0	82
Nov 2022	37	22	1	60
Dec 2022	28	23	2	53
Jan 2023	37	15	1	53
Feb 2023	34	21	2	57
Mar 2023	38	22	0	60
Apr 2023	19	19	0	38
May 2023	21	16	1	38
Jun 2023	15	16	2	33
Jul 2023	30	31	3	64
Aug 2023	20	15	1	36
Sep 2023	33	13	4	50
Oct 2023	42	15	1	58
Nov 2023	11	6	0	17
Dec 2023	20	7	1	28
Jan 2024	28	18	1	47
Feb 2024	26	9	2	37
Mar 2024	27	21	1	49
Apr 2024	34	15	4	53
May 2024	32	15	3	50
Jun 2024	57	16	4	77
Jul 2024	28	6	8	42
Aug 2024	35	4	2	41
Sep 2024	24	6	1	31
Oct 2024	18	2	0	20
Nov 2024	20	1	0	21
Dec 2024	22	5	0	27
Jan 2025	14	5	0	19
Feb 2025	21	8	0	29
Mar 2025	17	8	2	27
Apr 2025	24	12	1	37
May 2025	33	13	2	48
Jun 2025	23	29	0	52
Jul 2025	28	12	3	43
Aug 2025	9	2	0	11

Cumulative views and downloads (calculated since 16 May 2022)

Month	HTML	PDF	XML	Total
May 2022	348	83	6	437
Jun 2022	89	33	2	124
Jul 2022	97	15	3	115
Aug 2022	72	22	2	96
Sep 2022	40	20	4	64
Oct 2022	52	30	0	82
Nov 2022	37	22	1	60
Dec 2022	28	23	2	53
Jan 2023	37	15	1	53
Feb 2023	34	21	2	57
Mar 2023	38	22	0	60
Apr 2023	19	19	0	38
May 2023	21	16	1	38
Jun 2023	15	16	2	33
Jul 2023	30	31	3	64
Aug 2023	20	15	1	36
Sep 2023	33	13	4	50
Oct 2023	42	15	1	58
Nov 2023	11	6	0	17
Dec 2023	20	7	1	28
Jan 2024	28	18	1	47
Feb 2024	26	9	2	37
Mar 2024	27	21	1	49
Apr 2024	34	15	4	53
May 2024	32	15	3	50
Jun 2024	57	16	4	77
Jul 2024	28	6	8	42
Aug 2024	35	4	2	41
Sep 2024	24	6	1	31
Oct 2024	18	2	0	20
Nov 2024	20	1	0	21
Dec 2024	22	5	0	27
Jan 2025	14	5	0	19
Feb 2025	21	8	0	29
Mar 2025	17	8	2	27
Apr 2025	24	12	1	37
May 2025	33	13	2	48
Jun 2025	23	29	0	52
Jul 2025	28	12	3	43
Aug 2025	9	2	0	11

Viewed (geographical distribution)

Total article views: 2,228 (including HTML, PDF, and XML) Thereof 2,228 with geography defined and 0 with unknown origin.

Country	#	Views	%

Cited

Latest update: 08 Aug 2025

Short summary

The use of low-cost sensors in air quality monitoring has been gaining interest across all walks of society. We present the results of evaluations of the PurpleAir against regulatory-grade PM_2.5. The results indicate that with proper calibration, we can achieve bias-corrected PM_2.5 data using PA sensors. Our study also suggests that pre-deployment calibrations developed at local or regional scales are required for the PA sensors to correct data from the field for scientific data analysis.


Total:	0
HTML:	0
PDF:	0
XML:	0

Low-Cost Air Quality Sensor Evaluation and Calibration in Contrasting Aerosol Environments

Supplement

Viewed

Viewed (geographical distribution)

Cited

1 citations as recorded by crossref.