the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Research of Low-cost Air Quality Monitoring Models with Different Machine Learning Algorithms
Gang Wang
Chunlai Yu
Kai Guo
Haisong Guo
Yibo Wang
Abstract. To improve the prediction for the future air quality trends, the demand for low-cost sensor-based air quality gird monitoring is growing gradually. In this study, a low-cost multi-parameter air quality monitoring system (LCS) based on different machine learning algorithm is proposed. The LCS can measure particulate matter (PM2.5 and PM10) and gas pollutants (SO2, NO2, CO and O3) simultaneously. The multi-dimensional multi-response prediction model is developed based on the original signals of the sensors, ambient temperature (T) and relative humidity (RH), and the measurements of the reference instrumentations. The performance of the different algorithms (RF, MLR, KNN, BP, GA-BP) with the parameters such as determination coefficient R2 and Root Mean Square Error (RMSE) are compared and discussed. Using these methods, the R2 of the algorithms (RF, MLR, KNN, BP, GA-BP) for the PM is in the range 0.68–0.99; the mean RMSE values of PM2.5 and PM10 are within 3.96–16.16 μgm-3 and 7.37–28.90 μgm-3, respectively. The R2 of the algorithms (RF, MLR, KNN, BP, GA-BP) for the gas pollutants (O3, CO and NO2) is within 0.70–0.99; the mean RMSE values for these pollutants are 4.06–16.07 μgm-3, 0.04–0.15 mgm-3, 3.25–13.90 μgm-3, respectively. The R2 of the algorithms (RF, KNN, BP, GA-BP, except for MLR) for SO2 is within 0.27–0.97, and the mean RMSE value is in the range 1.05–3.22 μgm-3. These measurements are consistent with the national environmental protection standard requirement of China, and the LCS based on the machine learning algorithms can be used to predict the concentrations of PM and gas pollution.
- Preprint
(1949 KB) - Metadata XML
- BibTeX
- EndNote
Gang Wang et al.
Status: closed
-
RC1: 'Comment on amt-2023-163', Anonymous Referee #2, 02 Nov 2023
This is a fine piece of measurement calibration paper. My understanding on measurement techniques are very thin, so I will only comment on data processing:
(1) From my point of view, the authors applied different algorithms to fit the low cost sensor data and compared the performance, but the authors do not discuss how the results can be used to predict the PM and other pollutants. Specifically, is it used to predict unobserved locations or forecast short-term future?
(2) Also, let's say, RF seems to outperform other methods, is RF calibrated output considered to be the final product? or low cost sensor raw data still should be the reference? Do the authors consider the better performance from RF that can be overfitting, and other methods with a lower predictive performance can be more explainable?
(3) The formula in Equation (1) does not seem valid for MLR, because all those terms are correlated. Additional treatments or justifications are needed.Minor comments:
p1,l10: a typo "algorithms"
p2,l5: additional parentheses
Table 4: a typo II, III for O3Citation: https://doi.org/10.5194/amt-2023-163-RC1 -
AC1: 'Reply on RC1', Gang Wang, 16 Nov 2023
RC1:
This is a fine piece of measurement calibration paper. My understanding on measurement techniques are very thin, so I will only comment on data processing:
(1) From my point of view, the authors applied different algorithms to fit the low cost sensor data and compared the performance, but the authors do not discuss how the results can be used to predict the PM and other pollutants. Specifically, is it used to predict unobserved locations or forecast short-term future?
Thanks for your question. You are right that we applied different algorithms to fit the low-cost sensor data and compared the performance. The prediction in this paper means that we use the fitted model from the empirical value to estimate the current raw data from the sensor with the same location, with the purpose to get more accurate result. The model used to predict unobserved locations or forecast short-term future will be discussed in the future research.
(2) Also, let's say, RF seems to outperform other methods, is RF calibrated output considered to be the final product? or low cost sensor raw data still should be the reference?
Thanks for your question. The RF outperforms other methods. The calibrated output result of RF in this paper is accurate enough, and can be the final product. The current raw data of the sensor is still used as the input of the model to obtain the current accurate calibrated data.
(3)Do the authors consider the better performance from RF that can be overfitting, and other methods with a lower predictive performance can be more explainable?
Thanks for your question. With the purpose of avoiding over-fit in the five models, the randomly divide parameters of train ratio and test ratio are 80% and 20%, respectively. To ensure the robustness of the model evaluation, the 5-fold cross validation is also conducted. The dataset is divided into 5 mutually exclusive subsets with same size, the 4 subset is randomly selected as the training set each time, and the remaining 1 subset is used as the test set. After completing each round of validation, select 4 copies again to train the model and use the remaining 1 copy for validation. After several rounds (less than 5), the loss function is selected to evaluate the optimal model and parameters (Mahesh et al., 2023; Zimmerman et al., 2018).
(4) The formula in Equation (1) does not seem valid for MLR, because all those terms are correlated. Additional treatments or justifications are needed.
Thanks for your question. After the data collected by the LCS, the raw data should be preprocessed. The PM3006 particulate matter sensor can output six kinds of particle range (i.e., >0.3μm, >0.5μm, >1.0μm, >2.5μm, >5.0μm and >10μm, respectively). By subtracting the six particle range values in turn, the individual particle counters are obtained, and expressed as x0.5, x1.0, x2.5, x5.0 and x10.0, listed in Table 1, the measured particle number concentration is converted to PM mass concentrations in the PM2.5 and PM10 size fractions.
The particle counter terms are pretreated and individual from each other. The multi-input one-response preprocessing and prediction models can be written as Eq. (1) to obtain the concentrations Ypm2.5.
(1)
Where Wpm2.5= [w1_pm2.5, w2_pm2.5, w3_pm2.5, w4_pm2.5, w5_pm2.5] is the corresponding weight coefficients; the Xpm2.5 = [x0.5, x1.0, x2.5, T, RH] represents the individual particle counters, the temperature sensor and humidity sensor; the bpm2.5 is the intercept values of the model.
To obtain the concentration Ypm10,the multi-input one-response preprocessing and prediction models can be written as Eq. (2).
(2)
Where Wpm10= [w1_pm10, w2_pm10, w3_pm10, w4_pm10, w5_pm10, w6_pm10, w7_pm10] is the corresponding weight coefficients; the Xpm10 = [x0.5, x1.0, x2.5, x5.0, x10.0, T, RH] represents the individual particle counters, the temperature sensor and humidity sensor; the bpm10 is the intercept values of the model.
(5) Minor comments:
p1,l10: a typo "algorithms"
p2,l5: additional parentheses
Table 4: a typo II, III for O3
Thanks for your suggestion. The errors are revised in the new vision.
Citation: https://doi.org/10.5194/amt-2023-163-AC1 - AC2: 'Reply on RC1', Gang Wang, 16 Nov 2023
-
AC1: 'Reply on RC1', Gang Wang, 16 Nov 2023
-
RC2: 'Comment on amt-2023-163', Alice Cavaliere, 02 Nov 2023
This study presents a low-cost multi-parameter air quality monitoring system (LCS) that incorporates diverse machine learning algorithms. While the utilization of GA-BP techniques is emphasized, the paper falls short in clearly elucidating its novelty, thereby preventing a comprehensive understanding of its unique contribution. Additionally, the presence of several structural weaknesses within the paper necessitates significant revisions to enhance its coherence and overall quality.
Specific Comments:
Introduction
The introduction would benefit from a more comprehensive review of recent literature. Additionally, the presence of unclear terms, such as "multi-dimensional multi-response" on line 16, requires clarification to ensure a precise and unambiguous understanding.In addition, the rationale behind the selection of the five specific algorithms used in the study remains unclear. Providing a clear justification for the choice of these algorithms would enhance the understanding of the research methodology and its relevance to the study's objectives.
Measurement setup
This section would greatly benefit from expansion to ensure a comprehensive understanding of the study. Specifically, there is a need for more clarity regarding the data collection process, including details on the quantity of data collected for each pollutant and any procedures employed for outlier removal. Additionally, in Section 2.1, the inclusion of a map illustrating the data collection site would provide crucial contextual information. In Section 2.2, it is essential to specify the precise names of the sensors used or provide access to datasheets, particularly for the Alphasense sensors. Clarifying whether the PM sensor used is named PM300S, for instance, would enhance the transparency of the study. Moreover, the paragraph discussing laboratory tests requires expansion. Given the apparent linearity of the sensor response to concentrations, it is necessary to explicate the rationale behind testing non-linear methods. Exploring concentration curves at various temperatures and humidity levels would contribute to a more thorough analysis. Lastly, directly citing the manufacturer of the reference monitors mentioned on line 23 of page 5, as well as providing information on the methodology employed for the weekly calibrations, would significantly strengthen the study's transparency.
Calibration method
The equation (2) seems unclear; there might be a typographical error with X' instead of X. Furthermore, in section 3.2, it would be beneficial to include additional statistics such as Mean Squared Error (MSE) or Mean Absolute Error (MAE) to provide a more comprehensive evaluation of the model's performance.
Results and discussion
Paragraph 4.1 presents intriguing insights; however, it could benefit from a clearer presentation. For instance, the method of determining the number of trees in the random forest is not explicitly elucidated. Additionally, while it is evident that a sub-period was chosen for testing, the rationale behind this selection remains unexplained. Clarifying these aspects would enhance the overall coherence and understanding of the paragraph.
In paragraph 4.2, including the size of each segment, as well as the reference temperature, humidity, and concentration range, would enhance the comprehensiveness of the experimental setup and contribute to a more detailed understanding of the study. In Figure 9 (b), including a normalized version of the Root Mean Square Error (RMSE) would be beneficial to enable an accurate comparison among the three periods. The same principle applies to Table 3; including a normalized version of the Root Mean Square Error (RMSE) would facilitate an accurate comparison among the different parameters.
Furthermore, the text mentions a division into train and test sets. It would be valuable to clarify whether a cross-validation was also conducted to ensure the robustness of the model evaluation. These consideration are also valid for the results of gas mesurements. Moreover tt would be insightful to include a discussion of the varying results obtained for each segment. For instance, a better detailed analysis of why the performance of SO2 is consistently good for period II but considerably poorer for the other periods would enrich the understanding of the data and provide valuable insights into the underlying factors influencing the results.
Finally, there is a typo on line 6 of page 14 (dada instead of data).Conclusion
The conclusion paragraph would benefit from a more explicit discussion on the presence of a recommended algorithm for calibration and a thorough examination of its potential limitations. By addressing the challenges associated with generalizing black box models, notably random forests, the conclusion could provide a more nuanced understanding of the practical implications and constraints that may arise from the study's findings.
Citation: https://doi.org/10.5194/amt-2023-163-RC2 - AC3: 'Reply on RC2', Gang Wang, 16 Nov 2023
- AC4: 'Reply on RC2', Gang Wang, 17 Nov 2023
Status: closed
-
RC1: 'Comment on amt-2023-163', Anonymous Referee #2, 02 Nov 2023
This is a fine piece of measurement calibration paper. My understanding on measurement techniques are very thin, so I will only comment on data processing:
(1) From my point of view, the authors applied different algorithms to fit the low cost sensor data and compared the performance, but the authors do not discuss how the results can be used to predict the PM and other pollutants. Specifically, is it used to predict unobserved locations or forecast short-term future?
(2) Also, let's say, RF seems to outperform other methods, is RF calibrated output considered to be the final product? or low cost sensor raw data still should be the reference? Do the authors consider the better performance from RF that can be overfitting, and other methods with a lower predictive performance can be more explainable?
(3) The formula in Equation (1) does not seem valid for MLR, because all those terms are correlated. Additional treatments or justifications are needed.Minor comments:
p1,l10: a typo "algorithms"
p2,l5: additional parentheses
Table 4: a typo II, III for O3Citation: https://doi.org/10.5194/amt-2023-163-RC1 -
AC1: 'Reply on RC1', Gang Wang, 16 Nov 2023
RC1:
This is a fine piece of measurement calibration paper. My understanding on measurement techniques are very thin, so I will only comment on data processing:
(1) From my point of view, the authors applied different algorithms to fit the low cost sensor data and compared the performance, but the authors do not discuss how the results can be used to predict the PM and other pollutants. Specifically, is it used to predict unobserved locations or forecast short-term future?
Thanks for your question. You are right that we applied different algorithms to fit the low-cost sensor data and compared the performance. The prediction in this paper means that we use the fitted model from the empirical value to estimate the current raw data from the sensor with the same location, with the purpose to get more accurate result. The model used to predict unobserved locations or forecast short-term future will be discussed in the future research.
(2) Also, let's say, RF seems to outperform other methods, is RF calibrated output considered to be the final product? or low cost sensor raw data still should be the reference?
Thanks for your question. The RF outperforms other methods. The calibrated output result of RF in this paper is accurate enough, and can be the final product. The current raw data of the sensor is still used as the input of the model to obtain the current accurate calibrated data.
(3)Do the authors consider the better performance from RF that can be overfitting, and other methods with a lower predictive performance can be more explainable?
Thanks for your question. With the purpose of avoiding over-fit in the five models, the randomly divide parameters of train ratio and test ratio are 80% and 20%, respectively. To ensure the robustness of the model evaluation, the 5-fold cross validation is also conducted. The dataset is divided into 5 mutually exclusive subsets with same size, the 4 subset is randomly selected as the training set each time, and the remaining 1 subset is used as the test set. After completing each round of validation, select 4 copies again to train the model and use the remaining 1 copy for validation. After several rounds (less than 5), the loss function is selected to evaluate the optimal model and parameters (Mahesh et al., 2023; Zimmerman et al., 2018).
(4) The formula in Equation (1) does not seem valid for MLR, because all those terms are correlated. Additional treatments or justifications are needed.
Thanks for your question. After the data collected by the LCS, the raw data should be preprocessed. The PM3006 particulate matter sensor can output six kinds of particle range (i.e., >0.3μm, >0.5μm, >1.0μm, >2.5μm, >5.0μm and >10μm, respectively). By subtracting the six particle range values in turn, the individual particle counters are obtained, and expressed as x0.5, x1.0, x2.5, x5.0 and x10.0, listed in Table 1, the measured particle number concentration is converted to PM mass concentrations in the PM2.5 and PM10 size fractions.
The particle counter terms are pretreated and individual from each other. The multi-input one-response preprocessing and prediction models can be written as Eq. (1) to obtain the concentrations Ypm2.5.
(1)
Where Wpm2.5= [w1_pm2.5, w2_pm2.5, w3_pm2.5, w4_pm2.5, w5_pm2.5] is the corresponding weight coefficients; the Xpm2.5 = [x0.5, x1.0, x2.5, T, RH] represents the individual particle counters, the temperature sensor and humidity sensor; the bpm2.5 is the intercept values of the model.
To obtain the concentration Ypm10,the multi-input one-response preprocessing and prediction models can be written as Eq. (2).
(2)
Where Wpm10= [w1_pm10, w2_pm10, w3_pm10, w4_pm10, w5_pm10, w6_pm10, w7_pm10] is the corresponding weight coefficients; the Xpm10 = [x0.5, x1.0, x2.5, x5.0, x10.0, T, RH] represents the individual particle counters, the temperature sensor and humidity sensor; the bpm10 is the intercept values of the model.
(5) Minor comments:
p1,l10: a typo "algorithms"
p2,l5: additional parentheses
Table 4: a typo II, III for O3
Thanks for your suggestion. The errors are revised in the new vision.
Citation: https://doi.org/10.5194/amt-2023-163-AC1 - AC2: 'Reply on RC1', Gang Wang, 16 Nov 2023
-
AC1: 'Reply on RC1', Gang Wang, 16 Nov 2023
-
RC2: 'Comment on amt-2023-163', Alice Cavaliere, 02 Nov 2023
This study presents a low-cost multi-parameter air quality monitoring system (LCS) that incorporates diverse machine learning algorithms. While the utilization of GA-BP techniques is emphasized, the paper falls short in clearly elucidating its novelty, thereby preventing a comprehensive understanding of its unique contribution. Additionally, the presence of several structural weaknesses within the paper necessitates significant revisions to enhance its coherence and overall quality.
Specific Comments:
Introduction
The introduction would benefit from a more comprehensive review of recent literature. Additionally, the presence of unclear terms, such as "multi-dimensional multi-response" on line 16, requires clarification to ensure a precise and unambiguous understanding.In addition, the rationale behind the selection of the five specific algorithms used in the study remains unclear. Providing a clear justification for the choice of these algorithms would enhance the understanding of the research methodology and its relevance to the study's objectives.
Measurement setup
This section would greatly benefit from expansion to ensure a comprehensive understanding of the study. Specifically, there is a need for more clarity regarding the data collection process, including details on the quantity of data collected for each pollutant and any procedures employed for outlier removal. Additionally, in Section 2.1, the inclusion of a map illustrating the data collection site would provide crucial contextual information. In Section 2.2, it is essential to specify the precise names of the sensors used or provide access to datasheets, particularly for the Alphasense sensors. Clarifying whether the PM sensor used is named PM300S, for instance, would enhance the transparency of the study. Moreover, the paragraph discussing laboratory tests requires expansion. Given the apparent linearity of the sensor response to concentrations, it is necessary to explicate the rationale behind testing non-linear methods. Exploring concentration curves at various temperatures and humidity levels would contribute to a more thorough analysis. Lastly, directly citing the manufacturer of the reference monitors mentioned on line 23 of page 5, as well as providing information on the methodology employed for the weekly calibrations, would significantly strengthen the study's transparency.
Calibration method
The equation (2) seems unclear; there might be a typographical error with X' instead of X. Furthermore, in section 3.2, it would be beneficial to include additional statistics such as Mean Squared Error (MSE) or Mean Absolute Error (MAE) to provide a more comprehensive evaluation of the model's performance.
Results and discussion
Paragraph 4.1 presents intriguing insights; however, it could benefit from a clearer presentation. For instance, the method of determining the number of trees in the random forest is not explicitly elucidated. Additionally, while it is evident that a sub-period was chosen for testing, the rationale behind this selection remains unexplained. Clarifying these aspects would enhance the overall coherence and understanding of the paragraph.
In paragraph 4.2, including the size of each segment, as well as the reference temperature, humidity, and concentration range, would enhance the comprehensiveness of the experimental setup and contribute to a more detailed understanding of the study. In Figure 9 (b), including a normalized version of the Root Mean Square Error (RMSE) would be beneficial to enable an accurate comparison among the three periods. The same principle applies to Table 3; including a normalized version of the Root Mean Square Error (RMSE) would facilitate an accurate comparison among the different parameters.
Furthermore, the text mentions a division into train and test sets. It would be valuable to clarify whether a cross-validation was also conducted to ensure the robustness of the model evaluation. These consideration are also valid for the results of gas mesurements. Moreover tt would be insightful to include a discussion of the varying results obtained for each segment. For instance, a better detailed analysis of why the performance of SO2 is consistently good for period II but considerably poorer for the other periods would enrich the understanding of the data and provide valuable insights into the underlying factors influencing the results.
Finally, there is a typo on line 6 of page 14 (dada instead of data).Conclusion
The conclusion paragraph would benefit from a more explicit discussion on the presence of a recommended algorithm for calibration and a thorough examination of its potential limitations. By addressing the challenges associated with generalizing black box models, notably random forests, the conclusion could provide a more nuanced understanding of the practical implications and constraints that may arise from the study's findings.
Citation: https://doi.org/10.5194/amt-2023-163-RC2 - AC3: 'Reply on RC2', Gang Wang, 16 Nov 2023
- AC4: 'Reply on RC2', Gang Wang, 17 Nov 2023
Gang Wang et al.
Gang Wang et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
216 | 47 | 11 | 274 | 3 | 6 |
- HTML: 216
- PDF: 47
- XML: 11
- Total: 274
- BibTeX: 3
- EndNote: 6
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1