|In this revised manuscript, the authors addressed most of my previous concerns. Specifically, the revised manuscript now includes more details about the random forest model as well as the relevant measures to mitigate overtraining; the authors also discussed in greater detail the initial VOCs. I appreciate the authors' efforts and it is my opinion that the quality of the manuscript is greatly improved. I recommend the manuscript for publication after the following minor comments are addressed, which are intended to further improve the clarity and the flow:|
Line 43: … exhibiting
Line 41-53: consider combining this paragraph with the next for a smoother flow.
Line 99-101: This sentence touches one common concern that some machine learning models may be less transparent/interpretable compared to other conventional techniques, thus giving the impression that the authors would discuss this concern in greater detail in this paragraph. Yet, the rest of this paragraph drifted away. Then the next paragraph opens with yet another statement on this “black box” concern. Please rewrite these two paragraphs to improve the logic and flow.
Line 235-237: This is a healthy start, but the outcome of this 12-fold cross validation is missing. Please include a figure or table, perhaps in the SI, to archive the consistency of the model performance across all 12 folds. Whether splitting the dataset randomly is a good strategy for cross validation remains a subject of debate, but it is key to archive all key details.
Line 286-291: This is interesting. Is the decrease in RH usually accompanied with changes in other parameters/conditions? Perhaps a change in weather system, cloud cover (i.e. change in radiation), etc? Keep in mind that the features are not always independent variables (certainly not the case in this work, which is perfectly fine). Say, if two features A and B are equally important, the algorithm may give high importance to A (or B) but very low importance to B (or A), thus giving the wrong impression that A (or B) is important but B (or A) is not. I would be a little surprised if the negative response of ozone to RH is really driven by NO2 uptake under high RH, after all NO2 is only moderately soluble.
Code and data availability: The authors mentioned in the response that the random forest model used in this work is not based on widely used software packages/platforms (such as python/scikit-learn) but is developed in-house in MATLAB (this is impressive, if implemented properly). The vast majority of the dataset used in work is also not publicly accessible as of now. Please refer to the journal policy/guidelines on code and data availability. Given the absolute critical role the random forest model is playing in this work, I strongly recommend that the authors deposit the random forest model code in FAIR (Findable, Accessible, Interoperable, and Reusable)-aligned reliable public repositories.