Comment on amt-2020-489

This manuscript provides a seven-step methodology for the calibration and quality assurance of low-cost air quality sensors. Thanks to the generalised nature of this method, it can be applied to a wide range of sensors and potentially be used as a standard calibration procedure. The data processing script was made publicly available which maximises the applicability of this method and the impact of this research.

The authors have pointed out current challenges in the use of low-cost sensors including the lack (or incomparability) of calibration procedures in many low-cost sensor application studies. They stress the need of a reliable and reproducible data calibration and postprocessing method. This manuscript is an important step towards this aim and, therefore, a valuable contribution to the literature in this field as it has the potential to improve the data quality in future applications of low-cost sensors. The manuscript is well structured and clearly written.
My main suggestions to further improve the scientific quality of the manuscript are: Discuss the limitations of this method in more detail (Point 1) Add physical explanations of the found observations (Point 5) Specific comments 1. Please discuss the limitations of your calibration method in more detail (Point 2.1)

Application range of calibrated sensors (indoor vs outdoor vs mobile)
You stressed the importance of calibrating the sensors under conditions that are similar to those under which they will be (or have been) operated during the experimental application. This needs to be considered when defining the application range of the sensors. Thanks to their silent operating conditions and small size, low-cost sensors are suited for indoor as well as mobile applications (e.g. wearable sensors for personal exposure assessment). However, if the calibration is conducted outdoors, the sensors might not be suited for such applications as the environmental conditions may differ significantly in these environments. Furthermore, mobile deployments would require further data cleaning and validation steps as rapidly changing environments may have an impact on the sensor performance (e.g. Alphasense Ltd., 2013).

Sensor systems
As you have pointed out, low-cost sensors are often temperature and RH dependent as well as cross-sensitive to other pollutants. Therefore, it should be recommended to apply the presented calibration method to sensor systems (with additional sensors for T, RH and cross-sensitive gases) rather than individual sensors.

Data cleaning (Point 2.2.2)
In this step, point outliers are removed based on the assumption of a slowly changing airfield where peak exposures over a few seconds do not occur. However, such shortterm (< 10 sec) emissions may occur in certain settings (e.g. traffic emissions of nearby passing vehicles, cigarette emissions of passengers etc.). One advantage of the high spatial and temporal resolution of low-cost sensors is that such peak exposures may be captured. The proposed method, however, excludes such events. Please include this argument when defining the application range of the sensors (Point 2.1).

Line 115:
You state that, while demonstrated here with MOS, the proposed calibration method can equally be applied to electrochemical sensors. To strengthen this argument, please add a brief physical explanation, a reference, or experimental proof.

Line 221, line 240:
Please explain how you have determined the splitting ratio between training and validation period. How much differ you results when using other ratios? 4. Table 6: Please explain why you are using the medians and not the means of your statistical parameters. (whereas in Line 221 you were speaking about the average RMSE) 5. While the manuscript nicely discusses the implications of a finding, it sometimes does not offer physical explanations for them: Line 245: "If the graphs showed instability across the various folds, Step 4 was repeated and a new model was selected for validation" What causes this instability and how can you ensure that the model stays stable under field conditions? Section 3.4.4 (model selection): Different relationships between the input variables were found for different models, e.g. an inverse temperature dependence for NO 2 was found for the best fitting MLR but no temperature dependence was found in the case of the best fitting RF. How can you explain this and what type of physical relationship (e.g. temperature dependence) would you expect? The model performance was found to be higher when using the ambient environmental conditions (T and RH) as parameters (e. g. Tables 6 and 7). However, you pointed out in the discussion (Line 619) that the internal conditions are more representative for the operating conditions of the sensor. What are possible explanations for this observation?
6. Line 292: Please specify "decent" and "good" agreement (e.g. with mean R2 & RMSE) 7. Line 327: You deployed (at least) two low-cost sensors. Have you quantified the agreement between the two sensors? If so, add a small sentence here as it may be a strong argument why it is sufficient to only look at the data of one representative sensor. Perhaps summarise the performance of the second sensor briefly in the main text. How can you explain the non-linear response of sensor s72 ( Figure S8 Figure 2 (optional): Adding a timeline with (rough) dates would help to comprehend the paragraph above quicker. Figures 4 c, d; 6; 7 a, b; 10 etc: Make sure that all axes have units (even if only arbitrary units). Tables 8 and 9, add the R 2 and RMSE values to the graphs to provide a comprehensive overview.

Line 503:
"the reference instruments did not impact the predictive accuracy of the models and can therefore [in this case] be ignored as a potential interference" -can this be generalised for all sensors? If not, add "in this case" 23. Line 508: "The uncertainty between RF models and MLR models was fairly similar"replace "between" with "of"