|This is an improved manuscript – in particular, the additional details about the training and test data sets have substantially clarified the authors’ approaches to sensor calibration. Most of my concerns have been addressed; remaining comments are listed below.|
If training and test sets are not the same (p. 6, line 1), it’s hard to understand the utility (or even the meaning) of Table 4 – combining all test and training data into combined ranges isn’t terribly meaningful, since there’s substantial overlap between the two. Moreover combining them even risks misleading the reader as to the true ranges of the two sets for individual sensors – the ranges for an individual sensor might be quite different than those shown in the table. I would recommend changing this table substantially to make clear the differences in training/test sets for individual sensors. In addition, these differences should be discussed in the table caption. If there was a way to visualize these differences (maybe some example histograms?) that would be helpful also.
Results: in my original review I had suggested that the random forest and hybrid approaches shouldn’t be different, since the training and test sets appeared to be identical. But since the ranges given in Table 4 turn out to be combined ranges, and not ranges covered by each individual sensor, this may be incorrect – there may be sensors for which the ranges of the training and test sets differ substantially. (However, whether this is actually the case is hard to evaluate based on the information in the paper and SI.) In such cases the two models may be expected to give different results, so the two could be discussed individually.
Regardless, if hybrid model is to be left in, the authors should still need to provide information on the number of “crossings” between the RF and LR models, and the fraction time evaluated by RF vs time evaluated by LR.
P. 16, lines 9-11: it should also be mentioned that the clustering algorithm would likely be improved by use of kNN (rather than k-means-clustering, which is what is used).
P. 16, lines 13-15: if the authors are going to continue to make this suggestion based on the current work (even with the new caveat added), they need to be back it up with much more than a citation to another paper. Specifically they need to show some evidence that the differences in performance results from the RAMP circuitry, and not from differences in the training/test set used. I’m not sure how one would do this, but as written the sentence is purely speculative, and not backed up with any substantive evidence.
P. 18 line 14: it is stated that a new model should be developed “each year”, but this is probably more specific than is warranted from the work. My takeaway from this work is that models stay reasonably robust for timescales of several months, but should be periodically evaluated/updated when used over longer timescales (on the order of every ~6-18 months). I would recommend changing to wording to reflect this. This recommendation is also included in the abstract, and so should be changed there as well. (As a minor side note, I feel including it in the abstract risks detracting from the more fundamental results of this work, related to generalized models. So I might recommend removing or shortening the sentence in lines 20-22 of the abstract.)
SI: in the Response to Reviews the authors state that “the randomized nature of the training approach for some models (such as the random forest models) will lead to slightly different results if these models are re-built.” I don’t understand this. If the algorithm uses a random seed to generate psuedo-random numbers, the same psuedo-random numbers should be generated each time, so results should be replicable. (More generally, if the results are indeed different when the model is re-run, this represents a potentially major problem, as it calls into question the robustness of the results.)