I find that, overall, this paper has been improved substantially since its initial submission and the authors are to be commended for the detail and care that they applied to each of my previous comments. The error analysis especially is much more rigorous and robust and produces quantifiable values that are of much greater use to the research community than what was present in the original draft. I am excited about taking my own personal infrared thermometer (which I normally use in my kitchen to tell me when my skillet will give my steak a good sear) and pointing it at the sky to retrieve the PWV. I find that there are still outstanding issues that should be addressed before this paper reaches final approval, though I feel it is on the path to that destination.
Major points
The present work examines two years of handheld infrared thermometer observations and uses PWV calculated from a weighted average of the two nearest radiosonde sites. Remotely-sensed PWV observations are available from instruments that are situated more closely to the IR thermometer observing sites. The authors make use of these observations in a secondary sense to illustrate the overall annual evolution of PWV over the domain. These results are displayed in Figure B1, a welcome addition to the discussion. However, this figure also raises a significant question: are the radiosonde observations (especially EPZ) biasing the analysis? It seems to me that the most representative observation for the IR thermometer is going to be from the AERONET station: it’s not the closest observation point, but it is the most similar in altitude and it is substantially closer than the radiosonde sites, and it also has observations that are much closer in time to the IR observations than the sondes do. Figure B1 indicates that the sondes tend to be more moist than the AERONET, with EPZ consistently more moist than the other observations. How is it known, therefore, that the spatially weighted average of the ABQ and EPZ sondes is the most representative observation and therefore the one around which this work should be built?
The authors do not make primary use of the remotely sensed observations for two reasons, representativeness (due to the altitude difference between Suominet and the IR thermometer) and missing data (due to a year-long gap in the AERONET and Suominet observations). The first issue is a valid one, but I feel this second point deserves some more inspection. After all, the AERONET observations are still available for approximately half the observing period. It may be that two years of weighted-averaged radiosonde data that is 100s of km and approximately 6 h removed from the IR thermometer observations is better than one year of AERONET observations that is both spatially and temporally closer to the target. However, that point needs to be explicitly argued.
In the end, this work relies on observations that are frequently 6 h old (or 6 h early), at least 100 km away, and much more moist than more local observations. Using the radiosonde observations may be the appropriate course of action, but it needs to be demonstrated that this set of decisions is the correct one.
Finally, as I read through this work again, I’m left with one very fundamental question: how good is it? An analysis that shows the relationship for the IR PWV product to some kind of truth (be it the merged sondes, AERONET, etc.) seems to be lacking. The figures shown in the present work, such as the relationship between the sky temperature and PWV, are important but the relationship between the new product and the truth is critical. This could take the form of a scatterplot, histogram of the differences, box and whisker plot for the differences in various PWV bins, etc., but something should be in there. Crucially, I do not have a sense of how well the product performs as a function of different values.
Minor Points
98. For the observations taken at 2300 UTC, are they matched to temporally averaged radiosonde observations or are they just matched to the nearest sonde time?
112. It is important to emphasize that the determination of clear or cloudy skies is a subjective observation by a human observer.
115. The lack of brightness temperature observations below a given temperature threshold (resulting in NaN values) means that low PWV values cannot be observed with this method. What is the minimum PWV value that can be observed, and what is the seasonal distribution of missing data? This seems like an important issue that end users ought to be aware of. I assume that this is a more frequent occurrence in the high deserts of New Mexico than it is in the environment observed by Mims, and that wintertime values are more likely to be missing, but these points should be made explicit in the text.
Figure 1: This is an extremely minor point, and you can address or ignore as you see fit, but I find figures easier to interpret when grid lines are present.
142. When you say ground temperature, do you specifically mean skin temperature as measured by the IR thermometer?
165. If the Suominet and AERONET observations are going to be part of this analysis (even if only in the appendix), their locations should be noted on Fig. 2.
165. Sometimes the text refers to Figure N, and other times it refers to Fig. N. This may be a stylistic choice, as it appears that the word is spelled out at the start of a sentence but not elsewhere, so I don’t know how much consistency you are going for here.
174-175: I’m not seeing where your product appears in Appendix B (unless you only mean the merged sondes). This goes back to the point I made in the major comments above about not really getting a sense of the skill or utility of the product.
203. This seems like a counterintuitive way to approach the exceedance thresholding, as though the most important thing was to preserve 90% of the dataset instead of crafting a representative dataset. If the data are unrepresentative, they should not be used regardless of how many event dates must be removed. At a minimum, it is important to know how many standard deviations that 55% difference represents. (Also note: in the response to the reviewers, the authors stated this was a 75% threshold, so they should verify which value is the correct one.) It is easier to scientifically justify a standard-deviation-based filter than a filter designed to preserve a certain fraction of the total dataset, even if in the end you tune one filter to match the other.
226. This is close to what I was suggesting when I suggested a monte carlo simulation. My thinking was that you could take an IR-observed temperature, randomly perturb it by some value drawn from a gaussian, and plug that into your tool to obtain a new PWV. Do that a few thousand times, and you’ll have an estimate on how the instrument uncertainties contribute to uncertainties in PWV. This doesn’t take into account the uncertainties in the model, however, which your approach seems to do.
230. Does this RMSE vary with the magnitude of the signal? Looking at Figure B1, a RMSE of 0.35 cm is very close to the observed value for the winter months. Do you expect that the error bars are very similar throughout the year, or do the larger PWV values in the summer have greater uncertainties associated with them?
285. This seems to imply that it may be possible to derive the appropriate relationships between PWV and the IR temp without needing to take two years of manual observations to generate a testing dataset. Is this true? Or, rather, are there ways to arrive at the needed coefficients using existing data? (Earlier, when I said that I wanted to point my IR thermometer at the sky to get PWV, I meant it.) In all seriousness, you have done a good job demonstrating that the system needs to be trained to specific locations due to the large climatological variability in water vapor content. But are there ways to achieve acceptable results using a priori data? I think this is an important point for the issues raised in the conclusions, as substantial datasets will need to be collected by citizen scientists and school groups just to train the relationships. If an initial model can be implemented immediately from prior observations, NWP, etc., the adoption of such a program will likely increase.
Figure B1. I keep coming back to this figure throughout reading and reviewing this paper. At times I wonder if this figure is important enough that it deserves promotion ot the main body of the paper. |
General Comments
Overall, I applaud the novel approach that the authors are taking to produce low-cost observations of an atmospheric parameter that has many applications, from climate studies to model assessment. While they haven’t developed the technique themselves, they are evaluating several different commercially available products to assess how well-suited they are to produce these observations and how a different location can impact the relationships used to obtain their targeted variable.
However, there are some fairly significant issues with the work that preclude publication at this time. I believe that this will require the team to redo almost all of their analysis. However, I do believe that ultimately the fundamental work should be published, and therefore I am recommending major revisions.
Specific Comments
The first thing that struck me while reading this paper is that this is not a method to observe total precipitable water (TPW), but really a method to observe precipitable water vapor (PWV) in clear sky conditions. While one can argue that in clear skies the TPW is functionally equivalent to the PWV since there is no liquid or ice water present, this distinction is a valuable one: there are more sources of PWV data than TPW since measuring cloud characteristics is so challenging. There are several additional ways of measuring PWV that the authors do not address in the manuscript. This includes a direct retrieval from ground-based hyperspectral IR observations (Turner 2005 https://doi.org/10.1175/JAM2208.1), calculated from thermodynamic profiles retrieved from hyperspectral IR observations (Turner and Blumberg 2018 https://doi.org/ 10.1109/JSTARS.2018.2874968), Raman lidar, aircraft, etc.
This leads into the most significant concern that I have about the present work: the training and validation dataset has significant drawbacks and better choices may be available. It may be true that in the desert southwest the temporal and spatial variability is not large, but it remains that the data being used is, at a minimum, located 110 km and 6 h away from the desired quantity. I am surprised that the authors did not utilize the Suominet observations of PWV from the Socorro area, especially since one of the authors is the contact for that particular observing site. This may be due to thinking that the present work describes a TPW product and not a PWV product. It is true that the observation site is located on a mountain while the IR observations are presumably taken at a lower altitude. This criticism is tempered somewhat by the fact that the two radiosonde sites used for validation differ in elevation by ~400 m and so altitude differences are going to be an issue regardless of the validation set used. That being said, a quick glance at a 14 day time series at Albuquerque (http://www.atmo.arizona.edu/products/gps/P034_14day.gif) and Socorro (http://www.atmo.arizona.edu/products/gps/SC01_14day.gif) doesn’t really show a huge impact of the altitude (at least at the time of the writing of this review). Suominet has the advantage of a substantially better temporal resolution allowing a more direct comparison to the IR observations, and in fact, offering enough observations that it would be possible to average to reduce noise in the signal.
Even if they choose not to use Suominet observations, there are ways that the radiosonde dataset can be leveraged to create a more representative data sample. Rather than using every single IR observation, it may be better to exclude from analysis the cases in which there is a substantial difference between the two sites, and/or between the 0000 and 1200 UTC launches. By focusing on cases in which the spatiotemporal variability is small, the authors can have greater confidence in the retrieved product. This will reduce the number of data points, but I fell will produce a stronger product overall.
The error analysis also seems to be somewhat lacking, as it tends to focus on the uncertainty of the regression while not addressing the influence of the uncertainty of the instrument or the measurement technique. A monte carlo approach may prove useful here: by randomly perturbing the input brightness temperatures by a random value chosen from a gaussian distribution with a standard deviation equal to the instrument uncertainty, then repeating that over a set number of trials, it may provide a more realistic assessment of how the instrument itself may be contributing to the error bars of the retrieved value. This doesn’t include the uncertainty induced by the way the instrument is held, which may also expand the uncertainty of the retrieved value.
Finally, I’d like to see a greater exploration of the differences between Mims et al 2011 and the present work. What is the RMSE of the current dataset, and how does that compare to the RMSE if you applied the Mims relationship to your data? In other words, how much are you improving the technique by tuning it for your specific location? Such an analysis would help increase the novelty of this paper.
Technical Comments
Line 50. Consider how PWV (not TPW) is also being measured by various systems, based on the discussion abovce.
Line 75. How are the observations actually being taken? Is a human pointing a hand-held system towards the sky and writing down the observed temperature, or is a more robust method being used? Many IR thermometers have adjustable emissivities, and the default isn’t necessarily a blackbody. Were the emissivities set to the same value across all systems?
Line 77. Does the manufacturer note the wavelengths at which this instrument operates?
Line 99. This analysis of how to hand-hold a thermometer within 5 deg of zenith, and the fact that it results in less than 1 C uncertainty, is interesting, and the discussion of both points should be expanded.
Line 104. How are you screening for clouds? Observer judgement? Airport ceilometer? Satellite? IR thermometer threshold?
Line 111. I find it surprising that there is little dust in the middle of the high deserts of New Mexico. Why is the dust so low?
Fig 1. This figure is very confusing to me, and I apologize if there is something obvious that I’m missing. There are four categories: clear, cloudy, clear NaN, cloudy NaN. It seems like two separate things are going on. There is an instrument assessment to determine if the sky is clear or not (more detail on that is needed). But in the case of the NaNs, an external assessment of the clear our cloudy state has to be used because the instrument is not reporting anything. This is all coupled with the fact that the manuscript says that clouds were filtered out. Ultimately, I’m not sure what the figure is trying to tell me. A better approach may be a contingency table for each instrument that compares the external / instrument assessment in terms of clear/clear, clear/cloudy, cloudy/clear, and cloudy/cloudy, with special notes of the number of NaNs in each category.
Figure 2. By starting out the caption with (a,c) it is somewhat confusing to the reader (who may be more accustomed to going from a to b). It may be better to say something like “Comparisons between the AMES 1 and the FLIR i3 (left column) and the AMES 2 (right column) for clear sky (top row) and ground (bottom row).”
Line 140. This section would be greatly improved with a map showing the location of ABQ, EPZ, and Socorro, with elevation as the background color.
Line 156. The amount of data that is used in the analysis fits better in the methodology than in the results. I found myself using the values reported in Fig 1 to calculate the approximate number of datapoints for context before I got to this part of the paper.
Line 186. Is this R^2 for a linear correlation? If so, you may actually have a better fit than your numbers report, since the fit has an obvious non-linear shape.
Line 220: It doesn’t appear this way from the observations in Figure 4, but do the model studies show any evidence that the signal gets saturated (that is, is there a point where PWV is so high that any additional PWV can’t be detected from the brightness temperature observations)?
Line 257. This cost info is very important and should appear in the intro.