Global ensemble of temperatures over 1850–2018: quantification of uncertainties in observations, coverage, and spatial modeling (GETQUOCS)

Ilyas, Maryam; Nychka, Douglas; Brierley, Chris; Guillas, Serge

doi:https://doi.org/10.5194/amt-14-7103-2021

Articles | Volume 14, issue 11

https://doi.org/10.5194/amt-14-7103-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/amt-14-7103-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 14, issue 11

Research article

|

12 Nov 2021

Research article |

| 12 Nov 2021

Global ensemble of temperatures over 1850–2018: quantification of uncertainties in observations, coverage, and spatial modeling (GETQUOCS)

Maryam Ilyas, Douglas Nychka, Chris Brierley, and Serge Guillas

Download

Final revised paper (published on 12 Nov 2021)
Preprint (discussion started on 04 Jan 2021)

Interactive discussion

Status: closed

RC1:
'Comment on amt-2020-454', Anonymous Referee #2, 16 Feb 2021
Global Ensemble of Temperature over 1850-2018: Quantification of Uncertainties in Observations, Coverage and Spatial modelling (GETQUOCS)

Maryam Ilyas, Douglas Nychka, Chris Brierley, and Serge Guillas

An interesting paper with a few issues to be resolved. In their analysis of global temperature data using the multiresolution lattice kriging method, the authors extend the work of Ilyas et al 2017 by exploring the hyperparameter estimation uncertainty using a Monte Carlo sampling method. It is interesting to see the impacts of this hyperparameter uncertainty assessment. These are a potentially important source of uncertainty in assessments of observed global temperature change that have not previously been investigated in other studies. There are some issues with the paper structure, including a lack of concluding remarks. Additional discussion of the effects controlled by the sampled parameters and illustration of the impacts of their sampling throughout the temperature record is needed.

Main points:

1. While it is great that the paper includes estimates of hyperparameter uncertainty, I’m am left uncertain on the extent to which hyperparameter uncertainty translates into an appreciable uncertainty in the temperature fields and how this varies through the temperature record. The paper only provides examples of hyperparameter estimates and resulting fields for a single month. Is this representative of other months? Interpretation of differences between the ABC based and profile likelihood based analyses would be aided by showing how the profile likelihood hyperparameter estimates compare to those from ABC, for example in Figure 2. Similarly, temperature/uncertainty fields are only shown for a single month in Figures 3/4.

These points seem important to understand the benefits of sampling the hyperparameter uncertainty, also given that the method appears to be computationally expensive.

2. The review of prior literature is frequently a few years out of date and needs updating. Some cited studies are inaccurately or incorrectly described. See detailed comments for details.

3. There appear to be a few simplifications in the statistical model/uncertainty model used that are not discussed:

Hyperparameter estimates are global, estimated independently for each field, with no regional estimates, essentially modelling temperature anomaly variability as being identical at all locations over land and sea.

As a space only model (not space-time) there appears to be no accounting for persistence of temperatures used to aid reconstruction or accounted for in uncertainty estimates.

My understanding of MRLK is that it models observational error as identically distributed for each observed location. The analysis described makes no mention of the additional uncertainty terms for HadCRUT4 (in addition to the ensemble) that describe differences in observational error distributions for each grid cell and correlations in errors between grid cells, arising from the movement of marine measurement platforms. It appears that this information, that is not encoded into the HadCRUT4 ensemble members, has not been used and they are not described in Section 3.2. These uncertainties were found to be important in Morice et al 2012. Some comment on not including these, or how they are approximated by the additive uncorrelated error term in MRLK, would be appropriate.

4. The reader needs to refer to Nychka et al 2015 to understand the meaning of the lambda hyperparameter. The “aw” hyperparameter (please rename to use a single letter unless it is a product of two variables) does not appear to be described in Nychka et al 2015. I do not understand how to interpret the function of this “autoregressive weights” hyperparameter and how it might affect the resulting temperature fields.

5. Discussion of Figure 5 suggests that uncertainty estimates for global average temperature anomalies are wider for ABC that those of Ilyas et al 2015, but this is not particularly evident in Figure 5. Uncertainty ranges appear to be roughly comparable, of similar magnitude to the quantisation in the plot of roughly 0.01 °C, but slightly skewed one way or another. It’s not clear that the ABC sampling of the hyperparameter uncertainty would necessarily lead to wider uncertainty estimates in the global mean than the profile likelihood estimates. For example, the lambda parameter is defined in Nychka et al 2015 as lambda =noise variance / process variance. Sampling into high values of lambda would give a process with low variance and large measurement noise, which would lead to smaller uncertainties arising from sampling limitations. The lower variance of the ABC analysis field in Figure 3 suggests that this might be the case. It would be an alternative explanation to the changes in LatticeKrig 6.4 that are alluded to in the first paragraph on page 10.

6. The paper ends rather abruptly with a discussion of a sampling method (which arguably should be moved earlier in the paper). It would benefit from a conclusions section. Are there any deficiencies in the approach that we should be aware of? What's missing or could further developed? It could comment on developments while this paper was being worked on that are not included, e.g. for HadSST4 and talk more broadly about where this study fits alongside other research in the subject area. It would be an appropriate place to place a link to the data.

Detailed points.

Page 1, Abstract, line 1: Needs the word global in there to indicate that we’re talking about global temperature records?

Page 1, Abstract, Line 7: It’s not clear in the abstract what the “variation” in parameters is referring to. Hyperparameter estimates vary from month to month but not spatially. Or is this referring to the uncertainty in hyperparameter values, which otherwise isn’t stated in the abstract and is the key addition in the paper?

Page 1, line 15 – Good et al 2016 is a satellite-based skin temperature record, not air temperature?

Page 1, line 19 – It’s not exactly so simple as obtaining data from the WMO/GCOS. Modern messages are transmitted via these means but much work is required to compile observations from individual nation states and from research institutions to compile the historical records.

Page 2, line 11 – The most recent version of the NOAA record is now called NOAAGlobalTemp with the following reference:

Zhang, Huai-Min, Jay H. Lawrimore, Boyin Huang, Matthew J. Menne, Xungang Yin, Ahira Sánchez-Lugo, Byron E. Gleason, Russell Vose, Derek Arndt, J. Jared Rennie, Claude N. Williams, 2019: Updated temperature data give a sharper view of climate trends, Eos, 100, doi.org/10.1029/2019EO128229. Published on 19 July 2019.

A newer version is also just been published:

Vose, R. S., Huang, B., Yin, X., Arndt, D., Easterling, D. R., Lawrimore, J. H., et al. (2021). Implementing full spatial coverage in noaa's global temperature analysis. Geophysical Research Letters, 48, e2020GL090873. https://doi.org/10.1029/2020GL090873

Page 2, line 14 – Ishii et al., 2005 describes only the marine portion of the JMA temperature record.

Page 2, line 15 – The latest version of HadCRUT, HadCRUT5, has the following reference. Note the date for the final published version as 2021, not 2020:

Morice, C. P., Kennedy, J. J., Rayner, N. A., Winn, J. P., Hogan, E., Killick, R. E., et al. (2021). An updated assessment of nearâsurface temperature change from 1850: the HadCRUT5 data set. Journal of Geophysical Research: Atmospheres, 126, e2019JD032361. https://doi.org/10.1029/2019JD032361

Page 2, line 15 – The 2013 paper for Berkeley Earth only described the land data. The recent paper describing the merged land-ocean can be cited as:

Rohde, R. A. and Hausfather, Z.: The Berkeley Earth Land/Ocean Temperature Record, Earth Syst. Sci. Data, 12, 3469–3479, https://doi.org/10.5194/essd-12-3469-2020, 2020.

Page 2, line 19 – The GISS data set has long only used satellite nightlight data for bias adjustment of urban areas. It does not use satellite derived temperature information. The current version of MLOST does not use satellite data. The statements here likely refer to the ERSST3b sea surface temperature data set, which used satellite data and was once the marine data source for these data sets. The current version, ERSST5, does not use satellite data.

Page 2, line 26 – HadCRUT4 is not interpolated but the recently publish HadCRUT5 is. The JMA data set’s oceans are interpolated.

Page 2, line 30 – MLOST is not based on linear interpolation. It’s a combination of “low frequency” spatial running average and a “high frequency” reduced space analysis using a method called Empirical Orthogonal Teleconnections.

Page 2 line 31 – GISS uses linear distance weighting, not inverse linear distance weighting. No inverse involved (see Equation 1 of Lenssen et al., 2019 for the equation, or section 2 of Hansen et al., 2010 for a description). The linear distance weighting is correctly stated in the following sentence on line 32.

Page 2, line 34 – A reference is needed here. Does this use of kriging refer to the JMA COBE SST data set’s use of optimum interpolation?

Page 3, line 1 – Cowtan and Robert (2014) should be Cowtan and Way (2014) (i.e. the author’s name is Robert Way). Repeated again in other references to this paper (e.g. on line 3).

Page 3, line 4 – I would argue that these methods do not ignore variations at multiple length scales. For kriging/Gaussian process regression, the ability to represent multiple length scales is dependent on the covariance function used (which can be extremely flexible if constructed to be). The reconstruction method in NOAAGlobalTemp also represents multiple length scales in its own way though a reduced space decomposition. I assume that this comment on multiple length scales is alluding to MLRK, which explicitly represents multiple length scales as a sum of covariance functions. The distinction here is perhaps in MRLK flexibly to fit covariance structures with multiple scales without necessarily defining those structures in advance?

Page 3, line 5 – The new HadCRUT5 data set include a conditional simulation step to encode analysis uncertainties into an ensemble. Other data sets provided uncertainty estimates in their interpolation by other means but not through simulation.

Paragraph at page 3, line 15 - It is great to see the hyperparameter uncertainties and conditional simulation included. It does not appear that all components of the HadCRUT4 uncertainty model have been used though. In particular, those associated with biased observations for individual ships encoded into the HadCRUT4 error covariance matrices and per-grid-cell uncertainty estimates, not included in the ensemble, are not used. Instead the model seems to assume i.i.d. errors for each observed grid cell, with a stationary measurement error variance across all locations, estimated each month.

Page 3, line 27 – Should this comment on sparsity in covariance matrices refer instead to sparsity in inverse covariance matrices?

Page 4, line 19 – I know that I can refer to Nycha et al. (2015) to understand the function of the lambda parameter. The average reader will not know this. A reference to Nycha et al. (2015) would be appropriate here. Is “aw” one term or two? I don’t understand what this parameter does and I can’t find it in Nycha et al.

Page 4, line 21 – Typo? “The smoothness parameter lambda influence throughout the calculation”. What does is it influence?

Page 5, line 12 – Are these semivariances defined at the observation locations. Is there any binning etc. to compensate for biases sampling of e.g. short ranges when computing the empirical semivariogram?

Page 5, line 4 – d and rho are not defined in this paper.

Page 5, line 10 – version 4.5.0.0 but it’s appropriate for comparisons with Ilyas et al. (2017) if it is the same version as used there.

Page 7, line 6 – This Section 3.2 is essentially a recap of Morice et al (2012), with a heavy focus on the land data. Only the large-scale bias terms are discussed here and not the measurement and grid sampling uncertainty components. These are particularly important for marine regions as ship/buoy movement leads to spatially correlated error, which should be mentioned here. HadCRUT4 does not include these in the ensemble, but instead as additional spatial error covariance matrices. It seems that these have not been used in this paper.

Page 7, line 3 – The sentence should not begin with “So”.

Page 7, line 24 – It could be noted that the error model here represents the effects of potential residual biases when using station records that have been screened for urbanisation.

Page 7, line 28 - Sampling distributions for the HadCRUT4 ensemble are described in Morice et al. (2012). It would be sufficient here to refer to that study for the ensemble sampling methods rather than repeating it here and elsewhere in Section 3.2. The output of Morice et al. (2012) is used in this paper rather than reimplementation/modification of their methods so these methodological details are not core to this study.

Page 8, line 4 – Again, sampling distributions used in the construction of the input datasets used could be replaced with a reference to Morice et al. (2012) as they are not critical to the new work undertaken in this study.

Page 8, line 20 – Estimation of hyperparameters for each individual choice is an important design choice. There is no discussion of variation of the parameter estimates seasonally or in time later in the paper. It would be interesting to see this.

Page 8, line 24 – this is 10 hyperparameter sample draws for each month, yes?

Page 28, line 28 – do these marginal variances have interpretable units? Are these °C^2? It’s interesting if the marginal variance has little affect on uncertainty as it controls the variance of the process in interpolated regions and the relative importance of each spatial scale. Or is the uncertainty in process variances somehow pushed into lambda in the model’s parameterisation?

The next sentence says that this parameter is computed from a single field. Is there no seasonal variability in marginal spatial variance? Perhaps some comment on how the parameter should be interpreted would be helpful to explain why February 1988 is representative of the whole data set.

Page 9, line 2 – Use of the field with minimum coverage seems a strange choice rather than using a well sampled period. Is this because of computational cost limitations?

Figure 2 - It would be interesting to see the likelihood-based estimate here too. This would help to understand what's happening in Figure 3 in comparisons between ABC and likelihood-based fields. This figure would benefit from some accompanying discussion of the effects of the parameters on the fitted fields and how the parameter estimates differ from the likelihood-based parameters.

For example, Nychka et al (2015) indicates that lambda =noise variance / process variance. High values of lambda would give a process with low variance and large measurement noise. Would this result in e.g. lower variance fields than a likelihood based estimate of a lower lambda?

The aw parameter is interesting here. There's a lot of weight right at the edge of the prior distribution. Is the distribution being truncated by the choice of prior?

Page 9, line 11 – Some discussion of how parameters compare between ABC and likelihood methods would again be useful here. What is it about the sampled parameter values that leads to the differences?

Page 10, line 2 – The lower uncertainty in unobserved grid locations could explain why This could be the reason that global average statistics do not appear to be much affected. It seems like the ABC ensemble is leaning towards a model with greater observational noise variance and lesser process variance. This would explain the smoother fields in the figure 2 (a)-(b) comparison. Is this correct?

Figure 3 – Are these fields the ensemble means/medians? This is not stated. Are the results for this month representative of other months in terms of parameter estimates and uncertainty estimates? It would be useful to see how they compare for better sampled periods or other times of year. The plot needs units (°C?).

Figure 4 – How is uncertainty defined here? Is this the full ensemble spread? What is the statistic being shown? What are the units?

Page 12, line 5 – I think that this says that samples are drawn from the conditional normal distribution, with each HadCRUT4 ensemble member having 10 hyperparameter samples, and each of those having 100 samples from the conditional normal. However, the wording of “namely the variogram based ABC posteriors of autoregressive weights and smoothing parameter” does not include the conditional normal sampling.

Figure 5 – Is there a grey line plotted for the ensemble median of the old ensemble? If so then I can't see it. It would be helpful to include to show any differences (or lack thereof) in the mean/median with the hyperparameter sampling.

Figure needs units (°C?).

Page 14, line 5 – It’s not clear that the uncertainties are always larger for ABC. They seem comparable or slightly skewed relative to Ilyas et al 2017. Differences often appear to be around the scale of the apparent rounding resolution in the plot. It looks like the uncertainty range is often narrower for the new ensemble.

Is it guaranteed that the uncertainty estimates would be wider using ABC? Could lower uncertainties be possible if the hyperparameter distribution samples a region of the parameter space that leads to smaller process variance, and hence smaller coverage uncertainty estimates?

Page 14, paragraphs at lines 13 and 24 – This discussion of Latin hypercube sampling is a strange and abrupt way to end the paper. It would better be placed earlier in the method/results. This sampling is good to see though as a 100,000-member gridded dataset would be rather unwieldy to use.

Page 15, line 1 - No conclusions section? See main point 6.

Figure 6 – The axes need labels/units. It could be useful to see the resulting sampling for other locations too.
Citation: https://doi.org/10.5194/amt-2020-454-RC1
- AC1: 'Reply on RC1', Maryam Ilyas, 30 Mar 2021
  
  Comment: An interesting paper with a few issues to be resolved. In their analysis of global temperature data using the multiresolution lattice kriging method, the authors extend the work of Ilyas et al 2017 by exploring the hyperparameter estimation uncertainty using a Monte Carlo sampling method. It is interesting to see the impacts of this hyperparameter uncertainty assessment. These are a potentially important source of uncertainty in assessments of observed global temperature change that have not previously been investigated in other studies. There are some issues with the paper structure, including a lack of concluding remarks. Additional discussion of the effects controlled by the sampled parameters and illustration of the impacts of their sampling throughout the temperature record is needed.
  Response: We are grateful for the encouraging comments. The section-6 discussion will be replaced by conclusion and discussion where concluding remarks will be added. The impact of infusing uncertainties in the parameters will be discussed more in sections 4 and 5.
  Comment-1: While it is great that the paper includes estimates of hyperparameter uncertainty, I’m am left uncertain on the extent to which hyperparameter uncertainty translates into an appreciable uncertainty in the temperature fields and how this varies through the temperature record. The paper only provides examples of hyperparameter estimates and resulting fields for a single month. Is this representative of other months? Interpretation of differences between the ABC based and profile likelihood based analyses would be aided by showing how the profile likelihood hyperparameter estimates compare to those from ABC, for example in Figure Similarly, temperature/uncertainty fields are only shown for a single month in Figures 3/4.
  These points seem important to understand the benefits of sampling the hyperparameter uncertainty, also given that the method appears to be computationally expensive.
  Response: Uncertainties in the spatial field pretty much depend upon the spatial coverage. This single field was chosen to highlight the maximum impact of parametric uncertainties as it has the least spatial coverage. This point would be discussed in the paper. Additionally, comparisons of ABC and profile likelihood estimation procedures will be added.
  Comment-2: The review of prior literature is frequently a few years out of date and needs updating. Some cited studies are inaccurately or incorrectly described. See detailed comments for details.
  Response: Literature of more recent studies will be added. References will be corrected (apologies).
  Comment-3: There appear to be a few simplifications in the statistical model/uncertainty model used that are not discussed:
  Ø Hyperparameter estimates are global, estimated independently for each field, with no regional estimates, essentially modelling temperature anomaly variability as being identical at all locations over land and sea.
  Ø As a space only model (not space-time) there appears to be no accounting for persistence of temperatures used to aid reconstruction or accounted for in uncertainty estimates.
  Ø My understanding of MRLK is that it models observational error as identically distributed for each observed location. The analysis described makes no mention of the additional uncertainty terms for HadCRUT4 (in addition to the ensemble) that describe differences in observational error distributions for each grid cell and correlations in errors between grid cells, arising from the movement of marine measurement platforms. It appears that this information, that is not encoded into the HadCRUT4 ensemble members, has not been used and they are not described in Section 3.2. These uncertainties were found to be important in Morice et al 2012. Some comment on not including these, or how they are approximated by the additive uncorrelated error term in MRLK, would be appropriate.
  Response: We apologize for our failure to sufficiently describe the statistical/ uncertainty model. We agree on the first two. For the third one, additional uncertainty estimates that are not blended in the ensemble members are not considered in constructing this data product. All these details would be mentioned in the paper.
  Comment-4: The reader needs to refer to Nychka et al 2015 to understand the meaning of the lambda hyperparameter. The “aw” hyperparameter (please rename to use a single letter unless it is a product of two variables) does not appear to be described in Nychka et al 2015. I do not understand how to interpret the function of this “autoregressive weights” hyperparameter and how it might affect the resulting temperature fields.
  Response: Apologies for this confusion. We will replace “aw” with a single letter. Autoregressive weight is a sort of range parameter that is used in the precision matrix of the Gaussian Markov random field. The definitions of model parameters will be added.
  Comment-5: Discussion of Figure 5 suggests that uncertainty estimates for global average temperature anomalies are wider for ABC that those of Ilyas et al 2015, but this is not particularly evident in Figure 5. Uncertainty ranges appear to be roughly comparable, of similar magnitude to the quantisation in the plot of roughly 0.01 °C, but slightly skewed one way or another. It’s not clear that the ABC sampling of the hyperparameter uncertainty would necessarily lead to wider uncertainty estimates in the global mean than the profile likelihood estimates. For example, the lambda parameter is defined in Nychka et al 2015 as lambda =noise variance / process variance. Sampling into high values of lambda would give a process with low variance and large measurement noise, which would lead to smaller uncertainties arising from sampling limitations. The lower variance of the ABC analysis field in Figure 3 suggests that this might be the case. It would be an alternative explanation to the changes in LatticeKrig 6.4 that are alluded to in the first paragraph on page 10.
  Response: Thanks for the alternative explanation. Explanation in this context will be added in the paper.
  Comment-6: The paper ends rather abruptly with a discussion of a sampling method (which arguably should be moved earlier in the paper). It would benefit from a conclusions section. Are there any deficiencies in the approach that we should be aware of? What's missing or could further developed? It could comment on developments while this paper was being worked on that are not included, e.g. for HadSST4 and talk more broadly about where this study fits alongside other research in the subject area. It would be an appropriate place to place a link to the data.
  Response: As described earlier, the conclusion and discussion section will be improved. Latin hypercube details can be moved earlier as well.
  Detailed points.
  Comment: Page 1, Abstract, line 1: Needs the word global in there to indicate that we’re talking about global temperature records?
  Response: This will be changed.
  Comment: Page 1, Abstract, Line 7: It’s not clear in the abstract what the “variation” in parameters is referring to. Hyperparameter estimates vary from month to month but not spatially. Or is this referring to the uncertainty in hyperparameter values, which otherwise isn’t stated in the abstract and is the key addition in the paper?
  Response: These are uncertainties in the parameters. This will be clarified in the paper by modifying line-7.
  Comment: Page 1, line 15 – Good et al 2016 is a satellite-based skin temperature record, not air temperature?
  Response: Apologies. We will correct it.
  Comment: Page 1, line 19 – It’s not exactly so simple as obtaining data from the WMO/GCOS. Modern messages are transmitted via these means but much work is required to compile observations from individual nation states and from research institutions to compile the historical records.
  Response: We will improve this line.
  Comment: Page 2, line 11 – The most recent version of the NOAA record is now called NOAAGlobalTemp with the following reference: …..
  Response: Thanks. This line will be modified and the new references mentioned will be added.
  Comment: Page 2, line 14 – Ishii et al., 2005 describes only the marine portion of the JMA temperature record.
  Response: Noted thanks. This reference will be corrected.
  Comment: Page 2, line 15 – The latest version of HadCRUT, HadCRUT5, has the following reference. Note the date for the final published version as 2021, not 2020:
  Response: Thanks for highlighting. This will be corrected.
  Comment: Page 2, line 15 – The 2013 paper for Berkeley Earth only described the land data. The recent paper describing the merged land-ocean can be cited as:
  Response: Noted thanks. This reference will be corrected.
  Comment: Page 2, line 19 – The GISS data set has long only used satellite nightlight data for bias adjustment of urban areas. It does not use satellite derived temperature information. The current version of MLOST does not use satellite data. The statements here likely refer to the ERSST3b sea surface temperature data set, which used satellite data and was once the marine data source for these data sets. The current version, ERSST5, does not use satellite data.
  Response: Noted thanks. This line will be modified accordingly.
  Comment: Page 2, line 26 – HadCRUT4 is not interpolated but the recently publish HadCRUT5 is. The JMA data set’s oceans are interpolated.
  Response: We will clarify this line.
  Comment: Page 2, line 30 – MLOST is not based on linear interpolation. It’s a combination of “low frequency” spatial running average and a “high frequency” reduced space analysis using a method called Empirical Orthogonal Teleconnections.
  Response: Noted thanks. This line will be clarified.
  Comment: Page 2 line 31 – GISS uses linear distance weighting, not inverse linear distance weighting. No inverse involved (see Equation 1 of Lenssen et al., 2019 for the equation, or section 2 of Hansen et al., 2010 for a description). The linear distance weighting is correctly stated in the following sentence on line 32.
  Response: We will rephrase this. By linear inverse distance weighting we mean that the weight of each sample point decreases linearly from unity to zero. It does not explicitly include inverse. This interpolation scheme computes estimates by weighting the sample points closer to the prediction location greater than those farther away without considering the degree of autocorrelation for those distances.
  Comment: Page 2, line 34 – A reference is needed here. Does this use of kriging refer to the JMA COBE SST data set’s use of optimum interpolation?
  Response: Yes. We will add the reference as well.
  Comment: Page 3, line 1 – Cowtan and Robert (2014) should be Cowtan and Way (2014) (i.e. the author’s name is Robert Way). Repeated again in other references to this paper (e.g. on line 3).
  Response: Thanks. This will be corrected throughout the paper.
  Comment: Page 3, line 4 – I would argue that these methods do not ignore variations at multiple length scales. For kriging/Gaussian process regression, the ability to represent multiple length scales is dependent on the covariance function used (which can be extremely flexible if constructed to be). The reconstruction method in NOAAGlobalTemp also represents multiple length scales in its own way though a reduced space decomposition. I assume that this comment on multiple length scales is alluding to MLRK, which explicitly represents multiple length scales as a sum of covariance functions. The distinction here is perhaps in MRLK flexibly to fit covariance structures with multiple scales without necessarily defining those structures in advance?
  Response: We will modify and tone down this statement.
  Comment: Page 3, line 5 – The new HadCRUT5 data set include a conditional simulation step to encode analysis uncertainties into an ensemble. Other data sets provided uncertainty estimates in their interpolation by other means but not through simulation.
  Response: Noted thanks we will incorporate this.
  Comment: Paragraph at page 3, line 15 - It is great to see the hyperparameter uncertainties and conditional simulation included. It does not appear that all components of the HadCRUT4 uncertainty model have been used though. In particular, those associated with biased observations for individual ships encoded into the HadCRUT4 error covariance matrices and per-grid-cell uncertainty estimates, not included in the ensemble, are not used. Instead the model seems to assume i.i.d. errors for each observed grid cell, with a stationary measurement error variance across all locations, estimated each month.
  Response: We will explicitly highlight that only the uncertainties encoded in HadCRUT4 ensemble members are included in the study. We will hopefully consider gridded error estimates in our future work.
  Comment: Page 3, line 27 – Should this comment on sparsity in covariance matrices refer instead to sparsity in inverse covariance matrices?
  Response: Yes. Calculating the inverse is easier for sparse matrices as compared to those of dense matrices.
  Comment: Page 4, line 19 – I know that I can refer to Nycha et al. (2015) to understand the function of the lambda parameter. The average reader will not know this. A reference to Nycha et al. (2015) would be appropriate here. Is “aw” one term or two? I don’t understand what this parameter does and I can’t find it in Nycha et al.
  Response: We will add reference and clarify “aw” which is one term.
  Comment: Page 4, line 21 – Typo? “The smoothness parameter lambda influence throughout the calculation”. What does is it influence?
  Response: We will modify this line.
  Comment: Page 5, line 12 – Are these semivariances defined at the observation locations. Is there any binning etc. to compensate for biases sampling of e.g. short ranges when computing the empirical semivariogram?
  Response: These are empirical semi variances that are computed for observed spatial locations. All the standard rules are observed while computing the semivariogram. For example, the number of point pairs at each lag distance (or in each bin) is at least 30. Semivariogram is computed up to half of the maximum distance between the points over the whole spatial domain.
  Comment: Page 5, line 4 – d and rho are not defined in this paper.
  Response: We will define these terms in the paper.
  Comment: Page 5, line 10 – version 4.5.0.0 but it’s appropriate for comparisons with Ilyas et al. (2017) if it is the same version as used there.
  Response: This paper and Ilyas et al. (2017) are based on the same version 4.5.0.0.
  Comment: Page 7, line 6 – This Section 3.2 is essentially a recap of Morice et al (2012), with a heavy focus on the land data. Only the large-scale bias terms are discussed here and not the measurement and grid sampling uncertainty components. These are particularly important for marine regions as ship/buoy movement leads to spatially correlated error, which should be mentioned here. HadCRUT4 does not include these in the ensemble, but instead as additional spatial error covariance matrices. It seems that these have not been used in this paper.
  Response: We will discuss measurement and gridded sampling error estimates as well in this section. However, it is important to note that these are not being used in this paper.
  Comment: Page 7, line 3 – The sentence should not begin with “So”.
  Response: This line will be modified.
  Comment: Page 7, line 24 – It could be noted that the error model here represents the effects of potential residual biases when using station records that have been screened for urbanisation.
  Response: This will be added.
  Comment: Page 7, line 28 - Sampling distributions for the HadCRUT4 ensemble are described in Morice et al. (2012). It would be sufficient here to refer to that study for the ensemble sampling methods rather than repeating it here and elsewhere in Section 3.2. The output of Morice et al. (2012) is used in this paper rather than reimplementation/modification of their methods so these methodological details are not core to this study.
  Response: Sure, we will exclude excessive sampling details and cite Morice et al. (2012) for details.
  Comment: Page 8, line 4 – Again, sampling distributions used in the construction of the input datasets used could be replaced with a reference to Morice et al. (2012) as they are not critical to the new work undertaken in this study.
  Response: Noted. We will exclude sampling details from here too.
  Comment: Page 8, line 20 – Estimation of hyperparameters for each individual choice is an important design choice. There is no discussion of variation of the parameter estimates seasonally or in time later in the paper. It would be interesting to see this.
  Response: We will discuss how the parameter estimates vary across time.
  Comment: Page 8, line 24 – this is 10 hyperparameter sample draws for each month, yes?
  Response: That’s true.
  Comment: Page 28, line 28 – do these marginal variances have interpretable units? Are these °C^2? It’s interesting if the marginal variance has little affect on uncertainty as it controls the variance of the process in interpolated regions and the relative importance of each spatial scale. Or is the uncertainty in process variances somehow pushed into lambda in the model’s parameterisation?
  The next sentence says that this parameter is computed from a single field. Is there no seasonal variability in marginal spatial variance? Perhaps some comment on how the parameter should be interpreted would be helpful to explain why February 1988 is representative of the whole data set.
  Response: We estimated the marginal spatial variance for the field that has the maximum information. This slightly varies across time. These details will be added.
  Comment: Page 9, line 2 – Use of the field with minimum coverage seems a strange choice rather than using a well sampled period. Is this because of computational cost limitations?
  Response: Not really. We intend to show the coverage error estimates of the least sampled spatial field. Therefore, the posteriors are being shown for the corresponding spatial field to develop a link.
  Comment: Figure 2 - It would be interesting to see the likelihood-based estimate here too. This would help to understand what's happening in Figure 3 in comparisons between ABC and likelihood-based fields. This figure would benefit from some accompanying discussion of the effects of the parameters on the fitted fields and how the parameter estimates differ from the likelihood-based parameters.
  For example, Nychka et al (2015) indicates that lambda =noise variance / process variance. High values of lambda would give a process with low variance and large measurement noise. Would this result in e.g. lower variance fields than a likelihood based estimate of a lower lambda?
  Response: We will mention maximum likelihood estimates and discuss them with ABC posteriors.
  Comment: The aw parameter is interesting here. There's a lot of weight right at the edge of the prior distribution. Is the distribution being truncated by the choice of prior?
  Response: Not really. The choice of prior was guided by Nychka et al. (2015).
  Comment: Page 9, line 11 – Some discussion of how parameters compare between ABC and likelihood methods would again be useful here. What is it about the sampled parameter values that leads to the differences?
  Response: Sure, we will add discussion on ABC and maximum likelihood estimates.
  Comment: Page 10, line 2 – The lower uncertainty in unobserved grid locations could explain why This could be the reason that global average statistics do not appear to be much affected. It seems like the ABC ensemble is leaning towards a model with greater observational noise variance and lesser process variance. This would explain the smoother fields in the figure 2 (a)-(b) comparison. Is this correct?
  Response: This can be a reason.
  Comment: Figure 3 – Are these fields the ensemble means/medians? This is not stated. Are the results for this month representative of other months in terms of parameter estimates and uncertainty estimates? It would be useful to see how they compare for better sampled periods or other times of year. The plot needs units (°C?).
  Response: The field is ensemble median. The field represents a worst-case scenario (i.e. least spatial coverage). Parameter estimates and uncertainties vary across months. We will mention these details and add units.
  Comment: Figure 4 – How is uncertainty defined here? Is this the full ensemble spread? What is the statistic being shown? What are the units?
  Response: Figure 4a represents the gridded standard error in °C associated with the gridded predictions made in Figure 3a. The predictions are for the median ensemble member.
  Comment: Page 12, line 5 – I think that this says that samples are drawn from the conditional normal distribution, with each HadCRUT4 ensemble member having 10 hyperparameter samples, and each of those having 100 samples from the conditional normal. However, the wording of “namely the variogram based ABC posteriors of autoregressive weights and smoothing parameter” does not include the conditional normal sampling.
  Response: That’s true.
  Comment: Figure 5 – Is there a grey line plotted for the ensemble median of the old ensemble? If so then I can't see it. It would be helpful to include to show any differences (or lack thereof) in the mean/median with the hyperparameter sampling. Figure needs units (°C?).
  Response: Noted. We will try to improve this figure.
  Comment: Page 14, line 5 – It’s not clear that the uncertainties are always larger for ABC. They seem comparable or slightly skewed relative to Ilyas et al 2017. Differences often appear to be around the scale of the apparent rounding resolution in the plot. It looks like the uncertainty range is often narrower for the new ensemble.
  Is it guaranteed that the uncertainty estimates would be wider using ABC? Could lower uncertainties be possible if the hyperparameter distribution samples a region of the parameter space that leads to smaller process variance, and hence smaller coverage uncertainty estimates?
  Response: It is not guaranteed. We will improve this line and add details.
  Comment: Page 14, paragraphs at lines 13 and 24 – This discussion of Latin hypercube sampling is a strange and abrupt way to end the paper. It would better be placed earlier in the method/results. This sampling is good to see though as a 100,000-member gridded dataset would be rather unwieldy to use.
  Response: Noted. We will discuss Latin hypercube sampling earlier in the paper.
  Comment: Page 15, line 1 - No conclusions section? See main point 6.
  Response: We will add a conclusion section.
  Comment: Figure 6 – The axes need labels/units. It could be useful to see the resulting sampling for other locations too.
  Response: We will improve this figure and add more locations in the appendix maybe.
  
  Citation: https://doi.org/10.5194/amt-2020-454-AC1
RC2:
'Comment on amt-2020-454', Anonymous Referee #3, 01 Mar 2021

Summary of paper: In recent years, there have been a number of new reconstructions of the global temperature dataset, including at least three major government agencies (NOAA and NASA in the US, the Meteorological Office in the UK, maybe I should count Japan as a fourth) and some private efforts (the Berkeley Earth group and the Cowtan-Way-Robert papers). In an earlier paper published in 2017, three of the present authors sought to improve previously established methods by using the "LatticeKrig" approach to do spatial interpolation. LatticeKrig is a very sophisticated form of spatial statistics modeling that uses multi-resolution basis function expansions combined with Markov random fields for the coefficients. It's a much more advanced method than simple kriging, but the latter is still used by many authors (including some of the competing global temperature groups) for this kind of work. The present paper further extends this by using the approximate Bayesian computing (ABC) approach to construct a Bayesian version of the same algorithm. The advantage of the Bayesian approach is that it incorporates uncertainty in some of the spatial model parameters and thereby, in principle, should result in more precise assessments of uncertainty in the final product. The project required an enormous amount of computing (estimated at 62 months computer time, reduced to 8 months by parallelization) but the end result is a new data product that can realistically be compared with the other constructions that have been made.

Assessment: In the end, it seems to me that the authors come out with an incremental improvement on their own previous construction and it doesn't seem to be the "game changer" that the authors were possibly hoping for when they set out along this path. For example, the uncertainty bounds in Figure 5 are only very slightly different from those in their own previous construction, and the extended example on pages 9-11 seems to have been specially constructed as a worst-case scenario (as is explained in the paper, May 1861 was the month with the least spatial coverage of the entire dataset, therefore, presumably, the one for which the advantages of this kind of approach should be most clearly seen). Nevertheless, the work seems to be been competently carried out and it is always useful to have another dataset for comparison - I fully support publication.

My main concern about the paper is that the manuscript itself seems to have been put together in some haste. Prior to reading this, I was familiar with the LatticeKrig approach but not with the intimate details. Some of those details are important, e.g. the definitions of the lambda and aw parameters (page 4). I believe, throughout the paper, there is a need to give more explicit detail about the method. Given the 62 months of computer time, it seems unlikely that anyone else would want to exactly reconstruct this dataset, but nevertheless, I still feel that this should be a requirement of publication, that the method should be described in sufficient detail that anyone who wants to reconstruct the result has all the information required to do so.

In my initial correspondence with the editor, I queried why the authors had not produced a "supplemental materials" document - as I understand, such a document would be supported by the journal and I might suggest that the authors take advantage of this when revising the paper. Specifically, the kind of detail that is probably not relevant to the casual reader of the paper, but would be needed by anyone actually intending to try to reproduce the results, could very well go into an online appendix.

One further "general" comment - there is a passing reference to HadCRUT5 (Morice et al. 2020), which is the latest version of the U.K. Met. Office model. I wonder if maybe the authors should say a little more about this. I presume the production of HadCRUT5 overlapped the present effort but would the authors like to comment on how HadCRUT5 improves on HadCRUT4 and specifically how it compares with the present work?

Specific comments:

p. 1 line 16 - "phenomena" (plural of phenomenon)

p. 2 l. 15 Here the author Rohde is misspelled Rhode. This error also occurs in p. 2 l. 20, p. 3 l. 1 and p. 18 l. 24

p. 2 l. 34 Reference to JMA (?) where the question mark is a standard latex warning for a missing reference. Were the authors referring to the paper Ishii et al, mentioned on p. 2 l. 14?

p. 4, l. 19. Here the authors mention two parameters from Nychka et al 2015, called lambda and aw, but they never define these two parameters. The implication is that one can look up Nychka 2015 to find these definitions but I tried doing that and I think we need more assistance.

Nychka (2015) defines a parameter lambda=sigma^2/rho but they don't call it a smoothing parameter - that was my first confusion. I do note that the LatticeKrig R manual also defines lambda and does call it a smoothing parameter - the most recent version of this that I downloaded for preparing this review was version 8.4 dated November 2019 (the authors of this paper refer to version 6.4 as the one they used for the bulk of their computations). If the present authors want to call it "the smoothing parameter" without further explanation, they need to be precise about where this is defined, and the answer appears to be the LatticeKrig manual (which I'll subsequently refer to as LK), not Nychka (2015) (henceforth N15).

Both N15 and LK say that lambda and rho are computed my maximum likelihood and I understand that one of the objectives of the present paper is to extend that by using the ABC approach to approximate lambda, but what happened to rho? This isn't explained here, but later (p. 6 l. 4) they say, "both d and rho are still estimated using the maximum likelihood approach".

In fact d and rho are respectively an overall mean and a variance (scaling) parameter of a multivariate normal distribution and it is well known and trivial to implement that these parameters can be integrated out analytically, so as to focus attention on the spatial correlation parameters - in my first reading of the paper I assumed this was what they had done. But the comment on p. 6 l. 4 makes me wonder about this point. In summary, we need clarification of what the authors actually did. (If it really was a Bayesian approach, we also need to discuss the prior distribution, since rho in particular may require a proper prior.)

Now let me turn to the other parameter, aw, referred to as "autoregressive weights". This is based on the fact that at each level of the multiresolution process that defines the spatial model, the coefficients of the radial basis functions have the structure of a lattice process that is assumed to be of conditional autoregressive (CAR) form. However, here is no fixed structure for this and no single parameter called the autoregressive weight (or weights - it's not clear whether the authors actually meant to use a plural form here). p. 584 of N15 refers to a weight matrix B where the off-diagonal entries are -1 and the diagonal entries are of form B_{j,j}=4+kappa^2. In this case the lower bound 4 arises because the sum of entries in B is required to be positive. So is kappa the autoregressive weight? LK actually use a different notation, where they define a variable a.wght (this is the nearest I can find to any variable actually called "aw") and they comment (p. 23), "in the simplest case a.wght is the central value, and should be greater than 4". So is a.wght the same as 4+kappa^2 in the N15 notation? If so, which parameterization do the present authors actually use? Later (p. 8 l. 21) the authors give aw a prior density that is uniform on the interval (1,4), but now I'm really confused about how that particular range was determined ...

Another potential wrinkle is that N15 p. 585 explicitly mentions the possibility that the autoregressive structure may be different at different levels of the multiresolution process, but I'm reading between the lines that they didn't consider that extension in this paper.

I don't actually think any of these questions are complicated. I understand very well that there are certain model choices that you just have to make. The authors simply need to be explicit about what these choices were and how exactly the various parameters are defined.

p. 5. The flow diagram illustrating the ABC approach is clear and should be easy for the reader to follow, but again, some specific details are missing. How do they determine the number H of variogram sampling points and the specific vector of distances represented by h? I'm assuming that when you write gamma(h) you don't literally mean h as the distance (h is an index going from 1 to H) but gamma(d_h) for some distance d_h, but then the same question, what values did you actually use and how were they chosen?

p. 6 equations (5) and (6). I don't think we need you to define every symbol here but please give an exact source for these equations, which I assume are somewhere in N15?

p. 6 lines 16,17 - reference to a new dataset HadCRUT5 (which I wasn't aware of myself until reading this paper). I presume HadCRUT5 came out while this paper was being developed. I think this sentence should be moved to the discussion section and the authors should discuss how the two approaches compare and contrast with one another - are there any features in which HadCRUT5 improves on the present approach?

p. 7, l. 22 - "since the last few decades" - slightly awkward English construction here, maybe "during the last few decades" would be better. I am aware that the word "since" would be used in several other languages, for example "depuis" in French.

p. 7, l. 28. Unclear why you set any negative value to 0.0. While I'm well aware that we all talk about global warming and not global cooling, I don't think the possibility of cooling is excluded by basic atmospheric physics - if a stochastic model occasionally produces a negative value, why not include it in the analysis? From a political point of view, the authors should take care to avoid any implication that their approach was predetermined to result in a warming outcome.

p. 8 l. 28. This review is already getting rather lengthy and by this point I was definitely suffering from reviewer fatigue, but if I'm not mistaken, this is the first time in the paper there is any parameter called alpha. Please, either define it, or give an explicit prior reference where it is defined.

pp. 9-11. The authors are quite explicit that May 1861 was chosen for this illustration because it was the month with the poorest spatial coverage, and therefore presumably the one that best illustrates the advantages of using a more refined spatial approach, but I think it would be helpful to have at least some comparisons with other months. Are these kinds of plots typical of what we would expect if we just chose a month at random?

p. 10 lines 3-4: so part of the reason for the difference is that the LKrig function was improved between the two versions of LatticeKrig that were used for the 2017 paper and this one? Could you expand on that a bit - was that a major factor?

p. 12 l. 24 and Figure 5b: what exactly is the "median time series"? I'm inferring that each section of the time series was centered about its median value but what time scale was used for calculating the medians?

p. 14, this is an additional feature that is only introduced later in the paper and somewhat complicated to evaluate. It seems that the authors do not intend to publish their full 100,000-member ensemble but only a subset selected by a conditional latin hypercube sampling (CLHS) approach? I'm sure there are good reasons for doing that but at least from the appearance of Fig. 6, there appear to be some nontrivial differences between the two approaches, or am I misinterpreting this figure? Once again, the fact that they have shown this figure only for May 1861 may in some sense be a worst case scenario, but it would be helpful to clarify that point.

Citation: https://doi.org/10.5194/amt-2020-454-RC2
- AC2:
  'Reply on RC2', Maryam Ilyas, 30 Mar 2021
  Assessment: In the end, it seems to me that the authors come out with an incremental improvement on their own previous construction and it doesn't seem to be the "game changer" that the authors were possibly hoping for when they set out along this path. For example, the uncertainty bounds in Figure 5 are only very slightly different from those in their own previous construction, and the extended example on pages 9-11 seems to have been specially constructed as a worst-case scenario (as is explained in the paper, May 1861 was the month with the least spatial coverage of the entire dataset, therefore, presumably, the one for which the advantages of this kind of approach should be most clearly seen). Nevertheless, the work seems to be been competently carried out and it is always useful to have another dataset for comparison - I fully support publication.
  Comment: My main concern about the paper is that the manuscript itself seems to have been put together in some haste. Prior to reading this, I was familiar with the LatticeKrig approach but not with the intimate details. Some of those details are important, e.g. the definitions of the lambda and aw parameters (page 4). I believe, throughout the paper, there is a need to give more explicit detail about the method. Given the 62 months of computer time, it seems unlikely that anyone else would want to exactly reconstruct this dataset, but nevertheless, I still feel that this should be a requirement of publication, that the method should be described in sufficient detail that anyone who wants to reconstruct the result has all the information required to do so.
  In my initial correspondence with the editor, I queried why the authors had not produced a "supplemental materials" document - as I understand, such a document would be supported by the journal and I might suggest that the authors take advantage of this when revising the paper. Specifically, the kind of detail that is probably not relevant to the casual reader of the paper, but would be needed by anyone actually intending to try to reproduce the results, could very well go into an online appendix.
  Response: Apologies for not properly defining the model parameter definitions. These details will be added to the paper.
  Comment: One further "general" comment - there is a passing reference to HadCRUT5 (Morice et al. 2020), which is the latest version of the U.K. Met. Office model. I wonder if maybe the authors should say a little more about this. I presume the production of HadCRUT5 overlapped the present effort but would the authors like to comment on how HadCRUT5 improves on HadCRUT4 and specifically how it compares with the present work?
  Response: Sure, this discussion will be added.
  Specific comments:
  Comments: p. 1 line 16 - "phenomena" (plural of phenomenon)
  2 l. 15 Here the author Rohde is misspelled Rhode. This error also occurs in p. 2 l. 20, p. 3 l. 1 and p. 18 l. 24
  
  2 l. 34 Reference to JMA (?) where the question mark is a standard latex warning for a missing reference. Were the authors referring to the paper Ishii et al, mentioned on p. 2 l.14?
  
  Response: Apologies. These will be corrected.
  Comments: p. 4, l. 19. Here the authors mention two parameters from Nychka et al 2015, called lambda and aw, but they never define these two parameters. The implication is that one can look up Nychka 2015 to find these definitions but I tried doing that and I think we need more assistance.
  Nychka (2015) defines a parameter lambda=sigma^2/rho but they don't call it a smoothing parameter - that was my first confusion. I do note that the LatticeKrig R manual also defines lambda and does call it a smoothing parameter - the most recent version of this that I downloaded for preparing this review was version 8.4 dated November 2019 (the authors of this paper refer to version 6.4 as the one they used for the bulk of their computations). If the present authors want to call it "the smoothing parameter" without further explanation, they need to be precise about where this is defined, and the answer appears to be the LatticeKrig manual (which I'll subsequently refer to as LK), not Nychka (2015) (henceforth N15).
  Response: We will precisely define both the parameters and cite the LatticeKrig manual.
  Comments: Both N15 and LK say that lambda and rho are computed my maximum likelihood and I understand that one of the objectives of the present paper is to extend that by using the ABC approach to approximate lambda, but what happened to rho? This isn't explained here, but later (p. 6 l. 4) they say, "both d and rho are still estimated using the maximum likelihood approach".
  In fact d and rho are respectively an overall mean and a variance (scaling) parameter of a multivariate normal distribution and it is well known and trivial to implement that these parameters can be integrated out analytically, so as to focus attention on the spatial correlation parameters - in my first reading of the paper I assumed this was what they had done. But the comment on p. 6 l. 4 makes me wonder about this point. In summary, we need clarification of what the authors actually did. (If it really was a Bayesian approach, we also need to discuss the prior distribution, since rho in particular may require a proper prior.)
  Response: Apologies for this confusion. The proposed model is not fully Bayesian. Only ABC posteriors of lambda and autoregressive weights were being determined. Other parameters (e.g. d and rho) were estimated by maximum likelihood.
  Comments: Now let me turn to the other parameter, aw, referred to as "autoregressive weights". This is based on the fact that at each level of the multiresolution process that defines the spatial model, the coefficients of the radial basis functions have the structure of a lattice process that is assumed to be of conditional autoregressive (CAR) form. However, here is no fixed structure for this and no single parameter called the autoregressive weight (or weights - it's not clear whether the authors actually meant to use a plural form here). p.584 of N15 refers to a weight matrix B where the off-diagonal entries are -1 and the diagonal entries are of form B_{j,j}=4+kappa^2. In this case the lower bound 4 arises because the sum of entries in B is required to be positive. So is kappa the autoregressive weight? LK actually use a different notation, where they define a variable a.wght (this is the nearest I can find to any variable actually called "aw") and they comment (p. 23), "in the simplest case a.wght is the central value, and should be greater than 4". So is a.wght the same as 4+kappa^2 in the N15 notation? If so, which parameterization do the present authors actually use? Later (p. 8 l. 21) the authors give aw a prior density that is uniform on the interval (1,4), but now I'm really confused about how that particular range was determined …
  Another potential wrinkle is that N15 p. 585 explicitly mentions the possibility that the autoregressive structure may be different at different levels of the multiresolution process, but I'm reading between the lines that they didn't consider that extension in this paper.
  I don't actually think any of these questions are complicated. I understand very well that there are certain model choices that you just have to make. The authors simply need to be explicit about what these choices were and how exactly the various parameters are defined.
  Response: The reviewer is well justified in asking for clarification about the exact form for the spatial autoregression (SAR) parameter that controls the spatial dependence among lattice points. Although there is the freedom to choose a different parameter for each level of multiresolution level in this work it is fixed at a single value. This choice is consistent with more traditional covariance models such as the Matern family in spatial statistics. As the reviewer correctly pinpoints, the a.wght parameter built into the LatticeKrig code for rectangular lattice points is the autoregressive parameter and is the same as 4+kappa^2 in 2 dimensions, and in this form, the 4 nearest neighbors get a weight of -1 in the SAR formulation. Moreover, kappa can be loosely interpreted as a correlation range parameter for the spatial field. The particular application to a SAR model on the sphere and the icosohedral based lattice, however, led us to a slightly different parameterization. This is due to the fact that the lattice points can have different numbers of nearest neighbors. Twelve points will have 5 neighbors and the remaining points will have 6. Moreover, these points will not be exactly equally spaced. Given these irregularities, we found it easier to give a weight of 1 + exp(2*omega) to the center lattice point and constrain the sum of the nearest neighbor weights to sum to the negative one. Thus, in this spherical version we take a.wght= = 1 + (exp(omega)^2 = 1 + exp(2*omega) and parametrize this value through omega. exp(omega) can be interpreted as an approximate correlation range for the SAR and omega itself is useful in representing the log of this scale parameter. In particular, priors put on a.wght are done through the assumption that omega follows a uniform distribution over the interval [ -4.5, .55]. This choice covers a useful span of spatial correlations when omega is translated back into the a.wght parameter and subsequently into the dependence of the field at the lattice points.
  Comments: p. 5. The flow diagram illustrating the ABC approach is clear and should be easy for the reader to follow, but again, some specific details are missing. How do they determine the number H of variogram sampling points and the specific vector of distances represented by h? I'm assuming that when you write gamma(h) you don't literally mean h as the distance (h is an index going from 1 to H) but gamma(d_h) for some distance d_h, but then the same question, what values did you actually use and how were they chosen?
  Response: All the standard rules are observed while computing the semivariogram. Semivariogram is computed up to half of the maximum distance between the points over the whole spatial domain. The distances are Great circle distances in kilometers. The number of bins (H) is 12 that is somehow standard. “h” is the center of each of the 12 lag classes that are defined over half of the maximum distance. These details will be added to the paper.
  Comments: p. 6 equations (5) and (6). I don't think we need you to define every symbol here but please give an exact source for these equations, which I assume are somewhere in N15?
  Response: These equations are slightly different.
  Comments: p. 6 lines 16,17 - reference to a new dataset HadCRUT5 (which I wasn't aware of myself until reading this paper). I presume HadCRUT5 came out while this paper was being developed. I think this sentence should be moved to the discussion section and the authors should discuss how the two approaches compare and contrast with one another - are there any features in which HadCRUT5 improves on the present approach?
  Response: We will add these details in the discussion section.
  Comments: p. 7, l. 22 - "since the last few decades" - slightly awkward English construction here, maybe "during the last few decades" would be better. I am aware that the word "since" would be used in several other languages, for example "depuis" in French.
  Response: These sentences will be rephrased.
  Comment: p. 7, l. 28. Unclear why you set any negative value to 0.0. While I'm well aware that we all talk about global warming and not global cooling, I don't think the possibility of cooling is excluded by basic atmospheric physics - if a stochastic model occasionally produces a negative value, why not include it in the analysis? From a political point of view, the authors should take care to avoid any implication that their approach was predetermined to result in a warming outcome.
  Response: These details are for Morice et al. (2012). We will cite Morice et al. (2012) and remove these sampling details from this paper.
  Comment: p. 8 l. 28. This review is already getting rather lengthy and by this point, I was definitely suffering from reviewer fatigue, but if I'm not mistaken, this is the first time in the paper there is any parameter called alpha. Please, either define it or give an explicit prior reference where it is defined.
  Response: Sure, we will cite references and additional details for this parameter.
  Comment: pp. 9-11. The authors are quite explicit that May 1861 was chosen for this illustration because it was the month with the poorest spatial coverage, and therefore presumably the one that best illustrates the advantages of using a more refined spatial approach, but I think it would be helpful to have at least some comparisons with other months. Are these kinds of plots typical of what we would expect if we just chose a month at random?
  Response: More plots in the appendix can be added for comparison purposes.
  Comment:p. 10 lines 3-4: so part of the reason for the difference is that the LKrig function was improved between the two versions of LatticeKrig that were used for the 2017 paper and this one? Could you expand on that a bit - was that a major factor?
  Response: More explanation will be added.
  Comment: p. 12 l. 24 and Figure 5b: what exactly is the "median time series"? I'm inferring that each section of the time series was centered about its median value but what time scale was used for calculating the medians?
  Response: That's true. Also, for calculating the medians annual time series were used.
  Comment: p. 14, this is an additional feature that is only introduced later in the paper and somewhat complicated to evaluate. It seems that the authors do not intend to publish their full 100,000-member ensemble but only a subset selected by a conditional latin hypercube sampling (CLHS) approach? I'm sure there are good reasons for doing that but at least from the appearance of Fig. 6, there appear to be some nontrivial differences between the two approaches, or am I misinterpreting this figure? Once again, the fact that they have shown this figure only for May 1861 may in some sense be a worst case scenario, but it would be helpful to clarify that point.
  Response: We would be happy to publish the full ensemble. We are looking at the options. Also, in the appendix, we can add figures for other months.
  
  Citation: https://doi.org/10.5194/amt-2020-454-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Maryam Ilyas on behalf of the Authors (01 Aug 2021) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (04 Aug 2021) by Francis Pope

RR by Anonymous Referee #2 (02 Sep 2021)

Suggestions for revision or reasons for rejection

Ilyas et al - Global Ensemble of Temperatures over 1850-2018: Quantification of Uncertainties in Observations, Coverage, and Spatial modelling (GETQUOCS) (second review)

This is an interesting paper that expands on the range of uncertainty sources affecting assessments of observed global temperature changes. Notably it investigates uncertainty in the parameters of the spatial statistical model used in its assessment – a previously little explored source of uncertainty in such assessments – and results in a new data set that will be available for future studies. Hence it is opinion that this is a valuable contribution.

I thank the authors for their detailed response to reviewers and revisions to the manuscript. The structure of the paper is improved by now including a conclusion section and adding a separate section to describe the latin hypercube sampling used. Additional methodological details are included that aid understanding and reproducibility of the work.

However, some typos, grammatical errors and figure references should be addressed before publication. My recommendation is therefore that the paper should be published after a few minor revisions.

Major comments

1. The additional discussion in the conclusions section helps to put this work into context of the current state of understanding in the field. The paper’s conclusions are at times perhaps overly negative about the value of the study. That the inclusion of the parameter ensemble does not have a substantial impact on uncertainty in global average temperature estimates is a useful result. The paper’s results indicate that parameter uncertainty estimates may be important at regional scales. The true impact of this will not be clear until the data set is used in other studies.

2. There appears to be no reference to Figures A1 and A2 in the text. Figures C1 and C2 appear to be referenced in the incorrect location (where it seems that A1 and A2 would be relevant).

Some additional comment on what we see in the figures in Appendix A and Appendix B would also be beneficial as they appear to show that the parameter estimates differ for differing months. This appears to have an impact on uncertainty estimates in less well-measured regions.

3. There are several typos and grammatical errors to be corrected (see detailed comments).

Detailed comments (line numbers corresponds to the version with tracked changes):

Page 1 Line14 – I suggest removal of the word natural from this sentence. The cited references assess observed variability with no such assessment on whether this variability is natural or non-natural (i.e. anthropogenically forced).

Page 1 line 15 –. A slightly odd choices of references with limited relevance to this study. The use of model here is rather broad, covering two assessments of NWP model output and an application of an integrated assessment model. Milton and Earnshaw (2007) compare NWP temperature output to individual observation sites and not global gridded data sets (although they do use spatially gridded data for precipitations). Edwards et al (2011) assess NWP temperatures against observations from only a single observing site (no spatial estimation or fields used). Glanemann et al. (2020) does use global temperature data from NASA GISTEMP.

Page 1, line 19 – To say this data is “mostly derived” from these sources may not be accurate. While much modern data is indeed shared under WMO and GCOS umbrellas, the historical station data sets depend greatly on data compilation by research and data rescue efforts. This includes work by groups at NOAA, the Climatic Research Unit, the International Surface Temperature Initiative, and others such as the ACRE initiative.

Page 2, line 29 – typo: does not.

Page 3 – line 10 – Could mention HadCRUT5 here, as this supersedes HadCRUT4 and includes spatial interpolation through kriging/gaussian process regression.

Page 3, line 26 – It has yet been mentioned that the paper uses HadCRUT4 so this line appears abruptly. Should this sentence appear later in the paper, in the first paragraph of Section 3 or at the end of section 3?

Page 4, line 27 – This additional description of the functions of the autoregressive weight and lambda parameters, with reference to Nychka et al. (2019) is appreciated.

Page 5, line 2 – There are a few grammatical errors/typos in this added text: “*The* empirical semivariogram”, “the average distance *between* each bin”, “*The* semivariance”, “The semivariogram”.

Page 7 line 20 – typos: “Therefore, it was was converted to an ensemble data *set* by Morice et al. (2012) using *the* Brohan et al. (2006) uncertainty model.”

Page 8, line 25-28 – this is now clear that only the ensemble is used in this study and not any additional uncertainty information provided in HadCRUT4.

Page 9, line 13 – should omega be a symbol here?

Page 9, line 25- the additional detail on why 1988 was chosen is useful. The additional technical details in this section improve the reproducibility of the study. This will also be of aid to readers that are familiar with the LatticeKrig method.

Page 9, line 32 – sentence construction: please change “Other two” to “Two other”.

Page 10, line 1 – It is helpful to see the modelling assumptions now clearly stated here. The additional figures and discussion in section 4.2 are welcome additions that aid interpretation of the differences between the new results and those of Ilyas et al (2017).

Page 14, line 15 – thank you for adding this discussion of the effects of lambda for different sampled values.

Figure 5 – It is rather hard to see the lines in this figure. Please consider increasing the contrast between the lines and shaded regions.

Page 17, line 1 – This should mention the reason that the data set it is rather unwieldy – the large ensemble.

Page 17 lines 11 – I recommend adding the word average here to be clear that this statement refers to global average temperature estimates. The case for regional temperature is not clear as it has not been investigated in detail.

Page 17, line 14 – Inconsistent use of capitals: HADCRUT4/HadCRUT4 (HadCRUT4 used elsewhere).

Page 17, line 15 onwards – That the inclusion of the parameter ensemble does not have a substantial impact on uncertainty in global average temperature estimates is a useful result. It indicates that the use of point value estimates in existing assessments does not have a substantial impact on total uncertainty estimates. The comment on use in studies of climate variability at line 15 is apt. While not investigated in detail in this paper, it is quite possible that the interpolation parameter uncertainty may be more important for regional climate studies than for the global average.

Page 17 line 17 – This statement here may be over critical of the value of the work. Spatial model parameter uncertainty estimates may be important for regional assessments, which is not ruled out by this work.

Appendices – These figures may be better suited to a supporting information section rather than appendices, dependent on the Journal’s style guidance.

Figures A1 and A2 - Figures A1 and A2 do not appear to be cited in the text. Should the references to Figures C1 and C2 at page 9 line 32 refer to these two figures? These two figures with Figure 2 suggest that the model parameters do change for different fields/times. The corresponding uncertainty fields in Figures 3, B1 and B2 suggest an impact of uncertainty in spatial fields, particularly in unobserved regions. This does not appear to be commented on in section 4.2.

Figures C1 and C2 – These appear to be referenced in the incorrect location in the text (see previous comment). They should be referred to in Section 6, around page 16 line 4.

Hide

ED: Publish subject to minor revisions (review by editor) (22 Sep 2021) by Francis Pope

AR by Maryam Ilyas on behalf of the Authors (27 Sep 2021) Author's response Author's tracked changes Manuscript

ED: Publish as is (13 Oct 2021) by Francis Pope

AR by Maryam Ilyas on behalf of the Authors (16 Oct 2021) Manuscript

Short summary

Instrumental temperature records are fundamental to climate science. There are spatial gaps in the distribution of these measurements across the globe. This lack of spatial coverage introduces coverage error. In this research, a methodology is developed and used to quantify the coverage errors. It results in a data product that, for the first time, provides a full description of both the spatial coverage uncertainties along with the uncertainties in the modeling of these spatial gaps.