A hybrid optimal estimation and machine learning approach to predict atmospheric composition

Werner, Frank; Bowman, Kevin W.; Lee, Seungwon; Laughner, Joshua L.; Payne, Vivienne H.; McDuffie, James L.

doi:10.5194/amt-19-3095-2026

Articles | Volume 19, issue 9

https://doi.org/10.5194/amt-19-3095-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/amt-19-3095-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 19, issue 9

Research article

|

11 May 2026

Research article |

| 11 May 2026

A hybrid optimal estimation and machine learning approach to predict atmospheric composition

Frank Werner, Kevin W. Bowman, Seungwon Lee, Joshua L. Laughner, Vivienne H. Payne, and James L. McDuffie

Download

Final revised paper (published on 11 May 2026)
Preprint (discussion started on 07 Oct 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-4864', Daniel Miller, 22 Jan 2026
Review AMT - egusphere-2025-4864
Title: A hybrid optimal estimation and machine learning approach to predict atmospheric composition
First author: Frank Werner
Summary
This paper describes the development and data product of HYbrid REtrieval Framework (HYREF), which predicts sub-column carbon monoxide (CO) concentrations from Cross-track Infrared Sounder (CrIS) observations. This model was trained on the lower spatial resolution optimal estimation (OE) retrievals from TRopospheric Ozone and its Precursors from Earth System Sounding (TROPESS). The resulting machine learning (ML) data product for CrIS combines high spatial resolution and characterization of degrees of freedom and retrieval errors – which are critical to the ability to compare datasets to other observations, models, and use in data assimilation.

Overall Feedback
I think that this paper is great and worthy of publication with only minor revisions. Most of the feedback below consists of recommendations about different analytical techniques that may help improve the paper.

One specific point of feedback is a rather general critique of the value of the simple linear regression analysis for statistically comparing similar datasets. In the limit that correlations approach 1 linear regression plots start to provide limited visual evaluation ability – or to put it more bluntly everyone has seen a good looking regression and they often look less than usefully similar. There is a more robust approach to construct comparisons like this known as the “Bland-Altman Plot”, that also helps to incorporate additional statistical information about the data and better poses the fundamental question: “could variable B statistically replace variable A”. Figure 2 in your paper currently does a reasonable job of displaying the other dimensions that can matter for such a regression; given that they display retrievals, errors, and degrees of freedom.

I would recommend at least looking at Bland-Altman plots in response to this review and potentially including such analysis in the paper itself. In particular, Bland-Altman offers a better visual framework for handling data comparison tasks if the distribution of data is not linearly distributed or has uniquely variable uncertainty in either of the compared datsets. This is particularly true for variables with non-gaussian variability (e.g., logarithmic distributed variables such as optical thickness) or heteroscedastic uncertainty/variability. It looks to me as though these considerations might matter for the datasets in panels a,b, and c of Figure 2 – whereas panel d appears clearly normally distributed at all scales. One further relevant concern here is that neural network architectures such as yours are largely tuned toward gaussian process prediction and can struggle (without adequate consideration) to handle heteroscedastic variability in datasets because of the common isotropic noise assumption [Stirn et al., 2022].

An example of demonstrating a situation where analysis with Bland-Altman can significantly improve your analytical toolkit can be found in Knobelspiesse, et al. (2019). This paper explores an instrument intercomparison for radiometric polarimeters – which exhibit non-gaussian distributions in observed radiances as well as heteroscedastic variability in in the degree of linear polarization (DoLP) uncertainty. The example therein is discussed in section 3.C and summarized visually in Figure 8 and Figure 9. The links below summarize the methodology and has a python notebook demonstrating examples.
https://github.com/knobelsp/BlandAltman?tab=readme-ov-file
https://colab.research.google.com/github/knobelsp/BlandAltman/blob/main/BlandAltman.ipynb

Furthermore, as a ML retrieval example of how heteroscedasticity can cause issues with application of machine learning methods - the cloud microphysics retrievals in Miller et al., 2020 struggle to handle retrievals across the whole range of variability of the retrieval datasets. This is because of the statistical distributions of radiances and DoLP have rather heteroscedastic dependencies on the geophysical variables attempting to be retrieved.

Specific Feedback
Fig 3e,f – could you match the number of significant figures in each of these colorbar labels? That might also help you keep the formatting of the numbers from one bar from nearly overlapping with the other as they are now.

I would recommend a more clear and early statement of why you want to look at the power spectrum for scale breaks in section 4.2. This will help keep the reader’s attention on the importance of the relevant result when you finally introduce the figures. I understand that this is a good way to test and demonstrate its value beyond a simple sub-grid interpolator. But I think perhaps you could reorganize this section to address this. For example you mention scale-breaks on line 161 but it takes until line 180 to explain of why spatial variability and power spectrum is of interest.

Line 179: Typo – “cahnges”

Citations
Stirn, A., Wessels, H.-H., Schertzer, M., Pereira, L., Sanjana, N. E., and Knowles, D. A., “Faithful Heteroscedastic Regression with Neural Networks”, arXiv e-prints, Art. no. arXiv:2212.09184, 2022. doi:10.48550/arXiv.2212.09184.
Knobelspiesse K, Tan Q, Bruegge C, Cairns B, Chowdhary J, van Diedenhoven B, Diner D, Ferrare R, van Harten G, Jovanovic V, Ottaviani M, Redemann J, Seidel F, Sinclair K. Intercomparison of airborne multi-angle polarimeter observations from the Polarimeter Definition Experiment. Appl Opt. 2019 Jan 20;58(3):650-669. doi: 10.1364/AO.58.000650. PMID: 30694252; PMCID: PMC6996873.
Miller, D. J., Segal-Rozenhaimer, M., Knobelspiesse, K., Redemann, J., Cairns, B., Alexandrov, M., van Diedenhoven, B., and Wasilewski, A.: Low-level liquid cloud properties during ORACLES retrieved using airborne polarimetric measurements and a neural network algorithm, Atmos. Meas. Tech., 13, 3447–3470, https://doi.org/10.5194/amt-13-3447-2020, 2020.
Citation: https://doi.org/10.5194/egusphere-2025-4864-RC1
- AC1: 'Reply on RC1', Frank Werner, 26 Mar 2026
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4864/egusphere-2025-4864-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-4864-AC1
RC2:
'Comment on egusphere-2025-4864', Anonymous Referee #2, 05 Feb 2026

Review of A hybrid optimal estimation and machine learning approach to predict atmospheric composition
Werner et al. present the development of a machine-learning-based retrieval framework (TROPESS-HYREF) that predicts subcolumn CO concentrations from CrIS observations, trained using TROPESS optimal estimation (OE) retrievals. Importantly, in addition to CO concentrations, the system also predicts retrieval diagnostics such as averaging kernels, degrees of freedom (DoF), and retrieval errors, which are critical for model-observation comparison, validation, and data assimilation. The paper is generally well written, and the proposed framework represents a potentially important development for improving the spatial coverage and computational efficiency of OE-based retrieval systems, which are often limited by computational bottlenecks. However, I do have several concerns regarding the model’s generalizability, and the interpretation of some of the reported results, which I believe should be addressed to strengthen the manuscript.
Major comments:
1. Generalizability and data-splitting strategy
The use of a 98%/1%/1% split for training, validation, and testing raises concerns regarding the generalizability of the ML model. While the absolute number of samples in the validation and test sets is large, the strong spatial and temporal correlations inherent in satellite observations mean that a random split does not guarantee independence. For example, the 10 June 2023 wildfire case is drawn from the 04/2023-01/2025 period used for training. Given that 98% of the data are included in the training process, the test set likely contains many samples that are spatially and temporally adjacent to training samples. Under this split strategy, the reported test-set performance may largely reflect re-prediction of patterns already seen during training rather than true out-of-sample generalization.
A more robust evaluation would involve temporally or regionally independent splits (e.g., holding out entire months, seasons, or geographic regions), or comparison with fully independent third-party observations such as in situ or ground-based measurements. As currently implemented, the 98%/1%/1% split limits the interpretability of the reported test results.
2. Evaluation of predicted diagnostics
A key claimed advantage of the TROPESS-HYREF framework is its ability to predict retrieval diagnostics such as averaging kernels, DoF, and retrieval errors. However, the evaluation presented in the paper focuses largely on CO column concentrations. Additional assessment of the predicted diagnostics would strengthen the paper. For example, how accurate and stable are the ML-predicted averaging kernels relative to OE? Are the predicted errors statistically consistent with OE-derived uncertainties? How suitable are these diagnostics for downstream applications such as data assimilation?
3. Claims regarding performance relative to OE
Some statements suggesting that the ML system may "outperform" the OE retrieval are concerning. Given that the ML model is trained to reproduce OE results, it is unclear how it could outperform the OE retrieval in a physical sense. Clarifying that the ML system improves coverage and computational efficiency, rather than retrieval accuracy relative to OE, would help avoid overinterpretation.
Minor comments:
L96-97: The description of the forward and backward processes may be confusing for general readers, as it does not explicitly mention the backpropagation algorithm. A brief clarification would improve readability.
L101: The time range (04/2023–01/2025) is critical information and should be mentioned in the Data section.
Section 3: Feature preprocessing is not described: were different input features (radiances, latitude/longitude, UTC time, a priori values) normalized or scaled prior to training?
L106: Is it necessary to use the full spectrum, or would a reduced set of CO-sensitive channels suffice?
L188-189: The mesoscale processes associated with the identified spectral break should be discussed more explicitly.
Figure 4: A direct comparison of power spectral densities between ML-predicted CO and interpolated OE CO would be informative and could further clarify the added value of the ML approach.

Citation: https://doi.org/10.5194/egusphere-2025-4864-RC2
- AC2: 'Reply on RC2', Frank Werner, 26 Mar 2026
  
  The comment was uploaded in the form of a supplement: https://egusphere.copernicus.org/preprints/2025/egusphere-2025-4864/egusphere-2025-4864-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/egusphere-2025-4864-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

AR by Frank Werner on behalf of the Authors (26 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (27 Mar 2026) by Zhao-Cheng Zeng

RR by Anonymous Referee #2 (20 Apr 2026)

ED: Publish as is (20 Apr 2026) by Zhao-Cheng Zeng

AR by Frank Werner on behalf of the Authors (20 Apr 2026) Manuscript

Short summary

We developed a hybrid machine learning-optimal estimation retrieval system that efficiently and accurately mimics operational retrieval results. Crucially, this algorithm also predicts critical diagnostic variables including observation operators needed for comparison with independent data and ingestion into downstream chemical data assimilation models.