28 Mar 2022
28 Mar 2022
Status: a revised version of this preprint is currently under review for the journal AMT.

Ch3MS-RF: A Random Forest Model for Chemical Characterization and Improved Quantification of Unidentified Atmospheric Organics Detected by Chromatography-Mass Spectrometry Techniques

Emily B. Franklin1, Lindsay D. Yee2, Bernard Aumont3, Robert J. Weber2, Paul Grigas4, and Allen H. Goldstein1,2 Emily B. Franklin et al.
  • 1Department of Civil and Environmental Engineering, University of California Berkeley, Berkeley, 94720, USA
  • 2Department of Environmental Science, Policy and Management, University of California Berkeley, Berkeley, 94720, USA
  • 3Univ Paris Est Creteil and Université de Paris, CNRS, LISA, F-94010 Créteil, France
  • 4Department of Industrial Engineering and Operations Research, University of California Berkeley, Berkeley, 94720, USA

Abstract. The chemical composition of ambient organic aerosols plays a critical role in driving their climate and health relevant properties and holds important clues to the sources and formation mechanisms of secondary aerosol material. In most ambient atmospheric environments, this composition remains incompletely characterized, with the number of identifiable species consistently outnumbered by those that have no mass spectral matches in the literature or NIST/NIH/EPA mass spectral databases, making them nearly impossible to definitively identify. This creates significant challenges in utilizing the full analytical capabilities of techniques which separate and generate spectra for complex environmental samples. In this work, we develop the use of machine learning techniques to quantify and characterize novel, or unidentifiable, organic material. This work introduces Ch3MS-RF (Chemical Characterization by Chromatography-Mass Spec Random Forest Modelling), an open-source R-based software tool for efficient machine-learning enabled characterization of compounds separated in chromatography-mass spec applications but not identifiable by comparison to mass spectral databases. A random forest model is trained and tested on a known 130 component representative external standard to predict the response factors of novel environmental organics based on position in volatility-polarity space and mass spectrum, enabling reproducible, efficient, and optimized quantification of novel environmental species. Quantification accuracy on a reserved 20 % test set randomly split from the external standard compound list indicate that random forest modelling significantly outperforms the commonly used methods in both precision and accuracy, with a median response factor % error of -2 % for modelled response factors compared to > 15 % for typically used proxy assignment-based methods. Chemical properties modelling, evaluated on the same reserved 20 % test set as well as an extrapolation set of species identified in ambient organic aerosol samples collected in the amazon rainforest, also demonstrates robust performance. Extrapolation set property prediction mean average errors for carbon number, oxygen to carbon ratio (O : C), average carbon oxidation state , and vapor pressure are 1.8, 0.15, 0.25, and 1.0 (log(atm)), respectively. Extrapolation set Out-of-Sample R2 for all properties modelled are above 0.75, with the exception of vapor pressure. While predictive performance for vapor pressure is less robust compared to the other chemical properties modelled, random forest-based modelling was significantly more accurate than other commonly used methods of vapor pressure prediction, decreasing mean average vapor pressure prediction error to 0.24 (log(atm)) from 0.55 (log(atm)) (chromatography-based vapor pressure prediction) and 1.2 (log(atm)) (chemical formula-based vapor pressure prediction). The random forest model significantly advances untargeted analysis of the full scope of chemical speciation yielded by GCxGC-MS techniques and can be applied to GC-MS as well. It enables accurate estimation of key chemical properties commonly utilized in the atmospheric chemistry community, which may be used to more efficiently identify important tracers for further individual analysis and to characterize compound populations uniquely formed under specific ambient conditions.

Emily B. Franklin et al.

Status: final response (author comments only)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on amt-2022-99', Anonymous Referee #1, 17 Apr 2022
    • AC1: 'Reply on RC1', Emily B. Franklin, 18 May 2022
  • RC2: 'Comment on amt-2022-99', Anonymous Referee #2, 22 Apr 2022
    • AC2: 'Reply on RC2', Emily B. Franklin, 18 May 2022

Emily B. Franklin et al.

Emily B. Franklin et al.


Total article views: 346 (including HTML, PDF, and XML)
HTML PDF XML Total Supplement BibTeX EndNote
255 78 13 346 21 3 3
  • HTML: 255
  • PDF: 78
  • XML: 13
  • Total: 346
  • Supplement: 21
  • BibTeX: 3
  • EndNote: 3
Views and downloads (calculated since 28 Mar 2022)
Cumulative views and downloads (calculated since 28 Mar 2022)

Viewed (geographical distribution)

Total article views: 334 (including HTML, PDF, and XML) Thereof 334 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
Latest update: 24 May 2022
Short summary
The composition of atmospheric aerosols are extremely complex, containing an estimated hundreds of thousands of individual compounds. The majority of these compounds have never been catalogued in widely used databases, making them extremely difficult for atmospheric chemists to identify and analyze. In this work, we present Ch3MS-RF, a machine learning-based model to enable characterization of complex mixtures and prediction of structure-specific properties of unidentifiable organic compounds.