11 Feb 2022
11 Feb 2022
Status: this preprint is currently under review for the journal AMT.

Comprehensive detection of analytes in large chromatographic datasets by coupling factor analysis with a decision tree

Sungwoo Kim1, Brian M. Lerner2, Donna T. Sueper2, and Gabriel Isaacman-VanWertz1 Sungwoo Kim et al.
  • 1Charles E. Via Jr. Department of Civil and Environmental Engineering, Virginia Tech, Blacksburg, VA, 24061, USA
  • 2Aerodyne Research, Inc., Billerica, MA, 01821, USA

Abstract. Environmental samples typically contain hundreds or thousands of unique organic compounds, and even minor components may provide valuable insight into their sources and transformations. To understand atmospheric processes, individual components are frequently identified and quantified using gas chromatography/mass spectrometry. However, due to the complexity and frequently variable nature of such data, data reduction is a significant bottleneck in analysis. Consequently, only a subset of known analytes is often reported for a dataset, and a large amount of potentially useful data are discarded. We present here an automated approach of cataloging and potentially identifying all analytes in a large chromatographic dataset and demonstrate the utility of our approach in an analysis of ambient aerosols. We use a coupled factor analysis/decision tree approach to de-convolute peaks and comprehensively catalog nearly all analytes in a dataset. Positive Matrix Factorization (PMF) of small sub-sections of multiple chromatograms is applied to extract factors that represent chromatographic profiles and mass spectra of potential analytes, in which peaks are detected. A decision tree based on peak shape, noise, retention time, and mass spectrum is applied to discard erroneous peaks and combine peaks determined to represent the same analyte. With our approach, all analytes within the small section of the chromatogram are cataloged, and the process is repeated for overlapping sections across the chromatogram, generating a complete list of the retention times and estimated mass spectra of all peaks in a dataset. We validate this approach using samples of known compounds and demonstrate the separation of poorly resolved peaks with similar mass spectra and the resolution of peaks that appear in only a fraction of chromatograms. As a case study, this method is applied to a complex real-world dataset of the composition of atmospheric particles, in which more than 1100 analytes are resolved.

Sungwoo Kim et al.

Status: final response (author comments only)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on amt-2022-16', Anonymous Referee #1, 03 Mar 2022
    • AC1: 'Reply on RC1', Sungwoo Kim, 23 May 2022
  • RC2: 'Comment on amt-2022-16', Alexander Vogel, 23 Mar 2022
    • AC2: 'Reply on RC2', Sungwoo Kim, 23 May 2022

Sungwoo Kim et al.

Sungwoo Kim et al.


Total article views: 256 (including HTML, PDF, and XML)
HTML PDF XML Total Supplement BibTeX EndNote
160 83 13 256 27 4 3
  • HTML: 160
  • PDF: 83
  • XML: 13
  • Total: 256
  • Supplement: 27
  • BibTeX: 4
  • EndNote: 3
Views and downloads (calculated since 11 Feb 2022)
Cumulative views and downloads (calculated since 11 Feb 2022)

Viewed (geographical distribution)

Total article views: 254 (including HTML, PDF, and XML) Thereof 254 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
Latest update: 25 May 2022
Short summary
Atmospheric samples can be complex and current analysis methods often require substantial human interaction and also discard potentially important information. To improve analysis accuracy and computational cost of these large datasets, we developed an automated analysis algorithm that utilizes a factor analysis approach coupled with a decision tree. We demonstrated this algorithm cataloged approximately ten times more analytes compared to a manual analysis and in a quarter of the analysis time.