A machine learning approach to aerosol classification for single-particle mass spectrometry

Christopoulos, Costa D.; Garimella, Sarvesh; Zawadowicz, Maria A.; Möhler, Ottmar; Cziczo, Daniel J.

doi:https://doi.org/10.5194/amt-11-5687-2018

Articles | Volume 11, issue 10

https://doi.org/10.5194/amt-11-5687-2018

© Author(s) 2018. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/amt-11-5687-2018

© Author(s) 2018. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 11, issue 10

Research article

|

18 Oct 2018

Research article |

| 18 Oct 2018

A machine learning approach to aerosol classification for single-particle mass spectrometry

Costa D. Christopoulos, Sarvesh Garimella, Maria A. Zawadowicz, Ottmar Möhler, and Daniel J. Cziczo

Download

Final revised paper (published on 18 Oct 2018)
Preprint (discussion started on 23 Jan 2018)

Interactive discussion

Status: closed

AC: Author comment | RC: Referee comment | SC: Short comment | EC: Editor comment

- Printer-friendly version

- Supplement

RC1: 'Reviewers comment', Anonymous Referee #1, 15 Mar 2018
- AC1: 'Response to Reviewer 1', Daniel J. Cziczo, 21 Jun 2018
RC2: 'Report on Manuscript', Anonymous Referee #2, 11 Apr 2018
- AC2: 'Response to Reviewer 2', Daniel J. Cziczo, 21 Jun 2018

Peer-review completion

AR: Author's response | RR: Referee report | ED: Editor decision

AR by Daniel J. Cziczo on behalf of the Authors (21 Jun 2018)

ED: Referee Nomination & Report Request started (27 Jun 2018) by Francis Pope

RR by Anonymous Referee #2 (13 Jul 2018)

Suggestions for revision or reasons for rejection

General Comments:
The revision has produced a much stronger and more integrated manuscript which describes a particular machine-learning approach used to separate single particle mass spectra by identity. The authors have provided more detailed information about the conceptual framework for their classification scheme, have done a rudimentary comparison to an alternative method, and have done a thorough job of exploring, presenting, and explaining the results from training and “blind” tests using their proposed method. Although it is mentioned briefly in the manuscript, the authors haven’t seriously engaged with assessing the utility of this method for analysis of ambient particle spectra, where presumably it would need to be functional to be useful. Are there situations wherein this method could be used to essentially “pick out” the particles that match one of the training sets, while not trying to differentiate “other” particles not included? If so, how different would the particle spectra need to be to achieve this?

Specific Comments:
The authors state, p. 3, lines 11 – 12, that “interpretability is more limited with methods such as cluster analysis and neural networks” without justification. Such statements should include explanations and/or citations, or be removed if they represent opinions.

On p. 5, line 12, the authors describe that “algorithms are known to struggle with chemically-similar aerosols…” but again provide no definition of “struggle” nor a discussion of how similar is too similar. Furthermore, the discussion on this page, lines 19 – 23, should mention that (as with all of the algorithms discussed in this paper), there are user defined settings that are included in each method, and the choice of those settings influences the outcome significantly. Generalizations about performance are therefore challenging, when little information about settings is provided. An alternative approach that the authors could explore is referencing specific articles in which specific methods/algorithms are used, and commenting on the successes and challenges that are illustrated by the specific results that the authors obtained.

On p. 6, lines 4 - 5, the authors mention “measurement uncertainty” without defining the variable in which that uncertainty is found. Is it the identification, the peak areas, or something else?

In section 2.3, the authors discuss binary decision trees without mentioning random forests, although the term has been introduced. It would be helpful to contextualize the binary trees within the discussion of the random forests at the beginning of this discussion, which could be accompanied by a short comment that the random forest approach will be described more thoroughly below.

In the methods section, parameters such as the number of nodes per tree (p. 11, line 10), number of trees (p. 10, line 11) and number of variables per split (p. 11, line 11) are stated, but the methodology for choosing these numbers is not explained in sufficient detail (or at all, in the case of the number of nodes). The parameter used to select the best settings is described as the “values that produce the lowest test error” – is this error just rate of incorrect identification?

On p. 11, line 18, the noun asymptote is used as a verb. The sentence should be rewritten.

On pp. 20 - 21, the authors illustrate the advantages of their method by mentioning that an unexpected contaminant was detected based on the results. The implication is that this is possible using their method but not others, however a distance metric-based algorithm would likely also be able to identify this contaminant, as it contained additional peaks. The authors should clarify how this example specifically illustrates the strength of their method (if it does).

Figure 4 illustrates a comparison of results using the random forest and a distance classifier. However, no information is provided about the (user-defined) parameters used to define different clusters in the distance metric example, making this comparison tricky. If the parameters were changed slightly, these results would likely vary. Also, the labels of a) and b) should be removed from the figure caption; top and bottom row are sufficient. The figure would be more useful if the algorithm type were included in the labels for the specific matrices, so that one needn’t rely on the text in the figure caption to identify what the matrices represent. Maybe replace “Aerosol Confusion Matrix (Positive)” with “Random Forest (Positive)” or “Euclidian Distance (Positive)” for clarity.

Figure 5 is still confusing, in that it shows the ~1/3 of particles (soot) that are introduced into the AIDA chamber but which the PALMS instrument cannot detect. The figure caption suggests that the instrument transmission efficiency is discussed in the text, but that discussion (p. 18, lines 18 – 21) is very brief and is mostly directed towards explaining the significant under-counting of the larger particles. This discussion should be expanded, and ideally, the data presented in the figure should be shown corrected for the inlet transmission. As it stands now, the use of the pie charts only illustrates that the match between the concentration (it is not specified whether the input aerosol in the chamber is given in number concentration or mass concentration, although presumably the PALMS results are provided in number concentration) is poor. The specificity with which the different particle types can be identified is sufficiently different in positive and negative ion spectra to warrant more discussion than is given. Overall, the data presented in this figure cannot serve to make the readers of this paper confident that the picture of the aerosol composition obtained by these experiments would do an excellent job of representing the reality of what is present.

Hide

RR by Anonymous Referee #1 (14 Aug 2018)

Suggestions for revision or reasons for rejection

The authors present a new analytical tool to tackle the difficult task of analyzing datasets generated by single-particle-laser-ablation-mass-spectrometry (spms). They utilize the random forest as a machine learning approach. The authors state why they apply this method and what they expect.
It is clearly presented how a random forest is generated and subsequently used.
The produced results are scientifically promising. And grant a novel view onto these kind of datasets.
After building the random forest and analyzing its properties, the authors apply the forest to a blind dataset. The results are shown and they differ quite significantly from the assumed true constitution. (Fig 5).
The critical discussion of these results and of the general problems of supervised machine learning remains quite limited. The most important neglected point being dataset bias. E.g. the random forest might find hidden correlations within the training dataset that have nothing to do with the chemistry of the particles but with instrumental parameters, and which most probably are not apparent during the blind test.
One example would be that the signal intensities could depend on ambient temperature, ambient pressure, or laser power. Especially result showing close to 100% true classifications, should be a examined more critically than done by the authors.
Following a list of individual remarks:
p.5_16 chemically similar and easily separable this is an oxymoron chemically similar implies a strong overlap of chemical features
p.8_20-23 why is this normalization done this removes information about the ionization efficiencies, how can you differentiate between ionization efficiency and relative abundance
p.9_18 Are there really up to 3000 tests before reaching a node? This would mean each m/z value is tested roughly 6 times. And the tree would have to be at least 6000 nodes.
p.10_19 Should be left out in ~40% of the trees. Here would be a good point to mention dataset bias. Because although a spectrum is not in the training-set there could still be hidden correlation to the others.
p.11_4 might be better to follow if "To generate variability in the model only a random set of splits is tested at each node and only the best split in terms of entropy is chosen"
p.12_15 markers
p.13_11 helps to put
p.16_18-19 I don’t understand Point b). If it is distinct why is it not separated.
p.16_20 Here is an example of dataset bias and it is shown to hold some information but the backside is not discussed.
p.18_2-9 If Misclassification is as shown 1-4% it cannot explain the (not SOA) fractions of 3-9% for fertile soil, ATD and cellulose. There must be an additional source of error.
p.18_13 The authors state that 90% of the mixture can be characterized with most certainty. Comparing this to Fig 5. this statement seems quite exaggerated.
p.18_16 It seems unrealistic that there was so much effort put into this campaign and without characterizing the used aerosols in more detail.
p.18_18 Why was the PALMS instrument able to see soot particles in the training set with 100% accuracy if it cannot be seen? Why not use the size distributions of the individual components and the transmission efficiency of the PALMS to at least get the expectable aerosol constitution?

I recommend to publish with revisions. Especially the discussion of the blind data set is lacking a more in depth analysis and the problem of dataset bias should be mentioned and discussed. From the presented results my conclusion would be that the method is not able to reproduce a given atmospheric aerosol constitution.

Hide

ED: Publish subject to minor revisions (review by editor) (03 Sep 2018) by Francis Pope

AR by Daniel J. Cziczo on behalf of the Authors (19 Sep 2018) Author's response Manuscript

ED: Publish as is (26 Sep 2018) by Francis Pope

AR by Daniel J. Cziczo on behalf of the Authors (03 Oct 2018)

Short summary

Compositional analysis of atmospheric and laboratory aerosols is often conducted with mass spectrometry. In this study, machine learning is used to automatically differentiate particles on the basis of chemistry and size. The ability of the machine learning algorithm was then tested on a data set for which the particles were not initially known to judge its ability.