the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
A novel probabilistic source apportionment approach: Bayesian auto-correlated matrix factorization
Anton Björklund
Manousos I. Manousakas
Jianhui Jiang
Markku T. Kulmala
Kai Puolamäki
Kaspar R. Daellenbach
Download
- Final revised paper (published on 22 Feb 2024)
- Preprint (discussion started on 12 May 2023)
Interactive discussion
Status: closed
-
RC1: 'Comment on amt-2023-70', Anonymous Referee #1, 26 May 2023
General comments
The manuscript by Anton Rusanen et al. constructed and evaluated a source apportionment (SA) methodology comprehensively and systematically. The Bayesian Matrix Factorisation (BAMF) model assumes the autocorrelation of PM sources, hence, it is compared to models in which autocorrelation is not assumed, such as Positive Matrix Factorisation (PMF), which is the most widely used. Also, it presents an a priori constraint effectivity assessment. Because of the worth of such a thorough exploration of a novel and advantageous SA mindset, I recommend publishing the manuscript subject to the revisions listed below.
My major concern is the lack of clarity in the manuscript due to the omission of details and discussion. Although the article is well structured, it misses information which would help the reader to understand the reasons for the decision-making. Moreover, the discussion of results is incomplete, and the conclusions are rather succinct and weak despite the strong results found. The manuscript needs to be thoroughly revised to show extensively the beneficial points of the BAMF and outline its limitations through a deeper explanation of the findings, model and factor-wise. The specific comments below list some points in which the explanations are weak or nonexistent.
Specific Comments
MAJOR COMMENTS
- One of my major claims is the unreferenced statement that PM sources are autocorrelated, there is not any reference supporting it throughout the text. It is not unlikely, but this statement needs compelling backing arguments.
- Line 44: This statement is too strong to be only justified by a time series of a certain site and period. You should back up this with references or more proof which highlights the need of considering auto-correlation in PM measurements. Also, further explanation on the cyclicity expected in 24h and its multiples and maybe intra-day correlation should be provided. Which sources are more (and less) expected to present such autocorrelations?
- Sine the BAMF is a newly-developed SA method, the reader might need more information about it to fully understand its mechanism than that provided in Section 2.2. Could you please provide further information on how the algorithm works and which steps it follows? Maybe you can add a workflow diagram to clarify the process.
- Line 60: “… measurement uncertainty determined as standard deviations of the error terms for each data point.” Why did you not calculate the uncertainty matrix as widely calculated, in the way described in the protocol of Ulbrich et al. (2009)? A different uncertainty matrix calculation could significantly affect model performance.
- The autocorrelation term implies that, from the Cauchy distribution, for high time lags (dt) the autocorrelation for the i, and i+1 time series is a uniform distribution. This might be true for certain sources, however, the traffic source, for instance, is expected to correlate better for dt = 24h rather than for dt=16h. Did you consider this drawback? How detrimental do you think the cyclicity disregard can be?
- In all discussions of plots it is lacking certain discussion on which factors are more accurate in each method and why a model might achieve a better description than the other. For example:
- Figure 4. BAMF does a better job overall, but maybe PMF does better for CCOA, can you hypothesise this? Relate to Table 1 results. Also, highlight that those factors with marked cyclical autocorrelation are those in which BAMF overperforms significantly BAMF-0 (e.g. COA).
- Figure 5: Why do you think the CCOA is the hardest factor to resolve even if the BBOA is well-described? It is agreed that it might be mixed with the BBOA, but what do you hypothesise makes BAMF-0 > BAMF?
- Figure 6: Discuss why the POA is better acknowledged by the PMF and SOA by the BAMF. Even if the final values shown in Table 2 support BAMF, don’t you think the POA bias is significantly high? Also, discuss in terms of difference depending on the lags factor-wise.
- Figure 9: Highlight the lag differences to truth of all models factor-wise.
- Please justify the intermediate approach for constraining profiles used in section 4.4. Why would you constrain half spectra? Why did you select m/z60 as the threshold?
- Discuss how these two datasets are representative of real-world measurements and how reproducible are these tests in other datasets of different locations, the assortment of sources, temporal variabilities, meteorological influence… Is there any case in which you would rather apply PMF?
- Discuss how suitable would be BAMF for other instruments apart from ToF-ACSM (Q-ACSM, AMS, offline filters, PTR-MS, X-ACT…).
- You mentioned in Line 5 how BAMF provided error estimation. Please, discuss the uncertainties provided by the model for each factor.
- Discuss if you believe the positive effect of considering factor autocorrelation (BAMF) is higher than the time-dependency of profiles effect (accounted by rolling PMF).
- Conclusions are short and weak despite the great number of tests performed and the valuable results obtained. Please elaborate on your findings:
- Improvements of BAMF for PMF on factorisation performance and limitations on reconstruction performance.
- Results on the anchoring. Show which approach showed better results in BAMF-C.
- What are your intentions and further steps for BAMF? Is it prone to substitute PMF in the near term? Which aspects of the BAMF need future research?
MINOR COMMENTS
- Line 2: “… are temporarily auto-correlated”. I believe this sentence is very strong if unreferenced. Maybe it could be better to point out how this has never been accounted for when designing source apportionment methods.
- Line 6: “…better than PMF…”. “Better” here sounds arbitrary unless you present some proof showing how and why this model is better. Maybe you could present the Pearson correlation coefficient for Gs comparing them with PMF, or something similar referring to the factorisation performance.
- Line 7: “highly cross-correlated components”: since this has not been properly explained yet, I would advise giving further information about which are these components or substituting the expression with something like “not auto-correlated” or “susceptible to mixing”. Also, give a reason for these factors to be challenging to resolve.
- Line 16: SOA is not only a result of gas-to-particle reactions, there are other formation mechanisms such as the coating of pre-existing particles. Hence, this sentence has to be rephrased or completed.
- Line 22: Please include the number of citations of the PMF paper, statistics on articles using this method, or reference to some paper which shows the predominance of this methodology.
- Line 26: The paper from Canonaco et al. (2013) does not “optimise the source profiles”, it demonstrates the high variability of profiles year-wise and suggests a method for accounting for certain capture of these evolving profiles (seasonal PMF). Please rephrase this sentence.
- Line 27: “point solution with arbitrary rotations”. PMF accounts for rotational ambiguity and statistical error assessment, hence, it does not provide a “point solution”. Maybe I did not understand what “point solution” refers to since this concept is not explained here nor before. Please add some explanation on its meaning and why PMF is providing “arbitrary rotations” even if the nxm space is explored with rotational tools (e.g. a-value, random seed, DISP etc.)
- Line 33: Add the citation of Heikkinen et al. (2020) which struggles with SA of the SMEAR II site low concentrations but managed to obtain main sources using machine learning techniques.
- Line 34: “…constrain POA sources chemical composition is usually…”. Here, it should be mentioned that some sources were better captured using constraints applied in time series, as in Chazeau et al. (2022).
- Line 41-42: “…the commonly used optimization goals do not include any temporal terms of the resolved components (Wang and Zhang, 2012; Paatero and Tapper, 1994), and thus any time information is ignored.”. The reader might find it hard to understand what this refers to. Do you mean that the optimisation parameter Q in PMF is not related to X temporal features? Please rephrase.
- Line 54: “Simply put… time-dependent concentration”. Here you could indicate that this implies the staticity of the profiles throughout the data, and how PMF has overcome this issue with the rolling PMF, achieving time-dependent profiles.
- Lines 66-67: The reader might need more information on why these distributions are selected and what their parameters are needed. Firstly, the parameters of the distributions could be shown explicitly for greater clarity e.g. Cauchy (x0 = Gik, γ = αa[k]Δti +αb[k]). Also, the fact that Fi is a matrix of i dimensions should be mentioned. Does the F not represent now the factors' chemical composition? More information than that given in lines 71-75 is needed.
- Line 77: “The term … determines…”. Please explain explicitly the mathematical and environmental meaning of this term. Is it meaning that the lag of time implies less probably Gi+1,k to be equal to Gi,k? Also, add information on the configuration and meaning of the vectors alpha.
- Line 90: Equation (5). I don’t think this can be understood with the information provided. Do you mean that you constrain each j and l ∀ j, l? Also, if I understand the intention of this equation, the ratio between two ions' intensities should be written as “F[i,j]/F[i,l]”. Otherwise, you are expressing that you fix a ratio between m/zs for whichever times i, k, which would make no sense. However, it does make sense though constrain two ion intensities for all time units i. Moreover, in this fashion, you avoid using the k index for time, since it is inherently related to the number of factors for us the PMF users.
- Lines 90-99: You should specify which ions and from which reference profiles you are using here in the methods section. Please, refer to a table for indicating those.
- Line 99: The SoFi PMF has a criteria selection to filter out bad values, hence environmental criteria to accept/discard solutions affects the final a-value, whose presented value will be the mean value of all the accepted runs. Hence, the appreciation you make of “all solutions within the boundary to be of equal quality” is not fair since only those solutions which make environmental sense are kept. Please, disregard this comment if I did not understand what you were intending.
- Lines 109-112: The reader might not understand what you refer to with “MAP point solution”, please describe a bit more this part.
- Lines 116-119: Why are you normalising to afterwards dis-normalise again? Can you please explain the motivation for this step?
- Line 132: you have to explain how you apply the Hungarian algorithm, Manhattan distances concepts here so that the reader understands the way it works specifically for sorting factors (at least).
- Lines 145-149: Why use this instead of scaled residuals? I know the meaning is analogous, but it surprised me that you computed those metrics instead of the commonly-used scaled residuals.
- Lines 208-211: Please explain a bit more about the construction of G, why do you model it through random walks? How did you tweak the diel patterns through this kind of modelling? Moreover, I don’t understand the sentence “the added diurnal peaks can be seen as the periodicity of the tail. Also, why BBOA is using a Gaussian and the other factors using Cauchy?
- Table 1: Please justify why did you use r for G and ρ for profiles.
- Line 284: I would move these two graphs (Figs. 7 and 8) to supplementary information or annexe. These are not your final solutions but tests, which can conflict with the clarity of the solution presentation.
- Line 289: Please describe the composition and time features of this “unnecessary factor”. Discuss if it is more advisable in this respect to extract a “noise/unidentified” profile as the BAMF does or to split two factors as the BAMF-0 and the PMF do.
- Lines 317-318: “Partially constrained… “. Discuss which recommendation would you make in a ranking fashion: first the constrained PMF, but afterwards, the constrained PMF or the partially-constrained BAMF?
- Figure 9 or 10: I would like to see a comparison of unconstrained BAMF vs. constrained PMF to assess the power of BAMF unconstrained vs. the best version of PMF, which is constraining it.
- Line 324: Again, these two first sentences are very strong and unsupported. Please add citations or reformulate the argument.
Technical corrections
- Line 41: “Particularly relevant for this study is that…”. Please rephrase to something similar: “XX is particularly relevant for this study…”.
- Line 49: Write the definition of j in the same way as the i definition, otherwise seems as if they were not analogous. It should be written as j ∈ [m]={1, …, m}.
- Line 52-53: Why do you write Fi with a point? What does it mean? Is it completely necessary for the definition?
- Line 59: I believe the notation in PMF for the matrix error is called “uncertainty” rather than “error”. Error would refer to the error matrix. Please, modify.
- Line 61: Please, define what you imply with the term “latent variables”.
- Line 83: “… without the lag-1 auto-correlation…”. Could you rephrase? Do you mean that the Gi+1,k is not computed using the Cauchy distribution or that something in expression (4) is different for the BAMF-0?
- Line 88: Indicate the nature of this penalty and how is it applied.
- Line 98: This is not a scenario but a practice or a methodology. Please rewrite.
- Line 101: Define the acronym STAN. Also, if not stated above, describe slightly the workflow of the algorithm/process.
- Beware that “reconstruction performance” (Line 140) is not in italics but these tests are in italics in line 56. Please homogenise.
- Line 169: Please describe what you mean by “sub-optimal”.
- In lines 186 and 207 you should be explicit about which anchor did you use for every factor.
- Lines 205-206: “the modelled sources are traffic exhaust: HOA, cooking: COA, biomass burning: BBOA, coal combustion: CCOA, and secondary OA: OOA”. The punctuation here is not used properly. Please, rephrase this sentence.
- Line 235: I believe this section should be called “Results and Discussion”. The labelling “Experiments” could be implying a methodological explanation of them, which should not be the case.
- All graphs including profiles (Figs. 2, 4, 6, 7, 8, 9, 10) present an arbitrary x-axis ending in two, which does not look standardised. Please, include 12, but afterwards, use tenths’ ticks.
- Tables 1, 2: Captions are to be placed above the table not below.
- Line 310: Explicit which anchors did you use in each factor to constrain the ion ratios.
References
Chazeau, B., El Haddad, I., Canonaco, F., Temime-Roussel, B., d'Anna, B., Gille, G., ... & Marchand, N. (2022). Organic aerosol source apportionment by using rolling positive matrix factorization: Application to a Mediterranean coastal city. Atmospheric environment: X, 14, 100176.
Heikkinen, L., Äijälä, M., Daellenbach, K. R., Chen, G., Garmash, O., Aliaga, D., ... & Ehn, M. (2021). Eight years of sub-micrometre organic aerosol composition data from the boreal forest characterized using a machine-learning approach. Atmospheric Chemistry and Physics, 21(13), 10081-10109.
Ulbrich, I. M., Canagaratna, M. R., Zhang, Q., Worsnop, D. R., & Jimenez, J. L. (2008). Interpretation of organic components from positive matrix factorization of aerosol mass spectrometric data. Atmospheric Chemistry and Physics Discussions, 8(2), 6729-6791.
Citation: https://doi.org/10.5194/amt-2023-70-RC1 - AC1: 'Reply on RC1', Anton Rusanen, 31 Oct 2023
-
RC2: 'Comment on amt-2023-70', Anonymous Referee #2, 28 Jul 2023
The manuscript addresses source apportionment using a Bayesian statistical approach. This is a departure from standard source apportionment techniques and the approach has some conceptual merit in that it reduces reliance on measurement uncertainty matrix inputs and therefore facilitates the inclusion of additional parameters. The generation of a test dataset and the analysis using established positive matrix factorization and the novel Bayesian approach is helpful in move the science forward in this area.
For the wider application of this approach, it would be useful to understand what the computational speed is when compared to PMF.
For PMF, alternative factor solutions below and above that chosen should be reported and discussed at least in the SI
Fig 3 and associated analysis and discussion - it would be useful to have a statistical analysis of these comparisons (t-test, Kruskal-Wallace) to show whether the difference were statistically different
Fig 4 – there is a large over-estimation of HOA compared to the other approaches. It is not obvious where this mass is allocated in comparison and is worthy of some discussion.
Fig 5 – please keep the fig sub title (a,b,c) in the same location. The reason for the large variability in CCOA for BAMF in b needs to be discussed in more detail.
First sentence of conclusion (326) is not true.
other comments:
Line 20 – formalism – what is this? Could an alternative word be used?
Line 66 - Dirilecht is spelt incorrectly I believe
Line 166 – Sofi only finds local optima for unconstrained PMF, where an a-value is used this is not the case
Citation: https://doi.org/10.5194/amt-2023-70-RC2 - AC2: 'Reply on RC2', Anton Rusanen, 31 Oct 2023