The concentrations of atmospheric particulate matter and many of its constituents are temporally auto-correlated. However, this information has not been utilized in source apportionment methods. Here, we present a Bayesian matrix factorization model (BAMF) that considers the temporal auto-correlation of the components (sources) and provides a direct error estimation. The performance of BAMF is compared with positive matrix factorization (PMF) using synthetic Time-of-Flight Aerosol Chemical Speciation Monitor data, representing different urban environments from typical European towns to megacities. We find that BAMF resolves sources with overall higher factorization performance (temporal behavior and bias) than PMF on all datasets with temporally auto-correlated components. Highly correlated components continue to be challenging and ancillary information is still required to reach good factorizations. However, we demonstrate that adding even partial prior information about the chemical composition of the components to BAMF improves the factorization. Overall, BAMF-type models are promising tools for source apportionment and merit further research.

Air pollution in the form of particulate matter (PM) has a substantial impact on the earth's climate

Multiple methods for weighted non-negative matrix factorization exist

Previous studies have revealed that chemical data from the Aerosol Mass Spectrometer family (Aerodyne Aerosol Mass Spectrometer,

While developments related to source apportionment, in atmospheric science, focused on different ways to pre- and post-process data

The commonly used optimization goal

The auto-correlations of the hourly means of several aerosol constituents measured at 19 different sites in Europe. The auto-correlation is the Pearson correlation coefficient between the original and the delayed time series. Auto-correlations at a lag of 1 and 2 h are very high in most cases. These data show that particulate matter constituents exhibit strong lag-1 auto-correlation, consistent with earlier such statements in the literature (e.g.,

In this section we discuss the methods used in the paper, starting with notation in Sect.

In this paper, we describe the data by

We define a Bayesian probabilistic model that captures our prior assumptions of the process that generated the measurements. The only observed variables in our model are the data matrix

We chose the Cauchy distribution for the auto-correlation term because the long tails make large jumps between the

For comparison, we created a version of the BAMF model without the lag-1 auto-correlation terms of Eq. (

In source apportionment analyses, it is common to utilize reference spectra as boundary conditions for the factor analysis – to find components with, e.g., previously observed chemical compositions. We include this scenario in another model, called “BAMF-C”, by adding peak intensity ratios to the BAMF model,

We use Stan

The standard way to initialize the model in Stan is by randomly sampling from the prior distributions. However, our model has many parameters with fairly strict distributions. Consequently, we found this starting point to be poor, sometimes causing Stan to markedly slow down, or even fail. Hence, we initialize the model with a point solution. We utilize Stan's capability to find a single maximum a posteriori (MAP) point solution for the parameters, which we use as the initialization. Note, however, that the solutions typically have several local optima, in which case the point solution is only one such local optimum.

Stan

The samples are drawn in proportion to the posterior probability of each sample. Obtaining samples from a multidimensional posterior distribution is a non-trivial task. For effective sampling, we use Hamiltonian Monte Carlo (HMC), a method where the gradient of the distribution and an ancillary variable called momentum are used to direct the chain to explore the typical set

The MAP estimate is used as a starting point for the sampling. It is found with an optimization method on the same probability distribution used by the sampling. We use the LBFGS

Before running our model, we normalize the data such that the mean of the data (

Stan outputs posterior samples from the two matrices

The order of the components in

To select the ordering of the components, we take a small number of representative samples, usually the last five, and compute the optimal permutation using the Hungarian algorithm

We use this approach for sorting the outputs of all models (BAMF-0/C, BAMF, PMF) to ensure the most direct comparability of the results. Finally, median, 25 %, and 75 % percentiles are computed using the sorted samples. In the comparisons, we use medians for all models, but in some figures, we also show 25 % and 75 % percentiles. The median, or any central estimate, is not guaranteed to be the “best” optimized solution in any metric (probability or root mean square sum of residuals). Still, we use it to represent a reasonable solution inferred from the samples.

The first metric to check is if the model explains the data well (

Even at moderate data sizes, assessing whether the original data falls within the model confidence bounds for every variable individually is not practical. Therefore, we summarized this information by computing the model residual (difference between the model input and output, mean of all samples) normalized with the uncertainty of the model output (standard deviation of all samples); essentially observing whether the original data are inside the sample standard deviation.

For synthetic data – with a known ground truth – it is possible to assess how well the methods resolve the actual components in addition to the reconstruction performance. We call this evaluation

We use PMF, specifically the multilinear engine 2 (ME-2) controlled by the user interface SoFi

A priori information in the form of known rows of factor profiles or of known columns of factor time series can be added to the model to reduce the rotational ambiguity. By adding this external information, the user can reduce the space PMF searches for the optimized solution, reducing the rotational ambiguity of the solution. Using external data to run PMF is usually referred to as “constraining the solution”, and external information is used as constraints. Here we used two approaches, (a) entirely unconstrained PMF runs and (b) constrained runs using external source profiles. We rely on the commonly used

We generated synthetic datasets mimicking the OA sources in different urban environments. These synthetic datasets mimic mass spectral OA analyses of a Time-of-Flight Aerosol Chemical Speciation Monitor (ToF-ACSM,

The ToF-ACSM alternates between measuring particles and air together, called “open signal” (

First, we generated a synthetic ToF-ACSM OA mass spectral dataset mimicking a polluted megacity environment affected by multiple OA sources. The synthetic datasets used here are based on observations from Beijing, as it is a relatively well-studied environment. In our case, the modeled sources are traffic exhaust (HOA); cooking (COA); biomass burning (BBOA); coal combustion (CCOA); and secondary OA (OOA). In addition, we also constructed more simple datasets generated with fewer factors (two factors: HOA

Characteristics of the synthetic ToF-ACSM OA datasets. Panels

As another test, we used chemical transport model data from

This dataset differs from those in Sect.

In this section we compare the factorizations from BAMF and PMF on simulated megacity data, in Sect.

In the first experiment, we assess the performance of BAMF, BAMF-0, and PMF on synthetic data mimicking the conditions in a polluted megacity described in Sect.

Figure

Reconstruction metrics for BAMF, BAMF-0, and PMF for synthetic megacity data. Panel

Data reconstruction is essential to get within the error limits. However, source apportionment aims to accurately and precisely resolve the actual components in

Illustration of

Reconstruction and factorization performance of all three models for the synthetic megacity ToF-ACSM OA dataset in Fig.

All models slightly underestimate OOA, which results in overestimating the other components (Table

Reconstruction and factorization performance for the synthetic European city ToF-ACSM OA dataset.

When considering all 10 synthetic datasets with five components mimicking a polluted megacity, BAMF consistently produces factors closer in magnitude to the truth and which correlate better with the actual factors than the other models (Fig.

Summary of factorization performance of the three models for all synthetic megacity ToF-ACSM OA datasets with five components (10 datasets): Panel

In a second exercise, we assessed the performance of BAMF, BAMF-0, and PMF on a synthetic dataset mimicking the conditions in a typical European city (Sect.

Factorization performance of all three models for the synthetic European city ToF-ACSM OA dataset. The shaded area is the interquartile range (0.25–0.75 quantile). Panel

All models show signs of mixing between the components, most likely due to the correlation of the true

For real-world source apportionment analyses, the true number of components, i.e., sources, to be resolved via matrix factorization is unknown yet crucial. Despite the importance, accurately determining and specifying the correct number of modeled components is not trivial (see, e.g.,

Factorization performance of underspecified (three components) models for the synthetic European city ToF-ACSM OA dataset. Panel

Factorization performance of overspecified (five components) models for the synthetic European city ToF-ACSM OA dataset. Panel

For the overspecified models (five instead of four components), the results differ (Fig.

As highlighted in Sect.

We tested the performance of the models when using a priori information on

The reconstruction and factorization performance of the different models are compared in Table

Factorization performance of models using a priori information on

Factorization performance of models using a priori information on

We present a Bayesian matrix factorization model that accounts for temporal auto-correlation of the components (BAMF) and provides direct error estimation. BAMF is built on top of Stan, a freely available, robust, actively developed, open-source framework for statistical modeling with the ability of full Bayesian statistical inference with MCMC sampling. Here, we characterize the BAMF performance on synthetic Time-of-Flight Aerosol Chemical Speciation Monitor mass spectral OA data compared to PMF. This approach allows us to assess the model performance based on input data reconstruction and the ability to accurately model the chemical composition and concentration time series of the components.

All models performed well in reconstruction performance regardless of factorization performance, indicating that reconstructing the data is insufficient for judging how good the extracted factors are. Without strongly correlated components, BAMF resolves temporally auto-correlated components well (synthetic megacity dataset), while PMF performs considerably worse. Both BAMF and PMF are challenged by strongly correlated components (European data).

Further, we show that using a priori information on the chemical composition of the components improves BAMF factorization performance such that all components are well represented. Even adding a priori information for a few peaks significantly reduced component bias, and partially specifying the profile (for 56 % of the peaks) produced comparable results to fully constraining the profile with PMF. This opens up possibilities for using incomplete chemical composition information to improve factorizations.

While we tested BAMF on synthetic OA ToF-ACSM data in this paper, source apportionment analyses of other chemical PM data (e.g., trace elements from either Xact or offline filter analysis) could also profit from accounting for the auto-correlation of components, if the components are auto-correlated. Further testing is especially needed for datasets with temporally sparse sources, i.e., pollution sources occurring only during specific events, which are also challenging for PMF.

Overall, we believe BAMF-type models are promising tools for source apportionment and deserve further research, e.g., improving the separation of the chemical composition of components or the computational speed of BAMF. These models can also be used complementary to current source apportionment methods due to their different emphasis and advantages. One such research topic would be introducing rolling window methods as has been done with PMF, to allow the source profiles to change over time and to act as a basis for real-time source apportionment. Other possible topics are using BAMF with other time series instruments and with real-world data. Another area of development is computational speed – for the dataset sizes discussed here running BAMF takes a few hours on a modern computer (Intel Xeon Silver 4110), but the time increases as the data size increases.

The error in

The bias of CCOA and BBOA

The bias of CCOA time series and composition in the five-component megacity datasets.

Comparing the true components of the datasets shows that the European dataset components are much more correlated both in

Pearson correlation and standard deviation of the

Spearman correlation of the

Pearson correlation of the

Spearman correlation of the

Figure

Concentration and uncertainty time series at selected

Figure

Results from overspecified BAMF-C model for the synthetic European city ToF-ACSM OA dataset with five modeled components instead of four and HOA and BBOA fully constrained.

Figure

Workflow of running the BAMF model in this study. Pre- and post-processing steps are technically optional but help in the convergence and interpretation of the results. With PMF the pre-processing and denormalization are skipped and the modeling box is just PMF, but we still sort the data similarly.

Table

The source profiles used to construct the datasets. The profiles were restricted to

Figure

Unconstrained BAMF and constrained PMF on the European dataset with four components.

The error bars and the shaded areas in the time series are based on the interquartile ranges (IQR) in the empirical distribution given by the MCMC sampler. This gives us an idea of how accurately we can fix the modeled concentrations and compositions. In these results the error estimation is a bit optimistic, since it does not always cover the true solution. The underestimation is possibly due to the strictness of IQR and it not considering the model choice error. Figure

IQR compared to the median on the base case of the European dataset

The datasets are available at

AR, AB, KRD, and KP participated in the model, dataset, and experiment development. AR ran and analyzed the experiments. KRD and MIM contributed to the PMF model solutions. JJ did the transport model runs. KRD, KP and MTK supervised the work. All authors contributed to the writing of the manuscript.

The contact author has declared that none of the authors has any competing interests.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

We thank the Research Council of Finland for its support. Kaspar R. Daellenbach acknowledges support by SNSF Ambizione. Jianhui Jiang acknowledges support by Science and Technology Commission of Shanghai Municipality, China.

This research has been supported by the Research Council of Finland (decision nos. 337549, 345704, and 346376). Kaspar R. Daellenbach has been supported by SNSF Ambizione (grant no. PZPGP2_201992). Jianhui Jiang has been supported by Science and Technology Commission of Shanghai Municipality, China (Shanghai Pujiang Program, grant no. 21PJ1402800). Open-access funding was provided by the Helsinki University Library.

This paper was edited by Eric C. Apel and reviewed by two anonymous referees.