Development and Application of a Supervised Pattern Recognition Algorithm for Identification of Fuel-Specific Emissions Profiles

. Wildfires have increased in frequency and intensity in the western United States (U.S.) over the past decades. These trends are projected to continue, with negative consequences for air quality across the U.S. Wildfires emit large quantities of particles and gases that serve as air pollutants and their precursors, and can lead to severe air quality conditions over large spatial and long temporal scales. Characterization of the chemical constituents in smoke as a function of combustion conditions, fuel type and fuel component is an important step towards improving the prediction of air quality effects from fires and evaluating mitigation strategies. Building on the comprehensive characterization of gaseous non-methane organic compounds (NMOCs) identified in laboratory and field studies, a supervised pattern recognition algorithm was developed that successfully identified unique chemical speciation profiles among similar fuel types common in western coniferous forests. The algorithm was developed using laboratory data from single fuel species and tested on simplified synthetic fuel mixtures. The fuel types in the synthetic mixtures were differentiated, but as the relative mixing proportions became more similar, the differentiation 10 became poorer. Using the results from the pattern recognition algorithm, a classification model based on linear discriminant analysis was trained to differentiate smoke samples based on the contribution(s) of dominant fuel type(s). The classification model was applied to field data and despite the complexity of contributing fuels, and the presence of fuels "unknown" to the classifier, the dominant sources/fuel types were identified. The pattern recognition and classification algorithms are a promising approach for identifying the types of fuels contributing to smoke samples and facilitating selection of appropriate 15 chemical speciation profiles for predictive air quality modeling, using a highly reduced suite of measured NMOCs. Utility and performance of the pattern recognition and classification algorithms can be improved by expanding the training and test sets to include data from a broader range of single and mixed fuel types.


Introduction
Research has showed that the western United States n (U.S.) has seen an increase in the frequency and intensity of wildfires 20 over the last three decades (Jaffe et al. (2020), Miller et al. (2009) and Dennison et al. (2014)), which is projected to continue (Westerling et al. (2006), Miller et al. (2009) and Dennison et al. (2014)). One of the consequences of wildfires is extremely poor air quality (McMeeking et al. (2005), McKenzie et al. (2006), Park et al. (2006), and Hu et al. (2018)). Emissions from wildfires include carbon monoxide (CO), carbon dioxide (CO2), and methane (CH4); several hundreds of gas-phase non-1 methane organic compounds (NMOCs); and particulate matter (PM). While CO2 and CH4 are important greenhouse gases, 25 NMOCs are of particular importance in the context of air quality because they serve as precursors to secondary air pollutants including photochemical ozone (O3) and secondary organic aerosol (SOA) (Andreae et al. (1988), Ward and Hardy (1991), Alvarado and Prinn (2009)). The latter of which, SOA, is a major constituent of atmospheric PM (Zhang et al. (2007)). In order to predict the air quality impacts of wildfires, differences in emissions and their effects on chemistry and pollutant formation must be represented in models (Kochanski et al. (2015), Pavlovic et al. (2016), Chen et al. (2019), Prichard et al. (2019), 30 Jaffe et al. (2020)). Wildfire emissions are dependent on a number of factors such as combustion conditions (e.g., flaming vs. smoldering), fuel conditions (e.g., moisture content) and fuel type (e.g., species and component) (Goode et al. (2000), Urbanski (2013), Liu et al. (2017), Stockwell et al. (2014), Stockwell et al. (2015), Koss et al. (2018), Sekimoto et al. (2018), Hatch et al. (2019), Prichard et al. (2020)). Differences in these factors can affect the total amountflux n of emissions as well as the profile of emissions, i.e., the identities and quantities of individual chemical speciesconstituents n . Permar et al. (2021) 35 recently reported that combustion conditions, specifically modified combustion efficiency (MCE), explained approximately 70% of the variability in observed trace gas emissions from wildfires. Consistent with some existing modeling approaches, they suggested total NMOCs could be predicted using MCE, and the contribution of individual compounds determined using speciation profiles. Success of that approach requires knowledge of the relevant speciation profiles, and therefore contributing fuel types. 40 NMOC speciation profiles have been developed from both field and laboratory studies (Urbanski et al. (2008), Simpson et al. (2011, Urbanski (2014), Holder et al. (2017), Andreae (2019), and Prichard et al. (2020)). Laboratory studies offer some advantages over field studies in the context of controlling fuel species and fuel components; other variables, such as combustion conditions and fuel moisture, can be harder to control and can lead to differences in the identities and quantities of NMOCs emitted between laboratory and field studies (Yokelson et al. (2013), Stockwell et al. (2014), Liu et al. (2017), Sekimoto 45 et al. (2018)). Yokelson et al. (2013) presented an inter-comparison of laboratory-and field-based emission factors (EFs), and approaches for using laboratory data to enhance the fundamental understanding of fire emissions coupled with field data to evaluate the representativeness of laboratory-based measurements. At that time, they noted that up to 70% of NMOCs remained unidentified for certain fuel types. More recently, due to the application of advanced instrumental techniques, there have been significant improvements in the identification and quantification of NMOCs emitted from fires, particularly in laboratory studies 50 (Stockwell et al. (2014), Stockwell et al. (2015), Hatch et al. (2017), Koss et al. (2018)). For example, Stockwell et al. (2015) detected approximately 80-96 % of the total emitted NMOC mass in experiments during the 2012 fourth Fire Lab at Missoula Experiment (FLAME-4); and Hatch et al. (2019) identified more than 500 individual NMOCs during FLAME-4. The relatively rapid expansion in available NMOC data provides opportunities for developing more detailed speciation profiles (in which a higher fraction of the detected mass is assigned to unique compounds or formulas) and for applying statistical data analysis 55 methods, facilitating the identification of unique sets of compounds that allow differentiation of fuel type(s) and estimation of their contributions to smoke samples.
Existing approaches for identifying the contribution of fuel types to smoke include land cover databases or fuel loading models coupled with fuel consumption models (e.g. FOFEM, Keane and Lutes (2018); and CONSUME, Ottmar (2009)), and the use of marker compounds. One of the limitations of land cover databases or fuel loading models is that they are difficult 60 to update frequently enough to reflect changes in ecosystems (Reeves et al. (2009), Vogelmann et al. (2011), Nelson et al. (2013 and Lindaas et al. (2021)). Marker compounds are emitted in relatively high abundances and can be used to differentiate fuels by component or fuel layer and in some case by species. For example, Wan et al. (2019) showed that p-hydroxybenzoic acid was emitted from combustion of herbaceous plants, while vanillic acid was emitted from combustion of softwoods and hardwoods. It has also been shown that syringic acid is associated with hardwood combustion (Simoneit (2002) and Zangrando 65 et al. (2013)), and dehydroabietic acid with conifers (Fu et al. (2009)). Zhang et al. (2021) found that the benzene to toluene ratio in smoke from sugarcane leaves was different than the ratio in smoke from sesame stalk, demonstrating differences among agricultural fuels. In measurements of western forests and shrublands, Jen et al. (2018) showed that hydroquinone was a good marker for manzanita combustion. One of the limitations of using marker compounds to identify fuel types is the lack of specificity, i.e., marker compounds have not been identified that enable identification of a large number of fuel species or 70 closely related fuel species.
In this work a method is presented for identifying fuel types from measured NMOCs in smoke samples. To overcome some of the existing limitations in identifying the contribution of specific fuel types to smoke, pattern recognition (PR) and classification algorithms were developed using data obtained during two laboratory campaigns in 2012 and 2016, and applied to data obtained during a field study in 2017. Machine learning techniques have been applied for source identification in other 75 disciplines. For example, Welke et al. (2013) and Ziółkowska et al. (2016) used principal component analysis (PCA) and linear discriminant analysis (LDA) to differentiate and classify wine varietals based on specific compounds present in wine samples. Johnson and Synovec (2002) used PCA and analysis of variance to select marker compounds in gasoline fuel blends and PR to differentiate the blends. In this work, the large data sets generated during FLAME-4 and the Fire Influence on Regional to Global Environments Experiment (FIREX) 2016 Fire Lab campaigns were leveraged to develop a source identification method 80 using fuel-specific NMOC profiles. The PR algorithm performs an automated selection of compounds that differentiate sources (in this case, fuels) based on measured NMOCs. The classification algorithm then uses the source profiles to identify source contributions to specific samples. The data used to train and test the algorithm are introduced in section 2. The algorithm development, implementation and testing are presented in sections 2 and 3. The application to field data is presented in section 3, and general conclusions and implications are presented in section 4. 85 2 Data and Methods

Data
The NMOC data used in this study were acquired from a variety of fuel types burned in laboratory and field settings during three campaigns: 1) FLAME-4 laboratory campaign in 2012 (FLAME-4 FL12), 2) FIREX laboratory campaign in 2016 (FIREX FL16), and 3) Blodgett Forest Research Station (BFRS) prescribed burns in 2017; both laboratory campaigns took place at 90 the U.S. Forest Service Fire Science Laboratory (FSL). Details of the facilities, sample collection and data analysis have been discussed in previous publications (Stockwell et al. (2014), Hatch et al. (2015), Hatch et al. (2019), Selimovic et al. (2018)).
Briefly, during FLAME-4 FL12 and FIREX FL16, a broad variety of biomass fuels were burned (Stockwell et al. (2014), Selimovic et al. (2018)), including conifers and shrubs (Table 1); 80 samples were collected from both room and stack burns as described in Stockwell et al. (2014) and Selimovic et al. (2018). During the BFRS study, a total of 28 samples (Hatch et al. 95 (2019)) were collected from a utility task vehicle parked downwind from three separate prescribed burn plots that had different fuel distributions (see Table1 and Supplementary Information (SI) Figs. S2-S4 in Hatch et al. (2019)). All NMOC samples were collected using dual bed stainless steel sorbent tubes and were analyzed using an automated thermal desorption unit coupled to a two dimensional gas chromatograph with a time-of-flight mass spectrometer (GC × GC-TOFMS). The raw chromatograms were processed using the commercially available software Chromatof (Leco Corp., St. Joseph, MI). The measured mixing ratios 100 were used to calculate normalized excess mixing ratios (NEMRs) versus CO, ∆X/∆CO (Yokelson et al. (1999)), in which delta represents excess over background. The calculated NEMRs of monoterpenoids (C 10 H 16 and C 10 H 16 O) were used as the starting point for this analysis based on Hatch et al. (2018) and Hatch et al. (2019), and the emission profile analysis presented in the SI section 1 n . Hatch et al. (2019) demonstrated that the variability in NMOC composition could not be attributed entirely to MCE, and that chemical speciation was highly correlated among some fuel types across a range of MCE values, particularly 105 conifers; within conifers, clear differences in monoterpenoid emissions were observed as a function of fuel species.

Pattern recognition algorithm
A four-step PR algorithm ( Fig. 1) was developed to select a subset of compounds that captured the variance between fuel types and then use the selected compounds to differentiate fuel types based on NMOC speciation profiles; the algorithm steps are: 1) data preprocessing, 2) analysis of variance (ANOVA) n , 3) principal component analysis (PCA) n and 4) and k-means clustering.

110
The algorithm was implemented using the Python package scikit-learn (Pedregosa et al. (2011)). The algorithm components n are explained in sections 2.2.1-2.2.4. Implementation of the algorithm is explained in section 2.3. n .

Preprocessing and analysis of variance (ANOVA) feature selection n
Data preprocessing (step 1) is performed to handle any missing values in the samples. Approaches for handling missing values are specific to the type(s) of data and the reason(s) for missing values (Dong and Peng (2013);McNeish (2017)). In this data set, 115 missing values largely were a result of compounds being below the detection limit or having negative values after background correction. During preprocessing, for every feature (i.e., compound) the percentage by number of missing values across all samples was calculated. For any given compound, if the percentage of missing values was less than 30% then the missing values were replaced with zeros. If the percentage was more than 30% then the compound was removed from the data set.
The 30% threshold is supported by published statistical methods guides, including Dong and Peng (2013) and Jakobsen et al. 120 (2017), that suggested a threshold range of 10% -40% prior to replacement. n Samples also were evaluated and filtered for missingness, using two criteria. For criterion one, only fuel types that had more than 30% (by number) of the 93 monoterpenoids above background levels were selected. Since the PR algorithm was based on monoterpenoids, samples with few to no detected monoterpenoids would reduce the ability of the algorithm to differentiate between fuel types and therefore reduce the overall efficiency. For criterion two, only fuel types that had three or more samples 125  was used to further filter the compounds retained in step 1. In this application, each detected compound was treated as an

130
ANOVA-type problem with N samples in k classes (fuel types) to determine whether or not a compound could separate the different fuel types. For each compound, the ratio of class-to-class variance to within-class variance was calculated using Eq.
1, also known as the Fisher ratio (F-ratio): n n where the nominator (V b ) corresponds to the between-class sum of squares and the denominator (V w ) to the within-class sum 135 of squares. The magnitude of the F-ratio is an indication of class separation. Following the F-ratio calculation the compounds were ranked in an ascending order based on their F-ratios values. Further details regarding the F-ratio calculation can be found in SI section 2. n 6 2.2.2 Principal component analysis (PCA) Principal component analysis and k-means clustering n PCA (step 3), as described in Abdi and Williams (2010), is a dimensionality reduction technique that is used to project high 140 dimensional data into a lower dimensional space along the direction(s) of maximum variance in the data. k-means (Jain (2010)) (step 4) is a popular clustering algorithm that finds clusters in a n-dimensional space (Jolliffe (2002) and Abdi and Williams (2010)).
In this study PCA was used to compress the information carried by the selected compounds to a lower dimensional space and k-means clustering was used to find formed clusters after the application of PCA. d In PCA the original variables are transformed into a new coordinate system. The new coordinate system is formed using the calculated principal components (PCs) 145 which serve as the new directions/axes. Each PC is a linear combination of the original variables and weights (also called loadings), as shown in Eq. 2: n  2017)). For this reason standardization of the data might be necessary prior to PCA, in which the data are mean centered and each variable is divided by the standard deviation of that variable. While standardization can help alleviate the scaling problem it should not be a default practice as it can magnify the effect of outliers in the data (Gewers et al. (2021)). If the variability of a feature is a consequence of intrinsic variability in the analytical method (e.g., experimental error or noise in the data), then standardization may erroneously emphasize that in the PCA results. In such 160 cases, either the noise should be reduced by some means or standardization should be avoided. In this study, the selection of PCA for dimensionality reduction was based on three previous studies that included PCA in pattern recognition analysis of chromatographic data (Welke et al. (2013), Johnson and Synovec (2002), Ziółkowska et al. (2016)). n 2.2.3 k-means clustering n k-means (step 4) is a popular clustering algorithm that finds clusters in an n-dimensional space (Jolliffe (2002)). Given a set 165 of observations (x 1 , x 2 , x 3 , .... x n ) where each observation is an d-dimensional real vector, k-means tries to partition the n observations into k ≤ n sets S = {S 1 , S 2 , S 3 , ..., S k }. Mathematically, k-means clustering minimizes within cluster variances, or squared Euclidean distances (Jain (2010)), as shown in Eq. 3: n arg min n where µ i is the mean of points in S i . In this study Elkan's algorithm (Elkan (2003)) was used to solve Eq 3. k-means 170 clustering was used to find formed clusters after the application of PCA. The inputs for the clustering analysis were the retained PCs. k-means was chosen over other clustering algorithms because of its simplicity and the absence of highly anisotropically distributed clusters (irregular shapes); unequal variance among the clusters and unevenly sized clusters can cause problems with k-means (Jain (2010) the maximum number of usable components. The default method used in this study was the explained variance but all three approaches were evaluated and provided similar results. n k-means requires that the target number of clusters are provided in advance. This is challenging when the number of clusters is not known. In this study, the elbow plot method was used to determine the number of clusters (Fig. 5). The elbow plot method requires running the k-means multiple times using a different number of clusters each time. For each run the total within-sum 185 of squares (TWSS) was calculated according to Eq. 4: n n where q and c are multidimensional vectors with the coordinates for each centroid and sample, respectively; n is the total number of samples and k max is the total number of pre-selected clusters. TWSS is a measure of variability for the observations within a cluster. The smaller the TWSS for a number of clusters the better the clustering. TWSS is plotted against the number 190 of clusters. As with the scree test the inflection point in the plot signifies the optimum number of clusters. n 2.3 Running the pattern recognition algorithm n Implementation of the PR algorithm proceeds through five steps. In step one, the compounds in the data set are preprocessed to replace missing values and discard samples that might be problematic (section 2.2.1). In step two, for each retained compound the F-ratio is calculated. In step three, in an iterative fashion, PCA and k-means clustering are performed on the m highest 195 ranked compounds and the separation of the classes in the samples (fuel types) is evaluated as a function of the number of compounds and the minimum number of PCs that achieve the 80% variance threshold. In step four, if the separation is not adequate then the number of compounds is increased or decreased and the run is repeated. In step five, once the class separation no longer improves or starts degrading with the addition or removal of compounds (step 4), more PCs are retained and different combinations of PCs are tested. While increasing the number of components will lead to better separation of the fuel types, it 200 can also lead to overfitting. In this study the algorithm was optimized for sample separation as a function of the compounds selected and pairing of the PCs. The effects of overfitting were not considered in the optimization due to limitations in the sample size that prevented the use of known evaluation methods. n

Classification
To test the applicability of the PR algorithm results for field samples, a classification algorithm, LDA (Hastie et al. (2009)), 205 was applied (see SI section 5 for implementation details). LDA is a supervised learning method that is similar to PCA. Both LDA and PCA are linear transformation techniques; LDA is supervised, whereas PCA is unsupervised and ignores class labels.
While PCA tries to find a subspace of features in order to maximize variance among samples, LDA attempts to find a feature subspace that maximizes class separability. The inputs for the LDA training were the selected PCs from the PR analysis (independent variables) and the fuel types (response variable/class) (see section 4 in SI). The output of LDA is a probability 210 score for every sample that is being tested for its likelihood to belong to a particular fuel type, calculated as follows (Eq. 5): n (5) n where k is the class of sample x, µ is the vector of the means for each class based on the selected features, Σ is the common covariance matrix for the three classes in the training set and C st is a term that contains constants from the multivariate Gaussian distribution. LDA was chosen as the classification method in this study because of its closed-form solution that does not require 215 any hyperparameter tuning. Methods that require hyperparameter tuning (e.g., k-Nearest Neighbours) are not appropriate for this application because: 1) the training data set included laboratory data only while the test set included field data and 2) the field data were not labeled. n 3 Algorithm Implementation, d Results and Discussion 3.1 Sample and fuel type selection for pattern recognition and classification 220 The PR algorithm was applied to the FIREX FL16 data set to identify a group of marker compounds that could be used to differentiate fuel types. Classification was then performed using the FIREX FL16 data as the training set, and BFRS data as the testing set. The selection of the training and testing sets was based on the size of each data set; the FIREX FL16 data to test the response of the classification algorithm to fuel types that were not included in the training set. The use of each data set in the PR and classification algorithms is summarized in Table 2.
Two selection criteria were applied to the training set to ensure that standard deviations and averages could be computed, which are central features of the PR algorithm. First, only fuel types that had more than 30% (by number) of the 93 monoterpenoids above background levels were selected. Since the PR algorithm was based on monoterpenoids, samples with little to no detected 235 monoterpenoids would reduce the ability of the algorithm to differentiate between fuel types and therefore reduce the overall efficiency. Second, only fuel types that had three or more samples were retained. d Before applying the PR algorithm the data were processed as described in section 2.2.1. n Application of the preprocessing criteria reduced the number of samples from a total of 74 to 39 and the number of fuel species from 18 to five: pines (ponderosa pine and lodgepole pine), firs (Douglas fir and subalpine fir) and spruce (Engelmann spruce Feature selection was performed and evaluated using two approaches: 1) manual selection, where the compounds were filtered based on a single criterion, whether a compound was present in more than three fuel species; n and 2) automated selection using the the PR algorithm (section 2.3) on Fisher ratios (F-ratios) n . The percentage of variance explained using the first two PCs was used as the metric to evaluate the quality of feature selection using the two approaches (see scree plot method section 3.2.2) following application of PCA (Fig. 2). For automated selection, the F-ratios were calculated for every compound in the 250 data set using Eq. 1. In Eq. 1 the nominator (V b ) corresponds to the between-class sum of squares and the denominator (V w ) to the within-class sum of squares. d The selection approaches were compared using the metrics introduced in section 2.2.4 and visualizing the emission profiles of the selected compounds. n The manual approach resulted in the selection of the following nine (out of 93) compounds: a-pinene, limonene, 3-carene, b-myrcene, camphene, p-cymene, bornyl acetate, b-phellandrene and tricyclene. The automated approach resulted in selection of the following five compounds: tricyclene, camphene, b-pinene, 255 3-carene and bornyl acetate. Figure 2 shows the improved performance of automated feature selection over manual feature selection, based on the single highest explained variance across PCs 1-5. d The PR algorithm was run again such that the number of selected compounds (nine) were the same between the manual and automated selection methods. The cumulative explained variance plots for manual and automated selection of compounds are shown in Fig. 2. Given the 80% threshold, three components were required with manual and automated selection of nine compounds, and two components with automated To make the feature selection results more intuitive, the normalized emission ratio profiles (ratio of the compound ER to the sum of ERs for the selected compounds) as a function of fuel species are shown for manual selection (Fig. 3), and automated 270 selection of five (Fig. 4) and nine (Fig. S5) compounds. Emerging patterns can be seen in the resulting profiles between and within the fuel types. The emission profiles from the automated selection, for both five and nine compounds, provide more distinct profiles between fuel types, and more consistent profiles within types, than the profiles from manual selection. For example, with manual selection (Fig. 3) the normalized emission ratio profiles show that the the relative abundances of alphapinene (black) and d-limonene (dark brown) are more similar between ponderosa pine and Engelmann spruce than they are 275 for the two pines, and the two pine species have dissimilar relative abundances of 3-carene (yellow) and b-phellandrene (tan).
However, these differences within pines and similarities between ponderosa pine and spruce disappear with the automated selection, particularly with five compounds. The consistency of profiles within the fuel types is important because it allows for fuel separation and classification when new samples are provided. Given the more effective dimensionality reduction of the automated feature selection, as well as the more consistent emission profiles, the manual feature selection will be not be 280 further discussed. n The automated selection with five compounds results in more distinct and consistent profiles for each fuel family, which translates to a higher potential for separation (greater explained variance) in the PCA space. The five compounds selected with the automated approach were thus used for the PR analysis. d

PCA and k-means clustering
Following data preprocessing and feature selection, PCA was performed on the reduced data set. To determine the number 285 of PCs to be retained, a scree test using a modified version of the Kaiser criterion (Jolliffe (2002)) was performed. In the where I j corresponds to the eigenvalue for PC j . Figure 2 shows that with automated feature selection two PCs (PC1 and PC2) were adequate to explain 92% of the variance in the data set. The scores from these retained PCs (PC1 and PC2) were then used as input for k-means clustering. To determine the optimal number of clusters, the elbow plot method was used. The elbow plot is a graphical method in which the cumulative distance is calculated for all points/samples from their respective centroids and then plotted against the number of clusters. In this study, Euclidean distance was used (Eq. 3): d where q and c are multidimensional vectors with the coordinates for each centroid and sample, respectively. The values for each sample are the respective scores from the selected PCs. The indices j and k correspond to centroid number and sample number, respectively. The optimum number of clusters is found after a steep decrease in the total Euclidian distance, followed by trivial changes, with increasing number of clusters. A steep decrease signifies good clustering performance (dense clusters), 300 while a trivial change in the Euclidian distance shows that the cluster centroids do not change substantially from their previous positions. The elbow plot (Fig. 5) shows that the algorithm identified four clusters as the optimum number. d Following data preprocessing and feature selection, PCA was performed on the reduced data set (five compounds) followed by k-means clustering on two retained components (based on the explained variance metric, see Fig. 2). For k-means the number of clusters was determined using the elbow plot method (section 2.2.4). In Fig. 5  firs and pines; but poorer separation for spruce, for which four of ten samples overlapped with another fuel family. Adding four more compounds reduced the explained variance from PC1 and PC2 from 92% to less than 70% and resulted in only minor improvement in cluster separation (Fig. S6). The difficulty that the algorithm encounters separating spruce effectively can be explained using the elbow plot (Fig. 5). The k-means algorithm identified four clusters as the optimum number, but the steep decrease in the total TWSStotal Euclidean distance n actually occurs between one and two total clusters. The TWSS 315 total Euclidean distance n decreases further between two and four but to a lesser extent (shallower slope). The lesser decrease between two and four clusters indicates that the clustering algorithm had difficulty identifying clusters in the PCA space, which is then apparent in Fig. 6. From the normalized emission ratio profiles (Fig. 4), it can be seen that the spruce and fir samples have similar normalized emission ratios for tricyclene, camphene and b-pinene. This limits the ability to fully separate spruce and firs in the PCA space. n 320

Beyond principal components one and two
Thus far, only the first two PCs (from a total of five) were used in the analysis since they explained 92% of the variance in the data set. Another 8% was shared between PCs three and four, which could potentially provide better separation for the spruce samples. After testing PCs one with three and one with four, the combination of one and four resulted in better performance of the PR algorithm. Though the optimal number of clusters for the PC1 and PC4 pair (Fig. 7) was the same as with the PC1 325 and PC2 pair, it was found at a lower TWSStotal Euclidean distance n , which is indicative of more dense clustering. Figure 8 shows the results for the PC1 and PC4 pair. Cluster one included 13 out of 16 pine samples and one overlapping fir sample.
Cluster two included 11 out of 13 fir samples and only one overlapping spruce sample. Clusters three and four included nine out of 10 spruce samples, one overlapping fir sample and three overlapping pine samples. With the PC1 and PC4 pair, spruce samples had 30% less overlap with firs ( Fig. 9), with only moderate losses in the separation between spruce and pines. These

Mixed samples
The PR algorithm selected compounds that separated single-fuel smoke samples by the contribution of fuels categorized into three types (firs, pines and spruce). Before testing the algorithm on complex smoke samples the algorithm was tested on syn-335 thetic fuel mixtures described in section 3.1. From the three 60/40 samples only the fir/spruce synthetic mixture was clustered with the dominant fuel family (fir). The pine/spruce and pine/fir synthetic mixtures were clustered with spruce clusters one and three, respectively (Fig. 10). The clustering of the pine/spruce synthetic mixture as spruce was marginal in the PCA space and was due to the scatter of the spruce samples rather than the similarity of the synthetic mixture with spruce. The clustering of the pine/fir mixture with spruce is more intuitive after comparing the normalized ER profiles (Fig. S7), which show the similarities 340 in the ER profiles for the pine/fir mixture and spruce. Figure 11 shows the PR results including the 90/10 synthetic mixtures.
Both samples were correctly clustered with their respective dominant fuel family. The synthetic mixture results suggest that the algorithm can select marker compounds that can differentiate fuel types even when they are mixed in relatively even proportions (i.e., 60/40 pine/spruce and fir/spruce mixtures) but for some mixtures the differentiation might be poor (i.e., 60/40 pine/fir). Including more mixed fuel samples in the training and test sets can likely improve the separation of complex mixtures 345 achieved using the PR algorithm.

Classification
For the classification algorithm, the scores of the selected PCs were used as input for the LDA training. PC1 and PC4 were selected since they provided better separation across the three fuel types (see section 3.2.3), while explaining 82% of the variability in the data set. The outcome of LDA is a probability, calculated using Eq. 4, that a sample belongs to one of the 350 main clusters determined by k-means (see section 3.2.3). d where k is the class of sample x, µ is the vector of the means for each class based on the selected features and Σ is the common covariance matrix for the three classes in the training set. In this application the probability score is related to the proximity of sample to a class of samples (cluster) in the PCA space ( Fig. 10 and Fig. 11) which is linked to its similarity with the emission 355 profiles for the three fuel types (Fig. 4). The assignment of a sample to a class is based on the class with the highest probability, Figure 10. PCA coupled with k-means clustering results for the PC1 and PC4 pair including the 60%/40% synthetic mixtures. Figure 11. PCA coupled with k-means clustering results for the PC1 and PC4 pair including the 90%/10% synthetic mixtures.
even if marginally higher. For example a sample with a pine probability score of 70% or more will most likely be inside the pine cluster. Generally, samples with probability scores 60% and higher are most likely in the cluster space of a fuel family.
Samples with a probability score 60% and lower are more likely to be adjacent to more than one fuel family in the PCA space.
For the training of the classifier all 39 samples from FIREX FL16 were used. d After the pattern recognition, all 39 samples from the FIREX FL16 data set and the selected compounds in the form of the retained components (PC1 and PC4) were provided to the LDA algorithm for training. As described in section 2.4, LDA provides class membership probability (Eq. 5). In this application the probability score is related to the proximity of a sample to a class of samples (cluster) in the PCA space (Figs. 10 and 11) which is linked to its similarity with the emission profiles 365 for the three fuel types (Fig. 4). The assignment of a sample to a class is based on the class with the highest probability, even if marginally higher. For example a sample with a pine probability score of 70% or more will most likely be inside the pine cluster. Generally, samples with probability scores 60% and higher are most likely in the cluster space of a fuel family. Samples with a probability score of 60% and lower are more likely to be adjacent to one or more fuel families in the PCA space. n The classification algorithm was tested using the synthetic mixtures and FLAME-4 FL12 samples before testing using the 370 BFRS field data. The classification results for the synthetic mixtures are shown in Fig. 12. Two of three 60/40 synthetic mixtures were classified correctly, pine/spruce and fir/spruce, with classification probabilities of 70% and higher for the dominant fuel family. The 60/40 pine/fir synthetic mixture was classified as spruce. Its classification is a result of its clustering in the PCA space with spruce ( Fig. 10), which is directly connected to its similarity with the spruce emission profile (Fig. S7). The two 90/10 synthetic mixtures, pine/spruce and fir/spruce, were correctly classified with classification probabilities over 80% for the  The classification results for the FLAME-4 FL12 samples are shown in Fig. 12. This data set included six fuel species (ponderosa pine, black spruce, Indonesian peat, rice straw, wiregrass and sawgrass); only one of which, ponderosa pine, was in the training set. Both ponderosa pine and black spruce samples were classified correctly (Fig. 12), with classification probabilities Figure 13. Normalized emission ratio profiles for FIREX FL16 samples: pines and firs; and FLAME-4 FL12 samples: sawgrass, wiregrass, rice straw and Indonesian peat. The relative abundances of camphene and b-pinene in sawgrass, Indonesian peat and rice straw were ≥ 0.6, but for figure clarity, the axis limits were not changed. over 90%. The Indonesian peat, rice straw, wiregrass and sawgrass samples were classified as firs or pines (Fig. 12), with classification probabilities over 70%. The classification algorithm evaluated partial similarity against only three options (pine, firs or 385 spruce), none of which represent the fuel types of the four fuel species. Figure 13 shows the average normalized emission ratio profiles for pines and firs, as well as Indonesian peat, rice straw, wiregrass and sawgrass. It can be seen that camphene is the only one of the five compounds that is present in the sawgrass and Indonesian peat samples, and thus these fuels were classified as firs, which also have a high relative abundance of camphene. Wiregrass and rice straw samples also include camphene, but have higher relative abundances of b-pinene and 3-carene, and thus were classified as pines, which also have higher relative 390 abundances of these two compounds (Fig. 13). As illustrated by the application to the synthetic fuel mixtures, the performance of the classification algorithm can be improved in future work by expanding the range of fuel types and mixtures included in the training and test sets.  Fig. S2-S4). Seven different fuel species were identified in the three burned plots: white fir, incense cedar, tanoak, sugar pine, ponderosa pine, Douglas fir and California black oak. Due to the heterogeneity of the fuels, and the influence of meteorology and sampling location, it was not possible to determine the relative contribution of each fuel species to each sample. Instead the average overstory composition (Figs. S8-S11) was used to determine likely influences from dominant sources close to each sampling location. For plot 60, 400 sites one and two (Fig. S8), the main influence was from firs (47%), followed by similar amounts of pines and incense cedar (25% and 27%), with no contribution from tanoak or California black oak. For plot 60, site three (Fig. S9), the main influence was from incense cedar (43%), followed by firs (34%), pines (26%) and California black oak (10%). For plot 340 (Fig. S10) the main influence was from firs (63%), followed by pines (21%), incense cedar (12%) and (2%) from tanoak and California black oak. Finally for plot 400 (Fig. S11) the main influence was from firs (55%), followed by pines (26%) and incense cedar 405 (18%). The classification algorithm classified all samples from plots 60 (Fig. 14) and 340 (Fig. 15) as fir dominant. Nine out of ten samples from plot 400 (Fig. 16) were classified as fir dominant and one as pine dominant. While spruce was absent in the burned plots, all samples (with the exception of the pine dominant sample in plot 400) had a higher classification probability for spruce than pines.

Blodgett samples
For plot 60 there were a total of 11 samples collected. Five samples in sites one and two, which is fir dominant (Fig. S8), and 410 six samples in site three, which is incense cedar dominant (Fig. S9). For sites one and two the classifier results were reasonable based on the overstory composition, but for site three the classification results were inconclusive since no emission profiles were available for cedars. It is likely that incense cedar or the mixture of incense cedar with firs most closely resembles the fir emission profiles of the selected compounds and thus was classified as firs. For plots 340 and 400 the classification results, probability of 63% in 340 and 55% in 400, are reasonable given 16 out of 17 samples are fir dominant . The one 415 sample that was classified as pines in plot 400 was most likely affected by ponderosa pine emissions during sampling. The one sample that was classified as pines in plot 400 was most likely affected by ponderosa pine emissions during sampling.
The composition plots in Hatch et al. (2019) show that one of the inventoried plots next to plot 400 (sites one and two) had an average fractional overstory composition of more than 50% ponderosa pine. Regarding the elevated probability for spruce, despite its absence from all burned plots, it was likely an artifact of mixed smoke between pines and firs, that was also shown in 420 the synthetic mixed sample PR and classification results. Among the three plots, pines and firs together account for more than 70% of the overstory composition on average. Thus the contribution from both firs and pines could lead to smoke mixtures that resemble the spruce emission profiles. Tanoak and California black oak (Figs. S8-S11) account for 2% -10% of the total contributions among the three plots. Due to insufficient emission data, their contributions to the smoke samples could not be evaluated, but given their low overstory contribution it is likely that they did not influence the collected samples substantially.

425
The results for the BFRS data showed that the laboratory-based emission profiles selected by the PR algorithm can be applied to smoke samples collected in the field and can be used to identify dominant fuel sources even in mixed smoke samples. While the algorithm has been tested and trained on only three fuel types, widespread application can be achieved with further training and testing using a more diverse set of compounds and broader range of fuel types.
21 Figure 14. Classification probability by fuel class for plot 60.

Pattern recognition and classification d
A supervised pattern recognition (PR) algorithm was developed and applied in this study to: 1) differentiate sources/fuel types using NMOCs measured in smoke samples and selected with an ANOVA based feature selection method; and 2) train a classification algorithm to identify dominant sources/fuel types in smoke samples based on the unique speciation profiles identified by the PR algorithm. The PR algorithm was able to group five fuel species (Douglas and subalpine fir, ponderosa 435 and loblolly pine, and Engelmann spruce) into three fuel types (pines, firs and spruce), with minimum overlap; only five of 39 total samples were grouped with types that were not representative of the fuel species. The separation was achieved using five monoterpenoids that the algorithm selected out of a pool of 93. Future work should include exploring how normality and heteroskedasticity, which are underlying assumptions for ANOVA, may be affecting separation. This can be achieved by using non-parametric tests that do not make assumptions about the underlying data distribution and are more robust than ANOVA in 440 the presence of heteroskedasticity. n The PR algorithm was tested with five synthetic mixtures, for which three of five (60/40 fir/spruce, 90/10 pine/spruce, and 90/10 fir/spruce) were successfully separated and clustered with their dominant family. The same synthetic mixtures were also tested using the classification algorithm, where four of five were classified correctly (60/40 pine/spruce, 60/40 fir/spruce, 90/10 pine/spruce, and 90/10 fir/spruce). The application of the classification algorithm to the synthetic mixtures demonstrated that dominant source contributions could be identified in fuel mixtures. For the FLAME-4 445 FL12 samples the classification algorithm correctly classified two of six samples (ponderosa pine and black spruce); these two samples were the only fuels represented by the three fuel types. For the BFRS field samples, based on the fractional overstory composition, the classification results were reasonable with 27 out of 28 samples being classified as fir dominant and one sample as pine dominant. The incorrect classifications that occurred with the synthetic fuel mixture (60/40 pine/fir) and the FLAME-4 FL12 samples (Indonesian peat, rice straw, wiregrass and sawgrass) were due to the similarity or partial similarity 450 of their emission profiles with the fuels used to train the classification model. This can be resolved in future applications by including more compounds and a broader range of fuel types, including in mixtures. This will also facilitate the use of this approach for identifying contributing fuels outside of western coniferous forests.
Code and data availability. The data and implementation scripts for the pattern recognition algorithm and the classification model are available through a GitHub repository https://github.com/christos-stamatis/supervised_pattern_recognition but are not yet final.

455
Author contributions. CS and KCB contributed equally to the manuscript; CS lead the data analysis, data interpretation and manuscript preparation efforts; KCB lead the conceptual design and manuscript editing. ern U.S. Wildfires Based on WE-CAN Aircraft Measurements, Journal of Geophysical Research: Atmospheres, 126, e2020JD033 838,