Peak fitting (PF) and partial least
squares (PLS) regression have been independently developed for estimation of
functional groups (FGs) from Fourier transform infrared (FTIR) spectra of
ambient aerosol collected on Teflon filters. PF is a model that quantifies
the functional group composition of the ambient samples by fitting individual
Gaussian line shapes to the aerosol spectra. PLS is a data-driven,
statistical model calibrated to laboratory standards of relevant compounds
and then extrapolated to ambient spectra. In this work, we compare the FG
quantification using the most widely used implementations of PF and PLS,
including their model parameters, and also perform a comparison when the
underlying laboratory standards and spectral processing are harmonized. We
evaluate the quantification of organic FGs (alcohol

Atmospheric aerosol, also called particulate matter (PM), is made up of
organic compounds, inorganic salts, trace elements, black carbon, and water,
among other substances. Accounting for its total mass in terms of its
speciated composition is desirable for regulatory and epidemiological
reasons, and this goal poses a substantial challenge for environmental
analytical measurement. Organic compounds in particular can contribute
20 %–80 % of the atmospheric aerosol mass

Some of the earliest work in studying organic aerosol composition in Los
Angeles smog and synthetic smog generated in the laboratory was studied using
infrared spectroscopy

The essential principle of the technique is to record chemically specific
absorption bands resulting from dipole moment transitions induced by
interaction of molecular vibrations with mid-IR radiation

To address these challenges, methods for quantification of bonds from FTIR
spectra fall into two broad categories

Additionally, when scattering interferences are present in the sample,
spectral preprocessing can extend applicability of the Beer–Lambert law

Past evaluations of these algorithms have been performed against functional
group composition of laboratory-generated samples with known composition, or
aggregated metrics such as OM or organic carbon (OC) in ambient samples where
reference measurements of organic functional groups have not been available.
Evaluations of FG abundance in laboratory-generated samples have fared well

Given the possible decisions that can be made regarding calibration sample
selection, spectra manipulation, and fitting algorithm for calibration model
development, a critical need remains to evaluate the sensitivity of estimated
FG abundances to the calibration model used. Therefore, in this work we use
the same 794 ambient sample spectra from the IMPROVE network used by

The basis for quantitative spectroscopy can be described by the
Bouguer–Lambert–Beer law

In the following sections, we first describe laboratory and ambient
measurements used for calibration and prediction
(Sect.

Summary of models evaluated.

We use 794 IMPROVE network

Ambient samples were collected every third day from midnight to midnight
(local time) for 24 h. FTIR spectra are obtained for PM collected on
25

The baseline correction method of

Laboratory and ambient sample spectra (raw and baseline corrected). Black lines denote mean absorbances, and dashed gray areas denote 95 % confidence intervals.

Spectra similarities in baseline-corrected ambient sample spectra are used to
group samples into clusters, as originally presented by

The method of PF constructs a physically based representation of
absorbances based on Eq. (

Typically, calibration parameters are obtained from single-compound standards
where attribution of absorption to individual functional groups is the least
ambiguous. Prediction in more complex mixtures is enabled by the concurrent
fitting of multiple absorption peaks and invocation of mixing rules to arrive
at a representative absorption coefficient. In this work, we retain the
algorithm for apportioning the absorbance spectrum to various functional
groups

Multivariate calibration is an alternative approach that is typically
formulated as a linear regression problem, with the analyte concentration as
the regressand (response variable) and absorbances used as regressors. In
scalar notation, this relationship is written as

A series of candidate models which satisfy Eqs. (

Though the interpretation of PLS models is less straightforward than PF,
it is possible to examine how models are weighting spectral
variables (wavenumbers) and calibration samples for making predictions. The
regression coefficients are difficult to interpret directly as their
magnitudes must be interpreted in combination with absorbances. In addition,
the value of the regression coefficients can also be either positive or
negative; the latter are associated with interfering species

The constituent molar abundance

While the carboxylic group comprises two molecular bonds, the abundances of
carboxylic hydroxyl and carbonyl bonds are conventionally quantified
separately with calibration models developed for their respective absorption
bands. The carbonyl quantified in this way can include contributions from
ketonic and aldehydic carbonyl because of their proximity in absorption bands
that are difficult to resolve in environmental samples; the carboxylic
hydroxyl cCOH and total carbonyl tCO are re-apportioned to estimate abundance
of carboxylic COOH groups along with non-acid (ketonic, aldehyde, and ester)
carbonyl CO (written as naCO). Stoichiometrically,

One strategy to avoid the apportionment of tCO to COOH and naCO is to build
an alternative PLS regression model to predict naCO directly, rather than
tCO. The known concentrations in laboratory standards are transformed
according to Eq. (S1) and provided as the response vectors to
Eq. (

Metrics such as mean error, mean bias, RMSE, and many others exist for
intercomparison among measured and estimated values. In this work, to
quantify overall bias we use total least squares slope (obtained via major
axis regression), which accounts for uncertainties in both quantities being
compared

We first report on differences among estimated absorption coefficients
(Sect.

Calibration curves and predicted concentrations according to the PF
strategy outlined in Sect.

Top row: integrated absorption as a function of known molar abundance used to derive molar absorption coefficients. Bottom row: evaluation of derived absorption coefficients on predicted concentrations for test set compounds not used in the fitting.

Recalibrated absorption coefficients and fit statistics for each FG and compound. Italicized texts denote the compounds not used in the computation of cCOH and aCH absorption coefficients averages.

In Fig.

Summary of molar absorption coefficients reported in the literature.
The single star for 1-docosanol aCOH indicates that there are only three
points – one of which is an influential point – so this is effectively a
single-point estimate. The single star for ammonium sulfate indicates that it
based on a single value. The double star is used to indicate that the
absorption coefficient for malonic acid cCOH is estimated for a concentration
range order of magnitude lower than the rest. Previous studies are summarized
by

Absorption coefficients for aCH vary by a factor of 3.2 (between 1.0 and 3.2, over 10 compounds), for aCOH by a factor of 1.9 (between 19.8 and 37.7, over seven compounds), for cCOH by a factor of 1.6 (between 32.8 and 51.6, over three compounds), and for tCO by a factor 1.6 (between 10.0 and 16.1, over seven standards; Fig. S11). Without informed strategies for parameter selection, the range of valid possibilities for these absorption coefficients imparts uncertainty in FG calibration.

Shown on the right side of the Fig.

We first compare quantities for which we have an independent estimate (TOR OC
and ammonium) to place our predictions in context. Individual contributions
of FGs used to estimate OC are discussed in Sect.

Comparison of estimated OC (FG OC) against OC measured by TOR method (TOR OC). PFo refers to peak fitting using the original parameters. PLSr refers to partial least square using raw spectra. PFr refers to peak fitting using the recalibrated absorption coefficients. PLSbc refers to partial least square using baseline-corrected spectra.

Figure

The difference between PFo and PFr is due to the systematic increase in absorption coefficients used by PFr compared with PFo, which decreases the molar abundance of FGs and, consequently, the FG OC. This difference is particularly articulated by the absorption coefficient for aCH (1.73 against 1.31) as its mole fraction is over 60 % regardless of estimation method used.

The differences between PLSr and PLSbc are more difficult to understand, but
some interpretations can be made. First, systematic differences can occur in
the way that laboratory standard and ambient sample spectra are baseline
corrected as the absorbance regions are different. Also, baseline correction
used in the PLSbc does not include frequency lower than 1500

Comparison of estimated OC (FG OC) gainst OC measured by TOR method (TOR OC). PLSbc* refers to partial least square using baseline-corrected spectra and a heuristic choice for the aCH LVs number (13) based on agreement between FG OC and TOR OC (Fig. S16).

The solutions in the neighborhood (

PLSbc* is only one of many possible models that show improved agreement with
TOR OC to be explored in future work; for this paper we restrict our
evaluation of results primarily to those obtained by the protocols described
in Sect.

Figure

Comparison of estimated ammonium (FG ammonium) against ammonium measured using ion chromatography (IC ammonium). PFo refers to peak fitting using the original parameters. PLSr refers to partial least square using raw spectra. PFr refers to peak fitting using the recalibrated absorption coefficients. PLSbc refers to partial least square using baseline-corrected spectra.

While ammonium quantification by FTIR has been the focus of past researchers

Figure

FG comparison summary. Pearson correlation coefficient PFo refers to peak fitting using the original parameters. PLSr refers to partial least square using raw spectra. PFr refers to peak fitting using the recalibrated absorption coefficients. PLSbc refers to partial least square using baseline-corrected spectra.

First, we focus on the comparison between PFo and PLSr. In urban samples, aCH
presents the highest correlation (

Correlation coefficients in the PFr–PLSr comparison for any organic FG are
similar to the PFo–PLSr comparison; the only notable difference is the larger
regression slope for cCOH (1.94 and 3.25 for urban and rural samples against
1.53 and 2.57 respectively), due to the lower absorption coefficient applied
to PFr than PFo (Table

PFo and PFr predictions agree closely with PLSbc, likely because they use the same portion of the spectra. The organic FG correlations vary between 0.7 (aCOH – rural samples) and 0.99 (tCO – rural sample). PFr predictions are closer to PLSbc than PFo, with the exception of cCOH, since they use the same laboratory standard compounds.

The correlation for iNH is greater than 0.97 with slope close to one in the case of PFr and greater than 1.8 in the case of PFo, which indicates a systematic bias due to the different absorption coefficients used (14.84 and 8.89 for PFr and PFo respectively).

The correlations in tCO between PF and PLS are high (

Within the broader scope of assessing uncertainty for each FG, we can
consider that the estimated slopes can vary according to the selection of
absorption coefficient value for PF and the number of LVs for PLS. The range
of absorption coefficients is given in Sect.

Figure

Comparison of quantified abundance of tCO and cCOH. PFo refers to peak fitting using the original parameters. PLSr refers to partial least square using raw spectra. PFr refers to peak fitting using the recalibrated absorption coefficients. PLSbc refers to partial least square using baseline-corrected spectra.

There is noticeably more scatter in the relationship between tCO and cCOH
from the PF predictions, with the presence of naCO difficult to identify in
these samples. Furthermore, tCO abundance is systematically lower than cCOH
for many samples (discernable beyond the extent of scatter).

Figure

Comparison of estimated CO according to canonical calibration (as difference between molar abundance of cCOH and tCO), and alternate calibration (direct calibration to non-acid CO). PLSr refers to partial least square using raw spectra. PLSbc refers to partial least square using baseline-corrected spectra.

High abundances of naCO have been reported in biomass burning and biogenic
secondary OM in past studies (using PF), either due to ketones present in
photochemical reaction products

Figure

The absolute magnitudes, however, require further consideration. The mean

To improve these carbon-normalized metrics, the undetected carbon
moieties
can be corrected by incorporating an assumed carbon mass recovery
fraction

Figure

When we heuristically adjust the PLSbc aCH model parameters to match TOR OC
concentrations within 10 % on average (PLSbc* introduced in
Sect.

We examine and summarize in Fig.

Left column shows scaled baseline-corrected spectra between 4000 and
1500

Cluster 16 (second row in Fig.

The atypical predictions for clusters 19 and 20 (third and fourth row in
Fig.

In this work, we explore the diversity in FG predictions that can result from
calibration models built with mid-IR spectra. In particular, we compare two
prominent methods for estimation of functional groups (FGs) from mid-IR
spectra used in atmospheric PM analysis: peak fitting (PF) and partial least
squares (PLS) regression. PF is an approach using physically based absorption
profiles to model spectral signals, and PLS is a statistical approach which
is trained on relevant features from reference spectra. Using PF, we
evaluated FG estimates using molar absorbance coefficients (model parameters)
from previous studies (PFo) and calculated (PFr) using 238 laboratory
standards from

PFo and PFr require some assumptions: (i) structure of the PTFE signal;
(ii) value of the molar absorbance coefficients; and (iii) apportionment rule to
apportion carbonyl to carboxylic and non-acid contributions. Underestimation
of OC in comparison to TOR (by as much as 50 %) and surprisingly high
values of

PLSr requires the least prior knowledge – e.g., how to model the baseline – and
therefore brings an appealing approach to calibration. However, scattering
contributions from larger ambient particles can lead to overprediction of
organic FGs and ammonium in ambient samples with considerable dust impacts.
As reported in previous studies

Both PLSbc and PLSr can quantify carboxylic acid and non-acid carbonyl groups directly by designating the target variable to COOH and naCO, and the models are trained on wavenumbers and LVs relevant for the two species. From this analysis, we conclude that almost all of the carbonyl for samples in the seven 2011 IMPROVE sites is associated with carboxylic rather than ketone or ester CO. In principle, it is also possible to define a fixed relationship between carboxylic cCOH and carbonyl CO such that COOH and the residual naCO can be determined in PF, but this requires additional assumptions to implement.

In summary, models built with laboratory standards and algorithms are able to
extract relevant information from ambient FTIR sample spectra. Evaluation
against external reference values (TOR OC and ammonium estimated from anion
chromatography analysis) suggests moderately strong to strong correlation for
this IMPROVE monitoring data set, and it is generally consistent with past studies that
have also found high correlation with collocated measurements of TOR OC and
AMS OM

Reducing uncertainty in predictions derived from FTIR spectra can be
envisioned by two means: further advancing our study of laboratory standards
that mimic ambient samples more closely and by exploring mathematical
solutions possible within a stricter set of constraints. Regarding the first
point,

Given that some differences will remain between key features in laboratory
standards and ambient samples, the second point on algorithmic improvements
can be formulated in several ways. One strategy is to explore the subset of
solutions that are consistent with available external measurements (e.g., TOR
OC, AMS OM, and other chemical information) to revise model selection
criteria. While an example varying the number of LVs for aCH is shown in this
work, a more formal approach to multi-parameter optimization is preferable
for approaching this task. Targeting means to establish different
relationships between spectra and FGs than considered in this work is also
possible. For instance, the full range of available calibration samples or
absorption coefficients are likely not the most appropriate for every sample.
Diversity in sample composition – e.g., between urban and rural samples –
can be incorporated in a multilevel modeling approach, whereby different
model or model parameters can be used based on spectral shape or identified
source contributions (e.g., using positive matrix factorization;

Finally, anticipating the mass fraction of OC that can be explained by FGs
will continue to play an important role in estimating the overall OM,
particularly for the

FTIR spectroscopy remains a promising analytical technique to provide
independent estimates of OM,

Baseline correction

The IMPROVE network spectra will be made publicly
available. TOR OC and PM

Table

List of abbreviations and their definitions.

The supplement related to this article is available online at:

MR and ST have performed the analysis, written the manuscript, and prepared the artwork. AD has provided the sample spectra and contributed in writing and revising the manuscript.

The authors declare that they have no conflict of interest.

The authors acknowledge funding from EPFL and the IMPROVE program with support from the US Environmental Protection Agency (National Park Service cooperative agreement Pl8AC01222). We also thank Christophe Delval for assistance in identification of the Christiansen peak species.

This paper was edited by Charles Brock and reviewed by two anonymous referees.