Various vibrational modes present in molecular mixtures of laboratory and atmospheric aerosols give rise to complex Fourier transform infrared (FT-IR) absorption spectra. Such spectra can be chemically informative, but they often require sophisticated algorithms for quantitative characterization of aerosol composition. Naïve statistical calibration models developed for quantification employ the full suite of wavenumbers available from a set of spectra, leading to loss of mechanistic interpretation between chemical composition and the resulting changes in absorption patterns that underpin their predictive capability. Using sparse representations of the same set of spectra, alternative calibration models can be built in which only a select group of absorption bands are used to make quantitative prediction of various aerosol properties. Such models are desirable as they allow us to relate predicted properties to their underlying molecular structure. In this work, we present an evaluation of four algorithms for achieving sparsity in FT-IR spectroscopy calibration models. Sparse calibration models exclude unnecessary wavenumbers from infrared spectra during the model building process, permitting identification and evaluation of the most relevant vibrational modes of molecules in complex aerosol mixtures required to make quantitative predictions of various measures of aerosol composition. We study two types of models: one which predicts alcohol COH, carboxylic COH, alkane CH, and carbonyl CO functional group (FG) abundances in ambient samples based on laboratory calibration standards and another which predicts thermal optical reflectance (TOR) organic carbon (OC) and elemental carbon (EC) mass in new ambient samples by direct calibration of infrared spectra to a set of ambient samples reserved for calibration. We describe the development and selection of each calibration model and evaluate the effect of sparsity on prediction performance. Finally, we ascribe interpretation to absorption bands used in quantitative prediction of FGs and TOR OC and EC concentrations.

Atmospheric aerosols or particulate matter (PM) can range in size from a few
nanometers to tens of micrometers and exist as complex mixtures of organic
compounds, black carbon, sea salt and other inorganic salts, mineral dust,
trace elements, and water

Fourier transform infrared (FT-IR) spectroscopy

These calibration models can take the form of a multivariate linear equation,
in which suitable coefficients are found to combine the effect of absorbances
at various wavenumbers of the infrared spectra to reproduce the concentration
of a target analyte. Problems of this form are commonly solved by ordinary
least squares (OLS) regression, but OLS performs poorly when the system is
undetermined (i.e., there are many thousands of wavenumbers and only several
hundred samples), and serial correlation exists among predictor variables
(i.e., absorbances among adjacent wavenumbers are not independent of one
another)

One approach to facilitate interpretation is variable selection, in which
models are reduced to only the relevant wavenumbers required for prediction

We revisit calibration models for four FGs developed using laboratory
standards

In this section, we first summarize the experimental protocol detailed by

For this work, we use 794 pairs of ambient PM

The PTFE filters are scanned (without pretreatment) using a Tensor 27 FT-IR
spectrometer (Bruker Optics) with liquid-nitrogen-cooled mercury cadmium
telluride detector in transmission mode. Each spectrum is acquired over
mid-infrared wavenumbers of 4000 to 420

Two different versions of the spectra described above are used in building
calibration models with each described in Sect.

In multivariate calibration, we seek to solve the linear equation for
coefficients

Notation for matrices and vectors is provided in
Appendix

There are two types of variables in our application of PLS regression: the
physical variables, or spectral features, corresponding to wavenumbers at
which absorbances are measured (columns of

We consider four methods for obtaining models that require fewer wavenumbers,
which are summarized below and described in more detail in
Appendix

We distinguish between samples used for training and validation of
calibration models (the “calibration set”) and the test set used for
evaluation (the “test set”, which has no influence on model development or
validation). The calibration and test sets are constructed identically to the
most accurate class of full wavenumber models described previously

The root mean squared error (RMSE) between the observed (

Selection of the full wavenumber model is described according to the
procedure by

Models are evaluated on sparseness and comparison to reference measurements.
Reference measurements include FG abundances in laboratory standards and TOR
OC and EC concentrations in ambient samples in the test set. For evaluation
of FGs in ambient samples, we sum the carbon mass estimated from FG
abundances (which we designate as “FG-OC”) and compare against TOR OC.
FG-OC is estimated from moles

Examining sparse regression coefficients are informative for identifying
important absorption bands, but interpretation can still be complicated by
the compensation of interfering bands

For TOR OC and EC, we further provide an additional level of qualitative
interpretation by associating absorption bands of vibrational modes to FGs
that contribute to our capability for predicting TOR OC and EC. For this
purpose, we examine regression coefficients alongside VIP scores and consider
that negative coefficients (a) compensate for positive artifacts in other
absorption regions or (b) are themselves artifacts of oscillations that occur
in regression coefficients when the number of LVs in the model is large

Estimated RMSECVs for a range of models using a different subset of
wavenumbers are shown by shaded regions. Results are shown for models using
raw spectra. For panels in the first two columns (SPLSa and SPLSb), the
shaded regions extend from the minimum RMSECV to pRMSECV solutions for each
sparsity penalization parameter. For the last column of panels (EN/EN–PLS),
the shaded region extends from the minimum RMSECV to 1 standard error above
for each value of the penalization parameter in EN estimates. Circles
correspond to models selected for this work. For EN/EN–PLS panels, red
circles correspond to the EN solution, and blue circles correspond to the
selected solutions for EN–PLS. The RMSECVs for EN–PLS are underestimated
(Appendix

Same information as Fig.

Number of wavenumbers and LVs selected for final models.

In this section, we first describe the range of sparse models that are
generated by different algorithms and tuning parameters and present the
models selected based on validation within the calibration set
(Sect.

The sensitivity of RMSECV to models formulated with different NZVs are shown
in Figs.

We observe from a comparison of methods that the range of RMSECVs estimated
by each model algorithm depends on the response variable (FGs or TOR OC and
EC) and spectra type, and none consistently outperforms the rest. EN–PLS is
able to achieve lower apparent RMSECVs than EN in many cases even while using
the same wavenumbers, though this difference may partially be due to
underestimation of the final RMSECV resulting from awareness of validation
samples in the PLS stage of model evaluation (Appendix

Predictions vs. reference for ambient samples (OC from sum of FG).
“Anomalous” samples are those identified by

The reduced wavenumber models using raw spectra resulted in fewer NZVs than
using baseline corrected spectra for 14 out of the 24 of the cases examined,
indicating that it is possible for sparse methods to effectively remove the
PTFE interference and achieve suitable performance. One potential explanation
may be that isolated regions of PTFE interference can be used efficiently to
correct for the remaining interferences from the analyte regions. On average,
NZVs for both spectra types are reduced by approximately 20 % for the
solutions chosen. However, reductions can be as low as 1–9 % for any
substance, mostly achieved by EN (with the exception of CO for the baseline
corrected case, where the fewest NZVs is achieved by the SPLSa algorithm).
The highest percent reduction is achieved for aCH (99 %; corresponding to 40
NZVs) using raw spectra with EN, which is a surprising result given the
richness of features in the aCH region

Forty NZVs correspond to the fewest in the set evaluated, tied with CO of
baseline corrected spectra.

Prediction of FG concentrations in laboratory samples shows good agreement
with laboratory samples with

Comparison statistics of full and reduced wavenumber solutions show sensitivity of model predictions to sparsity. OC and EC correspond to predictions from direct calibration to TOR measurements rather than summing FGs.

It is worth noting that all sparse models predict higher concentrations of
aCOH in both urban and rural samples than full models on average by a factor
of 3–8, when raw spectra are used for calibration. For this spectra type,
urban cCOH samples are also on average over-predicted by a factor of 4
(except by SPLSb). Predictions for CO generally exhibit less variation.
However, as the aCOH and cCOH are not as large contributors to the OM as aCH
(60–70 % of OM mass according to the full model;

In comparison to the FG-OC predictions which are extrapolated from laboratory
standards to the composition domain of atmospheric OM, predictions for TOR OC
and EC made by direct calibration to ambient samples show remarkable
consistency with the full model solution (Fig.

We have only compared one possible solution from each method, but other
solutions can be generated for a given algorithm by changing the model
parameters; there may be possible solutions which are better suited. However,
as concluded previously (somewhat obviously) in PLS applications to aerosol
FT-IR spectra, predictions are most robust when samples in the evaluation set
are similar to those in the calibration set

Predictions vs. reference for ambient samples (direct calibration).

VIP scores and the sign of regression coefficients at each wavenumber are
shown in Figs.

VIP for solutions using raw spectra. Dark gray lines in the “full”
solution panels correspond to the first loading weights, and the vertical
bars in every panel extend from 0 to VIP scores. Red points accompanying
vertical bars indicate wavenumbers for which regression coefficients are
positive and blue points indicate wavenumbers for which coefficients are
negative. Regions up to VIP scores of 0.5 are shaded to indicate VIP scores
not considered for our interpretation
(Sect.

VIP for solutions using baseline corrected spectra. Lines and colors
are as indicated for Fig.

We first describe our interpretation of the most parsimonious EN–PLS solution
for each FG and extend our interpretation to the SPLSa, SPLSb, and full
spectrum solutions. For aCH, CO, and cCOH FGs, the same vibrational modes are
used for both baseline corrected and raw spectrum solutions. In the case of
aCH, the C–H stretching mode (near 2900

Example group of baseline corrected spectra shown with VIP scores
above 0.5 overlayed. The VIP is derived from all calibration spectra
consisting of laboratory standards (same as those shown in
Fig.

SPLSa, SPLSb, and full spectrum solutions use the same vibrational modes of the EN–PLS solution for the quantification of cCOH, aCH, and CO. The additional peaks in the raw spectrum solutions are interpreted as being associated with background correction. For aCOH, the less parsimonious methods use the O–H stretching in both the baseline corrected and the raw spectrum solutions in contrast to the EN–PLS solution, which uses two different vibrational modes for different spectra types.

Similarity in predicted abundances is not necessarily anticipated by the
number of wavenumbers used. An illustration is provided in Fig.

Summary of FG and associated vibrational modes with positive regression coefficients used in the prediction of TOR OC and TOR EC by EN–PLS method. The FGs that are common to every solution (both raw and baseline corrected solution for TOR OC and EC) are reported first, followed by non-oxidized FGs, oxygenated FGs, and nitrogenated FGs; the order is not indicative of abundance.

For both the prediction of TOR OC and EC, different sets of wavenumbers
(i.e., absorption bands) are used between the raw (Fig.

Even by excluding wavenumbers with extremely small VIP scores and PTFE
contributions, many interpretations for contributing FGs still exist on
account of the large number of overlapping absorption bands at each
wavenumber used by the two spectra types. In this first analysis of relevant
wavenumbers for TOR OC and EC calibration, we present our interpretation
through a “common FG hypothesis” in which we assume it most likely that
predictions by the raw and baseline corrected spectra models are primarily
enabled by a common set of FGs. This framework leads to the possibility that
the same FG may be used by the two solutions by means of different
vibrational modes (at different wavenumbers), and the inference that these
comprise essential FGs necessary for prediction. We cannot exclude the
possibility that there exists a suite of FGs with approximately similar
capability to provide, in some combination, quantitative prediction of TOR OC
and EC, leading to two models that require less than maximal overlap in FGs.
If we consider an extreme case, which we denote as the “divergent FG
hypothesis,” a pair of models may use a minimally redundant set. This
approach to band assignment can lead to an intractable number of
possibilities and is also considered less plausible given the similar level
of accuracy and robustness attained by the two models. Table

We preface our interpretations by stating that, at this time, we make no claim regarding
the relative contributions of each FG to TOR OC or EC mass, as our VIP
analysis considers the importance of spectra absorbances (and not FG
abundance) to the mass concentrations. Knowledge regarding molar absorption
coefficients and the relationship of FG to carbon abundance

The wavenumbers used by the TOR OC calibration models correspond to
vibrational modes associated with major FGs of organic PM. Carbonyls
associated with carboxylic acids, ketones, aldehydes, and esters are used by
both models through the C=O stretch (1700–1750

The baseline corrected TOR EC calibration model appears to rely on similar
absorption bands and FGs as TOR OC; this can be partly explained by the
restricted range of wavenumbers used. However, as the regression coefficients
are different we note that the bands are weighted differently in arriving at
their respective predictions, possibly indicating their use for OC artifacts
in quantification of TOR EC. Many similarities in the structure of VIP scores
with the raw solutions of TOR OC can be accounted to the PTFE corrections
previously described, though the main spectral features used by the raw
spectra TOR EC model appears to be vibrational modes in the molecular
fingerprint region that overlaps with the absorbance from the C–F stretching
of PTFE (

Both TOR EC models may use four FGs used by the TOR OC solutions: aromatic
and ring structures, amines and amides, and esters. The C–C ring stretch is
present in the baseline corrected solution at

The band assignments for TOR OC are perhaps not surprising given that many of
the FGs have been used previously for quantification of organic PM, but it is
worth considering our interpretation for TOR EC in context as FT-IR is not
commonly employed for the study of elemental carbon or similar substances.
“Elemental carbon” strictly refers to sp

Direct measurements of elemental carbon and associated substances by infrared
spectroscopy are not numerous, but

The absorption band around 1600

We explicitly remark that while the absorption bands discussed in relation to
VIP scores are all observed in the overall infrared spectra used for building
calibration models, they are associated with the PM mixture and do not
correspond to direct observations of bands in physically or chemically
isolated specimens of TOR EC. The bands are selected mathematically, based on
strength of covariance (in combination with other bands) with TOR EC for
selection in quantitative prediction. Similar approaches based on covariance
analysis methods have reported acetyl, aromatic, and phenol structures

There are additional bands which appear to be relevant for one spectra type
but not the other for both TOR OC and EC (and also other analytes discussed
in Sect.

SPLSa and SPLSb solutions for TOR OC and EC calibration models developed with
baseline corrected spectra use wavenumbers similar to EN–PLS, which are
described above. In both TOR OC and EC models, additional wavenumbers in the
range 2200–2500

The full spectrum solutions include features that have been previously
described in the EN–PLS solutions, though more specific interpretation is
more difficult for lack of sparsity. In the baseline corrected solution we
can see that for both TOR OC and EC high VIP scores correspond to the regions
around 1700, 3000, and 3400

We evaluated four sparse methods in the construction of calibration models for four organic FGs and TOR OC and EC. Since the full wavenumber models already performed well in prediction, the best of the sparse models generally did not improve model performance but provided interpretation regarding the most relevant absorption bands required for prediction. In formulating sparse models, the direct 1-norm penalty on regression coefficients by EN permitted better control of sparsity – i.e., stronger correlations between penalty and sparsity were observed, and more sparse solutions were ultimately obtained – than imposing penalties on individual weight or direction vectors as formulated by SPLSa and SPLSb.

SPLS methods were less robust than EN and EN–PLS in extrapolating calibration models developed with laboratory standards for use in estimating FG abundances in ambient samples. For example, FG-OC estimated by the full wavenumber PLS model using raw spectra had a slope and correlation of 0.97 and 0.93 in comparison to TOR OC, while the performance dropped as low as 1.69 (slope) and 0.77 (correlation) with the SPLS methods. The additional dimensionality reduction applied by EN–PLS led to better performance than EN using the same wavenumbers, and similar performance to the full wavenumber models was achieved while using only 1–6 % of original wavenumbers for each FG. As some samples are more sensitive to model formulation and sparsity, such methods can possibly be used to identify cases in which laboratory standards do not reflect the types of bonds in ambient samples. When sparse methods were used to build calibration models using ambient samples, TOR OC and EC prediction metrics were insensitive to sparsity. All PLS-based models predicted reference values (not included during calibration) with less than 10 % bias and correlation coefficients higher than 0.9. In examining sparse calibration models for aCOH, cCOH, aCH, and CO, selected wavenumbers for FGs are consistent with known absorption bands of their constituent bonds. FGs contributing to our capability for prediction for TOR OC are those which are commonly associated with organic PM, while for TOR EC, the main bond found in common between the raw and baseline corrected spectra are C–C stretch in ring-structured compounds. Wavenumbers used by raw and baseline corrected spectra appear to vary significantly, but they can be (and have been) interpreted through different vibrational modes associated with a common set of FGs.

This first evaluation of sparse calibration methods using FT-IR spectra shows
promise in conferring interpretation of associated molecular bonds to TOR OC
and EC measurements. Sparse calibration models “localized” (in the
statistical sense) by spectral features or external variables can be used
to identify key FGs used for prediction of TOR measurements at various sites

The IMPROVE network data will be made publicly available.

Tables

Dimensions and indexing variables.

Arrays and dimensions.

The derivation, properties, and implementation of sparse methods used in this
paper are described in detail by their respective authors: SPLSa

A search for LVs can be framed as an optimization problem to maximize
covariance between response and explanatory variables under a set of
transformations

Summary of sparse methods and parameters.

To solve for the underlying weights and direction vectors which satisfy these
equations, we use the NIPALS algorithm implemented in the

EN regularization is not a variant of PLS but solves for regression
coefficients in Eq. (

The first penalty corresponds to that used by LASSO regression

We perform EN regression using the

A sparse PLS formulation by

This algorithm is implemented in the

Another sparse PLS algorithm, introduced by

We use the implementation in the

We have implemented this method in R in combination with the

The weighting of the bias and variance can be formulated in various
functional forms with tuning parameters, and the final model chosen by
consensus scoring

Variable selection and model parameter estimation is performed entirely
within the calibration set for all methods. The two objectives are combined
in EN and SPLSa by their respective soft thresholding penalties, but are
separated into two steps by SPLSb and EN–PLS. SPLSb correctly applies both
variable selection and parameter estimation prior to estimation of error and
against the validation (sub)set, while our current implementation of EN–PLS
method performs variable selection (via EN) prior to the CV and RMSECV
estimation procedure of the second step. In the latter scenario, the reported
RMSECV can be underestimated

The authors acknowledge funding from the Swiss National Science Foundation (200021_143298) and the IMPROVE program (National Park Service cooperative agreement P11AC91045). We also thank A. Weakley for helpful discussions. Edited by: A. Sayer Reviewed by: G. Lebron and one anonymous referee

^{®}in Machine Learning, 4, 1–106,