Introduction
Measurement and quantification of atmospheric aerosol composition and
abundance provide a basis from which we can monitor regional air quality,
predict potential impacts on health and climate, and deduce
formation mechanisms to reduce uncertainties in climate models for simulating
alternative scenarios relevant to climate change adaptation or policy
decision-making . Atmospheric aerosols, or particulate matter (PM), occur as
complex mixtures of inorganic salts, crustal elements, sea spray, organic
compounds, black carbon, and water , and a combination of
analytical techniques are required to resolve their physical and chemical
characteristics . A useful and relatively inexpensive
strategy is to collect atmospheric aerosol particles onto a substrate for
offline analysis in the laboratory. Amongst different substrates,
polytetrafluoroethylene (PTFE) filters have been extensively used in both
measurement campaigns and routine monitoring networks, such as the
IMPROVE network in pristine and rural areas or the Chemical Speciation
Network/Speciation Trends Network in urban and suburban areas in the United
States . Advantages of PTFE substrates include their stability,
hydrophobicity, and negligible carbon gas adsorption . As such, they are amenable to gravimetric
mass, elemental analysis, and detailed chemical speciation analysis
e.g.,.
Carbonaceous particulate matter (PM) composition collected on PTFE filters is
characterized by Fourier transform infrared (FT-IR) spectroscopy. Organic
functional groups in PM absorb mid-infrared (IR) radiation in specific
segments of the spectrum. The amount of light absorbed is proportional to the
moles of the functional group (Beer–Lambert law). The absorption at the
characteristic frequency of a particular type of bond is measured directly
through PTFE filters . The high-frequency
region (> 1500 cm-1) contains stretching and bending modes of important
functional groups, such as alkane (consisting of saturated aliphatic C-CH
bonds found in hydrocarbon chains), carboxylic acid (COH and C=O found in
carboxylic acids and diacids), carbonyl (C=O found in ketones, aldehydes, and
esters), hydroxyl (COH found in straight chain alcohols), and amine (C-NH2
found in primary amines) . The fingerprint region (< 1500 cm-1) contains absorption bands organonitrate (CONO2) and
organosulfate compounds (COSO3) but is
outside the scope of our study.
A growing number of papers in recent years have been published to introduce
and apply different statistical applications for atmospheric aerosol
characterization from the infrared spectra. One of the applications includes
unsupervised clustering of discrete spectra categories to quantify source
contributions, such as fossil fuels, biomass vegetation, or biomass burning,
to the total organic PM mass. Spectral clustering has been used in several
atmospheric aerosol measurement campaigns and data analysis studies
. Cluster analysis
of spectra have compared favorably with source class interpretation from
factor analysis ,
which attribute variations in the spectra matrix to varying contributions
from an underlying set of components, and multiple linear regression with
predetermined factor sources . Another approach, which
has a long record in use for quantification of functional group composition
and source apportionment in atmospheric aerosol samples
e.g.,, is fitting individual Gaussian line shapes to
quantify alcohol COH, carboxylic COH, alkane CH, carbonyl CO, and amine NH
functional groups . Finally, functional groups
and organic and elemental carbon content
equivalent to that of thermal optical reflectance (TOR) have been estimated from partial least squares (PLS)
calibration applied to infrared spectra. However, all applications but PLS
regression require baseline-corrected infrared spectra (without PTFE
interferences) to apply the Beer–Lambert law-type analysis and account for
variations in analyte (aerosol) absorbance only. Aside from these statistical
applications, removing PTFE interferences is a necessary step for visual
inspection and comparison of similarity of aerosol composition in FT-IR
spectra.
The problem of background removal is ubiquitous in nearly all spectroscopies
(e.g., FT-IR, nuclear magnetic resonance (NMR), and Raman spectroscopies) and
their respective applications that quantify chemical quantities based on the
shape and distribution of spectral peaks .
A general formulation of the problem is to partition an observed
spectroscopic signal into two components: one that varies smoothly (baseline)
and one that is zero except in specific, localized regions (analyte). However,
background correction represents an ill-posed problem; we do not know the
exact proportions of the baseline and analyte in the observed signal. As a
result, a realistic approach is to implement a baseline model representation
capable of capturing underlying physical phenomena causing the baseline
specific to the spectroscopy type. While many such investigations have been
made in FT-IR biospectroscopy ,
single-compound, gas-phase FT-IR ,
NMR , and Raman spectroscopies
, the background removal in
FT-IR atmospheric aerosol samples remains a far less-studied topic.
Therefore, we evaluate existing classes of background correction methods to
identify the most promising one based on ambient aerosol spectral
characteristics.
Existing techniques include frequency decomposition (via Fourier transform,
wavelets, or digital filters) to separate the baseline component from the
analyte absorption in the frequency domain . While the
frequency decomposition techniques have been shown to successfully correct
biological or single-compound FT-IR spectra
, they do not apply in the PM context, where
spectral features are not well separated due to broad analyte absorption
regions in condensed-phase aerosol samples. Another existing class, numerical
differentiation (e.g., first or second differentiation, or Savitzky–Golay
derivation) , leads to noise amplification and
requires additional smoothing that is sensitive to the signal-to-noise ratio
for a specific set of samples. Furthermore, as a result of negative values
from the derivative transformation, transformed spectra are difficult to
visually interpret for spectroscopists. The interpolation approach
uses sample-specific PTFE signals on background regions where
analyte absorption is not expected, and interpolates through analyte regions
to identify their relative contributions at each wavelength.
Widely used interpolation methods for aerosol analysis on PTFE filters
are not scalable to projects
with a large number (hundreds of thousands) of aerosol samples. The most
modern implementation of these methods addresses the
challenges described above by prescribing a set of four default background
regions and a polynomial model for the variation. Background regions can be
adjusted for each sample to improve accuracy; yet, this leads to additional
costs in labor and variability across users. For example, users with
extensive FT-IR baseline correction experience may feel comfortable using
visual inspection to identify the background and analyte regions in Fig. 1.
Others, on the other hand, may prefer to look at past examples or conduct a
brief literature search on the presence and locations of absorbing functional
groups. Alternatively, for a fixed background region, a non-negativity
constraint can be imposed to alleviate issues of unrealistic spectral
features that can arise from incorrect specification of the background. However,
this adaptation to handle negative analyte absorbance can lead to positive
bias in certain regions of ambient sample spectra, or overall in blank sample
spectra for which the mean absorbance should be zero (as an average of
positive and negative values). Finally, as the predefined polynomial forms
are unable to account for all PTFE interferences, the method requires
subtraction of blank filter spectra to remove some PTFE features a priori.
Therefore, the baseline correction is performed on the residual spectra
rather than the original. However, because blank filters themselves exhibit
variability, there is no perfect PTFE reference and the subtraction procedure
may impart additional bias. Additionally, collecting blank PTFE filters
increases FT-IR analysis costs and time.
794 FT-IR atmospheric aerosol spectra collected on PTFE filter. Each spectrum is color-differentiated.
However, a separation of atmospheric aerosol absorbance bands from the PTFE
baseline via interpolation can be very complex, and therefore difficult to
quantify precisely and reason about. We break down the issue into two
separate problems. The first problem is determining sample-specific bounds
for analyte and background subregions in the atmospheric aerosol spectra.
Atmospheric PM mixtures are thought to comprise 104–105 atmospheric
organic species , leading to
broad, overlapping IR absorption bands (features on the order of 10–102 cm-1) of different functional groups that absorb within similar
wavenumber regions . In Fig. we show
794 atmospheric PM samples collected on PTFE filters, each differentiated by
color. The overlapping absorbance bands can be seen as smoothly varying
features in regions at ∼ 3700–2200 and ∼ 1820–1500 cm-1,
superimposed on a sloping baseline. The range is only indicative; the
wavenumber specificity is further limited by a variability in ambient PM
mixture composition. As composition varies as a function of PM source and
date, several of these functional groups may be absent in the sample at hand.
Due to the absence of structurally distinguishable features to indicate the
onset of analyte contributions, it is challenging to pinpoint the exact
locations of analyte absorption. The second problem is reproducing the
structure of the PTFE baseline. PTFE scattering represents the largest source
of variation of the FT-IR signal when particles are collected
. The extent of variation in slope and shape of baseline
can vary substantially among individual samples (Fig. 1). Baseline
variations due to PTFE fiber stretching are unique to each sample and do not
follow a prescribed or universal pattern, rendering standardized baseline
preprocessing methods, for example pre-scan subtraction, standard normal
variate, and multiplicative scattering correction ,
insufficient. Due to a lack of structural specificity of the underlying PTFE
signal, we need a sample-adaptive model.
Notation for variables
Category
Symbol
Description
w
weight
y
observed absorbance
Smoothing splines model formalization
y^
fitted absorbance
(Sect. 2.1)
x
wavenumber
j
an index to denote the number of wavenumbers
λ
smoothing penalty
B
background component in observedabsorbances
A
analyte component in observed absorbances
Spectral signal decomposition
WA
set of wavenumbers with absorbances
(Sect. 2.1)
WB
set of wavenumbers without absorbances
W1–W4
specific wavenumbers to denote boundariesbetween analyte and background components
EDFT
desired (target) EDF parameter defined by a user prior to applying the model
Smoothing splines parameter selection (Sects. 2.1, 2.3.2, 3.1)
EDFA
actual EDF parameter computed by the algorithm to match the user-specified EDFT
EDF*
optimal EDF parameter selected from a range of EDFT
Naturally, this problem raises the question of how to develop an automated
method for baseline correcting hundreds or thousands of ambient aerosol FT-IR
spectra given the variability in environmental mixture composition and PTFE
baselines. This study approaches the question by detailing the statistical
protocol, which allows for the precise definition of analyte and background
subregions, applies nonparametric smoothing splines to model
sample-specific PTFE variations, and integrates performance metrics from PM
and blank samples alike in the smoothing parameter selection. Referencing an
extensive set of atmospheric aerosol samples, in Sect. we
start by identifying key FT-IR signal characteristics (such as non-negative
absorbance or analyte segment transformation), which reduce signal variations
to fundamental features to capture sample-specific transitions between
background and analyte. To reproduce sample-specific variations in PTFE
background and analyte structures, we develop a nonparametric, adaptive
model: interpolation based on smoothing splines regulated by the roughness
parameter. While referring to qualitative properties of the baseline (such as
smoothness), the goal is to learn the baseline structure in the background
region to predict the baseline structure in the analyte region. In Sect. we evaluate the model both at the physical and application
layers. We establish the initial model feasibility by using near-zero blank
absorbance and non-negative analyte absorbance as our physical criteria.
Further, by comparing smoothing splines baseline (SSB)-corrected spectra with
polynomial baseline (PB)-corrected spectra via three different applications,
(1) visual and clustering analysis, (2) functional group quantification, and (3) organic
and elemental carbon prediction, we are able to discern which
variations in quantities obtained from SSB-corrected spectra are due to
inherent variations already present and which are added due to the new
baseline approximation. We close with a summary of the baseline correction
procedure extendible to the fingerprint region or spectra acquired on other
substrates in Sect. .
Methods
Section 2.1 introduces smoothing splines in the context of FT-IR signal.
Sections 2.2 and 2.3 detail the modeling protocol, including formalizing
bounds for analyte and background regions and selecting smoothing parameters.
Sections 2.4 and 2.5 describe the data set and applications we used for
smoothing splines model evaluation.
Smoothing splines model description
For the sake of clarity, Table summarizes notation for
commonly used variables pertaining to specific categories in implementing the
smoothing splines model. The proposed interpolation method uses smoothing
splines, a popular nonparametric regression technique, which has been
applied in different steps in spectral signal analysis: data exploration,
model building, testing parametric models, and diagnosis . Their
expression is obtained by minimizing the following two-part objective
function:
minimizey^∑j=1nwj(yj-y^j)2+λ∫ab(y^′′(x))2dx,
where w is weight at wavenumber j, y and y^ are observed and
fitted absorbances at wavenumber j, and λ is a smoothing penalty.
Minimizing this criterion over the entire spectrum leads to a unique
solution, which is a natural cubic spline with knots at the unique values of
the wavenumbers x for j=1,2,…,N . The
explicit solution in form of the natural spline eliminates the knot selection
problem without leading to over-parameterization due to the smoothing penalty
constraint. The advantage of smoothing splines is their capacity to operate
both locally, through w representations for each wavenumber j, and
globally, through a single λ representation over the entire
wavenumber domain.
The first, least squares term, ∑j=1nwj(yj-y^j)2,
represents the similarity measure consisting of the squared distance between
observed absorbance values and interpolating function values. The advantage
of locally moderated weights lies in allowing us to choose whether absorbance
at a particular wavenumber j should be included in determining the fitted
baseline. We define weights as follows. Let us decompose the original
spectral signal into a two-component mixture:
yj=Bj+Ajifj∈WA(analyte region)Bjifj∈WB(background region).
Here Bj denotes the background component comprising baseline, noise,
and, if present, any remaining local, high-frequency interference
. Aj denotes the analyte component, WA
denotes the set of wavenumbers with analyte absorbance, and WB
denotes the set of wavenumbers without analyte absorbance. We then wish to
select observations that represent solely the background component and
exclude those that contain the analyte contribution:
wj=0ifj∈WA1ifj∈WB.
Other conceptually analogous variants for determining weights exist. Some
researchers define weights as posterior probabilities from mixture models
. Some researchers use curve fitting with asymmetric weights
. While differing in the
requirement of a priori knowledge on the assignment of observations to
different components, all frameworks, including ours, propose that greater
weight is given to those observations representing the background only, and
smaller or no weight is given to those containing contribution from analyte
peaks. Therefore, the aim of the least squares term is to extract the
structural information from the neighboring background regions to infer the
baseline structure in the analyte region.
The second term of the objective criterion, λ∫ab(y^′′(x))2dx, is a regularization term. It constrains y^
to vary smoothly on a global level. Overall, the objective function trades
off fit to the spectral data with the smoothness via the tuning parameter,
λ. For smaller values of λ more weight is given to fitting
the squared error term of the criterion. When λ= 0 the unique
minimizer is a natural cubic spline, which will interpolate the original
response, yj. Conversely, for greater values of λ more weight is
given to keeping the curvature small. When λ →∞,
the unique minimizer is a second-degree polynomial. A spectrum of λ
values ranging from 0 to ∞ will generate a family of models, from
interpolation to the parametric polynomial model.
When faced with a problem of how much smoothing should be applied to fit the
spectral data on hand, effective degrees of freedom (EDF) represents a more
physically interpretable metric to parameterize the regularization of the
smoothing spline than λ . Consider writing the
n vector of fitted values, y^, as
y^=N(NTN+λΩN)-1NTy=Sλy.
Here N denotes an n×n design matrix of the cubic spline
basis functions evaluated at the observed values xj and
ΩN is ∫lmNl′′(x)Nm′′(x)dx. A linear
operator referred to as a smoother matrix, Sλ differentially
shrinks influence of y toward their alignment with the corresponding
basis functions. Consequently, the EDF of a smoothing spline is defined as
the sum of eigenvalues of Sλ:
EDFλ=∑j=1nSλjj.
EDF is bounded between 2 and n. If λ=0, Sλ becomes
the n×n identity matrix, and EDFλ=n. Conversely, if
λ=∞, Sλ becomes the projection matrix from
linear regression on x, and EDFλ=2. The advantage of
reformulating the smoothing parameter in EDF over λ is that its span
is bounded and defined with respect to the number of wavenumbers in the
region we want to baseline-correct.
If a desired (target) EDF is defined by a user, smoothing splines models are
usually fitted via the backfitting algorithm to search for the actual EDF
closest to the target. At convergence, the solution can be formulated as
EDFA=argminλEDFT-∑j=1n{Sλ}jj2,
where EDFA represents the actual EDF determined from ∑j=1n{Sλ}jj (Eq. ) which minimizes the
departure from the target EDF, EDFT. The backfitting procedure is
implemented in the smooth.spline function of the R statistical package
, which we used to develop our baseline correction
model. Thus, the user-defined EDFT will form a basis for model parameter
solutions from which the optimal parameter, EDF*, will be chosen
(Sect. 2.3).
Comparison of key background modeling characteristics pertaining to the proposed and current models
Characteristics
Proposed method
Current method
Functional form
Smoothing splines
Polynomial
Type
Nonparametric
Parametric
Representations
Global (EDFT) and local (wj)
Global (nth degree of a polynomial)
Requires pre-scans?
No
Yes
Requires user's input?
No
For every scan
The relationship between FT-IR spectrum features and smoothing splines parameters to model those features
Spectrum characteristics
Model parameters
Segment
Region type
Wavenumber range (cm-1)
Type of modeled baseline
Weights
EDF
1
Background upper
[4000, W1]
Fitted
wj=1
EDF*
Analyte
[W1, W2]
Predicted
wj=0
Background lower
[W2, 1820]
Fitted
wj=1
2
Background upper
[2000, W3]
Fitted
wj=1
EDF*
Analyte
[W3, W4]
Predicted
wj=0
Background lower
[W4, W4-1(Δν̃)]a
Fitted
wj=1
a The lower background region consists of a single wavenumber adjacent to W4 (Sect. 2.2.1).
Summarizing in Table , we argue that smoothing splines offers
a more adaptive and realistic basis for modeling PTFE variations than the
current method by combining local and global representations. We apply
smoothing splines to specific segments where each analyte region is
sandwiched by neighboring background regions containing a smoothly varying
baseline. As a result, each segment then contains an accurate basis for
baseline prediction in the analyte region using an optimal smoothing
parameter, EDF*.
FT-IR baseline correction protocol
Using the smoothing splines theory described above, we formalize the baseline
correction protocol in Table . The weights wj
from Eq. (), i.e., wj=0 in the analyte region
and WA and wj=1 in the background region WB, are
determined by sample-specific bounds for analyte and background regions,
W1 to W4. Fig. illustrates a road map for our
protocol. In Step 1, we divide a raw spectrum into two segments.
Segment 1 includes the domain from 4000 to 1820 cm-1, to capture the maximum
extent of the background regions surrounding the first analyte region.
Segment 2 includes the domain from 2000 to 1500 cm-1 and captures a sufficient
extent of background regions surrounding the second analyte region.
We set W2 to 2220 cm-1, which universally marks the start of the carbon
dioxide (CO2) absorbance band .
(1) Uncorrected spectrum partitioned into two segments: Segment 1,
4000–1820 cm-1 and Segment 2: 2000–1500 cm-1. (2) Transformed
segments with zero first and last absorbance values. (3) Upper panel: initial
baseline (gray), final baseline estimated iteratively via a non-negativity
constraint (red). Red vertical lines delineate background and analyte
regions: W1= 3360 cm-1 and W2= 2220 cm-1. Lower panel: final
baseline (blue). Blue vertical lines delineate background and analyte
regions: W3= 1820 cm-1 and W4= 1520 cm-1. (4) Resultant
corrected spectrum.
In Step 2, we perform a geometric transformation, which will be used to
determine and verify some of the bounds for analyte and background regions: W1 in
Segment 1 and W3 to W4 in Segment 2. As a linear operation, this geometric transformation
preserves the actual absorbance magnitudes. Let a denote an vector of raw absorbances
corresponding to a segment selected in Step 1 illustrated in Fig. . First
we rotate aj about a point a1 such that a1=a1R=aNR, where ajR denotes
the rotated vector element and R denotes the corresponding rotation matrix:
aR=Ra,whereR=cos(θ)sin(θ)-sin(θ)cos(θ)andθ=arctanνN-ν1aN-a1.
Second, we translate ajR such that a1*=aN*=0, where
aj* denotes the resulting translated vector:
a*=aR-a1R.
Projecting raw absorbances on the local platform axis (a1=aN=0)
offers a valuable means of numerically representing a raw spectrum, without
appealing to underlying PTFE structural specification. The geometric
transformation is a key component in our protocol. First, it allows us to
analytically separate background from the analyte in W4 by determining a
local minimum. Second, it provides visually recognizable verification
valuable for further method developments, if need be (e.g., precise W1,
W3, and W4 are difficult to recognize in raw data in Fig. 1). For
instance, the concept is extendible to application developments for baseline
correction in the fingerprint region , which is outside the
scope of our current study.
In Step 3, we determine specific bounds, W1 to W4, for analyte
and background regions, WA and WB. The benefits of
determining sample-specific W1 and W4 are twofold. First, certain
analytes may be absent from a complex aerosol mixture at hand, thereby
increasing WB. Second, higher loadings may lead to broader tails
of certain absorption profiles, thereby decreasing WB. Section 2.3.1 details a method to determine these bounds.
In Step 4, we subtract final baselines from transformed segments and
stitch the baseline-corrected segments together. In the overlapping region
between 2000 and 1820 cm-1 , we use the mean absorbance in the final
result. The absorbance between the rightmost background region down to 1500 cm-1 is set to
zero.
Selection of model parameters
The problem of selecting model parameters, W1–W4 and EDF, carries
key implications for the quality of fitted baselines. Our goal is to select
model parameters to reproduce the structure of sample-specific PTFE
variations while minimizing physically unrealistic FT-IR features, such as
negative absorbance from PM spectra or absorbance from blank spectra.
Referencing an extensive set of baseline-corrected ambient and blank samples
(described in Sect. ), we identify two common physical
expectations, to which generated baseline should conform: (1) non-negative
analyte absorbance and (2) near-zero blank absorbance.
Determining bounds for analyte and background regions
We determine W1 iteratively for each value of the smoothing parameter to
satisfy a non-negativity constraint near the boundaries. An initial
(conservative) estimate of W1= 3720 cm-1 is congruent with our
understanding of the absence of absorption bands over the subdomain between
4000 and 3720 cm-1 ; yet, smaller contributions from
certain functional groups, such as alcohol OH, increase the likelihood of
negative background absorbance if W1 remains underspecified. Therefore, we
begin with the initial estimate (gray baseline in Fig.
Step 3) and iteratively decrease W1 until the non-negativity constraint is
satisfied or until W1 reaches W2. We set W2 to 2220 cm-1, which
universally marks the start of the CO2 absorbance band .
Similarly, we set W3 to 1820 cm-1, which universally marks the start
of the carbonyl absorbance band observed in all PM samples.
To accommodate the specifications of individual samples, W4 is determined
as a wavenumber ν̃, for which aj* attains its minimum over the set of candidate
values between 1520 and 1600 cm-1:
W4=argminjaj*:ν̃j∈[1520,1600],
where aj* are transformed absorbances from Step 2. To minimize the
interference from the neighboring alkane peak, starting to absorb around 1510 cm-1 , we limit the lower background region to a single
wavenumber adjacent to W4, W4-1(Δν̃).
Selection of EDF
To parameterize the influence of EDF on the quality of fitted baselines via
the two expectations, we derive two EDF-optimizing metrics: (1) a negative
absorbance fraction for ambient samples and (2) total normalized absolute
blank absorbance for blank filters. We summarize the metrics in Table .
Relationship between fitted baseline characteristics as
a result of varying EDF and EDF-optimizing metrics to represent these characteristics
Segment
Physical criterion
Sample type
Wavenumber range (cm-1)
Representation
1
1
Near-zero blankabsorbance
Blank
[4000, 2500], [2200, W2]
Total normalized absolute blank absorbance, ‖aB‖1*
2
Non-negative analyteabsorbance
Ambient
[W1, 2500]
Negative absorbancefraction, NAF
2
1
Near-zero absorbance
Blank
[2000, 1500]
Total normalized absolute blank absorbance, ‖aB‖1*
2
Non-negative analyteabsorbance
Ambient
[W3, W4]
Negative absorbancefraction, NAF
The negative absorbance fraction (NAF) represents the contribution of negative
analyte absorbance, ‖aA-‖1, to the total analyte
absorbance, ‖aA‖1:
NAF=‖aA-‖1‖aA‖1×100%,
where ‖⋅‖1 denotes the 1-norm magnitude of a vector
(summation of all absolute values of vector elements). NAF is calculated
across the entire wavenumber range in the analyte part of in a given segment,
excluding the CO2 absorbance band.
Total normalized absolute blank absorbance, ‖aB‖1*, quantifies the model's departure from the true
result, zero absorbance, per wavelength in a given segment. It is calculated
as a 1-norm magnitude of blank absorbances, ‖aB‖1,
normalized by the number of wavenumbers in the corresponding wavenumber range
(Table ), nν̃:
‖aB‖1*=‖aB‖1nν̃.
‖aB‖1* is calculated across the entire
wavenumber range in a particular segment excluding the CO2 absorbance
band. We select EDF* from a range of EDFT by evaluating minima
from both ‖aB‖1* and NAF. To that end,
Figs. and in
Sect. present a qualitative and quantitative
evaluation for varying EDFT together with EDF* selection.
54 randomly selected ambient samples (left) and 54 blank samples
(right) corrected by varying EDFT. Each spectrum is color-differentiated.
The CO2 absorption band between 2500 and 2220 cm-1 not associated with PM composition is shaded in color. The x axis ranges from 4000
to 1500 cm-1 in both left and right panels.
Median NAF in Segment 1 (a) and Segment 2 (b) calculated from 794
ambient samples (black points). Lower and upper bounds of shaded areas denote
3rd and 97th percentiles. Mean ‖aB‖1* for 2≤EDF≤12 in Segment 1 (c) and Segment 2 (d), calculated from 54 IMPROVE
2011 laboratory blank samples (black points). Shaded areas denote 3
standard deviations from the mean. In all panels, the black line is drawn to capture the
overall trend. While we select the interval 2≤ EDFT ≤12
specifically to highlight each metric's minima, we present results from the
entire interval 2≤ EDFT ≤n for completeness in Fig. S2.
Experimental data
We apply smoothing splines baseline correction to 794 particulate matter
(≤2.5 µm in diameter, PM2.5) samples collected on
PTFE filters and 54 blank PTFE filters. The particulate matter samples were
collected at IMPROVE sites on every third day in 2011. IMPROVE absorption
spectra had been used in a previous studies which detail the mechanics of FT-IR spectra
collection. More important for this study is the level of spectral
preparation applied prior to the background correction. Following the
practice established in we use unmodified
spectra in which values interpolated during the zero-filling process were
removed. Prior to applying the smoothing splines baseline, we truncate the
original wavenumber domain between 4000 and 420 cm-1 to capture the
subdomain between 4000 and 1500 cm-1 (1944 wavenumbers). As a reference,
the same subdomain is used in the polynomial method . In
contrast to , we do not apply smoothing to remove water
vapor interference and carbon dioxide to minimize the number of preprocessing
steps.
Applications for model evaluation
Cluster analysis
Cluster analysis with FT-IR measurements generates natural categories for PM
samples based on spectral similarity. These categories can represent mixture
classes of chemically complex aerosols, and their association with
meteorological and collocated measurements has been shown to provide
complementary information for source apportionment . For this purpose, each spectrum is SSB-corrected to isolate
the analyte contribution to the IR absorbance, normalized by its 2-norm
magnitude to emphasize variation in relative composition rather than absolute
concentration, and grouped according to the hierarchical clustering algorithm
of . There are inherent differences in the vapor artifacts
between the PB-corrected and SSB-corrected spectra that are not critical for
the algorithms used for quantification of functional groups, or TOR organic
and elemental carbon but influence clusters formed from the naïve
clustering approach described above. As the PB-corrected signal requires
differencing the IR spectrum of the PTFE before and after sample collection,
water vapor and CO2 signals remaining in the PB-corrected spectra
represent differences in concentrations present in the chamber during both
scans, whereas SSB-corrected spectra only contain the amount present in the
latter. Therefore, regions where these artifacts are present (ν̃>3600 cm-1 and ν̃<2400 cm-1 in Segment 1) are excluded from the
normalization and clustering, though some water vapor artifact overlapping
with analyte absorption remains in Segment 2. In addition, seven samples with
specific features or low signal-to-noise ratios are removed from the set
prior to the clustering as they are not well discriminated by the algorithm,
or influences the grouping of the rest of the spectra.
Peak fitting
We apply the peak-fitting algorithm based on parameter
constraints described by to both SSB- and PB-corrected
spectra and evaluate the differences between two baseline correction methods
by comparing peak areas. Peak areas correspond to integrated absorbances from
line shapes fitted for alcohol COH, carboxylic COH, alkane CH, carbonyl CO,
and amine NH. We examine the comparability and implications of replacing the
PB correction approach with SSB correction in future analyses of this type.
Prediction of TOR organic carbon (OC) and elemental carbon (EC)
recently demonstrated that collocated PTFE
samples analyzed by FT-IR and quartz fiber filters analyzed by TOR can be
used to build calibration models that predict TOR-equivalent OC and EC
concentrations from new FT-IR spectra. One of several calibration models with
accuracy and precision on a par with TOR precision can be constructed when
the concentration range and composition of carbonaceous samples in the
calibration set approximately resemble those in the test (challenge) set. For
this work, we use an identical procedure as described by for building calibration and test sets from 794 IMPROVE 2011
samples chronologically stratified within each site. The spectra are SSB-corrected
and calibration and test samples are drawn to contain two-thirds
and one-third of the entire set, respectively. Only TOR OC and EC predictions
necessitate dividing the data set into calibration and test subsets; the
previous two applications, clustering and peak fitting, are applied to the
entire data set.
Results
At the physical level, we evaluate the feasibility of our model by selecting the
optimal smoothing parameters in Sect. 3.1 and by presenting the
sample-specific bounds for analyte and background regions in Sect. 3.2. At
the application level, we begin our evaluation of smoothing splines baseline-corrected spectra with visual and cluster analysis in Sect. 3.3, followed
by functional group quantification analysis in Sect. 3.4, and predicted TOR
OC and EC analysis in Sect. 3.5.
EDF selection
Qualitatively, in Fig. we compare the behavior
of PM (left panel) and blank samples (right panel) using varying EDFT (4,
5, 7, and 200, from top to bottom). In this analysis we used all 54 blank
samples and randomly sampled 54 out of 794 PM samples to keep the counts
equal and allow for representative cross-comparison. The trend from top to
bottom shows both PM and blank samples exhibit increasing sensitivity to the
amount of smoothing applied. With increasing EDFT, baseline-corrected
ambient spectra begin to exhibit negative analyte absorbance (left column).
Simultaneously, baseline-corrected blanks in the region at 3700–2500 and
1820–1600 cm-1 begin to depart from our target, zero absorbance (right
column).
Quantitatively, in Fig. we evaluate the
impact of EDFT on negative absorbance fraction metric, NAF, (top panel)
and total normalized absolute blank absorbance metric, ‖aB‖1*, (bottom panel) in segments 1 and 2 (left and
right panel). Horizontal panels share the same x axis and vertical panels
share the same y axis to allow for representative cross-comparison.
Therefore, each plot in the matrix in Fig.
corresponds to a unique condition in terms of a metric and segment. Starting
from Fig. A (top left), we find that any
EDFT between 2 and 4 minimizes median NAF and its variance simultaneously:
median NAF ≈0.0% and variance = 0.44 %. Moving down to
Fig. c (bottom left), we look at the effect of
EDFT on blank absorbance in Segment 1. We find that any EDFT between
2 and 4 generates very low ‖aB‖1*: mean
‖aB‖1*= 3.42 ×10-4 and 3σ
(the extent of shaded areas) = 2.79 ×10-4. Technically, the minimum
variance in ‖aB‖1* occurs for EDFT=5
but the difference is less than 1.5 %. Of the two metrics, we prefer to
minimize NAF over ‖aB‖1* as NAF represents a
more robust metric (the sample size is an order of magnitude greater and in
future applications the choice of EDFT will likely affect
disproportionally more PM samples than blank samples). To finalize the choice
of EDFT from 2≤EDFT≤4, we now consider how these
EDFT values compare to EDFA obtained by the smoothing splines algorithm from Eq. (). We plot the distributions of EDFA given EDFT in
Segment 1 using all 794 PM samples and 54 blank samples in Fig. a and b.
Box-and-whisker plots representing distributions of EDFA for a
given EDFT used in Segment 1 (a, b) and Segment 2 (c, d) in both PM (n=794) and blank samples (n=54). Median and whiskers in each box-and-whisker
plot are highlighted in red.
The extensive number of knots to form bases for fitting splines (that is,
wavenumbers in observed absorbances used for fitting: xj for which wj≠ 0 from Eq. ) creates limitations on minimum
achievable EDF. This is particularly acute when EDFT is low (< 7 in
Segment 1 and < 3 in Segment 2). For instance, if we apply baselines with
EDFT= 4 in Segment 1 (Fig. a and b), the
distribution of EDFA will span between 4.9 and 6.1 depending on the
number of basis-forming knots (Fig. S3). However, applying baselines with
target EDF < 4 will lead to identical EDFA results, confirming that the
set of EDFA between 4.9 and 6.1 is indeed the minimum achievable EDF in
the search domain. Therefore, out of EDFT candidates for EDF* we
choose 4 as it represents the actual, true parameters most accurately; given
EDF*= 4 we obtain EDFA∈ ([4.9, 6.1] for PM samples and [4.9,
4.9] for blank samples.
In Fig. b (top right) we start by limiting
the evaluation in Segment 2 to EDFT for which NAF variance is greater than
0.22 % (roughly a half of the value from the best-fit model in Segment 1).
This leaves us with 4≤EDFT≤7. Out of this subset, we find
selecting 4 as EDFT minimizes ‖aB‖1* in
Fig. d; mean ‖aB‖1*= 1.71×10-4, 3σ= 1.06×10-4. Additionally, and importantly, 4 represents the most parsimonious solutions
without visually distorting the blank baseline and shape of the PM peaks
(Fig. ). By selecting EDF*= EDFT= 4, now the actual EDF parameters match the target EDF parameter (Fig. a and d).
W1 and W4
Figure presents empirical cumulative distributions' functions
of W1 and W4 from PM and blank samples. Distribution of W1 in PM
samples spans values between 3300 and 3710 cm-1, with 50 % of samples
having W1 > 3700 cm-1, reflecting sample-specific PM mixture composition
(illustration of spectra in Fig. a). W1 in
blank samples was determined to be 3710 cm-1 (Fig. b). Distribution of W4 in PM samples spans
values between 1520 and 1600 cm-1 , reflecting sample-specific ammonium
absorbance width (Fig. a). W4 in blank
samples was determined to be 1600 cm-1 , which is consistent with our
physical expectation about zero amine absorbance (Fig. b).
Empirical cumulative distribution functions representing
distributions of W1 and W4 in PM samples (n=794) in red and blank
samples (n=54) in blue.
Cluster analysis
The number of samples from SSB-corrected spectra not sharing the same
relative labeling as those from PB-corrected spectra varies with the total
number of clusters used to partition the spectra set. Figure S1 in the Supplement shows that
the discrepancy for 787 samples increases as the set is partitioned into a
larger number of clusters. The difference in sample labeling varies between
5 % for two clusters and 11 % for five clusters; the increase is observed for
larger number of clusters because spectra are grouped according to finer
variations in their features. Feature (wavenumber) selection and advanced
algorithms can lead to more robust clustering that is less sensitive to small
variations in spectra , but visual comparisons of spectra
in the present form of aggregation can provide useful interpretations as
discussed below. The inter-cluster differences will further depend on the
number of clusters and the type of clustering algorithm. Since there is no
absolute reference for baseline-corrected spectra, these discrepancies speak
to the differences between two candidate methods.
Cluster membership for polynomial and smoothing splines methods. The
region between 2500 and 2200 cm-1 is masked to indicate the region of
CO2 absorption not associated with aerosol composition.
Figure shows spectra from the two baseline
correction algorithms grouped into categories using the approach described in
Sect. . Type I spectra are selected manually, and Types
II–V are determined by a four-cluster solution by hierarchical clustering
(with a discrepancy rate between PB and SSB of 10 %). Type I spectra
display low absorbance in the alcohol COH region, visible methylene paired
peaks (2920 and 2850 cm-1) from CH2 bonds present in vegetative
detritus , and the largest absorbance in the carbonyl CO
region (centered near 1700 cm-1) compared to the rest of the sample
spectra. This spectra type indicates a dominant contribution from biomass
burning aerosol spectra . These two samples
were collected in St. Marks, FL, during January and February; fire burning is
prescribed near this location during January through May of each year. Type
II spectra also contain sharp methylene peaks but also stronger absorption
above 3100 cm-1 associated with alcohol COH and less pronounced carbonyl CO
absorption. Sixty percent of the 132 SSB-corrected spectra are found in
Phoenix, AZ, so this is interpreted to be associated with urban aerosol (we
note that Phoenix samples may be overrepresented in this spectra set as two
sampling sites out of the seven analyzed in this work are located in this
city). Similar features have been found in spectra from the urban environment
of Mexico City .
Type V contains spectra for which peaks near 3200–3100 cm-1 are most prominent,
indicating the significant presence of ammonium. These features have commonly
been reported in fossil fuel burning samples or factor analysis components
that have been
assigned by correlation with combustion tracers (e.g., V, Cr, Ni, Zn, As) and
back trajectory analyses. These aerosols presumably arise from a combination
of aged background aerosol and aerosols produced locally in the presence of
high oxidant concentrations of polluted environments .
However, 87 % of the 322 SSB-corrected Type V samples are found in the five
non-urban sites, suggesting that in this data set this spectroscopic
signature is more indicative of aged secondary aerosol. Ammonium
concentrations are often temporally correlated with oxidized organic aerosol
e.g., which increases in abundance toward
rural areas . Types III and IV share some combination of
features with types I, II, and V, with the ammonium peak near 3200 cm-1
more visible in type IV and larger contributions from methylene peaks visible
in type III. The peak near 3700 cm-1 present in several type IV spectra is
suggestive of phenolic compounds also present in biogenic aerosol
.
Integrated peak area corresponding to different functional groups
(a–e) from polynomial baseline and smoothing splines baseline-corrected
spectra. Slope magnitudes represent the slope of the regressed line. The silver
line represents a one to one line.
Predicted FT-IR OC vs. measured TOR OC using smoothing splines-corrected spectra for (a) calibration set (n=517) and (b) test set (n=268).
Predicted FT-IR EC vs. measured TOR EC using smoothing splines-corrected
spectra for (c) the calibration set (n=501) and (d) the test set (n=268).
This analysis demonstrates that the new SSB correction method can generate
spectra similar in profile to PB-corrected spectra used in past studies,
providing a basis for further mixture analysis.
MDL and precision for FT-IR OC and TOR OC.
Carbon type
Metric
TOR
FT-IR
FT-IR
FT-IR
raw spectrad
PB-corrected spectrad
SSB-corrected spectra
OC
MDL (µg m-3)b, c
0.05
0.14, [0.11, 0.28]
0.11, [0.08, 0.17]
0.06, [0.04, 0.09]
% below MDL
1.5
2.6
0.7
0.0
Precision (µg m-3)b
0.14
0.12
0.21
0.06
Mean blank (µg)
NRe
0.1 ± 1.5
1.9 ± 1.2
0.1 ± 0.6
EC
MDL (µg m-3)b, c
0.01
0.02, [0.01, 0.02]
0.01, [0.00, 0.01]
0.01, [0.01, 0.02]
% below MDL
3
1
2
1
Precision (µg m-3)b
0.11
0.04
0.06
0.06
Mean blank (µg)
NRe
0.06 ± 0.17
0.08 ± 0.15
0.01 ± 0.12
b Concentration units of µg m-3 for MDL and precision are based on the IMPROVE volume of 32.8 m3. c Numbers inside the interval denote 95 % confidence intervals on the estimate. d . e Not reported.
Peak fitting analysis
Figure presents integrated absorbances for alcohol
COH, carboxylic COH, alkane CH, carbonyl CO, and amine NH quantified from PB
and SSB-corrected spectra. For all functional groups but carboxylic COH the
discrepancy between the two methods is < 10 % (the slope of the regressed line
< 1 ±0.1). The difference is on the same order of magnitude as the cluster
discrepancy rate. The bias in carboxylic COH fitting is likely due to the
fact that its line shape was fixed specifically to the PB-corrected spectra
, and is more sensitive to the absorption profile to
which it is fitted than the Gaussian peaks with adjustable parameters used
for fitting other functional groups. The bias in may be alleviated by
rederiving the carboxylic COH line shape for the smoothing splines method, or
applying an adjusted molar absorption coefficient. The bias of 13 % is on the
order of variation in absorption coefficients of carboxylic COH estimated for
different organic acid compounds, and also within uncertainty for an
absorption coefficient estimated from the mean of these values
.
Prediction of TOR organic and elemental carbon
Figure presents performance metrics from TOR OC and
TOR EC predictions obtained from SSB-corrected spectra. All fits are
characterized by high coefficients of variations (R2 ≥ 0.94) and
near-zero bias (≤ 0.01 µg m-3), demonstrating
accurate predictions. With respect to predicted TOR OC, performance metrics
from the test set (Fig. b) are on a par with those
obtained from raw spectra and PB-corrected spectra. Specifically, error
(0.09 µg m-3) and normalized error (10 %) are on the same order
as those obtained from raw spectra (error of 0.08 µg m-3,
normalized error of 11 %) and PB-corrected spectra (error of 0.08 µg m-3, normalized error of 12 %) .
In Table we show that applying SSB leads to a
lower minimum detection limit (MDL) of 0.06 µg m-3), which leaves
no samples below MDL. This is statistically different from the no baseline
case, where MDL is 0.14 µg m-3. Precision (0.06 µg m-3) obtained from
SSB-corrected spectra is on the same order as that obtained from raw
(0.12 µg m-3) or PB-corrected spectra (0.21 µg m-3).
Likewise, TOR EC performance metrics from the test set (Fig. d) are on a par with those obtained from raw spectra
and PB-corrected spectra. Specifically, error (0.04 µg m-3) and normalized error (27 %) are on the same order as those obtained
from raw spectra (error of 0.02 µg m-3, normalized error of 21 %) and PB-corrected spectra (error of 0.04 µg m-3,
normalized error of 24 %) . Table
shows that MDL (0.01 µg m-3) obtained from SSB-corrected spectra is similar to MDL obtained from
raw or PB-corrected spectra (all ≤ 0.02 µg m-3).
In summary, SSB-corrected spectra OC and EC predictions from blank and
ambient samples are as accurate and precise as those from raw or PB-corrected
spectra. No additional bias is introduced as a result of SSB correction
implementation. However, the reduction in the complexity of baseline correction
is amenable for scaling up to a large number of samples. To some extent, PLS
is a robust regression method and is able to effectively remove contributions
to the signal which are not related to the target analyte. While individual
predictions vary, we show in Fig. S4 that the quality of TOR OC and EC
predictions is not statistically affected by the choice of EDF between 2 and
30.
Conclusions
Within the past few years the guided polynomial baseline-corrected algorithm
has been applied to characterize the ambient FT-IR spectra by classifying
mixtures ,
quantifying organic functional groups , and predicting
TOR OC and EC . Here our results demonstrate
that similar estimates (cluster discrepancy rate of 10 %, functional group
difference ≤ 13 %, and R2≥ 0.94 %, bias ≤ 0.01 µg m-3,
error ≤ 0.04 µg m-3 in TOR OC and EC
predictions) can be obtained using a new, automated baseline correction
protocol. Contrasting with the polynomial method, this paper detailed the
statistical framework, which applies nonparametric smoothing splines to
model sample-specific PTFE variations, reduces the number of free parameters
from four to one, and selects the parameter by minimizing two evaluation metrics:
negative analyte absorbance and blank absolute absorbance. The proposed
protocol unifies and simplifies many of the steps in existing techniques while
eliminating the need for expert intervention in manually adjusting background
regions specific to each sample. More importantly, the automated solution
allows us and future users to evaluate its analytical reproducibility while
minimizing reducible bias due to current default background regions or a
variability in human judgement in adjusting these regions. The solution was
developed as a direct response to the growing body of research on statistical
applications for characterization of FT-IR atmospheric aerosol samples
collected on PTFE filters and a rising interest in analyzing FT-IR samples
collected by air quality monitoring networks. As a result, we anticipate that the
model will enable FT-IR researchers and data analysts to quickly and reliably
analyze a large amount of data. Although the exact reduction in user time may
be difficult to generalize due to high variability across different users, we
reason that the following approximation applies. Qualitatively, if N values are
considered for each free parameter in each method, then the amount of time
for expert examination of each model solution scales up with N4 for the
polynomial method (due to four boundary points as free parameters) and N for
the smoothing splines method (due to 1 EDF parameter). Additionally, and importantly, the
evaluation metrics, which we established in this manuscript, have been shown
to sufficiently simplify the parameter selection process for users of any
level of experience.
One of the important avenues for future research include implementing
sample-specific EDF when the parameter choice affects model performance
significantly across samples. As Fig. 3 demonstrates, the individual
differences between EDFT 4 and 7 in Segment 1 are negligible; on the
whole these parameters do a very similar job in minimizing the undesirable
quantities (negative analyte absorbance and blank absorbance) in Fig. 4.
However, we anticipate that we and other FT-IR analysts may benefit from
sample-specific EDF when analyzing data sets collected under different
conditions, be it a different sampling flow rate or filter type. Another line
of future work may include extending this approach to the remaining part of
mid-IR absorbance spectrum (1500–420 cm-1). The fingerprint region
contains important functional groups , such as organonitrates,
which can benefit from an adaptive baseline correction algorithm. As
demonstrated in this paper, the general strategy of (1) segmenting
baseline regions of interest such that they contain a smoothly varying (or
uniformly sloping) baseline and (2) using conservative estimates for
background regions, and (3) using FT-IR physical criteria (such as minimal
blank absorbance, non-negative analyte/background absorbance, and no baseline
discontinuities) for parameter selection can provide a good starting point
for these tasks.
The automated smoothing splines baseline correction method has been
implemented in R package APRLssb and can be accessed at this repository:
https://bitbucket.org/stakahama/aprlssb by contacting the corresponding
author.