Introduction

AMT

Atmospheric Measurement Techniques

AMT

Atmos. Meas. Tech.

1867-8548

Copernicus GmbH

Göttingen, Germany

10.5194/amt-8-4979-2015

Evaluation of hierarchical agglomerative cluster analysis methods for discrimination of primary biological aerosol

Crawford

i.crawford@manchester.ac.uk

https://orcid.org/0000-0003-4433-7310

Ruske

https://orcid.org/0000-0002-2761-0832

Topping

D. O.

https://orcid.org/0000-0001-8247-9649

Gallagher

M. W.

https://orcid.org/0000-0002-4968-6088

1Centre for Atmospheric Science, SEAES, University of Manchester, Manchester, UK 2NCAS, National Centre for Atmospheric Science, University of Manchester, Manchester, UK

I. Crawford (i.crawford@manchester.ac.uk)

27November2015

8 11 49794991 13May2015 16July2015 6November2015 10November2015

This work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/

This article is available from https://amt.copernicus.org/articles/8/4979/2015/amt-8-4979-2015.html

The full text article is available as a PDF file from https://amt.copernicus.org/articles/8/4979/2015/amt-8-4979-2015.pdf

In this paper we present improved methods for discriminating and quantifying primary biological aerosol particles (PBAPs) by applying hierarchical agglomerative cluster analysis to multi-parameter ultraviolet-light-induced fluorescence (UV-LIF) spectrometer data. The methods employed in this study can be applied to data sets in excess of 1 × 106 points on a desktop computer, allowing for each fluorescent particle in a data set to be explicitly clustered. This reduces the potential for misattribution found in subsampling and comparative attribution methods used in previous approaches, improving our capacity to discriminate and quantify PBAP meta-classes. We evaluate the performance of several hierarchical agglomerative cluster analysis linkages and data normalisation methods using laboratory samples of known particle types and an ambient data set.

Fluorescent and non-fluorescent polystyrene latex spheres were sampled with a Wideband Integrated Bioaerosol Spectrometer (WIBS-4) where the optical size, asymmetry factor and fluorescent measurements were used as inputs to the analysis package. It was found that the Ward linkage with z-score or range normalisation performed best, correctly attributing 98 and 98.1 % of the data points respectively. The best-performing methods were applied to the BEACHON-RoMBAS (Bio–hydro–atmosphere interactions of Energy, Aerosols, Carbon, H2O, Organics and Nitrogen–Rocky Mountain Biogenic Aerosol Study) ambient data set, where it was found that the z-score and range normalisation methods yield similar results, with each method producing clusters representative of fungal spores and bacterial aerosol, consistent with previous results. The z-score result was compared to clusters generated with previous approaches (WIBS AnalysiS Program, WASP) where we observe that the subsampling and comparative attribution method employed by WASP results in the overestimation of the fungal spore concentration by a factor of 1.5 and the underestimation of bacterial aerosol concentration by a factor of 5. We suggest that this likely due to errors arising from misattribution due to poor centroid definition and failure to assign particles to a cluster as a result of the subsampling and comparative attribution method employed by WASP. The methods used here allow for the entire fluorescent population of particles to be analysed, yielding an explicit cluster attribution for each particle and improving cluster centroid definition and our capacity to discriminate and quantify PBAP meta-classes compared to previous approaches.

Introduction

Microorganisms influence climate through their physical and chemical interactions with the atmosphere. Recently there has been renewed interest in how primary biological aerosol particles (PBAPs) interact with and modify clouds. It has been shown that bacterial aerosol such as Pseudomonas syringae can act as ice nuclei (IN) at relatively warm temperatures , which even in low concentrations can cause rapid cloud glaciation via the Hallet–Mossop process, leading to premature precipitation .

It has been hypothesised that a feedback cycle exists where PBAPs associated with plants influence the formation and modification of clouds through ice formation to induce precipitation, creating an environment which is beneficial for plant and microbial growth and thus stimulating further PBAP emission – this is known as the bioprecipitation hypotheses, and potential links between long-term regional climatology and PBAP emissions have recently been suggested . One of the key drivers for new research into bioprecipitation is a need for more accurate quantification of cloud evolution and precipitation in weather and climate models given its potential impact.

Bioaerosols are now being included as important components in global climate models . Recently bioaerosol emission models were tested on European regional scales using real-time Wideband Integrated Bioaerosol Spectrometer (WIBS-4) data collected at rural and semi-rural sites in Germany and Finland . Validation of these models is reliant on a very limited number of studies, and the authors highlight the difficulty of applying such models to e.g. urban environments and cite the general paucity of high-resolution atmospheric PBAP data to constrain model results. Providing such data is paramount to improving model predictions and accurately assessing the impact of PBAP emissions on environment and health. Retrieving such data is reliant upon the applicability of detection methods described in the following section.

The focus of this study is to evaluate hierarchical agglomerative cluster analysis methods applied to WIBS ultraviolet-light-induced fluorescence (UV-LIF) data sets for the discrimination of primary biological aerosol. In this paper we describe the detection method and data preparation procedures before evaluating the performance of several common hierarchical agglomerative cluster analysis linkages and data normalisation methods using laboratory and ambient data sets.

Detection methods

The detection, classification and quantification of PBAPs remain a significant multidisciplinary technical challenge. Conventional techniques can be split into culturing and non-culturing techniques, both of which require the collection of particles onto a medium for offline analysis. Culturing techniques collect particles of interest onto a growth medium which is incubated for hours to days. The grown colonies are then counted microscopically, providing species identification but not quantification of their atmospheric concentration, making the technique unsuitable for estimating PBAP emissions . Non-culturing techniques collect particles onto filters or in a liquid suspension, which is more suitable for estimating atmospheric concentrations but is not typically used for classification . The major limiting factors of non-culturing techniques are that they are labour intensive, require long sampling periods and suffer from impactor sampling artefacts (e.g. particle fragmentation, obscuration), leading to erroneous enumeration. This makes it difficult to study emissions at the process level as some PBAPs, such as fungal spores and bacteria, display large diurnal variations with significant short-term episodic emissions, which would require an impractical number of samples to capture reliably. PBAPs including bacteria can undergo substantial instantaneous spikes in emissions compared to their baseline state in response to rainfall . These rapid emissions are important not only to capture peak concentrations but also to derive emission factors accurately and understand the underlying mechanisms.

UV-LIF spectrometers have become available which show early promise of classifying and quantifying bioaerosols by broad taxonomic class on a single-particle basis . This instrument is based on technology developed by the University of Hertfordshire Centre for Atmospheric and Instrumentation Research (CAIR). A full technical description of the WIBS instrument is given later in this manuscript. UV-LIF spectrometers work on the principle that PBAPs contain biofluorophores such as NAD(P)H, riboflavin, and tryptophan which auto-fluoresce when excited with UV radiation with the excitation, and detection bands of the WIBS are optimised to detect these common biofluorophores . The single-particle, online nature of the technique yields far superior time resolution to the offline techniques discussed earlier, making it ideally suited to measuring PBAPs in a rapidly changing environment. The time resolution is limited by the counting statistics, with typically 1–5 min integration periods providing adequate sensitivity depending on ambient concentrations. This allows for better measurements of PBAP fluxes, which would be difficult using traditional offline methods.

Whilst UV-LIF spectrometers offer many advantages over traditional methods, discriminating between different bioaerosol classes and possible, non-biological fluorescent interferents remains an ongoing area of research . At present, UV-LIF spectrometers lack a common absolute reference standard, making comparison of measurements made between instruments difficult. Furthermore the lack of a calibration standard has impeded attempts to characterise PBAPs of interest which would greatly simplify classification by the utilisation of supervised learning techniques. In lieu of an absolute calibration method other techniques must be used to segregate particles by type when interpreting uncalibrated data sets.

WIBS UV-LIF instrumentation

A full technical description of the original WIBS measurement principles and its development is given by , , and . In the versions of the instrument used here ambient air is sampled at 2.38 L min-1, with 10 % of the total as aerosol flow drawn through a 1.2 mm (inner diameter) tube to generate a single in-line aerosol beam intersecting a well-defined optical sensing region. The remainder of the flow is filtered and used as a sheath flow to stabilise the aerosol beam and minimise possible detrainment contamination of the optical surfaces within the scattering chamber. Single particles passing through the sensing region intercept a 635 nm diode laser beam, and the elastically scattered forward and sideways intensity is measured. A lookup table based on a standard Mie scattering model is used to convert the forward-scatter / side-scatter intensity ratio to optical diameter based on the instrument's response to NIST calibration polystyrene latex (PSL) spheres. The WIBS utilises a quadrant detector to measure the scattered intensity. The signal from each component quadrant is used to calculate an “average” optical diameter over the four scattering solid angles. In addition the standard deviation between the four signal intensities is used to provide a particle asymmetry factor (AF) as a proxy of particle morphology. AF is reported in arbitrary units (a.u.) and is based on measurements with calibration particles with different aspect ratios; corn starch flour was used to represent irregular particles, and ellipsoidal haematite particles were used as an analogue for rod-like bacterial particles as described in . AF ranges from 8 to 10 for nearly spherical particles and 20–100 for a rod- or fibre-like particles. The detectable particle “average optical diameter” range for WIBS-4 is 0.5 < Do < 20 µm, with a 50 % detection at Dp50 = 0.8 µm . The WIBS size range is optimised to sample most airborne bacteria and fungal spores, but only very small pollen. Following initial particle detection and sizing, two optically filtered Xenon flash-lamps are sequentially triggered, providing excitation wavelengths centred at 280±10 and 370 ± 20 nm. The fluorescence emission is collected by two spherical mirrors and split into two channels using a dichroic filter at 410 nm before being measured by two photomultiplier tubes (PMTs).

Both PMTs record fluorescence during the 280 nm excitation phase because no detection bands overlap the excitation band; however only the 410–650 nm PMT detector is active during the 370 nm excitation. In subsequent discussions herein the three fluorescent channels will be referred to as FL1 (fluorescence between 300 and 400 nm, following excitation at 280 nm), FL2 (fluorescence between 410 and 650 nm, following 280 nm excitation) and FL3 (fluorescence between 410 and 650 nm, following excitation at 370 nm). The autofluorescence arising from the 280 nm excitation in biological material is influenced heavily by proteins and the bio-molecule tryptophan, whereas fluorescence from 370 nm excitation is influenced by riboflavin and co-enzyme NAD(P)H . However, fluorescence emission spectra are inherently broad, and interrogating complex microorganisms and micron-sized particles results in a complex mixture of fluorescence emission peaks from many fluorophores that can be difficult to interpret unambiguously .

Hierarchical cluster methods

Hierarchical agglomerative cluster analysis (HCA) has been demonstrated to be a powerful tool to classify particles ; however, the available analysis toolkits are limited by heavy computational burdens, making the analysis of large data sets problematic. In HCA each data point is initially in its own single membered cluster. The clusters are sequentially combined into larger multi-membered clusters until all data points are in one large cluster at the end of the process. At each step through the process the two clusters which are separated by the shortest distance are combined where the inter-cluster distance is determined by the linkage algorithm. In this study we trialled several common linkages, which are now described:

Single: the distance between two clusters is defined as the minimum distance between any single data point in the first cluster and any single point in the second cluster.

Complete: the distance between two clusters is defined as the maximum distance between any single data point in the first cluster and any single point in the second cluster.

Average (unweighted average distance): the distance between two clusters is defined as the average distance between all data points in the first cluster and all data points in the second cluster. The weight of each cluster is proportional to the cluster size.

Weighted (weighted average distance): similar to average, but the weight of each cluster is identical irrespective of size.

Ward: this linkage is a special case where the clusters to be merged is determined by finding the pair of clusters which yield the minimum increase in total within-cluster variance after merging, rather than by minimum distance between clusters.

Centroid: the distance between clusters is defined as the distance between the centres (mean vectors) of clusters.

Median: the distance between two clusters is iteratively defined as the distance between the cluster midpoints. Here the midpoint is defined as the point itself in a singleton cluster or the average of the midpoints of the clusters to be merged.

A full mathematical description of these linkages is provided in .

WASP

The WIBS AnalysiS Program (WASP; ) uses the average linkage clustering algorithm and is written in Igor Pro

WaveMetrics Inc., OR, USA

. WASP performs HCA on a random subset of the data limited to a maximum of ≈ 1 × 104 data points, with analysis taking around 4 h on a high-powered desktop computer

3.4 GHz quad core, eight-thread processor, 16 GB RAM, 64 bit OS.

. The choice of the number of clusters to retain is manually selected by the inspection of several metrics, and the remaining data are attributed to the chosen clusters by comparison to the cluster centroids using a distance-based similarity method as described in . The authors noted that this comparative method can lead to systematic misattribution when less populous clusters form poorly defined centroids which do not reflect the true spread of the variables. They also noted that particles outside of a specified distance from a cluster centroid are left unclassified, potentially leading to an underprediction of cluster concentrations.

Fastcluster

In this manuscript we use open-source HCA methods which can analyse data sets in excess of 1 × 106 points on a desktop computer. Subsampling and comparative attribution are not required as each data point is explicitly clustered. We also test the feasibility of using an automated method for determining the optimum number of clusters to retain.

In this study we have used the open-source Python package fastcluster , which features several common linkages. Of the included linkages the Ward, centroid and median linkages do not require the distances between data points to be stored in memory, allowing for memory-saving modes to be used, greatly increasing the maximum number of data points that can be analysed from approximately 7 × 104 to in excess of 1 × 106 points using the test computer described earlier. In order to take advantage of the memory-saving algorithms, the Euclidean distance metric must be used. The performance of the memory-saving linkages are assessed using laboratory-sampled particles of known type and ambient data previously analysed with WASP. In a future publication we will assess computational requirements in more detail, presenting results pertinent to “big-data” analysis depending on the amount of data retrieved during any given campaign.

Schematic of procedure followed to generate cluster products from raw data. Relevant sections for each sub-procedure are labelled where appropriate.

Overview of analysis procedure

In this section we provide an overview of the procedure followed when applying hierarchical agglomerative cluster analysis to WIBS data (summarised in Fig. ):

load and quality assure data;

filter data; a.

remove particles Dp < 0.8 µm;

remove non-fluorescent particles;

remove saturating particles.

normalise data;

cluster data;

validate cluster solutions;

generate cluster products.

These procedures are now discussed.

Data preparation

Prior to analysis it is necessary to prepare the single-particle data to ensure that they are physically meaningful to prevent artefacts biasing the cluster solutions such that any potential to effect the performance of any cluster analysis is minimised.

The particle collection efficiency of the WIBS drops below 50 % at ∼ 0.8 µm. We have chosen to integrate number concentrations of particles > 0.8 µm rather than apply a correction factor to the concentrations below this size.

The baseline fluorescence of the instrument is measured during so-called forced trigger (FT) sampling periods where the instrument triggers the flash lamps and records the resultant fluorescence in the absence of aerosol in the sample volume. The WIBS-4 instrument automatically makes such measurements if measured concentrations are lower than 2 counts s-1 for a sustained period of time, on the basis that the coincidence of a forced trigger measurement with a particle in the measurement region is small. The mean fluorescence in a FT period is treated as the baseline fluorescence of the optical chamber during the sample period. For a particle to be considered fluorescent it must exhibit a fluorescence greater than a threshold value, defined as the baseline fluorescence plus 3 SDs (standard deviations). The analysis software subtracts this threshold value from measured fluorescence of each sample, with all values greater than 0 being considered significantly fluorescent compared to the instrument baseline. Fluorescence measurements below the threshold (i.e. less than 0 after threshold subtraction) are not considered physically meaningful and are clipped at 0. This simplifies the segregation between fluorescent (FL) and non-fluorescent (non-FL) particles.

Sufficiently fluorescent particles (such as pollens) will saturate the PMT, and as such it is not possible to accurately measure their true fluorescence. Data from saturating particles are not physically meaningful and are excluded from analysis.

Cluster validation indices

In order to remove the subjective nature of the method employed by to determine the optimum number of clusters to retain, we have chosen to use the Calinski–Harabasz criterion (CH), which is defined as being the ratio of the overall between-cluster variance to the overall within-cluster variance . We calculate the CH index for the first 20 cluster solutions and select the solution with the highest CH value as being the optimal solution.

Data normalisation

In the study the prepared data were z-score-normalised prior to analysis. This was performed to minimise the effect of the different ranges of scale of each parameter biasing the clustering; i.e. the fluorescent intensities are of the scale 0–2092, size 0.8–20 and AF 0–100. We investigate the effect of normalisation on clustering performance using the following standard procedures:

No normalisation.

Subtract mean; divide by standard deviation (z-score). The mean value of the normalised distribution is 0, where the z-score value of a data point is the number of standard deviations from the mean. This can be positive or negative.

Standardise by range. Subtract minimum value; divide by the maximum value. Normalises data to new range of 0–1.

Divide by sum. Divide each of the variables by its sum. The sum of the normalised distribution is 1. Since our original data are positive, the normalised values will also be positive.

Rank. Replace each data point by its rank. The data under this normalisation will be integers from 1 to N, where N is the number of data points.

These are the procedures considered in but excluding procedures which produce identical results for the Euclidean metric. They concluded the range normalisation to be the best performing. We considered procedures proposed by which considered better-performing alternatives to the above procedures. However it seems unlikely that the procedures will scale in terms of performance for large data.

Properties of the polystyrene latex spheres sampled.

PSL Size Doping Sample sample

[

µm] size 1 4.17 None 8927 2 3.1 Green 7976 3 2.2 Red 8942 4 2.1 Blue 8796 5 1 Green 5055

Data sets and data preparation

To assess the performance and suitability of the available clustering linkages, we first look at a laboratory data set of known particle types so that the cluster solutions can be compared to the known result. We then trial the best-performing methods on ambient data from the BEACHON-RoMBAS (Bio–hydro–atmosphere interactions of Energy, Aerosols, Carbon, H2O, Organics and Nitrogen–Rocky Mountain Biogenic Aerosol Study) experiment, which has been studied previously using similar methods . These data sets are now described in detail.

Fluorescent polystyrene latex spheres

To test the applicability and performance of the memory-efficient hierarchical agglomerative clustering linkages available in the Python package fastcluster, five different PSL spheres

Manufactured by Polysciences Inc., PA, USA, and Duke Scientific Corp., CA, USA.

were sampled using the WIBS-4. They were of different sizes, and four of them had been doped with a fluorescent coating. The properties of the tested PSLs are summarised Table .

The three fluorescence measurements (FL1–3), size and asymmetry factor were chosen as inputs. The PSLs exhibit strong fluorescence, with some saturating the PMTs in multiple channels; as such we have chosen to keep saturating particles in the analysis to maximise sample size. Non-fluorescent particles and particles smaller than 0.8 µm have been excluded from the analysis due to low collection efficiency. AF and size are typically log-normally distributed. In keeping with the analysis performed in and we convert these variables to log space prior to analysis. As memory saving is used, this limits analysis to using only the Euclidean distance metric.

Performance of the different linkages and normalisation procedures for the full data set in terms of the percentage of data points placed into the same cluster as the known clustering. In bold are the best-performing normalisations for each linkage.

None z-score Range Sum Rank Single 48.065 24.384 48.065 47.996 42.160 Complete 87.996 96.039 87.531 85.126 82.390 Average 87.432 97.791 87.406 65.772 96.990 Weighted 85.439 89.675 64.843 82.798 65.056 Ward 72.606 98.136 98.036 97.726 98.011 Centroid 87.423 97.264 87.446 65.772 96.805 Median 82.361 80.575 82.974 84.912 65.501

The regional BEACHON-RoMBAS experiment

The WIBS was deployed at the the Manitou Experimental Forest Observatory (MEFO), located 35 km northwest of Colorado Springs, Colorado, USA , as part of the Rocky Mountain Biogenic Aerosol Study project (BEACHON-RoMBAS) during summer 2011. Details of the experiment and sampling arrangement are given in . In the study HCA was performed on a subset of the WIBS data (≈ 1 × 104 particles) using the average linkage, with the remaining particles attributed to a cluster by comparison to the cluster centroid. Details of the attribution method and the process of selecting the number of clusters to retain are provided in . This analysis yielded clusters which were behaviourally consistent with fungal spores and bacteria. We perform analysis of this data set using the methods described in this manuscript, which we compare to the results.

Results Fluorescent polystyrene latex spheres

The fastcluster package was run with the seven available linkages, each with the different normalisation procedures. Note that only the single, Ward, centroid and median linkage are available when the memory-efficient version of the fastcluster package is used.

We use the CH index to identify the “optimal” number of clusters and attempt to construct a best match between the desired clusters and proposed clusters. Then, to evaluate the performance of the algorithm, we calculate the proportion of the data points placed into the same cluster for both the desired and proposed clustering. The results are given in Table .

For the full data set we can see that the z-score is the best-performing normalisation for all but the single and median linkages, where the performance is poorer across all normalisations.

However in Table we repeat the tests for varying sample size, where we see that as sample size decreases the range normalisation starts to outperform the z-score.

It appears that when using the full data set the z-score normalisation with either the Ward linkage or average linkage is the preferred option. When sampling, however, we see that range normalisation may be better.

An explanation for this behaviour could be that the range normalisation suffers with outliers which we are much more likely to encounter for large samples, so we would expect better performance for the smaller samples. Contrast this with the z-score where our measurement of the mean and the standard deviation is more accurate with large samples.

Performance of the Ward linkage for varying sample size.

Sample 500 1000 5000 10 000 20 000 size z-score 79.330 85.696 94.746 97.543 97.132 Range 95.664 97.671 98.041 98.065 98.074

Top panel: average FL1–3 detector intensites (blue, green and brown bars, left axis), size (diamond, right axis) and asymmetry factor (cross, right axis) for the five PSL samples. Middle and bottom panels: the same as for the top panel but for the Ward linkage solution centroids using range (middle) and z-score (bottom) normalisation.

Time series of PSL samples with data points coloured by cluster assignment for Ward linkage and range normalisation. Bottom panel shows the fraction of each cluster assigned to each sample with the most populated cluster annotated above.

Same as Fig. but for Ward linkage and z-score normalisation.

Same as Fig. but for BEACHON-RoMBAS ambient data.

Figure shows the cluster centroids for the Ward linkage with range and z-score normalisation. It can be seen that both methods yield similar clusters to the known solution; e.g. the average values of the 4.17 µm sample are accurately captured by the fifth cluster using range normalisation and the third cluster using z-score normalisation. Similarly the 3.1 µm green PSL sample is captured by the fourth range-normalised and first z-score-normalised clusters. Figures and show a time series of the FL1–3 and size input parameters (AF omitted from figure), which are colour-coded by the cluster assignments in Fig. . The bottom panel of each figure shows the fraction of each cluster assigned to each sample, where it can be seen that both normalisation methods achieve a high level of attribution accuracy, with a minimum of 96 % of data points being correctly attributed with no significant misattribution. The results of this experiment suggest that both range and z-score normalisation are appropriate when clustering WIBS data using the Ward linkage, with each yielding an optimal five-cluster solution correctly attributing 98 and 98.1 % of the data points respectively. The centroid linkage with z-score normalisation also performed well, correctly attributing 97.3 % of the particles into five significant clusters.

Time series of BEACHON-RoMBAS cluster concentrations using Ward linkage with range (top panel) and z-score (middle panel) normalisation as compared to the solutions obtained using WASP (bottom panel) for the period 00:00 MST (Mountain Standard Time) on 26 July 2011 to 12:00 MST on 28 July 2011. Clusters with similar centroids have been combined. See text for details.

Size distribution of BEACHON z-score-normalised clusters produced using the Ward linkage for the period 00:00 to 06:00 MST 27 July 2011.

Left panel: comparison of Ward linkage cluster concentrations using range and z-score normalisation for BEACHON-RoMBAS. Right panel: comparison of Ward linkage cluster concentrations (z-score normalisation) to WASP cluster concentrations.

BEACHON-RoMBAS

Data from the BEACHON-RoMBAS experiment (≈ 8.2 × 105 fluorescent data points) were analysed using the Ward linkage with both range and z-score normalisations and also the centroid linkage with z-score normalisation. The centroid linkage yielded a solution with only one significantly populated cluster, suggesting that it is inappropriate for analysing ambient data. Figure shows the cluster centroids of each Ward normalisation where the range yields a five-cluster solution and z-score a 4-cluster solution. It can be seen that the solutions of each are broadly similar; range cluster 4 (hereby notated as R4) is similar to z-score cluster 1 (hereby notated as Z1); R2 is similar to Z4. Additionally R1, R3 and R5 are similar to R2, suggesting that they are of similar origin, with the difference in fluorescence being due to size, morphology or particle age. This is also observed in the z-score result in clusters Z2, Z3 and Z4. A time series (not shown) of cluster concentrations shows these internally similar clusters to respond in a similar fashion to meteorological events such as rainfall. For ease of interpretation the concentrations of similar clusters have been combined. Figure shows a time series of the combined cluster concentrations for each method and also the cluster concentrations obtained using WASP. It can be seen that the concentrations of clusters R1 + R3 + R5, R2, Z2 + Z3 and Z4 all behave in a similar fashion to the WASP cluster C3, which was determined to be representative of bacteria owing to its strong positive response to rainfall . The response of clusters R4 and Z1 is similar to the WASP cluster B3, which was determined to be representative of fungal spores owing to its diurnal response to relative humidity. The size distributions for each of the z-score clusters (Fig. ) show the bacterial clusters to be small with sub-micron modes for clusters Z2 and Z3, which is consistent with the bacteria observed at the site, while the fungal cluster (Z1) mode is approximately 1.5–2 µm as might be expected. Caution must be taken when interpreting or assigning a bioaerosol meta-class to a cluster to avoid conflation of different particle types; e.g. emissions of some fungal spore species are positively correlated rainfall, which could be conflated into the bacterial meta-class in this case. Supporting measurements are needed to determine which species are present so that possible conflations can be identified and caveated appropriately.

Top panels: hourly average diurnal cycle of fungal cluster concentration for z-score normalisation (left panels) and WASP (right panels) over the period 27 July 2011–7 August 2011. Bottom panels: same as for top panels but for the bacterial clusters. Whiskers denote 5th and 95th percentiles. Mean value indicated by x marker. Note change in scale for bacterial panels.

Figure compares the concentrations of the similar clusters for each normalisation method. Comparison of R5 to Z4 (left panel, blue circles, representative of fungal spores) shows each method to yield similar concentrations. Comparison of the bacterial cluster concentrations yields poor correlation between methods when comparing the traces in Fig. (left panel, black diamonds and red squares); however when the concentration of all clusters representative of bacteria are combined (left panel, magenta crosses) the correlation is excellent (Nzscore = 1.00 × Nrange - 1.42, R2 = 1). This suggests that the major difference between the two different normalisation methods is how particles of similar types are partitioned between the clusters.

The right panel of Fig. compares the z-score concentrations to those obtained using WASP. It can be seen that the WASP fungal concentration is overestimated by a factor of approximately 1.5 compared to the z-score result (Z4 and B3, blue circles). The WASP bacterial concentration is underestimated by approximately a factor of 5 compared to the z-score result. Figure shows the hourly average diurnal cycles of the fungal (top panels) and bacterial (bottom panels) cluster concentrations for the z-score result (left panels) and WASP (right panels) over the period 27 July 2011–7 August 2011. Each method displays a similar trend, with the fungal clusters exhibiting a minimum during the day owing to the diurnal response of fungal spores to relative humidity and the bacterial clusters responding to the frequent afternoon rain storms . Again it can be seen that WASP overestimates the fungal concentration by approximately a factor of 1.5–2 and underestimates the bacterial concentration by a factor of 5–6 compared to the z-score result. The most likely explanation for the observed discrepancies between the WASP and z-score concentrations is the introduction of artefacts caused by the subsampling and comparative attribution methods used in WASP. In the fungal spore case, misattribution due to a poorly defined centroid can lead to an overestimation when compared to the new method as observed here. WASP yields only one cluster representative of bacteria, while the z-score method yields three and the range method four. This results in WASP failing to attribute data points potentially representative of bacteria to its single bacterial cluster, leading to the observed underestimation when compared to the new method. WASP does not return diagnostic information about the cluster attribution; however, the sum of the concentration of WASP clusters B3, C3 and D3 only accounts for approximately 24 % of the fluorescent aerosol concentration, suggesting that many particles are left unattributed by WASP.

Conclusions

Several hierarchical agglomerative cluster analysis linkages and normalisation methods were trialled using several laboratory samples of known particle type and a previously published ambient data set which was analysed using similar methods. The Ward linkage with range and z-score normalisation was found to successfully resolve the five test PSL samples with a high level of accuracy, correctly attributing 98 and 98.1 % of the data points respectively. Analysis of the BEACHON-RoMBAS WIBS-3 data yielded similar results using the Ward linkage with the range and z-score normalisation methods. Each method produced one cluster representative of fungal spores and several clusters representative of bacterial aerosol where the fungal concentrations and the sum of the bacterial aerosol concentrations agreed well. The BEACHON-RoMBAS results were compared to the WASP results for the same data set , where it was found that WASP overestimated the fungal spore concentration by a factor of 1.5 and underestimated the bacterial aerosol concentration by a factor of 5 compared to the methods trialled here. This is likely due to errors arising from misattribution due to poor centroid definition and failure to assign particles to a cluster as a result of the subsampling and comparative attribution method employed by WASP. The methods used here allow for the entire fluorescent population of particles to be analysed, yielding an explicit cluster attribution for each particle. This improves cluster centroid definition (e.g. allowing for several bacterial clusters compared to just one in WASP) and removes the potential for underestimation by failing to attribute a particle to a cluster.

In this paper we have demonstrated that WIBS single-particle UV-LIF spectrometer data can be successfully segregated using the Ward hierarchical agglomerative cluster analysis linkage with z-score and range data normalisation. The explicit clustering method employed in this study can be applied to large data sets, removing potential clustering aretfacts associated with the subsampling and attribution method used in previous approaches, improving our capacity to discriminate and quantify PBAP meta-classes. These improved techniques will be of importance for interpreting data from future multi-parameter UV-LIF instruments with improved fluorescence resolution and for extending the measurement technique to real-time quantification for ambient monitoring networks.

The Supplement related to this article is available online at doi:10.5194/amt-8-4979-2015-supplement.

Acknowledgements

The authors wish to express our gratitude to R. Sarda Esteve (CEA) and J. A. Huffman (University of Denver) for use of their fluorescent calibration particles as part of the BIODETECT experiment, without which the fundamentals of this work could not have been performed. We would also like to acknowledge USFS and NCAR for providing invaluable logistical support and access to the Manitou Experimental Forest field site. We also acknowledge P. Kaye and W. R. Stanley (University of Hertfordshire) for their continued support. This work was funded by the Natural Environment Research Council INUPIAQ programme, grant number NE/K006002/1. Edited by: F. Pope

References Benson et al.(1979)Benson, Meyer, Zaruba, and KcKhann

Benson, R., Meyer, R., Zaruba, M., and KcKhann, G.: NoCellular autofluorescence – is it due to flavins?, J. Histochem. Cytochem., 27, 44–48, 1979.

Billinton and Knight(2001)

Billinton, N. and Knight, A. W.: Seeing the wood through the trees: a review of techniques for distinguishing green fluorescent protein from endogenous autofluorescence, Anal. Biochem., 291, 175–97, 10.1006/abio.2000.5006, 2001.

Calinski and Harabasz(1974)

Calinski, T. and Harabasz, J.: A dendrite method for cluster analysis, Commun. Stat.-Theor. M., 3, 1–27, 10.1080/03610927408827101, 1974.

Crawford et al.(2012)Crawford, Bower, Choularton, Dearden, Crosier, Westbrook, Capes, Coe, Connolly, Dorsey, Gallagher, Williams, Trembath, Cui, and Blyth

Crawford, I., Bower, K. N., Choularton, T. W., Dearden, C., Crosier, J., Westbrook, C., Capes, G., Coe, H., Connolly, P. J., Dorsey, J. R., Gallagher, M. W., Williams, P., Trembath, J., Cui, Z., and Blyth, A.: Ice formation and development in aged, wintertime cumulus over the UK: observations and modelling, Atmos. Chem. Phys., 12, 4963–4985, 10.5194/acp-12-4963-2012, 2012.

Crawford et al.(2014)Crawford, Robinson, Flynn, Foot, Gallagher, Huffman, Stanley, and Kaye

Crawford, I., Robinson, N. H., Flynn, M. J., Foot, V. E., Gallagher, M. W., Huffman, J. A., Stanley, W. R., and Kaye, P. H.: Characterisation of bioaerosol emissions from a Colorado pine forest: results from the BEACHON-RoMBAS experiment, Atmos. Chem. Phys., 14, 8559–8578, 10.5194/acp-14-8559-2014, 2014.

Douwes et al.(2003)Douwes, Thorne, Pearce, and Heedrik

Douwes, J., Thorne, P., Pearce, N., and Heederik, D.: Bioaerosol Health Effects and Exposure Assessment: Progress and Prospects, Ann. Occup. Hyg., 47, 187–200, 10.1093/annhyg/meg032, 2003.

Foot et al.(2008)Foot, Kaye, Stanley, Barrington, Gallagher, and Gabey

Foot, V. E., Kaye, P. H., Stanley, W. R., Barrington, S. J., Gallagher, M., and Gabey, A.: Low-cost real-time multiparameter bio-aerosol sensors, in: Optically Based Biological and Chemical Detection for Defence, 71160I–71160I-12, 15 September 2008, Cardiff, Wales, UK, 10.1117/12.800226, 2008.

Gabey(2011)

Gabey, A. M.: Laboratory and field characterisation of fluorescent and primary biological aerosol particles, PhD thesis, University of Manchester, Manchester, UK, 2011.

Gabey et al.(2011)Gabey, Stanley, Gallagher, and Kaye

Gabey, A. M., Stanley, W. R., Gallagher, M. W., and Kaye, P. H.: The fluorescence properties of aerosol larger than 0.8 µm in urban and tropical rainforest locations, Atmos. Chem. Phys., 11, 5491–5504, 10.5194/acp-11-5491-2011, 2011.

Gabey et al.(2013)Gabey, Vaitilingom, Freney, Boulon, Sellegri, Gallagher, Crawford, Robinson, Stanley, and Kaye

Gabey, A. M., Vaitilingom, M., Freney, E., Boulon, J., Sellegri, K., Gallagher, M. W., Crawford, I. P., Robinson, N. H., Stanley, W. R., and Kaye, P. H.: Observations of fluorescent and biological aerosol at a high-altitude site in central France, Atmos. Chem. Phys., 13, 7415–7428, 10.5194/acp-13-7415-2013, 2013.

Gnanadesikan et al.(2007)Gnanadesikan, Kettenring, and Maloor

Gnanadesikan, R., Kettenring, J., and Maloor, S.: Better alternatives to current methods of scaling and weighting data for cluster analysis, J. Stat. Plan. Infer., 137, 3483–3496, 10.1016/j.jspi.2007.03.026, 2007.

Heald and Spracklen(2009)

Heald, C. L. and Spracklen, D. V.: Atmospheric budget of primary biological aerosol particles from fungal spores, Geophys. Res. Lett., 36, L09806, 10.1029/2009GL037493, 2009.

Hummel et al.(2014)Hummel, Hoose, Gallagher, Healy, Huffman, O'Connor, Pöschl, Pöhlker, Robinson, Schnaiter, Sodeau, Toprak, and Vogel

Hummel, M., Hoose, C., Gallagher, M., Healy, D. A., Huffman, J. A., O'Connor, D., Pöschl, U., Pöhlker, C., Robinson, N. H., Schnaiter, M., Sodeau, J. R., Stengel, M., Toprak, E., and Vogel, H.: Regional-scale simulations of fungal spore aerosols using an emission parameterization adapted to local measurements of fluorescent biological aerosol particles, Atmos. Chem. Phys., 15, 6127–6146, 10.5194/acp-15-6127-2015, 2015.

Jacobson and Streets(2009)

Jacobson, M. Z. and Streets, D. G.: Influence of future anthropogenic emissions on climate, natural emissions, and air quality, J. Geophys. Res., 114, D08118, 10.1029/2008JD011476, 2009.

Kaye et al.(2005)Kaye, Stanley, Hirst, Foot, Baxter, and Barrington

Kaye, P. H., Stanley, W. R., Hirst, E., Foot, E. V., Baxter, K. L., and Barrington, S. J.: Single particle multichannel bio-aerosol fluorescence sensor, Opt. Express, 13, 3583, 10.1364/OPEX.13.003583, 2005.

Kaye et al.(2007)Kaye, Aptowicz, Chang, Foot, and Videen

Kaye, P. H., Aptowicz, K., Chang, R. K., Foot, V., and Videen, G.: Angularly Resolved Elastic Scattering from Airborne Particles, Opt. Biol. Part., 238, 31–61, 2007.

Kim et al.(2010)Kim, Karl, Guenther, Tyndall, Orlando, Harley, Rasmussen, and Apel

Kim, S., Karl, T., Guenther, A., Tyndall, G., Orlando, J., Harley, P., Rasmussen, R., and Apel, E.: Emissions and ambient distributions of Biogenic Volatile Organic Compounds (BVOC) in a ponderosa pine ecosystem: interpretation of PTR-MS mass spectra, Atmos. Chem. Phys., 10, 1759–1771, 10.5194/acp-10-1759-2010, 2010.

Li and Humphrey(1991)

Li, J. K. and Humphrey, A. E.: Use of fluorometry for monitoring and control of a bioreactor, Biotechnol. Bioeng., 37, 1043–1049 10.1002/bit.260371109, 1991.

Milligan and Cooper(1988)

Milligan, G. W. and Cooper, M. C.: A study of standardization of variables in cluster analysis, J. Classif., 5, 181–204, 10.1007/BF01897163, 1988.

Möhler et al.(2007)Möhler, DeMott, Vali, and Levin

Möhler, O., DeMott, P. J., Vali, G., and Levin, Z.: Microbiology and atmospheric processes: the role of biological particles in cloud physics, Biogeosciences, 4, 1059–1071, 10.5194/bg-4-1059-2007, 2007.

Morris et al.(2014)Morris, Conen, Alex Huffman, Phillips, Pöschl, and Sands

Morris, C. E., Conen, F., Alex Huffman, J., Phillips, V., Pöschl, U., and Sands, D. C.: Bioprecipitation: a feedback cycle linking earth history, ecosystem dynamics and land use through biological ice nucleators in the atmosphere, Global Change Biol., 20, 341–351, 10.1111/gcb.12447, 2014.

Müllner(2013)

Müllner, D.: fastcluster: fast hierarchical, agglomerative clustering routines for R and Python, J. Stat. Softw., 9, 1–18, 10.18637/jss.v053.i09, 2013.

Ortega et al.(2014)Ortega, Turnipseed, Guenther, Karl, Day, Gochis, Huffman, Prenni, Levin, Kreidenweis, DeMott, Tobo, Patton, Hodzic, Cui, Harley, Hornbrook, Apel, Monson, Eller, Greenberg, Barth, Campuzano-Jost, Palm, Jimenez, Aiken, Dubey, Geron, Offenberg, Ryan, Fornwalt, Pryor, Keutsch, DiGangi, Chan, Goldstein, Wolfe, Kim, Kaser, Schnitzhofer, Hansel, Cantrell, Mauldin, and Smith

Ortega, J., Turnipseed, A., Guenther, A. B., Karl, T. G., Day, D. A., Gochis, D., Huffman, J. A., Prenni, A. J., Levin, E. J. T., Kreidenweis, S. M., DeMott, P. J., Tobo, Y., Patton, E. G., Hodzic, A., Cui, Y. Y., Harley, P. C., Hornbrook, R. S., Apel, E. C., Monson, R. K., Eller, A. S. D., Greenberg, J. P., Barth, M. C., Campuzano-Jost, P., Palm, B. B., Jimenez, J. L., Aiken, A. C., Dubey, M. K., Geron, C., Offenberg, J., Ryan, M. G., Fornwalt, P. J., Pryor, S. C., Keutsch, F. N., DiGangi, J. P., Chan, A. W. H., Goldstein, A. H., Wolfe, G. M., Kim, S., Kaser, L., Schnitzhofer, R., Hansel, A., Cantrell, C. A., Mauldin, R. L., and Smith, J. N.: Overview of the Manitou Experimental Forest Observatory: site description and selected science results from 2008 to 2013, Atmos. Chem. Phys., 14, 6345–6367, 10.5194/acp-14-6345-2014, 2014.

Pöhlker et al.(2012)Pöhlker, Wiedemann, Sinha, Shiraiwa, Gunthe, Smith, Su, Artaxo, Chen, Cheng, Elbert, Gilles, Kilcoyne, Moffet, Weigand, Martin, Pöschl, and Andreae

Pöhlker, C., Wiedemann, K. T., Sinha, B., Shiraiwa, M., Gunthe, S. S., Smith, M., Su, H., Artaxo, P., Chen, Q., Cheng, Y., Elbert, W., Gilles, M. K., Kilcoyne, A. L. D., Moffet, R. C., Weigand, M., Martin, S. T., Pöschl, U., and Andreae, M. O.: Biogenic potassium salt particles as seeds for secondary organic aerosol in the Amazon, Science, 337, 1075–1078, 10.1126/science.1223264, 2012.

Robinson et al.(2013)Robinson, Allan, Huffman, Kaye, Foot, and Gallagher

Robinson, N. H., Allan, J. D., Huffman, J. A., Kaye, P. H., Foot, V. E., and Gallagher, M.: Cluster analysis of WIBS single-particle bioaerosol data, Atmos. Meas. Tech., 6, 337–347, 10.5194/amt-6-337-2013, 2013.

Sands et al.(1982)Sands, Langhans, Scharen, and G. de Smet

Sands, D., Langhans, V., Scharen, A., and de Smet,: The association between bacteria and rain and possible resultant meteorological implications, J. Hungar. Meteorol. Serv., 86, 148–152, 1982.

Schumacher et al.(2013)Schumacher, Pöhlker, Aalto, Hiltunen, Petäjä, Kulmala, Pöschl, and Huffman

Schumacher, C. J., Pöhlker, C., Aalto, P., Hiltunen, V., Petäjä, T., Kulmala, M., Pöschl, U., and Huffman, J. A.: Seasonal cycles of fluorescent biological aerosol particles in boreal and semi-arid forests of Finland and Colorado, Atmos. Chem. Phys., 13, 11987–12001, 10.5194/acp-13-11987-2013, 2013.

Stanley et al.(2011)Stanley, Kaye, Foot, Barrington, Gallagher, and Gabey

Stanley, W. R., Kaye, P. H., Foot, V. E., Barrington, S. J., Gallagher, M., and Gabey, A.: Continuous bioaerosol monitoring in a tropical environment using a UV fluorescence particle spectrometer, Atmos. Sci. Lett., 12, 195–199, 10.1002/asl.310, 2011.

Toprak and Schnaiter(2013)

Toprak, E. and Schnaiter, M.: Fluorescent biological aerosol particles measured with the Waveband Integrated Bioaerosol Sensor WIBS-4: laboratory tests combined with a one year field study, Atmos. Chem. Phys., 13, 225–243, 10.5194/acp-13-225-2013, 2013.

</app></app-group></back> </article>