Improved real-time bio-aerosol classification using Artificial Neural Networks

1. PCO S.A. ul. Jana Nowaka-Jeziorańskiego 28, 03-982 Warsaw, Poland. 6 2. Institute of Optoelectronics, Military University of Technology, ul. Gen. Witolda Urbanowicza 2, 7 00-908 Warsaw, Poland 8 9 *Corresponding author: miron.kaliszewski@wat.edu.pl 10 11 12

Abstract.Air pollution has had an increasingly powerful impact on the everyday life of humans.More and more people are aware of the health problems that may result from inhaling air which contains dust, bacteria, pollens or fungi.There is a need for real-time information about ambient particulate matter.Devices currently available on the market can detect some particles in the air but cannot classify them according to health threats.Fortunately, a new type of technology is emerging as a promising solution.
Laser-based bio-detectors are characterizing a new era in aerosol research.They are capable of characterizing a great number of individual particles in seconds by analyzing optical scattering and fluorescence characteristics.In this study we demonstrate the application of artificial neural networks (ANNs) to real-time analysis of single-particle fluorescence fingerprints acquired using BARDet (a Bio-AeRosol Detector).A total of 48 different aerosols including pollens, bacteria, fungi, spores, and nonbiological substances were characterized.An entirely new approach to data analysis using a decision tree comprising 22 independent neural networks was discussed.Applying confusion matrices and receiver operating characteristics (ROC) analysis the best sets of ANNs for each group of similar aerosols were determined.As a result, a very high accuracy of aerosol classification in real time was achieved.It was found that for some substances that have characteristic spectra, almost each particle can be properly classified.Aerosols with similar spectral characteristics can be classified as specific clouds with high probability.In both cases the system recognized aerosol type with no mistakes.
In the future, it is planned that performance of the system may be determined under real environmental conditions, involving characterization of fluorescent and nonfluorescent particles.

Introduction
Ambient air contains a variety of particles such as dust, bacteria, pollens, fungi and other particles of biological and nonbiological origin (Pöhlker et al., 2013;Górny, 2004).Aerosols are involved in various atmospheric processes such as ice nuclei formation, precipitation and global climate effects (Deguillaume et al., 2008;Fröhlich-Nowoisky et al., 2016;Gabey et al., 2010;Pósfai and Buseck, 2010;Fuzzi et al., 2015).They also greatly influence human health (Davidson et al., 2005;Pope and Dockery, 2006;Michaels, 2017;Shiraiwa et al., 2012).Therefore, the characterization of ambient air is important for estimating potential health hazards and environmental impact (Mauderly and Chow, 2008;Lim et al., 2005).Standard methods of aerosol composition assessment usually include microscopic inspection or molecular analysis of filters (Miaskiewicz-Peska and Lebkowska, 2012), tape or liquid trapped particles.Nevertheless, they suffer from low time resolution due to periodical and relatively long analytical procedures.They are also ineffective for the detection of non-culturable microorganisms (Blais-Lecours et al., 2015;Trafny et al., 2014).
The detection and classification of biological particles is possible using fluorescence techniques due to the presence of proteins, NADH, and some vitamins that emit light when excited with UV light (Lakowicz, 2006).This feature is utilized in single-particle fluorescence detectors.In the flowing air each particle is characterized for size/shape using light scattering as well as fluorescence properties.This approach ensures continuous measurement and immediate response.Thus the analysis process can be facilitated and accelerated compared with other commonly used analytical procedures (Hill et al., 1999;Choi et al., 2014;Taketani et al., 2013;Feugnet et al., 2008).Besides advantages such as reagentless and real-time particle characterization, laser-based methods do not provide information on the chemical composition of aerosol.
Several studies using single-particle fluorescence detectors have demonstrated that fluctuations of aerosol concentration and variations in its fluorescence properties are highly dependent on the season, day, time, location and place occupancy (Gabey et al., 2011;Huffman et al., 2010;Pinnick et al., 2004;Bhangar et al., 2014;Fennelly et al., 2017).Each single particle passing the instrument is labeled with a time stamp, scattering properties (size and/or shape) and fluorescence characteristics.It is obvious that continuous singleparticle measurements bring a new potential and quality to environmental research.However, particles of the same type and batch display slightly different spectral characteristics due to variations in biochemical composition, size, age of population (Agranovski et al., 2003), degradation (Hernandez et al., 2016) or stress level (Lee et al., 2010) and the particle position within the instrument's interrogation point (Pan et al., 2011).Simpler statistical analyses, such as data averaging and graphical spectra representation, are not sufficient.Therefore, the huge amount of data and occurring spectral variations require more advanced algorithms supporting automatic data classification.Various analytical methods of particle discrimination and classification have been applied.It has been shown that principal component analysis (PCA), linear discriminant analysis (LDA) and hierarchical cluster analysis (HCA) of fluorescence spectra greatly increase the discrimination of particles compared with methods based on spectra averaging or fluorescence threshold (Leśkiewicz et al., 2016;Kaliszewski et al., 2013;Pan et al., 2012;Savage et al., 2017;Crawford et al., 2015).Artificial neural networks (ANNs) comprise an emerging analytical approach that is becoming more widely and successfully applied in various life domains such as chemical analysis (Borecki et al., 2008), image recognition (Antowiak and Chałasińska-Macukow, 2003), data mining and weather forecasting (Purnomo et al., 2017).It has been shown that ANNs can be applied in bio-aerosol classification (Kohlus and Bottlinger, 1993).However, it usually requires more user input compared to other analytical procedures (Ruske et al., 2017).
This paper focuses on the application of ANNs for realtime discrimination of bio-aerosols based on single-particle fluorescence characteristics.We demonstrate a new approach to data analysis using ANNs which allows automation of data preparation procedures and minimum user involvement.Detailed information concerning the construction and parameters of the instrument used for the experiments was presented in our previous work (Kaliszewski et al., 2016).In general, the ambient air is continuously drawn through the nozzle.It is focused with a sheath flow of filtered air.Particles in the focused air pass through the BARDet's chamber where they are interrogated by a 16 mW CW laser beam generated by a diode laser operating at 375 nm wavelength (CUBE, Coherent).The backward and forward scattered signals are detected with two PMTs (photomultiplier tubes; H6780, Hamamatsu) mounted at the 35 and 145 • angles to the laser beam axis.
The fluorescence of particles is measured at a 90 • angle to the laser beam with a 32-channel PMT (A10766, Hamamatsu).The longpass filter with a cutting edge at 400 nm (Edmund Optics) separates the fluorescence signal from scattered light.The multichannel PMT measures fluorescence in 18 active channels in a range of 415.4-643.5 nm.The channels are grouped in seven bands.Such a solution extends the dynamic range of measured spectra, assures a high S/N (signal-to-noise) ratio and also reduces the possibility of signal saturation.The remaining channels are not used.The band configuration is presented in Table 1.

Aerosols
For the tests, dry powders of harmless substances were used since they did not need a specialized aerosol protection chamber.In order to achieve a reliable aerosol classification, the ANNs need to be trained using as large a number of measurement data as possible.Therefore, various particle types, that can be easily aerosolized, were tested.Samples such as pollens, fungi, bacteria, spores and plant debris naturally occur in the atmosphere.Biofluororphores such as riboflavin, cellulose, amino acids and proteins were also characterized since they are present in biological materials.The group of bacterial growth media was investigated due to their powerful influence on bacteria fluorescence, especially if they are not sufficiently washed.This can occur in the case of intentionally released bacterial aerosols.Due to technical limitations, samples other than of a pharmaceutical type could not be aerosolized in this study.The aerosols of flours as well as fluorescent nonbiological substances such as paper dust, AC fine test dust and talc were analyzed since they can especially occur in indoor and public places.Nonfluorescent particles were not the subject of research since they can be automatically discarded as nonbiologically applying given fluorescence thresholds.
The samples used for this study are listed in Table 2. To perform numerous experiments, disposable vials were used, one for each aerosol sample.This prevented crosscontamination between measured samples.The aerosols were generated from modified 50 ml Falcon tubes placed on the vortex.The vials in the lower part contained two connectors for silicon tubes.Vortexed particles were entrained and formed an aerosol cloud inside the Falcon tube.The aerosolized particles were aspirated from the vial to BARDet's aerosol inlet.Each tube contained about 50 mg of the dry powder sample.During aerosol generation, filtered air was supplied into the vial to compensate for the BARDet's flow.The concentration of the aerosols was adjusted with the vibration frequency of the vortex.The measurement started after the aerosol reached a homogeneous concentration.The experimental setup is shown in Fig. 1.

Aerosol microscopy
For microscopy analysis the aerosols were generated as described above and collected by impaction on a glass microscopic slide.The visualization of the samples was performed using a Nikon Eclipse Ti-U microscope with 10× objective.The images were recorded with a 5 megapixel DS-Fi1 camera.The aerosol equivalent diameters and circularity were analyzed automatically using NIS-Elements 64 bit 3.22.10software.The threshold of particle outline was corrected manually to obtain the visually best fit.

Data acquisition method and preprocessing
The fluorescence of each particle was recorded in seven bands.This creates a time series of the signals which has to be preprocessed before further analysis.There are two steps in gathering data.The first one is performed by the internal software of BARDet which is responsible for controlling the instrument and the acquisition of raw signals.Then data are forwarded to a preprocessing module in the analysis software.Its first task is to extract valuable signals from the noise (three sigma rule).After that a normalization procedure is required.It is performed first by subtracting the average value of the signal and then normalizing it to its standard deviation.
The main goal was to analyze the shape of the emission spectrum (not signal strength).An example visualization of input data is shown in Fig. 2.
The data acquisition process started after the stabilization of the aerosol generation rate which was measured by the device.It was important not to exceed one particle per 2 ms of data integration time in a 20 µs measurement window.Finally, a total of 114 779 spectral characteristics of 48 aerosols was gathered, which gives on average 2391 (SD 437) fluorescence characteristics per substance.From the recorded data, 80 % were used as a training data set and 20 % as a test data set.

Basics
There are many types of artificial neural networks (ANNs), but in this paper only the backpropagation algorithm is demonstrated because it is one of the most practical ones.The main concept of this algorithm is based on a model of the neuron that has two tasks.It aggregates signals ( 1) and then processes them by an activation function (2), which, in this research, is a sigmoid.The result of such single processing is a new signal z j propagated to other neurons (Fig. 3).
where a j is the aggregated signal, w j i is the weight that connects neuron i with j and z i is the signal (input).
where g(a j ) is the sigmoidal function and β is the parameter (steepness) of sigmoid curve.The structure of a neural network is formed by layers of neurons: input, hidden and output.In this research input neurons constitute a fluorescence spectrum and output neurons represent substances.Most computations are carried out in the hidden layers (no more than two layers were examined).The schematic representation of neuron layers is presented in Fig. 4.
The described algorithm constitutes the supervised learning method that requires training data for a teaching process.This allows one to calculate an error between the tar- get shown and the ANN response.Every problem is related to minimizing output error which is calculated as the mean squared error (Eq.3).
where E is the mean squared error, t k is the observed value (target), y k is the calculated response, k is the output neuron and c is the number of output neurons.
The gradient descent method is used to find a minimum of error function.Error is dependent on network weights w j i which might be adjusted (Eq.4).In order to update weights correctly, firstly one needs to propagate the error backwards by calculating partial derivatives δ j (Eq.5) (Fig. 5).All mathematical details are well described by Christopher M. Bishop (Bishop, 1995).
where η is the learning rate, m is the momentum and t is the iteration.δE δw j i = δE δa j δa j δw j i = δ j z i (5) The learning rate factor determines the size of the steps, while the momentum parameter enables the local minimum to be omitted by adding a fraction of the weight correction from the last step.
After the correction of all weights of the ANN, the output error is examined, and the procedure starts again unless an error level is low enough and there is no overfitting.All data are divided into three different sets: training, test and validation.For calculations during the learning process, only the first two are used.In order to determine whether it is time to stop the teaching process, one has to observe an error in the test set.There will be a moment when this error comes to be constant or starts increasing due to the overfitting of training data (Fig. 6).The validation data set may be useful for comparing different models or just to verify the current model with a completely separate set of data.

Implementation of ANNs for BARDet
There are statistical commercial software packages available that provide ANN modules as one of the methods to analyze the data.It is worthwhile noting that customized software was developed for this research.This approach helped us to understand ANNs in depth and led to the development of software that is not only responsible for data preprocessing and network training, but also (mainly) for solving a realtime classification problem.
Ruske et al. in their studies (Ruske et al., 2017) compared various algorithms to analyze single-particle data and noted that an ANN requires much more user input.However, we present a method to overcome this inconvenience by automating the process and implementing procedures which simplify and improve the analysis.
The main disadvantage of an ANN is the fact that it is a parametrized algorithm.How well it works depends strictly on a proper choice of the best possible factors, which may be different for each problem.There are two types of factors that influence the ANN outcome.The first one corresponds to the architecture of the ANN which comprises a number of layers, neurons and an activation function parameter.The second one determines the learning process: momentum and learning rate.The latter can be tuned during the learning process to make it much faster.The "bold driver" procedure was chosen for that purpose.It continuously increases the learning rate unless an error is higher than that before the change.If it is, the algorithm radically decreases the learning rate and obtains weights from the last step again.Teaching an ANN is a stochastic process initiated by using randomly chosen initial weights.It was found that the best procedure for this investigation would be to conduct all optimization processes that way.Therefore, the parameters of the ANN, responsible both for structure and the learning process, are randomly selected until the desired result is reached.In fact, the calculations are carried out automatically and simultaneously for several models by means of multicore-oriented software.The benefits of this approach are time-saving and high levels of efficiency and effectiveness in finding the best model.The latter is especially important because the goal is to create a model that produces the best results, which does not necessary mean creating a more complicated network (more neurons or layers).

Model evaluation
The main goal of the analysis described in this paper is to find a solution to the bio-aerosol classification problem.When a training process ends, a final model is created, a network, which has a unique structure and a set of weights.One can create many of them and only make a comparison by using the final error.It is not the best solution because the goal is to distinguish patterns in data consistently, not to produce a network with a minimal error.That is why there is a need to conduct a final analysis of the results and evaluate the model in accordance with the best classification performance.
The standard method for visualization of results is a confusion matrix which will be necessary for receiver operating characteristics (ROC) analysis (Fawcett, 2006).It simply shows what fraction of population for each class is predicted correctly or not.Each element from the data set is assigned to one of the following fits of the confusion matrix: true positive (TP), true negative (TN), false negative (FN) and false positive (FP).If it belongs to TP and TN, it was classified correctly.
The ROC graphs are very simple but useful tools for discovering whether a classifier is worth using or if it makes a random classification.It is based on two rates from the confusion matrix: hit rate (Eq.5) and false alarm rate (Eq.6).Each discrete classifier has a threshold level that assigns an element to a positive or negative class.The points on the ROC graph (Fig. 7) represent the classifier for many thresholds.The most desirable curve will be obtained when the true positive rate is high, and the false positive rate is low (convex line).The random classifier, in turn, has a hit rate equal to a false alarm rate despite threshold variation (diagonal line).To identify an ROC analysis with one coefficient, the area under the curve (AUC) may be used.A higher value of AUC results in better performance (0.5 means random, and 1 means excellent).
The confusion matrix and ROC analysis described above were defined for two class problems (positive, negative).There is a straightforward way to expand it for multi-class problems.One needs to take a desired class versus all other classes.Then it will be possible to compare how good the classifier for specific classes within one model is.

ANN performance
First attempts were made to distinguish all substances using only one neural network model.The tests revealed that it is impossible due to the huge number of samples (48 aerosols) and only a few of them presented significantly different fluorescence spectra which allow accurate characterization.The remaining substances are then misclassified.Therefore, we decided to use a more practical approach to this problem, which would be to create several groups (considering information about aerosols), but we did not want to create any classes a priori.Although the ANN type demonstrated needs training, which requires a set of known classes, further tests showed that there is a possibility of finding similarities between substances through the analysis of confusion matrices.It was achieved after many trials of matching substances, which were not well separated, into new groups and checking if they are good enough on ROC graphs.Consequently, this procedure was also applied to those new groups.
All examples demonstrated below were calculated on the test data sets, not training data.In the first ANN presented (Fig. 8), which tries to classify all of the 48 substances (Group 0), four aerosols reached a very high accuracy of separation (AUC > 0,9).The best separation was achieved for fluorescent microspheres (FM).In this case 98.5 % of all FM particles were correctly classified.Similarly, an efficient separation was achieved for riboflavin (RIB), talc (NT) and Lactobacillus bulgaricus (LCB).The remaining aerosols were  divided into three separate groups that gather the most similar substances (Group 1-3) (Table 3).The subsequent groups up to 21 represent individual ANNs leading to the final classification of the aerosol.In practice separation is done not by one confusion matrix (ANN) but by all of them in sequence (22 ANNs combined in a decision tree).For example, if an ANN classifies an unknown substance into any of 22 groups it means that the decision process is not ended but rather from that moment another ANN classifies this substance.However, each new ANN is trained using only a subsection of the data excluding the data from other groups.
Table 4 and Fig. 9 show results achieved for two substances that have a very similar spectrum, and the AUCs calculated are not much higher than in a random classifier.This example clearly shows why we are not always able to classify every single particle of aerosol with 100 % accuracy.However, just a representative number (several dozen) of measured particles (a cloud) allows the proper prediction of aerosol types within a few seconds.This is easy to observe during real-time detection because counts allocated in a confusion matrix tend to reach a stable state quite quickly.

Classification tree
Finally, to achieve the best possible classification, a decision tree was created (Fig. 10).It comprises not 1, but 22 models.The process of creating them is not replicable in terms of the exact factors used for ANN generation.However, this is not essential because the decision tree is based on ANN results (classification ability), which should be the highest possible.Therefore, the final result will be the same.It is difficult to present confusion matrices and ROC graphs for all neural networks in this paper.Therefore, only the most interesting one has been discussed.Here, each node represents a network that classifies a group of aerosols.The aerosols on the left side of the diagram show the most distinct differences; thus they are easy to classify (Level 0).On the right side (Level 1-5), this task is much more demanding due to a similar spectrum, and the separation is less probable in accordance with single particles, although it is still very useful from a practical point of view for aerosol cloud discrimination.At first glance one can see that FM and RIB are very well recognized, but that was expected because these are standards of fluorescence.Surprisingly, NT and LCB aerosols were also separated from the others (Level 0 network).Further analysis of the tree structure identifies a correlation between samples and their real categories.It is especially noticeable for pollens, which are allocated to a separate branch of that tree, and all stems from Group 1.Most of them were classified on the third level.Interestingly all grass pollens (AAP, ATP, BGP, PPP) belong to the same group, Group 6.Similarly, both Lycopodium pollens from different regions of the word show a close correlation, although Abies alba, which is a tree, was classified in the same group.Flours, smut spores and papers are dispersed between different levels, but particular groups belong to the same branch of the tree.However, some of the samples are scattered on the whole tree area and do not correspond to any group.
It should be noted that the result is a system of 22 ANNs that work simultaneously.In comparison to the training process, which is rather time-consuming and has to be empirically optimized, this cluster of learned ANNs delivers high performance.Input data are processed by a single ANN in milliseconds.This performance makes the neural network a great tool as a splitting node in the classification tree.Compared to our previous results, for which principal component analysis was applied to analyze data from BARDet (Kaliszewski et al., 2016), the ANNs allowed much better discrimination between various bio-aerosols.
In this paper the possibility of applying an artificial neural network (ANN) for the real-time classification of biological aerosols was investigated.The spectral characteristics of bioaerosols were collected using the BARDet instrument.The database consisted of 48 substances.Finally, 22 neural networks were trained and combined into a decision tree.This allowed aerosols to be characterized in real time.Tests revealed that only certain substances have such characteristic fluorescence spectra that allow correct classification of almost each particle.However, in all other cases the system was able to recognize a particular aerosol accurately with no mistakes, but a representative number of several dozens of particles in a cloud was necessary.Further approximation was based on decision tree analysis in which each node corresponded to a separate learned ANN.The best sets of ANNs for each group of similar aerosols were discovered utilizing confusion matrices and ROC analysis.Our intention was to make a complete system which detects and classifies substances without creating groups a priori.This attitude helped us to create a powerful analytical tool that works automatically, and the results of classification are immediately available on the operator's screen.
This study proved that it is possible to create a tool for a highly effective analysis of bio-aerosols using multiple ANNs combined into a decision tree.Our approach allowed us to automate and speed up the analysis, which reduced time and the amount of computing power needed.In a future study the database will be extended to obtain potentially a vast variety of samples including atmospherically relevant bacteria and fungi.In the next step, the actual performance of the system will be determined under real environmental conditions, which will be most challenging due to the presence of unknown fluorescent and nonfluorescent particles.
Data availability.The experimental aerosol data can be provided upon request.The software for automatic data analysis cannot be publicly provided at this moment since it is the subject of negotiations with a company.
Author contributions.ML developed the code, performed aerosol experiments, performed data analysis for ANNs, wrote the section concerning data analysis and contributed to the discussion.He also elaborated on the research conception, produced graphs concerning the data analysis and elaborated on the aerosol generation setup.MK elaborated on the research conception, provided input in writing the Introduction, aerosol experimental and Summary sections, produced some graphs, supervised the manuscript preparation, elaborated on the aerosol generation method, collected most of the samples for analysis, performed aerosol experiments and participated in and elaborated on the microscopic analysis.MW maintained and manipulated the optical elements of BARDet, contributed to scientific discussions, edited the manuscript and graphs, wrote the aerosol microscopy section and performed most of the microscopic measurements.JM contributed to scientific discussions, supported and improved the electronic module of the BARDet and contributed to the writing of the Introduction.ZM reviewed the manuscript and provided constructive discussion as well as language corrections.KK reviewed the manuscript, provided constructive discussion and supervised the project.

Figure 1 .
Figure 1.Setup of aerosol generation, data recording and analysis.

Figure 2 .
Figure 2. Example of 50 normalized subsequent fluorescence characteristics of NT (a), FM (c) and LCB (e) and corresponding averaged normalized intensities of NT (b), FM (d) and LCB (f).Error bars represent standard deviation of measurements.

Figure 3 .
Figure 3. Mathematical model of a single neuron cell.

Figure 4 .
Figure 4. Typical topology of an artificial neural network.

Figure 5 .
Figure 5. Model of backward error propagation.

Figure 6 .
Figure 6.Example of error minimizing during the training process.

Figure 7 .
Figure 7. ROC graph with an example of classifier (blue).

Figure 8 .
Figure 8.(a) ROC and (b) error progress of an ANN that classifies all samples.

Figure 9 .
Figure 9. ROC (a) and error progress (b) of an ANN that classifies two very similar samples.

Table 1 .
Configuration of bands in the multichannel PMT.

Table 2 .
List of all substances used in the experiment.

Table 3 .
Exemplary confusion matrix of all aerosols classified by the first ANN.Bold numbers denote how correct (in percent) a certain substance was classified to be.
Figure 10.The decision tree consists of 22 ANNs separating 48 substances.

Table 4 .
Confusion matrix of two substances that have very similar spectra.