Quantification of toxic metals using machine learning techniques and spark emission spectroscopy

The United States Environmental Protection Agency (US EPA) list of hazardous air pollutants (HAPs) includes toxic metal suspected or associated with development of cancer. Traditional techniques for detecting and quantifying toxic metals in the atmosphere are either not real time, hindering identification of sources, or limited by instrument costs. Spark emission spectroscopy is a promising and cost-effective technique that can be used for analyzing toxic metals in real time. Here, we have developed a cost-effective spark emission spectroscopy system to quantify the concentration of toxic metals targeted by the US EPA. Specifically, Cr, Cu, Ni, and Pb solutions were diluted and deposited on the ground electrode of the spark emission system. The least absolute shrinkage and selection operator (LASSO) was optimized and employed to detect useful features from the spark-generated plasma emissions. The optimized model was able to detect atomic emission lines along with other features to build a regression model that predicts the concentration of toxic metals from the observed spectra. The limits of detections (LODs) were estimated using the detected features and compared to the traditional single-feature approach. LASSO is capable of detecting highly sensitive features in the input spectrum; however, for some toxic metals the single-feature LOD marginally outperforms LASSO LOD. The combination of low-cost instruments with advanced machine learning techniques for data analysis could pave the path forward for data-driven solutions to costly measurements.


Introduction
The United States Environmental Protection Agency (US EPA) lists a number of metals in their list of hazardous air pollutants (HAPs). These metals are known or suspected to cause cancer or other serious health effects (Buzea et al., 2007;Pope et al., 2002). Table 1 lists the metals in the US EPA's HAP list. Table 2 lists other metals that are not on the US EPA's HAP list but have been implicated in a range of adverse health effects and so are of concern to the California Air Resources Board (CARB). It has been shown that presence of these metals is associated with various health concerns such as diabetes (Zanobetti et al., 2009), cardiovascular disease (Brook et al., 2004), and asthma (Gent et al., 2009). Therefore, it is necessary to monitor and quantify their ambient concentration.
Various techniques over the years have been developed and used to measure metal particles. X-ray fluorescence (XRF) (Vincze et al., 2002;Van Meel et al., 2007) and inductively coupled plasma mass spectrometry (ICP-MS) (Rovelli et al., 2018;Venecek et al., 2016) have been used traditionally to quantify metals in atmospheric particles. XRF is excellent for measuring lighter elements and metals on filter substrates, but for field application it is expensive, has a high limit of detection (LOD) for heavier elements, and includes radiation risk. ICP-MS requires collection of aerosol on a substrate, such as a filter or impactor foil, extraction of the metals or elements from the substrate using harsh acidic chemicals, and then analyzing in the ICP-MS along with standards that help the instrument quantitate. Moreover, ICP-MS is most suitable for heavier elements and metals so has a high LOD for lighter toxic metals and is not available in field-deployed, real-time applications. Additionally, these instruments are expensive and hence are limited by cost and complexity as well.
Recently, machine learning and deep learning techniques have been applied in different fields. These techniques in general learn patterns that can be used to distinguish different labels. Boucher et al. (2015) employed various linear and nonlinear machine learning techniques on LIBS spectra obtained from geological samples and concluded that a combination of models yields a lower total error of prediction. Chengxu et al. (2018) used convolutional neural networks to detect potassium in LIBS spectra and improve the linearity of their prediction model incorporating deep convolutional layers. Zheng et al. (2018a) employed spark emission spectroscopy on metals and used partial least squares regression to analyze their spectra set. They compared their multivariate models to univariate models and showed in their study that these two groups have similar performance.
While LIBS and SIBS address issues regarding the field measurement and instrument complexity, they are still considered expensive. Current interest in low-cost sensors and their ability to characterize local air pollution concentrations motivated development of a low-cost system. We employed two complementary approaches: (1) decreasing the cost of the electronics associated with SIBS and (2) incorporating advanced data analysis techniques to improve quantification and the limit of detection. In recent years, numerous studies have used artificial neural networks (Ferreira et al., 2008), partial least squares regression, and least absolute shrinkage and selection operator (LASSO) (Dyar et al., 2012) on emission spectra to improve the quantification and limit of the detection of spectroscopic systems. In this study, we have developed a low-cost spark emission spectroscopy system to quantify toxic metals. To reduce the overall cost, inexpensive replacements for necessary components, such as the spark generator and delay generator, have been developed in the lab. To improve performance, advanced machine learning tools such as k-means clustering and LASSO have been employed to improve the system performance. The resulting instrument was evaluated against four toxic metals listed by the US EPA.
2 Instrument development

Spark generation system
Setting up a spark emission spectroscopy system requires expensive components. However, depending on the application, some of the components can be replaced. Components such as spark generator and delay generator can cost up to USD 10 000 and 5000, respectively. According to our application and needs, we developed these components for less than USD 600 and 50, respectively. One costly component that is required for developing a spark emission spectroscopy system is the spark generation system. Numer-ous papers have studied the fundamental principles of spark emission spectroscopy (Walters, 1977;Sacks and Walters, 1970;Walters, 1969). The key idea is to discharge a capacitor as quickly as possible to increase the power dissipated in the spark gap. Figure 1 illustrates the schematic of the spark generation system. The overall goal is to charge a capacitor at high voltage and, once it has been charged sufficiently, discharge the capacitor through the spark gap. An Arduino board controls the timing between charging and discharging the capacitor. A boost convertor converts 24 to 5000 V DC and is connected to a mechanical relay with two switching states controlled with the Arduino board. In the charge state, the mechanical relay provides the conduction path between the boost convertor and the capacitor. In this configuration, the capacitor reaches full charge in 5 s. Once the capacitor is fully charged, the Arduino board sends a signal to turn off the boost convertor and sends another signal to the mechanical relay to flip to the discharge state. At the discharge state, the mechanical relay provides a conduction path between the capacitor and the spark gap. Shepherd et al. (2000) showed that the discharge process could be controlled by a resistor after the spark gap. For low resistor values, the spark current exhibited a periodic behavior as the capacitor discharges, which can be associated with an under-damped discharging. On the other hand, increasing the resistor value damped the discharge process and dissipated a large portion of the capacitor energy through the resistor instead of the spark gap. We found that a 10 resistor maximizes the power dissipation in the spark gap, while minimizing oscillations. Figure 2 illustrates the evolution of the generated spark as a function of time. The voltage shows a sudden increase followed by an exponential decrease fully discharging in less than 5 µs and thus delivering sufficient energy to the arc and deposited analyte.

Delay generator
The delay generator is another costly component typically used in time-resolved spectroscopy. Electronics advances have paved the way for developing a cost-effective delay generator. The delay generator suppresses initial noise in the emission spectrum so needs to cover a range between 1 and 20 µs with resolution less than 0.2 µs. We designed a custombuilt delay generator in order to lower the overall cost of the instrument. Figure 3 illustrates the schematic of the circuit. Upon generation of the spark-induced plasma, a pair of lenses collects and focuses the plasma emission into a photodiode. The pulse generated by the photodiode is passed into a voltage comparator (LM 311-N) to generate a transistortransistor logic (TTL) signal. The output TTL signal from the comparator is sent to a pulse width modulator (PWM) controller (LTC6992), which adds delay to the TTL signal. An Arduino board adjusts a digital resistor (AD5241), which in turn determines the delay value. Figure 4 shows the delay generator performance. The y axis illustrates the delay values requested of the delay generator, while the x axis shows the measured values. The red dashed line shows the desired 1 : 1 line, while the circles show the measured performance. The performance is linear over the relevant delay range with only a slight deviation from the 1 : 1 line. Considering the sparkgenerated plasma's short lifetime, our measurements require short delay values (< 5 µs) where the built-in delay generator shows excellent performance and accuracy.

Spectra collection
Four toxic metals with different concentrations were used to test the developed spark emission spectrometer system performance. Cr, Cu, Ni, and Pd (1000 µg mL −1 ) were purchased from AccuStandard and diluted to specific concentrations. For each concentration more than 10 spectra have been collected and used for model development. A micropipette was used to deposit diluted solutions on a 1 mm diameter Tungsten ground electrode of the spark system for emission analysis. The total mass can be calculated from the deposited volume and solution concentration. Upon evaporation of the droplets, the capacitor was discharged to ablate the deposited material and obtain spectra. A pair of lenses (75 mm focal length and 25.4 mm diameter; Thorlab) focused the emission into an optical fiber connected to a spectrometer (Ocean Optics).

Results and discussions
To address shot-to-shot variations in the spark-generated plasma and nullify possible faults caused by the low-cost components, an unsupervised learning technique, k-means clustering, classifies the collected spectra. Following this procedure, it is possible to identify and remove outliers and hence improve the accuracy of the analysis. Figure 5 illustrates the elbow plot that is used to optimize the number of spectral classes. The standard approach is to set the optimum number of clusters to the value where the within-cluster sum of squares (WCSS) error plateaus. The WCSS error plateaus once we have two or more centroids, and, therefore, the number of centroids is set to two. Figure 6 illustrates the performance of the model for 300 spectra obtained from the background (Tungsten ground electrode ablation). The results show clearly two clusters with different emission response. The lower left cluster containing < 10 % of the spectra represents low-signal outliers, which were eliminated from further analysis. For each toxic metal, 0.1, 1, 10, and 100 ng of mass were deposited on the ground electrode. For each concentration, 10 spectra were collected using a 2 µs delay between the observed and recorded emissions. After ablating the deposited mass and recording the spectrum, feature scaling has been used as a preprocessing step to improve the optimization process for our machine learning model. Plasma temperature can be obtained as follows: Combining Eqs.
(1) and (2) and taking log from both sides gives ln where k B is the Boltzmann constant, A ki is the transition probability between two energy states (i) and (k), and N k is the population density at energy state k (E k ). λ ki indicates the wavelength associated with the transition, and g k represents the degeneracy of energy state k. The slope of Eq. (3) is used to estimate the plasma temperature based on a series of Tungsten lines for the recorded cleaned spectra set at 2 µs. Figure 7 illustrates the Boltzmann plot Omenetto, 2010, 2012) constructed by Tungsten lines. Based on the slope of the fit, the plasma temperature is estimated as 4013 ± 579 K. Upon identifying and removing the outlier spectra, the cleaned spectra set is normalized using the Tungsten peak at W I (400.87 nm) and fed into the LASSO algorithm for model development and prediction.

LASSO
The cleaned scaled spectra set has been used to detect and quantify concentrations of the toxic metals. Simple linear regression obtains the slope and intercept of a linear line by minimizing the mean squared error between the predictions and known values. LASSO detects and employs more features to perform predictions by optimizing the following loss function: where x (i) ∈ R 2048 and h θ (x (i) ) represent the normalized spectrum and the LASSO concentration prediction based on   spectrum (i) (x (i) ), respectively, and where y (i) is the known concentration corresponding to spectrum (i). m refers to the total number of spectra, and the LASSO coefficients are indicated by θ j . k indicates the total number of features (spectral lines) used to build the model. The first term in Eq. (4) is the mean squared error and is common with simple linear regression, while the second term is a regularization term that minimizes the magnitude of θ j . The L1 norm essentially sets most of the features in the spectrum to zero and maintains only a few features to build the linear model and perform predictions. The regularization constant (c) determines the number of features to be used in the model, and therefore the model loss needs to be optimized with respect to the regularization constant. To obtain the optimized regularization Figure 6. K-means clustering for detecting outliers before passing the spectra set to the LASSO model. Two clusters were plotted for the normalized intensities of two arbitrary wavelengths at λ 1 (208.365 nm) and λ 2 (208.759 nm). constant, we plotted the loss values for the Ni spectra training and testing sets as a function of the number of features for various c values based on leave-one-out cross validation (Fig. 8). As expected, the train loss monotonically decreases as the number of features increases, while the loss for the test  set initially decreases and then starts increasing. This implies that after incorporating a certain number of features into the model, the model starts memorizing rather than generalizing, which is known as overfitting. Therefore, we set the regularization constant to the value that minimizes the loss for the test set. Figure 9 illustrates the optimized LASSO model predictions obtained by cross validation. For each concentration, the cross validation predictions were averaged and plotted along with the standard deviations. The predicted values vary linearly with the actuals. Figure 10 shows the wavelengths chosen by LASSO and the mean spectrum for 10 ng. LASSO chose a few Ni emission peaks along with other features to build the model. The same optimization process was applied to other toxic metals, specifically Cr, Cu, and Pb. Figure 11 illustrates the resulting predictions and demonstrates the value of LASSO for predicting deposited mass from the spectra. To obtain the LOD, the following function of the LASSO coefficients θ j was used: where σ B is the standard deviation of the background and θ B is the Euclidean norm of LASSO coefficients. Table 3 reports the LODs of the studied toxic metals.
Multivariate regression models such as LASSO might be more powerful in detection and quantification over univariate models; however, there is no guarantee that multivariate models outperform simple linear regression (Castro and Pereira-Filho, 2016;Braga et al., 2010). To evaluate LASSO performance, we compared LASSO with univariate methods, by calculating the LODs using simple univariate linear regression based on the features selected by LASSO. Figure 12 illustrates the LODs obtained using this univariate  technique (circles) compared to LASSO LOD (dashed line) for Ni. Considering only the sensitivity (LOD) is necessary but not sufficient for evaluating model performance since low R 2 values are also problematic. Therefore, in order to incorporate both R 2 and LOD for model assessment, we defined a score as Based on this definition, a model that has low LOD and high R 2 is desirable. The LASSO score outperforms single feature linear regression for Pb, but the two methods were comparable for Cu, Ni, and Cr (Fig. 13). Other studies have reported that univariate techniques performed better than multivariate ones (Castro and Pereira-Filho, 2016;Braga et al., 2010). In LASSO, this may be related to the cost function defined for the regression (Eq. 4). LASSO is a special case of elastic net family where both L1 and L2 norms are combined and used in the cost function. Considering the cost function in Eq. (4), the model goal is to minimize the prediction error and coefficient values (minimizing L1). This does not necessarily optimize LOD. Therefore, cost function minimization does not correspond to LOD minimization. Considering Fig. 12, using features defined by   LASSO in a univariate model may yield better LOD than that obtained by LASSO alone. This might be an advantageous approach if the physical intuition of the features is not as important as the detection of toxic metals.

Conclusion
A cost-effective spark emission spectroscopy instrument was designed and developed to quantify toxic metals targeted by the US EPA and the California Air Resources Board. Costly components such as the spark generation system and delay generator were developed to lower the overall cost. An unsupervised learning technique was employed to detect outlier spectra. The cleaned spectra set was fed into LASSO for pre-dicting the concentration of deposited samples on the ground electrode of the spark system from spectra obtained from the plasma. A combination of LASSO feature detection with univariate regression might improve the detection limits. Our results illustrate the promising realm of cost-effective sensors combined with advanced machine learning techniques to provide data-driven solutions to the traditional challenging problems.
Data availability. Please contact the corresponding author for the data.
Author contributions. ASW conceived the idea and advised SAD on this work. SAD carried out the experiments. SAD also conceived of the machine learning approach to analyzing the spectra.
Competing interests. The authors declare that they have no conflict of interest.
Financial support. This research has been supported by the California Air Resources Board (CARB).
Review statement. This paper was edited by Francis Pope and reviewed by two anonymous referees.