A novel Mie lidar gradient cluster analysis method of nocturnal boundary layer detection during air pollution episodes

. The observation of the nocturnal boundary layer height (NBLH) plays an important role in air pollution and monitoring. Through 39 days heavily polluted observation experiments over Beijing (China) and exhaustive evaluation of gradient method (GM) , wavelet covariance transform method (WCT) and cubic roots gradient method (CRGM), a novel algorithm based on cluster analysis of gradient method (CA-GM) of lidar signal is developed to capture the multilayer structure and achieve stability in the nighttime . The CA-GM highlights its performance in comparison with radiosonde data, 10 the best correlation (0.85), the weakest root mean square error (203 m), and the improved 25 % correlation coefficient by the GM. In comparison with long-term experiments with other algorithms, a reasonable parameter selection can distinguish layers with different properties, such as the cloud layer, elevated aerosol layers, and random noise. Consequently, the CA-GM can automatically deal with the uncertainty of the multiple structures and obtain a stable NBLH with a high time resolution, which expected to contribute to air pollution monitoring and climatologies, as well as model verification.


Introduction
Air pollution has an important impact on human health, climatic patterns and the ecological environment (Shi et al., 2019).
The primary anthropogenic emission source is the particulate matters (PMs) which are the major source of severe haze in Beijing (Ma et al., 2019). Many passive and active remote sensing instruments have been combined to observe the aerosol optical and microphysical properties (Ji et al., 2018b;Wang et al., 2019), the relationship of PMs and meteorology (Li et al., 20 2019;Zhang et al., 2015), and the aerosol-atmospheric boundary layer height (ABLH) interaction (Dong et al., 2017;Su et al., 2020b). With the development in star and moon photometry, the continues day-to-night detection improved the columned-integrated aerosol properties at night. (Benavent-Oltra et al., 2019;Pérez-Ramírez et al., 2008). Nevertheless, the observations of the variation of nocturnal boundary layer (NBL) are few. Because the complexity of weak wind forcing, significant stratification, and intermittent turbulence (Weil, 2011), resulting in the accumulation of fine particles 25 continuously near-surface. The turbulence mixing process is a companion with the strong physio-chemical effect, which favors the formation of new-particles and induce a worse polluted episode (Hao et al., 2018;Wang et al., 2018). Therefore, acquiring the nocturnal boundary layer height (NBLH) in pollution cases accurately, especially during nighttime, is of great significance for air pollution. Multiple approaches have been developed to determine the ABLH on observation, including radiosounding, remote 30 sensing as well as parameterization based on a laboratory experiment (Su et al., 2020a). The lidar uses an aerosol as a tracer for mixing processes with high space and time resolutions. According to the research of Martucci et al (2007), the correlation coefficient is lower under stable conditions due to a complex aerosol structure that increases the difficulty of NBLH retrieval (Emeis and Schäfer, 2006;Martucci et al., 2007).During the nighttime, the NBLH determined by elastic lidar is either the top of the residual layer (RL) or the top of the surface mechanical driven mixing layer (Dang et al., 2019a). In the absence of 35 any external forces, aerosols in the atmosphere get stratified and result in single or multiple layers (elevated or advent aerosols), depending on the location and type of the atmospheric aerosols (Dudeja, 2019). The more complex vertical backscatter signal profile can also form in the specific environmental conditions such as cloud contamination and the local signal noise affection (Dang et al., 2019a;Stull, 1988).
The classical methodologies of lidar-retrieved algorithms are difficult in the identification of multilayer structure in night 40 polluted cases. The gradient-based method, such as first-order gradient method (GM) (Hayden et al., 1997), inflection point method (Menut et al., 1999), logarithm gradient method , and cubic root gradient method (CRGM) (Yang et al., 2017), are sensitive to noisy data unless averaging signal or lose some instantaneous useful information. The threshold method is too subjective to set a universal threshold on different weather and terrain (Frioud et al., 2003). The variance method (Hooper and Eloranta, 1986) is easily get infected by lofted aerosol layers and reduce the temporal 45 resolution because of calculating the profile of variance. The haar wavelet covariance transform (WCT) method (Davis et al., 2000) and the idealized backscatter method (Steyn et al., 1999) is more robust to noise, but it still can be infected by the lowlevel cloud and aloft aerosols (Caicedo et al., 2017). The START-2D (Haeffelin et al., 2012) performs well under multiplelayer situations, but the wide information to the thermodynamic state of atmosphere should be added to the best measure of the transition of the boundary layer. Some advanced methods like extend Kalman filter (Banks et al., 2014) and the 50 PathfinderTURB (Poltera et al., 2017) are proposed due to track the ABLH over time coherently and may jump between different atmospheric layers inadvertently. The layer categorization still remains one of the major sources of uncertainty related to the ABLH retrieval.
Departing from these previous efforts to estimate the ABLH, in this paper, we present a new approach on cluster analysis of gradient method (CA-GM) to overcome multilayer structure and remove the fluctuation of NBLH with raw data 55 resolution. This study proposes a reasonable parameter to reduce the interference of the cloud layer, elevated aerosol layers, and the local noise over the air pollution in the megacity region. The results evaluated by comparing with the nearby radiosonde site and confirming the long-term variations with the traditional methods in different atmospheric layers. N,116.32° E). The L-band radiosonde provided fine-resolution profile of temperature, pressure, relative humidity, wind speed and direction twice a day at 08:00 and 20:00 local standard time (LST) (Guo et al., 2016). Previous studies (Hennemuth and Lammert, 2006;Seidel et al., 2012) have used the radiosonde as a reference in detecting ABLHs for daily and annual changes in lidar measurement. We resampled the radiosonde data using linear interpolation to achieve a same 70 vertical resolution of lidar and compared to the 1-h average NBLH centred around the radiosonde launch times. As a result of the complexity of the transition during the morning and the early night, we combined relative humidity gradient (RHG) as a secondary reference which should have good consistency with the potential temperature (PTG) method (Seibert et al., 2000).

the BIT-lidar system 75
A single-wavelength Raman-Mie lidar is operated in the campus of Beijing institute of Technology, providing aerosol, cloud, ABLHs and temperature measurements. This lidar system has been continuously improved to capture the loading of aerosols (Chen et al., 2014;Ji et al., 2018a). The standardized range-corrected signal (RCS) is subjected to correlation lidar factor correction (correction of electronic noise error, background noise error, overlap factor) and distance correction. The backscatter coefficient can be calculated as the Fernald method (Fernald, 1984) and assumed lidar ratio as 70 sr (Rosati et al.,80 2016) due to the polluted continental aerosol particles. The detailed parameters of the BIT-lidar are listed at Table 1.

Weighted k-means cluster
The CA-GM is based on the assumption that the values of turbulence and aerosol concentration are significantly higher in the NBL than in the free atmosphere (FA). Due to the influence of the multilayer structure, the peaks of the minimum gradient of RCS is the potential location of the NBL. And assembling these distinguish peaks with height through the time into groups can be considered as space and time-averaged aerosol concentration. Therefore, it can solve the inadvertently 90 jump between different atmospheric layers. The theoretical schematic of the k-means clustering principle are shown in Where & ( ( ) is the value of the gradient of RCS, and ℎ & (ℎ ( ) is the height of the peak. 95 After that, we apply the k-means cluster algorithm to classify the datasets with those notable peaks. The cluster number should be given in prior, and the k-means method build cluster iteratively by moving centroid until the target function sum of squared errors (SSE) towards local minimum (Toledo et al., 2014), the SSE calculated following Eq. (2).
Where is a number of clusters, & and is the cluster centroid and all observation in the cluster ,respectively. In order to 100 obtain the accurate in compact and well-separated clusters, the criteria in the cluster validation are necessary. Davies-Bouldin index (Davies and Bouldin 1979) is conducted to the cluster validation analysis and defined as Eq. (3).
Where ? is the averaged intra-distance between of observations and their cluster centriod, ( & , ( ) is the distance between cluster centriod & and ( . The minimum DB index is considered as an optimal cluster classification. 105 The standard k-means must be normalized in the case that the variable is quite a difference, data normalization is based on min-max normalization (Virmani et al., 2015). The normalized k-means clustering is "isotropic" in all directions of space 110 and tends to capture a spherical-like shape. Nevertheless, in this paper, we proposed to put weight on height and leaving variances greater along with height. Therefore, the assembling groups of the distribution will tend to be separated along variables with greater variance, which conducive to set the top limiter altitude to classify different atmospheric layers vertically (Figure 2 b). The weighted is calculated by the difference between maximum altitude ℎ PQ< and the minimum altitude ℎ P&R , the weighted height ℎ S is rescaling by the normalized height data ℎ S as Eq.(4). 115

Multiple layers classification
As the presence of the strong gradient signature of the backscatter profile, we perform to collect 3 minima peaks of GM distribution within an hour and classify the different layers on their characteristic (Figure 2 a). The cloud layer (CL) has a lager gradient magnitude of extinction and backscattering coefficient than the aerosol layers (Palm et al., 2012). Additionally, the typical nocturnal clouds are shallow cumulus (Cu), stratocumulus (Sc), and stratus (St) (Kotthaus and Grimmond, 2018). 125 They have shallow vertical dimensions and denser than aerosols at the same altitude, hence, it can be distinguished from the aerosol layer (Wang and Sassen, 2001). Meanwhile, the noise from the GM could be affected by background and electronic noise, it has non-regular distribution and appeared at higher altitude where has lower signal-to-noise ratio. The noise layer is lack of starfield structure but have a similar GM value with the lower height. Thus, we calculate the range of vertical extension of different layers, indicating the cluster significance of noise and other layers. As for elevated aerosols (EALs), 130 the presence of EALs above the NBL represent a difficulty when retrieving the NBL top height, in particular when EALs layers are close to it. Both of the two aerosol layers have a similar characteristic of gradient variance and range of height, we discover by seeking the empirical threshold value of the EALs in backscatter coefficient (Hänel et al., 2012). The typical backscatter threshold for 532 nm wavelength lidar is defined as XY = 1.786 × 10 ab km a6 a6 , which is calculated by Ångstöm parameter as 1.2 in the urban-industrial and mixed condition (Dubovik et al., 2002). The gaps between NBL and 135 EALs in the multilayer structure are determined to XY = 100 (Peng et al., 2017).

The implement of CA-GM algorithm
The CA-GM method based on k-means cluster analysis of different types of atmospheric layers is generally used to retrieve multiple layers in polluted cases. The specific ideas are shown in the flowchart in Figure 3, and the specific steps are as follows, detailed results are presented as a case study. 140 The algorithm is divided into three parts as the pre-processing, the layer attribution classification, and the NBL inspection.
The CA-GM algorithm is implemented if the data collection exceeds 30 minutes within an hour period. Firstly, The standardized lidar RCS is applied Savitzky-Golay filter to preliminary denoising. The profile of the backscatter coefficient ( ) is calculated, and the reference height (ℎ Vfg ) is limited by the Fernald method as the theoretically height-limiter. The * is a dataset of three gradient minima of RCS. Set cluster as two in prior, implement 10 times k-means cluster to seek for the 145 minimum DB index as the optimal grouping and . Second, there is a parameter &RXVQ defined as the minimum distance of inter-cluster, which can measure the cluster stratified significance to classifying the cloud and noise mixed in the * .
If &RXVQ exceeds the threshold ijk , it can distinguish noise from other layers. The >& and >( is as a quality control function to noise layer attribution. As for the cloud layer, the vertical extension Y nJ ( Y nL ) of the cloud is shallower than aerosol layers, therefore, we defined an empirical constant C for this study. In addition, the vertical uniformity parameter ( ) 150 works as quality control tools to features of clouds and other layers as well. If cloud layer or noise exists, the original * are removed from the top limiter as ℎ MpUqr or ℎ RU&sf , respectively. After the elimination of cloud and noise interference, the elevated aerosol layers (EALs) can be judged from the typical aerosol layer tu and the gaps distance of XY . Finally, the newly dataset * which has been removed from the different attributed layer goes to the final step with the cluster as and (or v and v ). Due to the assumption of NBL distribution, the largest deviation is indicating the location of the NBLH 155 (ℎ Rwp ).
In summary, through 1-2 times k-means cluster analysis of the vertical-temporal gradient of the GM within an hour, the multilayer NBL structure can be separated according to their physical characteristics of different layers. The CA-GM method is an objective and robust method for judging the attribution of different layers (NBL, EALs, CL) and noise.

Evaluate with radiosonde data
The L-band radiosonde provided accurate thermodynamic profiles, the radiosonde-determined NBLHs were used to evaluate the accuracy of the lidar-retrieved NBLHs. Comparing with two-moment radiosonde with all algorithms, it highlighted a 170 good agreement that the correlation coefficients (R) were range from 0.68-0.85. The CA-GM had the highest consistencies among classical methods, with the highest correlation coefficient (0.85), the weakest root mean square error (RMSE) (203 m), the smaller mean bias (28 m), and the minimum mean relative absolute difference (PRD) (17 %) ( Table 2).

Figure 4. Comparison between the radiosonde-determined and lidar-retrieved measurement of NBLH in gradient method (GM) (hollow circle) and cluster analysis of gradient method (CA-GM) (filled circle). The correlation coefficients is represented by R.
The black solid line is the 1:1 line.
The NBLH retrieved by GM and CA-GM ( Figure 4) had a good correlation of the radiosonde and the latter method enhanced 180 25% consistency of the correlation coefficient. With the implementation of CA-GM, the data were concentrated and reduced the RMSE from 292 m to 203 m ( Table 2). The means bias of GM is greater than the CA-GM, corresponding the decease of PRD from GM to CA-GM. Additionally, comparing with the WCT and CRGM, the WCT underestimated NBLH about 13 m, and the CRGM shown to overestimate the altitude as 186 m. The RMSE of CA-GM is smaller than WCT and CRGM, which was similar to the result of PRD. Therefore, the CA-GM had a good correlation with radiosonde and had the smallest 185 fluctuation and highest consistency with NBLH retrieval.

Compared with other classical methods
To enrich our analysis, a comparison of CA-GM with GM, WCT, and CRGM in the 39-day nighttime period was applied in the supplement weakness of the rare time resolution of the radiosonde. The results were shown in Figure 5 as follows. The valid CA-GM data was implemented for a total of 256 hours, the data were analyzed to compare with other retrieval 200 algorithms. Under the condition without the infatuations of multiple layers, the CA-GM had a good correlation as 0.90, 0.70 and 0.82, for GM, WCT, and CRGM, respectively, and the RMSE was the smallest compared with other situations.
Consequently, the CA-GM was more similar to the other three methods in the case of without multilayer structure, which proved reasonable of CA-GM with classical boundary layer retrieved method.
Moreover, the extensive results showed that the WCT method was more accurate than GM during nighttime (Caicedo et 205 al., 2017) and less affected by the low signal-to-noise ratio condition (Brooks, 2003). The dilation and the threshold of WCT were selected carefully in this study (Mao et al., 2013), thus, the performance of WCT can ensuring the certify the identification of the cloud layer and noise. Notably, the improvement of the correlation coefficient from 0.70 to 0.84 in cloud contamination and from 0.70 to 0.72 in noise affection, which was conducted the better consistency to remove those attributed layers. Whereas the fluctuations in noise and cloud layers are relatively large, the CA-GM performed outstanding 210 ability for cloud removal the eliminate noise. As for EALs, because of the ambiguous cluster of the EALs and NBL, all the methods had poor correlation coefficient with CA-GM. The observation of EALs is the most challenging part for the multiple layer structure, more active remote sensing instruments (such as multi-wavelength lidar, polarized lidar) and methods are needed to prove the accurate layout of the EALs. Table 3 shown the criteria parameters in the CA-GM. The cluster significant parameter &RXVQ for noise was 20.75±14.62 215 m, which is significantly smaller among other conditions. The typical altitude of NBL, EALs, cloud, and noise in severe haze pollution is 590.49±202.84 m, 1024.69±166.36 m, 1252.52±303.28 m, and 1100.66±253.04 m, respectively. The vertical extension of the cloud layer was indeed shallower than the other layer, with the typical extension of 128.6±82.13 m. The backscatter coefficient of EALs is 1.12±0.76×10 -3 km -1 sr -1 , which was an evidence of choosing the suitable empirical value β tu . The cloud had the smallest value of the vertical uniformity, which indicated denser peak distribution than other layers. 220

Effects of cloud contamination
On 23 Dec 2016, there was a cloud layer at 1.3 km above ground level (AGL) between 18:00-23:00 LST ( Figure. 6-1), 225 which was presented as a light blue region. Below the cloud base, there was a distinct surface aerosol layer and a strong signal negative gradient, indicating as the WCT method capture. The cloud seriously influences the GM and CRGM determination and capture the upper edge of the cloud. After 21:00 LST, the cloudiness decrease, the lidar can capture the NBL signal. After defining the minimum in the upper-cluster ( & ) as the top limiter altitude, the CA-GM was able to capture the slowly increased NBLH as shown in Figure 6-1. Figure 6-2 was shown the significant two-layer structure distribution 230 hourly on the first k-means cluster distribution. The centroid of two-layers was indicated the approximate location at 839 m and 1428 m ( Figure 6-2 b), and the cloud located at the upper layer which had a shallow vertical extension and relative dense distribution. The radiosonde had a good correlation with lidar-retrieved NBLH in Figure 6-3, the PTG was shown the steepest slope at 1.37 km, but correspond the height at the location of cloud. Therefore, we selected NBLH using the RHG method as 0.78 km, which was smaller than CA-GM retrieved height at 20:00 LST. 235

Noise affection 245
On 6 Apr 2017, the noise distribution was prone to appear when the low-load aerosol for the GM. The gradient-based methods were effected by noise and with a wide range of fluctuations (Figure 7-1). On the contrary, the WCT captured well the edge of the concentration of aerosols. From the distribution of GM with height distribution, the Figure 7-2 was seen the obvious mixing without starfield layer structure. Therefore, the noise was mixed in the upper layer of the centriod at height at 1479 m, 1452 m and 1173 m at 19:00, 20:00 and 21:00, respectively, which should set the top limiter and recalculate the 250 NBLH hourly. The radiosonde data was calculated by the rapid change of PTG method as 0.79 km (Figure 7-3), corresponding to the height retrieval by with CA-GM as 0.74 km.

Nocturnal aloft aerosol layer
On 2 Jan 2017, the EALs frequently appeared in the lower troposphere. There was a distinct aerosol layer between 0.7-1.2 260 km AGL between 17:00-22:00 LST ( Figure. 8-1). Without any limitation, the GM, CRGM, and WCT captured the height of EALs when the signal negative gradient at the EALs is stronger than the NBL，corresponding to the aloft aerosol structure from 17:00-22:00 LST. As the Figure 8-2 was shown, the distinct two peaks of the cluster were the two aerosol layers, deviation of the upper was lager at first and two layers gradually almost the same gradient magnitude with the time transition.
The centroid of upper-cluster upward is another evidence of the NBLH with a topped EALs. After using the CA-GM method 265 to limit the base of the lofting aerosol layers, the effect of EALs in the polluted causes can be separated successfully.
Similarly, in Figure 8-3, the first gradient maxima above the surface inversion layer (SIL) is the NBLH, and both PTG and RHG had a good consistency and the NBLH is at 0.44 km. The other peak with PTG and RHG corresponding to the height of EALs.

Discussion and conclusion
The elastic lidars are superiority instruments with which to determine nighttime boundary layer height with high space and 280 vertical resolution. Multiple layer structures in severely polluted cases impeding buoyancy forces and influence pollutants dispersed and diluted. In this paper, the novel algorithm of CA-GM was automatics developed to capture the multilayer structure and achieve stability in the nighttime with high time resolution. A 39 days heavily polluted observation experiment over Beijing (China) thoroughly evaluate the limitation of the current methods for boundary layer height determination and developed an algorithm suitable to pollution conditions. 285 Overall, the CA-GM highlights its high performance in comparison with radiosonde data, the best correlation (0.85), and the weakest root mean square error (203 m) and improved 25 % correlation coefficient of the GM. The possible deviations are due to the different definitions of thermodynamic NBLs from radiosondes and the aerosol NBLs. The sounding data are multi-layered as well due to the affection of aerosol layers and cloud layers, the radiosonde-retrieved NBLH should combine the PTG and RHG method to discuss the uncertainty of NBLs in the pollution period. The calculation of three minimum 290 gradients can obverse the potential starfield layer structure, which provides a worst-case scenario for estimating surface concentrations of pollutants released into the NBL. In comparison with long-term performance with other algorithms, a reasonable parameter selection can distinguish different atmospheric layers such as cloud layer, elevated (or advected) aerosol, and random noise. The &RXVQ , Y n , ,and provide a novel idea of classification multiple layers on their physical characteristic, which is more objective to automatic clustering in a complex condition. The correlation coefficient with CA-295 GM and WCT was elevated correlation coefficient from 0.7 to 0.84, 0.7 to 0.72 in cloud and noise affection, which proved the ability of CA-GM can ensure the upper edge of low-level cloud and remove the random noise. The EALs are often located at the top of NBLs with a similar characterise of the NBLs, thus, using the empirical threshold on a single wavelength Mie-lidar is a good way to classify EALs in polluted cases. Consequently, the CA-GM can deal with the uncertainty of the multi-layered structure and obtain a stable NBLH with a high time resolution, which expected to 300 contribute to long-term observation of the single wavelength lidar system and micro pulse lidar monitoring in air pollution.
Code/Data availability. Contact csy@bit.edu.cn for data requests.
Author contribution. Y.Z and S.C analyzed the experimental data and co-wrote the paper. H.C and P.G supported the 305 experiment and for maintenance of BIT-lidar. SY.C designed and lead the study. All co-authors discussed the results and commented on the manuscript.
Competing interests. The authors declare that they have no conflict of interest.