Deriving boundary layer height from aerosol lidar using machine learning: KABL and ADABL algorithms

The atmospheric boundary layer height (BLH) is a key parameter for many meteorological applications, including air quality forecasts. Several algorithms have been proposed to automatically estimate BLH from lidar backscatter profiles. However recent advances in computing have enabled new approaches using machine learning that are seemingly well suited to this problem. Machine learning can handle complex classification problems and can be trained by a human expert. This paper describes and compares two machine-learning methods, the K-means unsupervised algorithm and the AdaBoost supervised algorithm, to derive BLH from lidar backscatter profiles. TheK-means for Atmospheric Boundary Layer (KABL) and AdaBoost for Atmospheric Boundary Layer (ADABL) algorithm codes used in this study are free and open source. Both methods were compared to reference BLHs derived from colocated radiosonde data over a 2-year period (2017–2018) at two Météo-France operational network sites (Trappes and Brest). A large discrepancy between the root-mean-square error (RMSE) and correlation with radiosondes was observed between the two sites. At the Trappes site, KABL and ADABL outperformed the manufacturer’s algorithm, while the performance was clearly reversed at the Brest site. We conclude that ADABL is a promising algorithm (RMSE of 550 m at Trappes, 800 m for manufacturer) but has training issues that need to be resolved; KABL has a lower performance (RMSE of 800 m at Trappes) than ADABL but is much more versatile.

and is pursued here using the K-means for Atmospheric Boundary Layer (KABL) algorithm. KABL has been extensively tested and is shared via an open-source code. In addition, we test an alternative adaptive 60 boosting (AdaBoost) machine-learning algorithm, the AdaBoost for Atmospheric Boundary Layer (ADABL) algorithm. Both algorithms classify whether the measurement points are inside or outside of the boundary layer; however, ADABL learns the characteristics of both groups from a training set. The training set consists of atmospheric boundary layer identifications made by human experts, which is acknowledged as being more reliable than available automatic methods (Seibert et al., 2000). Algorithms classifying from a reference dataset (e.g., ADABL) are called supervised algorithms, while algorithms classifying 65 without a reference dataset (e.g., KABL) are called unsupervised algorithms. Supervised algorithms make it possible to automatically reproduce human expertise in boundary layer identification. To our knowledge, this is the first time that a supervised algorithm has been applied to this problem. This study is of practical interest because it includes the publication of the source code, which only uses free software.
In Sect. 2, we describe the data used in this study, i.e., the lidar data in the algorithm inputs, reference radiosonde data, and 70 ancillary data used to sort the meteorological conditions. In Sect. 3, we describe the two machine-learning algorithms (KABL and ADABL) and the procedures used to evaluate them. In Sect. 4, we present the results of our study, which consists of a sensitivity analysis of the KABL algorithm, a comparison of the methods with the radiosonde data over a two-year period, and a case study. In Sect. 5, we discuss the results, limitations, and prospects of our study. The final section is dedicated to the conclusions that can be drawn from our study.

2 Material
Our study used data from the Météo-France operational network. We used colocated radiosonde and aerosol lidar data over two sites: Brest (a coastal city in north-western region of France) and Trappes (an inland suburban area within Paris). The dataset spanned two years: 2017 and 2018. A case study was conducted on August 2, 2018, for the Trappes site.

Data processing
MiniMPL acquires profiles of atmospheric backscattering at high frequency (2500 Hz) using a low-energy pulse (3.5 µJ) emitted by a Nd:YAG laser at 532 nm. The profiles are acquired in photon-counting mode and, in our present configuration, averaged over 5 min and 30-m vertical resolution bins. The instrument uses a monostatic coaxial design where the laser beam 95 and the receiver optics share the same axis. Because of geometrical limitations, only a fraction of the signal can be recovered in the near field. Therefore, in our system, the first usable data are available at 120 m above ground level.
The instrument has polarization capability, collecting backscattered photons in two channels with the measured raw signals in the "copolarized" and "cross-polarized" channels suffixed co and cr, respectively (the intrument uses both circular and linear depolarization; see Flynn et al. (2007), for more details). These raw signals are processed to obtain the quantity of interest, 100 i.e., the range-corrected signal (RCS), which is also called the normalized relative backscatter. This processing consists of several procedures including background, overlap, afterpulse, and dead-time corrections. A comprehensive description of the processing is given in Campbell et al. (2002). The "copolarized" and "cross-polarized" range-corrected signals, RCS co and RCS cr , respectively, as delivered by the manufacturer's software, are used as predictors for the machine-learning algorithms described in Sect. 3.

105
The raw data type and format depends on the instrumental device used. To make the algorithms usable on other devices, we converted the files to a harmonised format using the raw2l1 software 1 ; we then used these files as the algorithm input.

Radiosonde data
The algorithms were evaluated with respect to estimates derived from radiosonde (RS) profiles. Météo-France operates several RS sites for the World Meteorological Organization Global Observing System. Two RS sites are colocated with the lidars at 110 Brest and Trappes. These sites are equipped with Meteomodem robotsondes and typically launch a Meteomodem M10 sonde at 11:15 UTC and 23:15 UTC every day.
Many methods exist to estimate BLH from RS data, several of which have been used in the literature. Some of these methods are listed below.
-Parcel method: BLH is the height at which the profile of the potential temperature θ reaches its ground value.

115
-Humidity gradient method: BLH is the height at which the gradient of the relative humidity is strongly negative.
-Bulk Richardson number method: BLH is the height at which the bulk Richardson number exceeds 0.25 (this threshold varies among studies).
-Surface-based inversion: BLH is the height at which the gradient temperature profile reaches zero.
-Stable layer inversion: BLH is the height at which the gradient of the potential temperature profile reaches zero.

125
Following the recommendations in Figure 10 of Seibert et al. (2000), we chose to compute BLH using the parcel method for the 11:15 UTC sounding and the bulk Richardson number for the 23:15 UTC sounding and refer to this estimate as BLH-RS from now on.

Ancillary data
Ancillary data were used to describe the meteorological situation at the observation sites. These data were not used by the 130 machine-learning algorithms. All the instruments were colocated with the lidar and radiosonde launches.
-Rain gauges were used to detect rain events.
-Vaisala Ceilometer CL31 instruments were used to detect the cloud base height and distinguish cases with clouds on top of, or inside, the boundary layer. Even though MiniMPL is capable of detecting clouds, we relied on the CL31 cloud detection because the MiniMPL algorithm was found to report non-existent clouds.

135
-Scatterometers were used to estimate the horizontal visibility and detect the occurrence of fog.
3 Machine-learning methods

Supervised learning method
Supervised methods learn from a reference. Such methods are divided into two families: classification, which aims to find the frontiers between groups, and regression, which aims to approximate a function. In this study, we treat the BLH estimation as 140 a classification problem where we wish to classify the lidar measurement at each range gate as belonging to either 'boundary layer' or 'free atmosphere'. Then, the highest point of the 'boundary layer' class indicates the BLH estimate. Several supervised algorithms were compared to maximize accuracy (see Sect. 3.1.3), here, we describe Adaboost, which was the algorithm selected for this study. Boosting algorithms are a very powerful family of algorithms that were developed for classification but can also be used for regression (Hastie et al., 2009). In particular, the AdaBoost algorithm is designed for binary classification 145 (Freund and Schapire, 1997) and is therefore well suited to our problem.

AdaBoost algorithm
Let us consider the following problem. We have N vectors x i ∈ R p (here, the numbers of predictors, p = 4: seconds since midnight, height above ground, copolarized channel and cross-polarized channel), and for each vector, we have a binary indicator y i ∈ {−1, 1} (−1 for 'boundary layer', 1 for 'free atmosphere'). From the sample (x i , y i ) i∈ [[1,N ]] , where [[1, N ]] is 150 the ensemble of integers from 1 to N , we want to predict the output indicator y new of any new vector x new . To do so, we must find a rule based on the x new coordinate values (the features) to cast it into the appropriate class. Decision tree classifiers (Breiman et al., 1984) perform this casting one feature at a time. For example, in Figure 2, there are black and white points in a two-dimensional space. The black points are mostly located where X 1 is low, hence the rule "if X 1 < t 1 , then the point is black." However, in the other region, where X 1 > t 1 , there are still some black points, all with low X 2 . Therefore, we add the 155 rule "if X 2 < t 2 , then the point is black, else it is white." Decision trees are classifiers made up of such "if" statements with various depths and thresholds. The deeper the tree, the more accurate the border but the more complex the decision and the longer it takes to train. Deep trees are strongly subject to overfitting and are less efficient than other methods. However, shallow decision trees are valuable because of their simplicity and their speed, even though their performance are quite limited (Hastie et al., 2009). They are often used as weak learners, that is, classifiers with poor performance (but better than random) that are 160 very simple (Freund and Schapire, 1997). In this study, weak learners in AdaBoost are trees with a maximum depth of five (a maximum of five forks between the root and the leaves).  AdaBoost is based on decision tree classifiers. It aggregates these classifiers to determine the most accurate border. The concept behind AdaBoost is illustrated in Figure 3. First, a shallow decision tree is fitted to the entire dataset using the Classification and Regression Tree (CART) algorithm (Hastie et al., 2009). All points have the same weight in this first step. Some 165 points in the dataset are misclassified, and the error of the classifier is the weighted average of the misclassified points. Another shallow decision tree is then fitted on a resampled dataset where the previously misclassified points are over-represented. This new tree has new misclassified points that will be over-represented in the training of the next tree, and so on, up to the specified number of trees (M = 200 in our case). The detailed algorithm is described in Hastie et al. (2009), algorithm 10.1, andin Schapire (2013). Such an algorithm needs to be trained using a trustworthy reference. On days where the boundary layer is easily visible to a human expert, the top of the boundary layer can be drawn by hand; all points below this limit are in the class 'boundary layer' and all points above this limit are in the class 'free atmosphere'.
In this study, two days were classified by hand. These two days where chosen because the boundary layers on these days 175 were easily visible; the two hand-classified days were at different sites in different seasons. The first hand-classified day was a clear summer day in Trappes, shown in Figure 4 (left); a stable boundary layer is present near the ground during the night, topped by a residual layer and a few clouds between 02:00 UTC and 04:00 UTC. A mixed layer started to develop at 09:00 UTC and remained at approximately 2000 m for the rest of the day. At approximately 22:00 UTC, a new stable layer appeared to develop near the ground; however, it is not very clear where this layer started or what its extent was. The second hand-classified 180 day was a clear winter day in Brest, shown in Figure 4 (right): a stable boundary layer was present near the ground during the night, topped by a residual layer, which was shallower than the layer observed at the Trappes site. The mixed layer started to develop at 08:00 UTC and remained at approximately 1000 m with the height of the layer gradually decreasing throughout the day. At approximately 17:00 UTC, aerosols appeared to accumulate in a thin layer close to the ground; therefore, we chose to select the top of this thin layer as BLH. The coordinates of the points on the hand-drawn BLHs were obtained using the Visual Geometry Group Image Annotator software 2 . Then, the output curves were interpolated with a cubic spline to match the lidar temporal resolution. Given the resolution of the lidar, this method of labeling the data results in N = 86, 400 individuals in total (this number is the product of 288 profiles per day, 150 vertical range bins and two labeled days).

Retained configuration
Four predictors were used: the two lidar channels, time (number of seconds since midnight), and altitude (meters above ground level). The ADABL configuration used was weak learner: decision tree of depth five; number of weak learners: 200; and predictors: time, altitude, RCS co , and RCS cr .

195
This configuration was chosen because more complex classifiers do not necessarily improve the performance. The computation time of the algorithm was still reasonable: training took 23 s on the full dataset and predicting BLH for a full day took 3.7 s with a modern laptop. AdaBoost was chosen after a comparison of multiple classification algorithms, i.e., random forest, nearest neighbor, decision trees, and label spreading (study not shown here). The benchmark score was the accuracy as measured by the percentage of individuals that were correctly classified. The accuracy was estimated by group K-fold cross-validation, 200 where labeled datasets are grouped into chunks of three consecutive hours, one group was used as a testing set and all the rest as a training set. This operation was repeated until each group was used as the testing set. The resulting accuracy was 96%.
However, this figure overestimates the generalization ability of AdaBoost. A more correct estimation would be obtained with an independent validation set (e.g., a new hand-classified day). An independent validation set was not used here because the cross-validation accuracy was only used to discriminate between the classification algorithms.

205
It is possible to quantify the relative importance of the predictors (Breiman et al., 1984;Hastie et al., 2009). After training, the relative importance of the time, RCS co , RCS cr , and altitude predictors was 30.3%, 28.4%, 26.5%, and 14.8%, respectively.

Unsupervised learning methods
Unsupervised methods aim to identify groups in the data. In our case, we want to identify the group 'boundary layer'. The BLH estimate is then the upper boundary of this group. Two unsupervised learning algorithms were tested: K-means and 210 expectation-maximization (EM).

K-means algorithm
The K-means algorithm is a well proven and commonly used algorithm for data segmentation (Jain et al., 1999;Pollard et al., 1981) and consists of three steps, where K is the number of clusters specified by the user.
1. Initialization: K centroids m 1 , ..., m K are initialized at random locations inside the feature space.
Steps 2 and 3 are repeated until the centroids stop moving. It has been shown that this algorithm converges to a local minimum 220 of the intra-cluster variance (Selim and Ismail, 1984). Figure 5 (left) illustrates this algorithm.

EM algorithm
The EM algorithm assumes that each group k ∈ [[1, K]] is generated by a Gaussian distribution (µ k , Σ k ). The algorithm iteratively estimates the parametersμ k ,Σ k , and the responsibility for each Gaussianγ i k , where the responsibility is the probability of the point x i being generated by the k-th Gaussian. Points are then attributed to the group with the highest responsibility: Figure 5 (right) illustrates this algorithm. The K-means and EM algorithms are very similar. If we assume that all Gaussians distributions have the same fixed variance and that this variance tends to zero, the EM and K-means algorithms are the same. However, K-means does not rely on a Gaussian assumption.

Flowchart and description of KABL parameters 230
A simplified flowchart of KABL and ADABL is shown in Figure 6. This section focuses on the KABL parameters to introduce the sensitivity analysis made in Sect. 4.1. The parameters of the KABL software are detailed here.
-algo: The applied machine-learning algorithm. Possible values are: -'gmm' for the EM algorithm (Gaussian mixture model); and -'kmeans' for the K-means algorithm.

235
-classif_score: The internal score used to automatically choose the number of clusters (only used when n_clusters = 'auto'). See Sect. 3.4 and Table 1 for a description of the internal scores.
-init: Initialization strategy for both algorithms. Three choices are available: -'random': randomly pick an individual as the starting point (both K-means and EM); -'advanced': use a more sophisticated initialization (kmeans++ for K-means (Arthur and Vassilvitskii, 2007) and 240 the output a K-means pass for EM); and -'given': start at explicitly selected point coordinates.
-max_height: The height (meters above ground level) at which the profiles are cut.
-n_clusters: The number of clusters to be formed (between two and six). This is either explicitly given or determined automatically to optimize the score given in classif_score.

245
-n_inits: The number of repetitions of the algorithm. When this number is larger, the algorithm is more likely to find the global optimum but requires more time.
-n_profiles: The number of profiles concatenated prior to the application of the algorithm. For example, if n_profiles = 1, only the current profile is used. If n_profiles = 3, the current profile and the two previous profiles are concatenated and input into the algorithm.

250
-predictors: The list of variables used in the classification. These variables can be different at night and during the day.
For both time periods, the variables can be chosen from -RCS co : the copolarized range-corrected backscatter signal; and -RCS cr : the cross-polarized range-corrected backscatter signal.
The parameters of the KABL software are highlighted in bold in the following explanation of the KABL algorithm. A 255 netCDF file generated by the raw2l1 software needs to be provided as input data to KABL. The data, namely, the altitude vector z (size N z ), the time vector t (size N t ), and the range-corrected signals RCS co and RCS cr (N t × N z matrices), are extracted from this file. Such data are prepared to fulfil the machine-learning algorithm requirements. For each time, the n_profiles last profiles are extracted. Then, the data they contain are normalized (by removing the mean and dividing by the standard deviation); this provides a matrix X (N ×p, where N =n_profiles·N z and p = |predictors| is the number of elements 260 in the list). The matrix X is the usual input for a machine-learning algorithm; it has one line for each individual observation and one column for each variable (or predictor) observed. For the BLH retrieval, the preparation also provides a vector Z (size N ) containing the altitude of each individual observation. The algorithm (either K-means or EM, as specified by algo) is applied to the matrix X, with the parameters n_clusters, init, and n_inits. This results in a vector of labels (size N ) that contains the cluster attribution of each individual. Finally, since by definition the boundary layer is the layer directly influenced by the 265 ground, we look for the first change in the cluster attribution, starting from the ground level. This gives us the value of BLH for this profile. These operations are repeated until reaching the end of the netCDF file.  Table 2.

Performance metrics
Two types of metrics were used.
-External scores: These metrics compare the result to a trustworthy reference. They have the advantage of providing a 270 meaningful evaluation of the performance but depend strongly on the quality of the reference (i.e., its accuracy and availability).
-Internal scores: These metrics rate how well the classification performs based only on the distances between points. They have the advantage of being always computable but are not linked to any physical property and therefore are not always meaningful.

275
None of these metrics are perfect; however, the information they provide allows a broader understanding of the algorithm performance.

External scores
External scores use a reference to assess the quality of the result. In our case, the reference are BLH-RS (RS-derived BLH) and, when available, the human expert hand-classified BLH. Two external scores are used in this study. If we denoteẐ as the 280 estimated BLH (by any of the previously introduced algorithms) and Z ref as the reference, the external scores are as follows: -RMSE, where lower values are better: (1) -The Pearson correlation, where higher values are better: Here,Ẑ and Z ref are random variables, E [·] denotes the mathematical expectation, and σ(·) denotes the standard deviation.
When these scores are estimated, the random variables are replaced by a sample vector and the expectation and standard deviation are replaced by their usual estimators.

Internal scores
The quality of a classification can be quantified using scores that are based only on the labels and the distances between 290 points. Such scores estimate how trustworthy an estimation is without any external input. Many such scores exist with different formulations and different strengths and weaknesses (Desgraupes, 2013). In this study, three internal scores were used: -The silhouette score (Rousseeuw, 1987): where a is the average distance to its own group and b is the average distance to the neighboring group. S sil = 1 is the 295 best classification, S sil = 0 is neutral, and S sil = −1 is the worst classification.
-The Calinski-Harabasz index (Caliński and Harabasz, 1974): where N is the number of points, K is the number of clusters, B is the between-cluster dispersion, and W k is the within-cluster dispersion of the cluster k. Higher S ch indicates a better classification.

300
-The Davies-Bouldin index (Davies and Bouldin, 1979): where k and k are the two cluster numbers,δ k is the average distance between points and their cluster center for the cluster k, and d(µ k , µ k ) is the distance between the cluster centers µ k and µ k . Lower S db indicates a better classification.

Other metrics
In addition to the internal and external scores, the computation time and the number of invalid values (NaN or Inf) were recorded. BLH estimates of NaN or Inf can occur when all the points of the profile are assigned to the same cluster; this reflects a faulty configuration of the algorithm. Even though these metrics do not measure how well a program is performing, they are useful to the user.

310
All the metrics used to measure the performance of KABL are summarized in Table 1.

Sensitivity analysis of the KABL algorithm
A sensitivity analysis was performed on the KABL code to identify the "best" configuration. Various KABL configurations were extensively tested on a single day: August 2, 2018, at the Trappes site, for which we have a hand-classified reference 315 (Figure 4 (left)). The most relevant configurations were retained and tested on the two-year lidar dataset.
There are eight parameters in the KABL code (see Sect. 3.3 for their descriptions). To assess the sensitivity of KABL to these parameters, the performance metrics (given in Sect. 3.4) were calculated using the hand-classified BLH as Z ref and with the output of KABL asẐ for different combinations of input parameters. The output metrics given in Table 1 were tested using the input values given in Table 2. We refer to a set of values for the KABL parameters as a configuration. Screening all the 320 possible values listed in Table 2 would require 3240 different configurations. The number of clusters to be formed is explicitly passed and is always the same  Table 2 and the abbreviations for the metrics are described in Table 1.
To obtain an overview of these 3240 configurations, we started by estimating the influence of the parameters (listed in Table 2) on the different metrics (listed in Table 1). The influence of the parameters was quantified using first-order Sobol indices (Sobol, 2001;Iooss and Lemaître, 2015;Rieutord, 2017), that is, the ratio of the variance of the metric when the parameter was fixed over the total variance of the metric. If we denote Y as the metric and X as the vector of the parameters,  can see that some parameters are more influential than others (e.g., classif_score is much less influential than n_clusters). This matrix highlights the main effects of changing a parameter and, therefore, how to set each parameter appropriately. For each parameter, we examined the metrics that it influences and determined the preferred configuration.
Critical parameters are indicated in Figure 7 by the darkest blue columns, namely, n_clusters, algo, predictors, and init 3 .

335
For each parameter, Figure 8 shows the distribution of the relevant output given the parameter value (violin plots are explained in Hintze and Nelson (1998)). For example, Figure 8a has the value of algo on its x-axis and the computing time on its yaxis. The 3240 different configurations were divided into two groups; those with algo='kmeans' and those with algo='gmm'.  The parameter values were chosen to give the optimal values for the metrics they influence. The optimal values are indicated by a yellow star in each plot. To set algo, we examined the computing time ( Figure 8a) and the Davies-Bouldin index ( Figure   8b). These figures indicate that 'kmeans' is the best choice for both metrics (resulting in a lower computing time and a lower Davies-Bouldin index). To set init, we examined the correlation (Figure 8c) and the computing time ( Figure 8d). In this case,

345
'given' appears to be the best choice. To set n_clusters, we examined RMSE (Figure 8e) and the silhouette score ( Figure 8f).
They indicate that the best numbers of clusters are three and 'auto', respectively. We chose to give priority to RMSE because the silhouette score has very high values for two clusters, which is suspicious given the presence of a cloud and a residual layer on this day. To set predictors, we examined the silhouette score ( Figure 8g) and the Calinski-Harabasz index ( Figure   8h); here, 'co' appears to be the best choice. Following this methodology, we can identify a few configurations worth trying.

350
These configurations were tested on the two-year dataset. The configuration used to generate the results in Sect. 4.2.1 is given in Table 3. This configuration was chosen to maximize the correlation between KABL and RS at the Trappes site.

Two-year comparison
BLH estimates from the three methods (KABL, ADABL, and the manufacturer's algorithm) were compared to BLH-RS over a two-year period.  This selection rejects a large part of the dataset but ensures that only well-defined cases are retained for the comparison. In total, 178 RS measurements from Trappes and 101 RS measurements from Brest were used for the overall comparison. to BLH-RS. The first column represents RMSE E 2 (lower is better), and the second column represents the correlation ρ (higher 370 is better is likely due to the larger extent of our dataset (178 RS at Trappes and 101 at Brest, spanning a two-year period) and the low maturity of the algorithms. ADABL has better correlation and RMSE values than KABL at both sites. The manufacturer's algorithm performs well without any specific tuning on our part. It uses a wavelet covariance transform, as described in Brooks (2003). This result is not surprising because the wavelet method has been shown to be robust in numerous studies, especially 380 in Caicedo et al. (2017), who included a cluster analysis method and concluded that the wavelet method is preferred.

Seasonal and diurnal cycles
To quantify the ability of the algorithms to provide a consistent BLH estimate, Figure 10 shows At the Trappes site (Figure 10c), KABL and ADABL also overestimate BLH in comparison to the BLH-RS, while the manufacturer's estimate is close. The seasonal cycle is more visible at Trappes than at Brest, and all BLH estimates are higher 395 in summer than in winter. The most pronounced cycle is given by KABL, while the least pronounced cycle is given by BLH-RS.
The inter-quartile range are also very large, especially in summer, reflecting the variation of BLH between day and night. Figures 10b and 10d show the diurnal cycle, where all values within the same six-minute period in the day were averaged.
Because the radiosondes are only launched twice a day, at 11:15 UTC and 23:15 UTC, an equivalent BLH-RS diurnal cycle cannot be drawn. However, we used the average and quartile values at these times as checkpoints for the other estimates.

400
The manufacturer's and KABL estimates both have very smooth diurnal cycles, with lower BLH at night and maximum BLH around 15:00 UTC at the Trappes site and around 13:00 UTC at the Brest site. The KABL average is always higher than that calculated by the manufacturer's algorithm. The ADABL estimation has a very different diurnal cycle, similar to the conceptual image we have of the boundary layer. Indeed, ADABL was trained using hand-classified BLHs that reflect this conceptual image. Therefore, it is not surprising that ADABL reproduces this image well; however, it may fail to adapt to special cases. It 405 appears that the "time" predictor (the number of seconds since midnight) has a large influence that is not balanced by the other predictors. This is likely because ADABL was trained on only two days, resulting in an unbalanced importance for sunrise and sunset on these particular days and at these locations. To balance this importance, the AdaBoost algorithm needs to be trained on more days and at more sites with a representative selection of cases.

Case study 410
The chosen case study was for April 19, 2017, at the Trappes site. The boundary layer was clearly visible and had nearly all the features of the conceptual image. The case study was for a day that was not included in the ADABL training set, so as not to bias the comparison in favour of ADABL. At the beginning of the day, there is a thick residual layer containing some plumes. Both KABL and the manufacturer's algorithm include these plumes in the boundary layer. Conversely, ADABL gives a very low estimate where there is no visible frontier. In the morning (from 08:00 UTC to 12:00 UTC), all the algorithms capture the morning transition reasonably well. However, KABL includes more irrelevant estimates (selecting remnants of the surface layer) than the other methods and 420 ADABL gives an estimate that is too high for no apparent reason around 12:00 UTC. During the day, ADABL sticks to the top of the boundary layer, the manufacturer's algorithm sticks to the surface layer (which is very visible), and KABL oscillates between the two. The evening transition is blurry; the signal from the surface layer slowly increases, as the mixed layer decays into a residual layer. KABL locates this transition very early (around 17:00 UTC), when it stops oscillating and sticks to the surface layer. ADABL makes the transition more smoothly, from 19:00 UTC to 22:00 UTC. The manufacturer's algorithm 425 is the last to make the transition, at around 23:00 UTC, and the transition then occurs very sharply. We can conclude from this case study that none of the algorithms perfectly capture the boundary layer. Some of the limitations are physical, e.g., the evening transition is ill defined, resulting in disagreement between the algorithms. BLH-RS at 23:15 UTC is close to the lower boundary of the lidar range. This highlights the fact that BLH below 120 m are not rare and will not be detected by lidar if BLH is in the lidar blind zone. Some of the other limitations are algorithmic; KABL has an unfortunate tendency to oscillate 430 between several candidates for the top of the boundary layer (surface layer or clouds), and ADABL too closely reproduces the features of the days it has been trained on (e.g., night estimates and morning transitions).

Algorithm maturity
Both algorithms examined here are not yet mature when applied in this context. K-means algorithms have already been used to 435 detect BLHs in previous studies (Toledo et al., 2014;Caicedo et al., 2017;Toledo et al., 2017;Rieutord et al., 2014); therefore, it is a more mature method. This is visible in this paper via the level of investigation, which was much higher for KABL than for ADABL. Concerning boosting, this the first time, to our knowledge, that such an algorithm has been tested on this type of problem; therefore, ADABL is a completely new algorithm. Yet, it outperforms KABL and competes favorably with the manufacturer's algorithm despite raising training issues.

Time and altitude continuity
The oscillations observed in Figure 11 are unrealistic and need to be avoided. They occur with KABL because clusters do not always have vertical persistence, as shown in Figure 12. One can see the cluster labels on a time-altitude grid for the same day as in Figure 11. When the initialization is random (Figure 12a, init = 'random', default settings in K-means), the labels are also random. Only the transition of the labels on a profile is important. When the initialization is given (Figure 12b, init = 445 'given', retained settings in KABL), the labels can be identified. The blue cluster starts from very high attenuated backscatter coefficient (it detects clouds and the shallow morning BL); the red cluster starts from high attenuated backscatter coefficient (it detects the mixed layer or residual layer); and the green cluster starts from low attenuated backscatter coefficient (it detects the free atmosphere). Oscillations occur when some points are identified as free atmosphere in the middle of the boundary layer.
In the case study presented here (Figures 11 and 12b), this happens in the afternoon, when the blue cluster (starting from very 450 high attenuated backscatter coefficient) gathers a few points near the first measurements and the overhead artifacts because very high attenuated backscatter coefficient is irrelevant under these conditions.
Several prospects exist to enforce vertical persistence of the clusters in KABL, we list here four examples. First, the problem identified in Figure 12b could be solved by choosing the number of clusters automatically. However, this option was tested (case where n_clusters = 'auto' in Sect. 3.3) and it did not solve this issue. The sensitivity analysis showed that n_clusters 455 = 'auto' led to higher discrepancies with respect to BLH-RS. In fact, this setting usually gives low estimates of BLH because the automatically chosen number of clusters is usually high. A more advanced strategy to automatically choose the number of clusters (e.g., Tibshirani et al., 2001) might get around this issue. The vertical persistence of clusters can be enforced by adding altitude to the KABL predictors. This can be done thanks to a post-processing, for example, with a moving average or by imposing a maximum BLH growth rate (Poltera et al., 2017). The distance used in K-means could also be modified to 460 incorporate these constraints, for example, by adding penalty terms.
In ADABL, time and altitude continuity are ensured because they are within the predictors. However, ADABL yields BLH estimates that are too similar to the BLHs in the training set. Removing time and/or altitude from the predictors should be considered to force the algorithm to rely more on the measurements. Further, the sensitivity analysis presented here for KABL needs to be performed for ADABL.

Real-time estimation
Even though it was not necessary for this study, all of the algorithms studied here can be used in real time. As soon as a lidar profile is available, BLH estimations can be performed instantaneously 4 . However, KABL suffers from undesirable oscillations from one profile to the next. A method to filter these oscillations is needed but would disable the "real-time" feature of the algorithm. In addition, the hour of the day needs to be explicitly passed in a periodic function. This has not been done here 470 because we worked only on 24-hour time periods.

Quality of the evaluation
Even though we made an effort to sort the meteorological conditions using ancillary data, the two-year comparison still mixes heterogeneous conditions. In addition, the results are clearly different at the sites studied here, emphasizing the importance of local conditions. A more precise casting of the meteorological conditions with atmospheric stability indices or large-scale 475 insights would lead to a better understanding of the strengths and weaknesses of the algorithms. The importance of the sites needs to be investigated by extending the study to a larger number of sites with different environments. A more careful examination of cloudy days also needs to be performed. Cases where the cloud bases are below 3 km were filtered from our study. However, cases where clouds reside inside the inversion should be detected by KABL as an extra cluster; further studies are required to confirm this behavior. In addition, ADABL was not specifically trained to deal with cloudy situations. Further 480 studies to determine how ADABL behaves without training and how it could be appropriately trained are required.
Radiosonde profiles are usually regarded as the best reference for BLH. However, the derivation of BLH from such measurements is contentious because several methods exist and some strongly disagree. Moreover, RS measurements cannot be used to assess the full diurnal cycle of BLH. This is a clear limitation of this study because the RS measurements cannot determine 485 if the difference between the diurnal cycle of ADABL and those of the other methods represents an improvement. Therefore, a very interesting project would be to use a dedicated field experiment with high-frequency tethersonde or other continuously running instruments as a reference. For example, microwave radiometers are good candidates because they provide information that is not based on aerosols and the derivation of BLH from these instruments are routine (Cimini et al., 2013).

490
ADABL already shows good performance when trained on only two days. Most of its bad estimates result from the short length of its training period. Therefore, a short-term project would be to label more days with various meteorological conditions.
However, the dependence of ADABL on training makes it sensitive to instrumentation settings and calibrations. Even though the effect of a calibration or the evolution of an instrumental device has not been studied, it is likely that training needs to be repeated after each calibration or change in the instrumental device. Therefore, two strategies are possible for training 495 ADABL: remove the influence of calibration prior to training (this would require knowing the instrumental constants for all of the devices) or train it to deal with differences (this would require including as many different devices as possible in the training set, which would then become very large). In any case, the main limitation will be the need to label the entire training dataset (a priori by human experts).

500
KABL appeared to perform the least well in this study; however, there are interesting prospects to improve its performance.
KABL does not require any training; therefore, it is less dependent on instrumentation settings and calibrations. Because it is not strongly dependent on the instrumental devices, it can be used on backscatter profiles made by other instruments (e.g., ceilometers). Moreover, other profiles besides the backscatter intensity can be added as additional predictors for unsupervised learning after normalization. Therefore, the concept of KABL can be advanced further to create synergy between multiple 505 remote sensing instruments. Microwave radiometers are good candidates because they have comparable time resolution to lidar and provide independent information concerning the thermal stratification of the boundary layer. Cloud radars also have comparable time resolution to lidar and provide additional independent information.

Quality flags
Currently, no quality flags for the estimation are provided. One approach would be to use the internal scores (i.e., silhouette, 510 Davies-Bouldin, and Calinski-Harabasz defined in Sect. 3.4) as quality flags; however, further study is required to determine whether these metrics can serve as reliable quality flags.
This paper described two algorithms based on machine learning to estimate the boundary layer height from aerosol lidar measurements. The first, KABL, is based on the K-means algorithm. The second, ADABL, is based on the AdaBoost algorithm.

515
Both algorithms take the same input file, one day of data generated by the raw2l1 routine, and produce similar output, a BLH time series for the input day. KABL is a non-supervised algorithm that looks for a natural separation in the backscatter signals between the boundary layer and the free atmosphere. ADABL is a supervised algorithm that fits a large number of decision trees in a labeled dataset and aggregates them in an intelligent manner to provide a good prediction. KABL, ADABL, and the lidar manufacturer's algorithm were tested on a two-year dataset taken from the Météo-France operational lidar network. The

520
Trappes and Brest sites were chosen because of their different climates and the availability of regular RS measurements, which were used as a reference.
A large discrepancy in RMSE and the correlation with the radiosondes was observed between the two sites. At the Trappes site, KABL and ADABL outperformed the manufacturer's algorithm while the opposite occurred at the Brest site. At both sites, ADABL performed better than KABL (higher correlation and lower error) and the manufacturer's algorithm performed 525 well. By analyzing the seasonal and diurnal cycles, we determined that the KABL and manufacturer's estimates have similar behavior; however, the KABL estimates are always higher by approximately 200 m. ADABL generates the most pronounced diurnal cycle, with a pattern that is very similar to the expected diurnal cycle; however, its results depend greatly on the days it has been trained on. In particular, the sunset and sunrise times of these days over-influenced the ADABL estimate. In the case study, we saw that both algorithms perform well overall; however, we identified several algorithmic limitations, e.g., KABL 530 tended to oscillate between several candidates for the top of the boundary layer (surface layers or clouds) and ADABL was overly constrained by the days it was trained on (e.g., the night estimate and morning transition). In summary, ADABL is promising but has training issues that need to be resolved, KABL has a lower performance but is much more versatile, and the manufacturer's algorithm using a wavelet covariance transform performs well with little tuning but is not open source. A wide range of future developments is available for ADABL and KABL, the most immediate being that the training set of ADABL 535 can be enhanced, time and altitude continuity can be enforced in the KABL estimation, and both can be compared to high temporal resolution RS measurements.