Wind lidars present advantages over meteorological masts, including simultaneous multipoint observations, flexibility in measuring geometry, and
reduced installation cost. But wind lidars come with the “`cost” of increased complexity in terms of data quality and analysis. Carrier-to-noise
ratio (CNR) has been the metric most commonly used to recover reliable observations from lidar measurements but with severely reduced data
recovery. In this work we apply a clustering technique to identify unreliable measurements from pulsed lidars scanning a horizontal plane, taking
advantage of all data available from the lidars – not only CNR but also line-of-sight wind speed (

Long-range scanning wind lidars are useful tools, and their adoption has grown rapidly in recent years in wind energy applications

Both approaches miss important and complementary information, either neglecting the strength of the signal backscattering (quantified by CNR) or the
spatial distribution and smoothness of the wind field. Moreover, in both approaches the position of observations is not taken into account,
information that can shed light on areas permanently showing anomalous values of

Data self-similarity – over any scale in the case of fractals or a range of them in real situations

The Density Based Spatial Clustering for Applications with Noise algorithm, or

This paper is organized as follows: Sect.

The filtering techniques presented here were tested on lidar measurements made at the Test Centre Østerild located in northern Jutland, Denmark (see Fig.

Characteristics of the balconies experiment, from

The experiment consist of two measuring phases (see Table

This data set is well suited to test different data filtering techniques. A large measurement area will be affected by local terrain and atmospheric
conditions, like clouds or large hard targets. Moreover, at this scale lidars reach their measuring limitations since the backscattering from
aerosols decreases rapidly with distance

Assessing and comparing the performance of filters is challenging with no reference available to verify that rejected or accepted observations are
reliable or bad observations. This is especially difficult for long-range scanning lidars since their measurements cover large areas and, due to
spatial variability, a valid reference would need several secondary anemometers scattered over the scanning area. Testing filters on a controlled and
synthetic data set contaminated with well-defined noise presents an option to deal with this problem. In this study, the filters presented in
Sect.

Synthetic PPI scans are sampled by a lidar simulator from synthetic wind fields generated via the Mann model

Synthetic wind field characteristics and parameters.

Lidar simulators have been presented previously by

The characteristics of the lidar simulator and real long-range lidar

The simulator receives scanning pattern characteristics as input (beam range, range gate step, azimuth angle range, and azimuth angle steps) to
generate a primary mesh with the sampling positions on top of background wind field. Following the measuring principle of the lidar, the

The azimuth averaging is the arithmetic mean of the

The most simple noise that can be used to contaminate synthetic scans are sparse, uniformly distributed outliers. This noise, also known as salt-and-pepper noise, is easily detected and eliminated by median-like filters when extreme; discrete steps affect the smoothness of an
image

A two-dimensional grid of

For each grid point

Finally, the noise function is the sum of dot products,

Procedural noise on synthetic scans.

CNR thresholds are well known, and lidar manufacturers usually recommend values for rejection of signals with poor backscattering or hitting hard
targets

The median filter arises as a viable option for detecting erroneous measurements since it is well known that this type of nonlinear filter is suited
to detect and filter noise that presents distributions with large tails. Here we use an adaptation of the traditional median filter used in the
image-processing community, closely related to the three-stage filtering technique described in

The input parameters of this filter will be the size (or number of elements) of the moving windows in the radial and azimuth directions,

If we represent lidar observations as

Core point: points

Direct density-reachable point: points

Density-reachable point: points

The input parameters

When scans are very noisy, the selection of a proper value of

The set of features considered when filtering synthetic data does not include CNR because it is not available from the lidar simulator described
in Sect.

Since we consider features that vary significantly in magnitude (CNR and range gate distance for instance), we normalize the data before the application of

The clustering filter is implemented to be a nonsupervised classifier and does not need more input parameters than the different features and the
number of scans put together as a batch before filtering. The latter is set to three in this case to speed up calculations and avoid creating
clusters from noisy regions. From this point of view, this filter is also as dynamic – similar to

Equations (

In the absence of reference measurements, the quality of the data retrieved after filtering is assessed by comparing the distribution of radial wind
speeds for very reliable observations (with CNR values within the range of

Probability density function of reliable observations of

Another metric is the similarity between the PDF of reliable and nonreliable data after filtering. The distance between both probability density functions
can be compared with similarity metrics like the Kolmogorov–Smirnov test

The null hypothesis here is that two realizations are from the same distribution if the K–S statistic is such that its two-tailed

The KL divergence is a measure of similarity or overlapping of two distributions,

Both metrics will be used to estimate how the distribution of nonreliable observations of

Both performance metrics, the recovery rate of abnormal measurements in the tails of the PDF of reliable observations and its statistical distance to
the PDF of filtered nonreliable observations, will be assessed for the median-like filter, the clustering filter, and also for data filtered with a
CNR threshold of

In Fig.

Histograms of the three performance indexes for the total number of synthetic scans.

Both filters perform well when evaluated in terms of

The data set from the balconies experiment presents advantages for the clustering filter since the CNR value can be included as a feature in
describing the data. Nevertheless, as mentioned already in Sect.

Distribution of recovery fraction per wind speed bin for phase 1 of the experiment of

Distribution of recovery fraction per wind speed bin for phase 1 of the experiment of

Results of PDF similarity test of reliable and nonreliable data after filtering. The CNR

Total recovery fraction for phase 1 of the experiment. The noisy and far regions of the scans show a high recovery, above 80 %, for

Detail of the recovery rate at the site of the turbines for the

The total recovery fraction of observations for phase 2 of the experiment. The noisy and far regions of the scans show a high recovery, above 70 %, for (

Figures

Additional data recovered, relative to the number of observations in the reliable range of CNR, and fraction of data recovered with values beyond the 3

Table

The metrics introduced in Sect.

The synthetic wind fields used here do not consider the presence of hard targets. These anomalies in the wind field are observed by lidars as points
with high CNR values and abnormal

The data set analyzed from the balconies experiment corresponds to horizontal scans at 50 and 200

Regarding feature selection, the clustering filter could consider the spectral spreading of the heterodyne signal,

Using the statistical distances

The selection of features and the number of scans put together per filtering step or iteration could also be automated using feature selection
methods. Nevertheless, this would make the clustering filter more complex in its implementation and more computationally expensive, which is the main
disadvantage of this approach compared to the median-like filter. Very efficient median filters can achieve a computational complexity up to

The CNR threshold filtering has been the common approach to retrieve reliable observations from lidar measurements. In this work we compared this approach against two alternative techniques: a median-like filter based on the assumption of smoothness of the wind field, hence in the smoothness of the radial wind speed observed by a wind lidar, and a clustering filter based on the assumption of self-similarity of the observations captured by the wind lidar and the possibility of clustering them in groups of good data and noise. A controlled test was carried out on the last two approaches using a simple lidar simulator that sampled scans from synthetic wind fields later contaminated with procedural noise. The results indicate that the clustering filter is capable of detecting more added noise than the median-like filter at a good recovery rate of noncontaminated data. When the three filters are tested on real data, the clustering approach shows a better performance when identifying abnormal observations, increasing the data availability between 22 % and 38 % and reducing the recovery of abnormal measurements between 70 % and 80 % when compared to a CNR threshold. This is an important result because it increases the spatial coverage of the data which can be used later for wind field reconstruction and wind data analysis, especially in the far region of the scan, which covers the largest measured area.

Even though the median-like filter is computationally efficient, it needs an optimal definition of input parameters, which are dependent on the turbulence characteristics of the wind field. The clustering filter is more robust in this sense because it is capable of automatically adapting its parameters to the structure of the data. This is a step forward to a more robust and automated processing of data from lidars, which ideally should be independent of the turbulence characteristics of the measured wind field or the scanning pattern used.

Figure

Contours of performance metrics for

The synthetic data and code are available at

LA implemented the analysis on synthetic and real data and wrote all sections of this paper.

The author declares that there is no conflict of interest.

I would like to thank Ioanna Karagali for comments and guidance through the data from the Østerild balconies experiments; Robert Menke for the first version of the median-like filter; and Ebba Dellwik, Nikola Vasiljevic, Gunner Larsen, Mark Kelly, and Jakob Mann for the very useful and clarifying feedback during the analysis and writing process.

This paper was edited by Murray Hamilton and reviewed by two anonymous referees.