Retrieving the atmospheric number size distribution from lidar data

We consider the problem of reconstructing the number size distribution (or particle size distribution) in the atmosphere from lidar measurements of the extinction and backscattering coefficients. We assume that the number size distribution can be modelled as a superposition of log–normal distributions, each one defined by three parameters: mode, width and height. We use a Bayesian model and a Monte Carlo algorithm to estimate these parameters. We test the developed method on synthetic data generated by distributions containing one or two modes, and perturbed by Gaussian noise, and on three real datasets 5 obtained from AERONET. We show that the proposed algorithm provides satisfactory results even when the assumed number of modes is different from the true number of modes, and substantially excellent results when the right number of modes is selected. In general, an over-estimate of the number of modes provides better results than an under-estimate. In all cases, the PM1, PM2.5 and PM10 concentrations are reconstructed with tolerable deviations.

1 Introduction 10 LIDAR (LIght Detection and Ranging) is a remote sensing technique similar to RADAR (RAdio Detecting And Ranging) which uses light in the form of short laser pulses to invest a target and obtain, through elastic and inelastic scattering processes, information on the target properties as a function of the distance from the lidar system.
In the atmospheric application, LIDAR systems can be used to obtain spatially resolved information about the optical properties of the atmospheric aerosols (desert dust, volcanic ash, smog and many other types of substances) (Giannakaki et al.,15 2010; Lee and Wong, 2018;Stelitano et al., 2019;Chazette, 2020), over a distance of several kilometers and with high spatial and temporal resolutions. Advanced, multi-wavelength lidar systems are capable of giving information on the spectral dependence of the aerosol optical properties, and can therefore be used to obtain information on the aerosol microphysical properties Granados-Muñoz et al., 2014;Giannakaki et al., 2016;Chemyakin et al., 2016;Molero et al., 2020). 20 However, information on the microphysical properties of the atmospheric aerosols is seldom obtained using the LIDAR signal alone. This information, which is essential for a complete aerosol characterization useful to understand their effect on climate, is instead frequently obtained through the synergistic use of in-situ instruments; incidentally these measurements also allow a validation of the lidar retrievals, but only for those values that are closest to the ground, and for a particular aerosol typology (Saharan dust, Biomass burning aerosol,...); alternatively, validation can be done using synthetic data (Alados-25 Arboledas et al., 2011;Di Girolamo et al., 2012;Veselovskii et al., 2013;Osterloh et al., 2013;Samaras et al., 2015;Whiteman et al., 2018).
In order to retrieve the microphysical properties of the aerosol from lidar measurements, two inverse problems must be solved in sequence: in the first inverse problem, one uses the measured backscattered power to obtain an estimate of the aerosol optical parameters; in the second problem, one uses the estimated optical parameters (at different wavelengths), derived from 30 lidar observations, to obtain an estimate of the number size distribution, i.e. the density of particles as a function of the particle size. This latter problem is particularly challenging because of the limited amount of data, due to the practical problem of measuring at many different wavelengths. Moreover, from a mathematical point of view, the microphysical parameters are generally derived from the optical ones through integral equations that cannot be solved analytically and whose numerical solution leads to a so-called ill-posed problem. This last is characterized by a strong sensitivity of the solution from the input 35 data uncertainties and by the non-uniqueness of the solution. Remarkably, from a mathematical viewpoint there can be several ways to overcome ill-posedness, however, not all of them actually reflect realistic physical conditions. In addition, numerical studies showed that a poor selection of the constraints can affect the quality of the solution and compromise the microphysical retrieval in spite of the strength of the regularization algorithm. Therefore, in order to obtain stable and physically acceptable solutions, mathematical and physical constraints variously combined with regularization methods are applied (Müller et al., 40 1999;Böckmann, 2001;Veselovskii et al., 2002;Böckmann et al., 2005;Kolgotin and Müller, 2008).
Other studies have compared the "3β+2α" lidar retrievals to the AERONET (Aerosol Robotic Network) retrievals (Holben 50 et al., 1998;Veselovskii et al., 2009;Sawamura et al., 2014). Nevertheless, AERONET retrieval of columnar volume concentrations, assuming the aerosol uniformly mixed throughout the boundary layer, is not directly comparable to vertically-resolved lidar results. Intercomparison studies between "3β+2α" lidar retrievals and AERONET retrievals or ground-based in-situ measurements are not suitable to properly evaluate the performance of the lidar microphysical retrievals obtained for different altitudes.

55
Despite these difficulties, the possibility of characterizing the atmospheric particulate using only the lidar instrument would be very advantageous, and for these reasons it is currently a much studied topic (Di et al., 2018a;Kolgotin et al., 2016;De Rosa et al., 2020).
Following the state of the art, we retrieve the particle size distribution from "3β+2α" lidar system parameters. In order to mitigate the ill-posedness, we adopt a parametric model for the number size distribution. Several authors referred this 60 shape to a triangular shape, such as in (Veselovskii et al., 2004), defined on equidistant or a logarithmic-equidistant grid for the effective radius of the particles. In agreement with Di et al. (2018b) and Sun et al. (2013) and with the standard AERONET inversion procedure Dubovik et al. (2006) we find more viable working with log-normal distributions. In particular we reconstruct the particle size distribution as a superposition of a small number of log-normal distributions, on a logarithmic interval of particles radius. Then, because of the non-linearity of the problem, we adopt a Markov Chain Monte Carlo algorithm (Metropolis-Hastings) to approximate the posterior distribution of the parameters of interest. Monte Carlo methods have long been considered to be of little practical use, due to their high computational cost; however, the steady growth of available computational power that characterized the last decades have made them largely usable in the applications.
This article is organized as follows: in the Methods Section, we provide the mathematical formulation of the problem, and a description of the Monte Carlo algorithm; in the following Section, we analyze the results obtained on synthetic data, on five 70 exemplar cases; in the final Section we briefly summarize our conclusions.

Methods
Lidar instruments measure the backscattered light power at wavelength λ, coming from distance z, given by the following equation: where α(λ, z) and β(λ, z) are the extinction and backscattering coefficients, respectively, and C is a constant that depends on the instrument characteristics. Equation (1) can be looked at as an inverse problem, where one aims at recovering the extinction and backscattering coefficients from LIDAR data. By solving the inverse problem, (see Ansmann et al. (1990Ansmann et al. ( , 1992; Shcherbakov (2007); Pornsawad et al. (2008); Garbarino et al. (2016)), one obtains extinction and backscattering coefficients at every altitude z, for a usually small set of λ. The present article is concerned with the subsequent problem of estimating the 80 number size distribution from these indirect measures of α(λ, z) and β(λ, z). In the following, we omit the dependence on z because the problem can be solved independently at different altitudes.

Definition of the problem
The extinction and backscattering coefficients carry information on the number size distribution through the Mie scattering theory. Specifically, let n(r) be the number size distribution; then, under the approximation of spherical homogeneous particles, 85 the extinction and backscattering coefficients are given by where r a and r b are the lower and upper bounds for the particles' size; m is the Complex Refractive Index (CRI) of the 90 target atmosphere; k α/β (r, λ, m) are integral kernels defined as follows: where a n and b n are defined as follows: ψ n and ξ n being the Riccati-Bessel functions.
The problem we want to solve consists in retrieving the number size distribution n(r) from a set of measurements {α(λ i )} i=1,...,I and {β(λ j )} j=1,...,J . By defining a data array y = (α(λ 1 ), . . . , α(λ I ), β(λ 1 ), . . . , β(λ J )), and discretizing the possible values of r, we get an inverse problem of the form where n is a vector such that n i = n(r i ), K is the discretization of the integral kernels, and ε is the noise affecting the data.
Solving the linear inverse problem defined by (8) with a reasonable discretization of the r variable (say, at least 200 points) entails recovering a large number of unknowns from a very small data set. One viable option is to reduce the number of unknowns by using a variable support function, such as in Osterloh et al. (2011). Here we take a different approach, and assume that the number size distribution has a pre-defined shape, that depends on a small number of parameters (Osterloh et al., 2013). Specifically, we assume that the number size distribution is the superposition of a small number N of log-normal 110 distributions.
Each log-normal distribution is completely defined by 3 parameters: its mean µ, its standard deviation σ and its height h. The total number of parameters to be estimated is therefore 3N , which is substantially smaller than the number of parameters for solving the unconstrained, linear case. However, the problem is now non-linear and cannot be solved with standard Tikhonov 115 regularization. Indeed, we can re-write equation (8) as where K m (·, ·) are non-linear functions of the mode and width of each component, ad can be derived by inserting (9) into (2)-(3). We observe that the dependence on the CRI m is indicated here by the subscript, because we will consider m to be fixed and known throughout this study.

120
In this work, we set up a Bayesian model and apply a Monte Carlo technique to find the best parameters of unimodal (N = 1) and bimodal (N = 2) distributions, given a priori information and the data.

Bayesian approach
In the Bayesian framework, probability distributions are used to code our degree of knowledge on the values of unknown/unobservable quantities: perfect knowledge is represented by a probability distribution which is non-zero only in the correct value, and partial 125 knowledge is represented by a probability distribution which assigns high probability to likely values and low probability to unlikely values. The Bayesian framework is useful to combine a priori information, i.e. information available before the data are collected, with the information content of the data: a priori information is coded in the so-called prior distribution, while the information content of the data is conveyed by the likelihood function.
Let us define the vector ..,N containing all the unknown parameters. We denote its prior density as p(x); 130 in this work, we assume that individual parameters are independent on each other, so that their joint density is the product of single densities; we also assume that no additional prior information is available, and use uniform prior distribution for each of them, in a suitable interval. This leads us to define the prior as where U() denotes the uniform distribution.

135
As far as the likelihood is concerned, we assume that the data are affected by Gaussian noise, which leads us to defining the likelihood as where N (y; a, b) denotes the Gaussian distribution for the y variable, with mean a and standard deviation b. We can finally derive the posterior distribution as given by Bayes theorem: where p(y) is a normalization constant, whose value is unknown, but can be neglected as it does not depend on x. The posterior distribution represents all the available information on the values of the unknown set of parameters x, given the data y. As such, it represents the solution to the inverse problem. However, it is a probability distribution on a relatively high-dimensional space R 3N , and is therefore difficult to visualize. In the next section we illustrate a computational algorithm that allows to obtain 145 samples from this distribution and, in particular, to find the set of parameters that maximizes the posterior itself.
Finally, let us observe that in this work we assume that N is known; while this is, in general, not true, it makes the problem more tractable. In practice, N can be selected by the user, and solutions with different values can be calculated and compared.
Future work will be devoted to considering the general case of unknown N .

Monte Carlo algorithm 150
The posterior distribution defined in (11) is a complicated function on a relatively high-dimensional space: characterization of such distribution requires a computational tool that is able to deal with narrow peaks and local modes. In this work we produce a numerical approximation of the posterior distribution using a Markov Chain Monte Carlo (MCMC) approach.
The general idea of Monte Carlo methods is the one to sample the target distribution p(x|y), i.e. obtain a set of realizations x (1) , . . . , x (N ) of random variables X (1) , . . . , X (n) distributed according to the target distribution p(x|y); indeed, the law of 155 large numbers then guarantees that, for any suitable function f (X), the following approximation holds where Ω is a given domain; the possibility of approximating the integral of any function f () implies that one has a full characterization of the target distribution.
transition kernel P (x |x) of the Markov Chain satisfies The implications of this choice are obvious: if X (k) is distributed according to π(x|y) then each X (n) has the same distribution for n ≥ k. Importantly, if a set of technical conditions are met, after an initial "wandering" time, referred to as burn-in, the Markov Chain will hit the target distribution independently on the initialization. Therefore, once a transition kernel satisfying 165 (12) has been constructed, it is enough to simulate from it, starting with a random initialization.
In this work we use the Metropolis-Hastings construction of the kernel, which has the following form and has to be interpreted as follows: given the current state x, the Markov Chain will evolve with probability α(x, x ) according to a proposal distribution Q(x |x), and with probability 1 − α(x, x ) it will remain in x. A sufficient condition that guarantees 170 that the transition kernel defined in (13) is invariant with respect to π(x|y) is that the acceptance probability α(x, x ) is set as follows: In the numerical simulations below we will be using Gaussian proposal distributions; Gaussian distributions have the pleasant property Q(x |x) = Q(x|x ), so that the acceptance ratio simplifies, as detailed below.

175
Eventually, the MCMC algorithm works as follows.
We start from an initializing value x (1) , drawn from the prior distribution. Then for k > 1 for every and each parameter within x (k) , a new value is proposed by drawing a Gaussian perturbation around the current value; this way, a new state is proposed x * which is identical to x (k) for all but one value; e.g., assuming N = 1, where µ * is drawn from a Gaussian distribution of mean µ (k) .
compute the acceptance ratio α; given our previous choices of uniform distribution and symmetric proposal distribution, accept the proposed value with probability min(1, α): if α ≥ 1, then set x (k+1) = x * ; otherwise, draw a uniform number 185 r ∈ [0, 1] and accept the proposed value if r < α; otherwise, set x (k+1) = x (k) , i.e. do not move.
The algorithm proceeds for a fairly large number of samples; in the simulations below, we use 5,000 samples.
We show several examples of application of the Monte Carlo method to synthetic and real data. We first consider unimodal distributions with variable modal radius from "fine" to "coarse" and then with bimodal distributions with components of variable 190 amplitude. Synthetic distributions are generated so as to represent typical atmospheric aerosol distributions; for this reason, an analysis of the distributions deriving from the observations of the AERONET network has been made. We then analyze three real datasets taken from the AERONET network. As already mentioned, throughout this study the Complex Refractive Index m is considered to be known: the possibility of estimating also the CRI will be the subject of future developments. 195 In order to perform a quantitative analysis of reconstruction accuracy, we use two different methods.

Performance evaluation
The first method is based on the deviation between the size distribution (SD) reconstructed by the inversion algorithm and the simulated exact SD. Indeed, the algorithm determines the SD that reproduces the set of measured parameters with some tolerance to account for the presence of noise; it is first necessary to define a method for quantitatively measuring the distance between the synthetic SD and the reconstructed SD. This can be done using the deviation defined as: where" SD T is the True size distribution (simulated); SD R is the size distribution reconstructed by the algorithm; r min , r max are minimum and maximum value of the radius, determined on the basis of the condition that within the interval the SD T values are higher than 10 −5 x maximum of the SD T . This last condition allows only the significant parts of the distribution to be taken into account.

205
The second method of evaluating the accuracy of the solutions is based on the calculation of integral properties of the size distributions. Since our algorithm allows us to determine the dimensional distribution expressed as: it is possible to derive the concentration in volume and therefore the Particulate Matter parameters PM1, PM2.5 and PM10, defined as the amount of particulate matter with size smaller than 1, 2.5 and 10 micron, respectively, can be computed as where ρ represents the particulate matter density, which is here assumed constant and equal to 1 and R assume the values 1,2.5 or 10 µm for PM1, PM2.5 and PM10, respectively.

215
In addition, we also take into consideration the effective radius r eff and the mean volume radius r v , defined as: where r min and r max assume in our case the values 0.01µm and 20µm, respectively. 220

Numerical validation
We first proceed to show that the deviation measure is a good indicator of the performances, in the sense that its value gives a quantitative evaluation of the distance between true and estimated size distributions. To this aim, we simulated in Simulation 1 a unimodal distribution with the parameters given in Table 1 (middle column). We then run the reconstruction algorithm multiple times, from different random initializations and with different numbers of iterations ranging between 100 and 5,000, 225 so as to reach a diverse population of final estimates. In the inversion we assumed that the optical input parameters (2α + 3β) are determined with the 5% error and imposed the constraints on the parameters given in Table 1. Discretization of the r variable used 500 values logarithmically spaced between 0.01 and 20µm.  Having ascertained that the deviation is a good measure of the "closeness" of the reconstruction to the real distribution, it is obvious to assume that a reconstruction with a small value of deviation must correspond to a low value of the discrepancy between measured and predicted optical coefficients, but the inverse statement is not true. In fact, since the discrepancy only 235 measures the distance of the reconstructed optical parameters with respect to the input ones, small differences in the optical parameters (low values of the discrepancy) can correspond to very large differences in the size distribution parameters.
The statistical nature of the Monte Carlo method includes an intrinsic instability, in the sense that a repetition of the calculation with the same set of input optical parameters, even if without error, leads to different reconstructions. The dispersion of the reconstructions with the same initial conditions is a measure of the stability of the method.

240
Another issue is the effect of noise on the optical input parameters. This random perturbation causes a further increase in the instability of the reconstruction, which may also prevail over the intrinsic instability of the method.
In order to have a quantitative evaluation of the influence of the instability of the method with respect to the noise on the input parameters we have made a statistical analysis of the discrepancy. A bimodal distribution was simulated with the parameter values given in Table 1 245 The value of the extinction coefficients at the wavelengths of 355, 532nm and of the backscattering coefficients at the wavelengths of 355.532 and 1064nm were then determined with the Mie theory, considering homogeneous spherical particles.
The values of the optical coefficients were then used as input data for the reconstruction. The reconstruction was repeated 30 times, each time perturbing the set of input parameters in order to simulate a 5% error on each of them. We then calculated the discrepancy of each of the 30 sets of optical parameters perturbed with respect to the unperturbed set (input discrepancy). The 250 distribution of the input discrepancy is shown in Figure 2, left plot, and the distribution of the values of the discrepancy of the corresponding 30 reconstructions (output discrepancy) is shown in the right plot in the same Figure. It should be kept in mind that the discrepancy input represents the "distance" between each set of perturbed parameters and the theoretical set, while the discrepancy output represents the "distance" between the set of parameters corresponding to each reconstruction with respect to the set of input parameters of the same reconstruction. From Figure 2 we see that: 1) 255 the uncertainty of 5% in the optical input parameters corresponds to a much greater fluctuation of the solution than the one The uncertainty obviously takes into account both the instability of the method and the errors on the input parameters. In the tests conducted so far the distribution of the optical parameters has been considered Gaussian, but in view of the simple 265 logical structure of the algorithm, it is in principle possible to introduce arbitrary distributions to take into account, for example, the contribution of systematic errors in the input optical parameters (consider for example that the error in the backscattering coefficient at 1064 nm can be dominated by the uncertainty on the Lidar Ratio, whose value should be fixed a-priori in a more or less arbitrary manner).

Results with synthetic data 270
In the following we show the results of different tests for simulated SDs representing realistic cases. The reconstruction have been obtained by setting the number of iterations to 5,000 and by running the algorithm 30 times, with noise equal to 5%. For each run, we collect the best solution and we provide uncertainty quantification, shown as shaded area in the pictures below, using the standard deviation of the best solution across these runs.  Table 3. Performance metrics obtained by running the inversion algorithm with a unimodal distribution, when the data are generated by a synthetic unimodal distribution (Cfr. Fig. 3)  Table 2.
The a) and b) cases simulate a unimodal SD centered on the coarse mode of a realistic bimodal distribution; the c) and d) cases simulate a unimodal SD which approximates a fine mode of a realistic bimodal distribution.
In Table 3 we report the deviation for each of the reconstructions in Fig. 3 and the percentage deviation of the parameters PM1, PM2.5, PM10, PM, r eff ed r v with respect to the simulated SD.   Fig. 5 and 6 show the reconstructions of the same distributions simulated in Fig. 3 and 4, obtained by using a bimodal distribution with the parameters given in Table 2.   4 -57.5 -16.4 -24.6 Tables 3, 4, 5, 6 report the deviations of each reconstruction of the Fig. 3, 4, 5, 6 and the percentage deviations of the parameters PM1, PM2.5, PM10, PM, r eff ed r v with respect to the simulated SD.

Results with real data
We finally validate our proposed method on three datasets that have been obtained from experimental data recorded by AERONET, by applying the direct calculation of MIE to the size distribution reported in the AERONET database. For all 290 three datasets, we attempt reconstruction with a unimodal and a bimodal distribution: constraints for the parameter values are reported in Table 7. All reconstructions were obtained with 5,000 iterations, 20 repetitions and 5% noise on the optical parameters. In Table 8 we report details of the datasets, including registration site and date, as well as performance evaluation of our reconstructions. We observe that reconstructions obtained with bimodal distributions provide consistently better results than those obtained with unimodal distributions, and that the error in assessing the PM concentrations remains at more than 295 acceptable levels, particularly with bimodal distributions.

Discussion
The comparisons between the reconstructed and the simulated distributions, shown in Figures 3 -6 with 5% noise on the optical parameters allow to draw some conclusions.   Figure 9. Results obtained by applying our proposed method, with a unimodal (left) and with a bimodal (right) distribution, to the experimental dataset recorded in Gozo.
As shown in Fig. 3 and Table 3, the reconstructions of simulated unimodal distributions with unimodal distributions give 300 excellent results. The distributions with modal radius between 0.2µm and 2.5µm are reconstructed with high accuracy both for the shape of the distribution and for the integral properties (PM1, PM2.5, PM10, PM). More critical is the reconstruction of the effective radius and the mean volume radius, however the worst case happens when the size distribution is simulated with modal radius equal to 0.2 mm, in which the difference with the values of the simulated distribution is equal to 11% and 25% for r eff and r v , respectively. Note that the above results are obtained with very large, not realistic, constraints to the parameters 305 of the reconstructed distribution.
The reconstruction of bimodal distributions, with realistic values of the parameters, by using unimodal distributions, see Fig.   4 and Table 4, gives solutions which show very important deviations with respect to the simulated distributions, nevertheless the values of the integral parameters have deviations which are less than 35%. In these cases, r eff and r v are always underestimated those routinely obtained in the inversion of lidar data, that typically provides errors around 50% (Di et al., 2018a).
Figures 5 and 6 and the respective tables 5 and 6 refer to reconstructions with bimodal SDs, of a simulated unimodal distribution and a simulated bimodal distribution, respectively. In this case, the reconstructions are done by imposing more strict constraints than in the cases of unimodal distributions. The constraints are the product of a statistical analysis of the parameters of the size distributions from the AERONET network. In these cases the reconstruction of unimodal distributions 315 gives solutions which have deviation less than 5% except for the case of distributions with modal radius equal to 0.2 µm, in which case PM10 and PM have deviations around 8% and 21%, respectively; such deviations (see Fig. 5d) are due to the fact that often the algorithm introduces some contribution of particles with radius around 5 µm. In all the cases, r eff is reconstructed with an accuracy around 5%, while the accuracy of r v is less than 9%, but in the case of distribution simulated with modal radius equal to 0.2mm, r v is underestimated by 22%.

320
In the bimodal SD reconstruction, Fig. 6 and Table 6, the "fine" mode is always reconstructed better than the "coarse" mode, this effect is connected to the wavelengths used for determining the optical parameters of the lidar measures. Overall the integral properties are reconstructed with an accuracy less than 6% for PM1 and PM2.5 and an accuracy between 6% and 44% for PM10 and PM. r eff and r v are recovered with an accuracy between 7% and 37% and between 2% and 57%, respectively.
The analysis of real data confirmed that the inversion algorithm, particularly in the bimodal settings, is capable of recovering 325 realistically shaped SDs.
Our analysis also highlights few limitations of the proposed technique. First, the technique presents an inherent subjectivity as regards the choice of unimodal versus bimodal distributions; while, on average, the bimodal settings performs better, it can also produce some spurious modes such as those in Fig. 5(d). In addition, we observe that the uncertainty quantification currently implemented seems to provide, at times, optimistic results, to the extent that the true distribution is not always 330 included in the confidence bands. We reckon that these limitations can be overcome by using more complex but more powerful Monte Carlo sampling techniques, such as those described in (Sorrentino et al., 2014;Luria et al., 2019;Sciacchitano et al., 2019;Viani et al., 2021); this will be the topic of future studies.

Conclusions
The preliminary results presented in this paper indicate that the proposed method can effectively retrieve uni-and bi-modal 335 distributions from extinction coefficients measured at two wavelengths and backscattering coefficients measured at three wavelengths. The reconstruction with three-modal are more challenging, and more constraints might be necessary to obtain reliable and stable solutions. The extension of the method to three-modal distributions and variable refractive index, together with better uncertainty quantification and automatic model selection, will be the subject of future studies.