Neural network modelling to estimate particle size distribution 
based on other particle sections and meteorological parameters

Abstract. In air quality research, often only particle mass concentrations as indicators of aerosol particles are considered. However, the mass concentrations do not provide sufficient information to convey the full story of fractionated size distribution, which are able to deposit differently on respiratory system and cause various harm. Aerosol size distribution measurements rely on a variety of techniques to classify the aerosol size and measure the size distribution. From the raw data the ambient size distribution is determined utilising a suite of inversion algorithms. However, the inversion problem is quite often ill-posed and challenging to invert. Due to the instrumental insufficiency and inversion limitations, models for fractionated particle size distribution are of great significance to fill the missing gaps or negative values. The study at hand involves a merged particle size distribution, from a scanning mobility particle sizer (NanoSMPS) and an optical particle sizer (OPS) covering the aerosol size distributions from 0.01 to 0.42 μm (electrical mobility equivalent size) and 0.3 μm to 10 μm (optical equivalent size) and meteorological parameters collected at an urban background region in Amman, Jordan in the period of 1st Aug 2016–31st July 2017. We develop and evaluate feed-forward neural network (FFNN) models to estimate number concentrations at particular size bin with (1) meteorological parameters, (2) number concentration at other size bins, and (3) both of the above as input variables. Two layers with 10–15 neurons are found to be the optimal option. Lower model performance is observed at the lower edge (0.01 



Measurement sites and Instruments 157
In this study, we collected a dataset obtained from a measurement campaign in Amman, the capital city of Jordan, between 158 1 August 2016 and 31 July 2017. The city represents an area with Middle Eastern urban conditions within the Middle 159 East and North Africa (MENA) region. This region serves as a compilation of different aerosol particle sources including 160 natural dust, anthropogenic pollution (e.g., generated from the petrochemical industry and urbanization), as well as new 161 particle formation. 162 163 The database includes particle size distribution and meteorological parameters, as mentioned in the first step in Figure 1. 164 The aerosol measurement was carried out at the aerosol laboratory located on the third floor of the Department of Physics, 165 University of Jordan (32°00′ N, 35°52′ E) in the neighbourhood of Al Jubeiha. The campus is situated at an urban 166 background region in northern Amman. In particular, the campaign measured the particle number size distribution using 167 a scanning mobility particle sizer (NanoScan SMPS 3910, TSI, MN, USA). It monitors the particle size distributions as 168 electrical equivalent diameter 0.01-0.42 μm (13 channels). The size range of the SMPS system can be extended to coarse 169 particles with an additional compact instrument: an optical particle sizer (OPS 3330, TSI, MN, USA). OPS measures 170 optical diameter 0.3-10 μm (13 channels). This optical sizing method reports an optical particle diameter, which is often 171 different from the electrical mobility diameter measured by the SMPS technique. The measurements were combined to 172 provide a particle size distribution of wider particle diameter range 0.01-10 μm, which is further described in Section 2.2. 173 The SMPS inlet flow rate was 0.75 lpm (±20%) while the sample flow rate was 0.25 lpm (±10%). The flow rate of OPS 174 was about 1 lpm. The aerosol transport efficiency through the aerosol inlet assembly was determined experimentally: 175 ambient aerosol sampling alternatively with and without sampling inlet, and the aerosol data was corrected accordingly. 176 The penetration efficiency was ~47% for 0.01 μm, ~93% for 0.3 μm and ~40% for 10 μm (Hussein et al., 2020). These 177 deficiency of measurement at the upper and lower edges is somewhat in alignment with other literatures. Particle size 178 measured by nanoSMPS (Tritscher et al., 2013) tended to be underestimated for spherical particles larger than 0.2 μm by 179 up to 34% (Fonseca et al., 2016). Liu et al. (2014) clearly portrayed that the detection limit of particle size below 0.03 μm 180 is about 80-500 cm -3 , which is up to 10 times larger than that of coarser particles, for other versions of SMPS. Stolzenburg 181 and McMurry (2018) explained that discrepancies could be resulted from DMAs with transfer functions that were 182 degraded (i.e., broadened) by flow distortions caused by particle deposition within the classifier tube, sizing errors due to 183 errors in flowmeter calibrations or leaks, CPC concentration errors due to improper pulse counting, and continuity failure 184 in the DMA high voltage connection.

Data pre-processing 193
The next step in Figure 1 is data pre-processing. Since the sampling time resolution of SMPS and OPS was 1 min and 5 194 min, respectively, we synchronised the data into 5-min average. Since a part of the size ranges in both instruments are 195 overlapping with each other, the last two size bins in SMPS and the first size bin in OPS were neglected. Finally, we 196 merged the size range of electrical mobility diameter 0.01-0.25 μm by SMPS and optical diameter 0.32-10 μm by OPS, 197 and obtain a wider particle size distribution which covers the diameter range 0.01-10 μm. Merging electrical mobility 198 diameter and optical diameter can be a challenge and the overlapping region is often calculated with high uncertainty 199 (DeCarlo et al., 2004;Tritscher et al., 2015). The challenge arises because the optical diameters are measured based on 200 the refractive index of the particles, which depends on their chemical composition. Therefore the sizing will vary over 201 time. There is also a very slight dependency with the SMPS system that is linked to the shape of the particles, which 202 influences their sizing. 203 204 We also calculated the particle number concentration with four particle diameter modes (size-fractionated number 205 concentration): nucleation (0.01-0.025 μm), Aitken (0.025-0.1 μm), accumulation (0.1-1 μm) and coarse mode (1-10 206 μm). Subsequently, the total number concentration was obtained as the sum of all these fractions. The size-fractionated 207 number concentrations were obtained by summing up the measured particle number size distribution over the specified 208 particle diameter range. 209

210
In order to perform neural network modelling, aerosol and meteorological data were first linearly interpolated in case of 211 short missing data periods. For missing data over longer periods, the whole rows are eliminated. The shorter missing data 212 occurs due to technical faults while the longer missing periods are attributed to instrument maintenance (Zaidan et al., 213 2020). Only 71.8% of total data was retained for modelling in the measurement period. Since the data were obtained from 214 different measured variables with various physical units and magnitudes, it was crucial to normalise the data. The scaling 215 factor depends on which activation function is chosen. In this case, the datasets were scaled so that it has a mean of 0 and 216 a standard deviation of 1 to transform them into the range of the activation function. The standardised data was then 217 separated into different months for the reason of the seasonal variation in the atmospheric condition. The data was further 218 divided into training set (70%) and testing set (30%). The processed data were also converted to hourly and daily averages 219 for reporting purposes. 220

Modelling 221
After data collection and data pre-processing procedures, the next step is model optimisation (Figure 1 a series of layers. The first layer has a connection from the network input. Each subsequent layer has a connection from 231 the previous layer. The final layer produces the network's output. A neuron can be thought as a combination of two parts: 232 where z j (L) and b j (L) are the intermediate output and the bias term for the j th neuron at L th layer, respectively. w ji (L) is the j th 233 weight for each data points x i at L th layer. The second part performs the activation function (sigmoid function in this 234 study) on z j to give out the output of the neuron: 235 (2), The FFNN model was created, trained and simulated with MATLAB (version: 8.3.0.532), using Neural Network Toolbox. 236 We initialised the weights randomly and the weights are updated through ''Levenberg-Marquardt'' algorithm 237 optimisation that was the fastest available back-propagation training function (Chaloulakou et al., 2003). We performed 238 several iterations within a cycle to minimise the training loss with Bayesian regularisation. These steps were done 239 iteratively until the best combination of the number of hidden layers and the corresponding number of neurons that 240 provided the minimum error was found. According to the review paper by suitable model configuration, the model estimates number concentration using testing data. Finally, the selected 247 performance metrics, described in Section 2.4, can be calculated and we evaluate which approach is the most suitable for 248 size distribution estimation. 249

Performance metrics 250
We choose the optimal combination of the number of hidden layers and the corresponding number of neurons by checking 251 its mean absolute error (MAE), which is a simple way to illustrate the residuals of the estimated values by the model. In 252 order to identify which size bin manage to be predicted best, two metrics are used, namely coefficient of determination 253 (R 2 ) and normalised root-mean-square error (NRMSE). R 2 measures how well the observed outcomes are replicated by 254 the model, based on the proportion of total variation of outcomes explained by the model. NRMSE represents the standard 255 deviation of the estimated errors with respect to its mean. NRMSE is used rather than commonly used RMSE because the 256 number concentrations of the different size range are of different magnitudes. The comparison in different size range 257 becomes different if RMSE is not normalised with its mean. 258 where y i , y î and y ̅ represent the i th measurement value, the y th estimated value by the model and the mean of the all the 259 measurement data, respectively. n notates the total number of the valid measurement data.  Figure 3a). RH varied quite a lot from 10% to 100%, with an hourly median of 52.3%, and did not seem to have a 267 seasonal pattern ( Figure 3b). In summer months, wind appeared be stronger but the wind direction is more stable, mostly 268 from northwest (270⁰-360⁰). In cold months, averaged wind speed was lower but wind blew from fluctuating direction. 269 During the whole measurement period, wind speed ranged between 0-6 m s -1 and its median is 1.39 m s -1 (Figure 3c is mainly because of road dust resuspension and might also be attributed to dust storm via long-range transport (Hussein 279 et al., 2019). In this study, we further explore how wind direction influences the particle number concentration ( Figure  280 4). Wind coming from the northwest (225⁰-325⁰) was generally stronger, but lower particle number concentration was 281 detected because the measurement area is at the outskirt of downtown. Wind from East and South (45⁰-225⁰) has a lower 282 wind speed but a more intense hourly particle number concentration can be detected. From that direction situates the 283 urban city where all kinds of industrial activities take place. When considering only coarse particles, relatively high 284 number concentration is found when south-westerly wind is strong. This can further serve as an evidence that the source 285 of coarse particles in that region might come mostly from long range sea salt from Dead Sea or dust particles from nearby 286 deserts. 287

General pattern of particle size distribution 288
Hourly total number concentration ranged from 1.90×10 3 cm -3 to 1.52×10 5 cm -3 and its median was 1.36×10 4 cm -3 . Figure  289 5a performed moderate seasonal pattern in general: lower in summer months and higher in colder months. Hussein et al. 290 (2019) also characterised the modal structure of the particle number size distribution for the same site.  Table 1, the total number concentration of all particle size (1.70±1.26×10 4 cm -3 ) is mostly accounted by Aitken 296 mode (45-80%, average: 1.09±1.01×10 4 cm -3 ), followed by nucleation mode (10-50%, average: 0.48±0.32×10 4 cm -3 ). 297 Accumulation mode (0-15%, average: 0.13±0.08 cm -3 ) comes third and only less than 0.5% of the total particle number 298 concentration contain coarse particles with an average of 2.13±2.80 cm -3 (Figure 5b There is no single combination which entirely outperform the others in all size bins. We summed up the MAE for all size 349 bins and decided to stick to 2 layers and 10 neurons with the overall lowest residuals (Table 2) replacing negative values in the raw data by particle sizers. While some instrument manufactures create built-in algorithms 362 to replace with artificial non-negative numbers, most end-users simply remove the seemingly impossible negative values 363 from the dataset. The perfect way to do it is to have a parallel instrument that overlaps with that particle size range. 364 However, in many cases, this is not possible as a result of financial constraints. Therefore, we shall rely on the mutual  (Laakso et al., 2003); therefore, the mapping of the relationships between long-range 381 transported accumulation mode particles and covariates is supposed not to well understood. However, the relative 382 prediction ability in this study is not lower given that local meteorological variables were used as input variables. The 383 possible reason is that this mode falls exactly in the instrumental overlapping regions, which leads to a lower predictively. 384 The locally-produced Aitken mode particles (0.03< Dp< 0.1 μm) are less effectively removed by transformation processes 385 (e.g., evaporation and coagulation) from the atmosphere, compared with nucleation mode (0.01< Dp< 0.03 μm), allowing 386 the prediction models to better understand their relationships with the input variables, which is in alignment with Al-387 Dabbous et al. (2017). 388 Figure 9 shows the diurnal discrepancies during workdays and weekends. Relative particle number concentration was 390 defined by the modelled concentration with respect to the measured concentration. Values above 1 indicates 391 overestimation while values below 1 suggests underestimation. For approach 1, except for the overlapping size bin, which 392 are underestimated by more than 50% at all time range, the difference between modelled and measured hourly number 393 concentration is within 50% during both workdays and weekends. Overestimation is found in early morning before 3 a.m. 394 during workdays for all size bins, especially for UFP. Following the overestimation, at about 6 a.m. in the morning, the 395 modelled number concentration appears to understate by up to 40%, especially at size bins below 0.1 um. Along the day, 396 the modelling uncertainties are rather small until in the evening from 6 p.m. to 11 p.m. where modelled UFP number 397 concentration show moderate overestimation one more time. It reveals that the model with only meteorological parameters 398 as inputs fail to catch the diurnal pattern from 6 p.m. to 7 a.m. in particular for UFP. The pattern of the performance for 399 weekends does not appear to be as distinctive as on workdays. It shows the overestimation not only for UFP in early 400 morning about 3 a.m., but also at the upper edge larger than 5 um from 3 a.m. to 4 p.m.. At 7.p.m. onwards until noon, an 401 underestimation is found at all size bins. For approach 2, except the overlapping size bin, which has a significant 402 overestimation from 6 p.m. to 7 a.m., most show trivial 10% uncertainty during both workdays and weekends. The model 403 performance over weekends show relatively stronger uncertainties. The smallest bin at 0.01 μm is slightly understated for 404 all hours of a day. Other than these, models with the full spectrum of size distribution as inputs manage to catch fairly 405 well the diurnal pattern for all size bins. 406 407 Figure 10 further shows the monthly deviation in modelling performance. For approach 1, higher R 2 is found in November, 408

Temporal pattern 389
February and April in the range of SMPS. Other than that, no observable variation in R 2 in approach 1. For approach 2, 409 except in January when all the rows were eliminated because of the lack of wind information, performance in the other 410 months is steady for most size range. At 0.21 μm, the difference in model performance varies across different months. can be thought of replacing 'negative' values in the raw data by particle sizers, including SMPS we used in this paper. 432 Instead of eliminating the negative values, they can be estimated by other size bins with a high accuracy in order to keep 433 the symmetry in data error distribution (Viskari et al., 2012). 434

Conclusion 435
This paper presents the evaluation of feed-forward neural network (FFNN) models for estimating particle number 436 concentration at various particulate size bins. Input predictors include a merged particle size distribution, by a scanning 437 mobility particle sizer (NanoSMPS) and an optical particle sizer (OPS), which covers size range from 0.01 to 10, and