Estimation of PM 2.5 Concentration in China Using 1 Linear Hybrid Machine Learning Model 2

. The satellite remote-sensing aerosol optical depth (AOD) and meteorological elements 7 were employed to invert PM 2.5 in order to control air pollution more effectively. This paper proposes 8 a restricted gradient-descent linear hybrid machine learning model (RGD–LHMLM) by integrating a 9 random forest (RF), a gradient boosting regression tree (GBRT), and a deep neural network (DNN) 10 to estimate the concentration of PM 2.5 in China in 2019. The research data included Himawari-8 AOD 11 with high spatiotemporal resolution, ERA-5 meteorological data, and geographic information. The 12 results showed that, in the hybrid model developed by linear fitting, the DNN accounted for the largest 13 proportion, whereas the weight coefficient was 0.62. The R 2 values of RF, GBRT, and DNN were 14 reported 0.79, 0.81, and 0.8, respectively. Preferably, the generalization ability of the mixed model 15 was better than that of each sub-model, and R 2 reached 0.84, whereas RMSE and MAE were reported 16 12.92 µg/m 3 and 8.01 µg/m 3 , respectively. For the RGD-LHMLM, R 2 was above 0.7 in more than 70% 17 of the sites, whereas RMSE and MAE were below 20 µg/m 3 and 15 µg/m 3 , respectively, in more than 18 70% of the sites due to the correlation coefficient having seasonal difference between the 19 meteorological factor and PM 2.5 . Furthermore, the hybrid model performed best in winter (mean R 2 20 was 0.84) and worst in summer (mean R 2 was 0.71). The spatiotemporal distribution characteristics 21 of PM 2.5 in China were then estimated and analyzed. According to the results


Background
In recent years, pollutants have been discharged increasingly in China where air pollution is becoming worse than ever before due to rapid urbanization and industrialization (Wang et al., 2019a).
The fine particulate matter (PM2.5) with a diameter below 2.5μm is the main component of air pollutants having considerable impacts on human health, atmospheric visibility, and climate change (Gao et al., 2015;Pan et al., 2018;Pun et al., 2017).The global concern about PM2.5 has increased significantly since it was listed as a top carcinogen (Apte et al., 2015;Lim et al., 2020).Currently, ground monitoring is the most efficient method of measuring PM2.5 (Yang et al., 2018).However, monitoring stations are not evenly distributed due to terrain and construction costs; therefore, it is difficult to obtain a wide range of accurate PM2.5 concentration data (Han et al., 2015).To solve the problem, the method of estimating PM2.5 with satellite remote-sensing was developed.Satellite remote-sensing is characterized by a wide coverage and high resolution (Hoff and Christopher, 2009;Xu et al., 2021).There is also a high correlation between AOD, obtained from satellite remote sensing inversion, and PM2.5; therefore, AOD is a very effective method of monitoring the spatiotemporal concentration characteristics of PM2.5.
After Engel-Cox et al. (2004) proposed using satellite AOD to estimate PM2.5 concentration, several studies are reported in the literature to address this theory.Based on the regression model, Liu et al. (2005) introduced AOD, boundary layer height, relative humidity, and geographical parameters as the main controlling factors to estimate PM2.5 in the eastern part of the United States, and the verification coefficient R 2 obtained was 0.46.Tian and Chen (2010) used AOD, PM2.5, and meteorological parameters in Southern Ontario, Canada, to establish a semi-empirical model to predict PM2.5 concentration per hour, and the verification coefficient R 2 obtained in rural and urban areas was 0.7 and 0.64, respectively.Hu et al. (2013) proposed a geography weighted regression model to estimate the surface PM2.5 concentration in southeastern America by combining AOD, meteorological parameters, and land use information.Their model average R 2 was 0.6.Lee et al. (2012)  The Himawari-8 satellite commonly used in the Asia-Pacific region is a geostationary satellite launched by the Japan Meteorological Agency in 2014.The observation frequency is 10 minutes, and the observation results can characterize the aerosol and provide AOD data with a resolution of 5 km (Bessho et al., 2016;Yumimoto et al., 2016).Due to its excellent performance, some scholars use Himawari-8 data to estimate ground PM2.5.Wang et al. (2017) proposed an improved linear model, introduced AOD, meteorological parameters, geographic information to estimate PM2.5 in the Beijing-Tianjin-Hebei region, and the verification coefficient R² was 0.86.Zhang et al. (2019b)  A large number of existing studies in the broader literature have examined the estimation of ground PM2.5 concentrations using satellite remote sensing AOD.However, the performance of PM2.5 estimation models established in the existing studies varies greatly and the performance of the models is not stable in different seasons and regions.To overcome this limitation, in this paper, a linear hybrid machine learning model (RGD-LHMLM) based on random forest (RF), gradient lifting regression tree (GBRT),

Satellite AOD Data
The Himawari Imager (AHI) on the Himawari-8 satellite launched by the Japan Meteorological Agency is a highly improved multi-wavelength imager.It adopts the whole disk observation method and has 16 visible and infrared channels.It has the characteristics of fast imaging speed, flexible observation area, and time.The Level-3-hour AOD product, released by the Japan Aerospace Space Agency (JAXA), provides 500 nm AOD data with a spatial resolution of 5km during the day.In previous studies (Zang et al., 2018)

Meteorological Data
ERA-5 reanalysis data is an hourly collection of atmospheric and land-surface meteorological elements since 1979 that the European Centre (ECMWF) has used its prediction model and data assimilation system to "Reanalyse" archived observations.Data used in this paper include surface relative humidity (RH, expressed as a percentage), air temperature at a height of 2 m (TM, expressed as K), Wind speed (U10, V10, in m/s), surface pressure (SP, in Pa), boundary layer height (BLH, in m) and cumulative precipitation (RAIN, in m) at 10 m above the ground.A series of studies has indicated that these parameters can affect the concentration of PM2.5 (Fang et al., 2016;Guo et al., 2017;Li et al., 2017b;Wang et al., 2019b).

Auxiliary Data
The auxiliary data used in this study include high and low vegetation index (LH, LL), ground elevation data (DEM), and population density data (PD).The high and low vegetation index is derived from ERA5 reanalysis data, which respectively represent half of the total green leaf area per unit level ground area of high and low vegetation type.The ground elevation data are derived from SRTM-3 measurements jointly conducted by NASA and the Defense Department's National Mapping Agency (NIMA), with a spatial resolution of 90 m.The population data come from the 2015 United Nations Adjust Population Density data provided by NASA's Center for Socio-Economic Data and Applications (SEDAC), which is based on national censuses and adjusted for relative spatial distribution.

Random Forest
Random Forest (RF) is built based on the combination of the Bagging algorithm and decision tree, which is an extended variant of the parallel ensemble learning method (Stafoggia et al., 2019).To construct a large number of decision trees, the random forest model takes multiple samples of the sample data.In the decision tree, the nodes are divided into sub-nodes by using the randomly selected optimal features until all the training samples of the node belong to the same class.Finally, all the decision trees https://doi.org/10.5194/amt-2021-64Preprint.Discussion started: 30 March 2021 c Author(s) 2021.CC BY 4.0 License.are merged to form the random forest.This method has proved to be effective in regression and classification problems and is one of the most well-known Machine learning algorithms used in many different fields (Yesilkanat, 2020).

Gradient Boosted Regression Trees
Different from the random forest, Gradient Boosting Regression Tree (GBRT) is based on Boosting algorithm and decision tree.The basic principle of GBRT is to construct M different basic learners through multiple iterations, and constantly add the weight of the learners with a small error probability, to eventually generate a strong learner (Johnson et al., 2018).The core of this method is that after each iteration, a learner will be built in the direction of residual reduction (gradient direction) to make the residual decrease in the gradient direction (Schonlau, 2005).The basic learner of GBRT is the regression tree in the decision tree.During the prediction, a predicted value is calculated according to the model obtained.The minimum square root error is used to select the optimal feature to split the dataset, and the average value of the child node is then taken as the predicted value.

Deep Neural Networks
Deep Neural Networks (DNN) is a supervised learning technique that uses a backpropagation algorithm to minimize the loss function.It adjusts the parameters through an optimizer, and has high computational power, making it ideal for solving classification and regression problems (Wang and Sun, 2019).The structure of DNN includes an input layer, an output layer, and several hidden layers.Each layer takes the output of all nodes of the previous layer as the input, and this process requires activation functions.Compared with other activation functions, the linear rectifying function (ReLU) has the advantages of simple derivation, faster convergence, and higher efficiency.At the same time, among the adaptive learning rate optimizers, the Adamx optimizer performs the best.It not only has the advantages of Adam in determining the learning rate range and having stable parameters in each iteration but also simplifies the method of defining the upper limit range of the learning rate and improves the iteration efficiency (Diederik and Jimmy, 2015).Therefore, in this paper, we selected the Adamx optimizer and ReLU activation function to train the DNN.

Model Establishment and Verification
After data processing, RF, GBRT, and DNN are used for modeling.To prevent model parameters from being controlled by large or small range data and speed up the convergence rate of the model, the data must be normalized before starting the training process.Finally, the three optimal sub-models are linear combined to achieve the final mixed model.To verify the model performance, this paper uses the "10-fold cross-validation" method (Adams et al., 2020).In this method, the data is split into 10 copies, 9 copies for training and 1 copy for verification; this process is repeated 10 times, and then the average of the 10 predictions is computed as the final result.Finally, the predicted value and the measured value are fitted linearly.At the same time, several indicators are used to evaluate the model, including the mean absolute error (MAE, when the predicted value and the true value are exactly equal to 0, that is, perfect model; The larger the error, the greater the value), the root mean square error (RMSE, when the predicted value and the real value are completely consistent is equal to 0, that is, the perfect model; The larger the error, the greater the value), the slope of the fitting equation and the determination coefficient R 2 (the greater the value, the better the model fitting effect).1.

1
The

Performance Analysis of Monitoring Station Model
The spatial performance of the model was analyzed by measuring R 2 , RMSE, and MAE at the monitoring stations.According to Figure 4, there are regional differences in the inversion performance of RGD-LHMLM.At all monitoring stations, the average R 2 was reported 0.74, and R 2 was above 0.7 at more than 70% of the stations, especially in the densely populated and industrially developed areas.The    The correlation coefficients between the monthly mean values of important meteorological parameters (AOD, BLH, TM and RH) and R 2 were also analyzed.According to the results, the correlation coefficients between the meteorological parameters and PM2.5 were lower in summer.Furthermore, there are many rainy days and large cloud coverage, which is not conducive to satellite observation and decreases the accuracy of AOD data in summer.Therefore, the summer model performance is poor.There was a strong correlation between meteorological parameters and PM2.5 in autumn.There were also similar correlations between spring and winter; however, the winter model performed was better.The reasons can be interpreted as below.The winter temperature and boundary layer height are low, whereas the atmosphere is stable but not conducive to the diffusion of pollutants.Moreover, during the heating period in winter, pollutant emissions soar greatly and result in a sharp rise in the concentration of PM2.5.
The increased pollution in winter ensures the quality and quantity of data, thereby improving the model performance effectively.(1) In the RGD-LHMLM obtained from linear fitting, the DNN accounted for the largest proportion with a weight coefficient of 0.62.The R 2 of RGD-LHMLM was 0.84, whereas its generalization ability was significantly better than that of a single model (DNN: 0. (2) The RGD-LHMLM was spatially stable, with R 2 >0.7 in more than 70% of sites as well as RMSE<20 μg/m 3 and MAE<15μg/m 3 in more than 95% of sites.These sites are mainly located in densely populated and industrially developed areas.The correlation difference between the inversion factor and PM2.5 in various seasons would lead to seasonal variations in the model performance.In addition, the performance was the worst in summer with an average R 2 of 0.71; however, winter showed the best performance with an average R 2 of 0.84.
(3) Changes in the spatiotemporal characteristics were obvious in the concentration of PM2.5 in China.In other words, North China and East China had the highest concentration of PM2.5 with an average annual concentration of 82.86 μg/m 3 , whereas Inner Mongolia, Qinghai, Tibet, and other regions had low pollution levels with an average annual concentration of PM2.5 below 40 μg/m 3 .In winter, the concentration of PM2.5 was higher with an average of 62.10 μg/m 3 , whereas the pollution was lighter in summer with an average concentration of PM2.5 being reported 47.39 μg/m 3 .
In conclusion, the RGD-LHMLM can accurately measure the concentration of PM2.5 and perform the seasonal evolution of pollutants.These results can help control the local pollution.This study also indicated that integrating multiple Machine learning models improved the accuracy of fitting results effectively.For more accurate pollutant data, such models can be employed to fit the PM2.5 in the future with more parameters closely related to PM2.5.However, there are some vacant values in the results of this study.There are also no data for some areas.
believed that the satellite remote sensing AOD data would be interfered by clouds and snow and ice, and the reliability of the data was questionable.They proposed a mixed model based on AOD calibration to predict the ground PM2.5 concentration in New England, USA, and achieved good results (R 2 = 0.83).Combined with MODIS AOD and ground observation data, Lv et al. (2017) estimated the daily surface PM2.5 concentration in the Beijing-Tianjin-Hebei region and improved the data resolution to 4 km.The data used in these early studies are AOD products obtained https://doi.org/10.5194/amt-2021-64Preprint.Discussion started: 30 March 2021 c Author(s) 2021.CC BY 4.0 License.from polar-orbit satellite sensors.The daily observation frequency is limited.Due to the influence of cloud and ground reflection, the dynamic change information of PM2.5 cannot be obtained.As a result, geostationary satellite observations can be used to overcome the problem of low temporal resolution for estimating surface PM2.5 (Emili et al., 2010).
used Himawari-8 hourly AOD product to estimate ground PM2.5 in China's four major urban agglomerations.The results showed significant diurnal, seasonal, and spatial changes and improved the temporal resolution of estimating PM2.5 concentration to the hourly level.As research into ground-based PM2.5 estimation deepens, traditional linear or nonlinear models cannot meet the requirements of large-scale estimation and are gradually being replaced by machine learning algorithms with strong nonlinear fitting ability.Liu et al. (2018) combined Kriging interpolation and random forest algorithm to obtain the concentration of high-resolution ground PM2.5 in the United States.To demonstrate the accuracy and superiority of the proposed method, the results were compared with the PM2.5 concentration in ground measurement stations.Chen et al. (2019) stacked and predicted PM2.5 concentration based on a variety of machine learning algorithms, discussed the influence of meteorological factors on PM2.5 and achieved an R 2 = 0.85.Li et al. (2017a) established a GRNN model for the whole of China to estimate PM2.5 concentration, and the results demonstrated that the performance of the deep learning model was better than that of the traditional linear model.
and deep neural network (DNN) is proposed to estimate ground PM2.5 concentration.The model https://doi.org/10.5194/amt-2021-64Preprint.Discussion started: 30 March 2021 c Author(s) 2021.CC BY 4.0 License.performance is evaluated from time and space to analyze its causes.Finally, spatiotemporal distribution of PM2.5 concentration in China in 2019 is obtained.data for 2019 used in this study are available from the China Environmental Monitoring Center's Air Quality Real-Time Publication System.The system extracts hourly mean PM2.5 data.By the end of 2019, China had 1641 monitoring stations built and in operation.
Figure 1 shows the spatial distribution of monitoring stations in China.

Figure 1
Figure 1 Distribution diagram of Environmental monitoring stations in China (2019) , Himawari-8 AOD was compared with the AOD data of AERONET (Aerosol Robotic Network) in China and achieved good performance.The AOD data used in this study is the Himawari-8 Level 3hour AOD data in 2019 obtained from the Himawari Monitor website of the Japan Meteorological https://doi.org/10.5194/amt-2021-64Preprint.Discussion started: 30 March 2021 c Author(s) 2021.CC BY 4.0 License.Agency.

Figure
Figure 2 Schematic diagram of model Figure3.The RGD-LHMLM model has the smallest degree of data dispersion, and the slope of the fitting line reaches 0.84, indicating that 84% of the prediction results are accurate, higher than the three submodels.

Figure 3
Figure 3 Accuracy of model Fitting and Validation (A: RF, B: GBRT, C: DNN, D: RGD-LHMLM) https://doi.org/10.5194/amt-2021-64Preprint.Discussion started: 30 March 2021 c Author(s) 2021.CC BY 4.0 License.model prediction accuracy was reported low (R 2 <0.6) in Xinjiang, Tibet, Qinghai, Western Sichuan, and a few other areas of Northeast China.The mean values of RMSE and MAE were reported 11.4 μg/m 3 and 8.01 μg/m 3 , respectively.In fact, the mean values of RMSE and MAE were below 20 μg/m 3 and 15 μg/m 3 in more than 95% of stations, something showed a low estimation error.

Figure 5
Figure 5 Monthly model performance fitting scatter diagram in 2019

Figure 7
Figure 7 Variation trend of monthly average of meteorological parameters (AOD, BLH, TM, RH) and R 2 4.3 Temporal and Spatial Distribution Characteristics of PM2.5 Concentration in China

Figure
Figure 8 Monthly distribution of PM2.5 concentration in China in 2019

Table 1 Comparison of model accuracy
The PM2.5 inversion results of a single machine learning model show that DNN has the best inversion performance, followed by GBRT, and RF has the worst performance.The expression of the mixing model obtained after linear mixing is as follows: