Continuous mapping of fine particulate matter (PM 2.5 ) air quality in East Asia at daily 6x6 km 2 resolution by application of a random forest algorithm to 2011-2019 GOCI geostationary satellite data

. We use 2011-2019 aerosol optical depth (AOD) observations from the Geostationary Ocean Color Imager (GOCI) instrument over East Asia to infer 24-h daily surface fine particulate matter (PM 2.5 ) concentrations at continuous 6x6 km 2 resolution over eastern China, South Korea, and Japan. This is done with a random forest (RF) algorithm applied to the gap-filled GOCI AODs and other data and trained with PM 2.5 observations from the three national networks. The predicted 24-h PM 2.5 15 concentrations for sites entirely withheld from training in a ten-fold crossvalidation procedure correlate highly with network observations (R 2 = 0.89) with single-value precision of 26-32% depending on country. Prediction of annual mean values has R 2 = 0.96 and single-value precision of 12%. The RF algorithm is only moderately successful for diagnosing local exceedances of the National Ambient Air Quality Standard (NAAQS) because these exceedances are typically within the single-value precisions 20 of the RF, and also because of RF smoothing of extreme PM 2.5 concentrations. The area-weighted and population-weighted trends of RF PM 2.5 concentrations for eastern China, South Korea, and Japan show steady 2015-2019 declines consistent with surface networks, but the surface networks in eastern China and South Korea underestimate population exposure. Further examination of RF PM 2.5 fields for South Korea identifies hotspots where surface network sites were initially lacking and shows 2015-2019 PM 2.5 25 decreases across the country except for flat concentrations in the Seoul metropolitan area. Inspection of monthly PM 2.5 time series in Beijing, Seoul, and Tokyo shows that the RF algorithm successfully captures observed seasonal variations of PM 2.5 even though AOD and PM 2.5 often have opposite seasonalities. Application of the RF algorithm to urban pollution episodes in Seoul and Beijing demonstrates high skill in reproducing the observed day-to-day variations in air quality as well as 30 spatial patterns on the 6 km scale. Comparison to a CMAQ simulation for the Korean peninsula demonstrates the value of the continuous RF PM 2.5 fields for testing air quality models, including over North Korea where they offer a unique resource. suggesting that the gap-filling procedure does not bias the results. The algorithm has only moderate success at predicting NAAQS exceedance events because most of these events are within the single-value precision, and also because of some smoothing of the extreme high tail of the PM 2.5 frequency distribution.


Introduction
Exposure to outdoor fine particulate matter (PM2.5) is a global public health issue, accounting for 8.9 35 million deaths in 2015 [Burnett et. al., 2018]. Beyond mortality, short-term exposure to elevated PM2.5 levels is associated with numerous adverse health outcomes including increased hospital admissions for respiratory and cardiovascular issues [Dominici et. al., 2006;Wei et. al., 2019]. Long-term exposure is associated with neurodegenerative diseases such as dementia, Alzheimer's disease, and Parkinson's disease [Kioumourtzoglou et. al., 2016]. High spatio-temporal monitoring of PM2.5 concentrations to 40 inform population exposure is important for both air quality regulation and epidemiological studies. Ground monitors can provide highly accurate measurements but have limited spatial coverage. Here we show how geostationary satellite observations of aerosol optical depth (AOD) over East Asia from the Geostationary Ocean Color Imager (GOCI) can be used with a random forest (RF) machine learning (ML) algorithm to provide continuous long-term reliable mapping of 24-h PM2.5 at 6x6 km 2 spatial 45 resolution.
The potential of satellites for high-resolution monitoring of PM2.5 has long been recognized in the public health community [Liu et al., 2004;van Donkelaar et. al., 2006]. Satellites retrieve AOD by backscatter of solar radiation. The MODIS sensors launched in 1999 on the NASA Terra and Aqua satellites have been the main source of AOD data, with global coverage twice a day at up to 1 km 50 resolution [Remer et. al., 2005[Remer et. al., , 2013Lyapustin et. al., 2018]. Early approaches to relate AOD observations to surface PM2.5 used chemical transport models (CTMs) to estimate local PM2.5/AOD ratios [Liu et al., 2004;van Donkelaar et. al., 2006], with more recent studies adding ancillary satellite data on the vertical distribution of aerosol extinction van Donkelaar et. al., 2016;van Donkelaar et. al., 2019]. Other approaches have used PM2.5 network data to infer PM2.5/AOD ratios 55 [Wang and Christopher, 2003], with statistical models based on meteorological and land-use predictor variables to enable spatial extrapolation [Gupta and Christopher, 2009;Liu et. al., 2009;Kloog et. al., 2012;2014]. More recently, non-parametric machine learning models have been developed to predict PM2.5 from satellite AOD observations including neural networks [Li et. al., 2017;Zang et. al., 2019] and RFs [Hu et. al., 2017;Brokamp et. al., 2018]. 60 Geostationary satellites are now dramatically increasing the capability for mapping of PM2.5 from space. The GOCI instrument launched in 2010 by the Korea Aerospace Research Institute (KARI) observes AOD eight times daily at 0.5x0.5 km 2 pixel resolution over eastern China, the Korean peninsula, and Japan . The fine-pixel hourly information is intrinsically valuable and also facilitates cloud clearing [Remer et al., 2012]. GOCI AOD data aggregated to 6x6 km 2 resolution 65 have been used to estimate PM2.5 in regional studies for eastern China and South Korea [Xu et al., 2015;Park et al., 2019;She et. al., 2020]. AODs cannot be observed under cloudy conditions, and AOD retrievals can also fail for other reasons including snow surfaces. Different methods have been used to fill the gaps and produce continuous data sets. Some studies use CTM AODs when satellite data are missing [Hu et. al., 2017;70 Stafoggia et. al., 2019]. Others use a statistical interpolation algorithm such as Kianian et. al. [2021], who combined a RF with the lattice kriging method to infer missing AOD over the US. Yet others first estimate PM2.5 using available AOD observations, then infer missing PM2.5 estimates using a separate gap-filling model [Kloog et. al., 2014;She et. al., 2020]. , and Japan, using large data symbols for visibility. Zoomed inset for South Korea shows the surface network observations with symbols corresponding to the 6x6 km 2 grid of the GOCI data. Log scale used for colorbar.

120
Meteorological and geographical predictor variables. We use hourly meteorological data from the ERA5 global reanalysis, with resolution of 30x30 km 2 [Hersbach et. al., 2020], as input predictor variables for the RF algorithm. For this purpose we aggregate the data to 24-h averages and allocate them to 6x6 km 2 GOCI grid cells by bilinear interpolation. We consider boundary layer height, 2-m air temperature and relative humidity (RH), 10-m meridional and zonal winds, and sea level pressure as 125 potential meteorological predictor variables. We also include as geographical predictor variables latitude, year, day of year (1-366), and nation category (eastern China, South Korea, or Japan). We also considered 2015 population density [CIESIN, 2018] as a potential predictor variable but find that it is not useful as discussed in section 3.2. 130 Figure 1 shows the mean distributions of GOCI AOD and surface network PM2.5 for 2011-2019 or for the more limited durations of their records (2014-2019 for eastern China PM2.5, 2015-2019 for South Korea PM2.5). The PM2.5 networks are extensive but coverage is nevertheless sparse and often limited to large urban areas, as illustrated by the zoomed inset for South Korea. We find that only 1.0% of GOCI 6x6 km 2 grid cells have PM2.5 observations in eastern China,7.4% in South Korea,and 7.9% 135 in Japan. This geographic limitation in the PM2.5 networks emphasizes the value of continuous coverage from the AOD data.  year-round statistics while the right panel shows winter months (DJF) only. Figure 2 shows the percentage of days with at least one successful hourly GOCI AOD retrieval on the 6x6 km 2 retrieval grid. There are substantial gaps in the record, mostly reflecting clouds and also snow cover in winter . We seek to fill in these gaps to produce a continuous daily data set while accounting for the associated errors. We fuse two strategies according to the availability 145 of nearby AOD retrievals: an inverse distance weighted (IDW) interpolation AODIDW of nearby retrievals [Shepard, 1968] and a bias-corrected monthly AODGC from the GEOS-Chem CTM:

AOD gap-filling
where α is a weighting factor that depends on the distance from nearest retrievals. GEOS-Chem is a 150 widely used CTM for inferring PM2.5 from satellite AOD data [Liu et al., 2004;van Donkelaar et. al., 2006;2019;. Here we use scaled monthly mean GEOS-Chem AODs from a simulation by Zhai et al. [2021] for 2016 in East Asia with 0.5 o x 0.625 o resolution. That simulation reported a low mean bias relative to AERONET; we correct this for each year in the study period by using annual mean GOCI AODs on the 6x6 km 2 grid. In this way we obtain a spatial distribution of 155 monthly mean AODGC values for 2011-2019 for use in equation (1). We calculate the weighting factors used in Equation (1) via the Gaspari-Cohn function, a fifthorder piecewise polynomial with a radial argument [Gaspari and Cohn, 1999]. The Gaspari-Cohn function resembles a Gaussian distribution but with compact support, taking on a maximum value of 1 for = 0 and a minimum value of 0 for ≥ 2. We define = l/c for a given 6x6 km 2 grid cell and day 160 to be the distance l from the midpoint of the grid cell to that of the nearest observed grid cell, normalized by a spatial correlation length scale determined from available AOD observations in and around that grid cell. We find that the value of c ranges from 110 km to 170 km over our domain. an ensemble machine learning method where many individual decision trees are fit to the training data and vote on an output value, with the average value taken as best estimate [Breiman, 2001]. 170 b 8-hr average 550 nm AODs on the 6x6 km 2 grid retrieved with the YAER v2 algorithm  c ECMWF ERA5 fields [Hersbach et. al., 2020] at 30x30 km 2 spatial resolution and hourly temporal resolution, interpolated bilinearly to the GOCI grid and averaged over 24 hours. d Estimated from temperature and dewpoint using the August-Roche-Magnus approximation [Alduchov and Eskridge, 1996]. Decision trees are fit recursively to the predictor variable. Suppose we have a collection of N data elements i ∈ [1, N], denoted & , each composed of m predictor variables ( & ∈ ℝ ' ), and a corresponding list of N labels yi that we would like to learn. In our case yi denotes the observed PM2.5 concentrations 180 from the surface networks averaged on the 6x6 km 2 grid, and N denotes the number of these observations. The algorithm works by splitting the data into left and right subsets L and R at an optimum split point determined from the predictor variables in & [Pedregosa et. al., 2011]. The optimum split point is defined as the one that minimizes the impurity G,

Random forest algorithm
where represents the fraction of data in the subset L and MSE represents the mean squared error of 185 each of the subsets, where C is the mean of the target labels within a given subset and n is the number of elements in that subset. From there the same algorithm is recursively applied to the left and right subsets L and R until the tree is grown. We follow the advice of Hastie et. al. [2009] and grow trees until the data are fully classified (each leaf contains only one value). 190 Due to the recursive training structure, decision trees are sensitive to the data on which they are trained, because a change in one split point changes the composition of all its child nodes. Individual decision trees thus have high error variance but no inherent bias. It follows that averaging many individual and uncorrelated trees should yield a low variance, low bias prediction. We construct 200 trees in parallel and reduce correlation between them through a bagging procedure: for each of the 200 195 decision trees in the RF, sample the input data with replacement to form a new dataset of the same dimensions and then grow a decision tree from this bootstrapped data [Breiman, 2001]. Because of the high input sensitivity, a wide variety of decorrelated trees are grown. The predictions of each individual tree are averaged to yield the prediction of the RF. We fit our RF using the RandomForestRegression class in the Python module Scikit-learn [Pedregosa et. al., 2011]. We attempted to further decorrelate 200 trees by following Breiman [2001] and calculating split points of each individual tree using only a random subset of the m predictor variables; however, a sensitivity test we performed showed only minor differences with the base case and therefore we follow Guerts et. al. [2006] in considering all predictor variables in the training process. We evaluate how the RF generalizes to predictions for the full 6x6 km 2 domain via a 10-fold 205 crossvalidation. For each fold of the crossvalidation, we leave out 10% of PM2.5 network sites (averaged on the 6x6 km 2 grid if needed) from each country. These 10% represent the test set; because we perform the validation ten times, each grid cell is in the test set exactly once. We compare predicted PM2.5 to withheld observed PM2.5 using four metrics: root mean square error (RMSE); the RMSE divided by mean observed PM2.5 (relative RMSE, or RRMSE); the coefficient of variation (R 2 ); and the mean bias 210 computed by averaging the difference between predicted and observed PM2.5 (MB). An outcome of interest is the ability of our predictions to capture exceedances of National Ambient Air Quality Standards (NAAQS). We categorize each prediction within the test sets into one of four classes: true positives (TP) where both predicted and observed PM2.5 exceed the NAAQS threshold; true negatives (TN) where neither exceed the threshold; false positives (FP) where an 215 exceedance is predicted but not observed; and false negatives (FN) where an exceedance is observed but not predicted [Brasseur and Jacob, 2017;Cusworth et. al., 2018]. We use these classes to compute three overall prediction grades. The first, percent of detection (POD), gives the fraction of observed exceedances that were successfully predicted: 220 The second, false alarm ratio (FAR), gives the fraction of predicted exceedances that did not occur: The third, equitable threat score (ETS), compares how well the prediction does relative to random 225 chance: where β is the number of true positives obtained by random chance, 230 ETS is 1 for perfect prediction skill and 0 for no better or worse than chance.
Predictor variable selection is an important task in implementing a RF, as the addition of non-235 informative variables can decrease performance. Unlike linear regression which can naturally ignore unhelpful predictors, irrelevant data can by chance aid in minimizing impurity G at some stage in the optimization process making all subsequent splits suboptimal. The six meteorological variables given in Table 1 are standard in AOD/PM2.5 prediction [e.g. Kloog et. al., 2014;Li et. al., 2017], while the four spatio-temporal variables (location dummies, latitude, year, and day of year) and the retrieval gapfilling 240 parameter proved to be informative in sensitivity tests. In addition to the predictor variables in Table  1, we considered as additional variables the population density, the GOCI fine mode fraction (FMF), and the GOCI multiple prognostic expected error (MPEE), but we found that they worsened accuracy of the fit and so we did not retain them. Because population density worsened the fit we did not include other spatially varying but temporally fixed land-use variables such as road data. We also compared 245 RFs trained on GOCI AOD and on GOCI-AHI fused AOD and found no significant difference in the fitting of PM2.5. We therefore use the GOCI AOD product because of its longer record.
3 Results and discussion Figure 3 shows scatterplots, color-coded by count, comparing surface observations of 24-h and annual 250 mean PM2.5 to the predicted values in grid cells whose records are entirely withheld from training in the crossvalidation procedure. Predicted values for the annual mean are obtained by averaging the 24-h predictions. Table 2 gives comprehensive statistics for East Asia and for each country. The 24-h predictions for East Asia have a negligible mean bias of 0.23 μg m -3 (annual, 0.22 μg m -3 ), though the RF underpredicts PM2.5 at the high tail of the distribution; we will return to that issue later in the context 255 of NAAQS exceedances. Root mean square error (RMSE) between observed and predicted 24-h PM2.5 is 8.8 μg m -3 s (annual, 3.3 μg m -3 ) corresponding to a relative RMSE (RRMSE) of 37% (annual, 14%), as defined in section 2.3. The prediction captures 89% of the observed 24-h variance (R 2 = 0.89) and 96% of annual (R 2 = 0.96). These results compare favorably to previously reported inferences of 24-h and annual PM2.5 at 1-10 km resolution from satellite data over China [Hu et. al., 2019;Xue et. al., 260 2019]. R 2 for annual mean PM2.5 in South Korea is relatively low (0.41), which can be explained by the weak dynamic range of observed annual PM2.5 in the country (Figure 1), as will be discussed later in this section. Our gap-filling strategy does not introduce bias for days without GOCI observations (and with AOD inferred instead from equation (1)), as the evaluation statistics for those days are similar to the whole population. 265

275
completely withheld from the RF training process in the crossvalidation procedure. Statistics shown are for root-mean-square error (RMSE), relative RMSE (RRMSE), coefficient of variation (R 2 ), and mean bias (MB).
One potential application of PM2.5 monitoring from space would be to diagnose exceedances of national ambient air quality standards (NAAQS) at locations without network sites. Table 3 shows the NAAQS for 24-h and annual PM2.5 for the three countries and the ability of the RF algorithm to 280 diagnose NAAQS exceedances in grid cells excluded from the training process in the crossvalidation procedure. 24-h exceedances correspond to the high tails of the distributions but annual exceedances are much more widespread. The POD column shows percent of true positives successfully detected, while the FAR shows the rate of false positives (defined in section 2.3). POD for 24-h exceedances ranges from 47%-78% by country (FAR: 16%-21%). PODs are higher for annual exceedances but that reflects 285 the higher observed frequency of these exceedances. The ETS values ranging from 0.43-0.63 indicate that the model captures exceedances with much better skill than random guessing.

295
f Equitable threat score (ETS) defined as the ability of the RF to predict exceedances beyond random chance.
The main difficulty for the RF algorithm to predict NAAQS exceedances is that many of those exceedances fall within the precision of individual predictions. This is illustrated in Figure 4 with the cumulative probability density function (pdf) of the 24-h and annual mean PM2.5 concentrations in 300 eastern China, South Korea, and Japan, representing the same withheld data from the crossvalidation as in Tables 2 and 3. The 24-h RRMSE of 26-32% depending on country ( Table 2) is shown as the grey envelope and is relatively flat across the distribution. Prediction of NAAQS exceedances within that uncertainty envelope is limited by the precision of the algorithm. All of the 24-h exceedances in Japan are within that envelope, as are most of the exceedances in eastern China and Korea. China has the 305 largest fraction of exceedances beyond the RRMSE of the RF algorithm and therefore the best prediction success. An additional though smaller cause of bias is that the RF algorithm underestimates the high tail of the pdf, as is apparent in Figure 4, which explains in particular why we achieve a better FAR than POD for 24-h PM2.5 in South Korea and Japan. Our worst NAAQS prediction performance is for annual PM2.5 in South Korea for the old 25 µg m -3 standard, because most of the distribution is 310 within the RRMSE envelope. Additionally, the already small dynamic range of observed annual PM2.5 (black dots) is underestimated by the RF (blue dots). These culminate in an RF estimate with good RMSE but low R 2 . Japan. Observations (black) are compared to RF predictions (colored) taken from the crossvalidation. The grey envelope represents the relative root mean square error (RRMSE) of the RF algorithm as given in Table 2 We experimented with several modifications to the RF algorithm to improve prediction of 320 NAAQS exceedances but with no success. These tests included training separate RFs for each of the three countries; training annual PM2.5 predictions on annual (rather than 24-h) PM2.5 data; directly predicting NAAQS exceedances by setting the learned label to be true if a day (year) is above the 24-h (annual) NAAQS for a given country; and applying different weights to the data so that the high tail is oversampled in the training process. None of these tests yielded significant improvements. Smoothing  325 of the tails in RFs is a well-recognized problem [Zhang and Lu, 2012]. Following Zhang and Lu [2012] we attempted to train RFs to predict and correct the residuals but found this to be ineffective. Part of this tail smoothing could also result from the underlying GOCI AOD land product, which has a negative bias (-0.02) for high AODs and a positive bias (+0.02) for low AODs .

PM2.5 temporal trends and spatial distributions
330 Figure 5 shows long-term trends of annual PM2.5 for each country, as measured by the PM2.5 network and as inferred from our RF algorithm for both areal and population-weighted means. We do not include RF PM2.5 for years before the networks became available (and hence when the RF could be trained) because of concern over extrapolation bias. The PM2.5 networks show decreasing trends in all three countries and these trends are consistent with the RF PM2.5 for both areal and population-weighted 335 means, demonstrating that the trends reported by the PM2.5 networks are representative of the countries. However, the PM2.5 networks in eastern China and South Korea underestimate the population-weighted means. Trends in South Korea and eastern China become flat between 2018 and 2019 (with a slight population-weighted increase in South Korea). This could possibly reflect interannual meteorological variability [Zhai et al., 2019;Koo et. al., 2020], but also an increase in oxidants producing secondary 340 aerosol [Huang et. al., 2021].  Figure 6 shows the changes in annual mean PM2.5 concentrations over South Korea between 2015 and 2019, as observed from the national network and as predicted by the RF. We focus on South Korea for discussion because it shows the advantages of satellite-based PM2.5 in a region that already has good coverage. Continuous mapping from the RF algorithm enabled by the GOCI AODs adds 355 enormous coverage to the sparse surface observations, including detection of PM2.5 hotspots missing from the network such as the Iksan region on the west coast in 2015 that was subsequently added to the network by 2019.  AOD and PM2.5 in East Asia tend to have opposite seasonalities driven by boundary layer depth and RH [Zhai et al., 2021]. Figure 8 compares predicted and observed monthly mean PM2.5 in the 380 Beijing, Seoul, and Tokyo metropolitan areas, with predictions coming from withheld data in the 10fold crossvalidation. Correspondence between modelled and observed PM2.5 may be tighter than the nationwide annual means plotted in Figure 5 because these urban areas are well-observed. We see that the RF algorithm fully captures the observed seasonality in PM2.5, although some observed monthly spikes are underestimated. The Figure illustrates the lack of trend in the Seoul Metropolitan Area over 385 2015-2019 but also shows that winter and summer PM2.5 in the region have opposite and roughly equal trends, with winter growing more polluted while summers become cleaner.

Urban-scale pollution events
We examine here the ability of the RF algorithm to capture the spatial and temporal variability of PM2.5 pollution events on urban scales. Figure 9 shows a predicted map of PM2.5 -produced by a RF trained on all the data, with observed PM2.5 overlaid -across the Seoul metropolitan area on May 395 24-29, 2016 corresponding to a severe pollution event sampled during the KORUS-AQ field campaign [Crawford et. al., 2021]. The dense PM2.5 network for Seoul shows large variability at the sub 6x6 km 2 scale that the AOD data and thus this RF PM2.5 product cannot resolve. However, the RF algorithm capture most of the variability in observed 24-h PM2.5 concentrations aggregated on the 6x6 km 2 grid (R 2 = 0.74). The RF also captures successfully the day-to-day variability during the event. 400

Regional air quality model evaluation
Regional air quality model predictions of PM2.5 are typically evaluated with observations from 415 surface network sites, but the spatially continuous RF PM2.5 fields offer more extensive coverage for model evaluation. We demonstrate this capability here with Community Multiscale Air Quality Modeling System (CMAQ version 4.7.1) simulations for the Korean peninsula including both South and North Korea at 9-km resolution Bae et al., 2021]. There are no surface PM2.5 data in North Korea to train the RF so we use the South Korea categorical variable to generate the RF PM2.5 420 fields there. The simulation for South Korea was conducted for 2015-2019 using emissions from the Clean Air Policy Support System (CAPSS) 2016  for South Korea and KORUSv5 [Woo et al., n.d] for outside South Korea. The simulation for North Korea was conducted for 2016 using emissions from the Comprehensive Regional Emissions Inventory for Atmospheric Transport 425 Experiment (CREATE) 2015 [Woo et al., 2020] and CAPSS 2013. To prepare the boundary conditions, a coarse domain at 27-km horizontal grid resolution covering Northeast Asia was used.

Conclusions
We used 2011-2019 geostationary aerosol optical depth (AOD) observations from the GOCI satellite instrument, in combination with a random forest (RF) machine learning algorithm trained on air quality network data, to produce a continuous 24-h PM2.5 data set for eastern China, South Korea, and Japan at 460 6x6 km 2 resolution. The resulting gap-free product complements the air quality networks that cover only 1% of 6x6 km 2 grid cells in eastern China, 7% in South Korea, and 8% in Japan. It provides a general dataset for PM2.5 mapping to serve regional pollution analysis, air quality monitoring, and public health applications. We trained the RF algorithm on gap-filled AODs from the GOCI instrument and a suite of 465 twelve meteorological, geographical, and temporal predictor variables. Gap filling of AODs was done by a weighted combination of nearest-neighbor data and chemical transport model fields, with the weight serving as an additional predictor variable. Testing of the RF algorithm by prediction of withheld network sites shows single-value precisions in each country of 26-32% for 24-h PM2.5 and 12% for annual mean PM2.5, with negligible mean bias. Accuracy statistics for PM2.5 inferred on grid 470 cells with no AOD retrieval (i.e. estimated using equation (1)) show similar accuracy statistics as the entire population, suggesting that the gap-filling procedure does not bias the results. The algorithm has only moderate success at predicting NAAQS exceedance events because most of these events are within the single-value precision, and also because of some smoothing of the extreme high tail of the PM2.5 frequency distribution. 475 We compared the continuous 24-h RF PM2.5 fields to spatial and temporal patterns observed at the national network sites. National trends of PM2.5 inferred from the RF and weighted by area or population are consistent with those observed at network sites (2015-2019 in eastern China and South Korea, 2011-2019 in Japan), confirming that the trends observed at these sites are representative. However, the network sites in eastern China and South Korea underestimate population exposure. The 480 RF PM2.5 fields over South Korea show PM2.5 hotspots missing in the early AirKorea network (2015) that are confirmed by subsequent addition of sites to the network (2019). The spatial distribution of RF PM2.5 trends in South Korea shows decreases everywhere except in the Seoul metropolitan area where trends are flat. We show with time series in the capital cities (Beijing, Seoul, Tokyo) that the RF successfully captures the seasonality of PM2.5 even though AOD and PM2.5 have different and often 485 opposite seasonalities. We examined the ability of the RF algorithm to map air quality on urban scales by analysis of two multi-day pollution episodes in Seoul and Beijing. The algorithm captures the day-to-day temporal variability observed by the surface networks as well the spatial variability on the 6x6 km 2 scale. The highest PM2.5 concentrations are underpredicted, which reflects the smoothing of the high tail of the 490 distribution.
The continuous spatial coverage of PM2.5 provided by the RF fields enables improved evaluation of the air quality models used in support of emission control policies. Comparison to a CMAQ simulation for South Korea in 2015-2019 reveals a large model underestimate in coastal environments undersampled by the AirKorea network. Comparison to a CMAQ simulation for North Korea in 2016, 495 where the RF provides the only PM2.5 data for model evaluation, shows drastically different patterns with the RF featuring high PM2.5 throughout North Korea. The RF results in North Korea could be affected by errors due to lack of training data but they appear consistent with the PM2.5 network observations at Chinese border sites. 500 Data availability 24-h 6x6 km 2 resolution daily PM2.5 derived from the RF are made freely available on DataVerse at https://doi.org/10.7910/DVN/0L3IP7.