Filling the gaps of in-situ hourly PM2.5 concentration data with the aid of empirical orthogonal function constrained by diurnal cycles

1Key Laboratory of Geographic Information Science (Ministry of Education), East China Normal University, Shanghai 200241, China 5 2School of Geographic Sciences, East China Normal University, Shanghai 200241, China 3State Key Laboratory of Severe Weather, Chinese Academy of Meteorological Sciences, Beijing 100081, China 4School of Atmospheric Physics, Nanjing University of Information Science & Technology, Nanjing, China 5Institute of Environment, Energy and Sustainability, The Chinese University of Hong Kong, Hong Kong, China 6Department of Civil, Environmental, and Construction Engineering, University of Central Florida, Orlando, FL 10 32816, USA

Abstract. Data gaps in surface air quality measurements significantly impairs the data quality and the exploration of these valuable data sources. In this study, a novel yet practical method called diurnal cycle constrained empirical orthogonal function (DCCEOF) was developed to fill in data gaps present in data records with evident temporal variability. The hourly PM2.5 concentration data retrieved from the national 55 ambient air quality monitoring network in China was used as a demonstration. The DCCEOF method aims at reconstructing the diurnal cycle of PM2.5 concentration from its discrete neighborhood field in space and time firstly and then predicting the missing values by calibrating the reconstructed diurnal cycle to the level of valid PM2.5 concentrations observed at adjacent times. The statistical results indicate a high frequency of data gaps in our retrieved hourly PM2.5 concentration record, with PM2.5 concentration 60 measured on about 40% of days suffering from data gaps. Further sensitivity analysis results reveal that data gaps in hourly PM2.5 concentration record may introduce significant bias to their daily averages, especially during clean episodes at which PM2.5 daily averages are observed to subject to larger uncertainties compared to the polluted days (even in the presence of same number of missingness). The cross-validation results indicate that our suggested DCCEOF method has a good prediction accuracy, 65 particularly in predicting daily peaks and/or minima that cannot be restored by conventional interpolation approaches, thus confirming the effectiveness of the consideration of local diurnal variation pattern in gap filling. By applying the DCCEOF method to the hourly PM2.5 concentration record measured in China during 2014 to 2019, the data completeness ratio was substantially improved while the frequency of days with gapped PM2.5 records reduced from 42.6% to 5.7%. In general, our suggested DCCEOF method 70 provides a practical yet effective approach to handle data gaps in time series of geophysical parameters with significant diurnal variability, and this method is also transferable to other data sets with similar barriers because of its self-consistent capability.
Deleted: Data gaps frequently emerge in…in surface air quality measurements significantly impairs the data quality and the exploration of these valuable data sourcesour retrieved in-situ hourly air quality data records… In this study, we propose … novel yet 130 practical gap filling …ethod called the …iurnal cycle constrained empirical orthogonal function (DCCEOF) was developed to fill in data gaps for the improvement of data completeness…resent in data records with evident temporal variability. The hourly PM2.5 concentration data retrieved from the China…national ambient air 135 quality monitoring network in China is …as used here …s a demonstration. The DCCEOF method aims at reconstructing the diurnal cycle of PM2.5 concentration from its discrete neighborhood field in space and time firstly and then predicting Generally, the DCCEOF method …he missing values by works in a principle of 140 calibrating the diurnal cycle of PM2.5 concentration …he reconstructed diurnal cycle that is reconstructed from discrete PM2.5 neighborhood fields in space and time …o the level of valid PM2.5 concentrations observed at adjacent times. Prior to gap filling, the data completeness and the impact of data gaps in hourly PM2.5 145 concentration record on daily averages were examined. …he statistical analysis …esults indicates…a high frequency of data gaps our retrieved hourly PM2.5 concentration record…ecord, with PM2.5 concentration measured on about 40% of days subject to…uffering from data gaps. On the other hand,…urther sensitivity analysis results 150 reveal that these ... [1]

Formatted: Subscript
Deleted: could …ay introduce significant bias to their daily averages of PM2.5 concentration… especially during clean episodes as at which PM2.5 daily averages are observed to subject to larger biases biases …ncertainties would be introduced to PM2.5 daily averages 155 during clean days than…ompared to the polluted days (even in the presence of same number of missingness). The cross-validation results indicate that the …ur suggested DCCEOF method …as a good …ood prediction accuracy,,…particularly in predicting daily peaks and/or minima that cannot be restored by the …onventional 160 spline …nterpolation approaches, given the consideration of local diurnal variation pattern of PM2.5 in our method… thus confirming the effectiveness of the consideration of local diurnal variation pattern in gap filling.

Introduction
A large variety of ground-based monitoring networks have been established worldwide to provide accurate measurements on various aspects of the atmospheric environment (Lolli and Di Girolamo, 2015).
Many of these in-situ measurements, however, suffer from data losses due to various unexpected reasons (e.g., instrumental malfunction, interruption of power supply, and internet outage), thus resulting in 180 salient data gaps in the archived data records. Undoubtedly, these gaps significantly impair the data qualities and the exploitation of these valuable data sources. Therefore, filling data gaps present in such datasets is essential to the further exploitation of these in-situ measurements.
Due to the high frequency and severity of haze pollution events, China started to establish the national ambient air quality monitoring network since 2012 by extending the range of the previous sparsely 185 distributed monitoring network to cover most major Chinese cities. To date, more than 1,600 statecontrolled monitoring stations are routinely operated to measure concentrations of six primary air pollutants (i.e., PM10, PM2.5, O3, NO2, SO2, CO) on an hourly basis (Guo et al., 2017;Li et al., 2017a).
These in-situ measurements are publicly released online via the China National Environment Monitoring Centre (CNEMC) in near real-time as of 2013 but without providing any direct data download interface. 190 Consequently, users oftentimes apply an automated software program (also known as a "web crawler") to retrieve these valuable data sources from the CNEMC website. Such an endeavour empowers us to acquire hourly air quality data more efficiently.
As a critical air quality indicator, PM2.5 mass concentration data have been widely used in many haze related studies Miao et al., 2018;Bai et al., 2019aBai et al., , 2019bZhang et al., 2019). 195 Nevertheless, how data gaps were treated in the data exploration process (e.g., data integration and data transformation), especially for those using daily or monthly averaged PM2.5 data set (e.g., Guo et al., 2009;Miao et al., 2018;Ye et al., 2018;Yang et al., 2019a), is oftentimes unclear. Since ignoring missing values would undoubtedly introduce biases into the final results (Bondon, 2005;Larose et al., 2019), some studies attempted to perform data analysis on a relatively long time scale by integrating 200 hourly records into monthly resolution so as to mitigate the impacts of data gaps (e.g., Bai et al., 2019b;(e.g., van Donkelaar et al., 2016;Li et al., 2017;Huang et al., 2018;Manning et al., 2018;Shen et al., 2018;Bai et al., 2019a;Zhang et al., 2019). Such a treatment on data gaps (e.g., ignoring missing values 230 or excluding records on days with missingness) would either introduce new bias to the aggregated data record or make the original PM2.5 time series temporally discontinuous, however.
Since a non-gap PM2.5 record is essential to PM2.5 related haze control and environmental health risk assessment, filling data gaps in hourly PM2.5 concentration record is thus of great value. Although there exists versatile gap filling methods in literature (e.g., Beckers and Rixen, 2003;Taylor et al., 2013;Chang 235 et al., 2015;Dray and Josse, 2015;Gerber et al., 2018), most of them fail to properly restore missingness present in data record with high temporal resolution (e.g., hourly) and evident diurnal variability (e.g.,

PM2
.5 concentration), in particular the daily extrema. For instance, PM2.5 concentrations vary significantly in space and time due to heterogeneous local emissions and atmospheric conditions (Guo et al., 2017;Lennartson et al., 2018;Shi et al., 2018). A similar barrier also applies for many other datasets which are 240 sampled at high temporal resolution. Therefore, a priori knowledge of the diurnal variation pattern of the analysed data is thus essential to the restoration of values for missingness.
In this study, a novel yet practical gap filling method called DCCEOF (that is, the diurnal cycle constrained empirical orthogonal function) was developed to better handle data gaps present in time series with marked variability in space and time, by taking the diurnal variation pattern as a critical constraint 245 in missing value prediction. To our knowledge, none of the existing gap filling methods have accounted for the diurnal variation pattern of the given data in their missing value restoration schemes, and hence the predicted values from such methods would subject to large bias. As an illustration, the hourly PM2.5 concentration record retrieved from CNEMC during the time period of 2014 to 2019 was applied to demonstrate the efficacy and accuracy of the suggested DCCEOF method. Science questions to be 250 answered by this study include: (1) how about the data completeness of the Chinese in situ PM2.5 record?
(2) how much uncertainties can be introduced to PM2.5 daily averages by missing values? (3) is it feasible to reconstruct the local diurnal variation pattern of PM2.5 from discrete observations in the neighborhood?
Since each method is initially proposed to deal with missingness in one specific data set, adopting one 295 method to another data set is often a challenge due to distinct features of missingness (e.g., missing at random versus missing not at random), in particular for data sets with significant spatiotemporal heterogeneity such as air pollutants time series (Junger and Ponce de Leon, 2015). PM2.5 concentration often exhibits evident diurnal variation patterns, which are primarily governed by local air pollutants emissions and regional meteorological conditions such as boundary layer height (Guo et al., 2017;Li et 300 al., 2017;Huang et al., 2018;Liu et al., 2018;Miao et al., 2018;Yang et al., 2018Yang et al., , 2019b. Consequently, conventional approaches like those listed in Table 1 may partially fail in accurately predicting missing values in hourly PM2.5 time series. In general, most available gap filling methods in Table 1 suffer from at least one of the following drawbacks: 1) partially fail for data sets with prominent gaps; 2) not self-consistent due to the requirement 305 of supplementary data sets;3) computationally intensive (e.g., neural networks), and, most critically; 4) unable to fairly predict daily peaks and/or minima due to the lack of essential prior knowledge of diurnal variation cycle of monitoring targets. Given the significant heterogeneity of PM2.5 concentration in space and time (Guo et al., 2017;Manning et al., 2018), ignoring the diurnal phases of PM2.5 would result in large bias to the gap filled PM2.5 data set.  (2015) Interpolation Linear regression, Spline, NAR, ARIMA, ARCH Stauch and Jarvis (2006)  to the target station while was set to 7 (temporal window size) in our case. The spatial and temporal window sizes used here are based on our recent results in which an optimal window size of 50 km and 3-355 day was found to attain a good autocorrelation of PM2.5 concentration in space and time, respectively (Bai et al., 2019c). To have adequate samples for the construction of , , , here we enlarged the both window sizes by simply doubling the values found in our previous study. These two window sizes would have little effect on the performance of the subsequent gap filling once they are large enough (at least greater than the identified optimal window sizes) to cover most similar observations nearby. This is because a 360 sorting scheme is further applied to the neighborhood field to pinpoint observations having similar diurnal variation pattern with the target station. In other words, the two window sizes used here is simply to include adequate samples while avoiding incorporating all available data for the subsequent data reconstruction, in particular those even distant away.
2) Construct a compact PM2.5 neighborhood field: Since the initial PM2.5 neighborhood field , , 365 might include many irrelevant observations with distinct diurnal variation patterns due to large spatial and temporal window sizes, a compact neighborhood field needs to be constructed by only retaining observations that are highly related to the target PM2.5 time series with respect to the diurnal variation pattern. Therefore, the covariance rather than correlation between the target time series and every candidate PM2.5 time series in , , wass first calculated (weighted by the number of valid data pairs 370 within 24-h). Subsequently, the candidate PM2.5 time series were sorted in terms of the magnitudes of covariances in a descending order. Finally, the first time series were retained to construct the optimized PM2.5 neighborhood field < by complying with the criterion that there were at least five valid observations at each specific time from 00:00 to 23:00. The aim was to avoid large bias in the subsequent diurnal cycle reconstruction using empirical orthogonal function (EOF), since large outliers might emerge 400 at times without any valid observation. Mathematically, the process to construct < can be formulated as follows: where C denotes the 24-h time series of candidate PM2.5 in , , and COV is the covariance function.

405
3) Reconstruct the local diurnal cycle of PM2.5: The diurnal cycle of PM2.5 at site on date (denoted as ) was then reconstructed from the optimized PM2.5 neighborhood field < using EOF in an iterative process similar to the DINEOF method (Beckers and Rixen, 2003). In our suggested DCCEOF method, the target PM2.5 time series at site on date (denoted as ) were also included to constrain the reconstruction of , and the whole field can be denoted as Q : In general, the EOF-based gap filling process can be outlined as follows: a) 20% of valid PM2.5 observations in Q were first held out for cross validation and then these data values were treated as gaps by replacing with nulls (i.e., missing value); b) given that a small amount of missing values would not significantly influence the leading EOF mode for the original data set, we assigned a first guess (here we 415 used the mean value of valid data on each specific date) to the data points where missing values were identified to initialize the EOF analysis; c) EOF analysis was performed on the previously generated background field (that is, Q with gaps are filled with daily mean and denoted as < R >) in a form of singular value decomposition (SVD) and then data values at value-missing points were replaced by the reconstructed values using the first EOF mode at the corresponding locations. These processes can be 420 expressed as: where < R > denotes the initial matrix in which the missing values were filled with daily means. U, S, and V are three matrices derived from SVD while 0 , 0 , and 0 denote the SVD components in the first EOF mode. C is the reconstructed matrix using the first EOF mode; e) iteratively decompose and reconstruct the matrix while updating data values at the value-missing points using the first EOF mode till the convergence was confirmed by the mean square error at each iteration; f) repeat the above iterative 435 processes by including the following EOF modes till the reach of the final convergence (i.e., mean square error started to increase when a new EOF mode was included). The diurnal cycle was finally derived by standardizing the identified leading EOF modes in the last round iteration. In short, our suggested DCCEOF method is a univariate and self-consistent gap filling method since no additional data record is required for the restoration of missing values. Rather, the method works relying primarily on the local diurnal cycle of PM2.5 that can be reconstructed from discrete PM2.5 neighborhood 445 fields in space and time. In contrast to conventional gap filling methods that work on a purely statistical basis (e.g., spline interpolation), the unique feature and novelty of the proposed DCCEOF method lies in the accounting for the local diurnal variation pattern of the input data in their missing value predictions, thus making the predicted values physically meaningful and with high accuracy.

535
Given the common application of daily-averaged PM2.5 concentration data in many studies, the possible impacts of data gaps on PM2.5 daily averages were thus assessed here to examine how well the estimated PM2.5 daily averages can be trusted in the presence of data gaps, especially during different pollution episodes. Toward such a goal, gap-free observations of hourly PM2.5 concentration within 24-h were first extracted. To make the computational workload manageable, we randomly sampled 1,000 days 540 observations rather than using observations from all gap-free days. Moreover, days with PM2.5 daily  percentiles, we may deduce that missing values would result in larger bias to PM2.5 daily averages on clean days than in polluted conditions given larger MRDs during clean scenarios (Figures 4c-d). This effect is in line with expectations since PM2.5 concentration often exhibits relatively larger diurnal 580 variations on cleaner days than during polluted episodes due to the possible boundary layer height effect (Li et al., 2017;Miao et al., 2018). Moreover, six missing values in 24-h observations would result in as large as approximately 5% of deviations (10% for 12 missing values) to PM2.5 daily averages during clean days (Figures 4c-d).
In addition to the total number of missing values, possible impacts of diurnal phases of missing values on 585 PM2.5 daily averages were also examined. It shows that different diurnal phases were observed for MRDs associated with missingness at different pollution levels ( Figure 5). Specifically, missing values in the afternoon and evening would be more likely to overestimate PM2.5 daily averages, whereas an opposite effect (underestimations) was observed for missingness in the morning and night. Moreover, the Deleted: result in overestimations to missingness in the afternoon during clean days has a larger potential to overestimate PM2.5 daily averages than during other times. This effect could be largely associated with the diurnal phases of PM2.5 as daily peaks are oftentimes observed in the early morning (Wang and Christopher, 2003), though such a diurnal variation pattern may differ by regions (Lennartson et al., 2018). Also, the diurnal phases of PM2.5 are 610 largely dominated by the diurnal variation of regional emissions and boundary layer processes (Guo et al., 2016;Lennartson et al., 2018;Miao et al., 2018;Yang et al., 2019b). In contrast, the diurnal phases of MRDs are not evident during polluted days. Above findings collectively suggest the need to fill data gaps in hourly PM2.5 observations, especially for those measured during clean days, since missing values would result in larger biases to PM2.5 daily averages than those during polluted episodes.

Performance of the DCCEOF method
Cross-validation experiments were first conducted at two monitoring stations to evaluate the efficacy of 635 the suggested DCCEOF method in reconstructing the diurnal cycle of PM2.5 from the discrete neighborhood field. Specifically, gap-free PM2.5 records on three distinct days with different pollution levels were first extract randomly at each station and then six valid observations in each 24-h record were held out. Subsequently, the DCCEOF method was applied to reconstruct the diurnal cycle of PM2.5 for each specific case. Figure 6 compares the reconstructed diurnal cycles of PM2.5 with their actual PM2.5 concentrations. The results indicate that the reconstructed diurnal cycles of PM2.5 have a good fit with their actual observations, thus confirming the effectiveness of the DCCEOF method in reconstructing the 650 diurnal variation pattern of PM2.5 from the discrete neighborhood field. In particular, the DCCEOF method also succeeded to restore the missing PM2.5 information even at the inflection times, e.g., the peak value in Figure 6c and the minimum value in Figure 6e, which are hardly to be recovered by statistical interpolation approaches. Nonetheless, compared with actual PM2.5 observations, the reconstructed diurnal cycle of PM2.5 is still unable to totally restore all types of local variations (e.g., PM2.5 observations 655 between 0700 and 1100 shown in Figure 6f). This is consistent with our initial understanding that PM2.5 concentrations vary substantially in space and time whereas the reconstructed diurnal cycle of PM2.5 is derived from a limited number of leading EOF modes and hence it only captures the dominant variation pattern of PM2.5 in the neighborhood field while some local variations could be thus ignored. In spite of this potential defect, the suggested DCCEOF method still exhibits promising accuracy in restoring the 660 local diurnal cycle of PM2.5 even from a discrete neighborhood field. Both the DCCEOF method and a spline interpolation approach were used to practically restore the retained PM2.5 observations. The comparison results shown in Figure 7 indicate higher accuracy of the 680 DCCEOF method than the spline interpolation approach in restoring those retained PM2.5 observations, especially for those at the inflection times at which spline interpolation failed to predict with good accuracy (e.g., peak values on August 3). However, both methods failed in predicting the minimum values on August 2. After checking the original data record, we found that the local variation of PM2.5 at this station differed largely from all neighboring stations at that time. For such situation, the proposed 685 DCCEOF method also fails to properly predict the missing values given large difference in diurnal variation patterns in space and time.   Moreover, the diurnal cycle reconstructed from the neighborhood field in space is more accurate than 810 using PM2.5 observations from near-term days, which is evidenced by smaller correlation values with limited neighboring stations. Such effect is also in line with our recent results when comparing the beneficial effects of spatial and temporal neighboring terms in advancing the gridded PM2.5 concentration mapping (Bai et al., 2019c). well as the reduction of gap frequency. After applying the DCCEOF method, the data completeness ratio of hourly PM2.5 concentration records in China has been improved by approximately 5% on average nationwide, with the overall data completeness ratio increasing from 89.2% to 94.3% (Figure 10a).
Despite the small magnitude of data completeness improvement ratio, the occurrence frequency of 830 missingness has been significantly reduced, with the averaged frequency of days with missingness declined from 42.6% to 5.7% nationwide (Figure 10b). In general, the gap-filled PM2.5 record is temporally more complete given fewer data gaps and this data set can thus be used as a promising data source for PM2.5-related studies in the future.

840
Compared with conventional interpolation approaches, the suggested DCCEOF method has better accuracy in predicting missing values for the emerged data gaps in hourly PM2.5 time series given the principle of accounting for the local diurnal variation pattern of PM2.5 concentration. Specifically, site specific diurnal cycle of PM2.5 was reconstructed from the discrete spatial and temporal neighborhood using EOF and was then used as a reference to predict missing values. Such a scheme enabled DCCEOF 845 to capture the local variation pattern of PM2.5 with good accuracy in regions with dense neighboring stations (e.g., eastern China) and less temporal dynamics of PM2.5. In contrast, relatively poor accuracy could be attained in the western part of the country where stations are sparsely distributed given the lack of adequate neighboring information. In such context, the performance of DCCEOF could be further 855 improved using a general diurnal variation pattern of PM2.5 that is firstly determined though a typical classification. However, this endeavor needs us to have a clear prior information of diurnal variability of PM2.5 in space and time. On the other hand, the diurnal variation pattern of other relevant factors that are highly related to PM2.5 variations, e.g., meteorological factors such as mixing layer height, might be also applied to advance the reconstruction of the local diurnal cycle of PM2.5.

860
Although the DCCEOF method has a promising accuracy in filling data gaps present in hourly PM2.5 concentration time series, the current method only works for days with at least several valid observations.
In other words, the DCCEOF method is incapable of restoring values for days with all 24-h data missed.
This is because the remnant valid observations within 24-h are used as a critical constraint not only to convolve with other neighboring observations in space and time to identify similar observations but also 865 to determine the data magnitude of predicted values for missingness. Moreover, the severity of data gaps in the initial neighborhood field is also associated with the final prediction accuracy because significant data gaps in the neighborhood field could introduce large bias to the reconstructed local diurnal variation pattern. In such context, aforementioned proxy information such as the diurnal variation pattern of meteorological conditions could be applied as a good complementary.

Conclusions
A novel yet practical gap filling method termed DCCEOF was introduced in this study to cope with annoying data gaps in geophysical time series, particularly for those with significant diurnal variability.
Compared with the conventional interpolation methods, the suggested DCCEOF method is selfconsistent, physically meaningful, and more accurate, given the accounting for the local diurnal variation 875 pattern of the monitoring factor in missing value restorations. Such an endeavor enables the DCCEOF method to predict missing values even at inflection times, like daily peaks or minima that conventional methods always fail to predict properly, with promising accuracy. In addition, we also assessed the severity of data gaps in our retrieved China in-situ hourly PM2.5 records.
In general, the missingness ratio was less than 10% over most stations across China while data gaps occurred more frequently at 06:00 and 12:00 BJT than during other times. After gap filling, the data completeness ratio of China in-situ hourly PM2.5 concentration record was improved to 94.3% while the 910 frequency of days with missingness was markedly reduced from 42.6% to 5.7%. The gap-filled hourly PM2.5 concentration record can thus be used as a promising data source for better PM2.5 concentration mapping and exposure assessment.
Overall, the suggested DCCEOF method provides a realistic and promising way to deal with missingness emerged in hourly PM2.5 concentration record which oftentimes exhibits significant diurnal variation 915 patterns. Given the self-consistent nature, the suggested DCCEOF method can be easily applied to PM2.5 datasets measured in other regions and other geophysical records with similar barriers. A more general comparison of this method with many other conventional gap filling methods will be conducted in the future to further examine the performance and accuracy of the DCCEOF method in handling various types of data gaps.