Ozone formation sensitivity study using machine learning coupled with the reactivity of volatile organic compound species

The formation of ground-level ozone (O3) is dependent on both atmospheric chemical processes and meteorological factors. In this study, a random forest (RF) model coupled with the reactivity of volatile organic compound (VOC) species was used to investigate the O3 formation sensitivity in Beijing, China, from 2014 to 2016, and evaluate the relative importance (RI) of chemical and meteorological factors to O3 formation. The results showed that the O3 prediction performance using concentrations of measured/initial VOC species (R2 = 0.82/0.81) was better than that using total VOC (TVOC) concentrations (R2 = 0.77). Meanwhile, the RIs of initial VOC species correlated well with their O3 formation potentials (OFPs), which indicate that the model results can be partially explained by the maximum incremental reactivity (MIR) method. O3 formation presented a negative response to nitrogen oxides (NOx) and relative humidity (RH), and a positive response to temperature (T ), solar radiation (SR), and VOCs. The O3 isopleth calculated by the RF model was generally comparable with those calculated by the box model. O3 formation shifted from a VOC-limited regime to a transition regime from 2014 to 2016. This study demonstrates that the RF model coupled with the initial concentrations of VOC species could provide an accurate, flexible, and computationally efficient approach for O3 sensitivity analysis.


Introduction
Ground-level ozone (O 3 ) pollution, which can cause adverse human health effects such as cardiovascular and respiratory diseases, has received increasing attention in recent decades (Cohen et al., 2017). Oxidation of volatile organic compounds (VOCs) will produce peroxyl radicals (RO 2 ) and hydroperoxyl radicals (HO 2 ). The RO 2 /HO 2 can accelerate the conversion from NO to NO 2 , subsequently, formation of O 3 by photolysis of NO 2 in the presence of O 2 (T. . The production and loss of RO 2 and HO 2 are highly dependent on the concentration ratio of VOCs and NO x in the atmosphere. Hence, atmospheric O 3 concentrations or production rates show a nonlinear relationship with VOCs and NO x . Moreover, the O 3 -VOC-NO x sensitivity is readily influenced by VOC species (Tan et al., 2018), meteorological parameters (H. Liu and Wang, 2020), and even atmospheric particulate matter (Li et al., 2019), thus, exhibiting high temporal and spatial variability. Therefore, it is urgent to develop an accurate and highly efficient method for timely assessing the sensitivity regime of O 3 production and evaluating the effectiveness of a potential measure on O 3 pollution control. The sensitivity of O 3 formation can usually be analyzed using observed indicators, such as ozone production efficiency (OPE, O 3 / NO z ) Lin et al., 2011), HCHO/NO y (Martin et al., 2004), and H 2 O 2 /NO z (or H 2 O 2 /HNO 3 ) (Sillman 1995;Hammer et al., 2002;, observationbased model (OBM) (Vélez-Pereira et al., 2021) and chem-ical transport models including community multiscale air quality (CMAQ) (Djalalova et al., 2015) and Weather Research and Forecasting with Chemistry (WRF-Chem) model (P. Wang et al., 2020).
The observed indicators can be utilized to quickly diagnose the sensitivity regime of O 3 production. However, the accuracy is sensitive to the precision of tracer measurements. OBMs combine in situ field observations, remote sensing measurements, and chemical box models, which are built on widely used chemistry mechanisms (e.g., MCM, Carbon Bond, RACM or SAPRC) and applied to the observed atmospheric conditions to simulate the in situ O 3 production rate (Mo et al., 2018). The sensitivity of O 3 production to various O 3 precursors, including NO x and VOCs, can be diagnosed based on the empirical kinetic modeling approach (EKMA) or quantitatively assessed with the relative incremental reactivity (RIR). Chemical transport models, which are driven by meteorological dynamics and incorporated with the emissions of pollutants and the complex atmospheric chemical mechanism, provide a powerful tool for simulating various atmospheric processes, including spatial distribution, regional transport vs. local formation, source apportionment and production rates of pollutants, and so on (Sayeed et al., 2021). At present, OBMs are widely used to investigate O 3 formation sensitivity in China. Previous studies indicated that O 3 formation in urban areas of China is located in a VOC-limited or a transition regime and varies with time and location (Ou et al., 2016;Zhan et al., 2021). Although both OBMs and chemical transport models can assess the sensitivity of O 3 production and predict the O 3 pollution level in a scenario of control measures, the calculation accuracy is affected by the uncertainty of input parameters (Tang et al., 2011;L. Yang et al., 2021). Thus, they are mostly applied to sampling cases with a short time span (days or weeks) (Xue et al., 2014;Ou et al., 2016).
Compared to traditional methods, machine learning (ML) is able to capture the main factors affecting atmospheric O 3 formation in a timely manner with great flexibility (without the constraints of time and space) and high computational efficiency (Y. Wang et al., 2020b;Grange et al., 2021;. Although attention should be paid to the robustness of machine learning because it depends on the input dataset (observations or outputs of chemical transport models), previous studies have demonstrated that cross-validation and data normalization can well reduce the dependence of the model on input data and improve the robustness of the model (Y. Wang et al., 2016Liu et al., 2021;R. Ma et al., 2021). Thus, it is a promising alternative to account for the effects of meteorology on air pollutants and has been intensively used in atmospheric studies (H. Hou et al., 2022).
Recently, ML based on convolutional neural network (CNN), random forest (RF), and artificial neural network (ANN) models have been applied in simulating atmospheric O 3 and shown good performance in O 3 prediction Xing et al., 2020). For example, R.  simulated O 3 concentrations in the Beijing-Tianjin-Hebei (BTH) region from 2010-2017 using an RF model that considered meteorological variables and output variables from chemical transport models, and the correlation coefficient (R 2 ) between the observed and modeled O 3 concentrations was greater than 0.8. Liu et al. (2021) also reported a high accuracy (80.4 %) for classifying pollution levels of O 3 and fine particulate matter with aerodynamic diameters less than 2.5 µm (PM 2.5 ) at 1464 monitoring sites in China using an RF model. Thus, the RF model has shown good performance in terms of prediction accuracy and computational efficiency (Y. Wang et al., 2016. Although ML is widely used to understand air pollution, many ML studies have used total VOCs (TVOCs) to simulate O 3 formation and rarely considered the effect of VOC species on O 3 formation sensitivity (Feng et al., 2019;Liu et al., 2021;R. Ma et al., 2021). Thus, they were unable to identify the chemical reactivity of a single species to O 3 formation, which may lead to underestimations or even misunderstandings of the role of VOCs in O 3 formation because the same concentration of TVOCs with different compositions may lead to different OPEs. In addition, VOCs react with OH radicals during atmospheric transport, which is the most important sink of VOCs (di Carlo et al., 2004;. Makar et al. (1999) reported that the isoprene emissions were underestimated by up to 40 % if the OH oxidation is not considered. Other studies indicated that the initial concentrations of VOCs, which account for the photochemical loss of VOCs during transport, were more representative of pollution levels in the sampling area than the observed VOCs (Yuan et al., 2013;Zhan et al., 2021). However, whether the ML model can identify the connection between the reactivity of VOC species and O 3 formation sensitivity has not been clarified.
It should be noted that physical interpretability of the results is an important question when ML models are applied in atmospheric studies (Hou et al., 2022). However, explanations of ML results (e.g., RI) are somewhat vague because ML is a "black-box" model from the point view of chemical mechanism (Hou et al., 2022;Taoufik et al., 2022). In this study, we used the RF model to evaluate the prediction performance of atmospheric O 3 using the TVOCs, measured VOC species, and photochemical initial concentration (PIC) of VOC species, which is calculated based on the photochemical-age approach (Shao et al., 2011). We compared the relative importance (RI) of the precursors (VOC species, NO x , PM 2.5 , CO) and the meteorological parameters (temperature, solar radiation, relative humidity, wind speed, and direction) on O 3 formation in the summer of Beijing from 2014 to 2016. We also discussed the possibility of connecting the RIs of VOCs with their O 3 formation potentials (OFPs) and the changes in O 3 -VOC-NO x sensitivity based on the RF model from 2014 to 2016. Our study indicates that the RF model combined with initial concentrations of VOC species can simulate O 3 concentrations well and provides a flexible and efficient tool for O 3 modeling in a near-real-time way.

Sampling site and data
The sampling site (40.04 • N, 116.42 • E) is located at the campus of Chinese Research Academy of Environmental Sciences and was described in our previous work . Briefly, the station is located 2 km from the north 4th Ring Road and surrounded by a mixed residential and commercial area. The concentrations of VOCs, NO x , CO, O 3 , and PM 2.5 were measured at 8 m above ground level at this location. Meteorological parameters, including temperature (T ), relative humidity (RH), wind speed and direction (WS&WD), and solar radiation (SR), were monitored at 15 m above ground level. VOCs were measured by an online commercial instrument (GC-866, Chromatotec, France), which consisted of two independent analyzers for detecting C 2 -C 6 and C 6 -C 12 hydrocarbon components. More details about the observations can be found in the Supplement (Sect. S1). The calculation of initial VOCs and sensitivity tests can be found in Sect. S2

Random forest model
The random forest (RF) is a type of ensemble decision tree that can be used for classification and regression (Breiman, 2001). In this work, we performed O 3 and RI calculations using the RF method in MATLAB's Statistics and Machine Learning Toolbox. During the training process, the model creates a large number of different decision trees with different sample sets at each node and then averages the results of all decision trees as its final results (Breiman, 2001). To avoid over-fitting, we trained the random forest model using cross-validation for the normalized data, which can improve the robustness of the model. Briefly, we randomly divided the normalized data into 12 subsets, then alternately took one subset as testing data along with the rest as training data. By doing this, every data point has an equal chance of being trained and tested. The length of the input data from 2014 to 2016 was 1190, 1062, and 872 rows, respectively, in which different types of VOCs, NO x , CO, PM 2.5 , and meteorological parameters (including temperature, relative humidity, solar radiation, wind speed and direction) were used as input variables and O 3 as output variables. The mean values (± standard deviation) of input/output parameters are shown in Table S1 in the Supplement. Approximately 1/3 of the samples are excluded from the sample, when the decision tree is built and used to calculate the out-of-bag data error. Hence, RF can evaluate the RI of variables via the changes in out-of-bag (OOB) data error (Svetnik et al., 2003), where N represents the number of decision trees, and er-rOOB1 and errOOB2 represent the out-of-bag data error of feature i before and after randomly permuting the observation, respectively. The RI i is used to evaluate the importance and sensitivity of feature i to O 3 formation in this study. More details about workflow of RF model and the hyperparameter tuning can be found in Sect. S3. The optimized parameters are shown in Table S2. To verify the stability of the model, we performed a significance test on the model results. The results showed that there was no significant difference among the different tests (P > 0.05, R 2 > 0.98). When plotting the O 3 formation sensitivity curves, we made a virtual matrix of inputs by varying the concentrations of NO x and VOCs from 0.9 to 1.1 times (with a step of 0.01) of their mean values while keeping all other inputs unchanged (i.e., the mean values). Then, the new matrix was used as testing data, while all the measured data were taken as training data. Thus, the testing data should represent the mean sensitivity regime of O 3 in Beijing, while the training data actually covered all the sensitivity regimes of O 3 formation to guarantee a sufficient coverage in the NO x -limited regime for the RF model simulations. The EKMA curves were plotted using the daily maximum 8 h (MDA8) O 3 . More details can be found in the Supplement.
3 Results and discussion 3.1 Overview of air pollutants and meteorological conditions Figure 1 shows the time series of air pollutants and meteorological parameters during the observations from 2014 to 2016. In 2014, 2015, and 2016, the wind direction was dominated by northwest winds (Fig. S1 in the Supplement), with mean wind speeds of 3.1 ± 2.7, 2.3 ± 2.2, and 1.3 ± 1.2 m s −1 , respectively, and the mean daytime temperatures were 22.3 ± 5.8, 23.9 ± 5.0, and 24.0 ± 4.4 • C, respectively. The average value of SR decreased from 162.9 to 150.8 W m −2 during the observation period. As shown in Fig. 1f- (Fig. S2). As shown in Fig. 1f-g, the concentrations of four types (alkanes, alkenes, alkynes, and aromatics) of VOCs showed significant differences from 2014 to 2016 due to the variations in emission sources . In addition to VOC species, the variations in other parameters, such as meteorological conditions and PM 2.5 , should have a complex influence on O 3 -VOC-NO x sensitivity (Li et al., 2019;.

Prediction performance of the model
To build a robust model, we evaluated the prediction performance of the RF model for the ambient O 3 simulation.  Table S2. Figure 2a-c show the time series of the measured and modeled O 3 concentrations, which were simulated using the TVOCs, measured VOC species, and initial VOC species as part input variables along with the same set of other parameters. The correlation coefficients (R 2 ) of the training data were 0.77, 0.82, and 0.81 for the TVOCs, measured VOC species, and initial VOC species, respectively. The corresponding root mean square errors (RM-SEs) for the predicted O 3 concentrations were 17.4, 12.6, and 13.9. Figure 2d-f show the prediction performance of the testing dataset under these three circumstances. When the TVOCs were split into measured or initial VOC species, the R 2 increased obviously as the number of data features increased. Therefore, the VOC composition has a significant influence on O 3 prediction using the RF model. In previous studies using TVOCs, the influence of VOC composition was neglected R. Ma et al., 2021). Our results indicate that the RF model can accurately predict O 3 concentrations when the concentrations of measured/initial VOC species are considered. It should be pointed out that if the training dataset does not have sufficient coverage in the NO x -limited regime, then the trained algorithm essentially attempts to extrapolate in that regime, which is prone to overtraining. To avoid such overtraining, a 12-fold cross-validation by randomly dividing the observation data in each day into 12 subsets and alternately taking 1 subset as testing data and the rest as training data ensures that each data point has an equal chance of being trained and tested. The curves of the predicted O 3 concentrations in Fig. 2 were spliced using the testing datasets in all runs. Thus, our results actually covered all the sensitivity regimes of O 3 formation. This means that the model is robust. Figure 3a shows the RIs of different ambient factors, including chemical and meteorological variables on O 3 formation. The difference in the RIs is also compared using the TVOCs and the VOC species as inputs. Chemical factors (including VOC species, NO x , PM 2.5 and CO) accounted for 79.1 % of the contribution to O 3 production in the summer of 2016. Meanwhile, VOC species accounted for approximately 63.4 % of O 3 production while the RIs using TVOC concentrations accounted for only 2.1 %. S.  analyzed the contribution of meteorological conditions and chemical factors to O 3 formation on the North China Plain (NCP) using the CMAQ model in combination with process analysis and found that chemical factors dominate O 3 formation in summer. Using probability theory, Ueno and Tsunematsu (2019) also found that VOCs/NO x dominates O 3 production compared to meteorological variables. Thus, our results are similar to those of previous studies based on chemical models (Ueno and Tsunematsu, 2019), which demonstrates that the RF model can reflect the contribution of VOC species to O 3 production even if the observed VOC species are used.

Relative importance of major factors
Here, we compared the RIs of VOCs calculated using the initial VOC species and the observed VOC species with the OFPs. The OFPs were calculated by the maximum incremental reactivity (MIR) method (Carter, 2010). As shown in Fig. S5, the RIs showed good correlations with the OFP. Interestingly, the initial concentrations of VOC species improved the correlation coefficients between the RIs and OFPs. Furthermore, we calculated the RIs and OFPs of different species using the observed data during the campaign study in Daxing District in the summer of 2019 (Zhan et al., 2021), and a stronger correlation was observed between the RIs of the initial VOC species and the OFPs (Fig. S6). These results indicate that the RIs of the initial VOCs species in the ML model should partially reflect the chemical reactivity of VOCs to produce O 3 in the atmosphere.
Although the RIs calculated using the initial VOC species slightly changed compared to those calculated using the observed VOCs (Table S3)    and VOCs will lead to a decrease in O 3 concentration. Although O 3 formation is highly related to the photolysis of NO 2 , a previous study demonstrated that it is VOC-limited in summer in Beijing (Zhan et al., 2021). This finding is consistent with the observed negative response of O 3 to NO x in this work. High RH usually coincides with low surface O 3 concentrations in field observations, which can be ascribed to the inhibition of O 3 formation by the transfer of NO 2 /ONO 2 -containing products into the particle phase and the promotion of dry deposition of O 3 on the surface (Kavassalis and Murphy, 2017; Yu 2019). In addition, it has been shown that RH is negatively related to the rate constant of HONO formation (Hu et al., 2011). Thus, RH might also affect the O 3 formation by influencing atmospheric OH radicals from photolysis of HONO. It should be noted that the negative response of ozone to RH might also have resulted from the dependence of RH on other parameters/conditions, such as SR. However, RH and SR showed a bad correlation (r < 0.1). We further tested the dependence of the RI on RH and SR with or without the counterpart as input. The stable RI values (Table S4) mean that RH and SR are independent from each other. These previous works can well explain the observed negative response of O 3 to RH in Fig. 3b-d. Previous studies have observed a positive correlation between the O 3 concentration and T or SR (Steiner et al., 2010;Paraschiv et al., 2020;Li et al., 2021). Temperature can directly affect the chemical reaction rate of O 3 formation (Fu et al., 2015), and SR can promote the photolysis of NO 2 (Hu et al., 2017;Y. Wang et al., 2020a), thus accelerating O 3 formation. As mentioned above, O 3 formation is VOC-limited in Beijing; thus, a positive response of O 3 concentration to VOCs is observed in Fig. 3b. Interestingly, the RIs of isoprene showed an increasing trend from 2014 to 2016 because of the obvious reduction in anthropogenic VOCs (Fig. S7)  . In the context of global warming, studies should focus on the factors that affect O 3 formation, including biogenic emissions, T , and SR. Thus, additional efforts will be required to reduce anthropogenic pollutants in the future.

Ozone formation sensitivity
To further analyze the sensitivity of O 3 to VOCs and NO x from 2014 to 2016, we plotted sensitivity curves for O 3 generation using the RF model, and the results are shown in Fig. 4a-c. Moreover, EKMA curves in 2015 were also obtained using the OBM (Fig. 4d). As shown in Fig. 4a-c, O 3 formation was sensitive to VOCs in the summer of Beijing during our observations, which is consistent with previous studies that used box models (Li et al., 2020b) and chemical transport models . This result is also consistent with the RIs of VOCs or NO x to O 3 formation ( Fig. 3b-d). Interestingly, the O 3 formation sensitivity to VOCs decreases or gradually shifts from the observed point to the transition regime from 2014 to 2016 (Fig. 4a-c), which is similar to that reported by Zhang et al. (2021). These phenomena can be ascribed to the increased relative importance of meteorological factors, such as T , SR, and RH, for O 3 formation and the variation in anthropogenic VOC emissions (Steiner et al., 2010;.
We compared the relative error of simulated MDA8 O 3 calculated using the RF and OBM model in 2015, as shown in Fig. S8. The mean relative error of simulated MDA8 O 3 between RF model and box model was 15.6 %. Hence, a combination of the RF model and initial VOCs species can accurately depict the sensitivity regime of O 3 formation, while the calculated RIs correlate well with the OFPs.

Conclusions
In summary, this work investigated O 3 formation sensitivity in the summer from 2014-2016 in Beijing using the RF model coupled with the reactivity of VOC species. The results show that the prediction performance of O 3 by the RF model was significantly improved when measured/initial VOC species were considered compared to TVOCs. Furthermore, after the photochemical loss of VOC species during transport was corrected, the RIs of the VOC species were well correlated with the OFPs of VOC species calculated using the MIR method, thus indicating that the RIs in the ML model reflect the chemical reactivity of VOCs. Meanwhile, both NO x and highly reactive species (such as isoprene, propene, benzene) played an important role in O 3 formation. An increased contribution of temperature to O 3 production was observed, which implied the importance of temperature to O 3 pollution in the context of global warming conditions. Both the RF model and the box model results showed that O 3 formation was sensitive to VOCs in Beijing, although the sensitivity regime shifted from VOC-limited regime to a transition regime from 2014 to 2016. Due to the high computational efficiency of ML, the O 3 formation sensitivity plotted by the RF model coupled with the reactivity of VOC species can provide an accurate, flexible, and efficient approach for analyzing O 3 sensitivity in a near-real-time way.
Author contributions. JZ designed the idea and wrote the manuscript; YL and HL provided useful advice and revised the manuscript; WM performed box model simulations; and XZ, XW, FB, YZ, and ZW conducted the campaign and compiled the data. All authors contributed to the discussion of the results and writing of the manuscript.
Competing interests. The contact author has declared that neither they nor their co-authors have any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Acknowledgements. This research was financially supported by the Ministry of Science and Technology of the People's Republic of China (grant no. 2019YFC0214701), the National Natural Science Foundation of China (grant nos. 41877306 and 92044301), and the programs from Beijing Municipal Science & Technology Commission (grant no. Z181100005418015). We thank Yizhen Chen for providing the meteorological parameter data for campaign studies. Review statement. This paper was edited by Glenn Wolfe and reviewed by two anonymous referees.