Assessing Measurements of Pollution in the Troposphere (MOPITT) carbon monoxide retrievals over urban versus non-urban regions

The Measurements of Pollution in the Troposphere (MOPITT) retrievals over urban regions have not been validated systematically, even though MOPITT observations are widely used to study CO over urban regions. Here we compare MOPITT products over urban and non-urban regions with aircraft measurements from the Deriving Information on Surface conditions from Column and Vertically Resolved Observations Relevant to Air Quality (DISCOVERAQ – 2011–2014), Studies of Emissions and Atmospheric Composition, Clouds, and Climate Coupling by Regional Surveys (SEAC4RS – 2013), Air Chemistry Research In Asia (ARIAs – 2016), A-FORCE (2009, 2013), and Korea United States Air Quality (KORUS-AQ – 2016) campaigns. In general, MOPITT agrees reasonably well with the in situ profiles, over both urban and non-urban regions. Version 8 multispectral product (V8J) biases vary from−0.7 % to 0.0 % and version 8 thermal-infrared product (TIR) biases vary from 2.0 % to 3.5 %. The evaluation statistics of MOPITT V8J and V8T over non-urban regions are better than those over urban regions with smaller biases and higher correlation coefficients. We find that the agreement of MOPITT V8J and V8T with aircraft measurements at high CO concentrations is not as good as that at low CO concentrations, although CO variability may tend to exaggerate retrieval biases in heavily polluted scenes. We test the sensitivities of the agreements between MOPITT and in situ profiles to assumptions and data filters applied during the comparisons of MOPITT retrievals and in situ profiles. The results at the surface layer are insensitive to the model-based profile extension (required due to aircraft altitude limitations), whereas the results at levels with limited aircraft observations (e.g., the 600 hPa layer) are more sensitive to the model-based profile extension. The results are insensitive to the maximum allowed time difference criterion for co-location (12, 6, 3, and 1 h) and are generally insensitive to the radius for co-location, except for the case where the radius is small (25 km), and hence few MOPITT retrievals are included in the comparison. Daytime MOPITT products have smaller overall biases than nighttime MOPITT products when comparing both MOPITT daytime and nighttime retrievals to the daytime aircraft observations. However, it would be premature to draw conclusions on the performance of MOPITT nighttime retrievals without nighttime aircraft observations. Applying signal-to-noise ratio (SNR) filters does not necessarily improve the overall agreement between MOPITT retrievals and in situ profiles, likely due to the reduced number of MOPITT retrievals for comparison. Comparisons of MOPITT retrievals and in situ profiles over complex urban or polluted regimes are inherently chalPublished by Copernicus Publications on behalf of the European Geosciences Union. 1338 W. Tang et al.: Assessing MOPITT carbon monoxide retrievals lenging due to spatial and temporal variabilities of CO within MOPITT retrieval pixels (i.e., footprints). We demonstrate that some of the errors are due to CO representativeness with these sensitivity tests, but further quantification of representativeness errors due to CO variability within the MOPITT footprint will require future work.

Abstract. The Measurements of Pollution in the Troposphere (MOPITT) retrievals over urban regions have not been validated systematically, even though MOPITT observations are widely used to study CO over urban regions. Here we compare MOPITT products over urban and non-urban regions with aircraft measurements from the Deriving Information on Surface conditions from Column and Vertically Resolved Observations Relevant to Air Quality (DISCOVER-AQ -2011, Studies of Emissions and Atmospheric Composition, Clouds, and Climate Coupling by Regional Surveys (SEAC 4 RS -2013), Air Chemistry Research In Asia (ARIAs -2016), A-FORCE (2009A-FORCE ( , 2013, and Korea United States Air Quality (KORUS-AQ -2016) campaigns. In general, MOPITT agrees reasonably well with the in situ profiles, over both urban and non-urban regions. Version 8 multispectral product (V8J) biases vary from −0.7 % to 0.0 % and version 8 thermal-infrared product (TIR) biases vary from 2.0 % to 3.5 %. The evaluation statistics of MOPITT V8J and V8T over non-urban regions are better than those over urban regions with smaller biases and higher correlation coefficients. We find that the agreement of MOPITT V8J and V8T with aircraft measurements at high CO concentrations is not as good as that at low CO concentrations, although CO variability may tend to exaggerate retrieval biases in heavily pol-luted scenes. We test the sensitivities of the agreements between MOPITT and in situ profiles to assumptions and data filters applied during the comparisons of MOPITT retrievals and in situ profiles. The results at the surface layer are insensitive to the model-based profile extension (required due to aircraft altitude limitations), whereas the results at levels with limited aircraft observations (e.g., the 600 hPa layer) are more sensitive to the model-based profile extension. The results are insensitive to the maximum allowed time difference criterion for co-location (12, 6, 3, and 1 h) and are generally insensitive to the radius for co-location, except for the case where the radius is small (25 km), and hence few MOPITT retrievals are included in the comparison. Daytime MOPITT products have smaller overall biases than nighttime MOPITT products when comparing both MOPITT daytime and nighttime retrievals to the daytime aircraft observations. However, it would be premature to draw conclusions on the performance of MOPITT nighttime retrievals without nighttime aircraft observations. Applying signal-to-noise ratio (SNR) filters does not necessarily improve the overall agreement between MOPITT retrievals and in situ profiles, likely due to the reduced number of MOPITT retrievals for comparison. Comparisons of MOPITT retrievals and in situ profiles over complex urban or polluted regimes are inherently chal-

Introduction
Observations from the Measurements of Pollution in the Troposphere (MOPITT) instrument onboard the NASA Terra satellite have been used for retrieving total column amounts and volume mixing ratio (VMR) profiles of carbon monoxide (CO) using both thermal-infrared (TIR) and near-infrared (NIR) measurements since March 2000. Besides the TIRonly and NIR-only products, the MOPITT multispectral TIR-NIR product is also provided, which has enhanced the sensitivity to near-surface CO (Deeter et al., , 2013Worden et al., 2010). Since the start of the mission, the MOPITT CO retrieval algorithm has been improved and enhanced continuously . For example, the Version 6 product improvements included the reduction in both a geolocation bias and a significant latitude-dependent retrieval bias in the upper troposphere . In the Version 7 products, a new strategy for radiance bias correction and an improved method for calibrating MOPITT's NIR radiances were included . For the most recently released MOPITT Version 8 products, enhancements include a new radiance bias correction method (Deeter et al., 2019). Meanwhile, the MOPITT products have been extensively evaluated and validated with in situ measurements, though this has been done primarily over nonurban areas (Deeter et al., , 2012(Deeter et al., , 2013(Deeter et al., , 2016(Deeter et al., , 2019Emmons et al., 2004Emmons et al., , 2007Emmons et al., , 2009). In addition, MOPITT products have also been compared with groundbased spectrometric column retrievals (e.g., Buchholz et al., 2017;Hedelius et al., 2019). For the past 2 decades, MOPITT CO products have been widely used for various applications, including understanding atmospheric composition, evaluating atmospheric chemistry models, and constraining inverse analyses of CO emissions (e.g., Arellano et al., 2004Arellano et al., , 2006Arellano et al., , 2007Chen et al., 2009;Edwards et al., 2006;Fortems-Cheiney et al., 2011;Gaubert et al., 2016;Heald et al., 2004;Jiang et al., 2018;Kopacz et al., 2009Kopacz et al., , 2010Kumar et al., 2012;Lamarque et al., 2012;Tang et al., 2018;Yurganov et al., 2005).
MOPITT products are particularly useful for monitoring and analyzing air pollution over urban regions because of the enhanced retrieval sensitivity to near-surface CO and the long-term record (e.g., Clerbaux et al., 2008;Girach and Nair, 2014;Jiang et al., 2015Jiang et al., , 2018Kar et al., 2010;Tang et al., 2019;Worden et al., 2010;Li and Liu, 2011;He et al., 2013;Aliyu and Botai, 2018;Kanakidou et al., 2011). However, the performance of MOPITT retrievals over urban regions has not yet been validated systematically. Furthermore, in situ observations of CO profiles over urban areas are limited, especially in Asia. Indeed, along with the non-urban validation exercises mentioned above, development and validation of the MOPITT retrieval algorithm relies heavily on in situ measurements over remote regions, such as measurements from the HIAPER (High-Performance Instrumented Airborne Platform for Environmental Research) Pole-to-Pole Observations (HIPPO) and the Atmospheric Tomography Mission (ATom) campaigns (e.g., Deeter et al., 2013Deeter et al., , 2017Deeter et al., , 2019. Comparisons of MOPITT products to measurements with aircraft profiles during the Korea United States Air Quality (KORUS-AQ) campaign over South Korea have only recently been made in Deeter et al. (2019), but without explicitly analyzing MOPITT performance over urban regions.
In this study, we compare MOPITT Version 8 and 7 products with aircraft profiles made over urban regions (as well as non-urban regions) from campaigns including Deriving Information on Surface conditions from Column and Vertically Resolved Observations Relevant to Air Quality (DISCOVER-AQ); the Studies of Emissions and Atmospheric Composition, Clouds, and Climate Coupling by Regional Surveys (SEAC 4 RS); the Air Chemistry Research In Asia (ARIAs); the Aerosol Radiative Forcing in East Asia (A-FORCE); and KORUS-AQ. These campaigns are described in Sect. 2, along with a brief description of the MO-PITT products and the methodology used. We present the comparisons of MOPITT products to aircraft profiles and discuss the impacts of key factors in the retrieval process on the retrieval results in Sect. 3. In Sect. 4, we discuss the sensitivities of results to the assumptions and data filters made for aircraft-satellite comparisons not only in this study but also in previous evaluation studies of MOPITT and other satellite products. Section 5 gives the conclusions of the study.

MOPITT retrievals and products
MOPITT is a nadir-sounding satellite instrument flying on the NASA Terra satellite. It uses a gas filter correlation radiometer and measures radiance at both the TIR band near 4.7 µm and the NIR band near 2.3 µm. These observations have a spatial resolution of about 22 km×22 km with satellite overpass time at approximately 10:30 and 22:30 (local time).
To determine a unique CO concentration profile from the MOPITT measured radiances, an optimal estimation-based retrieval algorithm and a fast radiative transfer model are used (Deeter et al., 2003;Edwards et al., 1999). The retrieved state vector (x rtv ) for optimal estimation-based retrievals can be expressed as where x a and x true are the a priori state vector and the true state vector, respectively. A (which has a size of 10 × 10) is the retrieval averaging kernel matrix (AK) that represents the sensitivity of retrieved profiles to actual profiles and is the random error vector. Note that CO quantities in the state vector are retrieved as log 10 (VMR). We focus on validating the recently released Version 8 of the MOPITT TIR, NIR, and multispectral TIR-NIR products. We also include comparisons with the MOPITT Version 7 TIR, NIR, and multispectral TIR-NIR products in the Sect. 3.1 for reference. These two versions of MOPITT products were introduced in detail in Deeter et al. (2017Deeter et al. ( , 2019.

Aircraft measurements used for comparisons
Aircraft-sampled profiles of CO concentrations during the DISCOVER-AQ, SEAC 4 RS, ARIAs, A-FORCE, and KORUS-AQ campaigns are used for comparisons with MOPITT-retrieved profiles. DISCOVER-AQ and SEAC 4 RS were conducted over the US, while ARIAs, A-FORCE, and KORUS-AQ were conducted over East Asia. Locations of the aircraft profiles from these campaigns are compared with the MODIS (Moderate Resolution Imaging Spectroradiometer) Terra and Aqua Land Cover Type Climate Modeling Grid Yearly Level 3 Version 6 0.05 • × 0.05 • Global product (MCD12C1 v006) (Friedl and Sulla-Menashe, 2015) to determine if a profile was sampled over an urban or non-urban region. Specifically, for each aircraft profile, a 0.5 • ×0.5 • box centered over the location of the aircraft profile (determined by averaged latitude and longitude of aircraft observations in the profile) is selected. If the urban and built-up fraction in the box is larger than 10 %, the profile is considered to be an urban profile. Overall, for each campaign, the averaged aircraft profile over urban regions has higher CO concentrations compared to that over non-urban regions, especially near the surface (see Fig. S1 in the Supplement). Profiles during ARIAs, which are sampled over Hebei province in China, are exceptional, as the averaged profile over non-urban regions has higher CO concentrations especially near the surface, indicating high CO levels in the entire study region. We note that Hebei is one of the most heavily industrialized and polluted regions, and the difference in CO profiles is driven less by urban versus rural than by synoptic and mesoscale meteorology. In addition, Hebei is an arid region and subject to strong nocturnal inversions, so the surface CO can be very high. For aircraft profiles sampled during KORUS-AQ, the CO profiles over urban and non-urban regions are similar, even though the averaged profile over urban regions has slightly higher CO concentration near the surface. This is largely due to the fact that many of the non-urban aircraft profiles are sampled over the Taehwa forest site, which is impacted by CO transported from the nearby Seoul urban region. The urban regions often have different surface parameters (e.g., surface temperature and emissivity) and usually but not always have higher CO concentrations than non-urban regions. However, the surface parameters are unlikely to impact the ultimate quality of MOPITT retrieval products (Pan et al., 1998;Ho et al., 2005). The goal of this study is to understand if MOPITT retrievals are able to represent conditions over urban regions given sampling and cloud cover. In addition, the relatively large spatial and temporal variability of CO concentrations over urban regions makes the validation even more complex. Because of the complexity of urban regions and their connection with non-urban regions nearby, we also provide analysis at high CO concentrations regardless of land cover type. Note that the comparisons include the 600 hPa layer (usually in the free troposphere). It is possible that CO concentrations at this layer are transported from other regions that are not representative of urban regions. Even so, MOPITT retrievals at the 600 hPa layer are still impacted by the CO concentrations at other layers including the surface layer (Eq. 1). Therefore, the comparisons at 600 hPa is necessary.
The campaigns and profiles are summarized in the Table 1 and Fig. 1. During DISCOVER-AQ, SEAC 4 RS, and KORUS-AQ, CO concentrations were measured by the NASA Differential Absorption Carbon monOxide Measurement (DACOM), whereas during ARIAs and A-FORCE, CO concentrations were measured by Picarro G2401-m and Aero-Laser GmbH AL5002, respectively. Note that the primary goal of DISCOVER-AQ was to provide aircraft observation methodologies for satellite validation (e.g., Lamsal et al., 2014). There are 121 profiles over four urban regions from DISCOVER-AQ, making it particularly useful for the goal of this study. Because of this, our results are heavily driven by aircraft profiles from DISCOVER-AQ. Even though there are only two profiles sampled over urban regions, the A-FORCE campaign obtained 45 profiles in total sampled over East Asia during spring 2009, winter 2013, and summer 2013. The seasonal and spatial coverage of the dataset makes it representative of the region. The ARIAs campaign provides 19 profiles and three of these were sampled over Chinese urban regions. Few previous studies have validated MOPITT products over China (e.g., Hedelius et al., 2019), so aircraft profiles from ARIAs have also been included in this study.

Method for comparing MOPITT profiles to aircraft measurements
We generally follow the method that has been used in previous MOPITT evaluation and validation studies (Deeter et al., , 2012(Deeter et al., , 2013(Deeter et al., , 2016(Deeter et al., , 2019Emmons et al., 2004Emmons et al., , 2007Emmons et al., , 2009). There are four main steps in aircraft versus MOPITT comparisons.
1. Because of aircraft altitude limitations, in situ data from field campaigns do not typically reach the highest altitudes at which MOPITT radiances are sensitive. Therefore, to obtain a complete vertical profile as required for comparison with MOPITT retrievals, each   in situ profile is extended vertically using the following steps: (i) the aircraft measurements are interpolated to the 35-level vertical grid used in MOPITT forward model calculations (0.2-1060 hPa); (ii) the levels from the surface to the lowest-altitude aircraft measurement are filled with the value of the in situ measurement at the lowest-altitude aircraft measurement; (iii) for levels above a certain pressure level P interp (higher altitude), model or reanalysis data are used directly; (iv) for levels between the highest-altitude aircraft measurement and the altitude of P interp , values are linearly interpolated. Unlike the previous MOPITT evaluation studies that used monthly model results from MOZART (Model for OZone And Related chemical Tracers)  or CAM-chem (Community Atmosphere Model with chemistry) (Lamarque et al., 2012), here we use 3-hourly Copernicus Atmosphere Monitoring Service (CAMS) reanalysis of CO produced by the European Centre for Medium-Range Weather Forecasts (ECMWF). CAMS CO reanalysis has a horizontal resolution of 80 km × 80 km and 60 vertical grids (from surface to 0.1 hPa). Satellite retrievals of atmospheric composition including MOPITT TIR Version 6 total col-umn CO retrievals are assimilated in the CAMS reanalysis (Inness et al., 2019; https://confluence.ecmwf.int/ pages/viewpage.action?pageId=83396018, last access: 18 March 2020). We note that as we do not compare with these higher levels later, the use of CAMS reanalysis is expected to have a minimal impact on the lower levels we use in the comparison (e.g., the surface layer, the 800 hPa layer, and the 600 hPa layer). The final CO profile at the 35-level vertical grid is then regridded onto a coarser 10-level grid (for consistency with the actual MOPITT retrieval grid) by unweighted averaging the fine-grid VMR values in the layers immediately above the corresponding levels in the retrieval grid. We investigate the sensitivity of the results to P interp in Sect. 4.1.
2. For a given in situ profile, MOPITT profiles are considered co-located with the aircraft profile and are selected for comparison only if their center points are within the radius of 100 km and within 12 h of the acquisition of the aircraft profile. Sensitivities of the results to the radius and time criteria for co-location selection are further investigated in Sect. 4.2.
3. For each pair of co-located MOPITT-retrieved and in situ profiles, we apply the MOPITT a priori profile and averaging kernel to the in situ profile as in Eq. (1). Thus, after converting from profiles of the in situ and a priori CO concentrations to log 10 (VMR) profiles (x in situ and x a ), we calculate so that the log 10 (VMR)-based transformed in situ profile (x transformed ) has the same degree of smoothing and a priori dependence as the MOPITT-retrieved log 10 (VMR) profile (x rtv ).
4. For each in situ profile, there are likely to be multiple MOPITT retrievals that meet the above co-location criteria. If fewer than five MOPITT retrievals are colocated with an in situ profile, the in situ profile is not used in the following study and analysis. If an in situ profile is co-located with five or more MOPITT retrievals (assume the number to be N retrieval ), then the following steps are used in the comparison with MO-PITT: (a) the averaging kernel and a prior of each colocated MOPITT retrieval are applied to the in situ profile (through Eq. 2) to obtain N retrieval of x transformednote that applying these N retrieval sets of MOPITT a priori profiles and averaging kernels to the same in situ profile results in differently transformed in situ profiles; (b) the N retrieval of x transformed are averaged in log 10 (VMR) space; and (c) the N retrieval of MOPITT retrievals x rtv are also averaged. Figure 2 shows an example of profile comparisons (the original aircraft profile, aircraft profile extended with CAMS reanalysis data and regridded to 35-level grid, x in situ , x a , x transformed , and x rtv ) in VMR for an aircraft profile sampled on 22 July 2011 during DISCOVER-AQ in Maryland (MD). Figure 2 also demonstrates what to expect within a MOPITT retrieval pixel and vertical level. The MOPITT retrievals have a spatial resolution of about 22 km × 22 km, and each MOPITT retrieval level corresponds to a layer immediately above that level. The standard deviation of the original aircraft CO observations in each MOPITT layer are also shown, which is due to horizontal and vertical variability in CO. Taking the 800 hPa layer as an example, the standard deviation of the original aircraft CO observations in the level is 21.4 ppb, which is larger than the difference between x transformed and x rtv at that level (12.4 ppb). We also show the relative scale of the aircraft profile (3 km × 5 km) and a MOPITT pixel (22 km × 22 km) in Fig. 2. We expect the variability of CO within a MOPITT pixel to be even larger than the CO variability within the scale of 3 km×5 km. The variability within a satellite pixel and the representativeness error in the satellite retrieval and aircraft profile comparisons make it challenging to compare satellite retrievals to aircraft observations. This is one of the major reasons that MOPITT has yet to be compared with aircraft observations over urban regions with in situ observations. The representativeness error has been discussed in previous studies (Fishman et al., 2011;Follette-Cook et al., 2015;Judd et al., 2019). Follette-Cook et al. (2015) quantified spatial and temporal variability of column-integrated air pollutants, including CO, during DISCOVER-AQ MD from a modeling perspective (using the Weather Research and Forecasting model coupled with Chemistry -WRF-Chem). They found that during the July 2011 DISCOVER-AQ campaign, the mean CO difference at the distance of 20-24 km is ∼ 30 ppb (derived from the aircraft observations) and ∼ 40 ppb (derived from co-located WRF-Chem output), based on structure function analyses. In this study, we demonstrate this challenge with an example in Fig. 2. We also show a sensitivity analysis in Sect. 4 to provide perspectives on how the spatial and temporal representativeness may change the results. Further quantification of the variability within MOPITT pixels would be very challenging (partially due to limited coverage of the observational data), and we will elaborate more on this issue in Sect. 5.

MOPITT comparisons with aircraft profiles over urban and non-urban regions
In this section, the results for MOPITT comparisons with aircraft profiles are provided for only daytime retrievals (i.e., solar zenith angle < 80 • in the retrieval) because (1) MOPITT retrievals generally contain more CO profile information in the daytime, which is reflected in AKs and degrees of freedom for signal (DFS) in Fig. 3, and (2) most aircraft profiles are sampled during the daytime. In Sect. 4.3, we discuss the sensitivity to the inclusion of MOPITT nighttime retrievals in MOPITT comparisons with aircraft profiles. In addition, many aircraft profiles, especially those from DISCOVER-AQ, lack observations above 600 hPa. Even though we extended the aircraft profiles vertically with reanalysis data (as discussed in Sect. 2.3), this still prevents the use of these profiles for validating MOPITT retrievals at upper levels against in situ observations. In this paper, we only focus on comparing MOPITT retrievals below the altitude of 600 hPa to aircraft profiles. Nevertheless, since the CO retrievals below 600 hPa are still weakly impacted by CO fields in the upper levels (as shown by the AKs in Fig. 3), in Sect. 4.1 we per-form sensitivity tests on how augmenting the aircraft profiles with reanalysis fields affects the comparison results.

Overall statistics
The overall comparison results are presented in Table 2. Following Deeter et al. (2017), retrieval biases and standard deviation (SD) are calculated based on mean x rtv and x transformed for each in situ profile and converted from log 10 (VMR) to percent. The correlation coefficient (r) is quantified based on (x rtv − x a ) and the corresponding (x transformed − x a ) to avoid correlations which mainly result from the variability of the a priori. x rtv , x transformed , and x a are in log 10 (VMR) space in order to apply the AKs, which are computed for x rtv in log 10 (VMR). These comparisons for MOPITT Version 8 TIR-only (V8T) and Version 8 TIR-NIR (V8J) are shown in Figs. 4 (for all profiles) and 5 (for urban profiles). Overall biases for V8J products (averaged over all campaigns in Table 1) vary from −0.7 % to 0.0 %, which are lower than biases for V8T (from 2.0 % to 3.5 %).
Overall biases for V8J products are also smaller than biases for V7J (from −0.5 % to −5.4 %). For V8J and V7J, biases over urban regions vary from −0.8 % to −2 % and from −1.4 % to −8.9 %, respectively, which are generally larger than biases over non-urban regions (−0.3 %-1.1 % and −3.3 %-0.1 %). Correlation coefficients over non-urban regions are higher than those over urban regions for all six products (V7T, V8T, V7N, V8N, V7J, V8J) at all three levels in Table 2 (the surface layer, the 800 hPa layer, and the 600 hPa layer). We also notice that for TIR-NIR and TIRonly products, V8 have higher correlation coefficients with in situ measurements than V7 over non-urban regions, whereas over urban regions, V8 products have lower correlation coefficients than V7 (except for the 600 hPa layer). Overall, MO-PITT products (especially V8J) perform reasonably well over both urban and non-urban regions. Performance over nonurban regions is better than that over urban regions in terms of higher correlation coefficients and smaller biases for V8J and V7J.

Discussions of individual campaigns
We also evaluate MOPITT V8J retrievals during individual field campaigns with results in Fig. 6. The corresponding results for MOPITT V8T are summarized in Fig. S2. The patterns of biases are very similar for MOPITT V8J and V8T. Thus, in this subsection, we focus on V8J unless stated otherwise. Overall, except for comparisons with A-FORCE and ARIAs, biases over urban regions and non-urban regions do not have a significant difference. Neither do biases determined for campaigns over the US and East Asia differ significantly, either. When compared to DISCOVER-AQ CA (California), MOPITT CO values are generally higher than in situ profiles at the 600 hPa layer (i.e., the 100 hPa uniform layer imme-diately above 600 hPa) but not at the surface layer (i.e., the uniform layer immediately above the surface). This is likely related to the fact that the DISCOVER-AQ CA aircraft profiles are mostly below 600 hPa, and hence CO values of these in situ profiles at 600 hPa and above are filled with CAMS reanalysis data. In addition, DISCOVER-AQ CA was conducted in the winter when boundary layer height is at lower altitudes, which could also explain the difference, in particular since most of the other campaigns are during times with greater vertical mixing. The lack of aircraft observations at 600 hPa and above also has a smaller impact on the biases at the 800 hPa layer through applying AK (see Fig. 3).
During the A-FORCE campaign, only 2 in situ profiles out of 45 were sampled over urban regions. The locations of the two profiles are close to each other and they are both sampled on or near the coast of South Korea (Fig. 1). MOPITT has large negative biases (−30 % to −40 %) when compared to these two profiles. The averaged x in situ , x a , x transformed , and x rtv over non-urban regions during A-FORCE and the x in situ , x a , x transformed , and x rtv of the two profiles over urban regions are shown in Fig. S3. Compared to the averaged x in situ over non-urban regions, the x in situ for the two profiles over the urban regions have large enhancements near the surface and between 600 and 800 hPa. Even though the x a and x rtv for the two profiles have higher CO concentrations (∼ 400 ppb at the surface layer) than the averaged x a and x rtv (∼ 200 ppb at the surface layer), they are still lower than the x transformed .
As for KORUS-AQ, MOPITT also has a negative bias (though smaller) when compared to the profiles over urban regions. Most of these KORUS-AQ profiles were located near the two profiles from A-FORCE but farther from the coast. The negative bias is not seen over non-urban regions during KORUS-AQ at the surface layer.
When compared to the in situ profiles from ARIAs, MO-PITT has a large positive bias, especially over urban regions (20 %-30 %). During ARIAs, in situ profiles over urban regions have lower CO values (∼ 200 ppb at the surface layer) than those in situ profiles over non-urban regions (∼ 400 ppb at the surface layer; Fig. S4). We note there are only a small number of in situ profiles over urban regions in East Asia used in this study, compared to what is provided by DISCOVER-AQ in the US. The large negative biases against A-FORCE and large positive biases against ARIAs point to the need for more in situ observations over East Asia.

MOPITT comparisons with aircraft profiles at high CO concentrations
Urban regions are often associated with high CO concentrations. But this is not always the case (e.g., Fig. S4). Here we separate the in situ profiles at the surface layer, the 800 hPa layer, and the 600 hPa layer into lower 50 % CO values and higher 50 % CO values based on CO values at each level to demonstrate the impact of CO concentrations on the MO-   . MOPITT V8J and V8T validation results over both urban and non-urban regions at 600 hPa layer, 800 hPa layer, and the surface layer in terms of log 10 (VMR). log 10 (VMR) is defined as x rtv -x a for MOPITT profiles and x transformed -x a for the in situ profiles. The use of log 10 (VMR) allows us to remove the impact of the a priori in the comparisons. The variability of the MOPITT data used to calculate each of the plotted mean values is represented by the vertical error bars. The dashed lines are one-to-one ratio lines.
PITT product validation (Fig. 7). For V8J, MOPITT has smaller biases at higher 50 % CO concentrations for all three levels, whereas for V8T, MOPITT has larger biases at the surface layer and the 600 hPa layer at higher 50 % CO concentrations. For the higher 50 % of measured mixing ratios both V8J and V8T have larger SDs and lower correlation coefficients at the surface layer, the 800 hPa layer, and the 600 hPa layer, suggesting that the agreement between MO-PITT and the in situ profiles at higher CO concentrations is not as good as that at lower CO concentrations. In contrast, Deeter et al. (2016) found that the retrieval biases do not visibly increase in the upper range of CO concentrations when compared to aircraft measurements over the Amazon Basin. The vertical error bars in Fig. 7 (caused by the multiple co-located MOPITT profiles with one in situ profile) represent the variability (standard deviation) of the MOPITT data used to calculate each of the plotted mean values. For an in situ profile, the variability of the MOPITT data located within its radius of 100 km and within 12 h is larger when the in situ profile has higher CO values, indicated by larger error bars at higher 50 % CO concentrations. At higher 50 % CO concentrations, the averaged retrieval uncertainties for the 600 hPa, 800 hPa, and surface layers are 28 %, 28 %, and 29 %, respectively. This is smaller than the averaged retrieval uncertainties at lower 50 % CO concentrations (28 %, 29 %, and 30 % for the 600 hPa, 800 hPa, and surface layers, respectively). We therefore conclude that the larger apparent biases at high CO concentrations are related to greater CO variability and representativeness error of the in situ profile within the co-location radius used for analyzing the MOPITT data rather than indicating larger retrieval uncertainties. Theoretically, MOPITT retrievals perform better with higher CO concentrations. The larger biases at high CO concentrations in Fig. 7 imply that the relatively greater CO variability may overcome the impact of high CO concentrations. Addressing representativeness error and spatial variability in the comparisons between satellite and in situ profiles is challenging and will be discussed further in Sect. 5. We will discuss the sensitivity of radius and time difference for the selection of co-located data in Sect. 4. The difference in the variability at different CO concentrations was not found in Deeter et al. (2016). It could be partially due to the fact that the aircraft profiles over the Amazon Basin used in Deeter et al. (2016) were sampled under more geographically homogeneous conditions, whereas the profiles used in this study are from different campaigns, and high CO con-centrations over and near urban regions might be associated with more complex and inhomogeneous conditions.

Sensitivities to assumptions made for aircraft-satellite comparisons
In Sect. 3, we compared profiles over urban and non-urban regions separately to MOPITT V8T, V8N, V8J, V7T, V7N, and V7J. In this section, we compare only the MOPITT V8J product to all the in situ profiles (both over urban and nonurban regions) described in Table 1 to test the sensitivity of results to the assumptions made during the comparisons.

Sensitivity to the in situ profile extension
As discussed in Sect. 2.3, the in situ profiles must be vertically extrapolated or extended to compare with MOPITT Figure 6. Box plot (with medians represented by middle bars, interquartile ranges between 25th and 75th percentiles represented by boxes, and the most extreme data points not considered outliers represented by whiskers) for biases (%) for the profiles over both urban and nonurban regions (yellow), profiles over urban regions (green), and profiles over non-urban regions (red) at 600 hPa layer (a), 800 hPa layer (b), and the surface layer (c). An outlier is a value that is more than 1.5 times the interquartile range away from the top or bottom of the box.
products due to aircraft altitude limits. Thus, model or reanalysis data must be merged with the in situ data to generate a complete CO profile for comparisons with MOPITT satellite retrievals. The use of model or reanalysis data may introduce uncertainties in the comparison results as they are not measured directly. The parameter P interp controls the impact of the model-based profile extension on the shape and value of in situ profiles (see Fig. S5). Here we test the sensitivity of validation results to various P interp values (100, 200, 300, 400, 500 hPa) to demonstrate the potential impact of the profile extension. Note that the model-based profile extension and the value of P interp impacts the validation results through changing the augmented observational profile, which is different from the other sensitivity tests in this study that change the selection of MOPITT data. The agreements between the values of MOPITT and in situ profiles at the surface layer are insensitive to the selection of P interp (Fig. 8). The overall agreements between the values of MOPITT and in situ profiles at the 800 hPa layer are also not sensitive to P interp , except for the results against DISCOVER-AQ CA, which have slightly larger biases when P interp is 200 hPa or 100 hPa since the DISCOVER-AQ CA aircraft profiles at 600 hPa and above are mostly extended using reanalysis data. Therefore, the comparisons with DISCOVER-AQ CA are more likely to be affected by P interp compared to other campaigns which typically obtained higher maximum aircraft altitudes. At the 600 hPa layer, the agreements between the values of MO-PITT and in situ profiles are affected more by P interp compared to the those at the surface layer and the 800 hPa layer for comparisons with all the campaigns. The overall validation results using 100 hPa as P interp have larger biases than using other values of P interp . At 400 hPa layer and 200 hPa layer, the comparisons are even more sensitive to P interp for all the campaigns (Fig. S6). The CAMS 3-hourly reanalysis data are constrained by observations, but their use may still introduce the uncertainties in the validation results especially at upper pressure levels (e.g., 200 and 400 hPa). Previous MOPITT evaluation results may be subject to larger uncertainties by using CAM-chem monthly CO fields that are not constrained by observations (e.g., Deeter et al., 2012Deeter et al., , 2016.

Sensitivity to the radius and allowed maximum time difference as criteria for co-location
The criteria for co-location in this study (within a radius of 100 km and within 12 h of the acquisition of the aircraft profile) generally follow previous MOPITT validation studies (e.g., Deeter et al., 2016Deeter et al., , 2019 and are chosen empirically. They are selected based on a trade-off between uncertainties generated from CO spatial and/or temporal variability and the number of included MOPITT retrievals that impacts the statistical robustness. Here we test the sensitivity of the results to the two criteria for co-location. The box plot of biases calculated with different radii (200, 100, 50, and 25 km) at the surface layer, the 800 hPa layer, and the 600 hPa layer are shown in Fig. 9. Overall, the biases calculated with a radius of 200, 100, and 50 km are similar, whereas the biases calculated with the radius of 25 km are different from others. The comparisons of MOPITT to in situ profile results using the radius of 25 km generally have larger biases and SD, due to including fewer MOPITT retrievals. In some cases, there are no matched MOPITT retrievals within the radius of 25 km of the aircraft profile (e.g., DISCOVER-AQ CA and ARIAs). In addition, representativeness errors would be expected to go up if there are only a few retrievals over a more polluted and perhaps heterogeneous area. We note that the use of the largest radius (200 km) in this paper does not appear to degrade the overall results, even though representativeness errors generated from CO spatial and/or temporal variability are expected to increase. However, the use of the smallest Figure 8. Sensitivity to P interp . Biases (%) using 100 hPa (blue), 200 hPa (gray), 300 hPa (yellow), 400 hPa (green), and 500 hPa (red) as P interp at 600 hPa layer (a), 800 hPa layer (b), and the surface layer (c) are shown by box plot (with medians represented by middle bars, interquartile ranges between 25th and 75th percentiles represented by boxes, and the most extreme data points not considered outliers represented by whiskers). The biases are calculated against all (both urban and non-urban) in situ profiles listed in Table 1. The "200 hPa" values (gray bars) in this figure are the same as yellow bars (for all data) in Fig. 6. See the caption of Fig. 6 for the definition of outliers. radius (25 km) degrades the overall results by reducing the number of included MOPITT retrievals. The box plot of biases calculated with four sets of allowed maximum time difference (12, 6, 3, and 1 h) are shown in Fig. 10. The overall results are not sensitive to the selection of allowed maximum time difference. One exception is the comparisons to the SEAC 4 RS campaign at the 600 hPa layer, due to a smaller number of MOPITT retrievals in the shorter time window. We note that when comparing to the ARIAs campaign, using 1 h as the allowed maximum time difference decreases the biases at the surface layer, the 800 hPa layer, and the 600 hPa layer, compared to the cases using a longer allowed maximum time difference (i.e., 3, 6, and 12 h). This implies that the temporal variability is relatively large in the region. And the improvement observed for ARIAs for the shortest time also points to the possibility that short-term emission sources might be responsible for the large biases there. On the other hand, when the allowed maximum time difference equals 1 h, there are only six aircraft profiles that match MOPITT retrievals.

Sensitivity to the inclusion of MOPITT nighttime retrievals
Previous MOPITT validation studies have only included MOPITT daytime observations. Over land, MOPITT retrievals for daytime and nighttime overpasses are characterized by significantly different averaging kernels (Fig. 3) and may be subject to different types of retrieval error (Deeter et al., 2007). CO has a long enough lifetime (approximately 1 month; Gamnitzer et al., 2006) in the free troposphere that nighttime observations could be potentially comparable, in general, to the daytime flights for remote sites. However, for urban regions where the spatiotemporal variability of the emissions and evolution of the planetary boundary layer drives large changes in the measured CO, comparisons of MOPITT nighttime observations to aircraft profiles sampled during the daytime may introduce representative uncertainties, especially for areas that are subject to strong nocturnal inversions, and the surface CO can be enhanced. It is difficult to disentangle the effects of the MOPITT daytime or nighttime performance and the uncertainty from the temporal representativeness, based on the comparison of the MOPITT daytime or nighttime retrievals with daytime aircraft profiles. Figure 9. Sensitivity to the radius as criteria for co-location. Biases (%) using 200 km (blue), 100 km (gray), 50 km (green), and 25 km (pink) as the radius for co-location at 600 hPa layer (a), 800 hPa layer (b), and the surface layer (c) are shown by box plot (with medians represented by middle bars, interquartile ranges between 25th and 75th percentiles represented by boxes, and the most extreme data points not considered outliers represented by whiskers). The numbers in (c) correspond to the number of in situ profiles qualified for validation within the given radius. The biases are calculated against all (both urban and non-urban) in situ profiles listed in Table 1. The "100 km" values (gray bars) are the same as yellow bars (for all data) in Fig. 6. See the caption of Fig. 6 for the definition of outliers.
Therefore, we only include the results in Fig. S7 and briefly describe the results here without drawing any further conclusions. Overall, MOPITT nighttime retrievals have larger biases than daytime retrievals, which could be expected since most of the aircraft profiles are sampled during the daytime. Flight campaigns with nighttime observations are needed to validate MOPITT nighttime retrievals.

Sensitivity to the signal-to-noise ratio (SNR) filters
The MOPITT Level 3 data are generated from Level 2 data, and are available as gridded (1 • ×1 • ) daily mean and monthly mean files. Pixel filtering and SNR thresholds for Channel 5 and 6 average radiances are used when averaging Level 2 data into Level 3 data, and this increases overall mean DFS values (details can be found in the MOPITT Version 8 Product User's Guide, 2018). Taking the MOPITT V8J daytime product as an example, the Level 3 data product excludes all observations from Pixel 3 (one of the four elements of MO-PITT's linear detector array that has highly variable Channel 7 SNR values) or observations where both the Channel 5 average radiance SNR < 1000 and the Channel 6 average radiance SNR < 400. In Fig. 11, we test the impact of apply-ing the aforementioned SNR filters to the agreement between MOPITT and in situ profiles. Note that we are not suggesting the comparisons between MOPITT Level 3 product and aircraft measurements. Because the MOPITT Level 3 product is gridded data, and it represents the average value in a 1 • × 1 • grid. Comparing the grid average value to an aircraft profile within it may be subject to large representativeness errors. Here we only show the sensitivity of agreement between MOPITT Level 2 data and aircraft profiles to the application of SNR filters. We find that applying the SNR filters does not significantly change the overall agreement between MOPITT retrievals and the in situ profiles used in this study. This is mostly because applying the SNR filters reduces the number of MOPITT retrievals included in the comparisons. This effect is particularly important if there are not many MOPITT retrievals to begin with (such as our comparisons with in situ profiles in this study). Even though applying SNR filter when generating Level 3 data does not significantly change the agreement with the in situ profiles used in this study, excluding low-SNR observations from the Level 3 cell-averaged values raises overall mean DFS values (MO-PITT Algorithm Development Team, 2018). In addition, the Figure 10. Sensitivity to the allowed maximum time difference as criteria for co-location. Biases (%) using 12 h (gray), 6 h (blue), 3 h (green), and 1 h (pink) as the allowed maximum time difference for co-location at 600 hPa layer (a), 800 hPa layer (b), and the surface layer (c) are shown by box plot (with medians represented by middle bars, interquartile ranges between 25th and 75th percentiles represented by boxes, and the most extreme data points not considered outliers represented by whiskers). The numbers in (c) correspond to the number of in situ profiles qualified for validation within the given allowed maximum time difference. The biases are calculated against all (both urban and non-urban) in situ profiles listed in Table 1. The "12 h" values (gray bars) are the same as yellow bars (for all data) in Fig. 6. See the caption of Fig. 6 for the definition of outliers.
Level 3 product typically is less affected by random retrieval errors (e.g., due to instrument noise or geophysical noise).

Discussion and conclusions
MOPITT products are widely used for monitoring and analyzing CO over urban regions. However, systematic validation against observations over urban regions has been lacking. In this study, we compared MOPITT products over urban regions to aircraft measurements from DISCOVER-AQ, SEAC 4 RS, ARIAs, A-FORCE, and KORUS-AQ campaigns. The DISCOVER-AQ campaign was designed primarily with satellite validation in mind, and the campaign over MD, CA, Texas (TX), and Colorado (CO) together contributes 64.8 % (232 out of 358) of the aircraft profiles and 91.0 % (121 out of 133) of the aircraft profiles over the urban regions in this study (Table 1). Therefore, the DISCOVER-AQ campaign largely contributes to the results and the statistics in this study. We found that MOPITT mean biases are well within the 10 % required accuracy (Drummond and Mand, 1996) for both urban and non-urban regions (mean biases for V8J and V8T vary from −0.7 % to 0.0 % and from 2.0 % to 3.5 % for different levels). The performance over non-urban regions is better than that over urban regions in terms of correlation coefficients for the 6 products in Table 2 and biases of V8J and V7J. However, the in situ profiles over East Asia used in this study are limited, especially over urban regions (only 11 profiles). The large biases against aircraft profiles from the A-FORCE and ARIAs campaigns point to the need for more in situ observations over East Asia. We also studied the impact of CO concentrations on the agreement between MOPITT products and in situ profiles by dividing the aircraft profiles of CO into two groups of high CO (upper 50 %) and low CO (lower 50 %). We found that MOPITT retrievals at high CO concentrations have higher biases and lower correlations compared with low CO concentrations, although CO variability may tend to exaggerate retrieval biases in heavily polluted scenes. The statistics are often very similar between different versions and products over urban and non-urban regions, and in general, MOPITT agrees reasonably well with the in situ profiles in both cases. There is not, therefore, any reason to recommend the continued use of MOPITT versions Figure 11. Sensitivity to the signal-to-noise ratio (SNR) filters. Biases (%) for MOPITT retrievals without SNR filters (gray) and MOPITT retrievals with SNR filters (green) at 600 hPa layer (a), 800 hPa layer (b), and the surface layer (c) are shown by box plot (with medians represented by middle bars, interquartile ranges between 25th and 75th percentiles represented by boxes, and the most extreme data points not considered outliers represented by whiskers). The numbers in (c) correspond to the number of in situ profiles qualified for validation without or with SNR filters. The biases are calculated against all (both urban and non-urban) in situ profiles listed in Table 1. The "without SNR filter" values (gray bars) in this figure are the same as yellow bars (for all data) in Fig. 6. See the caption of Fig. 6 for the definition of outliers. earlier than V8 based on urban or non-urban region considerations. In general, MOPITT V8 is recommended (Deeter et al., 2019) as it uses a new parameterized radiance bias correction method to minimize retrieval biases and has updated spectroscopic data for water vapor and nitrogen.
In addition, the assumptions and data filters made during aircraft-satellite comparisons may impact the validation results. We tested the sensitivities of the results to assumptions and data filters, including the model-based extension to the in situ profile, radius, and allowed maximum time difference as criteria for the selection of co-located data, the inclusion of nighttime MOPITT data, and the SNR filters. The agreements between the values of MOPITT and in situ profiles at the surface layer are insensitive to the model-based profile extension, whereas the results at upper levels (e.g., 400 and 200 hPa) are more sensitive to the profile extension, as there are very limited aircraft observations. The results are insensitive to the allowed maximum time difference as a colocation criteria and are generally insensitive to the radius for co-location except for the case with a radius of 25 km, where only a small number of MOPITT retrievals are included in the comparisons. Overall, daytime MOPITT products overall have smaller biases than nighttime MOPITT products. However, conclusions regarding the performance of MO-PITT daytime and nighttime retrievals cannot be drawn due to the fact that most of the aircraft profiles are sampled during the daytime. As we mentioned earlier, MOPITT daytime and nighttime retrievals may be subject to different retrieval errors. In addition, previous studies suggest pollutants themselves may have different characteristics during the daytime and nighttime (e.g., Yan et al., 2018). Therefore, validation of MOPITT nighttime retrievals, with a sufficient number of nighttime airborne profiles, is needed in order to study nighttime CO characteristics and trends. Applying SNR filters does not necessarily change the overall agreement between MOPITT retrievals and in situ profiles used in this study significantly, and this may be partially caused by the smaller number of MOPITT retrievals in the comparisons after the SNR filters. We note that comparisons to ARIAs are exceptional in a few sensitivity tests due to rather a limited number of aircraft measurements. Given the large biases against aircraft profiles from the ARIAs campaign, more in situ obser-vations over East Asia, especially China, are needed in order to validate MOPITT products in the region.
Validation and evaluation of satellite retrievals with aircraft observations are very challenging, and assumptions have to be made for the comparisons. As discussed in Sect. 2, the CO spatial variability within MOPITT retrieval pixels and the representativeness error of aircraft profiles when compared to MOPITT retrievals may introduce uncertainties in the validation results. This issue is difficult to address and quantify due to the limited spatial coverage of dense aircraft observations. One possible way is to study NO 2 data retrieved from the Geostationary Trace Gas and Aerosol Sensor Optimization (GeoTASO) at very high resolution (250 m × 250 m), to provide an upper estimate on CO variability. Moreover, the variability of Tropospheric Monitoring Instrument (TROPOMI) CO retrievals (resolution: 7 km × 7 km; Landgraf et al., 2016) might also provide information on MOPITT sub-pixel variability. Further research on trace gas spatial variability within satellite retrieval pixels and quantification of the representativeness error incurred by comparing individual aircraft profiles to satellite products are needed and will be the subject of a follow-up study.
Author contributions. WT, HMW, and MND designed the study. WT analyzed the data with help from MND, SMA, and LKE. GSD provided CO measurements during DISCOVER-AQ SEAC 4 RS, and KORUS-AQ. RRD, XR, and HH provided CO measurements during ARIAs. YK provided CO measurements during A-FORCE. HMW, MND, DPE, LKE, BG, RRB, and XR offered valuable discussions and comments in improving the study. WT prepared the paper with improvements from all the other coauthors.
Competing interests. The authors declare that they have no conflict of interest.
Acknowledgements. This material is based upon work supported by the National Center for Atmospheric Research, which is a major facility sponsored by the National Science Foundation under Cooperative Agreement No. 1852977. Wenfu Tang is supported by an NCAR Advanced Study Program Postdoctoral Fellowship. The NCAR MOPITT project is supported by the National Aeronautics and Space Administration (NASA) Earth Observing System (EOS) Program. The authors thank the DISCOVER-AQ, SEAC 4 RS, ARIAs, A-FORCE, and KORUS-AQ Science Teams for the valuable in situ observations. We thank Naga Oshima and Makoto Koike for the A-FORCE data. ARIAs was supported by NSF (grant no. 1558259) and the National Institute of Standards and Technology (NIST, grant no. 70NANB14H332). The authors thank Frank Flocke for helpful comments on the paper. Wenfu Tang thanks Cenlin He for helpful discussions.
Financial support. This material is based upon work supported by the National Center for Atmospheric Research, which is a major facility sponsored by the National Science Foundation under Cooperative Agreement no. 1852977. Wenfu Tang is supported by an NCAR Advanced Study Program Postdoctoral Fellowship. The NCAR MOPITT project is supported by the National Aeronautics and Space Administration (NASA) Earth Observing System (EOS) Program. ARIAs was supported by NSF (grant no. 1558259) and the National Institute of Standards and Technology (NIST, grant no. 70NANB14H332).
Review statement. This paper was edited by Andre Butz and reviewed by three anonymous referees.