Interactive comment on “ Evaluation of MOPITT version 7 joint TIR-NIR X CO retrievals with TCCON ”

The manuscript entitled ’Evaluation of MOPITT version 7 joint TIR-NIR XCO retrievals with TCCON’ by Hedelius et al. presents results from an intercomparison between MOPITT TIR-NIR data products and CO measurements from the ground-based TCCON network. Covering various technical aspects of intercomparison strategies between the two remote sensing instruments, the manuscript fits well into the scope of AMT. Apart from this detailed validation, it is demonstrated how the different methods of bias correction affect CO emission estimates based on assimilation of MOPITT observations. The paper is well written and the methods are clearly described, with only minor technical and typographical errors. I recommend publication in AMT after minor revisions based on my comments below.


Introduction
Carbon monoxide (CO) is an important atmospheric trace gas. It is a tracer of pollution and atmospheric transport and plays an important role in the atmospheric hydroxyl (OH) budget. About 2800 Tg CO yr −1 is emitted globally, with about 45 % of the emissions coming from oxidation of volatile organic compounds (VOCs -predominately methane and isoprene), about 25 % from biomass burning, 25 % from fossil-fuel and domestic-fuel burning, and the rest from vegetation, oceans, and geological activity (Seinfeld and Pan-dis, 2006). It acts as an indirect greenhouse gas (GHG) as both a minor source of CO 2 and by affecting OH concentrations, which in turn affects the lifetime of methane. Its 100year global warming potential per mass is 1.9 (Forster et al., 2007). The ultimate fate for 90 % of CO is oxidation by OH to form carbon dioxide and HO 2 . CO has an average global lifetime of about 1-3 months, with a shorter lifetime in the tropics and a longer lifetime in the Southern Hemisphere extratropics (Lelieveld et al., 2016). The moderate lifetime of CO makes it a good tracer for both emissions and transport of pollution.
The Measurements Of Pollution In The Troposphere (MO-PITT) is a Canadian instrument aboard the Terra Earthobserving satellite, launched in December 1999. Drummond et al. (2010) describe the instrument in more detail, but briefly, it is a gas correlation radiometer with near-infrared (NIR) and thermal-infrared (TIR) channels. The primary MOPITT mission goal is to quantify CO in the Earth's atmosphere. Space-based observations of CO can provide greater spatial coverage than a few surface observations. However, space-based observations that rely on reflected (e.g., NIR) sunlight can be influenced by surface properties, airglow, and clouds and are more strongly affected by aerosol scattering than solar-viewing instruments. For MOPITT the TIR sensitivity depends on the strength of the temperature contrast between the surface and atmosphere, which is variable across the globe. Due to the physical limitations of passive Earth nadir-viewing remote sensing, satellite instruments often have lower information content per observation than ground-based instruments (e.g., Deeter et al., 2015), especially compared to the Total Carbon Column Observing Network (TCCON), which measures atmospheric absorption of the Sun's radiance. Ground-based spectrometers often have higher spectral resolution and/or coverage as well as temporal resolution at an individual location. These differences between observing systems make intercomparisons useful in checking for and reducing biases.
While MOPITT data are the longest satellite record of total column CO , there are several other satellite instruments that measure column CO, and we mention a few of them here. SCIAMACHY (Scanning Imaging Absorption Spectrometer for Atmospheric CHartographY) aboard Envisat (Environmental Satellite) launched in March 2002 was first compared with ground-based observations in 2005 (Sussmann and Buchwitz, 2005) and later compared with the larger TCCON and found to be biased about 10 ppb lower (Hochstaffl et al., 2018). TROPOMI (TROPOspheric Monitoring Instrument) aboard the Sentinel-5 Precursor was launched in October 2017 and was found to be biased 6 ppb higher than TCCON, with the difference depending on location . GOSAT-2 (Greenhouse gases Observing SATellite-2) was recently launched in October 2018, and TCCON will be used for its validation.
Most intercomparisons with MOPITT have used aircraft data (e.g, Deeter et al., 2014Deeter et al., , 2017Deeter et al., , 2019. The first systematic validation of MOPITT CO with ground-based column measurements was by Buchholz et al. (2017), who used the Network for the Detection of Atmospheric Composition Change (NDACC) mid-infrared retrievals. There have been some studies to compare observations from MOPITT with data from a few (three to six) TCCON sites (e.g, Mu et al., 2011;Té et al., 2016), but this is the first to use observations from all the sites in an intercomparison with MOPITT. Continual comparisons of MOPITT observations with other systems ensure data quality and can be used to determine areas of improvement. This intercomparison exercise uses the MO-PITT Version 7 joint (V7J) product and ground-based NIR observations from the TCCON.
The rest of this paper is summarized as follows: Sect. 2 describes the different instruments, systems, and datasets used in this study. Section 3 describes our effort to derive filters for MOPITT data and to improve the single-sounding accuracy and precision using bias corrections. Section 4 describes the MOPITT and TCCON comparisons, including sensitivity tests and a comparison of averaging kernels and information content. Section 5 describes data assimilation tests where the GEOS-Chem model is used to estimate how filtering and bias correcting MOPITT data affects global fluxes. Finally we conclude in Sect. 6 with a summary of practical considerations in this study along with suggestions on how MOPITT retrievals might be improved in future iterations, and we summarize our work in Sect. 7.

MOPITT
The MOPITT instrument aboard the Terra satellite launched in December 1999 has been described elsewhere . Briefly, it is a gas-filter correlation radiometer with eight optical channels, of which three have been used since August 2001 for CO observations, two in the TIR band (channels no. 5 and no. 7; 4.617 ± 0.055 µm), and one in the NIR band (channel no. 6; 2.334 ± 0.011 µm). Each channel produces an "average" (A) and "difference" (D) radiance measurement. A linear detector array in each channel allows MOPITT to make observations at four different sounding locations simultaneously. The ground field of view is approximately 22 km × 22 km for each sounding. Retrievals from among these four "footprints" or pixels were previously shown to have a bias compared to ground-based column measurements from the NDACC Infrared Working Group (IRWG) (Buchholz et al., 2017). A moving mirror scans cross track for 29 "stares" in each direction for a swath that is approximately 650 km wide, and one back-and-forth sweep takes approximately 26 s. Terra is in a daytime-descending (nighttime-ascending) Sun-synchronous orbit at an altitude of about 700 km, with a local Equator crossing time at around 10:30 LT (22:30 nighttime) and an inclination angle of 98.4 • . Terra makes 14-15 orbits daily, with an exact repeat time of 16 d. However, with its wide swath width, MOPITT is able to achieve near-global coverage every 3-4 d. The redundancies built into the MO-PITT mission allowed for continued measurements after a cooler failure in May 2001 eliminated one of the two optical boards and the usefulness of channels 1-4, leaving channels 5-8 . The impact of other early anomalies is minor. No abrupt changes since 2001 are expected to impact the retrievals, with the possible exception of annual hot calibrations, the latest of which being in March 2019 and a separate temporary cooler malfunction in July 2009. Due to the different instrument configuration from the early record, we only include MOPITT data from 2002 to 2017 (inclusive) in this study.
There are different retrieval products corresponding to TIR-only (T) retrievals, NIR-only (N) retrievals, and TIR-NIR (J -joint) retrievals. We chose to make comparisons with the level 2 Version 7 joint (L2, V7J) product because it should theoretically contain the most information. Deeter et al. (2014) noted that the V6 TIR-NIR product has the greatest vertical resolution but has large retrieval errors and bias drift. The TIR-only product has the highest stability, and the NIR only is best at total column CO retrievals. The MO-PITT retrievals are performed on a logarithmic scale due to the large variability in CO in the atmosphere (∼ 1 order of magnitude). The state vector includes up to 10 vertical layers of log 10 (VMR CO ) (dry volume-mixing ratios), surface temperature, and surface emissivity. Retrievals are performed on a grid of 100 hPa spaced layers up to 100 hPa (e.g., surface-900, 900-800 hPa; Deeter, 2017). The top layer retrieved is 100-50 hPa, and above that the prior VMR CO from the model is used due to low sensitivity. The 50-0 hPa layer represents 1.2 ± 0.4% (1σ ) of the a priori CO column (1.3 % in SH and 1.0 % in NH). Fractions of CO in this layer compared to the total column are shown in Fig. S1 in the Supplement. The a priori value is from climatological output from the Community Atmosphere Model with Chemistry (CAMchem; Lamarque et al., 2012) and is described by Deeter et al. (2014). The a priori covariance matrix is described by Deeter et al. (2010). A total column is obtained by a weighted average of the layers, and this can be converted to a columnaverage dry-air mole fraction (denoted X CO ) by dividing by the model total column of dry air included in the MOPITT V7 product that takes into account surface pressure and water content. We focus on only daytime soundings, which are defined as those with a solar zenith angle (SZA) less than 80 • in the retrieval. In the V7J data product the 100-0 hPa layer is an average of the 100-50 and 50-0 hPa layers, and we use the 100-0 hPa values for our 100-50 hPa layer but use values that are 48 % of this amount for 50-0 hPa based on recommendations of the MOPITT V5 user's guide.
There are a number of previous studies that have compared MOPITT with different observing systems. Because the algorithm has been improved several times since the start of the mission, here we only list validation studies on Versions 6 (released in 2013) and 7 (released in 2016 and used in this study). Recently Version 8 was released (December 2018). Versions prior to 6 are no longer available (https: //www2.acom.ucar.edu/mopitt/products, last access: 2 August 2018). Deeter et al. (2014) noticed a bias between the MOPITT V6J column retrievals and aircraft observations of +4.3 ppb (assuming an average total air column density of 2.1 × 10 25 molec. cm −2 , roughly 5 % for a global 80 ppb average). They noted a correlation of r = 0.89 between the systems and a drift of only 0.15±0.1 ppb yr −1 (∼ 0.18±0.12 %). The V6J retrievals had an overall positive bias at the surface and 800 hPa layers, a negative bias at the 600 and 400 hPa layers, and a positive bias again at the 200 hPa layer. The bias, drift, and correlation all depended on which data products were compared. Later, the V6J profiles were compared with aircraft measurements over the Amazon Basin (Deeter et al., 2016). Limited maximum aircraft altitudes precluded column retrieval comparisons, but Deeter et al. (2016) noted maximum biases at the 800 hPa of −27 %.
Three studies compared ground-based remote-sensing observations with those from MOPITT (Rakitin et al., 2015;Té et al., 2016;Buchholz et al., 2017). Rakitin et al. (2015) made comparisons between MOPITT V6J L3 and various ground-based remote-sensing sites in Eurasia. There is significant variability in the unadjusted comparisons for different sites in their study, which could be from the influence of averaging kernels (Rodgers and Connor, 2003), but in general MOPITT observations were larger than ground-based observations. Té et al. (2016) compared MOPITT V6J and IASI (Infrared Atmospheric Sounding Interferometer) satellite observations with ground-based observations in an urban site (Paris), a high-altitude site (Jungfraujoch), and a Southern Hemisphere site (Wollongong). They noted good agreement between space and ground-based observations with slopes of 0.91-0.99, with satellite observations being slightly lower. Recently, Buchholz et al. (2017) compared MOPITT V6 observations with those from 14 different ground-based NDACC sites between 78 • S and 80 • N and used data from August 2001 to February 2012 for comparisons with V6T, V6N, and V6J. We focus on their V6J comparison results. They found MOPITT to be generally biased high relative to the NDACC, and 11 sites have a bias less than 10 % over land. The all-station mean bias is 5.1 %, and the average correlation is r = 0.78. They noted that the surface type (land or water) had little effect on validation statistics. However, they did note that validation results differed among pixels, and pixel 1 has the lowest correlation while pixel 3 has the highest correlation. Deeter et al. (2017) is the only systematic global validation study of the MOPITT V7 algorithm. They use aircraft measurements from the HIAPER Pole-to-Pole Observations (HIPPO) campaign and National Oceanic and Atmospheric Administration (NOAA) aircraft flask samples primarily over North America for their validation dataset. They describe the improvements included to create the V7 algorithm. They find that the V7J column observations have a smaller bias and larger r (1.4 ppb and 0.93, respectively) than the V6J product (3.8 ppb and 0.89).
While L1 includes radiance bias corrections, there are no empirical bias corrections to the physics-based retrieval in the L2 V7 MOPITT products. There are retrieval anomaly diagnostics included in the L2 product, but users need to define filters to use for their particular application. For L3, V7J daytime observations where both the signal-to-noise ratio (SNR) of channel no. 5A < 1000 and the SNR of channel no. 6A < 400 are excluded (Deeter, 2017). All observations from pixel 3 are also excluded due to excessive and unstable noise from NIR measurements from that pixel (Deeter et al., 2015). In this study suggested filters are developed along with a bias correction.

TCCON
The TCCON is a global network of independently operated solar-viewing Fourier-transform spectrometers (SV-FTS) operated under a common set of standards. From measurements taken by these spectrometers, retrieved estimates of X CO are made (Wunch et al., 2011a(Wunch et al., , 2015. Because profiles are not a part of the TCCON data product, we focus on validating the MOPITT total columns rather than profiles. Data are quality screened by both individual site operators as well as a centralized team. From sensitivity tests perturbing the algorithm to each known source of uncertainty (e.g., a priori values of VMRs and temperature and surface pressure), GGG2014 X CO systematic errors for TCCON are below 4 % (Wunch et al., 2015). The uncertainty in the scaling slope is 6 % (2σ ).
One of the primary uses of the TCCON data has been satellite validation (e.g., Inoue et al., 2016;Kulawik et al., 2016;Wunch et al., 2017b). There are several reasons why TCCON data are considered more accurate than satellite observations and hence a good validation source. (1) Observations are directly pointed at the Sun, which increases the SNR, is insensitive to effects of surface properties, and is insensitive to the effects of both airglow and aerosol scattering (e.g., Zhang et al., 2015). (2) Instruments are operated at a resolution of at least 0.02 cm −1 , which provides more information for spectral fitting than most satellite measurements.
(3) The network was established in 2004 with contributions from many different institutions. This international collaboration has led to many discoveries on how to reduce errors in X gas retrievals (e.g., Kiel et al., 2016).
Despite these advantages, there are known sources of uncertainty that could bias the measurements. For example, to tie this to the World Meteorological Organization (WMO) in situ scale, there is a 7 % scaling factor in GGG2014 for X CO (Wunch et al., 2015). This factor is considered large com-pared to the current uncertainty in spectroscopy, and there is an ongoing effort to determine if this factor is appropriate. In this study we use both the official TCCON X CO product as well as a derived product without the empirical scaling factor applied. For a discussion and current comparison of unscaled TCCON data to the WMO scale, see Sect. S2 in the Supplement. We

AirCore
AirCore measurements are a novel way to vertically sample the atmosphere to obtain profiles of various gases and have been described elsewhere (Karion et al., 2010;Membrive et al., 2017). Briefly, a coiled tube on the order of 100-300 m long, with an inner diameter on the order of 2-5 mm, is taken to altitude. One end of the tube is sealed, so during ascent it is evacuated and on descent the tube slowly fills with ambient air. Because diffusion is slow over the length of the tube but fast across the 2-5 mm diameter of the tube, air from different altitudes does not mix significantly. Upon landing the vertical profile of the gas is preserved along the length of the tube, with high altitudes near the closed end and low altitudes near the open end. On the ground, the AirCore is analyzed within a few hours, which minimizes molecular diffusion. By pulling the air through and measuring concentrations with a calibrated trace-gas analyzer, a vertical profile can be obtained. AirCore CO is still a developmental product with a sample measurement precision typically less than 5 ppb . However, stratospheric AirCore CO profile comparisons have shown differences as large as 20 ppb, which could be a result of diffusion in stratospheric AirCore samples, AirCore surface effects, or incorrect Air-Core sample end-member assumptions. Accuracy is dependent on the quality of calibration and standards (see Sect. S2).
Often AirCores are flown on balloons that can reach a ceiling of around 30 km (∼ 10 hPa), depending on the type of balloon. Once altitude is reached, the payload is cut away from the balloon. Higher-altitude data (during rapid descent) often need to be discarded; hence 22 km (∼ 40 hPa) is the median highest altitude in this dataset. The vertical resolution depends on AirCore tubing dimensions, measurement altitude, recovery time, and temperature but is on the order of 200-1000 m. From 2012 to 2017 there are 36 AirCore profiles available. AirCore profiles are used among other profile measurements to tie TCCON retrievals to the WMO scale (Wunch et al., 2015). Here we use them for sensitivity tests when an approximation of the true atmospheric profile is needed.

Quality control filters and bias correction
Typically a retrieved state vectorx (e.g., an atmospheric profile) is described as a linearization about the a priori state vector x a (Rodgers, 2000), i.e., In this equation, A is the averaging kernel, a matrix in this case, with elements A ij = ∂x i ∂x j , and x is the true state vector. The term x is a catch-all for any remaining systematic or random uncertainties from instrument calibration or the retrieval. This term is a function of forward-model parameters not perfectly known (b), such as pressure, temperature, pointing, spectroscopy, and modeling of instrument response (e.g., the instrument line shape). c contains other values in the retrieval not used in the forward model, such as convergence criteria. Changes inx may thus be related to changes in b and c. Biases in b and c may be approximated as having a linear effect onx (Rodgers, 1990). However, these effects may not be accounted for in models, so measurement teams may reduce the effects of these spurious variations by filtering data empirically.
For example, empirical corrections are employed for various gases in the final TCCON products after the physicsbased retrievals to improve accuracy up to about 0.1 %, which would otherwise be currently limited to accuracies of about of 2 %-3 % due to spectroscopic uncertainties, especially in O 2 (Wunch et al., 2011b(Wunch et al., , 2015. As a second example, empirical corrections to CO 2 measurements from the Orbiting Carbon Observatory-2 (OCO-2) satellite (launched in 2014) did not always improve data at all scales but did reveal areas where the algorithm could be improved (O'Dell et al., 2018;Kiel et al., 2019). Though their studies were for CO 2 , we apply many of the same methods for CO, including similar truth proxies.
By comparing retrieved data with a truth proxy, some data may stand out as being possibly biased due to the x (b, c) term. These may be filtered out, deweighted, or bias corrected to improve the final product. It is challenging to define a truth proxy because if the true state of the atmosphere were known a priori, the measurement would not be needed in the first place. Rather than using proxies that work for each measurement, we aggregate many measurements to empirically identify artifacts and outliers. We use TCCON and a smallregion approximation (SRA -also known as small-area approximation or variation in other studies) as truth proxies. For the SRA we assume that over a sufficiently small region (e.g., ∼ 100 km × 100 km) that is far from point sources the atmosphere is approximately homogeneous and outliers are due to inadequacies in the retrieval.
Filter selection and biases are interdependent; thus our quality-control (QC) and bias-correction process was iterative.  Buchholz et al. (2017) observed biases among the four MOPITT pixels. This bias significantly affects our SRA (Sect. 3.2), as a biased value may be chosen as the median. We spatially grid the data in 2 • ×2 • bins and average for each pixel separately over monthly timescales to evaluate variability in the bias. Here and throughout, data are averaged as described in Appendix A. We analyze multiple months but here show results from April and November 2016 in Fig. 1 for the difference between pixels 2 and 4. We choose these two pixels because the instrumental noise is larger for pixel 3 (Deeter et al., 2015) and pixel 1 has a known large global bias (Buchholz et al., 2017), and we would therefore expect the difference between pixels 2 and 4 to be a lower bound on pixel-to-pixel bias. We see large pixel-to-pixel bias polewards of 60 • . Comparing with scenes flagged as snowy or icy by retrievals from MODIS (Moderate Resolution Imaging Spectroradiometer; also aboard Terra), we see that there is some correlation between the bias with the snow or ice scenes. This bias can be positive or negative. For example, we see that pixel 2 is lower than pixel 4 towards the North Pole and is biased positively over land in Antarctica. Over sea ice around Antarctica, pixel 2 is lower than pixel 4. We also compare pixels 1 and 3 to the weighted mean and find that pixel 3 is biased low over land snow or ice and pixel 1 is biased high over both land and water snow or ice. These biases likely arise from the effects snow or ice have on the thermal contrast of the surface and hence affect the TIR channels. For the rest of our analysis, we filter for daytime scenes and remove soundings where the MODIS diagnostics indicate the presence of any snow or ice.

Pixel-to-pixel bias
We examine temporal trends in MOPITT pixel bias compared to the weighted mean from all pixels (Fig. 2). Data are averaged globally for each pixel and surface type separately for 15 d bins. This analysis relies on the assumption that on average each pixel samples the same area. We see that the absolute bias of pixel 1 is largest. However, in contrast with Buchholz et al. (2017), we observe a negative rather than positive bias between pixel 1 and the mean in the TIR-NIR retrievals, which may be because their study was of V6 data. Pixel 3 has a smaller absolute bias that is positive. In 2002, the spread of the biases is larger than in 2017. On average, the land and water biases are similar (within 0.4 ppb); however, there is a larger seasonal cycle (∼ 1.5 ppb) in the bias for the land that may be an artifact of the sampling and averaging global 15 d bins.
One consideration for bias corrections is whether accounting for differences in averaging kernels can account for the bias. Buchholz et al. (2017) noticed a large absolute bias for pixel 1 compared with NDACC observations even after accounting for averaging kernels. To examine the effects of averaging kernels, we find MOPITT soundings within an ellipse (±1 • latitude, ±1.5 • longitude) around the center location of AirCore flights on the same day. There are 20 flights with coincident observations and 1933 total corresponding MOPITT soundings. We apply averaging kernels to create simulated MOPITT column retrievals from AirCore profile measurements: whereĉ is the simulated X CO , and c a is the a priori column X CO . x is the dry VMR profile (from AirCore) and should not be confused with the state vector, which is log 10 (VMR) for MOPITT. For this study we have defined the MOPITT column-averaging kernel for a pressure level i to be (Appendix B) The pressure weighting function h has been described by Connor et al. (2008) and Wunch et al. (2010). We find that the maximum bias for the retrieved columns is between pixels 1 and 4 and is about 8 times larger for the retrieved (2.6 ppb) than for the simulated columns (0.3 ppb; Table 2). For these soundings MOPITT is also biased high compared to the Air-Core simulated columns by 3.3 ppb, which is greater than the bias of 0.5-1.4 ppb compared to other aircraft profiles . We make a preliminary pixel bias correction by adjusting soundings over land and water for each pixel separately based on a linear fit to the overall time series shown in Fig. 2. This fit is later improved after filtering (Sect. 3.4). After this adjustment we noticed some residual bias among the histograms, so we also apply a year-to-year pixel bias correction of up to 0.4 ppb that is the same for water and land.

Small-region approximation
We perform a SRA on the dataset with the preliminary filter for daytime and snow or ice free scenes and preliminary pixel bias correction. In a SRA, data within a specified area and time frame are assumed to be homogeneous, and variation within that area is assumed to be non-physical. There is always some real variation in the atmosphere; however, Figure 1. On the left (a, b) are average differences in X CO between pixels 2 and 4. On the right (c, d) is the corresponding MODIS snow or ice flag, where 0 indicates all snow or ice and 1 indicates that the scenes were clear of snow or ice. Some correlation is observed between bins with a large pixel-to-pixel bias and snow or ice cover. Here and throughout we use an Eckert IV equal-area projection. . Pixel (pxl) biases compared to the weighted mean with time. Data are averaged into 15 d bins separated for land and water soundings. The mean of pixels 1, 2, and 4 is also shown because pixel-3 data are not included in the L3 product. The small gap in 2009 is from a temporary cooler malfunction on 28 July. statistically, for a large sample size these variations are expected to average out. If the area is too small then there will be too few points for an unbiased median. If the area is too large then true atmospheric variability will be significant. A disadvantage of using this method as a "truth" proxy is that it is insensitive to bias on larger scales related to, for example, latitude and surface albedo (e.g., O'Dell et al., 2018, for OCO-2 and CO 2 ).
We use small regions that are approximately 89 km × 133 km (0.8 • ×1.2 • , latitude × longitude, at the Equator). Region size is a trade-off between having sufficient points per region and keeping regions small enough that real variations in X CO are small. The effects of different region sizes are described in Sect. S3. To calculate anomalies, we subtract the median from all the points within that region. If the median point does not have at least a degree of freedom (DOF) for the signal then the entire region is discarded. We also require at least 10 points in each region, which retains about 50 % of the SRA bins.

Quality control filters
Using the SRA "truth" proxy, we can look for correlations of differences to the local median (i.e., anomalies) with various parameters that are or may be related to the retrieval. Table 3 lists parameters we consider for filtering and bias corrections. We make plots similar to those by Wunch et al. (2011b) and O'Dell et al. (2018) (though their studies were of X CO 2 ) of anomalies versus one of the various parameters to aid in determining filter cutoffs (e.g., Fig. 3). Such plots may reveal empirical relationships with features. Similar plots with additional parameters, including some we decided were inappropriate to use as filters, are available in the Supplement (Sect. S4). Several features can be examined in these plots to decide on where to set the filter limits, including the underlying histograms, systematic biases from zero in the mean including spikes, the spread among pixels -which indicates pixel-to-pixel bias, and the root-mean square (RMS) from the SRA -which includes systematic and random deviations from the truth proxy. We define filters based usually on one of the following criteria: (1) absolute mean bias is greater than 2 ppb, (2) the RMS is greater than 6 ppb, or (3) spread of pixel-to-pixel bias is greater than 5 ppb. These criteria are not strict, and we change thresholds if too few data are in a bin (due to possible sampling bias), if too many data are removed, or if the overall trend in the mean seems like it could be corrected by a bias correction.
Several features are apparent in the SRA diagrams (Figs. 3 and S4-S9) that indicate that data may be less reliable. For example, there is a step change in the bias for soundings over land going from day to night. The RMS is much smaller over snow or ice free scenes (flag of 1). We also note large anomalies for low channel 5A SNR which, in agreement with the L3 product filters, suggest it to be a good parameter to filter on. However, the bias is small for low channel 6A SNR soundings; so unlike the L3 product, we do not use it as a filter criterion. We also find that the sum of the retrieval anomaly diagnostics is a better indicator for suspicious data over land than over water. These particular tests also do not support excluding all pixel-3 soundings, though on average it does have a lower and more variable DOF (Deeter et al., 2015). Maps of where data are filtered are available in Figs. S10 and S11. Using these filters reduces the number of daytime soundings to 3.50×10 8 (of 5.40×10 8 ) and reduces the RMS from 3.84 to 2.55 ppb. By comparison, when we apply the L3 filters it reduces data to 3.27 × 10 8 daytime soundings and an RMS of 3.02 ppb.

Bias correction
We observe trends in the mean bias with various parameters (e.g., Figs. 3 and S4-S9). To reduce the likelihood of overfitting, O'Dell et al. (2018) used linear fits as bias corrections only if they removed at least 5 % of the variance for X CO 2 from OCO-2. For X CO from MOPITT, the ratio between the scatter (indicated by RMS) and bias is larger than for X CO 2 from OCO-2; however, over our period of analysis there are about 400 times more data over water and about 100 times more over land. Fitting concerns here primarily relate to how representative the SRA is as a truth proxy and how much the biases would already be accounted for by adjusting individual soundings using averaging kernels. Even with a criterion of only a 3 % reduction in the overall RMS, the only parameter to meet this is the maximum difference between adjacent levels over land (see Fig. S5j). This feature is larger for strong gradients between levels, which can appear when there are strong surface fluxes or when the retrieval is unstable and oscillates. This instability may be caused by bias related to, for example, spectroscopic errors. Following O'Dell et al. (2018) we make piecewise linear fits to the overall mean over two regimes, split at 100 ppb for a bias correction. The Multivariate Adaptive Regression Splines (MARS) algorithm could also be used to make a piecewise linear fit over a multidimensional dataset. However, it is more likely to overfit the data. When we applied it to the top three most variable fields the RMS for land soundings was not significantly reduced compared with our piecewise fit, so we did not use those results.
In addition to the single "feature" bias correction above, we apply a pixel-to-pixel bias correction after the filtering, described in Sect. 3.3. We perform a second SRA on the filtered data without a pixel bias correction. SRA data are binned separately for each pixel and land or water surface type and averaged over 10 d. On 28 July 2009 one of the coolers on MOPITT malfunctioned, which caused a 2-month instrument shutdown. We separate the period before and after this event and make 16 different linear fits of the bias relative to the all-pixel mean with time (2 for land and water, 4 for pixels, and 2 for time), following the method of York et al. (2004). These linear fits are used to define the pixel-to-pixel bias.  Hochstaffl et al. (2018) found it to be necessary to deweight observations that were further away in time and space from points of comparison. This is likely much less of an issue for this study due to differences in retrieval errors and coincidence scales. For MOPITT the median retrieval error is about 3.5 ppb versus 24.8 ppb for SCIAMACHY. For SCIAMACHY temporal averaging was on the order of a month compared to this study, where we only use TCCON observations within ±30 min. We apply spatial averaging to the MOPITT data typically over areas of 2 • × 4 • (with exceptions noted below). Spatial weighting is not as much of a concern here as for Hochstaffl et al. (2018) with SCIA-MACHY because they used coincidence criteria of 500-2000 km radii, which are significantly larger in terms of area (about 8-100 times). However, despite using smaller areas, heterogeneities in CO sources that MOPITT averages over may occasionally introduce bias for real reasons (e.g., Lindenmaier et al., 2014).
We make exceptions to the ±1 • latitude ±2 • longitude spatial coincidence criteria for several sites. For sites poleward of 60 • (eu, sp, and so) we expand the area to 4 • × 8 • because the atmosphere is expected to be well mixed and retrievals are more sparse. For sites in the Los Angeles Basin (ci, jc, and jf), we limit the area to 33.4-34.3 • N, 116.7-118.8 • W, because we expect X CO within the basin to be much larger than the surrounding area due to urban emissions. We set the minimum latitude to 34.5 • N for the AFRC site to avoid the polluted Los Angeles Basin. We average soundings over land and water separately.
Because of the long (13+ year) comparison between MO-PITT and TCCON, random representation error is much less important than systematic error. Té et al. (2016) and Buchholz et al. (2017) noted that systematic biases can arise from comparing total column observations (in molec. cm −2 ) from MOPITT and NDACC when the surface altitudes differ significantly. This effect will be diminished in column averages (X CO ) in locations away from strong local surface fluxes; however, different surface altitudes can lead to biases because CO profiles are not completely uniform. Between two TCCON sites only ∼ 10 km apart in an urban region, Hedelius et al. (2017) noted an X CO 2 difference of nearly 1 ppm. They attributed part of this to the different site alti-tudes. We estimate the ratio between observations at the surface pressure of TCCON versus the surface pressure of MO-PITT soundings. The total column-average dry mole fraction is The vector x here can be either the retrieved profile or the a priori VMR profile. We use the MOPITT profiles because they are likely more representative of the true atmosphere than TCCON a priori profiles and apply Eq. (4) to find the retrieved and prior MOPITT X CO at the MOPITT sounding surface pressure. We then recalculate h based on the daily average TCCON site surface pressure. When TCCON altitude is lower, the MOPITT surface level is uniformly extended. For higher-altitude sites, the lowest-altitude MOPITT levels are either unused (h j = 0) or deweighted. We then calculate X CO based on TCCON surface pressure. Figure 4 shows the ratios between the MOPITT retrieved X CO using the TC-CON surface pressure compared to the MOPITT sounding surface pressure for 10 • × 10 • areas. Larger areas are used to get a larger variety in surface pressures. We see that for the high-altitude Zugspitze site, this scaling is particularly large (around 15 %). Over these areas the overall scaling for all sites is 0.996 ± 0.023 (1σ ). A scaling factor less than unity is usually due to larger CO mixing ratios near the surface than the rest of the column and lower TCCON site pressure (Hedelius et al., 2017). In this intercomparison, we implicitly account for differences in surface pressure using the h vector. This can make a difference for individual sites by as much as −10.5 ± 4.1 ppb (1σ ) (for Zugspitze). However, we have found in practice that accounting for differences in surface pressure makes little difference here on the overall comparison (compare Fig. S12c and f). In aggregate the difference is only −0.2 ± 1.5 ppb (1σ ).

Overall global scaling
MOPITT and TCCON use different a priori VMR profiles and have different averaging kernels (AKs; Sect. 4.4), and these differences in sensitivity need to be taken into account when comparing retrievals from the different instruments. Here we account for differences in AKs and a priori profiles following the methods of Wunch et al. (2011b), which are formally described as method II in Sect. S6.1. Retrievals are also on different vertical grids, and regridding is described in Appendix C. Figure 5 shows the comparison for all sites. We find that MOPITT observations are higher than TCCON by about 6.4 %. This is similar to the 5.1 % positive bias between MOPITT V6J and NDACC total column observations (Buchholz et al., 2017).
We perform a variety of sensitivity tests on the overall global comparison. There are different approaches to account for different a priori VMR profiles and AKs such as the choice of comparison ensemble (Sect. S6). Figure S12 shows the comparison for a variety of tests when AKs are applied differently or not at all. Generally all comparisons show MO-PITT to be about 6 %-9 % higher than TCCON, with some exceptions. For method III, where AKs are applied in a manner opposite to method II, the bias is as high as 15 % but is closer to ∼ 10 % or less. Figure S13 is a series of bar charts of how the different methods compare for each site. We also examine how the scaling changes for different colocation criteria in Figs. S12d and e by halving and doubling the coincidence areas. We find that MOPITT is biased higher than TCCON in these tests by 5 %-7.4 %. Doubling the area decreases R 2 for the global comparison.
Next we test sensitivity to pressure scaling. Our vertical regridding (Appendix C) accounts for differences in surface pressure, so we use a basic comparison without AK correc- Figure 5. One-to-one plot comparing MOPITT and TCCON, following method II (similar to Wunch et al., 2011b, Sect. S6). MO-PITT data were adjusted to the TCCON a priori profile (ĉ M ), and MOPITT averaging kernels were applied to TCCON data (ĉ M←T ). Error bars represent standard deviations of the weighted averages. Triangles represent soundings over water, and other shapes are over land. Text is number of points or days n; coefficient of determination for ordinary least-squares regression R 2 ; and bias (in %) at 50, 75, 100, and 150 ppb using the shown fit and equation for the shown fit using the methods of York et al. (2004). tions (method 0) for this test. Between these two, the overall offset is not significantly different (9 %-10 %).
Finally we test whether filtering and bias corrections affect the comparison (Fig. S12g-i). The pixel and feature bias correction have little effect on the overall global comparison (∼ 5.8 %-6.8 %). Without filtering, the scatter in the comparison increases, leading to a smaller R 2 . Due to a large intercept, the percent difference spans about 3 %-8 %. Figure S12g shows the comparison for a derived TCCON product without empirical corrections for airmass and without correction to the WMO scale (Wunch et al., 2015). In this comparison the TCCON data do not have the standard scaling to aircraft. Due to uncertainties in the TCCON WMO scaling (Sect. S2), some comparisons are made without it. Here the bias between the datasets is significantly different and is less than 0.5 %. When MOPITT V7J data were compared directly with NOAA flask measurements from aircraft, Deeter et al. (2017) found a positive bias of less than about 1 %. Figure 6 shows boxplots of the MOPITT to TCCON differences (using method II) for each site for land-only and water-only soundings. We do not note an overall bias between land and water. For all sites the TCCON−MOPITT bias is positive and usually on the order of about 3-10 ppb with a few exceptions. For example, MOPITT observations compared to the AFRC (df) TCCON are particularly high Figure 6. Boxplots of the MOPITT-TCCON percent difference at the TCCON sites (using method II), ordered by latitude (degrees north in parenthesis). Blue boxes are MOPITT soundings over water, and brown boxes are those over land. Whiskers represent the inner 95 % of data. Notches are 95 % confidence intervals of the median. Box heights represent the relative number of observations. The solid horizontal line is the Equator, dashed lines are ±45 • , and the dotted line is 60 • N.
(∼14 ppb). This could be related to the high albedo or high surface temperatures of this desert site.

Systematic biases
A seasonal variation in bias may be indicative of differences in sensitivities between the instruments to some feature, such as airmass or water content, that varies seasonally. Figure 7 shows the time series of the difference averaged in 1-month 5 • latitudinal bands. Though there is significant scatter among individual comparisons, we find a long-term trend of −0.06 ± 0.06 % yr −1 in the MOPITT-TCCON difference using the Theil-Sen estimator. Deeter et al. (2017) reported a bias drift of −0.04 ± 0.10 % yr −1 for V7J, though bias drifts for individual layers were larger. Including a cor-rection trend to the L1 radiances significantly reduced the bias drift for the layers (Deeter et al., 2019). Seasonalities of the difference for each site are in Fig. S14. There does not appear to be a persistent seasonal trend for all sites, though there is some seasonal variability for individual sites. For Lamont and AFRC the bias is larger in July-October, while for Białystok the bias is larger in April-June. At Ascension the bias is largest in January-February, while for Réunion it is largest in September-November. We do not make a seasonal bias correction.
There appears to be some latitudinally dependent bias, with a larger bias in the Northern Hemisphere. Part of this could be related to stratospheric CO (Sect. S1). Deeter et al. (2017) also showed some latitudinal variation in MOPITT retrievals compared with aircraft. They suggested that part of the variability could arise from interfering species such as N 2 O, which has spectral lines that overlap with the TIR channels. Before V7 a constant value of N 2 O was assumed, which was determined to cause biases on the order of a few parts per billion . In V7 a global average is used based on a linear fit to monthly in situ observations. Figure S14 shows the bias as a function of column N 2 O measured by the TCCON. There is a slope of −0.40 X CO ppb X N 2 O ppb , though the overall correlation is small (R 2 = 0.08). There also appears to be a small dependence on column H 2 O, which was likely reduced in V8 (Deeter et al., 2019). We do not make bias corrections for any of these systematic features.

Averaging kernels, covariance matrices, and information content
According to Rodgers and Connor (2003, Sect. 2 therein), an intercomparison of two observing systems should also include a comparison of (1) averaging kernels, (2) retrieval noise covariance, (3) degrees of freedom, and (4) the Shannon information content. In conjunction with the comparison of averaging kernels, we think that it is also helpful to compare a priori profiles, which is done in Appendix D. Because the MOPITT retrievals are of logarithmic profiles and the TCCON uses a linear scaling retrieval, some aspects of the comparison are inherently different. Example AKs for MOPITT and TCCON are shown in Fig. 8. Because the MOPITT retrieval is on a log scale, we make an assumption that the a priori VMRs represent the true profile to obtain unitless AKs (Appendix B). We find that the TCCON AKs are more sensitive than MOPITT. Shaded regions in Fig. 8a show a wide variability in MOPITT column AKs. In addition, the typical state significantly affects the MOPITT AKs (e.g., compare Pasadena and Lauder). TC-CON CO column AKs are most sensitive to the stratosphere and are assumed to be consistent at all sites. We make a sensitivity test where the AKs were explicitly calculated in GGG2014 for days with a wide range of X CO at the East Trout Lake site. In general the difference from the standard AKs is small, on the order of 5 % at most.
A priori profiles and MOPITT retrieved profiles along with their differences for select sites are shown in Appendix D. We compare MOPITT and TCCON a priori profiles. In general, MOPITT a priori profiles are influenced more by localized emissions, as they are based on 1 • simulated monthly climatologies from the CAM-chem model (Deeter et al., 2014). This can be seen especially at Pasadena and to a lesser extent at Lamont and Tsukuba. Ascension Island shows a special case where enhanced CO in the lower free troposphere is seen coming from biomass burning and rainforest VOC emissions in Africa. At sites far removed from local emissions (e.g., Ny-Ålesund and Lauder) the MOPITT and TC-CON a priori profiles are in better agreement with each other (see e.g., Pollard et al., 2017).
We take differences in a priori profiles and averaging kernels into account following method II, described in Sect. S6.1. Corrections are applied to each MOPITT retrieval and to daily averages of TCCON retrievals within coincidence criteria. We find in practice that corrections change the comparison by about 3 %. TCCON data are adjusted by 0.7 ± 1.8 ppb (1σ ), and MOPITT data are adjusted by −1.0 ± 3.1 ppb (1σ ).
Rather than comparing the retrieval noise covariance, we compare reported errors and measures of precision and accuracy. Histograms of total reported retrieval error for MOPITT are shown in Figs. S4o and S5o. With our prescribed filtering, global mean uncertainty values are 2.60 ± 1.27 ppb (1σ ) for smoothing, 2.68 ± 1.40 ppb (1σ ) for measurement, and 3.86 ± 1.63 ppb (1σ ) for the total error. The average of the errors reported in the TCCON files is 0.62 ± 0.50 ppb (1σ ). However, these errors are more a measure of repeatability than the total error or the accuracy. The 2σ uncertainty for TCCON (GGG2009) was reported as 4 ppb (Wunch et al., 2010), and the uncertainty budget from a range of sensitivity tests is less than 4 % (Wunch et al., 2015).
Histograms of the MOPITT DOF for the signal for water and land are shown in Figs. S4d and S5k. The DOF for the signal (d s ) can be determined from where E is the expected value operator (Rodgers, 2000, Eq. 2.46 therein). However, d s is usually determined from the trace of the averaging-kernel matrix (Rodgers, 2000, Eq. 2.80 therein), which is equivalent to Eq. (5) for profile retrievals. Because GGG2014 is a scaling retrieval, we treat TCCON measurements as having d s = 1. With a profile retrieval we would expect d s > 1, as was the case for CO 2 (Connor et al., 2016). The DOF gives an indication of how many independent parameters can be improved compared with the a priori profile. MOPITT DOFs are between 1 and 2, which indicates that total column measurements may be reasonable, but individual layer measurements may not always be accurate. Finally, the information content H s is a measure of how accurate a measurement is to how well a value is known a priori. Rodgers (2000) expresses it on a natural log scale (Eqs. 2.73 and 2.80 therein): where I n is the identity matrix. Here we express H s on a log 2 scale instead. Histograms of H s for MOPITT profile retrievals are shown in Figs We assimilate MOPITT observations using the GEOS-Chem model to show how filtering and the bias correction affect estimated emissions inferred from inversion analyses. We conducted three experiments in which we assimilated the following datasets: (1) the original MOPITT data, (2) the filtered and bias-corrected data with scaling down by about 6 % to match the standard TCCON data ( Fig. 5; referred to as Assim. 2), and (3) the filtered and bias-corrected data with a scaling of less than 0.5 % to the TCCON-based data not tied to the WMO scale and without the empirical airmass correction ( Fig. S12g; referred to as Assim. 3). The assimilation is performed using the GEOS-Chem four-dimensional variational (4D-Var) data assimilation system, employing Version 35J of the adjoint model at a horizontal resolution of 4 • × 5 • . The GEOS-Chem 4D-Var system has been used in previous studies for assimilation of MOPITT data (e.g., Kopacz et al., 2010;Jiang et al., 2013Jiang et al., , 2015Jiang et al., , 2017. We assimilate the MO-PITT data to optimize monthly average CO emissions. We assimilate daytime observations for the periods of October-December 2009 and May-July 2011 to coincide with flights from the HIPPO campaign. The a posteriori CO fluxes are compared with the a priori fluxes, and the a posteriori CO concentrations are validated against CO measurements from the HIPPO 10 s merged data (Santoni et al., 2014).
The assimilation uses the offline CO simulation in GEOS-Chem with prescribed monthly mean OH fields from TransCom (Patra et al., 2011) to compute the sink of CO. The prior anthropogenic CO emissions are from the EDGAR v4.2 inventory, which are overwritten regionally with the following inventories: the Streets 2006 emissions over China and southeastern Asia from Zhang et al. (2009), the annual Canadian anthropogenic emissions from the Criteria Air Contaminants (CAC inventory), the National Emissions Inventory 2005 (NEI2005) from the United States Environmental Protection Agency (EPA), the "co-operative programme for monitoring and evaluation of the long-range transmission of air pollutants in Europe" (EMEP) inventory, and the Big Bend Regional Aerosol and Visibility Observational (BRAVO) inventory in Mexico. The Global Fire Emissions Database, Version 3 (GFED3), provides the biomass emissions. The biofuel emissions are the Yevich and Logan (2003) inventory. The initial condition of CO states is generated by spinning up the GEOS-Chem model from January 2009. The initial CO concentrations are not optimized in the assimilation. The prior emissions are scaled by a factor of 1.5, and the emission error is purposely set to be 500 % so that the posterior CO source estimates will be less influenced by the a priori emissions and more strongly reflect the information from the filtered MOPITT observations.
Using HIPPO-2 and HIPPO-4 measurements for comparison, the simulation using only a priori fluxes produces mole fractions that are low by approximately 5 % (Tables 4 and 5). On the other hand, the original MOPITT assimilation and the assimilation which is not tied to WMO (Assimilation 3) tend to agree with each other and are biased high relative to HIPPO measurements. Assimilation 2 mole fractions are lower compared to HIPPO than the other assimilations. This suggests that scaling down MOPITT observations to match TCCON is translated to less CO in the assimilation, as expected. However, the comparison with HIPPO shows mixed results with each simulation depending on which latitudinal bands is considered.
To validate the quality of the filtered and bias-corrected MOPITT observations, the prior CO fluxes are compared with the a posteriori fluxes (Fig. 9). Again we find that assimilations 1 and 3 are in general agreement and Assimilation 2 produces lower fluxes. Fluxes using assimilated data are nearly always smaller than fluxes using the prior fluxes scaled up by 50 %. Assimilation 2, which includes scaling MOPITT to the standard TCCON product, produces fluxes that are in between the unscaled (lower) and scaled (higher) prior fluxes. Though fluxes from Assimilation 2 are closest to the unscaled prior fluxes, they are higher by about 30 % and 15 % during HIPPO-2 and HIPPO-4, respectively.
These results are inconclusive as to which of the assimilations is best. Comparisons with HIPPO mole fractions are mixed, and uncertainties in the assimilated prior fluxes prevent us from drawing definitive conclusions from the flux comparison. It is unclear if the filtering and bias corrections improved fluxes in these experiments.

Practical considerations in intercomparisons of remote sounding retrievals
In addition to the formal aspects of intercomparing retrievals from different remote sounding retrievals, there are a variety of practical aspects to consider. For several of these aspects, an entire study could be devoted to them for each intercomparison. We summarize our comparison methodology in Table 6 and give examples of other studies that provide additional details or alternative methods. Though it is impractical to test all combinations of different considerations, we test some as described in Sect. 4.2, such as coincidence criteria, filtering, bias corrections, and applications of averaging kernels.

Considerations for future MOPITT data use
Several lessons learned in this study may be useful for future versions of MOPITT data products or users assimilating the data. Additional fields used in the retrieval, such as the a priori mixing ratio from 50 to 0 hPa and the water vapor profile, would be useful outputs when converting mixing ratios from whole air to dry air. Though the prior covariance matrix is fixed (Deeter et al., 2010), a single matrix per daily file may be helpful. The retrieved surface emissivity over  land is on average about 0.007, or about 0.75 % larger than the prior emissivity, and the retrieved surface temperature is on average about 6 K larger than the prior temperatures (see histograms in Fig. S5f and S5m). This suggests that prior surface emissivity and temperature values should perhaps be reconsidered, as they may be biased low over land. Further updates to prior values of CO, N 2 O, and H 2 O are expected to further improve the retrievals. For example, the retrieved column CO is slightly larger than the a priori column globally (Fig. S16), but the difference depends on the level and could be related to uncertainties in model transport, sinks, and sources.
Filtering can reduce spurious values. MOPITT files include parameters that could be used in filtering, such as a retrieval anomaly diagnostic, various cloud indicators, and the DOF. Data users should consider creating a QC flag for their analyses, or a binary flag could be included in future versions, e.g., based on parameters in Table 3 or based on the recommendations of the MOPITT team (e.g., the L3 filters). Often highly deviant retrieved surface temperatures show up around coastlines (especially western coastlines; Fig. S10) that did not pass our quality screening. These may be related to sounding definitions of surface type. SNR 6A is used to filter MOPITT data when creating the TIR-NIR L3 product to maintain consistency with the NIR product and increase stability of the DOF, but we do not find sufficient evidence to use it as a TIR-NIR filter criterion based on X CO stability alone. When biases are found in the MOPITT L2 data, the strategy is to correct the L1 radiances or the retrieval algorithm (e.g., Deeter et al., 2019). MOPITT data users of L2 X CO may consider implementing a bias correction before analysis or model assimilation. In terms of X CO , pixel-3 data agree with pixels 2 and 4; however, this agreement may not necessarily hold for retrieved profiles, and pixel-3 data are excluded in the L3 product due to excessive NIR noise and in order to increase stability in the DOF (e.g., Deeter et al., 2015). A bias correction should be considered when assimilating pixel 1. There is a bias in the SRA for large retrieval errors on X CO above about 8 ppb (Figs. S4o, S5o). This bias suggests that perhaps these data should be excluded or deweighted further, which we did not do here. A bias adjustment field could also be included as a field in future MOPITT files. Such an adjustment could account for empirical biases noted with various parameters; pixel-to-pixel biases (Sect. 3.4); and an overall bias compared with NDACC (Buchholz et al., 2017), aircraft flights , and/or TCCON.

Conclusions
In this study quality-filtered and bias-corrected MOPITT data are compared with TCCON data. We first derive filters using only the MOPITT data, assuming homogeneity over small regions. These filters have the largest effectover snow or ice scenes and over high terrain. They reduce the overall RMS from 3.84 to 2.55 ppb. We find and correct a bias among the four pixels, which we confirmed exists using Air-Core. We also find and correct a feature bias.
After the filtering and bias correction, we compare with TCCON data. Using a method (method II; see Sect. S6.1) similar to Wunch et al. (2011b) to account for differences in a priori profiles and AKs, we find MOPITT data to be biased high by about 6 % compared with TCCON, but it is not clear whether MOPITT or TCCON is biased. We also test different methods, which all lead to a bias of about 6 %-10 %. There is a trend of −0.06±0.06 % yr −1 in the MOPITT-TCCON difference. The bias also appears to depend on site and latitude, but the scatter is not consistent enough to derive a correction. We also compared AKs and information content from the different retrievals. TCCON AKs are more sensitive to changes in the stratosphere. MOPITT AKs peak in the midtroposphere and can vary significantly among locations.
After applying filtering, and an overall scaling to match the TCCON, we assimilate the data into GEOS-Chem. Filtering and bias correction are uniform enough to not make a large difference among regional fluxes. When data are also scaled down to TCCON before implementing into GEOS-Chem, fluxes were lower in all regions. However, because of bottom-up uncertainties in global CO fluxes, these experiments were inconclusive. Additional work is needed to understand the relatively large (∼ 6 %) difference between MO-PITT and TCCON.
Data availability. MOPITT data were obtained from the NASA Langley server (ftp://l5ftl01.larc.nasa.gov/MOPITT/, last access: 12 December 2018). TCCON data were obtained through the TC-CON data archive hosted by CaltechDATA (TCCON, 2018). See Table 1 for data references for each site. TCCON data without the scaling to the WMO scale were obtained from the site PIs. AirCore data were obtained from Colm Sweeney (v20170918).

Appendix A: Calculation of X CO and weighted averaging
The MOPITT V7 data product contains fields for retrieved total column CO (in molec. cm −2 ). Unlike the TCCON, MO-PITT does not retrieve a dry-air column. However, a model dry-air column is provided. We obtain a dry-air mole fraction from X CO , ppb = retrieved CO column model dry air column × 10 9 .
The retrieval error (in molec. cm −2 ) can be converted to parts per billion in the same way. When averaging n soundings together, we use a weighted average using the inverse squared retrieval errors as weights. The average retrieved valueŷ iŝ whereŷ i denotes an individual measurement in the bin, and y i, err is the corresponding error. When an average weighted error is needed, we calculate a weighted standard error of the mean (SEM) using SEM = n (n − 1) In the case of uniform weightsŷ −2 i, err , this reduces to the typical SEM equation. We also test a bootstrap analysis (Efron and Gong, 1983) on binned data for one of the parameters (DEsfc) in the bias correction analysis (Sect. 3.4) to evaluate Eq. (A3). Data are placed into 146 bins, with at least 2000 points in each. The bootstrap is run 500 times per bin. We find, in agreement with Gatz and Smith (1995), that Eq. (A3) is a reasonable approximation to the SEM determined from the bootstrap method, with an offset of only +0.2 ± 3.1 % (1σ ).

Appendix B: MOPITT column-averaging kernel
We derive our own MOPITT column-averaging-kernel (AK) vector based on the full averaging-kernel matrix. To fulfill Eq. (2) (and using Eq. 4), MOPITT AK elements a j are a j = ∂ĉ ∂log 10 x j = ∂ i h ixi ∂log 10 x j . .
(B3) Figure D1. Profiles and profile differences between MOPITT and TCCON for six select sites and different days in 2013. For clarity, only one MOPITT profile within the coincidence criteria is selected per day. The rows are the TCCON a priori profiles (T a ), the MOPITT a priori profiles (M a ), the MOPITT retrieved profiles (M r ), the difference between TCCON and MOPITT a priori profiles, and the difference between MOPITT a priori and retrieved profiles.
Author contributions. JKH and DW were involved in the overall conceptualization, investigation, and methodology development. DW secured funding and computational resources and provided supervision. TLH wrote the original Sect. 5 draft. JKH did the formal analysis and visualization and wrote the remainder of the original draft. TLH and DBAJ created methodology for and performed the GEOS-Chem assimilations and comparisons with HIPPO data. CS and BCB provided the AirCore data, and MDM, NMD, MKD, DGF, DWTG, FH, LTI, PJ, MK, RK, CL, IM, JN, YSO, HO, DFP, MR, SR, CMR, MS, K Shiomi, K Strong, RS, YT, OU, VAV, WW, TW, POW, and DW provided the TCCON data, which involves independent funding acquisition, site management, data acquisition and processing, QA/QC (quality assurance/quality control), and delivery. RRB and HMW provided guidance on MOPITT data and insight into the MOPITT instrument, algorithm, and previous validation results. JKH, DW, DBAJ, RRB, NMD, FH, MK, IM, JN, RS, POW, HMW, CS, and BCB reviewed the paper. JKH and DW implemented edits to the paper.
Competing interests. The authors declare that they have no conflict of interest.