A comparison of carbon monoxide retrievals between the MOPITT satellite and Canadian High-Arctic ground-based NDACC and TCCON FTIR measurements

. Measurements of Pollution in the Troposphere (MOPITT) is an instrument on NASA’s Terra satellite that has measured tropospheric carbon monoxide (CO) from early 2000 to the present day. Validation of data from satellite instruments like MOPITT is often conducted using ground-based measurements to ensure the continued accuracy of the space-based in-strument’s measurements and its scientiﬁc results. Previous MOPITT validation studies generally found a larger bias in the MOPITT data poleward of 60 ◦ N. In this study, we use data from 2006 to 2019 from the Bruker IFS 125HR Fourier Trans- 5 form Infrared spectrometer (FTIR) located at the Polar Environment Atmospheric Research Laboratory (PEARL) in Eureka, Nunavut, Canada to validate the MOPITT Version 8 retrievals. These comparisons utilize mid- and near-infrared FTIR measurements made as part of the Network for the Detection for Atmospheric Composition Change (NDACC) and the Total Carbon Column Observing Network (TCCON), respectively. All MOPITT version 8 retrievals within a 1 ◦ radius from the PEARL Ridge Laboratory and within a 24-hour time interval are used in this validation study. MOPITT retrieval products in- 10 clude those from the near-infrared (NIR) channel, the thermal infrared (TIR) channel, and a joint product from the thermal and near-infrared ﬁnd ﬁltering criteria. We then apply the ﬁlters to the MOPITT dataset to minimize the MOPITT pixel bias and the number of outliers in the dataset. The sensitivity of each MOPITT pixel and each product is examined over the Canadian high Arctic. We then follow the methodologies recommended by NDACC and TCCON for the comparison between the FTIR and satellite total column retrievals. MOPITT averaging kernels are used to weight the NDACC and TCCON retrievals and take into account the different vertical sensitivities between the satellite and PEARL FTIR measurements. We use a modiﬁed Taylor diagram 5 to present the comparison results from each pixel for each product over land and water with NDACC and TCCON measurements. Our results show overall consistency between MOPITT and the NDACC and TCCON measurements. When compared to the FTIR, the NIR MOPITT retrievals have a positive bias of 3-10% depending on the pixel. The bias values are negative for the TIR product, with values between − 5 % and 0%. The joint TIR-NIR products show differences of − 4 % to 7%. The drift in MOPITT biases (in units of % year − 1 ) relative to NDACC and TCCON varies by MOPITT data product. In the NIR, 10 drifts versus TCCON are smaller than those versus NDACC, however, this scenario is reversed for the MOPITT TIR and joint TIR-NIR products. during the summer when snow and ice background is minimal. We observed large biases for all pixels over water during July and August which is correlated with the minimum amount of ice during those months and minimum solar zenith angles. Most of the pixel biases are seen in July-August when there is a mixture of ice and water over the ocean and the snow/ice background percentage over the ocean 10 is at a minimum. weighted average reduces the effect of these outlier retrievals. We also calculated the weighted average of the MOPITT AKs and a priori proﬁles in this same manner. The MOPITT VMR values for each pressure level are reported at the bottom of each level, but the FTIR VMR measurements are assigned to the middle of each FTIR level. In addition, the FTIR retrieval grids are ﬁner than the running the comparisons, we applied a ﬁlter to the MOPITT data to reduce the effects of outliers. In order to simplify and visualize 20 the comparisons between different combinations of pixels and land types, we used modiﬁed Taylor diagrams as well as plots of bias and drift to evaluate the biases, uncertainties, and correlation coefﬁcients between MOPITT V8 and the Eureka FTIR measurements. Our results show that there is good consistency between the MOPITT-NDACC and MOPITT-TCCON CO comparisons. The comparisons of the MOPITT V8 measurements with NDACC and TCCON show that the bias values are generally positive 25 for the NIR and negative for the TIR and joint TIR-NIR products. However, the biases and drifts versus NDACC for the TIR and joint TIR-NIR products are smaller than those versus TCCON. Pixel 1 has the largest bias in the NIR and joint TIR-NIR products in both the TCCON and NDACC comparisons; however, pixel 1 shows good performance in the TIR product comparisons for both NDACC and TCCON. We recommend to use only TIR measurements from pixel 1 in the high Arctic. For the TIR product, the bias and drift values are larger over water than over land when compared to both NDACC and TCCON. 30 However, the bias values are generally less than 5%. The TIR drift values versus TCCON are twice as large as those versus NDACC. In the joint TIR-NIR products, all pixels’ biases over land are positive and are negative over water for both the NDACC and TCCON comparisons. Finally, we compared our results with other studies for the three latest versions of MOPITT data both and regionally. There is a good consistency between our total column bias comparison for MOPITT V8 vs. NDACC with the MOPITT V6

describe the MOPITT instrument and its data products, investigate pixel-to-pixel biases, and apply filters to remove outliers for the comparisons. In Sects. 3 and 4, we describe the FTIR measurements and the NDACC and TCCON datasets and discuss the vertical sensitivity of each retrieval and its averaging kernels. The validation methodology, including coincidence criteria and the comparison approach are explained in Sects. 5. The results of the validation comparisons for MOPITT with NDACC and TCCON are shown in Sect. 6, including comparisons with previous results. Finally, we summarize the results and make 5 conclusions in Sect. 7. Buchholz et al. (2017) and Hedelius et al. (2019) are referenced several times in this paper, therefore we allocate the names "Buchholz2017" and "Hedelius2019" respectively to reference them.

MOPITT satellite instrument
MOPITT is on-board NASA'a Terra satellite, which was launched in December 1999 (Drummond et al., 2010). The Terra satellite is in a sun-synchronous, near-polar orbit with an inclination angle of 98.4 • N at~705 km altitude with an equator 10 overpass time at 10:30 AM (descending node). MOPITT is a nadir-viewing multi-channel TIR and NIR instrument with horizontal spatial resolution of 22×22 km and a swath width of~640 km which is achieved by cross-track scanning (Drummond and Mand, 1996;Drummond et al., 2010). This provides near-global measurement coverage from 82 • N to 82 • S in~3 days.
MOPITT uses a correlation spectroscopy technique, employing pressure-and length-modulated gas cells, to measure CO concentrations. Although the instrument comprised eight channels originally, only three channels have been used to retrieve 15 CO since August 2001 due to a failure in the cooling system. Of these channels, two are in the TIR band (4.6 µm) and one is in the NIR band (2.3 µm). The TIR channels have the most sensitivity to middle and upper tropospheric layers and show significant sensitivity to CO variation, thus providing profile information, while the reflected solar (NIR) channels are sensitive to the total CO column. There is significant measurement sensitivity in the lower troposphere, if the temperature contrast between the surface and the atmosphere is large (Drummond et al., 2010). 20 The MOPITT retrieval process utilizes an iterative optimal estimation method in log(volume mixing ratio (VMR)) to combine measured radiances and a priori information (Deeter et al., 2003). Each channel's detector is comprised of a four-pixel linear array, where 1 and 4 are the outer pixels and 2 and 3 are the inner pixels of the array. For each pixel, the retrieved profiles are provided on a 10-level fixed-pressure grid as the average VMR within each layer, where these levels correspond to the pressure at the bottom of each layer (Deeter et al., 2013). These are also integrated to provide MOPITT total column 25 CO values. In addition, for each pixel, the type of surface is catalogued as water, land, or mixed (coastline). We use V8 of the MOPITT level 2 data in this study including TIR-only, NIR-only, and joint TIR-NIR products (Deeter et al., 2019). The joint TIR-NIR retrievals use radiances from both channels and provide profiles with the largest degrees of freedom for signal (DOFS), the best vertical resolution, and the highest sensitivity in the lower troposphere (Deeter et al., 2015). We compared with the TIR and joint TIR-NIR products from MOPITT over both land and water. MOPITT NIR (solar reflectance) retrievals 30 provide information only over land.
Each MOPITT version product provides improvements over the previous version. As we will compare our results with some results from MOPITT V6 and V7, it is useful to briefly mention the improvements from V6 to V7  4 https://doi.org /10.5194/amt-2022-68 Preprint. Discussion started: 16 May 2022 c Author(s) 2022. CC BY 4.0 License. and then from V7 to V8 (Deeter et al., 2019). The first improvement in V7 is the consideration of the steady growth in N 2 O concentrations in the atmosphere over time in the radiative transfer model, rather than using constant concentrations for this interfering species. This could produce a time-dependent bias in calculated radiances and possible retrieval drift.
The second improvement is changing the source of the meteorological fields used (such as water vapour and temperature profiles and surface temperature) from NASA MERRA (Modern-Era Retrospective Analysis for Research and Applications) 5 reanalysis products for V6 to MERRA-2 for V7. The MOPITT retrieval algorithm only considers the observations of clearsky as input, which is determined from MOPITT's thermal channel radiances and the MODIS (Moderate Resolution Imaging Spectroradiometer; also aboard Terra) cloud mask. For this cloud detection, MODIS Collection 6 was used for the V6 retrievals after March 2016, and was used for the entire MOPITT V7 dataset. This change mostly affects the number of clear-sky scenes over the tropics specifically during nighttime. In the MOPITT retrieval process, the simulated radiances calculated by the 10 operational radiative transfer model are compared to the actual calibrated level 1 radiances and the bias between them is corrected. Radiance-bias correction factors compensate for different bias sources like forward model errors due to assumed spectroscopic data, geophysical errors, and errors in instrumental specifications (Deeter et al., 2014). For V7, a new strategy was used to derive radiance-bias correction factors by minimizing observed retrieval biases at 400 and 800 hPa using in situ CO profiles. 15 The most recent release, MOPITT V8, has several enhancements over V7. In V8, a new water vapor model and collisionallyinduced nitrogen absorption have been implemented (Deeter et al., 2019). The second change is in the radiance-bias correction.
The new parameterization includes the date and geographical location of the MOPITT observation and that water vapour total column at the observation time. This method decreases both retrieval drift and geographical variability of the biases. Another improvement is in the cloud detection, where MODIS Collection 6.1 cloud mask is applied, and also the threshold ratio value 20 of radiance for cloudiness is increased.

MOPITT pixel-to-pixel biases
There have been several studies that investigated pixel-to-pixel biases among the four MOPITT pixels. Deeter et al. (2015) found that the MOPITT V6 retrieval performance varies based on instrumental and geophysical effects and discussed how filtering could be used to reduce the impact of variations in instrumental noise between pixels. Globally for MOPITT V7,

25
Hedelius2019 investigated the pixel-to-pixel biases and their trends for snow-and ice-free pixels. They observed that pixel 1 had the largest negative bias and found that for all pixels the biases grow increasingly larger moving polewards, with pixel 2 having a smaller bias than pixel 4 at high latitudes. Also, Buchholz2017 examined how validation comparisons for V6 differed between pixels. They found that the correlations were poorer for pixel 1 than the other pixels for all data products and found the best correlations across data products for pixel 3.

30
To examine the pixel-to-pixel biases for MOPITT V8 over the Canadian high Arctic, we calculated 30-day means of MO-PITT total column CO measurements from the joint TIR-NIR retrieval within a radius of 110 km around Eureka, Nunavut for each pixel over land or water and then compared these to the weighted mean of the measurements from all pixels for the same time period. This internal comparison of MOPITT data quality is based on the assumption that each pixel samples the same 5 https://doi.org /10.5194/amt-2022-68 Preprint. Discussion started: 16 May 2022 c Author(s) 2022. CC BY 4.0 License. area. These pixel-to-pixel bias results are plotted in Fig. 1 with the average over the whole period plotted as a corresponding line. Pixel 1 over land has the largest negative bias, which is consistent with Hedelius2019. The pixel 1 bias is larger over land than over water. Pixel 3 has a large variability in bias over land although it oscillates around zero; therefore the average is small and positive. Pixel 2 has the smallest bias compared to other pixels. The variability in bias for pixel 4 is smaller than for pixel 1.
However, the average bias for pixel 4 over land is relatively large and positive. Overall, the variability in bias is more consistent 5 for all pixels over water and the average biases are smaller than those over land. Hedelius2019 found a significant trend in the pixel-to-pixel bias over time, therefore, they applied a bias correction before validation. In our case, the variability of the pixel-to-pixel bias is large (due to our smaller statistics) and the average biases all fall within each others' standard deviations. Therefore, we did not make a bias correction.   The outliers and pixel biases in Fig. 1 appear to have some periodicity, so to examine, the oscillations for each pixel and investigate seasonal effects, the monthly average of the pixel bias for each year in the 110 km radius circle around Eureka is plotted in Fig. 2. Monthly snow and ice background percentage taken from MODIS (provided in the MOPITT data files), as well as solar zenith angles, are plotted in Fig. 3. Pixel 1 has a large negative bias over land in the spring and summer months. The depth of snow over the Eureka region is at a maximum in spring and this could be the reason for the larger biases over 5 land, which are shown in Fig. 3 of Howell et al. (2016). Over land, pixel 2 has almost no bias. Pixel 3 over land has little bias except during April and May when the bias is positive. Pixel 4 has a positive bias all year except during the summer when snow and ice background is minimal. We observed large biases for all pixels over water during July and August which is correlated with the minimum amount of ice during those months and minimum solar zenith angles. Most of the pixel biases are seen in July-August when there is a mixture of ice and water over the ocean and the snow/ice background percentage over the ocean 10 is at a minimum.

MOPITT filtering
We adopted the filtering method of Hedelius2019 for our study. This uses the "small-area approximation" or "small-region approximation" (known as SRA) to identify outliers in the dataset, based on the assumption that over a small enough area (1 • radius), on a single orbit track, atmospheric properties are almost homogeneous (e.g., Mandrake et al. (2015); Wunch et al. parameter, systematic biases from zero in the mean pixels, pixel bias for each pixel, and the root-mean-square (RMS) residual from the SRA for each parameter. These plots were used to determine the parameters to focus on and the filtering criteria to be used by examining the variation and spread in the pixel-to-pixel biases. Table 1 summarizes the filter parameters determined 10 and the limits we apply to the MOPITT data to minimize the outliers, separated for land and water. Also, the percentage of MOPITT data that is passed by the filters is reported for each parameter. Most of the data removed are filtered due to the SZA parameter (~40% pass percentage) which was chosen to limit the data to daytime-only when the comparison FTIR is measuring and to be consistent with previous studies. Of the remaining filter parameters, those that have greatest impact are the MODIS snow/ice background and signal Chi-squared (χ 2 ), which represents the goodness of the retrieval's fit (~90-93% and~83-90 % 15 pass percentage, respectively). Because low DOFS retrievals are associated with low CO concentrations and removing them would induce a positive bias in our comparisons, we did not use DOFS as a filter parameter following the recommendation of Deeter et al. (2015). Also, based on the MOPITT data product recommendations, we did not use the surface temperature as a filter parameter, because it is a physical parameter.   Ground-based high-spectral-resolution FTIRs operating in transmission mode are widely used to measure atmospheric trace gases, including CO. The atmospheric absorption spectra produced by these instruments are used to retrieve total and partial column densities by exploiting atmospheric pressure broadening (Pougatchev et al., 1995). There are two global networks spanning from the Arctic to the Antarctic that utilize these instruments to study the Earth's atmosphere, namely NDACC  . From here onward, the Eureka FTIR measurements used in this study will be referred to by the network name 10 (e.g., NDACC or TCCON).

NDACC
The NDACC FTIR spectral coverage is obtained using two detectors (InSb and HgCdTe) which cover the MIR from 600 to 4800 cm −1 . The instrument is operated at a spectral resolution of 0.004 cm −1 (unapodized). VMR profiles are retrieved from the FTIR spectra and total and partial column densities are determined by converting VMR to density using temperature and 15 pressure profiles (Batchelor et al., 2009). SFIT4, a profile retrieval algorithm based on the optimal estimation method (Rodgers, 2000), is used with a combination of a priori information and information in the recorded spectral measurements to perform the spectral fitting. In the optimal estimation method, the VMR profile is iteratively updated until the difference between the measured and calculated spectra is minimized. The mean outputs from Whole Atmosphere Chemistry Climate Model (WACCM) version 4 between 1980-2020 are used for the a priori VMR profiles (Marsh et al., 2013) and daily temperature and pressure

TCCON
The TCCON FTIR spectra are measured in the NIR from 3800 to 11000 cm −1 at a spectral resolution of 0.02 cm −1 using an InGaAs detector. Estimates of column-averaged dry-air mole fractions (X CO ) are retrieved from the measurements; therefore 30 we use TCCON measurements to compare with the CO total column MOPITT values. The GFIT spectral fitting algorithm is used to retrieve trace gas amounts. It uses a nonlinear least squares spectral fitting algorithm that scales the a priori profile to produce a calculated spectrum that best matches the measured spectrum (Wunch et al., 2011a). The algorithm integrates the scaled profile to calculate the column abundance and the dry-air mole fractions are then calculated by dividing the column abundance by the column of dry air obtained from the simultaneous O 2 column abundance measurement. The TCCON CO a priori profiles are based on an empirical model and the temperature, pressure, and humidity a priori profiles are based on NCEP/National Center for Atmospheric Research reanalyses (Wunch et al., 2011a). TCCON X CO is reported in units of ppb.

5
The TCCON data used for this study are version GGG2020 (Laughner and the TCCON team., 2020) from March 2010 to October 2019.

Vertical sensitivity of instruments
In order to compare the MOPITT CO measurements with those from NDACC and TCCON, the vertical sensitivity of each instrument must be taken into account. However, the difference between the pixels is not large. The total column TIR-only AK for all pixels over water is approximately twice as large as that over land, and the difference between them is noticeable. The reason could be due to the larger geophysical noise over land than over water. The geophysical noise is affected by surface height and emissivity (Deeter et al., 2011). The variation of elevation over land around Eureka ( Fig. 4(g) and Fig. 5(g)) is large, ranging from sea level up to ap-20 proximately 1,500 m. Also, the surface emissivity over land (Fig. 4(d) is much larger than over water (Fig. 5(d)). TIR averaging kernels are also dependent on the temperature difference between the surface and the air above it (thermal contrast); the effect on the averaging kernels due to changes in thermal contrast also has a seasonal component. Over the Arctic, the thermal contrast over water is smaller than over land ( Fig. 3(a) and (b)) as the snow and ice background percentage are larger over water than land. Therefore, the averaging kernels over water are more sensitive to the free troposphere while the averaging kernels 25 over land show some sensitivity to the lower troposphere. For these measurements, Pixel 1 has the maximum sensitivity, and Pixel 3 has the minimum. The differences between land and water measurements decrease for the total column joint TIR-NIR products and their AKs are more similar than for the TIR-only retrievals. Overall, pixel 1 has the highest sensitivity and pixel 3 shows the lowest sensitivity in the joint TIR-NIR products. The contribution of NIR measurements to the joint retrievals improves the sensitivity of the AKs in the lower troposphere. Calculated from the trace of the AK matrix, the DOFS represents 30 the information content of the retrievals. The monthly average DOFS for each MOPITT product by pixel and surface are presented in Fig. 7(a-c). The variation of the DOFS over the year and between pixels can be seen in each plot. MOPITT AKs vary with season, which is reflected in the DOFS seasonal variability. The DOFS for the joint TIR-NIR product are higher than that for the TIR-only and NIR-only products, the land DOFS are larger than those over water, and pixels 1 and 3 have the largest and smallest DOFS, respectively ( Fig. 7(a-c)). Figure 7(d) shows the monthly average DOFS for the NDACC CO retrievals by year. There is variation of the DOFS over the year typically with SZA, with lower DOFS in summer when the Sun is highest in the sky. The DOFS for the NDACC retrievals which are roughly twice those of MOPITT. The average DOFS for the NDACC measurements is around 2 and that for the MOPITT joint TIR-NIR product is around 1. shown up to 100 hPa to be comparable with MOPITT AKs. The TCCON AK varies weakly with SZA, with a maximum spread of around 0.1 at the surface. The TCCON total column AKs are less than unity below 400 hPa and they are above unity from 400 hPa to higher altitudes. This indicates that there is more sensitivity to the upper troposphere and above. In the case of 10 NDACC, the total column AK being close to one indicates that all altitudes contribute to the total column equally. In contrast to the MOPITT total column AK, both NDACC and TCCON total column AKs are closer to unity than MOPITT indicating a larger contribution of the measurements to the retrieval rather than a priori information for these Eureka ground-based measurements. Because of these differences, the NDACC and TCCON retrievals will be smoothed by the MOPITT AKs for our comparisons. Rodgers and Connor (2003) presented a general method for comparing measurements from two instruments 15 with different averaging kernels, by smoothing the retrievals of the instrument with higher DOFS (higher-resolution) with the averaging kernels of the lower-resolution instrument. The details of the intercomparison methodology in this study are described in the next section.

Coincidence criteria
The coincidence criteria used in this work are consistent with the previous study by Buchholz2017 using NDACC measurements. MOPITT measurements are limited to be within the same day (24 h) as each FTIR measurement and only daytime measurements (SZA < 90 • ) are used. The MOPITT measurements must be within a 110 km radius from the PEARL Ridge Laboratory. This is a tighter spatial criteria than that used by Hedelius2019 in their TCCON comparisons, who used an area of 2 • × 4 • as a coincidence criteria globally but an area of 4 • × 8 • for stations above 60 • N. Figure 8(a) shows the location of PEARL at Eureka in the Canadian high Arctic. The topography of the area around Eureka is displayed in Fig. 8 large variation in the topography over a small area (~200 km radius) with a mixture of water and land in the vicinity of Eureka.
The QGIS 3.1 software is used to plot the data in Fig. 8

Methodology
Several steps were taken to prepare the MOPITT retrievals and FTIR measurements for the validation comparisons. First, the MOPITT retrievals were filtered based on the criteria in Table 1. Then, for each NDACC or TCCON measurement, we selected all of the co-located MOPITT measurements within ±12 hours of the FTIR measurement. From that subset, we 10 further separated the MOPITT data for each retrieval by pixel number and surface type and then took a weighted average of each pixel and land type subset of the MOPITT measurements to compare with the single FTIR measurement. The weighted average is calculated based on Eq. 1 using the inverse-squared retrieval standard deviations as weights.
where X i is each MOPITT measurement with a corresponding standard deviation of X i,σ . This is done because the variability 15 in Eureka's geography can influence retrievals. There are retrievals with large uncertainties and the weighted average reduces the effect of these outlier retrievals. We also calculated the weighted average of the MOPITT AKs and a priori profiles in this same manner. The MOPITT VMR values for each pressure level are reported at the bottom of each level, but the FTIR VMR measurements are assigned to the middle of each FTIR level. In addition, the FTIR retrieval grids are finer than the MOPITT retrieval grid. Therefore, it is necessary to re-grid the FTIR measurements. To do this, we used similar technique (approximation method) to that presented in Buchholz2017 interpolating the FTIR profiles on a log-pressure grid to an ultrafine grid of 500 grid points per MOPITT layer (rather than 100 grid points used in Buchholz2017). Then the FTIR profiles were averaged over the same pressure range as the MOPITT retrieval levels. Next we examined the surface pressure difference between the FTIR and MOPITT measurements arising due to topography. Buchholz2017 noted that the significant surface 5 altitude/pressure differences found between measurements from MOPITT and those from NDACC stations at high altitudes or with highly variable terrain like Eureka can create additional biases in the total column comparison. Therefore, it is necessary to consider the difference in the surface pressure to compare total column values over the same air mass range. We adjusted the surface pressure using the method explained in Buchholz2017, which is based on the method of Kerzenmacher et al. (2012).
Two scenarios are possible in these surface adjustments. First, the surface pressure at the FTIR site is smaller than MOPITT; 10 and in our case this is the most likely scenario because of the altitude of the PEARL Ridge Laboratory. In this case, the gap between the FTIR surface and MOPITT surface was filled with the FTIR a priori profile. If the difference between MOPITT and FTIR surface pressure was greater than a critical value of 80 hPa, we eliminated the MOPITT profile from the comparison.
Buchholz2017 used 50 hPa as the critical value and we found this limited the number of MOPITT profiles in the comparisons.
The second scenario is when the surface pressure at the FTIR site is larger than MOPITT. The fine-grid layers below the 15 MOPITT surface pressure level are then eliminated.
The next step was to smooth the FTIR retrievals with the MOPITT AKs since the FTIR retrievals have larger DOFS.
Buchholz2017 (NDACC data) and Hedelius2019 (TCCON data) used different techniques to compare their data with the MOPITT data. In order to maintain consistency to compare our results, we used similar techniques which are briefly presented here. V8 MOPITT retrievals provide both the total column averaging kernel, based on the method of Rodgers (2000), and the 20 AK matrix, which shows the sensitivity of the retrieved total column to perturbations at each level of the MOPITT CO profile (Deeter et al., 2019).
To compare with the NDACC measurements, we used the MOPITT total column averaging kernel vector (a M ) to smooth the FTIR NDACC profiles (x N DACC ) using Eq. 2.

25
where C M a is the a priori total column value corresponding to the MOPITT a priori profile (x M a ). For V8, like other versions, the MOPITT retrieval is in log space and the averaging kernel matrix and a M , which is the vector of derivatives of CO partial column values with respect to perturbations in log(VMR), must be applied to the log of the MOPITT profiles. The general relation to calculate the dry-air mole fraction, called X CO , is based on the ratio of CO total column (C CO ) to the total column of dry air (C dryair ) (Kiel et al., 2016). The MOPITT total column CO retrievals are in units of molec.cm −2 and the MOPITT data product contains a model dry-air column. Using these, we can calculate the dry-air mole fraction, in unit of parts per billion (ppb) using Eq. 4: To compare the NDACC results with MOPITT in units of ppb, the result of Eq. 2 should be converted to ppb. For this, the C dryair for the NDACC data can be calculated using parameters provided in the NDACC data files such as surface pressure 5 (P 0 ), the gravitational acceleration (g), and total column of water vapour (C H2O ) with Eq. 5: where m dryair = 28.964 × 10 −3 /N A kg/molecule is the molecular mass of dry air, m H2O = 18.02 × 10 −3 /N A kg/molecule is the molecular mass of water vapour and N A is Avogadro's constant.
Hedelius2019 described different methods for comparing MOPITT and TCCON measurements. We used their recommended 10 method II for our comparison with TCCON data, which is also the method presented in Wunch et al. (2011b).

Method IV in
Hedelius2019 is similar to the method we used for the NDACC data as described above. According to Wunch et al. (2011a), x T is the scaled a priori profile (x aT ) using a scaling factor (γ). which is calculated as and C T and C aT are the total column dry-air mole fraction and its a priori value, respectively. For the MOPITT validation 15 with TCCON measurements, we compare C T smoothed in Eq. 7 with C M in Eq. 8.
where C M is the MOPITT total column CO value corresponding to x M . TCCON retrieves the total column of O 2 (C O2 ) and 20 C dryair can be calculated through Eq. 9 assuming the dry-air mole fraction of O 2 is 0.2095: In the next section, the MOPITT column-averaged dry-air mole fraction XCO for each of the four MOPITT pixels, and each product, separated over land and water, are compared with the NDACC and TCCON total column values. There are 27 com- The same number of comparisons is conducted between the MOPITT and TCCON measurements. The 27 combinations of the four pixels over land and water (eight total), all pixel measurements combined ("Pixel all") over land and water separately (two total), and all pixel measurements combined over land and water together (one total) then three times for each product (NIR, TIR, and Joint TIR-NIR). The NIR channel does not measure over water, therefore six comparisons are not made using the 5 NIR channel which reduces the comparisons from 33 to 27.

Comparison approach
As described in Sect. 5.2, we compare 27 combinations for each FTIR measurement. To help visualize the results, we have used a Taylor diagram Rochford (2020), to summarize the comparisons between the MOPITT measurements and the FTIR measurements. The Taylor diagram is useful for evaluating and assessing multiple properties of the comparisons for each 10 MOPITT pixel and it has been used in different fields including atmospheric science (e.g., (Hegglin et al., 2010) and (Sharma et al., 2017)). In the Taylor diagram, MOPITT measurements are normalized to the FTIR measurements, to show how well each MOPITT measurement agrees with the NDACC or TCCON measurements. The Taylor diagram quantifies the relationship between each MOPITT and FTIR dataset in terms of the Pearson correlation coefficient (R), the centered root-mean-square difference (CRMSE), and their standard deviations. Taylor (2001) found that there is a geometric connection between these 15 parameters. The CRMSE is the mean-removed RMS difference and it is calculated using Eq. 10 (in units of ppb 2 ): where M i and F i represent MOPITT and FTIR measurements, respectively, and N is the total number of measurements.M andF are the means of each dataset, respectively. These means are used to calculate the percent difference bias: 20 The relationship between CRMSE, the standard deviations of the MOPITT (σ M ) and FTIR (σ F ) data, and the correlation coefficient is given in Eq. 12 and shown in Figure 1 of Taylor (2001) geometrically based on the law of cosines: Each point in a Taylor diagram can be characterized by a phase and an amplitude, which need to be determined. The correlation coefficient and CRMSE are the quantities that measure how well measurements from each MOPITT pixel agree with the FTIR 25 measurements in phase and amplitude, respectively. The correlation coefficient is the quantity that provides complementary statistical information quantifying the correspondence between the measurements associated with each MOPITT pixel and the FTIR measurements. These various quantities can be plotted on a polar graph, where the radial distance from the origin is the standard deviation of the MOPITT measurements (r). The azimuthal angle on the polar graph is the correlation between the MOPITT measurements and the FTIR measurements or θ = arc cos(R) (Kärnä and Baptista, 2016). Then the CRMSE is the radial distance from the position of a pixel data point that matches exactly the FTIR measurements (r = σ F , θ = 0). As suggested in Taylor (2001), the statistics can be normalized by the standard deviation of the FTIR measurements. Equation 10 becomes dimensionless if we divide both side of the equation by σ 2 F . This new graph is called the modified Taylor diagram 5 and in the normalized graph, the perfect MOPITT measurement would be positioned at (r = 1, θ = 0). The advantage of this modified Taylor diagram is that we can compare different pixels with different standard deviations and its disadvantage is that this graph is based on centered measurements (M i −M ) and therefore it does not show any pixel bias. There are a couple of approaches, such as Elvidge et al. (2014) and Kärnä and Baptista (2016), which take into this issue into account. We follow the approach suggested by the latter paper in which they normalized the root-mean-square error (RMSE) (Eq. 13) by σ 2 F and call 10 it the normalized RMSE (Eq. 14). Therefore, if the RMSE is: then the normalized RMSE would be: Normalized RMSE always has non-negative values and smaller values indicate better agreement between the MOPITT dataset 15 for a given pixel and the comparison dataset. Based on Eq. 13, the normalized RMSE will be zero for a pixel with measurements identical to the FTIR, and will be 1 for a pixel that has measurements equal to the mean of the FTIR measurements (M i =F ).
Normalized RMSE is sensitive to outliers and values greater than 1 represent poor agreement between the MOPITT data and the FTIR data. The R (Pearson correlation coefficient) values in the Taylor diagrams were calculated using ordinary least squares.
One point that should be considered is that σ F is the standard deviation of the FTIR measurements and it is not calculated 20 from random or systematic uncertainties reported by each instrument. The FTIR standard deviations (σ F ) for the NDACC and TCCON measurements are calculated from Eq. 2 and Eq. 7, respectively. In the comparison with NDACC measurements, we calculate σ M using the standard deviation of the MOPITT measurements. In the comparison with TCCON measurements, we use the standard deviation of the result of Eq. 8 to calculate σ M .
The drift of the MOPITT-NDACC and MOPITT-TCCON biases for each pixel was calculated by conducting a linear fit with 25 respect to time. A bi-square weighted robust fitting method was used to perform the linear fit (Holland and Welsch, 1977) and the significance of the drifts was computed using a Student's t-test. The advantage of this fitting method over the ordinary least squares fitting is that it is less sensitive to data gaps and outliers and it has been used in other studies such as Adams et al.

Comparison with NDACC
The results of the comparisons between the MOPITT and NDACC column CO measurements are separated into measurements over land, water, or both, and are shown in Fig. 11. In Fig. 11, column (a) shows the results for the MOPITT NIR product, the reference point (within 0.4 on CRMSE radial axes) versus pixels over water with values closer to 0.6. However, pixel 2 over land is the exception and its CRMSE and correlation values are worse than the other pixels over land. It is noted that the correlation coefficients and CRMSE for the TIR and joint TIR-NIR products for the all pixels combined (both land and water shown with the pink star) are almost identical to those for all pixels combined over land (shown as the purple square). The results for the joint TIR-NIR product illustrate that the result of the all pixels combined over water is close to the all pixels 5 combined over land. The best pixels for correlation coefficients for the joint product are the combined pixels over water and land (pink star) as well as pixel 3 over land (green circle). The overall NSTDs of the joint TIR-NIR pixels are larger than the NSTDs of the TIR measurements and they have slightly smaller correlation coefficients. Pixel 2 over water (pink triangle) has the smallest correlation coefficient in both the TIR and joint TIR-NIR products. Generally, the correlation coefficients found for pixels over land are higher than those for pixels over water, which could be because of higher thermal contrast over land 10 than water. Also, the correlation coefficients of the TIR products are larger than those of the joint TIR-NIR and both are larger than those of the NIR products. Similarly, the CRMSE values for the pixels over land are smaller than for pixels over water and they are closer to the reference point. In addition, the NSTD values of pixels over land are smaller than those over water with a few exceptions such as pixel 1 over land in the joint TIR-NIR products.
The second row of Fig. 11 illustrates the average bias versus RMSE for each pixel in Fig. 11. The error bars are the standard 15 deviations of the bias values. The bias for the NIR shows that, on average, pixel 3 measurements over land have the smallest bias among all pixels over land, however, pixel 2's average bias is close to that of pixel 3. Pixels 1 and 4 have the largest bias and the normalized RMSE values are larger than 1. Also, all the pixels have a positive bias, therefore the MOPITT NIR measurements are generally larger than NDACC measurements, on average by roughly 5%. All pixel biases for the TIR product are negative. The measurements for pixels over land have a lower bias and normalized RMSE than the pixel measurements over 20 water. Pixels 1 and 2 over land have smaller biases than the others, followed by the combined pixel measurements over land and the combined pixel measurements over land and water. The joint TIR-NIR product biases are split with generally positive biases for pixels over land and negative biases over water, with the exception of pixel 1 over water which is close to zero and positive. The smallest bias and normalized RMSE are found for pixel 3 over land followed by pixel 1 over water. The bias of pixel 1 over land is much larger than all the other pixels. Overall average pixel biases for all products agree within their standard 25 deviations. Broadly, pixel biases over land are smaller than over water for the TIR products, and they are comparable in the joint TIR-NIR products except pixel 1 over land. Pixels 1 and 4 show large biases for the NIR products and their normalized RMSE is above 1.
The same overall pattern as found for the bias can be seen in the drift, in the third row of Fig. 11 pixels with significance levels of 95% are labeled with an asterisk ( * ) in Fig. 11. The drift uncertainties of pixels over water are greater than those for the pixels over land.
Overall, the MOPITT NIR products show poorer performance than the TIR and joint TIR-NIR products. Pixel 1 over land shows a larger bias than the other pixels for the NIR and joint TIR-NIR products. Figure 12 shows the comparison between MOPITT and TCCON measurements and is identical in format to Fig. 11. Here, we investigate each row of panels in the same order as above. In the modified Taylor diagrams, the correlation of coefficient for all pixels for all products is between 0.8 and 0.95. The NSTD values for the NIR product comparisons are around 1.6, except pixel 1 which is around 1.8. The NSTD values are between 1.5 and 1.8 for the TIR product and these increase to between 1.8 and 2.3 for the joint TIR-NIR product with a higher NSTD value of 2.5 for pixel 1 over land.

10
The normalized CRMSE for all pixels in the NIR product is around the 0.8 contour, with the largest value of 1.1 for pixel 1 over land. For the TIR product, the normalized CRMSE increases to values between 0.8 and 1 with the smallest value of 0.7 for the pixel 4 over land. The values increase significantly for the joint TIR-NIR product to between 1.1 and 1.5 with the largest value of 1.7 for pixel 1 over land. The normalized CMRSE values for joint TIR-NIR products are greater than those for the NIR and TIR products, which have a similar performance. In the Taylor diagram, results for almost all of the pixels tend to 15 cluster in the same area, except for pixel 1 over land for the NIR and joint TIR-NIR products and for pixel 4 over land for the TIR product.
In the NIR, the pixel 3 has the smallest average pixel bias and pixel 1 has the largest. The normalized RMSE is around 1 for all pixels except for pixels 1 and 2, which have a larger bias than the others, with values close to 1.8 and 1.4, respectively. The TIR bias in the middle row shows that all pixel measurements over land cluster around zero percent bias and all pixels over 20 water cluster around −3%. However, the normalized RMSE of all pixels over water is around 1, but the values for the pixels over land are less than 1. The joint TIR-NIR bias illustrates that all pixels have a normalized RMSE above 1. Pixel 1 has the highest bias at around 9% in the joint TIR-NIR product. For all comparisons, the pixel biases fall within each other's standard deviations due to the large scatter in the biases.
The drift values for the pixels over land for the NIR product are between −1.3 and −1.9 %year −1 . Overall, the magnitude 25 of the drift for all pixels for the NIR product is smaller than for other products, however, the TIR drift values for pixels over land are similar to those for the NIR product. For the TIR product, most of the pixels' drifts over land and water vary between −1.0 and −2.0 %year −1 except pixel 1 over water with −2.8 %year −1 . For the joint TIR-NIR products, the drift tends to be worse than the drifts of the NIR and TIR products. The drifts for the joint TIR-NIR products are approximately twice as large, spanning −2.5 to −4.5%year −1 except for pixel 3 over water, which has a smaller value than the others (−1.5%year −1 ). Note  Figure 11. The normalized Taylor diagram (top row), and normalized RMSE versus average percent bias (middle row), and the drift (bottom row) for all MOPITT pixel measurements compared to NDACC. In the modified Taylor diagram, the normalized standard deviation (NSTD) is on the radial axis, the correlation coefficient value is on the angular coordinate, and the black dashed lines show the normalized CRMSE with respect to NDACC as the reference point. Column (a) shows the results for the NIR product, column (b) for the TIR product, and column (c) for the TIR-NIR. In row 2, the horizontal bars represent the 1-sigma standard deviation of the biases, and in row 3, the vertical bars are drift fit uncertainties (1-sigma). In row 3, the asterisks ( * ) on the x-axis labels indicate drifts with significance levels of 95% or greater. The MATLAB SkillMetrics toolbox (https://github.com/PeterRochford/SkillMetricsToolbox, last retrieved September 1, 2019.) was used to create the Taylor diagrams.

Comparison between NDACC and TCCON
The results in Fig. 11 and Fig. 12 show that the correlation coefficients between the MOPITT and TCCON measurements (0.8 − 0.95) are larger than those found between the MOPITT and NDACC measurements (0.7 − 0.8) for the NIR product. Similar results with respect to NDACC and TCCON are seen for the TIR product (0.8 − 0.95) and for the joint TIR-NIR product (0.8 − 0.9).
The NSTD in the comparison with NDACC measurements is generally between 1.0 and 1.2 for all pixels and all products.
However, the NSTD values in the comparisons with TCCON measurements are increasing for each product from NIR (around 1.6) to TIR (between 1.6-1.8) and to joint TIR-NIR (between 1.8-2.5).

5
The pixels' bias and drift results reveal more information in the comparison between Fig. 11 and Fig. 12 (2019). In this section, their results are compared with this study. Buchholz2017 compared MOPITT V6 data from all three retrieval products with 14 ground-based NDACC FTIR sites from around the globe. They found that overall, pixel 1 has the largest bias and the smallest correlation coefficient among the pixels, and pixel 3 has the largest R value for all products over land. In our study, pixel 1 has the lowest correlation coefficient in the NIR and joint TIR-NIR products (see Fig.11) as well as 30 high bias values. However we find that pixel 4 has correlation and bias values in the NIR that are comparable to those of pixel 1. Unlike the NIR and joint TIR-NIR products, pixel 1 has a large correlation coefficient with low bias in the TIR. We also find that pixel 3 has also the largest correlation coefficient and smallest bias only in the joint TIR-NIR product. We should consider that these overall results presented in Buchholz2017 are the average for all sites. The results for the Eureka FTIR and MOPITT   Table 3. Almost all of the NOAA measurements are over land, therefore for this comparison we used only MOPITT pixels over land.
As shown in Table 3, the bias values for each of the MOPITT products compared with all NOAA measurement sites are 15 smaller for V8 than V7 (however with opposite sign for the NIR). For the NIR and joint TIR-NIR V8 products, the biases increase poleward when considering all NOAA sites, the two northern sites and our Eureka results. The TIR bias has a different pattern than the other two products. For the MOPITT V8 TIR product, the bias relative to all NOAA sites is equal to that for the two northern NOAA sites. However, for Eureka, the TIR bias for comparisons with NDACC is negative and its magnitude is larger than that versus all NOAA sites. In contrast, the bias from the TIR -TCCON comparisons is smaller than that for all 20 NOAA sites and has the same sign. However, the standard deviations of all these biases are large and they agree within their combined uncertainties.  This study and others have investigated the MOPITT pixel biases. Deeter et al. (2015), using MOPITT V6 data, found that pixel 3 has the largest instrumental noise. Buchholz2017 showed that pixel 1 has the largest positive bias globally in the MOPITT V6 data. Finally, Hedelius2019 showed that pixel 1 has the largest negative bias and that biases increase poleward in MOPITT V7. Our results for MOPITT V8 show that pixel 1 has the largest bias among all 4 pixels over the Arctic, which agrees with Hedelius2019. Our monthly pixel bias investigation (Fig. 1) reveals that there is a bias in the summer months in all 5 pixels. Figure 2 illustrates that the bias in those months is likely due to the mixture of ice and water over the ocean and patchy snow over the land. Another result of the monthly pixel bias investigation is that pixel 1 measurements over land have a large systematic bias that could induce bias into multi-pixel averages for V8. The pixel 1 bias over water is similar to the biases of the other pixels. We can conclude that there is a systematic bias in pixel 1 over land. Pixel 3 also has a systematic positive bias over land in the spring months. 10 We compared the CO profile and total column averaging kernels for the MOPITT and FTIR retrievals as they have different vertical resolutions. We also analyzed the DOFS for the three MOPITT products and the NDACC measurements. The MOPITT column AKs over water are greater in the mid-troposphere than those over land, especially for the TIR products. This is because the thermal contrast is smaller over water and there is no contribution from the lower troposphere. TCCON and NDACC AKs showed more sensitivity to changes in the upper troposphere and above, however, MOPITT retrievals are typically more 15 sensitive to the mid-troposphere. The MOPITT TIR product is more sensitive to the mid-troposphere and the joint TIR-NIR product is more sensitive to the mid-and lower troposphere.
After accounting for the difference of averaging kernels for the instruments, we compared the MOPITT CO measurements to the NDACC and TCCON retrievals by separating the MOPITT results by pixel, land type, and data product. Before running the comparisons, we applied a filter to the MOPITT data to reduce the effects of outliers. In order to simplify and visualize 20 the comparisons between different combinations of pixels and land types, we used modified Taylor diagrams as well as plots of bias and drift to evaluate the biases, uncertainties, and correlation coefficients between MOPITT V8 and the Eureka FTIR measurements.
Our results show that there is good consistency between the MOPITT-NDACC and MOPITT-TCCON CO comparisons. The comparisons of the MOPITT V8 measurements with NDACC and TCCON show that the bias values are generally positive 25 for the NIR and negative for the TIR and joint TIR-NIR products. However, the biases and drifts versus NDACC for the TIR and joint TIR-NIR products are smaller than those versus TCCON. Pixel 1 has the largest bias in the NIR and joint TIR-NIR products in both the TCCON and NDACC comparisons; however, pixel 1 shows good performance in the TIR product comparisons for both NDACC and TCCON. We recommend to use only TIR measurements from pixel 1 in the high Arctic. For the TIR product, the bias and drift values are larger over water than over land when compared to both NDACC and TCCON.

30
However, the bias values are generally less than 5%. The TIR drift values versus TCCON are twice as large as those versus NDACC. In the joint TIR-NIR products, all pixels' biases over land are positive and are negative over water for both the NDACC and TCCON comparisons.
Finally, we compared our results with other studies for the three latest versions of MOPITT data both globally and regionally.
There is a good consistency between our total column bias comparison for MOPITT V8 vs. NDACC with the MOPITT V6 biases from Buchholz2017 (Table 2) for the NIR and joint TIR-NIR products. However, this consistency is not seen for the TIR products. There is low thermal contrast in the Arctic region and the DOFS are generally low (Fig. 7). The average TIR DOFS is less than 1 for the Eureka station. Therefore, the contribution of a priori information is high in the retrievals. The difference in our biases in the TIR product comparisons is due to the improvements applied to the MOPITT V8 retrieval relative to V6.
Our MOPITT vs. TCCON comparison is generally consistent with Hedelius2019. The MOPITT joint TIR-NIR products are 5 greater than TCCON by around 6-8 % globally based on Hedelius2019, and based on this study are 4−9% (depending on pixel) for the Arctic. A similar comparison with the total column results of Deeter et al. (2019) revealed that there is a correlation between the total column biases with latitude; larger biases were observed at higher latitudes. We also observed that consistent bias results were found between this study and that of Deeter et al. (2019) (Table 3) for the two northern sites around 60 • N.
Generally, the DOFS of MOPITT measurements in the Arctic is small (average around 1) because of the low thermal contrast.

10
Compared to MOPITT V6 and V7, our comparisons in the Canadian high Arctic show that there are significant improvements in MOPITT V8. In addition to the enhancements in the V8 retrievals, using a filter to reduce the effect of outliers in the Arctic region improved our comparisons with ground-based FTIR measurements from Eureka, Nunavut. Author contributions. AJ gathered the datasets and conducted comparisons between the datasets, created the plots and tables, and wrote the manuscript. KAW advised and supervised the work and provided comments, advice, and editing. RB and DW provided guidance regarding previous validation results and methodologies. MD provided advice and the results in Table 3 for the northern sites. KS, EL, and TW provided the PEARL NDACC retrievals. KS and SR provided the PEARL TCCON data. PF contributed to collecting TCCON and NDACC 20 measurements at PEARL. JRM supported the operations of PEARL and MOPITT. HMW supported the MOPITT data and operation. All coauthors provided comments and contributed to editing the manuscript. EM provided assistance with the TCCON data preparation.
Competing interests. One co-author is a member of the editorial board of this journal. The authors declare that they have no other competing interests to declare.