Validation of tropospheric NO 2 column measurements of GOME-2A and OMI using MAX-DOAS and direct sun network observations

. Multi-axis differential optical absorption spectroscopy (MAX-DOAS) and direct sun NO 2 vertical column network data are used to investigate the accuracy of tropo-spheric NO 2 column measurements of the GOME-2 instrument on the MetOp-A satellite platform and the OMI instru-5 ment on Aura. The study is based on 23 MAX-DOAS and 16 direct sun instruments at stations distributed worldwide. A method to quantify and correct for horizontal dilution effects in heterogeneous NO 2 ﬁeld conditions is proposed. After systematic application of this correction to urban sites, 10 satellite measurements are found to present smaller biases compared to ground-based reference data in almost all cases. We investigate the seasonal dependence of the validation re-sults as well as the impact of using different approaches to

cycle (typically one overpass per day for mid-latitudes) and coarse spatial resolution (from a few to several hundreds of kilometers). The accuracy of the different satellite datasets is also of concern, e.g. for trend analysis or diurnal variation studies. Validation activities, which are an essential part of any satellite programme, aim at deriving independently a set of indicators characterizing the quality of the data product. They encompass the monitoring of instrumental stability as well as the inter-sensor consistency needed to ensure continuity between 5 different satellite missions. Satellite validation also contributes to the improvement of retrieval algorithms through investigation of the accuracy of the data products and their sensitivity to retrieval parameter choices. Tropospheric satellite data products depend on various sources of ancillary data, e.g. a-priori vertical distribution of the absorbing and scattering species, surface albedo, information on clouds and aerosols (Boersma et al., 2004;Lin et al., 2015;Lorente et al., 2017, Liu et al., 2019a. In the case of NO2, separation between stratospheric and tropospheric 10 contributions is an additional source of complexity in the retrieval, and there is considerable debate on the importance of the role of free tropospheric (background) NO2 in the retrieval process (Jiang et al., 2018;Silvern et al., 2019). As discussed by Richter et al. (2013), the validation of tropospheric reactive gases (such as NO2, HCHO and SO2) is also challenging because short atmospheric lifetimes, local emission sources and transport can lead to a large variability of their concentrations in time and space (both vertically and horizontally). Active 15 photochemistry and transport processes lead to important diurnal variations cycles ) that need to be considered for validation studies. MAX-DOAS and direct sun remote-sensing techniques have large potential capacities for the validation of satellite trace gas observations, as they measure all day long and provide accurate measurements of integrated column amounts (i.e. a quantity close to that measured by space-borne instruments).
MAX-DOAS and direct sun measurements also match better the horizontal resolution of satellite observations 20 than e.g. surface in-situ monitoring networks. The spatial averaging of MAX-DOAS measurements has been quantified and shown to range from a few km to tens of km depending on aerosol content and measurement wavelength (Irie et al., 2011(Irie et al., , 2012Wagner et al., 2011;Wang et al., 2014;Gomez et al., 2014;Ortega et al., 2015).

25
In the last decade, several studies compared different SCIAMACHY, GOME-2 and OMI NO2 data products (generated by both operational and scientific prototype processors) to MAX-DOAS measurements at various stations (e.g., Brinksma et al., 2008;Hains et al., 2010;Vlemmix et al., 2010;Irie et al., 2008;Ma et al., 2013;Lin et al., 2014;Wang et al. 2017b;Drosoglou et al., 2017Drosoglou et al., , 2018Liu et al., 2019a, b, c). JAMSTEC data from the MADRAS network have been used in Kanaya et al. (2014) for the validation of the OMI DOMINO and NASA 30 tropospheric NO2 data. BIRA-IASB MAX-DOAS stations have been regularly used for the validation of GOME-al. (2012) also reported low OMI NO2 column values over China in summer, when the spatial distribution of NO2 was likely homogeneous.
In the present study, we validate GOME-2A and OMI tropospheric NO2 column measurements using data from a large number of MAX-DOAS and direct sun instruments operating in Europe, Asia, North America and Africa under a wide variety of atmospheric conditions and pollution patterns. Some of these datasets have already been 5 used in the past for tropospheric NO2 validation of different satellites and products. In the present study we combine them in a coordinated way allowing for a global approach to satellite validation, sampling different NO2 levels in various locations around the globe. In addition the smearing (or dilution) of the NO2 field due to the limited horizontal resolution of satellite measurements is investigated. A method for the quantification and correction of the dilution effect is proposed, and its impact on validation results is quantitatively evaluated. Our 10 validation approach is applied to operational OMI DOMINO and AC SAF GOME-2A products, as well as to climate data record OMI and GOME-2A NO2 data products generated within the EU QA4ECV project.
The paper is structured as follows: Sections 2 and 3 describe the OMI and GOME-2A sensors and data sets as well as the reference ground-based measurements. Section 4 presents the comparison methodology and comparison 15 results are discussed in Section 5. In Section 6, we concentrate on the quantification of horizontal dilution effects in satellite measurements performed around the measurement sites, and we show how these effects impact the validation results in urban conditions. Section 7 addresses seasonal effects and the impact of satellite data selection on the comparison results. Section 8 presents a summary of the validation results, and conclusions are detailed in Section 9.

Satellite tropospheric NO2 datasets
Tropospheric NO2 data products from space-borne sensors are generally retrieved via three main steps. First, a DOAS spectral analysis yielding the total column amount of NO2 along the slant optical path, secondly an estimation of the stratospheric NO2 column, to be subtracted from the total column to derive the tropospheric 25 contribution (so-called residual technique), and finally a conversion from slant (SCD) to vertical columns (VCD).
The last step is based on air-mass factor (AMF) calculations which require a-priori knowledge of the NO2 vertical distribution, pressure and temperature, surface albedo, aerosols and information on (effective) cloud cover and height (Boersma et al., 2004). The retrieval of tropospheric NO2 is given by: 2004) which can be separated into implementation differences (when different groups use identical ancillary data for the calculation of tropospheric NO2 AMFs) of about 6%, and structural differences, due to ancillary data selection, which can reach 31-42% (Lorente et al, 2017). The uncertainty in separating the stratospheric and tropospheric columns is about 0.5x10 15 molecules/cm 2 (Dirksen et al., 2011;Lorente et al., 2017).

5
In the present study, we focus on the ground-based validation of the mid-morning GOME-2A and the early afternoon OMI data. Illustration of the validation method and step-by-step results along the manuscript are given for the GOME-2A GDP (GOME Data Processor) 4.8 NO2 operational data product (Valks et al. 2011) and the OMI DOMINO v2.0 data product (Boersma et al., 2011), while final validation results and discussion also gather results for the GOME-2A and OMI QA4ECV products Zara et al., 2018). All products are 10 briefly presented in Table 1 and in the following sub-sections.

GOME-2 products
The second Global Ozone Monitoring Instrument (GOME-2) is a nadir-looking UV-visible spectrometer measuring the solar radiation backscattered by the atmosphere and reflected by the Earth and clouds in the 240- which presents the longest data record. The default swath width of the GOME-2A across-track scan is 1920 km, allowing global Earth coverage within 1.5-3 days at the Equator, with a nominal ground pixel size of 80×40km 2 .
Since 15 July 2013, GOME-2A is measuring on a reduced swath mode of 960km, with a ground pixel size of 40×40km 2 .

25
Operational products are retrieved from GOME-2 measurements in the framework of the Atmospheric  et al., 2016). Total, tropospheric and stratospheric NO2 columns are operationally retrieved with the GOME Data Processor (GDP) and a description of this algorithm can be found in Valks et al. (2011) and Liu et al. (2019b).
Within the QA4ECV (Quality Assurance for Essential Climate Variables) project, a coherent offline NO2 dataset 30 has been created for GOME, SCIAMACHY, GOME-2A and OMI Zara et al., 2018;Lorente et al, 2017) and comparisons with this dataset are also included at the end of this study.  Zara et al. (2018), Lorente et al. (2018) and Liu et al. (2019b). Previous GOME-2 validation highlighted the effect of GOME-2 large pixels, and the aerosol shielding effect, leading e.g., to differences of 5% to 25% over China (Ma et al., 2013;Wu et al., 2013;Drosoglou et al., 2018). Liu et al. (2019b) showed possible improvements of the GDP 4.8 product, leading to reduced discrepancies of the satellite-to-ground-10 based biases of the order of 10% to 25% for several MAX-DOAS stations.

OMI products
OMI (Ozone Monitoring Instrument) is a nadir-viewing imaging spectrometer with a spectral resolution of about 0.5 nm FWHM (Levelt et al., 2006). The light entering the telescope is depolarized using a scrambler and split 15 into two spectral bands: a UV channel (wavelength range 270-380 nm) and a visible channel (wavelength range 350-500 nm). The 114° viewing angle of the telescope corresponds to a 2600km wide swath on the Earth's surface distributed over 60 cross-track positions, which enables quasi-global coverage in one day. In the nominal global operation mode, the OMI ground pixel size varies from 13×24km² at true-nadir to 28×150km² on the edges of the swath. OMI is onboard the EOS-Aura satellite that was launched in July 2004, in a sun-synchronous polar orbit 20 crossing the Equator around 13:30 LT (in ascending node). The radiometric stability of the OMI instrument is exceptionally good (Schenkeveld et al., 2017), however, since June 2007, several rows of the detector have been affected by a signal reduction, the so called "row anomaly" (http://www.knmi.nl/omi/research/product/rowanomaly-background.php), reducing the usable swath coverage (see .

25
The DOMINO (Derivation of OMI tropospheric NO2) product is distributed in NRT via the TEMIS (Tropospheric Emission Monitoring Internet Service, http://www.temis.nl) project (Boersma et al., 2011). The offline OMI QA4ECV v1.1 product , is very similar to the GOME-2 product, as can be seen in Table 1.
For OMI, the stratospheric separation is performed using a data assimilation scheme based on the TM4 or TM5-

30
MP chemistry transport models. Its uncertainty is estimated to be about 0.2-0.3×10 15 molec/cm 2 (Boersma et al., 2004;Dirksen et al., 2011). Stratospheric NO2 vertical columns used in our study are derived from assimilated stratospheric slant columns divided by a geometrical air-mass factor, as described in Hendrick et al. (2012). For the OMI QA4ECV dataset, two estimates of the stratospheric column are reported (data assimilation and STREAM), and   OMI DOMINO v2.0 has been widely used in the past, and several validation exercises (Brinksma et al., 2008;Hains et al., 2010;Vlemmix et al., 2010;Irie et al., 2008Irie et al., , 2012Lin et al., 2014;Wang et al. 2017b;Drosoglou et al., 2017Drosoglou et al., , 2018Liu et al. 2019a) found underestimation of the OMI tropospheric NO2 columns in urban conditions and a better agreement in background locations (Celarier et al., 2008;Halla et al., 2011;Kanaya et al., 2014). Kanaya et al. (2014) showed close correlations with MAX-DOAS observations at 7 stations, but found low biases 5 up to ~50 %. Regarding the OMI QA4ECV product,  reported a first validation at the Tai'an station (China) in one summer month finding good agreement (bias of -2 %) with respect to MAX-DOAS NO2 columns (better than the agreement found for DOMINO v2 of -11% bias). Liu et al. (2019a) investigated the impact of correcting for aerosol vertical profiles in the OMI data, and compared four OMI datasets (POMINO and POMINO v1.1, DOMINO v2.0 and QA4ECV) with respect to data of three Chinese stations. Results suggested a 10 significant improvement of the OMI NO2 retrieval when correcting for aerosol profiles, in general and for hazy days. This is consistent with the previous finding that the accuracy of DOMINO v2.0 is reduced for polluted, aerosol-loaded scenes (Boersma et al., 2011;Kanaya et al., 2014;Lin et al., 2014;Chimot et al., 2016 molec/cm 2 ) with respect to 10 MAX-DOAS instruments, a feature also found for the OMI OMNO2 standard data product. They also found that the tropospheric VCD discrepancies between satellite and ground-based data exceed the combined measurement uncertainties and that, depending on the site, this discrepancy could be attributed to a 20 combination of comparison errors (horizontal smoothing difference error, error related to clouds and aerosols and differences due to a priori profile assumptions).

MAX-DOAS technique
A MAX-DOAS instrument measures the scattered sunlight under a sequence of viewing elevation angles extending from the horizon to the zenith ( Fig. 1a). At low elevation angles, the observed sunlight travels a long path in the lower troposphere (under aerosol-free conditions, the lower the elevation angle, the longer the path) while all 5 observations have approximately the same light path in the stratosphere, independently of viewing elevation. By taking the difference in SCD between off-axis observations and a (nearly) simultaneously acquired zenith reference spectrum (the differential slant column), the stratospheric contribution can therefore be eliminated. Tropospheric absorbers can be measured along the day, generally up to a solar zenith angle (SZA) of approximately 85° (Hönninger et al., 2004;Sinreich et al., 2005). Radiance spectra acquired at different elevation angles are analyzed using the DOAS method (Platt and Stutz, 2008), which gives integrated trace gas concentrations along the atmospheric absorption path. The resulting 15 differential slant columns (dSCDs) can be converted to vertical columns and/or vertical profiles using methods of different levels of complexity. Table 2 presents details about the retrieval strategy adopted by different teams.
They generally belong to one of the following categories:  Geometrical Approximation (GA): the vertical column is determined under the assumption that a singlescattering approximation can be made for moderately high elevation angles α (typically 30°) so that a 20 simple geometrical air-mass factor (AMFα≡SCD/VCD=1/sin(α)) (Honninger et al., 2004;Brinksma et al., 2008;Ma et al., 2013) can be used,  QA4ECV datasets: the vertical column is calculated using tropospheric AMFs based on climatological profiles and aerosol situations as developed during the QA4ECV project (http://uvvis.aeronomie.be/groundbased/QA4ECV_MAXDOAS/QA4ECV_MAXDOAS_readme_website.pdf).

25
These data are less sensitive to relative azimuth angle than the purely geometric approximation presented above.  Vertical profile algorithms based on parameterized profile shape functions: these make use of analytical expressions to represent the trace gas profile using a limited number of parameters (Irie et al., 2008;Li et al., 2010;Vlemmix et al., 2010;Wagner et al., 2011;Beirle et al., 2019).
MAX-DOAS profile inversion algorithms use a two-step approach: in the first step, aerosol extinction profiles are 5 retrieved from the measured absorption of the oxygen dimer O4 (Wagner et al., 2004;Friess et al. 2006). In a second step, trace gas profiles are retrieved from the measured trace gas absorptions, taking into account the aerosol extinction profiles retrieved in the first step. Both OEM and parameterized profiling approaches provide vertical profiles of aerosols and NO2 with a sensitivity typically in the 0-4 km altitude range with generally between 1.5 and 3 independent pieces of information in the vertical dimension (Vlemmix et al., 2015, Friess et al., 2016, Friess 10 et al., 2019. This complementary information on the vertical distribution of gases and aerosols in the atmosphere has been used in some studies to test some key assumptions made in the satellite data retrieval, in particular the apriori NO2 profile and aerosols content, providing therefore more insight into the quality of the satellite data (e.g., Wang et al., 2017b;Liu et al., 2019b,c;Compernolle et al., 2020).Recent intercomparison studies (Vlemmix et al., 2015;Friess et al., 2019;Tirpitz et al., 2020) show that both OEM and parameterized inversion approaches lead 15 to consistent results in terms of tropospheric vertical column but larger differences in terms of profiles. In this study, every data provider submitted data retrieved with their own tools and formats, without any harmonization.
Our study focuses therefore only on the vertical column, which is the more robust and reliable retrieved quantity.
The time coverage of the different data sets used in this study is presented in Fig. S1.

20
The accuracy of the MAX-DOAS technique depends on the SCD retrieval noise, the uncertainty of the NO2 absorption cross-sections and most importantly the uncertainty of the tropospheric AMF calculation. The estimated total error on NO2 VCD is of the order of 7-17% in polluted conditions. This includes both random (around 3 to 10% depending on the instruments) and systematic (11 to 14%) contributions (e.g. Irie et al., 2008Irie et al., , 2011Irie et al., , 2012Wagner et al., 2011;Hendrick et al., 2014;Kanaya et al., 2014). In extreme cases, the error can however reach 25 ~30% depending on geometry and aerosols.

Direct sun technique
Equipped with a sun-tracking device, direct sun instruments measure spectra of direct sun light having traversed the whole atmosphere. Such instruments are sensitive to both troposphere and stratosphere ( Figure 1b) and they 30 provide accurate total column measurements with a minimum of a-priori assumptions.
Direct sun observations are routinely available from Pandora instruments. A standardized Pandora network has been set-up by NASA (Herman et al., 2009, Tzortziou et al., 2014 and extended by ESA and LuftBlick via the Pandonia project (www.pandonia.net) to form the PGN (Pandonia Global Network, https://www.pandonia-global-

35
network.org/). Pandora data used in this study originate mostly from the original NASA network, which includes more than 60 different sites covering different time-periods (mostly campaign-based). In total, 15 Pandora direct sun instruments delivering at least 3 months of data have been considered here. They are listed in Table 3  generally operated in polluted areas (urban or sub-urban), however the network also contains a few background/remote sites located in Europe, Asia and the US. Valid data were selected for normalized root-mean square of weighted spectral fitting residuals (wRMS) less than 0.005, uncertainty in NO2 retrievals less than 0.05 DU were kept (A. Cede, personal communication).

5
Recent detailed studies in US and Korean sites during DISCOVER-AQ have shown good agreement of Pandora instruments with aircraft in-situ measurements, within 20% on average, although larger differences are observed for individual sites (Choi et al., 2019), the largest discrepancies being found in Texas (Nowlan et al., 2018). Good agreement of a few percent between Pandora and GeoTASO has been reported by Judd et al. (2019), while differences increase when resampling the comparisons for larger simulated pixel sizes, up to about 40% bias for 10 18x18km², similar to the bias found with OMI (50%).
The Pandora spectrometers provide NO2 total vertical column observations with a random uncertainty of about 2.7x10 14 molec/cm 2 and a systematic uncertainty of 2.7x10 15 molec/cm 2 (Herman et al., 2009). Those accounts for DOAS fit systematic errors, random noise, and uncertainties related to the estimation of the residual gas amount 15 in the reference spectra. In the present study, direct sun tropospheric VCDs are derived from the measured total NO2 content after subtraction of the stratospheric part estimated using satellite data (alone or within assimilation scheme, see Sect. 2), interpolated to the geolocation of the Pandora spectrometer: Summing the Pandora error uncertainty and the error uncertainty on the stratospheric column in quadrature, this 20 approach leads to an error uncertainty of about ~2.75 x10 15 molec/cm² on the tropospheric column from direct sun data. It should be noted that this approach leads to retrieval of total tropospheric column from the direct sun, while the tropospheric column from MAX-DOAS represents mainly the boundary layer.

25
For the comparison, GOME-2A and OMI data were extracted within a radius of 50 km around the 36 stations listed in Table 2 and Table 3 with only pixels having a cloud radiance fraction <50% and an AMFratio (AMFtropo/AMFgeom) > 0.2  being selected. In the case of OMI, pixels affected by the row anomaly were filtered out . The closest pixels and the mean value within the extraction radius were calculated for each day. To reduce the differences in spatial resolution of the satellite measurements

30
(GOME-2A: 40x80km², OMI: 13x24km² at best) compared to the ground-based sensitivity (horizontal length of the probed air mass up to ~20 km), the largest pixels from each instrument dataset were removed: only pixels with an across-track width smaller than 100km for GOME-2A and smaller than 40km for OMI were kept in the comparisons. Previous studies have investigated the use of stricter coincidence criteria as a way to overcome spatial resolution differences. E.g. Irie et al. (2008) showed differences up to 25% in satellite VCD between pixels measurements with a relative uncertainty ≥10% (Vlemmix et al., 2010). In addition to our baseline, tests on those parameters are performed in Sect. 7.3.
Ground-based MAX-DOAS data were interpolated to the satellite overpass time and a verification of the presence of data within ±1h was performed in order to avoid large interpolation errors. Pandora direct sun measurements 5 have a much higher acquisition rate (approximately 30 acquisitions/hour compared to typically 1 to 4 MAX-DOAS measurements) with sometimes strong NO2 variations not perfectly removed with the data filtering, so Pandora measurements within 1 hour (±30min) of the satellite overpass time were averaged. On this basis, daily comparisons were performed at each station, and corresponding monthly averages were also calculated.  Table S1 for GOME-2A and OMI, with daily and monthly statistics for correlation coefficient R, slope S and intercept I of a linear regression and mean and median monthly absolute and relative biases. Depending on the length of the ground-based time-series, the number of daily comparison points can vary significantly, from     overestimation of the stratospheric columns derived from satellite. This discrepancy is under investigation and will be the subject of a future study.

Overview of the ground-based datasets
Due to different deployment strategies, the direct sun instruments (mostly Pandora's type) tend to be located closer to strong NO2 emission sources than MAX-DOAS instruments, that sample both polluted and background sites.   they present differences of about 1 and 6 x10 15 molec/cm² at OMI and GOME-2A overpass times, respectively (5 to 15%). Another example is Chiba and Yokosuka. Both of these sites are situated on the urban Tokyo bay but at about 53km distance from each other. Their median differences from OMI and GOME-2 are 5.7 and 14.2 x10 15 molec/cm² respectively (69 to 82%).

Comparison of ground-based and satellite datasets
The comparison methodology illustrated in Fig. 2 Tables 2 and 3), based on their location with respect to known pollution sources. This classification is not based on NO2 levels but reflects the influence of the surrounding areas. E.g.
Xianghe station is in a polluted background with high NO2 levels (see Fig. 3), but it is located at a relatively large distance from surrounding urban areas, and is thus classified as sub-urban.  Table   20 4. It is clear that smaller slopes and larger biases are found at urban locations compared to background/sub-urban ones. Note also that for the case of the comparisons against MAX-DOAS data in background/sub-urban sites, smaller biases are obtained for OMI than for GOME-2 (about -10% compared to about -44%), while in urban sites the differences among the two satellites are smaller (about -36.6% and -38%). For direct sun sites both satellite sensors seem to behave similarly, with biases of about -20% in background/sub-urban cases and about -25% in  The median relative biases (SAT-GB)/GB at each site are presented as a color-coded map in Figure 6. Satellite data display a negative bias against ground-based reference data at all stations, except Reunion Island and UHMT-Houston, which are both coastal sites, highly heterogeneous in nature (Tzortziou et al., 2014;Loughner et al. 2014;Martins et al., 2016). Negative biases of about -80% are observed in Bujumbura and Nairobi,

5
which can be related to the small NO2 signal and the localized nature of the sources at these sites, combined with a complex orography (Gielen et al., 2017;Compernolle et al., 2020). Systematic uncertainties in the estimation of the stratospheric column in satellite datasets could also contribute to the observed underestimation, considering the overall small tropospheric NO2 signals at these locations. E.g. Valks et al. (2011) have shown that small-scale variations visible in the IFS-MOZART stratospheric NO2 field could not be captured by the GOME-2A 10 stratosphere-troposphere separation algorithm, due to limitations of the spatial filtering approach. In particular this might be the case at the Izaña and Mauna Loa stations (see Fig. 3a), where the satellite stratospheric column is found to exceed the total column NO2 derived from ground-based direct sun measurements. Finally, issues related to the use of inadequate ancillary datasets might also affect the accuracy of the satellite NO2 columns. This can be due to the coarse spatial resolution of models used as a priori (from 1.875° to 3° here, see Table 1) or their temporal 15 sampling (monthly values from 1997 or daily profiles, see Table 1   Looking at the details of the comparison results at each station ( Fig. 6 and values in Table S1), we find that GOME-
Significant differences between ground-based MAX-DOAS and both OMI QA4ECV and OMI NASA were also reported by Compernolle et al. (2020) in OHP, Bujumbura, Nairobi and Mainz.

20
When considering the results as a whole, the most prominent feature is the systematic underestimation of groundbased data by both satellite datasets for most of the sites. This underestimation is mostly prominent at urban sites close to the sources, but it is also found at background/sub-urban sites and cannot be fully explained by the satellite https://doi.org/10.5194/amt-2020-76 Preprint. Discussion started: 30 March 2020 c Author(s) 2020. CC BY 4.0 License.
uncertainties (see section 2). The differences observed between OMI and GOME-2A can be related to instrumental characteristics (e.g. differences in pixel size) but also to details of the applied retrieval methods (see Table 1 and Sect. 2). Several studies have discussed in detail the impact of algorithmic differences on the NO2 column uncertainty, which can reach 42%, mainly due to tropospheric AMF uncertainties (Lorente et al., 2017). The One way to understand these results is to consider the impact of the spatial resolution of the satellite measurements.
For the case of rural sites, coincident satellite pixels can include areas with higher NO2 columns leading to positive biases in the comparisons. In contrast at urban locations characterized by strong NO2 sources, coincident pixels 15 generally tend to include surrounding (sub-urban) areas. This effect is especially significant for satellite instruments measuring at coarse spatial resolution, such as GOME-2A. It can be attenuated in validation studies making use of long time-periods and many stations, however large localized NO2 concentrations will always tend to be underestimated. This is particularly true for satellite instruments characterized by horizontal resolution much coarser than the size of typical urban agglomerations (see Table 1). Note that the effect can be somewhat mitigated In the next section, we present an attempt to quantify the smearing effect around urban sites and use it to study its impact on validation results.

Horizontal dilution effects
In order to investigate the horizontal variability of the NO2 field at the 36 different stations, one full year (2005) of OMI NO2 QA4ECV dataset v1.1  was extracted to map the average NO2 column distribution at a grid of 0.025°x0.025° in latitude-longitude. Such highly-resolved gridded maps were obtained using a realistic representation of the OMI point spread function allowing to subsample the native OMI pixels the New-York area using airborne NO2 mapping data from the GeoTASO instrument. In our approach, the variation of the tropospheric NO2 VCD is sampled in concentric circles of different radii around each of the stations. Figure 7 illustrates the method for the Beijing (urban, Fig 7a) and Xianghe (sub-urban, Fig; 7c) sites, which both present strongly inhomogeneous NO2 fields. Figure 7b and 7d shows the NO2 VCD variation in https://doi.org/10.5194/amt-2020-76 Preprint. Discussion started: 30 March 2020 c Author(s) 2020. CC BY 4.0 License.
concentric circles around the stations. In Beijing, the ground-based instrument is located close to the urban NO2 hotspot, so that the NO2 level decreases rapidly outwards. In contrast, a different behavior is found at the Xianghe station, which is located at about 60 km to the East of the Beijing city center. In this case, due to the influence of the surrounding emission sources, the mean NO2 column tends to slightly increase when moving away from the site in the direction of Beijing. For background sites, one expects the NO2 content to remain roughly constant 5 around the station value. Horizontal variability effects have been documented in previous studies dealing with ozone and water vapor (Lambert et al., 2012, Verhoelst et al., 2015, as well as with tropospheric NO2 (Irie et al., 2012;Duncan et al., 2016 and, mostly to illustrate the impact of collocation mismatch errors on validation results. In our study, we propose a correction method applied to satellite data, which aims at reducing the impact of the smearing effect on comparisons. x 0.025° longitude. The black dot indicates the station location, the 2 circles denote 50 and 100 km radii around the station and the red box represents the extension of the GOME-2 pixels when the center of the pixels is within the 50km radius. Right panels 15 display the mean (black) and median (red) NO2 values at increasing colocation radii (expressed in km), with the variability (one standard deviation) given as an error bar around the mean.

Dilution correction method
Similarly to the studies of Chen et al. (2009) and Ma et al. (2013), a correction factor is calculated to quantify the 20 change in NO2 between the ground-based site and the satellite pixel location. In our approach, the dilution factor https://doi.org/10.5194/amt-2020-76 Preprint. Discussion started: 30 March 2020 c Author(s) 2020. CC BY 4.0 License.
(Fdil) is obtained from the OMI gridded files by taking the ratio between the average (mean or median) NO2 VCD at increasing distances from the site and the VCD value at the site. A second order polynomial is then fitted to these ratio values as illustrated in Fig. 7 (panels (b) and (d)). Accordingly, Fdil is calculated using the following equation, where R represents the distance from the site: In practice, Fdil is calculated as the median values of the gridded NO2 field for values of R from zero to 50km. Island (LePort station). This ensemble is referred to as UIPP (Urban, Island and Power Plant) in the rest of the paper.

Impact of the dilution correction
The regression analyses discussed in section 5 have been repeated after application of the dilution correction to 25 the satellite data sets. Results obtained with and without the correction are detailed in Table 4 for MAX-DOAS and direct sun data sets. We also distinguish between urban and background/sub-urban cases. Since the background sites are generally not (or weakly) influenced by dilution effects, results do not change much for this category, as expected. In contrast, when considering urban sites, statistical parameters are significantly modified. Regression slopes tend to be larger and closer to unity while biases are systematically reduced by about 10 to 20%.

30
The improvement brought by the dilution correction is further illustrated in Fig. 8, where the slopes of the linear regressions from daily scatter plots are presented for each station separately with and without dilution correction.
In order to limit the impact of outliers (especially the large columns that strongly affect the regression analysis), as the different sites have varying number of points. After application of the dilution correction, regression slopes improve (and come closer to unity) for all cases except De Bilt. However, for some sites, there seems to be an over-correction effect (Athens/GOME-2, UMHT/GOME-2, Beijing (both sites)/OMI and Reunion/OMI), while negative slope are obtained at a few other sites (e.g. Mauna Loa/GOME-2 and Reunion/GOME-2). As already discussed in Section 5.1, for direct sun stations this could be related to issues with the determination of 5 stratospheric columns in the satellite algorithm. UHMT is a peculiar site, where several studies performed during the DISCOVER-AQ 2013 Texas campaign (Nowlan et al., 2018;Choi et al., 2019) suggested that those Pandora NO2 measurements tend to be too low. Finally, some sites (e.g. Nairobi, Bujumbura, Thessaloniki, Izaña) display very small slopes probably due to the fact that these sites are characterized by very local sources or by nonsymmetric NO2 distributions. This is clearly the case for isolated islands where the NO2 can be locally trapped due 10 to orography (see figures S19, S22, S24 in supplement).  0.67 to 1.1 for direct sun data. The impact of excluding the largest columns from the regression analysis can be judged by comparing the grey and black lines, respectively obtained without and with filtering. As can be seen direct sun data are more affected by this filtering (slope increase from 0.38 to 0.67) than MAX-DOAS ones (slope increase from 0.49 to 0.52). This is likely related to the fact that, as already mentioned, direct sun instruments (of Pandora type) tend to be located closer to strong NO2 emission sources than MAX-DOAS instruments. Table 5 5 lists the statistical parameters from regression analyses performed with and without the dilution correction for all the UIPP stations and the different satellite products. Generally speaking, validation results obtained using both MAX-DOAS and direct sun systems appear to be consistent, although direct sun observations tend to agree slightly better with the satellite data. In the case of direct sun data, however, we note that the dilution correction tends to over-correct satellite measurements (see also Fig. 9). It is also interesting to note in Table 5 that the intercepts are   Considering all the stations together, Figure 10 presents an overview of the differences between satellite and ground-based data sets, for the original comparisons (in black) and after dilution correction (in red). We make the distinction between two different approaches for the selection of the coincident pixels: closest cloud free (cloud 5 radiance fraction<50%) pixel and mean value of all cloud free pixels within a radius of 50 km. Results are also given separately for MAX-DOAS sites (upper plot) and direct sun sites (lower plot).
As can be seen, the overall agreement between satellite and ground-based data sets is better for OMI comparisons and, after dilution correction, it is slightly better for direct sun than for MAX-DOAS sites. Again, this is likely 10 related to the fact that direct sun instruments (of Pandora type) tend to be located closer to strong NO2 emission sources. Moreover, as also discussed previously, MAX-DOAS sites report measurements under a larger variability of conditions (both clear-sky and cloudy), leading to an increased spread of the comparisons. Generally speaking the dilution correction pushes biases closer to zero and often reduces the spread of the differences. The best results are obtained with OMI, when comparing direct sun tropospheric columns to the closest pixel of the satellite. In this case, the median bias of -1.16 x10 15 molec/cm² obtained is reduced to -0.23 x10 15 molec/cm² after application of the dilution correction. A similar improvement is found for the MAX-DOAS comparisons, from -0.95 to -0.47 x10 15 molec/cm²). We find that the selection of the daily closest pixel leads to smaller biases and spreads and a 5 better agreement between median and mean values for both OMI and GOME-2A comparisons. Therefore, in the rest of the study, comparison results are exclusively based on coincidences determined using daily closest pixels. Although the dilution correction improves the agreement between the ground-based and satellite measurements, 15 significant negative biases persist at some of the validation sites (see Fig. 8). This could be related to satellite https://doi.org/10.5194/amt-2020-76 Preprint. Discussion started: 30 March 2020 c Author(s) 2020. CC BY 4.0 License.
retrieval issues but also to shortcomings in our correction approach, which relies on average NO2 fields derived using one year (2005) of OMI data. These average fields are not necessarily representative of the actual day-today variability at all sites. This certainly contributes to the scatter of the comparisons, but should have a small (systematic) effect on regression slopes. Seasonal behavior differences, not taken into account here, could also play a role. Moreover the OMI QA4ECV dataset , which has been selected as a source for OMI for the correction also implies that the afternoon NO2 is representative for the morning GOME-2 overpass, which is not entirely true. Another issue is the limited spatial resolution of OMI data and of its a-priori profiles assumption. High-resolution models (Drosoglou et al. 2017)  Finally, ground-based instruments are assumed to provide point source measurements, while in reality the horizontal sensitivity area of MAX-DOAS measurements can be as large as several tens of km (Irie et al., 2011).

15
The provision of this information for all ground-based measurements would thus be very valuable to further improve the comparison method. Note that in urban areas, the representativeness of MAX-DOAS observations for comparison with satellite data could be improved by making use of measurements in different azimuth directions (Ortega et al., 2015;Gratsea et al., 2016;Schreier et al., 2019;Dimitropoulou et al., 2020).

Seasonal variations and internal consistency of validation results
In this section, we discuss in more details the seasonal dependence of the results (Sect. 7.1) and the consistency of validation results for further changes of the pixels selection (Sect 7.2). Results for megacities such as Seoul and Beijing and for all UIPP stations are investigated in detail.

Seasonal dependence
Several sites submitted data for time-periods longer than one year (see Table 2 and 3 for the details), allowing to investigate the seasonal dependence of the comparisons. In Fig. 11

Impact of the satellite pixel selection
To further reduce the impact of the horizontal NO2 variability on our comparisons, an alternative approach was tested and compared to previous results. For these tests, the selection was restricted to OMI pixels covering the stations. The impact of constraining the coincidences in this way is presented in Table S1 for each station, without dilution correction. The following conclusions can be drawn:
Thus, at most stations, using a stricter colocation criterion results in a reduction of the bias (by up to ~20%). In order to better understand the impact of changing the pixel selection criteria, additional tests were performed for two megacities characterized by extremely high NO2 levels (see Fig. 3). Figure 12 illustrates, for Beijing, Beijing-CMA, Xianghe and Seoul, the impact of making different choices on the 5 OMI pixel size and location. For the most strict selection criterion (OMI pixels smaller than 40 km and located above the stations) we see a significant improvement of the bias and spread of the comparison in Seoul for direct sun data and only a slight improvement in the median bias for the Beijing/Beijing-CMA data. For Xianghe, the impact appears to be moderate or even negligible, as expected due to the sub-urban nature of this site (Fig. 7).
Differences in the results for the two Beijing sites are to be considered in the light of the different measurements 10 times (Table 1)   The peculiar behaviors of Beijing and Seoul cancel out when grouping results from all the UIPP stations. Figure   13 summarizes the change in biases for the UIPP ensemble, for the three pixels selection cases presented before.
As can be seen, restricting the comparison to small pixel sizes (from 100 to 40 km) only slightly improves the median bias, but it reduces the comparison spread. When focusing on pixels in strict overpass with the stations, 5 the bias is also reduced, but for the MAX-DOAS ensemble, not as much as when a horizontal dilution correction is applied.
https://doi.org /10.5194/amt-2020-76 Preprint. Discussion started: 30 March 2020 c Author(s) 2020. CC BY 4.0 License. (e.g. UHMT, Seoul), the dilution correction seems to over-correct the satellite NO2 columns, especially for OMI data. This is less clear for GOME-2A, indicating that the correction approach might be slightly too aggressive for the OMI case. It can also be seen that except for a few cases, both satellite data products behave similarly at the different stations. Once corrected for the dilution effect, satellite measurements agree with ground-based data to within 25% (black dotted lines). The blue lines represent the median bias of satellite measurements against all 20 station data, when including the dilution correction and for ground-based VCDtropo >2 x10 15 molec/cm².The latter filtering is applied to remove outliers leading to unphysical mean percent values. Resulting median residual biases are -23.5% for GOME-2A and -18% for OMI. For the sake of completeness, the same analysis was also performed on QA4ECV v1.1 OMI and GOME-2A datasets, using the same selection criteria. Corresponding figures can be found in the supplement (Fig S4 and S5). Similar results are found although the QA4ECV products tend to display  Where n is the number of comparisons of each case (which can be different), MAD is the median absolute deviation (see Huber (1981)), a robust indicator:  10 Table 6 summarizes the median biases for all the cases. As already stated, the dilution correction improves the validation results for both sensors, by about 10 to 13% in total over the station ensemble. The impact of considering only pixels over the stations is to reduce the bias by 2 to 6% for OMI (in comparison to the usual daily closest pixels selection), but it has a negligible effect on GOME-2A, probably due to the large size of the GOME-2A 15 pixels (40x80 km²). When considering the best comparison conditions including dilution correction (last column of It should be noted that in addition to this relative bias, the previously found positive intercepts and slopes smaller than one (see Table 5), could point to a twofold effect, involving a multiplicative error source (e.g. the AMF) and 25 an additive error source (e.g. the stratosphere to troposphere separation). This question should be further

Conclusions
Tropospheric NO2 column data from 39 ground-based remote-sensing instruments worldwide were used to conditions. Typically, sub-urban polluted stations (e.g., Xianghe) provide best conditions for the validation of satellite NO2 owing to their good representativeness of the size of the OMI or GOME-2A pixel spatial extent.
Validation at more remote stations can be challenging due to usually low levels of tropospheric NO2, leading to difficulties in the stratosphere-to-troposphere separation step in the satellite retrieval. Other challenging cases are cities and islands surrounded by a pristine atmosphere, such as Izaña, Reunion Island, Nairobi or and Bujumbura, 20 leading to large biases (up to ~80%) due to smearing of the local tropospheric NO2 emissions content in otherwise clean surroundings.
Comparisons at urban sites or close to strong NOx sources may suffer from smoothing difference errors due to the horizontal dilution of the measured NO2 field. Therefore, a quantitative correction for the dilution effect has been 25 developed based on the spatial distribution of tropospheric NO2 columns probed by OMI and averaged over one year. This dilution correction generally improves the comparison, reducing biases due to the spatial mismatch between ground-based and satellite observations. Lower biases and less scatter are obtained when considering the closest satellite pixels within a radius of 50 km from the stations. Generally OMI DOMINO v2 data agree better with ground-based data than GOME-2A GDP 4.8, especially for comparisons with MAX-DOAS data. correction improves the station-per-station comparisons with a few exceptions, generally at remote sites with local emissions surrounded by clean areas.
Restricting the comparison to satellite pixels covering the stations further reduces the bias and spread at urban locations, and the comparison spread at sub-urban sites for OMI data. However, the largest reduction of the bias 5 is obtained when applying the dilution correction. In terms of validation results, MAX-DOAS and direct sun measurements are found to be highly consistent, and therefore they have been used as an ensemble to assess the accuracy of GOME-2A and OMI data. Results based on this ensemble indicate that, even after correction for the horizontal dilution effect, satellite tropospheric NO2 columns are systematically biased low in comparison to ground-based measurements by 22% to 36% for GOME-2A and 11% to 21% for OMI, depending on the selected 10 satellite product. A summary of the validation results is given in Table 6.
The dilution correction developed here is parameterized according to the distance from the station and is based on one year of OMI NO2 measurements (2005). This approach has several identified limitations, such as assumptions made on the radial nature of the NO2 distribution around the sites and the overall applicability of the NO2 field 15 derived in 2005. Not considered here but also potentially important is the different intra-pixel dilution expected for the OMI and GOME-2A measurements. Despite its simplicity and shortcomings, our dilution correction was shown to significantly improve validation results and we anticipate that future developments will lead to further improvements. For example, possibilities exist to use estimates of the horizontal extent of MAX-DOAS measurements to improve the colocation with satellite data. MAX-DOAS instruments can also be operated in 20 multiple azimuthal scan mode, which could be used to further refine the colocation with satellite pixels (Brinksma et al., 2008;Gratsea et al., 2016;Ortega et al., 2015;Schreier et al., 2019;Dimitropoulou et al., 2020). Finally, imaging MAX-DOAS systems such as the IMPACT instrument (Peters et al., 2019) which provides fast sampling of the full (360°) azimuthal range, may lead to significant improvements in tropospheric NO2 validation close to source regions.

25
To further improve validation studies, information on the vertical distribution of NO2 and aerosols is also needed to test the impact of a-priori assumptions in satellite data retrieval. To some extent, this can be provided by MAX-DOAS instruments making use of vertical profiling techniques for the inversion of tropospheric profiles of NO2 and aerosols.

30
Finally, improving and further extending existing networks are essential requirements for future operational air quality satellite validation (Veihelmann et al., 2019). In this context, important steps include: -The further development of the PGN network of Pandora instruments, to better cover source regions in all continents and in the measurement areas of all current and future satellites.

10
Code/Data availability. The datasets generated and analyzed in the present work are available from the corresponding author on request and data per station can be requested from the individual PIs.
Author contribution. GP and MVR planned this study. GP performed the validation, the associated investigations and wrote the manuscript. MVR and FH contributed to the scientific discussions and to the manuscript writing.

15
NT participated to the OMI gridded maps creation. JG maintains the GOME-2 GDP station overpasses database up to date. All other co-authors provided ground-based data for the station(s) they are responsible for or support on the satellite data or the validation method. All co-authors were involved in the discussion of the results.