Validation of tropospheric NO2 column measurements of GOME-2A and OMI using MAX-DOAS and direct sun network observations

Multi-axis differential optical absorption spectroscopy (MAX-DOAS) and direct sun NO2 vertical column network data are used to investigate the accuracy of tropospheric NO2 column measurements of the GOME-2 instrument on the MetOp-A satellite platform and the OMI instrument on Aura. The study is based on 23 MAX-DOAS and 16 direct sun instruments at stations distributed worldwide. A method to quantify and correct for horizontal dilution effects in heterogeneous NO2 field conditions is proposed. After systematic application of this correction to urban sites, satellite measurements are found to present smaller biases compared to ground-based reference data in almost all cases. We investigate the seasonal dependence of the validation results as well as the impact of using different approaches to Published by Copernicus Publications on behalf of the European Geosciences Union. 6142 G. Pinardi et al.: GOME-2A and OMI tropospheric NO2 validation select satellite ground pixels in coincidence with groundbased data. In optimal comparison conditions (satellite pixels containing the station) the median bias between satellite tropospheric NO2 column measurements and the ensemble of MAX-DOAS and direct sun measurements is found to be significant and equal to −34 % for GOME-2A and −24 % for OMI. These biases are further reduced to −24 % and −18 % respectively, after application of the dilution correction. Comparisons with the QA4ECV satellite product for both GOME-2A and OMI are also performed, showing less scatter but also a slightly larger median tropospheric NO2 column bias with respect to the ensemble of MAX-DOAS and direct sun measurements.

select satellite ground pixels in coincidence with groundbased data. In optimal comparison conditions (satellite pixels containing the station) the median bias between satellite tropospheric NO 2 column measurements and the ensemble of MAX-DOAS and direct sun measurements is found to be significant and equal to −34 % for GOME-2A and −24 % for OMI. These biases are further reduced to −24 % and −18 % respectively, after application of the dilution correction. Comparisons with the QA4ECV satellite product for both GOME-2A and OMI are also performed, showing less scatter but also a slightly larger median tropospheric NO 2 column bias with respect to the ensemble of MAX-DOAS and direct sun measurements.

Introduction
Nitrogen dioxide (NO 2 ) is a key species for atmospheric chemistry, present both in the stratosphere and in the troposphere. In the troposphere, nitrogen oxides (NO x =NO + NO 2 ) together with volatile organic compounds are key ingredients for ozone and photochemical smog formation in polluted regions. By reaction with the hydroxyl radical (OH), NO 2 forms nitric acid (HNO 3 ), which leads to acid rain and consequently acidifies soils and waterbodies with negative impacts on the environment. In addition to its important role in air quality (human health and environmental acidification), NO 2 is also relevant for climate processes at high concentrations, contributing to direct radiative forcing and the extension of atmospheric lifetimes of gases such as CH 4 . The main sources of NO x include anthropogenic and natural emissions, such as fossil fuel combustion, biomass burning, lightning and microbial soil emissions. There is a need for accurate NO 2 measurements to assess and forecast its impact on air quality.
NO 2 can be measured by several methods, such as in situ sampling and active or passive remote sensing. The differential optical absorption spectroscopy (DOAS) technique (Platt and Stutz, 2008) is widely used to retrieve NO 2 in the atmosphere from measurements taken from satellites, from balloons and from the ground. Since the mid-nineties, NO 2 has been measured from space by mid-morning low earth orbit (LEO) nadir satellite instruments, such as GOME on ERS-2 (1996ERS-2 ( -2003Burrows et al., 1999), SCIAMACHY on EN-VISAT (2002Bovensmann et al., 1999) and GOME-2 on MetOp A, B and C (since 2006and November 2018Munro et al., 2016). From 2004 onwards, NO 2 measurements in the early afternoon have also been performed from the OMI imaging spectrometer on the EOS-Aura platform  and since the end of 2017 from the Sentinel-5P TROPOMI instrument (Veefkind et al., 2012). In the last 15 years, ground-based MAX-DOAS (multi-axis differential optical absorption spectroscopy) instruments have been developed to measure tropospheric trace gases (Hönninger and Platt, 2002;Hönninger et al., 2004;Sinreich et al., 2005). Combined with profiling algorithms, this technique has been successfully applied to retrieve tropospheric columns and information on the vertical distribution of NO 2 , HCHO, SO 2 , BrO, IO, HONO, CHOCHO and aerosols (e.g., Bobrowski et al., 2003;Wittrock et al., 2004;Wagner et al., 2004;Heckel et al., 2005;Frieß et al., 2006Frieß et al., , 2016Sinreich et al., 2007;Theys et al., 2007;Irie et al., 2008bIrie et al., , 2009Clémer et al., 2010;Galle et al., 2010;Hendrick et al., 2014). Direct sun observations in the UV-visible, which provide total column measurements Wenig et al., 2008;Herman et al., 2009;Wang et al., 2010), are also used for monitoring atmospheric NO 2 . In particular, the recently developed Pandora instrument (SciGlob, http://www.sciglob.com/, last access: 29 October 2020) operationally provides direct sun measurements of O 3 and NO 2 and SO 2 and HCHO in a scientific mode (Herman et al., 2009(Herman et al., , 2019Wang et al., 2010;Tzortziou et al., 2015;Fioletov et al., 2016;Spinei et al., 2018) at a growing number of sites.
One of the strengths of LEO nadir satellite instruments with wide swath width, like OMI and GOME-2, is their daily global coverage. Their main drawback is their limited revisit frequency and associated sampling of the diurnal cycle (typically one overpass per day for midlatitudes) and coarse spatial resolution (from a few to several hundreds of kilometers). The accuracy of the different satellite datasets is also of concern, e.g., for trend analysis or diurnal variation studies. Validation activities, which are an essential part of any satellite program, aim at deriving independently a set of indicators characterizing the quality of the data product. They encompass the monitoring of instrumental stability as well as the inter-sensor consistency needed to ensure continuity between different satellite missions. Satellite validation also contributes to the improvement of retrieval algorithms through investigation of the accuracy of the data products and their sensitivity to retrieval parameter choices. Tropospheric satellite data products depend on various sources of ancillary data, e.g., a priori vertical distribution of the absorbing and scattering species, surface albedo and information on clouds and aerosols (Boersma et al., 2004;Lin et al., 2015;Lorente et al., 2017;Liu et al., 2019a). In the case of NO 2 , separation between stratospheric and tropospheric contributions is an additional source of complexity in the retrieval, and there is considerable debate on the importance of the role of free tropospheric (background) NO 2 in the retrieval process (Jiang et al., 2018;Silvern et al., 2019). As discussed by Richter et al. (2013), the validation of tropospheric reactive gases (such as NO 2 , HCHO and SO 2 ) is also challenging because short atmospheric lifetimes, local emission sources and transport can lead to a large variability of their concentrations in time and space (both vertically and horizontally). Active photochemistry and transport processes lead to important diurnal variations cycles (Boersma et al., 2008) that need to be considered for validation studies. MAX-DOAS and di-rect sun remote-sensing techniques have large potential capacities for the validation of satellite trace gas observations, as they measure all day long and provide accurate measurements of integrated column amounts (i.e., a quantity close to that measured by spaceborne instruments). Remote sensing measurements also match the horizontal resolution of satellite observations better than e.g., surface in situ monitoring networks. The spatial averaging of MAX-DOAS measurements has been quantified and shown to range from a few kilometers to tens of kilometers depending on aerosol content and measurement wavelength (Irie et al., 2011Wagner et al., 2011;Wang et al., 2014;Gomez et al., 2014;Ortega et al., 2015).
In the last decade, several studies have compared different SCIAMACHY, GOME-2 and OMI NO 2 data products (generated by both operational and scientific prototype processors) to MAX-DOAS measurements at various stations (e.g., Brinksma et al., 2008;Hains et al., 2010;Vlemmix et al., 2010;Irie et al., 2008a;Ma et al., 2013;Lin et al., 2014;Wang et al., 2017;Drosoglou et al., 2017Drosoglou et al., , 2018Liu et al., 2019a. JAMSTEC data from the MADRAS network have been used in Kanaya et al. (2014) for the validation of the OMI DOMINO and NASA tropospheric NO 2 data. BIRA-IASB MAX-DOAS stations have been regularly used for the validation of GOME-2 GDP (GOME Data Processor) products from MetOp-A and MetOp-B Pinardi et al., , 2014Pinardi et al., , 2015Liu et al., 2019b) as part of the AC SAF activities (Hassinen et al., 2016; see also https://cdop.aeronomie.be/validation/valid-results, last access: 29 October 2020). Pandora datasets have also been used in satellite validation of total and tropospheric NO 2 columns (Herman et al., 2009;Tzortziou et al., 2014Tzortziou et al., , 2015Judd et al., 2019, and a recent study of Herman et al. (2019) presented an overview at 14 Pandora sites showing that NASA OMI NO 2 overpass data consistently underestimate the Pandora-derived NO 2 amounts. One general conclusion of these exercises was to find a low bias of the satellites tropospheric NO 2 columns in urban conditions and, in contrast, a better agreement with ground-based data in background and pristine locations Halla et al., 2011;Kanaya et al., 2014). However Irie et al. (2012) also reported low OMI NO 2 column values over China in summer, when the spatial distribution of NO 2 was likely homogeneous.
In the present study, we validate GOME-2A and OMI tropospheric NO 2 column measurements using data from a large number of MAX-DOAS and direct sun instruments operating in Europe, Asia, North America and Africa under a wide variety of atmospheric conditions and pollution patterns. Some of these datasets have already been used in the past for tropospheric NO 2 validation of different satellites and products and participated in the CINDI-1 and/or 2 intercomparison campaigns (Piters et al., 2012;Kreher et al., 2020). In the present study we combine them in a coordinated way, allowing for a global approach to satellite validation, sampling different NO 2 levels in various locations around the globe. In addition the smearing (or dilution) of the NO 2 field due to the limited horizontal resolution of satellite measurements is investigated. A method for the quantification and correction of the dilution effect is proposed, and its impact on validation results is quantitatively evaluated. Our validation approach is applied to operational OMI DOMINO and AC SAF GOME-2A products as well as to climate data record OMI and GOME-2A NO 2 data products generated within the EU QA4ECV project.
The paper is structured as follows: Sects. 2 and 3 describe the OMI and GOME-2A sensors and datasets as well as the reference ground-based measurements. Section 4 presents the comparison methodology, and comparison results are discussed in Sect. 5. In Sect. 6, we concentrate on the quantification of horizontal dilution effects in satellite measurements performed around the measurement sites, and we show how these effects impact the validation results in urban conditions. Section 7 presents a summary of the validation results, and conclusions are detailed in Sect. 8.

Satellite tropospheric NO datasets
Tropospheric NO 2 data products from spaceborne sensors are generally retrieved via three main steps: firstly, a DOAS spectral analysis, yielding the total column amount of NO 2 along the slant optical path; secondly an estimation of the stratospheric NO 2 column, to be subtracted from the total column to derive the tropospheric contribution (so-called "residual" technique); and finally a conversion from slant (SCD) to vertical (VCD) column densities. The last step is based on air mass factor (AMF) calculations which require a priori knowledge of the NO 2 vertical distribution, pressure and temperature, surface albedo and aerosols and information on (effective) cloud cover and height (Boersma et al., 2004). The retrieval of tropospheric NO 2 is given by Different data products have been generated for each satellite instrument, using different assumptions for each of the three aforementioned steps (see Boersma et al., 2004;Richter et al., 2011;Lin et al., 2014;Bucsela et al., 2013;Lamsal et al., 2014;van Geffen et al., 2015;Krotkov et al., 2016;Lorente et al., 2017;Liu et al., 2019a. In addition to instrument-specific differences, structural uncertainties arising from the application of different retrieval methodologies to the same satellite observations (sometimes also called forward model uncertainties) can introduce differences in the retrieved tropospheric NO 2 columns (VCD tropo ) of 10 %-50 % (e.g., van Noije et al., 2006;Lorente et al., 2017;Zara et al., 2018). SCD structural uncertainties generally do not exceed 1×10 15 molecules cm −2 , while the AMF calculation leads to more significant uncertainties (Boersma et al., 2004), which can be separated into implementation differences (when dif-6144 G. Pinardi et al.: GOME-2A and OMI tropospheric NO 2 validation ferent groups use identical ancillary data for the calculation of tropospheric NO 2 AMFs) of about 6 % and structural differences, due to ancillary data selection, which can reach 31 %-42 % (Lorente et al., 2017). The uncertainty in separating the stratospheric and tropospheric columns is about 0.5 × 10 15 molecules cm −2 (Dirksen et al., 2011;Lorente et al., 2017).
In the present study, we focus on the ground-based validation of the mid-morning GOME-2A and the early afternoon OMI data. Illustration of the validation method and step-bystep results along the paper are given for the GOME-2A GDP (GOME Data Processor) 4.8 NO 2 operational data product  and the OMI DOMINO v2.0 data product , while final validation results and discussion also gather results for the GOME-2A and OMI QA4ECV products Zara et al., 2018). All products are briefly presented in Table 1 and in the following subsections.

GOME-2 products
The second Global Ozone Monitoring Instrument (GOME-2) is a nadir-looking UV-visible spectrometer measuring the solar radiation backscattered by the atmosphere and reflected by the Earth and clouds in the 240-790 nm wavelength interval, with a spectral resolution of 0.2-0.5 nm full width at half maximum (FWHM; Munro et al., 2016). There are three versions of GOME-2 instruments flying on a sun-synchronous polar orbit on board the Meteorological Operational satellites (MetOp-A, MetOp-B and MetOp-C, launched respectively in October 2006, September 2012and November 2018. They have an Equator crossing time of 09:00-09:30 local time in the descending node. In this study we concentrate on the GOME-2A instrument (that is on MetOp-A), which presents the longest data record. The default swath width of the GOME-2A across-track scan is 1920 km, allowing global Earth coverage within 1.5-3 d at the Equator, with a nominal ground pixel size of 80 km × 40 km. Since 15 July 2013, GOME-2A has been measuring in a reduced swath mode of 960 km, with a ground pixel size of 40 km × 40 km.
Operational products are retrieved from GOME-2 measurements in the framework of the Atmospheric Composition Satellite Application Facility AC SAF (https://acsaf.org/, last access: 29 October 2020; formerly O3M SAF; see also Hassinen et al., 2016). Total, tropospheric and stratospheric NO 2 columns are operationally retrieved with the GOME Data Processor (GDP, and a description of this algorithm can be found in Valks et al. (2011) andLiu et al. (2019b). Within the QA4ECV (Quality Assurance for Essential Climate Variables) project, a coherent offline NO 2 dataset has been created for GOME, SCIAMACHY, GOME-2A and OMI Zara et al., 2018;Lorente et al., 2017), and comparisons with this dataset are also included at the end of this study. Table 1 summarizes the main retrieval steps for the various tropospheric NO 2 products considered here. The main differences are related to the methods to obtain the stratospheric NO 2 column, the cloud parameters and the a priori information used to calculate the tropospheric air mass factor. In the Q4ECV case, stratospheric columns are derived using two different approaches (assimilation in TM4 and STREAM). The stratospheric separation method has an estimated uncertainty in the 0.15-0.3 × 10 15 molec cm −2 range . The typical overall uncertainty for individual retrievals of tropospheric NO 2 vertical column densities is estimated to be 1.0 × 10 15 molecules cm −2 (±25 %) in rural environments and from 40 % to 80 % under polluted conditions .
Previous validation of GOME-2A GDP 4.8 data can be found in Valks et al. (2011), Hassinen et al. (2016 and Liu et al. (2019b) for a few MAX-DOAS stations, and results of regular validation exercises can be found at http://cdop. aeronomie.be/validation/valid-results (last access: 29 October 2020). Satellite-to-satellite comparisons of the GOME-2A QA4ECV data have been performed by Zara et al. (2018), Lorente et al. (2017) and Liu et al. (2019b). Previous GOME-2 validation highlighted the effect of GOME-2 large pixels and the aerosol shielding effect, leading, e.g., to differences of 5 % to 25 % over China (Ma et al., 2013;Wu et al., 2013;Wang et al., 2017;Drosoglou et al., 2018). Liu et al. (2019b) showed possible improvements of the GDP 4.8 product, leading to reduced discrepancies of the satellite-to-ground-based biases of the order of 10 % to 25 % for several MAX-DOAS stations.

OMI products
OMI (Ozone Monitoring Instrument) is a nadir-viewing imaging spectrometer with a spectral resolution of about 0.5 nm FWHM . The light entering the telescope is depolarized using a scrambler and split into two spectral bands: a UV channel (wavelength range 270-380 nm) and a visible channel (wavelength range 350-500 nm). The 114 • viewing angle of the telescope corresponds to a 2600 km wide swath on the Earth's surface distributed over 60 cross-track positions, which enables quasiglobal coverage in 1 d. In the nominal global operation mode, the OMI ground pixel size varies from 13 km × 24 km at true nadir to 28 km × 150 km on the edges of the swath. OMI is on board the EOS-Aura satellite that was launched in July 2004, in a sun-synchronous polar orbit crossing the Equator around 13:45 LT (in ascending node). The radiometric stability of the OMI instrument is exceptionally good (Schenkeveld et al., 2017); however, since June 2007, several rows of the detector have been affected by a signal reduction, the so-called "row anomaly" (http://www.knmi.nl/ omi/research/product/rowanomaly-background.php, last access: 29 October 2020), reducing the usable swath coverage (see Boersma et al., 2018). The DOMINO (Derivation of OMI tropospheric NO 2 ) product is distributed in NRT via the TEMIS (Tropospheric Emission Monitoring Internet Service; http://www.temis.nl, last access: 29 October 2020) project . The offline OMI QA4ECV v1.1 product  is very similar to the GOME-2A product, as can be seen in Table 1. For OMI, the stratospheric separation is performed using a data assimilation scheme based on the TM4 or TM5-MP chemistry transport models. Its uncertainty is estimated to be about 0.2-0.3 × 10 15 molec cm −2 (Boersma et al., 2004;Dirksen et al., 2011). Stratospheric NO 2 vertical columns used in our study are derived from assimilated stratospheric slant columns divided by a geometrical air mass factor, as described in Hendrick et al. (2012). For the OMI QA4ECV dataset, two estimates of the stratospheric column are reported (data assimilation and STREAM), and Boersma et al. (2018) illustrated the differences for both approaches, with differences of up to 1 × 10 15 molec cm −2 . Compernolle et al. (2020) showed best agreement with ZSL-DOAS NDACC measurements for the STREAM stratospheric dataset, with mean differences between the two datasets of the order of 0.2 × 10 15 molec cm −2 on average.
OMI DOMINO v2.0 has been widely used in the past, and several validation exercises Hains et al., 2010;Vlemmix et al., 2010;Irie et al., 2008aIrie et al., , 2012Lin et al., 2014;Wang et al., 2017;Drosoglou et al., 2017Drosoglou et al., , 2018Liu et al., 2019a) found underestimation of the OMI tropospheric NO 2 columns in urban conditions and a better agreement in background locations Kanaya et al., 2014). Kanaya et al. (2014) showed close correlations with MAX-DOAS observations at seven stations but found low biases of up to ∼ 50 %. Regarding the OMI QA4ECV product, Boersma et al. (2018) reported a first validation at the Tai'an station (China) in 1 summer month, finding good agreement (bias of −2 %) with respect to MAX-DOAS NO 2 columns (better than the agreement found for DOMINO v2 of −11 % bias). Liu et al. (2019a) investigated the impact of correcting for aerosol vertical profiles in the OMI data and compared four OMI datasets (POMINO and POMINO v1.1, DOMINO v2.0 and QA4ECV) with respect to data of three Chinese stations. Results suggested a significant improvement of the OMI NO 2 retrieval when correcting for aerosol profiles, in general and for hazy days. This is consistent with the previous finding that the accuracy of DOMINO v2.0 is reduced for polluted, aerosol-loaded scenes Kanaya et al., 2014;Lin et al., 2014;Chimot et al., 2016). Liu et al. (2019a) also established discrepancies in DOMINO v2.0 for very high NO 2 values (>70×10 15 molec cm −2 ). For 18 cloud-free days, they found smaller differences between the four products with respect to MAX-DOAS, with the QA4ECV dataset having the highest R 2 (0.63) and the lowest bias (−5, 8 %). An extended validation of the QA4ECV OMI product is reported in the recent Compernolle et al. (2020) study, showing a negative bias (from −1 to −4×10 15 molec cm −2 ) with respect to 10 MAX-DOAS instruments, a feature also found for the OMI OMNO2 standard data product. They also found that the tropospheric VCD discrepancies between satellite and groundbased data exceed the combined measurement uncertainties and that, depending on the site, this discrepancy could be attributed to a combination of comparison errors (horizontal smoothing difference error, error related to clouds and aerosols and differences due to a priori profile assumptions). 3 Ground-based datasets: MAX-DOAS and direct sun measurements

MAX-DOAS technique
A MAX-DOAS instrument measures the scattered sunlight under a sequence of viewing elevation angles extending from the horizon to the zenith ( Fig. 1a). At low elevation angles, the observed sunlight travels a long path in the lower troposphere (under aerosol-free conditions, the lower the elevation angle, the longer the path), while all observations have approximately the same light path in the stratosphere, independently of viewing elevation. By taking the difference in SCD between off-axis observations and a (nearly) simultaneously acquired zenith reference spectrum (the differential slant column), the stratospheric contribution can therefore be eliminated. Tropospheric absorbers can be measured along the day, generally up to a solar zenith angle (SZA) of approximately 85 • (Hönninger et al., 2004;Sinreich et al., 2005). Radiance spectra acquired at different elevation angles are analyzed using the DOAS method (Platt and Stutz, 2008), which gives integrated trace gas concentrations along the atmospheric absorption path. The resulting differential slant columns (dSCDs) can be converted to vertical columns and/or vertical profiles using methods of different levels of complexity. Table 2 presents details about the retrieval strategy adopted by different teams. They generally belong to one of the following categories: -Geometrical approximation (GA). The vertical column is determined under the assumption that a singlescattering approximation can be made for moderately high elevation angles α (typically 30 • ) so that a simple geometrical air mass factor (AMFα ≡ SCD/VCD = 1/ sin(α)) (Hönninger et al., 2004;Brinksma et al., 2008;Ma et al., 2013) can be used.
-QA4ECV datasets. The vertical column is calculated using tropospheric AMFs based on climatological profiles and aerosol situations as developed during the QA4ECV project (http://uv-vis.aeronomie. be/groundbased/QA4ECV_MAXDOAS/QA4ECV_ MAXDOAS_readme_website.pdf, last access: 29 October 2020). These data are less sensitive to the relative azimuth angle than the purely geometric approximation presented above.
-Vertical profile algorithms based on parameterized profile shape functions. These make use of analytical expressions to represent the trace gas profile using a limited number of parameters (Irie et al., 2008a, b;2011;Li et al., 2010;Vlemmix et al., 2010;Wagner et al., 2011;Beirle et al., 2019).
MAX-DOAS profile inversion algorithms use a two-step approach: in the first step, aerosol extinction profiles are retrieved from the measured absorption of the oxygen dimer O 4 Frieß et al., 2006). In a second step, trace gas profiles are retrieved from the measured trace gas absorptions, taking into account the aerosol extinction profiles retrieved in the first step. Both OEM and parameterized profiling approaches provide vertical profiles of aerosols and NO 2 with a sensitivity typically in the 0-4 km altitude range, with generally between 1.5 and 3 independent pieces of information in the vertical dimension Frieß et al., 2016Frieß et al., , 2019Tirpitz et al., 2020). This complementary information on the vertical distribution of gases and aerosols in the atmosphere has been used in some studies to test some key assumptions made in the satellite data retrieval, in particular the a priori NO 2 profile and aerosols content, providing therefore more insight into the quality of the satellite data (e.g., Wang et al., 2017b;Liu et al., 2019bCompernolle et al., 2020). Recent intercomparison studies Frieß et al., 2019; show that both OEM and parameterized inversion approaches lead to consistent results in terms of tropospheric vertical column but larger differences in terms of profiles. In this study, every data provider submitted data retrieved with their own tools and formats, without any harmonization. Our study focuses therefore only on the vertical column, which is the more robust and reliable retrieved quantity. The time coverage of the different datasets used in this study is presented in Fig. S1 in the Supplement.
The accuracy of the MAX-DOAS technique depends on the SCD retrieval noise, the uncertainty of the NO 2 absorption cross sections and most importantly the uncertainty of the tropospheric AMF calculation. The estimated total error on NO 2 VCD is of the order of 7 %-17 % in polluted conditions. This includes both random (around 3 % to 10 % depending on the instruments) and systematic (11 % to 14 %) contributions (e.g., Irie et al., 2008Irie et al., , 2011Irie et al., , 2012Hendrick et al., 2014;Kanaya et al., 2014). In extreme cases, the error can however reach ∼ 30 % depending on geometry and aerosols.

Direct sun technique
Equipped with a 2-axis positioner, direct-sun-capable DOAS instruments measure non-scattered photons. Such instruments are equally sensitive to both tropospheric and stratospheric absorptions (Fig. 1b). They have a very small uncertainty in AMF and can provide accurate total column measurements with a minimum of a priori assumptions.
Direct sun (DS) observations are routinely available from Pandora spectrometer instruments. A standardized Pandora network has been set up by NASA (Herman et al., 2009;Tzortziou et al., 2014; Pandora project: http://pandora.gsfc. nasa.gov, last access: X29 October 2020) and extended by ESA and LuftBlick to form the PGN (Pandonia Global Network; http://www.pandonia-global-network.org/, last access: 29 October 2020). Pandora data used in this study originate mostly from the original NASA network, which includes more than 60 different sites covering different time periods (mostly campaign-based). In total, 15 Pandora direct sun instruments delivering at least 3 months of data have been considered here. They are listed in Table 3 with an indication of their location, ownership, availability (see also Fig. S2 in the Supplement) and references. Pandora instruments are generally operated in polluted areas (urban or suburban); however the network also contains a few background/remote sites located in Europe, Asia and the United States. Valid data were selected for a normalized root-mean square of weighted spectral fitting residuals (WRMS) of less than 0.005; uncertainty in NO 2 retrievals less than 0.05 DU was kept (Alexander Cede, personal communication, 2015).
Recent detailed studies in US and South Korean sites during DISCOVER-AQ have shown good agreement of Pandora instruments with aircraft in situ measurements, within 20 % on average, although larger differences are observed for individual sites (Choi et al., 2020), the largest discrepancies being found in Texas (Nowlan et al., 2018). Good agreement of a few percent between Pandora and GeoTASO has been reported by Judd et al. (2019), while differences increase when resampling the comparisons for larger simulated pixel sizes, up to about 40 % bias for 18 km × 18 km, similar to the bias found with OMI (50 %).
The Pandora spectrometers provide NO 2 total vertical column observations, with a random uncertainty of about 2.7 × 10 14 molec cm −2 and a systematic uncertainty of 2.7 × 10 15 molec cm −2 (Herman et al., 2009). These account for DOAS fit systematic errors, random noise and uncertainties related to the estimation of the residual gas amount in the reference spectra. In the present study, direct sun tropospheric VCDs are derived from the measured total NO 2 content after subtraction of the stratospheric part estimated using satellite data (SAT) (alone or within assimilation scheme; see Sect. 2), Table 2. MAX-DOAS tropospheric NO 2 datasets included in this study (23 stations, 15 with profiles). GA stands for geometrical approximation, OEM for optimal estimation method and PP for parameterized profiling.    Tzortziou et al. (2014) interpolated to the geolocation of the Pandora spectrometer: (2) Summing the Pandora error uncertainty and the error uncertainty on the stratospheric column in quadrature, this approach leads to an error uncertainty of about ∼ 2.75 × 10 15 molec cm −2 on the tropospheric column from direct sun data. It should be noted that this approach leads to retrieval of the total tropospheric column from the direct sun, while the tropospheric column from MAX-DOAS represents mainly the boundary layer.

Comparison method
For the comparison, GOME-2A and OMI data were extracted within a radius of 50 km around the 36 stations listed in Tables 2 and 3, with only pixels having a cloud radiance fraction < 50 % and an AMF ratio (AMF tropo /AMF geom ) > 0.2  being selected. In the case of OMI, pixels affected by the row anomaly were filtered out . As the pointing direction and horizontal sensitivity length are not reported for all ground-based instruments, our baseline approach is to consider only pixels encompassing the station location. However, a sensitivity test has been performed at the Xianghe station (where both parameters are provided in the data files) by selecting all pixels crossing the MAX-DOAS line of sight. Comparison results were found to be close to those from the baseline case, with only 10 additional coincident days.
To reduce the differences in spatial resolution of the satellite measurements (GOME-2A: 40 km × 80 km; OMI: 13 km × 24 km at best) compared to the ground-based sensitivity (horizontal length of the probed air mass up to ∼ 20 km), the largest pixels from each instrument dataset were removed: only pixels with an across-track width smaller than 100 km for GOME-2A and smaller than 40 km for OMI were kept in the comparisons. Previous studies have investigated the use of stricter coincidence criteria as a way to overcome spatial resolution differences. For example, Irie et al. (2008a) showed differences of up to 25 % in satellite VCD between pixels located 5 to 50 km away from the site, and only OMI pixels centered within 0.1 • × 0.1 • of the MAX-DOAS stations were considered in the validation. Other approaches have averaged MAX-DOAS VCDs made in several azimuth directions Celarier et al., 2008;Ortega et al., 2015) or have excluded MAX-DOAS measurements with a relative uncertainty ≥ 10 % (Vlemmix et al., 2010).
Ground-based (GB) MAX-DOAS data were interpolated to the satellite overpass time, and a verification of the presence of data within ±1 h was performed in order to avoid large interpolation errors. Pandora direct sun measurements have a much higher acquisition rate (approximately 30 acquisitions per hour compared to typically one to four MAX-DOAS measurements) with sometimes strong NO 2 variations not perfectly removed with the data filtering, so Pandora measurements within 1 h (±30 min) of the satellite overpass time were averaged. On this basis, in addition to the daily comparisons at each station, corresponding monthly averages were also compared.
As an example, Fig. 2 shows the results of our analysis for the Xianghe MAX-DOAS site. Pollution episodes are well captured by both GOME-2A and OMI as well as seasonal variations characterized by high NO 2 VCDs in winter and low values in summer. Quantitatively, the comparison of the whole time series is good, with correlation coefficient R values of 0.88 and 0.94 and linear regression slopes of about 0.79 and 0.93, for the monthly GOME-2A and OMI data respectively. VCDtropo differences (SAT-GB in ×10 15 molec cm −2 ) and percent relative difference (100 · (SAT − GB))/GB in %) were calculated for each site. For Xianghe the median bias is about −2 × 10 15 molec cm −2 (−8 %) and 0.7 × 10 15 molec cm −2 (−4.4 %) for GOME-2A and OMI data respectively. Values for each site are reported in Table S1 in the Supplement for GOME-2A and OMI, with daily and monthly statistics for correlation coefficient R, slope S and intercept I of a linear regression and mean and median monthly absolute and relative biases. Depending on the length of the ground-based time series, the number of daily comparison points can vary significantly, from at least 3 months of data to several years of continuous measurements. Figure 3 presents an overview of the tropospheric and stratospheric NO 2 columns measured at each station, as obtained from the satellite-to-ground-based coincidences. The tropospheric columns correspond to the ground-based data as selected in Sect. 4 (including, for the direct sun case, the subtraction of the satellite-estimated stratospheric content; see Sect. 3), while the stratospheric columns are the satellite estimations. As can be seen from the box-and-whisker plot, the tropospheric content varies strongly among the stations, the observed median columns ranging from 1×10 15 molec cm −2 in rural places (Hohenpeissenberg, Réunion, Cape Hedo, Mauna Loa, Izaña) to about 30 to 40 × 10 15 molec cm −2 in highly polluted sites (Beijing, Seoul, Beijing-CMA). As can also be seen, tropospheric columns selected at GOME-2A overpass times (i.e., in the morning) are usually larger than those selected at OMI overpass time (13:30±0:90), which is explained by lower OH levels and somewhat higher NO x emissions, leading to slower NO 2 chemical loss midmorning (09:30) compared to noon (13:30) (Boersma et al., 2008;Kim et al., 2009). Note that the median tropospheric column is negative at the mountaintop stations of Izaña and Mauna Loa. This is either caused by a slight underestima-tion of the Pandora total columns or a slight overestimation of the stratospheric columns derived from satellite. This discrepancy is under investigation and will be the subject of a future study.

Overview of the ground-based datasets
Due to different deployment strategies, the direct-sunmeasuring instruments (especially Pandora instruments) were located closer to strong NO 2 emission sources than MAX-DOAS instruments that sample both polluted and background sites. The MAX-DOAS ensemble of stations measured NO 2 tropospheric columns in the 2 to 20 × 10 15 range (about 18 MAX-DOAS stations and 10 direct sun stations). Moreover, being able to also measure under partially cloudy conditions, MAX-DOAS sites tend to sample the full variability of the NO 2 field at measurement sites, while direct sun data preferentially sample clear-sky conditions. As a result, MAX-DOAS sites tend to display a larger variability, as can be judged from the larger boxes (25 % to 75 %) and lines (9 % to 91 %) in the box-and-whisker plots of Fig. 3a. Figure 3b presents the stratospheric columns derived from the two satellites. Values typically range between 2 × 10 15 and 3.5×10 15 molec cm −2 . The difference of about 0.6 (up to 1) ×10 15 molec cm −2 between the GOME-2A and OMI data is consistent with the known diurnal variation of the stratospheric NO 2 , which results from the NO/NO 2 equilibrium and the progressive photodissociation of N 2 O 5 during the day (Dirksen et al., 2011;Belmonte Rivas et al., 2014;van Geffen et al., 2015). Minimum values of the stratospheric column are obtained over the equatorial sites (Nairobi, Bujumbura and Mauna Loa).
The validity of the tropospheric estimation approach applied to the direct sun data (see Sect. 3.2 and Eq. 2) was verified at stations where both MAX-DOAS and direct sun measurements are performed. This is the case for three sites: Beijing, Xianghe and Thessaloniki. Combining these three datasets, Fig. 4 displays a scatterplot of the tropospheric NO 2 columns measured by both techniques. Results are shown separately for GOME-2A and OMI overpass times. In both cases, a high level of correlation is obtained (linear correlation coefficient > 0.95). The corresponding linear regression slopes are 1.09 ± 0.02 and 1.06 ± 0.01 for OMI and GOME-2A overpasses respectively, with intercepts of −3.5×10 15 and −0.6×10 15 molec cm −2 . These results suggest that MAX-DOAS and direct sun data show a small relative bias of about 10 %-15 %. Part of this bias, which could change depending on pollution levels, may arise from the satellite-based stratospheric correction applied to direct sun data. However, it should be noted that MAX-DOAS and direct sun measurements are not synchronized, with typical differences in measuring time of about half an hour for these stations. The NO 2 variability (which can be large in polluted sites) therefore probably contributes to the observed scatter and apparent bias. Furthermore, MAX-DOAS and direct sun instruments observe different air masses, which might lead to differences in the presence of horizontally inhomogeneous air masses. Another approach to verify the consistency of the groundbased dataset is to investigate the coherence between measurements at sites that are geographically close to each other. For example, NASA-HQ and GSFC are very close to each other, but measurements were performed by different Pandora instruments and during different time periods. Their median VCDtropo differences for the overlapping days are about 4.4 and 7.8×10 14 molec cm −2 at the OMI and GOME-2A overpasses respectively, in line with the expected uncertainty/variability of these ground-based data. Beijing and Beijing-CMA sites are interesting to compare since both are located inside the city, at a mutual distance of about 6 km. The first instrument has been measuring on the roof of the Institute of Atmospheric Physics (IAP) (Clémer et al., 2010), the second at the China Meteorological Administration (Ma et al., 2013). Both instruments have already been compared in Hendrick et al. (2014), showing good agreement (differences of about −2 % in winter and 3 % to 4 % for the rest of the period). When comparing their columns for the satellite's colocations, they present differences of about 1.7 and 6 × 10 15 molec cm −2 at OMI and GOME-2A overpass times, respectively (12 % to 15 %). Another example is Chiba and Yokosuka. Both of these sites are situated in the urban area of Tokyo Bay but at about 53 km distance from each other. Their median differences from OMI and GOME-2A are 5.7 and 14.2 × 10 15 molec cm −2 respectively (69 % to 82 %).

Comparison of ground-based and satellite datasets
The comparison methodology illustrated in Fig. 2 has been extended to the 23 MAX-DOAS and 16 direct sun stations gathered in this study. As expected, results show a clear dependence on the location of the comparison site. The best agreement is obtained in background/remote conditions, while comparisons are more challenging close to the sources, where the NO 2 field is more heterogeneous (Chen et al., 2009;Irie et al., 2012;Ma et al., 2013;Pinardi et al., 2014). To illustrate this point, the different stations have been qualitatively classified by the station PIs into urban, suburban and background sites (see Tables 2 and 3), based on their location with respect to known pollution sources. This classification is not based on NO 2 levels but reflects the influence of the surrounding areas. For example, Xianghe station is in a polluted background with high NO 2 levels (see Fig. 3), but it is located at a relatively large distance from surrounding urban areas and is thus classified as suburban. Figure 5 presents monthly mean scatterplots of the GOME-2A GDP 4.8 data against ground-based measurements at the different stations. Different sites are plotted in different colors, and results are grouped separately for MAX-DOAS and direct sun data as well as for urban and background/suburban stations. As can be seen, satellite and ground-based data generally correlate well, with correlation coefficients ranging between 0.75 and 0.96 and linear regression slopes between 0.37 and 0.83. For more details on the statistical analysis of the regressions, see Table 4. It is clear  that smaller slopes, larger biases and larger root mean square (rms) values are found at urban locations compared to background/suburban ones. Note also that smaller biases are obtained for OMI than for GOME-2A in all cases except for the case of the comparisons against direct sun data in background/suburban sites, where the differences among the two satellites are small (about −19.6 % and −21.3 %).
The median relative biases (SAT-GB)/GB at each site are presented as a color-coded map in Fig. 6. Satellite data display a negative bias against ground-based reference data at all stations, except UHMT-Houston, which is a coastal site, highly heterogeneous in nature Loughner et al., 2014;Martins et al., 2016). Negative biases of about −80 % are observed in Bujumbura and Linear regression values are given as an inset for each case (correlation coefficient R, slope S and intercept I ), and the number of months for each station is given in brackets in the legend. Pixel selection: GOME-2A pixel size < 100 km (i.e., removing backscans) over the stations.  Nairobi, which can be related to the small NO 2 signal and the localized nature of the sources at these sites, combined with a complex orography (Gielen et al., 2017;Compernolle et al., 2020). Systematic uncertainties in the estimation of the stratospheric column in satellite datasets could also contribute to the observed underestimation, considering the overall small tropospheric NO 2 signals at these locations. For example, Valks et al. (2011) have shown that small-scale variations visible in the IFS-MOZART stratospheric NO 2 field could not be captured by the GOME-2A stratospheretroposphere separation algorithm, due to limitations of the spatial filtering approach. In particular this might be the case at the Izaña and Mauna Loa stations (see Fig. 3a), where the satellite stratospheric column is found to exceed the total column NO 2 derived from ground-based direct sun measurements. Finally, issues related to the use of inadequate ancillary datasets might also affect the accuracy of the satellite NO 2 columns. This can be due to the coarse spatial resolution of models used as a priori information (from 1.875 to 3 • here; see Table 1) or their temporal sampling (monthly values from 1997 or daily profiles; see Table 1 Also  showed that known uncertainties in albedo climatologies result in NO 2 column uncertainties of 3 %-6 %, while errors in model input are responsible for up to 20 % of error on the retrieved NO 2 columns. Looking at the details of the comparison results at each station ( Fig. 6 and values in Table S1 in the Supplement), we find that GOME-2A and OMI present a similar behavior at a significant number of stations. Biases, however, tend to be slightly larger for GOME-2A. For example, in the megacity of Beijing, the median monthly mean bias is −32 % for OMI and −42 % for GOME-2A when considering direct sun cases, −24 % and −45 % for the Beijing MAX-DOAS case and −33 % and −49 % for the Beijing-CMA MAX-DOAS case. In Xianghe, which is a suburban site, the biases are smaller (−4 % and −8 % for MAX-DOAS), as expected. Table S1 provides a complete overview of the monthly bias results obtained when comparing OMI and GOME-2A to MAX-DOAS and direct sun instruments. Aside from the stations showing coherent validation results for OMI and GOME-2A (about 9 out of 16 direct sun sites and 8 out of 23 MAX-DOAS sites with differences in the satellite-to-ground validation results bias of less than 15 %), others are characterized by much larger differences, especially in remote sites such as OHP, Réunion, Cape Hedo, Fukue, Tsukuba and Bujumbura. A few mountaintop or high-altitude sites present very large relative biases, such as Nairobi (about −80 %), Mauna Loa (about −60 %) and Izaña (−200 % to −210 %). At Réunion and Bujumbura, only GOME-2A results display large biases (−76 % compared to 5 % for Réunion, and −84 % compared to −46 % for Bujumbura). Significant differences between ground-based MAX-DOAS and both OMI QA4ECV and OMI NASA were also reported by Compernolle et al. (2020) in OHP, Bujumbura, Nairobi and Mainz.
However, for some of these stations, these results only rely on a very small subset of comparison points (5 d for OMI comparisons at Mauna Loa, 14 d for Thessaloniki direct sun, 3 d for Nairobi, 11 d for Réunion, 12 d for Hohenpeissenberg), and in the next section we test the impact of relaxing the comparison criteria, to select the closest pixel per day, within the maximum radius of 50 km.

Impact of the satellite pixel selection
As to be expected, for a large number of stations, selecting pixels that do not contain the stations increases the comparison statistics but also changes the comparison results. This is especially the case for OMI. The change in coincidence selection is presented in Table S1 for each station. The following conclusions can be drawn for OMI.
-MAX-DOAS measurements: for 15 sites out of 23 there is a significant (more than 5 %) difference between results obtained using all the pixels and only those intersecting the stations. At most stations, the stricter colocation criterion results in smaller biases (by up to ∼ 20 %). In order to better understand the impact of changing the pixel selection criteria, additional tests were performed for two megacities characterized by extremely high NO 2 levels (see Fig. 3). Figure 7 illustrates, for Beijing, Beijing-CMA, Xianghe and Seoul, the impact of making different choices on the OMI pixel size and location. For the most strict selection criterion (OMI pixels smaller than 40 km and located above the stations), we see a significant smaller bias and spread of the comparison in Seoul for direct sun data and only a slight difference in the median bias for the Beijing/Beijing-CMA data. For Xianghe, the impact appears to be moderate or even negligible, as expected due to the suburban nature of this site. Differences in the results for the two Beijing sites are to be considered in light of the different measurement times (Table 1) and NO 2 levels (Fig. 3): measurements in Beijing (median NO 2 of about 20×10 15 molec cm −2 ) were performed in 2008-2009 during the Olympic Games, while measurements at the CMA building (median of 35 × 10 15 molec cm −2 ) covered the period from 2009 to 2011. For Seoul, where measurements were performed in 2012-2015 (median NO 2 of 35 × 10 15 molec cm −2 ), the metropolitan area extends over more than 11700 km 2 . In this case, as can be seen in Fig. S23 in the Supplement, the NO 2 signal is inhomogeneously spread over the city, and the instrument is not centered at the maximum of the satellite NO 2 observations. As a result, the selection of pixels in strict overpass with the site has a larger impact than for Beijing, where the MAX-DOAS instrument is located in the center of the city (Fig. 7). This is in line with the findings of Duncan et al. (2016). Analyzing OMI data over the period from 2005 to 2014, they found a complex spatial distribution of the NO 2 trends characterized by a decrease in the Seoul metropolitan area and an increase outside of the city center. The heterogeneity of changing emissions leads to a high dependence of the trend calculation across the city (change from about −30 % to +10 %). For the Beijing case, Duncan et al. (2016) also showed a reduction of the tropospheric NO 2 (by about −10.3 % from 2005 to 2014), with a minimum in 2008 at the time of the Olympic Games. Figure 8 summarizes the change in biases for the station ensemble, for the three pixel selection cases presented for OMI. As can be seen, restricting the comparison to small pixel sizes (from 100 to 40 km) improves the median bias, and it reduces the comparison spread. Further focusing on pixels in strict overpass with the stations, the spread is also reduced, but the median bias not so much, at the expense of a large number of comparison days.
For GOME-2A (not shown), both of these effects are much smaller, as the pixel side size is always about 80 km, and as such, when the pixel center is within 50 km radius, usually part of the pixel covers the station.
When considering the results as a whole, the most prominent feature is the systematic underestimation of groundbased data by both satellite datasets for most of the sites. This underestimation is mostly prominent at urban sites close to the sources, but it is also found at background/suburban sites and cannot be fully explained by the satellite uncertainties (see Sect. 2). The differences observed between OMI and GOME-2A can be related to instrumental characteristics (e.g., differences in pixel size) but also to details of the applied retrieval methods (see Table 1 and Sect. 2). Several studies have discussed in detail the impact of algorithmic differences on the NO 2 column uncertainty, which can reach 42 %, mainly due to tropospheric AMF uncertainties (Lorente et al., 2017). The underestimation of the NO 2 satellite products identified here at a large number of stations confirms what was obtained in previous validation exercises using fewer sites and different satellite products Brinksma et al., 2008;Vlemmix et al., 2010;Irie et al., 2008aIrie et al., , 2012Lin et al., 2014;Halla et al., 2011;Shaiganfar et al., 2011;Ma et al., 2013;Kanaya et al., 2014;Wang et al., 2017b;Mendolia et al., 2013;Tzortziou et al., 2014;Lamsal et al., 2014;Drosoglou et al., 2017;Herman et al., 2019;Judd et al., 2019;Compernolle et al., 2020). These studies generally reported small negative or positive biases over rural (unpolluted) measurement sites and stronger (systematic) negative biases over urban polluted sites.
One way to understand these results is to consider the impact of the spatial resolution of the satellite measurements. For the case of rural sites, coincident satellite pixels can include areas with higher NO 2 columns, leading to positive biases in the comparisons. In contrast at urban locations characterized by strong NO 2 sources, coincident pixels generally tend to include surrounding (suburban) areas. This effect is especially significant for satellite instruments measuring at coarse spatial resolution, such as GOME-2A. It can be attenuated in validation studies making use of long time periods and many stations; however large localized NO 2 concentrations will always tend to be underestimated. This is particularly true for satellite instruments characterized by horizontal resolution much coarser than the size of typical urban agglomerations (see Table 1). Note that the effect can be somewhat mitigated in the case of satellite retrievals using a priori profiles specified at high temporal and spatial resolution (Huijnen et al., 2010;Russell et al., 2011;Heckel et al., 2011;Lin et al., 2014;McLinden et al., 2014;Kuhlmann et al., 2015;Laughner et al., 2019;Goldberg et al., 2017;. In the next section, we present an attempt to quantify the smearing effect around urban sites and use it to extend Figure 7. Impact of the OMI pixel size (pixels smaller than 100 and 40 km in grey and black respectively) and with filtering on pixels only above the station (blue) on the differences' deviation between satellite and ground-based data at a few stations: Xianghe, Beijing, Beijing-CMA and Seoul. The number of comparison points is indicated on top with the corresponding colors. The box-and-whisker plots are defined as follows: crosses for the mean values, horizontal lines for the median, boxes for the 25 and 75 percentile and vertical lines for the 9 and 91 percentile.
the validation pixel selection method, in order to increase the comparison statistic.

Horizontal dilution effects
In order to investigate the horizontal variability of the NO 2 field at the 36 different stations, 1 full year (2005) of the OMI NO 2 QA4ECV dataset v1.1  was extracted to map the average NO 2 column distribution at a grid of 0.025 • × 0.025 • in latitude-longitude. Such highly resolved gridded maps were obtained using a realistic representation of the OMI point spread function allowing the native OMI pixels to be subsampled (Sihler et al., 2017). Only the smallest OMI pixels (rows 11 to 49) were retained for this analysis. Corresponding high-resolution grids were used to quantify the systematic change in tropospheric NO 2 between the position of the satellite pixels and the location of the stations, what we call hereafter the "dilution effect". The approach used here is an extension of a similar method introduced by Chen et al. (2009) and Ma et al. (2013) based on high-resolution city night light maps used as a proxy for NO 2 sources. Judd et al. (2019) also accurately quantified this effect in the New York area using airborne NO 2 mapping data from the GeoTASO instrument. In our approach, the variation of the tropospheric NO 2 VCD is sampled in concentric circles of different radii around each of the stations. Figure 9 illustrates the method for the Beijing (urban, Fig. 9a) and Xianghe (suburban, Fig. 9c) sites, which both present strongly inhomogeneous NO 2 fields. Figure 9b and d show the NO 2 VCD variation in concentric circles around the stations. In Beijing, the ground-based instrument is located close to the urban NO 2 hotspot, so that the NO 2 level decreases rapidly outwards. In contrast, a different behavior is found at the Xi-anghe station, which is located about 60 km to the east of the city center of Beijing. In this case, due to the influence of the surrounding emission sources, the mean NO 2 column tends to slightly increase when moving away from the site in the direction of Beijing. For background sites, one expects the NO 2 content to remain roughly constant around the station value. Horizontal variability effects have been documented in previous studies dealing with ozone and water vapor Verhoelst et al., 2015), as well as with tropospheric NO 2 Duncan et al., 2016;Kim et al., 2016;Boersma et al., 2018), mostly to illustrate the impact of collocation mismatch errors on validation results. In our study, we propose a correction method applied to satellite data, which aims at reducing the impact of the smearing effect on comparisons.

Dilution correction method
Similarly to the studies of Chen et al. (2009) and Ma et al. (2013), a correction factor is calculated to quantify the change in NO 2 between the ground-based site and the satellite pixel location. In our approach, the dilution factor (F dil ) is obtained from the OMI gridded files by taking the ratio between the average (mean or median) NO 2 VCD at increasing distances from the site and the VCD value at the site. A second-order polynomial is then fitted to these ratio values as illustrated in Fig. 9 (panels b and d). Accordingly, F dil is calculated using the following equation, where R represents the distance from the site: In practice, F dil is calculated as the median values of the gridded NO 2 field for values of R from 0 to 50 km. For sites showing a negative slope in the dilution factor (i.e., a clear This correction is applied to individual satellite measurements according to their respective distances. Typically, it is applied to large urban sites, stations isolated on small islands such as Réunion Island (Fig. S18 in the Supplement), Izaña . This ensemble is referred to as UIPP (urban, island and power plant) in the rest of the paper.

Impact of the dilution correction
The improvement brought by the dilution correction is illustrated in Fig. 10, where the slopes of the linear regressions from daily scatterplots are presented for each station separately with and without dilution correction. In order to limit the impact of outliers (especially the large columns that strongly affect the regression analysis), daily comparison points are filtered for values larger than the 75th percentile of the ground-based values of each station. This selection excludes large local values that cannot be captured by satellite measurements and allows for a more robust statistical regression analysis. In each panel, the case denoted "all" corresponds to a combined analysis including the data from all stations together. This is different than averaging the stations' slopes, as the different sites have a varying number of points. After application of the dilution correction, regression slopes improve (and come closer to unity) for all cases except De Bilt. However, for some sites, there seems to be an overcorrection effect (Athens/GOME-2A, UHMT/GOME-2A, Beijing (both sites)/OMI and Réunion/OMI), while a negative slope is obtained at a few other sites (e.g., Mauna Loa/GOME-2A and Réunion/GOME-2A). As already discussed in Sect. 5.1, for direct sun stations this could be related to issues with the determination of stratospheric columns in the satellite algorithm. UHMT is a peculiar site, where several studies performed during the DISCOVER-AQ 2013 Texas campaign (Nowlan et al., 2018;Choi et al., 2020) suggested that those Pandora NO 2 measurements tend to be too low. Finally, some sites (e.g., Nairobi, Bujumbura, Thessaloniki, Izaña) display very small slopes, probably due to the fact that these sites are characterized by very local sources or by nonsymmetric NO 2 distributions. This is clearly the case for isolated islands where the NO 2 can be locally trapped due to orography (see Figs. S19, S22, S24 in the Supplement). An alternative dilution correction approach taking into account the geographical extent of the satellite pixel and its localization in the NO 2 field has been tested. In order to estimate an uncertainty on our correction method, we applied this modified scheme to two extreme urban cases (Beijing and UHMT) and two moderate cases (Xianghe and Uccle). Differences amounting to about half the value of the current dilution correction are obtained. Figure 11 displays monthly scatterplots of GOME-2A and ground-based data for all the UIPP stations, i.e., those at which a dilution correction was applied. Data points corresponding to values larger than the 75 percentile are represented as grey points. The two upper plots show results without correction for MAX-DOAS (left) and direct sun (right) Figure 9. Dilution effect illustration for a typical urban (Beijing, a, b) and suburban (Xianghe, c, d) case. Panels (a, c) represent the 2005 yearly mean tropospheric NO 2 gridded from OMI QA4ECV data at the resolution of 0.025 • latitude × 0.025 • longitude. The black dot indicates the station location, the two circles denote 50 and 100 km radii around the station and the red box represents the outer extent of any 80 km × 40 km GOME-2A pixels whose centers are within the 50 km radius. Panels (b, d) display the mean (black) and median (red) NO 2 values at increasing colocation radii (expressed in kilometers), with the variability (1 standard deviation) given as an error bar around the mean. datasets, while corrected data are represented similarly in the lower plots. Again, the impact of the dilution correction is clearly apparent. The regression slope increases from 0.52 to 0.76 for MAX-DOAS and from 0.67 to 1.1 for direct sun data. The impact of excluding the largest columns from the regression analysis can be judged by comparing the grey and black lines, respectively obtained without and with filtering. As can be seen, direct sun data are more affected by this filtering (slope increase from 0.38 to 0.67) than MAX-DOAS ones (slope increase from 0.49 to 0.52). This is likely related to the fact that, as already mentioned, direct sun instruments (especially Pandora instruments) tend to be located closer to strong NO 2 emission sources than MAX-DOAS instruments. Other potential reasons are (1) the higher uncertainty in determining the true NO 2 column amount in the reference spectrum and (2) the more spatially localized direct sun measurements, especially at high sun. Moreover, the Pandora DOAS analysis is performed with the NO 2 absorption cross section at a temperature corresponding to the effective temperature of 254 K, while MAX-DOAS is typically analyzed for a temperature of 298 K. Spinei et al. (2014) showed that at polluted sites during hot summer months this could result in 5 %-10 % of underestimation in NO 2 total column derived from the direct sun data compared to the retrieval results at the true effective temperature. Table 5 lists the statistical parameters from regression analyses performed with and without the dilution correction for all the UIPP stations and the different satellite products. Generally speaking, validation results obtained using both MAX-DOAS and direct sun systems appear to be consistent, although direct sun observations tend to agree slightly better with the satellite data. In the case of direct sun data, however, we note that the dilution correction tends to overcorrect satellite measurements (see also Fig. 11), also resulting in slightly larger rms values for the dilution-corrected cases. It is also interesting to note in Table 5 that the intercepts are always positive, which could point to a systematic additive bias, possibly coming from an underestimation of the stratospheric (slant) columns. A bias of about −0.2 × 10 15 molec cm −2 has been reported by Compernoelle et al. (2020) when comparing the OMI QA4ECV assimilated stratospheric columns (based on an approach similar to the one used in the OMI DOMINO algorithm) to ground-based zenith-sky data. This bias was reduced to about −0.01 × 10 15 molec cm −2 when using the STREAM  approach. Investigation of the impact of the smoother STREAM stratosphere on the tropospheric validation results is out of the scope of this study but would be interesting as the small stratospheric errors can be amplified by the AMFs. Considering all the stations together, Fig. 12 presents an overview of the differences between satellite and groundbased datasets, for the original comparisons (in black) and after dilution correction (in red). We make the distinction between two different approaches for the selection of the coincident pixels: closest cloud-free (cloud radiance fraction < 50 %) pixel and mean value of all cloud-free pixels within a radius of 50 km. Results are also given separately for MAX-DOAS sites (upper plot) and direct sun sites (lower plot).
As can be seen, the overall agreement between satellite and ground-based datasets is better for OMI comparisons, and, after dilution correction, it is slightly better for direct sun than for MAX-DOAS sites. Again, this is likely related to the fact that direct sun instruments (of Pandora type) tend to be located closer to strong NO 2 emission sources. Moreover, as also discussed previously, MAX-DOAS sites report measurements under a larger variability of conditions (both clear-sky and cloudy), leading to an increased spread of the comparisons. Generally speaking the dilution correction pushes biases closer to zero and often reduces the spread of the differences. The best results are obtained with OMI, when comparing direct sun tropospheric columns to the closest pixel of the satellite. In this case, the median bias of −1.16 × 10 15 molec cm −2 obtained is reduced to −0.23 × 10 15 molec cm −2 after application of the dilution correction. A similar improvement is found for the MAX-DOAS comparisons, from −0.95 to −0.47 × 10 15 molec cm −2 . We find that the selection of the daily closest pixel leads to smaller biases and spreads and a better agreement between median and mean values for both OMI and GOME-2A comparisons. Therefore, in the rest of the study, comparison results are exclusively based on coincidences determined using daily closest pixels.
Several sites submitted data for time periods longer than 1 year (see Tables 2 and 3 for details), allowing the sea- Figure 11. Scatterplot of monthly mean GOME-2A GDP 4.8 NO 2 columns versus UIPP ground-based stations (MAX-DOAS instruments in (a, c) and direct sun instruments in (b, d)). Panels (a, b) present the original comparisons, and panels (c, d) those after applying the dilution correction. Calculations of the monthly mean values are performed after removal of the daily ground-based points larger than the 75 percentile of each station dataset. The monthly means without the filtering are presented in grey to illustrate the impact, and the number of remaining months for each station is given in brackets in the legend. Linear regression values are shown on each plot. sonal dependence of the comparisons to be investigated. In Fig. 13, seasonally sorted bias values of GOME-2A and OMI against MAX-DOAS measurements are presented for six selected stations (Uccle, OHP, Beijing, Xianghe, Bujumbura and La Réunion). A dilution correction was applied to satellite datasets at three of these sites (La Réunion, Uccle and Beijing). Although comparison results are roughly consistent for all seasons, smaller biases seem to be observed in summer time at several stations of the Northern Hemisphere. This might be related to the shorter lifetime of NO 2 in the warm season and the associated reduced variability of its concentration. As already discussed in Sect. 5, for Bujumbura and Réunion Island, one observes larger negative biases for GOME-2A than for OMI, despite the dilution correction applied in both sites. Note that a large underestimation of QA4ECV OMI NO 2 VCDs was also reported by Compernolle et al. (2020) in Bujumbura. Our validation results do not point to major seasonal effects; however it is general good practice to base validation studies on complete annual cycles in order to properly sample all observational conditions.
Although the dilution correction improves the agreement between the ground-based and satellite measurements, significant negative biases persist at some of the validation sites (see Fig. 10). This could be related to satellite retrieval issues but also to shortcomings in our correction approach, which relies on average NO 2 fields derived using 1 year (2005) of OMI data. These average fields are not necessarily representative of the actual day-to-day variability at all sites. This certainly contributes to the scatter of the comparisons but should have relatively little systematic effect on regression slopes. Seasonal behavior differences, not taken into account here, could also play a role. Moreover the OMI QA4ECV dataset , which has been selected as a source for estimating the correction factors, might have its own limitations. Trends in the last decades in NO 2 values worldwide Georgoulias et al., 2019) can be a limiting factor for some of the stations. Using OMI for the correction also implies that the afternoon NO 2 is representative of the morning GOME-2A overpass, which is not entirely true. Another issue is the limited spatial resolution of OMI data and of its a priori profiles' assumption. High-resolution models (Drosoglou et al., 2017) or airborne imaging DOAS measurements (Judd et al., 2019) could provide a better source of information to correct the NO 2 distributions around the stations, but such data are currently not available at the global scale. Finally, ground-based instruments are assumed to provide point source measurements, while in reality the horizontal sensitivity area of MAX-DOAS measurements can be as large as several tens of kilometers (Irie et al., 2011). The provision of this information for all ground-based measurements would thus be very valuable to further improve the comparison method. Note that in urban areas, the representativeness of MAX-DOAS observations for comparison with satellite data could be improved by making use of measurements in different azimuth directions (Ortega et al., 2015;Gratsea et al., 2016;Schreier et al., 2019;Dimitropoulou et al., 2020).

Overall validation results
Figures 14 and 15 present an overview of the absolute deviations and relative differences between OMI and GOME-2A tropospheric NO 2 column measurements and the reference ground-based MAX-DOAS and direct sun measurements considered in our study. For each sensor, deviations obtained without dilution correction are presented in panel (a), while biases and relative differences after application of the dilution correction are given in panels (b) and (c). For panels (a) and (b), the total median instrumental errors (satellite and ground-based errors summed in quadrature) are also given as grey bars. When comparing the deviation in (a) and (b), the improvement by the dilution correction is clear. One can also see that results obtained using MAX-DOAS and direct sun stations are consistent within the comparison uncertainties. Note that for a few urban sites (e.g., UHMT, Seoul), the dilution correction seems to overcorrect the satellite NO 2 columns, especially for OMI data. This is less clear for GOME-2A, indicating that the correction approach might be slightly too aggressive for the OMI case. It can also be seen that except for a few cases, both satellite data products behave similarly at the different stations. Once corrected for the dilution effect, satellite measurements agree with groundbased data to within 25 % (black dotted lines). The blue lines represent the median bias of satellite measurements against all station data, when including the dilution correction and for ground-based VCD tropo > 2 × 10 15 molec cm −2 . The latter filtering is applied to remove outliers, leading to unphysical mean percent values. Resulting median residual biases are −23.5 % for GOME-2A and −18 % for OMI. For the sake of completeness, the same analysis was also performed on QA4ECV v1.1 OMI and GOME-2A datasets, using the same selection criteria. Corresponding figures can be found in the Supplement (Figs. S4 and S5 in the Supplement). Similar results are found, although the QA4ECV products tend to display slightly larger residual bias values, both for the original comparisons and after dilution correction.  Figure 16 presents the overall GOME-2A and OMI biases for the different GDP, DOMINO and QA4ECV data products, for satellite pixels in strict coincidence with the stations. In the SAT-GB panel, grey bars present the estimated error on the median bias for each comparison case, estimated as where n is the number of comparisons of each case (which can be different), and MAD is the median absolute deviation (see Huber, 1981), a robust indicator: where k = 1.4826, for a correspondence of MAD with the 1σ SD in case of normal distribution without outliers. We note that the errors on the median values are significantly smaller (around 2×10 14 molec cm −2 ) than the median values themselves (a few 1 × 10 15 molec cm −2 ), indicating that the derived residual biases are significant. Table 6 summarizes the median biases for all the cases. As already stated, the dilution correction improves the validation results for both sensors, by about 10 % to 13 % in total over the station ensemble, with an overall uncertainty due to the method estimated at about 5 %. The impact of relaxing the comparison criteria from only pixels over the stations to the daily closest pixels selection is to increase the bias by 4 % to 6 % for OMI, but it has a negligible effect on GOME-2A (about 2 %), probably due to the large size of the GOME-2A pixels (40 km × 80 km). When considering the best comparison conditions including dilution correction (last column of Table 6), we come to the conclusion that satellite tropospheric NO 2 measurements tend to underestimate groundbased reference data by the following: -23 % for GOME-2A GDP4.8 -39 % for GOME-2A QA4ECV -18 % for OMI DOMINO -27 % for OMI QA4ECV.
It should be noted that in addition to this relative bias, the previously found positive intercepts and slopes smaller than 1 (see Table 5) could point to a twofold effect, involving a multiplicative error source (e.g., the AMF) and an additive error source (e.g., the stratosphere-troposphere separation). This question should be further investigated in future studies using more extended validation data, in particular of the stratospheric NO 2 column (see, e.g., Compernolle et al., 2020).  Fig. 7. In (a, b), grey bars are the ± comparison error, calculated adding in quadrature the satellite and ground-based VCDtropo errors.   Fig. 7. In panels (a, b), grey bars are the ± comparison error, calculated by adding in quadrature the satellite and ground-based VCDtropo errors.

Conclusions
Tropospheric NO 2 column data from 39 ground-based remote-sensing instruments worldwide were used to validate results from GOME-2A GDP 4.8 and QA4ECV v1.1 and OMI DOMINO v2 and QA4ECV v1.1 data products. Although the ground-based retrievals are not yet fully harmonized at network level, the ground-based datasets are treated coherently for the different stations, and the study illustrates the potential capacity of MAX-DOAS and the direct sun network for tropospheric NO 2 validation. The interest of such a network resides in the large number of stations sampling different pollution levels and scenarios, corresponding to re-mote, suburban and urban conditions. Typically, suburban polluted stations (e.g., Xianghe) provide the best conditions for the validation of satellite NO 2 , owing to their good representativeness of the size of the OMI or GOME-2A pixel spatial extent. Validation at more remote stations can be challenging due to usually low levels of tropospheric NO 2 , leading to difficulties in the stratosphere-troposphere separation step in the satellite retrieval. Other challenging cases are cities and islands surrounded by a pristine atmosphere, such as Izaña, Réunion Island, Nairobi or Bujumbura, leading to large biases (up to ∼ 80 %) due to smearing of the local tropospheric NO 2 emissions content in otherwise clean surroundings. Figure 16. Box-and-whisker plot of the daily satellite biases for all stations together, in absolute and relative values. The box-and-whisker plots are defined as in Fig. 7. Red is used for the dilution-corrected data, while black is used for the previously presented products (OMI DOMINO and GOME-2A GDP), and grey is used for the QA4ECV products.
The baseline comparison keeping only satellite pixels covering the stations presents the smaller bias and spread at urban locations and the comparison spread at suburban sites for OMI data. Relaxing the collocation criteria increases the statistics but at the expense of larger biases and spread. Comparisons at urban sites or close to strong NO x sources may suffer from smoothing difference errors due to the horizontal dilution of the measured NO 2 field. Therefore, a quantitative correction for the dilution effect has been developed based on the spatial distribution of tropospheric NO 2 columns probed by OMI and averaged over 1 year. This dilution correction generally improves the comparison, reducing biases due to the spatial mismatch between ground-based and satellite observations. Generally OMI DOMINO v2 data agree better with ground-based data than GOME-2A GDP 4.8, especially for comparisons with MAX-DOAS data. The dilution correction improves the station-per-station comparisons with a few exceptions, generally at remote sites with local emissions surrounded by clean areas.
A large reduction of the bias is obtained when applying the dilution correction. In terms of validation results, MAX-DOAS and direct sun measurements are found to be highly consistent, and therefore they have been used as an ensemble to assess the accuracy of GOME-2A and OMI data. Results based on this ensemble indicate that, even after correction for the horizontal dilution effect, satellite tropospheric NO 2 columns are systematically biased low in comparison to ground-based measurements by 23 % to 39 % for GOME-2A and 18 % to 27 % for OMI, depending on the selected satellite product. A summary of the validation results is given in Table 6.
The dilution correction developed here is parameterized according to the distance from the station and is based on 1 year of OMI NO 2 measurements (2005). This approach has several identified limitations, such as assumptions made on the radial nature of the NO 2 distribution around the sites and the overall applicability of the NO 2 field derived in 2005. Another limitation is the different intra-pixel dilution expected for the OMI and GOME-2A measurements. It has been tested on a few extreme cases by taking into account the pixels' corner positions, showing improvement in the comparisons and elimination of the overestimation. Despite its simplicity and shortcomings, our dilution correction was shown to significantly improve validation results, and we anticipate that future developments will lead to further improvements. For example, possibilities exist to use estimates of the horizontal extent of MAX-DOAS measurements to improve the colocation with satellite data. MAX-DOAS instruments can also be operated in multiple azimuthal scan mode, which could be used to further refine the colocation with satellite pixels Gratsea et al., 2016;Ortega et al., 2015;Schreier et al., 2019;Dimitropoulou et al., 2020). Finally, imaging MAX-DOAS systems such as the IMPACT instrument (Peters et al., 2019), which provides fast sampling of the full (360 • ) azimuthal range, may lead to significant improvements in tropospheric NO 2 validation close to source regions.
To further improve validation studies, information on the vertical distribution of NO 2 and aerosols is also needed to test the impact of a priori assumptions in satellite data retrieval. To some extent, this can be provided by MAX-DOAS instruments, making use of vertical profiling techniques for the inversion of tropospheric profiles of NO 2 and aerosols.
Finally, improving and further extending existing networks are essential requirements for future operational air quality satellite validation (Veihelmann et al., 2019). In this context, important steps include the following: -the systematic adoption of harmonized uncertainty characterization and reporting and of harmonized data reporting formats, another crucial point for data usage.
On this basis, it is anticipated that significant progress will be achieved in the near future towards the development of harmonized and quality-controlled global networks of UV-VIS MAX-DOAS and direct sun instruments. The development of such networks is an essential element for the validation and cross-mission consistency of the atmospheric composition satellite constellation bridging low-earth (LEO) and geostationary (GEO) orbits, in particular the ESA/EUMETSAT Copernicus Sentinel-4 (GEO) and -5 (LEO) series (planned for launch in from 2023 to 2036), the NOAA/NASA LEO Suomi-NPP/JPSS OMPS series (started in 2011, with JPSS launches planned to 2031), the CNSA LEO GaoFen-5 Environment Monitoring Instrument (2018) and the geostationary missions GEMS (2020) and TEMPO (2022) developed by the United States and South Korea and the United States, respectively.
Code and data availability. The datasets generated and analyzed in the present work are available from the corresponding author on request, and data per station can be requested from the individual PIs.
Author contributions. GP and MVR planned this study. GP performed the validation and the associated investigations and wrote the manuscript. MVR and FH contributed to the scientific discussions and to the manuscript writing. NT participated in the OMI gridded maps' creation. JG keeps the GOME-2 GDP station overpass database up-to-date. All other co-authors provided groundbased data for the station(s) they are responsible for or support for the satellite data or the validation method. All co-authors were involved in the discussion of the results.
Competing interests. The authors declare that they have no conflict of interest.