Evaluation of version 3 . 0 B of the BEHR OMI NO 2 product

Version 3.0B of the Berkeley High Resolution (BEHR) Ozone Monitoring Instrument (OMI) NO2 product is designed to accurately retrieve daily variation in the highspatial-resolution mapping of tropospheric column NO2 over continental North America between 25 and 50 N. To assess the product, we compare against in situ aircraft profiles and Pandora vertical column densities (VCDs). We also compare the WRF-Chem simulation used to generate the a priori NO2 profiles against observations. We find that using daily NO2 profiles improves the VCDs retrieved in urban areas relative to low-resolution or monthly a priori NO2 profiles by amounts that are large compared to current uncertainties in NOx emissions and chemistry (of the order of 10 % to 30 %). Based on this analysis, we offer suggestions to consider when designing retrieval algorithms and validation procedures for upcoming geostationary satellites.

1 Introduction NO x (≡ NO + NO 2 ) is an atmospheric trace gas emitted by anthropogenic activity (predominantly combustion, e.g., motor vehicles and power plants), lightning, biomass burning, and soil microbes.It plays an important role in air quality, as a major controlling factor in ozone and aerosol production, as well as being toxic itself.
Satellite observations of NO 2 relate absorption of light in the ∼ 400-460 nm range of reflected earthshine radiances to a total column measurement of NO 2 using differential optical absorption spectroscopy (DOAS, Boersma et al., 2001;Richter and Wagner, 2011) or a similar technique (e.g., van Geffen et al., 2015).Most applications of satellite NO 2 observations to constrain emissions or otherwise study air quality are focused on the tropospheric contribution to the total column; therefore the stratospheric column must be removed.Several methods have been implemented to do so (e.g., Boersma et al., 2007;Bucsela et al., 2013).The tropospheric slant column density (SCD) is then converted to a vertical column density (VCD) through the use of an air mass factor (AMF, McKenzie et al., 1991;Slusser et al., 1996;Burrows et al., 1999;Palmer et al., 2001) that accounts for the effect of path length, surface reflectivity and elevation, NO 2 vertical distribution, clouds, and aerosols.
There have been numerous studies evaluating OMI NO 2 products against in situ aircraft profiles and ground-based column measurements.This is not meant to be an exhaustive list, but to provide a summary of the results of evaluations of existing standard OMI NO 2 products.
The first-generation NASA standard product (SP) and KNMI DOMINO products were evaluated by Bucsela et al. (2008) and Hains et al. (2010) using aircraft profiles from multiple campaigns and Russell et al. (2011) using an extrapolation method with ARCTAS-CA aircraft data.These Published by Copernicus Publications on behalf of the European Geosciences Union.studies all identified a high bias in the DOMINO VCDs; by comparing the DOMINO a priori profiles to aircraft and lidar profiles Hains et al. (2010) found evidence that this was caused by insufficient vertical mixing in the DOMINO a priori profiles, which was corrected in DOMINO v2.Lamsal et al. (2014) undertook a detailed evaluation of NASA SP v2, primarily focusing on data from the Deriving Information on Surface Conditions from COlumn and VERtically Resolved Observations Relevant to Air Quality (DISCOVER-AQ) campaign in Baltimore, MD, USA.This work combined evaluation of the a priori profile against aircraft measurements along with validation of OMI VCDs with aircraft and ground-based VCDs.They found that the NASA SP v2 VCDs were generally biased low in urban areas and high in rural or suburban areas.This is consistent with the effect of coarse a priori profiles (Russell et al., 2011); in a large urban area like the Baltimore-Washington D.C. urban corridor, a coarse profile can capture the average urban characteristic profile, but on the edge, a coarse profile cannot capture the transition from urban to rural.Krotkov et al. (2017) and Goldberg et al. (2017) both evaluated the NASA SP v3, primarily using ground-based VCD observations.They found it to be biased low by ∼ 50 % in the Baltimore area (Goldberg et al., 2017) and 50 % or more in Hong Kong (Krotkov et al., 2017), but it was better than SP v2 in remote areas, due to the improved total column fitting implemented in version 3. Ialongo et al. (2016) also compared versions 2 and 3 of the NASA SP and version 2 of DOMINO against ground-based column measurements in Helsinki, one of only a few studies at high latitudes (> 60 • ).They found that SP v3 was biased 30 % low, while the version 2 products were not.They attributed this to cancellation of errors in the version 2 products, namely the high bias in the total OMI columns corrected by van Geffen et al. (2015) and the representativeness mismatch between OMI pixels and Pandora measurements.Russell et al. (2011) evaluated the original BEHR algorithm over California using data from the Arctic Research of the Composition of the Troposphere from Aircraft and Satellites (ARCTAS-CA) field campaign.As the ARCTAS-CA campaign did not include a large number of tropospheric profiles, Russell et al. (2011) computed aircraft-derived NO 2 VCDs from times when the aircraft was flying in the boundary layer (BL).Assuming a well-mixed BL, Russell et al. (2011) extrapolated the measurements within the BL to the surface and, combined with measurements in the free troposphere (FT) from the remainder of the ARCTAS-CA campaign, were able to estimate tropospheric NO 2 VCDs from aircraft measurements for a larger number of coincident OMI pixels than would have been possible with traditional aircraft profiles, at the expense of increased uncertainty in the aircraft-derived VCDs.Russell et al. (2011) found that the original BEHR product had agreement similar to that of the NASA SP v1 product with the aircraft data (both with slopes near 1), but BEHR had better correlation (R 2 0.83 vs. 0.72).
Since then, the plethora of aircraft campaigns and expansion of the Pandora ground-based spectrometer network across the US have provided better datasets to evaluate the BEHR product in a variety of locations.
Here we present an evaluation of version 3.0B of the Berkeley High Resolution (BEHR) OMI NO 2 retrieval.Version 3.0B implements several changes over v2.1C: daily profiles for selected years updated 12 km WRF-Chem NO 2 profiles with a more complete chemical mechanism (Zare et al., 2018), updated anthropogenic emissions and lightning NO x emissions added use of v3.0 NASA standard product (SP) tropospheric SCDs directional surface reflectance variable tropopause height surface pressure combining a high-resolution terrain database with WRF-simulated surface pressure.
The motivation for this upgrade stems from ideas developed in Laughner et al. (2016), in which we showed that daily high-resolution a priori profiles are necessary for a retrieval to simultaneously retrieve NO x VCDs and lifetime to accuracies better than 30 %.As our goal is to study the relationship between changes in NO x VCDs and emissions and NO x lifetime across the US, and resolving open questions requires higher relative precision and high accuracy than prior retrievals, we have developed a new product with daily 12 km a priori profiles.Therefore, in this work, we first evaluate the simulated WRF-Chem profiles against aircraft measurements and OMI SCDs to demonstrate that the daily profiles accurately represent the real atmosphere.We then directly evaluate the retrieved VCDs using both aircraft and Pandora observations and show that v3.0 is generally superior to v2.1C and that using daily profiles improves the overall quality of the retrieval.
2 Methods: models and observations

BEHR
The BEHR OMI NO 2 retrieval is described in detail in Laughner et al. (2018f).Briefly, the BEHR retrieval calculates a tropospheric AMF using high-resolution a priori input data for surface reflectance, surface elevation, and NO 2 vertical profiles; the NO 2 profiles are simulated with WRF-Chem (Sect.2.2).To capture the day-to-day variation in NO 2 profiles, daily profiles are used.Currently, 2005Currently, , 2007Currently, -2009Currently, , and 2012Currently, -2014 are available.Other years will be posted as processing is completed.A second subproduct uses monthly average profiles (simulated for 2012) to retrieve all years of the OMI data record.
The BEHR AMF is used to convert the tropospheric SCDs available in the NASA OMI NO 2 standard product to tropospheric VCDs.For full details of the AMF calculation, see Laughner et al. (2018f).The BEHR product is available for download as Hierarchical Data Format (HDF) version 5 files at http://behr.cchem.berkeley.edu/(last access: 3 January 2019) or through the four DASH repositories (Laughner et al., 2018a, b, c, d).

WRF-Chem
The WRF-Chem model version used to simulate the a priori NO 2 profiles for BEHR v3.0B is v3.5.1 (Grell et al., 2005).The model domain is 405 (east-west) by 254 (northsouth) 12 km grid cells centered on 39 • N 97 • W with 29 vertical levels.Meteorological initial, boundary, and nudging conditions are taken from the North American Regional Reanalysis (NARR) product; boundary conditions and fourdimensional data analysis (FDDA) nudging (Liu et al., 2006) are applied every 3 h.Temperature, water vapor, and U/V winds are nudged with nudging coefficients of 0.0003 s −1 .
The chemical mechanism used is described in Zare et al. (2018), which has a very detailed description of alkyl nitrate and nighttime chemistry.Methyl peroxynitrate (MPN) chemistry was added (Browne et al., 2011) to improve uppertropospheric (UT) chemistry.Anthropogenic emissions are from the National Emissions Inventory, 2011, scaled by EPA annual total emissions (EPA, 2016) to the model year.Biogenic emissions are from the Model for Emissions of Gases and Aerosols from Nature (Guenther et al., 2006).Lightning emissions are parameterized following Laughner and Cohen (2017) for a simulation with FDDA active (500 mol NO flash −1 , 2× base flash rate).

Pandora ground-based columns
Evaluation of satellite NO 2 VCDs usually uses one of two methods.First, total satellite columns can be directly compared to a ground-based column measurement, such as a Pandora spectrometer (Herman et al., 2009) or multi-axis DOAS (MAX-DOAS) instrument (Hönninger et al., 2004).In the case of a direct-sun measurement, such as a Pandora spectrometer, the AMF required is only a geometric AMF to ac-Table 1. Criteria that OMI pixels must meet to be used in any comparison.

Data field Condition
XTrackQualityFlags Must be 0 VcdQualityFlags Must be an even number CloudFraction Must be ≤ 0.2 BEHRAMFTrop Must be a non-fill value > 10 −6 count for the path length difference between the slant and vertical column since the multiple scattering that necessitates the use of a more complex AMF in the satellite retrieval is a much smaller signal than the direct-sun signal (Herman et al., 2009).
We compare against Pandora ground-based column measurements taken during the four DISCOVER-AQ campaigns.For each OMI overpass, pixels are matched with Pandora sites that lie within the pixel boundaries defined by the FoV75 corners in the OMPIXCOR product (Kurosu and Celarier, 2010).Only pixels meeting the criteria in Table 1 are used.If multiple valid pixels from the same overpass encompass the Pandora site, their VCDs are averaged.As in Goldberg et al. (2017), the stratospheric VCD from the NASA standard product is added to the tropospheric VCD to obtain a total column since the Pandora columns do not separate stratospheric and tropospheric contributions.
Pandora observations are matched in time to the OMI observations using the exact time of observation for each pixel given in the OMI data files.As in Goldberg et al. (2017), Pandora observations ±1 h from the OMI observation are averaged.

In situ aircraft profiles
The other common method of evaluating satellite VCDs is to use in situ measurements of NO 2 by an instrumented aircraft that flies a vertical profile to calculate a VCD by integrating the NO 2 concentrations vertically.Ideally, the aircraft should fly a spiral path that provides a complete vertical sampling of the troposphere over a ground footprint similar in scale to the satellite pixel; the DISCOVER-AQ campaigns held in Maryland, California, Texas, and Colorado between 2011 and 2014 were designed to provide this sampling over the lower troposphere.In other cases, the VCD calculated from integrating the aircraft profiles is often matched to satellite pixels in which the BL is sampled (e.g., Bucsela et al., 2008;Hains et al., 2010), on the assumption that UT sampling from adjacent pixels is sufficient.
We draw on methodology from several papers (Bucsela et al., 2008;Hains et al., 2010;Lamsal et al., 2014) for our approach.Similar to Hains et al. (2010), only profiles with a minimum radar altitude < 500 m and at least 20 measurements below 3 km above ground level (a.g.l.) are used.In the DISCOVER-AQ campaigns, individual profiles are demarcated in the data by a profile number.In the SENEX and SEAC4RS data, profiles were identified manually as periods when the aircraft was consistently ascending or descending.The profile measurements are binned to the same pressure levels used in the BEHR algorithm and the final profile uses the median of each bin.
Profiles are spatially matched to OMI pixels if any of the 1 s measurements in the bottom 3 km a.g.l.lie within the FoV75 pixel boundaries.As with Pandora data, OMI pixels must meet the criteria in Table 1 to be included; all VCDs from valid pixels intersecting the profile are averaged to yield a single VCD to compare against the profile.Only profiles with a mean observation time of all points in the bottom 3 km a.g.l.within 1.5 h of the mean OMI observation time for the orbit are used.
To calculate a VCD from the in situ measurements, the aircraft profiles are integrated from the average surface pressure to the average tropopause pressure of the matched pixels.The surface and tropopause pressure are used from the product being evaluated, i.e., aircraft profiles are integrated between BEHR surface and tropopause pressure for comparison with BEHR VCDs and NASA surface and tropopause pressures for comparison with NASA VCDs.For BEHR v2.1C comparisons, 200 hPa is used as the fixed tropopause pressure.Aircraft profiles that do not span the necessary vertical extent are extended similarly to in Lamsal et al. (2014).The aircraft profile is extended to the surface by using the ratio of modeled concentrations at each of the missing levels to the lowest level with aircraft data to scale the bottom bin with aircraft data.Missing profile levels above the top of the aircraft profile are replaced with model data.We use modeled NO 2 profiles from the "updated +33 %" GEOS-Chem simulation described in Nault et al. (2017) (v9.02 of the GEOS-Chem global chemical transport model (Bey et al., 2001) at 2.5 • × 2 • resolution, with updated HNO 3 , HO 2 NO 2 , and N 2 O 5 chemistry and lightning emission rates).The NO 2 profiles are monthly averages of model output from 2012 sampled between 12:00 and 14:00 local standard time.We avoid using the a priori WRF-Chem profiles for this so that the aircraft VCDs are independent of the retrieved VCDs.
We also used the extrapolation method from Hains et al. (2010), in which the medians of the top 10 and bottom 10 points are extrapolated to the tropopause and surface pressures, respectively.The median of the top 10 points must be < 100 pptv.As in Hains et al. (2010), a detection limit of 3 pptv is assumed, and if the median to be extrapolated is less than 3 pptv, it is set to one-half of the detection limit, 1.5 pptv.
In addition, we directly compare the a priori profiles to the in situ aircraft profiles.This is performed as in Laughner and Cohen (2017); for each 1 s data point in the aircraft data, the nearest WRF-Chem output time is selected, and the model grid cell containing the aircraft location is sampled.This effectively samples the model output as if the aircraft were flying through the model world.
We use a similar set of aircraft campaigns here as for the VCD evaluation (Sect.2.4); the only difference being that we use the Deep Convective Clouds and Chemistry (DC3) (Barth et al., 2015) instead of SENEX.The DC3 campaign focused on outflow from convective systems (i.e., thunderstorms) and so is used to evaluate the lightning NO x parameterization.The DC3 campaign had better UT sampling but far fewer profiles than SENEX.The DISCOVER-AQ campaigns focused on satellite validation, flying repeated spirals over six to eight sites during each campaign; however, for the average comparison, we use all data, not just those taken during the spirals.

Comparison with in situ aircraft profiles
Figure 1 shows campaign-averaged profiles matched with WRF profiles from the four DISCOVER-AQ campaigns, the DC3 campaign, and the SEAC4RS campaign.We compare the monthly average NO 2 profiles from BEHR v2.1C and v3.0B for all campaigns, as well as the daily v3.0B profiles.The plots shown only use data between 12:00 and 15:00 local standard time since the v3.0 monthly average profiles are calculated as a weighted average that only includes contributions from ±1 h from OMI overpass; this way all profiles get a fair comparison to the observations.
In general, the v3.0 profiles show better agreement with observed profiles than the v2.1 profiles, except during the California DISCOVER-AQ campaign.The most dramatic example is the Maryland DISCOVER-AQ campaign, in which the factor of ∼ 2 reduction in NO 2 concentration (likely due to updating emissions from 2005 to 2012, Fig. S10) brings the modeled profiles into substantially better agreement with the observed profiles.In the California DISCOVER-AQ campaign, the v2.1 profiles managed to capture an elevated layer of NO 2 that the v3.0 profiles did not; though we note that transport in California's Central Valley is notorious difficult to model (Hu et al., 2010, and references therein).In Texas, the v3.0 profiles and v2.1 profiles lie on opposite sides of the observed profiles, possibly suggesting that emissions in Houston did not decrease as much in fact as in the NEI inventory driving the v3.0 WRF simulations.In Colorado, both the v3.0 and v2.1 profiles match observations reasonably well.The daily profiles do a better job capturing the decrease in NO 2 between 750 and 600 hPa than the v3.0 monthly or v2.1 profiles; this may be due to day-today variability in recirculation from the upslope-downslope winds (e.g., Sullivan et al., 2016).
We evaluate the agreement quantitatively by calculating the mean absolute bias between the average WRF and aircraft profiles (Table 2).We divide the profiles into BL and FT, as different processes (e.g., anthropogenic vs. lightning emissions) govern them.As the SEAC4RS campaign has an obvious error in the FT (which will be discussed below), we calculate these values with and without the SEAC4RS campaign.In the BL, the version 3 profiles have one-half to two-thirds the bias of the version 2 profiles (depending on whether SEAC4RS is excluded).In the FT, there is little difference in the mean bias among profile types, unless SEAC4RS is included, in which case the daily profiles have a 33 % greater bias.
We include the SEAC4RS and DC3 campaigns to check the simulation of lightning NO x in the profiles.The daily profiles show agreement with the DC3 observations similar to that in Laughner and Cohen (2017).Restricting the DC3 data to 12:00-15:00 local standard time as we have done here reduces the strength of the lightning signal since the strongest lightning occurs after OMI overpass (Lay et al., 2007;Williams et al., 2000).Compared to Laughner and Cohen (2017), the discrepancy between modeled and observed profiles decreased around 500 hPa, increased around 400 hPa, and is similarly small around 200 hPa.Surprisingly, the difference between the v2.1 and v3.0 profile around 200 hPa is not as significant as the difference between the lightning and no-lightning cases in Laughner and Cohen (2017).This is unexpected as the v2.1 profiles did not include lightning NO x emission.It is possible that convection of greater surface NO x concentrations is driving the v2.1 UT concentration.
The SEAC4RS campaign covers the southeast US, which has very active lightning (Hudman et al., 2007;Travis et al., 2016).The daily profiles demonstrate a substantial overestimate in UT NO 2 (between 600 and 200 hPa).This is centered in the southeast US; model-measurement discrepancies between 600 and 200 hPa in the rest of the country are < 500 pptv (not shown).As discussed in Laughner et al. (2018f), the southeast US exhibits greater NO 2 VCDs (and therefore smaller AMFs) when using daily profiles; that is opposite to the profiles seen here, as greater NO 2 at higher altitudes results in larger AMFs.Laughner et al. (2018f) showed that the 3-month average daily shape factor over the southeast US had less contribution from UT NO 2 than the monthly profiles; this indicates that on average pixels in the southeast US are not influenced by lightning, but that the SEAC4RS sampling tended to select for convective outflow.However, this does indicate that the simulation of the UT in the southeast US is biased high.
To investigate the cause of this bias, we compare the WRF lightning flash density to that measured by the Earth Networks Total Lightning Network (ENTLN).ENTLN is a ground-based lightning observation network with more than 900 sensors deployed in the contiguous US.The sensors record lightning-produced strokes as well as accurate time and location.Strokes are then clustered into a flash if they are within 700 ms and 10 km of each other.The detection coefficient is larger than 70 % across the southern contiguous US (Rudlosky, 2015).
For the comparison, the WRF-Chem simulation is that described in Sect.2.2.ENTLN and WRF-Chem are sampled from 13 May to 23 June 2012 over the middle and east US domain, where active lightning events are detected.Both observed and simulated lightning flashes are converted to flash density by dividing flash counts by corresponding grid areas and time range.
Figure 2a and b show the spatial distribution of flash density in number per square kilometer per day observed by ENTLN and simulated by WRF-Chem.The largest biases are located over the southeast US (outlined by red on the map).In this region, WRF-Chem substantially overestimates flash density in general and a detection coefficient of 70 % for ENTLN cannot account for the discrepancy.The simulated flash density is the highest primarily along the coast, which is not detected by ENTLN.
The scatter plot of daily flash density over the southeast US from two datasets in Fig. 2c demonstrates that the WRF-Chem consistently overestimates flashes in the southeast US over the study period.However, outside of the southeast US, the agreement improves.The simulation captures the spatial pattern over the regional scale (Fig. 2a and b) and the simulated flash densities are consistent with the observed flash densities and the correlation improves as well (Fig. 2d).
Currently, the cause of the discrepancies between the flash density from WRF-Chem simulation and ENTLN observation is unknown.However, it is clear that the flash density, rather than the per-flash production rate of NO, is the cause of the disagreement in the UT between the daily profiles and SEAC4RS data.Further research is required to optimize the lightning parameterizations and improve flash density simulations in the southeast US for our model simulation.

Evaluation of variability in daily profiles
As demonstrated in Laughner et al. (2016), simulating the day-to-day variability in the a priori NO 2 profiles can have a significant impact on the retrieved NO 2 VCDs due primarily to the day-to-day variation in wind speed and direction driving outflow from emissions sources, e.g., cities and power plants.To examine how well WRF-Chem captures the day-to-day variability in NO 2 profiles, we compare aircraft data from three DISCOVER-AQ campaigns and the matched WRF-Chem data (Sect.2.4).For each profile in the DIS-COVER data, we binned the NO 2 concentrations by pressure and calculated the correlation between WRF-Chem and aircraft NO 2 concentrations (one data point per profile per pressure bin).The results are shown in Fig. 3.
In California (Fig. 3a), the monthly average profiles correlate better with the aircraft data.However, as mentioned before, the Californian Central Valley is known to be difficult to model accurately (Hu et al., 2010).In Colorado (Fig. 3c), the daily profiles do a slightly better job overall, simulating the variability at the surface and in an elevated layer more accurately than the monthly average profiles.The difference in Texas is quite dramatic (Fig. 3b), with the daily modeled profiles performing substantially better.This suggests that daily profiles are able to capture variability caused by small, concentrated urban plumes much more effectively than monthly average profiles.
As a second check, we also compare WRF-Chem tropospheric VCDs to OMI SCDs to evaluate the general accuracy of wind direction and speed in the daily model profiles.The OMI SCDs do not depend on modeled vertical profiles and so constitute an independent check on the plume direction.In order to have strong isolated NO x sources, we use Atlanta, Chicago, Las Vegas, Los Angeles, New York, and the Four Corners power plant for this study.For each of these sites, five days from 2007 are randomly chosen.If insufficient OMI SCDs are available for any day (> 10 % of OMI pixels are cloud covered or in the row anomaly), another day is randomly chosen.
For each day, the agreement between the relative spatial distribution of WRF-Chem VCDs and OMI SCDs is manually evaluated, focusing on whether the model plume is advected in the same direction as the OMI SCDs indicate.Each day's agreement is evaluated qualitatively as good or bad.This, whether the WRF-Chem daily VCDs are significantly different from the monthly average WRF-Chem VCDs, and the confidence in the comparison are recorded for each comparison.Because of the number of factors that affect the absolute magnitude of SCDs, we look for qualitative, rather than statistically quantitative, agreement between the modeled VCDs and OMI SCDs.This is relevant since Laughner et al. (2016) noted that it is primarily the plume shape that drives the day-to-day variability in AMFs; therefore a direct qualitative evaluation of the plume shape is desirable.
Figures 4 and 5 show two example comparisons, one good (Fig. 4) and one poor (Fig. 5).By studying randomly chosen days for six large NO x sources, we find that about 67 %-73 % of days with sufficient data to be evaluated show good agreement between the OMI SCDs and WRF-Chem daily VCDs.(The range is due to different levels of confidence filtering.) This indicates that the WRF-Chem-simulated NO 2 profiles are adequately capturing the day-to-day variability due to wind speed and direction.While we recognize that this conclusion is highly qualitative, the specific character of agreement that is important for these profiles (overall plume size and direction, rather than exact agreement between modeled and real concentrations or column densities) is rather difficult to evaluate quantitatively.We recognize that developing such methods is necessary and offer several possible approaches in Sect. 5.
Both comparisons (vs.OMI SCDs and aircraft measurements) show that daily WRF-Chem profiles do, on average, a better job than monthly average profiles capturing the day-today variation in profile shape.Therefore, the core improvement in BEHR v3.0, the transition to daily high-resolution a priori profiles, is fundamentally sound.Daily profiles are especially important for applications that focus on upwinddownwind differences in NO 2 columns around a NO x source (Laughner et al., 2016) and, as we will see in Sect.4, generally improve the retrieval in dense urban areas.

Column density evaluation
For the DISCOVER campaigns, we compare BEHR against aircraft-derived and Pandora VCDs together, calculating a single regression line for the combined dataset.These two measurements have unique strengths and weaknesses for comparison against satellite VCDs: Pandora spectrometers give precise column measurements and can be deployed for long time periods but have very small footprints (leading to possible representativeness errors) and provide total, not tropospheric, columns.Aircraft profiles have a footprint more similar to an OMI pixel size but introduce uncertainty due to missing parts of the profile (near the surface and in the UT in the DISCOVER campaigns) and cannot be deployed for long-term routine observations.
In order to take advantage of each method's strengths, we use two comparisons; in one, only Pandora data that have a coincident aircraft profile are included ("matched"), and in the other, all cloud-free Pandora data are used ("all").We do so because, when including all Pandora data, the number of Pandora comparisons available will overwhelm the number of available aircraft profiles in the regression.Therefore the regressions using all Pandora data are representative of longer time periods but weighted strongly towards the Pandora data, and the regressions using only the coincident data represent shorter time periods but give more weight to the aircraft data.As stated in Sect.2.3, we average all data within 1 h of OMI overpass (i.e., 13:30 local time ±1 h) to be consistent with Goldberg et al. (2017).A shorter averaging window (±0.5 h) was tested; the maximum effect on the slope was ∼ 8 % with most of the matched data slopes showing differences of ≤ 5 % and the all data slopes changing by ≤ 3.5 % in all but one case.3. To evaluate the southeast US, we use the SENEX and SEAC4RS campaigns, which only have aircraft data.These results are shown in Table 4.More details (slope, intercepts, R 2 values) can be found in Tables S1, S2, and S3. Figure 6 shows scatter plots of the BEHR vs. aircraft and all Pandora data for the four DISCOVER-AQ campaigns.Scatter plots showing aircraft and Pandora data separately are available in Sect.S1 of the Supplement.
For the DISCOVER-CO aircraft comparison, negative VCDs were removed.Negative VCDs occur when the estimated stratospheric NO 2 column is greater than the total NO 2 column; thus V trop = V total − V strat < 0. They cannot be introduced by the AMF correction of the tropospheric SCD to VCD as the AMF is a multiplicative factor and always > 0. Since all versions of BEHR use the same stratospheric NO 2 column as their respective NASA SP products, an error in stratospheric subtraction will be present in all products, and it cannot be corrected in the BEHR retrieval.Aircraft VCDs, by their nature, cannot be negative, so for these comparisons we remove the negative VCDs so as to avoid increasing the re-gression slopes by trying to fit these erroneous points.(However, we do note that this is a special case in which individual pixels or small groups of pixels are being compared against other VCDs.Most applications of BEHR data should retain the negative VCDs to avoid transforming the essentially Gaussian random stratospheric error into a systematic error by removing part of the bell curve.)Since the stratospheric VCDs are added back to the BEHR or NASA SP tropospheric VCDs for comparison with the Pandora VCDs, negative VCDs are not an issue with Pandora comparisons.
In the following sections, we will evaluate the new BEHR v3.0 VCDs from three perspectives: performance compared to the current NASA SP, performance compared to the previous version of BEHR, and performance using daily a priori profiles compared to using monthly a priori profiles.Throughout, BEHR v3.0 (M) refers to BEHR using monthly NO 2 profiles; likewise, BEHR v3.0 (D) refers to the product using daily NO 2 profiles.We will focus on the regression slopes here; intercepts and R 2 values are given in Table S3 in the Supplement; however we note that there is not a clear pattern of any one product having a consistently better R 2 value than the others.3; for intercepts and R 2 values, see Table S3.4. Slopes and 1σ uncertainties for reduced major axis (RMA) regression of satellite VCDs against in situ calculated VCDs.Both methods of extending the profiles (using GEOS-Chem modeled profiles or extrapolating the top and bottom 10 points) are included.Outliers are removed before calculating these parameters.
Campaign Product Slope (GEOS-Chem) Slope (extrap.) For all the DISCOVER campaigns, BEHR v3.0 shows better agreement with both aircraft and Pandora measurements than the NASA SP v3.0 (slopes closer to 1).This is expected since these campaigns generally centered on one or more cities, and a key feature of the BEHR retrieval is the ∼ 12 km a priori profiles (∼ 10× higher resolution than the NASA SP v3.0 profiles), which better capture the urban profile shape.
In the SENEX and SEAC4RS campaigns, BEHR's performance is more mixed.These campaigns include the southeast US, where we found that the WRF-Chem simulation that generated the a priori profiles overestimated the lightning flash density (Sect.3.1).In SEAC4RS, whether BEHR v3.0 (M) performs better or worse than the NASA SP v3.0 depends on the method used to extend the profile (Sect.2.4).This indicates that uncertainty in the measurement is greater than the difference between these two products.BEHR v3.0 (D) performs poorly in the SENEX campaign; this will be explored in Sect. 4.3. Overall,BEHR v3.0 (M) is not significantly affected by the overestimated lightning flash density in the southeast US, as the monthly average profiles smooth out the overlarge UT lightning NO 2 signal.

Comparison vs. BEHR v2.1
Using aircraft data plus just Pandora data coincident with aircraft spirals, v2.1 performs better in all DISCOVER campaigns except MD.However, using aircraft data plus all Pandora data, v3.0 (D) performs better than or similar to v2.1 in all DISCOVER campaigns in which daily profiles are available.The Pandora spectrometers provide more observations than the aircraft profiles, and, due to their small footprint, are more sensitive to narrow, highly concentrated NO 2 plumes.The v2.1 profiles used 2005 emissions; as seen in Fig. 1, this led to too much NO 2 being placed at the surface, which will increase the retrieved VCD.This suggests that the better performance of v2.1 in some cases is due to cancellation of errors; overestimated surface NO 2 is canceling out the lack of temporal variation in the profiles.That is, the higher average surface concentration in the v2.1 profiles may be similar to the in-plume concentrations resolved by the daily v3.0 profiles.
In v3.0, when daily profiles are available, the agreement is similar to or better than v2.1 if aircraft and all Pandora data are used.Therefore, daily profiles are able to capture at least some enhancements in surface NO 2 where and when they occur, without overestimating the average profile.This is not evident using aircraft and just the coincident Pandora data because of the smaller number of comparisons.As the comparison expands (using all Pandora data), the improvement becomes evident.The better performance of daily profiles suggests that even though Laughner et al. (2016) did not see large effects in a multi-month average using daily instead of monthly profiles, daily profiles will provide a more accurate representation of urban VCDs over longer averaging periods.BEHR v3.0 performs better in the SENEX and SEAC4RS comparisons than v2.1 (excluding 3.0 (D) in SENEX).The v2.1 profiles did not include lightning emissions, as it was a limitation of WRF-Chem at that time (Laughner et al., 2018f).This indicates that, even though the contribution of lightning to the southeast US profiles is too large, the inclusion of lightning NO 2 in the profiles did improve the representation of the southeast US.Laughner et al. (2018f) also showed that implementing a variable tropopause pressure decreased VCDs in the southeast US during summer; this also would help reduce the high bias compared to SENEX and SEAC4RS seen in BEHR v2.1.In the SENEX campaign, v3.0 (D) performs significantly worse than v3.0 (M).From Fig. 1 we know that the daily a priori profiles overestimate the UT NO 2 , and from Fig. 2 we know that this is due to a significant overestimate of the flash density in our WRF simulation.The comparison in Table 4 would seem to indicate that this overestimate has a severe impact on the retrieved VCDs, but we must also consider the uncertainty in the SENEX-derived VCDs. Figure 7a shows the ensemble of profiles from SENEX used to calculate VCDs.The circles mark levels that had to be calculated using model data for > 50 % of the profiles.In SENEX, that is all levels at about ∼ 700 hPa, which means that the SENEX aircraft data provide very little constraint on the UT.The lightning contribution to the SENEX columns must come from the GEOS-Chem averages or extrapolation from a lower altitude, which means the spatial and temporal variation is lost.
Figure 7c shows the effect of using the WRF-Chem a priori profiles instead of the GEOS-Chem profiles to extend the SENEX profiles.The WRF-Chem profiles do include spatial and temporal variation in the UT, but using them reinforces the AMF errors, moving all points away from the 1 : 1 line.Without either in situ measurements of the UT in the southeast US or Pandora total column observations we can-not separate the errors in AMF caused by the overestimated UT NO 2 in the a priori profiles from the error caused by the lack of spatiotemporal variation in the extended aircraft profiles.For example, the error in the cluster of points below the 1 : 1 line in Fig. 7c could be corrected if either the UT NO 2 in the a priori profile was reduced, decreasing the AMFs and so increasing the BEHR VCDs, or if the aircraft profile had less NO 2 , thus moving the points left onto the 1 : 1 line.(In this case, there would still be a discrepancy between the BEHR VCD and the VCD derived from combining aircraft and WRF-Chem profiles, suggesting that the WRF-Chem UT NO 2 is still too great.) Other campaigns do have better sampling of the UT, e.g., SEAC4RS (Fig. 7b, d, f), but do not have as many profiles in the southeast US (Fig. 7e, f).Therefore, we must currently assign an uncertainty of ±100 % to VCDs retrieved with daily profiles in the southeast US (east of 95 • W and south of 37.5 • N).This is almost certainly overly conservative, as Laughner et al. (2018f) showed that the frequency distribution of UT NO 2 in the southeast a priori profiles was skewed to lower values in the daily profiles, and a 3-month average using daily a priori profiles resulted in greater VCDs than using monthly a priori profiles, which would not be the case if the daily profiles always overestimated the UT.This suggests that days with little or no lightning in both the real world and WRF-Chem simulations are more numerous than days with a significant lightning contribution, and so a multimonth average using daily profiles would in fact accurately capture this.However, without long-term independent column measurements in the southeast, we cannot confirm this hypothesis.Future work will focus on improving the simulation of lightning in the southeast US.If successful, improved WRF-Chem profiles for the southeast can be implemented.In the DISCOVER campaigns, BEHR v3.0 (D) using daily profiles has regression slopes similar to or closer to 1 than BEHR v3.0 (M) using monthly profiles except in the DISCOVER-CA aircraft comparisons.There is a clear improvement in DISCOVER-TX using daily profiles.This suggests that the daily profiles are capturing small concentrated plumes in the urban area (Fig. 3b), which is improving the retrieval overall in an urban area with many highly concentrated industrial NO x sources.Therefore, we argue that daily profiles improve the retrieval in many ways, not only for applications that select for upwind-downwind pixels as shown in Laughner et al. (2016), but also for multi-month averages in dense urban areas.
5 Discussion: future efforts to validate daily profiles Using space-based SCDs to evaluate the spatial distribution of NO 2 in a chemical transport model (CTM) is powerful (Sect.3.2) because both the SCDs and CTM provide a spatially continuous field of NO 2 columns.As we have shown here, this makes a qualitative evaluation straightforward and illustrative.However, a quantitative metric is more challenging to devise, as the direct correlation of model and satellite columns is less important than the more abstract agreement between the overall plume direction and extent.As we have shown here, daily high-resolution profiles provide important benefits to an NO 2 retrieval; therefore, development of more quantitative methods to evaluate model performance in this manner should be a priority.There are several possibilities.First, an algorithm that identifies the plume and computes the direction and length of its major axis could be used.This would allow a comparison of the direction and extent of the plumes more directly.Such an algorithm would not be trivial to develop; comparisons such as the one shown in Fig. 5a, c would likely be difficult for the algorithm to distinguish the plume direction accurately.
Second, this problem could be treated as an image recognition problem.A neural network could be trained on modeled VCDs and SCDs.A training set of good and bad days could be constructed from the WRF-Chem simulations used in BEHR v3.0 (D).Development of this approach is beyond the scope of this paper.
Third, dense sensor networks (e.g., Shusterman et al., 2016;Kim et al., 2018) may also be useful to evaluate daily profiles by permitting a simpler correlation test between modeled and observed surface concentrations than is possible between modeled VCDs and observed SCDs.Development of these networks is a topic of active research.This method may be necessary for future retrievals, especially over the US and European domains, where decreasing NO x emissions mean that the contrast between plumes and background in SCDs is much weaker now than in 2007.

Conclusions
We have evaluated version 3.0B of the BEHR OMI NO 2 product against multiple datasets.We find that the WRF simulation used to generate the a priori NO 2 profiles generally agrees well with the available aircraft data; however, the number of lightning flashes is significantly overestimated in the southeast US, leading to an overestimate of the UT NO 2 in that region, although broadly consistent with ENTLN observations elsewhere.When compared against aircraftderived and Pandora VCDs, BEHR v3.0B performs better than SP v3.0, with regionally varying low biases of 0 %-51 % compared to in situ and Pandora measurements.Using daily profiles yields better results than monthly profiles, except in the southeast US.
The lessons learned here are applicable to geostationary satellites scheduled to launch in the near future.Because the BEHR retrieval focuses on the continental US, it serves as a useful prototype for future NO 2 retrievals from geostationary satellites such as GEMS (Bak et al., 2013;Choi and Ho, 2015), Sentinel-4 (Ingmann et al., 2012), and TEMPO (Chance et al., 2013), which will also be inherently restricted to regional areas.This offers the opportunity to use higherresolution a priori data than global retrievals.
Here, the results from the SENEX and SEAC4RS campaign demonstrate that verifying the chemical transport model's reproduction of the day-to-day variability in lightning flashes is vital to obtain reliable results in such regions.With the sub-daily temporal resolution available to geostationary satellites, this will only become more important.Therefore, geostationary retrievals should evaluate the diurnal variation in lightning flashes in their a priori models using ground-and space-based lightning detectors (e.g., NLDN, ENTLN, or the GOES-R lightning mapper), and plans should be made to validate retrieved VCDs in multiple regions that have strong, but different, lightning influence.Such validations must include measurement of the UT NO 2 profile and/or total column observations in order to reliably separate errors in the a priori profiles from errors in the observations used for evaluation.
Evaluating the day-to-day performance of the a priori profiles in future geostationary retrievals is crucial.Daily profiles have been shown to significantly affect retrieved NO 2 , especially in applications that systematically focus on NO 2 VCDs downwind of a source (Laughner et al., 2016), and we have shown here that daily profiles also improve performance in urban areas.With the WRF-Chem model configuration used here, urban NO 2 plumes are simulated with the correct spatial pattern ∼ 70 % of the time.Planned campaigns to evaluate geostationary satellite retrievals should be designed with an eye towards also evaluating the day-to-day accuracy of the a priori profiles.

Figure 1 .
Figure 1.Comparison of average WRF-Chem and aircraft NO 2 profiles from the (a) SEAC4RS, (b) DC3, and DISCOVER-AQ campaigns, the latter in (c) Maryland, (d) California, (e) Texas, and (f) Colorado.Aircraft profiles are shown in black, BEHR v2.1 profiles in green, BEHR v3.0 monthly profiles in red, and (where available) BEHR v3.0 daily profiles in blue.The WRF and aircraft data are matched as described in Sect.2.4 and binned by pressure.Uncertainties are 1 standard deviation of all profiles averaged.Note that for SEAC4RS the v2 profile reaches a maximum of ∼ 8000 pptv, off the plot axes.

Figure 2 .
Figure 2. Comparison between observed and simulated flash density from 13 May to 23 June 2012.Panels (a) and (b) show the mean flash density averaged over the study period from ENTLN and WRF-Chem, respectively.Both are gridded at 12 km grid spacing.Panels (c) and (d) show the correlation between total flash density per day between WRF and ENTLN in (c) the southeast US (denoted by the red box in a and b) and (d) elsewhere in the contiguous US.

Figure 3 .
Figure 3. R 2 values for correlation between aircraft data and spatiotemporally matched WRF-Chem data for the (a) DISCOVER-CA, (b) DISCOVER-TX, and (c) DISCOVER-CO campaigns, binned by pressure.Left column shows absolute R 2 values for each bin.Right column shows the difference in R 2 values using monthly average and daily profiles for each bin.

Figure 4 .
Figure 4.A comparison of OMI SCDs (a) and WRF monthly average (b) and daily (c) VCDs.The star marks the location of the Four Corners power plant.Data are from 4 March 2007.

Figure 5 .
Figure 5.A comparison of OMI SCDs (a) and WRF monthly average (b) and daily (c) VCDs.The star marks the location of New York, NY, USA.Data are from 29 September 2007.

Figure 7 .
Figure 7. (a, b) The profiles used to calculate the aircraft VCDs extended using WRF-Chem or GEOS-Chem profiles; the solid line is the median of all profiles, and the shading represents the 10th and 90th percentiles for each binned level.Circles indicate levels that were derived from the models in at least 50 % of the profiles.(c, d) Comparison of BEHR v3.0 (D) VCDs vs. aircraft-derived VCDs using GEOS-Chem and WRF-Chem profiles to extend the profile to the surface and tropopause.The black lines connect corresponding comparisons between the two methods and the red dashed line represents the 1 : 1 agreement.(e, f) Difference between aircraft VCDs extended with WRF-Chem and GEOS-Chem profiles.Panels (a, c, e) are for the SENEX campaign, and (b, d, f) are for SEAC4RS.

Table 2 .
Mean absolute bias between each of the types of simulated NO 2 profiles and the aircraft profiles shown in Fig.1.Values are given for the boundary layer (BL) and free troposphere (FT), with the divide at 775 hPa (∼ 2 km).All values are in parts per trillion by volume (pptv).

Table 3 .
Slopes and 1σ uncertainties of BEHR vs. combined aircraft (extended with GEOS-Chem profiles) and Pandora VCDs.Matched slopes use only Pandora data approximately coincident with aircraft profiles to obtain similar sampling; all uses all valid Pandora data.Outliers and negative VCDs are removed before computing slopes.