Interactive comment on “ An overview of measurement comparisons from the INTEX-B / MILAGRO airborne field campaign ”

Abstract. As part of the NASA's INTEX-B mission, the NASA DC-8 and NSF C-130 conducted three wing-tip to wing-tip comparison flights. The intercomparison flights sampled a variety of atmospheric conditions (polluted urban, non-polluted, marine boundary layer, clean and polluted free troposphere). These comparisons form a basis to establish data consistency, but also should also be viewed as a continuation of efforts aiming to better understand and reduce measurement differences as identified in earlier field intercomparison exercises. This paper provides a comprehensive overview of 140 intercomparisons of data collected as well as a record of the measurement consistency demonstrated during INTEX-B. It is the primary goal to provide necessary information for the future research to determine if the observations from different INTEX-B platforms/instrument are consistent within the PI reported uncertainties and used in integrated analysis. This paper may also contribute to the formulation strategy for future instrument developments. For interpretation and most effective use of these results, the reader is strongly urged to consult with the instrument principle investigator.


Introduction
The Intercontinental Chemical Transport Experiment-B (INTEX-B) was the second major airborne field mission conducted in the spring of 2006 as part of the NASA-led INTEX-NA (North America) mission, aiming to investigate the transport and transformation of pollution over the North American continent. INTEX-B operated in coordination with a larger program, the MILAGRO (Mega-city Initiative: Local and Global Research Observations) and IMPEX (Inter-Correspondence to: M. M. Kleb (mary.m.kleb@nasa.gov) continental and Mega-city Pollution Experiment) missions. INTEX-B was comprised of two phases. Phase one occurred from 1-21 March to maximize overlap with the MILAGRO campaign. During this phase, observations were primarily over Mexico and the Gulf of Mexico. The second phase lasted from 15 April to 15 May and focused on Asian pollution transported across the Pacific Ocean. Five specific goals were identified for INTEX-B: (1) to investigate the extent and persistence of the outflow of pollution from Mexico; (2) to understand the transport and evolution of Asian pollution, the related air quality, and climate implications in western North America; (3) to relate atmospheric composition to chemical sources and sinks; (4) to characterize the effects of aerosols on radiation; and (5) to validate satellite observations of tropospheric composition (Singh et al., 2009). For a complete mission overview, reader is referred to Singh et al. (2009).
The INTEX-B field mission involved two comparably equipped aircraft, the NASA DC-8 and NSF C-130. The sampling strategy often required coordination of both aircraft while making measurements in different regions or times. This naturally led to the pre-planning and execution of a series of comprehensive measurement comparisons of species/parameters measured on both platforms. The overarching goal was to generate a program-wide unified data set from all available resources to better address the science objectives. These comparisons form a basis to establish data consistency. The INTEX-B measurement comparison exercise should also be viewed as a continuation of efforts aiming to better understand and reduce measurement differences as identified in earlier field intercomparison exercises (e.g. NASA TRACE-P, Eisele et al., NASA has a long history of conducting instrument intercomparisons beginning with ground-based intercomparisons in July 1983 (Hoell et al., 1984(Hoell et al., , 1985aGregory et al., 1985) prior to the commencement of the airborne field studies in October 1983 with the Chemical Instrumentation Test and Evaluation (CITE) missions (Beck et al., 1987;Hoell et al., 1990Hoell et al., , 1993Gregory et al., 1993a,b,c). These early instrument intercomparisons were conducted on a common aircraft platform and played an important role in understanding the sensitivity of different techniques and evaluating them to find the best possible field instrument. The early intercomparison effort stimulated the development of atmospheric measurement techniques/instruments benefitting airborne field programs to this day. Since early 2000, integrated field campaigns have made use of the same measurement technique on separate aircraft platforms or different measurement techniques sometimes on the same or separate aircraft platforms. To understand the differences seen in the data and to better utilize the data from various instruments, a careful and thorough intercomparison is needed. The first NASA two-aircraft intercomparison was conducted during the 2001 TRACE-P (Transport and Chemical Evolution over the Pacific) field campaign (Eisele et al., 2003). During TRACE-P the NASA DC-8 and P-3B flew wing-tip to wingtip within 1 km of each other on three occasions lasting between 30 and 90 min. A significant finding of this exercise was that an intercomparison between two aircraft can reveal important insight into instrument performance. It also verified that two aircraft can be flown in a manner such that both sample the same airmass and experience the same high and low frequency fluctuations necessary to evaluate common measurements. In general the best agreement was achieved for the most abundant species (CO 2 and CH 4 ) with mixed results for less abundant species and those with shorter lifetimes (Eisele et al., 2003). The TRACE-P comparison of fast (1 s) measurements for CO and O 3 provided valuable information in defining bulk airmass properties, which was useful in interpreting the comparison results for short-lived species. The effect of small scale spatial variation should not have significant impact on assessment of the systematic difference, especially when the range of comparison is sufficiently larger than these variations.
Following TRACE-P, another major coordinated intercomparison occurred in 2004 during the International Consortium for Atmospheric Research on Transport and Transformation (ICARTT) airborne missions (INTEX-A, NEAQS-ITCT 2004, and ITOP). Five wing tip to wing tip intercomparison flights were conducted allowing comparisons between four aircraft. Although not formally published, these intercomparisons and additional mission information can be found in the Measurement Comparisons: ICARTT/INTEX-A link at http://www-air.larc.nasa. gov/missions/intexna/intexna.htm. The purpose of this paper is to provide a straightforward and comprehensive overview and record of the measurement consistency as characterized through the analysis of the INTEX-B intercomparison data. This paper is not intended as a review of instrument operation but rather a means to highlight the demonstrated instrument performance during the intercomparison periods. Intercomparison results are intended to identify measurements where an investment in improving measurement capability would be of great benefit. Results are also crucial to ensuring that analysis and modeling activities based on multi-platform observations reach conclusions that can be supported within the assessed data uncertainties. For parties interested in making use of the data presented here, further consultation with the relevant measurement investigators is strongly recommended. The remainder of this paper presents the details of the INTEX-B intercomparison.
Section 3 describes the intercomparison approach and implementation, including a description of the types of comparisons is presented. Data processing procedures and statistical assessment are presented in Sect. 4. four. Section 5 contains the results, and the summary is contained in Sect. 6.

Approach/implementation
During the INTEX-B/MILAGRO/IMPEX field campaigns, three formal measurement comparisons were carried out on 19 March, 17 April, and 15 May 2006. These segments were well integrated into science flights to achieve the overall science goals while aiming to compare instruments/measurements under a wide variety of conditions as summarized in Table 1. During the intercomparison portion of the flights, aircraft separation was less than 300 m in the horizontal and less than 100 m in the vertical. The intercomparison period for the 19 March flight was 41 min (Fig. 1a), covered altitudes from 0.3 to 3.4 km, and encountered Mexico City pollution as well as marine boundary layer air off the coast of Mexico. The wide range of the chemical conditions is evident in CO levels observed during the intercomparison period which ranged from 103 to 223 ppbv. The 17 April (Fig. 1b)  California to clean at 6 km over southern Oregon. Again the range in chemical conditions can be inferred from the CO levels encountered (99 to 163 ppbv). The last intercomparison flight on 15 May (Fig. 1c) was the longest, lasting approximately 1 h. This intercomparison began in the clean free troposphere (about 5.5 km) off the northern Oregon coast and ended in the marine boundary layer (near 0.3 km) off the northern California coast. As with the two previous intercomparisons, a variety of chemical conditions existed. For these comparisons, data from all three flights were combined for analysis and only data with values greater than the limit of detection were used for analysis. The comparisons cover short-lived to long-lived gas phase species as well as particulate microphysical, optical, and chemical properties. Table 2 provides a detailed list of the species/parameters included in the intercomparison along with measurement techniques, aircraft platform, principal investigators (PI), measurement uncertainties, and confidence level. All of the above information was taken from the PI file headers except for confidence level. For an explanation of "Technique", the reader is referred to the individual PI files located on the INTEX-B website (http://www-air.larc.nasa. gov/missions/intex-b/intexb.html) under the Current Archive Status link. The reported analysis was based on data submissions prior to 1 January 2010. The online plots may change to reflect the data updates at a later date.
In addition to the uncertainty information provided in the PI file headers, a special effort was made to obtain measurement uncertainties which were not originally provided in the file header as well as confidence levels. This is necessary information to determine if measurements are consistent and important metadata for future analysis. Some reported total uncertainties were given in 1 or 2 sigma confidence level while in other cases, confidence levels were not specified. The confidence level is typically associated with precision or precision dominated uncertainties. In some cases, both precision and accuracy are explicitly given in Table 2, while only total uncertainties are provided by the PI in many other cases without clear association to a confidence level. The concept of confidence level may be ill-defined for cases where accuracy is the dominant component of the total uncertainty. In these cases, the readers are directed to measurement PIs for proper application of the uncertainty information.
It is imperative that both aircraft sample the same airmass during the intercomparison period. In practice, this is conducted by keeping the aircraft in close proximity while maintaining a safe separation. Analysis of the fastest measurements can be an effective way to ensure the same airmass was sampled by both aircraft. If the same airmass is sampled, we expect the large scale features to be captured by both instruments. This is illustrated in the time series plots for both ozone (19 March) and water (15 May) where the major features are well represented by both instruments in each comparison (Figs. 2a and 3a). While the most prominent features are apparent in the data from each instrument, there is less   agreement in the relatively small scale changes that occur when O 3 remains consistently low (at low altitude in the marine boundary layer) and also at higher altitudes and higher O 3 levels (polluted Mexico City airmass). The timeseries for water displays a similar behavior. The large-scale features in the timeseries are well matched while there is less agreement in the finer features at both high (clean free troposphere) and low altitudes (marine boundary layer). The correlation plots (Figs. 2b and 3b) with associated regressions and coefficients of correlation (R 2 ) offer an additional method for evaluating the likelihood that the instruments sampled the same airmass. R 2 is defined as where x is the average of the x values and y is the average of the y values. Both ozone and water show that the measurements are strongly correlated as evident by the high R 2 value. Although it is not easy to discern in the time series for water, there is a slight time lag in one of the water measurements. This is evident in Fig. 3b where data points depart the tighter cluster in curved lines. In general the spread in the data appears larger for water than ozone, however, this may be due in part to the smaller range in the x and y scales for water. The high R 2 value for both ozone and water nevertheless indicate that the two aircraft are most likely sampling the same airmass. Intercomparison analysis was conducted during each stage of data submission: (1) comparison of field data (blind), (2) comparison of preliminary data (not blind), and (3) comparison of final data (not blind). These analyses and the distribution of results were carried out by the Measurement Comparison Working Group (MCWG). The primary responsibility of the MCWG included providing for secure field data submission to facilitate the "blind" comparison, analyzing data for each stage of data submission, and disseminating the results within the science team and to the   atmospheric community at large. In stage one, the blind comparison of field data, PIs submitted data within 24 h to a few days after the flight to an ftp site which was "blind" to the science team for a period of time until both paired comparison data were submitted. For example, the CO data was not available to the science team until both NSF C-130 and NASA DC-8 PIs submitted their CO data for the intercomparison flight. The MCWG then assessed the consistency between the paired DC-8 and C-130 measurements/instruments and released the comparison results and the data to the science team. In the preliminary data stage, data were compared again after allowing the PIs to apply post mission calibration and additional processing/correction procedures to their data. The MCWG presented these  results to the science team at the post-mission data workshop. In the comparison of final data (not blind), PIs submitted final data with uncertainty estimates. These results are archived online (http://www-air.larc.nasa.gov/missions/ intex-b/intexb-meas-comparison.htm) and summarized here. In addition to the inter-platform comparisons, intraplatform comparisons were made whenever possible. Since both instruments were located on the same aircraft, these comparisons were not limited to the three intercomparison periods discussed previously, rather they could span the entire mission.
As previously stated, the primary goal of this paper is to present a comprehensive overview of the INTEX-B/MILAGRO/IMPEX intercomparison results and provide a record of the measurement consistency. The level of the agreement between the measurements may depend on a number of factors, including calibration, instrument time re-sponse, and measurement techniques. For the comparison of the aerosol measurements, the particle size range of the measurements should be a critical consideration. The information summarized in Table 2 and Tables 3-5 is critical to determine if observations made from different platforms/instruments are consistent within the PI reported uncertainties. This is necessary when deciding if multiple data sets should be used in integrated analysis. At the same time, users are cautioned that differences between measurements can still be significant, even though they are technically consistent within the combined uncertainties quoted by the PIs. In addition, this overview paper does not attempt to describe the complexities of the various measurement techniques. Any interpretation of the results of these studies should be done in consultation with the individual instrument PIs (provided in Table 2).

Data process procedures and statistical assessment
The quantitative assessment of measurement/instrument consistency was based on statistical analysis of the intercomparison data. This required the merging of data to a common timeline. Merging was easiest when measurements were conducted with the same timing and integration period; however, it is not unusual for instruments based on different techniques to require different integration times to measure the same species/parameter or that instruments on different platforms are not well synchronized. For cases where instruments had the same integration period, but were not synchronized, the data were merged to ensure at least 50% sampling time overlap. For paired measurements with different integration time intervals, the shorter integration time measurements were merged into the longer time interval when measurements at the shorter time interval overlapped at least 50% of the longer time interval. These merged data pairs were used to quantitatively assess measurement consistency through linear regression analysis, when applicable, or descriptive statistics based on the ratio (DC-8/C-130) of the paired data points. The linear regression slopes and intercepts can be used to describe the level of the measurement agreement when a high enough level of correlation exists.
Here, this criteria has been defined as an R 2 value of 0.75. Lower R 2 values are typically encountered when the range of variation is limited in comparison to the uncertainties of the measurements and/or other instrument issues exist. When R 2 is below the threshold of 0.75, the median and percentile values of the DC-8/C-130 ratio have been used to express the level of consistency between the paired data. In addition, the absolute (or arithmetic) difference between paired data may be used in some cases (with combined uncertainties) to gain additional insight. Statistical comparisons presented here have been based on Orthogonal Distance Regression (ODR). Orthogonal distance regression is a regression technique similar to ordinary least squares (OLS) fit with the stipulation that both x and y are independent variables with errors. ODR minimizes sum of the squares of the orthogonal distances rather than the vertical distances (as in OLS). ODR is generally equivalent to where ε i is the error in y, δ i the error in x, w ε i and w δ i weighting factors, and β a vector of parameters to be determined (slope and intercept in this case), (Zwolak et al., 2007). Note that a weighted ODR (w ε i and w δ i = 1) is necessary when observations x i and y i are heteroscedastic (variance changes with i), (Boggs et al., 1988). It has been shown that ODR performs at least as well and in many cases significantly better than Ordinary Least Squares (OLS), especially when dσ ε /σ δ < 2, (Boggs et al., 1988). Boggs et al. (1988) have shown that ODR results in smaller bias, variance, and mean square error (mse) than OLS, except possibly when significant outliers are present in the data. For the bias of the parameter, β, and function estimates, f (x i ; β), OLS is statistically better only 2% of the time while ODR is significantly better 50% of the time. Results for the variance and mse of the parameter and function   estimates were similar; ODR variance and mse were smaller than that from OLS about 25% of the time. OLS results were significantly better than ODR only 2% of the time, (Boggs et al., 1988). While ODR allows for the possibility of assigning specific uncertainties to each data point, an accurate estimate of measurement uncertainty is not often available on point by point basis. Even when available, this can be complicated when merging measurements of differing integration times. Therefore, in the interest of treating all the intercomparisons uniformly, we use w ε i and w δ i = 1. The coefficient of determination, R 2 , is used to indicate the quality of the linear relationship between the paired measurements.

INTEX-B intercomparison
Three types of comparisons were conducted and are presented below: DC-8 to C-130 (Table 3), DC-8 to DC-8 (Table 4), and C-130 to C-130 (Table 5). One hundred and forty parameters were grouped according to chemical similarities and compared. The chemical groups for intercomparison purposes are photochemical precursors and gas phase trac-ers, photochemical products, photochemical radicals, oxygenated volatile organic carbons (OVOCs), non-methane hydrocarbons (NMHCs) along with halocarbons, alkylnitrates, and organic sulfur compounds, photolysis frequencies, particle microphysical properties, particle chemical composition, and particle scattering and absorption. As stated previously, when R 2 is greater than or equal to 0.75, the slope and intercept of the regression are given to represent the level of measurement consistency. It is noted here that the intercept should not simply interpreted as the offset between the instruments. When R 2 is less than 0.75 percentile statistics are given based on the ratio of the data (DC-8/C-130). The resulting statistics are given in the following Table 3a through Table 3i for the DC-8 to C-130 comparison. All analyses are based on the archived final data combined from all three intercomparison flights. No statistical analyses are provided when there are an insufficient number of data points to adequately represent the entire intercomparison periods. Finally, the range (minimum and maximum) is provided as additional information for the reader. In addition to the comparisons listed in Tables 3, 4, and 5, the uncertainties for each instrument can be found in Table 2. The uncertainties were provided in the final data file archive (Current Archive Status link)  . For cases where uncertainties were available on a point by point basis, the uncertainty was calculated as a percentage of the measurement. The minimum and maximum percentages are given in parentheses and the median is listed outside the parentheses. We present these comparisons and uncertainties without rating the level of agreement. This is a highly subjective task and we leave it to the reader to make that judgment with appropriate consultation with the respective PIs.
For an explanation of "Technique", the reader is referred to the individual PI files located on the INTEX-B website (http: //www-air.larc.nasa.gov/missions/intex-b/intexb.html) under the Current Archive Status link. All intercomparison correlation plots can be found online under the Measurement Comparisons: MILAGRO/INTEX-B/IMPEX link at http://www-air.larc.nasa.gov/missions/ intex-b/intexb.html. The correlation of the combined the data from all three flights is in the summary section. Individual timeseries and correlation plots are also available for each intercomparison on 19 March, 17 April, and 15 May 2006.
As described earlier, intra-platform comparisons were also conducted on both the DC-8 and C-130 aircraft for any overlapping measurements. See Table 4a through Table 4c for a complete list of the species, techniques used, and a statistical summary for the DC-8 to DC-8 comparisons. Tables 5ae provide statistical summary for the C-130 to C-130 comparisons. Since the instruments were located on the same platform, comparison data was not limited to the intercomparison portions of the flights. Data from the entire mission could be included.

Comparison with ICARTT data
In addition to the intercomparisons made during INTEX-B, we wish to examine the cases where the same comparisons could be made with data from the ICARTT mission and highlight instances where those intercomparisons show significant change. The ICARTT mission was conducted in 2004, a portion of which was INTEX-A (the predecessor to INTEX-B). For a complete description of INTEX-A see Singh et al. (2006). A full listing of the INTEX-A intercomparisons can be found at http://www-air.larc.nasa.gov/ missions/intexna/meas-comparison.htm . There are three cases where significant change is observed between INTEX-A and INTEX-B; H 2 O 2 , PAN, and total PANs. For H 2 O 2 the comparison was a DC-8 intra-platform comparison between CIT CIMS and URI EFD during INTEX-A (Fig. 4a) while for INTEX-B, CIT CIMS was on the C-130 and URI EFD on the DC-8 (Fig. 4b)    For interpretation and most effective use of these results, the reader is strongly urged to consult with the instrument PIs. We leave it to the reader to determine the level of consistency between the instruments compared. This should be done not only with the statistical analyses provided in Tables 3, 4, and 5, but also in consideration of the uncertainties in Table 2, keeping in mind that even when measurements are technically consistent within the PI reported uncertainties, significant differences between the measurements can still exist if the uncertainties are large. In addition, future instrument work may benefit from this assessment.