Revision of the WMO/GAW CO 2 Calibration Scale

The NOAA Global Monitoring Laboratory serves as the World Meteorological Organization Global Atmosphere Watch (WMO/GAW) Central Calibration Laboratory (CCL) for CO 2 and is responsible for maintaining the WMO/GAW mole fraction scale used as a reference within the WMO/GAW program. The current WMO-CO 2 -X2007 scale is embodied by 15 aluminum cylinders containing modified natural air, with CO 2 mole fractions determined using the NOAA manometer from 15 1995 to 2006. We have made two minor corrections to historical manometric records: fixing an error in the applied second virial coefficient of CO 2 , and accounting for loss of a small amount of CO 2 to materials in the manometer during the measurement process. By incorporating these corrections, extending the measurement records of the original 15 primary standards through 2015, and adding four new primary standards to the suite, we define a new scale, identified as WMO-CO 2 X2019. The new scale is 0.18 μmol mol -1 (ppm) greater than the previous scale at 400 ppm CO 2 . While this difference is small 20 in relative terms (0.045%), it is significant in terms of atmospheric monitoring. All measurements of tertiary-level standards will be reprocessed to WMO-CO 2 -X2019. The new scale is more internally consistent than WMO-CO 2 -X2007 owing to revisions in propagation, and should result in an overall improvement in atmospheric data records traceable to the


Introduction
Measurements of the atmospheric distribution of carbon dioxide (CO2) are essential to understanding sources and sinks of this 25 powerful greenhouse gas. We need well-calibrated measurements to track the history of the global abundance of CO2 because it is the main driving force of man-made climate change. Small differences in the relative abundances of CO2 and other trace gases observed at different locations, combined with information on atmospheric transport and mechanisms for landatmosphere-ocean exchange can provide constraints on estimates of the sources and sinks of CO2. Measurements are made at numerous sites around the globe in conjunction with the WMO Global Atmosphere Watch program and through regionally-30 coordinated programs (e.g. Integrated Carbon Observing System, ICOS).

Deleted:
to expand the range of the WMO/GAW scale to 800 ppm to better constrain instrument response and also provide support for measurements obtained closer to emission sources, such as urban areas; and 4) we have recently developed a new measurement system used to transfer the scale to reference gases (Tans et al., 2017), which now allows us to harmonize the primary standards 70 and define the scale with higher precision than what can be done with a single standard (see section 6.0).
Here we introduce a revision of the WMO/GAW CO2 scale, with the new scale identified as WMO-CO2-X2019 (hereafter referred to as X2019), and describe its implementation. This article is organized as follows. We first provide some background on the manometric method. We then describe two corrections to previous manometric results. These include corrections to 75 rectify a calculation error related to the second virial coefficient of CO2, and a correction for CO2 absorption/adsorption to manometer surfaces (most likely O-rings) that occurs during the measurement process. The magnitude of the overall correction is small (~0.18 ppm at 400 ppm), but significant in terms of network compatibility goals (WMO, 2020). We have applied these corrections to 23 years of manometric measurements. By reassigning CO2 mole fractions to previous and newly-introduced primary standards, we define the X2019 scale and explore differences between X2019 and X2007. We provide an estimate of 80 the uncertainty associated with CO2 reference gases, updating the work of Zhao and Tans (2006). Finally, we propagate the X2019 scale to all reference gases analysed by the CCL and discuss the implementation of the X2019 scale.

The NOAA manometer
The manometric procedure is described in Zhao et al. (1997), and Zhao and Tans (2006). Briefly, the manometer consists of two glass volumes housed in a temperature-controlled oven, two glass traps for cryogenically extracting CO2 from air and 85 purifying the CO2, and devices to measure pressure and temperature (Fig. 1). During a measurement experiment, the manometer is evacuated to ~5 mtorr and then gas from a cylinder is loaded into the larger of the two volumes (large volume, ~6 L). The large volume is flushed for 10 min. at 200 mL min -1 and the exit gas stream is monitored by NDIR to ensure a stable CO2 signal. Inability to observe a stable CO2 signal (<0.1 ppm) can result in the run being aborted. The large volume is then sealed off, allowed to equilibrate for five minutes, and the large volume temperature and pressure are recorded. The air 90 sample is then pumped across the glass traps, which are held at liquid nitrogen temperature, to cryogenically extract the CO2 from the air sample. The CO2 is then purified (to remove H2O) by alternately freezing at L-N2 temperature (~ -197 °C at 84 kPa) and warming to ~-67 °C. Finally, the purified CO2 is cryogenically trapped into the smaller of the two volumes (~7 mL) and allowed to sublimate. The pressure and temperature of CO2 in the small volume are recorded at ~30 s intervals as the CO2 warms and equilibrates to the oven temperature. 95 The mole fraction of CO2 is determined from measurements of pressure, temperature, and the ratio of the two volumes. The volume ratio is determined by a gas expansion method using two additional volumes, also housed in the oven. A gas, usually Deleted: After flushing the large volume for 10 min. at 200 mL min -1 and allowing the gas temperature to equilibrate to oven temperature, the large volume is sealed off and the large volume temperature and pressure are recorded. 105 air or nitrogen, is expanded into successive volumes, with P and T measured at each stage, to bridge the difference between small and large volumes (Zhao et al., 1997). The mole fraction of CO2, XCO2, is calculated using: where T and P are the temperatures and pressures of air in the large volume (air) and nearly pure CO2 in the small volume (CO2), βair and βCO2 are second virial coefficients, R is the gas constant, Φ is the volume ratio (large/small), and XN2O is the mole fraction of N2O in the air sample (measured separately by gas chromatography with electron capture detection) (Hall et al., 2011). Equation (1) is an alternate form of eq. 8 from Zhao et al. (1997).  (1), shown as blue lines. After the temperature and pressure of the air in the large volume are recorded, the air is drawn from the large volume, in the direction of arrow (2) (red lines), and through traps 1 and 2 to 120 cryogenically trap the CO2. The CO2 is cryogenically purified in glass traps 1 and 2, and then transferred to the small volume where its pressure and temperature are determined. Auxiliary volumes ("AV") are used in separate experiments to determine the ratio of large and small volumes (volume ratio). The dashed line depicts a temperature-controlled oven housing the glass volumes and pressure gauge.

Reprocessing historical manometer data 125
Manometer data were obtained using software designed to read and store temperature and pressure data during a manometer run, and calculate the CO2 mole fraction. Prior to each manometric episode, temperature and pressure were referenced to national standards (and to the SI) through calibration at accredited laboratories. Volume ratio experiments were performed prior to and during each episode (e.g. SM, Fig. S6). Pressure and temperature calibration coefficients needed to convert measured variables to P and T, as well as the volume ratio, were hard-coded in this software. During the final P and T 130 measurement, CO2 was calculated periodically as the gas in the small volume warmed and equilibrated to oven temperature during the final stage of measurement. An example of CO2 mole fraction calculated as a function of time is shown in Fig. 2.
Mole fractions of CO2 were previously determined as the maximum XCO2 calculated during the final stage (Fig. 2), adjusted for XN2O. There are two minor issues associated with this method that we correct with the implementation of the X2019 scale. 135 First, we recently discovered an error in the software used to calculate XCO2. The second virial coefficient for CO2 (βCO2) (Sengers et al., 1971) was calculated corresponding to a temperature that was 10 K higher than the actual TCO2 (320 K instead of 310 K) due to an interpolation error. Temperature was recorded correctly, but βCO2 was calculated incorrectly. Consequently, XCO2 was underestimated by about ~0.03 ppm at 400 ppm. Second, we recognize that the pressure in the small volume decreases slowly with time after the temperature of the small volume stabilizes (Fig. 2). For the 380 ppm sample shown in Fig. 2, the 140 rate of change in pressure is -10 -5 kPa s -1 , or -0.036 kPa hr -1 . We suspect that CO2 absorbs to Viton O-rings and possibly adsorbs to surfaces of the small volume (Fig. 3). Separate tests conducted with pure CO2 and Viton O-rings in a test tube revealed CO2 loss rates comparable to what is observed in the manometer small volume (unpublished data). Essential to the development of the X2019 scale was revisiting previous data and making corrections for the incorrect βCO2 and the loss of CO2 that occurred prior to the maximum measured XCO2. 145 The results from all manometric determinations are stored in a database. Historical manometer results were adjusted using the following equation: of time (upper panel), and temperature measured at three locations within the oven (lower panel). Historical manometric records are time-stamped with "measurement cycle", which is shown on the upper x-axis. Here, each measurement cycle corresponds to ~ 30s. Temperature probe T3 is adjacent to the small volume and is cooled to liquid nitrogen temperature during extraction.

Correcting for βCO2
For Xvirial_correction, we first updated the data reduction software to calculate βCO2 by correctly interpolating between the same βCO2 coefficients used to define X2007 (-112.8 cm 3 mol -1 at 310 K, and -104.8 cm 3 mol -1 at 320 K (Zhao et al., 1997)). We then use the correct βCO2 to calculate XCO2 from pressure and temperature recorded in manometric data files, and compare to XCO2 calculated using the original (incorrect) values for βCO2. Fig. 4 shows differences between the updated results (βCO2 170 correct) and the original XCO2 (βCO2 incorrect). There are three representative periods that correspond to three nominally different volume ratios. The data show compact relationships with CO2 mole fraction, as expected, since the mole fraction determined is largely a function of the pressure of CO2 collected in the small volume. During each manometric determination, several temperatures were recorded. Since there are periods for which we do not know specifically which temperature records were used or the exact volume ratio used in the original calculation, we used three polynomial functions to estimate 175 Xvirial_correction corresponding to three time periods : 1996: -1999: , 1999: -2003: , and 2004. The uncertainty associated with the estimated Xvirial_correction is less than 0.01 ppm.

Correcting for CO2 loss
To correct for CO2 loss, we assume that loss of CO2 to materials in the small volume begins soon after CO2 sublimes and occurs at a constant rate. By extending the manometer run time out several hours, we can see that the loss rate decreases with time (see SM, section 1.2). However, the loss rate is sufficiently linear over the short term that a linear correction is a reasonable 185 approach.
We derive loss rates by fitting a linear function to the calculated XCO2, beginning ~3 minutes after the maximum CO2 and fitting 10-12 minutes of data (Fig. 5). This period corresponds to near-constant temperature and a steady decrease in pressure. After obtaining a loss rate from each data file, we correct the existing CO2 record using the loss rate and elapsed time (expressed in terms of a measurement cycle, each approx. 30 s in duration).
where a is the slope calculated from a record of CO2 vs time as in Fig. 5 (ppm time -1 ), t is the time corresponding to the CO2 maximum, and t0 is the time at the start of the record, where we expect CO2 loss to begin. Since a<0, Xloss_correction is positive.
As an example, the maximum CO2 shown in Fig. 5 occurs at cycle 35 in the data file, ~1050 seconds after the liquid nitrogen was removed from the small volume. The slope (a), is -0.0074 ± 0.0002 ppm min -1 . If the loss of CO2 begins at time to=0, the correction required would be 0.13 ppm. After the liquid nitrogen is removed from the small volume, we estimate that the 200 purified CO2 reaches a temperature of 273 K within 1 minute, and 300 K within 3 minutes. Adsorption of CO2 probably begins about 1 minute after the liquid nitrogen is removed. For many data records, we know that there was a software delay of three minutes between the time the small volume was sealed off (and the liquid N2 removed) and the first data record. While this cannot be confirmed for all records, we include a two-minute delay: t_max_CO2 + 2 minutes (to = 2 min.). An error of 2 minutes in elapsed time would correspond to 0.015 ppm for a typical 400 ppm sample. Using an elapsed time of t_max_CO2 + 2 min. (17.5 205 +2 min, or 39 measurement cycles) in the above example, the loss correction is 0.14 ppm.    (1998 and 2004), and loss rates were determined from data after the second maximum. Forty-eight manometric runs were processed this way (8.9% of the total).

Summary of manometer results
The X2007 scale was derived by averaging results from seven manometric episodes (1996,1998,2000,2001,2003,2004,2006) (Table 1). In developing X2019, we examined data files back to 1996, and applied the corrections discussed previously 235 ( Fig. 7). There is not a 1:1 correspondence between original and reprocessed results. In a few cases, the original data appeared abnormal and were flagged when developing X2019. In other cases, we were either unable to find raw files corresponding to results in the database, or the records were not sufficient to calculate a CO2 loss rate (data not stored for sufficient time). In all, we were able to recover and apply corrections to 93% of the original data records. Higher variability in 1998 could be related to higher water vapor in samples extracted during that period. Manometric records 250 from 1998 often did not show the characteristic single CO2 maximum. Instead, those records show an initial "CO2" peak, followed by a short decline, and then a secondary peak followed by the normal decline (see SM, Fig. S1). This secondary peak could be related to H2O desorbing from surfaces in the small volume. We have seen this pattern recently when the manometer has not been run for several weeks and tends to show characteristics of residual moisture (longer pump-down times and higher than normal XCO2 results). For most of the records from 1998 and some records from 2004, Xloss_correction was determined from 255 the time associated with the first peak in CO2, and the loss rate determined after the second peak in CO2. We used the later loss rates because it appears that the initial slopes (loss rates) are impacted by evolution of H2O, and the loss rates calculated after the second peak in CO2 are more consistent with loss rates determined during other episodes. Although this introduces additional uncertainty, results from 1998 are generally consistent with those from other years (Fig. 7). Comparing 1998 results to other years, it would appear any potential impact of additional water vapor as an impurity is less than 0.1 ppm. Further, if 260 we used the time associated with the second peak instead of that associated with the first peak, manometer results from 1998 and 2004 would be slightly greater, but this would translate into an increase of only 0.01 ppm in the average manometric values for primary standards in the 250-520 ppm range.
It is also important to note that in May of 2014 we damaged the small volume during routine maintenance. New glassware and 265 a new air-actuated valve (Glass Expansion, Pocasset, MA) were installed in August 2014. This meant that the volume ratio, which had been essentially constant since 2004, needed to be re-established. After establishing traceability for temperature and pressure, we performed a number of volume ratio experiments and obtained a new volume ratio that was 2% larger than the previous one. Results from the 2015 episode, with the new small volume and volume ratio, agree well with those from previous episodes. The mean difference between the 2012 episode and 2015 episode, for all primary standards in the 250-520 ppm 270 range, is only 0.03 ppm.

Deleted: Supplemental Material
Deleted: (see Supplemental Material) Table 1: Primary standard CO2 mole fractions (ppm) determined using the NOAA manometer. A lower case "x" is used here to indicate that these are mean values determined from manometric measurement, and have not yet been harmonized into a calibration scale. For x2007 we report the average manometric results from seven episodes (as the mean of the episode averages). For x2019 we averaged all valid recoverable data from 1996-2017 after correcting for βCO2 and CO2 loss. Note that 280 primary ND17440 was put into service in 2010 to replace a standard that was thought to be drifting upward. ND17440 was not part of the original X2007 scale. CC71605 includes data from 2020.

Cylinder
Avg

Drift assessment
The mole fraction of CO2 (in air) in aluminum cylinders can increase with use (Langenfelds et al., 2005;Leuenberger et al., 285 2015;Schibig et al., 2018). Our experience suggests that XCO2 is relatively stable over the useful life of a cylinder when used sparingly at flowrates ~0.3 L min -1 or lower, but can increase as the pressure drops below about 15% of the fill pressure.
However, it is worth noting that detecting small drift rates over decades is very difficult because it requires a stable reference with comparably low uncertainties. At the end of the 2015 measurement episode, all 15 primary standards contained at least one third of the original gas, with pressures of at least 4.4 MPa (600 psi), and most contained more than 6 MPa. 290 Drift in the X2007 scale was assessed through repeated manometric measurement. Only AL47-103 (no longer in use) was found to be drifting. With the update to X2019, we applied corrections to the primary standards that were both a function of mole fraction and time. We therefore need to reassess the possibility of drift in the primary standards. We performed a weighted least squares linear fit to the mean mole fraction determined during each episode. Uncertainties were estimated by combining 295 the manometer repeatability during each episode (σi/√Ni), where σi is the standard deviation of results within episode "i", and Ni is the number of measurements during that episode, with the relative uncertainty in the volume ratio and the average uncertainty associated with Xvirial_correction and Xloss_correction for each episode (0.02-0.04 ppm). We lack sufficient information to fully evaluate the uncertainty in the volume ratio dating back to the earliest periods, so we assume that our current uncertainty assessment is valid for the entire record. We consider each episode independent since traceability to national standards for 300 temperature and pressure was established prior to each episode, and do not include uncertainty components common to all episodes (which include components of the volume ratio uncertainty related to temperature gradients in the oven, and differences in volume ratio obtained using difference gases (N2, air, and argon)). We estimate the total uncertainty in the volume ratio to be 0.014% (see SM, section 2.3.4). Excluding components common to all episodes, we use 0.013% for uncertainty on the volume ratio in the drift assessment. 305 Drift rates, in ppm per decade, are summarized in Fig. 8 (see also Table S1). For primary standards with XCO2 > 530 ppm, the manometric histories are too short to adequately assess drift. For those with XCO2 in the range 250-520 ppm, all but three show positive drift, although none is significant at the 95% C.L. While some calculated drift rates are of order 0.05 ppm/decade, we are unable to detect drift rates less than ~0.08 ppm/decade owing mostly to the uncertainties associated with the volume ratio 310 and reproducibility of the manometric measurements. The average drift rate among standards in the 350-450 ppm range is 0.02 ppm decade -1 , which would have only a minor impact on the heart of the X2019 scale if drift rates shown in Fig. 8 were incorporated, except when making comparisons across decades. Thus, while relative drift among cylinders can be observed over short time periods, as in Leuenberger et al. (2015) and Schibig et al. (2018), detecting long-term drift on an absolute basis is difficult. Still, drift in cylinders is typically small compared to the growth rate of atmospheric CO2 (~ 2 ppm yr -1 ). 315

Defining the X2019 WMO CO2 mole fraction scale
Primary standards were analyzed using the laser-spectroscopy system described in Tans et al. (2017). These data were then used to harmonize the standards and define a scale. Each primary standard was analyzed six times relative to a ~400 ppm 330 reference cylinder. On this analysis system we treat the three major isotopologues of CO2 separately to eliminate subtle biases due to variations in isotopic compositions among the standards and between samples and references cylinders. We harmonized the primary standards using only the major (  The uncertainties on flask measurements at INSTAAR listed above are determined for ambient atmospheric samples (~-7.5 to -9 ‰). Several of the primary standards are depleted relative to the atmosphere (see Table 2 Table 1) or the square of the inverse standard error. All four variations give essentially the same result (within 0.01 ppm near 400 ppm). Therefore, the X2019 scale is defined from an 365 orthogonal distance linear regression using the average manometric result and standard deviation (using 1/s 2 as weighting factors) for each cylinder ("Avg. (x2019)" and "s.d. (x2019)" in Table 1). Fig. 9 shows the residuals from six analysis periods over three years associated with harmonization. There is good agreement among the different analysis periods, indicating that variability seen in the residuals relates to the manometer average values. 370 For each primary standard, we corrected the CO2 mole fraction by the mean residual from the linear fit (Table 2). The X2019 scale is defined as the average residual-corrected mole fraction, determined over six analysis periods, for each primary standard. In this way, the scale is defined over a range, with better consistency and smaller uncertainty compared to individual primary standards. For X2019, we include the 15 primary standards used to define the X2007 scale, plus four additional primary standards with XCO2 > 530 ppm. Additional primary standards in the upper range help to constrain the fit and reduce 375 end-effects. Many residuals are less than 0.05 ppm, but the newer standards in the upper CO2 range show larger residuals.
Some of this may be due to their short measurement history compared to standards in the 250-520 ppm range. Finally, while harmonization is not strictly necessary if all primary standards are to be analyzed at the same time when propagating the scale, it provides some insurance on the potential loss of a primary standard. By assigning mole fractions consistent with the best fit response, loss of one or two standards from the suite of 19, especially in the middle of the XCO2, range would not be catastrophic. 380

Independent Assessment
Revision of the X2007 scale relies on the assumption that the loss of CO2 to Viton O-rings in the small volume of the manometer can be adequately addressed by linear extrapolation (Fig 5). Knowledge on CO2 losses prior to the availability of representative pressure and temperature measurements (during the time while the small volume is warming) is lacking.
Experiments in which pure CO2 was loaded into the small volume by overpressure (not transfer by cryogenic extraction) suggest that the loss process is initially non-linear and approaches a linear rate after about 10 minutes. If this is true, then the correction we apply is too small (by ~0.2 ppm) (see SM, section 1.2). However, these experiments were not carried out under the same conditions used to extract CO2 from air, so we cannot be sure that they are representative. Therefore, we explored an independent method to provide insight into potential bias in the X2007 scale and our attempt to correct for that bias.

Comparison to in-house, gravimetrically-prepared standards 410
We prepared CO2 primary standards using a gravimetric method (Hall et al., 2019). Briefly, known masses of highly pure CO2 were introduced into 29.5-L aluminum cylinders and diluted with known masses of CO2-free air. Uncertainties were reduced by preparing standards in one step and by accounting for CO2 likely to be adsorbed to cylinder walls at high pressure (Schibig et al., 2018). These standards were analyzed by laser spectroscopy and assigned XCO2 values on the X2019 scale (Table 3). The X2019 assignments are consistent with the gravimetrically-prepared values, with an average difference of 0.03 ppm, and an 415 average ratio of 1.00008 (Table 3). If the gravimetric standards were used to define a calibration scale, it would, on average, be 0.045% greater than the X2007 scale (avg ratio 1.00045, std. dev. 0.00017) (Hall et al., 2019). This is very close to the average ratio of 1.00040 derived by correcting historical manometric data (Table 2).  would also likely have agreed with the reference values within uncertainties at 380 ppm and 480 ppm during K120a, better agreement was achieved with X2017p, and hence also with X2019.

Uncertainty Analysis
Here, we estimate the total uncertainty associated with a CO2 determination on the X2019 scale. We extend the work of (Zhao and Tans, 2006), following accepted methods for uncertainty propagation (JCGM, 2008). To arrive at an uncertainty estimate, 445 we use equation (4), which is a modified version of equation (1), and propagate uncertainties over a range of CO2 mole fractions. We include the terms Xvirial_correction and Xloss_correction since the X2019 scale was derived based on these corrections.
Future manometric analysis will not include the term Xvirial_correction since βCO2 is now correctly determined. We also include the term XH2O and estimated uncertainty even though we do not correct for water vapor in the final sample (XH2O = 0). We establish traceability of manometric measurements to national temperature and pressure standards. Prior to a measurement episode, three platinum resistance thermometers, one thermistor, and a piston gauge are typically sent to an accredited laboratory for calibration (National Voluntary Laboratory Accreditation Program, NVLAP). We estimate the uncertainties associated with measurement of temperature and pressure from uncertainties reported by the calibration laboratories, 455 repeatability, and experience. Uncertainty components are described in the SM, and are similar to those estimated by Zhao and Tans (2006) except for the uncertainty associated with the volume ratio. We calculate a larger uncertainty for Φ, in part, because we observed small temperature gradients in the oven, and hence our ability to measure the gas temperature at each stage of the expansion sequence with existing equipment is probably less certain than previously estimated (Zhao and Tans, 2006). 460 Deleted: ¶ 8.1 Purity Assessment ¶ The primary function of the separation steps is to remove H2O from 535 the extracted CO2. The purified CO2 introduced into the small volume contains N2O and trace amounts of other gases. Considering the major constituents in air and their boiling points under the conditions at which CO2 is trapped, Zhao et al. (1997) concluded that a correction for N2O sufficient to account for impurities in the 540 purified CO2. We go one step further here by verifying, through analysis, that the purified CO2 is, indeed, highly pure and that additional purity corrections are not needed. ¶ ¶ We used the manometer to extract CO2 from a 380 ppm air sample.

545
At the end of a normal manometer run, we transferred the purified CO2, first to a stainless steel tube (5 mL volume with stainless steel metal bellows valve) and then to a 2.3 L stainless steel flask with a stainless steel metal bellows valve. We then added ~0.24 MPa (35 psia) UHP-grade nitrogen to create a mixture with XCO2 at 550 approximately 380 ppm, the same as that of the original air. We analyzed this mixture by GC-MS, GC-FID (Dlugokencky et al., 2005) and GC-ECD (Hall et al., 2011). We confirmed that gases likely to be trapped in the extraction step (nitrous oxide, ethane, propane, some chlorofluorocarbons) are present in the purified CO2 555 sample. The combined mole fraction of all gases measured in the flask, excluding CO2, N2O, and H2O, was 6 parts per billion (ppb). Of this 6 ppb, we found ~3.4 ppb Xe, 0.8 ppb ethane, 0.5 ppb CCl2F2 (CFC-12), 0.2 ppb CFCl3 (CFC-11) and trace amounts of other halogenated gases. Had Xe been quantitatively trapped and retained, 560 we would have found ~87 ppb, which would then require a correction. CH4 was not detected in the purified CO2 sample, confirming that CH4 is not trapped during the extraction process. ¶ ¶ We did not attempt to measure H2O, krypton, argon, and oxygen 565 since these would either not be trapped at the pressure of the traps (~4 kPa), or would likely be present at very low levels and would be difficult to measure. The water vapor content of our primary standards is < 2 ppm, and after two cryogenic separation steps we expect H2O to be <0.03 ppm. While we do not make a correction for 570 water vapor that might remain in the final sample, we do include an estimate in the uncertainty budget. With traps 1 and 2 at -67 °C, we estimate that 98% of the water vapor in the sample would be removed in trap 1, and 50% of the remaining water vapor would be removed in trap 2. This would correspond to 0.02 ppm in the 575 measured CO2. Because the trap temperatures vary from run to run (-65°C to -70°C), we include an uncertainty of 0.03 ppm in the uncertainty propagation. ¶ .
From equation (5), the expanded uncertainty at 400 ppm is 0.17 ppm, or 0.043%. This estimate is only slightly larger than that estimated by Zhao and Tans (2006) (2*0.069 = 0.14 ppm). We acknowledge that the uncertainty could be larger, owing to non-linear loss processes in the early stages of the final pressure and temperature measurements. However, the magnitude of this potential bias could not be quantified experimentally under conditions consistent with manometric experiments. 585 We include in our uncertainty estimate the scale transfer uncertainty, which is particularly relevant for users comparing data traceable to the same scale. From repeated measurements of multiple cylinders we estimate the scale transfer uncertainty based on laser-spectroscopy to be 0.01 ppm (1-sigma), similar to what was reported by Tans et al. (2017). For cylinders valueassigned by NDIR (~1995 to 2016) we estimate the scale transfer uncertainty at 0.03 ppm (1-sigma) (see SM, section 2.5). 590

Scale Implementation
As discussed above, the implementation of the scale involves the harmonization of primary standard manometric results through analysis, with assigned mole fractions derived using a linear response function based on spectroscopic analysis. These assigned mole fractions are then used to define the X2019 scale, and transfer that scale to lower-order standards.

595
In the hierarchy of value-assignment, standards used to support NOAA atmospheric measurements and those distributed by the CCL are known as "tertiary standards". Recalculating tertiary standard values on the X2019 scale involves three steps: 1) updating primary standards to X2019, 2) re-assigning secondary standards based on primary-secondary comparisons (note that some secondaries were re-assigned based on additional data not available upon initial assignment), 3) and re-assigning tertiary standards based on updated daily response functions, relative to secondaries. Here we present the impact of the X2019 scale 600 update on tertiary value assignments dating back to 1995. In a subsequent section we present the implications of the scale update on NOAA atmospheric measurements.
Tertiary standards are value-assigned based on analysis vs secondary standards (Zhao and Tans, 2006). From 1995 to October 2016, value assignment was performed by NDIR (Siemens Ultramat-3, -6F; LiCor Li-6251, Li-6252, or Li-7000), and from 605 November 2016 by laser spectroscopy (Picarro G2301; Los Gatos Research CCIA-46-EP; Aerodyne Research Inc. QC-TILDAS-CS). There was an approximately 12-month overlap period where tertiary standards were run on both systems. The NDIR response to CO2 is typically non-linear. For analysis on a given day, a quadratic response function was determined based on four secondary standards which were previously value-assigned based on similar mole fraction dependent subsets of the suite of primary standards. Secondary standards were selected such that XCO2 spanned the range of tertiary standards to be 610 For analysis by laser-spectroscopy, 16 secondary standards over the range 250-800 ppm (prior to April 2020, 14 secondary standards covering 250 -600 ppm) are used to define response curves for the three major isotopologues of CO2 ( 16 O 12 C 16 O, 16 O 13 C 16 O, and 16 O 12 C 18 O). The mole fraction of each of the three major isotopologues is measured and then converted into 625 total CO2, δ 13 C, and δ 18 O, accounting for the unmeasured minor isotopologues as described in Tans et al. (2017).
Upon revision to X2019, all secondary standards used as far back as 1979 were re-evaluated. Secondary standards were compared to primary standards multiple times during their use. A statistical test and expert judgement were employed to evaluate drift in secondary standards. The statistical test was occasionally overruled in cases where we suspect a step change 630 due to change in instrumentation was the underlying driver rather than drift in the secondary standard. If drift was suspected, a weighted linear or polynomial function was fit to the data (weighted by instrument reproducibility, see SM section 2.5) and a time-dependent mole fraction used. Note that it is easier to detect drift in secondary standards compared to primary standards because we evaluate secondary standards relative to the scale defined by many standards. Thus, the limiting factor is measurement reproducibility and not the absolute uncertainty of the scale. 635 During this re-evaluation, the drift status of some secondary standards was updated, with more data being available compared to when drift rates were first assigned. Thus, some standards that had previously assigned time-dependent values are now held constant, and vice-versa. Generally, the X2019 scale is more consistent across mole fraction and time, and therefore the new evaluations for secondary standard drift are considered more reliable. After updating secondary standard value assignments to 640 X2019, XCO2 for all tertiary standards dating to 1979 were re-assigned from raw data. We focus here mainly on the period from 1995 onward because our role as a WMO/GAW CCL began in 1995. Fig. 11 shows differences between tertiary standard assignments on X2019 and X2007, from 1995 through February 2020.
The overall scale difference is clearly a function of mole fraction, with the difference approximately 0.18 ppm at 400 ppm. It 645 is immediately obvious that differences are not a perfect linear function of mole fraction. Differences that are consistent over several months can be seen as coherent traces in Fig. 11. The coherent differences are due to secondaries being exhausted and replaced by others at slightly different mole fractions. Even though tertiary standards were bracketed by secondaries during analysis, limitations in the ability to value-assign any particular secondary standard, coupled with the limitations associated with fitting a quadratic response function to three or four secondaries contributes to variability. Even so, most of the year-to-650 year variability at a particular mole fraction is less than 0.02 ppm (1-sigma). Outliers, such as those corresponding to analysis performed in the mid-1990s above 400 ppm (red and purple symbols), are the result of extrapolation beyond the range of the secondaries. Prior to 1997, the highest secondary standard in regular use was 390 ppm.
Deleted: , 0.03 and 0.01 ppm for the NDIR and laser spectroscopic systems respectively 655 Formatted: Not Highlight Figure 11: Differences between X2019 and X2007 assignments to tertiary standards from 1995 to 2020. Each data point represents one analysis record (over 25,000 records shown), and a full calibration of a tertiary standard involves multiple analysis records.

660
The more prominent variations evident in Fig. 11 stem from re-assignment of primary and secondary standards, the non-linear response of NDIR instruments, and the nature of the value-assignment process. Scale differences appear significantly larger during 2008-2009 over the 360-390 ppm range (light green symbols). These value assignments, which involved around 600 analysis records (less than 3% of the total number), are inconsistent with most other data due to a revision of XCO2 assigned to a particular secondary standard (CA01982) in use at the time. This particular secondary was assigned a value of 391.87 on the 665 X2007 scale in 2008 when compared to primary standards. However, incorporating subsequent analysis of this cylinder against primary standards, it was evident that the cylinder was drifting upward rapidly. This secondary standard drifted ~0.2 ppm in two years (not common), but that drift was not accounted for in the X2007 value assignment, which caused the value used for data reduction to be too low. The drift is accounted for in the X2019 value assignment leading to larger X2019-X2007 differences for tertiary standards measured against this secondary standard. 670 The more recent data based on analysis by laser spectroscopy are represented as dark purple and maroon colors in Fig. 11.
These show a more linear relationship without the wavy structure, as expected for an instrument with a linear response calibrated over the entire scale range. The fact that the laser spectroscopic results do not agree with the NDIR data in the upper XCO2 range (> 420 ppm) is due to the use of secondary standards on this system that were not well-characterized. Value-675 assignments for these secondary standards were determined on the NDIR system and thus incorporate the biases associated with that system on X2007. They were not well characterized when they went into service, especially at the upper end of the range where we effectively expanded the calibration range in anticipation of the X2019 revision. We now have more Deleted: s information on these secondary standards, including analysis vs the primary standards on the laser spectroscopic system and 680 can better define them on X2019.
It is important to note that differences in value-assignment between the NDIR and laser spectroscopic system (Fig. 11) are only present on the X2007 scale. The X2019 revision resolves the underlying cause of the offsets. Fig. 12 shows the results from the ~12 month overlap during which tertiary standards were analyzed on both systems. There is a clear mole fraction 685 dependence to the offset on the X2007 scale. Tans et al. (2017) attributed this to the assigned values of the primary standards coupled with the method used for scale transfer using the NDIR but were not able to rule out other potential issues such as gas handling on the NDIR based system. The X2007 primary standard assignments (Table 2), based on harmonization by NDIR analysis, were not as robust as we thought. The X2007 scale was based on relatively few NDIR analysis runs, and as such the residuals were not as well defined as they are for X2019 (Fig. 9). By using small subsets of standards to calibrate the NDIR, 690 the data reduction of the NDIR system tracked errors in the assigned values rather than averaging those errors over the entire range of the scale. By normalizing the primary standards on a linear system, using the full suite of primary standards multiple times over several years (as was done for X2019), we can better define the assigned values of the primary standards. After converting to X2019, the NDIR system is still subject to end effects and errors in value assignments of the primary standards, but these errors are much smaller compared to X2007, and the comparison data show much better agreement between the two 695 systems (lower panel in Fig. 12). The good agreement between the two systems on X2019 leads us to believe that the mole fraction dependence in the offsets on X2007 (Fig 12a) is due the assigned values of the primary standards and not to some other issue related to gas handling. This also indicates that the agreement is probably relatively stable in time and there is likely no mole fraction dependent bias in the NDIR results prior to the comparison period. Deleted: We expect future value assignments to be more consistent than those based on the NDIR method. ¶ 705 Figure 12: Differences between NDIR and laser spectroscopic systems used for tertiary value-assignment on X2007 (upper panel) and X2019 (lower panel) during a 12-month overlap period. Open symbols denote tertiary standards with significatly lower 13 C-CO2 isotopic ratios compared to the others (δ 13 C < -20‰), and thus subject to bias in the NDIR measurement.
Dashed lines are the expected reproducibility of the NDIR system (±0.03 ppm). 710

Approximating X2019 using a linear scale conversion
For users of standards obtained from the CCL, the best way to update to the X2019 scale is to implement the X2019 reassignments and propagate through to atmospheric data. A database management system allows for efficient propagation of scale changes to atmospheric data. However, for datasets in which a full reprocessing is not

Revision of NOAA atmospheric data
We have reprocessed NOAA atmospheric data back to ~1979 for internal evaluation. This involved re-assigning XCO2 values for working (tertiary level) standards to X2019 by reprocessing the original tertiary-secondary comparisons. For data prior to 735 1995, this also involved converting from a Scripps Institution of Oceanography (SIO) scale to X2019. Complete detail of the conversion from the SIO scale to X2019 is beyond the scope of this paper, and will be addressed in a separate publication.
After fully converting to X2019, NOAA data prior to ~1979 will still be traceable to the SIO scale in use at the time of measurement. 740 We include examples of atmospheric data here to provide a comparison of two methods used for propagating the X2019 scale: full reprocessing using updated tertiary standard values and response functions, and a simple linear scale conversion applied to atmospheric records. Actual bias introduced into atmospheric records by implementing the linear conversion will depend on the calibration procedures used in a particular laboratory, and the range and calibration history of standards. For example, 745 if a particular set of standards used by a laboratory was analyzed multiple times by the CCL over several years, the impact of the 2008-2009 secondary standard mis-assignment would be reduced.
The lower panel in Fig. 15 shows the difference between the linear scale conversion and full reprocessing applied to in situ CO2 at Mauna Loa, HI (MLO). Generally, the linear scale conversion is fairly close to the fully reprocessed data but has a 750 negative bias which is larger during 2007-2009 due to the 2008 secondary mis-assignment issue. There are time periods of larger differences, such as in late 2014, due to a reassessment of drift in the working standards. In the case of the 2014 period, one of the working standards had a relatively large drift correction (0.2 ppm yr -1 , which is not common), but the drift correction was implemented on X2007 in a way that exaggerated the effect (this only applies to relatively few cylinders in 2014). Without fully reprocessing, this error would be preserved in the data set. 755

760
In addition to MLO, we reprocessed in situ data from the other NOAA baseline observatories (Barrow, AK; American Samoa; South Pole) and flask samples from marine boundary layer (MBL) sites using both the linear scale conversion and full reprocessing methods. Biases in the linear scale conversion were binned by year to get a sense of how well the linear scale conversion approximates the scale difference over time. Again, differences due to reassessment of drift in the working standards are included in these binned bias terms. Fig. 16 shows the average annual bias in each of these data records that 765 would be included if the records were converted to X2019 using the linear function rather than fully reprocessed (note, only  We also conducted a numerical experiment to examine scale conversion bias without the added complications from a reassessment of drift in working standards. We randomly selected sets of three and five individual tertiary standards measured 780 within a calendar year. Each set required a standard within ±10 ppm of the global average from a particular year (https://www.esrl.noaa.gov/gmd/ccgg/trends/global.html). The other standards were required to be at least 10 ppm but less than 30 ppm apart and cover mole fractions above and below the initial selected standard. Quadratic fits to the actual X2019 -X2007 differences vs. the X2007 assignments were made. The point on this curve corresponding to the calendar year global average (on the X2007 scale) was compared to the global average converted to X2019 using the linear scale conversion. The 785 experiment was run 50 times for each year. In essence, this lets us approximate the bias due to the use of the linear scale conversion on a hypothetical sample equal to the global average for 50 different sets of standards. The average biases due to the use of the linear scale conversion for 3-standard and 5-standard suites are shown in Fig. 16 expressed as 3-year running means. The results show good agreement with the bias seen in the in situ and flask MBL records. It is important to note that both the results of the numerical experiment and these particular atmospheric records are tightly tied to the CO2 scale transfer 790 system in time. Atmospheric data from 2007-2009 measured by external programs would not be as sensitive to the 2008 bias if their standards were not calibrated by the CCL during that time. Conversely, measurements at other times tied to standards that were only measured during the 2007-2008 period (without subsequent re-analysis) would be more sensitive.

Historical Scales
The impact of the revision from X2007 to X2019 is well understood and the linear conversion agrees with full reprocessing 795 within 0.03 ppm for nearly 80% of standards value-assigned since 1995 over the range 320-460 ppm (Fig. 14). However, data traceable to NOAA scales prior to the release of X2007 that cannot be fully reprocessed are an additional concern. The implementation of NOAA scales prior to X2007 was not rigorously documented. Prior to 2001, NOAA scales were partially based on SIO value assignments of the NOAA primary standards and thus were sensitive to revisions of the SIO scale. The incorporation of SIO revisions over time at NOAA and how these translated into distributed scales is not well documented, 800 and therefore it is difficult to determine relationships between X2019 and historical scales prior to the full conversion to X2007.
(Note that the CCL has taken multiple steps since then to ensure these lapses do not occur again and that the evolution of the scale is transparent and fully documented.) To assess the magnitude of potential bias relative to X2007 that could exist in archived data sets still traceable to historical 805 NOAA scales, we examined records from CSIRO (Australia), NIWA (New Zealand), and Environment Canada, who provided records of tertiary standard value-assignments prior to the formal adoption of the X2007 scale. Fig. 17 shows the difference between the original reported value (assigned by NOAA at that time) and the value re-assigned on scale X2007 upon its release.
NOAA primary standards were initially value-assigned by SIO from 1992 to 1995. From 1996-2000, we used a mixture of 810 NOAA and SIO manometric results, and from 2001 onward we used only NOAA manometric results. Scales propagated by NOAA from 1993-2000 were effectively a mixture of the SIO scale in use at the time (now obsolete) and the NOAA manometric data up to that time. Bias is largest and shows more scatter prior to 1994 because the NOAA scale was based on relatively few SIO measurements of the NOAA primary standards (Keeling et al., 2012). Primary assignments improved over time as the number of measurements increased. Data traceable to these unnamed NOAA scales are biased relative to X2007 815 ( Fig. 17). However, any potential bias in atmospheric records would be related to the date the standards were value-assigned, not necessarily the date the atmosphere was measured. The potential bias in historical data sets relative to X2019 would increase due to the X2019 to X2007 relationship. The linear conversion (equation 6) is not strictly applicable to data not traceable to X2007, but would be a close approximation for data traceable to scales in use between 2001 and 2006. These limitations should be considered with regard to the uncertainty of historical data. 820

Conclusions
We have applied two corrections to manometric data used to define the WMO/GAW CO2 scale and include four additional standards to define a new scale, identified as WMO-CO2-X2019. The net result of a scale update is two-fold: 1) The X2019 835 scale is more accurate and internally consistent than the previous X2007 scale. 2) Tertiary assignments on X2019 are more consistent across time because, with additional manometric analysis of primary standards and additional information on secondary assignments, scale propagation has been improved. While the scale difference at the tertiary standard level (~0.18 ppm at 400 ppm) is small in relative terms (0.045%), it is significant in terms of atmospheric monitoring. Measurement laboratories will need to update to the X2019 scale to avoid mis-interpretation of scale-induced (artificial) atmospheric 840 gradients as real signals.
For users of standards obtained from the CCL, the best way to update to the X2019 scale is to implement the X2019 reassignments and propagate through to atmospheric data. However, for datasets in which a full reprocessing is not possible or practical, a linear scale conversion is an option. The linear conversion will result in bias compared to full-reprocessing, but 845 that bias is relatively small in many cases, and is less than 0.03 ppm for nearly 80% of standards value-assigned since 1995 over the range 320-460 ppm. Deleted: ,