Hourly surface nitrogen dioxide retrieval from GEMS tropospheric vertical column densities: benefit of using time-contiguous input features for machine learning models

Gödeke, Janek; Richter, Andreas; Lange, Kezia; Maaß, Peter; Hong, Hyunkee; Lee, Hanlim; Park, Junsung

doi:https://doi.org/10.5194/amt-18-3747-2025

Articles | Volume 18, issue 15

https://doi.org/10.5194/amt-18-3747-2025

Articles | Volume 18, issue 15

Research article

11 Aug 2025

Research article |

| 11 Aug 2025

Hourly surface nitrogen dioxide retrieval from GEMS tropospheric vertical column densities: benefit of using time-contiguous input features for machine learning models

Janek Gödeke, Andreas Richter, Kezia Lange, Peter Maaß, Hyunkee Hong, Hanlim Lee, and Junsung Park

Abstract

Launched in 2020, the Korean Geostationary Environmental Monitoring Spectrometer (GEMS) is the first geostationary satellite mission for observing trace gas concentrations in the Earth's atmosphere. Observations are made over Asia. Geostationary orbits allow for hourly measurements, which lead to a much higher temporal resolution compared to daily measurements taken from low-Earth orbits, such as by the TROPOspheric Monitoring Instrument (TROPOMI) or the Ozone Monitoring Instrument (OMI). This work estimates the hourly concentration of surface nitrogen dioxide (NO₂) from GEMS tropospheric NO₂ vertical column densities (VCDs) and additional meteorological features, which serve as inputs for random forests and linear regression models. With several measurements per day, machine learning models can use not only current observations but also those from previous hours as inputs. We demonstrate that using these time-contiguous inputs leads to reliable improvements regarding all considered performance measures, such as Pearson correlation or mean square error. For random forests, the average performance gains are between 4.5 % and 7.5 %, depending on the performance measure. For linear regression models, average performance gains are between 7 % and 15 %. For performance evaluation, spatial cross-validation with surface in situ measurements is used to measure how well the trained models perform at locations where they have not received any training data. In other words, we inspect the models' ability to generalize to unseen locations. Additionally, we investigate the influence of tropospheric NO₂ VCDs on the performance. The region of our study is South Korea.

Download & links

How to cite.

Received: 09 Oct 2024 – Discussion started: 04 Nov 2024 – Revised: 09 Apr 2025 – Accepted: 22 May 2025 – Published: 11 Aug 2025

1 Introduction

The concentration of nitrogen dioxide (NO₂) near the Earth's surface is of significant interest for several reasons. NO₂ is not only a precursor to health hazard and air pollutant ozone, but also a direct threat to human health. Moreover, it is linked to environmental issues such as acid rain; see, e.g., Jacob (2000).

At present, surface NO₂ is measured by networks of ground-based in situ monitoring stations. However, due to the limited number of such stations, they cannot provide global information about the surface NO₂ concentration. This limitation is one of the reasons why satellite remote sensing has become popular for deriving global estimates of surface NO₂. Satellites detect the fingerprint of NO₂ within the backscattered solar radiation due to its strong absorption of light in the wavelength range of 350–500 nm. One of the first studies on deriving surface NO₂ from remote sensing observations was conducted by Lamsal et al. (2008) across the USA and Canada. In their study, surface NO₂ was estimated by applying an assumed NO₂ vertical distribution calculated with a chemical transport model to tropospheric NO₂ vertical column densities (VCDs), where the tropospheric NO₂ VCDs were obtained from the Ozone Monitoring Instrument (OMI; Levelt et al., 2006). Numerous further studies followed, also utilizing chemical transport models and observations from satellites in low-Earth orbits. For example, we refer to the studies of Lamsal et al. (2010, 2013), Bechle et al. (2013), Wang and Chen (2013), Kharol et al. (2015), Geddes et al. (2016), Gu et al. (2017), and Cooper et al. (2020, 2022). Both OMI data and other observations have been considered, e.g., from the Global Ozone Monitoring Experiment (GOME; Burrows et al., 1999), the Scanning Imaging Absorption Spectrometer for Atmospheric Chartography (SCIAMACHY; Bovensmann et al., 1999), and the TROPOspheric Monitoring Instrument (TROPOMI; Veefkind et al., 2012).

During the last 10 years, machine learning approaches have received increasing attention in determining surface NO₂ from satellite remote sensing observations. One advantage is the shorter computation time once the model has been trained. Diverse machine learning models have been used for this task, exploiting not only tropospheric NO₂ VCDs as input, but also additional input features to improve the model's performance, such as meteorological parameters, traffic density, or population information. Studies that consider observations from satellites in low-Earth orbits have been conducted by, for example, Kim et al. (2017), Jiang and Christakos (2018), de Hoogh et al. (2019), Chen et al. (2019), Di et al. (2020), Qin et al. (2020), Kim et al. (2021), Chan et al. (2021), Dou et al. (2021), Ghahremanloo et al. (2021), Li et al. (2022), Wei et al. (2022), Huang et al. (2023), and Shetty et al. (2024). For a detailed review on the methods used, the input features included, the regions of consideration, and the achieved performance, we refer to the work of Siddique et al. (2024).

Satellites in low-Earth orbits, such as OMI and TROPOMI, pass over the same region in middle and low latitudes once a day, which means they can provide at best one measurement per day and location. If the area is cloud-covered during the time of observation, the measurement of lower-tropospheric gases is not accurate, which makes the data coverage even more limited. Since satellites in low-Earth orbits provide observations at most once a day, most studies either predicted surface NO₂ at this specific satellite observation time (e.g., Kim et al., 2017) or estimated daily (e.g., Di et al., 2020), monthly, or annual averages of surface NO₂. Nevertheless, it should be mentioned that there are a few studies that have estimated hourly NO₂. As an example, Kim et al. (2021) linearly interpolated daily tropospheric NO₂ VCDs to an hourly resolution, from which they estimated hourly surface NO₂ concentrations over Switzerland and northern Italy.

In contrast, geostationary satellites permanently observe – more or less – the same region, leading to more data points for a given location that can be used for a prediction algorithm of surface NO₂. In particular, these larger datasets make machine learning approaches even more attractive. The first geostationary satellite instrument for observing trace gas concentrations in the Earth's atmosphere is the Geostationary Environmental Monitoring Spectrometer (GEMS; Kim et al., 2020), which was launched in February 2020 by the Republic of Korea. It provides hourly measurements of radiances over 20 countries in Asia, including South Korea. Alongside GEMS, there exists only one other geostationary satellite that monitors trace gases, namely NASA's TEMPO, which was recently launched in April 2023 and is observing North America. A third geostationary satellite, ESA's Sentinel-4 mission, was launched in 2025 and monitors Europe.

Until now, only a few studies have been conducted on hourly surface NO₂ retrieval from geostationary observations: Zhang et al. (2023) presented a scientific GEMS NO₂ product (POMINO-GEMS), which empirically corrects for overestimation and stripe artifacts in the operational GEMS NO₂ product. They then converted their tropospheric NO₂ VCDs of 2021 over China to hourly surface NO₂ using a chemical transport model. Further studies that exploit machine learning approaches have been conducted over China. Yang et al. (2023 b) used a random forest regressor to predict hourly surface NO₂ over China from GEMS radiance data at six wavelengths from the UV and visible bands, as well as some additional meteorological, temporal, and spatial features. Furthermore, a multi-output random forest was used to simultaneously predict five more air pollutants, such as ozone. Although prediction accuracy achieved by the multi-output model was slightly worse regarding surface NO₂, the overall training time for predicting all six pollutant concentrations was smaller. Ahmad et al. (2024) combined two machine learning models. First, a random forest was used to predict NO₂ mixing heights from meteorological input features. These were then fed into an extreme gradient boosting regressor, together with tropospheric NO₂ VCDs from GEMS, temporal variables, and meteorological variables. The study demonstrates the benefit of using NO₂ mixing height as input.

Hourly surface NO₂ has also been predicted from GEMS observations over South Korea, the region considered in this study. In the work of Lee et al. (2024), predictions were made for the whole year of 2022. Therein, the total amount of VCDs instead of tropospheric NO₂ VCDs was used as the only input of a (linear) mixed-effect model to predict surface NO₂. Their model is a piecewise-defined function whose output depends not only on the total column of NO₂, but also on the day and hour at which and region in which the prediction is to be made. For this, South Korea was divided into nine regions, which presumably leads to a more direct region-wise relationship between surface NO₂ and column densities of NO₂. In other words, implicitly, spatial and detailed temporal information is also exploited in their approach. This makes their model specialized to South Korea and the year 2022.

Another study that predicted surface NO₂ over South Korea was conducted by Tang et al. (2024). Therein, daily surface NO₂ concentrations instead of hourly surface NO₂ were predicted. Further, they did not use NO₂ column densities as input for a machine learning model. Instead, they inspected the influence of aerosol optical depth, which is part of the GEMS data products. Aerosol optical depth, together with surface NO₂ predictions from a chemical transport model and other features such as meteorological parameters, served as inputs for a random forest to estimate surface NO₂.

In order to train and evaluate machine learning models of surface NO₂, in situ NO₂ observations from ground-based networks are used. Within the literature, there are two frequently used strategies to evaluate the performance of a machine learning model in predicting surface NO₂. First, standard k-fold cross-validation is considered; see, for example, the works of Ghahremanloo et al. (2021), Chan et al. (2021), Yang et al. (2023 b), and Ahmad et al. (2024). This means that the whole dataset is randomly split into k equally sized subsets. One of them serves as the test set, whereas the other k − 1 values are used to train the model. Training and testing are repeated k times, until each subset has served once as a test set. The average test performance (e.g., Pearson correlation) is calculated and represents the final evaluation of the model. For standard k-fold cross-validation, data from all available in situ stations are contained in both the training and the test datasets (with large probability). However, what if the trained model should afterwards predict surface NO₂ at a new location which has not contributed data to the training set? With the result from standard cross-validation, it would be impossible to say how reliable the model can generalize to this unseen location. It may have overfitted to the locations that it has dealt with during training. Therefore, if global charts covering large areas like the entirety of South Korea are desired, it would be more appropriate to evaluate the model's performance via so-called spatial k-fold cross-validation. This means the set of available in situ stations is divided into training and test stations, the model is trained with data from training stations only, and – finally – its performance in predicting surface NO₂ at the test stations is evaluated. Unsurprisingly, performance measured with spatial cross-validation is indeed worse compared to standard cross-validation, which has been observed, e.g., within the studies of Ghahremanloo et al. (2021), Chan et al. (2021), Yang et al. (2023 b), and Tang et al. (2024). In our work we focus on spatial k-fold cross-validation, as we wish to inspect how well a model can generalize to unseen locations.

1.1 Goals of this study

Due to the hourly measurements GEMS provides over the same region, it is natural to ask whether one can benefit directly from the time resolution itself and not only from the resulting larger size of the dataset. Hence, we propose training a machine learning model φ that predicts surface NO₂ at a given location z and time t not only from corresponding tropospheric NO₂ VCD and meteorological data at time t, but also from $(k - 1) \in N_{0}$ previous hours (ℕ₀ denotes the set of natural numbers including zero). This means the model is a mapping $φ : R^{p k} \to R$ , where p is the number of different features:

\begin{aligned} input (z, t) := (\begin{array}{c} tropospheric {NO}_{2} VCD (z, t) \\ ⋮ \\ tropospheric {NO}_{2} VCD (z, t - k + 1) \\ meteorological features (z, t) \\ ⋮ \\ meteorological features (z, t - k + 1) \end{array}) \\ ⟼ φ (input (z, t)) \approx surface {NO}_{2} (z, t) . \end{aligned}

Here t − j refers to the time j hours before t, where $j \in \{0, 1, \dots, k - 1\}$ . In all that follows, k is also referred to as the time contiguity of the input features, as it determines how many times each input feature is included in the whole input vector. Note that k = 1 stands for the case in which only input features at current time t are included. Of course, one could also use features at later times t + j, but for simplicity and better readability, we focus on making predictions based on previous-time features in this work.

Our main aim is to inspect whether the performance of the model in predicting surface NO₂ at unseen locations will increase by using inputs with higher time contiguity k. Unseen locations are locations from which the model has not seen any training data. As it turns out, it is indeed beneficial to use larger time contiguity k>1 for the machine learning models considered, namely random forests and linear regressors. To the best of our knowledge, this observation has not been made in the literature yet. Regarding work on non-geostationary satellite data, the usage of time-contiguous tropospheric NO₂ VCDs is simply impossible, as only single measurements per day are available. We further carefully design experiments that are suitable for answering our main research question about the benefit of time-contiguous inputs. Last but not least, we inspect the influence of tropospheric NO₂ VCDs on the models' ability to predict surface NO₂ and their influence on the benefit of using time-contiguous inputs. This is of interest as it addresses the question of how useful and necessary satellite observations of NO₂ are for the prediction of surface NO₂ concentrations.

1.2 Outline

In Sect. 2 we describe the different sources of data included in our study. Furthermore, we describe the construction of the datasets used for training machine learning models in our study and give a mathematical description of these datasets. Afterwards, in Sect. 3.2 we describe the experiments that provide clear insights into the research questions, e.g., whether time-contiguous inputs can enhance the quality of surface NO₂ predictions. We also discuss different loss functions for measuring the performance of trained models on the test dataset. Section 4 serves as a quick recap of the machine learning models used in this study. Finally, we present and discuss the results of our experiments in Sect. 5.

2 Data

In our study, we exploit two data sources for the prediction of surface NO₂. The first source is tropospheric NO₂ VCDs derived from GEMS measurements, and the second is meteorological data from the ERA5 dataset (Hersbach et al., 2023). Further, measurements of surface NO₂ at in situ stations from the air quality network of South Korea serve as the ground truth in this study. This section begins with a brief description of these data sources, followed by a description of the data preprocessing steps. In particular, we explain how the VCDs were paired with ERA5 and in situ data and how time-contiguous datasets were constructed. For clarity, we provide mathematical definitions of these time-contiguous datasets.

2.1 Data sources

2.1.1 GEMS tropospheric NO₂ vertical column densities

GEMS is a UV–visible imaging spectrometer on board the geostationary satellite GK2B. At its launch on 18 February 2020, GEMS was the first geostationary air quality monitoring mission. GEMS is located over the Equator at a longitude of 128.2° E and covers a large part of Asia (5° S–45° N and 75–145° E) on an hourly basis. With four different scan modes, which all include South Korea, the field of regard (FOR) shifts westward with the Sun. During daytime, GEMS provides up to 10 observations over a given location according to the season and location, with a spatial resolution at Seoul of 3.5 km × 8 km. The GEMS irradiance and radiance measurements in the UV–visible spectral range can be used to derive column amounts of, for example, ozone (O₃), sulfur dioxide (SO₂), and NO₂, but also cloud and aerosol information (Kim et al., 2020). For this study, we use the tropospheric NO₂ VCD product.

During the time of this study, the operational GEMS L2 tropospheric NO₂ VCD product was available in v2. This version was evaluated by, e.g., Oak et al. (2024) and Lange et al. (2024), showing that it is high biased compared to the TROPOMI tropospheric NO₂ VCD product and ground-based tropospheric NO₂ VCD datasets. Additionally, the v2 product showed enhanced scatter. In preparation for the European geostationary instrument on Sentinel-4, the Institute of Environmental Physics at the University of Bremen (IUP-UB) has developed a scientific GEMS NO₂ product. The GEMS IUP-UB tropospheric NO₂ VCD v1.0 product was evaluated by Lange et al. (2024), showing good agreement with the operational TROPOMI NO₂ data and ground-based observations. Here, an earlier version (v0.9) of the same data product was used. Briefly, the retrieval is based on a differential optical absorption spectroscopy fit in the 405–485 nm spectral window, using daily GEMS irradiances as background spectra. The stratospheric correction is based on a variant of the STREAM algorithm of Beirle et al. (2016), and tropospheric vertical columns are computed using air mass factors by applying the tropospheric NO₂ profiles from the TM5 model run performed for the operational TROPOMI product (Williams et al., 2017). The TM5 model has an hourly temporal resolution with a spatial resolution of 1° × 1°. As the model a priori is interpolated in space and time, no obvious structures from the coarse model resolution are visible in the data, but the lack of detail may still impact the results. Cloud screening is based on the operational GEMS cloud product v2 and a threshold of 50 % cloud radiance fraction, but no additional cloud correction is performed. Each pixel has a quality indicator (qa value) based on fitting residuals, cloud fraction, and surface properties. Here, only data with the highest qa value (good fits, cloud radiance fraction below 50 %, no snow or ice detected) are used.

Further, the GEMS IUP-UB product does not yet have full error propagation. The tropospheric NO₂ VCD error is therefore estimated to be 25 %. The main uncertainty results from the assumptions used in the calculation of air mass factors, in particular for surface reflectivity, the NO₂ vertical profile, and aerosol loading. Uncertainties are expected to be larger in the morning when the boundary layer is shallow and smaller around noon and in the evening. Uncertainties introduced by the stratospheric correction can be important over clean regions but can be neglected over pollution hotspots.

2.1.2 Meteorological data

In order to predict surface NO₂, it would not be sufficient to use tropospheric NO₂ VCDs as the only source of information. This is because VCDs represent integrals over the entire troposphere, capturing contributions from NO₂ at various altitudes, not just near the surface. A common strategy is to incorporate additional meteorological features into the prediction of surface NO₂; see for example the works of Di et al. (2020), Qin et al. (2020), Ghahremanloo et al. (2021), Chan et al. (2021), Li et al. (2022), and Yang et al. (2023 b). In our study, we utilize meteorological features from the ERA5 dataset, the fifth-generation reanalysis by the European Centre for Medium-Range Weather Forecasts (ECMWF), which provides comprehensive global climate and weather data for the past 8 decades (Hersbach et al., 2023).

Our selection of meteorological features is partially inspired by the choices made in the aforementioned studies, including variables such as boundary layer height, wind components, surface temperature, or pressure. The 18 features from ERA5 that are considered in this study are listed in Table B1, where we use the same nomenclature as in the description of the ERA5 dataset; see again Hersbach et al. (2023). In the geographical reference system, the resolution of all meteorological features is 0.25° × 0.25°, which corresponds to approximately 28 km × 22 km over South Korea. Consequently, ERA5 data are approximately 8 times coarser in latitude and 3 times coarser in longitude than the GEMS tropospheric NO₂ VCDs.

2.1.3 In situ measurements of surface NO₂

In this study, we use in situ surface NO₂ measurements from the air quality network AirKorea as the ground truth, provided by the Korean Ministry of Environment (National Institute of Environmental Research (NIER), 2025). There is a large number of in situ stations in South Korea that, among other air-pollution-related species, measure surface NO₂. We used data from 637 stations, which are depicted in Fig. 1a. The instruments utilize the chemiluminescence method, as described by Kley and McFarland (1980). Our in situ dataset includes measurements from January 2021 until the end of November 2022, and we received the data in December 2022.

https://amt.copernicus.org/articles/18/3747/2025/amt-18-3747-2025-f01

Figure 1(a) Map with the 637 in situ stations from the air quality network of South Korea used in this study. (b) An exemplary split into 90 % training stations and 10 % test stations, considered during multiple 10-fold spatial cross-validations.

2.2 Pairing of data sources and data preprocessing

In the following, we explain the spatial and temporal pairing of the data sources. Tropospheric NO₂ VCDs and meteorological data possess spatial resolutions, as described in the previous section. Consequently, each data point covers an area (pixel) on the Earth's surface, rather than a single point. Here, we associated the location of an in situ station with the VCD pixel or meteorological pixel, whose center is nearest to the station's location (longitude, latitude). Note that the center of a VCD pixel coincides with the respective center of the GEMS satellite pixel, since no regridding is applied.

Tropospheric NO₂ VCDs are based on GEMS observations that have been collected within 30 min starting at a quarter to the respective hour, e.g., from 01:45 to 02:15 UTC. In situ measurements of surface NO₂ are available as hourly averages, starting on the hour. Temporally, we matched them with the VCDs using this timestamp and found that these data pairs showed the highest Pearson correlation. For example, VCDs between 01:45 and 02:15 UTC were matched with in situ measurements with a timestamp of 01:00 UTC. Unfortunately, at the end of our project, we learned that this was a misinterpretation of the in situ measuring times by 1 h, as the hourly averages actually start at the hour before the given timestamp instead of at the hour of the timestamp, as we had assumed. This means that the VCDs and surface NO₂ were not optimally matched within our experiments. However, the abovementioned correlation tests give us confidence that the conclusions of this study are not affected by this mistake, in particular with respect to the improvements in performance when adding data from other measurement times. To maintain consistency in notation, we continue to use the originally interpreted in situ measuring times, but they should be regarded as occurring 1 h earlier. Most meteorological features are given on the hour, which means at a specific point in time. There is one exception, namely evaporation, which is available as an hourly average starting on the hour, similar to in situ measurements. Since the averages of these data sources are taken over different periods of time, there is not a unique way to pair them temporally. Our approach is the following.

Due to the hourly resolution of all data sources, time t is expressed by t = YYYY/MM/DD/HH throughout this work. For example, t = 2021/01/23/01 refers to 23 January 2021 at 01:00 UTC. We associate the in situ measurements of surface NO₂, which started at time t and went on for 1 h, with t. In the example, time t = 2021/01/23/01 refers to surface NO₂ that has been averaged from 01:00 UTC until 02:00 UTC. Regarding tropospheric NO₂ VCDs, the same t refers to measurements that started 45 min later. Hence, t = 2021/01/23/01 describes the VCDs at a time between 01:45 and 02:15 UTC. Finally, for the meteorological features that are instantaneously on the hour, t stands for the feature's value 1 h later at t + 1. Thereby, it is closest to the corresponding VCD time frame. For example, t = 2021/01/23/01 is associated with the meteorological feature at 02:00 UTC.

To sum up, given a location z of an in situ station and a time t = YYYY/MM/DD/HH, we specified a single data point $(f (z, t), s (z, t))$ that stores surface NO₂ s(z,t) combined with the vector of input features f(z,t), which consists of tropospheric NO₂ VCDs and meteorological features. As a data preprocessing step, we exclude data points that violate any of the following conditions:

1.
All features are available at location z and time t (tropospheric NO₂ VCDs and surface NO₂ might be missing for a given z,t, for example, due to clouds).
2.
Tropospheric NO₂ VCDs are non-negative. Negative VCDs can occur as a result of measurement noise in the satellite data or uncertainties in the stratospheric correction. We excluded them in an effort to improve the quality of the dataset. However, toward the end of the project, we tested the effect of this filter on a subset of the dataset and found only very small changes. This is probably due to the fact that applying this filter only leads to a reduction in the dataset by less than 0.5 %. Since negative VCDs are usually found over regions with low tropospheric NO₂ VCDs, the filter leads to a loss of the input variable and thus a loss of predictions for these regions. In retrospect, we can conclude that the implementation of this filter was not necessary, as it only had little influence on our dataset and can thus be neglected in future work. Regarding the random forests used in this study, which are trained on non-negative VCDs only, they are still able to make reasonable but potentially biased predictions over clean regions with negative VCDs as inputs. In this case, the random forests would treat negative VCDs as being zero. In contrast to the VCDs, the in situ measurements of surface NO₂ are never negative.
3.
The GEMS qa value is equal to 1. Therefore, the trained models presumably cannot make reliable predictions for scenarios where the qa value is smaller than 1. It would be an interesting future direction to examine the effects of lowering the threshold for the qa value. This would result in a larger but more complex dataset.

Data points $(f (z, t), s (z, t))$ that fulfill these conditions are collected within the so-called data basis. A data point in the data basis is not time contiguous, as it only provides information at a single time t and not at previous hours. The construction of time-contiguous datasets is described in the next section.

2.3 Description of time-contiguous datasets

In the Introduction, we motivate the use of time-contiguous inputs for machine learning models in order to predict surface NO₂. For better clarity, we introduce notations and definitions in a mathematical form.

2.3.1 Spatial and temporal coordinates

Z is the set of positions (longitude, latitude) on the Earth's surface in terms of longitude and latitude. Hence, it can be seen as the Cartesian product $[- 180, 180) \times [- 90, 90)$ . In this study, we deal with in situ stations in South Korea which are located within $[124, 131) \times [33, 39)$ ; see Fig. 1a. These stations are simply identified with their location z∈Z in what follows.

T is the set of all measuring times YYYY/MM/DD/HH between January 2021 and November 2022. For example, 2021/01/23/01 refers to 23 January 2021 at 01:00 UTC. Note that for a given t ∈ T, the expression t − j for j ∈ ℕ stands for the time j hours before t. For example, for t = 2021/01/23/01 and j = 3, it is t − j = 2021/01/22/22.

2.3.2 Surface NO₂ and input features

Recall from the previous section that surface NO₂ measured at time t ∈ T and at in situ station z ∈ Z is denoted by s(z,t). As already mentioned, surface NO₂ is to be predicted from the tropospheric NO₂ VCD and meteorological variables such as the boundary layer height. These input features at z ∈ Z and t ∈ T are denoted by $f_{1} (z, t), \dots, f_{p} (z, t)$ , where p ∈ ℕ is the number of considered features (determined by some feature selection procedure; see Sect. 3.1). At this point, it is only important that f₁ denotes the VCDs. For simplicity, we just write $f (z, t) \in R^{p}$ for the vector of all features at location z and time t.

2.3.3 Data preprocessing

We review the data preprocessing described in the previous section in light of the mathematical notation. A measurement f₁(z,t) of a tropospheric NO₂ VCD is valid if it exists (measurements may be missing at some times t ∈ T), if f₁(z,t) ≥ 0, and further if the GEMS qa value is equal to 1. For all other features $f_{2} (z, t), \dots, f_{p} (z, t)$ and surface NO₂ s(z,t), it suffices that the measurement exists in order to be categorized as valid. Note again that in situ measurements of surface NO₂ are always non-negative in the present dataset.

In the following, we collect all locations and times (z,t) at which we have access to valid measurements. Namely, the domain of valid measurements Ω is defined as

\begin{matrix} (1) & Ω = {(z, t) \in Z \times T : and s (z, t), f_{1} (z, t), \dots, f_{p} (z, t) are valid} . \end{matrix}

2.3.4 Time-contiguous datasets

In order to consider time-contiguous measurements, we define for N ∈ ℕ the set

\begin{matrix} (2) & Ω_{N} = {(z, t) \in Ω : (z, t - j) \in Ω for j = 1, \dots, N - 1} . \end{matrix}

In other words, Ω_N collects locations and times (z,t) at which valid measurements also exist for at least N − 1 previous hours. Note that $Ω_{N} \subseteq Ω_{N - 1} \subseteq Ω$ for all N ∈ ℕ, and Ω₁ coincides with Ω, the domain of valid measurements. Given (z,t) ∈ Ω_N and $k \in {1, \dots, N}$ , this definition allows us to build a valid time-contiguous feature vector:

\begin{matrix} (3) & (\begin{array}{c} f (z, t) \\ f (z, t - 1) \\ ⋮ \\ f (z, t - k + 1) \end{array}) \in R^{p k}, \end{matrix}

which can serve as input for a machine learning model $φ_{θ} : R^{p k} \to R$ to predict surface NO₂ s(z,t).

Hence, Ω_N parameterizes the datasets occurring in our study. In fact, Ω_N parameterizes N different datasets of feature vectors paired with surface NO₂. They only differ within the time contiguity $k \in {1, \dots, N}$ of the feature vectors, that is, how many previous hours (namely k − 1) are considered for each feature (at most N − 1). Mathematically, these N datasets can be understood as functions $D_{N, k} : Ω_{N} \to R^{p k} \times R$ mapping $(z, t) \in Ω_{N}$ to the feature vector in Eq. (3) paired with surface NO₂ at location z and measuring time t. Further, D_1,1 just describes the data basis mentioned in the previous section.

The number of elements in Ω_N – that is, the size of all datasets D_N,k – are listed in Table 1 for $N = 1, \dots, 5$ . Hence, if a model is to be trained with time-contiguous inputs (k>1), this comes with the price of a smaller number of data points. For example, time-contiguous models cannot be used to make predictions at initial hours of a day. It should be mentioned that among all features described in the previous section, ERA5's soil type and high vegetation cover are the only features that do not depend on time t. This is why, in practice, we never included them k times but rather a single time only, when building the time-contiguous feature vector in Eq. (3) at (z,t). However, for the sake of simplicity, we neglect this fact within the notation.

Table 1Size of time-contiguous datasets D_N,k, which consist of data points for which valid measurements also exist for at least N − 1 previous hours, but only k values are used for constructing the time-contiguous feature vector in Eq. (3). Note that the size is independent of the time contiguity k. The overall considered time period covers January 2021 until November 2022.

Download Print Version | Download XLSX

2.3.5 Normalization of input features

For any given split into training and test data, the input features are normalized before being fed into the machine learning models to improve the stability of their performance. More precisely, each feature undergoes an affine transformation A such that its mean on the training data becomes 0 and its standard deviation becomes 1. Let ${\overline{x}}_{train}$ and σ_train be the mean and standard deviation of a feature in the training data, respectively. Then, the transformation applied to both training and test data points is given by

\begin{matrix} (4) & A (x) = \frac{x - {\overline{x}}_{train}}{σ_{train}} \end{matrix}

and is applied to both training and test data points.

A compact overview of the spatial and temporal resolutions of the data sources used is shown in Table 2. In addition, for each data source, the applied data preprocessing steps are listed. Moreover, the overall workflow for all data-processing steps is illustrated in the flowchart in Fig. 2.

Table 2Overview of spatial and temporal resolutions of the data sources used. Applied preprocessing steps are also listed for each data source.

^* Exception: ERA5 evaporation is available as hourly averages.

Download Print Version | Download XLSX

https://amt.copernicus.org/articles/18/3747/2025/amt-18-3747-2025-f02

Figure 2A flowchart for all data processing steps. The left column shows the construction of the time-contiguous datasets D_N,k. For preprocessing, the data are filtered according to the criteria in Sect. 2.2; see also Table 2. Evaluating the performance of models on D_N,k is done via spatial cross-validation; see Sect. 3.2. This pipeline is outlined in the right column.

Hourly surface nitrogen dioxide retrieval from GEMS tropospheric vertical column densities: benefit of using time-contiguous input features for machine learning models

1.1 Goals of this study

1.2 Outline

2.1 Data sources

2.1.1 GEMS tropospheric NO2 vertical column densities

2.1.2 Meteorological data

2.1.3 In situ measurements of surface NO2

2.2 Pairing of data sources and data preprocessing

2.3 Description of time-contiguous datasets

2.3.1 Spatial and temporal coordinates

2.3.2 Surface NO2 and input features

2.3.3 Data preprocessing

2.3.4 Time-contiguous datasets

2.3.5 Normalization of input features

3.1 Feature selection

3.2 Experiments

3.3 Performance measures

4.1 Linear regression

4.2 Random forests

5.1 Experiment 1: time-contiguous inputs provide additional information

5.2 Experiment 2: time-contiguous inputs are beneficial in spite of a smaller dataset

5.3 Experiment 3: influence of tropospheric NO2 VCDs, latitude, and surface height

5.4 Seasonal and diurnal error distribution

2.1.1 GEMS tropospheric NO₂ vertical column densities

2.1.3 In situ measurements of surface NO₂

2.3.2 Surface NO₂ and input features

5.3 Experiment 3: influence of tropospheric NO₂ VCDs, latitude, and surface height