Machine learning data fusion for high spatio-temporal resolution PM<sub>2.5</sub>

Porcheddu, Andrea; Kolehmainen, Ville; Lähivaara, Timo; Lipponen, Antti

doi:https://doi.org/10.5194/amt-18-4771-2025

Articles | Volume 18, issue 18

https://doi.org/10.5194/amt-18-4771-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/amt-18-4771-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 18, issue 18

Research article

|

25 Sep 2025

Research article |

| 25 Sep 2025

Machine learning data fusion for high spatio-temporal resolution PM_2.5

Andrea Porcheddu, Ville Kolehmainen, Timo Lähivaara, and Antti Lipponen

Download

Final revised paper (published on 25 Sep 2025)
Preprint (discussion started on 14 Feb 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-4056', Anonymous Referee #1, 31 Mar 2025

This study integrates multi-source data, including satellite and ground-based station data, to construct a deep learning model for estimating 24-hour high-resolution PM2.5 data. High spatiotemporal resolution PM2.5 mapping is of significant importance for pollution control and decision-making, and this study represents a useful attempt in this field. However, the following issues need to be addressed:

The study aims to estimate 24-hourly PM2.5 maps at 100 m resolution in urban areas. However, as shown in Table A1, most of the input data have resolutions coarser than 100 m, except for OpenStreetMap roads and DEM data, which are not directly related to PM2.5. How do the authors justify that the estimated PM2.5 resolution truly reaches 100 m?

The paper presents a deep learning-based estimation approach, but the description of the methodology remains unclear. First, Lines 148–149 mention that "The output is a 3-dimensional array containing 24 hourly PM2.5 maps," but Lines 159–160 state that "the output layer is a 3D 1x1x1 convolution," which appears contradictory and should be clarified. Second, the construction of the loss function is confusing—it should ideally be constrained by PM2.5 measurements from ground stations and NOODLESALAD PM2.5, but its current formulation appears overly complex and difficult to understand.

The study aims to estimate 24-hour, 100 m resolution PM2.5 data, but most of the results presented are seasonal or monthly averages. We would like to see 24-hour PM2.5 mapping results. Additionally, the comparison with MERRA2 focuses mainly on accuracy. Could the authors also better illustrate PM2.5’s spatial distribution and gradient variations, or even capture specific pollution emissions?

The study applies explainable AI techniques to explore the importance of different features, showing that SHAP values identify 2-meter air temperature as the most important feature. However, this analysis could be further improved. First, the underlying reasons for why certain variables are important (or not) are not sufficiently explored. Second, a broader perspective could be considered—how much of the variability in PM2.5 can be explained by meteorological variables overall?

The description of NOODLESALAD PM2.5 and its role in this study is unclear. The authors should provide a more detailed explanation rather than merely citing previous studies.

The results and analysis section could be further improved. First, it is recommended to structure the results into separate subsections rather than mixing everything together. Second, the quality of Figures 3–6 should be improved—currently, the font size is too small, and the figure titles could be removed (since the descriptions are already included in the captions). Lastly, additional results, such as 24-hour high-resolution PM2.5 maps, could enhance the persuasiveness of the study.

The references in the paper are somewhat outdated, with few studies from the recent three years included. It is recommended to update and supplement them.

Some minor issues:
(1) Figure 1: Does the figure represent the road network? Please clarify.
(2) Line 134: "3D PM2.5 maps" could be misinterpreted as three-dimensional spatial maps (including altitude). Is this the correct terminology?
(3) Figure 2: The representation is somewhat abstract. It would be better if the inputs and outputs were explicitly illustrated.
(4) Line 279: "consistent with prior findings" should be supported with references.

Citation: https://doi.org/10.5194/egusphere-2024-4056-RC1
- AC1: 'Reply on RC1', Andrea Porcheddu, 23 Jun 2025
  
  Thank you for your comments. Our reply can be found in the pdf attached.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4056-AC1
RC2:
'Comment on egusphere-2024-4056', Anonymous Referee #2, 26 May 2025

The authors aimed at mapping hourly PM2.5 concentration at 100-m resolution in Paris using multimodal data from MERRA-2 reanalysis, Sentinel-3 observations, and ground-based measurements via a deep learning model. While the topic is worthwhile to investigate, the proposed method, to a large extent, contributes little to the community. The reasons are follows. First, the proposed model works mainly relying on downscaling MERRA-2 aerosol diagnostics to generate 100-m PM2.5 estimates. Similar studies have been extensively conducted, differing from the spatial resolution, and satellite-based PM2.5 estimates play a very weak role.
The manuscript suffers from the following flaws that should be addressed before the further consideration.
1. the data accuracy of NOODLESALAD PM2.5 should be described in section 2.1. Moreover, what are essential roles of this unique product in the proposed deep learning framework, needs to clarify.
2. since the authors only used 11 stations for reference, is this adequate to depict PM2.5 variability across space in the study area?
3. MERRA-2 PM2.5 estimates: since no nitrates are provided in MERRA-2 aerosol diagnostics, the corresponding PM2.5 estimates are prone to large uncertainty. The data accuracy of this PM2.5 product should be validated as well.
4. The authors used a set of geographic variables with varying spatial resolution, how did the authors collocate them in the deep learning framework, no such descriptions.
5. A flow chart depicting the deep learning architecture, particularly the data flow, is essential for understanding and reproducibility.
6. Equations should be numbered.
7. Methodology: the authors mentioned that both satellite- and ground-based PM2.5 data were used as the learning target. Since these datasets have distinct data accuracy, would this undermine the learning capacity of the deep learned model?
8. line 207-209: this would result in imbalanced training sets at different hours, which could also influence the learning accuracy, as the learned model is more likely to predict PM2.5 during the satellite overpasses.
9. An intercomparison of spatial distribution of predicted PM2.5 estimates from MERRA-2 with satellite-derived PM2.5 at 100-m from Sentinel observations should be provided to assess the reliability of the proposed model in resolving PM2.5 distributions in Paris.

Citation: https://doi.org/10.5194/egusphere-2024-4056-RC2
- AC2: 'Reply on RC2', Andrea Porcheddu, 23 Jun 2025
  
  Thank you for your comments. Our reply can be found in the pdf attached.
  
  Citation: https://doi.org/10.5194/egusphere-2024-4056-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Andrea Porcheddu on behalf of the Authors (04 Jul 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (07 Jul 2025) by Sandip Dhomse

RR by Anonymous Referee #1 (18 Jul 2025)

ED: Publish subject to minor revisions (review by editor) (18 Jul 2025) by Sandip Dhomse

AR by Andrea Porcheddu on behalf of the Authors (28 Jul 2025) Author's response Author's tracked changes Manuscript

ED: Publish as is (30 Jul 2025) by Sandip Dhomse

AR by Andrea Porcheddu on behalf of the Authors (08 Aug 2025) Manuscript

Short summary

This study proposes a novel machine learning method to estimate pollution levels (PM_2.5) on urban areas at fine scale. Our model generates hourly PM_2.5 maps with high spatial resolution, by combining satellite data, ground measurements, geophysical model data, and different geographical indicators. The model properly accounts for spatial and temporal variability of the urban pollution levels, and can be highly beneficial for air quality monitoring and health protection.

Machine learning data fusion for high spatio-temporal resolution PM2.5

Download

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection

Machine learning data fusion for high spatio-temporal resolution PM_2.5