Articles | Volume 19, issue 12
https://doi.org/10.5194/amt-19-4219-2026
© Author(s) 2026. This work is distributed under the Creative Commons Attribution 4.0 License.
Improving imputation of missing PM2.5 speciation data using PMF-informed source-receptor relationships
Download
- Final revised paper (published on 26 Jun 2026)
- Supplement to the final revised paper
- Preprint (discussion started on 05 Mar 2026)
- Supplement to the preprint
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2026-474', Anonymous Referee #1, 21 Mar 2026
- AC1: 'Reply on RC1', Qili Dai, 29 May 2026
-
RC2: 'Comment on egusphere-2026-474', Anonymous Referee #2, 27 Mar 2026
- AC2: 'Reply on RC2', Qili Dai, 29 May 2026
-
RC3: 'Comment on egusphere-2026-474', Anonymous Referee #3, 29 Mar 2026
- AC3: 'Reply on RC3', Qili Dai, 29 May 2026
Peer review completion
AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
AR by Qili Dai on behalf of the Authors (30 May 2026)
Author's response
Author's tracked changes
Manuscript
ED: Referee Nomination & Report Request started (01 Jun 2026) by Haichao Wang
RR by Anonymous Referee #2 (01 Jun 2026)
RR by Anonymous Referee #1 (10 Jun 2026)
ED: Publish as is (10 Jun 2026) by Haichao Wang
AR by Qili Dai on behalf of the Authors (13 Jun 2026)
Manuscript
Missing data in PM2.5 speciation monitoring, due to instrumental drift, calibration, and maintenance, poses challenges for source apportionment and health risk assessments. Conventional imputation methods, including statistical techniques and deep learning, depend on mathematical correlations and often lack physical interpretability. This study presents a novel Positive Matrix Factorization-based reconstruction method (PMFr) that integrates source profile characteristics into the imputation process. Unlike traditional models that rely solely on data covariance, this approach uses "low-entropy structures" to reconstruct latent information, ensuring chemical consistency and physical interpretability. Given its potential to improve data quality in atmospheric research, the reviewer recommends this work for publication with some revisions and clarifications.
Major Comments:
1) The manuscript introduces a novel framework for data imputation based on low-entropy structures, but lacks practical guidelines on its limits of applicability for specific timestamps. It does not define conditions under which the method may fail due to insufficient observational constraints. The methodology assumes the source contribution vector (G) can be uniquely resolved from observed species, which requires at least one key tracer species for each source factor. However, the manuscript does not address scenarios where all characteristic species for a specific source are missing, leading to an under-constrained system that undermines the imputation's reliability.
The authors should include a section on practical principles for validity checks. It must state that before imputation, users should ensure each identified source factor has at least one non-missing key tracer. If any time point lacks all diagnostic tracers for a source, that data point should be flagged as un-imputable or handled with caution.
To operationalize this principle, the authors should add a table listing the "Non-Missable Key Tracers" for each “pollution source”. This table should clearly map each source factor to its essential diagnostic species. This will serve as a vital reference for practitioners to assess data quality and imputation feasibility before applying the model.
Addressing these points is essential to prevent the misapplication of the method and to clarify the boundary conditions under which the proposed imputation remains scientifically valid.
2) Mixed missing data patterns (MCMS vs. MCMI) in Cases 4~8. MCMS is inherently much more challenging than MCMI because it removes the identifiability of the source, whereas MCMI only removes temporal continuity. A model might perform well under MCMI but fail catastrophically under MCMS. Combining these two patterns into a single performance metric for each Case obscures the specific source of error. Therefore, the reviewer suggests that reporting the results for pure MCMS scenarios and pure MCMI scenarios separately is more scientifically valid.
3) The PMFr method relies on the assumption that source chemical profiles remain stable over time. However, real-world atmospheric conditions lead to dynamic source signatures that can vary significantly due to seasonal changes, fuel composition, and combustion conditions. This variability can introduce biases in reconstructed data if profiles differ from reality. Though this study uses a short two-month dataset, concerns about using this method over longer periods (e.g., multi-year datasets) highlight issues with profile stability. The manuscript currently lacks guidance on determining the appropriate temporal window for stable profiles. The reviewer advises the authors to provide clear, quantitative guidelines for assessing this assumption, including metrics or statistical tests (like rolling window analysis or change-point detection) to identify when profiles need recalibration or updating.
4) The PMFr framework relies on a linear mixing model (C=G×F), assuming observed concentrations are linear combinations of primary emissions. However, secondary components like sulfates, nitrates, and Secondary Organic Carbon (SOC) arise from complex, non-linear photochemical reactions, which the linear assumption may fail to accurately capture, particularly during heavy pollution or specific weather conditions. The manuscript does not sufficiently address the uncertainty introduced by this assumption in reconstructing secondary species. It is recommended that the authors discuss the limitations of the linear model in secondary aerosol formation and consider conducting a sensitivity analysis to quantify the uncertainty.
Minor comments:
1) Line 82. MCMS and MCMI should be defined in the first paragraph of section 2.2.
2) Figure 1b illustrates model performance metrics through a scatter plot comparing MAPE (y-axis) and IOA (x-axis), with R2 values annotated. However, it does not visualize the standard deviation (σ) of modeled data against observations. A model may show high IOA and low MAPE but still misrepresent variability, indicating "amplitude bias," which is crucial for accurate source contribution estimates. The authors should include a Taylor Diagram as a supplementary figure for a comprehensive statistical assessment of variance and correlation in the observed data.