U-Plume: automated algorithm for plume detection and source quantification by satellite point-source imagers

Bruno, Jack H.; Jervis, Dylan; Varon, Daniel J.; Jacob, Daniel J.

doi:https://doi.org/10.5194/amt-17-2625-2024

Articles | Volume 17, issue 9

https://doi.org/10.5194/amt-17-2625-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/amt-17-2625-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 17, issue 9

Research article

|

06 May 2024

Research article |

| 06 May 2024

U-Plume: automated algorithm for plume detection and source quantification by satellite point-source imagers

Jack H. Bruno, Dylan Jervis, Daniel J. Varon, and Daniel J. Jacob

Download

Final revised paper (published on 06 May 2024)
Preprint (discussion started on 25 Aug 2023)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2023-1343', Anonymous Referee #1, 08 Sep 2023

The manuscript presents a novel method to infer CH4 plume point source rates using the U-Net architecture for image segmentation followed by either a convolutional neural network or integrated mass enhancement to estimate the point source rate. They find the approach to be successful across a range of source rates and background noises and suggest a general functional relationship of the point source observability based on the source rate, wind speed, pixel size, and background noise. However, the manuscript lacks some important details about the methodology, specifically data normalization, model training, and model evaluation, which inhibit assessment and reproducibility of the work. Additionally, some claims are more broad than the presented results support; these claims should either be rephrased with more balanced language, or further work should be conducted to better substantiate the claims. Detailed comments are provided below.

1. 6870 scenes were used for training. For ML, this is considered a very small data set size, and small data sets often lead to shortcomings in the trained models when compared to identical models trained on larger data sets drawn from the same distribution. How is the model performance affected when adjusting the data set size by a factor of 2 in each direction?
2. It is stated that 90% of the images were used for training and 10% for testing. In ML applications, a validation set is used to monitor for overfitting during training. Was a validation set used during training? If so, please state that along with associated information (e.g., what fraction of the data set was used for validation, what type of cross validation was used, and any early stopping criteria contingent on the validation loss). If not, please state that and justify why it was not used.
If instead what the authors refer to as the testing set is actually the validation set, please correct the language accordingly, and please introduce a testing set to statistically evaluate the generalization of the model beyond the training and validation data.
3. Related to the above, what loss functions are minimized over which data set (training vs. validation), and what learning rate policies and stopping criteria (if any) were used when training the ML models? How many epochs were used to train each model?
4. Please add details on data normalization for the inputs/outputs of the U-Net and CNN models used in this investigation. This information is necessary to include in the manuscript as it is crucial for both assessment and reproducibility.
5. Please explicitly define the neural network architectures used in this work. The U-Net architecture is clearly defined in the source reference, so it is only necessary to state any deviations from that architecture, if any. The CNN architecture is poorly described here, providing no details on the number of layers, convolutional feature maps or kernel sizes, pooling sizes, the number of nodes in the fully-connected layers, nor activation functions. This information must be included in the manuscript for completeness.
6. Related to the above, how was the CNN architecture determined for this problem? Simple grid search, Bayesian optimization, or something else? Please add details about this in the manuscript.
7. Lines 157-158: Which specific Intel Core i7 CPU was used for this quoted benchmark? The clock speed of i7 CPUs spans a factor of ~4 depending on architecture and TDP, ranging from under 1.1 GHz to over 4 GHz, let alone other variables such as cache amounts. Additionally, please specify whether file I/O was included in this benchmark as well as whether the data were all loaded into RAM at once or batches were loaded on the fly, as there may be significant differences in performance for these scenarios.
8. This is more a general comment on the testing of the presented models. The test set solely consists of synthetic images produced in the same manner as the training data, that is, the test set does not include any real images where the predicted source rate could be compared with an existing data product. For completeness, the authors should perform some comparisons using real GHGSat-C1 scenes which contain an obvious CH4 plume that has had its source rate estimated by one or more existing methods and statistically summarize the differences between U-Plume and those methods. Ideally, many such cases should be included if feasible to better illustrate any trends in biases/deviations between this U-Plume approach and more traditional methods.
9. Lines 230-235: There are contradictory statements. deltaB < 5% is described as both "high background noise" and "low background noise", but only one of these can be true. Please correct this typo so that readers may understand what the authors consider to be low/high background noise.
10. Lines 239-242: Can you comment more on these false positives? How are they distributed with respect to the variables considered in this study? Were any other approaches considered to address them? Are the estimated source rates from these false positives generally small and could be filtered that way rather than based on number of pixels in the mask?
Using the 5-pixel-mask filtering loses 1.6% of the true positive detections; can you comment more on this? Are these generally small source rates at low wind speeds, cases with high background noise, or are they more uniformly distributed throughout the domain of interest?
11. Line 262: It says that "Detection probability is 10% for O_ps = 0.2". In Figure 4, O_ps has a minimum value of 0.5, at which it has a detection probability of 0%. O_ps = 2 looks to be closer to the 10% detection probability mentioned. Please correct this typo.
12. Lines 291-298: It is mentioned that the CNN method is biased towards the mean, that is, it overestimates small sources and underestimates large sources. What is the distribution of source rates in the training set used in this investigation? If that distribution is biased towards the mean, then that could also explain the CNN's reported behavior. If that is the case, the authors should either address this bias in the data set to improve model performance at the extrema (whether in the data set itself, or in the loss function used to train the model) or use more balanced language when discussing this limitation.
This bias towards the mean can also be a consequence of how the CNN was trained (though the manuscript lacks sufficient detail on how these models were trained to determine the likelihood of this being the case - see comments above).
While the CNN is likely to still perform poorly when extrapolating regardless of data set, the authors have not conclusively ruled out bias in the training data set or particular training methodology as the reason for the CNN's worse performance vs. the IME method over the domain that the CNN was trained on. Furthermore, it should be mentioned that expanding the training data set down to 100 kg/h source rates would likely enable the CNN to more accurately recover those scenarios.
13. Lines 380-381: What CPU was used for this quoted benchmark? Please be specific. (See also comment #7 above.)
14. Line 385-387: It states that "Evaluation with an independent dataset ...", but based on the described methodology earlier in the manuscript, it is misleading to describe the test set as "an independent dataset" given that it is drawn from the same distribution as the training data. It is only independent in the sense that it was not part of the training process, but the manuscript has not conclusively demonstrated that the model generalizes to independent data (real measurements). Please use more balanced language here, or perform the tests suggested above in comment #12 to better substantiate this claim.
15. Figure 7: The orange line looks to be biased by the outliers at O_ps ~ 2, as the line is above the vast majority of the data for O_ps < 10. Given that O_ps > 30 is omitted from the fit due to non-linearity (the error bottoms out around 10%, as mentioned), the authors may wish to consider also omitting O_ps < 3 from the fit due to non-linearity.
16. Figure 8: In the left and middle panels, some of the plotted lines are covered by the legend. Please relocate the legend in these panels to avoid this behavior. In the left panel, it looks like it could be placed in the center-left, while the middle panel could relocate the legend to the top-left or bottom-right of the plot.

Citation: https://doi.org/10.5194/egusphere-2023-1343-RC1
- AC2: 'Reply on RC1', Jack Bruno, 20 Feb 2024
  
  Attached is the full review comment response document with the Reviewer 1 comments addressed under the section:
  Responses to Reviewer #1
  
  Citation: https://doi.org/10.5194/egusphere-2023-1343-AC2
RC2:
'Comment on egusphere-2023-1343', Anonymous Referee #2, 23 Jan 2024
The manuscript by Bruno et al. departs from the need to automate the process of detection and quantification of point source emissions from space. It proposes a two-step machine-learning algorithm capable of processing a large number of images in a short time. The data set used to train and test the algorithm is composed entirely of plume-free images from the GHGSat-C1 satellite, although the authors note the option of applying the algorithm to other high spatial resolution missions sensitive to targeted gas. The manuscript presents the point source observability concentration through an equation that has the potential to define the ability of an instrument to detect a point source emission of a given flux rate and under given wind conditions. This is a concept that, as far as I know, has not been mathematically defined until now and is a very useful reference in point-source emission studies.
To show the results of U-Plume, the authors comprehensively analyze the masking success, provide figures on image processing efficiency with a 2.6 GHz Intel Core i7 CPU, and analyze point source emission detection probability depending on the flux rate, wind, and background noise. In addition, it compares the quantification results of the IME and CNN methods and, finally, performs an error analysis based on the type of scene under different conditions.
The manuscript is well written, and the algorithm efficiency analysis is complete for the GHGSat-C1 scenes. However, there are some points to be clarified or improved:
Major comments:
The authors say that U-plume is applicable to all point source imagers, but the performance analysis is based entirely on GHGat-C1 images, which are not publicly available. I miss an analysis or test of the algorithm's effectiveness applied to publicly available data from other missions more commonly used by the scientific community, e.g., Sentinel-2 or PRISMA. If this is out of the scope of the study, I would suggest making the manuscript more explicit that the results presented here are applicable to GHGSat-C1 and escaping from the narrative that these results are to be expected with other missions. This would apply, for example:
In the abstract, line 20: "U-Plume successfully detects and masks plumes from sources as small as 100 kg h-1", please add "in GHGSat-C1 images"

First paragraph of the introduction, line 32: how do you know you are addressing the need described in the paragraph if you have only applied it to the GHSat data?

Lines 85-87: how do you know it works if you have not done any other tests? If you want to escape from testing it on other point source imagers, I would propose changing the last sentence to "but the architecture and training process presented here are potentially applicable to other point source imagers for other species."

In the conclusions, lines 383-389: The authors state that "the point source observability metric can successfully predict the ability of a given plume imaging instrument to detect the plumes and quantify source rates for given observing conditions", but no real demonstrations have been made apart from the simulations with GHGSat-C1.

Related to the previous point, all the analysis has been done with simulations and no real plume. I think the manuscript would be much more complete if its effectiveness were tested with real plumes, either with GHGSat-C1 detections in the simplest case or by applying it to images from other point source instruments.
In section 4.3. Error analysis: I miss some discussion against the results obtained by Gorroño et al., 2023 (https://amt.copernicus.org/articles/16/89/2023/amt-16-89-2023.html), who did this same error analysis applied to Sentinel-2.
Minor comments:
I would say that the correct term would be "point source emission observability" instead of "point source observability". The authors repeat the same error throughout the text; for example, in the abstract, line 10, detection and quantification of point sources = point source emissions. Please check the rest of the cases.
Line 28: "spatial resolution finer than 50 m", EMIT has 60 m resolution. Even if it is not a satellite but a sensor on the ISS, given its contribution to the methane point source emission detections, I would suggest including it and changing 50 m for 60 m.
The references in line 44 Frankenberg et al., 2016 and Cusworth et al., 2021 are references from airborne campaigns. I would recommend putting more proper references from satellite/space sensor surveys, e.g., Thorpe et al., 2023 https://www.science.org/doi/10.1126/sciadv.adh2391, Irakulis-Loitxate et al., 2021 https://www.science.org/doi/epdf/10.1126/sciadv.abf4507, Ehret et al., 2022, https://pubs.acs.org/doi/10.1021/acs.est.1c08575
Line 184: comma after "In practice" instead of a dot.
Equation (3) and lines 205-206: I assume that this Ueff calibration is specific for GHGSat-C1 detections; please clarify whether this is the case or not.
Line 233: “…for high background noise (ΔB < 20%)” instead of <5%?
Citation: https://doi.org/10.5194/egusphere-2023-1343-RC2
- AC3: 'Reply on RC2', Jack Bruno, 20 Feb 2024
  
  Attached is the full review comment response document with the Reviewer 2 comments addressed under the section:
  Responses to Reviewer #2
  
  Citation: https://doi.org/10.5194/egusphere-2023-1343-AC3
RC3:
'Comment on egusphere-2023-1343', Anonymous Referee #3, 24 Jan 2024
General Comments
The manuscript presents a novel machine learning approach to identify methane plumes using satellite observations, and quantify the corresponding point source emissions. They consider the approach successful, and demonstrate using synthetic data (simulated plumes over-imposed on plume-free satellite images) for a range of source rates, wind directions and background noise. However, the manuscript does not provide any details about the machine learning methodology – there is no description of the data normalisation, composition of the neural networks, parameter choices and tuning, training process and validation approaches. Moreover, the process is trained and tested only on synthetic data, and no demonstration or mention is made to how applicable the method would be to GHG-sat plumes.
Specific Comments
Data generation reproducibility – there is little description of how the dataset is generated. These should be included in the manuscript or as an appendix:

Line 103: “We conduct 4 simulations for different meteorological conditions”. What are the conditions?
Line 120: “We use 28 plume-free observations corresponding to a variety of surface types”: What are the locations for these observations? What is the distribution of background noises across the 28 observations?
Line 152: “90% of the images are used for training and 10% are set aside for testing”. Is the split random, and could this affect the quality of the model? Is a validation set defined for tuning, and if not why?
ML model reproducibility – there is little description of the architecture of the model and its process:

Data normalisation and pre-processing: how was the image data and the wind-speed modified before being inputted to the neural network?
The U-net network is described in the source reference, but there is no description of the architecture of the CNN – please mention number of layers, size of layers (e.g. number of nodes, pooling sizes, kernel sizes), activation functions, loss function etc.
Include details on training for both nets, e.g. learning rates, epochs, batch sizes…
The CNN is said to not perform well outside of the training. Comment/demonstrate how the performance would improve if trained on these source rates.
It is not clear if the CNN is trained independently (i.e. using true masked footprints and their corresponding source rate) or within the pipeline (i.e. using masked footprints outputted by the U-net, and their source rate). Clarify, and comment on the potential impact of this on the results. If trained independently, evaluate the differences in performance using perfectly masked footprints vs the output of the U-net.
Algebra error

For the proposed equations to work, Ops seems to be off by a factor of 100. The equations only hold if Ops = 100Q/UWdeltaB rather than the current equation 4 Ops = Q/UWdeltaB. Demonstration by example:
From the third panel in figure 8, probability of detection is around 0.2 for wind speed 9, Q 1000kgh and deltaB 5%. Calculating Ops using these and W=25, and accounting for the change in units in Q from kgh to kgs, Ops = (Q/3600)/UWdeltaB = (1000/3600)/(9*25*0.05)=0.0247, which is below the lower boundary of Ops as defined in Eq 6.
Instead, working backwards using Eq 6: for a detection probability of 0.2, Ops = 2.47, which is off by a factor of a 100 from the Ops calculated using the original formula and parameters.
Results

The results are shown for selected value combinations of Q, deltaB and U. How are these value combinations chosen, and how does performance change outside of these combinations?
Claims on the effectiveness of this method are not backed up sufficiently, e.g. line 390 “For the GHGSat-C2 instruments with background noise deltaB < 1%, we find that U-plume can reliably detect and quantify point sources with Q>400kgh”. This is not demonstrated in the manuscript, as the training and testing is conducted only on synthetic data. Please evaluate on real GHGSat data to support this claim, or clarify. Similarly, it is not demonstrated that “this method can be applied to any chemically inert plume such as CO2” (line 373)
Technical Corrections
Line 184: Typo “In practice. the confidence mask”
Eq 4: Clarify that in this work, W corresponds to 25m.
Line 233: Typo “… for high background noise (deltaB<5%)”
Line 252: Missing full stop after Ops
Line 262: Typo “Detection probability is 10% for Ops = 0.2”. Ops lower bound is defined as 1.4 in eq 6, so this is likely a typo, meaning to say for Ops=2.
Citation: https://doi.org/10.5194/egusphere-2023-1343-RC3
- AC1: 'Reply on RC3', Jack Bruno, 20 Feb 2024
  
  Attached is the full review comment response document with the Reviewer 3 comments addressed under the section:
  Responses to Reviewer #3
  
  Citation: https://doi.org/10.5194/egusphere-2023-1343-AC1

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Jack Bruno on behalf of the Authors (20 Feb 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (26 Feb 2024) by Natalya Kramarova

RR by Anonymous Referee #2 (13 Mar 2024)

RR by Anonymous Referee #3 (14 Mar 2024)

RR by Anonymous Referee #1 (18 Mar 2024)

Suggestions for revision or reasons for rejection

The revised manuscript has been significantly improved and addressed most of the issues raised in the previous review. Minor comments are provided below.

In the abstract as well as page 6 line 189, the CPU is still not adequately described as requested in the previous review. For this to be useful to readers, the specific CPU model must be listed. There are presently at least 13 Intel Core i7 models (excluding minor variants) which have a base clock speed of 2.6 GHz as well as 1 which has a boost clock speed of 2.6 GHz. These models vary significantly in other relevant parameters, notably boost clock speed and L2/L3 cache sizes, which have a significant impact on the quoted "images per second" metric. Please list the specific CPU model rather than the nonspecific "2.6 GHz" metric.

On page 7 line 217, "We differ from the Ronneberger et al. structure only in that we begin with 16 convolutional layers rather than 64 and thus reach a maximum channel depth of 256 rather than 1024 at the end of the encoder path." From Figure 1 of Ronneberger et al., the U-Net architecture uses 18 convolutional layers (20 if also including the up-conv layers), not 64 as claimed. Based on the quoted numbers, the authors meant "convolutional filters" or "convolutional feature maps" rather than "convolutional layers". Please fix the text accordingly.

On page 7 line 228, the CNN architecture is described. Was any optimization performed to determine the configuration of the fully connected layers and number of training epochs? If yes, please describe the optimization procedure. If no, then it is likely that the selected configuration is not optimal and thus may be responsible for the poorer performance from the CNN; the authors should either state this in the text or perform an optimization procedure to better elucidate whether the CNN's deficiency is general to the CNN approach or simply a product of insufficiently training the model.

Hide

ED: Publish subject to technical corrections (19 Mar 2024) by Natalya Kramarova

AR by Jack Bruno on behalf of the Authors (19 Mar 2024) Manuscript

Short summary

Methane is a potent greenhouse gas and a current high-priority target for short- to mid-term climate change mitigation. Detection of individual methane emitters from space has become possible in recent years, and the volume of data for this task has been rapidly growing, outpacing processing capabilities. We introduce an automated approach, U-Plume, which can detect and quantify emissions from individual methane sources in high-spatial-resolution satellite data.