The Authors have made an effort to revise their manuscript according to the suggestions from reviewers. Personally, I feel that the paper would benefit from another round of revisions, as I still have a few questions about the Authors’ response to my comments. I will mark this paper for minor revisions, as my perception of this work is generally positive, but still I advise the Authors to pay attention to a few points that – in my opinion – would still benefit from some additional revisions. In general, I feel that some parts of the paper may give the readers the impression that some issues still open in the retrievals are weaknesses inherent to “the neural network approach” in general, when in reality they may well be caused by some particular design choices. Once again, I am referring to: use of relatively few training data and a very large network, some assumptions in the training set that may limit its comprehensiveness. This is a message that I would like to avoid. After the discussion is made a bit more balanced, I think this work can be published, because the results shown in this paper are certainly interesting to the scientific community.
REPLIES TO SPECIFIC POINTS
5) Can you add this consideration to the paper, and possibly add some references to back this statement? Furthermore, in your ER-2 NN you also use a fixed flying altitude (20 km). Therefore, unless I am missing something, the impact of cloud height variations (or cloud-aircraft altitude difference) on the top-of-atmosphere signal is not accounted for at all, and I think that this limitation needs more emphasis in the text. In your second NN you vary the flight altitude between 5 and 7 km and keep the cloud height fixed at 1 km. I guess that low-level clouds typically have altitudes of, let's say, 1 to 5 km (also depending of what you consider to be a "low-level" cloud). Since the atmospheric density decays exponentially with height, qualitatively I would expect a 1 km change in cloud height between 3 and 5 km to impact the shielding of Rayleigh scattering underneath more than changing the flight altitude by 1 km. Of course mine are just qualitative considerations, but I wonder if there are some references supporting your line of reasoning. On top of that, I still think that in absolute terms your training set is rather small compared to other existing studies, which use training sets containing millions to even tens of millions of data (Kox et al., 2014, Strandgren et al., 2017, Di Noia et al., 2019). I would suggest to at least emphasize in your text that this may be a limitation in your design setup.
References
Kox et al. (2014), “Retrieval of cirrus cloud optical thickness and top altitude from geostationary remote sensing”, Atmos. Meas. Tech., 7, 3233–3246, doi: 10.5194/amt-7-3233-2014
Strandgren et al. (2017), “Cirrus cloud retrieval with MSG/SEVIRI using artificial neural networks”, Atmos. Meas. Tech., 10, 3547–3573, doi: 10.5194/amt-10-3547-2017
9) LeCun et al. (1998) indeed seem to suggest that this approach may work, even though they do not explain why in mathematical terms. I am maybe being overscrupulous here, but I would still suggest to test whether your approach worked by looking at the statistics for the derivatives of your NN output with respect to each input, divided by the input standard deviation. This test should be simple to implement, and will tell you for sure whether or not your NN is really less sensitive to reflectance than it is to DoLP as you wish to achieve.
11) I agree that the choice of the number of hidden layers is a trial-and-error procedure. However, there are certainly limits (also theoretical - see, e.g., Baum and Haussler, 1989, Haykin, 1999) to the ratio between the number of training samples and the number of free parameters in order to get a reasonable generalization. Usually one wants to have at least an order of magnitude more training data than free parameters, whereas for your NN the opposite is the case. Now: I guess you determined the number of neurons on a subset of synthetic data not used during the training phase, and you may have found that your large NN architecture is the one that achieved the lowest RMS error. Is this the case? However, the real question is: how confident are you that your choice - determined through this procedure - is robust enough to be also valid for application to real data, which probably follow a statistical distribution that is very different to that of your training set? This question becomes especially important considering that your training set appears to contain a number of simplifications that may limit its comprehensiveness.
References:
Baum and Haussler (1989), “What Size Net Gives Valid Generalization?”, Neural Comp., 1, 151-160, doi: 10.1162/neco.1989.1.1.151
Haykin (1999), “Neural Networks: A Comprehensive Foundation”, Prentice Hall
14) Di Noia et al. (2019) do not perform a single-variable retrieval of effective variance. They retrieve effective variance, effective radius and cloud height together. Separate retrievals are only reported for cloud optical thickness, as COT cannot be retrieved from polarized radiance alone. I still suggest to acknowledge in your text that your effective variance retrievals look less accurate than results already published in literature, both NN (Di Noia et al., 2019) and non-NN (Shang et al., 2019) retrievals. Furthermore, could it be that your statistics for effective variance improve if you compute them on a subset of test data, e.g., only data in the principal plane?
Reference: Shang et al. (2019), "An improved algorithm of cloud droplet size distribution from POLDER polarized measurements", Remote Sens. Environ., 228, 61-74, doi: 10.1016/j.rse.2019.04.01
16) I have to admit that I do not see much reframing of this sentence in the text. I still see the sentence "in the framework of NN it is difficult to diagnose this error". However, you still do not explain WHY, in your opinion, this difficulty is typical of the NN framework as opposed to other retrieval schemes, considering that also NN retrievals can be fed back to radiative transfer models, derivatives of NN outputs with respect to inputs can be computed analytically, the response of the NN to specific inputs can be tested, etc. Therefore, I would still like to know what are the "error diagnosis techniques" that you can apply to other methods but not to a NN retrieval. Even if carrying out the analysis I recommended goes perhaps beyond the scope of your paper, I would still advise you to at least avoid generic statements such "in the NN framework it is difficult to diagnose this kind of error", unless you can explain what exactly is difficult and why.
19) You say that “it could be that … the polarized reflectance information is still not weighted strongly enough”. But doesn’t this actually call for an analysis of the sensitivity of the NN to its various inputs by looking at the network Jacobians, as I suggested earlier?
Furthermore. You say: “Without this shared dependence the network is treating … as independent pieces of information”. Well: I am not convinced that this is the case. Actually I think that whether or not the NN treats them as independent pieces of information depends on how the training set has been generated. I guess that the training set reflects the joint relationship between the spectral and angular dependence through the joint probability density function of the simulated quantities. Now, if you interpret the NN retrievals as an approximation to the posterior conditional expectation of the state vector given the measurements, as I suggested in my previous review, and approximate the joint pdf of the simulated data as a Gaussian, the covariance of this Gaussian definitely enters the expression of the posterior conditional expectation. I would thus expect that the NN should be able to capture this dependence, if the training has been successful.
20) If you confirm that SWIR bands saturates at lower COTs than VNIR bands, then I would suggest mentioning this as a likely cause for the behaviour of your scatter plots, as attempting to invert a relationship that saturates typically gives this kind of effects.
23) I do not perceive a big difference to the previous version of the paper. I still see a reasoning based on the concept of "clear traceability", which still looks somewhat vague to me. Let me be clear: it is not my intention to dismiss your point, but I would like to see a more focused discussion of what you actually mean by that.
A possible hint for revising your discussion is that of conveying the concept that NN retrievals - as opposed to curve fitting, LUTs, or iterative (regularized) least squared retrievals - are not designed to obtain the best fit between simulations and observations for each observation. In my opinion, this would be a more valid point than generically saying that NNs are "less rigorous" or "less traceable" than other retrieval methods.
MINOR REVISIONS
• P2, L30. Cite Werdell et al. (2019), “The Plankton, Aerosol, Cloud, ocean Ecosystem (PACE) mission: Status, science, advances”. Bull. Am. Meteor. Soc., 100, 1775-1794, doi:10.1175/BAMS-D-18-0056.1
• P13, L18. I guess the second “the former” in the sentence should be replaced with “the latter” |