DeepPrecip: A deep neural network for precipitation retrievals

. Remotely-sensed precipitation retrievals are critical for advancing our understanding of global energy and hydro-logic cycles in remote regions. Radar reflectivity profiles of the lower atmosphere are commonly linked to precipitation through empirical power laws, but these relationships are tightly coupled to particle microphysical assumptions that do not generalize well to different regional climates. Here, we develop a robust, highly generalized precipitation retrieval algorithm from a deep convolutional neural network (DeepPrecip) to estimate 20-minute average surface precipitation accumulation using near- 5 surface radar data inputs. DeepPrecip displays high retrieval skill and can accurately model total precipitation accumulation, with a mean square error (MSE) 160% lower, on average, than current methods. DeepPrecip also outperforms a less complex machine learning retrieval algorithm, demonstrating the value of deep learning when applied to precipitation retrievals. Predictor importance analyses suggest that a combination of both near-surface (below 1 km) and higher-altitude (1.5 - 2 km) radar measurements are the primary features contributing to retrieval accuracy. Further, DeepPrecip closely captures total pre- 10 cipitation accumulation magnitudes and variability across nine distinct locations without requiring any explicit descriptions of particle microphysics or geospatial covariates. This research reveals the important role for deep learning in extracting relevant information about precipitation from atmospheric radar retrievals.

1 Additionally, the size and availability of both vertically pointing and space-borne remote sensing datasets have expanded 20 greatly in recent decades as a result of technological instrument improvements and new satellite missions (Quirita et al., 2017).
These radar-based retrievals are powerful tools for filling current observational gaps and have been applied to great effect 25 in previous literature (Levizzani et al., 2011;Hiley et al., 2010). However, these relationships demonstrate an inability to generalize well to unseen validation data as a consequence of the microphysical particle assumptions (e.g. shape, diameter, particle size distribution (PSD), terminal fall velocity and mass) used in each relationship's unique derivation (Jameson and Kostinski, 2002).
Recent machine learning (ML) approaches have demonstrated improvements in estimating surface precipitation from remotely-30 sensed data compared to traditional nowcasting methods (Shi et al., 2017;Kim and Bae, 2017). Deep learning models have benefited greatly from the increased observational sample provided by remote sensing missions and have shown skill in learning complex spatiotemporal characteristics of the underlying datasets (Chen et al., 2020b). However, a deep learning convolutional surface precipitation retrieval using vertical column radar data with no spatiotemporal covariates has yet to be developed to our knowledge. Previous ML studies have typically focused on passive microwave and infrared datasets which lack a detailed 35 analysis of the vertical column structure, or suffer from a limited sample for model training across multiple, distinct regional climates (Xiao et al., 1998;Adhikari et al., 2020;Ehsani et al., 2021).
In this work, we evaluate the abilities of a novel deep learning precipitation retrieval algorithm trained on vertically pointing radar (up to 3 km above the surface). The regression model we present (DeepPrecip) is a hybrid deep learning neural network consisting of a feature extraction convolutional neural network (CNN) front-end and a regression feedforward multilayer per-40 ceptron (MLP) back-end. The combination of these two architectures allows DeepPrecip to recognize and learn the nonlinear relationships between different layers in the vertical column of radar observations and produce an accurate surface precipitation estimate. Through an analysis of feature input combinations, DeepPrecip performance is examined to identify regions within the vertical column that contain the most important contributions to retrieval accuracy (Lundberg and Lee, 2017). The relationships that exist between different layers of the vertical profile (and each atmospheric covariate) can be used to help 45 inform current and future active radar retrievals of surface precipitation.

b indicate periods
where non-zero surface precipitation was recorded. Study sites were selected based on the required presence of a micro rain 50 radar (MRR) and collocated Pluvio2 weighted precipitation gauge. Rain, snow and mixed-phase precipitation were recorded, with each site's precipitation phase and intensity distribution of observations differing based on the regional climate. For instance, Marquette experienced strong lake-effect snowfall while Cold Lake received mostly light, shallow snowfall. Further, due to the warmer temperatures recorded at OLYMPEx, these sites were classified as primarily experiencing liquid precipitation, while ICE-POP received only solid precipitation.

Pluvio2 precipitation weighing gauge
Reference surface precipitation observations were collected by OTT Pluvio2 weighted gauges at each site. The Pluvio2 gauge records the precipitation accumulation from falling hydrometeors with a minimum time resolution of 1 minute (Colli et al., 2014). It includes a 200 cm 2 heated surface orifice (400 cm 2 at Ny-Ålesund) to prevent snow and ice buildup, along with site-specific wind shielding implemented as described in Table 1. These fence setups include a Double Fence Intercomparison 60 Reference (DFIR) shield which is a large, double fenced wooden structure which helps significantly reduce the impact of wind on surface precipitation measurements (Rasmussen et al., 2012;Kochendorfer et al., 2022). The Alter shield system consists of multiple freely hanging, spaced metal slats around the gauge top opening which also helps mitigate undercatch issues during strong winds (Colli et al., 2014). Sensitivity analyses of different rolling temporal windows indicated an optimal temporal resolution of 20-minute non-real time accumulation (measurement results 5 minutes after precipitation accumulation), with 65 minimum observational thresholds of at least 0.2 mm over the course of an hour from the Pluvio2 gauge.

Micro rain radar
Vertical pointing MRRs (developed by METEK) were located nearby the Pluvio2 gauges at each site to record complementary atmospheric observations. The MRR is a K-band (24 GHz) continuous wave Doppler radar which provides information related to hydrometeor particle activity up to 3.1 km above the surface (or 1 km for Ny-Ålesund) as a function of spectral power 70 backscatter intensity. The MRR provides 29 vertical bins (of size 100 m) spanning 300 m to 3100 m above the surface as shown for each site in Fig. 2.a. Raw radar measurements were preprocessed using Maahn's improved MRR processing tool Alter (Boudala et al., 2021) (IMProToo) for noise removal, dealiasing and for extending the minimum detectable dBZ to -14 which allows for improved measurements of solid precipitation. This data was then temporally averaged to align to the same 20-minute windows generated for the Pluvio2 observations and used as a model input (Maahn and Kollias, 2012).

ERA5
European Centre for Medium-Range Weather Forecasts Reanalysis version 5 (ERA5) hourly temperature (TMP) and vertical wind velocity (WVL) on pressure levels from 0 to 3 km were also included as additional input covariates to DeepPrecip (Hersbach et al., 2020). These inputs allow the model to more accurately recognize different precipitation event structures, large-scale atmospheric dynamics and hydrometeor phases during training. Note that WVL units (Pa/s) are defined using the 80 ECMWF Integrated Forecasting System (IFS) which adopts a pressure based vertical co-ordinate system (i.e. negative values indicate upwards air motion, since pressure decreases with height). Each of these variables were linearly interpolated to align with the MRR data over 20 minute intervals and at 100 m vertical resolution.

Surface meteorology
Collocated surface temperature (degrees Celsius ( • C)) and 10-meter wind speed (m/s) meteorologic observations were also 85 collected from instruments installed at each site and temporally aligned to the Pluvio2 and MRR datasets. Surface wind data acts as an additional observational constraint for mitigating the effects of undercatch on unshielded measurement gauges (Rasmussen et al., 2012). Undercatch occurs when precipitation falling in the presence of wind can cause hydrometeors to pass over the gauge top orifice. This effect has been shown to bias reported precipitation quantities by up to 10% (Ehsani and Behrangi, 2022). We therefore limit the available training dataset to periods when surface wind speeds are < 5 m/s, as this 90 restricts the analysis to low-medium wind speed events at each location to maintain a high gauge-catch efficiency (Yang, 2014). This preprocessing step reduces the average size of our total observational pool by 16% across all stations, however we note that maximum intensity precipitation events are not removed using this technique.
Surface meteorologic station temperature data is used for precipitation-phase partitioning at 5 • C to allow for Z e − S/R comparisons with DeepPrecip. Additional dry surface air temperature thresholds of 0 • , 1 • and 2 • C were also examined, but 95 Z e − S/R performance for both rain and snow appeared optimal when classified using a 5 • C threshold (where temperatures < 5 • C are considered as solid precipitation and temperatures >= 5 • C are considered as rainfall). This simple temperature threshold is an additional source of uncertainty in our comparisons with the Z e − S/R relationships due to the influence of mixed-phase precipitation on power law accuracy, along with uncertainties in the location of the active melting layer (Jennings et al., 2018). A more sophisticated phase partitioning system (e.g. using wet-bulb temperature as described in Sims and

100
Liu (2015)) could also be linked to DeepPrecip as an additional predictor to further improve classification of mixed-phase precipitation in future work.

Radar-precipitation power laws
Relating radar reflectivity observations to surface accumulation has been done extensively in past surface and spaceborne 105 radar missions through Z e − S/R power law relationships (Skofronick-Jackson et al., 2017;Liu, 2008). These power law relationships are empirically defined by relating reflectivity values in a near surface bin to observed surface accumulation under a set of assumed particle microphysics (e.g. size, shape, density and fallspeed) (Matrosov et al., 2008). While these techniques have been used to great success in previous studies from Schoger et al. (2021) and Levizzani et al. (2011), the assumptions about snowfall and rainfall particle microphysics makes the generalization of these power laws less robust, which 110 contributes to high uncertainty when applied across large areas with unique regional climates (Jameson and Kostinski, 2002).
We examine an ensemble of 12 Ka-and K-band Z e − S/R relationships in this work to compare with model output from DeepPrecip ( Table 2). As a consequence of the short temporal period (20 minutes) used in this analysis, MSE values are typically small (< 0.1 mm 2 ). Each Z e − S/R relationship was applied to a near-surface bin in the reflectivity profile (bin 5 for DP f ull and DP near , and bin 11 for DP f ar ) to derive a corresponding surface precipitation estimate. These bins were selected 115 based on a sensitivity analysis where we examined the performance of multiple near-surface high-importance regions of the vertical column (not shown). The best performing regions were identified as the above bins (5 and 11) based on the respective region of the vertical column being considered (near or far). More information regarding the derivation of each Z e − S/R relationship can be found in Table 2.
To further evaluate the performance of DeepPrecip, we also include model comparisons to a set of six site-derived Z e − P 120 (reflectivity precipitation) power law relations. Each Z e − P relationship is empirically derived from the collocated MRR and Pluvio data at each each observational site examined in this work (excluding Cold Lake and Ny-Ålesund due to the limited available sample and vertical extent of each site, respectively). Each Z e − P relation is fit via a non-linear least-squares approach for finding optimal a and b coefficients in Eq. 1 using SciPy's curve_f it optimization algorithm (Virtanen et al.,  (Jash et al., 2019) 2020). Each Z e − P relationship was then applied to bin 5 reflectivities at each site (i.e. the same process as is used for 125 Z e − S/R relationships) and compared with in situ observations to assess their general accuracy.

Neural network architecture
DeepPrecip is a feedforward convolutional neural network that takes as input a vector of 115 atmospheric covariates (Table 3), performs a feature extraction of the vertical column and outputs a single surface precipitation estimate using a fully connected multilayer perceptron. While the structure of this final version of DeepPrecip is complex, the retrieval evolved from a much 130 simpler initial state based on a multiple linear regression (MLR) model. Due to clear nonlinearities between observed reflectivity data and surface precipitation accumulation, the MLR model was unable to capture in situ variability and provided estimates near the mean accumulation value. Similar radar-based precipitation retrieval studies by Chen et al. (2020a) and Choubin et al. (2016) have demonstrated much better performance using an ML-based approach which led to the development of a random forest (RF) model, an MLP and finally the CNN.

135
The 1D convolutional layers perform a feature extraction of the vertical column of inputs to reduce the total number of parameters being fed into DeepPrecip's fully connected dense layers. This 1D-CNN structure can identify relationships within the vertical column, save on memory and lower computational training time requirements. To perform a 1D feature extraction, the forward propagation step between the previous convolutional layer (l − 1) to the input neurons of the current layer (l) are expressed in Eq. 2 (Abdeljaber et al., 2017). 140 Where k and l refer to the k th neuron for layer l with x as the resulting input and b as the scalar bias. s and w terms represent the neuron output and kernel weight matrix respectively, from the i th neuron of layer l − 1 (and to the k th neuron of layer l for w). The function 'f ()' represents the activation function used to transform the weighted sum into an output to be used in the following network layer.

145
The RF model tested in this study was based on previous work from King et al. (2022) where a RF was used to retrieve surface snow accumulation from a collocated X-band and Pluvio2 instrument at a single experiment site (GCPEx). The RF developed in said study demonstrated good skill in estimating surface accumulation, and so we incorporate the same model here (retrained on the MRR and ERA5 data from this study) as a baseline comparison to other ML retrieval methods (i.e. DeepPrecip).

150
The final DeepPrecip model structure is outlined in Fig. 2.b. It includes two 1d-convolutional layers, a 1d max pooling layer, dropout layer, flattening layer and concludes in a dense MLP regressor with 3 hidden layers. The total number of trainable model parameters in DeepPrecip is 3,937,793. Model training and testing was performed using a 90/10 (non-shuffled) split on each site to generate training and testing datasets for each location. As an additional preprocessing step, we standardize all input covariates to remove the mean and by scaling inputs to unit variance. The non-shuffled nature of this splitting process allows for 155 DeepPrecip estimates to be validated against unseen data and prevents overfitting from training on temporally autocorrelated vertical column inputs. Additionally, this stratified selection process guarantees that an equal percentage of data is included from each site during training.
Retrieval accuracy is primarily assessed using a mean squared error (MSE) skill metric calculated between each model's estimated surface accumulation values and the total Pluvio2 non-real-time reference accumulation observations over 20 min-160 utes. Performance statistics are reported from the average skill of the test portion of a non-shuffled 90/10 train/test CV split (i.e. DeepPrecip trained and tested 10 times on different contiguous portions of the full available sample). Note that each split is stratified to include 10% of each station's sample in every test split. Uncertainty estimates are calculated from running each CV split 50 times using dropout to gain additional insight into model variability (resulting in 500 total model instances).
The dropout layers simulate training a large number of models with differing architectures in a highly parallelized manner 165 by randomly deactivating (or dropping) a certain fraction of nodes within the network to provide a distribution of retrieval estimates.

Hyperparameter optimization
DeepPrecip was developed, trained and optimized on Graphcore intelligence processing units (IPUs) MK2 Classic IPU-POD4 (Louw and McIntosh-Smith, 2021), which significantly sped up the training time by a factor of 6.5 compared to a state-of-the-170 art NVIDIA Tesla V100 GPU. Additional training throughput comparisons are included in 175 Hyperparameters do not change value during training (in contrast to model parameters like internal node weights), but they play a critical role in the neural network learning process to map input features to an output. Selecting optimal hyperparameter values is an important part in constructing a model which minimizes loss, improves model efficiency and quality, and mitigates overfitting. Multiple steps were taken to address concerns of model overfitting. In addition to the use of non-shuffled training, we employ multiple regularization methods including early stopping, dropout, the application of layer weight constraints and 180 L2 regularization (details in Table 5). L2 regularization (or ridge regression) adds an additional penalty term to the MSE loss function which helps to create less complex models when dealing with many input features to improve model generalization. To select the optimal values for the aforementioned hyperparameters, and to optimize DeepPrecip's general structure, we use a form of hyperparameterization known as hyperband optimization (Li et al., 2017). Hyperband is a variation of Bayesian optimization which intelligently samples the parameter space to find hyperparameter values that minimize loss while learning 185 from previous selections. Hyperband adds an additional component to the analysis by also slowly increasing the number of epochs run during each phase of the optimization process to sample in a more efficient manner. DeepPrecip hyperparameters were derived by running a 10-fold CV hyperband optimization continuously on a single Graphcore IPU for approximately two weeks. The final hyperparameter values (and their respective parameter search spaces) can be found in Table 5.

Unsupervised classification layer 190
An unsupervised k-means clustering preprocessing step is also applied using MRR reflectivity profiles as input to provide DeepPrecip with insights into distinct profile group (PG) vertical column structures (Fig. 2.b). Minimizing within-cluster sum of squares between each vertical column radar estimate results in k = 4 PGs being selected using the within-cluster-sum of squared errors elbow criterion method (Fig. 3). The elbow method is a clustering heuristic which allows for an optimal number of clusters to be selected as a function of diminishing returns of explained variation (i.e. finding the elbow or "knee of the 195 curve"). K-means clustering was applied using Python's scikit-learn package on all input reflectivity data to generate four profile clusters which were included as additional input parameters to DeepPrecip. These clusters are useful for partitioning the precipitation data into groups based on different precipitation intensity-classes (trace, low, medium and high intensity) to identify where DeepPrecip finds the most important contributors to high retrieval accuracy for each category of storm intensity.
Derived cluster groups are useful for interpreting feature importances from model output (Section 4.2).

DeepPrecip retrieval performance
We first examine the differences in performance between DeepPrecip and an RF that has demonstrated good performance in our previous work (not shown) to assess the capabilities of a less-sophisticated ML-based approach over a CNN. DeepPrecip demonstrates improved skill in capturing most of the peaks and troughs in observed precipitation variability ( Fig. 4.a). These 205 differences are most clearly demonstrated in Fig. 4.a at OLYMPEx and JOYCE, where DP more accurately predicts Pluvio2 precipitation extremes compared to the RF. Both models appear to struggle in capturing accumulation intensities during periods of mixed-phase precipitation when temperatures are near zero degrees C (i.e. Marquette, JOYCE and the tail end of OLYMPEx 1) due to a lack of training data with similar climate conditions and the complex nature of such events. DP does demonstrate improved skill at capturing light intensity precipitation at the beginning of the JOYCE period (compared to the RF), however 210 this is with some uncertainty as noted by the wider shaded region (1 standard deviation). Performance statistics (Fig. 4. old is imposed where retrievals recorded during periods with temperatures below five degrees C are classified as snow and periods equal to or warmer than five degrees C as rain. DeepPrecip more accurately captures surface precipitation quantities when compared to the Z e − S/R estimates, with a total accumulation curve similar in shape to that of in situ indicating that DeepPrecip more closely captures the observed precipitation variability and magnitude. Log-scale MSE statistics are calculated between each model and in situ records in Fig. 4.d and indicate that DeepPrecip consistently outperforms traditional 220 Z e − S/R power-law methods by 200% on average. As a general precipitation retrieval algorithm, we do not explicitly train a DeepPrecip snow and DeepPrecip rain model for different precipitation phases with unique regional atmospheric microphysical conditions. While the Z e − S/R models shown in Fig. 4.c/d are bespoke for rain or snow, DeepPrecip is trained on all data with no a priori knowledge of the underlying physical precipitating particle state. DeepPrecip estimates of accumulated rain display a lower MSE than that of snow (Fig. 4.d). We believe these differences to 225 be twofold: 1) the larger sample of rainfall events in the training data (3 times that of snowfall); and 2) the more complex nature of snow particle microphysics. Unlike the uniform properties of a rain droplet, the shape, size and fallspeed of solid precipitation is much more dynamic and challenging to model (Wood et al., 2013). Continued issues with interference from wind may have also impacted the accuracy of in situ measurements of snow accumulation leading to higher uncertainty and error (further discussions on these uncertainties in Sect. 5) (Kochendorfer et al., 2017). To visualize the range in uncertainty from the 230 CNN model estimates, we display confidence intervals showing 1 standard deviation in Fig. 4.b/d from 50 DeepPrecip model realizations using dropout. Both ML-based models exhibit the highest uncertainty during periods of mixed-phase precipitation at GCPEx and Marquette along with high intensity precipitation at OLYMPEx.
To further evaluate DeepPrecip's retrieval skill over traditional methods, we compare model performance to a set of six custom Z e − P site-derived power laws (derivation details in Sect. 3). While Z e − P relationships typically perform well in the 235 regional climate under which they were derived, they do not generalize well outside of said climate. This lack of robustness is visible in the differences between in situ and Z e − P estimates of accumulation in Fig. 5.a, where each Z e − P (light gray line) displays consistent positive or negative biases and no single power law captures the high variability in accumulation across multiple sites. For instance, OLYMPEx 1 and OLYMPEx 3-derived relationships produce a strong positive bias at JOYCE, and the JOYCE-derived Z e − P power law is quite negatively biased when applied at OLYMPEx. The mean of all six custom 240 power laws is shown in bold gray, and while it closely captures total mean accumulation across all sites, it is unable to model the high variability in precipitation intensity.
The resulting MSE from the application of each custom Z e − P relationship to each site (along with DeepPrecip) further demonstrates DeepPrecip's improved robustness (Fig. 5.b). In all other cases, DeepPrecip either outperforms all Z e − P power laws or is only slightly worse than the power law derived for the site in which it is being tested. On average, DeepPrecip 245 retrievals result in 160% lower MSE values than all Z e − P site-derived power laws estimates when applied to the testing data across the full spatiotemporal domain (Table 6). Figure 5.b also displays a model intercomparison of each Z e − P relation, where we can clearly see how Z e − P relations like those derived at OLYMPEx 1 and 3 are clearly unable to capture the vastly different snowfall regimes at sites like ICE-POP, GCPEx and JOYCE with their much larger MSE values for these sites.
The robustness of DeepPrecip was further evaluated using a leave-one-out cross validation (CV) for each site of training 250 observations. This approach tests the skill of DeepPrecip at predicting precipitation for a location that was not included in the training data, which is a strong indicator of the generalizability of the model. Log-scale MSE results of this test for each site are shown in Fig. 6 for each precipitation-phase subset, along with the corresponding average Z e − P/S/R estimate when applied at that site. These findings demonstrate similar performance to the baseline DeepPrecip model skill, which continues to outperform all traditional power law techniques on average. The large range in skill in the power law relationships at most sites 255 (wide error bars) further demonstrates the relative lack of generalizabiltiy of Z e − P/S/R relationships to different regional climates. Further, the site-derived power law fits (gray dots) perform worse on average than DeepPrecip for locations that are close in proximity (i.e. the OLYMPEx sites).
Predictably, DeepPrecip performance degrades compared to the baseline model when the testing site is left out since the model is no longer trained using data representing the regional climate of the site being tested. This difference in performance is 260 most notable at the set of OLYMPEx sites, and while DeepPrecip performance is still improved over the Z e −S/R relationships, we note a substantial percentage increase in MSE (375% on average) at these locations. OLYMPEx measurements were the only observational datasets without any gauge shielding and which is a likely source of uncertainty further contributing to this increase in error when the site is removed from the training set (Kochendorfer et al., 2022).

Quantifying sources of retrieval accuracy 265
Identifying regions within the vertical column that are the most important contributors towards retrieval accuracy is critical for informing future satellite-based radar precipitation retrievals. The ground-based radar instruments used in this work do not suffer from the same ground clutter contamination issues typical of satellite-based radar observations and we are therefore able to quantify the contributions to model skill arising from the included boundary layer reflectivity measurements in DeepPrecip.
Separating the training data into three subsets based on vertical extent and generating new models with this data, allows us to 270 examine changes in performance as a function of information availability. These subsets include: DP f ull (all 29 vertical bins, i.e. the baseline model), DP near (the lowest 1 km; 8 bins), and DP f ar (1-3 km; 21 bins). DeepPrecip MSE results (Table 6) 13   for each subset suggest that the information provided by a combination of both near-surface and far-profile data results in the highest accuracy.
Since Ny-Ålesund MRR observations were recorded with a maximal vertical extent of 1 km, they are only included 275 in DP near . Model skill when including/excluding Ny-Ålesund training data (19, 000 samples) was examined to determine whether it was confounding comparisons between the aforementioned vertical profile subset models. The results of these tests suggested that the impact on overall performance is negligible across both precipitation phases when Ny-Ålesund is included or excluded in the training set.
Distributions of surface precipitation anomalies appear distinct for rain and snow (Fig. 7), with the full column model 280 more closely capturing accumulation recorded by in situ gauges. Anomaly frequencies are derived by removing the mean accumulation estimate for each phase at each site. We attribute the structural differences between the anomaly distributions of of snow and rain to the more complex particle size distributions (PSDs) of snowfall coupled with the more variable particle water content of snow compared to that of rain (Yu et al., 2020). Additional uncertainties in the surface Pluvio2 measurement gauge observational records of snowfall due to gauge undercatch is another likely contributor of increased error (Kochendorfer 285 et al., 2022). In Fig. 7.a, both DP f ar and DP near exhibit higher anomaly values with a flattened curve top and heavy tails.
Using a combination of information from both near and far bins reduce these biases and tightens each accumulation anomaly distribution around zero. A similar trend is also present for rain in 7.b, where we again most closely capture the in situ anomaly distribution using DP f ull . A major challenge in deep learning is interpreting model output. SHapley Additive exPlanations (SHAP) (Lundberg and 290 Lee, 2017), is a game theory approach to artificial intelligence model interpretability based on Shapley values that has previously been used to great effect in the Geosciences (Maxwell and Shobe, 2022;Li et al., 2022). Shapley values quantify the contributions from all permutations of input features on retrieval accuracy to identify which are the most meaningful. While computationally expensive (with exponential time complexity), this process provides local interpretability within the model by examining how each possible combination of all input features impacts model accuracy (Jia et al., 2020). Here, the calculated

295
Shapley values give insight into the regions of the vertical column that are contributing the most useful radar information in the precipitation retrieval.
Shapley values for the entire dataset used in DP f ull indicate that the most important model predictors comprise a combination of both near-surface and far profile bins (Fig. 8). Reanalysis variable model inputs are generally the least influential, except for the trace precipitation case where low-mid level TMP and WVL bins appear highly important (Fig. 8). In all cases, TMP  for different subsets of vertical column reflectivities separated into all profiles, trace intensity, low intensity, medium intensity, and high intensity precipitation events based on a k-means clustering of input data (more in Sect. 3.2). Areas of dark color indicate a high feature importance at that location within the vertical column.

Discussion and Conclusions
DeepPrecip not only demonstrates considerable retrieval accuracy without the need for physical assumptions about hydrome-310 teors or spatio-temporal information, but also provides insight into the regions of the vertical column which are most important for improving predictive accuracy. The results from Sect. 4.2 suggest that while the exact altitudes providing predictive information from the vertical column may shift up or down under different precipitation intensities, there exists a consistent combination of both near-surface and far profile bins that always appear as highly important contributors to model skill. Furthermore, while RFL is typically considered as the most important predictor in radar-based precipitation retrievals (Stephens 315 et al., 2008;Skofronick-Jackson et al., 2015), we find that contributions from RFL, DOV and SPW provide a near-equal level of importance, with respective average percent contributions to model output of 30%, 31% and 28%, while ERA5 TMP and WVL variables have a total combined importance of 10%.
The combined insights from DeepPrecip's multi-model vertical extent evaluations and feature importance analyses demonstrate a potential to influence current and future remote sensing precipitation retrievals using deep learning. Instruments like 320 CloudSat's Cloud Profiling Radar (CPR), or the Global Precipitation Measurement (GPM) mission's Dual-frequency Precipitation Radar (DPR) also use active radar systems to perform similar, radar-based precipitation retrievals based on data from vertical column reflectivities (Stephens et al., 2008). While CPR and GPM-derived products use a more sophisticated Bayesian retrieval to the Z e − S/R relationships evaluated here, the resulting precipitation estimates are still tightly coupled to a priori physical assumptions of particle shape, size and fallspeed which is a substantial source of uncertainty (Hiley et al., 2010;Wood 325 et al., 2013). Additionally, the results of this study further support prior inference regarding the existence of regions of high importance in the < 1 km (near-surface) region of the vertical column relating to shallow-cumuliform precipitation strongly influencing retrieval accuracy. This is an area that is typically masked in satellite-based products (i.e. the radar "blind-zone") due to surface clutter contamination, and has been shown in previous work to likely be a major source of underestimation from missing shallow cumuliform precipitation (Maahn et al., 2014;Bennartz et al., 2019). This work motivates the importance of 330 continued research towards obtaining high-quality, non-cluttered near surface radar data to use as additional model inputs in future space-based retrievals of precipitation.
DeepPrecip is not without uncertainty and error which will reduce its accuracy when tested against new data. Uncertainties present in the training data (stemming from the MRR, ERA5 or Pluvio2 observations), will propagate through the model and bias the output estimates (Kochendorfer et al., 2022;Jakubovitz et al., 2019). We have taken steps to mitigate the impact of 335 these uncertainties through multiple data alignment and preprocessing decisions (details in Sect. 3), however precipitation gauge undercatch, wind shielding configurations, MRR attenuation and differences in site-specific vertical extent cannot be eliminated as contributors of retrieval error. While 60% of the power laws examined in this work were MRR-derived K-band relationships, the remaining 40% where either Ka-band or the Marshall-Palmer (MP) Rayleigh relationship. While K and Ka are similar radar frequencies, the differences between the two can bias the resulting precipitation estimate when a Ka-derived 340 power law is applied to K-band data (especially during periods of intense precipitation). Furthermore, while the collection of data from multiple sites provides us with a robust training set under multiple regional climates, due to the unique experimental setups at each site, calibration biases between study locations may further reduce DeepPrecip's skill when applied to new data.
As the MRR instrument has a limited 3 km maximum vertical range, we also miss possible precipitation events occurring outside of this region, which may contribute to further surface precipitation underestimation. Internal CNN model uncertainty 345 is likely driven, in part, by a combination of the high variability that is typical of precipitation and the limited sample from nine measurement sites over 8 years, which does not fully capture all different forms of possible precipitation structure and occurrence.
Code and data availability.