The University of Washington Ice-Liquid Discriminator (UWILD) improves single particle phase classifications of hydrometeors within Southern Ocean clouds using machine learning

Mixed-phase Southern Ocean clouds are challenging to simulate and their representation in climate models is an important control on climate sensitivity. In particular, the amount of supercooled liquid and frozen mass that they contain in the present climate is a predictor of their planetary feedback in a warming climate. The recent Southern Ocean Clouds, Radiation and Aerosol Transport Experimental Study (SOCRATES) vastly increased the amount of in-situ data available from mixed-phase Southern Ocean clouds useful for model evaluation. Bulk measurements distinguishing liquid and ice water 5 content are not available from SOCRATES so single particle phase classifications from the Two-Dimensional Stereo (2D-S) probe are invaluable for quantifying mixed-phase cloud properties. Motivated by the presence of large biases in existing phase discrimination algorithms, we develop a novel technique for single particle phase classification of binary 2D-S images using a random forest algorithm, which we refer to as the University of Washington Ice-Liquid Discriminator (UWILD). UWILD uses 14 parameters computed from binary image data, as well as particle inter-arrival time, to predict phase. We use liquid-only 10 and ice-dominated time periods within the SOCRATES dataset as training and testing data. This novel approach to model training avoids major pitfalls associated with using manually labelled data, including reduced model generalizability and high labor costs. We find that UWILD is well calibrated and has an overall accuracy of 95% compared to 72% and 78% for two existing phase classification algorithms that we compare it with. UWILD improves classifications of small ice crystals and large liquid drops in particular and has more flexibility than the other algorithms to identify both liquid-dominated and ice-dominated 15 regions within the SOCRATES dataset. UWILD misclassifies a small percentage of large liquid drops as ice. Such misclassified particles are typically associated with model confidence below 75% and can easily be filtered out of the dataset. UWILD phase classifications show that particles with area-equivalent diameter (Deq)< 0.17 mm are mostly liquid at all temperatures sampled, down to -40◦C. Larger particles (Deq > 0.17 mm) are predominantly frozen at all temperatures below 0◦C. Between 0◦C and 5◦C, there are roughly equal numbers of frozen and liquid mid-size particles (0.17 < Deq < 0.33 mm) and larger particles (Deq 20 > 0.33 mm) are mostly frozen. We also use UWILD’s phase classifications to estimate sub 1-Hz phase heterogeneity and we show examples of meter-scale cloud phase heterogeneity in the SOCRATES dataset. 1 https://doi.org/10.5194/amt-2021-123 Preprint. Discussion started: 21 May 2021 c © Author(s) 2021. CC BY 4.0 License.

Mixed-phase processes within Southern Ocean clouds moderate cloud radiative effects (Bodas-Salcedo et al., 2016;McCoy et al., 2014a) and cloud-climate feedbacks associated with the stormy region (McCoy et al., 2014b). The presence of small amounts of ice within liquid-dominated mixed-phase clouds can substantially increase precipitation as compared with warm clouds with similar thickness, due to efficient cold precipitation formation (Bergeron, 1928;Field and Heymsfield, 2015).
Increased precipitation can reduce cloud lifetime (Albrecht, 1989) and increase aerosol scavenging (Radke et al., 1980). particle image classification have, would limit the scope of our TTV set to particles large enough to identify by eye that have an unambiguous phase. This would result in our TTV set being substantially different from a set of randomly sampled particles from SOCRATES, and reduce the generalizability of our machine learning model for the whole SOCRATES dataset. Instead, 135 we use in-situ flight data, including temperature from the HARCO heated total air temperature sensor (EOL, 2019), water vapor mixing ratio from the Vertical Cavity Surface Emitting Laser (VCSEL) hygrometer (Zondlo et al., 2010;Diao, 2020) and voltage from from the Rosemount Icing Detector (RICE) (EOL, 2019), to identify flight periods where the hydrometeors are most likely to be all or mostly the same phase. Equations of saturation pressure with respect to liquid and ice (Murphy and Koop, 2005) were used to calculate relative humidity with respect to liquid (RH) and ice (RH i ), respectively. Uncertainties of 140 RH and RH i can be derived based on the uncertainties associated with temperature and water vapor. Uncertainties range from 6.4% to 6.8% for RH, and from 6.5% to 6.9% for RH i from 0 • C to -40 • C, respectively. Identifying liquid phase regions of cloud is simple because frozen hydrometeors rarely persist at temperatures above 5 • C (Yuter et al., 2006;Oraltay and Hallett, 2005).
Thus, we select a five minute flight period where the temperature varies between 6 • C and 12 • C as a liquid-only period. We show time series of 1-Hz temperature, RH, particle count, and liquid fraction from UWILD for this region in Figure 3a. There 145 is only RH data available towards the end of the period and the RH is close to 100% there. UWILD classifies most particles as liquid throughout the flight period, but its accuracy decreases towards the end of the period. We explain how UWILD classifies particles in Section 3 and we quantify UWILD's performance and identify biases in its classifications in Section 4. A histogram of temperature for the liquid-only period is shown in Figure 2b and a normalized histogram of area-equivalent diameter (D eq ) is shown in Figure 4. Small particles (< 100 pixels or D eq < 0.1 mm) dominate the liquid-only dataset. 150 Supercooled water can persist at all temperatures above the homogeneous nucleation threshold of -40 • C (Korolev et al., 2017) and below 0 • C. Since there are no multi-second periods of in-cloud data from SOCRATES with temperatures below -40 • C (Figure 2b), we cannot be certain that all particles within any given SOCRATES flight period are frozen. However, we can use the aforementioned atmospheric parameters and particle probe images to identify periods where we have very high confidence that over 99% of the particles are frozen. We refer to these as ice-dominated periods. We use temperature, RH with 155 respect to ice and liquid water, voltage from the RICE, and particle images from the PHIPS and the 2D-S to identify two icedominated periods, which we show in Figures 3b and 3c. The ice-dominated periods are defined as having no RICE response, are sub-saturated with respect to liquid, and supersaturated with respect to ice. A RICE response, which consists of the 1-Hz voltage oscillating over a 20-second period, is expected when the supercooled liquid water content exceeds 0.01 g m −3 (Biter et al., 1987). RF04 is the source of 70% of particles in the combined ice-dominated dataset and RF01 is the source of 30%. A 160 histogram of temperature for the combined ice-dominated dataset is shown in Figure 2b and a normalized histogram of D eq for the same dataset is shown in Figure 4. The ice-dominated dataset is composed primarily of medium-size and large particles (≥ 100 pixels or D eq ≥ 0.1 mm) particles.
We manually inspected 500 2D-S particle images from the combined ice-dominated dataset with D eq ≥ 0.2 mm and found that all of them are unambiguously frozen. We also manually inspected 500 randomly sampled images with D eq < 0.2 mm, 165 which account for 13% of the particles in the ice-dominated dataset, and found that up to 1% of the particles are unambiguously liquid and an additional 1% could be liquid but are unidentifiable by eye. Thus, as many as 0.3% of the particles in the ice-dominated regions may be liquid. If UWILD classified all particles in the ice-dominated region correctly, it would have a slightly higher performance than what is reported here (Section 4) because we compute performance metrics assuming that all particles in the ice-dominated region are frozen. 170 We also use phase classifications from the PHIPS to evaluate our liquid-only and ice-dominated periods. The PHIPS dataset includes both automated classifications using the scattering phase function and manual classifications using particle images.
Because of the greater maximum scattering data acquisition rate compared to the maximum imaging rate described in the beginning of this section, there are more automated classifications than manual classifications. The PHIPS automated classification algorithm identified 132 particles as liquid and 45 particles as frozen during our liquid-only period, but manual classifications 175 of 106 images are universally liquid. This bias in the PHIPS algorithm arises from the fact that large liquid drops are typically aspherical because they are distorted due to pressure differences in the instrument's inlet and are thus misclassified as ice (Fritz Waitz, personal communication). While the PHIPS was not available for RF01, the PHIPS algorithm automatically classified 3905 particles as frozen and only 4 particles as liquid during our ice-dominated period from RF04. Manual classifications of 320 images are universally frozen. 180 We compare the liquid-only period and two ice-dominated periods with two examples of mixed-phase periods in Figure 3.
The first mixed-phase period ( Figure 3d) samples a stratocumulus cloud within the boundary layer. The RICE voltage oscillates throughout the period, indicating the presence of supercooled water exceeding 0.01 g m −3 , and the liquid fraction is greater than 75% most of the time. The cloud is saturated with respect to liquid. The second mixed-phase period (Figure 3e) sampled near the top of altostratus cloud. This period is colder and, on average, sub-saturated with respect to liquid water and saturated 185 with respect to ice. The RH is sub-saturated even though the cloud is liquid-dominated in the sampled region because the aircraft is skirting a horizontally variable cloud top. The liquid fraction is close to 1.0 and the RICE voltage oscillates until 00:23:00 UTC (∼ 0.4 UTC), when the liquid fraction decreases abruptly and the RICE voltage stabilizes. The change in phase occurs because the aircraft is initially sampling the cloud top and transitions to sampling below the cloud top.
The liquid-only period includes 90,000 particles that pass our size threshold, while the ice-dominated periods include 55,000 190 particles. All particles drawn from the liquid-only period are labelled liquid and all particles drawn from the ice-dominated regions are labelled ice. These labels are taken as truth for the purposes of model training and evaluation (Section 4). We partition the particles into three size categories: small (corresponding to 25-99 pixels or D eq of approximately 0.056-0.1 mm), medium (100-699 pixels or 0.1-0.3 mm), and large (> 700 pixels or > 0.3 mm). In the remainder of this study, all references to particle size will be in terms of D eq . We then randomly subsample the liquid particles down to have an equal total number of ice 195 and liquid particles, preserving the ratio between the 3 size categories. These particles are then partitioned into training (60%), test (20%), and validation (20%) sets, again preserving the original ratios between the three size bins for each phase separately, as well as the balance of particles from each phase. We refer to the combined training, test, and validation sets as the TTV set.
We explicitly preserve these rough size distributions to ensure that the test set has a reasonable number of small ice crystals and large liquid drops for evaluation, as these particles are rare enough in the full TTV dataset that a completely random partition 200 risks having them undersampled in the test set. Preserving a balance between liquid and ice particles simplifies interpretation of model performance summary statistics; this is further discussed at the beginning of Section 4. The composition of the TTV set is broken down in Table 2. We show histograms of particle properties from the TTV set and in the whole SOCRATES dataset (14 flights) in the appendix, and discuss out of sample particles.

205
A key consideration for all machine learning applications is the choice of machine learning model. One approach to analyzing particle probe images is to apply deep learning directly to the captured image (e.g. Touloupas et al., 2020;Xiao et al., 2019;Wu et al., 2020;Korolev et al., 2020). Here, we take a simpler approach and employ a random forest model (Breiman, 2001;Pedregosa et al., 2011), which requires a preprocessing step to extract relevant image features (e.g., particle area or max diameter; full list in Table 2). Classification is then carried out using these features as inputs (Lindqvist et al., 2012;Nurzynska et al., 210 2012Nurzynska et al., 210 , 2013Praz et al., 2018Praz et al., , 2017O'Shea et al., 2016). An advantage of this approach is that it allows for the inclusion of features not directly related to particle appearance; in particular, we show that inter-arrival time is a valuable discriminator of liquid and ice particles. This variable would be more difficult to incorporate in an image-based deep learning model. Random forests can also provide more interpretable results, as the trained model can be analyzed to investigate relative feature importance. Another advantage (shared by many machine learning approaches) is the determination of classification confidence, 215 which can be useful in filtering particles depending on application, or estimating uncertainties in calculated bulk properties such as liquid water content.
For a decision tree trained using a supervised learning approach, the training set is split by thresholding features (e.g. whether particle area ratio is more or less than 0.8); precisely which feature and which value is determined by whatever 'best' splits the dataset into distinct categories (for UWILD the max Gini impurity reduction criterion is used). This process is repeated on 220 each data subset until the data are entirely partitioned into distinct categories. In a random forest, multiple such trees (100 for UWILD) are trained using random subsets of data features. Randomness is introduced here to reduce overfitting to the training set and improve model generalizability. For a given test datapoint, each tree provides a classification, and the plurality vote of all trees is the overall category assigned to the datapoint, with the proportion of trees voting for that category as the model confidence. We include a simple schematic of UWILD in Figure 1c.

225
A model is well-calibrated if its model confidence (internal prediction probability) accurately reflects its performance. Figure 5 shows this relationship between model confidence (from the random forest votes) and model accuracy (how likely the model was to correctly classify particles), evaluated on the test set. A one-to-one relationship is ideal because it indicates that we can directly use model confidence as an estimate of prediction uncertainty. For example, a particle classified as ice with a model confidence of 75% should be seen as 75% likely to be ice, and 25% likely to be liquid. Figure 5 also shows that UWILD 230 has a confidence of 95% or higher for 73% of the particles in the TTV set.
To better understand how the UWILD classifier determines particle phase, we quantify how much it relies on each of the fifteen different features (listed in Table 1) using permutation feature importance analysis. This technique measures how much a model relies on the information encoded in a particular feature by calculating model accuracy on a test set, randomly shuffling a given feature, and measuring how much the accuracy decreases. The random shuffling of a feature renders that feature useless 235 to the model classification and the model accuracy will decrease substantially for a very significant feature. This analysis can be rapidly performed multiple times for each feature. Another advantage of permutation feature importance is that it is a function of the dataset being used for evaluation as well as the model, and so this metric can be calculated separately for different subsets of the data, or for entirely new test datasets. Other measures of feature importance (such as impurity-based feature importance) are functions only of the model and do not share this advantage. A relevant drawback to all measures of feature importance is 240 that they are affected by correlations between features. As correlated features share information, model performance may not decrease as much when a particular feature is shuffled, as a (previously) correlated different feature may still encode the relevant information. However, decorrelating variables prior to use, which would address this issue, complicates model interpretation (while not significantly affecting model performance), and so we chose to preserve original features, and caution against too minute a dissection of the permutation feature importance. 245 Figure 6 shows the permutation feature importance of the top ten features, split by particle size and evaluated on the model test set. Overall, we see that width, area ratio, and log(inter-arrival time) are the most important. The next two features (max dimension and length) both encode size and correlate well with width, while the remaining features have low impact on model accuracy. Put another way, the model primarily relies on these first three features for classification. Considering differences between size classes, we note that width (and in fact all size-related features) are most important for medium particles, which is 250 to be expected as larger particles are predominantly ice and smaller particles predominantly liquid, with medium particles varying the most. For small particles, log(inter-arrival) time is most important, and for large particles, area ratio is most important (likely because small and medium particles are mostly quasi-spherical irrespective of phase). Regarding correlated features, the results in Figure 6 should not be taken to mean that particle width in particular is a key discriminator as opposed to length or max dimension, but rather that the width is a good estimator of particle size, which is the particle characteristic that matters 255 in determining its phase. If particle width were removed from the feature set, then another size-encoding feature would appear more important.

Comparison between phase classification schemes
For quantitative evaluation of a classification model, an intuitive summary metric is model accuracy (the ratio of correct classifications to total classifications). The overall accuracy of UWILD, Holroyd, and Area Ratio on our test set is 94.9%, 78.5% 260 and 71.8%, respectively, indicating that UWILD is performing quite well. However, accuracy is most suitable for balanced classification problems (i.e., when data are spread evenly across categories). In the case of highly unbalanced problems, high accuracy can be achieved by systematically erring in favor of the dominant category. For example, small particles in the test set are overwhelmingly liquid (Figure 4) so high accuracy can be achieved by predicting that all small particles are liquid at the expense of correctly classifying small ice particles.

265
Model performance, especially for unbalanced classification problems, can be better measured by calculating precision (the ratio of all particles correctly classified as liquid to all particles classified as liquid) and recall (the ratio of all particles classified as liquid to all true liquid particles). Both scores range from 0-1, and they penalize false positives and false negatives, respectively, for a particular category. These scores are unified in the F1 score, which is their harmonic mean: The F1 score is a conservative measure of model performance because the lesser of recall and precision will dominate the harmonic mean, and it can be calculated for various data subsets. We show the F1 scores for Holroyd, Area ratio and UWILD in Figure 7 as a function of phase and size class. This analysis is performed on our test set. UWILD outperforms Holroyd and well when classifying small ice and large liquid, it nevertheless has a particularly large performance advantage over Holroyd and Area Ratio for those categories. Holroyd outperforms Area Ratio for medium and large ice particles, whereas Area Ratio outperforms Holroyd for small ice particles and liquid particles of all sizes. In the rest of this section, we identify differences between the algorithms and biases within each algorithm to explain these discrepancies in their performances. We use the whole SOCRATES dataset (14 flights), which includes 5.76 million classified 2D-S images, for our analysis for the rest of the 275 paper. Table 3 shows how many particles each algorithm classified as liquid and ice for each size class, and in total. In general, Holroyd classifies the most particles as ice and Area Ratio classifies the most particles as liquid. UWILD and Area Ratio both classify over 90% of the small particles as liquid whereas Holroyd classifies only 70% of them as liquid. Area Ratio classifies three quarters of the medium particles and half of the large particles as liquid. In contrast, the other two algorithms classify 280 about 40% of the medium particles and 0% (Holroyd) to 2.7% (UWILD) of the large particles as liquid.
In Figure 8, we show the fraction of particles classified as liquid, from the three phase discrimination algorithms, in the phase spaces of temperature vs particle size (left column) and RH vs particle size (right column). In the second row, we show a 2D histogram of the confidence from UWILD, and in the third, fourth and fifth rows, we show 2D histograms of the fraction of particles classified as liquid by the three phase discrimination algorithms. At temperatures greater than -20 • C,

285
UWILD confidence is lowest in areas where UWILD transitions between having a high liquid fraction and a low liquid fraction ( Figure 8b). UWILD confidence is also low for small particles at temperatures below -20 • C which can have high or low liquid fractions. All three algorithms show a decrease in liquid fraction for small particles at temperatures between -20 • C and -30 • C and an increase in the liquid fraction at temperatures below -30 • C (Figure 8c-e). This behavior is a consequence of small sample size, as the liquid dominated data below -30 • C comes from just one flight that sampled the top of an altostratus cloud,

290
whereas the ice-dominated data at higher temperatures come from several flights that sampled the middle of altostratus clouds.
Since temperature and RH are not inputs to any of the algorithms, we can use them to gauge whether the particle classifications make physical sense. In other words, we can use these atmospheric parameters to make broad predictions of hydrometeor phase and determine which algorithm is most consistent with these predictions. We expect that small particles will be entirely liquid above 0 • C and that large particles will be primarily liquid above 0 • C and entirely liquid above 5 • C due to having longer 295 melting timescales (Oraltay and Hallett, 2005).
Ice and liquid precipitation formation mechanisms have been observed to operate simultaneously at temperatures as low as Huffman and Norman, 1988;Cober et al., 2001b;Kajikawa et al., 2000;Korolev et al.;Silber et al., 2019) so we cannot use temperature alone to make a prediction for the liquid fraction of particles at temperatures below 0 • C. However, we do expect to see a size dependence in the liquid fraction. To our knowledge, the largest liquid particle associated with supercooled 300 drizzle formation (as opposed to melting frozen hydrometeors) that has been noted in the literature has a maximum dimension of 0.625 mm (Cober et al., 2001b). Most SOCRATES data were collected in conditions that could not support the re-lofting of melted frozen hydrometeors. Furthermore, melted frozen hydrometeors are rarely lofted to temperatures below -5 • C in environments that do support re-lofting (Oraltay and Hallett, 2005). Thus, we expect that medium-sized and large droplets at temperatures below -5 • C are primarily formed via supercooled drizzle formation and will not be present at the largest sizes 305 (0.625 -1 mm).
Holroyd and UWILD classify too many medium-sized (0.1 mm < D eq < 0.3 mm) and large (D eq > 0.3 mm) particles as ice at warm temperatures and the bias is more pronounced for Holroyd. Liquid fractions for Holroyd sharply decrease to near 0.0 for particles with D eq between 0.2 and 0.3 mm at all temperatures ( Figure 8e). This strong size dependence arises from the fact that Holroyd only considers the presence or absence of a Poisson spot for particles with maximum dimensions less than 0.3 mm 310 (Figure 1b). Particles exceeding that maximum dimension threshold must be nearly spherical in shape to be classified as liquid because the presence of a Poisson spot is not factored into the phase classification. UWILD's liquid fraction for particles with D eq between 0.2 and 0.5 mm at temperatures between 0 • C and 5 • C is approximately 0.5 (Figure 8c). This is unrealistically low particularly for the warmer end of this temperature range, where most frozen particles would have melted.
Holroyd also classifies many small particles (D eq < 0.1 mm) as ice (Figure 8e) at all temperatures. Its liquid fraction never Area Ratio classifies too many large particles (D eq > 0.3 mm) as liquid at cold temperatures ( Figure 8d). Area Ratio's liquid fraction rarely drops below 0.8 for particles with D eq > 0.2 mm and temperatures below -5 • C, whereas Holroyd and UWILD have liquid fractions near 0.0 (Figure 8c,e). While a liquid fraction between 0.5 and 0.8 for these particles is not physically 320 impossible, the fact that there is no decrease in the liquid fraction with increasing particle size for particles with D eq between 0.5 mm and 1 mm, where particle sizes and temperatures are inconsistent with supercooled drizzle formation, implies that Area Ratio's higher liquid fractions may be unrealistic.
There are also clear differences between the three algorithms' classifications in RH vs particle size space (Figure 8h-j). Holroyd has a liquid fraction closer to 0.75 for the same region. Uncertainty in RH is around 7% so while high liquid fractions are most common at liquid saturation they occur at a wide range of RH values. Additionally, fluctuations in RH from dry air entrainment and in-cloud circulation can lead to deviations from liquid saturation at 1-Hz resolution. UWILD classifies fewer particles as liquid in subsaturated air than either Area Ratio or Holroyd. In the midsize particle range (0.1 < D eqc < 0.2 mm), which includes drizzle, the liquid fraction is near 1.0 when the RH is close to 100% and it drops down to 0.2 when RH 330 decreases to 50%.
Liquid particles can persist in subsaturated air at or below the cloud base, and these regions were purposefully sampled within the boundary layer during SOCRATES. Drizzle drops falling below liquid clouds evaporate in the subsaturated environment, reducing their size. Subsaturated air can also be associated with ice-dominated clouds as RH i is higher than RH throughout the mixed-phase temperature range. In ice-dominated clouds, cloud droplets are produced at the turbulent cloud top and tend to 335 freeze before forming drizzle drops. For both of these reasons, we expect a decrease in the average size of liquid particles as the RH decreases below liquid saturation. UWILD is the only algorithm for which the 50% liquid fraction shifts to smaller sizes as the RH falls below 100%. Thus, UWILD's lower liquid fractions in regions with RH < 100%, for particles in the midsize particle range (0.1 < D eq < 0.2 mm), are more realistic than Holroyd's and Area Ratio's higher liquid fractions.
UWILD is the only algorithm of the three that can achieve liquid fractions near 0.0 and near 1.0, in both temperature vs 340 particle size space (Figure 8c) and RH vs particle size space ( Figure 8h). Thus, it has the flexibility to represent both the liquid-only regions that we expect at the warmest temperatures and near liquid saturation, and the ice-dominated regions that we expect at the coldest temperatures and the largest particles sizes, and in subsaturated regions.
The dashed boxes labelled A-D on the 2D histograms in the left column of Figure 8 highlight areas of disagreement between the models, whereas box E highlights agreement between the models regarding the presence of supercooled water at -35 • C.
345 Figure 9 shows randomly sampled images from each of the five regions within the dashed boxes. Each particle image has the UWILD confidence printed above the particle and the phase classifications from all three algorithms printed below the particle.
Since we have chosen to primarily focus on areas of disagreement between the algorithms, there are more misclassifications in these regions than in the dataset as a whole.
Box A highlights a region where Area Ratio has a liquid fraction near 1.0 across all size categories, Holroyd has a liquid 350 fraction of about 0.75 for small particles and 0.0 for large particles, and UWILD has a liquid fraction of 1.0 for the warmest temperatures and 0.5 for temperatures near 0 • C. Since temperature ranges from 0 • C to 10 • C here, we expect that the particles are mainly liquid although large ice particles can persist warmer than 0 • C. Furthermore, quasi-spherical frozen particles can have a close resemblance to large liquid drops and the two cannot necessary be distinguished by eye from 2D-S images.
Randomly sampled particles from this region appear to be predominantly liquid due to the absence of rough edges along the 355 perimeter or non-spherical habits (Figure 9a). Out of 50 randomly sampled images, Area Ratio classifies one particle as ice, Holroyd classifies 21 particles as ice, and UWILD classifies 12 particles as ice. Of the sampled particles classified by UWILD with a confidence ≥ 75%, all but one of them appear to be properly identified as liquid. The only exception is one particle with a confidence of 81%, which is likely misclassified due to the particle being truncated at the end of the image buffer and thus yielding a lower area ratio. A greater proportion of particles with a confidence below 75% are classified as ice by UWILD 360 but are likely liquid, comprising about 55% of the sampled particles for these lower confidences. The misclassifications can be removed from UWILD, if desired, by filtering out particles that have a confidence of less than 75% and/or are touching the edge of the image buffer. Holroyd misclassifies 9 more liquid particles as ice than UWILD, but does not provide a measure of confidence that can be used to assess the likelihood of mis-classification. Of the particles that UWILD likely misclassifies as ice, many have particularly large Poisson spots. Particle area is computed from shadowed diodes exclusively so Poisson spots 365 are not included. Additionally, particles with Poisson spots are resized following Korolev (2007). Both of these factors affect the calculation of area ratio and, thus, the phase classification of the particle.
Box B highlights a temperature and size range where UWILD and Area Ratio are in agreement that the liquid fraction is near 1.0 but Holroyd has a lower liquid fraction of about 0.75. The 50 randomly sampled images shown appear to be mostly liquid with some irregular small ice crystals also present. UWILD performs better for these temperatures and particle sizes (2020) and, thus, are more likely to have quasi-spherical heavily rimed habits with larger area ratios. The different particle habits explain the discrepancy in Area Ratio's performance in the two different regions.

380
In both boxes C and D, UWILD and Holroyd likely misclassify several large particles as ice and those particles are largely associated with low confidence in UWILD. It is difficult to quantify this bias because there are several particles that could be either quasi-spherical frozen particles or large droplets and cannot be distinguished by eye. This does not mean that UWILD cannot classify those particles because it uses inter-arrival time in addition to image-derived parameters to make classifications.
Box E highlights a region where all three phase discrimination algorithms have liquid fractions greater than 0.5 despite 385 sampling very cold temperatures (-33 • C to -36 • C). Randomly sampled images with high confidence in UWILD have spherical habits and most have Poisson spots as well, suggesting that the particles in this region are primarily liquid. UWILD and Area Ratio classify all 20 randomly sampled images as liquid whereas Holroyd classifies 4 particles in the sample as ice. These particles are particularly small and lack Poisson spots so we cannot identify their phase by eye. Nevertheless, it is clear that all three algorithms correctly identify this region as liquid-dominated. These particles were sampled during a period from RF03 390 that is plotted in the second half of Figure 3e, where the aircraft skirted the top of an altostratus layer.

Particle size distributions
We use UWILD's classifications and confidences to compute median 1-Hz liquid and frozen particle size distributions (PSDs) and uncertainties for all SOCRATES data (14 flights). We show average PSDs for five different temperature ranges, and for 395 the whole dataset, in Figure 10. The x-axis is D eq (consistent with other figures) and the y-axis is the particle concentration normalized by the log of the bin width. Both axes are plotted on a log scale. The dashed lines are deterministic distributions which means they are generated using the UWILD classifications without taking the model confidence into account. All clas-sified particles are used for this analysis regardless of model confidence. The solid lines and shaded areas around them are the median and interquartile range of 30 bootstrapped samples which are generated using the model confidence. For example, if a particle is classified as ice with 75% confidence than it is considered an ice particle for the deterministic distribution but, on average, it will be considered an ice (liquid) particle in 75% (25%) of the bootstrapped samples. Note that due to the log scale on the y-axis, the effect of bootstrapping is mainly noticeable where concentrations are small, and is strongest where the differences between ice and liquid concentrations span orders of magnitude. Note also that the bootstrapped distributions fall entirely between the deterministic distributions. This can be understood by considering, for example, the smallest particles 405 just below 0 • C, where there are approximately 100 times as many liquid particles as ice. If, when taking into consideration model confidence, 2% of the liquid particles are reclassified as ice, this is barely noticeable in the liquid particle concentration, but results in a tripling of the ice particle concentration ( Figure 10c). As the bootstrapped PSDs are better representations of the true PSDs, considering model confidence is most essential in the areas in Figure 10 where there are large discrepancies between the deterministic and bootstrapped distributions (e.g. estimating sub-mm ice particle concentrations around 0 • C).

410
Within the SOCRATES dataset, small particles (D eq < 0.1 mm) are more likely to be liquid at all temperature ranges but they have much higher concentrations and are more liquid-dominated above -20 • C (Figure 10c,d,e). The concentrations of medium-sized (0.1 mm < D eq < 0.3 mm) and large (D eq > 0.3 mm) particles decrease as temperature increases. Medium-sized particles are ice-dominated between -40 • C and -20 • C ( Figure 10a) and liquid-dominated above -5 • C (Figure 10d In Section 4, we showed that UWILD misclassifies some large liquid particles as ice. We examined 200 randomly sampled 2D-S images of particles with D eq > 0.16 mm from temperatures between 0 • C and 5 • C, and found that 16% of particles classified as ice are actually liquid (not shown). These misclassified particles universally lack Poisson spots. Thus, large particles 425 within that temperature range are indeed ice-dominated but to a lesser extent than what the PSDs suggest. We also examined 200 randomly sampled 2D-S images of particles with D eq > 1 mm from temperatures between 5 • C and 40 • C and found that all particles are classified as ice but are actually liquid (not shown). These particles have an elongated shape due to being distorted by the instrument inlet, lack a Poisson spot, and often touch the edge of the image buffer so that they are partially cut-off. We note that classification skill decreases in general for particles approaching 1 mm because particles of this size are less likely to 430 be fully imaged by the instrument. Thus, we caution that phase classifications for the largest particles may be less certain and that the 5 • C to 40 • C temperature range is particularly affected by misclassifications of large drops as ice. We note that there are so few of these large misclassified liquid particles at warm temperatures that they do not show up in Figure8c, which uses a threshold of 100 particles for each 2D histogram bin.

Cloud phase heterogeneity 435
We also use UWILD's classifications and confidences to compute a 1-Hz estimate of sub 1-Hz cloud phase heterogeneity.
Cloud phase heterogeneity, or the degree to which liquid and ice particles are evenly mixed within mixed-phase clouds, can influence cloud radiative (Sun and Shine, 1994) and thermodynamic properties (Korolev et al., 2017). It may also modulate the rates of certain mixed-phase processes such as the Wegener-Bergeron-Findeisen process (Tan and Storelvmo, 2016), and has implications for how those processes should be parameterized in microphysics models. Most studies of cloud phase 440 heterogeneity from in-situ observations have focused on 1-Hz data (Korolev et al., 2003;D'Alessandro et al., 2019;Field et al., 2004;Cober et al., 2001a). Field et al. (2004) used PSDs from a Small Ice Detector in combination with other in-situ measurements to identify cloud phase and found that segments as short as 100-m could contain both liquid and ice. They used 1-Hz data but could investigate relatively small length scales due to low aircraft speeds (100-120 m s −1 ). Here, we use single particle phase classifications to investigate sub 1-Hz cloud phase heterogeneity and we identify mixed-phase periods on the 445 meter-scale.
From the single particle classification data, we derive an estimate of sub 1-Hz heterogeneity by considering whether adjacent particles in the 2D-S image buffer are of the same phase or of different phases, which we denote a phase 'flip' (from I-L or from L-I). If, for a 1-second period, there are many phase flips given the number of particles, that sample is more heterogeneous than one where there are few (or no) phase flips for a population of particles. We leverage the fact that our classifications are probabilistic in determining phase flips and create a probabilistic phase flip prediction as well. Given two adjacent particles p 1 and p 2 : We estimate the most likely number of flips over all particles within a given sample by adding these probabilities together.
Thus a hypothetical sample containing 100 particles may have between 0 flips (completely homogeneous sample with 100% classification confidence on all particles) and 99 flips (particles are alternating 100% likely ice and 100% likely liquid), although both of these extremes are unlikely with our probabilistic estimate. A limitation of this heterogeneity estimate is that it implicitly 450 assumes that phase flip probabilities are independent. An advantage of this metric is that it avoids using particle mass to compute phase heterogeneity. Ice particle mass estimated from 2D-S images can vary over an order of magnitude depending on the assumed mass-dimensional relationship (Wu et al., 2019) and more reliable measurements from a Nevzorov instrument with a deep cone (Korolev et al., 2003) were not available during SOCRATES.
We create a 1-Hz heterogeneity measure, which we refer to as the phase flip fraction, by dividing the number of probabilistic 455 phase flips described above by the total number of particles imaged by the 2D-S within one second. We implicitly assume that all unclassified particles, which are mainly particles with fewer than 25 pixels and D eq < 0.1 mm, are liquid, which is a reasonable extrapolation from our PSDs (Figure 10f) as only 1% of the smallest particles are classified as ice. However, this Code and data availability.
The 2D-S (https://data.eol.ucar.edu/dataset/552.009), VSCEL (Diao, 2020), PHIPS (Schnaiter, 2018a, b) and aircraft (EOL, 2019) data used in this study are found on the NCAR EOL data archive. The software packages used to process the OAP data (McFarquhar et al., 2018) and to run the random forest model (Mohrmann et al., 2021) are publicly available as GitHub 530 repositories. The data containing 1-Hz phase-partitioned PSDs and phase flip fraction estimates will be made publicly available on the NCAR EOL data archive upon submission.
Appendix A

A1
We show histograms of 14 particle parameters for the TTV set and the 14 SOCRATES flights analyzed in this study, in 535 Figure A1. We do not include the parameter touching_edge here because it is binary. The histograms are plotted on a log scale so that the tails of their distributions are visible. About 0.6% of particle images from the SOCRATES dataset that we analyzed here are out of sample. This means that at least one particle parameter has a value that is outside of the range of the TTV set. Such particles are usually out of sample because they are larger than all of the particles in the TTV set. We examined 45 randomly sampled images from the SOCRATES dataset that have an area-equivalent diameter (D eq ) greater than all particles 540 in the TTV set. These particles are mainly heavily rimed aggregates. All of them are clearly frozen and UWILD classifies them as such. Thus, we do not believe that the small percentage of out of sample particles present in the SOCRATES dataset reduces the performance of UWILD.
Area ratio is greater than 1.0 about 25% of the time This is because a correction is applied to the calculation of the maximum dimension following Korolev (2007) when the diode at the center of the minimum circle enclosing the particle is unshadowed, 545 which is the case for particles featuring Poisson spots. In these cases, the area of a circle with a diameter equal to the corrected maximum dimension can be smaller then the projected area of the particle.
Author contributions. RA prepared the manuscript with contributions from all co-authors, and lead the analysis of the phase classifications from the three different algorithms for the whole SOCRATES dataset (Section 4). JM lead the development of UWILD, with contributions from JL and IH, and computed size distributions and phase heterogeneity metrics from UWILD's classifications and model confidences 550 (Section 5). JF processed the 2D-S data to obtain particle features used in the machine learning model and contributed expertise on the 2D-S instrument and the SOCRATES campaign. JL and IH trained and tuned UWILD and evaluated its performance on the test set (Section 4). In particular, JL computed F1 scores for different particle categories and IH computed permutation feature importance. RW provided continuous feedback and guidance during the study. MD re-calibrated the water vapor data from the VCSEL instrument and provided comments for the manuscript.