Rain time series records are generally studied using rainfall rate or accumulation parameters, which are estimated for a fixed duration (typically 1 min, 1 h or 1 day). In this study we use the concept of “rain events”. The aim of the first part of this paper is to establish a parsimonious characterization of rain events, using a minimal set of variables selected among those normally used for the characterization of these events. A methodology is proposed, based on the combined use of a genetic algorithm (GA) and self-organizing maps (SOMs). It can be advantageous to use an SOM, since it allows a high-dimensional data space to be mapped onto a two-dimensional space while preserving, in an unsupervised manner, most of the information contained in the initial space topology. The 2-D maps obtained in this way allow the relationships between variables to be determined and redundant variables to be removed, thus leading to a minimal subset of variables. We verify that such 2-D maps make it possible to determine the characteristics of all events, on the basis of only five features (the event duration, the peak rain rate, the rain event depth, the standard deviation of the rain rate event and the absolute rain rate variation of the order of 0.5). From this minimal subset of variables, hierarchical cluster analyses were carried out. We show that clustering into two classes allows the conventional convective and stratiform classes to be determined, whereas classification into five classes allows this convective–stratiform classification to be further refined. Finally, our study made it possible to reveal the presence of some specific relationships between these five classes and the microphysics of their associated rain events.

The analysis of “precipitation events” or “rain events” can be used to obtain information concerning the characteristics of precipitation at a particular location and for a specific application. This is a convenient way to summarize precipitation time series in the form of a small number of characteristics that make sense for particular applications.

The concept of a precipitation event is not new and has been used for many years (Eagleson, 1970; Brown et al., 1984). A wide variety of definitions, varying according to the context of each study, have been reported in the literature (Larsen and Teves, 2015). Moreover, when a rain rate time series (generally based on rain gauge records) is broken down into individual rainfall events, a wide variety of their characteristics, such as average rainfall rate, rain event duration and rainfall event distribution (known as hydrological information), can be computed for each event. Our analysis of the literature has led to the identification of 17 features used to characterize rainfall, which makes it quite difficult to compare different studies. The first goal of the present study is to select a reduced set of features characterizing rainfall events, through the use of a data-driven approach, without taking a priori knowledge of the field of application into account, thereby characterizing rainfall events in the most parsimonious and efficient manner.

The second goal is to assess, without using any a priori criteria, whether the rain
events are still correctly clustered by the most relevant observed features.
Indeed, atmospheric process specialists distinguish between stratiform and
convective events, arguing that the physical processes involved in their
evolution are different. The goal here is to check that a small sample of
variables, derived from spot measurements to describe rain events, can allow
this distinction to be made and ultimately be used to refine it.
Hydrological (hereafter referred to as “macrophysical”) information makes
use of rain gauge measurements to characterize rain events. This information
is defined in order to characterize the features of global events but not
to provide any information concerning the raindrop microphysics of the
event. Nevertheless, in many applications such as remote sensing, knowledge
of the microphysics is essential. One key parameter in remote sensing is the
raindrop size distribution, noted as

In the present study, we use a data-driven approach to study the relationships between different rain properties. As disdrometers provide drop size distributions, they allow 1-minute (or shorter) rain rates to be estimated, and in the present study these can be used to derive the hydrological information of interest, which is coherent with the data that would be provided by standard rain gauges. Through the combined interpretation of microphysical and hydrological information, we are also able to analyze the microphysical properties of the rain event clusters provided by our algorithm. This makes it possible to retrieve (unobservable) microphysical information from rain gauge measurements.

From a single rain rate time series, observed with a 1-minute time
resolution, we seek to answer the following questions:

Among the large number of hydrological variables described in the literature, which are the most significant?

Does the resulting description of rain events allow different types of rain event to be discriminated?

What (unobserved) microphysical properties of an event, or type of rain event, can be inferred from its macrophysical description?

This research relies on the analysis of raindrop measurements obtained with
a dual-beam spectropluviometer (DBS) disdrometer, first described by
Delahaye et al. (2006). This instrument allows the arrival time, diameter
and fall velocity of incoming drops to be recorded. As the capture area of
the sensor is 100 cm

In everyday life, it is common knowledge that rain starts at a certain moment and stops some time later. However, due to its discreet nature, rain (which generally consists of a very large number of raindrops) is not an easy concept to define. Indeed, the exact definition of a rain event will depend on the sensor's characteristics (specific surface capture, detection threshold, instrumental noise) as well as the spatial or temporal resolution chosen for the study. This definition may also depend on the purpose of the study and thus on the scientific community behind it. There is thus a wide range of criteria used to break down precipitation records into rain events. For this reason, it is important to define and apply an unambiguous definition of a “rain event”.

In this study, the pattern produced by the 1-minute rain rate time series
RR

Observation periods, availability of DBS observations and numbers of rain events for the learning and test data sets.

Dunkerley (2008a, b) carried out an analysis of the inter-event time (IET) in order to check the influence of this variable on the definition of rainfall events and its influence on the average rainfall rate. As emphasized in this study, when determining a value for the MIT, it is crucial to find an appropriate compromise between the independence of rain events and the intra-event variability of rain rates. The choice of MIT thus has a direct impact on the macrophysical characteristics that are ultimately determined by the analysis. Other researchers have proposed to use MIT values of 20 min, 1 h or even 1 day (see Dunkerley, 2008a, for a detailed list). In the present study it was decided to set the MIT to 30 min. This is in agreement with the value used by Coutinho et al. (2014), Haile et al. (2011), Dunkerley (2008a, b), Balme et al. (2006) and Cosgrove and Garstang (1995).

When applied to our data set, this choice leads to the identification of 545 rain events, which can be divided up into two subsets, i.e., one for learning and the other for testing (Table 1). The learning data set is composed of observations collected over a 2-year period between 2013 and 2014, with an availability of 96.4 %, whereas the test data set collected during the 2008–2012 period contains periods with missing data due to a malfunction of the recording device.

The 23 variables identified in the literature, used for the characterization of rain events.

Rain events contain a wealth of information, which generally needs to be
condensed into a limited set of well-chosen features. However, there is no
conventional or commonly accepted list or specific set of macrophysical
features that can be used to accurately describe and summarize an event. In
the present study it was thus decided to consider a large number of
features, allowing the macrophysical rain event information described in the
literature to be correctly represented. Seventeen characteristics were
selected and identified (Llasat, 2001; Moussa and Bocquillon, 1991) and are
listed in Table 2. Some of these are parameter dependent, such as

Among the 23 indicators (hereafter referred to as variables) corresponding
to the previously defined features, some are very well known. These include
the event duration (

PCA on the learning data set based on the 23 variables described
in Table 3.

It is important to note that very few of these 23 variables are compatible with the probabilistic assumptions generally associated with exploratory statistical methods. They are often highly variable, with highly skewed distributions, and therefore do not have normal distributions. It is thus more difficult to make direct use of standard statistical methods with these data, as it may lead to misleading interpretations (Daumas, 1982). It is thus necessary to introduce an additional step in order to transform the original distributions into tquasi normally distributed distributions. The most suitable type of normalizing transformation for each of these variables was selected empirically by testing seven different possible transformations (Table 3). For each variable, the retained transformation is that leading to a distribution with the strongest similarity to a normal distribution, i.e., with a kurtosis close to 3 and a skewness close to 0. For each indicator, the selected transformation is provided in the last column of Table 2.

Following the normalization step, PCA was
carried out on the learning data set (see end of Sect. 2.1 and Table 1). It
follows that the two principal axes contain 73 % of the total information,
whereas the first five principal axes are needed to represent 90 % of the
total information. The IET

Transformations used to normalize the variables listed in Table 2.

Finally, PCA analysis clearly shows that, within each PCA group, many variables are highly intercorrelated, i.e., linearly dependant on each other. This means that several variables could be removed with no substantial loss of information. This leads to the following question: which variables can be removed in order to retain the most parsimonious subset of variables representative of the full data set? The PCA extracts summary variables, which are a linear combination of original variables, but does not allow for the selection of variables. To answer to this question, we propose a method for the global selection of variables that seeks to identify the relevant variables in a data set. As it appears to be intuitively more advantageous to select variables with a physical sense, rather than using dimension reduction methods (e.g., PCA, which is more suitable for the detection of linear relationships), the proposed method is based on the use of a GA. The following section provides a brief introduction into the concept of GAs and shows how they can be advantageously used for the selection of variables in the context of the present study.

Computer-assisted variable selection is important for several reasons. Indeed, the selection of a subset of variables in a high-dimensional space can improve the performance of the model or its statistical properties, but it also provides more robust models and reduces their complexity. In practice, it is not generally possible to try all potential combinations of variables and to select the best of these, as a consequence of the enormous computational cost associated with such an approach. Among the many different variable-selection techniques described in the literature (Guyon and Elisseeff, 2003), we chose to develop a model based on the use of GAs to search for an optimal subset of variables. GAs (Holland, 1975) are stochastic optimization algorithms based on the mechanics of natural selection and the genetics described by Charles Darwin. In our study, a chromosome is defined as a subset made up from our 23 variables. A first generation composed of a population of 60 potential chromosomes is arbitrarily chosen. The performance of each chromosome (i.e., for each corresponding subset of 60 variables) is evaluated through a fitness function f. This fitness function is defined in such a way that the higher its value, the greater the fitness function's ability to represent the full data set (of dimension 23), using the smallest possible number of variables. On the basis of the performance of these 60 chromosomes, we create a new generation of 60 potential-solution chromosomes, using classical evolutionary operators: selection, crossover and mutation. The performance of this new generation is then evaluated. This cycle is repeated until a predefined stop criterion is satisfied. The best chromosome from the current generation then provides the optimal subset of variables.

We define by

As previously stated, the fitness function allows a measure to be provided
of how well a minimal subset of variables can represent the entire data
space (in dimension 23). The fitness function

As the aim of this approach is to minimize the number of selected variables

Diagram for the selection of variables based on a genetic algorithm associated with Kohonen maps.

The genetic algorithm is based on the following five steps (Fig. 2.):

In the first step, initialization, a (initial) population

In the second step, evaluation, for each of the

In the third step, the best chromosome

In the fourth step, selection, a new population of 60 chromosomes is created from the current population by
randomly sampling with replacement chromosomes based on their probabilities,
determined using the formula

The fifth step, reproduction, uses mutation and crossover possibilities in the new population.
Mutation consists in modifying (or not) certain
components of the chromosomes. The probability of mutation is in general
very low and is commonly set to

The GA is applied to our data sets in order to obtain an
optimal subset of variables forming a subspace, which can (in a certain
sense) provide relatively accurate information concerning the global space,
whilst having the particularity of containing non-redundant information. At
the 187th generation the algorithm produces a subspace comprising five
variables, namely, event duration

The three variables (

An SOM is a topological map composed of neurons. In the present case, a
neuron is a vector of dimension 23 containing the 23 variables defined in
Table 2. Each neuron has six neighboring neurons. SOM is an unsupervised neural
network trained by a competitive learning strategy that performs two tasks:
vector quantization and vector projection. The SOM, which is different to

In the present study we used the toolbox developed by the SOM Toolbox
Team, which is available at the following site:

Distance matrix for the

Projection of the

After learning by the GA algorithm described in the previous section, the
resulting map

Figure 3 shows the distance matrix. For each neuron, the color indicates the
mean distance between a neuron and its neighbors. The value at the center of
each neuron represents the number of rain events of the learning data set,
captured by the corresponding neuron. All neurons capture rain events and
slightly more than half of these capture between three and five rain events, which
is close to the value that would be obtained (

The five variables

The map can be related to the previously implemented PCA (Fig. 1). As can be
seen in Fig. 4, the variables

Those events which contribute the greatest quantities of water (Fig. 4.,
brown neuron at the bottom right of

Other events which contribute large amounts of water (but less than
previously; Fig. 4. red neuron at the bottom left of

Concerning the variable

The variable

Several other relationships, which are not described in detail here, can be
observed. These include the correlation between the normalized absolute rain
rate variation (

Coefficient of determination obtained on the learning and test data sets. The values in bold correspond to the five selected variables.

The variable

In an effort to provide additional information for validation of the map, we
compared each of the 23 variables with their corresponding value given by
the SOM, for the learning data set and the test data set. For each of the 311
events of the test data set, the best matching unit of the SOM, i.e., the
neuron that is the closest to the event, is determined with respect to the
five selected variables. As an example, for each event Fig. 5 shows the
current value of the unlearned variable

Dendrogram obtained from the hierarchical cluster analysis of the 64 neurons in the SOM. The horizontal dashed line represents the threshold between the two classes.

We have shown that the distance matrix (Fig. 3) confirms the successful deployment of the map. Based on the distance between neurons, it appears that neurons can be grouped to obtain a limited number of classes, each with its own characteristics. In order to group the 64 neurons into a small number of classes, a hierarchical cluster analysis was carried out (Everitt, 1974). Only the five selected variables were used for the classification, and a Euclidian distance was selected for the hierarchical algorithm. Figure 6 shows the resulting dendrogram, applied to the 64 neurons.

Depending on the physical processes involved, experts tend to separate rain
events into two different classes: stratiform and convective events.
Although this classification is relatively crude, since stratiform and
convective events can sometimes exist inside the same rain event, it is very
commonly used. Concerning the time series, most authors use a very simple
scheme to distinguish between stratiform and convective rain types. For
reasons of simplicity, rain classification is sometimes defined using the
instantaneous rain rate and the standard deviation estimated over
consecutive samples. As an example, Bringi et al. (2003) defined stratiform
rain samples when the standard deviation of the rain rate, taken over five
consecutive 2 min samples, is less than 1.5 mm h

Firstly, we separate the dendrogram into two classes. The first class
contains 51 neurons and 79 % of the observations, whereas the second class
contains 13 neurons and 21 % of the observations. The solid black line in
Fig. 3 corresponds to the dividing line between these two classes. The first
class, containing the greatest number of neurons, is in most cases
characterized by relatively low rain rates. This can be seen by examining
the structure of the map, according to the mean rain rate variable
(

Representation of the neurons in the

The second group is characterized by a smaller number of neurons. This
corresponds to the higher values of the mean rain rates (

Our analysis of the structure of the variables

Figure 7a and b show the neurons in the

The hypothesis that the two categories of precipitation events corresponding to different dynamic regimes can be identified solely on the basis of hydrometeorological variables is in agreement with the findings of Molini et al. (2011). These authors have shown that there is a strong agreement between the hydro-meteorological classification (based on the duration and extent of events from rain gauge network data) and dynamic classifications (the convective adjustment timescale identified to distinguish between equilibrium and non-equilibrium convection derived from ECMWF analysis). We conclude that this unsupervised automatic clustering, based on the five selected variables, makes it possible to correctly implement a classification with these two well-known classes (stratiform and convective). It should be noted that, unlike other classifications described in the literature, this was established without making use of a priori information, since it is produced by an unsupervised process.

From the stratiform and convective classification described above, it is
interesting to refine the two classes into a set of subclasses. The synoptic
rainfall associated with midlatitude depressions provides an example of
stratiform precipitation, which forms in depressions in the vicinity of warm
and cold fronts. The very light type of rainfall (drizzle) associated with
stratus or stratocumulus is included in the class of stratiform
precipitation. This can occur under anticyclonic conditions,or in the warm
region of a depression. The associated rain depths (

An important step in hierarchical clustering is the selection of an optimal
number of partitions

Summary of the rain event subclasses computed with the learning data set.

Hierarchical clustering of the map into five subclasses. The colors represent the subclass numbers: subclass 1 is dark blue, subclass 2 is blue, subclass 3 is green, subclass 4 is orange, and subclass 5 is red.

From these five subclasses, two belong to the stratiform class and the other
three belong to the convective class. In the learning data set, the first
subclass represents 12 % of all events and 68, 1.2,
6.8 and 12 % for subclasses 2, 3, 4 and 5 respectively. The characteristics of these five
subclasses are summarized below and in Table 5. The five selected variables
are remarkably heterogeneous between classes, meaning the accuracy of these
variables for clustering:

Subclass 1 (drizzle and very light rain): the main feature of this class
is the very low mean value (

Subclass 2 (“normal” events): this is a relatively broad class
containing 68 % of all events, with a mean event rain rate (

The three remaining subclasses correspond to convective classes of events,
which are characterized by a strong temporal heterogeneity and significant
intensities. Depending on the depth of rain events, this convective class is
subdivided into three subclasses.

Subclass 3 contains relatively long events (

Subclass 4 contains relatively short events (

Subclass 5 contains events that are characterized by relatively
low values for the rain event depth (

To conclude this section, this new classification allows the conventional definition for stratiform events to be refined. The convective classification can be subdivided into five different subclasses, each of which is homogeneous. This classification is obtained for midlatitude climates. As the data set used in this study is representative of only one specific region and topography (i.e., the temperate climate encountered in the Île-de-France region, France), its analysis cannot reveal information related to different processes, i.e., those which are not sampled in the data set. Such processes could lead to the identification of additional specific clusters of events. In particular, there are no orographic rainfall events or oceanic observations. The final step in this study involves assessing whether the homogeneous character of each class is preserved at the microphysics scale and attempting to identify any relationships between the information present at the scale of both the microphysics and the macrophysics of these events (hydrological information).

Our study of the microphysical properties of rain is based on a
comprehensive analysis of its drop size distribution

A general expression for the drop size distribution defined by Testud et al. (2001) is commonly used in the literature. This allows a distinction to be
made between the stable shape function

Microphysical variable

Projections of the learned map, according to

Many authors, including Atlas et al. (1999), Bringi et al. (2003), Marzuki et
al. (2013) and Suh et al. (2016), have endeavored to associate specific
microphysical properties with each type of precipitation (convective or
stratiform). In view of the maps shown in Fig. 4 and the
convective–stratiform classification developed in Sect. 4.3, we are able
to confirm that precipitation events classified as stratiform express small
values for

In order to improve our analysis of the microphysical information embedded in the data set, we analyzed the relationship between the two microphysical parameters using the reference vectors (neurons) from the map, which include information related to the original rain events.

Figure 9 shows the variable

In the case of the stratiform subclasses (1 and 2) a clear relationship can
be observed between the two variables. The microphysics characteristics of
these two subclasses are clearly distinct. Indeed, subclass 1 (drizzle and
light rain) has the smallest

For the convective events (subclasses 3, 4, 5), small differences can be
noticed with respect to

Following our classification, Fig. 9 indicates that there are real
relationships between the macrophysical and microphysical variables.
Nevertheless, knowledge of the variables (

Researchers who study microphysical features and their association with
specific types of precipitation use simple schemes, based on rain rate
estimations over a fixed period of integration (a few minutes), in order to
separate stratiform and convective rain types. They also use these simple
schemes to label

Many previous authors have observed that the drop size distribution is
closely related to processes controlling rainfall development mechanisms. In
the case of stratiform rainfall, the residence time of the drops is
relatively long and the raindrops grow by the accretion mechanism. In
convective rainfall, raindrops grow by the collision–coalescence mechanism,
associated with relatively strong vertical wind speeds. Numerous studies
have been published concerning the variability of

Marzuki et al. (2013) noted that during convective rain the increase in
value of

When applying our algorithm to the various macroscopic properties by rain
event, we also take into account the variability of rain within an
individual rain event. Fig. 9 clearly shows that the spreading of parameters

Suh et al. (2016) also compare

Suh et al. (2016) show in Fig. 4c of their study that the pdf for convective
rainfall is higher than that corresponding to stratiform rainfall, when

In view of the generally satisfactory retrieval of microphysical information from macrophysical parameters, we are of the opinion that the topological map successfully restores some of the information implicitly embedded in the data set. It is thus interesting to note that the macrophysical parameters of rainfall are related to its microphysical properties. Firstly, the map collects similar events, whilst ensuring, through the minimization of topological errors, that the unfolding of the map is correct. A neuron is thus closer to its neighbors than to any other neuron on the map. This criterion ensures that the data space is optimally partitioned into connected subparts, such that the neurons on the map can be related to the underlying processes governing rainfall.

Although the definition of a rain event is relatively subjective, this study underlines the advantages of using event analysis rather than sample analysis. This data-driven analysis of events shows that rain events exhibit coherent features. As a consequence of the discrete and intermittent nature of rainfall, some of the features commonly used to describe rain processes are inadequate, in particular when they defined for a fixed duration. Excessively long integration times (hours or days) can lead to the mixing of observations that correspond to distinct physical processes as well as to the mixing of rainy and clear air periods, within the same sample. An excessively short integration time (seconds, minutes) leads to noisy data, which are sensitive to the sensor's characteristics (sensor area, detection threshold and noise). By analyzing entire rain events, rather than short individual samples of fixed duration, it is possible to clearly identify certain relationships between the different features of rain events, in particular the influence of the microphysical properties of rain on its macrophysical characteristics. This approach allows the intra-event variability caused by measurement uncertainties to be reduced, thus improving the accuracy with which physical processes can be identified.

Once an event has been clearly identified, it is possible to choose a small number of variables to describe it. We present a new data-driven approach, which can be used to select the most relevant variables for this characterization. This approach has generic properties and can be adapted to many multivariate applications. A GA, when combined with SOM clustering, can allow the unsupervised selection of an optimal subset of five macrophysical variables. This is achieved by minimizing a score function, which depends on the topology error of the SOM and the number of variables. This score provides a parsimonious description of the event, whilst preserving as much as possible the topology of the initial space.

Numerous variables derived mainly from rain rate recordings are used to describe precipitation in the context of rain time series studies and a wide variety of topics of interest, including hydrology, meteorology, climate and weather forecasting. The algorithm proposed in this study produces a subspace formed by only 5 of the 23 rain features described in the literature. We show that these five features can be selected by the algorithm in an unsupervised manner and, from the macrophysical point of view, can provide an adequate description of the main characteristics of rainfall events. These characteristics are the event duration, the peak rain rate, the rain event depth, the standard deviation of the event rain rate and the absolute rain rate variation of order 0.5.

In order to confirm the relevance of the five selected features, we analyze
the corresponding SOM and are able to clearly reveal the presence of
relationships between these features. This approach also reveals the
independence of the inter-event time (IET

As this research was based on the analysis of observations made in midlatitude plains in France, the relevance of this classification remains to be confirmed through the analysis of data sets recorded in different climatic zones and under different meteorological conditions, such as those encountered in mountainous or coastal areas. If the SOM described in the present study were learned with a more exhaustive data set, a larger map would be produced, and this could reveal new types of rainfall behavior, which remained undetected in the current data set. This point will be addressed in future studies.

The data-driven analysis of entire rain events (rather than the analysis of fixed-length samples) is relevant to the study of interactions between the macrophysical (based on the rain rate) and microphysical (based on raindrop) properties of rain. In the present study, several strong relationships were identified between these microphysical and macrophysical characteristics, and we show that some of the five subclasses identified in this analysis have specific microphysical characteristics. When a relationship between the microphysical and macrophysical properties of rain is identified, this can have many practical implications, especially for remote sensing. In the context of weather radar applications, the microphysical properties of rain are needed in order to estimate rain rates through the use of the Z–R relationships. The estimation of microphysical rain characteristics, based on easily observable rain gauge measurements, could play a significant role in the development of the quantitative precipitation estimation (QPE).

The dataset is available on request from the authors.

The authors declare that they have no conflict of interest.

The authors wish to thank the team of SIRTA (Site Instrumental de Recherche par Télédétection Atmosphérique) as well as ACTRIS-FR for financial support. Edited by: G. Vulpiani Reviewed by: D. Dunkerley, A. Parodi, and three anonymous referees