Prediction of Alpine Foehn from time series of GNSS troposphere products using machine learning

Remote sensing of water vapor using the Global Navigation Satellite System (GNSS) is a well-established technique and reliable data source for Numerical Weather Prediction (NWP). One of the phenomena rarely studied using GNSS are foehn winds. Since foehn winds are associated with significant humidity gradients between lee/luv side of a mountain range, tropospheric estimates from GNSS are also affected by their occurrence. Time series reveal characteristic features like distinctive minima/maxima and significant decrease in correlation between the stations. However, detecting such signals becomes 5 increasingly difficult for large data sets. Therefore, we suggest the application of machine learning algorithms for detection and prediction of foehn events from GNSS troposphere products. The present study uses long-term time series of high-quality GNSS troposphere products from the Automated GNSS Network Switzerland (AGNES) as well as records of operational foehn index to investigate the performance of several different classification algorithms based on appropriate statistical metrics. The two best-performing algorithms are fine-tuned and employed on two years of test data. The results show very promising re10 sults, especially when reprocessed GNSS products are utilized. Detectionand alarm-based measures reach levels of 70-85% for both tested algorithms and thus are comparable to those from studies using data from meteorological stations and NWP. For operational prediction, some limitations due to the availability and quality of GNSS products in near-real time (NRT) exist. However, they might be mitigated to a significant extend by provision of additional NRT products and improved data processing in the future. 15


Introduction
Global Navigation Satellite Systems (GNSS) are used extensively for positioning and navigation applications worldwide. Additionally, they enable users to retrieve information about the state of the Earth's atmosphere, in particular the distribution of water vapor. This technique, commonly referred to as GNSS Meteorology, was first proposed three decades ago (Bevis et al., 1992) and is still gaining increasing interest from the scientific community as well as meteorological institutions. The retrieval 20 of atmospheric information from GNSS is based on the fact that electromagnetic signals (such as GNSS signals) are delayed when traveling through specific layers of the atmosphere. The delay experienced by a GNSS signal in the lowest part of the atmosphere (troposphere) is proportional to the water vapor content along the signal path. This fact is typically exploited in (K), as well as the water vapor partial pressure e (hPa) (Rüeger, 2002): N = (n − 1) × 10 6 = 77.6890 · P T + 6.3938 · e T + 3.75463 × 10 5 · e

90
The term can be split into a hydrostatic (addressing the first term) and a wet portion (addressing the last two terms of Equation 1). The total tropospheric delay experienced by a GNSS signal observed at an elevation e and azimuth direction a is referred to as the Slant Total Delay (STD) STD(a, e) = ZHD · mf h (e) + ZWD · mf w (e) + mf g (e) · [GN · cos(a) + GE · sin(a)], where ZHD (Zenith Hydrostatic Delay) represents hydrostatic part, and ZWD the wet part of the signal delay in the zenith direction. In addition, horizontal gradients GN (north-south direction) and GE (east-west direction), accounting for the asymmetry of the atmospheric layers passed by the signal, can be estimated in GNSS processing. In order to map the delays and gradients estimated for the zenith direction to the correct elevation, mapping functions for both parts of the delay (mf h (e), mf w (e)) and the gradients (mf g (e)) are used. 100 The total delay in the zenith direction, i.e. the Zenith Total Dealy (ZTD), is the sum of the hydrostatic and wet part ZTD = ZHD + ZWD.
ZHD accounts for the major part of the total delay and is largely determined by the atmospheric pressure. It can be modelled with sufficient accuracy from surface pressure observations using, e.g., the formula of Saastamoinen (Saastamoinen, 1972): ZHD = 0.0022767 · p s 1 − 0.00266 · cos(2θ) − 0.00028 · H (4) 105 where p s is the surface pressure, θ the station latitude, and H is the station height above the geoid.
ZWD represents the main signal of interest for meteorological purposes, as it is directly related to the water vapor content in the air column above the GNSS receiver, and therefore to IWV, via where κ denotes a semi-empirical function depending on the integrated mean temperature T m . Thus, it shows the same high 110 temporal and spatial variability as water vapor, making precise modelling from meteorological surface observations practically impossible. As a consequence, ZWD is commonly estimated as an unknown in GNSS parameter estimation alongside of station coordinates and the receiver clock error.

115
All investigations presented in this study are based on GNSS troposphere products from the Automated GNSS Network Switzerland (AGNES). The AGNES network was established in 2001 and is maintained by the Swiss Federal Office of To-pography (swisstopo) (Brockmann et al., 2002). Currently it consists of 31 GNSS stations, which are visualized in Figure 1.
The capabilities of the network were extended to multi-GNSS in 2015 (Brockmann, 2016). Reprocessed, long-term time series Figure 1. Overview of the AGNES GNSS station network, the meteorological station at Altdorf (ALT) is marked with a star. Picture modified, original taken from: http://pnac.swisstopo.admin.ch/pages/en/agnes.html of hourly tropospheric delays and gradients are available for the period 1999-2020. A description of the data set as well as 120 details on the reprocessing of GNSS data can be found in e.g. Brockmann (2015). Parts of this reprocessing were carried out in the framework of the second EUREF (International Association of Geodesy Reference Frame Sub-Commission for Europe) Permanent Network (EPN) reprocessing campaign in 2014, where GNSS data from a large number of European stations were reprocessed (Pacione et al., 2017).

Foehn observations at Altdorf
In order to train a specific ML algorithm and evaluate its performance, a reference data set of foehn observations is needed as the target variable. This study uses time series of 10-minute estimates of foehn index (FI) calculated at the station Altdorf, following the approach presented by Dürr (2008). The FI data is provided by the National Meteorological Ground-level Monitoring Network (SwissMetNet, SMN) operated by MeteoSwiss. Currently, data from about ten sites, frequently experiencing foehn winds, are available on an operational level. The index is designed for operational nowcasting and relies on typical foehn predictors such as wind speed and direction, pressure and temperature gradients, and humidity observations at the respective measurement site and surrounding stations. It can return three different integer values: 0 (no foehn), 1 (foehn-mixed air) and 2 (foehn), which are distinguished based on the predictors mentioned above. In an extensive validation against classifications 135 by human experts, the index showed good performance for indices re-calculated back to 1981 (Gutermann et al., 2012). For a detailed description of the calculation algorithm we refer to Dürr (2008) and Gutermann et al. (2012). As we aim for a binary classification (no foehn or foehn), the cases of FI = 1 are treated as nonfoehn events and therefore mapped to value 0.
Furthermore, we map the cases of foehn (FI = 2) to the value 1 for the sake of simplicity in all results shown in the following.
Then, each hour in the whole data set where at least one 10-minute value indicates foehn is treated as an hour of foehn 140 appearance, and thus a foehn event.

Machine Learning Algorithms
In the course of this study, several different ML algorithms are tested in order to investigate their usability for this specific problem and to compare their performance relative to each other. The following algorithms are tested:

155
The selection of features from GNSS troposphere time series is based on previous investigations on visual detection from time series of different parameters as well as on obvious choices which are expected to be impacted most by foehn conditions. Two obvious choices are visualized for December 2019 in Figure 2, namely ZWD (12-hour moving-average) at stations north (KALT, shown in blue) and south of the Alpine ridge (LOMO, shown in orange) and its difference (bottom section, shown in black). In addition, foehn events at Altdorf are shown as color-coded periods (orange). Strong correlation between the 160 contrary trends in ZWD at the two stations and the onset of foehn in Altdorf can be observed. Furthermore, the difference in ZWD between the stations reaches a (negative) maximum in the two extended foehn periods observed (∼ 15.-18. and 19.-21.12.2019). These time series give a first impression how (and from which parameters) foehn events can be detected using GNSS data sets. As this becomes a very demanding task for longer periods (both visually and analytically), ML-techniques are a promising tool to extend and automate such a detection process, with the additional benefit of possibly providing the ability 165 to also predict upcoming events.

Data preparation
One of the main challenges for ML-based classification algorithms are imbalanced data sets. This imbalance is also strongly present in data sets of foehn observations, since foehn is a rather rare meteorological phenomenon. For the utilized FI data set, the average foehn probability over the 11-year period (2010-2020) amounts to only ∼ 4%. Thus, the ratio of under-170 representation of the minority class (foehn event) compared to the majority class (no foehn event) is as large as 1:25.

Oversampling
A common approach to overcome problems originating from high imbalance in a data set is to oversample the minority class for the training data set. One possible approach to achieve this is the Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al., 2002), which we use in this study. The technique creates new (synthetic) instances of the minority class within 175 the training data. For this study, an oversampling of observed foehn hours in the training data set by 25% was conducted using SMOTE, which improves the performance of the applied algorithms by about 20%. The value of 25% oversampling was chosen to achieve a reasonable balance between the advantage of having more usable training events (larger percentage of oversampling) and the fact that foehn is still a rather rare phenomenon (therefore also rare in possible test data sets). All results shown in the following sections are based on pre-processing using this approach.

Shifting of FI time series
In order to assess the suitability of the GNSS troposphere products for operational prediction, a time shift of one hour is applied to the target vector (i.e. FI time series at Altdorf). As operational usage is considered a main goal of the proposed method, the shift is applied for all test cases investigated in this study (also those using post-processed GNSS products). Therefore, each prediction of a foehn event is based on GNSS observations collected one hour before a possible onset of foehn at Altdorf.

Performance metrics
As already outlined in the last section, the imbalance in data sets of foehn observations is a major obstacle for the application of ML algorithms and the assessment of their performance. For highly imbalanced data, performance metrics typically used in ML studies might not be representative and therefore other options have to be explored. In the case of the present data set, a typical performance measure such as precision alone would not be suitable as it simply compares detected/predicted foehn 190 hours with the observed data for all time steps. Thus, it might happen that an algorithm with optimal precision predicts (almost) no foehn events at all, since this will still result in an optimal performance with regards to precision. In order to overcome these issues we adapt the following performance metrics, as introduced by Barnes et al. (2007) and used in Sprenger et al. (2017).
These can be formulated as conditional probabilities P(|) and separated into detection-based: -Probability of Detection (POD) = P(predicted | observed)
As already visible from the formulations above, POD and CAR are directly connected to each other via the Bayes Theorem.
This also implies that there is always a trade-off between those two parameters and therefore only one of them can be optimized, while decreasing the respective other one. Which metric should be optimized strongly depends on the actual application, as already outlined by Sprenger et al. (2017) who argued that alarm-based measures might be more relevant from a forecasters 205 perspective.
In addition, we adopt two measures which represent a kind of mean performance in terms of both CAR and POD for describing the results presented in the next sections. The first one is just the simple average of those two parameters combined, therefore referred to as COMB in the following: The second adopted metric is based on the F-beta score F β (Baeza-Yates and Ribeiro-Neto, 1999), which can be formulated using the confusion matrix. The matrix reports the number of false negatives (FNs), false positives (FPs), true negatives (TNs), and true positives (TPs) and thus allows for the calculation of common performance measures in ML such as precision, recall and F-beta score: The classical F-beta score (F 1 , using β = 1) represents the weighted harmonic mean of precision and recall, with a range 220 between 0 (worst case) and 1 (optimal value). As already discussed above, a precision measure might not be representative for results of this study and therefore we use the F 2 score instead of F 1 , putting more emphasis on the recall, i.e. on the detection of all foehn events: This section gives an overview on the process of algorithm selection using a cross-validation approach applied for the whole training data set. Details on this approach are given in Section 5.1. Furthermore the default feature setup (i.e. used GNSS troposphere products) utilized for the algorithm selection as well as for the first case study is outlined in Table 1. It includes ZWD (absolute values and all possible differences between stations) and gradient products (GN and GE) from all available AGNES stations as well as a selection of four ZHD differences which are representative differences between north-south 230 stations in the network. Tests have also been conducted using all possible differences in ZHD (as for ZWD), but no improvement was found using this setup. This might be explained by the fact that ZHD is largely depending on pressure, which typically does not show such small-scale variations as water vapor (and thus ZWD). Therefore, a small number of ZHD differences (i.e. pressure differences) across the Alpine ridge might be sufficient for the ML-based prediction. For the second case study, which uses NRT GNSS products for the prediction, only tropospheric delay products (ZWD, ZHD) are used since gradients products 235 are not available for this data set.

Choosing optimal ML algorithms
Before carrying out case studies using a specific ML algorithm, the most promising one(s) have to be identified from the list of algorithms given in Section 4.1. Therefore, a cross-validation over the training data set (2010 -2018) was performed and evaluated using the performance metrics introduced in Section 4.4. For the cross-validation, single years of data are iteratively 240 taken out of the training data set, serving as validation data in order to assess the performance of the outlined algorithms. This is repeated until every year serves once as validation data set. The actual implementation is carried out using the Python package scikit-learn (Pedregosa et al., 2011) and the default settings from the algorithm routines are used in this first evaluation.
Results of the cross-validation are visualized in Figures 3 and 4 and summarized in Table 2. Statistical measures show a strong dependency on the actual foehn probability observed in a specific year, as correlations between black lines (foehn probability) 245 and actual measures for different algorithms (colored lines) reveal. Figure 3 and  combined measures (COMB and F 2 score). For detection-based measures (POD), the GB algorithm achieves the highest value on average as well as for most single years. The same holds for the RF algorithm in terms of alarm-based measures (CAR), but its detection-based performance is significantly degraded compared to e.g. GB and SVC. Thus we decided to use the GB and 250 SVC algorithms for evaluation in the test case studies, as those are the only ones provided average combined measures of over 70% (see Table 2).

Hyperparameter tuning
Based on the results of the cross-validation, the GB and SVC algorithm are chosen for in-depth tuning of their hyperparameters.
Therefore, a grid-search procedure is conducted, which is an exhaustive search over a subset of manually selected values. The  ing data set (2010-2018) is randomly divided into three folds, where two thirds are used for training while the last third serves for validation. This procedure is repeated three times until each third is used once for validation. All tested hyperparameter values, as well as the best performing value combinations, are summarized in Table 3.
6 Results: Case studies for test period 2019-2020

260
As the major performance test of the proposed method, two case studies are performed for a dedicated test period, covering the years of 2019 and 2020. Within these case studies, different setups regarding utilized GNSS stations and tropospheric parameters in the feature matrix are investigated. In a final step, we assess the ability of the proposed methods to predict foehn events using near real-time (NRT) troposphere products from previous hours. For all investigations/results shown in the following, the chosen GB and SVC algorithms are trained using nine years (2010 -2018) of hourly troposphere products and 265 evaluated for the test period 2019-2020. Specific settings concerning station and feature setup as well as type of used GNSS products are introduced in the respective sections.

Case study 1: Reprocessed products
Representing the main results of this study, the first case study includes all available stations and most parameters available from GNSS processing. The exact setup is following the settings obtained from the cross-validation, given in Table 1. Results

270
of the predictions of the test data from both algorithms are visualized in Figure 5 and the respective performance statistics are given in Table 4. The results indicate good performance of the tested algorithms, both providing POD/F 2 values over/at 80% and low POFD/MAR values (3%). The GB prediction achieves a slightly higher CAR (71%) than SVC (67%) and therefore  Overall, these findings hold for results from both tested algorithms, although SVC typically produces more false alarms (higher POD but therefore also higher POFD/lower CAR) than GB.
As the GB algorithm gives the opportunity to assess the importance of the used features for the prediction result, we additionally show the 30 most important features in Figure 8. By far the best predictor is the ZWD difference between the stations HABG 280 and SANB, which might outline the importance of high-altitude stations (such as SANB, 1702 m altitude) at the alpine crest.
14 https://doi.org/10.5194/amt-2022-33 Preprint. Discussion started: 4 February 2022 c Author(s) 2022. CC BY 4.0 License. Interestingly, also features from stations far away from Altdorf have significant impact, most prominently ZWD and also gradient products (even for east-west direction) in the Valais area (HOHT and HOH2 stations). This is reasonable due to the fact that typical wind trajectory in the Rhone valley is east-west oriented.
6.1.1 Thinning of feature space 285 As visible from Table 1, the dimension of the feature space (i.e. the number of columns of the feature matrix) becomes large when the full AGNES network is utilized for the prediction (at least for ZWD and gradient products), which significantly increases the computational effort for algorithm training. In this section, we thus investigate if comparable results can still be obtained when significantly reducing the number of features. Therefore we thinned out the feature space to the top 30 predictors   Table 4. Performance metrics for the proposed models using post-processed troposphere products and the full feature setup (as shown in from the results shown in the prior section (Figure 8). Results obtained from this reduced data set are given in Table 5. While   Table 5. Performance metrics for the proposed models using post-processed troposphere products but only the top 30 features (as shown in Figure 8 from the original setup Table 1). 6.2 Case study 2: Prediction using NRT troposphere products The last major question behind this study is to what degree the proposed method can be used or incorporated into operational 295 forecasting of foehn events. Therefore, the investigations presented in the prior sections are extended by using NRT troposphere products, in order to investigate the suitability of the proposed ML algorithms for operational prediction. These data are currently available in form of NRT tropospheric delays (ZHD, ZWD and ZTD), which are typically provided with a latency of approximately 30-40 minutes after the full hour. Unfortunately, no atmospheric gradients are currently delivered in NRT mode, but an extension is possible and aimed for in the near future. The missing gradient information makes it necessary to train the 300 GB and SVC algorithms again for the dedicated period (2010-2018), but this time only using features related to tropospheric delays (ZWD, ZWD differences, ZHD differences). Results of the prediction using NRT products can be found in Figures 9 (for GB) and 10 (for SVC) as well as in Table 6. The comparison of ML-based predictions with the FI time series shows a Figure 9. GB predictions for default (GB) and adjusted (GB_adj) run using NRT products. Shown are predicted foehn events (orange, yellow), observed FI (red) and matches (to adjusted prediction, green).
significant decrease in prediction accuracy, especially in terms of CAR for both algorithms. This indicates the importance of gradient parameters for the proposed method, despite the fact that the feature importances are dominated by ZWD differences 305 (see Figure 8). Furthermore, lower quality of ZWD estimates must be taken into consideration as well for the NRT solution, since lower-quality orbit and clock products have to be used for GNSS processing. Nevertheless, combined measures (F 2 and Figure 10. SVC predictions for default (SVC) and adjusted (SVC_adj) run using NRT products. Shown are predicted foehn events (blue, darkblue), observed FI (red) and matches (to adjusted prediction, green). COMB) are still reaching values between 60-66%. If a more equal performance level for both detection-and alarm-based measures is desired, the initial value of the decision threshold (0.5) used by an algorithm for the binary classification can be adopted. In this case, we used a threshold value of 0.75 to gain more equal statistics for both algorithms (results marked with 310 _adj in Figures 9 and 10), while conserving the combined measures. This higher threshold is able to increase the CAR by ∼20%, given the fact that it erases a considerable amount of false alarms.
In the present study, the feasibility of using GNSS troposphere products in combination with ML-based classification algorithms for the detection and operational prediction of alpine foehn was analyzed. For this purpose, we made use of a long-term 315 (eleven years) data set of GNSS tropospheric parameters derived for stations of the Swiss AGNES GNSS network as well as FI observations at the SMN-station Altdorf. In the course of an extensive cross-validation over the training data set (2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018), seven different classification algorithms were tested. The two best-performing algorithms were then trained using the nine-year training data set already utilized for cross-validation earlier. In a first case study, we evaluated results of foehn classifications/predictions from those two algorithms (GB and SVC) over a two-year test period (2019-2020) at the SMN-station 320 Altdorf. The second case study investigated the usability of NRT GNSS products for foehn prediction, in order to assess the feasibility of these low-latency (∼ 30-40 minutes) data for operational forecasting.
The following main conclusions can be drawn from the presented results: -ML-based foehn prediction/classification using GNSS troposphere products works well and achieves good performance in terms of both detection-based (POD = 83%, POFD = 4% ) and alarm-based (CAR = 67-71%, MAR = 3% ) metrics.

325
Results of both algorithms are almost equivalent and comparable to those obtained by Sprenger et al. (2017), who used NWP data as well as observations from the validating measurement site (Altdorf). Thus, the obtained "GNSS-only" results are unexpectedly promising. This fact once more outlines the great potential of GNSS products for meteorology -not only for precipitation-related phenomena -as well as the benefits of using them in ML-based approaches.
-Out of the seven algorithms tested in the nine-year cross-validation, GB and SVC provided the best average performance 330 in terms of performance metrics mentioned before. Therefore they were selected for further hyperparameter tuning and usage in the case studies carried out.
-Most promising results can be obtained if the full station network can be utilized. A reduction of the feature space to the 30 most important features from the full-network run shows similar POD values, but results in a degradation of CAR values by ∼10%. Still, a careful thinning in the feature space (much higher number than 30 features remaining) might 335 be possible without losing much critical information. Most important predictors include ZWD differences from GNSS stations nearby Altdorf as well as gradient products from selected stations, which include e.g. stations in the Valais area.
-Similar problems (feature reduction) as outlined above are present when using NRT data for operational prediction, as currently only tropospheric delays are delivered in NRT products. Furthermore, the quality of the prediction results also varies with the quality of the troposphere products, which is significantly lower for NRT data. This is visible from the 340 comparison between post-processed delay-only and NRT runs. Similarly as the for the dedicated feature thinning, main degradation of the results is experienced for the alarm-based measures. Still, the degradation can be mitigated to some extent by a dedicated tuning of the decision threshold in the classification algorithm (increase/decrease of the default 0.5 value). Finally, we except that the NRT prediction will also benefit from gradient products as soon as they become available in an operational manner. -Choosing the optimal performance metrics and appropriate pre-processing denotes a key task in ML-based prediction algorithms, especially when working with such a highly imbalanced data set as in this study. The actual choice for the most important metric(s) strongly depends on the actual application of the prediction method, deciding whether detection-or alarm-based measures should be preferred. Within this study we tried to tune the algorithms for an optimal balance between both metric-types and leave a possible decision to the potential users. However, as already outlined 350 before, there exists a trade-off between POD and CAR and therefore an optimization towards one metric will always result in a shortcoming towards the other.
Overall, these initial results are very promising and therefore the developed method might already aid the meteorological community as an additional tool for foehn classification and/or prediction. Nevertheless, a number of enhancements can still be achieved through more detailed investigations in future studies. Possible improvements of the method we aim to investigate 355 in the future would be: -Enhance the nowcasting capabilities of the proposed method by including gradient products in the NRT prediction.
-Use troposphere products from relevant stations of GNSS networks from neighbouring countries (Italy, Austria, Germany...) and the small-scale COGEAR network in the Valais area.
-Perform spatial interpolation of tropospheric parameters from neighbouring stations to the location of the SMN-station 360 Altdorf (using e.g. collocation methods) and evaluate their benefit for the prediction.
-Performing a more extensive grid-search for hyperparamter tuning of the used algorithms or try new (possibly more sophisticated deep learning) algorithms.
Code availability. Source code is available upon request from the corresponding author.
Author contributions. MAR developed the research idea, software implementation and prepared the manuscript (including formal analysis, 365 visualization, and writing). EB produced and provided the long-term time series of GNSS troposphere products used for this research. LC prepared selected visualizations for the manuscript. All co-authors contributed to the discussion of results as well as reviewing and editing the manuscript.
Competing interests. The authors declare that no competing interests are present.
Acknowledgements. The authors would like to thank swisstopo for providing GNSS troposphere products from the AGNES network and