Inferring surface energy ﬂuxes using drone data assimilation in large eddy simulations

Spatially representative estimates of surface energy exchange from ﬁeld measurements are required for improving and vali-dating Earth system models as well as satellite remote sensing algorithms. The scarcity of ﬂux measurements can limit understanding of ecohydrological responses to climate warming, especially in remote regions with limited infrastructure. Direct ﬁeld measurements often apply the eddy covariance method on stationary towers, but recently drone-based measurements of 5 temperature, humidity, and wind speed have been suggested as a viable alternative to quantify the turbulent ﬂuxes of sensible ( H ) and latent heat ( LE ). A data assimilation framework to infer uncertainty-aware surface ﬂux estimates from sparse and noisy drone-based observations is developed and tested using a turbulence-resolving large eddy simulation (LES) as a forward model to connect surface ﬂuxes to drone observations. The proposed framework explicitly represents the sequential collection of drone data, accounts for sensor noise, includes uncertainty in boundary and initial conditions, and jointly esti- 10 mates the posterior distribution of a multivariate parameter space. Assuming typical ﬂight times and observational errors of light-weight, multi-rotor drone systems, we ﬁrst evaluate the information gain and performance of different ensemble-based data assimilation schemes in experiments with synthetically generated observations. It is shown that an iterative ensemble smoother outperforms both the non-iterative ensemble smoother and the particle batch smoother in the given problem, yield-ing well-calibrated posterior uncertainty with continuous ranked probability scores of 12 W m − 2 for both H and LE with

emphasized that the drone data assimilation framework is not confined to these assumptions and can thus readily be extended to more complex cases and other scalar fluxes, such as for trace gases in future studies.

Introduction
The significance of sensible (H) and latent (LE) heat fluxes between an underlying surface and the atmosphere aloft is not in dispute, given the plethora of problems spanning atmospheric, oceanographic, cryospheric, soil, and vegetation dynamics, in which these turbulent exchange processes arise. Direct measurements of these surface fluxes enable robust methods to evaluate and tune parametrizations used in climate models, and to develop algorithms for indirect flux retrieval using satellite remote 30 sensing. Traditionally, flux measurements are collected on meteorological towers using the so-called eddy covariance (EC) technique (or other micro-meteorological approaches such as the Bowen ratio method). While these stationary tower measurements are often considered the best available technique for surface flux estimation, they are known to have limited spatial representativeness (Chu et al., 2021). Moreover, the link between measured turbulent heat flux at the tower and sources or sinks at the surface becomes problematic when these sources and sinks are spatially variable ). An indirect 35 manifestation of this problem is a failure to close the surface energy balance with underestimates in excess of 20% (on average) being reported across the sites of the FluxNet network (Stoy et al., 2013). This problem is by no means confined to surface heterogeneity. A number of studies (Steinfeld et al., 2007;De Roo et al., 2018) showed that organized eddies can bias flux estimates even in homogeneous environments, which explains part of the typically observed lack of closure of the measured surface energy balance. One approach to ameliorate these issues is to include spatially distributed measurements, which frames 40 the scope of the work herein. Airborne measurements from aircrafts have a long tradition in atmospheric sciences and are used to complement flux towers (Desjardins et al., 1989;Mahrt, 1998). Over the past decade, developments in miniaturized sensors and small unoccupied aircraft systems (hereafter referred to as drones) have been opening possibilities for studies of land-atmosphere interactions in ways not attempted before.
Drones measuring air temperature, humidity, and wind speed are a promising tool for spatially distributed measurements in 45 meteorological studies (Lee et al., 2018;Barbieri et al., 2019). Fixed-wing drones are typically equipped with air speed sensors that allow for wind speed estimation (Elston et al., 2015). Multi-rotor drones introduce some distortions to the turbulence field around them but they can still estimate horizontal wind speed based on their altitude derived from inertial measurement unit (IMU) data (Neumann and Bartholmai, 2015;Palomaki et al., 2017). Drones can also 'hover' in place at a point of interest or can sample along a pre-programmed flight path making them a sort of intermediate between tethered balloon soundings and 50 helicopter platform. Drone data have already been used for flux estimation in several studies (Bonin et al., 2013;Hoffmann et al., 2016;Kim and Kwon, 2019;Båserud et al., 2020), typically using Monin-Obukov similarity theory (MOST) (Monin and Obukhov, 1954;Foken, 2006) with flux-profile (Högström, 1988) or flux-variance (Katul and Hsieh, 1999) relationships as well as vertically-integrated heating/drying rates to infer surface fluxes. Due to the stochastic nature of turbulent transport, measurements are usually aggregated over longer time periods or large spatial distances where the statistical variability the limited flight time of drones, new trade-offs in the spatio-temporal sampling strategy could be developed to optimize flux estimates. Recently, multi-platform systems or drone swarms, carrying a mobile sensor network, have been shown to have capabilities for estimating emissions from gas point sources (Hutchinson et al., 2017;Ristic et al., 2020). While the potential for drone-based flux measurements as a relatively low-cost and mobile complement to EC is promising, there are many open 60 questions regarding the uncertainties of the resulting flux estimates, the optimal flight strategy, the required turbulent transport model, and the data-model fusion algorithms.
Returning to the issue of spatial variability and scales, the surface layer of the atmosphere constitutes a non-linear system where variability exists across all scales (Wyngaard, 2010). To explicitly represent intermittent and inhomogeneous turbulent transport as well as coherent structures requires high-resolution models that are computationally much more expensive than the flux-related expressions encoded in MOST. In particular, turbulence-resolving large eddy simulations (LES) are widely accepted tools to simulate boundary layer dynamics, as they explicitly resolve the energy-containing range of large eddies while they parametrize the effect of sub-grid scales on the resolved scales (see e.g., Stull, 1988). While MOST describes planarhomogeneous and stationary turbulence statistics in the absence of subsidence, LES allows for the analysis of turbulence time series at high temporal resolution, so as to realistically represent turbulence statistics collected at time scales of seconds to 70 minutes. Some studies have already paved the way (Sühring et al., 2019) by performing idealized LES studies with known initial and boundary conditions, and with virtual airborne measurements to show the feasibility of airborne flux estimation techniques, even above heterogeneous surfaces (but disregarding sensor noise). This indicates that drone observations can be combined with LES to estimate surface fluxes. That is, the LES may be viewed as a mathematical operator that takes surface boundary conditions and key large-scale meteorological forcing and provides statistics such as turbulent fluxes and 75 meteorological states at all points in the domain of interest over a period of time. These statistics can then be compared with 'noisy' data obtained from drones. Surface fluxes that optimally match the noisy measurements can then be inferred.
This view implies that a mathematically optimal technique for consistent data-model fusion can be formulated as a kind of Bayesian inference problem (MacKay, 2003;Jaynes, 2003;Särkkä, 2013;Gelman et al., 2013), which is typically referred to as data assimilation (DA) or inverse modeling in the geosciences (Carrassi et al., 2018;Evensen et al., 2022b). Herein, we 80 adopt a broad Bayesian definition of the field of DA in line with Evensen et al. (2022a). In addition to the conventional DA problem of state estimation, this definition also encompasses the problem of parameter estimation. The latter is often referred to as an inverse problem (Stuart, 2010) rather than a DA problem. Since the flux estimation problem at hand is precisely such a parameter estimation or inverse problem we are also leaning on developments in this field (Iglesias et al., 2013;Schillings and Stuart, 2017). In this study, we do not make any distinction between DA and inversion and take a unifying approach through in atmospheric inverse modeling to assess regional emissions of CO 2 (Tans et al., 1989) and methane (Thompson et al., 2018). Due to computational limitations, studies using DA with LES models have only recently become possible (Lunderman et al., 2020), and an evaluation with drone observations together with independent flux measurements remains lagging (or lacking all together).

95
In many practical applications, one typically omits stochastic terms in the model and assumes it to be a perfect representation of reality (so-called strong-constraint data assimilation) (Evensen, 2019). Even so, different DA techniques will excel depending on model complexity and the number of parameters in the problem. Variational DA combines the model and the data through the optimization of a cost function, but requires taking derivatives of the forward model with respect to its parameters (Bannister, 2017), which is difficult or impossible for most LES codes. Particle-based methods (van Leeuwen et al., 2019), such 100 as the particle batch smoother scheme (Margulis et al., 2015), are conceptually well suited for drone data assimilation given their limited assumptions, but they are known to suffer from degeneracy for problems in higher dimensional parameter spaces (Snyder et al., 2008). Ensemble Kalman-based methods, such as the ensemble smoother (van Leeuwen and Evensen, 1996) and its iterative variants (Emerick and Reynolds, 2013), on the other hand, have been shown to overcome some of these limitations for very large parameter spaces. However, these approaches invoke Gaussian linear assumptions at the analysis phase when 105 data and models are combined. These assumptions can be problematic given that Gaussian random variables do not respect physical bounds and many forward models in the geosciences are non-linear. This issue motivated studies that sought to describe how the iterative ensemble Kalman smoother can be used to improve urban air pollution estimation by assimilating both mean wind and gas concentrations with a Reynolds-Averaged Navier-Stokes (RANS) model (Defforge et al., 2021). Considering the potential and limitations of the different DA schemes, one may hypothesize that either Ensemble Kalman-based or 110 particle-based approaches (or a combination thereof) could be ideal for drone data assimilation in LES.
The aim of the present study is to first perform observing system simulation experiments (Masutani et al., 2010) to evaluate which DA scheme is most suited for the problem of flux estimation from drone observations, and to demonstrate what flux results can be expected from typical light-weight drone systems. We then apply the drone data assimilation technique to realworld measurements from drones and compare its results to concurrent eddy covariance flux estimates to demonstrate the 115 feasibility of the method. To be clear, given the differences in footprint and underlying assumptions, we do not argue that this comparison offers a validation per se -only a plausibility check of the estimated order of magnitude of fluxes and their relative variability.

Data assimilation framework 120
The aim here is to infer surface fluxes of sensible and latent heat using sparse and uncertain drone measurements of meteorological variables in the atmospheric boundary layer. Solving this inverse problem requires a forward (or data generating) model that maps the parameters, namely the surface fluxes of interest and other uncertain boundary conditions, to the drone observations through

125
where y ∈ R d is the observation vector, G (·) is the forward model, x ∈ R m is the target parameter vector, and ∈ R d is the observation error. In practice, G(·) is a composition of multiple operations (c.f. Evensen et al., 2022b) The inner operation, T (·), is a transformation step that maps the parameters from an unbounded space to a bounded physical space. This step helps satisfy the Gaussian assumption of the ensemble Kalman methods while avoiding unphysical values By employing a turbulence-resolving LES as opposed to a RANS model for the dynamics M(·) in our forward model G(·), we 135 are able to generate the surface flux to drone observation mapping since the LES is run at an appropriate level of spatio-temporal detail.
Even in the absence of observation error, the inversion of Equation (1) will typically be an ill-posed problem in the sense that a solution for the parameters x may not exist or be unique (Stuart, 2010). As such, it is more instructive to abandon the quest for a single optimal solution, which does not necessarily exist in a well-defined way, and rather approach this problem 140 in a probabilistic manner using Bayesian inference (Jaynes, 2003;MacKay, 2003;Särkkä, 2013). We do this following a classical Bayesian approach in geosciences known as data assimilation (DA) reviewed elsewhere (Wikle and Berliner, 2007;Evensen et al., 2022b), where we use a prior distribution p(x) to represent our knowledge concerning possible values for the model parameters x before taking the observed drone data y into account. We combine this with a second distribution, the likelihood p(y|x), which describes the probability of generating the data for a given set of parameters of the LES model.

145
To help construct this likelihood, conventional DA assumptions are followed (e.g. Carrassi et al., 2018) by using an additive Gaussian observation error ∼ N(0, R) with zero mean and observation error covariance matrix R. Bayes' theorem then yields a posterior distribution of the parameters p(x|y) by taking the product of prior and likelihood, i.e.
which represents our knowledge of the parameters and their uncertainties in view of our uncertain prior knowledge as well as the 150 data and their assumed error distribution. The so-called model evidence p(y) in the denominator of Equation (3) simply plays the role of a normalizing constant in this context. To solve this probabilistic inverse problem in practice, various derivativefree ensemble-based DA schemes can be used to estimate the posterior numerically by adopting particle and/or Gaussian approximations.
This problem formulation is implicitly conditioned on the strong constraint (see Evensen et al., 2022b) that the forward 155 model G is a perfect representation of reality. As George Box humorously notes -even though all models are wrong, what matters is the extent to which they are useful (Box, 1976). From this perspective, synthetic experiments (described below) are useful because the models are perfect by construction and thus useful for testing and comparing the DA algorithms. In the real experiments, where we compare with independent EC data, some of the mismatch between the EC estimates and drone-based inferences will undoubtedly be due to the strong assumptions made in the respective approaches. Given the level of realism 160 in LES, these structural model errors introduced when moving the algorithm application from synthetic to field data are likely dominated by simplifications of topography and spatio-temporal flux variability. The Bayesian approach to inference also offers a way to compare the relative usefulness of different models using the model evidence (MacKay, 2003), although this will not be pursued here.

165
The turbulence-resolving Parallelized Large-Eddy Simulation Model (PALM) (Raasch and Schröter, 2001;Maronga et al., 2015) version 6.0, is used as the forward model. PALM solves the filtered Navier-Stokes equations and the first law of thermodynamics with the Boussinesq approximation to explicitly resolve turbulent motions in the atmospheric boundary layer.
The effect of sub-grid scale motions on the flow is parameterized using the kinetic energy scheme of Deardorff (1980) as the sub-grid model. It is widely used in the boundary layer community to simulate neutral, stable, and unstable boundary layers 170 (Steinfeld et al., 2007;Couvreux et al., 2020) as well as scalar transport (Ardeshiri et al., 2020).
The number of grid points in the simulations is set to 256 by 256 longitudinally (along x) and laterally (along y), and with a constant potential temperature and specific humidity profile to a height of 800 m, above which a capping inversion with a vertical gradient of 1 K per 100 m for potential temperature and −0.5 g kg −1 per 100 m for specific humidity is used. To facilitate comparison, we use the same simplifying assumptions as EC, namely homogeneity and stationarity of surface fluxes, and flat terrain.
Boundary and initial conditions for H, LE, aerodynamic roughness length (z 0 ), initial potential temperature (θ init ), initial 185 specific humidity (q init ), and geostrophic wind speed at the surface (u g ) are varied in the LES ensemble simulations according to prior distributions for each parameter (described below). Of these six parameters, the primary interest is in H and LE while the remaining four parameters can be regarded as 'nuisance' parameters (Bretthorst, 1988;Jaynes, 2003;Gelman et al., 2013). The nuisance parameters are still inferred from the data, but are then implicitly 'integrated out' as we primarily focus on the marginal posterior distributions of H and LE. Due to the planar-homogeneous surface, there is no need to use an extra parameter for the second lateral component of the geostrophic wind speed at the surface.
Each simulation starts with a spin-up period during which turbulence generation is triggered by adding artificial random perturbations until turbulence starts to develop freely. The time series of the maximum vertical wind velocity and the resolvedscale turbulence kinetic energy shown in Figure S1 (see Supplementary material) indicate that 4680 s typically suffices to achieve stationary turbulence statistics in most simulations (corresponding to about 10 eddy turnover times). Some ensemble 195 members will represent parameter combinations that hardly allow for a turbulent flow regime, e.g. during strongly stable conditions with very negative sensible heat fluxes and low geostrophic wind speeds, and will therefore not develop stationary turbulence. Some of the prior parameter combinations might in reality also be physically improbable and would therefore yield extremely unlikely model predictions. Consequently, the inferred posterior probability will be low for such cases given that the drone data was generated under different regimes.

Prior distributions
The prior distributions for H and LE are set to be normal (i.e. Gaussian) distributions centered at 0 with standard deviations 210 of 150 W m −2 each. For u g and z 0 , log-normal prior distributions were specified (to ensure strictly positive support) with means (of the underlying normal distribution) of 0.7 and −1.2, and standard deviations of 0.7 and 0.5, respectively. The priors for θ init and q init are set to be normal distributions with mean values that are determined separately for each experiment to account for the large differences in mean temperature and humidity between our experiments. For these variables only, we follow the empirical Bayesian approach to constructing priors (Murphy, 2022) and determine these mean values from the 215 drone observations themselves based on the observed temperature and humidity range. For the synthetic experiments, we chose priors for θ init and q init that include the true values, but are not centered on them (approximately 0.5 standard deviations offset). This bias of the priors for these nuisance parameters makes subsequent inference more challenging and realistic. The standard deviation of the priors of θ init and q init are 0.3 K and 0.1 g kg −1 , respectively. These values relate to the observation errors described below. To test the sensitivity to the uncertainty in initial conditions, we also conducted synthetic experiments 220 with narrower prior distributions for θ init and q init (0.06 K and 0.03 g kg −1 ), labeled 'narrow init' below.
Note that we effectively use a so-called weakly informative prior (Banner et al., 2020) to limit the need for strong background information about the parameters. Moreover, we have adopted priors that are (or can be readily be transformed to) Gaussian color scale is independent for each ensemble member and omitted for clarity since we would like to emphasize relative differences in the spatial structure of the turbulent fields. In all plots, the white box indicates the domain for possible drone measurements.
distributions both for simplicity and because of the assumptions in the ensemble Kalman-based methods (see e.g., Carrassi et al., 2018) that we use. In theory, these methods assume a Gaussian prior and are thus closer to optimal when this assumption 225 is satisfied. In practice, we can satisfy this assumption by using Gaussian anamorphosis techniques to transform bounded physical variables to an unbounded Gaussian space (Bertino et al., 2003;Aalstad et al., 2018). We have also included the possibility to add further information to these priors through correlations between individual parameters, because we know that in reality some conditions can make others more probable. For example, based on experience from EC flux measurements we see that a large sensible heat flux makes a large latent heat flux more likely (as necessitated by the energy balance) with 230 typical correlations of 0.5 to 0.8 in data from our sites. To accommodate such prior knowledge, our framework allows us to draw correlated samples for the parameters from a joint prior distribution with a specified covariance matrix. For the analysis of the real-world drone measurements (later described), we prescribe a prior correlation between H and LE of 0.5. All other prior correlations are left at 0. For ease of interpretation in the synthetic experiments, we kept the prior correlations between H and LE at 0.

Drone measurements, observations and errors
Throughout this study, small multi-rotor drones equipped with light-weight sensors for air temperature and relative humidity are used along with a flight controller that estimates the drone's tilt angle for an indirect measurement of the horizontal wind speed U . This drone system is used for the real-world measurements and emulated for the synthetic experiments. A thin type-K thermocouple connected to a high accuracy thermocouple amplifier (MCP9600, Microchip Technology, USA) is employed 240 to measure air temperature. A BME280 (Bosch, Germany) capacitive relative humidity sensor is used, which also measures barometric pressure (and thus elevation). These sensors sample every 10 s, which is slower than the actual response times of these sensors as well as the time steps of our LES runs. In practice, these sensors have slightly different response functions with time constants of a few seconds, but for simplicity we consider the samples to represent near-instantaneous values (at least for computing the mean observations). The data is converted to potential temperature θ and specific humidity q (for which 245 the measurements of air pressure are used). Instead of mounting an anemometer on the drone to measure wind, we follow a common force-balance method using the drone's tilt angle during hovering to infer the wind speed (Neumann and Bartholmai, 2015;Palomaki et al., 2017). The tilt angle is estimated by the drone's state estimator (an Extended Kalman Filter implemented in the PX4 flight stack) based on the flight controller's (Pixhawk 4, Holybro, China) IMU sensors. Using the quadratic drag law, the drag on the drone's body can be estimated as where m is the mass of the drone (1.9 kg), g the gravitational acceleration (9.81 m s −2 ), C D the drone's drag coefficient (estimated as 2.8 using wind tunnel experiments (Neumann and Bartholmai, 2015)), α the tilt angle and ρ the air density (both estimated by the flight controller), A the drone's exposed area (estimated as 150 cm 2 from all directions), and v the relative horizontal wind speed. When the drone hovers at a fixed position, the horizontal wind speed U can be assumed to be equal to 255 v. This method does not explicitly account for drag forces from rotor movements, which introduces additional uncertainty in the wind speed estimation. We used an X500 kit (Holybro) as drone platform, which typically provides a total flight time of 15 min with the battery and payload that we employed (see photos in Figure 2).
The 10 s measurements of θ, q and U are aggregated to mean values of all measurements taken when the drone hovers at a fixed position (denoted asθ,q andŪ ). We additionally compute the differences between subsequent mean values to add local 260 mean gradients to our observations (denoted as ∆θ,∆q and ∆Ū ). This is done in a cyclic manner through the measurement locations, so that the local gradient at the first position is calculated as the difference to the last location.
The assumed error statistics of these observations are based on noise in the measurements caused by sensor imperfections and the mismatch between the scale of the observation and the scale of the model (representativeness error, see van Leeuwen, 2015), which is typical in meteorological data (Gandin, 1988). The related spatio-temporal representativeness errors are affected by 265 the rotor wash from the drone that mixes the air around the drone and makes its measurements more representative for spatial scales similar to the LES grid spacing. We first estimate the measurement error of the 10 s samples and then calculate the corresponding observation error by scaling the standard deviation of the (near) instantaneous measurement error with the inverse square root of the number of samples that are temporally aggregated to an observation. Based on the central limit theorem (e.g. Chopin and Papaspiliopoulos, 2020), the error model assumes this observation error to be Gaussian, independent, 270 and uncorrelated between different variables, which corresponds to a diagonal observation error covariance matrix. Systematic errors that occur for error distributions that are asymmetrically distributed with respect to zero, are assumed to be negligible.
This leads to the following definition for the diagonal observation error covariance matrix R ∈ R d×d employed in this study where diag(·) is the diagonal operator that converts a vector to a diagonal matrix, τ ∈ R d is a scaling vector, denotes the 275 element-wise product, σ ∈ R d contains the measurement error standard deviation for each observation. The elements of the scaling vector are defined as follows where S is the number of measurement samples that are averaged to form an observation. As elaborated in Section 2.2, we test two types of flight plans. The first type involves step-wise vertical profiles while the drones hover in place for a 2 minute 280 averaging period with a 10 s sampling interval such that S = 12. The second type involves random exploration where no averaging is performed such that S = 1. In summary, following independent Gaussian error propagation, this observation error covariance matrix implies that observation errors are uncorrelated, decrease with number of samples S in an averaging period, and are larger for local mean gradients than for means.
The elements of σ are determined by the measurement error standard deviation of the respective sensors. For temperature 285 measurements on drones, observational errors stem from radiative and adiabatic heating (due to air pressure fluctuations around the drone), and typical absolute root-mean-square errors are in the range 0.2 to 0.3 K (Wildmann et al., 2013). Here, we assume a standard deviation of 0.3 K for the measurement error. The standard deviation of the measurement error for q is estimated as 3% relative humidity (corresponding to about 0.1 g kg −1 specific humidity at typical air temperatures) that is based on the stated accuracy of the capacitive humidity sensor (BME280). For the horizontal wind speed U , the standard deviation for the 290 measurement error is conservatively estimated to be 2.0 m s −1 . Other studies using Inertial Measurement Unit data of multicopter drones for wind estimation report measurement uncertainties of less than 0.5 m s −1 (Palomaki et al., 2017), but since we did not evaluate this uncertainty for our drones, we decided to use a somewhat larger value to avoid underestimating this uncertainty.

Data assimilation schemes 295
We implemented four data assimilation schemes and assess their performance for the problem at hand (i.e. inference of H and LE). We used N e = 100 model realizations (referred to as 'ensemble members' or 'particles' in data assimilation) each with a different set of parameter values, to represent the prior probability distribution.
For the first scheme, the particle batch smoother (PBS) introduced by Margulis et al. (2015) is used. The PBS is a batchsmoother version of the particle filter that is widely used in the snow data assimilation community (Fiddes et al., 2019;Alonso-300 González et al., 2021). It is effectively a particle filter without resampling, tantamount to basic sequential importance sampling.
This scheme is obtained by using a particle representation, i.e. mixture of Dirac delta functions δ(·), of the prior which serves as the proposal distribution to perform importance sampling-based Bayesian inference as outlined in Appendix A. The resulting posterior is effectively a weighted sum of particles p(x|y) = Ne i=1 w i δ(x − x i ) where the weights are given by the likelihood ratio in which (·) T denotes the transpose and ε i = y − y i are the predicted observation errors where y i = G(x i ) are the predicted observations from the forward model based on parameters that have been drawn from the prior x (i) ∼ p(x). It has been shown that the direct application of basic particle methods (i.e., importance sampling using the prior as the proposal) such as this often does not work well in high-dimensional systems (Snyder et al., 2008) Evensen, 1994) that assimilates all observations simultaneously in a batch rather than sequentially. Such batch assimilation is more suitable for the inverse problem pursued herein (Evensen, 2018). Ensemble Kalman Filtering (EnKF) methods are successfully used in data assimilation applications in meteorology and oceanography with tens of millions of dimensions (Carrassi et al., 2018). While the EnKF assumes that the forward model is linear and that all distributions are Gaussian, it turns 320 out that the EnKF is robust to deviations from these assumptions in many applications (Katzfuss et al., 2016). These methods and the underlying equations are described in Appendix B.
For the third scheme, we use the Ensemble Smoother with Multiple Data Assimilation (ES-MDA) (Emerick and Reynolds, 2013). The ES-MDA is an iterative ensemble smoother that has been suggested as a more viable alternative to the non-iterative ES for highly non-linear forward models. In this iterative scheme, the same data is assimilated multiple times with an inflated 325 observation error covariance matrix to better handle the non-linear mapping between the parameters of interest and the observations. In particular, the gradual updating reduces the impact of the linear assumption in the ES update step. Despite using the data more than once, this iterative scheme remains consistent in a Bayesian sense since inflation is performed in such a way that the iterations are equivalent to assimilating the data once with a linear model. At the root of these iterative schemes we find the idea of tempered transitions, which is a technique that is widely used in challenging Bayesian inference tasks (Neal, 1996;330 Stordal and Elsheikh, 2015; Iglesias and Yang, 2021). This tempering, in combination with their derivative-free implementation, has placed iterative ensemble Kalman methods at the frontier of ongoing research in Bayesian inverse problems (Stuart, 2010;Iglesias et al., 2013;Schillings and Stuart, 2017) which is helping to both formalize, improve, and generalize this family of methods (Garbuno-Inigo et al., 2020;Iglesias and Yang, 2021;Cleary et al., 2021;Dunbar et al., 2022a). The equations and workflow for the ES-MDA scheme used herein are presented in Appendix B.
a Gaussian distribution, as the proposal distribution in importance sampling as outlined in Appendix A. Herein, we use the estimated Gaussian distribution from the penultimate iteration of the ES-MDA scheme as the proposal distribution. This new PIES scheme is an adaptation of the weighted EnKF described elsewhere (Papadakis et al., 2010) and the iterative ensemble 340 smoothers. As with the PBS, the resulting posterior is effectively a weighted sum of particles p(x|y) Ne i=1 w i δ(x−x i ) with weights given by is that the ES-MDA produces a biased approximation of the posterior for non-linear forward models (Stordal and Elsheikh, 2015). Although this bias is typically less severe than that of non-iterative ensemble Kalman methods (Emerick and Reynolds, 2013), it would nonetheless be advantageous to find efficient methods to reduce it. PIES is a straightforward translation of the scheme of Papadakis et al. (2010) to iterative ensemble smoothers such as the ES-MDA. As such, PIES can be viewed as a simple extension of the ES-MDA that does not necessarily impose any noticeable computational burden and might improve 355 performance. As with all particle methods, the effective sample size can be used to diagnose degeneracy in the ensemble of particles (Chopin and Papaspiliopoulos, 2020). A low ( N e ) effective sample size indicates degeneracy due to the fact that the proposal is too far from the target posterior.

Synthetic experiments
To compare the performance of different DA schemes and observation strategies, a set of so-called synthetic (or twin) experi- θ init = 294.1 K, q init = 5.55 g kg −1 , u g = 1.5 m s −1 ). Experiments with higher wind speed using u g = 6.0 m s −1 were also performed to test how increased mixing from mechanical turbulence (as opposed to buoyancy-driven turbulence) and the correspondingly reduced spatial gradients affect the drone data assimilation flux estimates.
To evaluate the performance of the DA schemes, we use standard point metrics such as the root-mean-square error (RMSE) and bias (mean error) of the ensemble medians with respect to the true values. To also measure the quality of the entire 370 ensemble, we employ the continuous ranked probability score (CRPS; Hersbach, 2000), which is a widely used score for ensemble verification in numerical weather prediction that generalizes the mean absolute error to an ensemble. It measures the distance between the entire ensemble and a deterministic reference value, in our case the truth, with 0 being the best possible score that only occurs for a degenerate ensemble centered on the truth. To quantify the overall information gain in an experiment, we also calculate a Kullback-Leibler divergence (KLD; see e.g. Murphy, 2022) that measures the distance between 375 the posterior and prior distribution (Perez-Cruz, 2008). We use nats as a unit for information content, where 1 nat corresponds to the information content of an event when the probability of that event occurring is 1/e. These four metrics quantify different aspects of the fit and information gain of parameter distributions and can hence give a more holistic evaluation of a synthetic experiment.
Two different types of flight plans were used to generate the observations, both adhering to most countries' legal constraints 380 that drones must not fly above altitudes of 120 m and that they must stay within visual range (a lateral domain estimated to be 500 m by 500 m). Based on MOST, we expect mean vertical gradients in measurements to increase towards the surface. For the first type of flight plan, we thus used a step-wise vertical profile with step sizes that increased with altitude. In particular, given the limited flight time of small multi-rotor drones, we used six vertical levels (at 10, 20, 30, 50, 70, and 100 m) with the drone hovering in place at each level for 2 minutes. We also performed synthetic experiments with flight times of 24 minutes, 385 flying this step profile twice. The second type of flight plan tested explores a larger spatial domain instead of hovering at a fixed position for 2 min. We implemented this approach using a random walk with biased directionality that is based on movement models used for biological systems (Codling et al., 2008). Here, every 10 s the drone can stay at its position or move 20 m laterally (x or y dimension) and/or 10 m vertically (z dimension), i.e. moving two LES grid cells. These moves are random, but to explore a larger space, the probability to continue moving (or staying) in the same direction for each spatial dimension We also include the possibility of using multi-drone observations to test the performance of a mobile sensor network on a drone swarm. For this purpose, we assume individual drones to be identical in sensor specifications and flight time correspond-395 ing to a so-called homogeneous swarm (see e.g. Ferreira-Filho and Pimenta, 2019).

Field experiments
Field campaigns at two ecohydrological research sites with different climatic conditions were conducted: a boreal wetland in south eastern Norway (Hisåsen, 61.11 • N, 12.26 • E, elevation 680 m a.s.l., mean annual air temperature 2.7 • C at the closest weather station) and a palsa mire in the discontinuous permafrost zone in northern Norway (Iškoras, 69.34 • N, 25.30 • E, 355 400 m a.s.l., mean annual air temperature −1.6 • C at the closest weather station). Figure 2 shows photos of the two sites to give an impression of the settings.
These sites feature independent flux measurements from EC systems installed at 2.8 m a.g.l. at both sites. A CSAT3 sonic anemometer (Campbell Scientific) was used at Iškoras and a HS50 (Gill) at Hisåsen. Both sites use an Li-7200 infra-red gas analyzer (Li-Cor) for H 2 O mixing ratios. Raw data from these instruments are sampled at 20 Hz and processed to 30-min 405 flux data following the FluxNet convention implemented in EddyPro version 6.2.0 (Li-Cor). We use block average Reynolds decomposition to extract turbulent fluctuations, an anemometer tilt correction by double rotation, a constant time lag compensation, and a high and low-pass filter correction (Moncrieff et al., 2005(Moncrieff et al., , 1997. For quality control, the flagging system proposed in FluxNet (Foken and Wichura, 1996)  We used the vertical step profile flight plan in all these flights. As these drone measurements are taken at altitudes up to 100 m a.g.l., the resulting flux estimates represent a larger surface footprint area (kilometer scale) compared to the EC method 420 (tens to hundreds of meters). At both field sites, the footprint of the EC tower is dominated by wetlands, while the larger-scale surroundings feature forested areas with potentially different turbulent heat flux characteristics. for LE. The slightly wider posterior spread for LE is expected due to the relatively large observational error for specific humidity, which contains most information about LE. We see that the drone observations do not include much information to constrain the roughness length z 0 , as this nuisance parameter appears to be hardly updated. This lack of adjustment of z 0 can be explained by the relatively large observational error associated with the wind speed estimates. Hence, our prior belief strongly governs the distribution of this parameter. It is nonetheless important to account for uncertainty in this nuisance 440 parameter so as to avoid over-confident and possibly wrong inferences about the fluxes of interest. External information, for example from remotely sensed land cover data, may help constrain this parameter (see Section 4.2). The two parameters for the initial conditions update slightly towards the true values of this synthetic experiment. We see a noteworthy equifinality issue (Beven, 2006)   negative correlation suggests that an ensemble member with low initial temperature and large sensible heat flux gives similar temperature predictions as an ensemble member with high initial temperature and small sensible heat flux. While this may not be a surprising result, at least a-posteriori (or with hindsight), it emphasizes the importance of initial conditions for drone-based surface flux estimations. Despite the relatively large observational error associated with the wind speed estimates, we see that u g updates considerably towards the truth for all DA schemes.

450
Varying the sampling strategies (step profile vs random exploration), flight time (12 vs 24 minutes), number of drones (1 vs 5), uncertainty in initial conditions (narrow vs broad), and the geostrophic wind speed (1.5 vs 6.0 m s −1 ) led to a total of   speeds. Recall that the experiment with u g = 1.5 m s −1 corresponds to the experiment shown in Figure 3. As expected, increasing the flight time from 12 to 24 min, i.e. flying two consecutive step profiles, narrows the posterior distributions (but by less than a factor of two). Using observations from five drones flying step profiles, the posterior distributions become even narrower while still encompassing the 'truth'. A similar reduction in flux uncertainty is achieved by the narrower priors for the initial conditions θ init and q init . Using the biologically inspired random exploration flight strategy, we find that several of the 465 posterior distributions can become narrower than their step-profile counterparts, but do not always include the true flux value, which is a symptom of the ES-MDA assimilation results being overconfident. This effect can be related to random exploration containing more observations (but without aggregation of multiple measurements) compared to step profiles.   Figure S2 in the Supplementary material. Figure 6 shows the surface flux comparisons with EC from all the field experiments. For the sensible heat flux, good agreement between the two methods is noted at both sites with a high correlation coefficient (R = 0.90). The drone data assimilation flux estimates tend to yield higher fluxes than the EC method, which might be a real effect given the different footprints of the 480 two methods (i.e. wetlands dominating the EC footprint have lower H and higher LE than their surroundings). For the latent heat flux estimates, the drone data assimilation typically yields larger flux uncertainty and the correlation with EC fluxes is only 0.40. There is a particularly large deviation between the methods in the three flights with the largest LE flux estimates from EC. Again, this deviation could be due to real flux differences between the wetland-dominated vicinity of the EC tower and the surrounding forest. We also emphasize that the EC fluxes do not necessarily represent the 'truth' in this comparison, even 485 though we only used EC estimates with the highest quality flags. For the sum H + LE, which is constrained by the available energy at the surface and may therefore be more homogeneous, the close agreement (RMSE=58 W m −2 ) and high correlation coefficient is noteworthy (R = 0.83). As the drone-DA uncertainty estimates are largely a result of our experimental design  et al., 2020), this deviation could be due to the larger footprint of the drone data assimilation method covering more of the wetland-forest mosaic at our sites. This hypothesis could be tested in a future study with additional measurements using 500 scintillometers, which yield fluxes that are representative for larger areas than EC. The new method operates over a large range of heat fluxes and wind speeds, but it remains to be tested how strongly stable conditions (e.g. during nighttime and/or winter) affect the technique.

Field experiments
The basis for these results is that drone measurements and LES capture variability at approximately the same spatial and temporal scales. During hovering, the drone position is kept fixed to within about 1 m based on GPS and barometer mea-505 surements. The drone's rotor wash, however, creates local mixing of air so that the temperature and humidity measurements represent a volume average around the drone, with a scale similar to the LES grid resolution. Incidentally, the LES time steps are at approximately the same temporal scale as sensors response times. Any remaining representativeness errors can in future studies also in theory be accounted for formally in the data assimilation framework by using an appropriate observation operator (van Leeuwen, 2015). However, a remaining limitation is that the LES output at grid levels close to the surface is affected 510 by the subgrid scheme and the coupling to the surface, which is typically conducted using MOST.
Using typical sensor configuration and flight times of small drones, we find a relatively large posterior spread of the surface fluxes, compared to the uncertainty estimates of the EC technique that are solely accounting for sampling errors arising from the small number of large eddies captured in the 30-min flux interval. While this may be expected given the non-linear and chaotic nature of the turbulent transport process, the uncertainty estimate here is based on explicitly stated error distributions 515 of the observations and epistemic uncertainties in the dynamical system related to initial and boundary conditions as well as parameters. It is therefore feasible to study and reduce flux uncertainties in our framework. For example, further observing system simulation experiments can be carried out to test the impact of sensor quality, observation strategies, and large-scale meteorological forcing (see Section 4.2).
Drone flux estimation provides a relatively low-cost and mobile complement to traditional methods like EC. Most applica-520 tions of drones are currently still restricted to manual flights with a human pilot in charge of the system (see, e.g., Bassi (2020) for a discussion of European airspace regulations). For fully automated and continuous drone flux measurements, a number of engineering and legal limitations need to be overcome. Nonetheless, even under the current constraints, the resulting flux temporal snapshots cover a much larger area than the typical EC footprints and can therefore be more suitable for large scale aggregate flux measurements. These snapshots may thus assist Earth system model improvement through targeted testing of 525 different process formulations and parameter settings, e.g. for new plant functional types (Dagon et al., 2020) or snow schemes (Aalstad et al., 2018).
Furthermore, it is worth highlighting that trace gas fluxes can also be estimated with this drone data assimilation technique, which would be particularly relevant for CO 2 and methane emissions from heterogeneous permafrost environments (Pirk et al., 2017). Gas concentration measurements from drones are already emerging as a cost-efficient tool for the petroleum industry 530 (Asadzadeh et al., 2022) and the agriculture sector (Daube et al., 2019), where greenhouse gas emissions can contribute to climate warming. Drone data assimilation could thus be a valuable tool to help monitor such hidden -and sometimes avoidable -emission sources.

Possible improvements
Improved surface flux inferences can be achieved in a number of different ways, including technical improvements during data 535 collection, such as using higher quality sensors, more drones, and better quantifying initial and boundary conditions, as well as modifying the data assimilation framework by using a larger ensemble size to improve the Monte Carlo approximation, more assimilation cycles to better account for nonlinearity, emulators to increase the smoothness of the likelihood function (Cleary et al., 2021), and a higher spatial resolution of the LES model to reduce structural model errors. The present study shows a choice of settings that we intuitively considered reasonable or practically possible, but more work should go into systemati-540 cally exploring the many orthogonal dimensions involved in the optimal experimental design of the drone data assimilation framework.
More effort could also go into formulating the priors, especially because some parts of the covered parameter space might in reality be physically improbable (e.g. inclusion of the energy balance). Our framework allows to add further information to the priors through correlations between individual parameters, which we only used for H and LE in our field experiments.

545
The effect of these prior parameter correlations was mostly a slightly more effective exploration of the parameter space, but future studies could investigate how this feature can be used to reduce the computational costs with expensive models like LES. Other parts of the parameter space could be constrained based on independent information, for example, by using downscaled reanalysis datasets that combine other Earth observation data (e.g. Fiddes et al., 2019;Alonso-González et al., 2021). A complementary approach could also be to directly incorporate land cover information, e.g. from satellite retrievals 550 (Aalstad et al., 2020), into the design of flux maps in the turbulence simulation as was done in van der Valk et al. (2022). The boundary layer height or the height of the prescribed capping inversion has been assumed to be known herein and was thus not included as a nuisance parameter in the DA scheme. Future studies should further test the validity of this assumption and explore the sensitivity of drone flux measurements to entrainment fluxes at the top of the boundary layer, which are known to affect turbulent quantities in the surface layer (van de Boer et al., 2014). 555 An extension of the list of uncertain parameters should ideally also be accompanied with including more observational constraints. In the current study, we assimilated observations of the mean values over short periods of times, and their local mean gradients. Higher order moments (such as variances and covariances), atmospheric structure functions (Arenas and Chorin, 2006), or other features could also be used to capture more information from the drone measurements in future studies.
The flight strategy used for the collection of observations could also be optimized. The determination of the optimal sampling 560 strategy for mobile sensors networks (giving sparse and noisy data) can more generally be regarded as an example of the exploration-exploitation dilemma (Box and Youle, 1955;Friston et al., 2015). In practice, the choice or trade-off between fewer, high quality observations and more, low quality observations becomes an active choice of the investigator. We have only tested two simple strategies: (i) what we called an intuitive approach that involved flying a vertical step profile with averaging times of two minutes, and (ii) a more exploratory approach based on directed random walks without averaging. The results 565 indicate that both methods can constrain the surface fluxes, but random exploration without averaging multiple measurements for an observation can give biased flux results. These biases are likely due to shortcomings of the assimilation schemes used when dealing with strongly non-linear forward models rather than the sampling strategy itself, and so could be alleviated by improving the assimilation algorithms. Future studies could moreover formalize and optimize the trade-off between exploration and exploitation more specifically using a calibrate-emulate-sample framework (Cleary et al., 2021;Dunbar et al., 2022b), to 570 determine the optimal strategy for a given flux mapping task.
While we only applied drone data assimilation to cases with homogeneous and stationary surface fluxes, and flat terrain (as a logical first step to facilitate comparability with EC), it is clear that the method can also explicitly account for more realistic representations of these effects. Complex terrain and flux heterogeneity can be explicitly included in the LES steering data and some field sites might even require a more detailed LES setup to account for energy flux heterogeneity (Ramtvedt and 575 Pirk, 2022). Representing the mesoscale meteorological setting more realistically could also be achieved through a newlydeveloped mesoscale nesting of PALM (Lin et al., 2021). Finally, even non-stationary fluxes could be included by going from a smoothing to a filtering data assimilation framework. This might in future help to complement EC measurements and maybe even improve on them by identifying assumption-violations that are causing the energy balance closure problem of the EC method (Stoy et al., 2013).

Data assimilation schemes for turbulent transport
Spatio-temporal data assimilation with LES is relatively complex mathematically and not commonly studied. In this case the forward model for the data assimilation becomes highly non-linear, which violates the assumptions of commonly used methods for higher-dimensional problems, such as Ensemble Kalman filters and smoothers. To avoid biased results, the degree of violation must be reduced, which can be achieved by the iterative approach implemented in the ES-MDA scheme. Our 585 results, particularly Figure 3 and Table 1, show that the ES-MDA scheme can markedly outperform both the ES and the particle-based methods tested herein. These findings are in close agreement with earlier snow data assimilation experiments (Aalstad et al., 2018;Alonso-González et al., 2022) that compared these schemes with similarly (i.e. medium) sized parameter spaces, albeit with considerably less complex forward models. For relatively small numbers of iterations it was suggested that non-uniform error inflation for the sequence of assimilation cycles (leading to non-uniform update steps) could be beneficial 590 for the results of the ES-MDA scheme (Evensen, 2018). We tested this idea in a small number of synthetic experiments (not shown), but did not find a striking improvement of the results. Still, we are of the opinion that related ideas to improve the tempering should be tested more extensively in future studies, for example by following the adaptive approach outlined by Iglesias and Yang (2021).
The PBS scheme is less affected by this problem, but a six-dimensional parameter space is already so large that the method 595 cannot effectively represent the posterior distribution due to the curse of dimensionality that plagues importance samplingbased methods (Snyder et al., 2008). The PIES scheme presented here aims to overcome this issue by combining the ES-MDA scheme with the PBS scheme to take advantage of their individual strengths. In particular, the PIES scheme is (unlike the PBS) not confined to simply weighting the initial samples drawn from the prior. Instead a proposal distribution based on the ES-MDA is used to guide the attention of the importance sampling to areas of higher posterior probability mass. As such, we see that the 600 PIES scheme markedly outperforms the PBS in terms of RMSE (cf. Table 1), but nonetheless still suffers from degeneracy. To further improve the PIES scheme, and potentially avoid degeneracy, future studies could explore the possibility of using more iterations in the ES-MDA that is used for the proposal distribution. An alternative path would be to explore iterative particle methods (Chopin and Papaspiliopoulos, 2020).
We have not used the gold standard technique for Bayesian inference, namely Markov chain Monte Carlo methods (e.g. Gelman et al., 2013), because our likelihood evaluations are so expensive that the sequential exploration of the parameter space with tens or even hundreds of thousands of steps would not be possible in a realistic time frame. New data assimilation schemes, some of which are specifically designed to handle problems with expensive likelihoods (Garbuno-Inigo et al., 2020), could open new possibilities for drone flux measurements. Among these schemes, a particularly promising route could be to explore the adoption of machine learning-based emulators (Cleary et al., 2021) and active learning (e.g. Murphy, 610 2022) as steps to further improve the posterior estimates without considerably increasing the computational cost. The majority of this computational burden stems not primarily from the update steps themselves, but rather from the need to iteratively run an ensemble of LES. The cost of running a single LES with PALM given our experimental setup is on average in the order of 50 CPU hours. The cost of running PALM with a particular parameter combination varies considerably given the adaptive timestep in PALM, but this average cost gives an indication of the considerable computational effort involved. As 615 such, the computational cost of the DA schemes can be measured directly in terms of the number of runs of LES (N r ) required to infer the posterior flux estimates. Herein, these fluxes are parameters rather than states, so we do not strictly need to run posterior predictions, thus lowering the computational costs. Still, the ES-MDA with N a = 2 iterations and N e = 100 ensemble members requires N r = N a × N e = 200 LES. The PIES scheme requires exactly the same number of LES as the ES-MDA.

MacKay
The non-iterative ES and PBS schemes, on the other hand, have a lower cost of N r = N e = 100 LES. Performing these DA 620 schemes together in the same experiment, i.e. with the same prior ensemble, has a lower cost than running them separately. In particular, while running the ES-MDA all the other schemes can effectively be run for free as benchmarks without the need for any additional LES. The total number of LES undertaken in this study was nonetheless considerable given that we performed 16 synthetic experiments and 18 real experiments, each with N r = 200, amounting to a total of around 6800 LES. It is worth noting that this is still considerably less than the cost of a single Markov Chain Monte Carlo experiment, which typically 625 requires in the order of 10 5 model evaluations. Nonetheless, the cost of these simulations placed a considerable constraint on the number of experiments we could perform to explore an otherwise vast space of design choices that should be investigated in future studies.

Conclusion
To facilitate the development of drone flux measurements, a data assimilation framework for estimation of turbulent heat fluxes 630 at the surfaces from sparse and noisy drone-based observations is presented using LES as a forward model. This framework allows explicit representation of the sequential collection of drone data, accounts for sensor noise, and includes uncertainty in boundary and initial conditions during the flux estimation. Different data assimilation schemes have been shown to markedly constrain the surface fluxes by using drone observations, with the ES-MDA scheme outperforming the three other tested schemes. Both synthetic and field experiments show promising results for homogeneous and stationary fluxes, which are in 635 reasonable agreement with independent EC flux estimates. Increasing flight times, using observations from multiple drones, and narrowing the prior distributions of the initial conditions, are viable methods to further improve flux results. Sampling strategies prioritizing spatial exploration instead of temporal averaging at fixed positions enhance the non-linearities in the forward problem and can lead to biased flux results.
While the comparison here uses the simplifying assumptions of flux homogeneity, stationarity, and flat terrain, we emphasize 640 that the drone data assimilation framework is not confined to these assumptions (as long as they can be accommodated in the forward model) and can thus readily be extended to more complex cases in future studies. Future effort could aim to apply this framework to estimate gas fluxes of e.g. CO 2 and methane, which would be another valuable contribution to Earth system science.

645
Importance sampling lies at the core of particle (or sequential Monte Carlo) methods such as PIES and PBS. Rather than directly sampling from a target distribution of interest, this sampling method estimates expectations with respect to a target distribution through indirect Monte Carlo integration by drawing from a proposal (also known as importance) distribution that is easier to sample from (MacKay, 2003). In DA, and Bayesian inference more generally, the posterior is the target distribution of interest and the expectation of some function g(x) with respect to the posterior is defined as (Särkkä, 2013) where, for example, the expectation of g(x) = x yields the posterior mean. Given N e independent samples from the posterior, x i ∼ p(x|y), we could approximate the expectation in (A1) numerically using direct Monte Carlo integration through E [g(x)|y] 1 Ne Ne i=1 g x (i) . Due to the law of large numbers and the central limit theorem, this approximation will converge almost surely to the true expectation as N e → ∞ with a standard error inversely proportional to √ N e (Chopin and 655 Papaspiliopoulos, 2020). In practice, it is rarely possible to generate independent samples directly from the posterior.
In importance sampling, we adopt a tractable proposal distribution q(x) with (at least) the same support as the posterior.
Multiplying the integrand with 1 = q(x)/q(x) , (A1) can be expressed as where the scheme draws from x i ∼ q(x) so that the normalized weights can be defined as w i = p(x i |y)/q(x i ). An obstacle 660 remains in that we can only directly evaluate the un-normalized posterior f (x) = p(y|x)p(x) and not the evidence p(y) in the denominator of (3) because it is the integral of f (x). Nonetheless, we can also approximate the evidence with importance sampling through p(y) = f (x) dx 1 Ne Ne i=1 w i where w i = f (x i )/q(x i ) are the unnormalized-weights. With this evidence approximation, we can now approximate (A1) as

665
where the (auto-normalized) weights are given by w i = w i Ne k=1 w k −1 with the property the Ne i=1 w i = 1. To ensure numerical stability, we use the log-sum-exp transformation when computing these weights (Murphy, 2022). The PBS scheme is obtained by using the prior as the proposal, i.e. q(x) = p(x), where (7) is for the particular case of a Gaussian prior and likelihood used herein. Similarly, the novel PIES scheme in Equation (8) is obtained by using the Gaussian distribution N( µ, C) from the penultimate ES-MDA iteration as the proposal distribution.

670
As a final point, we emphasize that importance sampling results in a particle representation of the posterior through a sum of weighted Dirac delta functions centered on the sampled states x i ∼ q(x) of the form p(x|y) Ne i=1 w i δ(x − x i ) (Särkkä, 2013). This point can be appreciated by recalling that the Dirac delta has the properties δ(x − x i ) dx = 1 and g(x)δ(x − x i ) dx = g (x i ), so that if we insert the particle representation in (A1) then 675 which is the same as the result in (A3). Under this particle representation we can conceptualize the posterior distribution as a set of particles (or points) in parameter space whose probability mass are given by their weights.

Appendix B: Ensemble Kalman methods
The ensemble Kalman filter (EnKF; Evensen, 1994) is a Monte Carlo version of the Kalman filter (Jazwinski, 1970;Särkkä, 2013). These schemes both make a Gaussian linear assumption, on top of the usual filtering assumptions of Markovian dy-680 namics and conditionally independent observations. When these assumptions are satisfied, the exact filtering distribution is a Gaussian that is available analytically in closed form through the Kalman filtering equations. Unlike the original Kalman filter, the EnKF can still be used when these assumptions are violated. In fact, it is remarkably robust to such violations which explains why it is a widely used method for the typically nonlinear and high dimensional problems that arise in geoscientific data assimilation (Carrassi et al., 2018). These ensemble Kalman methods can also be applied to solving more general smoothing 685 problems in which asynchronous observations are assimilated (Cosme et al., 2012). The ensemble smoother (ES; van Leeuwen and Evensen, 1996), a batch smoother version of the EnKF, and its iterative variants such as the ES-MDA (Emerick and Reynolds, 2013) have been shown to be particularly useful in the context of estimating static parameters in inverse problems (e.g. Evensen, 2018;Aalstad et al., 2018;Evensen, 2019;Garbuno-Inigo et al., 2020;Cleary et al., 2021;Alonso-González et al., 2022).

690
Here, the equations for both the stochastic ES and the ES-MDA are presented, while noting that a full derivation of the ensemble Kalman analysis equations can be found elsewhere (e.g. Evensen et al., 2022b). Let N a denote the number of assimilation cycles, then for the ES we set N a = 1 while for the ES-MDA N a > 1, typically with N a = 4. The superscript indexes these iterations. Let X ( ) = x . Then these stochastic ensemble Kalman methods proceed by initially drawing the initial parameters from the prior x ( =0) ∼ p(x), then for = 0 : (N a − 1) iterations: where Y is an d×N e matrix with N e copies of the observation vector y while the observation error term is E ( ) α = √ α ( ) R 1/2 ζ ( ) in which ζ ( ) is an d × N e matrix containing draws from a standard Gaussian N(0, 1) and α ( ) = N a is the observation error inflation coefficient. The so-called (ensemble) Kalman gain K ( ) is the m × d matrix  Monin-Obukhov Similarity Theory, Boundary-Layer Meteorology, 119, 431-447, https://doi.org/10.1007/s10546-