A Software Package to Simplify Tikhonov Regularization with Examples for Matrix-Based Inversion of SMPS and HTDMA Data

Tikhonov regularization is a tool for reducing noise amplification during data inversion. This work introduces RegularizationTools.jl, a general-purpose software package to apply Tikhonov regularization to data. The package implements well-established numerical algorithms and is suitable for systems of up to ~1000 equations. Included is an abstraction to systematically categorize specific inversion configurations and their associated hyperparameters. A generic interface translates arbitrary linear forward models defined by a computer function into the corresponding design matrix. This obviates the need 5 to explicitly write out and discretize the Fredholm integral equation, thus facilitating fast prototyping of new regularization schemes associated with measurement techniques. Example applications include the inversion involving data from scanning mobility particle sizers (SMPS) and humidified tandem differential mobility analyzers (HTDMA). Inversion of SMPS size distributions reported in this work builds upon the freely-available software DifferentialMobilityAnalyzers.jl. The speed of inversion is improved by a factor of ~200, now requiring between 2 and 5 ms per SMPS scan when using 120 size bins. 10 Previously reported occasional failure to converge to a valid solution is reduced by switching from the L-curve method to generalized cross-validation as the metric to search for the optimal regularization parameter. Higher-order inversions resulting in smooth, denoised reconstructions of size distributions are now included in DifferentialMobilityAnalyzers.jl. This work also demonstrates that an SMPS-style matrix-based inversion can be applied to find the growth factor frequency distribution from raw HTDMA data, while also accounting for multiply-charged particles. The outcome of the aerosol-related inversion methods 15 is showcased by inverting multi-week SMPS and HTDMA datasets from ground-based observations, including SMPS data obtained at Bodega Bay Marine Laboratory during the Calwater 2/ACAPEX campaign, and co-located SMPS and HTDMA data collected at the U.S. Department of Energy observatory located at the Southern Great Plains site in Oklahoma, U.S.A. Results show that the proposed approaches are suitable for unsupervised, nonparametric inversion of large-scale datasets as well as inversion in real-time during data acquisition on low-cost reduced-instruction-set architectures used in single-board computers. 20 The included software implementation of Tikhonov regularization is freely-available, general, and domain-independent, and thus can be applied to many other inverse problems arising in atmospheric measurement techniques and beyond.

This manuscript addresses the important issue of automating the processing of tandem DMA data. The idea of inverting data with regularization is sound. However, there are problems with the forward model of calculating system response from a known input distribution. If these issues can be properly addressed, the resulting software package should prove of great utility.

Major Comments
There appears to be a problem with proper accounting of diffusional losses and broadening of the transfer function, Ω, for DMA 2. Eq. (10) in the form of A characterizes the transfer through DMA 1 while the equation for O (line 230) characterizes transfer through DMA 2. As noted in Petters (2018), these two expressions are analogous except for the inclusion of Tc in the former and the limitation of the summation to k=1 in the latter. In the DMA, a particle is sized according to its apparent mobility diameter whereas diffusional losses as well as broadening of the transfer function are dependent on the true mobility diameter via particle diffusivity. Given one of these diameters, the particle charge is required to calculate the other and ultimately , size T   . Thus, it is important to sum over all charge states individually to calculate the diffusing transfer through a DMA. As this is not done for the second DMA, the given expression cannot be properly accounting for transfer of multiply charged particles.
The interpretation of Eq. (11) and its components would be greatly facilitated by an explicit indication of the independent parameters of distribution for the input size distribution n cn .
Also, the precise form of n cn (e.g. dN/dDp, dN/dlnDp, or dN/dlogDp) is important. The most obvious set of independent particle parameters would be (true) mobility diameter, D1, and charge, k. However, it appears that n cn is distributed according to apparent mobility diameter, Dk, and k in order to have the balance of the equation work out. The apparent mobility diameter is then pre-multiplied by the effective, or apparent, growth factor, gfk(z s ,gf0), and then by the ratio of true to apparent mobility diameters, D1/Dk. However, this ratio is being evaluated at the DMA 1 centroid mobility, z s , but applied to the Z grid after growth. Since this ratio is a function of size, this does not work out. Also, this means that the input distribution to the O operator characterizing DMA 2 transfer is in terms of true mobility diameter, in contrast to the n cn input to DMA 1 and A. All of this switching back and forth between true and apparent mobility diameter seems overly complicated.

Minor Comments and Corrections
line 158: Insert a space between "as" and "x". lines 216-217: "The size distribution after passage through the DMA is given by r = An + , where r is the response function … ." The size distribution exiting the DMA and the response of the detector are not the same thing. The former is usually given as dN/dlogDp while the latter, as in the case of a CPC, is given by NCPC, a simple number concentration. Also, there is the matter of the detector efficiency as well as the transport efficiency between the DMA and the detector, unless the latter has been subsumed into the DMA transport efficiency. As A is to later serve as the operator corresponding to transfer through DMA 1 in a tandem DMA setup, An must represent a size distribution, not a response function.
Eq. (11) and following: The double character notation for growth factor as "gf" is atypical as far as normal mathematical notation is concerned. It is too easily interpreted as g times f, rather than as a single parameter. And in this draft of the manuscript there is actually extra space between the two letters, increasing the likelihood of the wrong interpretation. However, it is seen that this space is eliminated in Petters (2018) so presumably it can and will be eliminated in the final typeset form. If not for this preexisting work and a strong preference to remain consistent with that, it would be better to change this to a single character form such as simply "g". Also, the reason for the choice of the cn superscript on n for the input distribution is quite obscure. Does that stand for something?
line 230: Given the length and complexity of the expression for the operator O, it would be better placed on a line by itself and numbered.
Discretization: As noted (lines 461-462), the forward model for the TDMA represents a triple integral. The parameters of integration may be denoted as Di and Do, the mobility diameters before and after growth, and gf0, the size-independent growth factor. Though the discretization of these parameters is automated in the software, some discussion of the constraints on this discretization should be included here. For instance, is there a restriction between the number of particle diameter bins and the number of measurement bins? Eq. (15) and the statement (line 248) "The size of A2 is n 2 , …" would imply that the number of gf bins must be equal to the number of measurement bins. Is this a necessary condition and, if so, why?
Figs. 1-4: Frequency vs. Growth Factor: Growth factor gf and its frequency distribution Pgf are naturally continuous functions, though the former is (artificially) discretized for the purposes of inversion. Just as the size distribution, n, is explicitly written as dN/dDp or dN/dlnDp with total integral N, the growth factor frequency distribution is also a derivative, dF/dgf or dF/dlngf, with total integral F=1. However, in the indicated plots, the frequency is plotted as for a parameter with truly discrete values such that the sum of the heights, rather than the areas, of the bars is equal to 1. That is, the height of each bar is given by (dF/dlngf)•lngf where lngf is the width of the bar. If the growth factor is discretized such that lngf is constant, then what is plotted is simply a uniformly scaled version of the more traditional dF/dlngf plot, though this would normally be versus lngf. As plotted, the area under these curves is not equal to 1.
Number Concentration vs. Apparent Growth Factor: In these plots, the Apparent Growth Factor is evidently given by ( ) ( ) The "Concentration" parameter is apparently the first-order inverted number distribution function given by where 2 = Qaerosol/Qsheath for DMA 2. This is also seen to be a scaled version of the apparent growth factor frequency distribution as dNapp/dlnDp2 = Nt,2•(dFapp/dlngfapp) where Nt,2 is the total concentration exiting DMA 2. If this is to be compared to the Frequency vs. Growth Factor plot, this would need to be multiplied by lngfapp = lnD1 2 () s z .
For the two plots to be directly comparable, lnD1 2 () s z would have to be a constant.
line 315-316: "… the residual is high is if the true input is a broad growth factor frequency distribution …" lines 332-333: "Errors from scans with low non-zero concentration at the edge of the size distribution propagate back into the inversion at other dry sizes." line 345: "… a cylindrical DMA column (TSI 3080)." Model "3080" does not specify the actual DMA column. Assuming it is the TSI long DMA, this should be specified as either "TSI 3080L" for the whole system or "TSI 3081" for just the column.
lines 386-387: "… with the timestamp closest to the a scan …" Eliminate "a". Lines 505-513: "The inverted dataset … closure (Mahish et al., 2018)." This is a very long run-on sentence. It needs to be broken up into several sentences.
"Best fit" vs "good fit": Though regularization produces what might be considered a best fit solution to the inversion problem, this does not necessarily imply it is a good fit. It would be best to calculate a fit parameter such as the chi square of the normalized residuals over the degrees of freedom. For a good fit, this should be near 1. That is, the residuals are on the order of what is predicted by Poisson statistics. Values an order of magnitude or more greater than that would suggest some sort of problem either with the dataset or the model.