Journal cover Journal topic
Atmospheric Measurement Techniques An interactive open-access journal of the European Geosciences Union
Journal topic

Journal metrics

IF value: 3.668
IF3.668
IF 5-year value: 3.707
IF 5-year
3.707
CiteScore value: 6.3
CiteScore
6.3
SNIP value: 1.383
SNIP1.383
IPP value: 3.75
IPP3.75
SJR value: 1.525
SJR1.525
Scimago H <br class='widget-line-break'>index value: 77
Scimago H
index
77
h5-index value: 49
h5-index49
Volume 7, issue 12
Atmos. Meas. Tech., 7, 4387–4399, 2014
https://doi.org/10.5194/amt-7-4387-2014
© Author(s) 2014. This work is distributed under
the Creative Commons Attribution 3.0 License.
Atmos. Meas. Tech., 7, 4387–4399, 2014
https://doi.org/10.5194/amt-7-4387-2014
© Author(s) 2014. This work is distributed under
the Creative Commons Attribution 3.0 License.

Research article 11 Dec 2014

Research article | 11 Dec 2014

Regression models tolerant to massively missing data: a case study in solar-radiation nowcasting

I. Žliobaitė1,2, J. Hollmén1,2, and H. Junninen3 I. Žliobaitė et al.
  • 1Aalto University, Department of Information and Computer Science, Espoo, Finland
  • 2Helsinki Institute for Information Technology (HIIT), Helsinki, Finland
  • 3Department of Physics, University of Helsinki, Helsinki, Finland

Abstract. Statistical models for environmental monitoring strongly rely on automatic data acquisition systems that use various physical sensors. Often, sensor readings are missing for extended periods of time, while model outputs need to be continuously available in real time. With a case study in solar-radiation nowcasting, we investigate how to deal with massively missing data (around 50% of the time some data are unavailable) in such situations. Our goal is to analyze characteristics of missing data and recommend a strategy for deploying regression models which would be robust to missing data in situations where data are massively missing. We are after one model that performs well at all times, with and without data gaps. Due to the need to provide instantaneous outputs with minimum energy consumption for computing in the data streaming setting, we dismiss computationally demanding data imputation methods and resort to a mean replacement, accompanied with a robust regression model. We use an established strategy for assessing different regression models and for determining how many missing sensor readings can be tolerated before model outputs become obsolete. We experimentally analyze the accuracies and robustness to missing data of seven linear regression models. We recommend using the regularized PCA regression with our established guideline in training regression models, which themselves are robust to missing data.

Publications Copernicus
Download
Short summary
We present a case study in solar/radiation nowcasting using environmental sensor measurements as inputs. While some sensor readings may oftentimes be missing, predictions need to be output continuously in near real time. We are after linear regression models that would be robust to missing data, i.e., that would perform well with or without data gaps. We recommend using regularized a PCA regression with our established guidelines for building robust regression models.
We present a case study in solar/radiation nowcasting using environmental sensor measurements as...
Citation