<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" dtd-version="3.0"><?xmltex \makeatother\@nolinetrue\makeatletter?>
  <front>
    <journal-meta>
<journal-id journal-id-type="publisher">AMT</journal-id>
<journal-title-group>
<journal-title>Atmospheric Measurement Techniques</journal-title>
<abbrev-journal-title abbrev-type="publisher">AMT</abbrev-journal-title>
<abbrev-journal-title abbrev-type="nlm-ta">Atmos. Meas. Tech.</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">1867-8548</issn>
<publisher><publisher-name>Copernicus GmbH</publisher-name>
<publisher-loc>Göttingen, Germany</publisher-loc>
</publisher>
</journal-meta>

    <article-meta>
      <article-id pub-id-type="doi">10.5194/amt-7-4387-2014</article-id><title-group><article-title>Regression models tolerant to massively missing data: a case study in solar-radiation nowcasting</article-title>
      </title-group><?xmltex \runningtitle{Regression models tolerant to massively missing data}?><?xmltex \runningauthor{I.~\v{Z}liobait\.{e} et~al.}?>
      <contrib-group>
        <contrib contrib-type="author" corresp="yes" rid="aff1 aff2">
          <name><surname>Žliobaitė</surname><given-names>I.</given-names></name>
          <email>indre.zliobaite@aalto.fi</email>
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1 aff2">
          <name><surname>Hollmén</surname><given-names>J.</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff3">
          <name><surname>Junninen</surname><given-names>H.</given-names></name>
          
        <ext-link>https://orcid.org/0000-0001-7178-9430</ext-link></contrib>
        <aff id="aff1"><label>1</label><institution>Aalto University, Department of Information and Computer Science, Espoo, Finland</institution>
        </aff>
        <aff id="aff2"><label>2</label><institution>Helsinki Institute for Information Technology (HIIT), Helsinki, Finland</institution>
        </aff>
        <aff id="aff3"><label>3</label><institution>Department of Physics, University of Helsinki, Helsinki, Finland</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">I. Žliobaitė (indre.zliobaite@aalto.fi)</corresp></author-notes><pub-date><day>11</day><month>December</month><year>2014</year></pub-date>
      
      <volume>7</volume>
      <issue>12</issue>
      <fpage>4387</fpage><lpage>4399</lpage>
      <history>
        <date date-type="received"><day>14</day><month>April</month><year>2014</year></date>
           <date date-type="rev-request"><day>16</day><month>July</month><year>2014</year></date>
           <date date-type="rev-recd"><day>6</day><month>November</month><year>2014</year></date>
           <date date-type="accepted"><day>14</day><month>November</month><year>2014</year></date>
           
      </history>
      <permissions>
<license license-type="open-access">
<license-p>This work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/3.0/">http://creativecommons.org/licenses/by/3.0/</ext-link></license-p>
</license>
</permissions>

      <self-uri xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014.html">This article is available from https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014.html</self-uri>
<self-uri xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014.pdf">The full text article is available as a PDF file from https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014.pdf</self-uri>
<abstract>
    <p>Statistical models for environmental monitoring strongly rely on automatic
data acquisition systems that use various physical sensors. Often, sensor
readings are missing for extended periods of time, while model outputs need
to be continuously available in real time. With a case study in solar-radiation nowcasting, we investigate how to deal with massively missing data
(around 50 % of the time some data are unavailable) in such situations.
Our goal is to analyze characteristics of missing data and recommend
a strategy for deploying regression models which would be robust to missing
data in situations where data are massively missing. We are after one model
that performs well at all times, with and without data gaps. Due to the need
to provide instantaneous outputs with minimum energy consumption for
computing in the data streaming setting, we dismiss computationally demanding
data imputation methods and resort to a mean replacement, accompanied with a
robust regression model. We use an established strategy for assessing
different regression models and for determining how many missing sensor readings
can be tolerated before model outputs become obsolete. We experimentally
analyze the accuracies and robustness to missing data of seven linear regression
models. We recommend using the regularized PCA regression with our established
guideline in training regression models, which themselves are robust to
missing data.</p>
  </abstract>
    </article-meta>
  </front>
<body>
      

      <?xmltex \hack{\newpage}?>
<sec id="Ch1.S1" sec-type="intro">
  <title>Introduction</title>
      <p>Environmental monitoring strongly relies on automatic data acquisition
systems, using various physical sensors. For instance, stations measuring atmosphere ecosystem relationships (SMEAR)
stations<fn id="Ch1.Footn1"><p><uri>http://www.atm.helsinki.fi/SMEAR/</uri></p></fn> measure the
relationship of atmosphere and forest in the boreal climate zone
<xref ref-type="bibr" rid="bib1.bibx9" id="paren.1"/>. The stations are equipped with an extensive range of
measurement instruments: atmospheric and flux measurements, irradiation and
flux measurements, tree physiology measurements, soil and soil-water
measurements, and solar irradiance. Due to the continuous flux of
measurements, the setup can be analyzed in the context of streaming data
<xref ref-type="bibr" rid="bib1.bibx3 bib1.bibx1" id="paren.2"/>. Streaming-data analysis is different from
the traditional retrospective data analysis, where data are first collected,
cleaned, preprocessed, and then analyzed. Streaming data arrive
continuously and need to be analyzed in real time. Statistical models built
on such streaming data (see, e.g., <xref ref-type="bibr" rid="bib1.bibx19" id="altparen.3"/>, <xref ref-type="bibr" rid="bib1.bibx12" id="altparen.4"/>, and
<xref ref-type="bibr" rid="bib1.bibx21" id="altparen.5"/>) need to operate continuously and provide outputs in real
time.</p>
      <p>Physical sensors are exposed to various risks due to severe environmental
conditions, exposure to physical damage, or battery drainage. Under such
circumstances it is very common to encounter time intervals when readings
from some of the sensors are missing from the database. A lot of advanced
missing-value imputation schemes have been developed
<xref ref-type="bibr" rid="bib1.bibx14 bib1.bibx2" id="paren.6"/>, primarily targeting offline exploratory data
analysis, where computational resources are practically unlimited, while it
is critical to reconstruct data as accurately as possible. A simple mean
replacement remains popular in regression modeling <xref ref-type="bibr" rid="bib1.bibx16" id="paren.7"/> in
situations where real-time outputs are needed and computational resources and
time are limited, but the input data does not need to be reconstructed
perfectly accurately, as long as model outputs remain correct.</p>
      <p>The goal of this study is to experimentally analyze the performance and
robustness of linear regression models with regard to massively missing data for
operation in resource-aware settings. We consider situations where data
are massively missing, which means that around 50 % of the time at least
one sensor does not deliver readings and there is no single sensor that
dominates the missing data; data from any sensor can be missing. In such
a situation, readings from input sensors may be missing for extended periods
of time; nevertheless, model outputs need to be produced continuously and
delivered in real time; not producing model outputs when some
data are missing is not an option. We aim at building one regression model
that is robust in performance; i.e., the expected performance is stable, no
matter how many sensor readings are missing.</p>
      <p>We present a case study in solar-radiation nowcasting using meteorological
sensor data as inputs, where multiple sensor failures happen frequently due
to environmental and operational reasons. We analyze the performance of seven
linear regression <?xmltex \hack{\mbox\bgroup}?>models<?xmltex \hack{\egroup}?> coupled with the mean replacement of missing
values and provide recommendations for robust and accurate modeling in such
circumstances. Nowcasting refers to predicting the <italic>current</italic> values
from other measurements and is different from forecasting, which aims at
predict future values from the past values.</p>
      <p>The paper presents a case study in which our earlier published results
<xref ref-type="bibr" rid="bib1.bibx27" id="paren.8"/> are put into practice for solving a solar-radiation
nowcasting task in the context of a SMEAR measurement station
<xref ref-type="bibr" rid="bib1.bibx9" id="paren.9"/>. A reader interested in the theoretical underpinnings of
our approach and a follow-up is advised to refer to studies by
<xref ref-type="bibr" rid="bib1.bibx27 bib1.bibx28" id="text.10"/>; the current paper focuses on practical
implications of the results and demonstrates how regression problems with
lots of missing data can be successfully solved with our recommended scheme.
The results apply to the case of linear regression coupled with the mean
replacement of missing values. We assume that the uncertainty of the sensor
measurements is stable over time when the measurements are available.</p>
      <p>Research attention to solar-radiation nowcasting and short-term forecasting
using statistical-data-driven models is increasing due to the growing popularity
of solar-energy power plants that need solar-radiation estimates for planning.
Research studies mostly focus on searching for a suitable statistical
modeling technique: artificial neural networks <xref ref-type="bibr" rid="bib1.bibx20" id="paren.11"/>,
autoregressive time series models <xref ref-type="bibr" rid="bib1.bibx4" id="paren.12"/>, Markov models
<xref ref-type="bibr" rid="bib1.bibx5" id="paren.13"/>, or optimally integrating different data sources, such
as meteorological variables, ground and remote sensing observations, or
satellite images <xref ref-type="bibr" rid="bib1.bibx8 bib1.bibx25" id="paren.14"/>. We are not aware of any
research work addressing the problem of massively missing values in solar-radiation nowcasting.</p>
      <p>The rest of the paper is organized as follows. Section <xref ref-type="sec" rid="Ch1.S2"/>
describes the SMEAR data used in the case study, the methodology of the
modeling, and the experimental protocol. Section <xref ref-type="sec" rid="Ch1.S3"/> presents
and discusses the results of the case study. Section <xref ref-type="sec" rid="Ch1.S4"/>
summarizes the contributions and concludes the study.</p>
</sec>
<sec id="Ch1.S2">
  <title>Materials and methods</title>
<sec id="Ch1.S2.SS1">
  <title>Data</title>
      <p>We use a data stream recorded at SMEAR II station in Hyytiälä,
Finland <xref ref-type="bibr" rid="bib1.bibx15" id="paren.15"/> (<inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mn>61</mml:mn><mml:mo>∘</mml:mo></mml:msup><mml:msup><mml:mn>50</mml:mn><mml:mo>′</mml:mo></mml:msup><mml:msup><mml:mn>51</mml:mn><mml:mrow><mml:mo>′</mml:mo><mml:mo>′</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> N,
<inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mn>24</mml:mn><mml:mo>∘</mml:mo></mml:msup><mml:msup><mml:mn>17</mml:mn><mml:mo>′</mml:mo></mml:msup><mml:msup><mml:mn>41</mml:mn><mml:mrow><mml:mo>′</mml:mo><mml:mo>′</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> E; 181 <inline-formula><mml:math display="inline"><mml:mrow><mml:mi mathvariant="normal">m</mml:mi><mml:mspace linebreak="nobreak" width="0.125em"/><mml:mi mathvariant="normal">a</mml:mi><mml:mo>.</mml:mo><mml:mi mathvariant="normal">s</mml:mi><mml:mo>.</mml:mo><mml:mi mathvariant="normal">l</mml:mi><mml:mo>.</mml:mo></mml:mrow></mml:math></inline-formula>),
measuring relationships between the forest ecosystem and atmosphere. We use
data covering a period of 7 years (April 2005–April 2013), recorded at
every 30 min from 37 observation sensors. The raw data coming from the
station have on average <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">7</mml:mn></mml:math></inline-formula> % of missing values. Missing values may occur
due to the occasional failure of measuring sensors, wear and tear, or variations
in electricity power supply. Some data are missing up to 50 % of the
time. There is no single sensor that would provide non-interrupted readings
over those 5 years; for any sensor from <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">1</mml:mn></mml:math></inline-formula> % (about 4 days per year)
up to <inline-formula><mml:math display="inline"><mml:mn>25</mml:mn></mml:math></inline-formula> % (3 months per year) values are missing.</p>
      <p>The task is to nowcast the current level of solar radiation from the
meteorological sensor data, given in Table <xref ref-type="table" rid="Ch1.T1"/>. The incoming
radiation to Earth is constant with the accuracy we require at a given day
and hour of the year. The only unknown is the absorption to the atmosphere and,
more importantly, to the clouds and anthropogenic pollution plumes. Hence,
an interesting variable to infer is the cloudiness, or, in other words, the
deviation of the measured radiation from the theoretical maximum. In this
schema other meteorological parameters could be used to estimate the
cloudiness, and this can further be used to calculate the actual radiation, but
the primary variable to nowcast is the difference between the theoretical and
actual radiation.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T1"><caption><p>Sensors for the case study:
SWS – surface wetness sensor; <inline-formula><mml:math display="inline"><mml:mi>P</mml:mi></mml:math></inline-formula> – pressure; <inline-formula><mml:math display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula> – temperature; WS – wind speed; WD – wind
direction; RH – relative humidity; RH Td – relative humidity calculated using dew point; PTG – potential temperature gradient; Vis – visibility.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="4">
     <oasis:colspec colnum="1" colname="col1" align="center"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">  
         <oasis:entry colname="col1">Index</oasis:entry>  
         <oasis:entry colname="col2">measurement</oasis:entry>  
         <oasis:entry colname="col3">height</oasis:entry>  
         <oasis:entry colname="col4">missing values</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>  
         <oasis:entry colname="col1">1</oasis:entry>  
         <oasis:entry colname="col2">Rain</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">2</oasis:entry>  
         <oasis:entry colname="col2">SWS</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">3</oasis:entry>  
         <oasis:entry colname="col2">Dew point</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">18 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">4</oasis:entry>  
         <oasis:entry colname="col2"><inline-formula><mml:math display="inline"><mml:mi>P</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">0.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">2 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">5</oasis:entry>  
         <oasis:entry colname="col2"><inline-formula><mml:math display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">4.2 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">16 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">6</oasis:entry>  
         <oasis:entry colname="col2"><inline-formula><mml:math display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">8.4 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">3 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">7</oasis:entry>  
         <oasis:entry colname="col2"><inline-formula><mml:math display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">16.8 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">2 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">8</oasis:entry>  
         <oasis:entry colname="col2"><inline-formula><mml:math display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">33.6 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">2 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">9</oasis:entry>  
         <oasis:entry colname="col2"><inline-formula><mml:math display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">50.4 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">2 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">10</oasis:entry>  
         <oasis:entry colname="col2"><inline-formula><mml:math display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">67.2 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">2 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">11</oasis:entry>  
         <oasis:entry colname="col2">WS</oasis:entry>  
         <oasis:entry colname="col3">33.6 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">9 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">12</oasis:entry>  
         <oasis:entry colname="col2">WS</oasis:entry>  
         <oasis:entry colname="col3">8.4 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">5 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">13</oasis:entry>  
         <oasis:entry colname="col2">WS</oasis:entry>  
         <oasis:entry colname="col3">16.8 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">3 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">14</oasis:entry>  
         <oasis:entry colname="col2">WS</oasis:entry>  
         <oasis:entry colname="col3">33.6 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">9 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">15</oasis:entry>  
         <oasis:entry colname="col2">WS</oasis:entry>  
         <oasis:entry colname="col3">74.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">25 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">16</oasis:entry>  
         <oasis:entry colname="col2">WD avr</oasis:entry>  
         <oasis:entry colname="col3"/>  
         <oasis:entry colname="col4">2 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">17</oasis:entry>  
         <oasis:entry colname="col2">WD ultrasonic</oasis:entry>  
         <oasis:entry colname="col3">8.4 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">7 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">18</oasis:entry>  
         <oasis:entry colname="col2">WD ultrasonic</oasis:entry>  
         <oasis:entry colname="col3">16.8 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">4 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">19</oasis:entry>  
         <oasis:entry colname="col2">WD ultrasonic</oasis:entry>  
         <oasis:entry colname="col3">33.6 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">9 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">20</oasis:entry>  
         <oasis:entry colname="col2">WD ultrasonic</oasis:entry>  
         <oasis:entry colname="col3">74.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">23 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">21</oasis:entry>  
         <oasis:entry colname="col2">RH</oasis:entry>  
         <oasis:entry colname="col3">4.2 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">21 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">22</oasis:entry>  
         <oasis:entry colname="col2">RH</oasis:entry>  
         <oasis:entry colname="col3">8.4 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">9 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">23</oasis:entry>  
         <oasis:entry colname="col2">RH</oasis:entry>  
         <oasis:entry colname="col3">16.8 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">7 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">24</oasis:entry>  
         <oasis:entry colname="col2">RH</oasis:entry>  
         <oasis:entry colname="col3">33.6 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">7 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">25</oasis:entry>  
         <oasis:entry colname="col2">RH</oasis:entry>  
         <oasis:entry colname="col3">50.4 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">9 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">26</oasis:entry>  
         <oasis:entry colname="col2">RH</oasis:entry>  
         <oasis:entry colname="col3">67.2 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">6 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">27</oasis:entry>  
         <oasis:entry colname="col2">RH Td</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">20 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">28</oasis:entry>  
         <oasis:entry colname="col2">PTG</oasis:entry>  
         <oasis:entry colname="col3"/>  
         <oasis:entry colname="col4">5 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">29</oasis:entry>  
         <oasis:entry colname="col2">Visibility</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">30</oasis:entry>  
         <oasis:entry colname="col2">Vis-min</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">31</oasis:entry>  
         <oasis:entry colname="col2">Vis-max</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">32</oasis:entry>  
         <oasis:entry colname="col2">Precipitation intensity</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">33</oasis:entry>  
         <oasis:entry colname="col2">Preci-min</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">34</oasis:entry>  
         <oasis:entry colname="col2">Preci-max</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">35</oasis:entry>  
         <oasis:entry colname="col2">Precipitation</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">36</oasis:entry>  
         <oasis:entry colname="col2">Snowfall</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">37</oasis:entry>  
         <oasis:entry colname="col2">Global RADIATION</oasis:entry>  
         <oasis:entry colname="col3">18.0 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col4">1 %</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p>This nowcasting task would be relevant to the stations where no radiation
measures are available. The station SMEAR II, from which the input data
originate, is able to measure solar radiation; hence, the true values are
present for us for evaluation purposes. However, instrumentation for
measuring solar radiation is not always available. Small meteorological
observation stations may not be able to have solar radiation measured, but it
may be interesting to nowcast radiation from meteorological data that is
available anyway.</p>
      <p>In this study our target variable is defined as the ratio of the actual
radiation to the theoretical maximum radiation. This gives a value
between <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">0</mml:mn></mml:math></inline-formula> and <inline-formula><mml:math display="inline"><mml:mn>100</mml:mn></mml:math></inline-formula> %, where <inline-formula><mml:math display="inline"><mml:mn>100</mml:mn></mml:math></inline-formula> % indicates that all the
theoretically possible radiation is actually incoming. The sensor Global
RADIATION (Table <xref ref-type="table" rid="Ch1.T1"/>) is not used as an input into the nowcasting
model; it is only used for evaluating the nowcasting accuracy. It indicates
the actual radiation and is used in forming the target variable.</p>
      <p>The theoretical maximum radiation is calculated using MIDC (Measurement and Instrumentation Data Center)
SOLPOS (Solar Position and Intensity)
Calculator<fn id="Ch1.Footn2"><p><uri>http://www.nrel.gov/midc/solpos/solpos.html</uri></p></fn>.
SOLPOS is a computational tool that calculates the apparent solar position
and intensity (theoretical maximum solar energy) based on the date, time, and
location on Earth. The tool is developed and maintained by The National
Renewable Energy Laboratory, which is operated for the US Department of
Energy by the Alliance for Sustainable Energy. The calculations are based on
established models for solar position, reported in <xref ref-type="bibr" rid="bib1.bibx22" id="text.16"/> and
other sources.</p>
      <p>The following input parameters were used: lat – <inline-formula><mml:math display="inline"><mml:mn>61.8475</mml:mn></mml:math></inline-formula>; long – <inline-formula><mml:math display="inline"><mml:mn>24.29472</mml:mn></mml:math></inline-formula>;
time zone – <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">2</mml:mn></mml:math></inline-formula> (location parameters); surface pressure – 990 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">mbar</mml:mi></mml:math></inline-formula>;
ambient dry-bulb temperature – 3 <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi/><mml:mo>∘</mml:mo></mml:msup><mml:mi mathvariant="normal">C</mml:mi></mml:mrow></mml:math></inline-formula>; azimuth of panel surface – 180<inline-formula><mml:math display="inline"><mml:msup><mml:mi/><mml:mo>∘</mml:mo></mml:msup></mml:math></inline-formula>; degrees of tilt from horizontal of panel – <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">0</mml:mn></mml:math></inline-formula>; solar irradiance
constant – 1360.8 <inline-formula><mml:math display="inline"><mml:mrow><mml:mi mathvariant="normal">W</mml:mi><mml:mspace width="0.125em" linebreak="nobreak"/><mml:msup><mml:mi mathvariant="normal">m</mml:mi><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> <xref ref-type="bibr" rid="bib1.bibx17" id="paren.17"/>; shadow-band width –
<inline-formula><mml:math display="inline"><mml:mn>7.6</mml:mn></mml:math></inline-formula> <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">cm</mml:mi></mml:math></inline-formula>; shadow-band radius – <inline-formula><mml:math display="inline"><mml:mn>31.7</mml:mn></mml:math></inline-formula> <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">cm</mml:mi></mml:math></inline-formula>; shadow-band sky
factor – <inline-formula><mml:math display="inline"><mml:mn>0.04</mml:mn></mml:math></inline-formula>; interval of a measurement period – <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">0</mml:mn></mml:math></inline-formula> <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">s</mml:mi></mml:math></inline-formula>.</p>
      <p>Often sensor readings are correlated with each other.
Figure <xref ref-type="fig" rid="Ch1.F1"/> visualizes the pairwise correlations
computed over non-missing data.  We see distinct blocks of positive
and negative correlations. For instance, relative humidity (RH) is
negatively correlated with temperature (<inline-formula><mml:math display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula>).</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F1"><caption><p>Correlations between input sensors.</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014-f01.pdf"/>

        </fig>

</sec>
<sec id="Ch1.S2.SS2">
  <title>Prerequisites</title>
<sec id="Ch1.S2.SS2.SSS1">
  <title>Setting</title>
      <p>Suppose we have <inline-formula><mml:math display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula> sources generating streaming data (e.g., weather
observation sensors). Data are recorded in multidimensional vectors <inline-formula><mml:math display="inline"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>∈</mml:mo><mml:msup><mml:mi mathvariant="double-struck">R</mml:mi><mml:mi>r</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>. Our task is to nowcast the target variable <inline-formula><mml:math display="inline"><mml:mrow><mml:mi>y</mml:mi><mml:mo>∈</mml:mo><mml:msup><mml:mi mathvariant="double-struck">R</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> (e.g., solar radiation) using these sensor readings as inputs.
The regression model is then <inline-formula><mml:math display="inline"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and
the corresponding learning task is to approximate function <inline-formula><mml:math display="inline"><mml:mi>f</mml:mi></mml:math></inline-formula> from the
available input–output data. It is important to note that we do not make use
of temporal information of the variables; that is, we predict the value of
the output <inline-formula><mml:math display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula> at time <inline-formula><mml:math display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>, with the sensor readings available at the same
time point <inline-formula><mml:math display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>; hence, the task is referred to as nowcasting. With the time
index in place, the regression model is <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo>(</mml:mo><mml:msup><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:msub><mml:mi>x</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. In the rest of the paper, we omit the time index <inline-formula><mml:math display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>. For
the identities of the sensors used in the case study (<inline-formula><mml:math display="inline"><mml:mrow><mml:mi>r</mml:mi><mml:mo>=</mml:mo><mml:mn>36</mml:mn></mml:mrow></mml:math></inline-formula>), see
Table <xref ref-type="table" rid="Ch1.T1"/>.</p>
      <p>Data arrive in real time, and nowcasting outputs need to be delivered as soon
as possible, in nearly real time. The nowcasting performance should be stable
in the sense that the expected loss in accuracy due to possible missing values
should be minimal. Bearing in mind that often environment monitoring sensors operate on batteries or autonomous power sources,
the computational resources consumed for data processing, including missing-value
imputation, should be minimal. We are after one model that performs well at
all times, with and without data gaps.</p>
</sec>
<sec id="Ch1.S2.SS2.SSS2">
  <title>Imputation of missing data</title>
      <p>We assume that, when a sensor fails, missing values are automatically replaced
with the <italic>mean</italic> values, which remains a popular approach in
practice due to its simplicity and low user cost
<xref ref-type="bibr" rid="bib1.bibx6 bib1.bibx16 bib1.bibx7" id="paren.18"/>. To keep the focus of the paper on the regression models tolerant to
massively missing data, we also assume that there is no need to implement any driven missing-value detectors; the system
knows when a value is missing.</p>
      <p>In this study, we do not explore alternative imputation methods due to two
reasons. Firstly, our main goal is to investigate the robustness of
regression models to missing data rather than to select the best imputation
scheme. Secondly, advanced model-based imputation methods such as linear
interpolation, nearest neighbor imputation, and self-organizing map or multilayer
perceptron methods <xref ref-type="bibr" rid="bib1.bibx14" id="paren.19"/> typically are more accurate when the
amount of missing data is small, but they lose their advantage when long
missing-data gaps are expected. While multiple imputation methods <xref ref-type="bibr" rid="bib1.bibx14" id="paren.20"/>
bear relatively high computational costs and are favorable in one-off
imputation operations, they are not very suitable for continuous online
operations and imputation in real time. More importantly, such methods
implicitly or explicitly assume that data are missing at random; i.e.,
a sensor value missing is independent both of observable variables and
of unobservable parameters of interest. In reality this assumption ay often be violated, for instance by the sensors switching themselves off at low temperatures.</p>
      <p>Bayesian approaches <xref ref-type="bibr" rid="bib1.bibx18 bib1.bibx23" id="paren.21"/> present an interesting
alternative for learning from incomplete data, but the goals and the task are
somewhat different from what we are solving. In our setting, training data
are abundant, and an initial model can be built from a subset that has no
missing values. Bayesian nets can inherently learn from data with missing
values, but once a model is ready, it does not seem to have any special
mechanism for making predictions from incomplete data. In this case a
Bayesian net would require an extra missing-value imputation approach, just
like a linear regression.</p>
      <p>One could create imputation models using knowledge about physical
relationships between variables. However, when a lot of data is missing, such
an approach would encounter a combinatorial explosion. One model would need
to be available per each combination of missing variables, which requires
building and maintaining <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mn mathvariant="normal">2</mml:mn><mml:mi>r</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> models, where <inline-formula><mml:math display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula> is the number of input
features.</p>
</sec>
<sec id="Ch1.S2.SS2.SSS3">
  <title>Performance indicators</title>
      <p>We use a nowcasting error as the main measure of performance, which is
computed on a subset of data that was not used for parameter estimation
<xref ref-type="bibr" rid="bib1.bibx10" id="paren.22"/>. The mean squared error (MSE) is a popular measure to
quantify the discrepancy between the true target value <inline-formula><mml:math display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula> and the value
output by the model, <inline-formula><mml:math display="inline"><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover></mml:math></inline-formula>. MSE punishes large deviations from the true
values; this is relevant for the environmental monitoring applications, where
large errors are to be avoided. For practical interpretability, RMSE is often
used, which is the square root of MSE. RMSE reports the error in the same
units as the target variable. For a test data set of size <inline-formula><mml:math display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>, MSE and RMSE
are computed as

                  <disp-formula id="Ch1.E1" content-type="numbered"><mml:math display="block"><mml:mrow><mml:mtext>MSE</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mn mathvariant="normal">1</mml:mn><mml:mi>n</mml:mi></mml:mfrac><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mo>(</mml:mo><mml:msup><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mi>l</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>l</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup><mml:mo>,</mml:mo><mml:mtext>RMSE</mml:mtext><mml:mo>=</mml:mo><mml:msqrt><mml:mtext>MSE</mml:mtext></mml:msqrt><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

            where <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>l</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is the true target value of the <inline-formula><mml:math display="inline"><mml:mi>l</mml:mi></mml:math></inline-formula>th sample and
<inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mi>l</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is the corresponding model output. In the experiments we
report RMSE, which can be interpreted as an average deviation of model
outputs from the true target values.</p>
</sec>
</sec>
<sec id="Ch1.S2.SS3">
  <title>Computational methods</title>
<sec id="Ch1.S2.SS3.SSSx1" specific-use="unnumbered">
  <title>Linear regression model</title>
      <p>For nowcasting we adopt linear regression models, which assume that the
relationship between <inline-formula><mml:math display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula> input variables <inline-formula><mml:math display="inline"><mml:mrow><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and
the target variable <inline-formula><mml:math display="inline"><mml:mi>y</mml:mi></mml:math></inline-formula> is linear. Without loss of generality we assume that
the input data are standardized before modeling to have zero mean and unit
standard deviation<fn id="Ch1.Footn3"><p>For standardization we need to estimate the data
mean <inline-formula><mml:math display="inline"><mml:mi>m</mml:mi></mml:math></inline-formula> and the standard deviation <inline-formula><mml:math display="inline"><mml:mi>s</mml:mi></mml:math></inline-formula> from a sample data set; then
<inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mtext>standardized</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>-</mml:mo><mml:mi>m</mml:mi><mml:mo>)</mml:mo><mml:mo>/</mml:mo><mml:mi>s</mml:mi></mml:mrow></mml:math></inline-formula>. For every variable we need to store
the values <inline-formula><mml:math display="inline"><mml:mi>m</mml:mi></mml:math></inline-formula> and <inline-formula><mml:math display="inline"><mml:mi>s</mml:mi></mml:math></inline-formula> and apply the same procedure to all new incoming data
before nowcasting.</p></fn>. The regression model takes the form

                  <disp-formula id="Ch1.E2" content-type="numbered"><mml:math display="block"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msub><mml:mi>x</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

            where <inline-formula><mml:math display="inline"><mml:mi mathvariant="italic">ϵ</mml:mi></mml:math></inline-formula> is the error variable and the vector <inline-formula><mml:math display="inline"><mml:mrow><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:msup><mml:mo>)</mml:mo><mml:mi>T</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> contains the parameters of the linear model
(regression coefficients). Since the data are assumed to have been
standardized, there is no bias term in the model. In matrix form, the model is
<inline-formula><mml:math display="inline"><mml:mrow><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="bold">X</mml:mi><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="italic">ϵ</mml:mi></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">X</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>×</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is a sample data matrix containing <inline-formula><mml:math display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula> records from <inline-formula><mml:math display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula> sensors and
<inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>×</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is a vector of the corresponding <inline-formula><mml:math display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula> target values.</p>
</sec>
</sec>
<sec id="Ch1.S2.SS4">
  <title>Ordinary least squares</title>
      <p>There are different ways to estimate the regression parameters
<xref ref-type="bibr" rid="bib1.bibx10" id="paren.23"/>. Ordinary least squares (OLS) is a simple and probably
the most common estimator. It minimizes the sum of squared residuals giving
the following solution:

                <disp-formula id="Ch1.E3" content-type="numbered"><mml:math display="block"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mtext>OLS</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mi>arg⁡</mml:mi><mml:msub><mml:mo>min⁡</mml:mo><mml:mi mathvariant="bold-italic">β</mml:mi></mml:msub><mml:mfenced open="(" close=")"><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="bold">X</mml:mi><mml:mi mathvariant="bold-italic">β</mml:mi><mml:msup><mml:mo>)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="bold">X</mml:mi><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo>)</mml:mo></mml:mfenced><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:msup><mml:mi mathvariant="bold">X</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold">X</mml:mi><mml:msup><mml:mo>)</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold">X</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula>

          Having estimated a regression model <inline-formula><mml:math display="inline"><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover></mml:math></inline-formula>, nowcasting on new
data <inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mtext>new</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> can be made as

                <disp-formula id="Ch1.E4" content-type="numbered"><mml:math display="block"><mml:mrow><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mtext>new</mml:mtext></mml:msub><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<sec id="Ch1.S2.SS4.SSS1">
  <title>Regularization</title>
      <p>If the input variables are correlated with each other, the optimization
problem could result in poor estimates for the parameters. In such situations,
regularization is often used for estimating the regression parameters. The
Ridge regression (RR) <xref ref-type="bibr" rid="bib1.bibx11 bib1.bibx10" id="paren.24"/> regularizes the
regression coefficients by imposing a penalty on their magnitude. RR solution
minimizes the cost function

                  <disp-formula specific-use="align"><mml:math display="block"><mml:mtable displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover><mml:mtext>RR</mml:mtext></mml:msub></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mo>=</mml:mo><mml:mi>arg⁡</mml:mi><mml:msub><mml:mo>min⁡</mml:mo><mml:mi mathvariant="bold-italic">β</mml:mi></mml:msub><mml:mfenced open="(" close=")"><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="bold">X</mml:mi><mml:mi mathvariant="bold-italic">β</mml:mi><mml:msup><mml:mo>)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="bold">X</mml:mi><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo>)</mml:mo><mml:mo>+</mml:mo><mml:mi mathvariant="italic">λ</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold-italic">β</mml:mi></mml:mfenced></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:msup><mml:mi mathvariant="bold">X</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold">X</mml:mi><mml:mo>+</mml:mo><mml:mi mathvariant="italic">λ</mml:mi><mml:mi mathvariant="bold">I</mml:mi><mml:msup><mml:mo>)</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msup><mml:msup><mml:mi mathvariant="bold">X</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>

              where <inline-formula><mml:math display="inline"><mml:mrow><mml:mi mathvariant="italic">λ</mml:mi><mml:mo>&gt;</mml:mo><mml:mn mathvariant="normal">0</mml:mn></mml:mrow></mml:math></inline-formula> controls the amount of shrinkage: the larger the
value of <inline-formula><mml:math display="inline"><mml:mi mathvariant="italic">λ</mml:mi></mml:math></inline-formula>, the greater the amount of shrinkage. <inline-formula><mml:math display="inline"><mml:mi mathvariant="bold">X</mml:mi></mml:math></inline-formula>
denotes the <inline-formula><mml:math display="inline"><mml:mrow><mml:mi>n</mml:mi><mml:mo>×</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:math></inline-formula> training data set, and <inline-formula><mml:math display="inline"><mml:mi mathvariant="bold-italic">y</mml:mi></mml:math></inline-formula> is the <inline-formula><mml:math display="inline"><mml:mrow><mml:mi>n</mml:mi><mml:mo>×</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> vector of the true target values; <inline-formula><mml:math display="inline"><mml:mi mathvariant="bold">I</mml:mi></mml:math></inline-formula> is the <inline-formula><mml:math display="inline"><mml:mrow><mml:mi>r</mml:mi><mml:mo>×</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:math></inline-formula> identity matrix. Nowcasting outputs on new data
<inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mtext>new</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> can be produced as

                  <disp-formula id="Ch1.E5" content-type="numbered"><mml:math display="block"><mml:mrow><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mtext>new</mml:mtext></mml:msub><mml:msub><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mtext>RR</mml:mtext></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="Ch1.S2.SS4.SSS2">
  <title>Principal component regression</title>
      <p>Principal component (PCA) regression <xref ref-type="bibr" rid="bib1.bibx13" id="paren.25"/> first transforms the
input data by rotating them towards their principal components and then
estimates the regression coefficients on the transformed data.</p>
      <p>Let <inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">X</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>×</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> be the training data matrix, and <inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">R</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>×</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is the matrix of <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> principal components, corresponding to the
largest eigenvalues. Here, <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> is a user-defined parameter such that <inline-formula><mml:math display="inline"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mo>≤</mml:mo><mml:mi>k</mml:mi><mml:mo>≤</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:math></inline-formula>; if <inline-formula><mml:math display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:math></inline-formula>, then PCA regression becomes the ordinary regression. Then
OLS gives the following solution on the transformed input data:

                  <disp-formula id="Ch1.E6" content-type="numbered"><mml:math display="block"><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mtext>PCA</mml:mtext><mml:mo>∗</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:mi>arg⁡</mml:mi><mml:msub><mml:mo>min⁡</mml:mo><mml:mi mathvariant="bold-italic">β</mml:mi></mml:msub><mml:mfenced open="(" close=")"><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="bold">XR</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo>∗</mml:mo></mml:msup><mml:msup><mml:mo>)</mml:mo><mml:mi>T</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold-italic">y</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="bold">XR</mml:mi><mml:msup><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo>∗</mml:mo></mml:msup><mml:mo>)</mml:mo></mml:mfenced><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

            and in the original data space the solution is
<inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover><mml:mtext>PCA</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="bold">R</mml:mi><mml:msubsup><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover><mml:mtext>PCA</mml:mtext><mml:mo>∗</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula>. Nowcasting on new data
<inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mtext>new</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> can be made as

                  <disp-formula id="Ch1.E7" content-type="numbered"><mml:math display="block"><mml:mrow><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mtext>new</mml:mtext></mml:msub><mml:mi mathvariant="bold">R</mml:mi><mml:msubsup><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mtext>PCA</mml:mtext><mml:mo>∗</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mtext>new</mml:mtext></mml:msub><mml:msub><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mtext>PCA</mml:mtext></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="Ch1.S2.SS4.SSS3">
  <title>Partial least squares regression</title>
      <p>Partial least squares (PLS) regression is very popular in chemometrics
<xref ref-type="bibr" rid="bib1.bibx26" id="paren.26"/>. Similarly to PCA, the input data are transformed, but instead
of maximizing the variance of the input data (as in PCA), this transformation
maximizes the covariance between input variables and the target. There is no
convenient analytical solution for optimization; instead an iterative
optimization is employed for parameter estimation. The procedure is presented
in Algorithm <xref ref-type="fig" rid="Ch1.F2"/>. Here, <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> is a user-defined parameter such that <inline-formula><mml:math display="inline"><mml:mrow><mml:mn mathvariant="normal">1</mml:mn><mml:mo>≤</mml:mo><mml:mi>k</mml:mi><mml:mo>≤</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:math></inline-formula>; if <inline-formula><mml:math display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>r</mml:mi></mml:mrow></mml:math></inline-formula>, then PLS regression becomes the ordinary regression.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F2"><caption><p>PLS
regression.</p></caption>
            <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014-algorithm.pdf"/>
            <?xmltex \hack{\def\figurename{Algorithm}\setcounter{figure}{0}}?>

          </fig>

      <p>Nowcasting on new data <inline-formula><mml:math display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mtext>new</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> can be made as

                  <disp-formula id="Ch1.E8" content-type="numbered"><mml:math display="block"><mml:mrow><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">x</mml:mi><mml:mtext>new</mml:mtext></mml:msub><mml:msub><mml:mover accent="true"><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo stretchy="false" mathvariant="normal">^</mml:mo></mml:mover><mml:mtext>PLS</mml:mtext></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
</sec>
<sec id="Ch1.S2.SS5">
  <?xmltex \opttitle{Estimating the robustness of linear regression\hack{\\} models to missing data}?><title>Estimating the robustness of linear regression<?xmltex \hack{\newline}?> models to missing data</title>
      <p>For a linear regression model, it is possible to determine theoretically how
many missing inputs can be tolerated before model outputs become obsolete. We
can estimate the robustness of a linear regression model to potentially missing
input data by using the deterioration index <xref ref-type="bibr" rid="bib1.bibx27" id="paren.27"/>, which is defined as

                <disp-formula id="Ch1.E9" content-type="numbered"><mml:math display="block"><mml:mrow><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:msup><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>(</mml:mo><mml:mi mathvariant="bold">Σ</mml:mi><mml:mo>-</mml:mo><mml:mi mathvariant="bold">I</mml:mi><mml:mo>)</mml:mo><mml:mi mathvariant="bold-italic">β</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

          where <inline-formula><mml:math display="inline"><mml:mi mathvariant="bold-italic">β</mml:mi></mml:math></inline-formula> is a vector of the regression coefficients, assuming that
the input variables have been standardized to zero mean and unit standard
deviation; <inline-formula><mml:math display="inline"><mml:mi mathvariant="bold">Σ</mml:mi></mml:math></inline-formula> is the covariance matrix of the input data;
and <inline-formula><mml:math display="inline"><mml:mi mathvariant="bold">I</mml:mi></mml:math></inline-formula> is the identity matrix. High values of the index <inline-formula><mml:math display="inline"><mml:mi>d</mml:mi></mml:math></inline-formula>
indicate low tolerance of the model to missing data. The prediction errors
will increase quickly with the number of missing inputs. The smaller <inline-formula><mml:math display="inline"><mml:mi>d</mml:mi></mml:math></inline-formula>, the
more robust to missing data the model is. <inline-formula><mml:math display="inline"><mml:mi>d</mml:mi></mml:math></inline-formula> may be negative; that is the
best option.</p>
      <p>Low <inline-formula><mml:math display="inline"><mml:mi>d</mml:mi></mml:math></inline-formula> guarantees robustness to missing data, but the models with low
<inline-formula><mml:math display="inline"><mml:mi>d</mml:mi></mml:math></inline-formula> do not necessarily give good predictions when all the values are
available. Hence, a tradeoff between accuracy and robustness needs to be
found, and the following method can help to find it.</p>
      <p>Suppose we get two models A and B, and we would like to select one for
deployment. We can measure their prediction errors on a training data set
using cross-validation: <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mtext>RMSE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">A</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> and
<inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mtext>RMSE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">B</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. We can also compute deterioration
indices <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">A</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">B</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. Without loss of
generality, assume that <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mtext>RMSE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">A</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>≥</mml:mo><mml:msup><mml:mtext>RMSE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">B</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>; i.e., model B shows a better prediction accuracy
when no data are missing. If <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">A</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>≥</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">B</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>,
then model B is also more robust. In such a case, model B is better (or at
least as good as A) with regard to both characteristics, and hence B is preferred over A.</p>
      <p>If, however, <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">A</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>&lt;</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">B</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, then we can find out how
many input readings can go missing before A becomes better than B. The number
<inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>m</mml:mi><mml:mo>∗</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> can be computed as <xref ref-type="bibr" rid="bib1.bibx27" id="paren.28"/>

                <disp-formula id="Ch1.E10" content-type="numbered"><mml:math display="block"><mml:mrow><mml:msup><mml:mi>m</mml:mi><mml:mo>∗</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:mi>r</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>)</mml:mo><mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mtext>RMSE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">A</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mo>]</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup><mml:mo>-</mml:mo><mml:mo>[</mml:mo><mml:msup><mml:mtext>RMSE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">B</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mo>]</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">B</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi mathvariant="normal">A</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

          where <inline-formula><mml:math display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula> is the number of input sensors.</p>
</sec>
<sec id="Ch1.S2.SS6">
  <title>Experimental protocol</title>
<sec id="Ch1.S2.SS6.SSS1">
  <title>Data preparation and preprocessing</title>
      <p>Solar-radiation readings (target variable) are available 99 % of the
time. We eliminate from the experiment the samples where no target value is
available, since we can use such samples neither for model training nor
can we measure the model accuracy on them.</p>
      <p>The following preprocessing of the target values is performed. If the
measured solar radiation is negative, it is set to 0. If the measured
solar radiation exceeds the theoretical (maximum) radiation, the measurement
is corrected to be equal to the theoretical radiation. In practice, such
observations can arise if, during a cloudy day, the sky is clear where the sun
is shining but there is cloud cover elsewhere. The cloud reflects back more
back-reflected radiation that the blue sky. For simplicity, we
do not consider this effect in our modeling at this stage.</p>
      <p>Exploratory analysis of missing data is performed on all 7 years of data.
For the analysis of the model accuracies, we use the first 3 years of data as
a training set and the remaining 4 years as the testing set. We assume
the scenario where an analyst is currently at the end of year three, and all
the previous 3 years of data are available for model calibration. After
modeling and calibration are done, an online operation scenario is assumed,
where the testing data (4 years) arrive in the sequential order.</p>
      <p>From the training set we eliminate all the observations that contain any
missing values (<inline-formula><mml:math display="inline"><mml:mn>34</mml:mn></mml:math></inline-formula> % of train data). The testing set contains all
samples, regardless of whether any values in the input data are missing. In
addition, we eliminate from the training and testing sets all the
observations where the value of theoretical radiation is <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">0</mml:mn></mml:math></inline-formula> (the periods of
dark) since the value of the target variable is then also <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">0</mml:mn></mml:math></inline-formula>, which can be
nowcasted with <inline-formula><mml:math display="inline"><mml:mn>100</mml:mn></mml:math></inline-formula> % accuracy, while when performing experimental
comparison of models we are interested in accuracies of nontrivial
nowcasting tasks.</p>
      <p>The training data are standardized to have zero mean and unit standard
deviation.  The testing data are preprocessed by subtracting the mean
and dividing by the <?xmltex \hack{\mbox\bgroup}?>standard<?xmltex \hack{\egroup}?> deviation calculated on the training set.
After standardization we replace all the missing values in the testing
set by zeros and test the performance of the regression models.</p>
</sec>
<sec id="Ch1.S2.SS6.SSS2">
  <title>Regression models used in the experiments</title>
      <p>We experimentally analyze seven regression models, summarized in
Table <xref ref-type="table" rid="Ch1.T2"/>.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T2"><caption><p>Summary of regression models: OLS – ordinary least squares; RR –
Ridge regression.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="4">
     <oasis:colspec colnum="1" colname="col1" align="center"/>
     <oasis:colspec colnum="2" colname="col2" align="center"/>
     <oasis:colspec colnum="3" colname="col3" align="center"/>
     <oasis:colspec colnum="4" colname="col4" align="center"/>
     <oasis:thead>
       <oasis:row>  
         <oasis:entry colname="col1"/>  
         <oasis:entry colname="col2"/>  
         <oasis:entry namest="col3" nameend="col4">Optimization </oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">  
         <oasis:entry colname="col1"/>  
         <oasis:entry colname="col2"/>  
         <oasis:entry colname="col3">OLS</oasis:entry>  
         <oasis:entry colname="col4">RR</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>  
         <oasis:entry colname="col1">Inputs</oasis:entry>  
         <oasis:entry colname="col2">all <inline-formula><mml:math display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">ALL</oasis:entry>  
         <oasis:entry colname="col4">rALL</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1"/>  
         <oasis:entry colname="col2">Selected <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">SEL</oasis:entry>  
         <oasis:entry colname="col4">rSEL</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1"/>  
         <oasis:entry colname="col2">PCA <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">PCA</oasis:entry>  
         <oasis:entry colname="col4">rPCA</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1"/>  
         <oasis:entry colname="col2">PLS <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col3">PLS</oasis:entry>  
         <oasis:entry colname="col4"/>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p>ALL uses all <inline-formula><mml:math display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula> sensors as inputs. SEL selects <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> sensors that have the
largest absolute correlation with the target variable (correlation is
measured on the training data) and builds a regression model on those <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>
sensors. PCA rotates the input data using principal component analysis, <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>
features corresponding to the largest eigenvalues are retained, and then
PCA builds a regression model on those <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> new features. PLS rotates input data
to maximize the covariance between the inputs and the target. We keep <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> new
features.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F3" specific-use="star"><caption><p>Analysis of missing-data patterns: <bold>(a)</bold> distribution
of number of missing sensors in observations (<inline-formula><mml:math display="inline"><mml:mrow><mml:mn>10</mml:mn><mml:mo>+</mml:mo></mml:mrow></mml:math></inline-formula> means that <inline-formula><mml:math display="inline"><mml:mn>10</mml:mn></mml:math></inline-formula>
to <inline-formula><mml:math display="inline"><mml:mn>36</mml:mn></mml:math></inline-formula> sensors are missing); <bold>(b)</bold> effects of removing the most of
the missing sensors; <bold>(c)</bold> the relation of individual sensors
with the target variable (each dot represents one sensor).</p></caption>
            <?xmltex \igopts{width=312.980315pt}?><graphic xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014-f02.pdf"/>

          </fig>

      <p>ALL, SEL, and PCA use the ordinary least squares optimization procedure (OLS)
for parameter estimation. In addition, we test the same approaches but using
the regularized Ridge regression (RR); these models are denoted as ALLr,
SELr, and PCAr. PLS uses its own iterative optimization procedure, which is not
regularized.</p>
      <p>In addition, we compare the performance to a naive baseline NAI, which
produces a constant output, considering that the radiation will be the same
as the mean radiation in the training data.</p><?xmltex \hack{\newpage}?>
</sec>
<sec id="Ch1.S2.SS6.SSS3">
  <title>Software and hardware</title>
      <p>The experiments are performed in MATLAB 2012b, using in-house produced code
(no extra packages are required) on a commodity laptop computer (Processor
<inline-formula><mml:math display="inline"><mml:mn>2.5</mml:mn></mml:math></inline-formula> <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">GHz</mml:mi></mml:math></inline-formula> Intel Core i5; Memory 8 GB 1600 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">MHz</mml:mi></mml:math></inline-formula> DDR3). The
data set used in this study and the code for the experiments are made
available<fn id="Ch1.Footn4"><p><uri>http://users.ics.aalto.fi/indre/smear.zip</uri></p></fn> for
research purposes.</p>
</sec>
</sec>
</sec>
<sec id="Ch1.S3">
  <title>Results and discussion</title>
<sec id="Ch1.S3.SS1">
  <title>Analysis of missing-data characteristics</title>
      <p>Firstly, we analyze in what way missing values occur in the case study
data set. Figure <xref ref-type="fig" rid="Ch1.F3"/>a presents the distribution of missing
sensors. We see that about half of the time nothing is missing and half of
the time observation vectors are incomplete. Over 35 % of the time,
2–4 sensors are missing. The mean number of missing sensors over all the
data set is <inline-formula><mml:math display="inline"><mml:mn>2.4</mml:mn></mml:math></inline-formula>. We observe from the data that up to <inline-formula><mml:math display="inline"><mml:mn>36</mml:mn></mml:math></inline-formula> sensors (all the
input sensors) may be missing at a time. From this analysis we conclude that
the amount of missing data is at a massive scale and scope, and missing
values needs to be taken into consideration when building nowcasting models
on this data. The amount and frequency of missing data also indicates that a
case deletion approach would not be suitable because there would be
predictions missing continuously.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F4"><caption><p>Correlation of missing-value patterns. High correlation
(indicated by darker values) means that the values are often missing
together.</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014-f03.pdf"/>

        </fig>

      <p>One may consider that removing from the data set one or two sensors with the
largest amount of missing values could solve the problem. This could help if
mostly the same sensors were missing all the time. In the following
experiment we analyze in what way individual sensors are missing. First, we
remove a sensor with the most missing values from the data set; this way the
observation vectors at each 30 min time stamp become shorter, and they now
include <inline-formula><mml:math display="inline"><mml:mn>35</mml:mn></mml:math></inline-formula> sensors instead of <inline-formula><mml:math display="inline"><mml:mn>36</mml:mn></mml:math></inline-formula>. Given the updated observation vectors, we recalculate how many of those vectors contain at least one missing value.
Then we remove the sensor lacking the next highest number of measurements and repeat the computation.
Figure <xref ref-type="fig" rid="Ch1.F3"/>b presents the results. We see that removing
a couple of largely missing sensors does not make the remaining observations
complete. We would need to remove about half of the sensors in order to reach
the stage where at least 95 % of the data are complete. The problem with
such an approach is that the removed sensors may carry important information
about the target, which then would be lost. To investigate this effect,
Fig. <xref ref-type="fig" rid="Ch1.F3"/>c presents the relation between the missing-data rate
in each sensor and the information about the target contained in it, measured
as the absolute linear correlation with the target variable. We have removed
the periods where the value of the target is equal to 0 (the dark periods
when there is no solar radiation) from this analysis. We see that some
sensors in the far right corner and upper center have a high missing-value
rate but also high correlation with the target variable. This means that
excluding sensors with high missing-value rates would lead to losses of
valuable information about the target that would be useful for nowcasting.</p>
      <p><?xmltex \hack{\newpage}?>One more issue with the data is that sensors do not produce missing values independently of each other. For example, if one temperature value is
missing, then it is likely that the other temperature values are missing as
well. It may be the case that sensors are missing together due to some common
external reasons, for instance, electric power outages. This observation is
illustrated by Fig. <xref ref-type="fig" rid="Ch1.F4"/>, which plots pairwise correlations
between missing values for different sensors. Sensors that are often missing
together are encoded in black (dark). We see that, in particular, temperatures
(<inline-formula><mml:math display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula>), relative humidity (RH), visibility, and precipitation readings are
often missing together. This means that we cannot rely on the redundancy of the
sensors such that, if, say, a temperature reading is missing at 33 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula>, we
can use the reading at 50 <inline-formula><mml:math display="inline"><mml:mi mathvariant="normal">m</mml:mi></mml:math></inline-formula>. Both readings would often be missing
together.</p>
      <p>Finally, in many cases the average duration of missing values lasts for
several hours. Figure <xref ref-type="fig" rid="Ch1.F5"/> presents the average duration of
missing values in the case study data set for each sensor. Since values may be
missing for extended periods of time, we also cannot, from this perspective, simply discard data with missing values, since in such cases we
would often not have model outputs for extended periods of time.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F5"><caption><p>Average duration of missing readings.</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014-f04.pdf"/>

        </fig>

      <p>In summary, the amount of missing data is very large, and at this level data
with missing sensors cannot be discarded without losing valuable information.
Missing values are strongly correlated with each other; this makes it
difficult and, in many cases, impossible to make use of sensor redundancy or
impute missing data based on non-missing data. Removing sensors with the most
missing data is also not feasible since missing values are not concentrated
in several sensors but are distributed across all the sensors and the
sensors with a lot of missing values at the same time carry relatively strong
information about the target at times when the values are not missing. Hence,
the most appropriate solution to the problem of missing values in this
setting appears to be building models that are robust to missing data. This
approach is free from any assumptions about the missing data and allows
nowcasting even when all or nearly all the sensors are missing.</p><?xmltex \hack{\newpage}?>
</sec>
<sec id="Ch1.S3.SS2">
  <title>Prediction accuracy</title>
      <p>Next we experimentally analyze accuracies of several linear regression models
and their robustness to missing values. The first experiment demonstrates how
we can select the best model for deployment. The second experiment presents
evidence about the performance on unseen data.</p>
      <p>Table <xref ref-type="table" rid="Ch1.T3"/> presents the errors of the regression models ALL,
rALL, SEL, rSEL, PCA, rPCA, and PLS, measured on the training set using a fivefold
cross-validation and deterioration index estimated based on the training set. For
PCA, rPCA, and PLS the number of components was fixed to <inline-formula><mml:math display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>18</mml:mn></mml:mrow></mml:math></inline-formula>, which is
a half of the original number of input sensors and explains <inline-formula><mml:math display="inline"><mml:mn>99</mml:mn></mml:math></inline-formula> % of
the variance. The cumulative percent variance method was used for selecting
<inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>, which is recommended as one of the most reliable methods in the
literature <xref ref-type="bibr" rid="bib1.bibx24" id="paren.29"/>. Figure <xref ref-type="fig" rid="App1.Ch1.F1"/> in the Appendix
provides complementary information about the variance explained by PCA
components. Later in this section we will present a sensitivity analysis to
different values of <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T3"><caption><p>Tenfold cross-validation errors (RMSE) measured based on the training data set and deterioration index (<inline-formula><mml:math display="inline"><mml:mi>d</mml:mi></mml:math></inline-formula>).</p></caption><oasis:table frame="topbot"><?xmltex \begin{scaleboxenv}{.83}[.83]?><oasis:tgroup cols="9">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:colspec colnum="8" colname="col8" align="right"/>
     <oasis:colspec colnum="9" colname="col9" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">  
         <oasis:entry colname="col1"/>  
         <oasis:entry colname="col2">ALL</oasis:entry>  
         <oasis:entry colname="col3">rALL</oasis:entry>  
         <oasis:entry colname="col4">SEL</oasis:entry>  
         <oasis:entry colname="col5">rSEL</oasis:entry>  
         <oasis:entry colname="col6">PCA</oasis:entry>  
         <oasis:entry colname="col7">rPCA</oasis:entry>  
         <oasis:entry colname="col8">PLS</oasis:entry>  
         <oasis:entry colname="col9"/>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>  
         <oasis:entry colname="col1">RMSE</oasis:entry>  
         <oasis:entry colname="col2">19.0</oasis:entry>  
         <oasis:entry colname="col3">21.6</oasis:entry>  
         <oasis:entry colname="col4">20.5</oasis:entry>  
         <oasis:entry colname="col5">21.9</oasis:entry>  
         <oasis:entry colname="col6">21.7</oasis:entry>  
         <oasis:entry colname="col7">21.8</oasis:entry>  
         <oasis:entry colname="col8">20.8</oasis:entry>  
         <oasis:entry colname="col9"/>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1"><inline-formula><mml:math display="inline"><mml:mi>d</mml:mi></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col2">1 122 710</oasis:entry>  
         <oasis:entry colname="col3">537</oasis:entry>  
         <oasis:entry colname="col4">362 708</oasis:entry>  
         <oasis:entry colname="col5">451</oasis:entry>  
         <oasis:entry colname="col6"><inline-formula><mml:math display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>109</oasis:entry>  
         <oasis:entry colname="col7"><inline-formula><mml:math display="inline"><mml:mo>-</mml:mo></mml:math></inline-formula>117</oasis:entry>  
         <oasis:entry colname="col8">6121</oasis:entry>  
         <oasis:entry colname="col9"/>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup><?xmltex \end{scaleboxenv}?></oasis:table></table-wrap>

      <p>This analysis is performed from the perspective of an analyst, making
a decision on which model to deploy. Cross-validation is used to avoid
potential overfitting of the model parameters to the training data.
Complementary information on the goodness of fit is presented in
Appendix <xref ref-type="sec" rid="App1.Ch1.S2"/>.</p>
      <p>In the case of the analyst basing the decision only on the offline analysis of
validation errors, he or she would select ALL for deployment since it gives the
lowest error, while PCA and rPCA show nearly the highest error. However, the
deterioration indexes computed for these models suggest the opposite: rPCA
shows the best, while ALL shows the worst deterioration index value. The
analyst can now theoretically compare the robustness of two models, for
instance ALL and PCA, using the criteria from Eq. (<xref ref-type="disp-formula" rid="Ch1.E10"/>), which gives

                <disp-formula specific-use="align"><mml:math display="block"><mml:mtable displaystyle="true"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi>m</mml:mi><mml:mo>∗</mml:mo></mml:msup></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:mi>r</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>)</mml:mo><mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mtext>RMSE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mtext>PCA</mml:mtext><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mo>]</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup><mml:mo>-</mml:mo><mml:mo>[</mml:mo><mml:msup><mml:mtext>RMSE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mtext>ALL</mml:mtext><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mo>]</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mtext>ALL</mml:mtext><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:msup><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mtext>PCA</mml:mtext><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mo>=</mml:mo><mml:mo>(</mml:mo><mml:mn>36</mml:mn><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>)</mml:mo><mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:mn>21.8</mml:mn><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup><mml:mo>-</mml:mo><mml:mo>(</mml:mo><mml:mn>19.0</mml:mn><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mn>1 122 710</mml:mn><mml:mo>-</mml:mo><mml:mo>(</mml:mo><mml:mo>-</mml:mo><mml:mn>117</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>≈</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>

            The result <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>m</mml:mi><mml:mo>∗</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>-</mml:mo><mml:mn mathvariant="normal">4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> means that, if we expect at least one sensor
reading to be missing in 10 000 observations, it is better to deploy
rPCA than ALL. Recall that in the data about <inline-formula><mml:math display="inline"><mml:mn>2.4</mml:mn></mml:math></inline-formula> sensors are missing on
average in every observation. Hence, in this situation it is clearly worth
deploying rPCA instead of a standard linear regression, even though the
ordinary regression may be more accurate when no data is missing.</p>
      <p>Let us consider the regularized version of ordinary regression rALL and rPCA.
rALL shows better cross-validation accuracy than rPCA regarding the training data
and not as bad a deterioration index as ALL. For rALL and rPCA, <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>m</mml:mi><mml:mo>∗</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mn>0.4</mml:mn></mml:mrow></mml:math></inline-formula>, which means that it is still worth deploying rPCA.</p>
      <p>The performance of PCA and rPCA seems very similar. For PCA and
rPCA, <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>m</mml:mi><mml:mo>∗</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:mn>13.1</mml:mn></mml:mrow></mml:math></inline-formula>, which means that rPCA is expected to be more accurate
than PCA if more than <inline-formula><mml:math display="inline"><mml:mn>13</mml:mn></mml:math></inline-formula> sensors are missing; this would be quite pessimistic
for our case study data, where the mean number of missing sensors is <inline-formula><mml:math display="inline"><mml:mn>2.4</mml:mn></mml:math></inline-formula>.
Hence, the analysis suggests chosing PCA for deployment.</p>
      <p>The following analysis simulates online operation after deployment.
Regression models are trained on the training set, and then sequentially
tested on the test set. Table <xref ref-type="table" rid="Ch1.T4"/> reports the testing
results of the regression models ALL, rALL, SEL, rSEL, PCA, rPCA, and PLS
(<inline-formula><mml:math display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>18</mml:mn></mml:mrow></mml:math></inline-formula>).</p>
      <p>The regularized principal component regression rPCA demonstrates the
best performance on the test data (<inline-formula><mml:math display="inline"><mml:mrow><mml:mtext>RMSE</mml:mtext><mml:mo>=</mml:mo><mml:mn>19.49</mml:mn></mml:mrow></mml:math></inline-formula>), closely
followed by PCA without regularization (<inline-formula><mml:math display="inline"><mml:mrow><mml:mtext>RMSE</mml:mtext><mml:mo>=</mml:mo><mml:mn>19.52</mml:mn></mml:mrow></mml:math></inline-formula>).
The other regularized
approaches, rSEL and rALL, perform notably worse (<inline-formula><mml:math display="inline"><mml:mrow><mml:mtext>RMSE</mml:mtext><mml:mo>=</mml:mo><mml:mn>20.43</mml:mn></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math display="inline"><mml:mn>20.28</mml:mn></mml:math></inline-formula>) but they
still outperform the naive baseline NAI (<inline-formula><mml:math display="inline"><mml:mrow><mml:mtext>RMSE</mml:mtext><mml:mo>=</mml:mo><mml:mn>22.88</mml:mn></mml:mrow></mml:math></inline-formula>).  The unregularized
approaches PLS, SEL, and ALL perform much worse than the baseline and
illustrate well the dangers presented by massively missing values.</p>

<?xmltex \floatpos{t}?><table-wrap id="Ch1.T4" specific-use="star"><caption><p>Nowcasting errors (RMSE) on the testing data set.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="9">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:colspec colnum="8" colname="col8" align="right"/>
     <oasis:colspec colnum="9" colname="col9" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">  
         <oasis:entry colname="col1"/>  
         <oasis:entry colname="col2">ALL</oasis:entry>  
         <oasis:entry colname="col3">rALL</oasis:entry>  
         <oasis:entry colname="col4">SEL</oasis:entry>  
         <oasis:entry colname="col5">rSEL</oasis:entry>  
         <oasis:entry colname="col6">PCA</oasis:entry>  
         <oasis:entry colname="col7">rPCA</oasis:entry>  
         <oasis:entry colname="col8">PLS</oasis:entry>  
         <oasis:entry colname="col9">NAI</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">  
         <oasis:entry colname="col1">Full set</oasis:entry>  
         <oasis:entry colname="col2">175.8</oasis:entry>  
         <oasis:entry colname="col3">20.4</oasis:entry>  
         <oasis:entry colname="col4">127.8</oasis:entry>  
         <oasis:entry colname="col5">20.3</oasis:entry>  
         <oasis:entry colname="col6">19.5</oasis:entry>  
         <oasis:entry colname="col7">19.5</oasis:entry>  
         <oasis:entry colname="col8">25.7</oasis:entry>  
         <oasis:entry colname="col9">22.9</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">Non-missing</oasis:entry>  
         <oasis:entry colname="col2">17.9</oasis:entry>  
         <oasis:entry colname="col3">19.2</oasis:entry>  
         <oasis:entry colname="col4">18.8</oasis:entry>  
         <oasis:entry colname="col5">19.7</oasis:entry>  
         <oasis:entry colname="col6">19.6</oasis:entry>  
         <oasis:entry colname="col7">19.6</oasis:entry>  
         <oasis:entry colname="col8">19.3</oasis:entry>  
         <oasis:entry colname="col9">22.9</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1">Missing</oasis:entry>  
         <oasis:entry colname="col2">233.4</oasis:entry>  
         <oasis:entry colname="col3">21.3</oasis:entry>  
         <oasis:entry colname="col4">169.2</oasis:entry>  
         <oasis:entry colname="col5">20.7</oasis:entry>  
         <oasis:entry colname="col6">19.4</oasis:entry>  
         <oasis:entry colname="col7">19.4</oasis:entry>  
         <oasis:entry colname="col8">29.7</oasis:entry>  
         <oasis:entry colname="col9">22.9</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup>

</oasis:table><?xmltex \hack{\vspace*{4mm}}?></table-wrap>

      <?xmltex \floatpos{t}?><fig id="Ch1.F6"><caption><p>Analysis of residuals.</p></caption>
          <?xmltex \igopts{width=184.942913pt}?><graphic xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014-f05.pdf"/>

          <?xmltex \hack{\vspace*{-3mm}}?>
        </fig>

      <p>It is interesting to note that the analyzed strategy combining linear
regression with mean replacement <xref ref-type="bibr" rid="bib1.bibx27" id="paren.30"/> theoretically
approaches NAI performance as more values go missing. If all the input
values are missing, then the predictor turns into NAI automatically.</p>
      <p>To analyze the performance further, we divide the test data into non-missing
(<inline-formula><mml:math display="inline"><mml:mn>44</mml:mn></mml:math></inline-formula> %) and missing observations (<inline-formula><mml:math display="inline"><mml:mn>56</mml:mn></mml:math></inline-formula> %) and inspect the errors on
these subsets separately. We see that the performance of all the models is
similar when there is no missing data. The ordinary regression ALL has an
advantage in accuracy, since it does not discard any information from the
input data. However, the non-regularized models (ALL, SEL, and PLS) fail badly
when there is missing data, while the regularized rSEL and rALL lose some
accuracy but still remain competitive. Both non-regularized and regularized
PCA remain nearly unaffected by missing data.</p>
      <p>Figure <xref ref-type="fig" rid="Ch1.F6"/> plots the distribution of absolute residuals for
each approach. We can see that most of the errors (residuals) are
concentrated around <inline-formula><mml:math display="inline"><mml:mn>10</mml:mn></mml:math></inline-formula>, which is a reasonably good result keeping in mind
that the range of the target variable is from 0 to 100. It means that most of
the predictions do not deviate too much from the true values. We can also see
that NAI has less probability mass on the left-hand side, where the most
accurate predictions are. As expected, intelligent predictors do better than
NAI. Only the unregularized approaches ALL and SEL have any probability mass
on the far right, which means that they occasionally produce predictions that
may exceed the maximum of the true target. We can conclude from this
investigation that predictions by most of the approaches are reasonably
stable, and outliers in predictions do not pose any major threats.</p>
      <p>Next, we analyze the sensitivity of the predictive performance to different
parameter settings. So far we used a fixed number of components (<inline-formula><mml:math display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>18</mml:mn></mml:mrow></mml:math></inline-formula>) for
PCA, rPCA, and PLS and the same number of selected features for SEL and rSEL.
Figure <xref ref-type="fig" rid="Ch1.F7"/> shows the testing errors (RMSE) as a function of
<inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>.</p>
      <p>An important observation can be made from this plot. The regularized
approaches rSEL and rPCA perform reasonably well at all variants of the
parameter <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>, while the <?xmltex \hack{\mbox\bgroup}?>non-regularized<?xmltex \hack{\egroup}?> models SEL, PCA, and PLS
perform poorly when a large number of components is retained. In such a case
the resulting models are still similar to ALL, which uses all the available
information. ALL, rALL, and NAI do not depend on the parameter <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> but are
also included for comparison. We also observe that PLS becomes very effective
at low <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>, but there is a risk of setting <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> incorrectly (e.g., around 25),
in which case PLS gives the worst results. Therefore, we instead recommend
using rPCA, which gives stable and accurate results even if <inline-formula><mml:math display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> is
suboptimal.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F7"><caption><p>Nowcasting error as a function of components retained (top
plot – all models in log scale; bottom plot – best models zoomed
in).</p></caption>
          <?xmltex \igopts{width=184.942913pt}?><graphic xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014-f06.pdf"/>

        </fig>

      <p>Finally, we visually analyze model outputs produced by the baseline approach
ALL and a robust approach rPCA (<inline-formula><mml:math display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>18</mml:mn></mml:mrow></mml:math></inline-formula>). Figure <xref ref-type="fig" rid="Ch1.F8"/> plots
four 3-day snapshots from the year 2012: 1–3 January, 1–3 April, 1–3 July, and
1–3 October. It is important to emphasize that, here, the plot shows raw
outputs of the classifiers in order to better illustrate the effects of
regularization, whereas, when calculating numerical errors, we postprocess
all the model outputs to fall into the same interval as the original target
(<inline-formula><mml:math display="inline"><mml:mrow><mml:mo>[</mml:mo><mml:mn mathvariant="normal">0</mml:mn><mml:mo>,</mml:mo><mml:mn>100</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">0</mml:mn></mml:math></inline-formula> means no irradiance is observed, and <inline-formula><mml:math display="inline"><mml:mn>100</mml:mn></mml:math></inline-formula> (%) means
all the theoretically possible irradiance is observed). That is, if the
prediction is less than <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">0</mml:mn></mml:math></inline-formula>, we correct it to <inline-formula><mml:math display="inline"><mml:mn mathvariant="normal">0</mml:mn></mml:math></inline-formula>, and if the prediction is
larger than <inline-formula><mml:math display="inline"><mml:mn>100</mml:mn></mml:math></inline-formula>, we correct it to <inline-formula><mml:math display="inline"><mml:mn>100</mml:mn></mml:math></inline-formula>. This postprocessing makes the
baseline classifiers more competitive (and hence is a more prudent way of
quantitative evaluation). We see from the figure that the baseline ALL
sometimes fails very badly (particularly in the January and April plots), while
the outputs of the regularized approach rPCA remain stable. In July there are
only a couple situations when ALL shows very poor performance (we see a green
inclination on day 2 and a green peak on day 3). In October both
approaches perform similarly. Unlike ALL, rPCA perform in a stable manner and does not
exhibit extreme failures.</p>

      <?xmltex \floatpos{t}?><fig id="Ch1.F8"><caption><p>Visualization of nowcasting (each plot shows 3 days).</p></caption>
          <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014-f07.pdf"/>

        </fig>

</sec>
</sec>
<sec id="Ch1.S4" sec-type="conclusions">
  <title>Summary and conclusions</title>
      <p>In environmental monitoring, continuous and comprehensive measurement of the
environment leads to a data streaming setting. Nowcasting in such settings is
a demanding task. We performed a case study in modeling solar radiation based
on a SMEAR <?xmltex \hack{\mbox\bgroup}?>measurement<?xmltex \hack{\egroup}?> data set, where model outputs are expected to be
available continuously in spite of often missing sensor readings. We also
experimentally analyzed missing-data patterns in our data set.</p>
      <p>We aimed at nowcasting the amount of global radiation, relative to the
theoretical maximum, with the help of measured meteorological variables. Due
to the need to provide instantaneous outputs in the data streaming setting,
as well as having limited computing power, especially when operating on autonomous
power sources, we dismiss all of the sophisticated data imputation methods,
which are computationally more demanding. We experimentally analyzed the accuracies and the robustness to missing data of seven linear-regression models
and recommend using the regularized PCA regression. The results apply to
linear-regression models coupled with the replacement of missing values by a constant
(mean).</p>
      <p>The strategy that we consider does not require any sophisticated missing-value imputation but just the replacement of the values with predefined constants.
Linear regression is also very light computationally; it only requires <inline-formula><mml:math display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula>
multiplications, where <inline-formula><mml:math display="inline"><mml:mi>r</mml:mi></mml:math></inline-formula> is the number of input variables, and one
summation. When the model is trained, it can be recorded and can operate with
minimal energy consumption. If, in addition to this, a computationally heavy imputation procedure, such as the expectation
maximization algorithm, would require computing power several orders of magnitude greater and would be the dominating computing operation.</p>
      <p><?xmltex \hack{\newpage}?>A linear regression, supplied with the right input, is a powerful model,
particularly considering that, if desired, one could apply nonlinear
transformations to the input features, which would then make the resulting
predictions nonlinear with respect to the inputs. More importantly, linear
models are theoretically well understood and can provide guarantees with
respect to performance when there is a lot of missing data. We would argue
that in such situations robustness of the model may be more important than
flexibility. A flexible model may on average be more accurate, but the
outputs may be extremely wrong at times. On the other hand, a robust model
may not be the most accurate on average, but its performance would at all times be stable
and the errors not too large. We chose linear models since they
have theoretical guarantees for robustness. Hence, we recommend using our
established guideline in training regression models, which themselves are
robust to missing data.</p>
      <p><?xmltex \hack{\newpage}?>Considering variable uncertainties of sensor measurements over time would
make an interesting extension of the  current work
if we had some way of quantifying how uncertainties vary. The strength of
uncertainty could be measured from 0 to 1, where 0 would mean a perfect
certainty, 1 would mean a missing value, and everything in between would mean
a noisy measurement. In such a case, a missing value could be considered as a
special case of uncertainty.</p><?xmltex \hack{\clearpage}?>
</sec>

      
      </body>
    <back><app-group><app id="App1.Ch1.S1">
  <title>Parameter selection</title>
      <p>Figure <xref ref-type="fig" rid="App1.Ch1.F1"/> presents information on the variance explained by
PCA components.</p>

      <?xmltex \floatpos{h!}?><fig id="App1.Ch1.F1" position="anchor"><caption><p>Cumulative variance explained by PCA components on the training data
without missing values.</p></caption>
        <?xmltex \igopts{width=236.157874pt}?><graphic xlink:href="https://www.atmos-meas-tech.net/7/4387/2014/amt-7-4387-2014-f08.pdf"/>

      </fig>

<?xmltex \hack{\newpage}?>
</app>

<app id="App1.Ch1.S2">
  <title>Goodness of fit</title>
      <p>Table <xref ref-type="table" rid="App1.Ch1.T1"/> presents fitness statistics of the regression models to
the training data. The coefficient of determination, <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula>, indicates the
amount of total variability explained by the regression model. The
coefficient is computed as

              <disp-formula id="App1.Ch1.Ex1"><mml:math display="block"><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mo>(</mml:mo><mml:msup><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mi>l</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>l</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mo>(</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>l</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo mathvariant="normal">‾</mml:mo></mml:mover><mml:msup><mml:mo>)</mml:mo><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

        where <inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>y</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>l</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is the true target value of the <inline-formula><mml:math display="inline"><mml:mi>l</mml:mi></mml:math></inline-formula>th sample and
<inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo mathvariant="normal" stretchy="false">^</mml:mo></mml:mover><mml:mrow><mml:mo>(</mml:mo><mml:mi>l</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is the corresponding model output, <inline-formula><mml:math display="inline"><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo mathvariant="normal">‾</mml:mo></mml:mover></mml:math></inline-formula> is the mean
of the true target values, and <inline-formula><mml:math display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula> is the number of samples in the train set.
We see that the best-fit model is ALL. Recalling the experimental analysis in
Sect. <xref ref-type="sec" rid="Ch1.S3"/>, we can see that good fitness to the training data
does not guarantee good generalization performance when a lot of missing
values start to appear.</p>

<?xmltex \floatpos{h!}?><table-wrap id="App1.Ch1.T1" position="anchor"><caption><p>Fitness statistics of the models on the training data.</p></caption><oasis:table frame="topbot"><?xmltex \begin{scaleboxenv}{.92}[.92]?><oasis:tgroup cols="8">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="right"/>
     <oasis:colspec colnum="3" colname="col3" align="right"/>
     <oasis:colspec colnum="4" colname="col4" align="right"/>
     <oasis:colspec colnum="5" colname="col5" align="right"/>
     <oasis:colspec colnum="6" colname="col6" align="right"/>
     <oasis:colspec colnum="7" colname="col7" align="right"/>
     <oasis:colspec colnum="8" colname="col8" align="right"/>
     <oasis:thead>
       <oasis:row rowsep="1">  
         <oasis:entry colname="col1"/>  
         <oasis:entry colname="col2">ALL</oasis:entry>  
         <oasis:entry colname="col3">rALL</oasis:entry>  
         <oasis:entry colname="col4">SEL</oasis:entry>  
         <oasis:entry colname="col5">rSEL</oasis:entry>  
         <oasis:entry colname="col6">PCA</oasis:entry>  
         <oasis:entry colname="col7">rPCA</oasis:entry>  
         <oasis:entry colname="col8">PLS</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>  
         <oasis:entry colname="col1">RMSE</oasis:entry>  
         <oasis:entry colname="col2">19.2</oasis:entry>  
         <oasis:entry colname="col3">21.2</oasis:entry>  
         <oasis:entry colname="col4">20.5</oasis:entry>  
         <oasis:entry colname="col5">21.6</oasis:entry>  
         <oasis:entry colname="col6">21.8</oasis:entry>  
         <oasis:entry colname="col7">21.8</oasis:entry>  
         <oasis:entry colname="col8">20.9</oasis:entry>
       </oasis:row>
       <oasis:row>  
         <oasis:entry colname="col1"><inline-formula><mml:math display="inline"><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula></oasis:entry>  
         <oasis:entry colname="col2">0.501</oasis:entry>  
         <oasis:entry colname="col3">0.393</oasis:entry>  
         <oasis:entry colname="col4">0.436</oasis:entry>  
         <oasis:entry colname="col5">0.373</oasis:entry>  
         <oasis:entry colname="col6">0.361</oasis:entry>  
         <oasis:entry colname="col7">0.361</oasis:entry>  
         <oasis:entry colname="col8">0.410</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup><?xmltex \end{scaleboxenv}?></oasis:table></table-wrap>

<?xmltex \hack{\clearpage}?>
</app>
  </app-group><ack><title>Acknowledgements</title><p>This work has been supported by the Academy of Finland grant 118653
(ALGODAN) and grant 258568 (MultiTree).<?xmltex \hack{\newline}?><?xmltex \hack{\newline}?>
Edited by: M. Weber</p></ack><ref-list>
    <title>References</title>

      <ref id="bib1.bibx1"><label>Aggarwal(2007)</label><mixed-citation>
Aggarwal, Ch. (Ed.): Data Streams – Models and Algorithms, Springer, 2007.</mixed-citation></ref>
      <ref id="bib1.bibx2"><label>Allison(2001)</label><mixed-citation>
Allison, P.: Missing Data, Sage Publications, 2001.</mixed-citation></ref>
      <ref id="bib1.bibx3"><label>Babcock et al.(2002)Babcock, Babu, Datar, Motwani, and Widom</label><mixed-citation>
Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J.: Models and
Issues in Data Stream Systems, in: Proc. of the 21st ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems, PODS, 1–16, 2002.</mixed-citation></ref>
      <ref id="bib1.bibx4"><label>Bacher et al.(2009)Bacher, Madsen, and Nielsen</label><mixed-citation>
Bacher, P., Madsen, H., and Nielsen, H. A.: Online short-term solar power
forecasting, Sol. Energy, 83, 1772–1783, 2009.</mixed-citation></ref>
      <ref id="bib1.bibx5"><label>Bhardwaj et al.(2013)Bhardwaj, Sharma, Srivastava, Sastry,
Bandyopadhyay, Chandel, and Gupta</label><mixed-citation>
Bhardwaj, S., Sharma, V., Srivastava, S., Sastry, O., Bandyopadhyay, B.,
Chandel, S., and Gupta, J.: Estimation of solar radiation using a combination
of Hidden Markov model and generalized Fuzzy model, Sol. Energy, 93, 43–54,
2013.</mixed-citation></ref>
      <ref id="bib1.bibx6"><label>Black et al.(2007)Black, Broadstock, Colin, and
Hunt</label><mixed-citation>
Black, C., Broadstock, D., Colin, A., and Hunt, L. C.: Filling in the gaps in
transport studies: a practical guide to developments in data imputation
methods, Traffic Eng. Control, 48, 358–363, 2007.</mixed-citation></ref>
      <ref id="bib1.bibx7"><label>Enders(2010)</label><mixed-citation> Enders, C. K.: Applied Missing Data
Analysis, Guilford Press, 2010.</mixed-citation></ref>
      <ref id="bib1.bibx8"><label>Hammer et al.(1999)Hammer, Heinemann, Lorenz, and
Lückehe</label><mixed-citation> Hammer, A., Heinemann, D., Lorenz, E., and
Lückehe, B.: Short-term forecasting of solar radiation: a
statistical approach using satellite data, Sol. Energy, 67,
139–150, 1999.</mixed-citation></ref>
      <ref id="bib1.bibx9"><label>Hari and Kulmala(2005)</label><mixed-citation>
Hari, P. and Kulmala, M.: Station for Measuring Ecosystem-Atmosphere
Relations (SMEAR II), Boreal Environ. Res., 10, 315–322,
2005.</mixed-citation></ref>
      <ref id="bib1.bibx10"><label>Hastie et al.(2001)Hastie, Tibshirani, and
Friedman</label><mixed-citation>
Hastie, T., Tibshirani, R., and Friedman, J.: The Elements of Statistical
Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001.</mixed-citation></ref>
      <ref id="bib1.bibx11"><label>Hoerl and Kennard(1970)</label><mixed-citation> Hoerl, A. E. and
Kennard, R. W.: Ridge regression: biased estimation for
nonorthogonal problems, Technometrics, 12, 55–67, 1970.</mixed-citation></ref>
      <ref id="bib1.bibx12"><label>Hrust et al.(2009)Hrust, Klaic, Krizana, Antonic, and
Hercog</label><mixed-citation>
Hrust, L., Klaic, Z. B., Krizana, J.,
Antonic, O., and Hercog, P.: Neural network forecasting of air
pollutants hourly concentrations using optimised temporal averages
of meteorological variables and pollutant concentrations,
Atmos. Environ., 43, 5588–5596, 2009.</mixed-citation></ref>
      <ref id="bib1.bibx13"><label>Jolliffe(2002)</label><mixed-citation> Jolliffe, I. T.: Principal
Component Analysis, 2nd Edn., Springer, 2002.</mixed-citation></ref>
      <ref id="bib1.bibx14"><label>Junninen et al.(2004)Junninen, Niska, Tuppurainen,
Ruuskanen, and Kolehmainen</label><mixed-citation>Junninen, H., Niska, H.,
Tuppurainen, K., Ruuskanen, J., and Kolehmainen, M.: Methods for
imputation of missing values in air quality data sets,
Atmos. Environ., 38, 2895–2907, 2004.
 </mixed-citation></ref><?xmltex \hack{\newpage}?>
      <ref id="bib1.bibx15"><label>Junninen et al.(2009)Junninen, Lauri, Keronen, Aalto,
Hiltunen, Hari, and Kulmala</label><mixed-citation>
Junninen, H., Lauri, A., Keronen, P., Aalto, P., Hiltunen, V., Hari, P., and
Kulmala, M.: Smart-SMEAR: on-line data exploration and visualization tool
for SMEAR stations, Boreal Environ. Res., 14, 447–457, 2009.</mixed-citation></ref>
      <ref id="bib1.bibx16"><label>Kadlec et al.(2009)Kadlec, Gabrys, and Strandt</label><mixed-citation>
Kadlec, P., Gabrys, B., and Strandt, S.: Data-driven soft sensors in
the process industry, Comput. Chem. Eng., 33, 795–814,
2009.</mixed-citation></ref>
      <ref id="bib1.bibx17"><label>Kopp and Lean(2011)</label><mixed-citation>Kopp, G. and Lean, J. L.:
A new, lower value of total solar irradiance: Evidence and climate
significance, Geophys. Res.  Lett., 38, L01706,
<ext-link xlink:href="http://dx.doi.org/10.1029/2010GL045777" ext-link-type="DOI">10.1029/2010GL045777</ext-link>, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx18"><label>Lerner et al.(2002)Lerner, Moses, Maricia, McIlraith, and Koller</label><mixed-citation>
Lerner, U., Moses, B., Maricia, S., McIlraith, Sh. A., and Koller, D.:
Monitoring a Complez Physical System using a Hybrid Dynamic Bayes Net., in:
Proc. of the the 18th Conference in Uncertainty in Artificial Intelligence,
UAI, 301–310, 2002.</mixed-citation></ref>
      <ref id="bib1.bibx19"><label>Lu et al.(2006)Lu, Hsieh, and Chang</label><mixed-citation> Lu, H.,
Hsieh, J., and Chang, T.: Prediction of daily maximum ozone
concentrations from meteorological conditions using a two-stage
neural network, Atmos. Res., 81, 124–139, 2006.</mixed-citation></ref>
      <ref id="bib1.bibx20"><label>Marquez and Coimbra(2011)</label><mixed-citation> Marquez, R. and
Coimbra, C. F.: Forecasting of global and direct solar irradiance
using stochastic learning methods, ground experiments and the NWS
database, Sol. Energy, 85, 746–756, 2011.</mixed-citation></ref>
      <ref id="bib1.bibx21"><label>Menut and Bessagnet(2010)</label><mixed-citation>Menut, L. and
Bessagnet, B.: Atmospheric composition forecasting in Europe,
Ann. Geophys., 28, 61–74, <ext-link xlink:href="http://dx.doi.org/10.5194/angeo-28-61-2010" ext-link-type="DOI">10.5194/angeo-28-61-2010</ext-link>,
2010.</mixed-citation></ref>
      <ref id="bib1.bibx22"><label>Michalsky(1988)</label><mixed-citation> Michalsky, J.: The
Astronomical Almanac's algorithm for approximate solar position
(1950–2050), Sol. Energy, 40, 227–235, 1988.</mixed-citation></ref>
      <ref id="bib1.bibx23"><label>Ramoni and Sebastiani(2001)</label><mixed-citation> Ramoni, M. and
Sebastiani, P.: Robust Learning with Missing Data, Machine Learning, 45, 147–170, 2001.</mixed-citation></ref>
      <ref id="bib1.bibx24"><label>Valle et al.(1999)Valle, Li, and Qin</label><mixed-citation> Valle, S.,
Li, W., and Qin, S. J.: Selection of the number of principal
components: the variance of the reconstruction error criterion with
a comparison to other methods, Ind. Eng. Chem. Res., 38,
4389–4401, 1999.</mixed-citation></ref>
      <ref id="bib1.bibx25"><label>Vuilleumier et al.(2011)Vuilleumier, Calpini, Cattin,
Roulet, Stauch, Stöckli, and Giunta</label><mixed-citation>Vuilleumier, L., Calpini, B., Cattin, R., Roulet, Y.-A., Stauch, V.,
Stöckli, R., and Giunta, I.: Solar radiation now-casting: the
need for multiple data source integration, in: Proc. of the COST
Action ES1002 “WIRE” State of the Art Workshop, available
at:
<uri>http://www.wire1002.ch/fileadmin/user_upload/Major_events/WS_Nice_2011/Spec._presentations/Vuillemier.pdf</uri>
(last access: 14 July 2014),
2011.</mixed-citation></ref>
      <ref id="bib1.bibx26"><label>Wold et al.(2001)Wold, Sjostroma, and Eriksson</label><mixed-citation>
Wold, S., Sjostroma, M., and Eriksson, L.: PLS-regression: a basic
tool of chemometrics, Chemometr. Intell.  Lab., 58, 109–130,
2001.</mixed-citation></ref>
      <ref id="bib1.bibx27"><label>Žliobaitė and Hollmén(2013)</label><mixed-citation>
Žliobaitė, I. and Hollmén, J.: Fault tolerant
regression for sensor data, in: Proc. of the European Conference on
Machine Learning and Principles and Practice of Knowledge Discovery
in Databases, ECMLPKDD, 449–464, 2013.</mixed-citation></ref>
      <ref id="bib1.bibx28"><label>Žliobaitė and Hollmén(2014)</label><mixed-citation>Žliobaitė, I. and Hollmén, J.: Optimizing regression
models for data streams with missing values, Mach. Learn., 1–27,
<ext-link xlink:href="http://dx.doi.org/10.1007/s10994-014-5450-3" ext-link-type="DOI">10.1007/s10994-014-5450-3</ext-link>,
2014.</mixed-citation></ref>

  </ref-list><app-group content-type="float"><app><title/>

    </app></app-group></back>
    </article>
