Finding the number and best locations of fixed air quality monitoring stations at street level is challenging because of the complexity of the urban environment and the large number of factors affecting the pollutants concentration. Data sets of such urban parameters as land use, building morphology and street geometry in high-resolution grid cells in combination with direct measurements of airborne pollutants at high frequency (1–10 s) along a reasonable number of streets can be used to interpolate concentration of pollutants in a whole gridded domain and determine the optimum number of monitoring sites and best locations for a network of fixed monitors at ground level. In this context, a data-driven modeling methodology is developed based on the application of Self-Organizing Map (SOM) to approximate the nonlinear relations between urban parameters (80 in this work) and aerosol pollution data, such as mass and number concentrations measured along streets of a commercial/residential neighborhood of Singapore. Cross-validations between measured and predicted aerosol concentrations based on the urban parameters at each individual grid cell showed satisfying results. This proof of concept study showed that the selected urban parameters proved to be an appropriate indirect measure of aerosol concentrations within the studied area. The potential locations for fixed air quality monitors are identified through clustering of areas (i.e., group of cells) with similar urban patterns. The typological center of each cluster corresponds to the most representative cell for all other cells in the cluster. In the studied neighborhood four different clusters were identified and for each cluster potential sites for air quality monitoring at ground level are identified.

Air quality monitoring is needed to guide regulations for public health
protection (Craig et al., 2008). At city scale it is performed through
networks of monitoring stations covering large geographic areas (i.e., 2–25 km

Time series of PM

Despite the stark difference between ambient and ground-level pollution concentrations, monitoring networks include only a few ground level monitoring stations with the purpose of characterizing traffic emissions rather than for policy advisory. This is usually the case everywhere and not only of Singapore. The deployment of comprehensive monitoring networks at ground level is hampered by the large number of monitors and associated costs (i.e., equipment, operation and maintenance) needed to represent the urban heterogeneity in terms of land use, buildings morphology and distribution of emission. To overcome this limitation and expand existing air quality monitoring networks a new method is proposed to determine the minimum number of stations at ground level and their best potential locations.

Modeling techniques, such as Computational Fluid Dynamics (CFD) and Large-Eddy Simulation (LES) have been used to investigate the dispersion and distribution of pollutants under the urban canopy (e.g., Li et al., 2006; Tominaga and Stathopoulos, 2013). However, the complexity of the urban structure has limited their application to simplified geometries (i.e., urban morphologies), idealized atmospheric conditions and particular distributions of emission sources (e.g., Li et al., 2012; Tominaga and Stathopoulos, 2013).

With recent advancements in computational and sensing technologies, data-driven approaches, also known as inverse or empirical modeling are an alternative to solve the problem of modeling in complex systems (Kolehmainen et al., 2001; Voukantsis et al., 2010), such as those imposed by the urban heterogeneity on the distribution of air pollutants at street level. The basic idea under these models is that if there are underlying rules controlling a system, they can be found from a set of data by means of statistical and probabilistic methods. Therefore, with a statistically reasonable amount of air pollution observations and data on urban parameters, a data-driven mathematical model can be constructed to interpolate the pollutants concentration to a whole gridded domain with an acceptable level of accuracy, without a descriptive theory of the real phenomena in advance.

Considering the number of potential urban parameters controlling the pollution distribution at ground level, the modeling challenge turns into the identification of the nonlinear functional relations between the urban parameters and concentration of atmospheric pollutants. This view inverts the problem of modeling from a deductive and theory-grounded approach to an inductive and data-driven approach as it is similarly described in Inverse Problem Theory (Tarantola, 2005).

The application of Self Organizing Map (SOM) as a data-driven modeling approach is used to find the association between particulate matter concentration at ground level and urban parameters in its vicinity. The model (trained SOM) is then applied to approximate the concentration of pollutants in a whole gridded domain based on the urban parameters of each particular cell. The resulting maps showing the spatial distribution of concentration of pollutants are expected to provide valuable information for epidemiological and risk assessments, as well as to identify hot spots of pollution.

The trained SOM is also used in combination with a clustering algorithm to determine the number of similar domains in the area, representing the optimum number of monitoring stations to cover the different urban patterns within the studied domain. The center of each cluster is the best potential location in terms of representativeness of the urban parameters.

The proposed data-driven model is tested using a data set of over 80 urban
parameters and high frequency (1 or 10 s) measurements of aerosol pollution
along a reasonable number of streets in a heterogeneous
residential/commercial neighborhood of Singapore, selected as a case study.
Fine-grained urban parameters spatially distributed in grid cells of 100

It is important to point out that the study presented here is a proof of concept with the aim of testing SOM. The proposed methodology is not a receptor model. It does not determine any source apportionment. Receptor models utilize chemical measurements to calculate the relative contributions from major sources at specific locations (e.g., Viana et al., 2008).

The article first describes the main features of SOM methodology and its capabilities for multidimensional data visualization, nonlinear function approximation, and data clustering. Then the urban parameters and aerosol pollution measurements are introduced. The application of SOM to our case is presented in three sections. The first section describes the application of SOM as a nonlinear function approximation method between urban parameters and measured aerosol concentrations. The efficiency of the approximation functions is evaluated through cross-validations between predicted and observed data. The second section explains the application of SOM to interpolate the measured pollution data from selected grid cells to the complete gridded domain. The third section describes the combination of SOM and a clustering algorithm to determine the optimum number of monitoring sites and their best locations in terms of representativeness and information gain. Maps of spatially interpolated aerosol concentrations present the results of the approach based on the SOM proposed here. The candidate locations for monitoring stations for each one of the identified types of urban patterns (i.e., clusters) are indicated in a final map showing also the representativeness of each grid cell within its respective cluster.

This section starts with a brief description of SOM as a data-driven modeling approach. The following section describes the selected neighborhood of Singapore as a study area and provides details of the urban parameters used for the model evaluation. Then, the aerosol pollution measurements are introduced.

Self Organizing Map is a data-driven modeling method introduced by Kohonen (1982). From a mathematical point of view, SOM acts as a nonlinear data
transformation in which data from a high-dimensional space are transformed to
a low-dimensional space (usually a space of two or three dimensions), while
the topology of the original high dimensional space is preserved. Topology
preservation means that if two data points are similar (i.e., close) in the
high-dimensional space, they are necessarily close in the new
low-dimensional space. This low-dimensional space, which is normally
represented by a planar grid with a fixed number of points, is called a map.
Each node of this map has its own coordinates

In comparison with other data transformation methods, SOM has the advantage of delivering two-dimensional maps visualizing smoothly changing patterns of data from the original high-dimensional space. In addition, SOM can also be used to predict values of parameters or dimensions using data of each other parameter through nonlinear approximation functions (Barreto and Souza, 2006). In the field of environmental modeling, data-driven methods, such as neural networks (e.g., Multi-Layer Perceptron Learning), Support Vector Machines (SVM) and time series forecasting methods such as the Autoregressive Integrated Moving Average (ARIMA) modeling technique, have been previously applied based on the availability of massive measured data (e.g., Kolehmainen, 2004; Kolehmainen et al., 2001). In a recent study Nguyen et al. (2014) used low-resolution satellite images in combination with SVM to estimate aerosol concentration at ground level from urban surfaces with no need for in situ measurements. However, they were not able to identify the urban parameters' influence on the aerosol concentration. Similarly, Hirtl et al. (2014) used satellite images, ground-based measurements and the support vector regression method to improve air quality forecasts at regional scale.

In summary, SOM is a generic, robust and powerful method that has been employed in several application domains (Kohonnen, 2013). It can be used for visualization of high-dimensional data and data exploration (Kolehmainen, 2004), state space modeling and clustering (Bieringer, 2013) and most importantly, as a nonlinear function approximation method without reducing the complexity of the system (Barreto and Souza, 2006).

The availability of parameters such as urban topology, land use, vehicular
traffic, roads dimensions, etc. at fine spatial resolution makes Singapore a
perfect place to investigate the influence of those parameters in the air
quality at ground level. For the selected domain of 35.1 km

Figure 2 shows the urban area selected to test the data-driven method proposed here. This area encompasses the district of Rochor, which meets the heterogeneity requirements to investigate the nonlinear correlations between urban parameters and air pollution at ground level. The district of Rochor covers the historic neighborhood of Little India, which is characterized by two types of building typologies: shop-houses and residential towers. Shop-houses are multifunctional row houses of 3–5 stories, while the residential towers are up to 30 stories and can be built on a multi-story base with retail function. Rochor contains multiple urban land uses that range from residential to small-scale industrial workshops. The urban parameters used to train SOM were those listed in the land use section of the Singapore Master Plan 2008 within each grid cell. Land use is derived as the number of square meters for each category.

Location of the commercial/residential district of Rochor,
Singapore, selected as an urban domain to test the data-driven method
proposed here.

The studied area is formed by different street layouts. Some roads are eight-lane transit streets, others shopping streets or back lanes with service functionality (e.g., garbage collection). To identify the individual street typology, different graph measures (Hillier et al., 1976) were applied to a street graph encompassing the entire city-state of Singapore with different distance ranges to identify the major and minor roads.

For the evaluation of the data-driven method proposed here we measured a number of variables that characterize the aerosols pollution at ground level. Particles were chosen among the typical monitored air pollutants in cities because they are responsible for driving the worst air quality conditions in Singapore, as well as in many other cities (Velasco and Roth, 2012).

Main categories of urban parameters with influence on the aerosols concentration at ground level of the residential/commercial neighborhood of Rochor, Singapore, used as a study case.

The aerosol pollution data were collected at ground level along streets, alleys and public areas of Rochor and from a site placed above the urban canopy (a balcony in a 28th floor) called thereafter background site. The purpose of this site was to measure particles concentrations at ambient level, as typical monitoring stations do. The route followed during the ground measurements and the location of the background site is shown in Fig. 2. The ground level route was designed to cover as much as possible the different land uses and urban topologies of the selected neighborhood.

Seven parameters of aerosol pollution were measured in situ using portable and
battery-operated sensors. The set of sensors included two DustTrak Aerosol
Monitors (TSI 8534) to measure size-segregated mass-fraction concentrations
(PM

The measurements were limited to the evening period from 18:00 to 20:00 h on weekdays. Using commuter data from the subway station of Farrer Park located in the middle of the neighborhood of Rochor (see Fig. 2), we found that this is the period of major influx of people, and therefore of major interest from a health risk point of view. The ground level route of 3.5 km was covered 20 times along 10 days of July 2013. None of the measurement days were affected by rain or smoke-haze from wildfires in neighboring islands (e.g., Sumatra and Kalimantan). Constant meteorological conditions, as well as constant intensity of anthropogenic activities (i.e., aerosol emissions) were assumed during the 2 h of measurements.

Using the location of each measurement obtained by the GPS readings properly synchronized with the particle sensors, an identification flag was assigned to each measurement point using as reference the closest grid cell and its corresponding urban parameters.

The measurements at the background site were used to verify that the ambient
concentrations were constant during the 2-h measurement periods, and the
ground-level measurements were used to train SOM. Two reasons explain this:
(1) the differences between the concentrations measured at ground level and
the background site showed a small variability and had therefore an
insignificant influence on training SOM. Statistically, the combination of a
random function

This section describes step by step the application of SOM as a data-driven
method to approximate the nonlinear functions between the urban parameters
and aerosol pollution data measured at ground level. The model is validated
by cross-validations between the values predicted by SOM and the measured
values. The involved steps are basically the following three:

Step 1: Data transformation from a high to a low dimensional space;

Step 2: Modeling the nonlinear functions between urban parameters and aerosol variables;

Step 3: Validation and hypothesis testing.

A Self Organizing Map is capable of delivering two-dimensional maps in which
smoothly changing patterns of the original high-dimensional space can be
visualized. Figure 3 shows the interrelations between the different aerosol
variables after training an SOM by simply using the averages of the
measurements for each grid cell. Two patterns are observed, one linearly
correlated for PM

The two different patterns are not surprising. In areas influenced by
traffic emissions, such as the district of Rochor, ultrafine particles (UFP

In addition to the smooth pattern created by the SOM, the probabilistic distribution of the original data set (i.e., measurement vectors of each grid cell) can be also obtained from the trained map, as shown in Fig. 4. In this diagram, called a “hit-map”, each hexagonal unit is a node of the SOM, where the size of the black points within each unit is relative to the number of similar observation points placed in that unit during the training phase. Hence, the data points and nodes are similar to each other in the same area of the map. This creates a smooth probabilistic pattern on top of the SOM, in which the frequency of observed patterns (proportional to the size of the black points) can be used for resampling and simulation of the observed patterns. For a detailed description of this idea one can refer to Bieringer et al. (2013).

Visualization created by an SOM of the patterns in two-dimension maps (component planes) for the different particle variables measured at street level.

Distribution of training data in the trained SOM (hit-map) based
on their similarity in urban parameters. Each hexagon is a node in the SOM
and the size of the black points within each hexagon is proportional to the
number of training data placed in that node during the training phase. The
gray shaded square indicates the projection of a new grid cell on the
trained SOM (only on urban parameters) within the region with

The already trained SOM in combination with algorithms such as the

Train an SOM based only on urban parameters (with normalized values for each parameters) covering the whole domain and including grids with and without direct aerosol measurements.

For each grid cell

Project the grid cell

Within the selected region of nodes (red contour in Fig. 4) find the
grid cells with aerosol measurements (

Calculate the normalized similarity between the selected cells and
those with measurements (i.e.,

Based on the following two recommendations calculate the aerosol
concentrations for cell

Calculate the weighted average concentrations from the selected cells with
measurements. The weight is based on the normalized similarity of the urban
parameters between cell

If weights are close to each other with no dominant weights from a few of the selected cells, use the measurement median instead of the mean to prevent bias from extreme values when calculating the average concentrations. Geometric means are also an option.

With the assumption of existing relationships between urban parameters and aerosol concentrations, the SOM creates a smooth map of emergent urban patterns. It is worth mentioning that in the trained SOM map, the grid cells representing the spatial surface of the neighborhood in the physical space are not necessarily placed in the same region if they do not have similar urban patterns.

The number of nodes in the SOM, defined as the width and height of the
trained map, is important to optimize the SOM training procedure. In our
experiment we selected a map of 20

Before applying the SOM to predict the aerosol concentrations in the entire domain, the nonlinear relationships between the different urban parameters and aerosol concentrations approximated by SOM must be tested. We performed cross-validations between the predicted values by SOM and the real values to validate the proposed data-driven modeling approach.

Because of the limited number of grid cells with measurements, the cross-validation was performed using 10 % of the samples in 20 iterations. This means that we removed randomly 10 % of the cells with measurements for every iteration and predicted their aerosol concentrations based on the remaining cells with measurements. The statistical metrics of the cross-validations shown in Table 2 demonstrate the ability of SOM to preserve the nonlinear relations between the urban parameters and aerosol concentrations. Figure 5 shows that the relative errors of the predicted aerosol concentrations are tightly distributed around zero with a relatively longer left tail for all of the particles, indicating a tendency to underestimate real values.

Relative errors distribution of the predicted aerosol concentrations based on the randomly selected validation data.

Distribution of the probabilistic confidence levels of the predicted aerosol concentrations using the nonlinear approximation function of SOM, overlaid on a map of the studied neighborhood of Rochor, Singapore.

Once the cross-validation has demonstrated satisfying results, we can
proceed to interpolate the aerosol concentrations in the complete gridded
domain. The interpolation methodology is essentially the same as the
methodology used in the previous section for the cross-validation. The only
difference is the addition of a confidence measure for the predicted
concentrations. This confidence measure is based on the similarity between
the urban parameters and grid cells with measurements. If no similar grid
cell with direct measurements is available for a particular set of grid
cells with a similar urban pattern, a null confidence value will be obtained
and no concentration will be predicted. In our study case, this situation
occurred for regions with no urban similarity with the region where the
measurements were performed. The confidence value for each cell was computed
following the next steps:

The grid cells from the complete domain are divided into cells with
measured data (

The

The median of the calculated distances for

Efficiency of SOM to approximate nonlinear relationships between urban parameters and aerosol concentrations. The cross-validation was performed using 10 % of the samples in 20 interactions as explained in the text.

Similar to previous sections, after the good cross-validation results between the predicted values by the nonlinear approximations and the measured aerosol pollution data, we can apply a clustering algorithm based on the urban parameters to determine the optimum number of fixed monitoring sites and their best locations (grid cells) in terms of representativeness and maximum information gain over the whole domain.

Clustering algorithms have the task of finding the optimum number of groups from a given data set. The members (i.e., grid cells) of a group must be as much as possible similar to each other and dissimilar to members of other groups. Each cluster must represent an individual group with a specific set of parameters. The clustering algorithm must also be capable of identifying the most informative (representative) members of that cluster within each cluster.

Application of the heuristic elbow method to identify the optimal number of clusters in which the grid cells of the studied domain of Rochor can be grouped. The drastic decrease of the clustering index with four clusters suggests that additional clusters will not significantly improve the clustering quality.

The

In practice, this number is determined by heuristic methods, such as the elbow method (Tibshirani et al., 2001). For our case study, the elbow method suggests that four clusters are enough for the whole gridded domain. It means that four main types of urban settings define the neighborhood of Rochor. As shown in Fig. 7, the clustering index (metric to evaluate the clusters compactness and separation) decreases drastically with four clusters. More clusters do not represent any major improvement.

Figure 8a shows the distribution of grid cells within the trained map by
SOM. The size of the black spot is proportional to the number of training
data in each node. After applying the

The methodology based on the data-driven model of SOM developed in this work offers two outcomes of potential relevance for the air quality management in cities. Maps showing the spatial distribution of aerosol concentrations at ground level within the whole gridded domain represent the first outcome. The second outcome and main goal of this work is the finding of the optimum number of fixed monitoring stations and their potential best locations to cover the different types of urban settings (i.e., clusters) of the studied neighborhood.

Figure 9 shows the spatial distribution of PM

Spatial distribution of PM

Statistics of the aerosol concentrations predicted by SOM for the
complete gridded domain of the neighborhood of Rochor on weekdays during the
evening rush hour (18–20 h). The analysis considers only grid cells with
measured or extrapolated data with confidence values

Distribution of the grid cells in the real two-dimensional space grouped in the four different clusters (i.e., urban settings) that form the neighborhood of Rochor, Singapore. The color intensity indicates the representativeness degree. Grid cells colored more intensely represent better the urban parameters of their corresponding clusters. The most representative grid cells of each cluster (highest value(s)) are marked in red as the candidate locations for fixed air quality stations. The gray areas correspond to buildings.

Although the measurements were conducted during the period of major influx of people, and therefore of major interest from a health risk point of view, the nonlinear relations between urban parameters and aerosol pollution data obtained by SOM cannot be representative for the whole diurnal course. They are only representative of the rush-hour period monitored. The nonlinear relationships will vary throughout the day as a consequence of the variability in the emissions' intensity within the studied neighborhood. However, the aerosol measurements during the evening rush-hour had the unique purpose of testing SOM. The proposed methodology is expected to be used in the design of future long-term studies. Maps showing the spatial distribution of pollutants concentration at ground level in fine-grained domains will provide valuable information for epidemiological and risk assessments. We already discussed the poor ability of typical ambient monitoring stations to represent the pollution levels at the height where urban dwellers carry out the majority of their activities. This is of particular concern in the ubiquitous environments affected by vehicular emissions. Many epidemiological studies have found significant health effects due to exposure to vehicular traffic (e.g., Lipfert and Wyzga, 2008). Although these studies have investigated various exposure criteria, including traffic intensity and proximity, control strategies have generally not yet been proposed on a widespread basis, in part due to the lack of long term air pollution monitoring at street level, as well as of a methodology to understand the relationships between pollutant concentrations and urban parameters. In a following article we will discuss the features and roles of those parameters in the air quality of the studied neighborhood of Rochor. The understanding of these relationships might be also useful for urban planning, in particular when designing strategies to improve urban mobility promoting walking and cycling as a means to cover the so called first and last miles (distances that commuters must cover in getting to and from public transportation).

Figure 10 summarizes the application of SOM as a data-driven method to find the optimum number of monitoring stations and their potential locations in terms of representativeness. The top candidate grid cell(s) for each one of the four different urban settings that form the residential/commercial neighborhood of Rochor are marked over the individual maps of each urban setting. The grid cells were brought back to the real two-dimensional space using as reference their latitude and longitude data. The next step in the selection of sites is to visit the candidate locations and verify that enough space is available for a fixed monitoring station, the security conditions and continuous access to power. The location must fulfill the criteria to assure the proper performance of the air quality monitors. For further guidance the reader may refer to handbooks for air quality monitoring (e.g., US-EPA, 2013). If we select more than one candidate location for each cluster (say three top candidates, each selected independently), another optimization step would be necessary to find the best four monitoring stations out of 12 candidate points (three locations for four clusters) in a way to minimize the overlap between stations of different clusters and maximize the total physical coverage of the monitoring stations.

The capability of SOM as a data-driven modeling method to approximate nonlinear relationships between multiple urban parameters and air pollution data at ground level was demonstrated using a database of urban parameters spatially distributed in high-resolution grid cells created with purposes different to air quality monitoring (e.g., urban planning) and aerosol pollution data collected during a short field study. The good agreement between measured and predicted aerosol concentrations showed that the group of urban parameters used in this work provides a good indirect measure of aerosol pollution at ground level within the studied neighborhood. The same methodology can also be used for any gaseous pollutant. Every pollutant, depending on its origin and physical and chemical characteristics will present different nonlinear relationships.

The satisfying results of SOM to approximate nonlinear relationships from multidimensional data gave the opportunity to apply SOM as a method to interpolate aerosol pollution data in a complete gridded domain, including grid cells with no direct measurements. In the same context, SOM in combination with a clustering algorithm was used to determine the optimum number of locations for monitoring sites to cover the different urban settings or clusters forming the studied neighborhood, as well as to find their best location in terms of representativeness of urban patterns within their clusters.

The data-driving modeling methodology developed in this work as a proof of
concept must be relatively easy to implement to other urban domains if such
urban parameters as street networks, land-use patterns, demographics,
transportation data, and building and street topology are available in
databases of high spatial resolution. The aerosol pollution measurements
should not represent a major cost if portable and battery-operated sensors
are used, as in this work. We evaluated seven different aerosol parameters,
but measurements only of black carbon or particle number concentration in
addition to PM

This research was supported by the National Research Foundation Singapore (NRFS), through the Singapore MIT Alliance for Research and Technology's CENSAM research program and the Singapore-ETH Centre for Global Environmental Sustainability (SEC) co-funded by NRFS and ETH Zurich. Edited by: P. Xie