Latest revision as of 17:55, 30 March 2023

Paper

Paper
edit
description
id	Vol-3194/paper53
wikidataid	Q117344909→Q117344909
title	SiMBA: Systematic Clustering-Based Methodology to Support Built Environment Analysis
pdfUrl	https://ceur-ws.org/Vol-3194/paper53.pdf
dblpUrl	https://dblp.org/rec/conf/sebd/BiraghiLMTT22
volume	Vol-3194→Vol-3194
session	→

SiMBA: Systematic Clustering-Based Methodology to Support Built Environment Analysis

SiMBA: Systematic Clustering-Based Methodology to
Support Built Environment Analysis
(Discussion Paper)

Carlo Andrea Biraghi1 , Emilia Lenzi2 , Maristella Matera2 , Massimo Tadi1 and
Letizia Tanca2
1
Politecnico di Milano - Dipartimento di Architettura, Ingegneria delle costruzioni e Ambiente Costruito
2
Politecnico di Milano - Dipartimento di Elettronica, Informazione e Bioingegneria

Abstract
The general interest in sustainable development models has grown enormously over the last 50 years,
and architecture and urban planning are certainly two areas in which research on the topic is most
advanced. At the same time, the contribution of computer science for a systematic analysis of the
territory, both from a morphological point of view and as regards performance, seems to have been
underestimated in today’s research. In this context, our research aims to joining the two - until now
separate - worlds of computer science, and architecture and urban planning. In particular, in this work
we present SIMBA: Systematic clusterIng-based Methodology to support Built environment Analysis.
SIMBA aims to enhance a consolidated analysis methodology, the Integrated Modification Methodology
(IMM), through the integration of advanced analysis methods for the extraction of relevant patterns
from built environment data. Using the city of Milan as a case study, we will demonstrate the possibility
for SIMBA to be generalised to the analysis of any built environment.

Keywords
Clustering, Build environment, Methodology, Data mining, Multidisciplinary, Sustainability

1. Introduction
One of the biggest problems of our century is the impact that the behaviour of human beings
is having on the environment. This impact, under the same conditions for the efficiency in
resource management, is obviously greater in more densely populated areas. According to [1]
currently, approximately 80% of the global primary energy is consumed in urban areas, and
cities are responsible for emitting more than 70% of the total world’s greenhouse gases and
consuming 60% of disposable water. Nonetheless, cities are the economic engine of the world.
This is why the problem of sustainability in built environments - defined as every human-made
space - has been broadly addressed in the literature. However, many issues are still open, and
different approaches have deficiencies in several respects, i.e., the definition of the scale at
which the analysis is conducted, the introduction of the context in the analysis, the usage of
huge amount of data and features to describe the built environment.

SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
$ carloandrea.biraghi@polimi.it (C. A. Biraghi); emilia.lenzi@polimi.it (E. Lenzi); maristella.matera@polimi.it
(M. Matera); massimo.tadi@polimi.it (M. Tadi); letizia.tanca@polimi.it (L. Tanca)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
� In this paper we present SIMBA: a systematic clustering-based methodology to support built
environment analysis. SIMBA has been conceived as a framework extending and enhancing
IMM (Integrated Modification Methodology)[2], an innovative design methodology for the
evaluation and the improvement of environmental performances of the city (or part of it). As
described in [2] IMM has already proved its effectiveness in the field of sustainability evaluation;
however, many challenges are still open and SIMBA proposes to support this methodology in
the phase of built-environment analysis. Our case study is the city of Milan which, with its
88 NILs (Nuclei di Identità Locale), has been already the subject of many applications of IMM.
In particular, the aims of the conducted experiments were (i) to enrich the phase of the built
environment analysis through the use of data clustering techniques, and (ii) to systematise the
basic steps of the analysis process by capitalizing on the support gained by data analysis. In
particular, SIMBa aims to: (i) produce a methodology to select a reasonable, yet representative
number of features when investigating the built environment; (ii) find experimental evidence of
corresponding patterns between the structure of the city and its performance; (iii) produce a
systematic method to measure the distance between elements, needed when comparing different
built unit; (iv) promote a human-in-the-loop paradigm [3] where domain experts can intervene
and refine the analysis process. With these objectives in mind, we have chosen to base the
methodology on clustering techniques, that would support the identification of patterns in the
built environment data, which are unlabeled, and highlight which of the input elements are
more similar to each other. At the same time, such algorithms have been useful to single-out
similarities and differences between the different elements, as well as the conceptual distance
between them.

2. IMM Methodology and data retrieved
The IMM methodology aims at improving the environmental performance of the city by modify-
ing its structural characteristics. The main elements of the methodology related to the structures
are Attributes (e.g., the height of a building) and Metrics (e.g., Building Density (BD) equal to
the total number of buildings in an area divided by the total sample area). Attributes represent
immediately measurable characteristics while Metrics result from calculations. Moreover, from
Metrics, IMM defines the Key Categories, used to represent the products of the synergy between
elementary parts of the city. At the moment, the IMM group has defined 7 Key Categories, but
for the sake of simplicity in our experiments, we will consider only Permeability (the spatial
relationship between urban built-ups and voids) and Porosity (the relationship between the
street network and the spatial components that influence the overall connectivity). The Key
Categories are used to express morphological characteristics, while the tools for performance
evaluation are called Indicators (e.g., Public Transportation Stop Density) organized into Design
Ordering Principle (DOP) families, i.e., families of actions that designers can perform to improve
the current system behaviour [2]. According to these definitions, we analyze the data related to
(i) Indicators, (ii) Metrics, and (iii) Attributes, and we also perform the same experiment using
directly some raw data coming from the Municipality of Milan, which are mostly related to
air pollution, building characteristics, population, transport, and services for each NIL. The
structure of each dataset is summarised in Table 1.
�Table 1
Datasets
Dataset Samples available Number of features
Indicators 88 25
Metrics 86 59 (52 Porosity + 7 Permeability)
Attributes 86 52
Milan 88 31

1.3 1.2 1.1
1. BUILT ENVIRONMENT
DATABASES DIMENSION GRANULARITY
DECOMPOSITION
CREATION CHOICE CHOICE

2.2 2.3 CLUSTERS
AUTOMATED FOR EACH
FEATURE DIMENSION
CLUSTERING
SELECTION
2.1 2.4
2. FIRST LEVEL
CLUSTERING CLUSTERING
PRE-PROCESSING
(ONE FOR EACH EVALUATION DISTANCES
DATASET) MANUAL
FIRST
FEATURE FORMULATION
CLUSTERING
SELECTION

CLUSTERS ON
IMPORTANT
DIMENSIONS

3.1 3.2
3. SECOND LEVEL IMPORTANT
CLUSTERING ON
CLUSTERING DIMENSION
SELECTED COMBINATION DISTANCES
CHOICE
SECOND
FORMULATION

Figure 1: The SIMBA workflow.

3. Methodology and results
In this section we present the SIMBA methodology together with the experiments we carried
out. Fig. 1 represents the methodological workflow. As you can see, the process takes as input a
built environment of any kind: a town, a city or even a country, since it can be generalized to
any input, as long as it is possible to identify comparable units in it. The case study discussed
in this paper focuses on Milan. The methodology is composed of three main phases: (i) Built
Environment Decomposition (BED phase); (ii) First-Level Clustering (FLC phase, one for each
dataset); (iii) Second-Lelvel Custering (SLC phase). The following sections will describe each
phase together with its application to our case study.

3.1. BED phase setting
The Built Environment Decomposition phase is useful to make the analysis manageable through
Data Mining algorithms, since it aims at reducing the number of features in the single analysis
�by splitting all the available data into distinct datasets according to different Dimensions.
In this phase, we first need to define the granularity at which the analysis has to be conducted
(step 1.1). In other words, we need to choose which are the samples we want to cluster, and
in our case study these samples are the 88 NILs. Although the definition of NIL is specific to
the city of Milan, this does not affect the generalization potential of the methodology, since
similar divisions (possibly at different scales) can be found in other cities, and therefore step
1.1 can be carried out. After the granularity, we need to define the Dimensions (step 1.2) and
retrieve the data according to them (step 1.3). The Dimensions represent the different aspects
we want to analyse. They can be of any type and category, e.g., environmental performances,
morphological characteristics, demographic data and so on, but they remain abstract concepts
for us: they have to work as simple guidelines to be used while looking for data to retrieve. In
fact, it is in the datasets that the Dimensions are represented. As far as our experiments are
concerned, we chose to focus on all the Dimensions at our disposal, as our four datasets include
both performance data and structural characteristics. Consequently, the datasets we create for
step 1.3 correspond to the ones we already defined in Section 2. Note that in some experiments
we have considered all Metrics together, in others Permeability and Porosity separately.

3.2. FLC phase setting
Once we have our datasets ready, we perform First-Level Clustering for each of them. This
phase is needed to:(i) prepare the datasets for clustering; (ii) select the important features; (iii)
perform clustering; (iiii) evaluate each obtained cluster. Outputs of this phase are:(i) different
clusterizations for each dataset; (ii) distances between the NILs for each dataset.
These outputs are useful to investigate patterns in each dataset and so for each Dimension. For
what concerns the setting for this phase, note that the final choice of the techniques used for pre-
processig, the algorithm chosen for feature selection, and the clustering algorithm itself, were
dictated by the nature of the data at our disposal. On the other hand, what is also important to
underline is that, once the process has been formalised, even if the specific setting is dependent
on the input data, the procedure remains unchanged. For what concerns the pre-processing
phase, we mostly had to manage missing values and perform some feature engineering. We
strictly collaborated with the experts in this step (2.1) to preserve the data semantics as much
as possible. As far as step 2.2 is concerned, for the manual feature selection we relied totally on
the experts, asking them to select no more than 8 features for each dataset; for the automated
case instead, the selection was conducted using the entropy-based algorithm presented in [4].
For clustering (step 3.3), we used the Agglomerating Hierarchical Clustering algorithm [5]. The
choice was dictated firstly by the few samples at our disposal (only 88 NILs), and secondly by
the need to choose a criterion for selecting the number of clusters in each experiment. For this
parameter, in fact, no indications were given to us by the experts, as this type of experiment is
new to them and is mainly exploratory in nature. For this reason, to determine the value of the
parameter n_clusters we relied on the dendrograms and the Knee-Elbow graph we obtained
for each dataset [6],
The last step of this phase is dedicated to the evaluation of the clusters, which is still a tricky
part in this field and most of the times is strongly application-dependent. Firstly, we tried to
have an absolute evaluation of each cluster result using internal metrics such us the Dunn and
�Davies-Bouldin [6] indexes, but the problem with these metrics is that, having a small number
of samples, even few distant samples in a cluster would produce a decrease in the score. For this
reason, we decided to evaluate clustering results referring mostly to the comparison between
the results obtained with the manual and the automated choice of the features. In this way, we
use the results obtained from the manual feature selection as a proxy of ground truth, even if
they derive from a semiautomatic process.
To do so, we first defined the comparison_matrix, a matrix used to store the number of
common elements among the clusters obtained by selecting different set of features. For each
experiment (dataset) we computed the comparison_matrix between the two approaches (the
manual and the automated one). The pseudo code for the computation is shown in Algorithm 1
below.
Algorithm 1: Comparison_matrix computation
Input: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑟𝑒𝑠𝑢𝑙𝑡𝑠
Output: 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥
𝑖𝑑𝑥_𝑛𝑖𝑙 = 𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑟𝑒𝑠𝑢𝑙𝑡𝑠.𝑐𝑜𝑙𝑢𝑚𝑛𝑠[0]
for 𝑖 = 0, 𝑖 < 𝑙𝑒𝑛(𝑖𝑑𝑥_𝑛𝑖𝑙), 𝑖 + + do
for 𝑗 = 0, 𝑗 < 𝑙𝑒𝑛(𝑖𝑑𝑥_𝑛𝑖𝑙), 𝑗 + + do
𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥[𝑖][𝑗] = (𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑟𝑒𝑠𝑢𝑙𝑡𝑠[𝑖] ==
𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑟𝑒𝑠𝑢𝑙𝑡𝑠[𝑗]).𝑠𝑢𝑚()
return 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥

In 1, the comparison_matrix CM is a matrix of dimension 𝑀 𝑥𝑀 , where 𝑀 is the number
of NILs for the dataset. For two NILs i and j, CM [i][j] = 2 if the two methods put the NILs i and
j in the same cluster, CM [i][j] = 1 if only one method puts the two NILs in the same cluster,
and CM [i][j] = 0 otherwise. This means that the positive cases correspond to the values 0 and
2 since they mean that, in the two approaches, the clustering has produced the same result for
that pair of NILs.
Finally, to evaluate each experiment, we defined the measure score, whose computation is
shown in Algorithm 2.
Algorithm 2: Score computation
Input: 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥, 𝑚𝑎𝑥_𝑣𝑎𝑙, 𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓 _𝑛𝑖𝑙𝑠
Output: 𝑠𝑐𝑜𝑟𝑒
for 𝑖 = 0, 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥.𝑖𝑛𝑑𝑒𝑥, 𝑖 + + do
for 𝑗 = 0, 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥.𝑐𝑜𝑙𝑢𝑚𝑛𝑠, 𝑗 + + do
if 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥[𝑖][𝑗] == 0 𝑜𝑟 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥[𝑖][𝑗] == 𝑚𝑎𝑥_𝑣𝑎𝑙
then
𝑔𝑜𝑜𝑑 = 𝑔𝑜𝑜𝑑 + 1
𝑔𝑜𝑜𝑑
return 𝑠𝑐𝑜𝑟𝑒 = 𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓 _𝑛𝑖𝑙𝑠

In a given experiment, for each pair of NILs, score counts how many values 0 or 2 occur
in the corresponding comparison_matrix, and normalizes this number w.r.t. the number of
NILs present in that experiment (88 or 86). Assuming that a good result for our experiment
is that the two clustering runs group all the NILs in the same way, in the Manual and in the
Automated case, score can be seen as an accuracy measure for the procedure. In addition, this
measure provides a direct and simple comparison between Manual and Automated cases.
Therefore, the score provides a numerical evaluation of the clusters obtained, and allows
the procedure to be repeated in any proposed scenario. However, given the nature of the
experiments and the absence of any real ground truth, a qualitative, although less formal,
evaluation by the experts cannot be ignored. For this reason, in Figure 2 we report the most
relevant results directly on the map of Milan. Each map is divided into NILs coloured according
to the cluster they belong to. The colours have also been chosen to trace which clusters are
� (a) Indicators Manual (b) Indicators (c) Metrics Manual (d) Metrics Automated
n_clusters = 3 Automated n_clusters = 4 n_clusters = 5
n_clusters = 4

(e) Porosity Manual (f) Porosity Automated (g) Permeability all
n_clusters = 2 n_clusters = 2 n_clusters = 4

Figure 2: Some results for the FLC phase.

Table 2
Scores
Dataset A Score
Indicators 76,34
Metrics 49,90
Porosity 49,72
Attributes 46,67
Milan 73,86

closest to each other; the outliers are always coloured red, therefore, in the figure, they represent
a single cluster, but they contribute individually to increase the n_cluster parameter which,
as we note, varies among the different experiments. In addition, black is used to highlight the
NILs that are missing in some of the datasets. In table 2 we also report the scores obtained for
all the datasets.
Figures 2a and 2b represent the clusters obtained by applying the Manual and the Automated
algorithms to the Indicators dataset. What emerges from these maps is that the two algorithms
provide almost the same results. Indeed, one might think that indicators only highlight macro-
differences among NILs, but the results have been judged by the experts to be perfectly coherent
with the way DOP families describe NILs. Moreover, as shown in 2, the score is rather close to
the number of NILs (88).
Looking at the results related to the Metrics dataset (Figure 2c and 2d) we note that, in this
case, the two algorithms (Manual and Automated) perform really differently. What emerges is
that the features selected manually can express finer differences, while the automated algorithm
� (a) Manual (b) Automated
n_clusters = 3 n_clusters = 9

Figure 3: Results for indicators and metrics.

only separates the central NILs from the more peripheral ones, with some exceptions in the very
peripheral areas. This is probably due to the fact that in the Manual case experts selected both
Porosity and Permeability metrics to balance their effects, while the entropy-based algorithm
mostly selected porosity-related ones. This is probably due to the imbalance between the
number of metrics related to Porosity (52) and those related to Permeability (7) that affected the
automated selection of the features.
This is confirmed by looking at the remaining maps in Figure 2 where we note that using
only Porosity-related metrics we have a sharp separation between centre and periphery NILs,
while, using the ones related to Permeability, it is possible to define some clusters also in the
peripheral areas. This gives us an idea of how SIMBA can be useful to study the effect of the
different Key Categories separately or jointly, and this is of huge importance for the IMM group.

3.3. SLC phase setting
Second-Level Clustering is the third and last phase of SIMBA, and it takes two inputs: (i) the
First-Level Clustering results, to identify on which dataset the clustering performed better; (ii)
the IMM experts’ indications about the Dimensions to focus on to continue the analysis (step
3.1); and produces two outputs:(i) the clustering results on the dataset created by combining the
important Dimensions; (ii) a formalization of distances between NILs considering only selected
features of the important Dimensions.
In the previous sections we showed the results obtained by clustering NILs keeping the
datasets separated. This was an important step, since it allowed us to understand what are the
important IMM elements to focus on, both in an automated way and in a more guided one.
As we have seen, both looking at the score values and at the produced maps, the Indicators
dataset was the one where the procedure performed best. This is probably due to the fact that
the indicators themselves are well separated into DOP families and the entropy-based feature-
selection algorithm can extract at least one indicator for each family. Other interesting results,
as we have shown, are obtained on the Metrics dataset. This time the score is considerably
worse, but the meaning of the resulting clusters in terms of morphology can’t be ignored. This
is actually why we decided to add this last phase to the methodology, applying clustering to a
dataset composed both by indicators and metrics (step 3.2). The results are reported in Figure 3.
By comparing maps, three important things emerge:(i) most of the NILs of the city centre are
always in the same cluster; (ii) looking at the manual case, this seems to show practically the
�same results as the Indicator case, while the automated one contains also some of the clusters
identified in the Metrics automated case. This might mean that there could be some indicators
that guide the whole creation of clusters, but some of the metrics selected in the automated
case end up smoothing its effect.

4. Conclusion and future works
In this work, we presented SIMBA, a systematic clustering-based methodology to support
built-environment analysis, and we proved its potential as a support tool for architects and
urban planners in understanding and analysing urban environments. The biggest limitation of
the work is the nature of the data. On the one hand, the scarcity of the samples (only 88 NILs)
makes the analysis sensitive to outliers and difficult to formally generalise; on the other hand,
the excessive number of characteristics, which are often redundant and not very informative,
affects the quality of the results produced. Two aspects on which future work will therefore
focus are the definition of finer granularity and the selection of features. Additional efforts
will be devoted to refining the human-in-the-loop approach that SIMBA wants to support, by
means of explanations that can guide the domain experts (and eventually, different experts)
in the selection of relevant analysis Dimensions in the SLC phase. In addition, to assess the
actual possibility for SIMBA to be generalised to the analysis of any built environment, different
datasets with different characteristics should be analyzed.

5. Acknowledgments
We would like to thank Dott. Hadi Mohammad Zadeh who has contributed to this work and its
future development.

References
[1] W. Bank, Cities and climate change: an urgent agenda, 2010.
[2] M. Tadi, M. H. M. Zadeh, O. Ogut, Measuring the influence of functional proximity on
environmental urban performance via integrated modification methodology: Four study
cases in milan, International Journal of Urban and Civil Engineering (2020).
[3] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of
methods for explaining black box models, ACM Comput. Surv. 51 (2019) 93:1–93:42.
[4] M. Dash, H. Liu, Feature selection for clustering, ACM digital library (2000).
[5] D. Wunsch, R. Xu, Clustering, IEEE Press Series on Computational Intelligence, Wiley, 2008.
[6] M. J. Zaki, J. Wagner Meira, Data Mining and Machine Learning: Fundamental Concepts
and Algorithms, Cambridge University Press, 2020.
�

Difference between revisions of "Vol-3194/paper53"

Latest revision as of 17:55, 30 March 2023

Paper

SiMBA: Systematic Clustering-Based Methodology to Support Built Environment Analysis

Navigation menu

Search

@@ Line 1: / Line 1: @@
+=Paper=
 {{Paper
+|id=Vol-3194/paper53
+|storemode=property
+|title=SiMBA: Systematic Clustering-Based Methodology to Support Built Environment Analysis
+|pdfUrl=https://ceur-ws.org/Vol-3194/paper53.pdf
+|volume=Vol-3194
+|authors=Carlo Andrea Biraghi,Emilia Lenzi,Maristella Matera,Massimo Tadi,Letizia Tanca
+|dblpUrl=https://dblp.org/rec/conf/sebd/BiraghiLMTT22
 |wikidataid=Q117344909
 }}
+==SiMBA: Systematic Clustering-Based Methodology to Support Built Environment Analysis==
+<pdf width="1500px">https://ceur-ws.org/Vol-3194/paper53.pdf</pdf>
+<pre>
+SiMBA: Systematic Clustering-Based Methodology to
+Support Built Environment Analysis
+(Discussion Paper)
+Carlo Andrea Biraghi1 , Emilia Lenzi2 , Maristella Matera2 , Massimo Tadi1 and
+Letizia Tanca2
+    Politecnico di Milano - Dipartimento di Architettura, Ingegneria delle costruzioni e Ambiente Costruito
+    Politecnico di Milano - Dipartimento di Elettronica, Informazione e Bioingegneria
+                                         Abstract
+                                         The general interest in sustainable development models has grown enormously over the last 50 years,
+                                         and architecture and urban planning are certainly two areas in which research on the topic is most
+                                         advanced. At the same time, the contribution of computer science for a systematic analysis of the
+                                         territory, both from a morphological point of view and as regards performance, seems to have been
+                                         underestimated in today’s research. In this context, our research aims to joining the two - until now
+                                         separate - worlds of computer science, and architecture and urban planning. In particular, in this work
+                                         we present SIMBA: Systematic clusterIng-based Methodology to support Built environment Analysis.
+                                         SIMBA aims to enhance a consolidated analysis methodology, the Integrated Modification Methodology
+                                         (IMM), through the integration of advanced analysis methods for the extraction of relevant patterns
+                                         from built environment data. Using the city of Milan as a case study, we will demonstrate the possibility
+                                         for SIMBA to be generalised to the analysis of any built environment.
+                                         Keywords
+                                         Clustering, Build environment, Methodology, Data mining, Multidisciplinary, Sustainability
+. Introduction
+One of the biggest problems of our century is the impact that the behaviour of human beings
+is having on the environment. This impact, under the same conditions for the efficiency in
+resource management, is obviously greater in more densely populated areas. According to [1]
+currently, approximately 80% of the global primary energy is consumed in urban areas, and
+cities are responsible for emitting more than 70% of the total world’s greenhouse gases and
+consuming 60% of disposable water. Nonetheless, cities are the economic engine of the world.
+This is why the problem of sustainability in built environments - defined as every human-made
+space - has been broadly addressed in the literature. However, many issues are still open, and
+different approaches have deficiencies in several respects, i.e., the definition of the scale at
+which the analysis is conducted, the introduction of the context in the analysis, the usage of
+huge amount of data and features to describe the built environment.
+SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy
+$ carloandrea.biraghi@polimi.it (C. A. Biraghi); emilia.lenzi@polimi.it (E. Lenzi); maristella.matera@polimi.it
+(M. Matera); massimo.tadi@polimi.it (M. Tadi); letizia.tanca@polimi.it (L. Tanca)
+                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
+    CEUR
+    Workshop
+    Proceedings
+                  http://ceur-ws.org
+                  ISSN 1613-0073
+                                       CEUR Workshop Proceedings (CEUR-WS.org)
+�   In this paper we present SIMBA: a systematic clustering-based methodology to support built
+environment analysis. SIMBA has been conceived as a framework extending and enhancing
+IMM (Integrated Modification Methodology)[2], an innovative design methodology for the
+evaluation and the improvement of environmental performances of the city (or part of it). As
+described in [2] IMM has already proved its effectiveness in the field of sustainability evaluation;
+however, many challenges are still open and SIMBA proposes to support this methodology in
+the phase of built-environment analysis. Our case study is the city of Milan which, with its
+NILs (Nuclei di Identità Locale), has been already the subject of many applications of IMM.
+In particular, the aims of the conducted experiments were (i) to enrich the phase of the built
+environment analysis through the use of data clustering techniques, and (ii) to systematise the
+basic steps of the analysis process by capitalizing on the support gained by data analysis. In
+particular, SIMBa aims to: (i) produce a methodology to select a reasonable, yet representative
+number of features when investigating the built environment; (ii) find experimental evidence of
+corresponding patterns between the structure of the city and its performance; (iii) produce a
+systematic method to measure the distance between elements, needed when comparing different
+built unit; (iv) promote a human-in-the-loop paradigm [3] where domain experts can intervene
+and refine the analysis process. With these objectives in mind, we have chosen to base the
+methodology on clustering techniques, that would support the identification of patterns in the
+built environment data, which are unlabeled, and highlight which of the input elements are
+more similar to each other. At the same time, such algorithms have been useful to single-out
+similarities and differences between the different elements, as well as the conceptual distance
+between them.
+. IMM Methodology and data retrieved
+The IMM methodology aims at improving the environmental performance of the city by modify-
+ing its structural characteristics. The main elements of the methodology related to the structures
+are Attributes (e.g., the height of a building) and Metrics (e.g., Building Density (BD) equal to
+the total number of buildings in an area divided by the total sample area). Attributes represent
+immediately measurable characteristics while Metrics result from calculations. Moreover, from
+Metrics, IMM defines the Key Categories, used to represent the products of the synergy between
+elementary parts of the city. At the moment, the IMM group has defined 7 Key Categories, but
+for the sake of simplicity in our experiments, we will consider only Permeability (the spatial
+relationship between urban built-ups and voids) and Porosity (the relationship between the
+street network and the spatial components that influence the overall connectivity). The Key
+Categories are used to express morphological characteristics, while the tools for performance
+evaluation are called Indicators (e.g., Public Transportation Stop Density) organized into Design
+Ordering Principle (DOP) families, i.e., families of actions that designers can perform to improve
+the current system behaviour [2]. According to these definitions, we analyze the data related to
+(i) Indicators, (ii) Metrics, and (iii) Attributes, and we also perform the same experiment using
+directly some raw data coming from the Municipality of Milan, which are mostly related to
+air pollution, building characteristics, population, transport, and services for each NIL. The
+structure of each dataset is summarised in Table 1.
+�Table 1
+Datasets
+                            Dataset               Samples available                         Number of features
+                    Indicators                                      88                      25
+                      Metrics                                       86                      59 (52 Porosity + 7 Permeability)
+                    Attributes                                      86                      52
+                       Milan                                        88                      31
+.3                                 1.2                              1.1
+. BUILT ENVIRONMENT
+                                                  DATABASES                       DIMENSION                        GRANULARITY
+           DECOMPOSITION
+                                                  CREATION                          CHOICE                            CHOICE
+.2                     2.3                                                   CLUSTERS
+                                                                AUTOMATED                                                                   FOR EACH
+                                                                     FEATURE                                                               DIMENSION
+                                                                                       CLUSTERING
+                                                                    SELECTION
+.1                                                                                  2.4
+. FIRST LEVEL
+           CLUSTERING                                                                                                        CLUSTERING
+                                  PRE-PROCESSING
+           (ONE FOR EACH                                                                                                     EVALUATION     DISTANCES
+            DATASET)                                                MANUAL
+                                                                                                                                               FIRST
+                                                                     FEATURE                                                              FORMULATION
+                                                                                       CLUSTERING
+                                                                    SELECTION
+                                                                                                                                          CLUSTERS ON
+                                                                                                                                           IMPORTANT
+                                                                                                                                          DIMENSIONS
+.1                                               3.2
+. SECOND LEVEL                        IMPORTANT
+                                                                                                       CLUSTERING ON
+           CLUSTERING                             DIMENSION
+                                                                                                   SELECTED COMBINATION                     DISTANCES
+                                                    CHOICE
+                                                                                                                                             SECOND
+                                                                                                                                          FORMULATION
+Figure 1: The SIMBA workflow.
+. Methodology and results
+In this section we present the SIMBA methodology together with the experiments we carried
+out. Fig. 1 represents the methodological workflow. As you can see, the process takes as input a
+built environment of any kind: a town, a city or even a country, since it can be generalized to
+any input, as long as it is possible to identify comparable units in it. The case study discussed
+in this paper focuses on Milan. The methodology is composed of three main phases: (i) Built
+Environment Decomposition (BED phase); (ii) First-Level Clustering (FLC phase, one for each
+dataset); (iii) Second-Lelvel Custering (SLC phase). The following sections will describe each
+phase together with its application to our case study.
+.1. BED phase setting
+The Built Environment Decomposition phase is useful to make the analysis manageable through
+Data Mining algorithms, since it aims at reducing the number of features in the single analysis
+�by splitting all the available data into distinct datasets according to different Dimensions.
+   In this phase, we first need to define the granularity at which the analysis has to be conducted
+(step 1.1). In other words, we need to choose which are the samples we want to cluster, and
+in our case study these samples are the 88 NILs. Although the definition of NIL is specific to
+the city of Milan, this does not affect the generalization potential of the methodology, since
+similar divisions (possibly at different scales) can be found in other cities, and therefore step
+.1 can be carried out. After the granularity, we need to define the Dimensions (step 1.2) and
+retrieve the data according to them (step 1.3). The Dimensions represent the different aspects
+we want to analyse. They can be of any type and category, e.g., environmental performances,
+morphological characteristics, demographic data and so on, but they remain abstract concepts
+for us: they have to work as simple guidelines to be used while looking for data to retrieve. In
+fact, it is in the datasets that the Dimensions are represented. As far as our experiments are
+concerned, we chose to focus on all the Dimensions at our disposal, as our four datasets include
+both performance data and structural characteristics. Consequently, the datasets we create for
+step 1.3 correspond to the ones we already defined in Section 2. Note that in some experiments
+we have considered all Metrics together, in others Permeability and Porosity separately.
+.2. FLC phase setting
+Once we have our datasets ready, we perform First-Level Clustering for each of them. This
+phase is needed to:(i) prepare the datasets for clustering; (ii) select the important features; (iii)
+perform clustering; (iiii) evaluate each obtained cluster. Outputs of this phase are:(i) different
+clusterizations for each dataset; (ii) distances between the NILs for each dataset.
+   These outputs are useful to investigate patterns in each dataset and so for each Dimension. For
+what concerns the setting for this phase, note that the final choice of the techniques used for pre-
+processig, the algorithm chosen for feature selection, and the clustering algorithm itself, were
+dictated by the nature of the data at our disposal. On the other hand, what is also important to
+underline is that, once the process has been formalised, even if the specific setting is dependent
+on the input data, the procedure remains unchanged. For what concerns the pre-processing
+phase, we mostly had to manage missing values and perform some feature engineering. We
+strictly collaborated with the experts in this step (2.1) to preserve the data semantics as much
+as possible. As far as step 2.2 is concerned, for the manual feature selection we relied totally on
+the experts, asking them to select no more than 8 features for each dataset; for the automated
+case instead, the selection was conducted using the entropy-based algorithm presented in [4].
+For clustering (step 3.3), we used the Agglomerating Hierarchical Clustering algorithm [5]. The
+choice was dictated firstly by the few samples at our disposal (only 88 NILs), and secondly by
+the need to choose a criterion for selecting the number of clusters in each experiment. For this
+parameter, in fact, no indications were given to us by the experts, as this type of experiment is
+new to them and is mainly exploratory in nature. For this reason, to determine the value of the
+parameter n_clusters we relied on the dendrograms and the Knee-Elbow graph we obtained
+for each dataset [6],
+   The last step of this phase is dedicated to the evaluation of the clusters, which is still a tricky
+part in this field and most of the times is strongly application-dependent. Firstly, we tried to
+have an absolute evaluation of each cluster result using internal metrics such us the Dunn and
+�Davies-Bouldin [6] indexes, but the problem with these metrics is that, having a small number
+of samples, even few distant samples in a cluster would produce a decrease in the score. For this
+reason, we decided to evaluate clustering results referring mostly to the comparison between
+the results obtained with the manual and the automated choice of the features. In this way, we
+use the results obtained from the manual feature selection as a proxy of ground truth, even if
+they derive from a semiautomatic process.
+   To do so, we first defined the comparison_matrix, a matrix used to store the number of
+common elements among the clusters obtained by selecting different set of features. For each
+experiment (dataset) we computed the comparison_matrix between the two approaches (the
+manual and the automated one). The pseudo code for the computation is shown in Algorithm 1
+below.
+   Algorithm 1: Comparison_matrix computation
+    Input: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑟𝑒𝑠𝑢𝑙𝑡𝑠
+    Output: 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥
+    𝑖𝑑𝑥_𝑛𝑖𝑙 = 𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑟𝑒𝑠𝑢𝑙𝑡𝑠.𝑐𝑜𝑙𝑢𝑚𝑛𝑠[0]
+    for 𝑖 = 0, 𝑖 < 𝑙𝑒𝑛(𝑖𝑑𝑥_𝑛𝑖𝑙), 𝑖 + + do
+        for 𝑗 = 0, 𝑗 < 𝑙𝑒𝑛(𝑖𝑑𝑥_𝑛𝑖𝑙), 𝑗 + + do
+            𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥[𝑖][𝑗] = (𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑟𝑒𝑠𝑢𝑙𝑡𝑠[𝑖] ==
+             𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑟𝑒𝑠𝑢𝑙𝑡𝑠[𝑗]).𝑠𝑢𝑚()
+    return 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥
+   In 1, the comparison_matrix CM is a matrix of dimension 𝑀 𝑥𝑀 , where 𝑀 is the number
+of NILs for the dataset. For two NILs i and j, CM [i][j] = 2 if the two methods put the NILs i and
+j in the same cluster, CM [i][j] = 1 if only one method puts the two NILs in the same cluster,
+and CM [i][j] = 0 otherwise. This means that the positive cases correspond to the values 0 and
+since they mean that, in the two approaches, the clustering has produced the same result for
+that pair of NILs.
+   Finally, to evaluate each experiment, we defined the measure score, whose computation is
+shown in Algorithm 2.
+   Algorithm 2: Score computation
+    Input: 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥, 𝑚𝑎𝑥_𝑣𝑎𝑙, 𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓 _𝑛𝑖𝑙𝑠
+    Output: 𝑠𝑐𝑜𝑟𝑒
+    for 𝑖 = 0, 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥.𝑖𝑛𝑑𝑒𝑥, 𝑖 + + do
+        for 𝑗 = 0, 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥.𝑐𝑜𝑙𝑢𝑚𝑛𝑠, 𝑗 + + do
+            if 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥[𝑖][𝑗] == 0 𝑜𝑟 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛_𝑚𝑎𝑡𝑟𝑖𝑥[𝑖][𝑗] == 𝑚𝑎𝑥_𝑣𝑎𝑙
+             then
+                𝑔𝑜𝑜𝑑 = 𝑔𝑜𝑜𝑑 + 1
+                       𝑔𝑜𝑜𝑑
+    return 𝑠𝑐𝑜𝑟𝑒 = 𝑛𝑢𝑚𝑏𝑒𝑟_𝑜𝑓 _𝑛𝑖𝑙𝑠
+   In a given experiment, for each pair of NILs, score counts how many values 0 or 2 occur
+in the corresponding comparison_matrix, and normalizes this number w.r.t. the number of
+NILs present in that experiment (88 or 86). Assuming that a good result for our experiment
+is that the two clustering runs group all the NILs in the same way, in the Manual and in the
+Automated case, score can be seen as an accuracy measure for the procedure. In addition, this
+measure provides a direct and simple comparison between Manual and Automated cases.
+   Therefore, the score provides a numerical evaluation of the clusters obtained, and allows
+the procedure to be repeated in any proposed scenario. However, given the nature of the
+experiments and the absence of any real ground truth, a qualitative, although less formal,
+evaluation by the experts cannot be ignored. For this reason, in Figure 2 we report the most
+relevant results directly on the map of Milan. Each map is divided into NILs coloured according
+to the cluster they belong to. The colours have also been chosen to trace which clusters are
+�          (a) Indicators Manual   (b) Indicators     (c) Metrics Manual (d) Metrics Automated
+               n_clusters = 3        Automated          n_clusters = 4      n_clusters = 5
+                                  n_clusters = 4
+                     (e) Porosity Manual (f) Porosity Automated (g) Permeability all
+                        n_clusters = 2        n_clusters = 2       n_clusters = 4
+Figure 2: Some results for the FLC phase.
+Table 2
+Scores
+                                          Dataset       A Score
+                                       Indicators         76,34
+                                        Metrics           49,90
+                                        Porosity          49,72
+                                       Attributes         46,67
+                                         Milan            73,86
+closest to each other; the outliers are always coloured red, therefore, in the figure, they represent
+a single cluster, but they contribute individually to increase the n_cluster parameter which,
+as we note, varies among the different experiments. In addition, black is used to highlight the
+NILs that are missing in some of the datasets. In table 2 we also report the scores obtained for
+all the datasets.
+   Figures 2a and 2b represent the clusters obtained by applying the Manual and the Automated
+algorithms to the Indicators dataset. What emerges from these maps is that the two algorithms
+provide almost the same results. Indeed, one might think that indicators only highlight macro-
+differences among NILs, but the results have been judged by the experts to be perfectly coherent
+with the way DOP families describe NILs. Moreover, as shown in 2, the score is rather close to
+the number of NILs (88).
+   Looking at the results related to the Metrics dataset (Figure 2c and 2d) we note that, in this
+case, the two algorithms (Manual and Automated) perform really differently. What emerges is
+that the features selected manually can express finer differences, while the automated algorithm
+�                                    (a) Manual      (b) Automated
+                                  n_clusters = 3    n_clusters = 9
+Figure 3: Results for indicators and metrics.
+only separates the central NILs from the more peripheral ones, with some exceptions in the very
+peripheral areas. This is probably due to the fact that in the Manual case experts selected both
+Porosity and Permeability metrics to balance their effects, while the entropy-based algorithm
+mostly selected porosity-related ones. This is probably due to the imbalance between the
+number of metrics related to Porosity (52) and those related to Permeability (7) that affected the
+automated selection of the features.
+   This is confirmed by looking at the remaining maps in Figure 2 where we note that using
+only Porosity-related metrics we have a sharp separation between centre and periphery NILs,
+while, using the ones related to Permeability, it is possible to define some clusters also in the
+peripheral areas. This gives us an idea of how SIMBA can be useful to study the effect of the
+different Key Categories separately or jointly, and this is of huge importance for the IMM group.
+.3. SLC phase setting
+Second-Level Clustering is the third and last phase of SIMBA, and it takes two inputs: (i) the
+First-Level Clustering results, to identify on which dataset the clustering performed better; (ii)
+the IMM experts’ indications about the Dimensions to focus on to continue the analysis (step
+.1); and produces two outputs:(i) the clustering results on the dataset created by combining the
+important Dimensions; (ii) a formalization of distances between NILs considering only selected
+features of the important Dimensions.
+   In the previous sections we showed the results obtained by clustering NILs keeping the
+datasets separated. This was an important step, since it allowed us to understand what are the
+important IMM elements to focus on, both in an automated way and in a more guided one.
+As we have seen, both looking at the score values and at the produced maps, the Indicators
+dataset was the one where the procedure performed best. This is probably due to the fact that
+the indicators themselves are well separated into DOP families and the entropy-based feature-
+selection algorithm can extract at least one indicator for each family. Other interesting results,
+as we have shown, are obtained on the Metrics dataset. This time the score is considerably
+worse, but the meaning of the resulting clusters in terms of morphology can’t be ignored. This
+is actually why we decided to add this last phase to the methodology, applying clustering to a
+dataset composed both by indicators and metrics (step 3.2). The results are reported in Figure 3.
+   By comparing maps, three important things emerge:(i) most of the NILs of the city centre are
+always in the same cluster; (ii) looking at the manual case, this seems to show practically the
+�same results as the Indicator case, while the automated one contains also some of the clusters
+identified in the Metrics automated case. This might mean that there could be some indicators
+that guide the whole creation of clusters, but some of the metrics selected in the automated
+case end up smoothing its effect.
+. Conclusion and future works
+In this work, we presented SIMBA, a systematic clustering-based methodology to support
+built-environment analysis, and we proved its potential as a support tool for architects and
+urban planners in understanding and analysing urban environments. The biggest limitation of
+the work is the nature of the data. On the one hand, the scarcity of the samples (only 88 NILs)
+makes the analysis sensitive to outliers and difficult to formally generalise; on the other hand,
+the excessive number of characteristics, which are often redundant and not very informative,
+affects the quality of the results produced. Two aspects on which future work will therefore
+focus are the definition of finer granularity and the selection of features. Additional efforts
+will be devoted to refining the human-in-the-loop approach that SIMBA wants to support, by
+means of explanations that can guide the domain experts (and eventually, different experts)
+in the selection of relevant analysis Dimensions in the SLC phase. In addition, to assess the
+actual possibility for SIMBA to be generalised to the analysis of any built environment, different
+datasets with different characteristics should be analyzed.
+. Acknowledgments
+We would like to thank Dott. Hadi Mohammad Zadeh who has contributed to this work and its
+future development.
+References
+[1] W. Bank, Cities and climate change: an urgent agenda, 2010.
+[2] M. Tadi, M. H. M. Zadeh, O. Ogut, Measuring the influence of functional proximity on
+    environmental urban performance via integrated modification methodology: Four study
+    cases in milan, International Journal of Urban and Civil Engineering (2020).
+[3] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of
+    methods for explaining black box models, ACM Comput. Surv. 51 (2019) 93:1–93:42.
+[4] M. Dash, H. Liu, Feature selection for clustering, ACM digital library (2000).
+[5] D. Wunsch, R. Xu, Clustering, IEEE Press Series on Computational Intelligence, Wiley, 2008.
+[6] M. J. Zaki, J. Wagner Meira, Data Mining and Machine Learning: Fundamental Concepts
+    and Algorithms, Cambridge University Press, 2020.
+�
+</pre>