ch5 Clustering PDF

analoui@iustac.
ir
Chapter 5
Cl ustering
5.1 Introduction
Chapters 3 and 4 describe how samples may be classified if a training set is available to
use in the design of a classifier. However, there are many situations where the classes
themselves are initially undefined. Given a set of feature vectors sampled from some
population, we would like to know if the data set consists of a number of relatively
distinct subsets. If it does and we can determine these subsets, we can define them to
be classes. This is sometimes called class discovery. The techniques from Chapters
3 and 4 can then be used to further analyze or model the data or to classify new data
if desired. Clustering refers to the process of grouping samples so that the samples "
are similar within each group. The groups are called clusters.
In some applications, the main goal may be to discover the subgroups rather than
to model them statistically. For example, the marketing director of a firm that supplies
business services may want to know if the businesses in a particular community fall
into any natural groupings of similar companies so that specific service packages and
marketing plans can be designed for each of these subgroups. Reading the public
data on these companies might give an idea of what some of these subgroups could
be, but the process would be difficult and unreliable, particularly if the number of
features or companies is large. Fortunately, clustering techniques allow the division
into subgroups to be done automatically, without any preconceptions about what kinds
of groupings should be found in the community being analyzed. Cluster analysis has
been applied in many fields. For example, in 1971, Paykel used cluster analysis to group
165 depressed patients into four clusters which were then called "anxious," "hostile,"
"retarded psychotic," and "young depressive." In image analysis, clustering can be
used to find groups of pixels with similar gray levels, colors, or local textures, in order
to discover the various regions in the image.
199
",. .0.,' """"" "'",'~! .,c' ',.'~' ",..", ".."" "'."".' '" ""'.i..i.""'o'., '.<,,,. "<';,,,:,,""""""
analoui@iust.ac.ir
200
CHAPTER 5. CLUSTERING 5.2. HIERARCHICAL CLUSTERING
201
La
Cats
a cluster. The coarsest grouping is at the top of the dendrogram, where all samples
are grouped into one cluster. In between, there are various numbers of clusters. For
example, in the hierarchical clustering of Figure 5.1, at level 0 the clusters are
Animals
{1}, {2}, {3},{4}, {5},
Long Hail Short Hair
each consisting of an individual sample. At level 1, the clusters are
SI. Bemard I ~Labrador
{1,2},{3},{4},{5}.
0
2 3 4 5
At level 2, the clusters are
{1, 2}, {3}, {4, 5}.
At level 3, the clusters are
{1,2,3},{4,5}.
Figure 5.1: A hierarchical clustering.
At level 4, the single cluster
{1,2,3,4,5}
In cases where there are only two features, clusters can be found through visual
inspection by looking for dense regions in a scatterplot of the data if the subgroups or
classes are well separated in the feature space. If, for example, there are two bivariate
normally distributed classes and their means are separated by more than two standard
deviations, two distinct peaks form if there is enough data. In Figure 4.20 at least one
of the three classes forms a distinct cluster, which could be found even if the classes
were unknown. However, distinct clusters may exist in a high-dimensional feature space
and still not be apparent in any of the projections of the data onto a plane defined
by a pair of the feature axes. One general way to find candidates for the centers of
clusters is to form an n-dimensional histogram of the data and find the peaks in the
histogram. However, if the number of features is large, the histogram may have to
. be very coarse to have a significant number of samples in any cell, and the locations
of the boundaries between these cells are specified arbitrarily in advance, rather than
depending on the data.
consists of all the samples.
In a hierarchical clustering, if at some level two samples belong to a cluster, they
belong to the same cluster at all higher levels. For example, in Figure 5.1, at level 2
samples 4 and 5 belong to the same cluster; samples 4 and 5 also belong to the same
cluster at levels 3 and 4.
Hierarchical clustering algorithms are called agglomerative if they build the den-
drogram from the bottom up and they are called divisive if they build the dendrogram
from the top down.
The general agglomerative clustering algorithm is straightforward to describe. The
total number of samples will be denoted by n.
Agglomerative Clustering Algorithm
1. Begin with n, clusters, each consisting of one sample.
5.2
Hierarchical Clustering
2. Repeat step 3 a total of n - 1 times.
3. Find the most similar clusters Ci and Cj and merge Ci and Cj into one cluster.
If there is a tie, merge the first pair found.
A hierarchy can be represented by a tree structure such as the simple one shown
in Figure 5.1. The patients in an animal hospital are composed of two main groups,
dogs and cats, each of which is composed of subgroups. Each subgroup is, in turn,
composed of subgroups, and so on. Each of the individual animals, 1 through 5,
is represented at the lowest level of the tree. Hierarchical clustering refers to a
clustering process that organizes the data into large groups, which contain smaller
groups, and so on. A hierarchical clustering may be drawn as a tree or dendrogram.
The finest grouping is at the bottom of the dendrogram; each sample by itself forms
Different hierarchical clustering algorithms are obtained by using different meth-
ods to determine the similarity of clusters. One way to measure the similarity between
clusters is to define a function that measures distance between clusters. This distance
function typically is induced by an underlying function that measures the distance
between pairs of samples. In cluster analysis as in nearest neighbor techniques (Sec-
tion 4.2), the most popular distance measures are Euclidean distance and city block
distance.
11)
a;
C')
C\I
...J
anatoui@iustac.ir
203
202 CHAPTER 5. CLUSTERING
5.2. HIERARCHICAL CLUSTERING
For the single-sample clusters {a} and {b}, DSL( {a}, {b}) = dCa,b).
The algorithm begins with five clusters, each consisting of one sample. The two
nearest clusters are then merged. The smallest number in (5.1) is 4, which is the
distance between samples 1 and 2, so the clusters {I} and {2} are merged. At this
point there are four clusters
{1,2},{3},{4},{5}.
Next obtain the matrix that gives the distances between these clusters:
0
0 10 15 20 25 30 5
Feature 1
Figure 5.2: Samples for clustering.
The value 8.1 in row {I, 2} and column 3 gives the distance between the clusters {I, 2}
and {3} and is computed in the following way. Matrix (5.1) shows that d(1,3) = 11.7
and d(2, 3) = 8.1. In the single-linkage algorithm, the distance between clusters is the
minimum of these values, 8.1. The other values in the first row are computed in a
similar way. The values in other than the first row or first column are simply copied
from the previous table (5.1). Since the minimum value in this matrix is 8, the clusters
{4} and {5} are merged. At this point there are three clusters:
The Single-Linkage Algorithm
The single-linkage algorithm is also known as the minimum method and the
nearest neighbor method. The latter title underscores its close relation to the
nearest neighbor classification method. The single-linkage algorithm is obtained by
defining the distance between two clusters to be the smallest distance between two
points such that one point is in each cluster. Formally, if C; and Cj are clusters, the
distance between them is defined as
{I, 2}, {3}, {4, 5}.
DSL(C;,Cj) = min d(a,b),
aEG"bEGj Next obtain the matrix that gives the distance between these clusters:
.where dCa,b) denotes the distance between the samples a and b.
Example 5.1 Hiemrchical clustering using the single-linkage algorithm.
Perform a hierarchical clustering of five samples using the single-linkage algorithm and
two features, x and y. A scatterplot of the data is shown in Figure 5.2. Use Euclidean
distance for the distance between samples. The following tables give the feature values
for each sample and the distance d between each pair of samples:
Since the minimum value in this matrix is 8.1, the clusters {I, 2} and {3} are merged.
At this point there are two clusters:
{1,2,3},{4,5}.
(5.1)
The next step will merge the two remaining clusters at a distance of 9.8. The
hierarchical clustering is complete. The dendrogram is shown in Figure 5.3.
g
It)
C\I
0
C\I C\I
CD
It)
5
:;
m
j
3
u.
4 1 2
{1,2}
3 4 5
{1,2}
-
8.1 16.0 17.9
3 8.1
-
9.8 9.8
4 16.0 9.8
-
8.0
5 17.9 9.8 8.0
-
{1,2}
3
{4,5}
{1,2}
- 8.1 16.0
3 8.1
-
9.8
{4,5}
16.0 9.8
-
x
Y
1 4 4
2 8 4
3 15 8
4 24 4
1 2 3 4 5
1
-
4.0 11.7 20.0 21.5
2 4.0
-
8.1 16.0 17.9
3 11.7 8.1 9.8 9.8
4 20.0 16.0 9.8 8.0
anatoui@iustac.ir
203
For the single-sample clusters {a} and {b}, Dsd {a}, {b}) = dCa,b).
The algorithm begins with five clusters, each consisting of one sample. The two
nearest clusters are then merged. The smallest number in (5.1) is 4, which is the
distance between samples 1 and 2, so the clusters {I} and {2} are merged. At this
point there are four clusters
{1,2},{3},{4},{5}.
0
0 10 15 20 25 30 5
Feature 1
Figure 5.2: Samples for clustering.
The value 8.1 in row {I, 2} and column 3 gives the distance between the clusters {I, 2}
and d(2, 3) = 8.1. In the single-linkage algorithm, the distance between clusters is the
minimum of these values, 8.1. The other values in the first row are computed in a
from the previous table (5.1). Since the minimum value in this matrix is 8, the clusters
{4} and {5} are merged. At this point there are three clusters:
The Single-Linkage Algorithm
The single-linkage algorithm is also known as the minimum method and the
nearest neighbor method. The latter title underscores its close relation to the
nearest neighbor classification method. The single-linkage algorithm is obtained by
defining the distance between two clusters to be the smallest distance between two
points such that one point is in each cluster. Formally, if Ci and Dj are clusters, the
distance between them is defined as
{I, 2}, {3}, {4, 5}.
DSL(Ci,Cj) = min d(a,b),
aEC"bECj
.where dCa,b) denotes the distance between the samples a and b.
Next obtain the matrix that gives the distance between these clusters:
Example 5.1 Hiemrchical clustering using the single-linkage algorithm.
Perform a hierarchical clustering of five samples using the single-linkage algorithm and
two features, x and y. A scatterplot of the data is shown in Figure 5.2. Use Euclidean
distance for the distance between samples. The following tables give the feature values
for each sample and the distance d between each pair of samples:
Since the minimum value in this matrix is 8.1, the clusters {I, 2} and {3} are merged.
{1,2,3},{4,5}.
The next step will merge the two remaining clusters at a distance of 9.8. The
hierarchical clustering is complete. The dendrogram is shown in Figure 5.3.
(5.1)
g
N
CD
It)
5
0;
j
CD
3
u.
4 1 2
{1,2}
3 4 5
{1,2}
-
8.1 16.0 17.9
3 8.1
-
9.8 9.8
4 16.0 9.8
-
8.0
5 17.9 9.8 8.0
-
{1,2}
3
{4,5}
{1,2}
- 8.1 16.0
3 8.1
-
9.8
{4,5}
16.0 9.8
-
x
Y
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
1 2 3 4 5
1
-
4.0 11.7 20.0 21.5
2 4.0
-
8.1 16.0 17.9
3 11.7 8.1
-
9.8 9.8
4 20.0 16.0 9.8
-
8.0
5 21.5 17.9 9.8 8.0
-
analoui@iustac.ir
2 3 4 5
Figure 5.3: Hierarchical clustering using the single-linkage algorithm. The distance
DsL between clusters that merge is shown on the vertical axis.
The Complete-Linkage Algorithm
The complete-linkage algorithm is also called the maximum method or the far-
thest neighbor method. It is obtained by defining the distance between two clusters
to be the largest distance between a sample in one cluster and a sample in the other
cluster. If Ci and Cj are clusters, we define
DCL(Ci,Cj) = max d(a,b).
aECi,bECj
Example 5.2 Hierarchical clustering using the complete-linkage algorithm.
Perform a hierarchical clustering using the complete-linkage algorithm on the data
shown in Figure 5.2. Use Euclidean distance (5.1) for the distance between samples.
As before, the algorithm begins with five clusters, each consisting of one sample.
The nearest clusters {1} and {2} are then merged to produce the clusters
{1,2},{3},{4},{5}.
,
r
I
t
205
The value 11.7 in row {1, 2} and column 3 gives the distance between the clusters {1, 2}
and d(2, 3) = 8.1. In the complete-linkage algorithm, the distance between clusters is
the maximum of these values, 11.7. The other values in the first row are computed in
a similar way. The values in other than the first row or first column are simply copied
from (5.1). Since the minimum value in this matrix is 8, the clusters {4} and {5} are
merged. At this point the clusters are
{1,2},{3},{4,5}.
Since the minimum value in this matrix is 9.8, the clusters {3} and {4, 5} are merged.
At this point the clusters are
{1, 2}, {3, 4, 5}.
Notice that these clusters are different from those obtained at the corresponding point
of the single-linkage algorithm.
At the next step, the two remaining clusters will be merged. The hierarchical
clustering is complete. The dendrogram is shown in Figure 5.4.
A cluster, by definition, contains similar samples. The single-linkage algorithm
and the complete-linkage algorithm differ in how they determine when samples in two
clusters are similar so they can be merged. The single-linkage algorithm says that two
clusters Ci and Cj are similar if there are any elements a in Ci and b in Cj that are
similar, in the sense that the distance between a and b is small. In other words, in
the single-linkage algorithm, it takes a single similar pair a, b with a in Ci and b in Cj
in order to merge Ci and Cj. (Readers familiar with graph theory will recognize this
procedure as that used by Kruskal's algorithm to find a minimum spanning tree.) On
the other hand, the complete-linkage algorithm says that two clusters Ci and Cj are
similar if the maximum of DcL(a, b) over all a in Ci and b in Cj is small. In other
words, in the complete-linkage algorithm all pairs in Ci and Cj must be similar in
order to merge Ci and Cj.
The Average-Linkage Algorithm
The single-linkagealgorithmallowsclustersto growlongand thin whereasthe complete-
linkagealgorithm produces more compact clusters. Both clusterings are susceptibleto
CD
0
u
c
!!
<0 ..
0
.8
ID
.c;
.S!'
CD ....
z
i
C\I
01
CD
Z 0
{1,2}
3 4 5
{1,2}
-
11.7 20.0 21.5
3 11.7
-
9.8 9.8
4 20.0 9.8
-
8.0
5 21.5 9.8 8.0
-
{1,2}
3
{4,5}
{1,2}
-
11.7 21.5
3 11.7
-
9.8
{4,5}
21.5 9.8
-
analoui@iust.ac.ir
206 CHAPTER 5. CLUSTERING 5.2. HIERARCHICAL CLUSTERING
207
2 3 4 5
The value 9.9 in row {1, 2} and column 3 gives the distance between the clusters {1, 2}
and d(2,3) = 8.1. In the average-linkage algorithm, the distance between clusters is
the average of these values, 9.9. The other values in the first row are computed in a
from (5.1). Since the minimum value in this matrix is 8, the clusters {4} and {5} are
merged. At this point the clusters are
Figure 504: Hierarchical clustering using the complete-linkage algorithm.
{1,2},{3},{4,5}.
distortion by outliers or deviant observations. The average-linkage algorithm is
an attempt to compromise between the extremes of the single- and complete-linkage
algorithms.
The average-linkage clustering algorithm, also known as the unweighted pair-
group method using arithmetic averages (UPGMA), is one of the most widely
used hierarchical clustering algorithms. The average-linkage algorithm is obtained by
defining the distance between two clusters to be the average distance between a point
in one cluster and a point in the other cluster. Formally, if Ci is a cluster with ni
members and Cj is a cluster with nj members, the distance between the clusters is
Since the minimum value in this matrix is 9.8, the clusters {3} and {4, 5} are merged.
At this point the clusters are
{1,2},{3,4,5}.
1
DAL(Ci,Cj) = - ~ d(a,b).
n.n. L.-
, 3 aEG"bEG;
At the next step, the two remaining clusters are merged and the hierarchical clustering
is complete.
Example 5.3 Hiemrchical clustering using the avemge-linkage algorithm.
An example of the application of the average-linkage algorithm to a larger data set
using the SAS statistical analysis software package is presented in Appendix BA.
{1,2},{3},{4},{5}.
Ward's Method
Ward's method is also called the minimum-variance method. Like the other
algorithms, Ward's method begins with one cluster for each individual sample. At
each iteration, among all pairs of clusters, it merges the pair that produces the smallest
squared error for the resulting set of clusters. The squared error for each cluster is
defined as follows. If a cluster contains m samples Xl,. . . , Xm where Xi is the feature
Perform a hierarchical clustering using the average-linkage algorithm on the data shown
in Figure 5.2. Use Euclidean distance (5.1) for the distance between samples.
The algorithm begins with five clusters, each consisting of one sample. The nearest
clusters {l} and {2} are then merged to form the clusters
G>
i
c
Il)
.8
.J:.
0
.2'
G>
z
'0
Il)
t::
.r
0
{1,2}
3 4 5
{1,2}
-
9.9 18.0 19.7
3 9.9
-
9.8 9.8
4 18 9.8
-
8.0
5 19.7 9.8 8.0
-
{1,2}
3
{4,5}
{1,2}
- 9.9 18.9
3 9.9
-
9.8
{4,5}
18.9 9.8
-
anatoui@iustac.ir
vector (Xi!"'" Xid), the squared error for sample xi-which is the squared Euclidean
distance from the mean-is
d
~)Xij - J-tY,
j=l
where J-tj is the mean value of feature j for the samples in the cluster
1 m
J-tj = - LXij.
m i=l
The squared error E for the entire cluster is the sum of the squared errors of the
samples
m d
E = L L(Xij - J-tj)2= mu2.
i=l j=l
The vector composed of the means of each feature, (J-t!,..., J-tn) = J-t,is called the
mean vector or centroid of the cluster. The squared error for a cluster is the sum
of the squared distances in each feature from the cluster members to their'mean. The
squared error is thus equal to the total variance of the cluster 172times the number
of samples in the cluster m, where the total variance is defined to be 172=17? +. . .+173,
the sum of the variances for each feature. The squared error for a set of clusters is
defined to be the sum of the squared errors for the individual clusters.
Example 5.4 Hiemrchical clustering using Ward's method.
Perform a hierarchical clustering using Ward's method on the data shown in Figure
5.2. The algorithm begins with five clusters each consisting of one sample. At this
point, the squared error is zero. There are 10 possible ways to merge a pair of clusters:
Merge {I} and {2}, merge {I} and {3}, and so on. Figure 5.5 shows the squared error
for each possibility. For example, consider merging {I} and {2}. Since sample 1 has
the feature vector (4,4) and sample 2 has the feature vector (8,4), the feature means
are 6 and 4. The squared error for cluster {1,2} is
(4 - 6? + (8 - 6)2+ (4 - 4)2+ (4 - 4)2 = 8.
The squared error for each of the other clusters {3}, {4}, and {5} is O. Thus the total
squared error for the clusters {I, 2}, {3},{4},{5} is
8 + 0 + 0 + 0 = 8.
Since the smallest squared error in Figure 5.5 is 8, the clusters {I} and {2} are merged
to give the clusters
{1,2},{3},{4},{5}.
209
Figure 5.5: Squared errors for each way of creating four clusters.
Figure 5.6: Squared errors for three clusters.
Figure 5.6 shows the squared error for all possible sets of clusters that result from
merging two of {I, 2}, {3}, {4}, {5}. Since the smallest squared error in Figure 5.6 is
40, the clusters {4} and {5} are merged to form the clusters
{1,2},{3},{4,5}.
Figure 5.7 shows the squared error for all possible sets of clusters that result from
merging two of {I, 2}, {3}, {4, 5}. Since the smallest squared error in Figure 5.7 is 94,
the clusters {3} and {4, 5} are merged to give the clusters
{1,2},{3,4,5}.
"'
Squared
Clusters Error, E
{1,2},{3},{4},{5}
8.0
{1,3},{2},{4},{5}
68.5
{1,4},{2},{3},{5}
200.0
{1,5},{2},{3},{4}
232.0
{2,3},{1 },{4},{5}
32.5
{2,4},{1},{3},{5}
128.0
{2,5},{1},{3},{4}
160.0
{3,4},{1},{2},{5}
48.5
{3,5},{1},{2},{4}
48.5
{4,5},{1},{2},{3}
32.0
Squared
Clusters Error, E
{1,2,3},{4},{5}
72.7
{1,2,4},{3},{5}
224.0
{1,2,5},{3},{ 4}
266.7
{1,2},{3,4},{5}
56.5
{1,2},{3,5},{ 4}
56.5
{1,2},{4,5},{3}
40.0
analoui@iust.ac.ir
~~~~
210
CHAPTER 5. CLUSTERiNG
5.3. PARTITIONAL CLUSTERING
211
technique is more general than the bottom-up hierarchies produced by agglomerative
techniques because the groups can be divided into more than two subgroups in one
step. (The only way this could happen for an agglomerative technique would be for two
distances to tie, which would be extremely unlikely even if allowed by the algorithm.)
Another advantage of partitional techniques is that only the top part of the tree, which
shows the main groups and possibly their subgroups, may be required, and there may
be no need to complete the dendrogram. All of the examples in this section assume
that Euclidean distances are used, but the techniques could use any distance measure.
Figure 5.7: Squared errors for two clusters.
Forgy's Algorithm
One of the simplest partitional clustering algorithms is Forgy's algorithm [Forgy].
Besides the data, input to the algorithm consists of k, the number of clusters to be
constructed, and k samples called seed points. The seed points could be chosen
randomly, or some knowledge of the desired cluster structure could be used to guide
their selection.
Forgy's Algorithm
2 3 4 5 1. Initialize the cluster centroids to the seed points.
2. For each sample, find the cluster centroid nearest it. Put the sample in the
cluster identified with this nearest cluster centroid.
Figure 5.8: Dendrogram for Ward's method.
3. If no samples changed clusters in step 2, stop.
At the next step, the two remaining clusters are merged and the hierarchical clus-
tering is complete. The resulting dendrogram is shown in Figure 5.8.
4. Compute the centroids of the resulting clusters and go to step 2.
5.3
Partitional Clustering
Example 5.5 Parlitional clustering usir;g Forgy's algorithm.
Agglomerative clustering (Section 5.2) creates a series of nested clusters. This contrasts
with partitional clustering in which the goal is usually to create one set of clusters
that partitions the data into similar groups. Samples close to one another are assumed
to be similar and the goal of the partitional clustering algorithms is to group data that
are close together. In many of the partitional algorithms, the number of clusters to be
constructed is specified in advance.
If a partitional algorithm is used to divide the data set into two groups, and then
each of these groups is divided into two parts, and so on, a hierarchical dendrogram
could be produced from the top down. The hierarchy produced by this divisive
Perform a partitional clustering using Forgy's algorithm on the data shown in Figure
5.2. Set k = 2, which will produce two clusters, and use the first two samples (4,4)
and (8,4) in the list as seed points. In this algorithm, the samples will be denoted by
their feature vectors rather than their sample numbers to aid in the computation.
For step 2, find the nearest cluster centroid for each sample. Figure 5.9 shows the
results. The clusters {(4,4)} and {(8,4), (15,8), (24,4), (24,12)} are produced.
For step 4, compute the centroids of the clusters. The centroid of the first cluster
is (4,4). The centroid of the second cluster is (17.75,7) since
(8 + 15+ 24 + 24)/4 = 17.75
Squared
Clusters
Error, E
{1,2,3},{4,5}
104.7
{1,2,4,5},{3}
380.0
{l,2},{3,4,5}
94.0
8
"'"
(!!
0
8 I:::
W
'"
"0
Q)
Oi
0
=>
g
'0
E
=>
(J)
0
anatoui@iust.ac.ir
212
CHAPTER 5. CLUSTERING 5.3. PARTITIONAL CLUSTERING
213
Figure 5.9: First iteration of Forgy's algorithm. Figure 5.11: Third iteration of Forgy's algorithm.
It has been proved [Selim] that Forgy's algorithm terminates; that is, eventually no
samples change clusters. However, if the number of samples is large, it may take the
algorithm considerable time to produce stable clusters. For this reason, some versions
of Forgy's algorithm allow the user to restrict the number of iterations. Other versions
of Forgy's algorithm [Dubes] permit the user to supply parameters that allow new
clusters to be created and to establish a minimum cluster size.
Figure 5.10: Second iteration of Forgy's algorithm.
The k-means Algorithm
An algorithm similar to Forgy's algorithm is known as the k-means algorithm. Be-
sides the data, input to the algorithm consists of k, the number of clusters to be
constructed. The k-means algorithm differs from Forgy's algorithm in that the cen-
twirls of the clusters are recomputed as soon as a sample joins a cluster. Also, unlike
Forgy's algorithm which is iterative, the k-means algorithm makes only two passes
through the data set.
and
(4 + 8 + 4 + 12)/4 = 7.
Since some samples changed clusters (there were initially no clusters), return to step
2.
Find the cluster centroid nearest each sample. Figure 5.10 shows the results. The
clusters {( 4,4), (8, 4)} and {(15, 8), (24,4), (24, 12)} are produced.
For step 4, compute the centroids (6,4) and (21,8) of the clusters. Since the sample
(8,4) changed clusters, return to step 2.
Find the cluster centroid nearest each sample. Figure 5.11 shows the results. The
clusters {(4,4),(8,4)} and {(15,8),(24,4),(24,12)} are obtained.
For step 4, compute the centroids (6,4) and (21,8) of the clusters. Since no sample
will change clusters, the algorithm terminates.
k-means Algorithm
1. Begin with k clusters, each consisting of one of the first k samples. For each
of the remaining n - k samples, find the centroid nearest it. Put the sample in
the cluster identified with this nearest centroid. After each sample is assigned,
recompute the centroid of the altered cluster.
2. Go through the data a second time. For each sample, find the centroid nearest
it. Put the sample in the cluster identified with this nearest centroid. (During
this step, do not recompute any centroid.)
In this version of Forgy's algorithm, the seed points are chosen arbitrarily as the
first two samples; however, other possibilities have been suggested. One alternative is
to begin with k clusters generated by one of the hierarchical clustering algorithms and
use their centroids as initial seed points.
Example 5.6 Partitional clustering using the k-means algorithm.
Nearest
Sample Cluster Centroid
(4,4) (4,4)
(8,4) (8,4)
(15,8) (8,4)
(24,4) (8,4)
(24,12) (8,4)
Nearest
(4,4) (6,4)
(8,4) (6,4)
(15,8) (21,8)
(24,4) (21,8)
(24,12) (21,8)
Nearest
(4,4) (4,4)
(8,4) (4,4)
(15,8) (17.75,7)
(24,4) (17.75,7)
(24,12) (17.75,7)
analoui@iust.ac.if
214 CHAPTER 5. CLUSTERING 5.3. PARTITIONAL CLUSTERING
215
3. If no samples changed clusters, stop.
4. Recompute the centroids of altered clusters and go to step 2.
Figure 5.12: Distances for use by step 2 of the k-means algorithm.
The goal of Forgy's algorithm and the k-means algorithm is to minimize the squared
error (as defined in Section 5.2) for a fixed number of clusters. These algorithms assign
samples to clusters so as to reduce the squared error and, in the iterative versions, they
stop when no further reduction occurs. However, to achieve reasonable computation
time, they do not consider all possible clusterings. For this reason, they sometimes
terminate with a clustering that achieves a local minimum squared error. Furthermore,
in general, the clusterings that these algorithms generate depend on the choice of the
seed points. For example, if Forgy's algorithm is applied to the data in Figure 5.2
using (8,4) and (24,12) as seed points, the algorithm terminates with the clusters
Perform a partitional clustering using the k-means algorithm on the data in Figure
5.2. Set k = 2 and assume that the data are ordered so that the first two samples are
(8,4) and (24,4).
For step 1, begin with two clusters {(8,4)} and {(24,4)} which have centroids at
(8,4) and (24,4). For each of the remaining three samples, find the centroid nearest
it, put the sample in this cluster, and recompute the centroid of this cluster.
The next sample (15, 8) is nearest the centroid (8,4) so it joins cluster {(8,4)}.
At this point, the clusters are {(8, 4), (15, 8)} and {(24,4)}. The centroid of the first
cluster is updated to (11.5,6) since
{(4,4),(8,4), (15,8)}, {(24,4), (24, 12)}.
(5.2)
This is different from the clustering produced in Example 5.5. The clustering (5.2) has
a squared error of 104.7 whereas the clustering of Example 5.5 has a squared error of
94. The clustering (5.2) produces a local minimum; the clustering of Example 5.5 can
be shown to produce a global minimum. For a given set of seed points, the resulting
clusters may also depend on the order in which the points are checked.
(8 + 15)/2 = 11.5, (4 + 8)/2 = 6.
The next sample (4,4) is nearest the centroid (11.5,6) so it joins cluster {(8,4),
(15,8)}. At this point, the clusters are {(8, 4), (15,8), (4, 4)} and {(24,4)}. The cen-
troid of the first cluster is updated to (9,5.3).
The next sample (24,12) is nearest the centroid (24,4) so it joins cluster {(24,4)}.
At this point, the clusters are {(8, 4), (15,8), (4, 4)} and {(24, 12), (24, 4)}. The centroid
of the second cluster is updated to (24,8). At this point, step 1 of the algorithm is
complete.
For step 2, examine the samples one by one and put each one in the cluster identified
with the nearest centroid. As Figure 5.12 shows, in this case no sample changes clusters.
The resulting clusters are
The Isodata Algorithm
The isodata algorithm can be considered to be an enhancement of the approach
taken by Forgy's algorithm and the k-means algorithm. Like those algorithms, it tries
to minimize the squared error by assigning samples to the nearest centroid. Unlike
those algorithms, it does not deal with a fixed number of clusters but rather it deals
with k clusters where k is allowed to range over an interval that includes the number of
clusters requested by the user. It discards clusters with too few elements. Clusters are
merged if the number of clusters grows too large or if clusters are too close together.
A cluster is split if the number of clusters is too few or if the cluster contains very
dissimilar samples. The specific details follow.
Besides the data and seed points, the following parameters are required by the
isodata algorithm:
{(8,4),(15,8),(4,4)} and {(24, 12), (24,4)}.
no_clusters: the desired number of clusters, which is also the number of seed points
min_elements: the minimum number of samples permitted per cluster
An alternative version of the k-means algorithm iterates step 2. Specifically, step
2 is replaced by the following steps 2 through 4:
2. For each sample, find the centroid nearest it. Put the sample in the cluster
identified with this nearest centroid.
min_dist: the minimum distance permitted between cluster centroids without merging
them
split_size: a parameter that controls splitting
Distance to Distance to
Sample Centroid (9,5.3) Centroid (24,8)
(8,4)
1.6 16.5
(24,4)
15.1 4.0
(15,8)
6.6 9.0
(4,4)
6.6 40.4
(24,12)
16.4 4.0
analoui@iust.ac.ir
iteLstart:maximum number of iterations in the first part of the algorithm
max...merge: maximum numb~r of cluster merges per iteration
iter _body: maximum number of iterations within the main part of the algorithm
These parameters are further explained in the algorithm.
Isodata Algorithm
1. (Steps 1 through 4 are like Forgy's algorithm.) Initialize the cluster centroids to
the seed points.
2. For each sample, find the cluster centroid nearest it. Put the sample into the
cluster identified with this nearest cluster centroid.
217
Figure 5.13: Samples for the isodata clustering.
3. Compute the centroids of the resulting clusters.
4. If at least one sample changed clusters and the number of iteration~ is less than
iteLstart, go to step 2.
10. If step 10 has been executed iteLbody times or no changes occurred in the clus-
ters since the last time step 10 was executed, stop; otherwise, take the centroids
of the clusters as new seed points and go to step 2.
5. Discard clusters with fewer than min_elements samples. Also discard the samples
they contain.
6. If the number of clusters is greater than or equal to 2 * no_clusters or the
number of this iteration is even, execute step 7 (merging operation); otherwise,
go to step 8.
Example 5.7 Partitional clustering using the isodata algorithm.
Use the isodata algorithm to cluster the data shown in Figure 5.13 when the parameters
are
no_clusters = 3
mill-elements = 2
min-dist = 3
split_size = 0.2
iter_start = 5
max...merge = 1
iter_body = 5
and the seed points are samples 1, 3, and 13. The first iteration is numbered O.
One pass through the data is required for convergence of the Forgy part of the
algorithm (steps 1 through 4). At this point, there are three clusters:
7. If the distance between two centroids is less than min_dist, merge these clus-
ters and update the centroid; otherwise, go to step 7. Repeat this step up to
max...mergetimes and then go to step 8.
8. If the number of clusters is less than or equal to no_clusters/2 or the number
of this iteration is odd, execute step 9 (splitting operation); otherwise, go to step
10.
9. Find a cluster that has a standard deviation for some variable, say x, which
exceeds split_size*ux, where Ux is the standard deviation of x in the complete
original set of samples. If none, go to step 10. Compute the mean of x within the
cluster. Separate the samples in this cluster into two sets, one consisting of those
in which x is greater than or equal to the mean, and the other consisting of those
in which x is less than the mean. Compute the centroids of these two clusters.
If the distance between these centroids is greater than or equal to 1.hmin_dist,
replace the original cluster by these two clusters; otherwise, do not split the
cluster.
{1,2,4,6},{~5,7},{8,9,10,11,12,13,14}.
For step 5, since no cluster has fewer than mill_elements members, none is dis-
carded.
Number x
y
Number x
y
1 0.0 0.0 8 6.0 0.75
2 0.0 1.0 9 6.0 1.00
3 0.0 3.0 10 6.0 2.00
4 0.5 0.5 11 6.0 2.10
5 0.5 3.5 12 6.2 0.80
6 1.0 1.0 13 6.2 2.05
7 1.0 3.0 14 8.0 12.0
N
0
IX)
>- co
..,.
N
...
. .
i.
.
0
,
0 2 4
.
.
, , ,
I
6 8 10 12
x
anatQui@ius!.ac.if
218
CHAPTER 5. CLUSTERING
219
For step 6, the number of clusters is not greater than or equal to 2 *no-clusters
and the number (0) of this iteration is even, so the merging operation (step 7) is
executed.
Since the distance between the centroids of the clusters {1, 2, 4, 6} and {3, 5, 7} is
less than min_dist, these clusters are merged. At this point, there are two clusters:
For step 6, the number of clusters is less than 2 *no_clusters and the number (2)
of this iteration is even, so the merging operation (step 7) is executed.
Since the distance between the centroids of the clusters
{1,2,3,4,5,6,7} and {8,9,10,11,12,13}
{1,2,3,4,5,6,7},{8,9,10,11,12,13,14}.
is not less than miILdist, these clusters are not merged. Proceed to step 8.
For step 8, the number of clusters (2) is greater than no_clustersj2 and the
number of this iteration is even, so go to step 10.
For step 10, since the number of iterations is less than the requested number and
the clusters changed, proceed to step 2.
Again the Forgy part of the algorithm (steps 1 through 4) does not change the
clusters.
For step 5, since no cluster has fewer than min_elements members, none is dis-
carded.
For step 6, the number of clusters is less than 2*no-clusters and the number (3)
of this iteration is odd, so proceed to step 8.
number of this iteration is odd, so the splitting operation (step 9) is executed.
For step 9, no cluster has a variable whose standard deviation exceeds split_size *
CT,so proceed to step 10.
For step 10, the number of iterations is less than the requested number, but no
clusters changed, so the algorithm terminates.
The merge step is not repeated (since max...merge= 1) so proceed to step 8. (In this
case, the remaining clusters could not be merged anyway since the distance between
their centroids is greater than min_dist.)
For step 8, the number of clusters (2) is greater than no_clustersj2 = 1.5 and the
number of this iteration is not odd, so go to step 10.
For step 10, since the number of iterations is less than the requested number (5)
and the clusters changed, proceed to step 2.
This time the Forgy part of the algorithm (steps 1 through 4) does not change the
clusters. '
For step 5, since no cluster has fewer than min-elements members, none is dis-
carded.
For step 6, the number of clusters is not greater than or equal to 2 *no_clusters
and the number (1) of this iteration is odd, so proceed to step 8.
number of this iteration is odd, so the splitting operation (step 9) is executed.
For step 9, there is a cluster {8, 9, 10, 11, 12, 13, 14} that has a standard deviation
for the variable (y) exceeding split_size*CTy' The samples are then divided into two
sets that have y values less than, or greater than, the mean value of y in the cluster:
{8,9, 1O, 11, 12, 13}, {14}.
The isodata algorithm has been used in many engineering and scientific applica-
tions. Example 5.7 shows that the algorithm requires extensive computation, even for
a small data set. For large data sets, it may be too expensive to run on conventional
single-processor computers. [Tilton] reports that the isodata algorithm required seven
hours of computing time on a VAX-11j780 to cluster the gray levels in a 512 x 512 im-
age into 16 clusters. Parallel systems can do much better; the same problem ran in 20
seconds on a system containing an array of 16,384 processors. Whether large amounts
of time or massively parallel computers are necessary is debatable. One study [Mezzich]
showed that simple k-means algorithms often outperform the isodata algorithm.
Since the distance between their centroids is greater than or equal to 1.1 * min_dist,
the cluster remains split. At this point there are three clusters:
{1,2,3,4,5,6,7},{8,9,10, 11, 12, 13}, {14}.
Again, the Forgy part of the algorithm (steps 1 through 4) does not change the
clusters.
For step 5, cluster {14} is discarded since it has fewer than min-elements members.
Example 5.8 Applying the isodata algorithm to the BOX data set.
{1,2,3,4,5,6,7},{8,9,10,11,12,13}.
In this example developed by Daniel Kusswurm, the isodata algorithm is applied to a
set of real data.
A standard data set used in many pattern recognition studies consists of 24 x 24
pixel digitized binary images of handwritten FORTRAN characters written by several
anatoui@iust.ac.ir
221
7 3
5 6
4 8 2
Figure 5.14: A digitized binary image of a handwritten X that has a feature vector of
(4,10,10,4,10,9,11,9).
persons. Eight features were obtained from each sample by countip.g the number of
consecutive noncharacter pixels on a line to the center beginning at each of eight
boundary positions (see Figure 5.14). The isodata algorithm was applied to a data set
consisting of 15 images of each of the characters 8, 0, and X. Figure 5.15 gives the
feature values and classes for this data set. (The class information is not used by the
clustering algorithm.) The parameters chosen were
and the seed points were the first four samples.
In the first iteration of the isodata algorithm, after the maximum number
(iteLstart = 5) of iterations of the Forgy part of the algorithm (steps 1 through
4), four clusters were obtained, which contained the following numbers of samples
from the three classes:
Figure 5.15: 80X data.
0 5 11 6 9 11 5 9 X 7 7 7 7 5 6 5 5 0
0 8 7 6 6 7 2 4 0 7 7 7 7 6 6 3 2 0
5 5 5 12 10 7 2 3 8 7 7 8 7 5 6 4 5 0
5 6 7 7 9 9 9 7 X 7 8 4 4 7 6 2 3 8
5 7 8 7 8 9 9 8 X 7 8 6 5 4 6 2 4 0
5 13 6 4 6 13 3 13 8 7 10 5 5 8 7 1 20 8
6 6 5 5 4 3 4 5 0 7 10 6 6 6 8 3 3 8
6 6 7 7 8 8 3 2 8 7 12 8 6 9 11 9 1 X
6 7 7 7 9 8 4 5 8 7 13 5 5 6 13 2 3 8
6 7 11 6 8 11 7 10 X 8 7 5 4 6 10 1 0 8
6 8 7 5 5 6 2 2 0 8 7 6 6 8 7 2 0 8
6 10 5 2 6 8 1 2 8 8 7 7 6 6 5 4 5 0
6 10 6 10 8 8 13 4 X 8 8 7 6 5 7 5 5 0
6 10 7 8 8 9 4 4 8 8 8 7 6 7 6 4 4 0
6 12 10 4 8 11 8 3 X 8 10 7 10 9 8 11 4 X
7 6 6 5 3 2 5 4 0 9 5 6 7 10 9 7 5 X
7 6 7 5 5 5 4 3 0 9 7 7 6 8 6 4 3 0
7 6 7 6 3 4 6 5 0 9 10 6 6 8 10 2 3 8
7 7 5 6 3 3 4 6 0 10 4 4 4' 8 8 3 10 X
7 7 6 5 10 10 10 8 X 10 5 6 4 9 9 6 11 X
7 7 6 6 8 7 2 3 8 10 7 4 4 9 9 3 9 X
7 7 6 T 7 7 1 1 8 10 7 6 6 9 9 7 10 X
11 8 7 10 11 10 6 9 X
no_clusters =
4
min_elements
=
8
min_dist
=
2.5
split_size
=
0.5
iterJltart
=
5
max..merge
=
1
iteLbody
=
3
Number in Number in Number in Number in
Class Cluster 1 Cluster 2 Cluster 3 Cluster 4
8 2 0 13 0
0 0 14 1 0
X 1 0 0 14
analoui@iust.ac.ir
For step 5, since cluster 1 has only three samples it is discarded. At this point, the
clusters are
For step 6, the number of clusters is not greater than or equal to 2 * no_clusters
executed. However, no clusters are merged.
For step 8, the number of clusters is greater than no_c1usters/2 = 1.5 and the
number of this iteration is not odd, so go to step 10.
For step 10, the number of iterations is less than the requested number (iter_body =
3) and the clusters changed, so proceed to step 2.
In the second iteration of the isodata algorithm, the Forgy part of the algorithm
does not change the clusters.
For step 5, since no cluster has fewer than min_elements members, none is dis-
carded.
For step 6, the number of clusters is not greater than or equal to 2 *no_clusters
and the number (1) of this iteration is odd, so proceed to step 8.
For step 8, the number of clusters is greater than no_c1usters/2 and the number
of this iteration is odd, so execute the splitting operation (step 9).
For step 9, a split occurs producing the clusters
In the third and final iteration of the isodata algorithm, the Forgy part of the
algorithm converges after two iterations to produce the clusters
.
I
!
223
a 10 12 14 16 1a 20
(d) Axis 1 (x)
Figure 5.16: Projections of the 80X data. The pairs of features are (a) the upper left
and side diagonals, (b) the two side lengths, (c) the top and right lengths. In (d), x
and y are two orthogonal axes lying in the plane that passes through the centers of
the three clusters.
8 0 0 13 0
0 0 14 1 0
X 0 0 0 14
8 0 0 13 0
0 11 3 1 0
X 0 0 0 14
8 0 0 13 0
'"' " -
oq-
I
:!
I
a
xx
x x
0
x
0
a IX
C')
C\I
ID co X 11 I!!
co 0008 X
:;
0 X801l0 X
::J 0 08alllX
1ii
co aXllalX
1ii
co 000 ax
ID
ID
IL a88a
IL XI
oq- a X
oq- X
C\I
C\I
0
0
0 2 4 6 a 10 14 0 2 4 6 a 10' 14
(a) Feature 1
(b) Feature 5
oq-
I
aa
I co
X XXX co
1
aa X X
co xa xx X
:s oq-
e
co a la
X X
C\I
::J
a8 0 UI
1ii
co 8000 C\I
ID
IL 0
oq- 0
I 0
0
C\I 0
I 0
,
,
0 2 4 6 a 10 14
(c) Feature 7
a
ss
a
a
a a a X
<0 a a 1-a X
0 0 X X
0 %0 X Xx
@ J
0
0
X
analoui@iust.ac.ir
224 CHAPTER 5. CLUSTERING 5.4. PROBLEMS
For step 5, since cluster 2 has only two samples it is discarded. At this point, the
clusters are
For step 6, the number of clusters is not greater than or equal to 2 * no_clusters
executed. However, no clusters are merged.
For step 8, the number of clusters is greater than no_clusters/2 and the number
of this iteration is not odd, so go to step 10. Since the number of iterations equals the
requested number, the algorithm terminates.
For the particular values chosen for the parameters, the isodata algorithm obtained
three clusters: one consisting of 11 Os, one consisting of 13 8s and two Os, and one
consisting of 14 Xs. In addition, it discarded five samples. The algorith~ did a good
job of finding clusters corresponding to .the three classes.
If only the eight features had been known and not the classes, it would have been
very difficult to infer the existence of these classes by examining the data in Figure
5.15. They would also probably not be obvious in a scatterplot of any pair of the
features. Plots for three arbitrarily chosen pairs are shown in Figures 5.16a, b, and
c. However, now that the means of three tentative clusters are known, a plane can be
defined by these three points in the 8-dimensional space and all the data points can be
projected onto this plane. This has been done to produce Figure 5.16d. The classes
overlap much less in this plane than in the other plots, but it is still not obvious that
the data would cluster into three groups in the 8-dimensional space.
Figure 5.17: Samples for problems.
(c) What is the complete-linkage distance between the clusters if city block
distance is used?
(d) What is the complete-linkage distance between the clusters if Euclidean
distance is used? [ADS: v'65]
5.3. Perform a hierarchical clustering of the data in Figure 5.17 using the single-
linkage algorithm and Euclidean distance. Show the distance matrices and the
dendrogram with the cluster distance as the vertical axis.
5.4. Compute the average-linkage distance between the two clusters ({3, 4}, {5, 6})
and ({1,1},{2,2}),
(a) using city block distance between points. [Ans: 6]
(b) using Euclidean distance between points.
5.4 Problems
5.5. Perform a hierarchical clustering of the data in Figure 5.17 using the single-
linkage algorithm and city block distance. Show the distance matrices and the
dendrogram.
5.6. Perform a hierarchical clustering of the data in Figure 5.17 using the complete-
dendrogram.
5.1. Give five examples, different from those in this chapter, of real situations in which
clustering might be useful.
5.2. A cluster contains three samples at (0,1), (0,2), and (0,3). Another cluster
contains samples at (1,7), (1,8), and (1,9).
(a) What is the single-linkage distance between the clusters if city block distance
is used? [Ans: 5]
(b) What is the single-linkage distance between the clusters if Euclidean dis-
tance is used?
5.7. Perform a hierarchical clustering of the data in Figure 5.17 using the complete-
dendrogram.
8 0 0 13 0
0 11 0 2 0
X 0 0 0 14
Sample
x
y
1 0.0 0.0
2 0.5 0.0
3 0.0 2.0
4 2.0 2.0
5 2.5 8.0
6 6.0 3.0
7 7.0 3.0
225
Cl)
1
5
co
>-
'I"
6 7
C\I 3 4
0 12
0 2 4 6 8
x
analoui@iust.ac.ir
u ,,- -- ----
5.8. Perform a hierarchical clustering of the data in Figure 5.17 using the average-
dendrogram.
5.9. Perform a hierarchical clustering of the data in Figure 5.17 using the average-
dendrogram.
5.10. Perform a hierarchical clustering of the data in Figure 5.17 using Ward's algo-
rithm. Show the values of the squared errors that are computed.
5.11. Perform a partitional clustering of the data in Figure 5.17 using Forgy's algo-
rithm. Set k = 2 and use the first two samples in the list as seed points. Show
the values of the centroids and the nearest seed points.
5.12. Perform a partitional clustering of the data in Figure 5.17 w;ing Forgy's algo-
rithm. Set k = 3 and use the first three samples in the list as seed points. Show
the values of the centroids and the nearest seed points. -
5.13. Perform a partitional clustering of the data in Figure 5.17 using the k-means
algorithm. Set k = 2 and use the first two samples in the list as the initial
samples. Show the values of the centroids and the distances from the samples to
the centroids.
5.14. Perform a partitional clustering of the data in Figure 5.17 using the k-means
algorithm. Set k = 3 and use the first three samples in the list as the initial
samples. Show the values of the centroids and the distances from the samples to
the centroids.
5.15. Perform a partitional clustering of the data in Figure 5.17 using the isodata
algorithm. Use reasonable values of your own choosing for the parameters. Set
k =3 and use the first three samples in the list as the initial samples.

ch5 Clustering PDF

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

ch5 Clustering PDF

Cargado por

Copyright:

Formatos disponibles

analoui@iustac.

También podría gustarte