Documentos de Académico
Documentos de Profesional
Documentos de Cultura
(1 4)
2
+(3 1)
2
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 2, No. 11, 2011
89 | P a g e
www.ijacsa.thesai.org
coincides with the idea of clustering [19]. The simplest type of
cluster is the spherical Gaussian [19]. Clusters that manifest
non-spherical or arbitrary shapes, such as correlation and non-
linear correlation clusters, are not addressed by the BIRCH
and CluSandra algorithms. However, the CluSandra
framework does not preclude the deployment of algorithms
that address the non-spherical cluster types.
BIRCH mitigates the I/O costs associated with the
clustering of very large multi-dimensional and persistent
datasets. It is a batch algorithm that relies on multiple
sequential phases of operation and is, therefore, not well-
suited for data stream environments where time is severely
constrained. However, BIRCH introduces concepts and a
synopsis data structure that help address the severe time and
space constraints associated with the clustering of data
streams. BIRCH can typically find a good clustering with a
single pass of the dataset [10], which is an absolute
requirement when having to process data streams. It also
introduces two structures: cluster feature (CF) and cluster
feature tree. The CluSandra algorithm utilizes an extended
version of the CF, which is a type of synopsis structure. The
CF contains enough statistical summary information to allow
for the exploration and discovery of clusters within the data
stream. The information contained in the CF is used to derive
these three spatial measures: centroid, radius, and diameter.
All three are an integral part of the BIRCH and CluSandra
algorithms. Given N n-dimensional data records (vectors or
points) in a cluster where i = {1,2,3, , N}, the centroid ,
radius R, and diameter D of the cluster are defined as
(3)
(4)
(5)
where and are the i
th
and j
th
data records in the cluster
and N is the total number of data records in the cluster. The
centroid is the clusters mean, the radius is the average
distance from the objects within the cluster to their centroid,
and the diameter is the average pair-wise distance within the
cluster. The radius and diameter measure the tightness or
density of the objects around their clusters centroid. The
radius can also be referred to as the root mean squared
deviation or RMSD with respect to the centroid. Typically,
the first criterion for assigning a new object to a particular
cluster is the new objects Euclidean distance to a clusters
centroid. That is, the new object is assigned to its closest
cluster. Note, however, that equations 3, 4 and 5 require
multiple passes of the data set. Like BIRCH, the CluSandra
algorithm derives the radius and centroid while adhering to the
single pass constraint. To accomplish this, the algorithm
maintains statistical summary data in the CF. This data is in
the form of the total number of data records (N), linear sum,
and sum of the squares with respect to the elements of the n-
dimensional data records. This allows the algorithm to
calculate the radius, as follows:
(6)
For example, given these three data records: (0,1,1),
(0,5,1), and (0,9,1), the linear sum is (0,15,3), the sum of the
squares is (0,107,3) and N is 3. The CF, as described in
BIRCH, is a 3-tuple or triplet structure that contains the
aforementioned statistical summary data. The CF represents a
cluster and records summary information for that cluster. It is
formally defined as
(7)
where N is the total number of objects in the cluster (i.e., the
number of data records absorbed by the cluster), LS is the
linear sum of the cluster and SS is the clusters sum of the
squares:
(8)
(9)
It is interesting to note that the CF has both the additive
and subtractive properties. For example, if you have two
clusters with their respective CFs, CF
1
and CF
2
, the CF that is
formed by merging the two clusters is simply CF
1
+ CF
2
.
B. CluStream
CluStream [5] is a clustering algorithm that is specifically
designed for evolving data streams and is based on and extends
the BIRCH algorithm. This section provides a brief overview
of CluStream and describes the problems and/or constraints
that it addresses. This section also describes how the
CluSandra algorithm builds upon and, at the same time,
deviates from CluStream.
One of CluStreams goals is to address the temporal
aspects of the data streams single-pass constraint. For
example, the results of applying a single-pass clustering
algorithm, like BIRCH, to a data stream whose lifespan is 1 or
2 years would be dominated by outdated data. CluStream
allows end-users to explore the data stream over different time
horizons, which provides a better understanding of how the
data stream evolves over time. CluStream divides the
clustering process into two phases of operation that are meant
to operate simultaneously. The first, which is referred to as the
online phase, efficiently computes and stores data stream
summary statistics in a structure called the microcluster. A
microcluster is an extension of the BIRCH CF structure
x
0
x
0
x
i
i1
N
N
R
x
i
x
0
( )
2
i1
N
N
D
x
i
x
j
( )
2
j1
N
i1
N
N N 1
( )
x
i
x
j
x
i
2
x
i ( )
2
/ N
( )
1/ N 1
( )
CF N, LS, SS
LS x
i
i1
N
SS x
i
2
i1
N