Documentos de Académico
Documentos de Profesional
Documentos de Cultura
* Assistant Professor, Department of CSE, Vasavi College of Engineering, Hyderabad-500 031, INDIA
+ Associate Professor, Department of CSE, Murthy Institute of Tech. & Science, Ranga Reddy-501301, INDIA
hanu.abc@gmail.com, guggillanarender@gmail.com, balaji_075@yahoo.co.in, anitha_yella@yahoo.co.in
T
grouping of most similar objects. Here we are grouping the values of random numbers. Each value is known as a datum.
objects with minimum dissimilarity. Thus more similar objects
are grouped in this paper; we present the proposed method
The data set may comprise data for one or more members,
which is experimented with the well known data sets from UCI corresponding to the number of rows.
data repository, taking example as a soybean, dataset. The data
set consists of 47 records and each record contains 35 attributes II. OVERVIEW OF CLUSTERING
describing the feature of plants with four classes of disease. There
ES
are three phases in this method. The dissimilarity matrix,
neighbor matrix and the initial clusters are formed in first phase.
Clusters are merged in the second phase by relocating the objects
using the neighborhood concept. In the third phase, mode of
A cluster is a collection of data objects that are more
similar to one another within the same cluster and are
dissimilar to the objects in other clusters. Unsupervised
learning deals with in stances which have not been pre-
attributes of clusters is computed, and phase I and Phase II are classified in any way and so they do not have a class attributes
applied for the tuples formed from these Representatives. associated with them. Clustering is a process of grouping data
into groups based on similarity measure and it is used to
Keywords- Data Mining, Clustering, Categorical data,
Dissimilarity, Mode.
describe the preprocessing step for other algorithms such as
characterization and classification.
I. INTRODUCTION
Clustering in Data Mining is a discovery process that
Data mining [1] involves the use of sophisticated data groups a set of data in which the intra cluster similarity is
A
analysis tools to discover previously unknown, valid patterns maximized and the inter cluster similarity is minimized. In
and relationships in large data sets [6]. Data Mining is a general, clustering is divided into two broad categories viz.,
technique to extract valid novel, potential patterns and useful hierarchical and partitional. The partitional clustering
information from complex and huge amount of data set [2]. technique partitions the database into pre-defined number of
Two fundamental Goals of Data Mining are prediction and clusters based on some criteria. Hierarchical clustering
description otherwise known as verification model and technique is divided into agglomerative and divisive. The
IJ
discovery model. Some important data mining tasks are agglomerative Clustering [3] follows the bottom up strategy
association rules, classification rules, clustering. whereas the divisive approach follows the top down strategy.
The basic principle of clustering hinges on the Concept of
There are two major types of predictions: one can distance metric. Since the data are invariably real numbers for
either try to predict some unavailable data values or pending statistical applications and Pattern recognitions, a large class
trends, or predict a class label for some data. Prediction is of metrics exists and one can define one‘s own metric
however more often referred to the forecast of missing according to the specific requirements.
numerical values, or increase/ decrease trends in time related
data. The major idea is to use a large number of past values to Data Mining primarily works with large data bases.
consider probable future values. Description in terms The objects in the data base contain the attributes of various
of(human-interpretable) patterns. Association rules discovers data types. These values may be of either numeric or non
the relations between variables in large databases [7]. Data numeric type. Clustering can be performed for both Numerical
classification is the categorization of data for its most effective and categorical data. For clustering numerical data, geometric
and efficient use[8]. A data set (or dataset) is a collection of properties like distance function are used as a criterion. As
data clustering is mostly related to real time or transactional of the whole data set, that is,
data sets, the attributes are of both numerical and categorical
type. Xi(t+1)= Xi(t) for all i=1,....,k
Data objects are clustered based on similarity
measurement. The most common similarity measurement is K-means is one of the simplest unsupervised learning
the distance function. The similarity between two objects Oi algorithms[12]. K-means clustering is a method of cluster
and OJ is measured using Euclidean distance or Manhattan analysis which aims to partition n observations into k clusters
distance. Clustering categorical data are very different from in which each observation belongs to the cluster with the
those of numerical data in terms of the definition of similarity nearest mean. he procedure follows a simple and easy way to
measure. The distance based Metric cannot be used to cluster classify a given data set through a certain number of clusters
the categorical data. Numerical clustering methods are applied (assume k clusters) fixed a priori.
to categorical data through data preprocessing [4]. But these
preprocessing techniques do not produce Quality clusters The following are the steps for K-means algorithm [10].
always. So it is widely accepted to apply the clustering on raw
categorical data. Here we have used the similarity concept as a Step 1: Place K points into the space represented by the
measurement. To maximize the intra cluster similarity, the objects that are being clustered. These points
minimum dissimilarity concept is used. represent initial group centroids.Start with a random
partition into K clusters
III. RELATED WORKS
T
Step 2: Generate a new partition by assigning each pattern to
The basic idea of clustering is grouping together its closest cluster center
similar objects[15]. A few existing categorical clustering
algorithms are discussed here. The k-means problem[14] is Step 3: Compute new cluster centers as the centroids of the
based on a simple iterative scheme for finding a locally clusters.
minimal solution. In the K-Means problem, a set of N points
X(I) in M-dimensions is given. The goal is to arrange these
ES
points into K clusters, with each cluster having a
representative point Z(J), usually chosen as the centroid of the
points in the cluster. This algorithm is often called the k-
means algorithm [9]. Prototypes algorithm is based on K-
Step 4: Steps 1 and 2 are repeated until there is no change in
the membership (also cluster centers remain the same)
Step 3: Re-test the similarity of all data vectors from cluster to A. Proposed Representative Based Algorithm
cluster with each mode vector in the following way. If
a vector from Xi(t) is found to be strictly nearer to Cj(t) Let the Object list = [1, 2, 3...n], where
than to the current Ci(t) , reallocate that vector to the N - Number of tuples/records
cluster Xj(t+1) to obtain a new partition M - Number of attributes
X=X1(t+1) Ụ X2(t+1) Ụ…. Ụ Xk(t+1) Phase I
Notice that ties here are biased so that the mode of a data The steps involved in this phase are detailed below:
vector‘s current cluster is preferred.
Step1: Construct a dissimilarity matrix ‗d‘ using the
Step 4: Go back to Step 2 and repeat the iteration until no measurement.
object has changed cluster assignment after full cycle
Step2. Compute the threshold value, and minimum TABLE-2 DISSIMILARITY MATRIX
dissimilarity of each object,
OBJECT 1 2 3 4 5 6 7
Step3. Construct a neighbor matrix as ‗neigh‘. 1 - 1 3 3 2 1 1
2 1 - 3 3 1 2 2
Step4. Select the first member of an object list, Form a new
cluster with this object as a member. Group the 3 3 3 - 1 3 3 3
neighbors of object based on the criteria. Remove the 4 3 3 1 - 3 3 3
clustered objects from the object list. 5 2 1 3 3 - 2 2
6 1 2 3 3 2 - 1
Step5. Repeat the above step until the object list becomes
7 1 2 3 3 2 1 -
empty.
Phase II
TABLE-3 NEIGHBOR MATRIX
The steps involved in merging of clusters are detailed below:
1 2,6,7
Step 1: Select the cluster with least number of objects. 2 1,5
T
Step2. The objects in the selected cluster are Relocated based 3 4
on the Cluster Merging Criteria.
4 3
Step3. The above steps are repeated until no more merging is 5 2
possible.
6 1,7
Phase III
ES
Compute the mode of each column or attribute of all objects in
7 1,6
The proposed algorithm is illustrated with a sample dataset TABLE-4 RESULTANT CLUSTERS AFTER PHASE-2
given in the Table-1. The sample dataset consists of 7
tuples/objects with 4 attributes. CLUSTER OBJECT
NUMBER
The dissimilarity matrix is depicted in table-2. Here
IJ
n = 4 and m = 3. 1 1,2,6,7
this sample data set, after phase III also we get the same
9 10
results.
10 1
C. Results Done Experimentally
11 15,18
The proposed method is experimented with real data
set like Soybean small. Soybean small dataset consists of 47 12 16
records and each record contains 35 attributes describing the
feature of plants with four classes of diseases. And the result 13 18,20
can be obtained in 10 clusters after phase-2 and 4 clusters after
phase-3 14 19,20
D. Measure the purity of cluster 15 11
A cluster is called a pure cluster if all the objects 16 12
belong to a single class. To measure the efficiency of the
proposed method, we have used the clustering accuracy 17 13,15,19
measure .The clustering accuracy r is defined as,
18 11,13
T
R= 1\n Summation I=1 to k of all a1
19 14
Where al is the number of data objects that occur in both
cluster Cl and its corresponding labeled class, and n is the 20 13,14
number of objects in the data set.
21 25
In Our proposed algorithm the number of clusters
‗K‘ is not given as input. But during merging using
ES
representative the number of clusters is reduced, evident that
when the number of clusters is more the purity is also more.
.So it is proposed here to select the cluster with high purity.
22
23
27
27
1 10 30 24
IJ
2 6,7 31 34
3 8 32 37
4 10 33 39,43
5 1,10 34 31,41
6 2,4,9 35 43
7 2,3,4 36 43
8 3 37 47
T
Science.
[6] http://www.fas.org/irp/crs/RL31798.pdf
[7] http://en.wikipedia.org/wiki/Association_rule_learning
[8]http://searchdatamanagement.techtarget.com/definition/data
TABLE-6 RESULTANT CLUSTERS OF SOYABEAN -classification
DATASET AFTER PHASE-2 [9] An Efficient k-Means Clustering Algorithm: Analysis and
Implementation, Tapas Kanungo, IEEE transaction on
CLUSTER NUMBER
1 1,5,10
ES
OBJECT
pattern analysis and machine intelligence.
[10] clustering algorithms by Johan Everts, Kunstmatige
Intelligentie / RuG
[11] The Effects of Ties on Convergece in K-Modes
Variants for Clustering Categorical Data N. Orlowski, D.
2 2,3,4,6,7,8,9 Schlorff, J. Blevins, D. Ca˜nas, M. T. Chu_, R. E. Funderlic
N.C. State University Department of Computer Science
11 11,13,14,15,17,18,19,20 [12]http://home.dei.polimi.it/matteucc/Clustering/tutorial_htm
l/kmeans.html
12 12,16 [13] http://en.wikipedia.org/wiki/Data_set
[14]http://people.sc.fsu.edu/~jburkardt/m_src/kmeans/kmeans.
21 21,24,25,28,30
A
html
22 22,23,27 [15] http://www.astro.princeton.edu/~gk/A542/matt.pdf
26 26,29
31 31,32,34,37,47
IJ
33 33,36,39,40,43
42 44,45,46,47
V. CONCLUSION