Documentos de Académico
Documentos de Profesional
Documentos de Cultura
I. INTRODUCTION
Data mining is used for analysing huge datasets, finds
relationships among these datasets and in addition the results
are also summarized which are useful and understandable to
the user. Today, large datasets are present in many areas due
to the usage of distributed information systems [14]. Sheer
amount of data is stored in world today commonly known as
big data. The process of extracting useful patterns of
knowledge from database is called data mining. The
extracted information is visualized in the form of charts,
graph, tables and other graphical forms. Data mining is also
known by another name called KDD (Knowledge Discovery
from the Database). The data present in database is in
structured format whereas, data warehousing may contain
unstructured data. It is comparatively easier to handle static
data as compared to dynamically varying data [16].
Reliability and scalability are two major challenges in data
mining. Effective, efficient and scalable mining of data
should be achieved by building incremental and efficient
mining algorithms for mining large datasets and streaming
data [14]. In this review paper our main objective is to do the
comparative study of clustering algorithms and to identify
the challenges associated with them.
II. CLUSTERING IN DATA MINING
Clustering means putting objects having similar properties
into one group and the objects with dissimilar properties into
another. Based on the given threshold value the objects
having values above and below threshold are placed into
different clusters [14]. A cluster is group of objects which
possess common characteristics. The main objective in
clustering is to find out the inherent grouping in a set of
unlabeled data [16]. Clustering is referred to as unsupervised
10
11
similarity. Better accuracy is provided by the cluster based results and works efficiently and accurately. Iurie Chiosa et
outlier detection technique as compared to the distance al. [11] has proposed novel clustering algorithm called
based approach.
Variational Multilevel Mesh Clustering (VMLC) which
incorporates the benefits of variational algorithms (Lloyd)
and hierarchical clustering algorithms. The selection of
D. Large Computational Time
As compared to the traditional clustering algorithms like seeds to be selected initially is not predefined. So to solve
K- means, hierarchical clustering algorithms have many this problem, a multilevel clustering is built which offers
advantages but such algorithms may suffer from high certain benefits by resolving the problems present in
computational cost [14]. Density based outlier detection variational algorithms and performs the initial seed
algorithms also suffer from the problem of large selection. Another problem that the clusters have non
computation time. High computation time is a major barrier optimal shapes can be solved by using greedy nature of
in case of density based outlier detection algorithms hierarchical approaches. Tu Linli et al. [12] gave a new Kalthough, they have number of advantages. Such algorithms means technique for clustering which considers double
have a less obvious parallel structure [15]. So to resolve the attributes. The high density set generates dissimilarity
problem of time and cost some algorithms are proposed by degree matrix and the Huffman tree is constructed on the
different researchers. William Hendrix et al. [24] presented a basis of this matrix. Then the initial cluster seeds are
shared-memory algorithm for single linkage hierarchical selected from the Huffman tree and it helps to overcome the
clustering (SHRINK) which merges the overlapped clusters. problem of initial seed selection. Md Anisur Rahman et al.
This algorithm provides a speedup up to great extent in case [13] uses ModEx and Seed-Detective approaches which help
of synthetic and real datasets of up to 250,000 points. in performing high quality clustering by generating good
Parallel algorithms are also proposed for clustering large initial seeds. The former approach is the modified version of
datasets in bioinformatics for solving the problem of large the Ex- Detective technique and also illustrates some of the
computation time. High time consumption often occurs in limitations of Ex-Detective. The latter is the combination of
large datasets while solving the cluster identification two approaches, ModEx and Simple K-means. By using
problem to identify noisy background and dense clusters and Modex approach Seed-Detective produces high quality of
by using this approach the computational time could be initial seeds which are given as input to the k-means and
reduced to great extent. Spectral clustering algorithms can leads to the better formation of clusters. Jeyhun Karimov et
easily recognize non-convex distribution and are used in al. [20] proposed hybrid evolutionary model for K-means
segmentation of images and many more fields. Such clustering (HE k- means). This method helps in the selection
clustering often costs high computation time when it deals of good initial centroid in case of K-means clustering by
with large images. So to solve this problem Kia.Li et al. [7] using meta- heuristic. Clustering quality is improved by 30%
proposed an algorithm based on spectral clustering which by using this approach in comparison to random seed
performs segmentation of images in less computational time. selection approach.
Seung Kim et al. [15] used a method for reducing
computational time in case of density based algorithm F. Identification of Different Distance and Similarity
Measures
known as Local Outlier Factor (LOF). This approach
incorporates two approaches; k-nearest neighbors search
For measuring the distance some standard equations are
algorithm (ANN) and kd-tree indexing. This method works
used in case of mathematical attributes like Euclidean,
efficiently in detecting local outliers in less computational
Manhattan and other maximum distance. These three special
time.
cases belong to Minkowski distance. Euclidean distance
(ED) is the measure which is usually used for evaluating
E. Efficient Initial Seed Selection
K-means algorithm is the crucial clustering algorithm used similarity between two points. It is very simple and easy
for mining data. The centers are generated randomly or they metric, but it also possesses some disadvantages like it is not
are assumed to be already available. In seed based suitable in case of time series application fields and is highly
integration small set of labeled data (called seeds) is susceptible to outliers and also to noise [2]. Usue Mori et al.
integrated which improves the performance and overcome [2] has proposed a multi-label classification framework
the problem of initial seed centers [20]. Viet-Vu Vu et al. [8] which selects reliable distance measure to cluster time series
performed active seed selection by using an efficient database. Appropriate distance measure is automatically
approach which is based on Min Max approach which selected by this framework. The classifier is based on
covers entire dataset. After few queries, all the clusters characteristics describing important features of time series
contain at least a single seed point and it also reduces the database and can easily predict and discriminate between
iteration number. Kiran Agrawal et al. [10] gave an efficient different set of measures. Duc Thang Nguyen [3] discusses
algorithm by using K-means which solves the problem of two clustering methods explicitly and implicitly for finding
initial seed selection and also determines the number of similarity between objects on the basis of viewpoints. The
clusters to be formed. This algorithm gives satisfactory traditional technique uses single view point whereas other
techniques of similarity measure use multi view point for
12
Kia Li
(2012)
Zhensong
Chen
(2015)
Mark
Junjie Li
(2008)
Clustering
Technique
Algorithm
for mining
clusters with
arbitrary
shapes
(CLASP)
DP
Clustering
algorithm
Benefits
Less
computational
cost
Defines
correct
cluster
centres.
Extension to Automatically
fuzzy $K$determines
means
cluster number
clustering
algorithm
Limitations
Efficiency
reduces
while
clustering
large
datasets
Kiran
Agrawal
(2009)
Iurie
Chiosa
(2008)
More
computational
time
William
Hendrix
Clusters may (2012)
overlap in large
datasets
Self
organized
maps
Efficient
for large
datasets
High time
complexity and
less speed
Clustering
large
datasets
(CLARA)
Produce
small,
distinct and
symmetric
clusters
Overlapping of
clusters
Cluster
based
outlier
detection,
distance
based
outlier
detection
Image
segmentatio
n algorithm
based on
spectral
clustering
Efficient
K- means
Variational
multilevel
mesh
clustering
Shaded
memory
algorithm
for single
linkage
hierarchical
clustering
(SHRINK)
Efficient in
outlier
detection
User needs to
define some
parameters like
threshold value.
Removes noise
Poor feature
selection.
Recognize
non convex
distribution
in images.
High
computation
time
Efficient
initial seed
selection
Produce tight
clusters
Produces
clusters
having
optimal
shape
More
complexity and
overhead
Less
computation
al cost
Not
efficient
for large
datasets
13
V. CONCLUSION
This paper describes the comparative study of clustering
techniques such as CLARA, K-means, CLASP and SHRINK
which are used by researchers in different application areas.
Comparison of different clustering algorithms is examined at
different levels of perception. This paper highlights the
concerned issues and challenges present in different
clustering algorithms. The issue arising in one approach is
resolved by other approach. Fuzzy logic is good for handling
uncertainties and due to parallel nature, neural networks are
good at handling real time applications. By doing
hybridization of neural networks and fuzzy techniques we
can obtain efficient results in detection of outliers. We have
concluded that algorithms like CLARA are used for
clustering large datasets efficiently but some asymmetric
clustering algorithms like CLASP, efficiently cluster simple
datasets but do not give expected outputs in case of mixed
and tightly coupled datasets. They are less accurate and
efficient for clustering large datasets. Therefore, the
technique based on the neural networks should be proposed
to improve clustering for efficiency enhancement in the
asymmetric algorithms.
REFERENCES
[1] Chih- Ping Wei, Yen-Hsien Lee and Che-Ming Hsu.
Department of Information, Empirical Comparison of Fast
Clustering Algorithms for Large Data Sets, Proceedings of the
33rd Hawaii International Conference on System Sciences
2000.
[2] Usue Mori, Alexander Mendiburu, and Jose A. Lozano,
Member, Similarity Measure Selection for Clustering Time
Series Databases, IEEE Transactions on Knowledge and Data
Engineering,2015.
[3] Duc Thang Nguyen, "Clustering with Multiviewpoint-Based
Similarity Measure", IEEE Transactions on Knowledge & Data
Engineering, vol. 24, no. 6, pp. 988-1001, June 2012.
[4] D. S. Yeung, X. Z. Wang, "Improving Performance of
Similarity-Based Clustering by Feature Weight Learning,"
IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 24, no. 4, pp. 556-561, April, 2002.
[5] Magnus Rattray, "A Model-Based Distance for Clustering,"
Neural Networks, IEEE - INNS - ENNS International Joint
Conference on, p. 4013, IEEE-INNS-ENNS International Joint
Conference on Neural Networks (IJCNN'00)-Volume 4, 2000.
[6] R. Vidal, A. Ravichandran, B. Afsari, R. Chaudhry, "Group
action induced distances for averaging and clustering Linear
Dynamical Systems with applications to the analysis of
dynamic scenes," 2014 IEEE Conference on Computer Vision
and Pattern Recognition, pp. 2208-2215, 2012 IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2012.
[7] Kai Li, Xinxin Song, "A Fast Large Size Image Segmentation
Algorithm Based on Spectral Clustering," 2012 Fourth
International Conference on Computational and Information
Sciences, pp. 345-348.
[8] Viet-Vu Vu, Nicolas Labroche, Bernadette Bouchon-Meunier,
"Active Learning for Semi-Supervised K-Means Clustering",
ICTAI, 2010, 2012 IEEE 24th International Conference on
Tools with Artificial Intelligence, 2012 IEEE 24th International
14
15