Documentos de Académico
Documentos de Profesional
Documentos de Cultura
ABSTRACT 1. INTRODUCTION
Learning is the process of generating useful Since 1990s, the notion of data mining, usually seen
information from a huge volume of data. Learning can as the process of “mining”
ining” the data, has emerged in
be classified as supervised learning and unsupervised
learning. Clustering is a kind of unsupervised many environments, from the academic field to the
learning. Clustering is also one of the data mining business or medical activities, in particular. As a
methods. In all clustering algorithms, the goal is to research area with not such a long history, and thus
minimize intracluster distances, and to maximize not exceeding the stage of adolescence‟ yet, data
intercluster distances. Whatever a clustering
ering algorithm mining
ing is still disputed by some scientific fields [1].
provides a better performance, it has the more In this sense, data mining means at various references
successful to achieve this goal [2]. Nowadays, are as follows: •
although many research done in the field of clustering Data mining is the process of discovering
algorithms, these algorithms have the challenges such interesting knowledge from large amounts of data
as processing time, scalability, accu accuracy, etc. stored in databases, data warehouses, or o other
Comparing various methods of the clustering, the information repositories [2].
contributions of the recent researches focused on Data mining is a process that uses algorithms to
solving the clustering challenges of the partition discover predictive patterns in data sets.
method [3]. In this paper, the partitioning clustering “Automated data analysis” applies models to data
method is introduced, the procedure of tthe clustering to predict behavior, assess risk, determine
algorithms is described, and finally the new improved associations, or do other types of
o analysis [3].
methods and the proposed solutions to solve these
challenges are explained [4].The clustering algorithms Actually, when data mining methods to solve concrete
are categorized based upon different research problems are used, in mind their typology is created,
phenomenon. Varieties of algorithms hhave recently which can be synthetically summarized in two broad
occurred and were effectively applied to real
real-life data categories, predictive methods and descriptive
mining problems. This survey mainly focuses on methods, already referred to as the
th objectives of data
partition based clustering algorithms namely kk-Means, mining. Clustering is the type of descriptive methods
k-Medoids and Fuzzy c-Means Means In particular; they [1].
applied mostly in medical data sets. T The importance
of the survey is to explore the various applications in Clustering is a process of grouping objects with
different domains [5]. similar properties. Any cluster should exhibit two
main properties; low inter-class
class similarity and high
Keywords: Clustering, Supervised Learning, intra-class similarity.. Clustering is an unsupervised
Unsupervised Learning, Data Mining learning i.e. it learns by observation rather than
examples. There are no predefined class label exists
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 5 | Jul-Aug 2018 Page: 506
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
clustering include the abilities to discover clusters of group. The main drawback of this algorithm [3] is
arbitrary shape and handle noise. The algorithm whenever a point is close to the center of another
requires just one scan through the database. However, cluster, it gives poor result due to overlapping of data
density parameters are needed for the termination points.
condition [9]. Density based algorithm continue to
grow the given cluster as long as the density in the Partition clustering algorithms
neighborhood exceeds certain threshold [1]. This K-means:
algorithm is suitable for handling noise in the dataset. It starts with a random initial partition and keeps
reassigning the patterns to clusters based on the
The following points are enumerated as the features of similarity between the pattern and the cluster centers
this algorithm. until a convergence criterion is met [4]. The method is
1. Handles clusters of arbitrary shape relatively scalable and efficient in processing large
2. Handle noise data sets [2], its time and space complexity is
3. Needs only one scan of the input dataset? relatively small, and it is an order-independent
4. Needs density parameters to be initialized. algorithm [8]. But the method often terminates at a
DBSCAN, DENCLUE and OPTICS [1] are local optimum, and is not suitable for discovering
examples for this algorithm. clusters with no convex shapes or clusters of very
different size [2]. Moreover, an ambiguity is about the
Spectral Clustering best direction for initial partition, updating the
Spectral clustering refers to a class of techniques partition, adjusting the number of clusters, and the
which relies on the Eigen structure of a similarity stopping criterion [8]. A major problem with this
matrix. Clusters are formed by partition data points algorithm is that it is sensitive to noise and outliers
using the similarity matrix. Any spectral clustering [9].
algorithm will have three main stages [4]. They are
1. Preprocessing: Deals with the construction of K-medoid/PAM:
similarity matrix. PAM was one of the first k-medoids algorithms
2. Spectral Mapping: Deals with the construction of introduced [2]. The algorithm uses the most centrally
eigen vectors for the similarity matrix located object in a cluster, the medoid, instead of the
3. Post Processing: Deals with the grouping data mean. Then, PAM starts from an initial set of
points medoids, and it iteratively replaces one of the
medoids by one of the nonmedoids if it improves the
The following are advantages of Spectral clustering total distance of the resulting clustering [9]. This
algorithm: algorithm works effectively for small data sets, but
1. Strong assumptions on cluster shape are not made. does not scale well for large datasets [2].
2. Simple to implement.
3. Objective does not consider local optima. CLARA:
4. Statistically consistent. Instead of taking the whole set of data into
5. Works faster. consideration, a small portion of the actual data is
chosen as a representative of the data. Medoids are
Partition Clustering then chosen from this sample using PAM. CLARA
Partition clustering algorithm splits the data points draws multiple samples of the data set, applies PAM
into k partition, where each partition represents a on each sample, and returns its best clustering as the
cluster [4]. The partition is done based on certain output. As expected, CLARA can deal with larger
objective function. One such criterion functions is data sets than PAM [2].
minimizing square error criterion which is computed
as, CLARANS:
E = ∑ ∑ || p – mi ||2 It draws a sample with some randomness in each step
of the search. Conceptually, the clustering process can
where p is the point in a cluster and mi is the mean of be viewed as a search through a graph. At each step,
the cluster. The cluster should exhibit two properties, PAM examines all of the neighbors of the current
they are (1) each group must contain at least one node in its search for a minimum cost solution. The
object (2) each object must belong to exactly one current node is then replaced by the neighbor with the
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 5 | Jul-Aug 2018 Page: 507
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
largest descent in costs [2]. The algorithm also The main advantage of the approach is its fast
enables the detection of outliers [9]. processing time [2].
IMPROVED PARTITION ALGORITHMS [8]. On the other the problem of clustering samples is
Multi-center Fuzzy C-means algorithm based on changed into a problem of merging sub clusters, thus
Transitive Closure and Spectral Clustering the computational load is low, and has strong
(MFCMTCSC) robustness [9].
It uses the multi-center initialization method to solve
sensitive problems to initialize for FCM algorithm, Robust clustering approach
and applies non-traditional curved clusters. To ensure It is based on the maximum likelihood principle, and
the extraction of spectral features, Floyd algorithm focuses on maximizing the objective function. The
provides a similarity Matrix used block symmetric approach also extends the Least Trimmed Squares
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 5 | Jul-Aug 2018 Page: 508
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
approach to fuzzy clustering toward a more general of them. The credibility concept is utilized to integrate
methodology [3]. Moreover, it discards a fixed degrees of membership and non-membership [8].
fraction of data. The fixed trimming level controls the
number of observations to be discarded in a different COMPARISON OF IMPROVED PARTITION
way from other methods that are based on fixing a ALGORITHMS
noise distance [6]. This approach also considers an Due to the above subjects, each of these algorithms
eigenvalue ratio constraint that makes it a has proposed the solutions. However, the improved
mathematically well-defined problem and serves to methods have advantages and disadvantages briefly
control the allowed differences among cluster scatters. mentioned in Table 2.
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 5 | Jul-Aug 2018 Page: 509
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
2456
APPLICATIONS OF CLUSTERING Adaptable for isolating spherical or poorly
ALGORITHMS separated Clusters.
Clustering plays a vital role in data analysis. The
kmeans algorithm is a simple algorithm that has been CONCLUSION
adapted to many problem domains. It can be seen that Data mining includes techniques in various fields to
the k-Means
Means algorithm is a blameless candidate for analyze the data. Many algorithms apply different
extension to work with fuzzy feature vectors [4]. A analyzes of data in this field. In this paper, after
large number of clustering algorithms must remain reviewing clustering methods of data mining, a
developed in a variety of domains for different types number of these algorithms are presented as a whole
of applications. None of these algorithms is suitable and an independent of the algorithm, and their
for all types of applications. This survey work is differences are studied. The discussed methods were
carried out to analyze the applications of partition an introduction to the concepts and the researches
based clustering algorithms, done by various which
hich indicated available algorithms by different
researchers in different domains [5]. The clustering functions in any fields [3]. In the following, the new
algorithms has been applied to more applications improved algorithms, and the proposed solutions to
including improving the efficiency of information solve the challenges of the partition algorithms were
retrieval system
em and simulation of special Medical described. Finally, each clustering algorithm is i not
Cluster applications and some other areas. We generally considered the best algorithm to solve all
proposed a new framework to improve the ‘web problems, and the algorithms designed for certain
sessions’ in their paper “Refinement of Web usage assumptions are usually assigned to special
Data Clustering from K-means means with Genetic applications. Considering the importance of
Algorithm”. In this research work, initially
nitially a modified partitioned clustering in data mining, and its being
k-means
means algorithm is used to cluster the user sessions widely in recent years, clustering algorithms have
[7]. The refined initial starting condition allows the become into a field of active and dynamic research
iterative algorithm to converge to a "better" local [7].
minimum. Then they have proposed a GA (Genetic
Algorithm) based refinement algorithm to improve the Therefore, improving the partition clustering
cluster quality [6]. algorithms such as K-means
means and FCM could be an
interesting issue for future research.The paper
Discovering (Uterus disease diagnosis) Interesting describes different
ferent methodologies and parameters
patterns (required level of instructions with complete associated with partition clustering algorithms. The
form) in the initial Cluster. Discovering overlapping drawback of k-means
means algorithm is to find the optimal
cluster by the reassignment of medical dataset uunder k value and initial centroid for each cluster [9]. This is
disease level [2]. Update the score cache level of the overcome by applying the concepts such as genetic
information finding, retrieval phase of the Medical algorithm,
ithm, simulated annealing, harmony search
applications with different types of datasets. This techniques and ant colony optimization.The choice of
article describes the properties of k-means
means algorithm clustering algorithm depends on both the type of data
among the other algorithms and they ar are summarized available and on the particular purpose and chosen
below for quick absorbency of the researchers and application []. The partitioned algorithms are work
developers. well for finding spherical shaped clusters in the given
Dependent on starting point and number of input as Medical dataset. This article discusses the
clusters can require very little iteration. various application areas of partition based clustering
Limits of clusters are well defined without algorithms like k-Means, k-Medoids,
Medoids, Fuzzy C-Means
C
overlapping. [10]. The k-Means
Means algorithm is very consistent
consisten when
The results of k Means are reliant on starting compared and analyzed with the other two algorithms.
Centroid locations. Further, it stamps its superiority in terms of its lesser
Frequently different optimal results can be execution time. From this survey, it is identified the
produced using different starting centroids. applications of innovative and special approaches of
Creates large clusters early in process; clusters clustering algorithms principally
principa for medical domain
formed are dependent on order of input data. [4]. From the various applications by several
Execution efficiency results depend upon ho how researchers, particularly, the performance of k-Means
k
many clusters are chosen. algorithm is well suited. Most of the researchers are
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 5 | Jul-Aug 2018 Page: 511