5 vistas

Título original: 8clst

Cargado por Mukul Verma

- Geometric Morphometrics--protocols for Error Testing
- Anomaly Detection RapidMiner
- Metáforas UP y DOWN
- Data Mining Intro IEP
- 3. Comp Sci - Ijcse -An Optimized Approach to Analyze Stock Market
- Automatic Tag to Region
- MachineLearning-Lecture12
- Lecture 07
- Implementing Cluster based Recommendation algorithm through Split Inversions
- IJAIEM-2013-04-09-016
- SOM U C Clustering September 2011
- Function Guide
- ewsn08 (1)
- t Cc 2014030266
- Exploring Application-Level Semantics for Data Compression Abstract by Coreieeeprojects
- r05411201 Information Retrieval Systems
- Cluster 1
- An efficient method for active semi-supervised density based clustering
- cluster4.pdf
- openSAP_ds1_Week_3_All_Slides

Está en la página 1de 51

Cluster Analysis

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Outlier Analysis

Summary

Pattern Recognition

Spatial Data Analysis

create thematic maps in GIS by clustering

feature spaces

detect spatial clusters and explain them in

spatial data mining

Image Processing

Economic Science (especially market research)

WWW

Document classification

Cluster Weblog data to discover groups of similar

access patterns

Examples of Clustering

Applications

Marketing: Help marketers discover distinct groups in

their customer bases, and then use this knowledge to

develop targeted marketing programs

an earth observation database

policy holders with a high average claim cost

to their house type, value, and geographical location

should be clustered along continent faults

clusters with

similarity measure used by the method and its

implementation.

by its ability to discover some or all of the hidden

patterns.

Mining

Scalability

determine input parameters

High dimensionality

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Outlier Analysis

Summary

Data Structures

Data matrix

(two modes)

x11

...

x

i1

...

x

n1

...

x1f

...

x1p

...

...

...

...

xif

...

...

xip

...

...

... xnf

...

...

...

xnp

d(2,1)

0

Dissimilarity matrix

d(3,1) d ( 3,2) 0

(one mode)

:

:

:

... 0

8

Clustering

in terms of a distance function, which is typically

metric: d(i, j)

There is a separate quality function that measures

the goodness of a cluster.

The definitions of distance functions are usually very

different for interval-scaled, boolean, categorical,

ordinal and ratio variables.

Weights should be associated with different variables

based on applications and data semantics.

It is hard to define similar enough or good enough

Interval-scaled variables:

Binary variables:

10

Interval-valued variables

Standardize data

s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)

where

...

xnf )

m f 1n (x1 f x2 f

xif m f

zif

sf

using standard deviation

11

Between Objects

similarity or dissimilarity between two data

objects

q

q

Some popular

d (i, j) q (| ones

x x |include:

| x x Minkowski

| q ... | x x |distance:

)

i1

j1

i2

j2

ip

jp

are two p-dimensional data objects, and q is a

positive integer

If q = 1, d isd (iManhattan

, j) | x x | |distance

x x | ... | x x |

i1

j1

i2

j2

ip

jp

12

Between Objects (Cont.)

If q = 2, d is Euclidean distance:

d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )

i1

j1

i2

j2

ip

jp

Properties

d(i,j) 0

d(i,i) = 0

d(i,j) = d(j,i)

Pearson product moment correlation, or other

disimilarity measures.

13

Binary Variables

Object j

Object i

1

0

1

a

c

0

b

d

sum a c b d

sum

a b

cd

p

bc

a bc d

Jaccard coefficient (noninvariant if the binary

d (i, j)

variable is asymmetric):

d (i, j)

May 23, 2016

bc

a bc

14

states are equally valuable, that is, there is

no preference on which outcome should be

coded as 1.

A binary variable is asymmetric if the

outcome of the states are not equally

important, such as positive or negative

outcomes of a disease test.

Similarity that is based on symmetric binary

variables is called invariant similarity.

15

Variables

Example

Name

Jack

Mary

Jim

Gender

M

F

M

Fever

Y

Y

Y

Cough

N

N

P

Test-1

P

P

N

Test-2

N

N

N

Test-3

N

P

N

Test-4

N

N

N

the remaining attributes are asymmetric binary

let the values Y and P be set to 1, and the value N be

setdto

0 , mary ) 0 1 0.33

( jack

2 01

11

d ( jack , jim )

0.67

111

1 2

d ( jim , mary )

0.75

11 2

16

Nominal Variables

take more than 2 states, e.g., red, yellow, blue, green

m

d (i, j) p

p

nominal states

17

Ordinal Variables

rif {1,..., M f }

replacing x

if by their rank

replacing i-th object in the f-th variable by

rif 1

zif

M f 1

interval-scaled variables

18

Ratio-Scaled Variables

nonlinear scale, approximately at exponential

scale,

such as AeBt or Ae-Bt

Methods:

yif = log(xif)

rank as interval-scaled.

19

Variables of Mixed

Types

symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio.

One may use a weighted formula to combine their

effects.

p

(f)

(f)

f 1 ij dij

d (i, j )

pf 1 ij( f )

f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

f is ordinal or ratio-scaled

compute ranks r and

if

r 1

zif

if

M

20

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Outlier Analysis

Summary

21

then evaluate them by some criterion

of the set of data (or objects) using some criterion

functions

clusters and the idea is to find the best fit of that model to

each other

22

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Outlier Analysis

Summary

23

Concept

D of n objects into a set of k clusters

chosen partitioning criterion

by the center of the cluster

(Kaufman & Rousseeuw87): Each cluster is

represented by one of the objects in the cluster

24

in 4 steps:

Partition objects into k nonempty subsets

Compute seed points as the centroids of the

clusters of the current partition. The

centroid is the center (mean point) of the

cluster.

Assign each object to the cluster with the

nearest seed point.

Go back to Step 2, stop when no more new

assignment.

25

Example

10

10

0

0

10

10

10

10

0

0

10

10

26

Strength

Relatively efficient: O(tkn), where n is # objects, k is

# clusters, and t is # iterations. Normally, k, t << n.

Often terminates at a local optimum.

Weakness

Applicable only when mean is defined, then what

about categorical data?

Need to specify k, the number of clusters, in

advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex

shapes

27

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means

Handling categorical data: k-modes (Huang98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with

categorical objects

Using a frequency-based method to update modes

of clusters

A mixture of categorical and numerical data: kprototype method

28

replaces one of the medoids by one of the nonmedoids if it improves the total distance of the

resulting clustering

not scale well for large data sets

29

(1987)

PAM (Kaufman and Rousseeuw, 1987), built in Splus

Use real object to represent the cluster

object i, calculate the total swapping cost TCih

Then assign each non-selected object to the

most similar representative object

30

TCih=jCjih

10

10

8

7

6

5

4

3

6

5

h

i

4

3

0

0

10

Cjih = 0

0

10

10

10

7

6

5

7

6

4

3

0

10

0

10

31

(1990)

each sample, and gives the best clustering as the output

Weakness:

necessarily represent a good clustering of the whole

data set if the sample is biased

32

(1994)

Search) (Ng and Han94)

graph where every node is a potential solution, that is, a

set of k medoids

randomly selected node in search for a new local optimum

further improve its performance (Ester et al.95)

33

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Outlier Analysis

Summary

34

Hierarchical Clustering

method does not require the number of clusters k

as an input, but needs a termination condition

Step 0

a

b

Step 1

ab

abcde

cde

de

e

Step 4

May 23, 2016

agglomerative

(AGNES)

Step 3

Data Mining: Concepts and Techniques

divisive

(DIANA)

35

Splus

matrix.

Go on in a non-descending fashion

10

10

10

0

0

10

0

0

10

10

36

Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected

component forms a cluster.

37

Splus

10

10

10

0

0

10

0

0

10

10

38

Agglomerative Clustering

Algorithm

1. Compute the proximity matrix

2. Let each data point be a cluster

3. Repeat

4. Merge the two closest clusters

5. Update the proximity matrix

6. Until only a single cluster remains

39

Methods

do not scale well: time complexity of at least O(n2),

where n is the number of total objects

can never undo what was done previously

Integration of hierarchical with distance-based

clustering

BIRCH (1996): uses CF-tree and incrementally

adjusts the quality of sub-clusters

CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of

the cluster by a specified fraction

CHAMELEON (1999): hierarchical clustering using

dynamic modeling

40

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Outlier Analysis

Summary

41

Density-Based Clustering

Methods

Clustering

based on density (local cluster criterion),

such as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination

condition

Several interesting studies:

DBSCAN: Ester, et al. (KDD96)

OPTICS: Ankerst, et al (SIGMOD99).

DENCLUE: Hinneburg & D. Keim (KDD98)

CLIQUE: Agrawal, et al. (SIGMOD98)

42

Density-Based Clustering:

Background

Two parameters:

density-reachable from a point q wrt. Eps, MinPts if

1) p belongs to NEps(q)

p

q

MinPts = 5

Eps = 1 cm

May 23, 2016

43

Density-Based Clustering:

Background (II)

Density-reachable:

Density-connected

A point p is density-reachable

from a point q wrt. Eps, MinPts if

there is a chain of points p1, ,

pn, p1 = q, pn = p such that pi+1 is

directly density-reachable from pi

A point p is density-connected to

a point q wrt. Eps, MinPts if there

is a point o such that both, p and

q are density-reachable from o

wrt. Eps and MinPts.

p1

q

o

44

Clustering of Applications with

Noise

cluster is defined as a maximal set of densityconnected points

Discovers clusters of arbitrary shape in spatial

databases with noise

Outlier

Border

Eps = 1cm

Core

MinPts = 5

45

Eps and MinPts.

If p is a border point, no points are densityreachable from p and DBSCAN visits the next

point of the database.

been processed.

46

(1999)

Structure

Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)

Produces a special order of the database wrt its

density-based clustering structure

This cluster-ordering contains info equiv to the

density-based clusterings corresponding to a broad

range of parameter settings

Good for both automatic and interactive cluster

analysis, including finding intrinsic clustering

structure

Can be represented graphically or using

visualization techniques

47

DBSCAN

Index-based:

k = number of dimensions

N = 20

p = 75%

M = N(1-p) = 5

Complexity: O(kN2)

Core Distance

Reachability Distancep2

r(p1, o) = 2.8cm. r(p2,o) = 4cm

May 23, 2016

p1

o

o

MinPts = 5

= 3 cm

48

Reachability

-distance

undefined

Cluster-order

of the objects

49

functions

(KDD98)

Major features

arbitrarily shaped clusters in high-dimensional data

sets

than DBSCAN by a factor of up to 45)

50

The set of objects are considerably dissimilar

from the remainder of the data

Example: Sports: Michael Jordon, Wayne

Gretzky, ...

Problem

Find top n outlier points

Applications:

Credit card fraud detection

Telecom fraud detection

Customer segmentation

Medical analysis

51

Outlier Discovery:

Statistical

Approaches

generates data set (e.g. normal distribution)

Use discordancy tests depending on

data distribution

distribution parameter (e.g., mean, variance)

number of expected outliers

Drawbacks

most tests are for single attribute

In many cases, data distribution may not be known

52

- Geometric Morphometrics--protocols for Error TestingCargado porTyB1990
- Anomaly Detection RapidMinerCargado porAnwar Zahroni
- Metáforas UP y DOWNCargado porMaría Cristina Martínez
- Data Mining Intro IEPCargado porsukhpreet11
- 3. Comp Sci - Ijcse -An Optimized Approach to Analyze Stock MarketCargado poriaset123
- Automatic Tag to RegionCargado porTick Tack
- MachineLearning-Lecture12Cargado porJahangir Alam Mithu
- Lecture 07Cargado pormachinelearner
- Implementing Cluster based Recommendation algorithm through Split InversionsCargado porIJSRP ORG
- IJAIEM-2013-04-09-016Cargado porAnonymous vQrJlEN
- SOM U C Clustering September 2011Cargado porVahid Moosavi
- Function GuideCargado porabulfaiziqbal
- ewsn08 (1)Cargado porpsgmail
- t Cc 2014030266Cargado porGanesh Nk
- Exploring Application-Level Semantics for Data Compression Abstract by CoreieeeprojectsCargado porMadhu Sudan U
- r05411201 Information Retrieval SystemsCargado porgeddam06108825
- An efficient method for active semi-supervised density based clusteringCargado poreditor3854
- Cluster 1Cargado porKarina
- cluster4.pdfCargado porArka Bose
- openSAP_ds1_Week_3_All_SlidesCargado porqwerty_qwerty_2009
- SEMI-AUTOMATIC LEAF DISEASE DETECTION AND CLASSIFICATION SYSTEM FOR SOYBEAN CULTURECargado porDinesh S
- Final Fyp Report!!!!Cargado porSadia Farooq
- DataMining-Syllabus-AU13Cargado porLatoya Hodges
- IEEEXplorer-1Cargado porgaythree
- Riege.aier.EA Contingency.2009Cargado porcgarciahe5427
- IJCTT-V3I5P101Cargado porsurendiran123
- Chapter 2 v3Cargado porJoy Bautista
- implenetation 2Cargado porDipak Kumar
- DP-FIM FOR DATA LINKAGE USING ONE CLASS CLUSTERING TREE FOR MOVIE RECOMMENDATIONCargado porIJIERT-International Journal of Innovations in Engineering Research and Technology
- firefly for SCPCargado poraman3012

- bayesnetCargado porKhurram Ch
- WEEK 8.docxCargado porLavender Man
- FinalCargado pornidduzzi
- Pneumatic Positiioner FisherCargado porAhmad Adel El Tantawy
- IJE-324Cargado porPayalSi
- Literature Survey Microsoft SurfaceCargado porB_Chetan
- Specification 705 - Optic Fibre Installations.RCN-D14^23409264Cargado porDerrick Senyo
- Meteor Burst Communications- article on commercial history, developments, key international projects- Don Sytsma interviewCargado porSkybridge Spectrum Foundation
- 3b. Deloitte Global Human Capital Trends-Future of work 2017.pdfCargado porChristian Cherry Tea
- 58180 MITSMR Deloitte Digital Business 2016Cargado porRafael Uribe
- Mix DesignCargado porPankaj Munjal
- FINAL PAPERCargado porMichaelister Ordoñez Monteron
- Pdc Course FileCargado porsaftainpatel
- The Calypso Post Vol 4Cargado porXaxi Yayi
- F2 1st Term Maths Teaching Material (1st Edition)Cargado porapi-3854339
- Newman Teddiman: Handbook of Research on Discourse Behavior and Digital Communication: Language Structures and Social InteractionCargado porKenneth Plasa
- Ph.d Thesis Iupware by LiuyongboCargado porWayaya2009
- The Role Law Plays in Modern SocietyCargado porSana
- FisherRoss&10-Interviewing Witnesses and VictimsCargado porMichelle Gomes
- M39f AudioCargado porLucian Florin
- Lecture 12 Secondary tillage.pdfCargado porsujan paudel
- Sample EssaysCargado porAmirul Arif
- CFMS DocumentationCargado porFelix Nana Sarpong
- Propagation of EM Signals in Underground Metal/Non-Metal MinesCargado porJHelf
- The Resurgence of C ProgrammingCargado porSorin Sorinos
- accb1MMS 403Cargado porAbhishek_mhptr
- qqqCargado porGede Sthandila
- Design and Fabrication of Automated Bed Control System for PatientsCargado porInternational Journal for Scientific Research and Development - IJSRD
- BLOCK 1Cargado porPrasanta Debnath
- A2 Magazine QuestionnaireCargado porHarrygeopreston