Está en la página 1de 16

Brief introduction to lectures

Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction

Lecture 4: Mining association rules


Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge
1

Lecture 5: Automatic Cluster Detection

One of the most widely used KDD classification


Content of the lecture
1. 2. 3. 4.

techniques for unsupervised data.


Introduction Partitioning Clustering Hierarchical Clustering Software and case-studies

Prerequisite: Nothing special


2

Partitioning Clustering
A partition of a set of n objects X {x1 , x2 ,..., xn } is a collection of K disjoint non - empty subsets P 1, P 2 ,..., P K of X (K n), often called clusters , satisfying the following conditions :

for all P (1) they are disjoint : P i P j 0 i and P j, i j (2) their union is X : P 1 P 2 ... P K X.
Denote the partition P {P 1, P 2 ,..., P K }, P i are called components of P

Each cluster must contain at least one object Each object must belong to exactly one group
3

Partitioning Clustering
What is a good partitioning clustering?

Key ideas: Objects in each group are similar and objects between different groups are dissimilar.

P {{x1 , x4 , x7 , x9 , x10},{x2 , x7 },{x3 , x5 , x6 }}


P 1 P2 P3

Minimize the within-group distance and Maximize the between-group distance.


Notice: Many ways to define the within-group distance (the average of distance to the groups center or the average of distance between all pairs of objects, etc.) and to define the between-group distance. It is in general impossible to find the optimal clustering.
4

Hierarchical Clustering
Partition Q is nested into partition P if every component of Q is a subset of a component of P.

P {x1 , x4 , x7 , x9 , x10},{x2 , x8 },{x3 , x5 , x6 }

Q {x1 , x4 , x9 },{x7 , x10},{x2 , x8 },{x5},{x3 , x6 }

A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence.
(This definition is for bottom-up hierarchical clustering. In case of top-down hierarchical clustering, next becomes previous).
5

Bottom-up Hierarchical Clustering


x1 , x2 , x3 , x4 , x5 , x6
{x1 , x2 , x3 , x4},{x5 , x6}
{x1 , x2 , x3},{x4},{x5 , x6}
{x1},{x2 , x3},{x4},{x5},{x6}
{x1},{x2},{x3},{x4},{x5},{x6}
x1 x2 x3 x4 x5 x6
6

Top-Down Hierarchical Clustering


x1 , x2 , x3 , x4 , x5 , x6
{x1 , x2 , x3 , x4},{x5 , x6}
{x1 , x2 , x3},{x4},{x5 , x6}
{x1},{x2 , x3},{x4},{x5},{x6}
{x1},{x2},{x3},{x4},{x5},{x6}
x1 x2 x3 x4 x5 x6
7

OSHAM: Hybrid Model


Brief Description of Concepts Multiple Inheritance Concepts

Concept Hierarchy

Wisconsin Breast Cancer Data

Discovered Concepts

Attributes
8

Brief introduction to lectures


Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction

Lecture 4: Mining association rules


Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge
9

Lecture 6: Neural networks

One of the most widely used KDD


Content of the lecture
classification techniques.

Prerequisite: Nothing special

1. Neural network representation 2. Feed-forward neural networks 3. Using back-propagation algorithm 4. Case-studies

10

Brief introduction to lectures


Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction

Lecture 4: Mining association rules


Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge
11

Evaluation of discovered knowledge

Lecture 7

One of the most widely used KDD


Content of the lecture
1. Cross validation 2. Bootstrapping 3. Case-studies

classification techniques.

Prerequisite: Nothing special


12

Out-of-sample testing
Training data

Induction method

2/3
Historical Data (warehouse) Sampling method Sample data Sampling method Model

1/3
Testing data Error estimation

error

The quality of the test sample estimate is dependent on the number of test cases and the validity of the independent assumption
13

Cross Validation
Historical Data (warehouse) Sampling method Sample data Sampling method

iterate
Sample 1

Induction method

Sample 2

. . .

Model

10-fold cross validation appears adequate (n = 10)

Sample n

Error estimation Runs error Error estimation


14

- Mutually exclusive - Equal size

Evaluation: k-fold cross validation (k=3)


1 1 2 3 2 1
test on the rest subset as testing data to evaluate the accuracy average the accuracies as final evaluation
15

to be evaluated 2 A method 3 1

A data set
run on each 2 subsets as training data to find knowledge

randomly split the data set into 3 subsets of equal size

Outline of the presentation


Objectives, Brief Discussion

Prerequisite
and Content

Introduction
to Lectures

and
Conclusion

This presentation summarizes the content and organization of lectures in module Knowledge Discovery and Data Mining
16

También podría gustarte