Unescowks 3

Brief introduction to lectures
Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction
Lecture 4: Mining association rules

Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge
1
Lecture 5: Automatic Cluster Detection
One of the most widely used KDD classification

Content of the lecture
1. 2. 3. 4.
techniques for unsupervised data.

Introduction Partitioning Clustering Hierarchical Clustering Software and case-studies
Prerequisite: Nothing special

2
Partitioning Clustering
A partition of a set of n objects X {x1 , x2 ,..., xn } is a collection of K disjoint non - empty subsets P 1, P 2 ,..., P K of X (K n), often called clusters , satisfying the following conditions :
for all P (1) they are disjoint : P i P j 0 i and P j, i j (2) their union is X : P 1 P 2 ... P K X.
Denote the partition P {P 1, P 2 ,..., P K }, P i are called components of P
Each cluster must contain at least one object Each object must belong to exactly one group
3
Partitioning Clustering
What is a good partitioning clustering?
Key ideas: Objects in each group are similar and objects between different groups are dissimilar.
P {{x1 , x4 , x7 , x9 , x10},{x2 , x7 },{x3 , x5 , x6 }}

P 1 P2 P3
Minimize the within-group distance and Maximize the between-group distance.

Notice: Many ways to define the within-group distance (the average of distance to the groups center or the average of distance between all pairs of objects, etc.) and to define the between-group distance. It is in general impossible to find the optimal clustering.
4
Hierarchical Clustering
Partition Q is nested into partition P if every component of Q is a subset of a component of P.
P {x1 , x4 , x7 , x9 , x10},{x2 , x8 },{x3 , x5 , x6 }
Q {x1 , x4 , x9 },{x7 , x10},{x2 , x8 },{x5},{x3 , x6 }
A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence.
(This definition is for bottom-up hierarchical clustering. In case of top-down hierarchical clustering, next becomes previous).
5
Bottom-up Hierarchical Clustering

x1 , x2 , x3 , x4 , x5 , x6
{x1 , x2 , x3 , x4},{x5 , x6}
{x1 , x2 , x3},{x4},{x5 , x6}
{x1},{x2 , x3},{x4},{x5},{x6}
{x1},{x2},{x3},{x4},{x5},{x6}
x1 x2 x3 x4 x5 x6
6
Top-Down Hierarchical Clustering

x1 , x2 , x3 , x4 , x5 , x6
{x1 , x2 , x3 , x4},{x5 , x6}
{x1 , x2 , x3},{x4},{x5 , x6}
{x1},{x2 , x3},{x4},{x5},{x6}
{x1},{x2},{x3},{x4},{x5},{x6}
x1 x2 x3 x4 x5 x6
7
OSHAM: Hybrid Model

Brief Description of Concepts Multiple Inheritance Concepts
Concept Hierarchy
Wisconsin Breast Cancer Data
Discovered Concepts
Attributes
8


9
Lecture 6: Neural networks
One of the most widely used KDD

classification techniques.
1. Neural network representation 2. Feed-forward neural networks 3. Using back-propagation algorithm 4. Case-studies
10


11
Evaluation of discovered knowledge
Lecture 7
One of the most widely used KDD

1. Cross validation 2. Bootstrapping 3. Case-studies
classification techniques.

12
Out-of-sample testing
Training data
Induction method
2/3
Historical Data (warehouse) Sampling method Sample data Sampling method Model
1/3
Testing data Error estimation
error
The quality of the test sample estimate is dependent on the number of test cases and the validity of the independent assumption
13
Cross Validation
Historical Data (warehouse) Sampling method Sample data Sampling method
iterate
Sample 1
Induction method
Sample 2
. . .
Model
10-fold cross validation appears adequate (n = 10)
Sample n
Error estimation Runs error Error estimation

14
- Mutually exclusive - Equal size
Evaluation: k-fold cross validation (k=3)

1 1 2 3 2 1
test on the rest subset as testing data to evaluate the accuracy average the accuracies as final evaluation
15
to be evaluated 2 A method 3 1
A data set
run on each 2 subsets as training data to find knowledge
randomly split the data set into 3 subsets of equal size
Outline of the presentation

Objectives, Brief Discussion
Prerequisite
and Content
Introduction
to Lectures
and
Conclusion
This presentation summarizes the content and organization of lectures in module Knowledge Discovery and Data Mining
16

Unescowks 3

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Unescowks 3

Cargado por

Copyright:

Formatos disponibles

Brief introduction to lectures

Lecture 4: Mining association rules

Lecture 5: Automatic Cluster Detection

One of the most widely used KDD classification

techniques for unsupervised data.

Prerequisite: Nothing special

P {{x1 , x4 , x7 , x9 , x10},{x2 , x7 },{x3 , x5 , x6 }}

Minimize the within-group distance and Maximize the between-group distance.

P {x1 , x4 , x7 , x9 , x10},{x2 , x8 },{x3 , x5 , x6 }

Q {x1 , x4 , x9 },{x7 , x10},{x2 , x8 },{x5},{x3 , x6 }

Bottom-up Hierarchical Clustering

Top-Down Hierarchical Clustering

OSHAM: Hybrid Model

Wisconsin Breast Cancer Data

Brief introduction to lectures

Lecture 4: Mining association rules

Lecture 6: Neural networks

One of the most widely used KDD

Prerequisite: Nothing special

Brief introduction to lectures

Lecture 4: Mining association rules

Evaluation of discovered knowledge

One of the most widely used KDD

Prerequisite: Nothing special

10-fold cross validation appears adequate (n = 10)

Error estimation Runs error Error estimation

- Mutually exclusive - Equal size

Evaluation: k-fold cross validation (k=3)

randomly split the data set into 3 subsets of equal size

Outline of the presentation

También podría gustarte