Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Tutorial Notes
Andrzej Skowron
Warsaw University
skowron@mimuw.eud.pl
Ning Zhong
Maebashi Institute of Technolgy
zhong@maebashi-it.ac.jp
Copyright 2000 by A. Skowron & N. Zhong
Contents
Introduction
Basic Concepts of Rough Sets
A Rough Set Based KDD process
Rough Sets in ILP and GrC
Concluding Remarks
(Summary, Advanced Topics, References
and Further Readings).
Introduction
Rough set theory was developed by
Zdzislaw Pawlak in the early 1980s.
Representative Publications:
Z. Pawlak, Rough Sets, International Journal
of Computer and Information Sciences, Vol.11,
341-356 (1982).
Z. Pawlak, Rough Sets - Theoretical Aspect of
Reasoning about Data, Kluwer Academic
Pubilishers (1991).
Introduction (2)
The main goal of the rough set analysis is
induction of approximations of concepts.
Rough sets constitutes a sound basis for
KDD. It offers mathematical tools to
discover patterns hidden in data.
It can be used for feature selection, feature
extraction, data reduction, decision rule
generation, and pattern extraction
(templates, association rules) etc.
identifies partial or total dependencies in
Introduction (3)
Recent extensions of rough set theory (rough
mereology) have developed new methods for
decomposition of large data sets, data mining
in distributed and multi-agent systems, and
granular computing.
This presentation shows how several aspects of
the above problems are solved by the (classic)
rough set approach, discusses some advanced
topics, and gives further research directions.
Information Systems/Tables
x
x2
x3
x4
x5
x6
x7
Age
LEMS
16-30
16-30
31-45
31-45
46-60
16-30
46-60
50
0
1-25
1-25
26-49
26-49
26-49
IS is a pair (U, A)
U is a non-empty
finite set of objects.
A is a non-empty finite
set of attributes such
that a : U Va for
every a A.
Va is called the value
set of a.
Decision Systems/Tables
Age
x
yes
x2
x3
x4
x5
x6
x7
LEMS Walk
16-30
50
16-30
31-45
31-45
46-60
16-30
46-60
0
1-25
1-25
26-49
26-49
26-49
no
no
yes
no
yes
no
DS: T = (U , A {d })
d A is the decision
attribute (instead of one
we can consider more
decision attributes).
The elements of A are
called the condition
attributes.
Indiscernibility
The equivalence relation
A binary relation R X X which is
reflexive (xRx for any object x) ,
symmetric (if xRy then yRx), and
transitive (if xRy and yRz then xRz).
Indiscernibility (2)
Let IS = (U, A) be an information system, then
with any B A there is an associated equivalence
relation:
IND IS ( B ) = {( x, x' ) U 2 | a B, a ( x) = a ( x' )}
An Example of Indiscernibility
Age
x
yes
x2
x3
x4
x5
x6
x7
LEMS Walk
16-30
50
16-30
31-45
31-45
46-60
16-30
46-60
0
1-25
1-25
26-49
26-49
26-49
no
no
yes
no
yes
no
Observations
An equivalence relation induces a partitioning
of the universe.
The partitions can be used to build new subsets
of the universe.
Subsets that are most often of interest have the
same value of the decision attribute.
It may happen, however, that a concept such as
Walk cannot be defined in a crisp manner.
Set Approximation
Let T = (U, A) and let B A and X U .
We can approximate X using only the
information contained in B by constructing
the B-lower and B-upper approximations of
X, denoted B X and B X respectively, where
B X = {x | [ x]B X },
B X = {x | [ x]B X }.
LEMS Walk
16-30
50
16-30
31-45
31-45
46-60
16-30
46-60
0
1-25
1-25
26-49
26-49
26-49
no
no
yes
no
yes
no
AW = {x1, x6},
AW = {x1, x3, x 4, x6},
BN A (W ) = {x3, x 4},
U AW = {x 2, x5, x7}.
The decision class, Walk, is
rough since the boundary
region is not empty.
An Example of
Set Approximation (2)
{{x2}, {x5,x7}}
AW
{{x3,x4}}
yes
AW
{{x1},{x6}}
yes/no
no
RX X
RX
set
U/R
R : subset of
attributes
R X = U{Y U / R : Y X }
Lower Approximation:
R X = U{Y U / R : Y X }
Headache
Yes
Yes
Yes
No
No
No
No
No
Temp.
Normal
High
Very-high
Normal
H ig h
Very-high
H ig h
Very-high
X1 = {u | Flu(u) = yes}
= {u2, u3, u6, u7}
RX1 = {u2, u3}
RX1 = {u2, u3, u6, u7, u8, u5}
Flu
No
Yes
Yes
No
No
Yes
Ye s
No
X2 = {u | Flu(u) = no}
= {u1, u4, u5, u8}
RX2 = {u1, u4}
RX2 = {u1, u4, u5, u8, u7, u6}
X2
X1
u2
u3
u7
u5
u1
u6
u8
u4
Properties of Approximations
B( X ) X B X
B( ) = B ( ) = , B(U ) = B (U ) = U
B( X Y ) = B( X ) B (Y )
B( X Y ) = B( X ) B (Y )
X Y implies B ( X ) B (Y ) and B ( X ) B (Y )
Accuracy of Approximation
| B( X ) |
B(X ) =
| B( X ) |
Reducts
Keep only those attributes that preserve the
indiscernibility relation and, consequently,
set approximation.
There are usually several such subsets of
attributes and those which are minimal are
called reducts.
UCX
X U / D
Independent
T = (U, C, D) is independent
if allc C are indispensable in T.
CORE ( C ) = RED ( C )
where RED(C) is the set of all reducts of C.
Headache
U1
U2
U3
U4
U5
U6
Yes
Yes
Yes
No
No
No
Muscle
pain
Yes
Yes
Yes
Yes
No
Yes
Temp.
Flu
Normal
High
Very-high
Normal
High
Very-high
No
Yes
Yes
No
No
Yes
CORE = {Headache,Temp} I
{MusclePain,
Temp}
= {Temp}
Muscle
pain
1,U4 Yes
Yes
U2
U3,U6 Yes
No
U5
Temp.
Flu
Normal
High
Very-high
High
No
Yes
Yes
No
Headache
Temp.
Flu
U1
U2
U3
U4
U5
U6
Yes
Yes
Yes
No
No
No
Norlmal
High
Very-high
Normal
High
Very-high
No
Yes
Yes
No
No
Yes
Discernibility Matrix
(relative to positive region)
Let T = (U, C, D) be a decision table, with
U = {u1 , u2 ,..., u n }.
By a discernibility matrix of T, denoted M(T),
we will mean n n matrix defined as:
mij =
{cC: c ( ui ) c ( u j )} if d D [ d ( ui ) d ( u j )]
if d D [ d ( ui ) = d ( u j )]
Discernibility Matrix
(relative to positive region) (2)
The equation is similar but conjunction is taken
over all non-empty entries of M(T) corresponding
to the indices i, j such that
u i or u j belongs to the C-positive region of D.
mij = denotes that this case does not need to be
considered. Hence it is interpreted as logic truth.
All disjuncts of minimal disjunctive form of this
function define the reducts of T (relative to the
positive region).
Discernibility Function
(relative to objects)
For any ui U ,
where (1)
b
b1
b1
b2
b1
c
c1
c0
c1
c1
d
y
n
n
y
C = {a, b, c}
D = {d}
( a c ) b c ( a b)
=bc
Reduct = {b, c}
a,c
u3
u4
a,b
1
1
1
1
2
2
2
0
0
2
2
1
1
1
2
2
0
2
0
1
2
Core = {b}
Reduct1 = {b,c}
Reduct2 = {b,d}
1
0
0
1
0
0
1
1
1
2
0
2
2
1
u1
u2
u2
u3
u4
u5
u6
u3 b,c,d b,c
u4 b
b,d c,d
u5 a,b,c,d a,b,c
a,b,c,d
u6 a,b,c,d a,b,c
a,b,c,d
u7
a,b,c,d a,b
c,d
c,d
Rough Membership
The rough membership function quantifies
the degree of relative overlap between the
set X and the equivalence class [ x] to
B
which x belongs.
: U [0,1]
B
X
| [ x ]B X |
=
| [ x ]B |
B
X
B X = {x | ( x) }
B
X
B X = {x | ( x) > 1 }.
B
X
Dependency of Attributes
Discovering dependencies between attributes
is an important issue in KDD.
A set of attribute D depends totally on a set
of attributes C, denoted C D, if all values
of attributes from D are uniquely determined
by values of attributes from C.
Methods
Real world
issues
possible
Dblearn
Logic
Set
Probability
Deduction
Induction
Abduction
RoughSets
Fuzzy Sets
Stoch. Proc.
Belief Nets
Conn. Nets
GDT
A Hybrid Model
Deduction
GrC
RS&ILP
TM
Abduction
GDT
RS
Induction
Observations
A real world data set always contains mixed
types of data such as continuous valued,
symbolic data, etc.
When it comes to analyze attributes with real
values, they must undergo a process called
discretization, which divides the attributes
value into intervals.
There is a lack of the unified approach to
discretization problems so far, and the choice
of method depends heavily on data considered.
aP bP d
x1
x2
x3
x4
x5
x6
x7
0.8
1
1.3
1.4
1.4
1.6
1.3
2
0.5
3
1
2
3
1
1
0
0
1
0
1
1
x1
x2
x3
x4
x5
x6
x7
0
1
1
1
1
2
1
P = {(a, 0.9),
(a, 1.5),
(b, 0.75),
(b, 1.5)}
2
0
2
1
2
2
1
1
0
0
1
0
1
1
A Geometrical Representation of
Data
b
3
x3
x1
x6
x5
2
x7
x4
x2
0.5
0
0.8
1.3 1.4
1.6
A Geometrical Representation of
Data
and
Cuts
b
x3
x6
3
x1
x5
2
x7
1
x2
0.5
0
x4
0.8
1.3 1.4
1.6
Vb = [0, 4).
A Discretization Process
Step 1: define a set of Boolean variables,
BV (U ) = { p , p , p , p , p , p , p }
wherea
a
1
a
2
a
3
a
4
b
1
b
2
b
3
c2
1.0
c3 c 4
1.3 1.4
1.6
p1a
p2a
p3a
p4a
{
1
,...,
n
the interval [vk , vk +1 ) for any
and a A.
0
1
1
1
0
0
1
0
0
1
0
0
0
0
1
1
1
0
1
1
0
0
0
1
0
0
0
0
0
0
1
1
1
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
1
1
0
0
0
1
1
0
1
0
0
1
0
1
0
1
0
0
0
b
1
a
2
a
2
a
3
b
2
p
p
p
p
p
b
2
b
3
a
3
b
1
b
3
a
3
a
4
a
4
b
1
b
2
b
3
= { (i. j ) : d ( xi ) d ( x j )}.
U
a
1
b
1
b
2
a
1
a
2
b
3
(p p p )
a
1
a
2
a
3
= (p p p )(p p p p )
U
a
2
a
4
b
2
a
2
a
3
b
2
b
3
a
4
b
2
b
3
x1
x5
2
x7
x4
x2
0.5
0
x6
0.8
1.3 1.4
1.6
A Result
U
aP bP d
x1
x2
x3
x4
x5
x6
x7
0.8
1
1.3
1.4
1.4
1.6
1.3
2
0.5
3
1
2
3
1
1
0
0
1
0
1
1
x1
x2
x3
x4
x5
x6
x7
0
0
1
1
1
2
1
P = {(a, 1.2),
(a, 1.5),
(b, 1.5)}
1
0
1
0
1
1
0
1
0
0
1
0
1
1
Observations
A database always contains a lot of attributes
that are redundant and not necessary for rule
discovery.
If these redundant attributes are not removed,
not only the time complexity of rule discovery
increases, but also the quality of the discovered
rules may be significantly depleted.
Attribute Selection
U
U1
U2
U3
U4
U5
U6
U
U1
U2
U3
U4
U5
U6
Muscle-pain
Yes
Yes
Yes
Yes
No
Yes
Headache
Yes
Yes
Yes
No
No
No
Temp.
Normal
High
Very-high
Normal
High
Very-high
Flu
No
Yes
Yes
No
No
Yes
Muscle-pain
Yes
Yes
Yes
Yes
No
Yes
Temp.
Normal
High
Very-high
Normal
High
Very-high
U
U1
U2
U3
U4
U5
U6
Flu
No
Yes
Yes
No
No
Yes
Headache
Yes
Yes
Yes
No
No
No
Temp.
Normal
High
Very-high
Normal
High
Very-high
Flu
No
Yes
Yes
No
No
Yes
Advantage
Fast
Disadvantage
Ignoring the performance effects of the induction
algorithm
Advantage
Taking into account the performance of the induction algorithm
Disadvantage
The time complexity is high
Basic Ideas:
Attribute Selection using RSH
Take the attributes in CORE as the initial
subset.
Select one attribute each time using the rule
evaluation criterion in our rule discovery
system, GDT-RS.
Stop when the subset of selected attributes
is a reduct.
Why Heuristics ?
The number of possible reducts can be 2
where N is the number of attributes.
Selecting the optimal reduct from all of
possible reducts is time-complex and
heuristics must be used.
N 1
An Example of
Attribute Selection
U
1
2
3
4
5
6
7
a
1
1
1
1
2
2
2
e
0
0
2
2
1
1
1
2
2
0
2
0
1
2
1
0
0
1
0
0
1
1
1
2
0
2
2
1
Condition Attributes:
a: Va = {1, 2}
b: Vb = {0, 1, 2}
c: Vc = {0, 1, 2}
d: Vd = {0, 1}
Decision Attribute:
e: Ve = {0, 1, 2}
Searching for
Removing attribute a
U
1
2
3
4
5
6
7
e
0
0
2
2
1
1
1
2
2
0
2
0
1
2
1
0
0
1
0
0
1
1
1
2
0
2
2
1
Searching for
(2)
Removing attribute
U
1
2
3
4
5
6
7
a
1
1
1
1
2
2
2
e
2
2
0
2
0
1
2
1
0
0
1
0
0
1
1
1
2
0
2
2
1
Removing attribute b
cause inconsistency.
u 1 : a 1 c 2 d 1 e1
u 4 : a1 c 2 d 1 e 0
Hence, b is used as CORE.
Searching for
(3)
Removing attribute c
U
1
2
3
4
5
6
7
a
1
1
1
1
2
2
2
e
0
0
2
2
1
1
1
1
0
0
1
0
0
1
1
1
2
0
2
2
1
Removing attribute c
does not cause inconsistency.
Hence, c is not used
as CORE.
Searching for
(4)
Removing attribute d
U
1
2
3
4
5
6
7
a
1
1
1
1
2
2
2
e
0
0
2
2
1
1
1
2
2
0
2
0
1
2
1
1
2
0
2
2
1
Removing attribute d
does not cause inconsistency.
Hence, d is not used
as CORE.
CORE(C)={b}
Initial subset R = {b}
R={b}
1
2
3
4
5
6
7
a
1
1
1
1
2
2
2
0
0
2
2
1
1
1
2
2
0
2
0
1
2
1
0
0
1
0
0
1
1
1
2
0
2
2
1
1
2
3
4
5
6
7
Q b 0 e1
The instances containing b0 will not be considered.
1. Selecting {a}
R = {a,b}
U
3
4
5
6
7
a
1
1
2
2
2
2
2
1
1
1
e
a 1b 2 e 2
a 1b 2 e 0
2
0
2
2
1
u3
u5
u6
u4
u7
a 2 b1 e 2
U/{e}
a 2 b 1 e1
u3,u5,u6
U POS
{ a , b}
X U /{e}
(X ) =
u4
u7
3
4
5
6
7
2. Selecting {c}
R = {b,c}
b
c
e
2
0
2
e
2
2
1
1
1
0
2
0
1
2
b2c 2 e0
2
0
2
2
1
b1 c 0 e 2
u3,u5,u6
u4
b1 c 1 e 2
b1 c 2 e 1
U POS
{b ,c}
X U /{e}
U/{e}
u7
3
4
5
6
7
2
2
1
1
1
0
1
0
0
1
2
0
2
2
1
U POS
{b , d }
X U /{e}
b2 d 0 e2
b2 d1 e0
b1d 0 e2
U/{e}
u3,u5,u6
u4
b1d1 e1
( X ) = {u3, u 4, u5, u 6, u 7};
u7
U/{e}
u3,
u3,u5,u6
u5,u6
u4
u4
u7
u7
A Heuristic Algorithm
for Attribute Selection
Let R be a set of the selected attributes, P be the
set of unselected condition attributes, U be the
set of all instances, X be the set of contradictory
instances, and EXPECT be the threshold of
accuracy.
In the initial state, R = CORE(C),
P = C CORE (C ), X = U POS R (D )
k = 0.
A Heuristic Algorithm
for Attribute Selection (2)
Step 1. If k >= EXPECT, finish, otherwise
calculate the dependency degree, k,
| POS R ( D) |
k=
.
|U |
A Heuristic Algorithm
for Attribute Selection (3)
Step 3. Choose the best attribute p with the
largest v p m p , and let
R = R { p}
P = P { p}.
Step 4. Remove all consistent instances u in
POS R (D) from X.
Step 5. Go back to Step 1.
Experimental Results
Data sets
Monk1
Monk3
Mushroom
Breast
cancer
Earthquake
Meningitis
Bacterial
examination
Slopecollapse
Gastric
cancer
Selected
Attri. N.
3
4
0
1
3
4
4
4
16
30
57
155
140
20920
0
1
2
3
4
9
23
3436
38
7520
19
A Sample DB
U
u1
u2
u3
u4
u5
u6
u7
a0
a0
a0
a1
a0
a0
a1
b0
b1
b0
b1
b0
b2
b1
c1
c1
c1
c0
c1
c1
c1
y
y
y
n
n
y
y
Condition attributes a, b, c
Va = {a0, a1} Vb = {b0, b1, b2}
Decision attribute d, Vd = {yn}
Vc = {c0, c1}
A Sample GDT
F(x)
G(x)
*b0c0
*b0c1
*b1c0
*b1c1
*b2c0
*b2c1
a0*c0
...
a1b1*
a1b2*
**c0
...
a0**
a1**
1/2
1/2
1/3
1/6
1/6
1/6
1/6
1/6
1/2
1/2
1/6
Probabilistic Relationship
Between PIs and PGs
1
N
p ( PI j | PGi ) = PGi
0
N PG i =
k {l | PG [ l ] =*}
k
if PI j PGi
otherwise
a0b0c0
a0b1c0
p = 1/3
1/3
a0*c0
1/3
a0b2c0
Na 0c 0} = nb = 3
Unseen Instances
Flu
Normal
No
High
Yes
Very-high Yes
Normal
No
High
No
Very-high Yes
U1
U2
U3
U4
U5
U6
Yes
Yes
Yes
No
No
No
Yes
Yes
Yes
Yes
No
Yes
Possible Instances:
yesnonormal
yes, no, high
yes, no, very-high
no, yes, high
no, no, normal
no, no, very-high
Closed world
Open world
Rule Representation
X
Y with S
X denotes the conjunction of the conditions
that a concept must satisfy
Y denotes a concept that the rule describes
S is a measure of strength of which the
rule holds
s( X ) = s( PGk ) =
p( PI
l
| PGk ) =
N insrel ( PGk )
N PGk
BKF ( PI
l
N PGk
| PGk )
r( X Y ) =
Ninsrel ( X ) Ninsclass ( X ,Y )
Ninsrel ( X )
a
a0
a0
a0
a1
a0
a0
a1
b
b0
b1
b0
b1
b0
b2
b1
c
c1
c1
c1
c0
c1
c1
c1
d
y
y
y
n
n
n
y
Condition Attrs. a, b,
c
a: Va = {a0, a1}
b: Vb = {b0, b1, b2}
c: Vc = {c0, c1}
Class d:
d: Vd = {yn}
a0 b0 c1
a0
a1
a0
a1
b1
b1
b2
b1
c1
c0
c1
c1
d
y,
y,
n
y
n
n
y
2
r{ y } (u1' ) = 1 = 0 .33
3
1
r{ n } (u1' ) = 1 = 0 .67
3
U
u1
u2
u4
u6
u7
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
y
n
n
y
Let Tnoise = 0
Q r{ y} (u1' ) > Tnoise
r{n} (u1' ) > Tnoise
d (u1' ) =
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
y
n
n
y
m2,1 = {b}
m2,2 =
m2,4 = {a, c}
m2,6 = {b}
m2,7 =
u1 u2 u4 u6 u7
u2 b a,c b
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
y
n
n
y
u2
u1 u 2 u4 u6 u7
b
a,c b
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
y
n
n
y
fT (u2) = (a b) (b c)
{a0,b1}
y
a0b1c0
{a0b1}
y
a0b1c1(u2)
s({a0b1}) = 0.5
r ({a 0b1} y ) = 0
{b1c1}
{b1,c1}
a0b1c1(u2)
y
a1b1c1(u7)
s({b1c1}) = 1
r ({b1c1} y ) = 0
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
y
n
n
y
1
{a0b1} y with S = (1 ) (1 0) = 0.5
2
1
{b1c1} y with S = (2 ) (1 0) = 1
2
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
y
n
n
y
m4,1 = {a, b, c}
m4,2 = {a, c}
m4,4 =
m4,6 =
m4,7 = {c}
u1 u2 u4 u6 u7
u4 a,b,c a,c c
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
y
n
n
y
u1 u2 u4 u6 u7
u4 a,b,c a,c c
fT (u4) = (a b c) (a c) (c)
= (c)
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
fT (u4) = (c)
y
n
n
y
{c0}
a0b0c0
{c0}
n
a1b1c0(u4)
a1b2c0
1
s ( c 0) =
6
r ({c0} n) = 0
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
y
n
n
y
1
{c0} n with S = (1 ) (1 0) = 0.167
6
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
y
n
n
y
u2: {a0b1}
{b1c1}
u4: {c0}
u6: {b2}
y, S = 0.5
y, S =1
n, S = 0.167
n, S=0.25
u7: {a1c1}
{b1c1}
y, S=0.5
y, S=1
Generalization Belonging to
Class y
u2
*b1c1
a1*c1
a0b1*
{b1c1}
{a1c1}
{a0b1}
u7
a0b1c1 a1b1c1
()
()
1/2
1/2
1/3
1/2
y with S = 1
y with S = 1/2
y with S = 1/2
u2u7
u7
u2
Generalization Belonging to
Class n
u4
u6
a0b2c1 a1b1c0
(n)
(n)
**c0
*b2*
c0
b2
1/6
1/4
n with S = 1/6
n with S = 1/4
u4
u6
Regarding Instances
(Noise Rate > 0)
U
u1,
u1 u3,
u5
u2
u4
u6
u7
a0 b0 c1
a0
a1
a0
a1
b1
b1
b2
b1
c1
c0
c1
c1
d
y,
y,
n
y
n
n
y
2
r{ y } (u1' ) = 1 = 0 .33
3
1
r{ n } (u1' ) = 1 = 0 .67
3
U
u1
u2
u4
u6
u7
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
d
y
y
n
n
y
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
d
y
y
n
n
y
u2: {a0b1}
{b1c1}
y, S=0.5
y, S=1
u4: {c0}
n, S=0.167
u6: {b2}
n, S=0.25
u7: {a1c1}
{b1c1}
y, S=0.5
y, S=1
Example of Using BK
a0b0c0 a0b0c1 a0b1c0 a0b1c1 a0b2c0 a0b2c1
a0b0* 1/2
a0b1*
a0*c1
a0** 1/6
1/2
1/2
1/3
1/6
1/2
1/3
1/6
1/6
BK
1/6
1
0
1/3
1/6
a1b2c1
1/3
1/3
1
1/3
1/3
1/3
1/3
a1b2c1
Changing Strength of
Generalization by BK
U
u1
u2
u4
u6
u7
a
a0
a0
a1
a0
a1
b
b0
b1
b1
b2
b1
c
c1
c1
c0
c1
c1
fT (u2) = (a b) (b c)
y
n
n
y
{a0,b1}
1/2 a0b1c0
{a0b1}
1/2
{a0b1}
a0b1c1(u2) a0 => c1, 100%
s({a0b1}) = 0.5
r({a0b1} y) = 0
{b1,c1}
0% a0b1c0
100%
a0b1c1(u2)
s({a0b1}) = 1
r({a0b1} y) = 0
Algorithm 1
Optimal Set of Rules
Step 1. Consider the instances with the same
condition attribute values as one instance,
called a compound instance.
Step 2. Calculate the rate of noises r for each
compound instance.
Step 3. Select one instance u from U and
create a discernibility vector for u.
Step 4. Calculate all reducts for the instance u
by using the discernibility function.
Algorithm 1
Optimal Set of Rules (2)
Step 5. Acquire the rules from the reducts for
the instance u, and revise the strength of
generalization of each rule.
Step 6. Select better rules from the rules (for u)
acquired in Step 5, by using the heuristics for
rule selection.
Step 7. U = U {u}. If U , then go back to
Step 3. Otherwise go to Step 8.
Algorithm 1
Optimal Set of Rules (3)
Step 8. Finish if the number of rules
selected in Step 6 for each instance is 1.
Otherwise find a minimal set of rules,
which contains all of the instances in the
decision table.
Algorithm 2
Sub-Optimal Solution
Step1: Set R = {}, COVERED = {}, and
SS = {all instances IDs}.
For each class Dc , divide the decision table
T into two parts: current class T + and
other classes
T .
Step2: From the attribute values vij of the
instances I k (where vij means the jth value
of attribute i,I k T + , I k SS ),
Algorithm 2
Sub-Optimal Solution (2)
choose a value v with the maximal number of
occurrence within the instances contained in
T+and the minimal number of occurrence
within the instances contained in T-.
Step3: Insert v into R.
Step4: Delete the instance ID from SS if the
instance does not contain v.
Algorithm 2
Sub-Optimal Solution (3)
Step5: Go back to Step2 until the noise rate
is less than the threshold value.
Step6: Find out a minimal sub-set R of R
according to their strengths. Insert ( R' Dc )
into RS. Set R = {}, copy the instance IDs
in SS to COVEREDand
set SS = {all instance IDs}- COVERED.
Algorithm 2
Sub-Optimal Solution (4)
Step8: Go back to Step2 until all instances
of T+ are in COVERED.
Step9: Go back to Step1 until all classes are
handled.
O (m n )
Let n be the number of instances in a DB,
m the number of attributes,
the number of generalizations
N (GT )
and is less than
m 1
O (2
).
Experiments
DBs that have been tested:
meningitis, bacterial examination, cancer, mushroom,
slope-in-collapse, earth-quack, contents-sell, ...
Experimental methods:
Comparing GDT-RS with C4.5
Using background knowledge or not
Selecting different allowed noise rates as the
threshold values
Auto-discretization or BK-based discretization.
Experiment 1
(meningitis data)
C4.5:
CellPoly ( > 220) Bacteria
CTFind ( abnormal ) CellMono ( 12) Bacteria
CellPoly ( 220) CellMono ( > 12) Virus
CTFind ( normal ) CellPoly ( 220 ) Virus
Experiment 1
(meningitis data) (2)
GDT-RS (auto-discretization):
cellPoly ( 221) bacteria
ctFind ( abnormal ) cellMono (< 15) bacteria
sex ( m ) CRP ( 4) bacteria
onset ( acute ) CCourse ( negative ) riskGrounp ( p )
bacteria
seizure (0) onset ( acute ) riskGrounp ( p ) bacteria
culture ( strepto ) bacteria
Experiment 1
(meningitis data) (3)
GDT-RS (auto-discretization):
CRP (< 4) cellPoly (< 221) cellMono ( 15) virus
CRP (< 4) ctFind ( normal ) cellPoly (< 221) virus
cellPoly (< 221) cellMono ( 15) risk ( n ) virus
Explanation of BK
If the brain wave (EEGwave) is normal, the
focus of brain wave (EEGfocus) is never
abnormal.
If the number of white blood cells (WBC) is
high, the inflammation protein (CRP) is also
high.
rule2:
EEGfocus (+ ) LOC[ 4,7) EEGabnorma l ;
S = (10 / E ) 4.
Experiment 2
(bacterial examination data)
Number of instances: 20,000
Number of condition attributes: 60
Goals:
analyzing the relationship between the bacteriumdetected attribute and other attributes
analyzing what attribute-values are related to the
sensitivity of antibiotics when the value of
bacterium-detected is (+).
Attribute Selection
(bacterial examination data)
Class-1 bacterium-detected
condition attributes 11
Class-2 antibiotic-sensibility
(resistant (R), sensibility(S))
condition attributes 21
Some Results
(bacterial examination data)
Some of rules discovered by GDT-RS are
the same as C4.5, e.g.,
lactamese (3+ ) bacterium-detected(+)
urine quantity (< 10 3 ) bacterium-detected(-)
Some of rules can only be discovered by
GDT-RS, e.g.,
disease1( pneumonia ) bacterium-detected(-).
Experiment 3
(gastric cancer data)
Instances number7520
Condition Attributes: 38
Classes
cause of death (specially, the direct death)
post-operative complication
Goals
analyzing the relationship between the direct
death and other attributes
analyzing the relationship between the postoperative complication and other attributes.
Experiment 4
(slope-collapse data)
Instances number3436
(430 places were collapsed, and 3006 were not)
Condition attributes: 32
Continuous attributes in condition attributes: 6
extension of collapsed steep slope, gradient, altitude,
thickness of surface of soil, No. of active fault,
distance between slope and active fault.
Results of LVW
Rule induction system: C4.5
Executing times: 10
Class: direct death
Number of selected attributes for each time
20, 19, 21, 26, 22, 31, 21, 19, 31, 28
Result-2 (19 attributes are selected)
multilesions, sex, location_lon3, location_cir4,
liver_meta, lymphnode_diss, proximal_surg, resection_meth,
combined_rese2, reconstruction, pre_oper_comp1,
post_oper_com2, post_oper_com3, spec_histologi, cellular_atyp,
depth, eval_of_treat, ln_metastasis, othertherapypre
Result of WSFG
Rule induction system:
C4.5
Results
the best relevant attribute first
Result of WSBG
Rule induction system:
C4.5
Result
the least relevant attribute first
Result of LVI
(gastric cancer data)
Executing Number of
Number of allowed
inconsistent
inconsistent instances times
instances
80
20
1
2
3
4
5
1
2
3
4
5
79
68
49
61
66
7
19
19
20
18
Number of
selected
attributes
19
16
20
18
20
49
26
28
23
26
Some Rules
Related to Direct Death
peritoneal_meta(2) pre_oper_comp1(.) post_oper_com1(L)
chemotherapypos(.)
S= 3*(7200/E)
location_lon1(M) post_oper_com1(L) ln_metastasis(3)
chemotherapypos(.) S= 3*(2880/E)
sex(F) location_cir2(.) post_oper_com1(L) growth_pattern(2)
chemotherapypos(.) S= 3*(7200/E)
location_cir1(L) location_cir2(.) post_oper_com1(L)
ln_metastasis(2) chemotherapypos(.) S= 3*(25920/E)
pre_oper_comp1(.) post_oper_com1(L) histological(MUC)
growth_pattern(3) chemotherapypos(.) S= 3*(64800/E)
sex(M) location_lon1(M) reconstruction(B2) pre_oper_comp1(.)
structural_atyp(3) lymphatic_inva(3) vascular_inva(0)
ln_metastasis(2) S=3*(345600/E)
sex(F) location_lon2(M) location_cir2(.) pre_oper_comp1(A)
depth(S2) chemotherapypos(.) S= 3*(46080/E)
algebraic expressions
difficult to deal with the
data with multi-class.
difficult to use BK
the stability and
uncertainty of a rule
cannot be explained
clearly
symbolic data must be
quantized.
difficult to use BK
the stability and
uncertainty of a rule
cannot be explained
clearly
unseen instances are not
considered
not consider whether the
discovered rules are the
minimal set covered all
instances
Advantages of ILP
(Compared with Attribute-Value Learning)
It can learn knowledge which is more
expressive because it is in predicate logic
It can utilize background knowledge more
naturally and effectively because in ILP the
examples, the background knowledge, as
well as the learned knowledge are all
expressed within the same logic framework.
Goal
Applying Granular Computing (GrC) and a
special form of GrC: Rough Sets to ILP to
deal with some kinds of imperfect data
which occur in large real-world applications.
+
E
correct with respect to E and
, i.e.
1. HB is complete with respect to E +
(i.e. e E H B |= e )
We also say that HB covers all positive examples.
2. HB is consistent with respect to E
(i.e. e E H B | e )
We also say that HB rejects any negative examples.
+
+
E
2. B E is consistent with respect to
(Otherwise there will be no solution)
Issues
In large, real-world empirical learning,
uncertainty, incompleteness, vagueness,
impreciseness, etc. are frequently observed
in training examples, in background
knowledge, as well as in the induced
hypothesis.
Too strong bias may miss some useful
solutions or have no solution at all.
Noisy data
Erroneous argument values in examples.
Erroneous classification of examples as
belonging to E + or E .
Missing data
Missing values: some arguments of some
examples have unknown values.
Missing predicates: BK lacks essential predicates
(or essential clauses of some predicates) so that
no non-trivial H can be induced.
Observations
H should be correct with respect to E +
and E needs to be relaxed, otherwise
there will be no (meaningful) solutions to
the ILP problem.
While it is impossible to differentiate
distinct objects, we may consider granules:
sets of objects drawn together by similarity,
indistinguishability, or functionality.
Observations (2)
Even when precise solutions in terms of
individual objects can be obtained, we may still
prefect to granules in order to have an efficient
and practical solution.
When we use granules instead of individual
objects, we are actually relaxing the strict
requirements in the standard normal problem
setting for ILP, so that rough but useful
hypotheses can be induced from imperfect data.
Solution
Granular Computing (GrC) can pay an
important role in dealing with imperfect data
and/or too strong bias in ILP.
GrC is a superset of various theories (such as
rough sets, fuzzy sets, interval computation)
used to handle incompleteness, uncertainty,
vagueness, etc. in information systems
(Zadeh, 1997).
Why GrC?
A Practical Point of View
With incomplete, uncertain, or vague
information, it may be difficult to
differentiate some elements and one is forced
to consider granules.
It may be sufficient to use granules in order
to have an efficient and practical solution.
The acquisition of precise information is too
costly, and coarse-grained information
reduces cost.
Solution (2)
Granular Computing (GrC) may be regarded as
a label of theories, methodologies, techniques,
and tools that make use of granules, i.e., groups,
classes, or clusters of a universe, in the process
of problem solving.
We use a special form of GrC: rough sets to
provide a rough solution.
Rough Sets
Approximation space A = (U, R)
U is a set (called the universe)
R is an equivalence relation on U (called an
indiscernibility relation).
U [ x]
= { x U | [ x ]R X }
[ x ]R X
AprA ( X ) =
U[ x ]
[ x ]R X
= {x U | [ x ]R X }
where
denotes the equivalence class
[ x]R x.
containing
x AprA ( X ).
An Illustrating Example
Given:
The target predicate:
customer(Name, Age, Sex, Income)
The positive examples
The negative examples
customer(a, 30, female, 1).
customer(c, 50, female, 2).
customer(b, 53, female, 100). customer(g, 20, male, 2).
customer(d, 50, female, 2).
customer(e, 32, male, 10).
customer(f, 55, male, 10).
Background knowledge B defining married_to(H, W)
by married_to(e, a). married_to(f, d).
+
E
.
E
respect to
and
The normal problem setting is perfectly suitable for this
problem, and an ILP system can induce the following
hypothesis H defining customer/4:
customer(N, A, S, I) :- I >= 10.
customer(N, A, S, I) :- married_to(N, N),
customer(N, A, S, I').
E = E E +
E
examples. That is, + +
where
E + = {e E + | e 'E e Re' }
E
E
which is correct with respect to
and
i.e.
1. H + B covers all examples of E + +
2. H + B rejects any examples of E
Example Revisited
Married_to/2 is missing in B. Let R be defined as
customer(N, A, S, I) R customer(N, A, S, I), with
+
the Rough Problem Setting 1, we may induce H as:
customer(N, A, S, I) :- I >= 10.
customer(N, A, S, I) :- S = female.
which covers all positive examples and the negative
example customer(c, 50, female, 2), rejecting other
negative examples.
Summary
Rough sets offers mathematical tools and
constitutes a sound basis for KDD.
We introduced the basic concepts of
(classical) rough set theory.
We described a rough set based KDD process.
We discussed the problem of imperfect data
handling in ILP using some ideas, concepts
and methods of GrC (or a particular form of
GrC: Rough Sets).
Advanced Topics
(to deal with real world problems)
Recent extensions of rough set theory
(rough mereology: approximate synthesis
of objects) have developed new methods for
decomposition of large datasets, data
mining in distributed and multi-agent
systems, and fusion of information granules
to induce complex information granule
approximation.
Thank You!