Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Igor Kononenko
University of Ljubljana, Faculty of electr. eng. and computer sc.
Trzaska 25, SI-61001 Ljubljana
Slovenia
Abstract
We analyse the biases of eleven measures for
estimating the quality of the multi-valued attributes. The values of information gain, Jmeasure, gini-index, and relevance tend to linearly increase with the number of values of
an attribute. The values of gain-ratio, distance measure, Relief , and the weight of evidence decrease for informative attributes and
increase for irrelevant attributes. The bias of
the statistic tests based on the chi-square distribution is similar but these functions are not
able to discriminate among the attributes of
different quality. We also introduce a new function based on the MDL principle whose value
slightly decreases with the increasing number
of attributes values.
Introduction
Selection measures
2.1
HCA =
XX
i
HC|A = HCA HA
where all the logarithms are of the base two. The information gain is defined as the transmitted information by
a given attribute about the objects class:
Gain = HC + HA HCA = HC HC|A
(1)
Gain
HA
(2)
where
Gain
HCA
(3)
i = 1, 2
Breiman et al. [1984] use gini-index as the (nonnegative) attribute selection measure:
X X
X
Gini =
p.j
p2i|j
p2i.
(6)
j
p2 X 2
P .j 2
pi|j
j p.j i
(7)
p2i.
(8)
is highly correlated with the gini-index. The only difference to equation (6) is that instead of the factor
p2
P .j 2
j p.j
Gini uses
p
P .j = p.j
j p.j
In our experiments, besides Gini and Relief , we evaluated Gini as well in order to verify whether this difference to equation (6) is significant or not.
2.3
Relevance
oddsi|j
p.j log
,
oddsi.
2.2
Gini =
X
j
p2.j Gini0
p2i. (1 p2i. )
2.4
2 and G statistics
where p(x)D is the chi-square distribution with D degrees of freedom and X0 is the value of the statistic for
a given attribute. Press et al. [1988] give the algorithms
for evaluating the above formula. We have two statistics
that are well approximated by the chi-square distribution with (V 1)(C 1) degrees of freedom [White and
Liu, 1994], 2 and G:
2 =
X X (eij nij )2
,
eij
i
j
n.j ni.
n..
(11)
e = 2.7182...
(12)
eij =
and
G = 2n.. Gain loge 2,
2.5
MDL
n.. + C 1
P rior M DL0 = n.. HC + log
C 1
and the approximation of the number of bits to encode
the classes of examples in all subsets corresponding to
all values of the selected attribute is:
X
X
n.j + C 1
P ost M DL0 =
n.j HC|j +
log
+log A
C 1
j
j
The last term (log A) is needed to encode the selection
of an attribute among A attributes. However, this term
is constant for a given selection problem and can be ignored. The first term is equal to n.. HC|A . Therefore, the
MDL measure that evaluates the average compression
(per instance) of the message by an attribute is defined
with the following difference Prior MDL - Post MDL
normalized with n.. : M DL0 = Gain+
X
!
n.. + C 1
n.j + C 1
1
log
log
+
n..
C 1
C 1
j
(13)
n.. + C 1
n..
P rior M DL = log
+ log
n1. , ..., nC.
C 1
This gives better definition for MDL:
X
1
n..
n.j
+
M DL =
log
log
n1. , ..., nC.
n1j , ..., nCj
n..
j
!
n.. + C 1
n.j + C 1
+ log
log
C 1
C 1
j
(14)
Experimental scenario
2
i+kC
(15)
We tried various values for k: (k = 0, 1, 2) which determines how informative is the attribute. For example,
for 2 possible classes with uniform distribution, the information gain of the two-valued attribute is 0.13 bits if
k = 1 and 0.32 bits if k = 2, and for 10 possible classes
it is 0.66 bits if k = 1 and 0.77 bits if k = 2. However,
the biases of all measures were not very sensitive to the
..
............... C = 10
................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
.............
0.8
.............
.
.
.
.
.
.
.
.
.
.
.....
...
........................ C = 5
Gain 0.6
.........................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...............
[bit]
(a)
1.0
0.5
.......................................................................................... C = 2
..................................................... C = 5
Relev 0.25 ...................................................................................................C = 10
.
.
.
.
.
.
.
...
..................
0.0
2 5 10
0.4
0.2 .......................................................................................... C = 2
20
40
V
Figure 2: Relevance for informative attributes.
0.0
2 5 10
20
40
0.03
0.02
(b)
0.15
.................... C = 5
...................... ...........C = 10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....
............
......................
................................................... C = 2
0.1 ..............................................................
.................
Gini
Relief
.........................................................................C = 10
............
.
.
.
.
0.01 . .................................................................................... C = 5
.................................................................................................
0.0
2 5 10
0.0
2 5 10
20
40
V
Figure 1: (a) Gain for informative attributes; (b) Gini for
informative attributes.
Results
4.1
C=2
40
0.05
20
.
0.4 ....
(b)
.
.......
......
.....
Relief 0.2 .......
... ...........
... ...........
... ......................
.......... ..........................................................
.....................................................................................................................C = 10
............................ C = 5
0.0
2 5 10
20
C=2
40
V
Figure 3: (a) Relief for uniform distribution of attribute values; (b) Relief for informative attributes.
measures. This is shown on Figure 1 (b) where the values of Gini for higher number of classes, where attributes
are even more informative (see eq. (15)), are lower than
the values for lower number of classes. This becomes
even more obvious for more informative attributes (e.g.
for k = 2 in eq. (15)), where the graph becomes similar
to that on Figure 2 except that for Gini all curves are
straight lines with higher slope.
Relev (9) has similar behavior like gini-index, except
that the estimates increase less than linearly with the
number of values of attributes. Figure 2 shows this for
informative attributes. Note that relevance tends to decrease with the increasing number of classes even though
the attributes for problems with more classes are more
informative.
4.2
The estimates of informative and highly informative attributes decrease exponentially with the number of values of attributes for GainR (2), 1D (3), and Relief (7).
However, for irrelevant attributes, all three measures
exhibit the bias in favor of the multivalued attributes.
The estimates of Relief increase logarithmically with the
number of values while for 1 D and GainR the estimates increase linearly. Figure 3 shows both biases for
0.1
.....
...... C = 10
.
.
.
.
......
......
.
.
.
.
......
......
.
.
GainR 0.05
.
.
.
..
.....
............... C = 5
......
.
.
.
...............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....
.....
...............
.............................
.. ................................................................................ C = 2
0.0 .........
2 5 10
20
MDL
[bit]
MDL
[bit]
(a)
0.5
0.55
MDL (13) exhibits the bias against the multivalued attributes. MDL almost linearly decreases with the number of values of attributes in all scenarios, as is shown on
Figures 5 (a) and (b). As expected, from the definition in
eq. (13), the bias (the slope of the curve) is higher for the
problems with the higher number of classes. Namely, the
number of classes influences the number of bits needed
to encode the decoders.
MDL is always negative for irrelevant attributes and,
therefore, the irrelevant attributes are always considered as non-compressive. For informative attributes the
compression decreases with the number of values of attributes.
MDL (14) has similar behaviour as MDL for irrelevant attributes, however, the slope of the curves is lower
0.45
2 5 10
20
40
P(2 )
................
..........................
.....................
C=2
0.9
2 5 10
1.0
4.3
C=2
40
Relief and Figure 4 shows the bias of GainR for irrelevant attributes. The performance of 1D is very similar
to that of GainR. Note the different scales for irrelevant
and informative attributes. This shows that the bias in
favor of the irrelevant multivalued attributes is not very
strong and is the lowest for Relief .
20
V
Figure 6: MDL for informative attributes.
C = 10
Figure 5: (a) MDL for uniform distribution of attribute values; (b) MDL for informative attributes.
0.0
......................................................................C = 10
.......................................................................................... C = 5
0.5
(b)
..
0.5 .................
.......................
.
.................................................................................
.............................................................. C = 2
0.0
..................... C = 5
.
2 5 10
20
0.5 ........................................................
................ ..................
2 5 10
0.0 .....................................................................
..................................
.. .... ..........
C=2
2 .5............10
...........................20
...............................
.
................... C = 5
.
.
...............
0.5
..................................
V
............C = 10
1.0
1.0
MDL
[bit]
........
40
V
Figure 4: GainR for uniform distribution of attribute values.
0.5
1.0
20
40
V
Figure 7: (a) P (2 ) for uniform distribution of attribute values; (b) P (2 ) for slightly informative attributes, k = 0 in
eq. (15).
and all informative attributes are considered compressive. This is shown on Figure 6.
The behavior of WE (5) is not stable. The reason may
be in the use of the non-differentiable function (absolute
values). The behavior seems to be somewhere between
Relev (for irrelevant attributes) and MDL (for informative attributes). Like MDL , WE decreases faster for the
problems with more classes.
4.4
1.0
..
............. C = 10
.............
.
.
.
.
.
.
.
.
.
.
.
.
..
0.8
...........
........
P(G)
................ C = 5
........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.6
....... ................
.................................................................................................................. C = 2
0.4
2 5 10
20
40
V
Figure 8: P (G) for uniform distribution of attribute values.
evaluation of eq. (10). But the fact is, that on most computers without special algorithms this measure will not
be able to distinguish between attributes with different
quality, which makes the measure impractical. Someone
may argue, that in the cases when P (2 ) = 1.0 the value
of 2 could be used. This can be done only when comparing the attributes with exactly the same number of
values (the same degree of freedom) which is not very
useful.
Figure 8 shows that P (G) defined with equations (10)
and (12) overestimates the multivalued attributes, which
is not in agreement with the conclusions by White and
Liu [1994]. Their conclusions seem valid if the figure is
limited to C = 2 and 5 and V = 2, 5 and 10, which was
their original scenario. Besides, P (G) has the same problems as P (2 ) with informative attributes. Its values for
informative attributes are all equal to 1.0.
We verified this for P (2 ) and P (G) by varying the
parameter k in eq. (15): k =0.0, 0.25, 0.50 ... As soon
as k > 0 all attributes are too informative and both
statistics get the value 1.0. The results in Figures 7 (b)
for k = 0 show the bias of P (2 ) against the multivalued
attributes for slightly informative attributes. The figure
for P (G) is similar.
Conclusions
Acknowledgments
I thank Matevz Kovacic, Uros Pompe, and Marko Robnik for discussions on the MDL principle and for comments on earlier drafts of the paper.
References
[Baim, 1988] P.W. Baim. A method for attribute selection in inductive learning systems. IEEE Trans. on
PAMI, 10:888896, 1988.