Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Rashmi Raj
(rashmi@cs.stanford.edu)
Protein
Source - http://www.mun.ca/biology/desmid/brian/BIOL2060/BIOL2060-13/1319.jpg
Problem
Science
Business
Web
Government
etc.
Market Basket Analysis
sunny rain
overcast
Humidity Yes
Windy
No Yes No Yes
Choosing the Splitting Attribute
► At each node, available
attributes are evaluated
on the basis of separating
the classes of the training
examples. A goodness
Sunny
function is used for this
purpose.
► Typical goodness
functions:
information gain (ID3/C4.5) Overcast
information gain ratio
gini index
Raining
Which Attributes to Select?
A Criterion for Attribute Selection
► Which is the best
attribute? Outlook
The one which will
result in the smallest
overcast
tree
Heuristic: choose the Yes
attribute that produces
the “purest” nodes
Information Gain
► Popular impurity criterion:
information gain
Information gain increases
with the average purity of
the subsets that an
attribute produces
► Strategy: choose attribute
that results in greatest
information gain
Computing Information
► Information is measured in bits
Given a probability distribution, the info
required to predict an event is the distribution’s
entropy
► Formula for computing the entropy:
Computing Information Gain
► Information gain:
(information before split) – (information after split)
PROSITE
Database of protein families and domains
Positive Negative
Examples Examples
Positive Example
► Query-1: Post-synaptic AND !toxin
All Species – To maximize the number of
examples in the data to be mined
!toxin – Several entities in UniProt/SwissProt
refer to the toxin alpha-latrotoxin
Negative Examples
► Query-2: Heart !(result of query 1)
► Query-3: Cardiac !(result of queries1,2)
► Query-4: Liver !(result of queries 1,2,3)
► Query-5: Hepatic !(result of queries 1,2,3,4)
► Query-6: Kidney !(result of queries
1,2,3,4,5)
Second Phase
generating the
predictor attribute:
use the link from
UniProt to Prosite
database to create the
predictor attributes
Predictor Attributes
Facilitate easy
Must have a good interpretation by
predictive power biologists
Generating The Predictor Attribute
Occurred at least
in two proteins
Pattern (NOT
NO YES Profile)
Prosite Entry?
Generates easily
interpretable
THEN ELSE
classification rules of the
form: IF (condition)
THEN (class)
This kind of rule has the
intuitive meaning
Classification Rules
Predictive Accuracy
► Well-known 10-fold cross- 1 2 3 4 5 6 7 8 9 10
validation procedures
(Witten and Frank 2000) 1 2 3 4 5 6 7 8 9 10
Divide the dataset into 10
partitions with approximately
same number of examples 1 2 3 4 5 6 7 8 9 10
(proteins)
In i-th iteration, 1=1,2,…,10,
the i-th partition as test set
and the other 9 partitions as
training set
1 2 3 4 5 6 7 8 9 10
Results
► Sensitivity
TPR = TP/(TP+FN)
Average values (over the 10 iterations of cross-
validation procedure) = 0.85
► Specificity
TNR = TN/(TN+FP)
Average values (over the 10 iterations of cross-
validation procedure) =0.98
Questions?