Está en la página 1de 16

Classication Trees and Random Forest Huaxin You Department of Statistics and Actuarial Science University of Central Florida

Outline

Classication Problems. Classication Trees. Bagging. Boosting. Random Forest.

Classication Problems

Provided a training data set with known class membership, classication problems involve nding a classication rule, so that it has high predictive accuracy on future examples from the same classes. Some popular problems include the following: Recognize handwritten digits and characters. Identify terror suspects in the video of thousands of people. Classify land type (forest, eld, dessert, urban) from satellite images. Diagnose a patients disease based on physiological data collected. Prevent credit card fraud or network intruders.

An Example: Credit Risk Analysis Data:


Customer103: (time=t0)
Years of credit: 9 Loan balance: $2,400 Income: $52k Own House: Yes Other delinquent accts: 2 Max billing cycles late: 3 Profitable customer?: ? ...

Customer103: (time=t1)
Years of credit: 9 Loan balance: $3,250 Income: ? Own House: Yes Other delinquent accts: 2 Max billing cycles late: 4 Profitable customer?: ? ...

...

Customer103: (time=tn)
Years of credit: 9 Loan balance: $4,500 Income: ? Own House: Yes Other delinquent accts: 3 Max billing cycles late: 6

Profitable customer?: No
...

Rules learned from synthesized data:


If Other-Delinquent-Accounts > 2, and Number-Delinquent-Billing-Cycles > 1 Then Profitable-Customer? = No [Deny Credit Card application] If Other-Delinquent-Accounts = 0, and (Income > $30k) OR (Years-of-Credit > 3) Then Profitable-Customer? = Yes [Accept Credit Card application]

An Example of Classication Trees

Should a tennis match be played?


Outlook

Sunny Humidity

Overcast Yes

Rain Wind

High
No

Normal Yes

Strong No

Weak Yes

Classication Trees

There are many algorithms proposed to generate classication trees. For example: ID3, C4.5 (R. Quinlan (1993)), CART (L. Breiman et al (1984)). They all focus on constructing a tree-like classication rule based on a given training set. These methods partition the space recursively, so that the impurity is reduced gradually. The impurity can be measured by: =
k i=1

pi log pi,

where is also called Entropy, pis denote the proportion of examples from ith class. Note that if in a subspace, all examples are from the same class, = 0, whereas is maximized when examples are evenly from all classes.

Top-Down Induction of Decision Trees

Main loop: 1. Find the best decision attribute A for next node such that the impurity is reduced most. 2. Assign A as decision attribute for node. 3. For each value or partition of A, create new descendant of node. 4. Sort training examples to leaf nodes. 5. If training examples perfectly classied, Then STOP, Else iterate over new leaf nodes. Which attribute is best?
[29+,35-]

A1=? f

[29+,35-]

A2=? f

[21+,5-]

[8+,30-]

[18+,33-]

[11+,2-]

Bagging Trees

Bagging: train classication trees on bootstrapped samples (L. Breiman (1998)). Given a training set of size n (big bag), create m different training sets (small bags) by sampling from the original data with replacement. Build m classication trees by training classication tree algorithm on these m training sets. Aggregate the prediction by simple majority vote.

Boosting Trees

Boosting: train classication trees on sequentially reweighted versions of training dataset (R. Schapire et al. (1997) and J. Friedman et al. (1998)). Train the rst classication tree. Data points are given different weights. A new classication tree is trained to focus on the data points previous classication tree gets wrong. During testing, each classication tree gets a weighted vote proportional to its accuracy on the training data.

Research Results

Compared to a single tree, Bagging consistently provides a modest gain. Boosting generally provides a larger improvement than Bagging. Both Bagging and Boosting increase the training cost.

A = {portion of population misclassied by t1}, B = {portion of population misclassied by t2}, C = {portion of population misclassied by t3}.

Bagging Trees and Variance Reduction

Classication trees are very sensitive to the small changes in examples. Two similar samples taken from the same populations can result in two very different classication trees. In statistical terminology, they have big variance. Therefore, voting many trees constructed from many small bags of example reduces the variance, stabilizes the performance of trees. Note that Bagging mainly reduces variance of a classication method. If the method is intrinsically faulted (i.e., has big bias), Bagging wont work.

Boosting Trees and Bayes Rule

Ideally, a random forest F tries to minimize the generalization error: P (Error) = EI [y (x) = F (x)], where y (x), F (x) are the actual class and predicted class of x, respectively. Given only a training set (xi, yi)s, P (Error|S ) =
n i=1

I [y (xi) = F (xi)].

In practice, P (Error|S ) is not continuous and hard to minimize. What Boosting does is maximize the expected value of margins Mis. Function M (.) is usually continuous and easier to maximize. M (.) and P (Error) have the same population optimizer Bayes Rule: Assign x to class i if: p(i|x) > p(j |x), j = i

Summary: Random Forests

What is a random forest? A huge ensemble of trees generated in some fashion. The decision rule is produced by voting. Computational cost concerns: The training of many trees is very expensive for large or high-dimensional datasets. The storage of many trees can be an issue. Remedies: When splitting nodes, instead of choosing optimal split among all variables, one only chooses the optimal split among a randomly selected subset of variables. Construct short trees, avoid deep trees that require exponentially more searching.

Avoid Overtting

Overtting occurs when the training error is very small whereas the generalization error becomes very large. How to avoid overtting when growing a random forest? Stop growing statistically insignicant branches. Prune the decision tree using some validation set of examples. It is shown in L. Breiman (2001) that the generalization error converges for a properly grown random forest, therefore the problem of overtting is avoided.

Possible Final Project

How to construct trees with multi-split, instead of only binary split? See Wei-yin Loh (2001) for more discussion. How to merge a random forest into a tree efciently? If a forest has k tree each with m terminal nodes, a straightforward way can generate a tree of size O(mk ). Is that a way to reduce it to O(m k )? How to use missing information in the training set? How to reduce the effect of noise variables and noise examples? How to determine good stopping criterion? How to use the same techniques on other classication methods such as neural networks, support vector machines, etc?

Major Contributors

Leo Breiman: UC Berkeley. Jerry Friedman, Trevor Hastie, and Robert Tibshirani: Stanford University. Robert Schapire and Yoav Freund: AT& T lab. Ross Quinlan: the University of New South Wales. Sites for Software download: http : //stat www.berkeley.edu/users/breiman/. http : //www stat.stanf ord.edu/ jhf /. http : //www.research.att.com/ schapire/. http : //www.cse.unsw.edu.au/ quinlan/.

También podría gustarte