Está en la página 1de 17

Predicting Categorical Variables.


Data Analysis Using R 1 / 17

Prediction algorithms used so far: linear regression, multilinear
regression (OLS), nonlinear regression, logistic regression.
Classification algorithms: nearest neighbour, K-nearest neighbour,
decision trees, random forests

Data Analysis Using R 2 / 17

k Nearest Neighbour

non-parametric classification algorithm (the structure of the data is
uses the neighbour points information to predict the class
one of the introductory supervised classifier
was proposed by Fix and Hodges in 1951 for performing pattern
classification task
addresses pattern recognition problems

Data Analysis Using R 3 / 17

Example I

We consider a basket of fruits containing: apples, bananas, grapes

and cherries.
The task is to arrange them into groups.
Data Analysis Using R 4 / 17
Example II

We have the characteristics of the fruits (X variables): color

(red/green), size (big/small) and weight (numeric).
For example: apple(red+big), banana(green+big),
grapes(green+small), cherries(red+small).
Let Y (fruit name) be the response variable = group/label
Using kNN and a training set (60% of the data set) we have to be
able to accurately predict the label (fruit name) for any fruit (new

Data Analysis Using R 5 / 17

Algorithm I

simplest version: predicts class by finding the nearest neighbour class

closest class will be identified using distance measures (Euclidean
distance, Manhattan distance, etc.)
KNN algorithm:
Let (Xi , Ci ) where i = 1, n be data points (we have several predictive
Xi denotes feature values and Ci denotes labels for each i
”c” number of classes, Ci ∈ {1, 2, 3, . . . , c} for all values of i
Let X be a point for which label/group/class is not known
We would like to find the label class using k-nearest neighbor

Data Analysis Using R 6 / 17

Knn Algorithm

1 Calculate ”d(x, xi )”, i = 1, n; where d denotes the Euclidean distance

between the points.
2 Arrange the calculated n Euclidean distances in non-decreasing order.
3 Let k be a positive integer, take the first k distances from this sorted
4 Find those k-points corresponding to these k-distances.
5 Let ki denotes the number of points belonging to the ith class
6 If kp > ki ∀p 6= i then put x in class p

Data Analysis Using R 7 / 17

If k is even, there might be ties.
To avoid this, usually weights are given to the observations, so that
nearer observations are more influential in determining which class the
data point belongs to.
An example of this system is giving a weight of d1 to each of the
observations, where d is distance to the data point.
If there is still a tie, then the class is chosen randomly.

Data Analysis Using R 8 / 17

kNN Algorithm Example I

Data Analysis Using R 9 / 17

kNN Algorithm Example II

Let’s consider the above image where we have two different target
classes: white and orange.
We have total 26 training samples.
Now we would like to predict the target class for the blue circle.
Considering k value as three, we need to calculate the similarity
distance using a similarity measure like Euclidean distance.
In the image, we have calculated the distance and placed the less
distant circles to the blue circle inside the big circle.
What will be the predicted class?

Data Analysis Using R 10 / 17

How to choose the value of k?

selecting the value of k - the most critical problem

small k ⇒ noise will have a higher influence on the result (overfitting
is very probable)
large k ⇒ computationally expensive; defeats the idea of kNN (near
points have similar classes)

Data Analysis Using R 11 / 17

Implementation in R:

- from scratch:
- using implemented functions found in libraries:

Data Analysis Using R 12 / 17


To optimize the results, we can use Cross Validation (which is one of

the fundamental methods in machine learning for method assessment
and picking parameters in a prediction or machine learning task).
Using the cross-validation technique, we can test kNN algorithm with
different values of k.
The model which gives good accuracy can be considered to be an
optimal choice.
To find the accuracy you can compute the Confusion matrix.
At times best process is to run through each possible value of k and
test our result.

Data Analysis Using R 13 / 17

Example I

The dataset: PimaIndiansDiabetes2 dataset from the mlbench

This dataset is part of the data collected from one of the numerous
diabetes studies on the Pima Indians, a group of indigenous Americans
who have among the highest prevalence of Type II diabetes in the
world–probably due to a combination of genetic factors and their
relatively recent introduction to a heavily processed Western diet.
768 observations, 9 variables: skin fold thickness, BMI, and so on,
and a binary variable representing whether the patient had diabetes.
Purpose: to train a classifier to predict whether a patient has diabetes
or not.

Data Analysis Using R 14 / 17

Example II

many observations available; goodly amount of predictor variables

available; interesting problem; good mixture of both class outcomes
(35% diabetes positive observations)
Grievously imbalanced datasets can cause a problem with some
classifiers and impair our accuracy estimates.
Steps of the kNN method:
split the data 80/20 randomly
visualize the effectiveness of k-NN with a different k using
cross-validation (knnEval() function from the chemometrics package)
perform kNN for a suitable value of k, using knn() function from the
class package
compute the accuracy of the method for the chosen k and determine
the confusion matrix

Data Analysis Using R 15 / 17

Exercises I

1. Iris
Consider the ”iris” data set in R. Determine which is the response variable.
Load the data and split it into 2 parts (80%, 20%) that will be the training
data and test data. Train the kNN model for k = 1, 2, 3 and determine
which one is the best (i.e. most accurate) using the test data.
Use the functions: knn() in the package ”class”, CrossTable() in the package
”gmodels” or from the package ”class”, confussionMatrix() in the
package ”caret”.
2. Breast Cancer
To diagnose Breast Cancer, the doctor uses his experience by analyzing
details provided by a) Patient’s Past Medical History and b) Reports of all
the tests performed. At times, it becomes difficult to diagnose cancer even
for experienced doctors, since the information provided by the patient might
be unclear and insufficient. Breast cancer database was obtained from the
University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

Data Analysis Using R 16 / 17

Exercises II

It contains 699 samples with 10 attributes. The Main objective is to predict

whether it’s benign or malignant. Use kNN algorithm to do that and explain
your choice of value for k.
The data can be downloaded at

Data Analysis Using R 17 / 17