Está en la página 1de 8

An Introduction to Support Vector Machine: Supervised Learning Approaches

Tej Bahadur Shahi CDCSIT,TU Second Year, Thesis Semester tejshahi198@gmail.com Abstract
Support vector machines (SVMs) appeared in the early nineties as optimal margin classifiers in the context of Vapnik.s statistical learning theory. Since then SVMs have been successfully applied to realworld data analysis problems, often providing improved results compared with other techniques. The SVMs operate within the framework of regularization theory by minimizing an empirical risk in a wellposed and consistent way. A clear advantage of the support vector approach is that sparse solutions to classification and regression problems are usually obtained: only a few samples are involved in the determination of the classification or regression functions. This fact facilitates the application of SVMs to problems that involve a large amount of data, such as text processing and bioinformatics tasks. In this article, the basic concept of SVM is introduced that can help the reader to understand the SVM at introductory level. Keyword: Support Vector Machine, Kernel function, constrained optimization

1. Introduction
1.1 Machine Learning Machine learning means to gain knowledge or understanding of or skill by study, instruction or experience.[1] ML usually refers to the changes in system that performs tasks associated with artificial intelligence such task involves recognition, diagnosis, planning , robot control, prediction etc. There are two major types of learning: In supervised learning we know the value of function f for the m samples in training set and we want to approximate that function to use it for unseen data.
Supervised learning is the machine learning task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which is called a classifier if the output is discrete or a regression function if the output is continuous [2].

In this article we will focus on this type of learning.

Unsupervised learning refers to trying to find hidden patterns in unlabelled data. It is closely related with the density estimation in statistics. Speed up learning is special case that of changing an existing function into an equivalent one that is computationally more effiecint. In supervised classification learning, the goal is to learn a function which assigns (discrete) labels to arbitrary objects, given a set of already assigned independent instances. This framework has a vast number of applications ranging from part-of-speech tagging to optical character recognition [3]. 1.2. Supervised learning process Supervised learning process consist of two steps: learning and testing as shown in figure below Learning (training): In this step the learning algorithm is used to learn the parameter of a model using the training data. Example of learning algorism is Back propagation algorithm for Neural Network and Decomposition Method and Sequential minimal optimizations are learning algorithm used for SVM. Testing: Test the model using unseen test data to assess the model accuracy. These models are used for classifying the test data and find the accuracy of learning process.

Figure 1: Supervised Learning Process

2. Support Vector Machines (SVM)


Support vector machines were invented by V. Vapnik and his co-workers in 1970s in Russia and became known to the West in 1992. SVMs are linear classifiers that find a hyperplane to separate two classes of data, positive and negative. The aim of Support Vector classification is to devise a computationally efficient way of learning good separating hyperplanes in a high dimensional feature space, where by good hyperplanes we will understand ones optimizing the generalization bounds and by computationally efficient we will mean algorithms able to deal with sample sizes of the order of 100 000 instances. The generalization theory gives clear guidance about how to control capacity and hence prevent overfitting by controlling the

hyperplane margin measures, while optimization theory provides the mathematical techniques necessary to find hyperplanes optimizing these measures [4]. SVM not only has a rigorous theoretical foundation, but also performs classification more accurately than most other methods in applications, especially for high dimensional data. It is perhaps the best classifier for text classification.

2.1 Hyperplane In learning a classifier for binary classification we are usually given a set of training examples (x1; y1); : : : ; (xn; yn). The input features xi Rd are usually d-dimensional vectors describing the properties of the input example, and the labels yi {+1,-1} are the response variables/outputs we want to predict. In binary classification a hyperplane that separates the two groups of objects is called decision surface. this is the case of inductive learning in which we construct a linear decision surface for a particular training data set and then using this decision surface to classify other points in the data universe.

Figure 2: Decision surface with an offset separating two classes [5].


There are so many hyperplane may exit that correctly separate the training data and which one is the best for training as well as testing data is the most important question. SVM try to solve this question with the help of structural risk minimization principle and Maximum margin classifier.

2.2 Structural Risk Minimization Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting the model becoming too strongly tailored to the particularities of the training set and generalizing poorly to new data. The SRM principle

addresses this problem by balancing the model's complexity against its success at fitting the training data.[2] 2.3 Maximum Margin classifier (Linear SVM) SVM looks for the separating hyperplane with the largest margin. Machine learning theory says this hyperplane minimizes the error bound.

Here the points liying on the supporting palne H1 and H2 are called support vectors. Figure 3 : Optimal separating plane with its two supporting planes Definition 1: A hyperplane supports a class if it is parallel to a (linear) decision surface and all points of its respective class are either above or below. We call such a hyperplane a supporting hyperplane [5] Definition 2: In a binary classification problem the distance between the two supporting hyperplanes is called a margin [5] Assume data are linearly separable. Then the general equation of a plane in n-dimensions is
w. x - b=0
F _

where x is a n by1 vector, w is the normal to the hyperplane and b is a (scalar) constant. Of all the points on the plane, one has minimum distance dmin to the origin

Given a set of training examples (x1; y1).(xn; yn). The input vectors xi Rd are usually ddimensional vectors describing the properties of the input example, and the labels yi {+1,-1}. If data are linearly separable then then there exist a d-dimensional vector w and a scalar b such that

In compact form we may combine these two equations in

Or Here (w, b) define the hyperplane separting the two class of data. The equation of the hyperplane is

Where w is normal to the plane, b is the minimum distance from the origin to the plane. In order to make each decision surface (w, b) unique, we normalize the perpendicular distance from the origin to the separating hyperplane by dividing it by |w| giving the distance as

As depicted in Figure 3, the perpendicular distance from the origin to hyperplane H1 : And the perpendicular distance from the origin to hyperplane H2: The support vectors are defined as the training points on H1 and H2. Removing any points not on those two planes would not change the classification result, but removing the support vectors will do so. The margin, the distance between the two hyperplanes H1 and H2 is

The margin determines the capacity of the learning machine which in turn determines the bound of the actual risk - the expected test error. The wider the margin the smaller is h, the VC-dimension of the classifier. Therefore our goal is to maximize margin or equivalently minimize the .

Therefore the optimization problem can be formulated as follows Minimize f= Subject to constraints This problem can be solved by using standard Quadratic programming technique.

2.4. Non Linear SVM


The SVM formulations require linear separation. Real-life data sets may need nonlinear separation. To deal with nonlinear separation, the same formulation and techniques as for the linear case are still used. We only transform the input data into another space (usually of a much higher dimension) so that a linear decision boundary can separate positive and negative examples in the transformed space, The transformed space is called the feature space. The original data space is called the input space. The basic idea is to map the data in the input space X to a feature space F via a nonlinear mapping , : X F x ( x) After the mapping, the original training data set {(x1, y1), (x2, y2), , (xr, yr)} becomes: {((x1), y1), ((x2), y2), , ((xr), yr)} Then perform linear separation in this feature space. Geometric interpretation is shown in fig below [4]

Figure 4: A feature Map can simplify the classification task In this example, the transformed space is also 2-D. But usually, the number of dimensions in the feature space is much higher than that in the input space There is an optimal separating hyperplane in a higher dimension, which corresponds to a nonlinear separating surface in input space.

The potential problem with this explicit data transformation and then applying the linear SVM is that it may suffer from the curse of dimensionality. The number of dimensions in the feature space can be huge with some useful transformations even with reasonable numbers of attributes in the input space. This makes it computationally infeasible to handle. Fortunately, explicit transformation is not needed. In SVM, this is done through the use of kernel functions, denoted by K, K(x, z) = (x) (z)

For example let us take Polynomial kernel K(x, z) = x zd Let us compute the kernel with degree d = 2 in a 2-dimensional space: x = (x1, x2) and z = (z1, z2). This shows that the kernel x z2 is a dot product in a transformed feature space.

x z = (x1z1 + x2 z2 )
2 2 2

2 2 2

= x z + 2x z x z 2 + x2 z 2 = (x) (z) ,
3. Some issues related to SVM

22 1 1 11 2

= (x1 , x2 , 2x1x2 ) (z1 , z2 , 2z1z 2 )

2 2

SVM works only in a real-valued space. For a categorical attribute, we need to convert its categorical values to numeric values. SVM does only two-class classification. For multi-class problems, some strategies can be applied, e.g., one-against-rest, and error-correcting output coding.

4. Reference and Bibliography


[1]. [2] [3] Nils J. Nilson, Introduction to Machine learning, Draft book, Robotics Laboratory, Stanford University, November 3,1998. http://en.wikipedia.org/wiki/Supervised_learning Simon lacoste-Jolien, Combining SVM with graphical models for supervised classification: an introduction to Max-margin Markov Networks, Department of ECES, University of California, Barkely, December 1, 2003. Nello Cristianini and John Shawe Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press 2002.

[4]

[5]

Lutz Hamal, Knowledge Discovery With Support Vector Machines, John Wiley & Sons, Inc. 2009.