Está en la página 1de 24

US007562060B2

(12) United States Patent (10) Patent No.: US 7.562,060 B2


Sindhwani et al. (45) Date of Patent: Jul. 14, 2009
(54) LARGESCALE SEMI-SUPERVISEDLINEAR Fung et al., Glenn, “Finite Newton for Lagrangian Support Vector
SUPPORT VECTOR MACHINES Machine Classification', 2003.*
Sindhwani et al., Vikas, “A Co-Regularization Approach to Semi
(75) Inventors: Vikas Sindhwani, Chicago, IL (US); Supervised Learning with Multiple Views”, 2005.*
Sathiya Keerthi Selvaraj, South Sindhwani et al., Vikas, “Linear Manifold Regularization for Large
Pasadena, CA (US) Scale Semi-supervised Learning, 2005.*
(73) Assignee: Yahoo! Inc., Sunnyvale, CA (US) Keerthi et al., Sathiya, "A Modified Finite Newton Method for Fast
Solution of Large Scale Linear SVMs”, 2005.*
(*) Notice: Subject to any disclaimer, the term of this Christopher J. C. Burges, (1998) "A Tutorial on Support Vector
patent is extended or adjusted under 35 Machines for Pattern Recognition”. Data Mining and Knowledge
U.S.C. 154(b) by 320 days. Discovery 2:121–167, pp. 1-43.
(21) Appl. No.: 11/394,744 * cited by examiner
(22) Filed: Mar. 31, 2006 Primary Examiner David R Vincent
Assistant Examiner—Adrian L. Kennedy
(65) Prior Publication Data (74) Attorney, Agent, or Firm—Seth H. Ostrow; Ostrow
Kaufman & Frankl LLP
US 2007/O239642 A1 Oct. 11, 2007
(57) ABSTRACT
(51) Int. Cl.
G06N, 3/08 (2006.01)
G06E I/00 (2006.01)
(52) U.S. Cl. .............................. 706/25; 706/17: 706/20 A computerized system and method for large scale semi
Supervised learning is provided. The training set comprises a
(58) Field of Classification Search ................... 706/25, mix of labeled and unlabeled examples. Linear classifiers
706/17, 20; 708/441,440 based on Support vector machine principles are built using
See application file for complete search history.
these examples. One embodiment uses a fast design of a linear
(56) References Cited transductive Support vector machine using multiple Switch
U.S. PATENT DOCUMENTS ing. In another embodiment, mean field annealing is used to
form a very effective semi-supervised support vector
2003/0093.004 A1* 5/2003 Sosa et al. .................. 600,544 machine. For both these embodiments the finite Newton
OTHER PUBLICATIONS method is used as the base method for achieving fast training.
Fung et al., Glenn, "Semi-Supervised Support Vector Machines for
Unlabeled Data Classification', 2001.* 20 Claims, 13 Drawing Sheets

PROVIDE TRANING SET


N
300

LABEL SUB-SET N
302

FIND CLASSIFIER TO CLASSIFY DATA USING


SEMI-SUPERWISED SWM N.
304

NEW DATA TO BE CLASSIFIED IS INPUT INTO


SYSTEM
308

USE THE SEMI-SUPERWISED SVM TO CLASSFY


NEW DATA N
310
U.S. Patent Jul. 14, 2009 Sheet 1 of 13 US 7.562,060 B2

||
Z

LeIpOueÐNS J? SnIO

J?MeJO
U.S. Patent Jul. 14, 2009 Sheet 2 of 13 US 7.562,060 B2

Ñ;9* * * *
‘61–
Z
*& &*#n $$ $ $*
U.S. Patent Jul. 14, 2009 Sheet 3 of 13 US 7.562,060 B2

PROVIDE TRAINING SET


N
3OO

LABEL SUB-SET N
302

FIND CLASSIFIER TO CLASSIFY DATA USING


SEMI-SUPERVISED SVM

NEW DATA TO BE CLASSIFIED IS INPUT INTO


SYSTEM

USE THE SEMI-SUPERVISED SVM TO CLASSIFY


NEW DATA N
310
U.S. Patent Jul. 14, 2009 Sheet 4 of 13 US 7.562,060 B2

Problem Given i labeled examples (zi, yi}- where ci e R', yi e (-1,+1}


and a cost for each example {c}, Solve:
B* =- agin
aramin XciN. f.yi -- at
ise
3a)1.2|ali?
+ g|5||
Equivalently Solve: AI + XTCX) (3* = XTCY
X is: ai ... c." Rxd, Y at ly . . . yil' e Rix
Ce R* : a diagonal matrix with C = C,
f3 e IR: a guess for the solution (set f = 0 e R if unavailable)
o = X6

Inputs X, Y, C, A, f, o, e, cgiternax
Initialization z = C(Y -o) r = Xz - Xf p = r w = |r
cgiter = 0 optimality = 0
Iterate while (cgiter < cgitermax)
cgiter-cgiter + 1
q = Xp
All-TC
f3 = B+ rp
o = o- yd
2 = 2 - yCq.
u2 r cul
r = XTz - Xf3
tu F ||r?
if (wi < e 12°)
Set optimality = 1 and Exit while loop.
end if
- all
(i) = i
p = r -- up
end while

Outputs f3, o, optimality

Fig. 4
U.S. Patent Jul. 14, 2009 Sheet 5 of 13 US 7.562,060 B2

Inputs uy, i, o, ö,Y, C as defined in Table 1

Initialize = {i : yo < 1}
I. r Au?(i) - w) + Yie, Cit (O - yi) (d. area oi)
R = Atif (a) - w)+y, Ca(ö, -y)(ö, -o)
Define 5 s 43 for all i
A = {fi: i e 7, yi (6, -o) > 0}
A2 = (6 : fig, yi (6 - oi) < 0}
A as At U A2
i=0
Reorder indices so that d, e A are sorted in non-decreasing
order 8, 8i, ....
Iterate for j = 1,2...
6 -

Eacit for loop


end if

Set s = - if Öi, e A or 8 - 1 if 5, e A2
L = L + SC, (oi; so yi, )(oi, anal oil)
R = R + SC, (oi - yi)(Öi - oi)
end for
U.S. Patent Jul. 14, 2009 Sheet 6 of 13 US 7.562,060 B2

Problem Given i labeled examples (ati, yi} where c e R, yi e (-1,+1} and a


cost for each example {c}-1, Solve:
* -- aromin
argin Y. max 0, 1-yi
i.e. - , ("...hil
(wai) + All?
illwl
X (t1 ...at Rixd, Y (y1 ... yi e Rix
Ce R*; a diagonal matrix with C = c.
u e R: (a guess for the solution)
If a guess is available, it is also convenient to pass:

Inputs X, Y, C, A and u, o, , (if available)


Initialize ifu, o unavailable (or set as zero vectors) set u = 0 e R, o = 0 e
R, c = 10’, cgitermax = 10, j = 1... i, j = d.
if w, o, , y are available, set c = 10, cgitermax = 10000
T = 10 iter = 0 itermax = 50
Iterate while (iter < iterinax)
iter-iter-l

(ti, o, opt) = CGLS(X,Y, C, u, o, e, cgiternax) (see table 2)


de = X, is
If citermax=10 reset cgitermax = 10000
if (opt = 1, Wii e i gioi 3 i + r , Wii e i yidi & 1 - t)
If c = 10 reset, e = 10té and
coatinue the while loop iterations.
Else set u = i o ed
and exit the while loop.
end if

6 = LINE-SEARCH(w, i, o, d, Y, C) (see table 3)


w = u + 5(a) - au)
a sco - 5(5 -o)
g = {i.e. l... i : yo < l} = i e 1... l: if
end while

Outputs u, o, ,
U.S. Patent Jul. 14, 2009 Sheet 7 of 13 US 7.562,060 B2

:
N
-N.
o n to in c. ()
o On O to o o
UGipting SSC et Joeyer
U.S. Patent Jul. 14, 2009 Sheet 8 of 13 US 7.562,060 B2

Problem Given l labeled examples (ai, yi where a e Ry; e {-1, +1} and
w unlabeled examples (c' j=1 Solve problem in Eqn. 7.
Define

Inputs X, Y, X, X, X'
r, S (maximum number of label pairs to switch, default S=1)
Initialization Ce R* : a diagonal matrix with C = }
upon L2-SVM-MFN(X,Y, C)
Compute o' = X'uo, Assign positive and negative labels to the
unlabeled data in the ratio r : (l-r) respectively by thresholding o'.
Put these labels in a vector Y.

Set X = 10 Define: x = (, ) Y=(S,


Define de R(t)*(t): a diagonal matrix with:
)
C = (1 < i <!) Ca = (t + 1 < i < t + u)
Set u = 0 e IR o = 0 e IR of e R = 1... ( -- u) =}
Iterate (Loop 1) while X & X
Re-training 1 (w, oo, , y)= L-SVM-MFN(X,Y, C, w, oo), , )
Iterate (Loop 2) while (Bs index pairs (ii) : 1 s is, i < u with s a S
such th; Y. = +1, Yi. = -l, o, < 1, -o < 1, o, < 0)
Switch Labels Y = -1 Y = +1 for k at 12... s.

Re-training 2
Y=(S, )
& Y

(u, o o?), , )= L-SVM-MFN(X,Y, C, w, to o, , )


end while (loop 2)
Increase A A at 2X
C = (1 < is l) C; = & ( + 1 < is l+ w)
a.

end while (loop 1)


U.S. Patent US 7.562,060 B2

(61-)
6

Lôeluns?bejdouq
y ------------
ce s
---
U.S. Patent Jul. 14, 2009 Sheet 10 of 13 US 7.562,060 B2

Inputs X, Y, p, X, X and tu, o on 172 (if available)


Initialize if u unavailable (or set as zero vectors) set u = 0 e IR, oo) =
0 e R, c = 10’, cgitermax = 10, n = 1... , 92 =
l . . . ui, 21 = i = i
if w, o ol, Jug, 32 are available, set e = 10, cgitermax
10000, f = {i.e. l...l: i g )
t = 10 iter = 0 itermax = 50
Set Y, C according to Eqn. 11
Set Y, C according to Eqn. 14
while (iter < itermax)
iters-iter--

Define the active index set: q = 1 U{j},


(iii, o, ö), opt) = CGLS(x, Y.C., iii, (on o, e, cgitemax)
of = Xfti)
if cgiter max=10 reset cgiterinax = 10000
if opt = 1, vi e i yió; g 1 + t , vi e if yo; > 1 - T
Wii e it o, >= 1 - r vi e 2 |o ( 1 + t
If c = 10 reset e = 10 and
continue the while loop iterations.
Else set up at it o set as
and exit the while loop.
end if

6 = LINE-SEARCH (au, u, o o o, o a d), Yu, Cit)


w = w -- )(ii - tu)
o = o + d (5 -o) a' = o' + 3 (6'-o')
g = {i : yo; < l} i = {i : i gi i}
yiRecompute
= i e 1.Y, v.C according
Jo Z=1}to Eqn.
2 =14{je 1 ... u : ol ( 1}
end while

w, to o'), Ji, Ji, 22


U.S. Patent Jul. 14, 2009 Sheet 11 of 13 US 7.562,060 B2

o
yer

t
-
o
c - c. co - r a Ni
q t
(A)
U.S. Patent Jul. 14, 2009 Sheet 12 of 13 US 7.562,060 B2

o, X, T, r
Compute g = g . . . g.
where g = A (max o, 1 - o - x 0. -- o)
Initialize a 10-10 iter - O maxiter at 500
V- = min(g1...g.) - Tlog
v. = max(g ...g.) - Tlogr.
w = (vi + v )/2 Initial guess
S or le-Cro w e-dro
B(v) = -r
B'(v) = x: (ii) (if si - 00, i.e larger than some upper
limit, set corresponding tern to 0)
while (B(w) > e) AND (iter < maxiter)
iterariter
if B'(v) > 0
- , , - B(v)
M is
B
end if
if (D <w-) OR. (b > v) OR B'(v) = 0
Bisection to hi?

Newton-Raphson
end if
Update se e-lift
B(v) ill is
B'(v) = (X, f (ifs; - co, i.e larger than some
upper limit, set corresponding term to 0)
if B (v) < 0 set v- = w else w = w
if v - w < e exit while loop
end while

part
re

s ... )
U.S. Patent Jul. 14, 2009 Sheet 13 of 13 US 7.562,060 B2

Problem Given labeled examples (zi, yi} where a e R, yi e (-1,+1} and


u unlabeled examples {a} - Solve for w” in Eqns 10,8,9.
Define x ...Tereka Y = y ... yie R*
X = (c. ... Te RX"
X, Y, X, X, X, r
Initialization T = 10 R = 1.5 e = 10
iterlac0 itermaxl at 30 itermax = 100
X = XT XT
p = r... rTe R"
h = H(p) (Eqn. 16)
(w, o o, it, , 2) = OPTIMIZE-W (X, Y, p, d, X) (Table 7)
F = JCup) (Eqn. 18)
f - -
min = F tumin = u! Oil Fo

while (iter1 < itermaxl) AND (h > e)


iter1 = iter1 + 1 iter2 = 0
k=1
while (iter2 < itermax2) AND (kl de)
iter2 = iter2 -1

p = oPTIMIZE-P(o', X', T, r) (Table 6)


(u), o o'), it, 1,72) =
OPTIMIZE-w(X, Y, p, X, X', w, oo, it, 7, 2) (Table 7)
kl = KL(p,q) (Eqn. 16)
F = J(u) (Eqn. 18)
if F K Fin
min = F tumin = u o
end if
end while (loop 2)
h = H(p)
T = T/R
end while (loop 1)
Umin
US 7,562,060 B2
1. 2
LARGESCALE SEM-SUPERVISED LINEAR cedure that results in significantly better operation in many
SUPPORT VECTOR MACHINES applications, while being computationally judicious.
According to another preferred embodiment, a computer
FIELD OF THE INVENTION ized system and method for semi-Supervised learning is pro
5 vided. A training set example elements is received. Elements
The invention is a method for large scale semi-supervised of the training set that are determined to fall within a classi
linear Support vector machines. Specifically, the invention fication group are labeled. The training set thereby has
provides a system and method that uses a modified finite labeled elements and unlabelled elements. The system uses
Newton method for fast training of linear SVMs using a mix selected labeled elements and unlabeled elements as
of labeled and unlabeled examples. 10 examples in a large scale semi-supervised Support vector
machine to select a linear classifier. The system receives
BACKGROUND OF THE INVENTION unclassified data items. Using the selected linear classifier,
the received unclassified data elements are classified.
Many problems in information processing involve the
selection or classification of items in a large data set. For 15 BRIEF DESCRIPTION OF THE DRAWINGS
example, web-based companies Such as Yahoo! have to fre
quently classify web pages as belonging to one group or the FIG. 1 is a block diagram illustrating components of a
other, e.g., as commercial or non-commercial. search engine in which one embodiment operates;
Currently, large amounts of data can be cheaply and auto FIG. 2 is an example of a news web page that can be
matically collected. However, labeling of the data typically categorized using one embodiment;
involves expensive and fallible human participation. For FIG.3 is a flow diagram illustrating steps performed by the
example, a single web-crawl by a search engine. Such as system according to one embodiment associated with search
Yahoo or Google, indexes billions of webpages. Only a very engine relevance ranking;
small fraction of these web-pages can be hand-labeled by FIG. 4 is a table listing conjugate gradient scheme itera
human editorial teams and assembled into topic directories. 25 tions according to one embodiment;
The remaining web-pages form a massive collection of unla FIG. 5 is a table outlining a full line search routine method
beled documents. performed by one embodiment;
The modified finite Newton algorithm, described in co FIG. 6 is a table having an abridged pseudo-code listing for
pending application Ser. No. 10/949,821, entitled “A Method a method performed by one embodiment;
And Apparatus For Efficient Training Of Support Vector 30 FIG. 7 is a graph illustrating an L. loss function over
Machines, filed Sep. 24, 2004, the entirety of which is incor unlabelled examples for transductive SVM;
porated herein by reference, describes a method for training FIG. 8 is a table outlining a transductive SVM algorithm
linear Support vector machines (SVMs) on sparse datasets performed by one embodiment;
with a potentially very large number of examples and fea FIG. 9 is a graph illustrating an L. loss function over
tures. Such datasets are generated frequently in domains like 35 unlabelled examples for transductive SVM;
document classification. However, the system and method FIG. 10 is a table illustrating steps for optimizing a weight
described in that application incorporates only labeled data in vector w according to one embodiment;
a finite Newton algorithm (abbreviated L2-SVM-MFN). FIG. 11 is a graph of a root function used by an embodi
Large scale learning is often realistic only in a semi-Super ment;
vised setting where a small set of labeled examples is avail 40 FIG. 12 is table illustrating steps for optimizing a belief
able together with a large collection of unlabeled data. probability p according to one embodiment; and
A system and method that provides for extension of linear FIG.13 is table illustrating steps for a method of mean field
SVMs for semi-supervised classification problems involving annealing according to one embodiment.
large, and possibly very high-dimensional but sparse, par
tially labeled datasets is desirable. In many information 45 DETAILED DESCRIPTION OF THE PREFERRED
retrieval and data mining applications, linear methods are EMBODIMENTS
strongly preferred because of their ease of implementation,
interpretability and empirical performance. The preferred A preferred embodiment of a large scale semi-supervised
embodiments of the system and method described herein linear Support vector machine, constructed in accordance
clearly address this and other needs. 50 with the claimed invention, is directed towards a modified
finite Newton method for fast training of linear SVMs.
BRIEF SUMMARY OF THE INVENTION According to one embodiment, the system uses two efficient
and scalable methods for training semi-supervised SVMs.
In one preferred embodiment, a transductive Support vec The first method provides a fast implementation of a variant
tor machine (TSVM) system and method for linear semi 55 oflinear a transductive SVM (TSVM). The second method is
Supervised classification on large, sparse datasets is provided. based on a meanfield annealing (MFA) approach, designed to
The system and method exploits data sparsity and linearity to alleviate the problem of local minimum in the TSVM opti
provide Superior Scalability. According to another preferred mization procedure.
embodiment, a multiple Switching heuristic further improves In one embodiment, as an example, and not by way of
TSVM training significantly for large scale applications. 60 limitation, an improvement in Internet search engine labeling
In another preferred embodiment, a system and method for of web pages is provided. The World Wide Web is a distrib
semi-supervised SVMs is disclosed that uses global optimi uted database comprising billions of data records accessible
Zation using mean field methods. The method generates a through the Internet. Search engines are commonly used to
family of objective functions with non-convexity that is con search the information available on computer networks, such
trolled by an annealing parameter. The global minimizer is 65 as the World WideWeb, to enable users to locate data records
tracked with respect to this parameter. This method alleviates of interest. A search engine system 100 is shown in FIG. 1.
the problem of local minima in the TSVM optimization pro Web pages, hypertext documents, and other data records from
US 7,562,060 B2
3 4
a source 101, accessible via the Internet or other network, are used by the system, but it is not necessary. Once the Sub-set is
collected by a crawler 102. The crawler 102 collects data labeled, the whole training set of labeled and unlabelled data
records from the source 101. For example, in one embodi is available to the system.
ment, the crawler102 follows hyperlinks in a collected hyper In step 304, a modified finite Newton method for fast
text document to collect other data records. The data records training of large scale semi-supervised linear SVMs is used to
retrieved by crawler 102 are stored in a database 108. There find a classifier. Even though the labeled training set can be
after, these data records are indexed by an indexer 104. any set of examples from within and outside the group, it is
Indexer 104 builds a searchable index of the documents in preferable for the method to retain knowledge of the fraction
database 108. Common prior art methods for indexing may of the group in the entire distribution. For example, in the case
include inverted files, vector spaces, suffix structures, and
10 of the group consisting of sports web pages, if this group
hybrids thereof. For example, each web page may be broken forms 1% of the entire collection of web pages, then this
down into words and respective locations of each word on the number is preferably conveyed as an input to the method as
page. The pages are then indexed by the words and their compared to unlabeled examples. In the description of the
respective locations. A primary index of the whole database method below, this number, written as a fraction (for example,
15 0.01) is called r.
108 is then broken down into a plurality of sub-indices and In Step 308, new data, such as a web-page of a document to
each Sub-index is sent to a search node in a search node cluster be classified is input into the system. In Step 310, the semi
106. supervised SVM obtained in Step 304 is used to classify new
To use search engine 100, a user 112 typically enters one or data.
more search terms or keywords, which are sent to a dispatcher A modified version of L-SVM-MFN algorithm is now
110. Dispatcher 110 compiles a list of search nodes in cluster discussed to Suit the description of two semi-supervised
106 to execute the query and forwards the query to those extensions discussed thereafter. The first extension provides
selected search nodes. The search nodes in search node clus an efficient implementation of a variant of Linear Transduc
ter 106 search respective parts of the primary index produced tive SVM (TSVM). The second extension is based on a mean
by indexer 104 and return sorted search results along with a 25 field annealing (MFA) algorithm designed to track the global
document identifier and a score to dispatcher 110. Dispatcher optimum of the TSVM objective function.
110 merges the received results to produce a final result set Modified Finite Newton Linear L-SVM
displayed to user 112 sorted by relevance scores. The modified finite Newton L-SVM method (L-SVM
As a part of the indexing process, or for other reasons, most 30
MFN) for Linear SVMs is ideally suited to sparse datasets
search engine companies have a frequent need to classify web with large number of examples and possibly large number of
pages as belonging to one group' or another. For example, a features. In a typical application, Such as document classifi
search engine company may find it useful to determine if a cation, many training documents are collected and processed
web page is of a commercial nature (selling products or into a format that is convenient for mathematical manipula
services), or not. As another example, it may be helpful to tions. For example, each document may be represented as a
35
determine ifa web page contains a news article about finance collection of d features associated with a vocabulary of d
or another Subject, or whether a web page is spam related or words. These features may simply indicate the presence or
not. Such web page classification problems are binary clas absence of a word (binary features), or measure the frequency
sification problems (x versus not x). To develop a classifier of a word suitably normalized by its importance (term fre
that can do Such distinguishing usually takes a large sample of quency inverse document frequency (TFIDF) features). Even
web pages labeled by editorial teams and for use as a training
40 though the Vocabulary might be large, only a very small
set for designing a classifier. number of words appear in any document relative to the
Referring to FIG. 2, there is shown an example of a web Vocabulary size. Thus, each document is sparsely represented
page that has been categorized. In this example, the web page as a bag of words. A label is then manually assigned to each
is categorized as a “Business' related web page, as indicated document identifying a particular category to which it
45
by the topic indicator 225 at the top of the page. Other cat belongs (e.g. "commercial' or not). The task of a classifica
egory indicators 225 are shown. Thus, if a user had searched tion algorithm (e.g. SVM) is to produce a classifier that can
for business categorized web pages, then the web page of reliably identify the category of new documents based on
FIG. 2 would be listed, having been categorized as such. training documents.
Given a binary classification problem with 1 labeled
With reference to FIG. 3, a flow diagram illustrates the 50 examples (x,y," where the input patterns XeR' (e.g. a
steps performed according to one embodiment. In step 300, a document) and the labels ye{+1-1}, L-SVM-MFN pro
training set of documents is provided. In one embodiment, vides an efficient primal solution to the following SVM opti
this training set includes randomly selected unlabelled web mization problem:
pages. In step 302, a sub-set of the training set is labeled as
meeting the criteria for inclusion in or exclusion from a group 55
(classification group). For example, the group may include 1 2 (1)
sports web pages in the case of the Internet web-page example w = argmin
we Rd 24
X max0, 1-y;(w,x) + 2 |y||
discussed above. This step may involve, by way of example,
and not by way of limitation, human review of sample web
pages for labeling of web pages that are determined to fall 60
within the group (for example, sports web pages) and also Here, w is a real-valued regularization parameter and sign
possibly labeling some pages as outside the group (for (w'x) is the final classifier.
example, non-sports pages), in the labeled training set. Unlike This objective function differs from the standard SVM
previous systems and methods, it is not strictly necessary to problem in some respects. First, instead of using the hinge
label examples of data that does not fall within the criteria (for 65 loss as the data fitting term, the square of the hinge loss (or the
example, non-sports pages). In one embodiment, non-labeled so-called quadratic Soft margin loss function) is used. This
examples (e.g. unlabelled sports and non-sports pages) can be makes the objective function continuously differentiable,
US 7,562,060 B2
5 6
allowing easier applicability of gradient techniques. Sec w-w') after performing an exact line search for 8', i.e. by
ondly, the bias term (“b') is also regularized. In the problem exactly solving a one-dimensional minimization problem:
formulation of equation 1, it is assumed that an additional
component in the weight vector and a constant feature in the
example vectors have been added to indirectly incorporate the (3) = argminf(w -- (wk) wk)))
s0
bias. This formulation combines the simplicity of a least
squares aspect with algorithmic advantages associated with
SVMs. It should also be noted that the methods discussed
herein can be applied to other loss functions. The modified finite Newton procedure has the property of
10
finite convergence to the optimal Solution. The key features
In one embodiment, a version of L2-SVM-MFN where a that bring scalability and numerical robustness to L-SVM
weighted quadratic soft margin loss function is used. MFN are: (a) Solving the regularized least squares system of
equation 4 by a numerically well-behaved conjugate gradient
1
scheme referred to as conjugate gradient for least squares
(2) method (CGLS), which is designed for large, sparse data
w =- argminf(w) = argmin
in -
2 12
cid;(w) + 5|wl 2 15
we Rd weR iii) matrices X. (b) Due to the one-sided nature of margin loss
functions, these systems are solved over restricted index sets
j(w), which can be much smaller than the whole dataset. This
In equation 2, equation 1 is re-written in terms of a partial also allows additional heuristics to be developed such as
summation of d,(w) wx, -y, over an index set j(w)={i:y, terminating CGLS early when working with a crude starting
(w'x,)<1}. Additionally, the loss associated with the i' guess, such as 0, and allowing the line search step to yield a
example has a cost c, f(w) refers to the objective function point where the index set (w) is Small. Subsequent optimi
being minimized, evaluated at a candidate Solution w. It is Zation steps then work on Smaller Subsets of the data.
noted that if the index setj(w) were independent of w and ran CGLS
over all data points, this would simply be the objective func 25 The CGLS procedure solves large, sparse, weighted regu
tion for weighted linear regularized least squares (RLS). larized least squares problems of the following form:
In one embodiment, f is a strictly convex, piecewise qua
dratic, continuously differentiable function having a unique
minimizer. The gradient of fat w is given by:
The key computational issue in equation 5 is to avoid the
30
construction of the large and dense matrix XCX, and work
only with the sparse matrix X and the diagonal cost matrix
isi(w) (stored as a vector) C.
Starting with a guess solution, Bo, Conjugate Gradient
35
performs iterations of the form:
where X is a matrix whose rows are the feature vectors of
training points corresponding to the index setj(w), Y, is a
column vector containing labels for these points, and C is where p' is a search direction and yeR gives the step in that
a diagonal matrix that contains the costs c, for these points direction. The residual vector (the difference vector between
along its diagonal. 40 LHS and RHS of equation 5 for a candidate B, which is also
In one embodiment, L-SVM-MFN is a primal algorithm the gradient of the associated quadratic form evaluated at B) is
that uses the Newton's method for unconstrained minimiza therefore updated as:
tion of a convex function. The classical Newton's method is
based on a second order approximation of the objective func
tion, and involves updates of the following kind: 45
The following intermediate vectors are introduced:
(k+1)=(k)8(k)(k) (3)

where the step size 8'eR, and the Newton direction n'eR is (i+1) = C(Y- X B(i+1)) = C(Y- XBi) + yj Xpi)
given by: 50 = gi-yi Cai
where gi = Xpi
Vf(w) is the gradient vector and Vf(w) is the Hessian
matrix off at w. However, the Hessian does not exist
everywhere, since f is not twice differentiable at those weight 55 The optimal choice of y' is given by:
vectors w where wx, y, for some index i. For this reason, a
finite Newton method works around this issue through a
generalized definition of the Hessian matrix. The modified
finite Newton procedure proceeds as follows. The step
w=w--n' in the Newton direction can be seen to be given 60
by solving the following linear system associated with a
weighted linear regularized least squares problem over the
data subset defined by the indices j(w): Finally, the search directions are updated as:
DI+Xe(f), C(i), X(t)/w’=X(R), C(i),Y(k)) (4)
65
where I is the identity matrix. Once w is obtained, w" is pit) = i+1) + wipi
obtained from equation 3 by setting w—w-6' (
US 7,562,060 B2
7 8
The linear pieces of p' are defined over those intervals
-continued wherej(ws) remains constant. Thus, the break points occurat
a certain set of values 8, where

T
w, v, Fyi.
The CGLS iterations are terminated when the norm of the
gradient r* becomes small enough relative to the norm of
the iterate z' or if the number of iterations exceed a certain
maximum allowable number. 10
for Some data point indexed by, i.e.
The CGLS iterations are listed in the table of FIG. 4. The
data matrix X is only involved in the computations through
matrix vector multiplication for computing the iterates q' O; - o; y; (O - oi)
and r. This forms the dominant expense in each iteration
(the product with C simply scales each element of a vector). 15
If there are no non-Zero elements in the data matrix, this has Among these values, only those indices i where 6,20i.e. if
O(no) cost, where no is the number of non-Zero elements in the iej(w) (then y.o,<1), soy,(o-o.)>0 or ifiž(w) (then y.o.>1)
data matrix. As a subroutine of L-SVM-MFN, CGLS is are considered, soy,(o-o.)<0. When Ö is increased past a ö,
typically called on a small subset of the full data set. The total in the former case the index i leaves (w) and in the latter case
cost of CGLS is O(t,no) where t is the number of itera it enters (w). Reordering the indices so that Ö, are sorted in a
tions, which depends on the practical rank of X and is typi non-decreasing order as 8,8. . . . , the root is then easily
cally found to be very small relative to the dimensions of X checked in each interval (88), k=1,2... by keeping track
(number of examples and features). The memory require of the slope of the linear piece in that interval. The slope is
ments are also minimal: only five vectors need to be main constant for each interval and non-decreasing as the search
tained, including the outputs over the currently active set of 25 progresses through these ordered intervals. The interval in
data points. which the slope becomes non-negative for the first time
Finally, another feature of CGLS is worth noting. Suppose brackets the root. Defining the extension of the linear piece in
the Solution B of a regularizer least squares problem is avail the interval
able, i.e. the linear system in equation 5 has been solved using
30
CGLS. If there is a need to solve a perturbed linear system, it
is greatly advantageous in many settings to start the CG (0, 0,...) as 8 (0) = Awi (w-w) + X cid (wo)(o, -o),
iterations for the new system with B as the initial guess. This isitwei )
is often called seeding. If the starting residual is small, CGLS
can converge much faster than with a guess of 0 vector. The 35
utility of this feature depends on the nature and degree of the slope and the root computations are conveniently done by
perturbation. In L2-SVM-MFN, the candidate solution w keeping track of
obtained afterline search in iterationk is seeded for the CGLS
computation of w in the next iteration. Also, in tuning. Over
a range of values, it is computationally valuable to seed the
solution for a particular onto the next value. For the trans 40 L = b(0) = Aw (w-w) + X c. (o, -y)(o; -o) and
ductive SVM implementation with L-SVM-MFN, solutions is iwei )
are seeded across linear systems with slightly perturbed label R = 8: (1) = Aw (w-w) + X c. (o, -y)(o, -o).
vectors and data matrices. is iwei, )
Line Search
45
Given the vectors w.win some iteration of L-SVM-MFN.
the line search step includes solving: The full line search routine is outlined in the table of FIG. 5.
The table of FIG. 6 provides an abridged pseudo-code for
L-SVM-MFN. Its computational complexity therefore is
3 = again(o) = f(ws) 50 O(t,no) where t, is the number of outer iterations of
CGLS calls and line search, and t is the average number of
CGLS iterations. These depend on the data set and the toler
ance desired in the stopping criterion, but are typically very
where ww.--Ö(w-w). small. Therefore the complexity is found to be linear in the
The one-dimensional function p(ö) is the restriction of the 55 number of entries in the data matrix.
objective function f on the ray from w ontow. Hence, like f. Semi-Supervised Linear SVMs
(p(ö) is also a continuously differentiable, strictly convex, It is now assumed that there are 1 labeled examples X,
piecewise quadratic function with a unique minimizer Ö y,' and u unlabeled examples {x'}," with x, x' eR" and
given by p'(Ö*)=0. Thus, one needs to find the root of the ye{-1,+1}. The goal is to construct a linear classifier sign
piecewise linear function 60 (w'x) that utilizes unlabeled data, typically in situations
where 1.<<u. Semi-supervised algorithms provide L-SVM
MFN the capability of dealing with unlabeled data.
d'(0) = Awi (w-w) + isi(ws)
X cid (ws)(o, -o) (6) Transductive SVM
Transductive SVM (TSVM) appends an additional term in
65 the SVM objective function whose role is to drive the classi
fication hyperplane towards low data density regions. The
following optimization problem is setup for standard TSVM:
US 7,562,060 B2
10
function is shown in FIG. 7. It should be noted that the Land
L. loss terms over unlabeled examples are very similar on the
k
argman 2 interval -1, +1. The non-convexity of this loss function
implies that the TSVM training procedure is susceptible to
5 local optima issues. Next, a mean field annealing procedure is
outlined that can overcome this problem.
The TSVM algorithm with L-SVM-MFN is outlined in
the table of FIG.8. A classifier is obtained by first running
1
Subject to: X max0, sign(ww.) 10
L-SVM-MFN on just the labeled examples. Temporary
util labels are assigned to the unlabeled data by thresholding the
soft outputs of this classifier so that the fraction of the total
number of unlabeled examples that are temporarily labeled
positive equals the parameter r.
The labels on the unlabeled data, y' . . . y are {+1.-1}- Then starting from a small value of M', the unlabeled data is
valued variables in the optimization problem. In other words, 15
TSVM seeks a hyperplane w and a labeling of the unlabeled gradually brought in by increasing by a factor of R where R
examples, so that the SVM objective function is minimized, is set to 2 in the outer loop. This gradual increase of the
subject to the constraint that a fraction r of the unlabeled data influence of the unlabeled data is away to protect TSVM from
be classified positive. SVM margin maximization in the pres being immediately trapped in a local minimum. An inner loop
ence of unlabeled examples can be interpreted as an imple identifies up to Spairs of unlabeled examples with positive
mentation of the cluster assumption. In the optimization prob and negative temporary labels such that Switching these
lem above, W' is a user-provided parameter that provides labels would decrease the objective function. L-SVM-MFN
control over the influence of unlabeled data. If there is enough is then retrained with the switched labels.
labeled data, W.W'r can be tuned by cross-validation. An initial Transductive L-SVM-MFN with multiple-pair switching
estimate of r can be made from the fraction of labeled 25 converges in a finite number of steps. Previous transductive
examples that belong to the positive class and Subsequent fine Support vector machine implementations used single Switch
tuning can be done based on performance on a validation set. ing (S=1) of labels. However, in the implementation of the
In one method, this optimization is implemented by first present embodiment, larger values for S (S>1, i.e., multiple
using an inductive SVM to label the unlabeled data and then Switching) are used which leads to a significant improvement
iteratively switching labels and retraining SVMs to improve 30 in efficiency without any loss in performance.
the objective function. The TSVM algorithm is a wrapper L-SVM-MFN on large sparse datasets combined with the
around an SVM training procedure. In one existing software efficiency gained from seeding w in the re-training steps
implemented system, the training of SVMs in the inner loops (after switching labels or after increasing W) is very effective.
of TSVM uses dual decomposition techniques. In sparse, Consider an iteration in Loop 2 of TSVM where a new pair of
linear settings significant speed improvements can be 35 labels has been switched, and the solution w from the last
obtained with L2-SVM-MFN over this software system. retraining of L-SVM-MFN (marked as Re-training 2 in FIG.
Thus, by implementing TSVM with L-SVM-MFN, 8) is available for seeding. When the last L-SVM-MFN
improvements are made for semi-supervised learning on converged, its solution w is given by the linear systems:
large, sparse datasets. The L-SVM-MFN retraining steps in
the inner loop of TSVM are typically executed extremely fast 40 WH-Xi"TCrow-Xi"TCro)
by using seeding techniques. Additionally, in one embodi
ment of TSVM, more than one pair of labels may be switched where Y is the current label vector. When labels Y.Y., are
in each iteration. switched, back at the top of loop 2, the label vector is updated
aS
To develop the TSVM implementation with L-SVM
MFN, the objective function is considered corresponding to 45
equation 7 below, but with the L. loss function:
where e, is a vector whose elements Zero everywhere except
in the i' and thei"position which are +1 and -1 or -1 and +1
(7) respectively. Note also that if i,j ej(w) the re-training of
weR'-yet-1,+1), Iwll' +
*k argmin
50 L-SVM-MFN with w as the starting guess immediately
encounters a call CGLS to solve the following perturbed
system:

1 55
Subject to: X. max0, sign (wx,) The starting residual vector r" is given by:
u1

Note that this objective function above can also be equiva


r" = Xi City (Y +2eil-(A + XRCo) Xilw
lently written in terms of the following loss over each unla 60
foci(wei
beled examplex:

A value of the label variabley is selected that minimizes the 65 where r(w) in the second step is the final residual of w which
loss on the unlabeled examplex, and rewritten in terms of the fell below e at the convergence of the last re-training. In
absolute value of the output of the classifier on X. This loss applications, such as text categorization, TFIDF feature vec
US 7,562,060 B2
11 12
tors are often length normalized and have positive entries.
Therefore, |x-x|22. This gives the following bound on the
starting residual: : (9)

which is much smaller than a bound of nin' with a zero where ris the fraction of the number of unlabeled examples
starting vector. Seeding is quite effective for Loop 1 as well, belonging to the positive class. As in TSVM, r is treated as a
where W is changed. With the two additional loops, the com user-provided parameter. It may also be estimated from the
plexity of Transductive L-TSVM-MFN becomes O(n 10 labeled examples.
tatino), where n is the number of label switches. The solution to the optimization problem above is tracked
The outer loop executes a fixed number of times; the inner as the temperature parameter T is lowered to 0. The final
loop calls L-TSVM-MFN n times. Typically, n, vitches Solution is given as:
is expected to strongly depend on the data set and also on the
number of labeled examples. Since it is difficult to apriori 15
estimate the number of switches, this is an issue that is best w* =- lim
1 -- i.
T-0 wi (1 O)
understood from empirical observations.
Mean Field Annealing
The transductive SVM loss function over the unlabeled In practice, the system (indexer 104) monitors the value of the
examples can be seen from FIG. 7 to be non-convex. This objective function in the optimization path and returns the
makes the TSVM optimization procedure susceptible to local Solution corresponding to the minimum value achieved.
minimum issues causing a loss in its performance in many To develop an intuition for this method, the loss term is
situations. A new algorithm that is based on mean field considered in the objective function associated with an unla
annealing can be used to overcome this problem while also 25 beled example as a function of the output of the classifier. This
being computationally very attractive for large scale applica loss term is based on calculations to be described below. FIG.
tions. 9 plots this loss term for various values of T. As the tempera
Mean field annealing (MFA) is an established tool for ture is decreased, the loss function deforms from a squared
combinatorial optimization that approaches the problem from loss shape where a global optimum is easier to achieve, to the
information theoretic principles. The discrete variables in the 30 TSVM loss function in FIG. 7. At high temperatures a global
optimization problem are relaxed to continuous probability optimum is easier to obtain. The global minimizer is then
variables and a non-negative temperature parameter T is used slowly tracked as the temperature is lowered towards zero.
to track the global optimum. The optimization is done in stages, starting with high Val
First, the TSVM objective function is re-written as follows: 35
ues of T and then gradually decreasing T towards 0. For each
T, the problem inequations 8.9 is optimized by alternating the
minimization over w and p-p ... p. respectively. Fixing p.
the optimization over w is done by L-SVM-MFN. Fixing w,
= argmin
2 the optimization over p can also be done easily as described
40
below. Both these problems involve convex optimization and
can be done exactly and efficiently. Details of these optimi
S.X. (ujmax(0, 1 - (wy) + (1 -pui)max0, 1 + (wy)) Zation steps follow.
Optimizing w
Described are the steps to efficiently implement the
45 L-SVM-MFN loop for optimizing w keeping p fixed. The
Binary valued variables u (1+y)/2 are introduced. Let call to L-SVM-MFN is made on the data X-X'X'X'77
pe(0,1) denote the probability that the unlabeled example whose first 1 rows are formed by the labeled examples, and the
belongs to the positive class. The Ising model 10 of Mean next 2u rows are formed by the unlabeled examples appearing
field annealing motivates the following objective function, as two repeated blocks. The associated label vector and cost
where the binary variables u, are relaxed to probability vari 50 matrix are given by
ables p, and include entropy terms for the distributions
defined by p,
ii. ii. (11)

(8)
Y= ... it is list
1 at
i
w =
-
argmin |wif -- iX. max0, 1 - y;(wy) -- 55 : :

weR pier0 III, i=1 C = dial87.71 Api


u
Pu (1 ii.- PL)
u
(1 ii.- Pu)
:
A.
52. (pimax(0. 1 - (wy) + (1-p)max0, 1 + (wy)) --
f=
60 Even though each unlabeled data contributes two terms to the
T
ii), (pilog p + (1-p)log (1-p)) objective function, effectively only one term contributes to
the complexity. This is because matrix-vector products,
which form the dominant expense in L-SVM-MFN, are per
formed only on unique rows of a matrix. The output may be
The “temperature T parameterizes a family of objective 65 duplicated for duplicate rows.
functions. The objective function for a fixed T is minimized In fact, the CGLS calls in L-SVM-MFN can be re-written
under the following class balancing constraints: so that the unlabeled examples appear only once in the data
US 7,562,060 B2
13 14
matrix. The CGLS call at some iteration where the active
index set is (w) for some current candidate weight vector w
is as follows:
- = S.X. (pimax 0, 1 - (wy) + (1 + pi)max0, 1(wx)) --
i=l
5
Note that if Iw'.x', 21, the unlabeled examplex, appears as T 1
one row in the data matrix X, with label given by-sign(w'.x).
If |w'x',<1, the unlabeled examplex, appears as two identi
X(p; log p + (1-p) losa-P)-Ee
cal rows x', with both labels. Letje 1 ... 1 be the indices of the
labeled examples in the active set, 'el ... ube the indices of 10
unlabeled examples with Iw'x', 21 and j'el . . . u be the Differentiating the Lagrangian with respect to p, the follow
indices of unlabeled examples with |w'x',<1. Note that the ing is obtained:
index of every unlabeled example appears in one of these sets
i.e., 'U' -1 ... u. Equation 12 may be re-written as:
15 A.
E; (max 0, 1 - (wy) - max0, 1 + (wy)) --
1 A. A. T P
A + Xxx; -- X. ci-xx; -- X. Xy, 2. logie, Tu = 0
isi jej je.j:
1 A. A.
iX. yix - X. Ci sign (wri)x; -- u X. (2p - 1)xi 2O
isi jej jei, Defining

where c, p, f sign(w'x')--1 and c. 1p, if sign(w'x')-1. 25 the expression for p, is given by:
Rewriting in matrix notation, an equivalent linear system is
obtained that can be solved by CGLS:
1 (15)
HXTCX/w-XFCY (13) P = gi-2v
1e T
where X-DX,X), C is a diagonal matrix and Y is the vector so
of effectively active labels. Each of these data objects have
1+u rows. These are given by: Substituting this expression in the balance constraintinequa
tion 9, a one-dimensional non-linear equation in V is as fol
lows:
1 (14)
C = , Y = y; je 1... li. 35
Ap; , - :

Ci-li (i+ i) = a 1, Yi = 1 if je 1... u, 1 1


2 =r
je j, sign (ww.) = -1 1+e T
A'(1-p) 40
Ci-li (i+ i) = — ,
The value of 2 v is computed exactly by using a hybrid
Yi = -1 if je 1... u je i sign (ws') = 1 combination of Newton-Raphson iterations and the bisection
A - : - : method to find the root of the function
Citi (ii) = , Yii = (2p-1) if je 1... u je is 45

Thus, CGLS needs to only operate on data matrices with one


instance of each unlabeled example using a Suitably modified
cost matrix and label vector. 50
After the CGLS step, the optimality conditions are
checked. The optimality conditions can be re-written as: This method is rapid due to the quadratic convergence prop
Wiei,yos 1+e erties of Newton-Raphson iterations and fail-safe due to
bisection steps. Note that the root exists and is unique, since
Wiej, yo.21-e 55 one can see that B(v)->1-reO when v-soo, B(v)-e-r-0 when
V->-o, and B(V) is a continuous, monotonically increasing
Wiej" lo', 121-e function, as shown in FIG. 11. The root finding begins by
bracketing the root in the interval V.V. so that B(V)<0.
Wiej, Io's 1+e B(V)20 where v.v. are given by:
60
For the Subsequent line search step, appropriate output and
label vectors are reassembled to call the routine in the table of
FIG. 5. The steps for optimizing w; are outlined in the table of V = min(g1...gi) - Tlog
FIG 10.
Optimizing p 65 V = max(g 1...g.) - Tlog
For the latter problem of optimizing p for a fixed w the
Lagrangian is constructed:
US 7,562,060 B2
15 16
The hybrid root finding algorithm performs Newton-Raph hyperplane to pass through a low density region keeping
son iterations with a starting guess of clusters of the relevant class on one side and instances of the
“others' class on the other side. The fraction r can first be
crudely estimated using a small random sample and then
V + v. finely adjusted based on validation performance of the trained
classifier.
The various embodiments described above are provided by
and invokes the bisection method whenever an iterate goes way of illustration only and should not be construed to limit
outside the brackets. The steps for optimizing pare outlined in the invention. Those skilled in the art will readily recognize
table if FIG. 12.
10 various modifications and changes that may be made to the
Stopping Criteria claimed invention without following the example embodi
For a fixed T, this alternate minimization proceeds until ments and applications illustrated and described herein, and
without departing from the true spirit and scope of the
Some stopping criterion is satisfied. A natural criterion is the claimed invention, which is set forth in the following claims.
known mean Kullback-Liebler divergence (relative entropy) 15
KL(p,q) between current values of p, and the values, say q, What is claimed is:
at the end of last iteration. Thus the stopping criterion for
fixed T is: 1. A computerized method for semi-Supervised learning
for web page classification, comprising:
receiving a set of web pages as training elements;
:
KL(p, q) = X pilog Pi + (1-p) log 1 - p. 3 ite ( 16) labeling some of the elements of the set of training ele
2. f di f 1 - di ments that are determined to fall within a classification
group, the set of training elements thereby having
labeled elements and unlabeled elements;
A good value fore is 10°. The temperature may be decreased 25 using selected labeled elements and unlabeled elements as
examples in a semi-Supervised Support vector machine
in the outer loop until the total entropy falls below a threshold implemented using a mean field annealing method, con
T, which is also taken to be 10: structing a continuous loss function from a non-continu
ous loss function, train a linear classifier;
: (17)
30 receiving unclassified web pages; and
H(p) = -X(p, log p + (1-p) log (1-p)) < ue classifying the unclassified web pages using the trained
i=l linear classifier.
2. The method of claim 1, wherein the number of unlabeled
elements in the set of training elements that fall within the
The TSVM objective function is monitored as the optimiza 35 classification group is specified as an estimate.
tion proceeds. 3. The method of claim 1, wherein the semi-supervised
support vector machine uses a finite Newton method for fast
training.
J(w) = (18)
40
4. The method of claim3, wherein a stepw-win' in
... i2. 1 \ O the finite Newton method for determining a weight vector w
5|wl + i), maso i=l in a k-th iteration, with n being a Newton step, is given by
Solving the following linear system associated with a
weighted linear regularized least squares problem over a data
The weight vector corresponding to the minimum trans 45
subset defined by a set of active indices j(w):
ductive cost is returned as the solution. The steps of mean field
annealing with L-SVM-MFN are outlined in the table of where I is the identity matrix, and wherein once w is
FIG. 13.
Application to One-Class Problems obtained, w) is obtained from w=w)+8'n' by
The transductive SVM and mean field annealing algo
setting w=w-6*(w-w) after performing an
rithms described above can be deployed in a variety of set
50 exact line search for 8', which is by exactly solving a
one-dimensional minimization problem:
tings involving different amounts of labeled and unlabeled
data. Many settings present the task of identifying members
of a certain class as opposed to distinguishing between well (3) = argminf(w -- (wk) wk))).
specified classes. For example, in order to identify web docu 55 &O
ments concerning sports, it is much easier to label sports
documents than to label the diverse and ill-characterized set
of non-sports documents. In such problems, labeled examples 5. The method of claim 4, wherein the modified Newton
come from a single class, very often with large amounts of method uses a conjugate gradient for least Squares method to
unlabeled data containing some instances of the class of inter 60 Solve large, sparse, weighted regularized least squares prob
est and many instances of the “others' class. lem.
Being a special case of semi-Supervised learning, the prob 6. The method of claim 1, wherein the semi-supervised
lem of one-class learning with unlabeled data can be Support vector machine is implemented using a modified
addressed by the algorithms developed in this paper. These finite Newton method with one or multiple switching of labels
algorithms implement the cluster assumption under con 65 of unlabeled example elements from the training set.
straints on class ratios. For one-class problems, unlabeled 7. The method of claim 6, wherein the objective function is
data is expected to be helpful in biasing the classification as follows
US 7,562,060 B2
17 18

argmin
we Ritupe(0,1);
A
iX. (4; max0, 1 - (wy) + (1 -pui)max0, 1 + (wly")))
1
Subject to: X. max0, sign(ww.) . 10
u wherein binary valued variables u-(1+y)/2 are intro
duced, pe(0,1) denote the belief probability that the
unlabeled example belongs to the positive class, 1 is the
8. The method of claim 7, comprising of optimizing w for number of labeled example elements, u is the number of
fixedy', using the finite Newton method. 15 unlabeled example elements, wherein w and Ware regu
9. The method of claim 8, wherein a step w—w-n' in larization parameters, and the Ising model of mean field
the finite Newton method for determining a weight vector w annealing motivates the following objective function,
in a k-th iteration, with n being a Newton step, is given by where the binary variables, are relaxed to probability
Solving the following linear system associated with a variables p, and include entropy terms for the distribu
weighted linear regularized least squares problem over a data tions defined by p,
subset defined by a set of active indices j(w):
6(i) to (i))
1 2 (8)
where I is the identity matrix, and wherein once w is wi = weRdargmin ||wl + 24Xmax(0, 1-y;(w,x) +
obtained, w) is obtained from w=w)+8'n' by {pie(0 III,a 2 i=1
setting w=w-6*(w-w) after performing an 25
exact line search for 8, which is by exactly solving a A
X(pimax(0, 1-(wT.x)12 + (1-p)max(0, 1 + (w'T.x.) 12)+
one-dimensional minimization problem:
T
(3) = argminf(w -- (wk) wk))). 30 i2. (pi log p + (1-p) log (1-p))
s0

10. The method of claim 9, wherein the modified Newton wherein the temperatureT parameterizes a family of objec
method uses a conjugate gradient for least Squares method to tive functions, wherein the objective function for a fixed
35 T is minimized under the following class balancing con
Solve large, sparse, weighted regularized least Squares prob straints:
lem.
11. The method of claim 7, comprising of optimizingy, for
a fixed w by switching one or more pairs of labels.
12. The method of claim 1, whereina stepw-win' in 40
the finite Newton method for determining a weight vector w
in a k-th iteration, with n being a Newton step, is given by
Solving the following linear system associated with a
weighted linear regularized least squares problem over a data where r, the fraction of the number of unlabeled examples
subset defined by a set of active indices j(w): 45
belonging to the positive class.
16. The method of claim 15, wherein the solution to the
optimization problem is tracked as the temperature parameter
where I is the identity matrix, and wherein once w is T is lowered to 0, wherein the final solution is given as:
obtained, w) is obtained from w =w--8'n' by
setting w—w-6(w-w') after performing an 50
exact line search for 8, which is by exactly solving a w = lim
T-0
wi.
one-dimensional minimization problem:
17. The method of claim 16, comprising optimizing W
(3) = argminf(w -- (wk) wk))). 55 keeping p fixed, wherein the call to finite Newton method is
&O
made on dataX-X'X'X' whose first 1 rows are formed by
labeled examples, and the next 2u rows are formed by unla
13. The method of claim 12, wherein the modified Newton beled examples appearing as two repeated blocks, wherein
method uses a conjugate gradient for least Squares method to the associated label Vector and cost matrix are given by:
Solve large, sparse, weighted regularized least Squares prob 60
lem.
14. The method of claim 1, comprising relaxing discrete r - - -
F-lysy...yll. 1.-1.-1... -
variables to continuous probability variables in mean field : :
annealing method, wherein a non-negative temperature C = dial...! Ap1 Ap (1 - p1) '(1-p)
parameter T is used to track a global optimum. 65 = alag., -i-...--— ... —
15. The method of claim 14, comprising writing a semi
supervised support vector objective as follows:
US 7,562,060 B2
19 20
where y, y. . . . y, are labels of the labeled example ele
mentS.
18. The method of claim 15, comprising optimizing p for a 1
--it
1
fixed W, a Lagrangian is constructed using Lagrangian mul B(v) = - y - r.
ii. 1 + eT
tiplier V as follows:

19. A system for semi-supervised learning for web page


- = ,X. (p. max0, 1 - (wy) + (1-p)max0, 1 + (wy)) -- classification, comprising:
i=l
an input device operative to receive a set of web pages as
T 1 training elements;
iX(p, log p + (1-p)) log (1 -P-EX.
i=l
p i=l
a processor operative to label some of the elements of the
set of training elements that are determined to fall within
a classification group, the set of training elements
wherein, differentiating the Lagrangian with respect to p. thereby having labeled elements and unlabeled ele
produces: ments;
the processor further operative to use selected labeled ele
ments and unlabeled elements as examples in a semi
of A Supervised support vector machine implemented using a
ap, (max 0, 1 - (w'T.x.)at 2 – max0, 1+(w'T.x.)a72 )+ mean field annealing method, constructing a continuous
T P loss function from a non-continuous loss function, to
2. logie, Tit = 0 train a linear classifier,
the input device further operative to receive unclassified
web pages; and
defining the processor further for operative to classify the received
unclassified web pages using the trained linear classifier.
wherein the expression for p, is given by: 20. A computer program product stored on a computer
readable medium having instructions for performing a semi
Supervised learning method for web page classification, the
1 method comprising:
pi = gi-2v receiving a set of Web pages as training elements;
1 + eT
labeling some of the elements of the set of training ele
ments that are determined to fall within a classification
wherein, Substituting the expression for p, in the balance group, the set of training elements thereby having
constraint, a one-dimensional non-linear equation in V is labeled elements and unlabeled elements;
as follows: using selected labeled elements and unlabeled elements as
examples in a semi-Supervised Support vector machine
implemented using a mean field annealing method, con
1
:
1 structing a continuous loss function from a non-continu
u T
3 =r ous loss function, to train a linear classifier,
receiving unclassified web pages; and
classifying the unclassified web pages using the trained
wherein the value of V is computed exactly by using a linear classifier.
hybrid combination of Newton-Raphson iterations an d 45
the bisection method to find the root of the function

También podría gustarte