Documentos de Académico
Documentos de Profesional
Documentos de Cultura
LABEL SUB-SET N
302
||
Z
LeIpOueÐNS J? SnIO
J?MeJO
U.S. Patent Jul. 14, 2009 Sheet 2 of 13 US 7.562,060 B2
Ñ;9* * * *
‘61–
Z
*& &*#n $$ $ $*
U.S. Patent Jul. 14, 2009 Sheet 3 of 13 US 7.562,060 B2
LABEL SUB-SET N
302
Inputs X, Y, C, A, f, o, e, cgiternax
Initialization z = C(Y -o) r = Xz - Xf p = r w = |r
cgiter = 0 optimality = 0
Iterate while (cgiter < cgitermax)
cgiter-cgiter + 1
q = Xp
All-TC
f3 = B+ rp
o = o- yd
2 = 2 - yCq.
u2 r cul
r = XTz - Xf3
tu F ||r?
if (wi < e 12°)
Set optimality = 1 and Exit while loop.
end if
- all
(i) = i
p = r -- up
end while
Fig. 4
U.S. Patent Jul. 14, 2009 Sheet 5 of 13 US 7.562,060 B2
Initialize = {i : yo < 1}
I. r Au?(i) - w) + Yie, Cit (O - yi) (d. area oi)
R = Atif (a) - w)+y, Ca(ö, -y)(ö, -o)
Define 5 s 43 for all i
A = {fi: i e 7, yi (6, -o) > 0}
A2 = (6 : fig, yi (6 - oi) < 0}
A as At U A2
i=0
Reorder indices so that d, e A are sorted in non-decreasing
order 8, 8i, ....
Iterate for j = 1,2...
6 -
Set s = - if Öi, e A or 8 - 1 if 5, e A2
L = L + SC, (oi; so yi, )(oi, anal oil)
R = R + SC, (oi - yi)(Öi - oi)
end for
U.S. Patent Jul. 14, 2009 Sheet 6 of 13 US 7.562,060 B2
Outputs u, o, ,
U.S. Patent Jul. 14, 2009 Sheet 7 of 13 US 7.562,060 B2
:
N
-N.
o n to in c. ()
o On O to o o
UGipting SSC et Joeyer
U.S. Patent Jul. 14, 2009 Sheet 8 of 13 US 7.562,060 B2
Problem Given l labeled examples (ai, yi where a e Ry; e {-1, +1} and
w unlabeled examples (c' j=1 Solve problem in Eqn. 7.
Define
Inputs X, Y, X, X, X'
r, S (maximum number of label pairs to switch, default S=1)
Initialization Ce R* : a diagonal matrix with C = }
upon L2-SVM-MFN(X,Y, C)
Compute o' = X'uo, Assign positive and negative labels to the
unlabeled data in the ratio r : (l-r) respectively by thresholding o'.
Put these labels in a vector Y.
Re-training 2
Y=(S, )
& Y
(61-)
6
Lôeluns?bejdouq
y ------------
ce s
---
U.S. Patent Jul. 14, 2009 Sheet 10 of 13 US 7.562,060 B2
o
yer
t
-
o
c - c. co - r a Ni
q t
(A)
U.S. Patent Jul. 14, 2009 Sheet 12 of 13 US 7.562,060 B2
o, X, T, r
Compute g = g . . . g.
where g = A (max o, 1 - o - x 0. -- o)
Initialize a 10-10 iter - O maxiter at 500
V- = min(g1...g.) - Tlog
v. = max(g ...g.) - Tlogr.
w = (vi + v )/2 Initial guess
S or le-Cro w e-dro
B(v) = -r
B'(v) = x: (ii) (if si - 00, i.e larger than some upper
limit, set corresponding tern to 0)
while (B(w) > e) AND (iter < maxiter)
iterariter
if B'(v) > 0
- , , - B(v)
M is
B
end if
if (D <w-) OR. (b > v) OR B'(v) = 0
Bisection to hi?
Newton-Raphson
end if
Update se e-lift
B(v) ill is
B'(v) = (X, f (ifs; - co, i.e larger than some
upper limit, set corresponding term to 0)
if B (v) < 0 set v- = w else w = w
if v - w < e exit while loop
end while
part
re
s ... )
U.S. Patent Jul. 14, 2009 Sheet 13 of 13 US 7.562,060 B2
where the step size 8'eR, and the Newton direction n'eR is (i+1) = C(Y- X B(i+1)) = C(Y- XBi) + yj Xpi)
given by: 50 = gi-yi Cai
where gi = Xpi
Vf(w) is the gradient vector and Vf(w) is the Hessian
matrix off at w. However, the Hessian does not exist
everywhere, since f is not twice differentiable at those weight 55 The optimal choice of y' is given by:
vectors w where wx, y, for some index i. For this reason, a
finite Newton method works around this issue through a
generalized definition of the Hessian matrix. The modified
finite Newton procedure proceeds as follows. The step
w=w--n' in the Newton direction can be seen to be given 60
by solving the following linear system associated with a
weighted linear regularized least squares problem over the
data subset defined by the indices j(w): Finally, the search directions are updated as:
DI+Xe(f), C(i), X(t)/w’=X(R), C(i),Y(k)) (4)
65
where I is the identity matrix. Once w is obtained, w" is pit) = i+1) + wipi
obtained from equation 3 by setting w—w-6' (
US 7,562,060 B2
7 8
The linear pieces of p' are defined over those intervals
-continued wherej(ws) remains constant. Thus, the break points occurat
a certain set of values 8, where
T
w, v, Fyi.
The CGLS iterations are terminated when the norm of the
gradient r* becomes small enough relative to the norm of
the iterate z' or if the number of iterations exceed a certain
maximum allowable number. 10
for Some data point indexed by, i.e.
The CGLS iterations are listed in the table of FIG. 4. The
data matrix X is only involved in the computations through
matrix vector multiplication for computing the iterates q' O; - o; y; (O - oi)
and r. This forms the dominant expense in each iteration
(the product with C simply scales each element of a vector). 15
If there are no non-Zero elements in the data matrix, this has Among these values, only those indices i where 6,20i.e. if
O(no) cost, where no is the number of non-Zero elements in the iej(w) (then y.o,<1), soy,(o-o.)>0 or ifiž(w) (then y.o.>1)
data matrix. As a subroutine of L-SVM-MFN, CGLS is are considered, soy,(o-o.)<0. When Ö is increased past a ö,
typically called on a small subset of the full data set. The total in the former case the index i leaves (w) and in the latter case
cost of CGLS is O(t,no) where t is the number of itera it enters (w). Reordering the indices so that Ö, are sorted in a
tions, which depends on the practical rank of X and is typi non-decreasing order as 8,8. . . . , the root is then easily
cally found to be very small relative to the dimensions of X checked in each interval (88), k=1,2... by keeping track
(number of examples and features). The memory require of the slope of the linear piece in that interval. The slope is
ments are also minimal: only five vectors need to be main constant for each interval and non-decreasing as the search
tained, including the outputs over the currently active set of 25 progresses through these ordered intervals. The interval in
data points. which the slope becomes non-negative for the first time
Finally, another feature of CGLS is worth noting. Suppose brackets the root. Defining the extension of the linear piece in
the Solution B of a regularizer least squares problem is avail the interval
able, i.e. the linear system in equation 5 has been solved using
30
CGLS. If there is a need to solve a perturbed linear system, it
is greatly advantageous in many settings to start the CG (0, 0,...) as 8 (0) = Awi (w-w) + X cid (wo)(o, -o),
iterations for the new system with B as the initial guess. This isitwei )
is often called seeding. If the starting residual is small, CGLS
can converge much faster than with a guess of 0 vector. The 35
utility of this feature depends on the nature and degree of the slope and the root computations are conveniently done by
perturbation. In L2-SVM-MFN, the candidate solution w keeping track of
obtained afterline search in iterationk is seeded for the CGLS
computation of w in the next iteration. Also, in tuning. Over
a range of values, it is computationally valuable to seed the
solution for a particular onto the next value. For the trans 40 L = b(0) = Aw (w-w) + X c. (o, -y)(o; -o) and
ductive SVM implementation with L-SVM-MFN, solutions is iwei )
are seeded across linear systems with slightly perturbed label R = 8: (1) = Aw (w-w) + X c. (o, -y)(o, -o).
vectors and data matrices. is iwei, )
Line Search
45
Given the vectors w.win some iteration of L-SVM-MFN.
the line search step includes solving: The full line search routine is outlined in the table of FIG. 5.
The table of FIG. 6 provides an abridged pseudo-code for
L-SVM-MFN. Its computational complexity therefore is
3 = again(o) = f(ws) 50 O(t,no) where t, is the number of outer iterations of
CGLS calls and line search, and t is the average number of
CGLS iterations. These depend on the data set and the toler
ance desired in the stopping criterion, but are typically very
where ww.--Ö(w-w). small. Therefore the complexity is found to be linear in the
The one-dimensional function p(ö) is the restriction of the 55 number of entries in the data matrix.
objective function f on the ray from w ontow. Hence, like f. Semi-Supervised Linear SVMs
(p(ö) is also a continuously differentiable, strictly convex, It is now assumed that there are 1 labeled examples X,
piecewise quadratic function with a unique minimizer Ö y,' and u unlabeled examples {x'}," with x, x' eR" and
given by p'(Ö*)=0. Thus, one needs to find the root of the ye{-1,+1}. The goal is to construct a linear classifier sign
piecewise linear function 60 (w'x) that utilizes unlabeled data, typically in situations
where 1.<<u. Semi-supervised algorithms provide L-SVM
MFN the capability of dealing with unlabeled data.
d'(0) = Awi (w-w) + isi(ws)
X cid (ws)(o, -o) (6) Transductive SVM
Transductive SVM (TSVM) appends an additional term in
65 the SVM objective function whose role is to drive the classi
fication hyperplane towards low data density regions. The
following optimization problem is setup for standard TSVM:
US 7,562,060 B2
10
function is shown in FIG. 7. It should be noted that the Land
L. loss terms over unlabeled examples are very similar on the
k
argman 2 interval -1, +1. The non-convexity of this loss function
implies that the TSVM training procedure is susceptible to
5 local optima issues. Next, a mean field annealing procedure is
outlined that can overcome this problem.
The TSVM algorithm with L-SVM-MFN is outlined in
the table of FIG.8. A classifier is obtained by first running
1
Subject to: X max0, sign(ww.) 10
L-SVM-MFN on just the labeled examples. Temporary
util labels are assigned to the unlabeled data by thresholding the
soft outputs of this classifier so that the fraction of the total
number of unlabeled examples that are temporarily labeled
positive equals the parameter r.
The labels on the unlabeled data, y' . . . y are {+1.-1}- Then starting from a small value of M', the unlabeled data is
valued variables in the optimization problem. In other words, 15
TSVM seeks a hyperplane w and a labeling of the unlabeled gradually brought in by increasing by a factor of R where R
examples, so that the SVM objective function is minimized, is set to 2 in the outer loop. This gradual increase of the
subject to the constraint that a fraction r of the unlabeled data influence of the unlabeled data is away to protect TSVM from
be classified positive. SVM margin maximization in the pres being immediately trapped in a local minimum. An inner loop
ence of unlabeled examples can be interpreted as an imple identifies up to Spairs of unlabeled examples with positive
mentation of the cluster assumption. In the optimization prob and negative temporary labels such that Switching these
lem above, W' is a user-provided parameter that provides labels would decrease the objective function. L-SVM-MFN
control over the influence of unlabeled data. If there is enough is then retrained with the switched labels.
labeled data, W.W'r can be tuned by cross-validation. An initial Transductive L-SVM-MFN with multiple-pair switching
estimate of r can be made from the fraction of labeled 25 converges in a finite number of steps. Previous transductive
examples that belong to the positive class and Subsequent fine Support vector machine implementations used single Switch
tuning can be done based on performance on a validation set. ing (S=1) of labels. However, in the implementation of the
In one method, this optimization is implemented by first present embodiment, larger values for S (S>1, i.e., multiple
using an inductive SVM to label the unlabeled data and then Switching) are used which leads to a significant improvement
iteratively switching labels and retraining SVMs to improve 30 in efficiency without any loss in performance.
the objective function. The TSVM algorithm is a wrapper L-SVM-MFN on large sparse datasets combined with the
around an SVM training procedure. In one existing software efficiency gained from seeding w in the re-training steps
implemented system, the training of SVMs in the inner loops (after switching labels or after increasing W) is very effective.
of TSVM uses dual decomposition techniques. In sparse, Consider an iteration in Loop 2 of TSVM where a new pair of
linear settings significant speed improvements can be 35 labels has been switched, and the solution w from the last
obtained with L2-SVM-MFN over this software system. retraining of L-SVM-MFN (marked as Re-training 2 in FIG.
Thus, by implementing TSVM with L-SVM-MFN, 8) is available for seeding. When the last L-SVM-MFN
improvements are made for semi-supervised learning on converged, its solution w is given by the linear systems:
large, sparse datasets. The L-SVM-MFN retraining steps in
the inner loop of TSVM are typically executed extremely fast 40 WH-Xi"TCrow-Xi"TCro)
by using seeding techniques. Additionally, in one embodi
ment of TSVM, more than one pair of labels may be switched where Y is the current label vector. When labels Y.Y., are
in each iteration. switched, back at the top of loop 2, the label vector is updated
aS
To develop the TSVM implementation with L-SVM
MFN, the objective function is considered corresponding to 45
equation 7 below, but with the L. loss function:
where e, is a vector whose elements Zero everywhere except
in the i' and thei"position which are +1 and -1 or -1 and +1
(7) respectively. Note also that if i,j ej(w) the re-training of
weR'-yet-1,+1), Iwll' +
*k argmin
50 L-SVM-MFN with w as the starting guess immediately
encounters a call CGLS to solve the following perturbed
system:
1 55
Subject to: X. max0, sign (wx,) The starting residual vector r" is given by:
u1
A value of the label variabley is selected that minimizes the 65 where r(w) in the second step is the final residual of w which
loss on the unlabeled examplex, and rewritten in terms of the fell below e at the convergence of the last re-training. In
absolute value of the output of the classifier on X. This loss applications, such as text categorization, TFIDF feature vec
US 7,562,060 B2
11 12
tors are often length normalized and have positive entries.
Therefore, |x-x|22. This gives the following bound on the
starting residual: : (9)
which is much smaller than a bound of nin' with a zero where ris the fraction of the number of unlabeled examples
starting vector. Seeding is quite effective for Loop 1 as well, belonging to the positive class. As in TSVM, r is treated as a
where W is changed. With the two additional loops, the com user-provided parameter. It may also be estimated from the
plexity of Transductive L-TSVM-MFN becomes O(n 10 labeled examples.
tatino), where n is the number of label switches. The solution to the optimization problem above is tracked
The outer loop executes a fixed number of times; the inner as the temperature parameter T is lowered to 0. The final
loop calls L-TSVM-MFN n times. Typically, n, vitches Solution is given as:
is expected to strongly depend on the data set and also on the
number of labeled examples. Since it is difficult to apriori 15
estimate the number of switches, this is an issue that is best w* =- lim
1 -- i.
T-0 wi (1 O)
understood from empirical observations.
Mean Field Annealing
The transductive SVM loss function over the unlabeled In practice, the system (indexer 104) monitors the value of the
examples can be seen from FIG. 7 to be non-convex. This objective function in the optimization path and returns the
makes the TSVM optimization procedure susceptible to local Solution corresponding to the minimum value achieved.
minimum issues causing a loss in its performance in many To develop an intuition for this method, the loss term is
situations. A new algorithm that is based on mean field considered in the objective function associated with an unla
annealing can be used to overcome this problem while also 25 beled example as a function of the output of the classifier. This
being computationally very attractive for large scale applica loss term is based on calculations to be described below. FIG.
tions. 9 plots this loss term for various values of T. As the tempera
Mean field annealing (MFA) is an established tool for ture is decreased, the loss function deforms from a squared
combinatorial optimization that approaches the problem from loss shape where a global optimum is easier to achieve, to the
information theoretic principles. The discrete variables in the 30 TSVM loss function in FIG. 7. At high temperatures a global
optimization problem are relaxed to continuous probability optimum is easier to obtain. The global minimizer is then
variables and a non-negative temperature parameter T is used slowly tracked as the temperature is lowered towards zero.
to track the global optimum. The optimization is done in stages, starting with high Val
First, the TSVM objective function is re-written as follows: 35
ues of T and then gradually decreasing T towards 0. For each
T, the problem inequations 8.9 is optimized by alternating the
minimization over w and p-p ... p. respectively. Fixing p.
the optimization over w is done by L-SVM-MFN. Fixing w,
= argmin
2 the optimization over p can also be done easily as described
40
below. Both these problems involve convex optimization and
can be done exactly and efficiently. Details of these optimi
S.X. (ujmax(0, 1 - (wy) + (1 -pui)max0, 1 + (wy)) Zation steps follow.
Optimizing w
Described are the steps to efficiently implement the
45 L-SVM-MFN loop for optimizing w keeping p fixed. The
Binary valued variables u (1+y)/2 are introduced. Let call to L-SVM-MFN is made on the data X-X'X'X'77
pe(0,1) denote the probability that the unlabeled example whose first 1 rows are formed by the labeled examples, and the
belongs to the positive class. The Ising model 10 of Mean next 2u rows are formed by the unlabeled examples appearing
field annealing motivates the following objective function, as two repeated blocks. The associated label vector and cost
where the binary variables u, are relaxed to probability vari 50 matrix are given by
ables p, and include entropy terms for the distributions
defined by p,
ii. ii. (11)
(8)
Y= ... it is list
1 at
i
w =
-
argmin |wif -- iX. max0, 1 - y;(wy) -- 55 : :
where c, p, f sign(w'x')--1 and c. 1p, if sign(w'x')-1. 25 the expression for p, is given by:
Rewriting in matrix notation, an equivalent linear system is
obtained that can be solved by CGLS:
1 (15)
HXTCX/w-XFCY (13) P = gi-2v
1e T
where X-DX,X), C is a diagonal matrix and Y is the vector so
of effectively active labels. Each of these data objects have
1+u rows. These are given by: Substituting this expression in the balance constraintinequa
tion 9, a one-dimensional non-linear equation in V is as fol
lows:
1 (14)
C = , Y = y; je 1... li. 35
Ap; , - :
argmin
we Ritupe(0,1);
A
iX. (4; max0, 1 - (wy) + (1 -pui)max0, 1 + (wly")))
1
Subject to: X. max0, sign(ww.) . 10
u wherein binary valued variables u-(1+y)/2 are intro
duced, pe(0,1) denote the belief probability that the
unlabeled example belongs to the positive class, 1 is the
8. The method of claim 7, comprising of optimizing w for number of labeled example elements, u is the number of
fixedy', using the finite Newton method. 15 unlabeled example elements, wherein w and Ware regu
9. The method of claim 8, wherein a step w—w-n' in larization parameters, and the Ising model of mean field
the finite Newton method for determining a weight vector w annealing motivates the following objective function,
in a k-th iteration, with n being a Newton step, is given by where the binary variables, are relaxed to probability
Solving the following linear system associated with a variables p, and include entropy terms for the distribu
weighted linear regularized least squares problem over a data tions defined by p,
subset defined by a set of active indices j(w):
6(i) to (i))
1 2 (8)
where I is the identity matrix, and wherein once w is wi = weRdargmin ||wl + 24Xmax(0, 1-y;(w,x) +
obtained, w) is obtained from w=w)+8'n' by {pie(0 III,a 2 i=1
setting w=w-6*(w-w) after performing an 25
exact line search for 8, which is by exactly solving a A
X(pimax(0, 1-(wT.x)12 + (1-p)max(0, 1 + (w'T.x.) 12)+
one-dimensional minimization problem:
T
(3) = argminf(w -- (wk) wk))). 30 i2. (pi log p + (1-p) log (1-p))
s0
10. The method of claim 9, wherein the modified Newton wherein the temperatureT parameterizes a family of objec
method uses a conjugate gradient for least Squares method to tive functions, wherein the objective function for a fixed
35 T is minimized under the following class balancing con
Solve large, sparse, weighted regularized least Squares prob straints:
lem.
11. The method of claim 7, comprising of optimizingy, for
a fixed w by switching one or more pairs of labels.
12. The method of claim 1, whereina stepw-win' in 40
the finite Newton method for determining a weight vector w
in a k-th iteration, with n being a Newton step, is given by
Solving the following linear system associated with a
weighted linear regularized least squares problem over a data where r, the fraction of the number of unlabeled examples
subset defined by a set of active indices j(w): 45
belonging to the positive class.
16. The method of claim 15, wherein the solution to the
optimization problem is tracked as the temperature parameter
where I is the identity matrix, and wherein once w is T is lowered to 0, wherein the final solution is given as:
obtained, w) is obtained from w =w--8'n' by
setting w—w-6(w-w') after performing an 50
exact line search for 8, which is by exactly solving a w = lim
T-0
wi.
one-dimensional minimization problem:
17. The method of claim 16, comprising optimizing W
(3) = argminf(w -- (wk) wk))). 55 keeping p fixed, wherein the call to finite Newton method is
&O
made on dataX-X'X'X' whose first 1 rows are formed by
labeled examples, and the next 2u rows are formed by unla
13. The method of claim 12, wherein the modified Newton beled examples appearing as two repeated blocks, wherein
method uses a conjugate gradient for least Squares method to the associated label Vector and cost matrix are given by:
Solve large, sparse, weighted regularized least Squares prob 60
lem.
14. The method of claim 1, comprising relaxing discrete r - - -
F-lysy...yll. 1.-1.-1... -
variables to continuous probability variables in mean field : :
annealing method, wherein a non-negative temperature C = dial...! Ap1 Ap (1 - p1) '(1-p)
parameter T is used to track a global optimum. 65 = alag., -i-...--— ... —
15. The method of claim 14, comprising writing a semi
supervised support vector objective as follows:
US 7,562,060 B2
19 20
where y, y. . . . y, are labels of the labeled example ele
mentS.
18. The method of claim 15, comprising optimizing p for a 1
--it
1
fixed W, a Lagrangian is constructed using Lagrangian mul B(v) = - y - r.
ii. 1 + eT
tiplier V as follows: