Ada Boost

ESCUELA SUPERIOR POLITÉCNICA DEL LITORAL
Clasificación utilizando ADABOOST
Andrés G. Abad, Ph.D.
Andrés G. Abad, Ph.D., agabad@espol.edu.ec 1 / 24

Agenda
Introducción al Boosting
Algoritmo AdaBoost
Referencias Bibliográficas

Introducción al Boosting I
I En Kearns and Valiant [1989] se plantea la pregunta de si las clases de complejidad:

aprendedores débiles y aprendedores fuertes, son iguales
I Schapire [1990] responde a esa pregunta, su prueba es constructiva: Boosting

Introducción al Boosting II
Basic idea: An adaptive combination of poor learners with suf-

ficient diversity leads to an excellent (complex)
classifier!
Base class: C - base class of simple classifiers (e.g., percep-

trons, decision stump, axis parallel splits)
 B 
X 
Output Classifier: ĉB (x) = sgn  αb cb (x) , cb ∈ C
b=1
Idea outline: Train a sequence of simple classifiers on modified

data distributions, and form a weighted average

Introducción al Boosting III
Suponga que h1 , . . . , hT son clasificadores débiles utilizados para aproximar una
función f : Rk → {−1, +1}, tal que
ε = P[h(x) , f (x)] = 0,5 − γ para x ∈ X; γ > 0
Clasificadores Débiles ([Viola and Jones, 2001])

Learning Results pelling but not sufficient for many real-world tasks. In
terms of computation, this classifier is very fast, re-
Introducción al Boosting IV
e details on the training and performance of the
system are presented in Section 5, several sim-
quiring 0.7 seconds to scan an 384 by 288 pixel im-
age. Unfortunately, the most straightforward tech-
sults merit discussion. Initial experiments demon- nique for improving detection performance, adding
152 Viola and Jones
4. Receiver operating characteristic (ROC) curve for the 200 feature classifier.
Curva ROC para un clasificador con 200
features.
Figure 10. Output of our face detector on a number of test images from the MIT + CMU test set.
6. Conclusions This paper brings together new algorithms, represen-

Andrés G. Abad, Ph.D., agabad@espol.edu.ec tations, and insights which are quite generic and may 6 / 24
Introducción al Boosting V

Agenda
Introducción al Boosting
Algoritmo AdaBoost
Referencias Bibliográficas

Introducción al AdaBoost I
P1
Weak
learner c1
Z
α1
P2
Weak c2
learner α2 Σ ĉB (x)
Z αB
…
cB
PB Weak
learner
Z
I Basados en Schapire [1990], se introduce en Freund and Schapire [1996] el algoritmo

AdaBoost (ADAptive BOOSTing)
I En Freund and Schapire [1997] se realiza la primera extensión del AdaBoost al
problema de regresión

Reducción del error en AdaBoost I
Sea t = 12 − γt el error de entrenamiento de

ht , entonces se puede demostrar que
Yh p i
H = 2 (t (1 − t )
t
 
 X 2 
≤ exp −2 γt 
t

Reducción del error en AdaBoost II
Empiricamente se ha evidenciado la superioridad del AdaBoost [Freund and Schapire,
1996; Bauer and Kohavi, 1999; Dietterich, 2000b]
Comparación de error de prueba entre algoritmos C4.5 Vs. Boosting Decision Stumps, y
Boosting C4.5 respectivamente [Freund and Schapire, 1999].

Algoritmo AdaBoost I
Inicialice: D1 (i) = 1/m para i = 1, . . . , m.
Para t = 1, . . . , T:
1. Entrenar la hipótesis débil ht : X → {−1, +1} utilizando la distribución Dt
2. Evalue error ponderado:
t = Pri∼Dt [ht (xi ) , yi ]

1−t
3. Seleccione αt = 1
2 ln t
4. Actualice para i = 1, . . . , m:
Dt (i) exp(−αt yi ht (xi ))

Dt+1 (i) = ,
Zt
donde Zt es el factor de normalización

Algoritmo AdaBoost II
α1 = 0,42, α2 = 0,65, α3 = 0,92

Referencias Bibliográficas I
Bauer, E. and Kohavi, R. (1999). An Empirical Comparison of Voting Classification Algorithms:
Bagging, Boosting, and Variants. Machine Learning, 36(1-2):105–139.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123–140.
Dietterich, T. G. (2000a). Ensemble Methods in Machine Learning. In Multiple Classifier Systems,
number 1857 in Lecture Notes in Computer Science, pages 1–15. Springer Berlin Heidelberg.
Dietterich, T. G. (2000b). An Experimental Comparison of Three Methods for Constructing
Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Machine Learning,
40(2):139–157.
Freund, Y. and Schapire, R. (1996). Experiments with a New Boosting Algorithm. pages 148–156.
Freund, Y. and Schapire, R. (1999). A short introduction to boosting. Japonese Society for Artificial
Intelligence, 14(5):771–780.
Freund, Y. and Schapire, R. E. (1997). A Decision-Theoretic Generalization of on-Line Learning and
an Application to Boosting.
Kearns, M. and Valiant, L. (1989). Cryptographic Limitations on Learning Boolean Formulae and
Finite Automata.
Krogh, A. and Vedelsby, J. (1995). Neural Network Ensembles, Cross Validation, and Active
Learning. In Advances in Neural Information Processing Systems, pages 231–238. MIT Press.
Referencias Bibliográficas II
Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1):81–106.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2):197–227.
Ueda, N. and Nakano, R. (1996). Generalization error of ensemble estimators. In , IEEE
International Conference on Neural Networks, 1996, volume 1, pages 90–95 vol.1.
Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cascade of simple features.
In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2001. CVPR 2001, volume 1, pages I–511–I–518 vol.1.

Apéndice
Descripción general I
El AdaBoost es una forma de optimización gradiente en el espacio de hipótesis

con el objetivo de minimizar la función de pérdida exponencial
èxp (f , H|D) = Ex∼D [e−f (x)H(x) ]

para
T
X
H(x) = αt ht (x)
t=1

Descripción general II
Al minimizar la función de pérdida exponencial èxp (f , H|D) tenemos
∂e−f (x)H(x)
= −f (x)e−f (x)H(x)
∂H(x)
= e−H(x) P(f (x) = +1|x) + eH(x) P(f (x) = −1|x) = 0
Resolviendo
1 P(f (x) = +1|x)

H(x) = ln
2 P(f (x) = −1|x)

Descripción general III
Dado que
1 P(f (x) = +1|x)

!
sign (H(x)) = sign ln
2 P(f (x) = −1|x)
si P(f (x) = +1|x) > P(f (x) = −1|x);
(
1
=
−1 si P(f (x) = +1|x) < P(f (x) = −1|x)
= arg máx P(f (x) = y|x)
y∈{−1,+1}
lo que implica que sign (H(x)) alcanza la tasa de error bayesiano.

Descripción general IV
Para t = 1, . . . , T:
1. Entrenar la hipótesis débil ht : X → {−1, +1} utilizando la distribución Dt
Obtener H(x) = Ti=1 αi hi (x).
P
Para completamente definir el AdaBoost necesitamos definir

I Como determinar las distribuciones Dt
I Cómo determinar los pesos αt

Descripción general V
El clasificador ht que corrige los errores de Ht−1 debe minimizar la función de

pérdida exponencial
h i
èxp (Ht−1 + ht |D) = Ex∼D e−f (x)(Ht−1 (x)+ht (x))
f (x)2 ht (x)2
" !#
≈ Ex∼D e−f (x)Ht−1 (x)
1 − f (x)ht (x) +
2
1

= Ex∼D e−f (x)Ht−1 (x) 1 − f (x)ht (x) +
2

Descripción general VI
El clasificador ideal ht sera tal que
ht (x) = arg mı́n èxp (Ht−1 + h|D)
h
f (x)2 h(x)2
" !#
≈ arg mı́n Ex∼D e −f (x)Ht−1 (x)
1 − f (x)h(x) +
h 2
h i
= arg máx Ex∼D e−f (x)Ht−1 (x) f (x)h(x)
h
e−f (x)Ht−1 (x)
" #
= arg máx Ex∼D f (x)h(x)
h Ex∼D [e−f (x)Ht−1 (x) ]
= arg máx Ex∼Dt [f (x)h(x)]
h
= arg mı́n Ex∼Dt [I(f (x) , h(x))]
h
D(x)e−f (x)Ht−1 (x)

para Dt (x) = Ex∼D [e−f (x)Ht−1 (x) ]
.

Descripción general VII
Bajo una distribución Dt , el peso αt se escoge minimizando la función de pérdida

exponencial
h i
èxp (f , αt ht |Dt ) =Ex∼Dt e−f (x)αt ht (x)
=Ex∼Dt e−αt I(f (x) = ht (x)) + eαt I(f (x) , ht (x))

=e−αt Px∼Dt (f (x) = ht (x)) + eαt Px∼Dt (f (x) , ht (x))

=e−αt (1 − t ) + eαt t
donde t = Px∼Dt (f (x) , ht (x)).

Descripción general VIII
Para obtener el αt óptimo hacemos
∂èxp (f , αt ht |Dt )
= − e−αt (1 − t ) + eαt t = 0
∂αt
cuya solución es
1 1 − t

αt = ln
2 t

Ada Boost

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Ada Boost

Cargado por

Copyright:

Formatos disponibles

ESCUELA SUPERIOR POLITÉCNICA DEL LITORAL

Clasificación utilizando ADABOOST

Andrés G. Abad, Ph.D.

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 1 / 24

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 2 / 24

I En Kearns and Valiant [1989] se plantea la pregunta de si las clases de complejidad:

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 3 / 24

Basic idea: An adaptive combination of poor learners with suf-

Base class: C - base class of simple classifiers (e.g., percep-

Idea outline: Train a sequence of simple classifiers on modified

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 4 / 24

ε = P[h(x) , f (x)] = 0,5 − γ para x ∈ X; γ > 0

Clasificadores Débiles ([Viola and Jones, 2001])

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 5 / 24

6. Conclusions This paper brings together new algorithms, represen-

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 7 / 24

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 8 / 24

I Basados en Schapire [1990], se introduce en Freund and Schapire [1996] el algoritmo

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 9 / 24

Sea t = 12 − γt el error de entrenamiento de

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 10 / 24

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 11 / 24

t = Pri∼Dt [ht (xi ) , yi ]

Dt (i) exp(−αt yi ht (xi ))

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 12 / 24

α1 = 0,42, α2 = 0,65, α3 = 0,92

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 13 / 24

Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1):81–106.

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 15 / 24

El AdaBoost es una forma de optimización gradiente en el espacio de hipótesis

`exp (f , H|D) = Ex∼D [e−f (x)H(x) ]

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 17 / 24

Al minimizar la función de pérdida exponencial `exp (f , H|D) tenemos

1 P(f (x) = +1|x)

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 18 / 24

1 P(f (x) = +1|x)

lo que implica que sign (H(x)) alcanza la tasa de error bayesiano.

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 19 / 24

Para completamente definir el AdaBoost necesitamos definir

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 20 / 24

El clasificador ht que corrige los errores de Ht−1 debe minimizar la función de

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 21 / 24

D(x)e−f (x)Ht−1 (x)

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 22 / 24

Bajo una distribución Dt , el peso αt se escoge minimizando la función de pérdida

=e−αt Px∼Dt (f (x) = ht (x)) + eαt Px∼Dt (f (x) , ht (x))

donde t = Px∼Dt (f (x) , ht (x)).

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 23 / 24

Para obtener el αt óptimo hacemos

Andrés G. Abad, Ph.D., agabad@espol.edu.ec 24 / 24

También podría gustarte

Sea t = 12 − γt el error de entrenamiento de

t = Pri∼Dt [ht (xi ) , yi ]

donde t = Px∼Dt (f (x) , ht (x)).