Deep Learning and Weak Supervisionor Image Classification

Deep learning and weak supervision
for image classification
Matthieu Cord
Joint work with Thibaut Durand, Nicolas Thome
Sorbonne Universits - Universit Pierre et Marie Curie (UPMC)

Laboratoire dInformatique de Paris 6 (LIP6) - MLIA Team
UMR CNRS
June 09, 2016
1/35
Outline
Context: Visual classification

1. MANTRA: Latent variable
model to boost classification
performances
2. WELDON: extension to
Deep CNN
2/35
Motivations
Working on datasets with complex scenes (large and cluttered

background), not centered objects, variable size, ...
VOC07/12 MIT67 15 Scene COCO VOC12 Action
Select relevant regions better prediction

ImageNet: centered objects
I Efficient transfert: needs bounding boxes [Oquab, CVPR14]
Full annotations expensive training with weak supervision
3/35
Motivations
How to learn without bounding boxes?
Multiple-Instance Learning/Latent variables for missing
information [Felzenszwalb, PAMI10]
Latent SVM and extensions => MANTRA
How to learn deep without bounding boxes?
Learning invariance with input image transformations
I Spatial Transformer Networks [Jaderberg, NIPS15]
Attention models: to select relevant regions

I Stacked Attention Networks for Image Question Answering
[Yang, CVPR16]
Parts model
I Automatic discovery and optimization of parts for image
classification [Parizi, ICLR15]

Deep MIL
I Is object localization for free? [Oquab, CVPR15]
I Deep extension of MANTRA: WELDON
4/35
Notations
Variable Notation Space Train Test Example

Input x X observed observed image
Output y Y observed unobserved label
Latent h H unobserved unobserved region
Model missing information with latent variables h
Most popular approach in Computer Vision: Latent SVM
[Felzenszwalb, PAMI10] [Yu, ICML09]
5/35
Latent Structural SVM [Yu, ICML09]
Prediction function:
= arg max hw, (x, y, h)i

y, h)
( (1)
(y,h)YH
I (x, y, h): joint feature map

I Joint inference in the (Y H) space
Training: a set of N labeled trained pairs (xi , yi )
I Objective function: upper bound of (y , y
i i )
N
1 C X
kwk2 + max [(yi , y) + hw, (xi , y, h)i] max hw, (xi , yi , h)i
2 N (y,h)YH hH
i=1 | {z }
(yi ,
yi )
I Difference of Convex Functions, solved with CCCP

I LAI: max [(yi , y) + hw, (xi , y, h)i]
(y,h)YH
I Challenge exacerbated in the latent case, (Y H) space
6/35
MANTRA: Minimum Maximum Latent Structural SVM
Classifying only with the

max scoring latent value
not always relevant
MANTRA model:

Pair of latent variables (h+
i,y , hi,y )
I max scoring latent value: h+ = arg max hw, (x , y, h)i
i,y i
hH
I min scoring latent value: h
i,y = arg min hw, (xi , y, h)i
hH
New scoring function:

Dw (xi , y) = hw, (xi , y, h+
i,y )i + hw, (xi , y, hi,y )i (2)
Prediction function find the output with maximum score

y = arg max Dw (xi , y) (3)
yY
7/35
MANTRA: Minimum Maximum Latent Structural SVM
Classifying only with the

max scoring latent value
not always relevant
MANTRA model:

Pair of latent variables (h+
i,y , hi,y )
I max scoring latent value: h+ = arg max hw, (x , y, h)i
i,y i
hH
I min scoring latent value: h
i,y = arg min hw, (xi , y, h)i
hH
New scoring function:

Dw (xi , y) = hw, (xi , y, h+
i,y )i + hw, (xi , y, hi,y )i (2)
Prediction function find the output with maximum score

y = arg max Dw (xi , y) (3)
yY
7/35
MANTRA: Model & Training Rationale
Intuition of the max+min prediction function
x image, h image region, y image class
hw, (x, y, h)i: region h score for class y
Dw (x, y) = hw, (x, y, h+
y )i + hw, (x, y, hy )i
I h+
y : presence of class y large for yi
I h
y : localized evidence of the absence of class y
I Not too low for yi latent space regularization
I Low for y 6= yi tracking negative evidence [Parizi, ICLR15]
street image x Dw (x, street) = 2 Dw (x, highway) = 0.7 Dw (x, coast) = 1.5
8/35
MANTRA: Model Training
Learning formulation
Loss function: `w (xi , yi ) = max [(yi , y) + Dw (xi , y)] Dw (xi , yi )
yY
I (Margin rescaling) upper bound of (yi , y), constraints:
y 6= yi , Dw (xi , yi ) (yi , y) + Dw (xi , y)

| {z } | {z } | {z }
score for ground truth output margin score for other output
Non-convex optimization problem

N
1 C X
min kwk2 + `w (xi , yi ) (4)
w 2 N
i=1
Solver: non convex one slack cutting plane [Do, JMLR12]

I Fast convergence
I Direct optimization 6= CCCP for LSSVM

I Still needs to solve LAI: max [(y , y) + D (x , y)]
y i w i
9/35
MANTRA: Optimization
MANTRA Instantiation: define (x, y, h), (x, y, h), (yi , y)

Instantiations: binary & multi-class classification, AP ranking
Binary Multi-class AP Ranking
x bag bag set of bags
(set of regions) (set of regions) (of regions)
y 1 {1, . . . , K } ranking matrix
h instance (region) region regions
[1(y=1) (x, h), . . . , joint latent ranking
(x, y, h) y (x, h)
1(y=K ) (x, h)] feature map
(yi , y) 0/1 loss 0/1 loss AP loss
LAI exhaustive exhaustive exact and efficient
Solve Inference maxy Dw (xi , y) & LAI maxy [(yi , y) + Dw (xi , y)]
I Exhaustive for binary/multi-class classification
I Exact and efficient solutions for ranking
10/35
WELDON
Weakly supErvised Learning of Deep cOnvolutional Nets
MANTRA extension for training deep CNNs
Learning (x, y, h): end-to-end learning of deep CNNs with
structured prediction and latent variables
I Incorporating multiple positive & negative evidence
I Training deep CNNs with structured loss
11/35
Standard deep CNN architecture: VGG16
Simonyan et al. Very deep convolutional networks for large-scale image recognition.
ICLR 2015
12/35
MANTRA adaptation for deep CNN
Problem
Fixed-size image as input
13/35
Problem
Adapt architecture to weakly supervised learning

1. Fully connected layers convolution layers
I sliding window approach
13/35
Problem

13/35
Problem

2. Spatial aggregation
I Perform object localization prediction
13/35
WELDON: deep architecture
C : number of classes
14/35
Aggregation function
[Oquab, 2015]
Region aggregation = max
Select the highest-scoring window
original image motorbike feature map max prediction

Oquab, Bottou, Laptev, Sivic. Is object localization for free? weakly-supervised
learning with convolutional neural networks. CVPR 2015 15/35
WELDON: region aggregation
Aggregation strategy:
max+min pooling (MANTRA prediction function)
k-instances
I Single region to multiple high scoring regions:
k k
1X 1X
max i-th max min i-th min
k k
i=1 i=1
I More robust region selection [Vasconcelos CVPR15]
max max + min 3 max +3 min

16/35
WELDON: architecture
17/35
WELDON: learning
Objective function for multi-class task and k = 1:
N
1 X
min R(w) + `(fw (xi ), yigt )
w N
i=1

w w 0
fw (xi ) =arg max max Lconv (xi , y , h) + min 0
Lconv (xi , y , h )
y h h
How to learn deep architecture ?

Stochastic gradient descent training.
Back-propagation of the selecting windows error.
18/35
WELDON: learning
Class is present
Increase score of selecting windows.
Figure: Car map
19/35
WELDON: learning
Class is absent
Decrease score of selecting windows.
Figure: Boat map
20/35
Experiments
VGG16 pre-trained on ImageNet

Torch7 implementation
Datasets
Object recognition: Pascal VOC 2007, Pascal VOC 2012
Scene recognition: MIT67, 15 Scene
Visual recognition, where context plays an important role:
COCO, Pascal VOC 2012 Action
VOC07/12 MIT67 15 Scene COCO VOC12 Action

21/35
Experiments
Dataset Train Test Classes Classification

VOC07 5.000 5.000 20 multi-label
VOC12 5.700 5.800 20 multi-label
15 Scene 1.500 2.985 15 multi-class
MIT67 5.360 1.340 67 multi-class
VOC12 Action 2.000 2.000 10 multi-label
COCO 80.000 40.000 80 multi-label
22/35
Experiments
Multi-scale: 8 scales (combination with Object Bank strategy)
23/35
Object recognition
VOC 2007 VOC 2012

VGG16 (online code) [1] 84.5 82.8
SPP net [2] 82.4
Deep WSL MIL [3] 81.8
WELDON 90.2 88.5
Table: mAP results on object recognition datasets.
[1] Simonyan et al. Very deep convolutional networks. ICLR 2015

[2] He et al. Spatial pyramid pooling in deep convolutional networks. ECCV 2014
[3] Oquab et al. Is object localization for free? CVPR 2015
24/35
Scene recognition
15 Scene MIT67
VGG16 (online code) [1] 91.2 69.9
MOP CNN [2] 68.9
Negative parts [3] 77.1
WELDON 94.3 78.0
Table: Multi-class accuracy results on scene categorization datasets.

[2] Gong et al. Multi-scale Orderless Pooling of Deep Convolutional Activation
Features. ECCV 2014
[3] Parizi et al. Automatic discovery and optimization of parts. ICLR 2015
25/35
Context datasets
VOC 2012 action COCO

VGG16 (online code) [1] 67.1 59.7
Deep WSL MIL [2] 62.8
Our WSL deep CNN 75.0 68.8
Table: mAP results on context datasets.

[2] Oquab et al. Is object localization for free? CVPR 2015
26/35
Visual results
Aeroplane model (1.8) Bus model (-0.4)
27/35
Visual results
Motorbike model (1.1) Sofa model (-0.8)
28/35
Visual results
Sofa model (1.2) Horse model (-0.6)
29/35
Visual results (failing examples)
Buffet Restaurant kitchen
30/35
Visual results (failing examples)
Kindergarden Classroom
31/35
Analysis
Impact of the different improvements
a) max b) +k=3 c) +min d) +AP VOC07 VOC12 action
X 83.6 53.5
X X 86.3 62.6
X X 87.5 68.4
X X X 88.4 71.7
X X X 87.8 69.8
X X X X 88.9 72.6
WSL detection results on VOC 2012 Action
max (a)) [Oquab, 2015] WELDON

IoU 25.6 30.4
32/35
Analysis
Impact of the number or regions k
k=1 k=3
33/35
Connections to others Latent Variables Models
Hidden CRF (HCRF) [Quattoni, PAMI07]
N
1 C X X X
kwk2 + log exphw, (xi , y, h)i log exphw, (xi , yi , h)i
2 N
i=1 (y,h)YH hH
Latent Structural SVM (LSSVM) [Yu, ICML09]

N
1 2 C X
kwk + max {(yi ,y)+hw,(xi ,y,h)i} maxhw,(xi ,yi ,h)i
2 N (y,h)YH hH
i=1
Marginal Structural SVM (MSSVM) [Ping, ICML14]

N
( )
1 2 C
X X X
kwk + max (yi ,y)+log exphw,(xi ,y,h)i log exphw,(xi ,yi,h)i
2 N y
i=1 hH hH
WELDON
N
1 CX X X
kwk2 + max (yi ,y)+ hw,(xi ,y,h)i hw,(xi , yi , h)i
2 N y
i=1 hH hH
34/35
Thibaut Durand Nicolas Thome Matthieu Cord
MLIA Team (Patrick Gallinari)

Sorbonne Universits - UPMC Paris 6 - LIP6
MANTRA project page

http://webia.lip6.fr/~durandt/project/mantra.html
Thibaut Durand, Nicolas Thome, and Matthieu Cord.

MANTRA: Minimum Maximum LSSVM for Image Classification and Ranking.
In IEEE International Conference on Computer Vision (ICCV), 2015.
Thibaut Durand, Nicolas Thome, and Matthieu Cord.
WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
35/35

Deep Learning and Weak Supervisionor Image Classification

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Deep Learning and Weak Supervisionor Image Classification

Cargado por

Copyright:

Formatos disponibles

Deep learning and weak supervision

for image classification

Sorbonne Universits - Universit Pierre et Marie Curie (UPMC)

June 09, 2016

Context: Visual classification

Working on datasets with complex scenes (large and cluttered

VOC07/12 MIT67 15 Scene COCO VOC12 Action

Select relevant regions better prediction

Full annotations expensive training with weak supervision

Attention models: to select relevant regions

classification [Parizi, ICLR15]

Variable Notation Space Train Test Example

= arg max hw, (x, y, h)i

I (x, y, h): joint feature map

I Difference of Convex Functions, solved with CCCP

Classifying only with the

Prediction function find the output with maximum score

Classifying only with the

Prediction function find the output with maximum score

I (Margin rescaling) upper bound of (yi , y), constraints:

y 6= yi , Dw (xi , yi ) (yi , y) + Dw (xi , y)

Non-convex optimization problem

Solver: non convex one slack cutting plane [Do, JMLR12]

I Direct optimization 6= CCCP for LSSVM

MANTRA Instantiation: define (x, y, h), (x, y, h), (yi , y)

Adapt architecture to weakly supervised learning

Adapt architecture to weakly supervised learning

Adapt architecture to weakly supervised learning

original image motorbike feature map max prediction

max max + min 3 max +3 min

How to learn deep architecture ?

Figure: Car map

Figure: Boat map

VGG16 pre-trained on ImageNet

VOC07/12 MIT67 15 Scene COCO VOC12 Action

Dataset Train Test Classes Classification

Multi-scale: 8 scales (combination with Object Bank strategy)

VOC 2007 VOC 2012

[1] Simonyan et al. Very deep convolutional networks. ICLR 2015

[1] Simonyan et al. Very deep convolutional networks. ICLR 2015

VOC 2012 action COCO

[1] Simonyan et al. Very deep convolutional networks. ICLR 2015

Aeroplane model (1.8) Bus model (-0.4)

Motorbike model (1.1) Sofa model (-0.8)

Sofa model (1.2) Horse model (-0.6)

Buffet Restaurant kitchen

WSL detection results on VOC 2012 Action

max (a)) [Oquab, 2015] WELDON

Latent Structural SVM (LSSVM) [Yu, ICML09]

Marginal Structural SVM (MSSVM) [Ping, ICML14]

MLIA Team (Patrick Gallinari)

MANTRA project page

Thibaut Durand, Nicolas Thome, and Matthieu Cord.

También podría gustarte