Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Machine Learning
ABSTRACT
We describe a practical impediment to the application of
deep neural network models when large training datasets
are unavailable. Encouragingly however, we show that recent
machine learning advances make it possible to obtain the
benefits of deep neural networks by making more efficient use
To learn about Quadrant, please
email us at info@quadrant.ai of training data that most practitioners do have.
(a) When the model has degree 2 (b) A more flexible model of degree 8 (c) However, when trained with sufficient
(with 3 parameters) we find excellent (with 9 parameters) when fit to the same data (200 training examples) the
agreement with the true model even 20 data points overfits and inappropriately more flexible degree 8 model closely
when training on only 20 points. flattens for x > 0.5 . approximates the true model.
Figure 1: Learning on a small data set of 20 (a and b) or 200 (c) example x, y pairs where x is the
observed variables and y is the unobserved variable to be estimated. The true data is generated
according to a parabolic model with added noise.
Discriminative models are trained using data that includes both x and y . If there
are D such examples, we indicate this training set as D = { x d, yd } dD= 1 . We call this a
labeled data set because every x has been labeled with its associated y value. It
is difficult to acquire large labeled data sets because the y labels often have to be
generated by humans. Large labeled data sets are typically costly to obtain, but —
until now — absolutely necessary to prevent overfitting.
Models for P (x) (inherent in generative algorithms) discover hidden structure (e.g.
gender in the previous example) in the observed data. For example, there may be
clusters or categories or hierarchical relationships among observed variables. The
structure in observed data can be leveraged for
d
d=1 d=1 Z d=1
Figure 3: (a) The density of sampled data points lying along two spirals, which
we learn using a DVAE. The DVAE utilizes two hidden variables z 1 , z 2 . z 1 is
binary-valued and z 2 is continuous. The DVAE learns a representation where z 1
identifies the two spirals and z 2 measures arc length along the spiral. The neural
Figure 2: Labeled data (red and green) network learns to map the arc length into rectilinear coordinates x 1 and x 2 . (b)
along with unlabeled examples (black). We The autoencoding map where each data point is mapped to into the latent space
seek the class label of the point centered P (z | x) and then back via P (x | z) . (c) The DVAE has also learned to accurately
within the open blue circle. interpolate into the gaps in training data along the spirals.
In Fig. 3(a) we show the operation of the DVAE without any labeled data
whatsoever. Fig. 3(b) shows the samples that results from the autoencoding
steps of mapping each point x into the latent space z using q (z | x) , and then
mapping the resulting z back to the data space. A binary latent label z 1 is
indicated by the color (red or green). The autoencoding process results in bringing
the initial x back very close to its original position and annotates each data point
with its z 1 label. The DVAE uses the discrete latent variable z 1 to label the two
spirals. In this way, learning of the appropriate class labels only requires one labeled
training example from each of the two spirals. These two labeled examples are
sufficient to associate the inferred latent z 1 with the proper class label. Fig. 3(c)
demonstrates that the generative mode of the DVAE, obtained by sampling from
P (z) and mapping these latent variables into x values through P (x | z) , recovers the
complete spirals and accurately interpolates into the gaps along each spiral that
were purposefully left within the training data.
This example demonstrates the principles behind the DVAE. The same algorithm
has been scaled up to much larger and more complex datasets. The DVAE
is currently one of the world-leading methods for the MNIST, OmniGlot, and
CalTech silhouettes benchmark datasets [Rol16].
Quadrant algorithms for robust learning with noisy data exploit both noisy
and clean data sources. Let D N = { x (dn) , y (dn) } Dd =N 1 and DC = { x (dc), ty (dc), y (dc)} dD=C 1 denote
datasets of noisy and clean data. Within the clean dataset, y (dc) denotes the d th
noisy data example and ty (dc) denotes the cleaned version of that example. DC is
used to learn correlations between clean and noisy labels. We train a conditional
random field (CRF) P (y, ty, h | x, i) by maximizing the likelihood of both datasets
D N and D C . The CRF relies on two components. Firstly, a neural network learns
to bias inputs x toward the expected clean and noisy labels. Secondly, and
importantly, the model accounts for correlations (either positive or negative)
between noisy and clean labels. These correlations adjust the predictions of
the neural network. The addition of this robust CRF layer to a neural network
improves the network’s predictive accuracy leading to state-of-the-art image
classification results. Further details may be found in [Vah17].
This robust learning method can be applied as a simple add-on to any existing
neural-network-based classification algorithm. In this way, the predictive
performance of any existing model can potentially be improved when supplied
with additional noisy data. Typical improvements are shown in Fig. 4 for two
different neural network architectures. Beyond improving prediction on unseen
instances, this robust learning algorithm can also clean noisy datasets. This
capability is demonstrated in Fig. 5.
Figure 4: Performance (accuracy) of the robust labeling algorithm on the standard image
classification benchmark dataset COCO (120000 Flickr images). The robust loss layer is Figure 5: We corrupt the labels of images
added to two different neural network architectures (ResNet-50 and VGG-16). The maximal from CIFAR-10 and attempt to correct
theoretical performance of these neural networks is shown in blue and is obtained by them automatically.
training, assuming that clean labels for the entire training data are accessible. The red bar
shows the results when the neural networks are trained in the more realistic setting where These examples are labeled as cat in the
only a combination of noisy and clean data is available. Three variations of the robust corrupted dataset, but have been amended
algorithm are considered with the best performing variant in green. Performance over the to dog by the robust learning algorithm.
baseline (red) is boosted significantly and approaches within a few percent of the theoretical In most cases the revision is accurate, and
limit. For the VGG-16 we also show the previously best-reported result (in light blue). only 4 errors (in red) are seen.
5 Conclusion
We have outlined a fundamental balance between overfitting and underfitting
that must be borne in mind when applying machine learning methods. The
tendency of large deep neural networks to overfit limits their applicability.
To date, most applications have dealt with overfitting by using very large
datasets. However, users wishing to apply machine learning, but lacking large
datasets, cannot employ this strategy.
References
To learn about Quadrant, please [Rol16] Jason Tyler Rolfe. Discrete variational autoencoders. CoRR,
email us at info@quadrant.ai abs/1609.02200, 2016. Published in ICLR 2017 and available at
https://arxiv.org/pdf/1609.02200.pdf