Está en la página 1de 6

Data-Efficient

Machine Learning
ABSTRACT
We describe a practical impediment to the application of
deep neural network models when large training datasets
are unavailable. Encouragingly however, we show that recent
machine learning advances make it possible to obtain the
benefits of deep neural networks by making more efficient use
To learn about Quadrant, please
email us at info@quadrant.ai of training data that most practitioners do have.

1 Overfitting and Underfitting


Arthur Samuel [Sam59] coined the term machine learning in 1959 describing it as
the “field of study that gives computers the ability to learn without being explicitly
programmed.’’ Modern machine learning methods rely on data (and to a much lesser
extent background domain knowledge) to learn. Learning successfully from data is
a balancing act. On one hand, learning methods must be able to represent arbitrary
complex relationships that may exist hidden within data. And yet on the other hand,
learning methods should capture only those relationships that generalize to novel
situations outside the training data. This fundamental tension pervades all machine
learning algorithms.

Machine learning models may perform poorly as a result of either overfitting or


underfitting the training data. Flexible models capable of representing complex
relationships amongst variables that inappropriately learn spurious patterns are said
to overfit. Overfitting is particularly problematic when models are extremely flexible
(as is the case of deep neural networks) and the amount of training data is small.
In contrast, learning algorithms that cannot represent relationships of sufficient
complexity fail when the underlying patterns within the data are too rich. A learning
method that fails in this way is said to underfit.

Machine learning is being applied successfully today in many domains. Whether it is


understanding images, video, language, or customers, playing games, or driving cars, the
Quadrant (a D-Wave Systems Business Unit) potential of machine learning is only just beginning to unfold. These successes in varied
3033 Beta Avenue, and complex domains require powerful flexible learning models. Deep neural networks
Burnaby, BC, Canada V5G 4M9
are more than up to this task, and often have hundreds of thousands or millions of
+1 (604) 630 1428 Ext 106 parameters. However, to train models of this richness without overfitting requires large
+1 (604) 630 1434 amounts of training data. Before the extraordinary potential of machine learning can be
info@quadrant.ai realized across a range of domains, this situation must change, we must be able to learn
www.quadrant.ai with less data. Quadrant has developed a number of learning algorithms to do just that.
2 Discriminative and Generative
Machine Learning
Most applications of machine learning are discriminative. Discriminative methods
model the dependence of unobserved variables on observed variables. Every time
you navigate through an automated telephone menu, you are interacting with a
discriminative learning algorithm trained to recognize your responses as either yes or
no. If we indicate a vector of unobserved variables as y (e.g. the correct interpretation
of your response) and a vector of observed variables as x (e.g. your speech) then
discriminative algorithms learn models for P ^y x h (the probability of a certain y
being associated with a given x ). This probabilistic relationship between x and y
may seem strange because in many cases y is a deterministic function of x , but the
probabilistic relationship is more general and subsumes the deterministic case.

(a) When the model has degree 2 (b) A more flexible model of degree 8 (c) However, when trained with sufficient
(with 3 parameters) we find excellent (with 9 parameters) when fit to the same data (200 training examples) the
agreement with the true model even 20 data points overfits and inappropriately more flexible degree 8 model closely
when training on only 20 points. flattens for x > 0.5 . approximates the true model.

Figure 1: Learning on a small data set of 20 (a and b) or 200 (c) example x, y pairs where x is the
observed variables and y is the unobserved variable to be estimated. The true data is generated
according to a parabolic model with added noise.

Discriminative models are trained using data that includes both x and y . If there
are D such examples, we indicate this training set as D = { x d, yd } dD= 1 . We call this a
labeled data set because every x has been labeled with its associated y value. It
is difficult to acquire large labeled data sets because the y labels often have to be
generated by humans. Large labeled data sets are typically costly to obtain, but —
until now — absolutely necessary to prevent overfitting.

In contrast, generative learning methods model the joint dependence of both


observed (x) and unobserved variables (y). Mathematically, this is represented
as P (x, y) . Using the chain rule of probability P (x, y) = P (y | x) P (x) , we see that
generative models include not only the dependence of unobserved variables
on observed variables, arising again from P (y | x) , but also the likelihood of the
unobserved variables themselves through P (x) . This is the origin of the generative
modifier; the learning algorithm additionally models how observed data x is
generated. As an example, imagine building a model that predicts a persons weight
(y) based on observations of their height (x) . A generative model may cluster
humans into gender, and then based on the gender, generate both height and
weight. By inferring the gender, the weight prediction is based upon both height and
gender. In contrast, a discriminative model which does not learn how to generate
height and weight is comparatively impoverished; it predicts weight only as a
function of height. Like discriminative modeling, learning generative models requires
both observed and unobserved data D = {x d, yd } dD= 1 . .

Models for P (x) (inherent in generative algorithms) discover hidden structure (e.g.
gender in the previous example) in the observed data. For example, there may be
clusters or categories or hierarchical relationships among observed variables. The
structure in observed data can be leveraged for

• Insight into the underlying relationships in and perhaps


causes of observed data
• Better discriminative performance

Businesses have long exploited the clusters discovered by machine learning


algorithms, but it is the latter, rarely leveraged benefit that can be particularly
powerful. The previous example of height and weight clearly demonstrates that
uncovering hidden structure (like gender) can improve predictive accuracy. The big
win though is to realize that uncovering structure determining P (x) can be done
without the labeled examples, i.e. it requires only data { x d } , which is typically much
less costly to gather. Thus, generative models can be trained more efficiently by
exploiting both labeled DL = { xd , yd } dD=L 1 and unlabeled data D U = { xd } dD=U1 . In this
way, the need for large labeled datasets is reduced.

3 Learning Generative Models


Given the benefits of generative models, how can we learn them in a manner
that takes advantage of the flexibilty of neural networks, but that mitigates their
tendency to overfit? One of the most effective approaches to generative learning is
the variational autoencoder.

To capture structure underlying observed data, autoencoders introduce variables


z representing this structure. We model a distribution over x as a generative
process whereby a latent variable is sampled according to some distribution
P (z | i) , and this latent variable is then probabilistically mapped to an observed
variable x using P (x | z, i) . The parameters of the model, i , are learned by
maximizing the likelihood of the training data. Since the observed data distribution
is obtained by averaging over z as P (x | i) = Rz P (x, z | i) = Rz P (x | z, i) P (z | i) , the
log likelihood and its gradient are given by

L (i) = / ln P (x d | i) = / ln / P (x d | z, i) P (z | i) and 4 L (i) = / E P(z | x ,i) 6ln P (x d , z | i)@ .


D D D

d
d=1 d=1 Z d=1

Determining the gradient requires evaluating the expectation with respect to


P (z | x, i), which identifies the z most likely to have given rise to a given x . In most
models, this probability cannot be calculated exactly. Variational autoencoders
(VAEs) approximate P (z | x, i) within a family of simpler approximating
distributions q (z | x,) . The parameters z are selected to best match q (z | x,) to
P (z | x, i) . Most commonly, q is a simple distribution over z (like a Gaussian)
whose parameters (mean n and variance v ) are determined by a neural
network (which maps x into n and v ). While previous VAEs are limited to
continuous latent variables z , Quadrant scientists have developed patented
algorithms that also allow for discrete latent variables that can represent
clusters and other discrete hidden structures [Rol16].

As a simple example we illustrate how discrete variational autoencoders (DVAE)


can learn to represent data which makes learning a discriminative model
trivial. Consider Fig. 2 where we define a binary classification problem with
two-dimensional inputs x. We indicate the two classes as either red or green,
and are supplied with a small labeled training set of examples. However, we
may have access to more plentiful unlabeled examples shown in black. We
are interested in the class label of the input centered within the open blue
circle. Visual inspection suggests that the appropriate class label is red as
the encircled point lies on the spiral all of whose points are labeled red. In
this example the unlabeled data makes the proper prediction clear. However,
without the unlabeled data most classifiers ( e.g. nearest-neighbor, or support
vector machines with standard kernels) would assign a green label based upon
the neighboring training points.

Figure 3: (a) The density of sampled data points lying along two spirals, which
we learn using a DVAE. The DVAE utilizes two hidden variables z 1 , z 2 . z 1 is
binary-valued and z 2 is continuous. The DVAE learns a representation where z 1
identifies the two spirals and z 2 measures arc length along the spiral. The neural
Figure 2: Labeled data (red and green) network learns to map the arc length into rectilinear coordinates x 1 and x 2 . (b)
along with unlabeled examples (black). We The autoencoding map where each data point is mapped to into the latent space
seek the class label of the point centered P (z | x) and then back via P (x | z) . (c) The DVAE has also learned to accurately
within the open blue circle. interpolate into the gaps in training data along the spirals.

In Fig. 3(a) we show the operation of the DVAE without any labeled data
whatsoever. Fig. 3(b) shows the samples that results from the autoencoding
steps of mapping each point x into the latent space z using q (z | x) , and then
mapping the resulting z back to the data space. A binary latent label z 1 is
indicated by the color (red or green). The autoencoding process results in bringing
the initial x back very close to its original position and annotates each data point
with its z 1 label. The DVAE uses the discrete latent variable z 1 to label the two
spirals. In this way, learning of the appropriate class labels only requires one labeled
training example from each of the two spirals. These two labeled examples are
sufficient to associate the inferred latent z 1 with the proper class label. Fig. 3(c)
demonstrates that the generative mode of the DVAE, obtained by sampling from
P (z) and mapping these latent variables into x values through P (x | z) , recovers the
complete spirals and accurately interpolates into the gaps along each spiral that
were purposefully left within the training data.

This example demonstrates the principles behind the DVAE. The same algorithm
has been scaled up to much larger and more complex datasets. The DVAE
is currently one of the world-leading methods for the MNIST, OmniGlot, and
CalTech silhouettes benchmark datasets [Rol16].

4 Exploiting Noisy Labeled Data


In many cases, even when high-quality labeled training data is scarce, poorer-
quality labeled data is available. For example, a company may rely on lower-quality
amateur annotation to compliment the higher quality labels annotated by experts.
Alternatively, perhaps due to expense, only a small portion of training data may
be cleaned to high standards. Quadrant has developed algorithms that exploit
correlations between poor- and high-quality labels to improve learning in these
common scenarios. The basic idea is simple: if there are systematic errors within
the noisy training data, then learning these biases and correlations can be used
to correct the noisy labelings.

Quadrant algorithms for robust learning with noisy data exploit both noisy
and clean data sources. Let D N = { x (dn) , y (dn) } Dd =N 1 and DC = { x (dc), ty (dc), y (dc)} dD=C 1 denote
datasets of noisy and clean data. Within the clean dataset, y (dc) denotes the d th
noisy data example and ty (dc) denotes the cleaned version of that example. DC is
used to learn correlations between clean and noisy labels. We train a conditional
random field (CRF) P (y, ty, h | x, i) by maximizing the likelihood of both datasets
D N and D C . The CRF relies on two components. Firstly, a neural network learns
to bias inputs x toward the expected clean and noisy labels. Secondly, and
importantly, the model accounts for correlations (either positive or negative)
between noisy and clean labels. These correlations adjust the predictions of
the neural network. The addition of this robust CRF layer to a neural network
improves the network’s predictive accuracy leading to state-of-the-art image
classification results. Further details may be found in [Vah17].

This robust learning method can be applied as a simple add-on to any existing
neural-network-based classification algorithm. In this way, the predictive
performance of any existing model can potentially be improved when supplied
with additional noisy data. Typical improvements are shown in Fig. 4 for two
different neural network architectures. Beyond improving prediction on unseen
instances, this robust learning algorithm can also clean noisy datasets. This
capability is demonstrated in Fig. 5.
Figure 4: Performance (accuracy) of the robust labeling algorithm on the standard image
classification benchmark dataset COCO (120000 Flickr images). The robust loss layer is Figure 5: We corrupt the labels of images
added to two different neural network architectures (ResNet-50 and VGG-16). The maximal from CIFAR-10 and attempt to correct
theoretical performance of these neural networks is shown in blue and is obtained by them automatically.
training, assuming that clean labels for the entire training data are accessible. The red bar
shows the results when the neural networks are trained in the more realistic setting where These examples are labeled as cat in the
only a combination of noisy and clean data is available. Three variations of the robust corrupted dataset, but have been amended
algorithm are considered with the best performing variant in green. Performance over the to dog by the robust learning algorithm.
baseline (red) is boosted significantly and approaches within a few percent of the theoretical In most cases the revision is accurate, and
limit. For the VGG-16 we also show the previously best-reported result (in light blue). only 4 errors (in red) are seen.

5 Conclusion
We have outlined a fundamental balance between overfitting and underfitting
that must be borne in mind when applying machine learning methods. The
tendency of large deep neural networks to overfit limits their applicability.
To date, most applications have dealt with overfitting by using very large
datasets. However, users wishing to apply machine learning, but lacking large
datasets, cannot employ this strategy.

We have outlined two learning methods that resolve this situation by


exploiting alternate, easily available, data sources. For each method we have
demonstrated its principle of operation and quantified its benefits.

References
To learn about Quadrant, please [Rol16] Jason Tyler Rolfe. Discrete variational autoencoders. CoRR,
email us at info@quadrant.ai abs/1609.02200, 2016. Published in ICLR 2017 and available at
https://arxiv.org/pdf/1609.02200.pdf

Quadrant (a D-Wave Systems Business Unit)


3033 Beta Avenue, [Sam59] Arthur Samuel. Some studies in machine learning using the game of
Burnaby, BC, Canada V5G 4M9 checkers. Technical report, IBM Journal of Research and Development, 1959.
Available at http://ieeexplore.ieee.org/document/5392560/
+1 (604) 630 1428 Ext 106
+1 (604) 630 1434
[Vah17] Arash Vahdat. Toward robustness against label noise in training
info@quadrant.ai deep discriminative neural networks. In Neural Information Processing Systems
www.quadrant.ai (NIPS), 2017. Available at https://arxiv.org/pdf/1706.00038.pdf

También podría gustarte