Está en la página 1de 21

RESTRICTED BOLTZMANN MACHINES

​ABSTRACT

User-generated business reviews have a significant impact in decision making of consumers. It


would be useful to predict if a business is good or to predict its ratings using only reviews. An
approach on modeling consumers reviews using a probabilistic model,Restricted Boltzmann
Machines (RBMs), is proposed. A restricted Boltzmann machine consists of a layer of visible
units and a layer of hidden units with no visible-visible or hidden-hidden connections. The
restricted Boltzmann machine is the main component used in building up the deep belief network
and has been studied by many researchers. We apply RBMs for non-linear feature extraction on
Yelp data set consisting of raw data for businesses. Obtained results using RBMs modeling
outperform results using traditional classifiers.
TABLE OF CONTENTS​:

CHAPTER 1: ​INTRODUCTION
1.1: Boltzmann Machines
1.2:Restricted Boltzmann Machines

CHAPTER 2: ​ENERGY FUNCTION


2.1:Deriving The conditional distribution
2.2:Deriving the individual conditional distributions
2.3:RBM Gibbs Sampling

CHAPTER 3: ​TRAINING
3.1:Log likelihood maximization
3.2:Contrastive Divergence

CHAPTER 4:​APPLICATIONS

CHAPTER 5: ​TRAINING ALGORITHM


5.1:Code for movie rating prediction
5.2:Results
5.3:Conclusion.
CHAPTER 1:
INTRODUCTION:

Boltzmann machines (BMs) have been introduced as bidirectionally connected networks of


stochastic processing units, which can be interpreted as neural network models. A Boltzmann
Machine can be used to learn important aspects of an unknown probability distribution based on
samples from this distribution. In general, this learning process is difficult and time-consuming.
However, the learning problem can be simplified by imposing ​restrictions on the network
topology, which leads us to ​Restricted Boltzmann Machines​. It is an unsupervised learning
algorithm.
Unsupervised Learning​ :
Unsupervised machine learning is the ​machine learning task of inferring a function to describe
hidden structure from "unlabeled" data (a classification or categorization is not included in the
observations). Since the examples given to the learner are unlabeled, there is no evaluation of the
accuracy of the structure that is output by the relevant algorithm
1.1.Boltzmann machines:
Boltzmann machine is a probabilistic model represented by undirected graph(nodes and edges).
Here the binary state {0,1} is considered as the state of the nodes.

Boltzmann machines can also be regarded as particular graphical models, more precisely
undirected graphical models also known as Markov random fields. The embedding of BMs into
the framework of probabilistic graphical models provides immediate access to a wealth of
theoretical results and well-developed algorithms. Therefore, this is introducing RBMs from this
perspective. Computing the likelihood of an undirected model or its gradient for inference is in
general computationally intensive, and this also holds for RBMs. Thus, sampling based methods
are employed to approximate the likelihood and its gradient. Sampling from an undirected
graphical model is in general not straightforward, but for RBMs Markov chain Monte Carlo
(MCMC) methods are easily applicable in the form of Gibbs sampling.

1.2Restricted​ ​Boltzmann machines:


Restricted Boltzmann machines are some of the most common building blocks of deep
probabilistic models. They are undirected probabilistic graphical models containing a layer of
observable variables and a single layer of latent variables​.
RBMs are shallow, two-layer neural nets that constitute the building blocks of ​deep-belief
networks.​ The first layer of the RBM is called the visible, or input, layer, and the second is the
hidden layer.
Each circle in the graph above represents a neuron-like unit called a ​node​, and nodes are simply
where calculations take place. The nodes are connected to each other across layers, but no two
nodes of the same layer are linked. That is, there is no intra-layer communication – this is the
restriction in a restricted Boltzmann machine. Each node is a locus of computation that processes
input, and begins by making stochastic decisions about whether to transmit that input or not.
(​Stochastic means “randomly determined”, and in this case, the coefficients that modify inputs
are randomly initialized.)
RBM can be defined as a ​symmetrical bipartite graph​.​Symmetrical means that each visible node
is connected with each hidden node. ​Bipartite means it has two parts, or layers, and the ​graph is
a mathematical term for a web of nodes.

GRAPHICAL MODELS​:

Probabilistic graphical models describe probability distributions by mapping conditional


dependence and independence properties between random variables on a graph structure (two
sets of random variables X1 and X2 are conditionally independent given a set of random
variables X 3 if p(X1, X2|X3) = p(X1|X3)p(X2|X3)).
There exist graphical models associated with different kind of graph structures, for example
factor graphs, Bayesian networks associated with directed graphs, and Markov random fields,
which are also called Markov networks or undirected graphical models

CHAPTER 2: ENERGY FUNCTION


2.1.Restricted boltzmann machines are based on energy model.
This energy function then defines a ​probability distribution over hidden and visible neurons of
the network as

​p(v,h) = ​1z​ exp{-E(v,h)}


​ Where E(v,h) is the energy function.

​ E(v,h) =-bTv-cTh-vTWh

Z is the normalizing constant partition function

​ Z =vhexp {-E(v,h)}
To maximize the energy if b is +ve, v1
If b is -ve , v0
And same for c also, i.e, if c is +ve,h1
If c is -ve, h0
And for Wjk, if ​Wjk is -ve, to maximize the energy either ​h or ​v ​should be -ve. And if ​Wjk
equals to 1, then ​hj ​and ​vk​also equals to 1.
Here we are taking -ve of energy, in this maximization means minimizing the negative energy
function.
Energy function
​E(v,h)​ =-bTv-cTh-vTWh
=-kbkvk-jcjhj-jkWjkhjvk
Distribution: ​ p(v,h) ​ = 1z exp{-E(v,h)}
Partition function: ​ Z ​ vhexp {-E(v,h)}
From the above equations it is observed that the ​Z​, the partition function is intractable. Therefore
the first joint distribution is also intractable. But ​p(h/v)​, the conditional probability is simple to
compute on the sample form.

2.2.Deriving the conditional distributions from the joint distribution:


To ​sample from the network (without fixing a state for hidden or visible neurons) means to
sample from the above probability distribution, that is, to draw states of the network's neurons
according to ​P(​ ​v,​ ​h)​ .
To ​sample (h/v) means to sample from the conditional probability distribution ​P(​ ​h|​ ​v)​ .
"Physically" this means what are the probabilities of getting a specific set of values for the
hidden neurons, having fixed the values ​v for the visible neurons, and sampling from this
probability distribution. This conditional probability distribution is computed as

​ p(h/v) ​= ​p(h,v) p(v)


=1zexp{-E(v,h)hp(v,h)
=1z​exp{bTv+cTh+vTWh}​1zh​exp{bTv+cTh+vTWh}
=​exp{ cTh+vTWh}​h​exp{ cTh+vTWh}
=1Z​exp{ cTh+vTWh}
​Where​, Z =hexp{ cTh+vTWh}
=​1Z​{j=1ncjhj +j=1nvTWjhj}

=​1Z j=1n​exp{ cTh+vTWh}

2.3.The distributions over the individual binary hj :

Since we are conditioning on the visible unit we can treat this as constant with respect to
distribution P(h/v). The factorial nature of the conditional P(h/v) follow us immediately from
our ability to write the joint probability over the vector h as the product of (unnormalised)
distributions over the individual elements hj it is now a simple matter of normalising the
distributions over the individual binary hj.

For given v,h elements are independent

​p(hj=1/v) ​=​ p(hj=1,v) p(v)

= p(hj=1,v) hp(hj,v)
= p(hj=1,v) ​p(hj=1,v)+p(hj=0,v)
=​exp{ cTh+vTWh}exp{0}+exp{ cTh+vTWh}

=​sigmoid{cj+vTWij}

​p(h/v)​ = j=1nsigmoid{cj+vTWij}
Similarly,

​p(v/h) ​ = i=1dsigmoid{bi+Wijh}

The independence between the variables in one layer makes Gibbs sampling especially easy:
Instead of sampling new values for all variables subsequently, the states of all variables in one
layer can be sampled jointly

RBM GIBBS SAMPLING:


Step 1: ​Sample h(l)~p(h/v(l))
We can simultaneously and independently sample all the elements of h(l)given v(l).
Step 2: ​Sample v(l+1)~p(v/h(l))
We can simultaneously and independently sample all the elements of v(l+1)given h(l)
.
CHAPTER 3: ​TRAINING
Training Restricted boltzmann machines​:
We will derive the general form of the gradient of the log likelihood w.r.t. an arbitrary model
parameter θ for a single training sample v here and use it to obtain the gradients for specific
types of RBMs and model parameters in the following sections. Some steps are marked with a
number and explained below. Here we are using log likelihood maximization for reducing the
complexity of calculation means instead of multiplications we are using additions for log
likelihood.
The log likelihood is given by
l(W,b,c)=t=1nlog p(v(t))
If we maximize the log likelihood function

l(W,b,c)=t=1nlog hp(v(t),h)
=t=1nlog h1Z exp{-E(v,h)}
=t=1nlog 1Zh exp{-E(v,h)}
=t=1n-log Zh exp{-E(v,h)}
= t=1nlog h exp{-E(v,h)} -n log (Z)
=t=1nlog h exp{-E(v,h)} -n log v h exp{-E(v,h)}
3.1:Log likelihood maximization
Maximize the function,
Means finding the derivative and set it to zero.
=(b,c,W)
l() = t=1nlog h exp{-E(v,h)} -n log v h exp{-E(v,h)}
dd{l()} = ddt=1nlog h exp{-E(v,h)} - n dd(log v h exp{-E(v,h)})

= t=1nh exp{-E(v(t),h)} dd ( -E(v(t),h) ) h exp{-E(v,h)}-v,h exp{-E(v(t),h)}


dd ( -E(v(t),h) ) v,h exp{-E(v,h)}

=t=1nEp(h/v(t))[dd( exp{-E(v(t),h)} ] - Ep(h,v)[dd( exp{-E(v,h)} ]

In above equation, the first term is Expectation Conditional distribution and the second term is
Expectational joint distribution of energy function.

To sample from (h,v) we use Gibbs sampling. In each iteration we have to compute all. But in
this the computation of the second term is very complicated to compute.

The gradient of negative energy function​:

ddW( exp{-E(v,h)} = ddW(bTv+cTh+vTWh )

= hvT

ddb( exp{-E(v,h)} = ddb(bTv+cTh+vTWh )


=v
ddc( exp{-E(v,h)} = ddc(bTv+cTh+vTWh )
=h
dd{l()}=t=1nEp(h/v(t))[dd( exp{-E(v(t),h)} ] - Ep(h,v)[dd( exp{-E(v,h)} ]

If we want to write update functions according to Gradient descent

​ddWl(W,b,c) = t=1nh(t)v(t)T- n Ep(h,v)[hvT]


​ ddbl(W,b,c) = t=1nv(t)T- n Ep(h,v)[v]
ddcl(W,b,c) = t=1nh(t)- n Ep(h,v)[h]
To avoid this problem, here is a algorithm​.

3.2 Contrasting Divergence:

To compute double expectation


Idea:
1.Replace the expectation by estimation by a point estimate v’
2.Obtain the point v’ by gibbs sampling.
3.Start sampling chain at v(t)

Ep(h,v) dd[-E(v,h)] = dd[-E(v,h)]

Obtaining unbiased estimates of log-likelihood gradient using MCMC methods typically requires
many sampling steps. However, recently it was shown that estimates obtained after running the
chain for just a few steps can be sufficient for model training. This leads to contrastive
divergence (CD) learning, which has become a standard way to train RBMs. The idea of k-step
contrastive divergence learning (CD-k) is quite simple: Instead of approximating the second term
in the log-likelihood gradient by a sample from the RBM-distribution (which would require to
run a Markov chain until the stationary distribution is reached), a Gibbs chain is run for only k
steps (and usually k = 1). The Gibbs chain is initialized with a training example v(0) of the
training set and yields the sample v(k) after k steps. Each step t consists of sampling h(t) from
p(h|v(t) ) and sampling v(t+1) from p(v|h(t)) subsequently.

CHAPTER 4: ​APPLICATIONS
1.​ ​Restricted Boltzmann machine is an algorithm useful for
Dimensionality reduction,
Classification,
​Regression​,
Collaborative filtering,
Feature learning and topic modeling.

5.TRAINING ALGORITHM:
● Initialize the visible variables and hidden variables.
● 2 cycles forward propagation and backward propagation
● Weights initially have been assigned randomly
● Forward propagation find positive associations after finding the hidden units
● Backward propagation - from the hidden units found in forward propagation we find
divisible units and then again do forward propagation to find the negative associations
● The difference between the positive and negative associations gives us delta w the value
with which we change the current weight for a user in the RBM

A Restricted Boltzmann Machine is a stochastic neural network (​neural network meaning we


have neuron-like units whose binary activations depend on the neighbors they're connected to;
stochastic​ meaning these activations have a probabilistic element) consisting of:
● One layer of visible units (users' movie preferences whose states we know and set);
● One layer of hidden units (the latent factors we try to learn); and
● A bias unit (whose state is always on, and is a way of adjusting for the different inherent
popularities of each movie).

For example, suppose we have a set of six movies (Harry Potter, Avatar, LOTR 3, Gladiator,
Titanic, and Glitter) and we ask users to tell us which ones they want to watch. If we want to
learn two latent units underlying movie preferences -- for example, two natural groups in our set
of six movies appear to be SF/fantasy (containing Harry Potter, Avatar, and LOTR 3) and Oscar
winners (containing LOTR 3, Gladiator, and Titanic), so we might hope that our latent units will
correspond to these categories -- then our RBM would look like the following.

5.1:Code for movie rating prediction


from __future__ import print_function
import numpy as np

class RBM:
def __init__(self, num_visible, num_hidden):
self.num_hidden = num_hidden
self.num_visible = num_visible
self.debug_print = True

# Initialize a weight matrix, of dimensions (num_visible x num_hidden), using


# a uniform distribution between -sqrt(6. / (num_hidden + num_visible))
# and sqrt(6. / (num_hidden + num_visible)). One could vary the
# standard deviation by multiplying the interval with appropriate value.
# Here we initialize the weights with mean 0 and standard deviation 0.1.
# Reference: Understanding the difficulty of training deep feedforward
# neural networks by Xavier Glorot and Yoshua Bengio
np_rng = np.random.RandomState(1234)

self.weights = np.asarray(np_rng.uniform(
low=-0.1 * np.sqrt(6. / (num_hidden + num_visible)),
high=0.1 * np.sqrt(6. / (num_hidden + num_visible)),
size=(num_visible, num_hidden)))

# Insert weights for the bias units into the first row and first column.
self.weights = np.insert(self.weights, 0, 0, axis = 0)
self.weights = np.insert(self.weights, 0, 0, axis = 1)

def train(self, data, max_epochs = 1000, learning_rate = 0.1):


"""
Train the machine.
Parameters
----------
data: A matrix where each row is a training example consisting of the states of visible
units.
"""

num_examples = data.shape[0]

# Insert bias units of 1 into the first column.


data = np.insert(data, 0, 1, axis = 1)

for epoch in range(max_epochs):


# Clamp to the data and sample from the hidden units.
# (This is the "positive CD phase", aka the reality phase.)
pos_hidden_activations = np.dot(data, self.weights)
pos_hidden_probs = self._logistic(pos_hidden_activations)
pos_hidden_probs[:,0] = 1 # Fix the bias unit.
pos_hidden_states = pos_hidden_probs > np.random.rand(num_examples,
self.num_hidden + 1)
# Note that we're using the activation *probabilities* of the hidden states, not the hidden
states
# themselves, when computing associations. We could also use the states; see section 3 of
Hinton's
# "A Practical Guide to Training Restricted Boltzmann Machines" for more.
pos_associations = np.dot(data.T, pos_hidden_probs)

# Reconstruct the visible units and sample again from the hidden units.
# (This is the "negative CD phase", aka the daydreaming phase.)
neg_visible_activations = np.dot(pos_hidden_states, self.weights.T)
neg_visible_probs = self._logistic(neg_visible_activations)
neg_visible_probs[:,0] = 1 # Fix the bias unit.
neg_hidden_activations = np.dot(neg_visible_probs, self.weights)
neg_hidden_probs = self._logistic(neg_hidden_activations)
# Note, again, that we're using the activation *probabilities* when computing
associations, not the states
# themselves.
neg_associations = np.dot(neg_visible_probs.T, neg_hidden_probs)

# Update weights.
self.weights += learning_rate * ((pos_associations - neg_associations) / num_examples)

error = np.sum((data - neg_visible_probs) ** 2)


if self.debug_print:
print("Epoch %s: error is %s" % (epoch, error))

def run_visible(self, data):


"""
Assuming the RBM has been trained (so that weights for the network have been learned),
run the network on a set of visible units, to get a sample of the hidden units.

Parameters
----------
data: A matrix where each row consists of the states of the visible units.

Returns
-------
hidden_states: A matrix where each row consists of the hidden units activated from the
visible
units in the data matrix passed in.
"""

num_examples = data.shape[0]

# Create a matrix, where each row is to be the hidden units (plus a bias unit)
# sampled from a training example.
hidden_states = np.ones((num_examples, self.num_hidden + 1))

# Insert bias units of 1 into the first column of data.


data = np.insert(data, 0, 1, axis = 1)

# Calculate the activations of the hidden units.


hidden_activations = np.dot(data, self.weights)
# Calculate the probabilities of turning the hidden units on.
hidden_probs = self._logistic(hidden_activations)
# Turn the hidden units on with their specified probabilities.
hidden_states[:,:] = hidden_probs > np.random.rand(num_examples, self.num_hidden +
1)
# Always fix the bias unit to 1.
# hidden_states[:,0] = 1
# Ignore the bias units.
hidden_states = hidden_states[:,1:]
return hidden_states

# TODO: Remove the code duplication between this method and `run_visible`?
def run_hidden(self, data):
"""
Assuming the RBM has been trained (so that weights for the network have been learned),
run the network on a set of hidden units, to get a sample of the visible units.
Parameters
----------
data: A matrix where each row consists of the states of the hidden units.
Returns
-------
visible_states: A matrix where each row consists of the visible units activated from the
hidden
units in the data matrix passed in.
"""

num_examples = data.shape[0]

# Create a matrix, where each row is to be the visible units (plus a bias unit)
# sampled from a training example.
visible_states = np.ones((num_examples, self.num_visible + 1))

# Insert bias units of 1 into the first column of data.


data = np.insert(data, 0, 1, axis = 1)

# Calculate the activations of the visible units.


visible_activations = np.dot(data, self.weights.T)
# Calculate the probabilities of turning the visible units on.
visible_probs = self._logistic(visible_activations)
# Turn the visible units on with their specified probabilities.
visible_states[:,:] = visible_probs > np.random.rand(num_examples, self.num_visible +
1)
# Always fix the bias unit to 1.
# visible_states[:,0] = 1

# Ignore the bias units.


visible_states = visible_states[:,1:]
return visible_states

def daydream(self, num_samples):


"""
Randomly initialize the visible units once, and start running alternating Gibbs sampling
steps
(where each step consists of updating all the hidden units, and then updating all of the
visible units),
taking a sample of the visible units at each step.
Note that we only initialize the network *once*, so these samples are correlated.
Returns
-------
samples: A matrix, where each row is a sample of the visible units produced while the
network was
daydreaming.
"""

# Create a matrix, where each row is to be a sample of of the visible units


# (with an extra bias unit), initialized to all ones.
samples = np.ones((num_samples, self.num_visible + 1))

# Take the first sample from a uniform distribution.


samples[0,1:] = np.random.rand(self.num_visible)

# Start the alternating Gibbs sampling.


# Note that we keep the hidden units binary states, but leave the
# visible units as real probabilities. See section 3 of Hinton's
# "A Practical Guide to Training Restricted Boltzmann Machines"
# for more on why.
for i in range(1, num_samples):
visible = samples[i-1,:]

# Calculate the activations of the hidden units.


hidden_activations = np.dot(visible, self.weights)
# Calculate the probabilities of turning the hidden units on.
hidden_probs = self._logistic(hidden_activations)
# Turn the hidden units on with their specified probabilities.
hidden_states = hidden_probs > np.random.rand(self.num_hidden + 1)
# Always fix the bias unit to 1.
hidden_states[0] = 1

# Recalculate the probabilities that the visible units are on.


visible_activations = np.dot(hidden_states, self.weights.T)
visible_probs = self._logistic(visible_activations)
visible_states = visible_probs > np.random.rand(self.num_visible + 1)
samples[i,:] = visible_states

# Ignore the bias units (the first column), since they're always set to 1.
return samples[:,1:]

def _logistic(self, x):


return 1.0 / (1 + np.exp(-x))

if __name__ == '__main__':
r = RBM(num_visible = 6, num_hidden = 2)
training_data = np.array([[1,1,1,0,0,0],[1,0,1,0,0,0],[1,1,1,0,0,0],[0,0,1,1,1,0],
[0,0,1,1,0,0],[0,0,1,1,1,0]])
r.train(training_data, max_epochs = 5000)
print(r.weights)
user = np.array([[0,0,0,1,1,0]])
print(r.run_visible(user))

5.2:Results:

Bias Unit Hidden 1 Hidden 2


Bias Unit -0.08257658 -0.19041546 1.57007782
Harry Potter -0.82602559 -7.08986885 4.96606654
Avatar -1.84023877 -5.18354129 2.27197472
LOTR 3 3.92321075 2.51720193 4.11061383
Gladiator 0.10316995 6.74833901 -4.00505343
Titanic -0.97646029 3.25474524 -5.59606865
Glitter -4.44685751 -2.81563804 -2.91540988

[1,0]
Note that the first hidden unit seems to correspond to the Oscar winners, and the second
hidden unit seems to correspond to the SF/fantasy movies, just as we were hoping.
What happens if we give the RBM a new user, George, who has (Harry Potter = 0,
Avatar = 0, LOTR 3 = 0, Gladiator = 1, Titanic = 1, Glitter = 0) as his preferences? It turns the
Oscar winners unit on (but not the SF/fantasy unit), correctly guessing that George probably
likes movies that are Oscar winners.

5.3:Conclusion​:
We showed how to create a more powerful type of hidden unit for an RBM by tying the weights
and biases of an infinite set of binary units. We then approximated these stepped sigmoid units
with noisy rectified linear units and showed that they work better than binary hidden units.We
have done a movie rating prediction by taking data of 6 members of two types of movies.
Likewise we can estimate any type of business activities, political situations, product reviews
also. The main purpose is prediction of things.