Documentos de Académico
Documentos de Profesional
Documentos de Cultura
ABSTRACT
CHAPTER 1: INTRODUCTION
1.1: Boltzmann Machines
1.2:Restricted Boltzmann Machines
CHAPTER 3: TRAINING
3.1:Log likelihood maximization
3.2:Contrastive Divergence
CHAPTER 4:APPLICATIONS
Boltzmann machines can also be regarded as particular graphical models, more precisely
undirected graphical models also known as Markov random fields. The embedding of BMs into
the framework of probabilistic graphical models provides immediate access to a wealth of
theoretical results and well-developed algorithms. Therefore, this is introducing RBMs from this
perspective. Computing the likelihood of an undirected model or its gradient for inference is in
general computationally intensive, and this also holds for RBMs. Thus, sampling based methods
are employed to approximate the likelihood and its gradient. Sampling from an undirected
graphical model is in general not straightforward, but for RBMs Markov chain Monte Carlo
(MCMC) methods are easily applicable in the form of Gibbs sampling.
GRAPHICAL MODELS:
E(v,h) =-bTv-cTh-vTWh
Z =vhexp {-E(v,h)}
To maximize the energy if b is +ve, v1
If b is -ve , v0
And same for c also, i.e, if c is +ve,h1
If c is -ve, h0
And for Wjk, if Wjk is -ve, to maximize the energy either h or v should be -ve. And if Wjk
equals to 1, then hj and vkalso equals to 1.
Here we are taking -ve of energy, in this maximization means minimizing the negative energy
function.
Energy function
E(v,h) =-bTv-cTh-vTWh
=-kbkvk-jcjhj-jkWjkhjvk
Distribution: p(v,h) = 1z exp{-E(v,h)}
Partition function: Z vhexp {-E(v,h)}
From the above equations it is observed that the Z, the partition function is intractable. Therefore
the first joint distribution is also intractable. But p(h/v), the conditional probability is simple to
compute on the sample form.
Since we are conditioning on the visible unit we can treat this as constant with respect to
distribution P(h/v). The factorial nature of the conditional P(h/v) follow us immediately from
our ability to write the joint probability over the vector h as the product of (unnormalised)
distributions over the individual elements hj it is now a simple matter of normalising the
distributions over the individual binary hj.
= p(hj=1,v) hp(hj,v)
= p(hj=1,v) p(hj=1,v)+p(hj=0,v)
=exp{ cTh+vTWh}exp{0}+exp{ cTh+vTWh}
=sigmoid{cj+vTWij}
p(h/v) = j=1nsigmoid{cj+vTWij}
Similarly,
p(v/h) = i=1dsigmoid{bi+Wijh}
The independence between the variables in one layer makes Gibbs sampling especially easy:
Instead of sampling new values for all variables subsequently, the states of all variables in one
layer can be sampled jointly
l(W,b,c)=t=1nlog hp(v(t),h)
=t=1nlog h1Z exp{-E(v,h)}
=t=1nlog 1Zh exp{-E(v,h)}
=t=1n-log Zh exp{-E(v,h)}
= t=1nlog h exp{-E(v,h)} -n log (Z)
=t=1nlog h exp{-E(v,h)} -n log v h exp{-E(v,h)}
3.1:Log likelihood maximization
Maximize the function,
Means finding the derivative and set it to zero.
=(b,c,W)
l() = t=1nlog h exp{-E(v,h)} -n log v h exp{-E(v,h)}
dd{l()} = ddt=1nlog h exp{-E(v,h)} - n dd(log v h exp{-E(v,h)})
In above equation, the first term is Expectation Conditional distribution and the second term is
Expectational joint distribution of energy function.
To sample from (h,v) we use Gibbs sampling. In each iteration we have to compute all. But in
this the computation of the second term is very complicated to compute.
= hvT
Obtaining unbiased estimates of log-likelihood gradient using MCMC methods typically requires
many sampling steps. However, recently it was shown that estimates obtained after running the
chain for just a few steps can be sufficient for model training. This leads to contrastive
divergence (CD) learning, which has become a standard way to train RBMs. The idea of k-step
contrastive divergence learning (CD-k) is quite simple: Instead of approximating the second term
in the log-likelihood gradient by a sample from the RBM-distribution (which would require to
run a Markov chain until the stationary distribution is reached), a Gibbs chain is run for only k
steps (and usually k = 1). The Gibbs chain is initialized with a training example v(0) of the
training set and yields the sample v(k) after k steps. Each step t consists of sampling h(t) from
p(h|v(t) ) and sampling v(t+1) from p(v|h(t)) subsequently.
CHAPTER 4: APPLICATIONS
1. Restricted Boltzmann machine is an algorithm useful for
Dimensionality reduction,
Classification,
Regression,
Collaborative filtering,
Feature learning and topic modeling.
5.TRAINING ALGORITHM:
● Initialize the visible variables and hidden variables.
● 2 cycles forward propagation and backward propagation
● Weights initially have been assigned randomly
● Forward propagation find positive associations after finding the hidden units
● Backward propagation - from the hidden units found in forward propagation we find
divisible units and then again do forward propagation to find the negative associations
● The difference between the positive and negative associations gives us delta w the value
with which we change the current weight for a user in the RBM
For example, suppose we have a set of six movies (Harry Potter, Avatar, LOTR 3, Gladiator,
Titanic, and Glitter) and we ask users to tell us which ones they want to watch. If we want to
learn two latent units underlying movie preferences -- for example, two natural groups in our set
of six movies appear to be SF/fantasy (containing Harry Potter, Avatar, and LOTR 3) and Oscar
winners (containing LOTR 3, Gladiator, and Titanic), so we might hope that our latent units will
correspond to these categories -- then our RBM would look like the following.
class RBM:
def __init__(self, num_visible, num_hidden):
self.num_hidden = num_hidden
self.num_visible = num_visible
self.debug_print = True
self.weights = np.asarray(np_rng.uniform(
low=-0.1 * np.sqrt(6. / (num_hidden + num_visible)),
high=0.1 * np.sqrt(6. / (num_hidden + num_visible)),
size=(num_visible, num_hidden)))
# Insert weights for the bias units into the first row and first column.
self.weights = np.insert(self.weights, 0, 0, axis = 0)
self.weights = np.insert(self.weights, 0, 0, axis = 1)
num_examples = data.shape[0]
# Reconstruct the visible units and sample again from the hidden units.
# (This is the "negative CD phase", aka the daydreaming phase.)
neg_visible_activations = np.dot(pos_hidden_states, self.weights.T)
neg_visible_probs = self._logistic(neg_visible_activations)
neg_visible_probs[:,0] = 1 # Fix the bias unit.
neg_hidden_activations = np.dot(neg_visible_probs, self.weights)
neg_hidden_probs = self._logistic(neg_hidden_activations)
# Note, again, that we're using the activation *probabilities* when computing
associations, not the states
# themselves.
neg_associations = np.dot(neg_visible_probs.T, neg_hidden_probs)
# Update weights.
self.weights += learning_rate * ((pos_associations - neg_associations) / num_examples)
Parameters
----------
data: A matrix where each row consists of the states of the visible units.
Returns
-------
hidden_states: A matrix where each row consists of the hidden units activated from the
visible
units in the data matrix passed in.
"""
num_examples = data.shape[0]
# Create a matrix, where each row is to be the hidden units (plus a bias unit)
# sampled from a training example.
hidden_states = np.ones((num_examples, self.num_hidden + 1))
# TODO: Remove the code duplication between this method and `run_visible`?
def run_hidden(self, data):
"""
Assuming the RBM has been trained (so that weights for the network have been learned),
run the network on a set of hidden units, to get a sample of the visible units.
Parameters
----------
data: A matrix where each row consists of the states of the hidden units.
Returns
-------
visible_states: A matrix where each row consists of the visible units activated from the
hidden
units in the data matrix passed in.
"""
num_examples = data.shape[0]
# Create a matrix, where each row is to be the visible units (plus a bias unit)
# sampled from a training example.
visible_states = np.ones((num_examples, self.num_visible + 1))
# Ignore the bias units (the first column), since they're always set to 1.
return samples[:,1:]
if __name__ == '__main__':
r = RBM(num_visible = 6, num_hidden = 2)
training_data = np.array([[1,1,1,0,0,0],[1,0,1,0,0,0],[1,1,1,0,0,0],[0,0,1,1,1,0],
[0,0,1,1,0,0],[0,0,1,1,1,0]])
r.train(training_data, max_epochs = 5000)
print(r.weights)
user = np.array([[0,0,0,1,1,0]])
print(r.run_visible(user))
5.2:Results:
[1,0]
Note that the first hidden unit seems to correspond to the Oscar winners, and the second
hidden unit seems to correspond to the SF/fantasy movies, just as we were hoping.
What happens if we give the RBM a new user, George, who has (Harry Potter = 0,
Avatar = 0, LOTR 3 = 0, Gladiator = 1, Titanic = 1, Glitter = 0) as his preferences? It turns the
Oscar winners unit on (but not the SF/fantasy unit), correctly guessing that George probably
likes movies that are Oscar winners.
5.3:Conclusion:
We showed how to create a more powerful type of hidden unit for an RBM by tying the weights
and biases of an infinite set of binary units. We then approximated these stepped sigmoid units
with noisy rectified linear units and showed that they work better than binary hidden units.We
have done a movie rating prediction by taking data of 6 members of two types of movies.
Likewise we can estimate any type of business activities, political situations, product reviews
also. The main purpose is prediction of things.