0 calificaciones0% encontró este documento útil (0 votos)

7 vistas21 páginasAbout Restricted Boltzmann machines and how they work

Apr 17, 2019

RBM

© © All Rights Reserved

PDF, TXT o lea en línea desde Scribd

About Restricted Boltzmann machines and how they work

© All Rights Reserved

0 calificaciones0% encontró este documento útil (0 votos)

7 vistas21 páginasRBM

About Restricted Boltzmann machines and how they work

© All Rights Reserved

Está en la página 1de 21

ABSTRACT

would be useful to predict if a business is good or to predict its ratings using only reviews. An

approach on modeling consumers reviews using a probabilistic model,Restricted Boltzmann

Machines (RBMs), is proposed. A restricted Boltzmann machine consists of a layer of visible

units and a layer of hidden units with no visible-visible or hidden-hidden connections. The

restricted Boltzmann machine is the main component used in building up the deep belief network

and has been studied by many researchers. We apply RBMs for non-linear feature extraction on

Yelp data set consisting of raw data for businesses. Obtained results using RBMs modeling

outperform results using traditional classifiers.

TABLE OF CONTENTS:

CHAPTER 1: INTRODUCTION

1.1: Boltzmann Machines

1.2:Restricted Boltzmann Machines

2.1:Deriving The conditional distribution

2.2:Deriving the individual conditional distributions

2.3:RBM Gibbs Sampling

CHAPTER 3: TRAINING

3.1:Log likelihood maximization

3.2:Contrastive Divergence

CHAPTER 4:APPLICATIONS

5.1:Code for movie rating prediction

5.2:Results

5.3:Conclusion.

CHAPTER 1:

INTRODUCTION:

stochastic processing units, which can be interpreted as neural network models. A Boltzmann

Machine can be used to learn important aspects of an unknown probability distribution based on

samples from this distribution. In general, this learning process is difficult and time-consuming.

However, the learning problem can be simplified by imposing restrictions on the network

topology, which leads us to Restricted Boltzmann Machines. It is an unsupervised learning

algorithm.

Unsupervised Learning :

Unsupervised machine learning is the machine learning task of inferring a function to describe

hidden structure from "unlabeled" data (a classification or categorization is not included in the

observations). Since the examples given to the learner are unlabeled, there is no evaluation of the

accuracy of the structure that is output by the relevant algorithm

1.1.Boltzmann machines:

Boltzmann machine is a probabilistic model represented by undirected graph(nodes and edges).

Here the binary state {0,1} is considered as the state of the nodes.

Boltzmann machines can also be regarded as particular graphical models, more precisely

undirected graphical models also known as Markov random fields. The embedding of BMs into

the framework of probabilistic graphical models provides immediate access to a wealth of

theoretical results and well-developed algorithms. Therefore, this is introducing RBMs from this

perspective. Computing the likelihood of an undirected model or its gradient for inference is in

general computationally intensive, and this also holds for RBMs. Thus, sampling based methods

are employed to approximate the likelihood and its gradient. Sampling from an undirected

graphical model is in general not straightforward, but for RBMs Markov chain Monte Carlo

(MCMC) methods are easily applicable in the form of Gibbs sampling.

Restricted Boltzmann machines are some of the most common building blocks of deep

probabilistic models. They are undirected probabilistic graphical models containing a layer of

observable variables and a single layer of latent variables.

RBMs are shallow, two-layer neural nets that constitute the building blocks of deep-belief

networks. The first layer of the RBM is called the visible, or input, layer, and the second is the

hidden layer.

Each circle in the graph above represents a neuron-like unit called a node, and nodes are simply

where calculations take place. The nodes are connected to each other across layers, but no two

nodes of the same layer are linked. That is, there is no intra-layer communication – this is the

restriction in a restricted Boltzmann machine. Each node is a locus of computation that processes

input, and begins by making stochastic decisions about whether to transmit that input or not.

(Stochastic means “randomly determined”, and in this case, the coefficients that modify inputs

are randomly initialized.)

RBM can be defined as a symmetrical bipartite graph.Symmetrical means that each visible node

is connected with each hidden node. Bipartite means it has two parts, or layers, and the graph is

a mathematical term for a web of nodes.

GRAPHICAL MODELS:

dependence and independence properties between random variables on a graph structure (two

sets of random variables X1 and X2 are conditionally independent given a set of random

variables X 3 if p(X1, X2|X3) = p(X1|X3)p(X2|X3)).

There exist graphical models associated with different kind of graph structures, for example

factor graphs, Bayesian networks associated with directed graphs, and Markov random fields,

which are also called Markov networks or undirected graphical models

2.1.Restricted boltzmann machines are based on energy model.

This energy function then defines a probability distribution over hidden and visible neurons of

the network as

Where E(v,h) is the energy function.

E(v,h) =-bTv-cTh-vTWh

Z =vhexp {-E(v,h)}

To maximize the energy if b is +ve, v1

If b is -ve , v0

And same for c also, i.e, if c is +ve,h1

If c is -ve, h0

And for Wjk, if Wjk is -ve, to maximize the energy either h or v should be -ve. And if Wjk

equals to 1, then hj and vkalso equals to 1.

Here we are taking -ve of energy, in this maximization means minimizing the negative energy

function.

Energy function

E(v,h) =-bTv-cTh-vTWh

=-kbkvk-jcjhj-jkWjkhjvk

Distribution: p(v,h) = 1z exp{-E(v,h)}

Partition function: Z vhexp {-E(v,h)}

From the above equations it is observed that the Z, the partition function is intractable. Therefore

the first joint distribution is also intractable. But p(h/v), the conditional probability is simple to

compute on the sample form.

To sample from the network (without fixing a state for hidden or visible neurons) means to

sample from the above probability distribution, that is, to draw states of the network's neurons

according to P( v, h) .

To sample (h/v) means to sample from the conditional probability distribution P( h| v) .

"Physically" this means what are the probabilities of getting a specific set of values for the

hidden neurons, having fixed the values v for the visible neurons, and sampling from this

probability distribution. This conditional probability distribution is computed as

=1zexp{-E(v,h)hp(v,h)

=1zexp{bTv+cTh+vTWh}1zhexp{bTv+cTh+vTWh}

=exp{ cTh+vTWh}hexp{ cTh+vTWh}

=1Zexp{ cTh+vTWh}

Where, Z =hexp{ cTh+vTWh}

=1Z{j=1ncjhj +j=1nvTWjhj}

Since we are conditioning on the visible unit we can treat this as constant with respect to

distribution P(h/v). The factorial nature of the conditional P(h/v) follow us immediately from

our ability to write the joint probability over the vector h as the product of (unnormalised)

distributions over the individual elements hj it is now a simple matter of normalising the

distributions over the individual binary hj.

= p(hj=1,v) hp(hj,v)

= p(hj=1,v) p(hj=1,v)+p(hj=0,v)

=exp{ cTh+vTWh}exp{0}+exp{ cTh+vTWh}

=sigmoid{cj+vTWij}

p(h/v) = j=1nsigmoid{cj+vTWij}

Similarly,

p(v/h) = i=1dsigmoid{bi+Wijh}

The independence between the variables in one layer makes Gibbs sampling especially easy:

Instead of sampling new values for all variables subsequently, the states of all variables in one

layer can be sampled jointly

Step 1: Sample h(l)~p(h/v(l))

We can simultaneously and independently sample all the elements of h(l)given v(l).

Step 2: Sample v(l+1)~p(v/h(l))

We can simultaneously and independently sample all the elements of v(l+1)given h(l)

.

CHAPTER 3: TRAINING

Training Restricted boltzmann machines:

We will derive the general form of the gradient of the log likelihood w.r.t. an arbitrary model

parameter θ for a single training sample v here and use it to obtain the gradients for specific

types of RBMs and model parameters in the following sections. Some steps are marked with a

number and explained below. Here we are using log likelihood maximization for reducing the

complexity of calculation means instead of multiplications we are using additions for log

likelihood.

The log likelihood is given by

l(W,b,c)=t=1nlog p(v(t))

If we maximize the log likelihood function

l(W,b,c)=t=1nlog hp(v(t),h)

=t=1nlog h1Z exp{-E(v,h)}

=t=1nlog 1Zh exp{-E(v,h)}

=t=1n-log Zh exp{-E(v,h)}

= t=1nlog h exp{-E(v,h)} -n log (Z)

=t=1nlog h exp{-E(v,h)} -n log v h exp{-E(v,h)}

3.1:Log likelihood maximization

Maximize the function,

Means finding the derivative and set it to zero.

=(b,c,W)

l() = t=1nlog h exp{-E(v,h)} -n log v h exp{-E(v,h)}

dd{l()} = ddt=1nlog h exp{-E(v,h)} - n dd(log v h exp{-E(v,h)})

dd ( -E(v(t),h) ) v,h exp{-E(v,h)}

In above equation, the first term is Expectation Conditional distribution and the second term is

Expectational joint distribution of energy function.

To sample from (h,v) we use Gibbs sampling. In each iteration we have to compute all. But in

this the computation of the second term is very complicated to compute.

= hvT

=v

ddc( exp{-E(v,h)} = ddc(bTv+cTh+vTWh )

=h

dd{l()}=t=1nEp(h/v(t))[dd( exp{-E(v(t),h)} ] - Ep(h,v)[dd( exp{-E(v,h)} ]

ddbl(W,b,c) = t=1nv(t)T- n Ep(h,v)[v]

ddcl(W,b,c) = t=1nh(t)- n Ep(h,v)[h]

To avoid this problem, here is a algorithm.

Idea:

1.Replace the expectation by estimation by a point estimate v’

2.Obtain the point v’ by gibbs sampling.

3.Start sampling chain at v(t)

Obtaining unbiased estimates of log-likelihood gradient using MCMC methods typically requires

many sampling steps. However, recently it was shown that estimates obtained after running the

chain for just a few steps can be sufficient for model training. This leads to contrastive

divergence (CD) learning, which has become a standard way to train RBMs. The idea of k-step

contrastive divergence learning (CD-k) is quite simple: Instead of approximating the second term

in the log-likelihood gradient by a sample from the RBM-distribution (which would require to

run a Markov chain until the stationary distribution is reached), a Gibbs chain is run for only k

steps (and usually k = 1). The Gibbs chain is initialized with a training example v(0) of the

training set and yields the sample v(k) after k steps. Each step t consists of sampling h(t) from

p(h|v(t) ) and sampling v(t+1) from p(v|h(t)) subsequently.

CHAPTER 4: APPLICATIONS

1. Restricted Boltzmann machine is an algorithm useful for

Dimensionality reduction,

Classification,

Regression,

Collaborative filtering,

Feature learning and topic modeling.

5.TRAINING ALGORITHM:

● Initialize the visible variables and hidden variables.

● 2 cycles forward propagation and backward propagation

● Weights initially have been assigned randomly

● Forward propagation find positive associations after finding the hidden units

● Backward propagation - from the hidden units found in forward propagation we find

divisible units and then again do forward propagation to find the negative associations

● The difference between the positive and negative associations gives us delta w the value

with which we change the current weight for a user in the RBM

have neuron-like units whose binary activations depend on the neighbors they're connected to;

stochastic meaning these activations have a probabilistic element) consisting of:

● One layer of visible units (users' movie preferences whose states we know and set);

● One layer of hidden units (the latent factors we try to learn); and

● A bias unit (whose state is always on, and is a way of adjusting for the different inherent

popularities of each movie).

For example, suppose we have a set of six movies (Harry Potter, Avatar, LOTR 3, Gladiator,

Titanic, and Glitter) and we ask users to tell us which ones they want to watch. If we want to

learn two latent units underlying movie preferences -- for example, two natural groups in our set

of six movies appear to be SF/fantasy (containing Harry Potter, Avatar, and LOTR 3) and Oscar

winners (containing LOTR 3, Gladiator, and Titanic), so we might hope that our latent units will

correspond to these categories -- then our RBM would look like the following.

from __future__ import print_function

import numpy as np

class RBM:

def __init__(self, num_visible, num_hidden):

self.num_hidden = num_hidden

self.num_visible = num_visible

self.debug_print = True

# a uniform distribution between -sqrt(6. / (num_hidden + num_visible))

# and sqrt(6. / (num_hidden + num_visible)). One could vary the

# standard deviation by multiplying the interval with appropriate value.

# Here we initialize the weights with mean 0 and standard deviation 0.1.

# Reference: Understanding the difficulty of training deep feedforward

# neural networks by Xavier Glorot and Yoshua Bengio

np_rng = np.random.RandomState(1234)

self.weights = np.asarray(np_rng.uniform(

low=-0.1 * np.sqrt(6. / (num_hidden + num_visible)),

high=0.1 * np.sqrt(6. / (num_hidden + num_visible)),

size=(num_visible, num_hidden)))

# Insert weights for the bias units into the first row and first column.

self.weights = np.insert(self.weights, 0, 0, axis = 0)

self.weights = np.insert(self.weights, 0, 0, axis = 1)

"""

Train the machine.

Parameters

----------

data: A matrix where each row is a training example consisting of the states of visible

units.

"""

num_examples = data.shape[0]

data = np.insert(data, 0, 1, axis = 1)

# Clamp to the data and sample from the hidden units.

# (This is the "positive CD phase", aka the reality phase.)

pos_hidden_activations = np.dot(data, self.weights)

pos_hidden_probs = self._logistic(pos_hidden_activations)

pos_hidden_probs[:,0] = 1 # Fix the bias unit.

pos_hidden_states = pos_hidden_probs > np.random.rand(num_examples,

self.num_hidden + 1)

# Note that we're using the activation *probabilities* of the hidden states, not the hidden

states

# themselves, when computing associations. We could also use the states; see section 3 of

Hinton's

# "A Practical Guide to Training Restricted Boltzmann Machines" for more.

pos_associations = np.dot(data.T, pos_hidden_probs)

# Reconstruct the visible units and sample again from the hidden units.

# (This is the "negative CD phase", aka the daydreaming phase.)

neg_visible_activations = np.dot(pos_hidden_states, self.weights.T)

neg_visible_probs = self._logistic(neg_visible_activations)

neg_visible_probs[:,0] = 1 # Fix the bias unit.

neg_hidden_activations = np.dot(neg_visible_probs, self.weights)

neg_hidden_probs = self._logistic(neg_hidden_activations)

# Note, again, that we're using the activation *probabilities* when computing

associations, not the states

# themselves.

neg_associations = np.dot(neg_visible_probs.T, neg_hidden_probs)

# Update weights.

self.weights += learning_rate * ((pos_associations - neg_associations) / num_examples)

if self.debug_print:

print("Epoch %s: error is %s" % (epoch, error))

"""

Assuming the RBM has been trained (so that weights for the network have been learned),

run the network on a set of visible units, to get a sample of the hidden units.

Parameters

----------

data: A matrix where each row consists of the states of the visible units.

Returns

-------

hidden_states: A matrix where each row consists of the hidden units activated from the

visible

units in the data matrix passed in.

"""

num_examples = data.shape[0]

# Create a matrix, where each row is to be the hidden units (plus a bias unit)

# sampled from a training example.

hidden_states = np.ones((num_examples, self.num_hidden + 1))

data = np.insert(data, 0, 1, axis = 1)

hidden_activations = np.dot(data, self.weights)

# Calculate the probabilities of turning the hidden units on.

hidden_probs = self._logistic(hidden_activations)

# Turn the hidden units on with their specified probabilities.

hidden_states[:,:] = hidden_probs > np.random.rand(num_examples, self.num_hidden +

1)

# Always fix the bias unit to 1.

# hidden_states[:,0] = 1

# Ignore the bias units.

hidden_states = hidden_states[:,1:]

return hidden_states

# TODO: Remove the code duplication between this method and `run_visible`?

def run_hidden(self, data):

"""

Assuming the RBM has been trained (so that weights for the network have been learned),

run the network on a set of hidden units, to get a sample of the visible units.

Parameters

----------

data: A matrix where each row consists of the states of the hidden units.

Returns

-------

visible_states: A matrix where each row consists of the visible units activated from the

hidden

units in the data matrix passed in.

"""

num_examples = data.shape[0]

# Create a matrix, where each row is to be the visible units (plus a bias unit)

# sampled from a training example.

visible_states = np.ones((num_examples, self.num_visible + 1))

data = np.insert(data, 0, 1, axis = 1)

visible_activations = np.dot(data, self.weights.T)

# Calculate the probabilities of turning the visible units on.

visible_probs = self._logistic(visible_activations)

# Turn the visible units on with their specified probabilities.

visible_states[:,:] = visible_probs > np.random.rand(num_examples, self.num_visible +

1)

# Always fix the bias unit to 1.

# visible_states[:,0] = 1

visible_states = visible_states[:,1:]

return visible_states

"""

Randomly initialize the visible units once, and start running alternating Gibbs sampling

steps

(where each step consists of updating all the hidden units, and then updating all of the

visible units),

taking a sample of the visible units at each step.

Note that we only initialize the network *once*, so these samples are correlated.

Returns

-------

samples: A matrix, where each row is a sample of the visible units produced while the

network was

daydreaming.

"""

# (with an extra bias unit), initialized to all ones.

samples = np.ones((num_samples, self.num_visible + 1))

samples[0,1:] = np.random.rand(self.num_visible)

# Note that we keep the hidden units binary states, but leave the

# visible units as real probabilities. See section 3 of Hinton's

# "A Practical Guide to Training Restricted Boltzmann Machines"

# for more on why.

for i in range(1, num_samples):

visible = samples[i-1,:]

hidden_activations = np.dot(visible, self.weights)

# Calculate the probabilities of turning the hidden units on.

hidden_probs = self._logistic(hidden_activations)

# Turn the hidden units on with their specified probabilities.

hidden_states = hidden_probs > np.random.rand(self.num_hidden + 1)

# Always fix the bias unit to 1.

hidden_states[0] = 1

visible_activations = np.dot(hidden_states, self.weights.T)

visible_probs = self._logistic(visible_activations)

visible_states = visible_probs > np.random.rand(self.num_visible + 1)

samples[i,:] = visible_states

# Ignore the bias units (the first column), since they're always set to 1.

return samples[:,1:]

return 1.0 / (1 + np.exp(-x))

if __name__ == '__main__':

r = RBM(num_visible = 6, num_hidden = 2)

training_data = np.array([[1,1,1,0,0,0],[1,0,1,0,0,0],[1,1,1,0,0,0],[0,0,1,1,1,0],

[0,0,1,1,0,0],[0,0,1,1,1,0]])

r.train(training_data, max_epochs = 5000)

print(r.weights)

user = np.array([[0,0,0,1,1,0]])

print(r.run_visible(user))

5.2:Results:

Bias Unit -0.08257658 -0.19041546 1.57007782

Harry Potter -0.82602559 -7.08986885 4.96606654

Avatar -1.84023877 -5.18354129 2.27197472

LOTR 3 3.92321075 2.51720193 4.11061383

Gladiator 0.10316995 6.74833901 -4.00505343

Titanic -0.97646029 3.25474524 -5.59606865

Glitter -4.44685751 -2.81563804 -2.91540988

[1,0]

Note that the first hidden unit seems to correspond to the Oscar winners, and the second

hidden unit seems to correspond to the SF/fantasy movies, just as we were hoping.

What happens if we give the RBM a new user, George, who has (Harry Potter = 0,

Avatar = 0, LOTR 3 = 0, Gladiator = 1, Titanic = 1, Glitter = 0) as his preferences? It turns the

Oscar winners unit on (but not the SF/fantasy unit), correctly guessing that George probably

likes movies that are Oscar winners.

5.3:Conclusion:

We showed how to create a more powerful type of hidden unit for an RBM by tying the weights

and biases of an infinite set of binary units. We then approximated these stepped sigmoid units

with noisy rectified linear units and showed that they work better than binary hidden units.We

have done a movie rating prediction by taking data of 6 members of two types of movies.

Likewise we can estimate any type of business activities, political situations, product reviews

also. The main purpose is prediction of things.

- Compartir en Facebook, abre una nueva ventanaFacebook
- Compartir en Twitter, abre una nueva ventanaTwitter
- Compartir en LinkedIn, abre una nueva ventanaLinkedIn
- Copiar dirección de enlace en el portapapelesCopiar dirección de enlace
- Compartir por correo electrónico, abre un cliente de correo electrónicoCorreo electrónico

## Mucho más que documentos.

Descubra todo lo que Scribd tiene para ofrecer, incluyendo libros y audiolibros de importantes editoriales.

Cancele en cualquier momento.