Thermodynamics of Learning PDF

Stochastic Thermodynamics of Learning
Sebastian Goldt and Udo Seifert

II. Institut fur Theoretische Physik, Universitat Stuttgart, 70550 Stuttgart, Germany
(Dated: November 30, 2016)
Virtually every organism gathers information about its noisy environment and builds models from
that data, mostly using neural networks. Here, we use stochastic thermodynamics to analyse the
learning of a classification rule by a neural network. We show that the information acquired by the
network is bounded by the thermodynamic cost of learning and introduce a learning efficiency 1.
We discuss the conditions for optimal learning and analyse Hebbian learning in the thermodynamic
arXiv:1611.09428v1 [cond-mat.stat-mech] 28 Nov 2016
limit.
PACS numbers: 05.70.Ln, 05.40.-a, 84.35.+i, 87.19.lv
Introduction. Information processing is ubiquitous k1+

in biological systems, from single cells measuring exter- 1
T 1 A1 +1 1
nal concentration gradients to large neural networks per- k1 1
...
...
forming complex motor control tasks. These systems are
surprisingly robust, despite the fact that they are oper-
T
ating in noisy environments [1, 2], and they are efficient:
...
...
E. coli, a bacterium, is near-perfect from a thermody-
namic perspective in exploiting a given energy budget P
T P P
to adapt to its environment [3]. Thus it is important
to keep energetic considerations in mind for the analysis
of computations in living systems. Stochastic thermody- FIG. 1. Model of a single neuron. Given a set of inputs

namics [4, 5] has emerged as an integrated framework to {1}N and their true labels T = 1 (left), the neu-

study the interplay of information processing and dissipa- ron learns the mappings T by adjusting its weights
tion in interacting, fluctuating systems far from equilib- RN . It processes
an input by computing the activation
rium. Encouraged by a number of intriguing results from A = / N which determines the transition rates of a
its application to bacterial sensing [615] and biomolec- two-state random process = 1 indicating the label pre-
ular processes [1620], here we consider a new problem: dicted by the neuron for each sample, shown here for = 1.
learning.
Learning is about extracting models from sensory data.
In living systems, it is implemented in neural networks pattern describes the activity of all the other connected
where vast numbers of neurons communicate with each neurons at a point in time: if the n-th connected neu-
other via action potentials, the electric pulse used uni- ron is firing an action potential in the pattern , then
versally as the basic token of communication in neu- n = 1. For symmetry reasons, we set n = 1 in case
ral systems [21]. Action potentials are transmitted via the n-th neuron is silent in the -th pattern. Every sam-

synapses, and their strength determines whether an in- ple has a fixed true label T = 1, indicating whether
coming signal will make the receiving neuron trigger an an action potential should be fired in response to that in-
action potential of its own. Physiologically, the adapta- put or not. These labels are independent of each other
tion of these synaptic strengths is a main mechanism for and equiprobable; once chosen, they remain fixed.
memory formation. We model the label predicted by a neuron for each
Learning task and model. A classic example for input with a stochastic process = 1 (right panel
neurons performing associative learning are the Purk- in figure 1). Assuming a thermal environment at fixed
inje cells in the cerebellum [22, 23]. We model such temperature T , the transition rates k for these processes
a neuron as a single-layer neural network or percep- obey the detailed balance condition
tron [24, 25], well known from machine learning and sta-
k+ /k = exp (A /kB T ) (1)
tistical physics [26]. The neuron makes N connections to
other neurons and is fully characterized by the weights
where kB is Boltzmanns constant and A is the input-
or synaptic strengths RN of these connections, see
dependent activation
figure 1. The neuron must learn whether it should fire
an action potential or not for a set of P fixed input pat- 1
terns or samples = (1 , . . . , N

), = 1, 2, . . . , P . Each A (2)
N
where the prefactor ensures the conventional normalisa-

tion. We interpret p( = 1 | ) with fixed as the
goldt@theo2.physik.uni-stuttgart.de probability that the -th input would trigger an action
2
potential by the neuron. The goal of learning is to ad- random variables X and Y , the conditional
P entropy of
just the weights of the network such that the predicted X given Y is given by S(X|Y ) x,y p(x, y) ln p(x|y)
labels at any one time = 1 , . . . , P equal the true where p(x|y) = p(x, y)/p(y). The natural quantity to
1 P
labels T = T , . . . , T for as many inputs as possible. measure the information learnt is the mutual information
Let us introduce the concept of learning efficiency by
considering a network with a single weight learning one I(T : ) S(T ) S(T |) (8)
sample = 1 with label T , i.e. N = P = 1. Here which measures by how much, on average, the uncer-
and throughout this letter, we set kB = T = 1 to render tainty about T is reduced by knowing [31]. To discuss
energy and entropy dimensionless. The weight (t) obeys the efficiency of learning, we need to relate this informa-
an overdamped Langevin equation [27] tion to the thermodynamic costs of adjusting the weight
during learning from t0 = 0 up to a time t, which are
(t) = (t) + f ((t), , T , t) + (t). (3) given by the well-known total entropy production [4] of
the weight,
The total force on the weight arises from a harmonic
potential V () = 2 /2, restricting the size of the Stot S() + Q. (9)
weight [28], and an external force f () introducing corre-
lations between weight and input. The exact form of this Here, Q is the heat dissipated into the medium by
learning force f () depends on the learning algorithm the dynamics of the weight and S() is the difference
we choose. The thermal noise (t) is Gaussian with cor- in ShannonPentropy (7) of the marginalized distribution
relations h(t)(t0 )i = 2(t t0 ). Here and throughout, p(, t) = T , p(T , , , t) at times t0 and t, respec-
we use angled brackets to indicate averages over noise re- tively. We will show that in feedforward neural networks
alisations, unless stated otherwise. We assume that ini- with Markovian dynamics (5, 6), the information learnt
tially at t0 = 0, the weight is in thermal equilibrium, is bounded by the thermodynamic costs of learning,
p() exp( 2 /2), and the labels are equiprobable,
I(T : ) S() + Q (10)
p(T ) = p() = 1/2. Choosing symmetric rates,
for arbitrary learning algorithm f (, , T , t) at all times
k = exp(A/2), (4) t > t0 . This inequality is our first result. We emphasise
that while relations between changes in mutual informa-
the master equation [27] for the probability distribution tion and total entropy production have appeared in the
p(T , , , t) with given reads literature [29, 30, 3234], they usually concern a single
degree of freedom, say X, in contact with some other
t p(T , , , t) = j (t) + j (t), (5)
degree(s) of freedom Y , and relate the change in mutual
information I(X : Y ) due to the dynamics of X to the
where t /t etc. and
total entropy production of X. Instead, our relation con-
j (t) = [ + f (, , T , t) ] p(T , , , t), (6a) nects the entropy production in the weights with the total

change in mutual information between T and , which
j (t) =k p(T , , , t) k p(T , , , t) (6b) is key for neural networks. Our derivation [35] builds on
recent work by Horowitz [30] and can be generalized to N
are the probability currents for the weight and the pre- dimensions and P samples, see eq. (16) below. Equation
dicted label, respectively. In splitting the total probabil- (10) suggests to introduce an efficiency of learning
ity current for the system (T , , ) into the currents (6),
we have used the bipartite property of the system, i.e. I(T : )
that the thermal noise in each subsystem ( and ), is 1. (11)
S() + Q
independent of the other [29, 30]. We choose 1, i.e.
introduce a time-scale separation between the weights Toy model. As a first example, let us calculate the ef-
and the predicted labels, since a neuron processes a sin- ficiency of Hebbian learning, a form of coincidence learn-
gle input much faster than it learns. ing well known from biology [21, 36], for N = P = 1
Efficiency of learning. The starting point to con- in the limit t . If the neuron should fire an action
sider both the information-processing capabilities of the potential when its input neuron fires, or if they should
neuron and its non-equilibrium thermodynamics is the both stay silent, i.e. = T = 1, the weight of their
Shannon entropy of a random variable X with probabil- connection increases fire together, wire together. For
ity distribution p(x), symmetry reasons, the weight decreases if the input neu-
ron is silent but the neuron should fire and vice versa,
= T . This rule yields a final weight proportional to
X
S(X) p(x) ln p(x), (7)
xX
F T , so to minimise dissipation [37], we choose a
learning force f linearly increasing with time,
which is a measure of the uncertainty of X [31]. This (
definition carries over to continuous random variables, Ft/ t
f (, , T , t) (12)
where the sum is replaced by an integral. For dependent F t > ,
3
105 all samples and their labels,

0.99

104 n (t) = n (t) + f (n (t), {n , T }, t) + n (t). (14)
0.95
0.9 A more realistic scenario from a biological perspective is
103 0.8 online learning, where the learning force is a function of

only one sample and its label at a time,

102 0.6
(t)
0.4 n (t) = n (t) + f (n (t), n(t) , T , t) + n (t). (15)
1
10 0.2
The sample and label which enter this force are given
0 by (t) {1, . . . , P }, which might be a determinis-
10
0 2 4 6 8 tic function or a random process. Either way, the
weights determine the transition rates of the P in-
dependent two-state processes for the predicted labels
FIG. 2. Learning efficiency of a neuron with a single ( 1 , . . . , , . . . , P ) via (1) and (2). Again, we as-
weight. We plot the efficiency (11) for a neuron with a sume that the thermal noise in each subsystem, n or
single weight learning a single sample as a function of the , is independent of all the others, and choose initial
learning rate and learning duration in the limit t . conditions at t0 = 0 to be p() exp( /2) and

p(T ) = p( ) = 1/2. The natural quantity to measure
the amount of learning after a time t in both scenarios is

where we have introduced the learning duration > 0 the sum of I(T : ) over all inputs. We can show [35]
and the factor > 0 is conventionally referred to as the that this information is bounded by the total entropy
learning rate in the machine learning literature [24]. The production of all the weights,
total entropy production (9) can be computed from the P N N
distribution p(T , , t), which is obtained by first inte-
X X X
I(T : ) [S(n ) + Qn ] = Sntot
grating out of equations (5, 6) and solving the resulting =1 n=1 n=1
Fokker-Planck equation [38]. The total heat dissipated (16)
into the medium Q is given by [4] where Qn is the heat dissipated into the medium by
the n-th weight and S(n ) is the change from t0 to t
Z Z in Shannon entropy (7) of the marginalized distribution
Q = dt d j (t) [(t) + f ((t), , T , t)] p(n , t). This is our main result.
0
Let us now compute the efficiency of online Hebbian
2 F 2 (e + 1) learning in the limit t . Since a typical neuron
= . (13)
2 will connect to 1000 other neurons [21], we take the
thermodynamic limit by letting the number of samples
As expected, no heat is dissipated in the limit of infinitely P and the number of dimensions N both go to infinity
slow driving, lim Q = 0, while for a sudden poten- while simultaneously keeping the ratio
tial switch 0, lim 0 Q = 2 F 2 /2. The change in
Shannon entropy S() is computed P from the marginal- P/N (17)
ized distribution p(, t) = T p(T , , t). Finally, the
mutual information (8) can be computed from the sta- on the order of one. The samples are drawn at ran-
tionary solution of (5). dom from p(n = 1) = p(n = 1) = 1/2 and remain
A plot of the efficiency (11), fig. 2, highlights the two fixed [39]. We choose a learning force on the n-th weight
competing requirements for maximizing . First, all the of the form (12) with F Fn and assume that the
information from the true label S(T ) = ln 2 needs to process (t) is a random walk over the integers 1, . . . , P
be stored in the weight by increasing the learning rate , changing on a timescale much shorter than the relaxation
which leads to S() ln 2 and a strongly biased dis- time of the weights. Since f 2 is finite, the learning force
tribution p(|) such that I(T : ) ln 2. Second, we is effectively constant with
need to minimise the dissipated heat Q, which increases P
with , by driving the weight slowly, 1. 1 X
Fn = n T , (18)
More samples, higher dimensions. Moving on to a N =1
neuron with N weights learning P samples with true
1 P
labels T (T , . . . , T , . . . , T ), we have a Langevin where the prefactor ensures the conventional normalisa-
equation for each weight n with independent thermal tion [24]. Hence all the weights n are independent of
noise sources n (t) such that hn (t)m (t0 )i = 2nm (tt0 ) each other and statistically equivalent. Averaging first
for n, m = 1, . . . , N . Two learning scenarios are possible: over the noise with fixed T , we find that n is nor-
batch learning, where the learning force is a function of mally distributed with mean hn i = Fn and variance
4
1 [40]. The average with respect to the quenched disor-

0.1
der T , which we shall indicate by an overline, is taken 0.6
second by noting that Fn is normally distributed by the

central limit theorem with Fn = 0 and Fn2 = , hence
I(T : )
0
hn i = 0 and hn2 i = 1 + 2 . The change in Shannon 0.4 0.1 0.5 1 5 10
entropy of the marginalized distribution p(n ) is hence
S(n ) = ln(1 + 2 ). Likewise, the heat dissipated by 0.2

the n-th weight Qn is obtained by averaging eq. (13)
over F Fn .

The mutual information I(T : ) is a functional of

the marginalized distribution p(T , ) which can be ob- 0 1 2
tained by direct integration of p( T , , ) [35]. Here we =1 2 3 4 8
will take a simpler route starting from the stability of the
-th sample [41] FIG. 3. Hebbian learning in the thermodynamic limit.
We plot the mutual information between the true and pre-
1 dicted label of a randomly chosen sample (21) in the limit
T = A T . (19)
N t with N, P as a function of P/N , comput-
ing p
C from (20) (solid lines) and by Monte Carlo integration
Its role can be appreciated by considering the limit T of p( T , , ) (crosses, error bars indicate one standard de-
0, where it is easily verified using the detailed balance viation). The inset shows the learning efficiency (22) in the
condition (1) that the neuron predicts the correct label limits 0 (solid) and (dashed). In both plots,
if and only if > 0. For T = 1, the neuron predicts increases from bottom to top.
the -th label correctly with probability
Z
e
pC p( = T ) = d p( ) (20) by the number of samples and weights, respectively. Plot-
e +1 ted in the inset of figure 3, this efficiency never reaches
the optimal value 1, even in the limit of vanishing dissi-
where p( ) is the distribution generated by thermal pation (solid lines in fig. 3).
noise and quenched disorder, yielding a Gaussian with
mean and variance 1 + 2 [35]. The mutual informa- Conclusion and perspectives. We have introduced
tion follows as neural networks as models for studying the thermody-
namic efficiency of learning. For the paradigmatic case

I(T : ) = ln 2 S(pC ) (21) of learning arbitrary binary labels for given inputs, we
showed that the information acquired is bounded by the
with the shorthand for the entropy of a binary random thermodynamic cost of learning. This is true for learning
variable S(p) = p ln p (1 p) ln(1 p) [31]. It is an arbitrary number of samples in an arbitrary number of
plotted in fig. 3 together with the mutual information dimensions for any learning algorithm without feedback
obtained by Monte Carlo integration of p( T , , ) with for both batch and online learning.
N = 10000. For a vanishing learning rate 0 or Our framework opens up numerous avenues for further
infinitely many samples , pC 1/2 and hence work. It will be interesting to analyse the efficiency of

I(T : ) 0. The maximum value I(T : ) = ln 2 learning algorithms that employ feedback or use an aux-
is only reached for small and decreases rapidly with iliary memory [42]. Furthermore, synaptic weight dis-
increasing , even for values of where it is possible to tributions are experimentally accessible [43, 44], offering
construct a weight vector that classifies all the samples the exciting possibility to test predictions on learning al-
correctly [25]. This is a consequence of both the thermal gorithms by looking at neural weight distributions. The
noise in the system and the well-known failure of Heb- inverse problem, i.e. deducing features of learning algo-
bian learning to use the information in the samples per- rithms or the neural hardware that implements them by
fectly [24]. We note that while the integral in eq. (20) has optimising some functional like the efficiency, looks like a
to be evaluated numerically, pC can be closely approxi- formidable challenge, despite some encouraging progress
mated analytically by p( > 0) with the replacement in related fields [45, 46].
/2 [35] (dashed lines in fig. 3).
Together, these results allow us to define the efficiency
of Hebbian learning as a function of just and ,

I(T : )
, (22) ACKNOWLEDGMENTS
S(n ) + Qn
where we have taken the mutual information per sample We thank David Hartich for stimulating discussions
and the total entropy production per weight, multiplied and careful reading of the manuscript.
5
[1] S. Leibler and N. Barkai, Nature 387, 913 (1997). [33] T. Sagawa and M. Ueda, Phys. Rev. Lett. 104, 090602
[2] W. Bialek, Biophysics : Searching for Principles, Prince- (2010).
ton University Press, 2011. [34] J. M. Horowitz and M. Esposito, Phys. Rev. X 4, 031015
[3] G. Lan, P. Sartori, S. Neumann, V. Sourjik, and Y. Tu, (2014).
Nat. Phys. 8, 422 (2012). [35] See Supplemental Material at ..., which includes Ref. [49],
[4] U. Seifert, Rep. Prog. Phys. 75, 126001 (2012). for a detailed derivation.
[5] J. M. R. Parrondo, J. M. Horowitz, and T. Sagawa, Nat. [36] D. O. Hebb, The organization of behavior: A neuropsy-
Phys. 11, 131 (2015). chological approach, John Wiley & Sons, 1949.
[6] H. Qian and T. C. Reluga, Phys. Rev. Lett. 94, 028101 [37] D. Abreu and U. Seifert, Europhys. Lett. 94, 10001
(2005). (2011).
[7] Y. Tu, Proc. Natl. Acad. Sci. U.S.A. 105, 11737 (2008). [38] H. Risken, The Fokker-Planck Equation, Springer, 1996.
[8] P. Mehta and D. J. Schwab, Proc. Natl. Acad. Sci. U.S.A. [39] In the limit of large N , only the first two moments of
109, 17978 (2012). the distribution will matter, making this choice equivalent
[9] G. De Palo and R. G. Endres, PLoS Comput. Biol. 9, to sampling from the surface of a hypersphere in N
e1003300 (2013). dimensions in that limit.
[10] C. C. Govern and P. R. ten Wolde, Phys. Rev. Lett. 113, [40] n is normally distributed since the Langevin equa-
258102 (2014). tion (15) defines an OrnsteinUhlenbeck process n which
[11] C. C. Govern and P. R. ten Wolde, Proc. Natl. Acad. for a Gaussian initial condition as we have chosen remains
Sci. U.S.A. 111, 17486 (2014). normally distributed [27].
[12] A. C. Barato, D. Hartich, and U. Seifert, New J. Phys. [41] E. Gardner, Europhys. Lett. 4, 481 (1987).
16, 103024 (2014). [42] D. Hartich, A. C. Barato, and U. Seifert, Phys. Rev. E
[13] A. H. Lang, C. K. Fisher, T. Mora, and P. Mehta, Phys. 93, 022116 (2016).
Rev. Lett. 113, 14 (2014). [43] N. Brunel, V. Hakim, P. Isope, J.-P. Nadal, and B. Bar-
[14] P. Sartori, L. Granger, C. F. Lee, and J. M. Horowitz, bour, Neuron 43, 745 (2004).
PLoS Comput. Biol. 10, e1003974 (2014). [44] B. Barbour, N. Brunel, V. Hakim, and J.-P. Nadal,
[15] S. Ito and T. Sagawa, Nat. Commun. 6, 7498 (2015). Trends Neurosci. 30, 622 (2007).
[16] D. Andrieux and P. Gaspard, Proc. Natl. Acad. Sci. [45] G. Tkacik, A. M. Walczak, and W. Bialek, Phys. Rev. E
U.S.A. 105, 9516 (2008). 80, 031920 (2009).
[17] A. Murugan, D. A. Huse, and S. Leibler, Proc. Natl. [46] T. R. Sokolowski and G. Tkacik, Phys. Rev. E 91, 062710
Acad. Sci. U.S.A. 109, 12034 (2012). (2015).
[18] D. Hartich, A. C. Barato, and U. Seifert, New J. Phys. [47] W. T. Newsome, K. H. Britten, and J. A. Movshon, Na-
17, 055026 (2015). ture 341, 52 (1989).
[19] S. Lahiri, Y. Wang, M. Esposito, and D. Lacoste, New [48] P. Dayan and L. F. Abbott, Theoretical Neuroscience,
J. Phys. 17, 085008 (2015). MIT Press, 2001.
[20] A. C. Barato and U. Seifert, Phys. Rev. Lett. 114, 158101 [49] J. M. Horowitz and H. Sandberg, New J. Phys. 16,
(2015). 125007 (2014).
[21] E. R. Kandel, J. H. Schwartz, T. M. Jessell, and Oth- [50] If we restricted ourselves to online learning, where the
ers, Principles of Neural Science, McGraw-Hill New York, learning force is a local force with only one sample and
2000. its label acting on the weights, we could consider this as
[22] D. Marr, J. Physiol. 202, 437 (1969). an upper bound on the amount of information that the
[23] J. S. Albus, Math. Biosci. 10, 25 (1971). weights can acquire during learning, yielding the same re-
[24] A. Engel and C. Van den Broeck, Statistical Mechanics sult for the efficiency.
of Learning, Cambridge University Press, 2001.
[25] D. J. MacKay, Information Theory, Inference and Learn-
ing Algorithms, Cambridge University Press, 2003.
[26] Experimental justification for focusing on a single neu-
ron comes from studies on psychophysical judgements in
monkeys, which have been shown to depend on very few
neurons [47].
[27] N. van Kampen, Stochastic Processes in Physics and
Chemistry, Elsevier, 1992.
[28] Restricting the size of the weights reflects experimental
evidence suggesting the existence of an upper bound on
synaptic strength in diverse nervous systems [48].
[29] D. Hartich, A. C. Barato, and U. Seifert, J. Stat. Mech.
2014, P02016 (2014).
[30] J. M. Horowitz, J. Stat. Mech. 2015, P03006 (2015).
[31] T. M. Cover and J. A. Thomas, Elements of Information
Theory, John Wiley & Sons, 2006.
[32] A. E. Allahverdyan, D. Janzing, and G. Mahler, J. Stat.
Mech. 2009, P09011 (2009).
1
SUPPLEMENTAL MATERIAL
STOCHASTIC THERMODYNAMICS OF LEARNING
Sebastian Goldt and Udo Seifert

II. Institut fur Theoretische Physik, Universitat Stuttgart, 70550 Stuttgart, Germany
(Dated: November 30, 2016)
In this supplemental material, we discuss the stochastic thermodynamics of neural networks in detail in section I
and derive our main result, eq. (16) of the main text, in section II. Furthermore, we complement our discussion
Hebbian learning in the thermodynamic limit with additional analytical calculations in section III.
I. STOCHASTIC THERMODYNAMICS OF NEURAL NETWORKS
We now give a detailed account of the stochastic thermodynamics of neural networks. For simplicity, here we will
focus on batch learning; the generalisation to online learning is straightforward. For a network with N weights n RN

learning P samples {1}N with their labels T = 1, = 1, 2, . . . , P , we have N Langevin equations [S1]

n (t) = n (t) + f (n (t), {n , T }, t) + n (t). (S1)
The Gaussian noise n (t) has correlations hn (t)m (t0 )i = 2T nm (tt0 ) for n, m = 1, . . . , N where T is the temperature
of the surrounding medium and we have set Boltzmanns constant to unity to render entropy dimensionless. The
weights determine the transition rates of the P independent two-state processes for the predicted labels via
k+ /k = exp (A /T ) (S2)
where A is the input-dependent activation
1
A (S3)
N
For the remainder of this supplemental material, we set T = 1, rendering energy dimensionless. We assume that the
thermal noise in each subsystem, like n or , is independent of all the others. This multipartite assumption [S2]
1 P
allows us to write the master equation for the distribution p( T , , , t) with T (T , . . . , T ) and ( 1 , . . . , P )
as
N
X P
X
t p( T , , , t) = n jn (t) + j (t), (S4)
n=1 =1
where t /t, n / n and the probability currents for the n-th weight n and the -th predicted label
are given by
h i
(t)
jn (t) = n + f (n , (t) , T , t) n p( T , , , t), (S5a)
j (t) =k + p( T , , 1 , . . . , , . . . , P , t) k p( T , , , t). (S5b)
We choose symmetric rates k = exp(A /2) with 1. Initially, the true labels T , weights and predicted
labels are all uncorrelated with

p0 (T ) =1/2, (S6)
p0 ( ) =1/2, and (S7)
1
p0 () = exp( /2). (S8)
(2)N/2
Since the following discussion applies to the time-dependent dynamics (S4), we understand that all quantities that
will be introduced in the remainder of this section have an implicit time-dependence via the distribution p( T , , , t)
or the currents (S5).
Our starting point for the stochastic thermodynamics of this system is the well-known total entropy production
S tot of the network which obeys the following second-law like inequality [S3]
S tot = t S( T , , ) + S m 0 (S9)
2
with equality in equilibrium only. Here, we have the Shannon entropy [S5] of the system,
XZ
S( T , , ) = d p( T , , ) ln p( T , , ). (S10)
T ,
Here, we include the variables T , and as arguments of the function S in a slight abuse of notation to emphasise
that we consider the Shannon entropy of the full distribution p( T , , ). S m gives the rate of entropy production in
the medium. For a system at constant temperature T = 1, S m Q, the rate of heat dissipation into the medium [S3].
Let us first focus on the change in Shannon entropy by differentiating (S10) with respect to time,
XZ
t S( T , , ) = d p( T , , ) ln p( T , , ), (S11)
T ,
where we have used that p( T , , ) is, of course, normalised. Using the master equation (S4), we find that
N
X P
X
t S( T , , ) = Sn + S (S12)
n=1 =1
where
XZ
Sn d n jn (t) ln p( T , , ), (S13)
T ,
XZ
S d j ln p( T , , ), (S14)
T ,
are the rate of change of the Shannon entropy S( T , , ) due to the dynamics of n and , respectively. The key
point here is that multipartite dynamics, a consequence of the uncorrelated noise across subsystems, lead to a linear
splitting of the probability currents and hence to a linear splitting of all quantities which are functions of the total
probability current. Similarly, for the rate of heat dissipation Q, we can write
N
X P
X
Q = Qn + Q (S15)
n=1 =1
where
XZ
Qn = d jn (t)Fn ( T , , ) (S16)
T ,

with the total force on the n-th weight Fn = n (t) + f (n (t), {n , T }, t), while
XZ
Q = d j (t) /2. (S17)
T ,
Finally, total entropy production S tot can also be split,

N
X P
X
S tot = Sntot + Stot . (S18)
n=1 =1
It can easily be shown that each of these total entropy productions of a subsystem obeys a separate second-law like
inequality, e.g.
Sntot = Sn ( T , , ) + Qn 0 (S19)
for the n-th weight.

Writing
p( T , , ) = p(n )p( T , , |n ) (S20)

3
with ( , n1 , n+1 , ), we can split Sn ( T , , ) into two parts: first, the change of Shannon entropy of the
marginalized distribution p(n ),
XZ
Sn (n ) = d n jn (t) ln p(n ) = t S(n ), (S21)
T ,
where the last equality follows from the fact that an entropy change of the marginalized distribution p(n ) can only
come from the dynamics of n . The second part is called the learning rate [S4]
XZ
ln (n ; T , , ) = d n jn (t) ln p( T , , |n ) (S22)
T ,
or information flow [S6, S7]. We emphasise that this learning rate ln is thermodynamic and has nothing to do with
the learning rate that goes into the definition of the learning algorithms, see for example eq. (12) of the main text.
To avoid confusion, we will refer to ln as the thermodynamic learning rate for the remainder of this supplemental
material. The second law (S19) for the n-th weight hence becomes
Sntot = t S(n ) + Qn ln (n ; T , , ) 0 (S23)
The thermodynamic learning rate is a thermodynamically consistent measure of how much the dynamics of n change
the mutual information I(n : T , , ), in particular for a system that continuously rewrites a single memory [S8].
We can further refine the second law (S23) by exploiting the causal structure of the dynamics, as was recently
suggested by Horowitz [S2]. The subsystem n directly interacts only with those degrees of freedom that appear in
its probability current jn (t) (S5). From inspection of the current jn (t), we see that n is directly influenced only by
itself and the given labels T . Keeping this in mind, we use the chain rule for mutual information [S5] to write
I(n : T , , ) = I(n : T ) + I(n : , | T ), (S24)
where we use the conditional mutual information
I(n : , | T ) =S(n | T ) S(n |, , T ) (S25)

XZ p( T , , )p( T )
= d p( T , , ) ln . (S26)
, T p(n , T )p(, , T )
Accordingly, we split the thermodynamic learning rate (S22) into a thermodynamic learning rate of the n-th weight
with the degrees of freedom that it directly interacts with, i.e. the true labels T ,
XZ
ln (n ; T ) = d n jn (t) ln p( T |n ), (S27)
, T
and a thermodynamic learning rate with the other subsystems given the true labels,
XZ
p(n , , | T )

ln (n ; , | T ) = d n jn (t) ln . (S28)
, p(n | T )p(, | T )
T
Horowitz proved [S2] the following second-law like inequality including the refined thermodynamic learning rate (S27),
t S(n ) + Qn ln (n ; T ) 0. (S29)
which is the basis for our proof of the main inequality, equation (16) of the main text.
II. DERIVATION OF INEQUALITY (16) OF THE MAIN TEXT
The stochastic thermodynamics of neural networks yields N inequalities of the form (S29). Integrating over time
and summing over all the weights, we find
N
X N Z
X N
X
[S(n ) + Qn ] dt ln (n ; T ) = I(n : T ) (S30)
n=1 n=1 0 n=1
4
The precise definition of all the terms are discussed in the main text and in section I of this supplemental material.
The crucial point for the last equality is that the labels T are static, so that the mutual information I(n : T )
changes only due to the dynamics of n and hence t I(n : T ) = ln (n ; T ) [S50]. To make progress towards our
main result, inequality (16) of the main text, we need to show that
N P

X X
I(n : T ) I(T : ). (S31)
n=1 =1
First, we note that from the chain rule of mutual information [S5], we have
N
X
I( : T ) = I(1 , . . . , n : T ) = I(n : T | n1 , . . . , 1 ) (S32)
n=1
with the conditional mutual information [S5]

I(n : T | n1 , . . . , 1 ) S(n |n1 , . . . , 1 ) S(n | T , n1 , . . . , 1 ). (S33)
Due to the form of the Langevin equation for the single weight, eq. (S1), individual weights are uncorrelated, and
hence the conditional mutual information simplifies to
I(n : T | n1 , . . . , 1 ) = S(n |n1 , . . . , 1 ) S(n | T , n1 , . . . , 1 ) (S34)
= S(n ) S(n | T ) (S35)
= I(n : T ) (S36)
such that
N
X
I(n : T ) = I( : T ). (S37)
n=1
Next, we show that

P P
!
1
X X
1
I( : T ) = I( : T | T , . . . , T ) I( : T ). (S38)
=1 =1
using the independence of the given labels T . We first note that

1 1 1 1 1 1
I( : T | T , . . . , T ) =S(T |T , . . . , T ) S(T |, T , . . . , T ) (S39)
1 1
=S(T ) S(T |, T , . . . , T ) (S40)
while

I( : T ) = S(T ) S(T |) (S41)
!
1 1
Hence for I( : T | T , . . . , T ) I( : T ), we need
1 1
I( : T | T , . . . , T ) I( : T ) (S42)
1 1
= S(T |) S(T |, T , . . . , T ) (S43)
1 1
= I(T : T , . . . , T | ) (S44)
0 (S45)

where we first used that the T are independent and identically distributed. The last inequality follows since any
mutual information, conditional or not, is always greater than or equal to zero [S5]. We have thus shown that
1 1
I( : T | T , . . . , T ) I( : T ) and hence (S38) is true.

Finally, to prove that I( : T ) > I(T : ), we consider the full probability distribution p( T , , ). From
the master equation, eq. (S4), we can write this distribution as

(0) 1 (1) 2
p( T , , ) = p( T )p(| T ) p (|) + p (|) + O(1/ ) (S46)

5
with 1 for physiological reasons as described in the text it takes the neuron longer to learn than to generate
an action potential. Hence to first order, T is by definition a Markov chain [S5]. Integrating out all the

labels, true and predicted, except for the -th one, we have the Markov chain T . For such a Markov
chain, it is easy to show the following data processing inequality [S5],

I(T : ) I(T : ), (S47)
which completes our derivation.
III. HEBBIAN LEARNING IN THE THERMODYNAMIC LIMIT
In this section, we provide additional analytical calculations for Hebbian learning in the thermodynamic limit for
long times t .
A. Direct integration of the full distribution p( T , , )

To compute the mutual information between the true and predicted label of a given sample, I(T : ), we need the

distribution p(T , ) or, since both T and are symmetric binary random variables, the probability that T = .
Our aim in this section is to obtain this probability for Hebbian learning in the thermodynamic limit with t by
direct integration of the full distribution over the true labels, weights and predicted labels for a given set of samples
{ }, which will also give additional motivation for introducing the stability of a sample.
We start with the full probability distribution

N
P Y 2
! P
!
1 e(n Fn ) /2 Y e /2 N
p( T , , ) = , (S48)
2 /2 N + e /2 N
=1 e
n=1
2
where is the learning rate and Fn is a suitably scaled average over the samples and labels,
P
1 X
Fn = T n (S49)
N =1
While the sum over the predicted labels 6= = 1 is trivial, we can integrate over the true labels by noting that we
can rewrite the exponent as

N
P Y 2
!
(n T n / N Fn ) /2
1 e e /2 N
p( T , , ) =

(S50)
2 n=1
2 e /2 N + e /2 N
6=
where the only dependence of the weight distribution on the true labels T is now confined to the sum
P
1 X
Fn T n . (S51)
N 6=
6=
In the thermodynamic limit, this allows us to replace the sum over all T by an integral over the stochastic variable

Fn , which is normally distributed by the central limit theorem and has mean 0 and variance . Carrying out the
integral, we find

N
!
(n T n / N )2 /2(1+ 2 )

Y e e /2 N
p(T , , ) = p

(S52)
n=1 2(1 + 2 ) e /2 N + e /2 N

Since both T and are binary random variables and T = 1 with equal probabilities, the mutual information
between the true and predicted label can be written as

I(T : ) = ln 2 S[p(T = )] (S53)
6

with the shorthand for the binary entropy S[p] = p ln p (1 p) ln(1 p) [S5]. With = T in the exponential
2
term of eq. (S52) and noting that (T n ) = 1 for all T , n , we then have
2
N 2
!

Y e(n T n / N ) /2(1+ ) eT /2 N
p(T = , ) = p

(S54)
n=1 2(1 + 2 ) e /2 N + e /2 N

It thus becomes clear that T is the sum of N random variables with mean / N and variance 1 + 2 . We
are then motivated to introduce the stability of a sample,
1
T = A T . (S55)
N
which, from eq. (S54), is normally distributed with mean and variance 1 + 2 . Introducing the stability allows us
to replace the integral over all the weights by an integral over the stability,
Z 2 2 Z
e( ) /2(1+ ) e e
p(T = ) = d p = d
p(
) (S56)
2(1 + 2 ) 1 + e 1 + e
which is the distribution obtained as eq. (20) of the main text.
B. Direct derivation of the distribution of stabilities
Let us quickly show how the distribution of stabilities

1
T , (S57)
N
= 1, . . . , P , is obtained directly from its definition. The weights are given by
P
1 X
= T + y (S58)
N =1
with y = (y1 , y2 , . . . , yN ) where yn are normally distributed random variables with mean 0 and variance 1 arising
from the thermal fluctuations in equilibrium. Substituting eq. (S58) into (S57), we have
P
1 X 1
= T T + T y (S59)
N =1 N
P
1 X 1
= + T T + T y (S60)
N N
6=
where going to the last line we have used the fact that = N . By inspection, we see that the secondterm is the
sum of N (P 1) N P random numbers /N and the last term is the sum of N random numbers yn / N . By the
central limit theorem, is hence normally distributed with mean h i = and variance
2 2 1
h( )2 i h i = 2 + N P + N 2 = 1 + 2 . (S61)
N2 N
C. Analytical approximation for I(T : )
We quantify the success of learning using the mutual information per sample,

I(T : ) = ln 2 S(pC ) (S62)
where S(p) = [p ln p + (1 p) ln(1 p)] is the binary Shannon entropy and pC
is defined as
Z
e
pC p( = T
)= d p( ) (S63)
e +1
7
The stabilities are normally distributed with mean and variance 1 + 2 (see section III B). This integral does
not have a closed-form analytical solution, but here we will demonstrate a very good analytical approximation.
To that end, we first rewrite the sigmoid function in the integrand in terms of the hyperbolic tangent and exploit
the similarity of the latter to the error function:
Z
e /2
pC = d p( ) /2 (S64)
e + e /2
Z
1 1
= + d p( ) tanh( /2) (S65)
2 2
Z
1 1
' + d p( ) erf( /2) (S66)
2 2
where we choose = 4/5 by inspection of the graphs of the two functions. Now the convolution of a normal distribution
and an error function has an exact solution,
Z
(x c)2

1 b + ac
d x erf(ax + b) exp = erf . (S67)
2d2 2d2 1 + 2a2 d2
Setting a = /2, b = 0, c = and d2 = 1 + 2 , we find that
1 1 /2
pC (, ) ' + erf p (S68)
2 2 1 + (1 + 2 )/2
2
1 1 /2
= + erf p (S69)
2 2 25/16 + 1/2 + 2 /2
1 1 /2
' + erf p (S70)
2 2 2(1 + 2 /4)
=p( > 0|, /2) (S71)
where in the last line we recognise by inspection that our result is nothing but the integral over the distribution of
stabilities p( |, /2) from 0 to . The probability that the neuron predicts the correct label is hence given by the
probability that the neuron learned the label correctly, > 0, with half the learning rate.
[S1] N. van Kampen, Stochastic Processes in Physics and Chemistry, Elsevier, 1992.
[S2] J. M. Horowitz, J. Stat. Mech. 2015, P03006 (2015).
[S3] U. Seifert, Rep. Prog. Phys. 75, 126001 (2012).
[S4] D. Hartich, A. C. Barato, and U. Seifert, J. Stat. Mech. 2014, P02016 (2014).
[S5] T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 2006.
[S6] A. E. Allahverdyan, D. Janzing, and G. Mahler, J. Stat. Mech. 2009, P09011 (2009).
[S7] J. M. Horowitz and M. Esposito, Phys. Rev. X 4, 031015 (2014).
[S8] J. M. Horowitz and H. Sandberg, New J. Phys. 16, 125007 (2014).

Thermodynamics of Learning PDF

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Thermodynamics of Learning PDF

Cargado por

Copyright:

Formatos disponibles

Stochastic Thermodynamics of Learning

Sebastian Goldt and Udo Seifert

PACS numbers: 05.70.Ln, 05.40.-a, 84.35.+i, 87.19.lv

Introduction. Information processing is ubiquitous k1+

where the prefactor ensures the conventional normalisa-

105 all samples and their labels,

only one sample and its label at a time,

1 [40]. The average with respect to the quenched disor-

S(n ) = ln(1 + 2 ). Likewise, the heat dissipated by 0.2

Sebastian Goldt and Udo Seifert

I. STOCHASTIC THERMODYNAMICS OF NEURAL NETWORKS

Finally, total entropy production S tot can also be split,

for the n-th weight.

p( T , , ) = p(n )p( T , , |n ) (S20)

Sntot = t S(n ) + Qn ln (n ; T , , ) 0 (S23)

I(n : T , , ) = I(n : T ) + I(n : , | T ), (S24)

where we use the conditional mutual information

I(n : , | T ) =S(n | T ) S(n |, , T ) (S25)

II. DERIVATION OF INEQUALITY (16) OF THE MAIN TEXT

with the conditional mutual information [S5]

Next, we show that

using the independence of the given labels T . We first note that

which completes our derivation.

III. HEBBIAN LEARNING IN THE THERMODYNAMIC LIMIT

A. Direct integration of the full distribution p( T , , )

which is the distribution obtained as eq. (20) of the main text.

B. Direct derivation of the distribution of stabilities

Let us quickly show how the distribution of stabilities

C. Analytical approximation for I(T : )

Setting a = /2, b = 0, c = and d2 = 1 + 2 , we find that

También podría gustarte