Bayesian inference, statistical physics, and the planted directed polymer problem

To cite this page
@misc{swpkim2025bayesian,
   author={P. Kim, Sun Woo},
   title={Bayesian inference, statistical physics, and the planted directed polymer problem},
   year={2024},
   howpublished={\url{https://sunwoo-kim.github.io/en/activities/talks/240409-imperial/}},
   note={Accessed: 2025-02-22}
}
Below are notes for my board talk that I gave on 2024-04-09 at the regular condensed matter theory meeting at Imperial College London.
Table of Contents

Introduction

Recently, there has been a lot of activity in casting problems in Bayesian inference problems as disordered stat-mech problems, then using tools of spin glasses to solve them, see Zdeborová 1511.02476. Here I wanted to give a quick introduction to this concept.

In Bayesian inference, we try to infer state x given some measurements/data y. We assume some prior distribution p(x) and a measurement model/likelihood p(y|x) to infer a posterior distribution using Baye’s rule, p(x|y)=p(y|x)p(x)p(y), where the ’evidence’ p(y)=xp(y|x)p(x) acts as the normalisation for the posterior. If the state space is large, then this is in general intractable. This is very similar to the partition function in statistical physics.

If y came from the ’true’ state x, then the natural question to ask is, is inference possible. To quantify this question, for continuous variables, we may look at the mean-squared error, MSE=Exp(|y)[(xx)2]. Is there any way we can prove when inference is possible? And, can we have phase transitions, say, in the mean-squared error as signal-to-noise is varied?

We will consider an idealised scenario covered in the above reference, called the teacher-student scenario. The teacher generates some state of a system, x from the ’true/teacher’s prior’ pT(x). Then, the teacher generates some data/measurements y given the state of the system via the teacher’s measurement model/likelihood, pT(y|x). The teacher then hands the student the measurements y.

The student then assumes some prior distribution pS(x) and measurement model pS(y|x) to infer a posterior distribution pS(x|y). Then we can consider the joint distribution, p(x,y,x)=pS(x|y)pT(y|x)pT(x)=pS(y|x)pS(x)pT(y|x)pT(x)pS(y). Note that p(x)=pT(x), since we can integrate over pS(x|y) wrt x then pT(y|x) wrt y. However, p(x)pS(x), since we can’t perform the integration over y and x first in general.

However, we can already immediately notice one thing: if the student’s model is equal to the teacher’s model, that is S=T, then p(x,y,x) is symmetric with respect to xx and therefore x is distributed identically to x, and p(x)=pT(x=x). The S=T point is called the Bayes optimal point.

To make the connection to stat-mech clearer, let’s write out the student’s posterior in a suggestive way: pS(x|y)=eβH(x,y)Z(y), where Z(y)=xeβH(x,y) and βH(x|y)=lnqS(y|x)+lnqS(x). Here I wrote the unnormalised distributions ex. qS(y|x)pS(y|x) as normalisation is ensured by the partition function. We see that the posterior looks like the Boltzmann probability of disordered system (ex. spin glass), with the ‘data’ y playing the role of disorder with pT(y)=xpT(y|x)pT(x).

In the limit where the data contains no information about the true/teacher’s state, ex. when the data is ‘infinitely’ noisy, we see that pT(y|x)pT(y) really becomes random disorder - this would be a situation where the student believes that there is some information in the data, but the teacher is actually supplying random noise.

When there is information about the true/teacher’s state in the data, this information is said to be ‘planted’ in the disorder. The distribution of pT(y)=xpT(y|x)pT(x) is the ‘planted distribution’, and p(x,y,x) is the ‘planted ensemble’.

This is all well and good, but is it actually useful? Let’s look at some examples, to see how existing disordered stat-mech problems map onto inference problems.

The planted random-bond Ising model

To make things more concrete, let’s consider a particular case of a model considered in the review. We consider L2 people standing in a room in a square lattice formation. To each person, we randomly hand out a card that can be Si=±1 with equal probability. However, we don’t record this information. Then, we ask each neighbouring pair to reply with 1 if they have the same cards or 1 if they have different cards. But there’s a twist - for each question the pair may lie with probability π. We record the answer that each pair gives us as Jij.

Then, given {Jij}ij, can we infer the ‘state’, i.e. the cards given to each person? As we have an equal-probability prior over the cards given, we have p(S)1. The ’likelihood’ of answer Jij given cards Si and Sj is p(Jij|Si,Sj)={1πJij=SiSjπJij=SiSj. As Jij is an Ising variable we can massage this to look like a Boltzmann weight. If we consider p(Jij|Si,Si)=NeβJijSiSj, Then from normalisation over Jij1,1 we have N=eβ+eβ, and π=eβeβ+eβ. Therefore if each pair lies half of the time, π=1/2, β=0, and if they never lie, π=0, β.

Then the likelihood over all answers is p(J|S)=i,jp(Jij|Si,Sj)exp(βi,jJijSiSj). Since the prior is uniform, the posterior is p(S|J)=q(J|S)Z(J)=exp(βi,jJijSiSj)Z(J), where Z(J)=Sexp(βi,jJijSiSj). is the partition function for a random-bond Ising model, and the posterior is its Boltzmann probability!

Let’s now consider the general teacher-student scenario, where the student’s model belief of the rate of lying βS(πS) can differ from the true/teacher’s βT(πT). p(S,J,S)=pS(S|J)pT(J|S)pT(S)exp(βSijJijSiSj)ZS(J)exp(βTijJijSiSj). Then consider the observable SiSi. This is 1 if they are aligned (if the inferred card is same as the true value of the card), and 1 if they are anti-aligned, so it is a measure of possibility of inference. E[SlSl]SSJSlSlexp(βSijJijSiSj)exp(βTijJijSiSj)ZS(J). Since we are summing over J, we are free to redefine J such that JijSiSj=J~ij. Similarly we can redefine S such that SiSi=S~i. Inside the partition function we can also redefine S inside the sum. Therefore E[SlSl]SS~J~S~lexp(βSijJ~ijS~iS~j)exp(βTijJ~ij)S~exp(βSijJ~ijS~iS~j) So now S has disappeared, and we the true distribution is the ‘ferromagnetic configuration’. So E[SlSl]S~J~S~lexp(βSijJ~ijS~iS~j)exp(βTijJ~ij)Sexp(βSijJ~ijSiSj)=J~S~S~lexp(βSijJ~ijS~iS~j)exp(βTijJ~ij)Sexp(βSijJ~ijS~iS~j), which is the magnetisation in the random-bond Ising model with p(J~)exp(βTijJ~ij). So the paramagnetic (PM) state corresponds to the failure of inference, and the ferromagnetic (FM) state corresponds to the success of inference.

By the way, in the spin-glass literature, what we’ve done here is called a ‘gauge transformation’. And, there is a special ‘solvable line’ in this model, called the Nishimori line, where p(J)Z(J). The basic idea was that p(J) just becomes another replica in the problem and so simplifies the calculation, but it turns out that the point where the Nishimori line crosses the phase boundary is very interesting. But this exactly corresponds to the Bayes optimality condition, βS=βT. Usually, people show that for particular choices for π, we can gauge-transform p(J) to look like Z(J). Here, we did it the other way around.

People have already studied the phase diagram of the RBIM, so we have the phase diagram for free. Let’s look at the phase diagram of the RBIM (for example see Gruzberg 0007254v2).

The p=0, Tc2.27 corresponds to the clean 2D Ising transition. For us, T corresponds to βS1 and p=πT. Thinking of βS,T as the ‘signal strength’, we can redraw the phase diagram in the βTβS plane. We know that the Nishimori line is the 45 line in this plane.

We see that in this picture, no matter what βS the student chooses, we need sufficient teacher’s signal strength βT>βTc in order for inference to be possible. On the other hand, if the student assumes that the signal strength is too low, βS<βSc, then again inference is not possible.

The planted directed polymer problem

So now the world is our oyster. We can start cooking up planted versions of spin-glass/disordered stat-mech models, and see what inference problem they map on to. We were particularly interested in looking at a class of inference problems involving hidden Markov models. This is where the prior follows a Markov process, p(x1:t)=(τ=2tp(xτ|xτ1))p(x1), and measurements yt conditioned on the state are generated through some measurement model p(yt|xt) at every timestep. As before, we could calculate a posterior for the entire trajectory, p(x1:t|y1:t), but this is usually intractable computationally. Instead, we can also consider a filtering task, where the state at the current timestep is inferred from data from all previous timesteps, i.e. p(xt|y1:t).

Concretely, we consider the case when the states are locations in space, and the Markov process is a random walk on that space. A simple example of a ‘kernel’ for this process is p(xt+1|xt)=12δxt+1,xt+14δxt+1,xt±1. An example of such a random walk in 1D is shown below.

At every timestep t, we will take an ‘image’ of the walker specified by pixel values at positions x, ϕx,t=ϵTδx,xt+ψx,t, which has a peak at the true location of the walker with ‘signal strength’ ϵT, and ψx,t is iid random Gaussian noise ψx,tN(0,σT2). An example of set of all images from each timestep for the above random walk is shown below.

Then, we’d like to infer the true position of the walker at the current time, xt or the true trajectory X:=(xτ)τ=1T given the images at all times. We call it the ‘planted directed polymer problem’, as it is related to the directed polymer in a random medium from stat-mech. Let’s see how.

Let the whole image at time t be ϕt:=(ϕx,t)x. Then assuming signal strength ϵS and noise strength σS, the likelihood for observing an image ϕt given that the walker’s position is xt is pS(ϕt|xt)=xt12πσS2exp[(ϕxt,tϵSδxt,xt)22σS2]=exp([ϵSϕxt,tϵ2/2]/σS2)πS(ϕt), where πS(ϕt) denotes the Gaussian measure. Denoting X:=x1:t and Φ:=ϕ1:t, the likelihood for the whole trajectory is pS(Φ|X)exp(ϵSσS2tϕxt,t)πS(Φ). As the πS(Φ) term is there for any value of X, we can ignore it, since we can just normalise the posterior again. So the student’s posterior for the whole path is pS(X|Φ)=qS(X|Φ)ZS(Φ), where the unnormalised posterior is qS(X|Φ)=exp(ϵSσS2t=1Tϕxt,t+t=2TlnqS(xt|xt1))qS(x1), and ZS(Φ)=XpS(X|Φ).

Let us assume that the kernel qS(xt|xt1) only depends on the distance between xt and xt1, and is maximised when the distance is zero. Then, we write lnqS(xt|xt1)=f(a(xtxt1)) where a is the lattice constant. Expanding, we find qS(X|Φ)=exp(τ=1t[a4νS[xτxτ1]2+ϵSσS2ϕxτ,τ]+O(a3/2))qS(x1), where νS=12|fS(0)| is the ‘width’ of the kernel. We see that the unnormalised posterior now looks like the Boltzmann weight for the directed polymer in a random environment. For those who are unfamiliar, it is a well studied stat-mech model for configurations of a polymer. Interpreting time as another spatial dimension t=y, it is ‘directed’ in that the polymer cannot loop around itself, since x is a single-valued function of y.

The first term in the sum (xτxτ1)2 corresponds to the ’elastic energy’ of the polymer, where you pay a price whenever you stretch out the polymer. The second term ϕxt,t, where ϕxt,t, usually random iid noise in the directed polymer setting, is a ‘random environment’ that the polymer is placed in.

The interesting phenomena in the directed polymer is that even though you need to pay energy to stretch out the polymer, it might be worth it to go through a particular location, as long as the potential there is deep enough. This creates a ‘roughening’ of the polymer (you can read more about it in Bhattacharjee 0402117v1, for example).

For us, ϕxt,t is not iid, but instead has information about the true location of the walker ‘planted’ in it. In recent work, we studied this problem in detail in 1D and also on the tree. Read more here: P. Kim 2404.07263.