Graphical inference model Lecture 1

On Wednesday, Professor Andrea Montanari talked about the computational reducibility of four problems in graphical probability models to each other. Since I missed the first lecture as well as beginning of Wednesday’s, I will rely partially on his lecture notes online. First we introduce the notion of graphical models, which are probability distributions on {X^V} that displays the graphical structure on the set {V}. Here {X} is thought of as the range of symbols that a function on {V} can take at each point. Examples are in order: a Bayesian network {\mu} on a directed graph {G=(V,E)} is represented as follows: for each vertex {v} we are given the conditional probability {p_v(x_v | x_{\pi(v)})} where {\pi(v)} is the set of parents of {v}. If {v} has no parents, we only specify instead the prior {p_v(x_v)}. So {V} is divided into two sets, {\pi(G)} consisting of Adams and Eves, and its complement. A Markov chain is the simplest kind of Bayesian network, in which the graph is a line graph with each edge going from left to right. Applying Bayes theorem many times, one deduces that

\displaystyle  \mu(x) = \prod_{v \in \pi(G)} p_v(x_v) \prod_{v \in V \setminus \pi(G)} p_v(x_v | \pi(v)). \ \ \ \ \ (3)

There are three other graphical models: pairwise graphical models which are defined on simple graphs and each edge contributes a term in the potential, as in the formula above. A factor graph is a bipartite graph with variable nodes {V} and function nodes {F}. The probability measure associated with factor graph model is of the form

\displaystyle  \mu(x) = Z^{-1}\prod_{v \in F} \psi_v(x_{\partial v}) \ \ \ \ \ (4)

so the factor nodes index the “factors” in the probability measure.

Andrea showed in the first lecture (which I missed) that the above three models are reducible to each other. The only nontrivial direction is reduction from factor graph to pairwise graph model, i.e., showing every pairwise graph can be represented as a factor graph, because it involves augmentation of the symbol set {X}.

One last model is the so-called Markov random field, in which each clique in the graph contributes to a factor in the probability measure, i.e.,

\displaystyle  \mu(x) = Z^{-1} \prod_{C \text{ a clique }} \psi_C(x_C). \ \ \ \ \ (5)

This clearly is equivalent to factor graph model because every clique is an edge.

All four models exhibit domain Markov property: i.e., if one conditions on the values on a set of vertices {S \subset V}, that insulates two other subsets {W, Y \subset V}, then {W,Y} are conditionally independent.

Here are the four main computational problems on graphical models people are interested:

  1. Computing small marginal probabilities
  2. computing conditional probability of the following form {P(x_S| x_T)} where {S, T \subset V}.
  3. Sampling from the whole distribution {L}
  4. Computing the partition function {Z_L = \sum_{x \in X^V} L(x)}.

It is not hard to show that all four are equivalent. Note that item one above only requires small marginal probabilities, which might seem a far-cry from the sampling task in item 3. But in fact we are talking about computational reducibility up to a linear factor in {|V|}, hence we are allowed to reduce the size of the graph and do the reduction by induction.

Someone brought up another interesting topic in the end, namely

5. computing the mode of the distribution.

While one could compute mode easily by sampling with an inverse temperature parameter that goes to infinity, in the same spirit as Laplace principle, the converse is far from true: taking the uniform measure on the set of some combinatorial object, then every element is a mode, but sampling amounts to counting, which can be NP-hard.

Finally Andrea discussed a paper applying graphical models to sonar system: the idea is we have a 2-dimensional function {f}, which perhaps describes the altitude of a square sea floor. We want to use sonar to recover this function. But sonar only detects {f(x)} at each {x} up to an integer multiple of the wavelength {\lambda}. So essentially we are given the data {f \mod \lambda}, which can be viewed as a section of the torus bundle over {[0,1]^2}. Furthermore the detection process only occurs at discrete spatial points, say {W^2 = (1/n [n])^2}. Thus we have a torus bundle over {W^2}. Instead of asking for {f}, which is impossible since with sonar one couldn’t theoretically obtain the absolute altitude, one only asks for the relative altitude variation in the square. Thus it makes sense to try to recover the functions {a_1, a_2: W^2 \rightarrow {\mathbb Z}}, which give the discrete partial derivatives of the overall phase of {f}, which takes value most frequently in the set {\{\pm 1, 0\}}. One certain needs the mixed partials to match, hence {a_1, a_2} must satisfy the loop condition {a_1(x,y) + a_2(x+1,y) - a_1(x,y+1) - a_2(x,y) =0}.

The idea is then to impose a Gaussian free field distribution on the set of discrete two dimensional {{\mathbb Z}^2}-valued functions {(a_1,a_2)} (after all, Gaussian fluctuations are the most universal in nature), subject to the loop constraint. Thus for each input observation {f \mod \lambda} on {W^2}, one could compute the prior distribution of {(a_1,a_2)} using this model.

Advertisements

About aquazorcarson

math PhD at Stanford, studying probability
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s