Next: Planning with Approximate Inference Up: MLB_Exercises_2012 Previous: Approximate inference in Bayesian

Learning overhypotheses [3* P]

**Figure:** A hierarchical Bayesian model. Each setting of $(\alpha , {\bf\beta })$ is an overhypothesis: $\text \bf{ \beta}$ represents the color distribution across all categories, and $\alpha$ represents the variability in color within each category.

Apply Gibbs sampling to acquire overhypotheses about the feature variability for the bags of marbles model illustrated in Fig. 8: Suppose that is a stack containing many bags of marbles. We empty several bags and discover that the marbles within the same bag have certain features in common: For instance some bags may contain black marbles, others may contain white marbles, but that the marbles in each bag are uniform in color. Given a new bag - bag - and a single marble (e.g. a black marble) drawn from this bag we are interested in the probability of the colors of all other marbles within this bag. On its own, a single draw would provide little information about the contents of the new bag, but experience with previous bags may lead us to endorse certain hypothesis (e.g. all marbles in a bag have uniform colors).

Learning overhypothesis: The term overhypothesis is used to refer to any form of abstract knowledge that sets up a hypothesis space at a less abstract level. By this criterion, an overhypothesis sets up a space of hypotheses about the marbles in bag : they could be uniformly black, uniformly white, and so on. Hierarchical Bayesian models capture the notion of overhypothesis by allowing hypothesis spaces at several levels of abstraction. In this example we wish to explain how a certain kind of inference can be drawn from a given body of data. In this case, the data are observations of several bags and we are working with a set of 2 colors.

Bags of marbles model: Let ${\bf y}^i$ indicate a set of observations of the marbles in bag . If we have drawn five marbles from bag 7 and all but one are black, then ${\bf y}^7=[4,1]$ . We are interested in the ability to predict the color of the next marble to be drawn from bag . The first step is to identify a kind of knowledge (level 1 knowledge) that explains the data and that supports the ability of interest. In this case, level 1 knowledge is knowledge about the color distribution of each bag. Let ${\bf\theta}^i$ indicate the true color distribution for the th bag in the stack. We assume that ${\bf y}^i$ is drawn from a binomial distribution with parameter ${\bf\theta}^i$ : in other words, the marbles responsible for the observations in ${\bf y}^i$ are drawn independently at random from the th bag, and the color of each depends on the color distribution ${\bf\theta}^i$ for that bag. If 60% of the marbles in bag 7 are black, then ${\bf\theta}^7 = [0.6, 0.4]$ .

For the marbles scenario, level 2 knowledge is knowledge about the distribution of the ${\bf\theta}$ variables. This knowledge is represented using two parameters, $\alpha$ and $\beta$ . The vectors ${\bf\theta}^i$ are drawn from a Beta distribution parameterized by a scalar $\alpha$ and a vector $\beta=(\beta_1,\beta_2)$ with $\beta_1 +\beta_2=1$ . The parameter $\alpha$ determines the extent to which the colors in each bag tend to be uniform, and $\beta$ represents the distribution of colors across the entire collection of bags. We need to formalize our a priori expectations about the values of these variables.

Level 2 knowledge is acquired by relying on a body of knowledge at an even higher level, level 3. We use a uniform distribution on $\beta_1$ and an exponential distribution on $\alpha$ , which captures a weak prior expectation that the marbles in any bag will tend to be uniform in color. The mean of the exponential distribution is $\lambda=1$ , i.e. $P(\alpha) = exp(-\alpha)$ . The parameter $\lambda$ and the pair $(\alpha, \beta)$ are both overhypotheses, since each sets up a hypothesis space at the next level down. Since the level 3 knowledge is specified in advance ( $\lambda$ ), you should analyze how an overhypothesis can be learned at level 2.

The joint probability distribution for this model is therefore given by

$\displaystyle P({\bf y}^1,...,{\bf y}^n,{\bf\theta}^1,...,,{\bf\theta}^n, \alph... ...{\bf\theta}^i) P({\bf\theta}^i\vert\alpha,\beta) P(\alpha\vert\lambda) P(\beta)$

(1)

with

$\displaystyle \alpha$	$\displaystyle \sim$	$\displaystyle Exponential(\lambda)$	(2)
$\displaystyle \beta_1$	$\displaystyle \sim$	$\displaystyle Beta(1,1)$	(3)
$\displaystyle {\bf\theta}^i$	$\displaystyle \sim$	$\displaystyle Beta(\alpha \beta_1,\alpha \beta_2)$	(4)
$\displaystyle {\bf y}^i\vert n^i$	$\displaystyle \sim$	$\displaystyle Binomial({\bf\theta}^i)$	(5)

where

is the number of observations for bag

.

Task: You should perform Gibbs sampling for this joint distribution to estimate the marginal distributions for each single ${\bf\theta}^i$ , $\alpha$ and $\beta$ (the overhypotheses) where the observation ${\bf y}^i$ are kept fixed for the following scenarious:

After observing 10 all-white bags, 10 all-black bag and a single black marble in the last bag.
After observing 20 mixed bags, where half of the marbles are white and half of the marbles are black, and a single black marble in the last bag.
Same as in 1 but with fixed $\alpha=1$ and $\beta_1=\beta_2=0.5$ .
Same as in 2 but with fixed $\alpha=1$ and $\beta_1=\beta_2=0.5$ .

Calculate average distributions across 50 Markov chains, each of which was run for 100000 iterations (discard the first 10000 samples as burn-in). Hand in plots for all estimated distributions and interpret your results.

Hints:

For the resampling of ${\bf\theta}^i$ the factor $P({\bf y}^i\vert{\bf\theta}^i) P({\bf\theta}^i\vert\alpha,\beta)$ is again a Beta distribution that can be directly sampled in MATLAB with the command $\texttt{random}$ and the argument $\texttt{beta}$ .
For the resampling of $\alpha$ and $\beta$ apply sampling-importance-resampling sampling (with 100 samples drawn from a proposal distribution that is identical to the prior distribution) from a distribution proportional to the factor $P({\bf\theta}^i\vert\alpha,\beta) P(\alpha\vert\lambda) P(\beta)$ .
You should adapt your own MATLAB code from the previous Gibbs sampling homework example to solve this assignment.

Submit your MATLAB code. Present your results clearly, structured and legible. Document them in such a way that anybody can reproduce them effortless.

Next: Planning with Approximate Inference Up: MLB_Exercises_2012 Previous: Approximate inference in Bayesian

Haeusler Stefan 2013-01-16