Thus far in this course we have treated unknown parameters \(p\) as fixed (nonâ€“random) and used the **frequentist** inference paradigm. It is also possible to view the unknown parameter as having some distribution. We could base this distribution on prior knowledge about the parameter. This distribution is known as the prior and is typically denoted by \(\pi(p)\). Once we have a prior on \(p\) representing our belief about the parameter before having seen the data, we can use Bayes Theorem to ``updateâ€™â€™ the distribution on \(\theta\). This distribution is known as the posterior. The posterior is denoted by \(\pi(p|x)\) which by Bayes Theorem is

\[
\pi(p|x) = \frac{f(x,p)}{f(x)} = \frac{f(x|p)\pi(p)}{\underbrace{\int f(x|p)\pi(p)dp}_{\equiv m(x)}} \propto f(x|p)\pi(p)
\] This general framework to inference is known as **Bayesian** statistics.

One can summarize the posterior distribution using the posterior mean. \[ \widehat{p} = \mathbb{E}_{p|x}[p] = \int p \pi(p|x) dp \]

Note that these estimators are functions of the posterior. In Bayesian inference, essentially all conclusions are derived from the posterior distribution. One major difficulty in Bayesian statistics is that the posterior may not have a closed form solution. For certain classes of models (\(f(x|\theta)\)) and priors (\(\pi(\theta)\)), the posterior is always a member of the set of prior distributions. These are known as conjugate families. In this case it is fairly easy to compute a posterior and obtain posterior point estimators.

**Conjugate Model:** Let \(\mathcal{F}\) be a set of probability density functions \(f(x|\theta)\) and \(\Pi\) be a set of prior distributions. If for any \(f \in \mathcal{F}\) and \(\pi \in \Pi\), the posterior \(\pi(\theta|x) \in \Pi \, \, \forall x\), the pair \(\mathcal{F},\Pi\) is called a conjugate family.

The following lemma will be useful for finding / verifying conjugate families.

**Lemma:** Suppose \(f\) and \(g\) are pdfs and \(f(\theta) \propto g(\theta)\), then \(f(\theta) = g(\theta)\).

Let \(X \sim Binomial(n,p)\). Suppose we represent our prior knowledge about \(p\) using a \(Beta(\alpha,\beta)\) distribution. The posterior is \[ \pi(p|x) = \frac{f(x|p)\pi(p)}{\int f(x|p)\pi(p)dp} \propto {n \choose x}p^x(1-p)^{n-x}\frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1} \propto p^{\alpha+x-1}(1-p)^{n-x+\beta-1} \] We see that the last expression on the right is proportional to a Beta distribution with parameters \(\alpha' = \alpha+x\) and \(\beta' = n-x + \beta\). Thus we have shown that \(\pi(p|x)\) is proportional to a Beta distribution. Since the posterior is a distribution itself, by the previous theorem \(\pi(p|x)\) must be a Beta distribution.

We now compare the maximum likelihood estimator to the Bayesian posterior mean for the binomial model. Suppose we plan to conduct a survey of 100 voters and ask them ``Do you support more gun control.â€™â€™ Let \(X =\) the number of voters who say yes. Then \(X \sim Binomial(n=100,p)\) where \(p\) is unknown proportion of people in the population who support gun control. As Bayesians we choose a prior to represent our belief about \(p\) before having collected the data. Suppose we think that \(p\) is around \(0.5\), but we feel it could be higher or lower. Then we might put a \(Beta(\alpha=2,\beta=2)\) prior on \(p\). This prior density is represented by the figure below.

We collect the data. \(90\) of the \(100\) people say they support more gun control. Our posterior on \(p\) is \(Beta(92,12)\). The posterior is represented below. We can see it is much more concentrated near \(0.90\). Almost all of the prior information has been ``washed outâ€™â€™ by the sample. If we had a smaller sample size (9 out of 10 people supported gun control), then the posterior would still bear a strong resemblance to the prior.

For a point estimate we could use the posterior mean. The mean of a beta distribution with parameters \(\alpha\) and \(\beta\) is \(\alpha/(\alpha + \beta)\). Therefore our posterior mean estimate for this example is \[\begin{equation*} \frac{92}{104} \approx 0.88 \end{equation*}\] Note that the maximum likelihood estimator for this model is \(\widehat{p}_{MLE} = X/n = 0.9\).

The Dirichlet distribution is a generalization of the beta distribution to two or more probabilities. It is a useful prior for discrete probability mass functions. This is because a discrete probability mass function may be parameterized by a vector of probabilities. The Dirichlet distribution can serve as a prior for these probabilities and impose the constraint that they must sum to 1. The Dirichlet distribution has the form:

\[ \pi(p|\alpha) = \frac{1}{B(\alpha)}\prod_{j=1}^K p_j^{\alpha_j-1} \]

where \(B(\alpha)\) is a normalizing constant that does not depend on \(p\). See wikipedia for more background on the Dirichlet distribution. The hyperparameter \(\alpha\) is chosen to reflect our prior knowledge about the probability mass function.

We simulate probability mass functions (pmf) from a Dirichlet prior.

```
library(MCMCpack)
set.seed(1234)
K <- 5
alpha <- rep(1,K)
Ndraw <- 100
draws <- rdirichlet(n=Ndraw,alpha=alpha)
matplot(1:K,t(draws),type='l',col="#00000040",lty=1,ylim=c(0,1))
points(1:K,colMeans(draws),col='blue',lwd=2,type='l')
```