Bayesian Statistics Overview

Thus far in this course we have treated unknown parameters \(p\) as fixed (non–random) and used the frequentist inference paradigm. It is also possible to view the unknown parameter as having some distribution. We could base this distribution on prior knowledge about the parameter. This distribution is known as the prior and is typically denoted by \(\pi(p)\). Once we have a prior on \(p\) representing our belief about the parameter before having seen the data, we can use Bayes Theorem to ``update’’ the distribution on \(\theta\). This distribution is known as the posterior. The posterior is denoted by \(\pi(p|x)\) which by Bayes Theorem is

\[ \pi(p|x) = \frac{f(x,p)}{f(x)} = \frac{f(x|p)\pi(p)}{\underbrace{\int f(x|p)\pi(p)dp}_{\equiv m(x)}} \propto f(x|p)\pi(p) \] This general framework to inference is known as Bayesian statistics.

One can summarize the posterior distribution using the posterior mean. \[ \widehat{p} = \mathbb{E}_{p|x}[p] = \int p \pi(p|x) dp \]

Note that these estimators are functions of the posterior. In Bayesian inference, essentially all conclusions are derived from the posterior distribution. One major difficulty in Bayesian statistics is that the posterior may not have a closed form solution. For certain classes of models (\(f(x|\theta)\)) and priors (\(\pi(\theta)\)), the posterior is always a member of the set of prior distributions. These are known as conjugate families. In this case it is fairly easy to compute a posterior and obtain posterior point estimators.

Conjugate Model: Let \(\mathcal{F}\) be a set of probability density functions \(f(x|\theta)\) and \(\Pi\) be a set of prior distributions. If for any \(f \in \mathcal{F}\) and \(\pi \in \Pi\), the posterior \(\pi(\theta|x) \in \Pi \, \, \forall x\), the pair \(\mathcal{F},\Pi\) is called a conjugate family.

The following lemma will be useful for finding / verifying conjugate families.

Lemma: Suppose \(f\) and \(g\) are pdfs and \(f(\theta) \propto g(\theta)\), then \(f(\theta) = g(\theta)\).

Beta-Binomial Conjugate Model

Let \(X \sim Binomial(n,p)\). Suppose we represent our prior knowledge about \(p\) using a \(Beta(\alpha,\beta)\) distribution. The posterior is \[ \pi(p|x) = \frac{f(x|p)\pi(p)}{\int f(x|p)\pi(p)dp} \propto {n \choose x}p^x(1-p)^{n-x}\frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1} \propto p^{\alpha+x-1}(1-p)^{n-x+\beta-1} \] We see that the last expression on the right is proportional to a Beta distribution with parameters \(\alpha' = \alpha+x\) and \(\beta' = n-x + \beta\). Thus we have shown that \(\pi(p|x)\) is proportional to a Beta distribution. Since the posterior is a distribution itself, by the previous theorem \(\pi(p|x)\) must be a Beta distribution.

Gun Control Example

We now compare the maximum likelihood estimator to the Bayesian posterior mean for the binomial model. Suppose we plan to conduct a survey of 100 voters and ask them ``Do you support more gun control.’’ Let \(X =\) the number of voters who say yes. Then \(X \sim Binomial(n=100,p)\) where \(p\) is unknown proportion of people in the population who support gun control. As Bayesians we choose a prior to represent our belief about \(p\) before having collected the data. Suppose we think that \(p\) is around \(0.5\), but we feel it could be higher or lower. Then we might put a \(Beta(\alpha=2,\beta=2)\) prior on \(p\). This prior density is represented by the figure below.

We collect the data. \(90\) of the \(100\) people say they support more gun control. Our posterior on \(p\) is \(Beta(92,12)\). The posterior is represented below. We can see it is much more concentrated near \(0.90\). Almost all of the prior information has been ``washed out’’ by the sample. If we had a smaller sample size (9 out of 10 people supported gun control), then the posterior would still bear a strong resemblance to the prior.

For a point estimate we could use the posterior mean. The mean of a beta distribution with parameters \(\alpha\) and \(\beta\) is \(\alpha/(\alpha + \beta)\). Therefore our posterior mean estimate for this example is \[\begin{equation*} \frac{92}{104} \approx 0.88 \end{equation*}\] Note that the maximum likelihood estimator for this model is \(\widehat{p}_{MLE} = X/n = 0.9\).

Dirichlet Prior for Discrete Distributions

The Dirichlet distribution is a generalization of the beta distribution to two or more probabilities. It is a useful prior for discrete probability mass functions. This is because a discrete probability mass function may be parameterized by a vector of probabilities. The Dirichlet distribution can serve as a prior for these probabilities and impose the constraint that they must sum to 1. The Dirichlet distribution has the form:

\[ \pi(p|\alpha) = \frac{1}{B(\alpha)}\prod_{j=1}^K p_j^{\alpha_j-1} \]

where \(B(\alpha)\) is a normalizing constant that does not depend on \(p\). See wikipedia for more background on the Dirichlet distribution. The hyperparameter \(\alpha\) is chosen to reflect our prior knowledge about the probability mass function.

We simulate probability mass functions (pmf) from a Dirichlet prior.

library(MCMCpack)
set.seed(1234)
K <- 5
alpha <- rep(1,K)
Ndraw <- 100

draws <- rdirichlet(n=Ndraw,alpha=alpha)
matplot(1:K,t(draws),type='l',col="#00000040",lty=1,ylim=c(0,1))
points(1:K,colMeans(draws),col='blue',lwd=2,type='l')

We can convert these pmfs into survival functions:

St <- rbind(1,1-apply(draws,1,cumsum))
matplot(0:K,St,type='l',col="#00000040",lty=1,ylim=c(0,1))
points(0:K,rowMeans(St),col='blue',lwd=2,type='l')

We now consider Dirichlet priors with different alphas

alphas <- list(rep(1,K),rep(50,K),c(5,5,rep(1,K-2)),c(50,50,rep(10,K-2)))
Ndraw <- 100

par(mfrow=c(2,2),mar=c(5,5,1,.5))
for(ii in 1:length(alphas)){
  draws <- rdirichlet(n=Ndraw,alpha=alphas[[ii]])
  matplot(1:K,t(draws),type='l',col="#00000040",lty=1,ylim=c(0,1),
          xlab=paste0("alpha=(",paste0(alphas[[ii]],",",collapse=""),")"),
          ylab="prob")
  points(1:K,colMeans(draws),col='blue',lwd=2,type='l')
}

Again we could plot the corresponding survival functions:

par(mfrow=c(2,2),mar=c(5,5,1,.5))
for(ii in 1:length(alphas)){
  draws <- rdirichlet(n=Ndraw,alpha=alphas[[ii]])
  St <- rbind(1,1-apply(draws,1,cumsum))
  matplot(0:K,St,type='l',col="#00000040",lty=1,ylim=c(0,1),
          xlab=paste0("alpha=(",paste0(alphas[[ii]],",",collapse=""),")"),
          ylab="prob")
  points(0:K,rowMeans(St),col='blue',lwd=2,type='l')
}

Conjugacy of Dirichlet with Discrete (Multinomial) Distribution

For a discrete distribution \(f_X\) which is nonzero on \(\{1,\ldots,K\}\) with \(f_X(j) = P(X=j) = p_j\), the \(\pi(\alpha) = Dirichlet(\alpha)\) prior with \(\alpha \in \mathbb{R}^{K+}\) is conjugate. To see this we assume a sample \(X_1,\ldots,X_n \sim f_X\). We have

\[ \begin{align*} \pi(p|\vec{x}) &\propto f(\vec{x}|p)\pi(p)\\ & \prod_{i=1}^n f(x_i|p)\frac{1}{B(p)} \prod_{j=1}^K p_j ^{\alpha_j-1}\\ &\propto \prod_{i=1}^n \prod_{j=1}^K p_{j}\mathbb{1}_{X_i=j} \prod_{j=1}^K p_j ^{\alpha_j-1}\\ &= \prod_{j=1}^K p_{j}^{\sum_{i=1}^n \mathbb{1}_{X_i=j}} \prod_{j=1}^K p_j ^{\alpha_j-1}\\ &= \prod_{j=1}^K p_j ^{\alpha_j + \sum_{i=1}^n \mathbb{1}_{X_i=j} -1}\\ &\propto Dir(\vec{\alpha} + (\sum_{i=1}^n \mathbb{1}_{X_i=1}, \ldots, \sum_{i=1}^n \mathbb{1}_{X_i=K})) \end{align*} \]

Simulated Example

Suppose there is a discrete distribution on \(1,2,3,4,5\) with the associated probabilities

probs <- c(.1,.1,.3,.5,0)
K <- length(probs)
par(mfcol=c(1,2),mar=c(4,5,1,1))
plot(1:length(probs),probs,type='l',col='red',lwd=2,ylab="probability mass function",xlab="")
plot(0:K,c(1,1-cumsum(probs)),type='l',col='red',lwd=2,ylab="survival function",xlab="")

We obtain a sample from this distribution

n <- 30
?sample
x <- sample(1:K,size=n,replace=TRUE,prob=probs)
x
##  [1] 2 4 4 4 2 4 2 1 4 4 3 3 4 3 4 4 1 4 4 2 2 4 4 4 4 4 1 4 3 1

Frequentist Analysis

We could estimate the distribution by counting the proportion of 1’s, 2’s, etc. This is the MLE for the

s <- rep(0,K)
names(s) <- 1:K
s[names(table(x))] <- table(x)
par(mfcol=c(1,2),mar=c(4,5,1,1))
plot(names(s),s/length(x),xlab="",ylab="prob",type='l')
points(1:length(probs),probs,type='l',col='red',lwd=2)
legend("topleft",c("True pmf","MLE pmf"),col=c("red","black"),lty=1,lwd=2)
ps <- s/length(x)
plot(0:K,c(1,1-cumsum(ps)),type='l',lwd=2,ylab="survival function",xlab="")
points(0:K,c(1,1-cumsum(probs)),type='l',col='red',lwd=2)
legend("topright",c("True S(t)","MLE S(t)"),col=c("red","black"),lty=1,lwd=2)

Bayesian Analysis

We use a Dirichlet prior and obtain a Dirichlet posterior. We consider using four different hyperparameters for the dirichlet distribution:

alphas
## [[1]]
## [1] 1 1 1 1 1
## 
## [[2]]
## [1] 50 50 50 50 50
## 
## [[3]]
## [1] 5 5 1 1 1
## 
## [[4]]
## [1] 50 50 10 10 10

Compute posterior for each possible hyperparameter prior:

post_alphas <- alphas
for(ii in 1:length(alphas)){
  post_alphas[[ii]] <- alphas[[ii]] + s
}

We draw distributions from the posterior and plot:

Ndraw <- 100
par(mfrow=c(2,2),mar=c(5,5,1,.5))
for(ii in 1:length(post_alphas)){
  draws <- rdirichlet(n=Ndraw,alpha=post_alphas[[ii]])
  matplot(1:K,t(draws),type='l',col="#00000040",lty=1,ylim=c(0,1),
          xlab=paste0("alpha=(",paste0(post_alphas[[ii]],",",collapse=""),")"),
          ylab="prob")
  points(1:K,colMeans(draws),col='blue',lwd=2,type='l')
  points(1:length(probs),probs,type='l',col='red',lwd=2)
  legend("topleft",c("True pmf","Posterior Mean","Posterior Draw"),col=c("red","blue","black"),lty=1,lwd=2)
}

We convert these posteriors on the probability mass function into posteriors on the survival function:

Ndraw <- 100
par(mfrow=c(2,2),mar=c(5,5,1,.5))
for(ii in 1:length(post_alphas)){
  draws <- rdirichlet(n=Ndraw,alpha=post_alphas[[ii]])
  St <- rbind(1,1-apply(draws,1,cumsum))
  matplot(0:K,St,type='l',col="#00000040",lty=1,ylim=c(0,1),
          xlab=paste0("alpha=(",paste0(post_alphas[[ii]],",",collapse=""),")"),
          ylab="prob")
  points(0:K,rowMeans(St),col='blue',lwd=2,type='l')
  points(0:K,c(1,1-cumsum(probs)),type='l',col='red',lwd=2)
  legend("bottomleft",c("True Survival","Posterior Mean","Posterior Draw"),col=c("red","blue","black"),lty=1,lwd=2)
}

Upcoming

On Friday, we generalize these methods to continuous distributions and right censoring and introduce MCMC to explore posterior distributions.