# Bayesian Credibility

Bayesian credibility attempts to estimate a relevant quantity by taking a weighted average of estimates with the weighting directly influenced by the individual experience.

## Statistical Models

Let $X$ denote a random variable. A statistical model for $X$ is a family of probability distributions that is hypothesized to contain the true distribution for $X$. We are primarily interested in families of distributions that depend on parameters or families that admit a parametrization. The only situation that is relevant for the exam is when the statistical model has a finite dimension -- the statistical model can be described by a finite number of real valued parameters:

[$] \Theta \subset \mathbb{R}^d \,. [$]

For instance, the family of all normal distributions can be parametrized by a two dimensional space since every normal distribution is characterized by its mean $\mu$ and its variance $\sigma^2$. In what follows, the parameter space $\Theta$ represents the parametrization of the statistical model; consequently, every element of the space represents a possible probability distribution for $X$:

[$] \operatorname{P}(X \in A \,;\, \theta \in \Theta) = p(A \, ; \, \theta)\, . [$]

If the distributions in the statistical model admit a density then it is denoted by $f(x\,;\,\theta)$:

[$] p(A \, ; \, \theta) = \int_{A}f(x\,;\,\theta) \,dx\,. [$]

### The Bayesian Model

The Bayesian approach to statistical modelling is to consider $X$ as being generated through a two step process:

1. Generate an inobservable random variable $Y$ taking values in $\Theta$
2. Generate $X$ with distribution corresponding to the parameter $Y$.

The distribution of $Y$ is called the prior distribution which we denote by $G(\theta)$ (with density $g(\theta)$ if it exists):

[$] \operatorname{P}(Y\leq \theta) = G(\theta). [$]

In the Bayesian context, the parametrized distributions $p(x ; \theta)$ are written as $p(x |\theta)$ to emphasize that we're dealing with conditional probability distributions (conditional on $Y$ taking on the value $\theta$).

## The Posterior Distribution

In the the Bayesian model context, the posterior probability is the probability distribution of $Y$ given $X$:

[$] \operatorname{P}(Y\leq \theta \mid X) = G(\theta \mid X). [$]

It contrasts with the likelihood function, which is the probability of the evidence given the parameters.

### Calculation

The posterior probability distribution of one random variable given the value of another can be calculated with Bayes' theorem by multiplying the prior probability distribution by the likelihood function, and then dividing by the normalizing constant, as follows:

[$] G(y \mid X = x) = {\int_{\theta \leq y} \mathcal{L}(x \mid \theta)\, dG(\theta) \over{\int_{\Theta} \mathcal{L}(x \mid \theta) \, dG(\theta)}}\,. [$]

When both the prior and each distribution in the statistical model admit a probability density function, the posterior distribution also admits a density given by

[$] f(\theta \mid X = x) = {f(x \mid \theta) g(\theta) \over {\int_{\Theta} f(x \mid \theta) g(\theta) \,d\theta}} [$]

with $g(\theta)$ is the prior density of $Y$, $f(x\mid \theta)$ is the likelihood function as a function of $\theta$, and $\int_{\Theta} f(x \mid \theta) g(\theta) \,d\theta$ is the normalizing constant.

### Updating the Prior

Consider the following situation: let

[$] X = [X_1,\ldots,X_n] [$]

denote a sequence of n random variables sharing an (unknown) common distribution belonging to some statistical model that has been conveniently parametrized by $\Theta \subset \mathbb{R}^d$ as above. Following The Bayesian Model, we assume that the data has been generated as follows:

1. Generate an inobservable random variable $Y$ taking values in $\Theta$ with prior distribution $G(\theta)$
2. Generate $X_1,\ldots,X_n$ with common distribution corresponding to parameter $Y$ and conditionally mutually independent given $Y$.

The random variables $X_1,\ldots,X_n$ are not mutually independent but conditionally independent. As we will see below, dependence can exist because information from one subset of the data affects the posterior distribution which in turn affects the distribution of the other random variables (see also The Predictive Posterior).

The posterior distribution equals

[$] G_n(y \mid X = x) = {\int_{\theta \leq y} \mathcal{L}_n(x \mid \theta)\, dG(\theta) \over{\int_{\Theta} \mathcal{L}_n(x \mid \theta) \, dG(\theta)}}\, [$]

with

[$] \mathcal{L}_n(x\mid \theta) = \prod_{i=1}^n \mathcal{L}(x_i \mid \theta) \, , x = [x_1,\ldots,x_n] \in \mathbb{R}^n [$]

and $\mathcal{L}(x_i\mid \theta)$ denoting the likelihood function associated with the random variable $X_i$ (these likelihood functions are identical since the random variables are assumed to have a common distribution).

You can think of the posterior distribution as an updated prior distribution: we start with a prior distribution $G(\theta)$ which is updated to $G_n(\theta)$ based on the information contained in the data.

## Bayesian Prediction

In this section, we assume the generic setup (see The Posterior Distribution):

1. a random variable $Y$ taking values in $\Theta$ with prior distribution $G(\theta)$
2. $X_1,\ldots,X_n$ are random variables with common distribution corresponding to parameter $Y$ and conditionally mutually independent given $Y$.

### Predictive Prior

What is the join distribution for the data when no data has yet been observed? Since $Y$ is unobserved, the joint distribution is actually equal to the expected join distribution and is called the predictive prior:

[] \begin{align*} \operatorname{P}(X_i\in A_i\,, i=1,\ldots,n) &= \operatorname{E}\left[\operatorname{E}\left[\prod_{i=1}^n 1_{X_i \in A_i} \mid Y\right] \right] \\ &= \operatorname{E}\left[\prod_{i=1}^n p(A_i \mid Y) \right] \\ &= \int_{\Theta} \prod_{i=1}^n p(A_i \mid \theta) \, dG(\theta). \end{align*} []

### Predictive Posterior

What is the joint distribution for new data when old data is present? Or how can we quantify the dependence between the data? If $X = [X_1,\ldots,X_n]$ then the predictive posterior is given by

[] \begin{align*} \operatorname{P}(X_{n+i}\in A_i \mid X) &= \int_{\Theta} \prod_{i=1}^k p(A_i \mid \theta) \, dG_n(\theta). \end{align*} []

### Bayesian Credibility Estimator

Given the $n$ data points $X_1,\ldots,X_n$, which can be thought of as your information, how can we estimate an unobservable random variable $Y$? One way to proceed is to try to find an estimator that minimizes the mean square error

[$] $$\label{mse} \min \operatorname{E}[(Z - Y)^2] \quad Z \,\,\text{depending on }\,X_1,\ldots,X_n \, .$$ [$]

To be mathematically precise, the minimizer must be measurable with respect to the sigma-field (information) generated by $X_1,\ldots,X_n$. It is well-known that the solution to \ref{mse}, often called the minimum mean square estimator, is unique (in a well-defined sense) and equals the conditional expectation $\operatorname{E}[Y | X_1,\ldots,X_n].$

Now suppose that the data $X_1,\ldots,X_n$ correspond to quantities of interest to insurers such as pure premium, claim severities, claim frequencies, etc. associated with an insured belonging to a risk class $\Theta$ (the data represents the experience of the insured). The minimum mean square estimator for the expected value $\mu(\Theta) = \operatorname{E}[X_i | \Theta]$ is called the Bayesian credibility estimator and equals

[$] \operatorname{E}[\mu(\Theta) | X_1,\ldots,X_n] = \int_{\Theta}\mu(\theta) \, dG_n(\theta). [$]

## Conjugate Families

If the posterior distributions are in the same family as the prior, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood functions. A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution. All members of the exponential family have conjugate priors. The exponential families include many of the most common distributions, including the normal, exponential, gamma, beta and Poisson distributions.

### Exponential family: scalar parameter

A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function), for the case of a discrete distribution) can be expressed in the form

[$] f(x\mid\theta) = h(x) \exp \left (\eta(\theta) \cdot T(x) -A(\theta)\right )[$]

where $T(x), h(x),\eta(\theta)$ and $A(\theta)$ are known functions. An alternative, equivalent form often given is

[$] f(x\mid\theta) = h(x) g(\theta) \exp \left ( \eta(\theta) \cdot T(x) \right )[$]

or equivalently

[$] f(x\mid\theta) = \exp \left (\eta(\theta) \cdot T(x) - A(\theta) + B(x) \right )[$]

The value $\theta$ is called the parameter of the family.

Note that $x$ is often a vector of measurements, in which case $T(x)$ may be a function from the space of possible values of $x$ to the real numbers. More generally, $\eta(\theta)$ and $T(x)$ can each be vector-valued such that $\eta(\theta)'\cdot T(x)$ is real-valued.

If $\eta(\theta) = \theta$, then the exponential family is said to be in canonical form. By defining a transformed parameter $\eta = \eta(\theta)$, it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since $\eta(\theta)$ can be multiplied by any nonzero constant, provided that $T(x)$ is multiplied by that constant's reciprocal, or a constant $c$ can be added to $\eta(\theta)$ and $h(x)$ multiplied by $\exp (-c \cdot T(x))$ to offset it.

Note also that the function $A(\theta)$ or equivalently $g(\theta)$ is automatically determined once the other functions have been chosen, and assumes a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of $\eta$, even when $\eta(\theta)$ is not a one-to-one function, i.e. two or more different values of $\theta$ map to the same value of $\eta(\theta)$, and hence $\eta(\theta)$ cannot be inverted. In such a case, all values of $\theta$ mapping to the same $\eta(\theta)$ will also have the same value for $A(\theta)$ and $g(\theta)$.

### Exponential family: vector parameter

The definition in terms of one real-number parameter can be extended to one real-vector parameter

[$]{\boldsymbol \theta} = \left(\theta_1, \theta_2, \cdots, \theta_s \right )^T.[$]

A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as

[$] f(x\mid\boldsymbol \theta) = h(x) \exp\left(\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(x) - A({\boldsymbol \theta}) \right)[$]

Or in a more compact form,

[$] f(x\mid\boldsymbol \theta) = h(x) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x) - A({\boldsymbol \theta}) \Big) [$]

This form writes the sum as a dot product of vector-valued functions $\boldsymbol\eta({\boldsymbol \theta})$ and $\mathbf{T}(x)$. An alternative, equivalent form often seen is

[$] f(x\mid\boldsymbol \theta) = h(x) g(\boldsymbol \theta) \exp\Big(\boldsymbol\eta({\boldsymbol \theta}) \cdot \mathbf{T}(x)\Big)[$]

As in the scalar valued case, the exponential family is said to be in canonical form if

[$]\forall i: \quad \eta_i({\boldsymbol \theta}) = \theta_i.[$]

A vector exponential family is said to be curved if the dimension of

[$]{\boldsymbol \theta} = \left (\theta_1, \theta_2, \ldots, \theta_d \right )^T[$]

is less than the dimension of the vector

[$]{\boldsymbol \eta}(\boldsymbol \theta) = \left (\eta_1(\boldsymbol \theta), \eta_2(\boldsymbol \theta), \ldots, \eta_s(\boldsymbol \theta) \right )^T.[$]

That is, if the dimension of the parameter vector is less than the number of functions of the parameter vector in the above representation of the probability density function. Note that most common distributions in the exponential family are not curved, and many algorithms designed to work with any member of the exponential family implicitly or explicitly assume that the distribution is not curved.

Note that, as in the above case of a scalar-valued parameter, the function $A(\boldsymbol \theta)$ or equivalently $g(\boldsymbol \theta)$ is automatically determined once the other functions have been chosen, so that the entire distribution is normalized. In addition, as above, both of these functions can always be written as functions of $\boldsymbol\eta$, regardless of the form of the transformation that generates $\boldsymbol\eta$ from $\boldsymbol\theta$. Hence an exponential family in its "natural form" (parametrized by its natural parameter) looks like

[$] f(x\mid\boldsymbol \eta) = h(x) \exp\Big(\boldsymbol\eta \cdot \mathbf{T}(x) - A({\boldsymbol \eta})\Big)[$]

or equivalently

[$] f(x\mid\boldsymbol \eta) = h(x) g(\boldsymbol \eta) \exp\Big(\boldsymbol\eta \cdot \mathbf{T}(x)\Big)[$]

Note that the above forms may sometimes be seen with $\boldsymbol\eta^T \mathbf{T}(x)$ in place of $\boldsymbol\eta \cdot \mathbf{T}(x)$. These are exactly equivalent formulations, merely using different notation for the dot product.

### Interpretation

In the definitions above, the functions $T(x)$,$\eta(\theta)$ and $A(\eta)$ were apparently arbitrarily defined. However, these functions play a significant role in the resulting probability distribution.

• $T(x)$ is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that fully summarizes the data $x$ within the density function. This means that, for any data sets $x$ and $y$, the density value is the same if $T(x) = T(y)$. This is true even if $x$ and $y$ are quite different—that is, $d(x,y)\gt0$. The dimension of $T(x)$ equals the number of parameters of $\theta$ and encompasses all of the information regarding the data related to the parameter $\theta$. The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data (and hence to derive any desired estimate of the parameters).
• $\eta$ is called the natural parameter. The set of values of $\eta$ for which the function $f(x;\theta)$ is finite is called the natural parameter space.
• $A(\eta)$ is called the partition function because it is the logarithm of a normalization factor, without which $f(x;\theta)$ would not be a probability distribution ("partition function" is often used in statistics as a synonym of "normalization factor"):

[$] A(\eta) = \ln\left ( \int_x h(x) \exp (\eta(\theta) \cdot T(x)) \operatorname{d}x \right )[$]

### Conjugate Priors for Exponential Family

A conjugate prior $\pi$ for the parameter $\boldsymbol\eta$ of an exponential family

[$] f(x|\boldsymbol\eta) = h(x) \exp \left ( {\boldsymbol\eta}^{\rm T}\mathbf{T}(x) -A(\boldsymbol\eta)\right )[$]

is given by

[$]p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu) = b(\boldsymbol\chi,\nu) \exp \left (\boldsymbol\eta \cdot \boldsymbol\chi - \nu A(\boldsymbol\eta) \right ),[$]

or equivalently

[$]p_\pi(\boldsymbol\eta\mid\boldsymbol\chi,\nu) = b(\boldsymbol\chi,\nu) g(\boldsymbol\eta)^\nu \exp \left (\boldsymbol\eta^{\rm T} \boldsymbol\chi \right ), \qquad \boldsymbol\chi \in \mathbb{R}^s[$]

where

• $s$ is the dimension of $\boldsymbol\eta$ and $\nu \gt 0$ and $\boldsymbol\chi$ are hyperparameters (parameters controlling parameters)
• $\nu$ corresponds to the effective number of observations that the prior distribution contributes
• $\boldsymbol\chi$ corresponds to the total amount that these pseudo-observations contribute to the sufficient statistic over all observations and pseudo-observations
• $f(\boldsymbol\chi,\nu)$ is a normalization constant that is automatically determined by the remaining functions and serves to ensure that the given function is a probability density function (i.e. it is normalized)
• $A(\boldsymbol\eta)$ and equivalently $g(\boldsymbol\eta)$ are the same functions as in the definition of the distribution over which $\pi$ is the conjugate prior.

The posterior distribution after observing $n$ observations has the same form as the prior with updated parameters

[]\begin{align*} \boldsymbol\chi' &= \boldsymbol\chi + \mathbf{T}(\mathbf{X}) \\ &= \boldsymbol\chi + \sum_{i=1}^n \mathbf{T}(x_i) \\ \nu' &= \nu + n. \end{align*} []

Note in particular that the data $\mathbf{X}$ enters into this equation only in the expression

[$]\mathbf{T}(\mathbf{X}) = \sum_{i=1}^n \mathbf{T}(x_i),[$]

which is termed the sufficient statistic of the data. That is, the value of the sufficient statistic is sufficient to completely determine the posterior distribution. The actual data points themselves are not needed, and all sets of data points with the same sufficient statistic will have the same distribution. This is important because the dimension of the sufficient statistic does not grow with the data size — it has only as many components as the components of $\boldsymbol\eta$ (equivalently, the number of parameters of the distribution of a single data point).

Note also that because of the way that the sufficient statistic is computed, it necessarily involves sums of components of the data (in some cases disguised as products or other forms — a product can be written in terms of a sum of logarithms. The cases where the update equations for particular distributions don't exactly match the above forms are cases where the conjugate prior has been expressed using a different parameterization than the one that produces a conjugate prior of the above form — often specifically because the above form is defined over the natural parameter $\boldsymbol\eta$ while conjugate priors are usually defined over the actual parameter $\boldsymbol\theta .$

### Beta-Binomial Model

The family of binomial distributions is a subfamily of the exponential family:

[] \begin{align*} p(k \mid \eta) &= \binom{m}{k} q^k \cdot (1-q)^{1-k} \\ &= \binom{m}{k} e^{k\eta - A(\eta)}\,,\, \eta = \log(q/(1-q)), A(\eta) = -\log(1-q) \,. \end{align*} []

From the discussion above, the conjugate family is

[$] p_\pi(\eta \mid \chi,\nu) = b(\boldsymbol\chi,\nu) \exp \left (\eta \cdot \chi - \nu A(\eta) \right ). [$]

To identify the family, transform back to $q$ via the implicit transformation $\eta = \log(q/(1-q)) = \eta(q)$:

[$] $$\label{cjfam-bb} p_\pi(\eta \mid \chi, \eta(q)) \frac{dq}{d\eta} = q^{\chi}(1-q)^{\nu - \chi} [q(1-q)]^{-1} = q^{\chi -1 }(1-q)^{\nu - \chi - 1}.$$ [$]

We recognize \ref{cjfam-bb} as the family of beta distributions with parameters $\alpha = \chi$ and $\beta = \nu - \chi$.

Let $N_1,\ldots,N_n$ denote the number of claims related to a portfolio of insurance contracts at the different time periods $i=1, \ldots,n$. We assume the following:

• The claim frequency $N_i$ is assumed to have a binomial distribution with size parameter $m = w_i$ (known) and success parameter $q$ (unknown).
• The claims are conditionally independent (conditional on $q$).

To compute the posterior for $q$ given the claim frequency data, we use the update equations (see the discussion above) and conclude that the posterior is a beta distribution with parameters

[] \begin{align*} \alpha^{\prime} = \chi^{\prime} &= \chi + \sum_{i=1}^n N_i =\alpha + \sum_{i=1}^n N_i \\ \beta^{\prime} = \nu^{\prime} - \chi ^{\prime} &= \nu + \sum_{i=1}^n w_i - \chi^{\prime} = \beta + w - \sum_{i=1}^n N_i \end{align*} []

with $w = \sum_{i=1}^n w_i$. If we interpret the $w_i$ as exposure units, the Bayesian credibility estimator for the claim frequency per unit of exposure equals

[$] \operatorname{E}[mq | N_1,\ldots,N_n] = m \frac{\alpha^{\prime}}{\alpha^{\prime} + \beta^{\prime}} = \frac{\alpha^{\prime} m}{w + \beta}. [$]

### Poisson-Gamma Model

Proceeding as in the previous example, but with the following different assumption: the claim frequencies $N_i$ have a Poisson distribution with mean $\lambda w_i$ (with only $\lambda$ unknown). If we set

[$] \boldsymbol{N} = [N_1,\ldots,N_n]\,,\, \boldsymbol{k} = [k_1,\ldots,k_n] [$]

, then

[] \begin{align*} \operatorname{P}(\boldsymbol{N} = \boldsymbol{k} \mid \eta) = p(\boldsymbol{k} \mid \eta) &=\prod_{i=1}^n \frac{(\lambda w_i)^{k_i} e^{-\lambda w_i}}{k_i!} \\ &\propto \exp\left(\eta \cdot T(\boldsymbol{k}) - A(\eta) \right) \quad (\text{change of variable}\,\, \eta = \log(\lambda)) \end{align*} []

with

[$] A(\eta) = we^{\eta},\,w=\sum_{i=1}^nw_i,\, T(\boldsymbol{k}) = \sum_{i=1}^n k_i \,. [$]

The conjugate prior family is then given by

[] \begin{align*} p_\pi(\eta \mid \chi,\nu) &= b(\chi,\nu) \exp \left(\eta \chi - \nu A(\eta) \right) \\ & \propto \exp \left(\eta \chi - \nu w e^{\eta} \right) \\ & \propto \widetilde{\eta}^{\chi -1} e^{-\nu w \widetilde{\eta}} \quad (\text{change of variable}\,\, \widetilde{\eta} = e^{\eta}). \end{align*} []

Hence the conjugate prior family is the family of gamma distributions with shape parameter $\alpha = \chi$ and scale parameter $\beta = (\nu w)^{-1}$ with corresponding gamma posterior with parameters (use the update equations derived in #Conjugate Priors for Exponential Family with the single data point $\boldsymbol{N}$):

[] \begin{align*} \alpha^{\prime} &= \chi^{\prime} = \chi + T(\boldsymbol{N}) = \alpha + \sum_{i=1}^n N_i \\ \beta^{\prime} &= (\nu^{\prime} w)^{-1} = [(\nu + 1)w]^{-1} = (w + \beta^{-1})^{-1}. \end{align*} []

If we interpret the $w_i$ as exposure units, the Bayesian credibility estimator for the claim frequency per unit of exposure equals

[$] \operatorname{E}[\lambda | N_1,\ldots,N_n] = \alpha^{\prime}\beta^{\prime}. [$]

## Wikipedia References

• Wikipedia contributors. "Bayesian inference". Wikipedia. Wikipedia. Retrieved 21 June 2019.
• Wikipedia contributors. "Exponential family". Wikipedia. Wikipedia. Retrieved 21 June 2019.
• Wikipedia contributors. "Likelihood function". Wikipedia. Wikipedia. Retrieved 21 June 2019.