Score Based Approaches

Of the countless number of possible mechanisms and processes that could have produced the data, how can one even begin to choose the best model? Two of the most commonly used criterions are (i) the Akaike information criterion and (ii) the Bayesian information criterion.

Akaike Information Criterion (AIC)

The Akaike information criterion (AIC) is an estimator of in-sample prediction error and thereby relative quality of statistical models for a given set of data.[1] In-sample prediction error is the expected error in predicting the resampled response to a training sample. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

AIC is founded on information theory. In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.

The Akaike information criterion is named after the Japanese statistician Hirotugu Akaike .


Let [math]d[/math] equal the number of estimated parameters in the model and let [math]\hat L[/math] be the maximum value of the likelihood function for the model. Then the AIC value of the model is the following:[2]

[[math]]\mathrm{AIC} \, = \, 2d - 2\ln(\widehat L)[[/math]]

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit.

Note that AIC tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that. Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model.

Kullback–Leibler divergence

Suppose a family of parametrized probability distributions [math]\operatorname{Q}(\theta)[/math] which constitutes a hypothesized model for the true probability distribution [math]\operatorname{P}[/math]. For simplicity, we will assume that all distributions admit a density function: [math]d\operatorname{Q}(\theta) = f_{\theta}(x) dx [/math] and [math]d\operatorname{P} = g(x) dx [/math]. A measure of how [math]\operatorname{Q}(\theta)[/math] differs from [math]\operatorname{P}[/math] is given by the Kullback–Leibler divergence (also called relative entropy):

[[math]]D_\text{KL}(\operatorname{P} \parallel \operatorname{Q}(\theta)) = \int \log\left(\frac{f_{\theta}(x)}{g(x)}\right)\, g(x) \, dx. [[/math]]

Let [math]\mathcal{L}(\theta \, | x) [/math] denote the likelihood function for the distribution [math]\operatorname{Q}(\theta)[/math] and let [math]\hat{\theta}_n[/math] denote the MLE given a sample size equal to [math]n[/math]. Assuming that the model contains the true probability distribution [math]\operatorname{P}[/math], the following approximation holds as [math]n [/math] tends to infinity:

[[math]] \operatorname{E}[\int \log \mathcal{L}(\hat{\theta}_n \, ; x )\, g(x) \, dx] \approx \operatorname{E}[\frac{\operatorname{-AIC}}{2}]. [[/math]]

The approximation above shows that, for large [math]n[/math], the AIC selects the model that minimizes the expected Kullback–Leibler divergence between the true probability distribution (assuming that the true distribution belongs to that particular model) and the parametric distribution corresponding to the maximum likelihood estimator for that particular model.

Bayesian Information Criterion (BIC)

The Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.

The BIC was developed by Gideon E. Schwarz and published in a 1978 paper,[3] where he gave a Bayesian argument for adopting it.


The BIC is formally defined as[4][a]

[[math]] \mathrm{BIC} = d\ln(n) - 2\ln(\widehat L) [[/math]]

where [math]\widehat L[/math] equals the maximized value of the likelihood function of the model; [math]n[/math] is the sample size; and [math]d[/math] is the number of parameters estimated by the model.

Approximating the Bayes' Factor

Suppose we are considering the [math]k[/math] models [math]\mathcal{M}_1, \ldots, \mathcal{M}_k[/math]. Assume that [math]p_j[/math] is the prior probability that the [math]j^{\textrm{th}} [/math] model is correct, then the posterior probability that the [math]j^{\textrm{th}} [/math] model is correct equals

[[math]] \operatorname{P}(\mathcal{M}_j | X_1, \ldots, X_n ) = \frac{p_j \int \mathcal{L}(\theta_j \, ; X_1, \ldots, X_n ) g_j(\theta_j) \, d\theta_j}{\sum_i p_i \int\mathcal{L}(\theta_i \, ; X_1, \ldots, X_n ) g_i(\theta_i) \, d\theta_i} [[/math]]

with [math]\mathcal{L}(\theta_j \, ; x ) [/math] denoting the likelihood function for the [math]j^{\textrm{th}} [/math] model. Under very restrictive conditions, we have the following approximation as [math]n [/math] tends to infinity:

[[math]] \int \mathcal{L}(\theta_j) g_j(\theta_j) \, d\theta_j \approx \exp(\ln(\widehat L) - d_j\ln(n)) = \exp(-\operatorname{BIC}_j / 2). [[/math]]

Using the approximation above, we can approximate the Bayes factor for two competing models [math]\mathcal{M}_i [/math] and [math]\mathcal{M}_j [/math]:

[[math]] \frac{\operatorname{P}(\mathcal{M}_i | X_1,\ldots, X_n)}{\operatorname{P}(\mathcal{M}_j | X_1,\ldots, X_n)} \approx \frac{p_i}{p_j} \exp[(\operatorname{BIC}_j - \operatorname{BIC}_i)/2]. [[/math]]

In other words, given certain conditions on the models, the BIC is asymptotically equivalent to the Bayesian model comparison method for model selection.


  1. Hastie, Trevor (2009). The Elements of Statistical Learning. Springer. p. 203. ISBN 978-0-387-84857-0. The Akaike information criterion is a[n] [...] estimate of Err_{in} when the log-likelihood loss function is used.
  2. Burnham & Anderson 2002, §2.2
  3. Schwarz, Gideon E. (1978), "Estimating the dimension of a model", Annals of Statistics , 6 (2): 461–464, doi:10.1214/aos/1176344136, MR 0468014.
  4. Wit, Ernst (2012). "'All models are wrong...': an introduction to model uncertainty". Statistica Neerlandica 66 (3): 217–236. doi:10.1111/j.1467-9574.2012.00530.x. 
  5. Claeskens, Gerda; Hjort, Nils Lid (2008), Model Selection and Model Averaging, Cambridge University Press


  1. The AIC, AICc and BIC defined by Claeskens and Hjort[5] are the negatives of those defined in this article and in most other standard references.

Wikipedia References