# Maximum Likelihood Estimation

In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the "agreement" of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution.

## Principles

Suppose there is a sample $X_1,\ldots,X_n$ of $n$ independent and identically distributed observations, coming from a distribution with an unknown probability density function $f_0$. It is however surmised that the function $f_0$ belongs to a certain family of distributions $\{f(\cdot | \theta) : \theta \in \Theta \}$ (where $\theta$ is a vector of parameters for this family), so that $f_0 = f(\cdot | \theta_0)$. The value $\theta_0$ is unknown and is referred to as the true value of the parameter vector. It is desirable to find an estimator $\hat\theta$ which would be as close to the true value $\theta_0$ as possible. Either or both the observed variables $X_i$ and the parameter $\theta$ can be vectors.

To use the method of maximum likelihood, one first specifies the joint density function for all observations. For an independent and identically distributed sample, this joint density function is

[$] \begin{equation} f(x_1,x_2,\ldots,x_n\mid\theta) = f(x_1\mid \theta)\times f(x_2|\theta) \times \cdots \times f(x_n\mid \theta). \end{equation} [$]

Now we look at this function from a different perspective by considering the observed values $X_1,\ldots,X_n$ to be fixed "parameters" of this function, whereas $\theta$ will be the function's variable and allowed to vary freely; this function will be called the likelihood:

[$] \mathcal{L}(\theta\,;X_1,\ldots,X_n) = f(X_1,\ldots,X_n\mid\theta) = \prod_{i=1}^n f(X_i\mid\theta). [$]

In practice it is often more convenient to work with the logarithm of the likelihood function, called the log-likelihood:

[$] \ln\mathcal{L}(\theta\,;\,X_1,\ldots,X_n) = \sum_{i=1}^n \ln f(X_i\mid\theta), [$]

or the average log-likelihood:

[$] L_n(\theta) = \frac1n \ln\mathcal{L}. [$]

The method of maximum likelihood estimates $\theta_0$, the true parameter of the distribution from which the sample is drawn, by finding a value of $\theta$ that maximizes $L_n(\theta)$. This method of estimation defines a maximum-likelihood estimator (MLE) of $\theta_0$

[$] \begin{equation} \{ \hat\theta_\mathrm{mle}\} \subseteq \{ \underset{\theta\in\Theta}{\operatorname{arg\,max}}\ L_n(\theta\,;\,x_1,\ldots,x_n) \}, \end{equation} [$]

if a maximum exists. An MLE estimate is the same regardless of whether we maximize the likelihood or the log-likelihood function, since log is a monotonically increasing function.

For many models, a maximum likelihood estimator can be found as an explicit function of the observed data. For many other models, however, no closed-form solution to the maximization problem is known or available, and an MLE has to be found numerically. For some problems, there may be multiple estimates that maximize the likelihood. For other problems, no maximum likelihood estimate exists (meaning that the log-likelihood function increases without attaining the supremum value).

In the exposition above, it is assumed that the data are independent and identically distributed. The method can be applied however to a broader setting, as long as it is possible to write the joint density function $f(x_1,\ldots,x_n | \theta)$ and its parameter $\theta$ has a finite dimension which does not depend on the sample size $n$.

## Properties

Maximum-likelihood estimators have no optimum properties for finite samples, in the sense that (when evaluated on finite samples) other estimators may have greater concentration around the true parameter-value. However, like other estimation methods, maximum-likelihood estimation possesses a number of attractive limiting properties:

• Consistency: the sequence of MLEs converges in probability to the value being estimated.
• Asymptotic normality: as the sample size increases, the distribution of the MLE tends to the Gaussian distribution with mean $\theta_0$ and covariance matrix equal to the inverse of the Fisher information matrix.
• Efficiency, i.e., it achieves the Cramér–Rao lower bound when the sample size tends to infinity. This means that no consistent estimator has lower asymptotic mean squared error than the MLE (or other estimators attaining this bound).
• Second-order efficiency after correction for bias.

Under certain conditions, the maximum likelihood estimator is consistent. The consistency means that having a sufficiently large number of observations $n$, it is possible to find the value of $\theta_0$ with arbitrary precision. In mathematical terms this means that as $n$ goes to infinity the estimator $\hat\theta$ converges in probability to its true value:

[$] \hat\theta_\mathrm{mle}\ \xrightarrow{p}\ \theta_0. [$]

Under slightly stronger conditions, the estimator converges almost surely (or strongly) to:

[$] \hat\theta_\mathrm{mle}\ \xrightarrow{\text{a.s.}}\ \theta_0. [$]

### Asymptotic normality

In a wide range of situations, maximum likelihood parameter estimates exhibit asymptotic normality - that is, they are equal to the true parameters plus a random error that is approximately normal (given sufficient data), and the error's variance decays as 1/n:

[$] \sqrt{n}\cdot(\hat\theta_\mathrm{mle} - \theta_0)\ \xrightarrow{D}\ N (0,I( \theta_0) ^{-1}) [$]

where $N$ denotes the Normal distribution and $I(\theta_0)$ is the Fisher information matrix.

## Fisher Information

Tthe Fisher information is a way of measuring the amount of information that an observable random variable $X$ carries about an unknown parameter $\theta$ of a distribution that models $X$. Formally, it is the variance of the score, or the expected value of the observed information. The Fisher-information matrix is used to calculate the covariance matrices associated with maximum-likelihood estimates.

### Definition

The Fisher information is a way of measuring the amount of information that an observable random variable $X$ carries about an unknown parameter $\theta$ upon which the probability of $X$ depends. The probability function for $X$, which is also the likelihood function for $\theta$, is a function $f(X;\theta)$; it is the probability mass (or probability density) of the random variable $X$ conditional on the value of $\theta$. The partial derivative with respect to $\theta$ of the natural logarithm of the likelihood function is called the score. Under certain regularity conditions, it can be shown that the first moment of the score is 0.

The second moment is called the Fisher information:

[$] \mathcal{I}(\theta)=\operatorname{E} \left[\left. \left(\frac{\partial}{\partial\theta} \log f(X;\theta)\right)^2\right|\theta \right] = \int \left(\frac{\partial}{\partial\theta} \log f(x;\theta)\right)^2 f(x; \theta)\; \mathrm{d}x\,, [$]

where, for any given value of $\theta$, the expression E[...|$\theta$] denotes the conditional expectation over values for $X$ with respect to the probability function $f(x;\theta)$ given $\theta$. Note that $0 \leq \mathcal{I}(\theta) \lt \infty$. A random variable carrying high Fisher information implies that the absolute value of the score is often high. The Fisher information is not a function of a particular observation, as the random variable $X$ has been averaged out. Since the expectation of the score is zero, the Fisher information is also the variance of the score.

If $\log(f(x);\theta)$ is twice differentiable with respect to $\theta$, then, under certain regularity conditions, the Fisher information may also be written as

[$] \mathcal{I}(\theta) = - \operatorname{E} \left[\left. \frac{\partial^2}{\partial\theta^2} \log f(X;\theta)\right|\theta \right]\,. [$]

Thus, the Fisher information is the negative of the expectation of the second derivative with respect to $\theta$ of the natural natural logarithm of $f$. Information may be seen to be a measure of the "curvature" of the support curve near the maximum likelihood estimate of $\theta$. A "blunt" support curve (one with a shallow maximum) would have a low negative expected second derivative, and thus low information; while a sharp one would have a high negative expected second derivative and thus high information.

Information is additive, in that the information yielded by two independent experiments is the sum of the information from each experiment separately:

[$] \mathcal{I}_{X,Y}(\theta) = \mathcal{I}_X(\theta) + \mathcal{I}_Y(\theta). [$]

This result follows from the elementary fact that if random variables are independent, the variance of their sum is the sum of their variances. In particular, the information in a random sample of size $n$ is $n$ times that in a sample of size 1, when observations are independent and identically distributed.

### Reparametrization

The Fisher information depends on the parametrization of the problem. If $\theta$ and $\eta$ are two scalar parametrizations of an estimation problem, and $\theta$ is a continuously differentiable function of $\eta$, then

[$]{\mathcal I}_\eta(\eta) = {\mathcal I}_\theta(\theta(\eta)) \left( \frac{{\mathrm d} \theta}{{\mathrm d} \eta} \right)^2[$]

where ${\mathcal I}_\eta$ and ${\mathcal I}_\theta$ are the Fisher information measures of $\eta$ and $\theta$, respectively.