Probability Distributions

A probability distribution assigns a probability to each measurable subset of the possible outcomes of a random experiment, survey, or procedure of statistical inference. Examples are found in experiments whose sample space is non-numerical, where the distribution would be a categorical distribution; experiments whose sample space is encoded by discrete random variable, where the distribution can be specified by a probability mass function; and experiments with sample spaces encoded by continuous random variables, where the distribution can be specified by a probability density function

A univariate distribution gives the probabilities of a single random variable taking on various alternative values. Important and commonly encountered univariate probability distributions include the binomial distribution, the hypergeometric distribution, and the normal distribution.

Introduction

To define probability distributions for the simplest cases, one needs to distinguish between discrete and continuous random variables. In the discrete case, one can easily assign a probability to each possible value: for example, when throwing a fair dice, each of the six values 1 to 6 has the probability 1/6. In contrast, when a random variable takes values from a continuum then, typically, probabilities can be nonzero only if they refer to intervals: in quality control one might demand that the probability of a "500 g" package containing between 490 g and 510 g should be no less than 98%.

If the random variable is real-valued (or more generally, if a total order is defined for its possible values), the cumulative distribution function (CDF) gives the probability that the random variable is no larger than a given value; in the real-valued case, the CDF is the integral of the probability density function (pdf) provided that this function exists.

Cumulative distribution function

Because a probability distribution P on the real line is determined by the probability of a scalar random variable [math]X[/math] being in a half-open interval (-∞, [math]X[/math]], the probability distribution is completely characterized by its cumulative distribution function:

[[math]] F(x) = \operatorname{P} \left[ X \le x \right] \qquad \text{ for all } x \in \mathbb{R}.[[/math]]

Discrete probability distribution

A discrete probability distribution is a probability distribution characterized by a probability mass function. Thus, the distribution of a random variable [math]X[/math] is discrete, and [math]X[/math] is called a discrete random variable, if

[[math]]\sum_u \operatorname{P}(X=u) = 1[[/math]]

as [math]u[/math] runs through the set of all possible values of [math]X[/math]. Hence, a random variable can assume only a finite or countably infinite number of values—the random variable is a discrete variable. For the number of potential values to be countably infinite, even though their probabilities sum to 1, the probabilities have to decline to zero fast enough. For example, if [math]\operatorname{P}(X=n) = \tfrac{1}{2^n}[/math] for [math]n = 1, 2, \ldots,[/math] we have the sum of probabilities 1/2 + 1/4 + 1/8 + ... = 1.

Well-known discrete probability distributions used in statistical modeling include the poisson distribution, the Bernoulli distribution, the binomial distribution, the geometric distribution, and the negative binomial distribution. Additionally, the discrete uniform distribution is commonly used in computer programs that make equal-probability random selections between a number of choices.

Probability Mass Function

Suppose that [math]X[/math] is a discrete random variable defined on a sample space [math]S[/math]. Then the probability mass function [math]f_X[/math] is defined as^[1]^[2]

[[math]]f_X(x) = \operatorname{P}(X = x) = \operatorname{P}(\{s \in S: X(s) = x\}).[[/math]]

Thinking of probability as mass helps to avoid mistakes since the physical mass is conserved as is the total probability for all hypothetical outcomes [math]X[/math]:

[[math]]\sum_{x\in A} f_X(x) = 1[[/math]]

When there is a natural order among the hypotheses [math]X[/math], it may be convenient to assign numerical values to them (or n-tuples in case of a discrete multivariate random variable) and to consider also values not in the image of [math]X[/math]. That is, [math]f_X[/math] may be defined for all real numbers and [math]f_X = 0[/math] for all [math]X[/math] not in [math] X(S)[/math].

Since the image of [math]X[/math] is countable, the probability mass function [math]f_X(x)[/math] is zero for all but a countable number of values of [math]X[/math]. The discontinuity of probability mass functions is related to the fact that the cumulative distribution function of a discrete random variable is also discontinuous. Where it is differentiable, the derivative is zero, just as the probability mass function is zero at all such points.

Cumulative Distribution

Equivalently to the above, a discrete random variable can be defined as a random variable whose cumulative distribution function (cdf) increases only by jump discontinuities—that is, its cdf increases only where it "jumps" to a higher value, and is constant between those jumps. The points where jumps occur are precisely the values which the random variable may take.

Indicator-function representation

For a discrete random variable [math]X[/math], let [math]u_0, u_1, \ldots [/math] be the values it can take with non-zero probability. Denote

[[math]]\Omega_i=X^{-1}(u_i)= \{\omega: X(\omega)=u_i\},\, i=0, 1, 2, \dots[[/math]]

These are disjoint sets, and by formula (1)

[[math]]\operatorname{P}\left(\bigcup_i \Omega_i\right)=\sum_i \operatorname{P}(\Omega_i)=\sum_i\operatorname{P}(X=u_i)=1.[[/math]]

It follows that the probability that [math]X[/math] takes any value except for [math]u_0, u_1, \ldots [/math] is zero, and thus one can write [math]X[/math] as

[[math]]X=\sum_i u_i 1_{\Omega_i}[[/math]]

except on a set of probability zero, where [math]1_A[/math] is the indicator function of [math]A[/math]. This may serve as an alternative definition of discrete random variables.

Continuous probability distribution

A continuous probability distribution is a probability distribution that has a cumulative distribution function that is continuous. Most often they are generated by having a probability density function. Mathematicians call distributions with probability density functions absolutely continuous, since their cumulative distribution function is absolutely continuous with respect to the Lebesgue measure. If the distribution of [math]X[/math] is continuous, then [math]X[/math] is called a continuous random variable.

Intuitively, a continuous random variable is the one which can take a continuous range of values—as opposed to a discrete distribution, where the set of possible values for the random variable is at most countable. While for a discrete distribution an event with probability zero is impossible (e.g., rolling 3.5 on a standard dice is impossible, and has probability zero), this is not so in the case of a continuous random variable.

Formally, if [math]X[/math] is a continuous random variable, then it has a probability density function [math]f_X[/math], and therefore its probability of falling into a given interval, say [math][a,b][/math], is given by the integral

[[math]] \operatorname{P}[a\le X\le b] = \int_a^b f(x) \, dx [[/math]]

In particular, the probability for [math]X[/math] to take any single value is zero, because an integral with coinciding upper and lower limits is always equal to zero.

The definition states that a continuous probability distribution must possess a density, or equivalently, its cumulative distribution function be absolutely continuous. This requirement is stronger than simple continuity of the cumulative distribution function, and there is a special class of distributions, singular distributions, which are neither continuous nor discrete nor a mixture of those. An example is given by the Cantor distribution. Such singular distributions however are never encountered in practice.

Note on terminology: some authors use the term "continuous distribution" to denote the distribution with continuous cumulative distribution function. Thus, their definition includes both the (absolutely) continuous and singular distributions.

By one convention, a probability distribution [math]\,\mu[/math] is called continuous if its cumulative distribution function [math]F(x)=\mu(-\infty,x][/math] is continuous and, therefore, the probability measure of singletons [math]\mu\{x\}\,=\,0[/math] for all [math]\,x[/math].

Another convention reserves the term continuous probability distribution for absolutely continuous distributions. These distributions can be characterized by a probability density function: a non-negative Lebesgue integrable function [math]\,f[/math] defined on the real numbers such that

[[math]] F(x) = \mu(-\infty,x] = \int_{-\infty}^x f(t)\,dt. [[/math]]

Discrete distributions and some continuous distributions (like the Cantor distribution) do not admit such a density.

Further details

Not every probability distribution has a density function: the distributions of discrete random variables do not; nor does the Cantor distribution, even though it has no discrete component, i.e., does not assign positive probability to any individual point.

A distribution has a density function if and only if its cumulative distribution function [math]F(x)[/math] is absolutely continuous. In this case: [math]F[/math] is almost everywhere differentiable, and its derivative can be used as probability density:

[[math]] \frac{d}{dx}F(x) = f(x). [[/math]]

If a probability distribution admits a density, then the probability of every one-point set {a} is zero; the same holds for finite and countable sets.

Two probability densities [math]f[/math] and [math]g[/math] represent the same probability distribution precisely if they differ only on a set of Lebesgue measure zero.

Families of densities

It is common for probability density functions (and probability mass functions) to be parametrized—that is, to be characterized by unspecified parameters. For example, the normal distribution is parametrized in terms of the mean and the variance, denoted by [math]\mu[/math] and [math]\sigma^2[/math] respectively, giving the family of densities

[[math]] f(x;\mu,\sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2 }. [[/math]]

It is important to keep in mind the difference between the domain of a family of densities and the parameters of the family. Different values of the parameters describe different distributions of different random variables on the same sample space (the same set of all possible values of the variable); this sample space is the domain of the family of random variables that this family of distributions describes. A given set of parameters describes a single distribution within the family sharing the functional form of the density. From the perspective of a given distribution, the parameters are constants, and terms in a density function that contain only parameters, but not variables, are part of the normalization factor of a distribution (the multiplicative factor that ensures that the area under the density—the probability of something in the domain occurring— equals 1). This normalization factor is outside the kernel of the distribution.

Expected Value

In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represents. For example, within an insurance context, the expected loss can be thought of as the average loss incurred by an insurer on a very large portfolio of policies sharing a common loss distribution (similar risk profile). Less roughly, the law of large numbers states that the arithmetic mean of the values almost surely converges to the expected value as the number of repetitions approaches infinity.

The expected value does not exist for random variables having some distributions with large "tails", such as the Cauchy distribution.^[3] For random variables such as these, the long-tails of the distribution prevent the sum/integral from converging. That being said, most loss models encountered in insurance implicitly assume finite expected losses.

The expected value is also known as the expectation, mathematical expectation, EV, average, mean value, mean, or first moment.

Univariate discrete random variable

Let [math]X[/math] be a discrete random variable taking values [math]x_1,x_2,\ldots[/math] with probabilities [math]p_1,p_2,\ldots[/math] respectively. Then the expected value of this random variable is the infinite sum

[[math]] \operatorname{E}[X] = \sum_{i=1}^\infty x_i\, p_i,[[/math]]

provided that this series converges absolutely (that is, the sum must remain finite if we were to replace all [math]x_i[/math]s with their absolute values). If this series does not converge absolutely, we say that the expected value of [math]X[/math] does not exist.

Univariate continuous random variable

If the probability distribution of [math]X[/math] admits a probability density function [math]f(x)[/math], then the expected value can be computed as

[[math]] \operatorname{E}[X] = \int_{-\infty}^\infty x f(x)\, \mathrm{d}x . [[/math]]

Properties

The expected value of a constant is equal to the constant itself; i.e., if [math]c[/math] is a constant, then [math]\operatorname{E}[c]=c[/math].

If [math]X[/math] and [math]Y[/math] are random variables such that [math]X \le Y[/math] almost surely, then [math]\operatorname{E}[X] \le \operatorname{E}[Y][/math].

The expected value operator (or expectation operator) [math]\operatorname{E}[\cdot][/math] is linear in the sense that

[[math]]\begin{align*} \operatorname{E}[X + c] &= \operatorname{E}[X] + c \\ \operatorname{E}[X + Y] &= \operatorname{E}[X] + \operatorname{E}[Y] \\ \operatorname{E}[aX] &= a \operatorname{E}[X] \end{align*}[[/math]]

Combining the results from previous three equations, we can see that

[[math]]\operatorname{E}[a X + b Y + c] = a \operatorname{E}[X] + b \operatorname{E}[Y] + c\,[[/math]]

for any two random variables [math]X[/math] and [math]Y[/math] and any real numbers [math]a[/math],[math]b[/math] and [math]c[/math].

Layer Cake Representation

When a continuous random variable [math]X[/math] takes only non-negative values, we can use the following formula for computing its expectation (even when the expectation is infinite):

[[math]] \operatorname{E}[X]=\int_0^\infty \operatorname{P}(X \ge x)\; \mathrm{d}x[[/math]]

Similarly, when a random variable takes only values in {0, 1, 2, 3, ...} we can use the following formula for computing its expectation:

[[math]] \operatorname{E}[X]=\sum\limits_{i=1}^\infty \operatorname{P}(X\geq i).[[/math]]

Residual Life Distribution

Suppose [math]X[/math] is a non-negative random variable which can be thought of as representing the lifetime for some entity of interest. A family of residual life distributions can be constructed by considering the conditional distribution of [math]X[/math] given that [math]X[/math] is beyond some level [math]d[/math],i.e., the distribution of lifetime given that death (failure) hasn't yet occurred at time [math]d[/math]:

[[math]] \begin{align} R_d(t) &= \operatorname{P}(X \leq d + t \mid X \gt d) \\ &= \frac{1 - S(t+d)}{S(d)} \end{align} [[/math]]

with [math]S(t)[/math] denoting the survival function for [math]X[/math] representing the probability that [math]X[/math] is greater than [math]t[/math] (the lifetime of [math]X[/math] is greater than [math]t[/math]).

Residual life distributions are relevant for insurance policies with deductibles. Since a claim is made when the loss to the insured is beyond the deductible, the loss to the insurer given that a claim was made is precisely the residual life distribution [math]R_d(t)[/math].

Mean Excess Loss Function

If [math]X[/math] represents loss to the insured with an insurance policy with a deductible [math]d[/math], then the expected loss to the insurer given that a claim was made is the mean excess loss function evaluated at [math]d[/math]:

[[math]] m(d) = \operatorname{E}[X-d \mid X \gt d] = \int_{0}^{\infty}\frac{S(t + d)}{S(d)} \,dt \,. [[/math]]

This function is also called the mean residual life function when [math]X[/math] is a general non-negative random variable. When the distribution of [math]X[/math] has a density say [math]f(x)[/math], then the mean excess loss function equals

[[math]] m(d) = \frac{\int_{d}^{\infty} (x-d) f(x) \, dx}{S(d)} \,. [[/math]]

Notes

Kumar, Dinesh (2006). Reliability & Six Sigma. Birkhäuser. p. 22. ISBN 978-0-387-30255-3.
Rao, S.S. (1996). Engineering optimization: theory and practice. John Wiley & Sons. p. 717. ISBN 978-0-471-55034-1.
Richard W Hamming (1991). "Example 8.7–1 The Cauchy distribution". The art of probability for scientists and engineers. Addison-Wesley. p. 290 ff. ISBN 0-201-40686-1. Sampling from the Cauchy distribution and averaging gets you nowhere — one sample has the same distribution as the average of 1000 samples!

References

Pierre Simon de Laplace (1812). Analytical Theory of Probability.

The first major treatise blending calculus with probability theory, originally in French: Théorie Analytique des Probabilités.

Andrei Nikolajevich Kolmogorov (1950). Foundations of the Theory of Probability.

The modern measure-theoretic foundation of probability theory; the original German version (Grundbegriffe der Wahrscheinlichkeitsrechnung) appeared in 1933.

Patrick Billingsley (1979). Probability and Measure. New York, Toronto, London: John Wiley and Sons. ISBN 0-471-00710-2.

David Stirzaker (2003). Elementary Probability. ISBN 0-521-42028-8.

Chapters 7 to 9 are about continuous variables.

B. S. Everitt: The Cambridge Dictionary of Statistics, Cambridge University Press, Cambridge (3rd edition, 2006). ISBN 0-521-69027-7
Bishop: Pattern Recognition and Machine Learning, Springer, ISBN 0-387-31073-8
den Dekker A. J., Sijbers J., (2014) "Data distributions in magnetic resonance images: a review", Physica Medica, [1]
Wikipedia contributors. "Probability density function". Wikipedia. Wikipedia. Retrieved 28 January 2022.
Wikipedia contributors. "Probability distribution". Wikipedia. Wikipedia. Retrieved 28 January 2022.

[1] Kumar, Dinesh (2006). Reliability & Six Sigma. Birkhäuser. p. 22. ISBN 978-0-387-30255-3.

[2] Rao, S.S. (1996). Engineering optimization: theory and practice. John Wiley & Sons. p. 717. ISBN 978-0-471-55034-1.

[Hamming2-3] Richard W Hamming (1991). "Example 8.7–1 The Cauchy distribution". The art of probability for scientists and engineers. Addison-Wesley. p. 290 ff. ISBN 0-201-40686-1. Sampling from the Cauchy distribution and averaging gets you nowhere — one sample has the same distribution as the average of 1000 samples!

[1]

[2]

[3]