# Probability Distributions

Given at least two random variables $X$, $Y$, ..., the joint probability distribution for $X$, $Y$, ... is a probability distribution that gives the probability that each of $X$, $Y$, ... falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.

The joint probability distribution can be expressed either in terms of a joint cumulative distribution function or in terms of a joint probability density function (in the case of continuous variables) or joint probability mass function (in the case of discrete variables).

## Examples

### Coin Flips

Consider the flip of two fair coins; let $A$ and $B$ be discrete random variables associated with the outcomes first and second coin flips respectively. If a coin displays "heads" then associated random variable is 1, and is 0 otherwise. The joint probability mass function of $A$ and $B$ defines probabilities for each pair of outcomes. All possible outcomes are

[$] (A=0,B=0), (A=0,B=1), (A=1,B=0), (A=1,B=1) [$]

Since each outcome is equally likely the joint probability mass function becomes

[$]\operatorname{P}(A,B)=1/4[$]

when $A,B\in\{0,1\}$. Since the coin flips are independent, the joint probability mass function is the product of the marginals:

[$]\operatorname{P}(A,B)=\operatorname{P}(A)\operatorname{P}(B)[$]

.

In general, each coin flip is a Bernoulli trial and the sequence of flips follows a Bernoulli distribution.

### Dice Rolls

Consider the roll of a fair dice and let $A$ = 1 if the number is even (i.e. 2, 4, or 6) and $A$ = 0 otherwise. Furthermore, let $B$ = 1 if the number is prime (i.e. 2, 3, or 5) and $B$ = 0 otherwise.

1 2 3 4 5 6
A 0 1 0 1 0 1
B 0 1 1 0 1 0

Then, the joint distribution of $A$ and $B$, expressed as a probability mass function, is

[$] \mathrm{P}(A=0,B=0)=P\{1\}=\frac{1}{6},\; \mathrm{P}(A=1,B=0)=P\{4,6\}=\frac{2}{6}, [$]
[$] \mathrm{P}(A=0,B=1)=P\{3,5\}=\frac{2}{6},\; \mathrm{P}(A=1,B=1)=P\{2\}=\frac{1}{6}. [$]

These probabilities necessarily sum to 1, since the probability of some combination of $A$ and $B$ occurring is 1.

## Cumulative Distribution Function

When dealing simultaneously with more than one random variable the joint cumulative distribution function can also be defined. For example, for a pair of random variables X,Y, the joint CDF $F$ is given by

[$]F(x,y) = \operatorname{P}(X\leq x,Y\leq y),[$]

where the right-hand side represents the probability that the random variable $X$ takes on a value less than or equal to $X$ and that $Y$ takes on a value less than or equal to $Y$.

Every multivariate CDF is:

1. Monotonically non-decreasing for each of its variables
2. Right-continuous for each of its variables.
3. $0\leq F(x_{1},...,x_{n})\leq 1$
4. $\lim_{x_{1},...,x_{n}\rightarrow+\infty}F(x_{1},...,x_{n})=1$ and $\lim_{x_{i}\rightarrow-\infty}F(x_{1},...,x_{n})=0,\quad \mbox{for all }i$

## Density function or mass function

### Discrete case

The joint probability mass function of a sequence of random variables $X_1,\ldots,X_n$ is the multivariate function

[$] \begin{equation} \operatorname{P}(X_1=x_1,\dots,X_n=x_n). \end{equation} [$]

Since these are probabilities, we must have

[$]\sum_{i} \sum_{j} \dots \sum_{k} \mathrm{P}(X_1=x_{1i},X_2=x_{2j}, \dots, X_n=x_{nk}) = 1.\;[$]

### Continuous case

If $X_1,\ldots,X_n$ are continuous random variables with

[$] F_{X_1,\ldots,X_n}(x_1,\ldots,x_n) = \int_{-\infty}^{x_1}\cdots \int_{-\infty}^{x_n} f_{X_1,\ldots,X_n}(z_1,\ldots,z_n) \,\, dz_1 \cdots dz_n [$]

,then $f_{X_1,\ldots,X_n}$ is said to be a joint density function function for the sequence of random variables.

## Marginal Distribution Functions

In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. This contrasts with a conditional distribution, which gives the probabilities contingent upon the values of the other variables.

The term marginal variable is used to refer to those variables in the subset of variables being retained. These terms are dubbed "marginal" because they used to be found by summing values in a table along rows or columns, and writing the sum in the margins of the table. The distribution of the marginal variables (the marginal distribution) is obtained by marginalizing over the distribution of the variables being discarded, and the discarded variables are said to have been marginalized out.

The context here is that the theoretical studies being undertaken, or the data analysis being done, involves a wider set of random variables but that attention is being limited to a reduced number of those variables. In many applications an analysis may start with a given collection of random variables, then first extend the set by defining new ones (such as the sum of the original random variables) and finally reduce the number by placing interest in the marginal distribution of a subset (such as the sum). Several different analyses may be done, each treating a different subset of variables as the marginal variables.

### Two-variable case

X
Y
x1 x2 x3 x4 py(Y)↓
y1 432 232 132 132 832
y2 232 432 132 132 832
y3 232 232 232 232 832
y4 832 0 0 0 832
px(X) → 1632 832 432 432 3232
Joint and marginal distributions of a pair of discrete, random variables X,Y having nonzero mutual information I(X; Y). The values of the joint distribution are in the 4×4 square, and the values of the marginal distributions are along the right and bottom margins.

Given two random variables $X$ and $Y$ whose joint distribution is known, the marginal distribution of $X$ is simply the probability distribution of $X$ averaging over information about $Y$. It is the probability distribution of $X$ when the value of $Y$ is not known. This is typically calculated by summing or integrating the joint probability distribution over $Y$.

For discrete random variables, the marginal probability mass function can be written as $\operatorname{P}(X=x)$. This is

[$]\operatorname{P}(X=x) = \sum_{y} \operatorname{P}(X=x,Y=y)[$]

where $\operatorname{P}(x,y)$ is the joint distribution of $X$ and $Y$. In this case, the variable $Y$ has been marginalized out.

Similarly for continuous random variables, the marginal probability density function can be written as $f_X(x)$. This is

[$]f_{X}(x) = \int_y f_{X,Y}(x,y) \, \operatorname{d}\!y[$]

where $f_{X,Y}(x,y)$ gives the joint density function of $X$ and $Y$. Again, the variable $Y$ has been marginalized out.

### More than two variables

For $i=1,\ldots,n$, let $f_X(i)(x_i)$ be the probability density function associated with variable $X_i$ alone. This is called the “marginal” density function, and can be deduced from the probability density associated with the random variables $X_1,\ldots,X_n$ by integrating on all values of the $n-1$ other variables:

[$]f_{X_i}(x_i) = \int f(x_1,\ldots,x_n)\, dx_1 \cdots dx_{i-1}\,dx_{i+1}\cdots dx_n .[$]

## Independence

A set of random variables is pairwise independent if and only if every pair of random variables is independent.

A set of random variables is mutually independent if and only if for any finite subset $X_1, \ldots, X_n$ and any finite sequence of numbers $a_1, \ldots, a_n$, the events

[$]\{X_1 \le a_1\}, \ldots, \{X_n \le a_n\}[$]

are mutually independent events.

If the joint probability density function of a vector of $n$ random variables can be factored into a product of $n$ functions of one variable

[$]f_{X_1,\ldots,X_n}(x_1,\ldots,x_n) = f_1(x_1)\cdots f_n(x_n),[$]

(where each $f_i$ is not necessarily a density) then the $n$ variables in the set are all independent from each other, and the marginal probability density function of each of them is given by

[$]f_{X_i}(x_i) = \frac{f_i(x_i)}{\int f_i(x)\,dx}.[$]

### i.i.d. Sequences

In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent.

The abbreviation i.i.d. is particularly common in statistics (often as iid, sometimes written IID), where observations in a sample are often assumed to be effectively i.i.d. for the purposes of statistical inference. The assumption (or requirement) that observations be i.i.d. tends to simplify the underlying mathematics of many statistical methods (see mathematical statistics and statistical theory). However, in practical applications of statistical modeling the assumption may or may not be realistic. To test how realistic the assumption is on a given data set, the autocorrelation can be computed, lag plots drawn or turning point test performed. The generalization of exchangeable random variables is often sufficient and more easily met.

The assumption is important in the classical form of the central limit theorem, which states that the probability distribution of the sum (or average) of i.i.d. variables with finite variance approaches a normal distribution.