gaussian-distribution

The bell-shaped distribution parameterised by a mean $μ$ and variance $σ^{2}$ (or covariance $Σ$ in higher dimensions). Ubiquitous in ML because the Central Limit Theorem promises that any sum of independent random variables tends towards Gaussian — making it a defensible default for noise. The MLE for its parameters is the sample mean and sample covariance.

Univariate Gaussian

The probability density function:

$N (x ∣ μ, σ^{2}) = \frac{1}{2 π σ ^{2}} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$

with $E [X] = μ$ (location) and $Var [X] = σ^{2}$ (spread). Sometimes the precision $β = 1/ σ^{2}$ is used instead — convenient for derivations because it removes the inverse.

The “68–95–99.7 rule”: about 68% of mass within $\pm σ$ of the mean, 95% within $\pm 2 σ$ , 99.7% within $\pm 3 σ$ .

Multivariate Gaussian

For an $N$ -dimensional vector $x$ :

$N (x ∣ μ, Σ) = \frac{1}{( 2 π ) ^{N /2} ∣ d e t ( Σ ) ∣ ^{1/2}} exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))$

$μ \in R^{N}$ is the mean vector.
$Σ \in R^{N \times N}$ is the covariance matrix — symmetric, positive semidefinite, with diagonal entries $σ_{ii}^{2} = Var [X_{i}]$ and off-diagonal entries $σ_{ij} = Cov [X_{i}, X_{j}]$ .

The contour lines of constant density are ellipsoids; their axes are the eigenvectors of $Σ$ , scaled by $λ_{i}$ .

Special cases:

$Σ = σ^{2} I$ — isotropic: equal variance in every direction, circular contours.
$Σ$ diagonal — components are uncorrelated; for Gaussians, this also implies independence.
$Σ$ general — correlated components; ellipsoidal contours.

Independence vs Uncorrelatedness

For Gaussians (and only Gaussians among standard distributions), uncorrelated $\Leftrightarrow$ independent. A diagonal covariance matrix factorises the joint density into a product of univariate marginals, so the variables are independent.

This is a Gaussian-specific property. In general, uncorrelatedness only rules out linear dependence — non-Gaussian distributions can have zero correlation but strong non-linear dependence.

A non-trivial covariance matrix means a multivariate Gaussian cannot be written as a product of two univariate Gaussians. But you can always diagonalise (by linear transformation, via PCA-like rotation) to decorrelate the variables — at which point they become independent.

Why Gaussians Are Everywhere

The Central Limit Theorem says: the sum (or average) of many independent random variables, regardless of their individual distributions, tends to a Gaussian as the number of variables grows. This is why Gaussians are the default noise model — measurement error, biological variation, market fluctuations are all approximately sums of many small independent contributions.

Consequence: even when you have no specific reason to assume Gaussian noise, it’s often a reasonable first approximation, and it leads to clean math.

MLE for the Gaussian

Given $N$ i.i.d. samples $x_{1}, \dots, x_{N}$ from $N (μ, Σ)$ , the maximum-likelihood estimators are the sample mean and sample covariance:

$\hat{μ}_{MLE} = \frac{1}{N} i = 1 \sum N x_{i}, \hat{Σ}_{MLE} = \frac{1}{N} i = 1 \sum N (x_{i} - \hat{μ}) (x_{i} - \hat{μ})^{⊤}$

Derivation sketch

The log-likelihood is: $ln p (X ∣ μ, Σ) = - \frac{N D}{2} ln (2 π) - \frac{N}{2} ln ∣ Σ ∣ - \frac{1}{2} \sum_{i = 1}^{N} (x_{i} - μ)^{⊤} Σ^{- 1} (x_{i} - μ)$ Setting $\nabla_{μ} = 0$ gives the sample mean directly. Setting $\nabla_{Σ} = 0$ (using matrix-derivative identities) gives the sample covariance. See lecture notes for the full derivation.

MLE underestimates $σ^{2}$

The MLE estimate $\overset{σ}{^}^{2} = \frac{1}{N} \sum (x_{i} - \overset{μ}{^})^{2}$ is biased — it systematically underestimates the true variance. The unbiased estimator divides by $N - 1$ instead of $N$ (Bessel’s correction). For large $N$ the difference is negligible; for small $N$ it matters. This is one of the standard “MLE has known biases” examples.

The Role in Linear Regression

Linear regression’s OLS solution is justified by assuming the residuals are i.i.d. Gaussian:

$y_{i} = w^{⊤} ϕ (x_{i}) + ε_{i}, ε_{i} \sim N (0, σ^{2})$

Under this assumption, the conditional distribution of $y$ is itself Gaussian:

$p (y ∣ x, w, σ^{2}) = N (y ∣ w^{⊤} ϕ (x), σ^{2})$

and the MLE for $w$ matches the OLS estimator exactly. Squared-error loss is the negative log-likelihood of a Gaussian noise model — that’s the principled reason we use it.

Active Recall

For a Gaussian, are uncorrelatedness and independence the same thing?

Yes — for Gaussians specifically. A diagonal covariance matrix factorises the joint density into a product of univariate Gaussians, so uncorrelated implies independent. This is not true for general distributions, where uncorrelatedness only rules out linear dependence.

Why does the Central Limit Theorem matter for ML?

It justifies the Gaussian as a default noise model: measurement errors and similar quantities are typically sums of many small independent contributions, which converge to Gaussian regardless of the individual contributions’ distributions. This is why squared-error loss (the Gaussian-noise MLE objective) is so widely used.

Can a multivariate Gaussian with non-diagonal covariance always be written as a product of univariate Gaussians?

No. A non-trivial covariance matrix couples the variables; the joint density doesn’t factorise. However, you can always diagonalise the covariance matrix by linear transformation (rotating into the eigenbasis), making the rotated variables independent.

Connections

maximum likelihood estimation — the principle that picks $\hat{μ}$ and $\hat{Σ}$ given data.
ordinary-least-squares — the regression analog: under Gaussian noise, MLE for $w$ equals OLS.
gaussian-kernel — the same exp-of-squared-distance functional form, repurposed as a kernel.
bayes-law — Gaussians are conjugate to themselves (Gaussian prior + Gaussian likelihood → Gaussian posterior), making them computationally tractable in Bayesian inference.
cross-entropy-loss — the Bernoulli-noise analog of squared-error: same recipe (negative log-likelihood), different distribution.

Course Notes

Explorer

gaussian-distribution

Univariate Gaussian

Multivariate Gaussian

Independence vs Uncorrelatedness

Why Gaussians Are Everywhere

MLE for the Gaussian

The Role in Linear Regression

Active Recall

Connections

Graph View

Table of Contents

Backlinks