The bell-shaped distribution parameterised by a mean and variance (or covariance in higher dimensions). Ubiquitous in ML because the Central Limit Theorem promises that any sum of independent random variables tends towards Gaussian — making it a defensible default for noise. The MLE for its parameters is the sample mean and sample covariance.
Univariate Gaussian
The probability density function:
with (location) and (spread). Sometimes the precision is used instead — convenient for derivations because it removes the inverse.
The “68–95–99.7 rule”: about 68% of mass within of the mean, 95% within , 99.7% within .
Multivariate Gaussian
For an -dimensional vector :
- is the mean vector.
- is the covariance matrix — symmetric, positive semidefinite, with diagonal entries and off-diagonal entries .
The contour lines of constant density are ellipsoids; their axes are the eigenvectors of , scaled by .
Special cases:
- — isotropic: equal variance in every direction, circular contours.
- diagonal — components are uncorrelated; for Gaussians, this also implies independence.
- general — correlated components; ellipsoidal contours.
Independence vs Uncorrelatedness
For Gaussians (and only Gaussians among standard distributions), uncorrelated independent. A diagonal covariance matrix factorises the joint density into a product of univariate marginals, so the variables are independent.
This is a Gaussian-specific property. In general, uncorrelatedness only rules out linear dependence — non-Gaussian distributions can have zero correlation but strong non-linear dependence.
A non-trivial covariance matrix means a multivariate Gaussian cannot be written as a product of two univariate Gaussians. But you can always diagonalise (by linear transformation, via PCA-like rotation) to decorrelate the variables — at which point they become independent.
Why Gaussians Are Everywhere
The Central Limit Theorem says: the sum (or average) of many independent random variables, regardless of their individual distributions, tends to a Gaussian as the number of variables grows. This is why Gaussians are the default noise model — measurement error, biological variation, market fluctuations are all approximately sums of many small independent contributions.
Consequence: even when you have no specific reason to assume Gaussian noise, it’s often a reasonable first approximation, and it leads to clean math.
MLE for the Gaussian
Given i.i.d. samples from , the maximum-likelihood estimators are the sample mean and sample covariance:
Derivation sketch
The log-likelihood is: Setting gives the sample mean directly. Setting (using matrix-derivative identities) gives the sample covariance. See lecture notes for the full derivation.
MLE underestimates
The MLE estimate is biased — it systematically underestimates the true variance. The unbiased estimator divides by instead of (Bessel’s correction). For large the difference is negligible; for small it matters. This is one of the standard “MLE has known biases” examples.
The Role in Linear Regression
Linear regression’s OLS solution is justified by assuming the residuals are i.i.d. Gaussian:
Under this assumption, the conditional distribution of is itself Gaussian:
and the MLE for matches the OLS estimator exactly. Squared-error loss is the negative log-likelihood of a Gaussian noise model — that’s the principled reason we use it.
Active Recall
For a Gaussian, are uncorrelatedness and independence the same thing?
Yes — for Gaussians specifically. A diagonal covariance matrix factorises the joint density into a product of univariate Gaussians, so uncorrelated implies independent. This is not true for general distributions, where uncorrelatedness only rules out linear dependence.
Why does the Central Limit Theorem matter for ML?
It justifies the Gaussian as a default noise model: measurement errors and similar quantities are typically sums of many small independent contributions, which converge to Gaussian regardless of the individual contributions’ distributions. This is why squared-error loss (the Gaussian-noise MLE objective) is so widely used.
Can a multivariate Gaussian with non-diagonal covariance always be written as a product of univariate Gaussians?
No. A non-trivial covariance matrix couples the variables; the joint density doesn’t factorise. However, you can always diagonalise the covariance matrix by linear transformation (rotating into the eigenbasis), making the rotated variables independent.
Connections
- maximum likelihood estimation — the principle that picks and given data.
- ordinary-least-squares — the regression analog: under Gaussian noise, MLE for equals OLS.
- gaussian-kernel — the same exp-of-squared-distance functional form, repurposed as a kernel.
- bayes-law — Gaussians are conjugate to themselves (Gaussian prior + Gaussian likelihood → Gaussian posterior), making them computationally tractable in Bayesian inference.
- cross-entropy-loss — the Bernoulli-noise analog of squared-error: same recipe (negative log-likelihood), different distribution.