The bell-shaped distribution parameterised by a mean and variance (or covariance in higher dimensions). Ubiquitous in ML because the Central Limit Theorem promises that any sum of independent random variables tends towards Gaussian — making it a defensible default for noise. The MLE for its parameters is the sample mean and sample covariance.

Univariate Gaussian

The probability density function:

with (location) and (spread). Sometimes the precision is used instead — convenient for derivations because it removes the inverse.

The “68–95–99.7 rule”: about 68% of mass within of the mean, 95% within , 99.7% within .

Multivariate Gaussian

For an -dimensional vector :

  • is the mean vector.
  • is the covariance matrix — symmetric, positive semidefinite, with diagonal entries and off-diagonal entries .

The contour lines of constant density are ellipsoids; their axes are the eigenvectors of , scaled by .

Special cases:

  • isotropic: equal variance in every direction, circular contours.
  • diagonal — components are uncorrelated; for Gaussians, this also implies independence.
  • general — correlated components; ellipsoidal contours.

Independence vs Uncorrelatedness

For Gaussians (and only Gaussians among standard distributions), uncorrelated independent. A diagonal covariance matrix factorises the joint density into a product of univariate marginals, so the variables are independent.

This is a Gaussian-specific property. In general, uncorrelatedness only rules out linear dependence — non-Gaussian distributions can have zero correlation but strong non-linear dependence.

A non-trivial covariance matrix means a multivariate Gaussian cannot be written as a product of two univariate Gaussians. But you can always diagonalise (by linear transformation, via PCA-like rotation) to decorrelate the variables — at which point they become independent.

Why Gaussians Are Everywhere

The Central Limit Theorem says: the sum (or average) of many independent random variables, regardless of their individual distributions, tends to a Gaussian as the number of variables grows. This is why Gaussians are the default noise model — measurement error, biological variation, market fluctuations are all approximately sums of many small independent contributions.

Consequence: even when you have no specific reason to assume Gaussian noise, it’s often a reasonable first approximation, and it leads to clean math.

MLE for the Gaussian

Given i.i.d. samples from , the maximum-likelihood estimators are the sample mean and sample covariance:

Derivation sketch

The log-likelihood is: Setting gives the sample mean directly. Setting (using matrix-derivative identities) gives the sample covariance. See lecture notes for the full derivation.

MLE underestimates

The MLE estimate is biased — it systematically underestimates the true variance. The unbiased estimator divides by instead of (Bessel’s correction). For large the difference is negligible; for small it matters. This is one of the standard “MLE has known biases” examples.

The Role in Linear Regression

Linear regression’s OLS solution is justified by assuming the residuals are i.i.d. Gaussian:

Under this assumption, the conditional distribution of is itself Gaussian:

and the MLE for matches the OLS estimator exactly. Squared-error loss is the negative log-likelihood of a Gaussian noise model — that’s the principled reason we use it.

Active Recall

Connections

  • maximum likelihood estimation — the principle that picks and given data.
  • ordinary-least-squares — the regression analog: under Gaussian noise, MLE for equals OLS.
  • gaussian-kernel — the same exp-of-squared-distance functional form, repurposed as a kernel.
  • bayes-law — Gaussians are conjugate to themselves (Gaussian prior + Gaussian likelihood → Gaussian posterior), making them computationally tractable in Bayesian inference.
  • cross-entropy-loss — the Bernoulli-noise analog of squared-error: same recipe (negative log-likelihood), different distribution.