week-01

TARGET DECK NeuralComputation::Week-01

Perceptron

What is a perceptron, mathematically?

A perceptron computes $\overset{y}{^} = sgn (b + w \cdot x)$ , where $x$ is the input vector, $w$ is a learned weight vector, $b$ is a learned bias, and $sgn$ outputs $+ 1$ or $- 1$ . It is the simplest single-neuron classifier.

How does removing the sign function from a perceptron change its job?

Without $sgn$ , the output $\overset{y}{^} = b + w \cdot x$ is a continuous real number — the model is now doing linear regression instead of classification. Same weights, same bias, same architecture, just a different output type.

What does the dot product $w \cdot x$ measure geometrically in a perceptron?

The signed perpendicular distance (scaled by $∥ w ∥$ ) from the point $x$ to the decision hyperplane. Positive means $x$ lies on the same side as $w$ ; negative means the opposite side. The sign function turns this distance into a hard $\pm 1$ classification.

In a perceptron, what controls the tilt vs the position of the decision boundary?

Tilt (orientation): the weight vector $w$ — the boundary is always perpendicular to $w$ .

Position (offset from origin): the bias $b$ — the boundary sits at signed distance $- b /∥ w ∥$ from the origin along $w$ .

Linear separability

What does it mean for a dataset to be linearly separable, and why does this matter for a single perceptron?

A dataset is linearly separable if a single hyperplane $w \cdot x + b = 0$ can split the classes correctly. A single perceptron can only solve linearly separable problems — XOR and concentric-circle patterns cannot be classified by any single perceptron, no matter how the weights are chosen.

Loss & optimisation

Why is "learning is optimisation" the central idea of week 1?

Learning means picking the parameters $(w, b)$ that make predictions as close as possible to the truth. We formalise “close” with a loss function $L$ , then solve $w^{*}, b^{*} = ar g min_{w, b} L (w, b)$ . Every later technique (gradient descent, backprop, etc.) is just machinery for solving this minimisation.

What is the difference between $ar g min$ and $min$ ?

$min_{x} f (x)$ returns the smallest value $f$ takes. $ar g min_{x} f (x)$ returns the input $x$ at which that smallest value is achieved. In ML we want the parameters that minimise loss, not the loss value itself — so we always write $ar g min$ .

MLE

What does maximum likelihood estimation (MLE) ask?

$\hat{θ}_{MLE} = ar g max_{\hat{θ}} P (data ∣ \hat{θ})$ Out of all possible parameter values, MLE picks the one that would have made the observed data most probable. The data is treated as fixed (we already saw it); $\hat{θ}$ is what we sweep over.

What is the difference between probability and likelihood, given the same formula $P (data ∣ θ)$ ?

Both use the same formula but flip what is fixed:

Probability: fix $θ$ , vary the data — “if the true mean is 20°C, what readings might we see?”

Likelihood: fix the data, vary $θ$ — “given we observed 23°C, which $θ$ best explains it?”

MLE is a likelihood problem: data observed, parameter swept.

Under what assumptions does maximising the likelihood reduce to minimising the sum of squared errors?

When (1) observations are independent and (2) noise around the true value is Gaussian. The Gaussian PDF contains $exp (- (x_{i} - \overset{x}{^})^{2} /2 σ^{2})$ , so taking $lo g$ of the product gives a sum of $- (x_{i} - \overset{x}{^})^{2}$ terms (plus constants). Dropping constants and flipping the sign turns $ar g max$ into $ar g min \sum (x_{i} - \overset{x}{^})^{2}$ .

Why is it valid to apply $lo g$ to the likelihood before optimising?

Because $lo g$ is monotonically increasing: if $f (a) > f (b)$ then $lo g f (a) > lo g f (b)$ . The location of the maximum doesn’t move. Three concrete benefits: it turns products into sums (easier to differentiate), cancels the Gaussian’s $exp$ , and avoids floating-point underflow when multiplying many tiny probabilities.

In the MLE-to-SSE derivation, why is it valid to drop the $\frac{1}{2 π σ}$ and $2 σ^{2}$ factors?

We are optimising over $\overset{x}{^}$ , not $σ$ . Those factors are constants with respect to $\overset{x}{^}$ , so they shift or scale the loss curve uniformly — every candidate $\overset{x}{^}$ moves by the same amount. The location of the optimum is unchanged.

Why do we flip the sign in the final step of the MLE-to-SSE derivation?

The log-likelihood is $- \sum (x_{i} - \overset{x}{^})^{2}$ . MLE maximises this, but ML training conventionally minimises loss. Maximising $- f$ is equivalent to minimising $f$ , so we negate and switch from $ar g max$ to $ar g min$ , giving $ar g min \sum (x_{i} - \overset{x}{^})^{2}$ .

A thermometer reads ${19, 17, 24}$ °C. Under MLE with Gaussian noise, what is the best estimate of the true temperature, and why?

The mean: $(19 + 17 + 24) /3 = 20$ °C. For Gaussian noise, the MLE is exactly the sample mean — the value that minimises $\sum (x_{i} - \overset{x}{^})^{2}$ . Setting the derivative to zero gives $\overset{x}{^} = \overset{x}{ˉ}$ .

Why this matters

Why is MLE called the "bridge between probability and training"?

MLE turns a probabilistic model of the data into a concrete loss function automatically:

Gaussian noise on real-valued targets $\to$ MSE / SSE

Bernoulli labels (binary) $\to$ binary cross-entropy

Categorical labels (multi-class) $\to$ cross-entropy

We don’t pick the loss arbitrarily — MLE derives it from the assumed distribution.

Course Notes

Explorer

week-01

Perceptron

Linear separability

Loss & optimisation

MLE

Why this matters

Graph View

Table of Contents