TARGET DECK NeuralComputation::Week-01
Perceptron
What is a perceptron, mathematically?
A perceptron computes , where is the input vector, is a learned weight vector, is a learned bias, and outputs or . It is the simplest single-neuron classifier.
How does removing the sign function from a perceptron change its job?
Without , the output is a continuous real number — the model is now doing linear regression instead of classification. Same weights, same bias, same architecture, just a different output type.
What does the dot product measure geometrically in a perceptron?
The signed perpendicular distance (scaled by ) from the point to the decision hyperplane. Positive means lies on the same side as ; negative means the opposite side. The sign function turns this distance into a hard classification.
In a perceptron, what controls the tilt vs the position of the decision boundary?
- Tilt (orientation): the weight vector — the boundary is always perpendicular to .
- Position (offset from origin): the bias — the boundary sits at signed distance from the origin along .
Linear separability
What does it mean for a dataset to be linearly separable, and why does this matter for a single perceptron?
A dataset is linearly separable if a single hyperplane can split the classes correctly. A single perceptron can only solve linearly separable problems — XOR and concentric-circle patterns cannot be classified by any single perceptron, no matter how the weights are chosen.
Loss & optimisation
Why is "learning is optimisation" the central idea of week 1?
Learning means picking the parameters that make predictions as close as possible to the truth. We formalise “close” with a loss function , then solve . Every later technique (gradient descent, backprop, etc.) is just machinery for solving this minimisation.
What is the difference between and ?
returns the smallest value takes. returns the input at which that smallest value is achieved. In ML we want the parameters that minimise loss, not the loss value itself — so we always write .
MLE
What does maximum likelihood estimation (MLE) ask?
Out of all possible parameter values, MLE picks the one that would have made the observed data most probable. The data is treated as fixed (we already saw it); is what we sweep over.
What is the difference between probability and likelihood, given the same formula ?
Both use the same formula but flip what is fixed:
- Probability: fix , vary the data — “if the true mean is 20°C, what readings might we see?”
- Likelihood: fix the data, vary — “given we observed 23°C, which best explains it?”
MLE is a likelihood problem: data observed, parameter swept.
Under what assumptions does maximising the likelihood reduce to minimising the sum of squared errors?
When (1) observations are independent and (2) noise around the true value is Gaussian. The Gaussian PDF contains , so taking of the product gives a sum of terms (plus constants). Dropping constants and flipping the sign turns into .
Why is it valid to apply to the likelihood before optimising?
Because is monotonically increasing: if then . The location of the maximum doesn’t move. Three concrete benefits: it turns products into sums (easier to differentiate), cancels the Gaussian’s , and avoids floating-point underflow when multiplying many tiny probabilities.
In the MLE-to-SSE derivation, why is it valid to drop the and factors?
We are optimising over , not . Those factors are constants with respect to , so they shift or scale the loss curve uniformly — every candidate moves by the same amount. The location of the optimum is unchanged.
Why do we flip the sign in the final step of the MLE-to-SSE derivation?
The log-likelihood is . MLE maximises this, but ML training conventionally minimises loss. Maximising is equivalent to minimising , so we negate and switch from to , giving .
A thermometer reads °C. Under MLE with Gaussian noise, what is the best estimate of the true temperature, and why?
The mean: °C. For Gaussian noise, the MLE is exactly the sample mean — the value that minimises . Setting the derivative to zero gives .
Why this matters
Why is MLE called the "bridge between probability and training"?
MLE turns a probabilistic model of the data into a concrete loss function automatically:
- Gaussian noise on real-valued targets MSE / SSE
- Bernoulli labels (binary) binary cross-entropy
- Categorical labels (multi-class) cross-entropy
We don’t pick the loss arbitrarily — MLE derives it from the assumed distribution.