A Practical Math Review for Machine Learning: Every Formula That Matters

Machine learning isn’t “just statistics with GPUs.” It’s a layered stack of algebra, calculus, probability, optimization, and geometry. If you don’t really get the math, you’ll never debug your model, read a new paper, or build something groundbreaking. Here’s your unified, battle-tested summary—with formulas, classic proofs, and hands-on examples. Each section links out for depth, but the big picture is all here.

1. Linear Algebra: The Language of Models

Almost every model is a function of vectors and matrices. Neural nets? Matrix multiplications. SVMs? Dot products in high-dimensional space.

Key Concepts & Formulas

Concept	Formula / Definition	Example
Vector Dot Product	$\mathbf{a} \cdot \mathbf{b} = \sum\_i a\_i b\_i$	$\mathbf{a} = \[2,3],\ \mathbf{b} = \[4,1]$ $= 24 + 31 = 11$
Matrix Multiplication	$(AB){ij} = \sum\_k A{ik}B\_{kj}$	$A = \begin{bmatrix}1&2\3&4\end{bmatrix}$ $B = \begin{bmatrix}0&1\1&0\end{bmatrix}$ $AB = \begin{bmatrix}2&1\4&3\end{bmatrix}$
Norm (L2)	$\\|\mathbf{x}\\|\_2 = \sqrt{\sum\_i x\_i^2}$	$\[3,4] \Rightarrow 5$
Eigenvalue Eqn	$A\mathbf{v} = \lambda \mathbf{v}$	PCA directions are eigenvectors

Proof Reference: Matrix calculus in neural nets: Stanford CS231n notes

2. Calculus: The Engine Behind Learning

Gradient descent is the workhorse. Understanding gradients is non-negotiable.

Key Concepts & Formulas

Concept	Formula	Example
Derivative	$f'(x) = \lim\_{h\to0} \frac{f(x+h)-f(x)}{h}$	$f(x) = x^2 \rightarrow f'(x) = 2x$
Gradient (Vector)	$\nabla f(\mathbf{x})$	$f(x,y) = x^2 + y^2 \rightarrow \nabla f = \[2x,2y]$
Chain Rule	$\frac{dz}{dx} = \frac{dz}{dy}\frac{dy}{dx}$	Backprop in neural nets
Partial Derivative	$\frac{\partial f}{\partial x}$	$f(x,y) = x^2y \rightarrow \frac{\partial f}{\partial x} = 2xy$

Classic Proof: Backpropagation is just the chain rule, vectorized. See Goodfellow et al., Deep Learning Book, Chapter 6.

3. Probability: Modeling Uncertainty

You can’t build robust models if you don’t understand uncertainty.

Key Concepts & Formulas

Concept	Formula	Example
Expectation	$\mathbb{E}\[X] = \sum\_x x P(x)$ (discrete)	Dice roll: $1/6 \sum\_{i=1}^6 i = 3.5$
Variance	$\mathrm{Var}(X) = \mathbb{E}\[(X-\mu)^2]$	Coin flip: $\mu=0.5$ , var= $0.25$
Bayes’ Theorem	$P(A	B) = \frac{P(B	A)P(A)}{P(B)}$	Spam filtering
Conditional Prob.	$P(A	B) = \frac{P(A \cap B)}{P(B)}$	Probability of rain given clouds
Entropy	$H(X) = -\sum\_x P(x)\log P(x)$	Coin flip: $H=1$ bit

Classic Example: Naive Bayes classifier:

P(\text{spam}|\text{words}) \propto P(\text{words}|\text{spam})P(\text{spam})

See Wikipedia for proof

4. Optimization: Finding the Best Model

All of machine learning is an optimization problem.

Key Concepts & Formulas

Concept	Formula	Example
Loss Function	$L(\theta) = \frac{1}{n}\sum\_{i} \ell(y\_i, f(x\_i;\theta))$	MSE: $\ell(y, \hat{y}) = (y - \hat{y})^2$
Gradient Descent	$\theta \leftarrow \theta - \eta \nabla\_\theta L(\theta)$	Linear regression step
Convexity	$f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)$	Ensures one global minimum

Classic Proof: See Boyd & Vandenberghe, Convex Optimization, Chapter 2

5. Statistics: Evaluating Models

A model that fits the training data is meaningless if you can’t measure its generalization.

Key Concepts & Formulas

Concept	Formula	Example
Mean	$\bar{x} = \frac{1}{n} \sum x\_i$	$\[1,2,3] \to 2$
Standard Deviation	$\sqrt{\frac{1}{n}\sum (x\_i - \bar{x})^2}$	$\[1,2,3] \to \sqrt{2/3}$
Confidence Interval	$\bar{x} \pm z^\* \frac{\sigma}{\sqrt{n}}$	95% CI for sample mean
Precision/Recall/F1	$\mathrm{F1} = 2 \frac{\text{precision} \times \text{recall}}{\text{precision}+\text{recall}}$	Classification metrics

Reference: Precision, Recall, F1 score—A hands-on explanation

6. Geometry: High-Dimensional Intuition

Why does “distance” become meaningless in high dimensions?

Curse of Dimensionality: Most points in a high-dimensional cube are near the corners.
Angle Between Random Vectors: In high dimensions, most randomly chosen vectors are nearly orthogonal.

Example:

Let $d$ be the dimension. $\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}||\mathbf{b}|}$ As $d \to \infty$ , the expected cosine tends to 0.

See Verleysen & François, “The Curse of Dimensionality in Data Mining and Time Series Prediction,”

7. Information Theory: Why ML Works

KL Divergence: $D\_{KL}(P|Q) = \sum\_x P(x)\log \frac{P(x)}{Q(x)}$ Used for regularization, variational inference.
Cross-Entropy Loss: $L = -\sum y \log \hat{y}$ Most-used loss for classification tasks.

Reference: Elements of Information Theory, Cover & Thomas

Classic Proof: Why Gradient Descent Works

Let $f$ be differentiable and convex. At each iteration:

x_{k+1} = x_k - \eta \nabla f(x_k)

As long as $\eta$ is small enough, $f(x\_{k+1}) < f(x\_k)$ , i.e., you move “downhill.” See: Convex optimization convergence proof

Absolutely. Let’s extend the original post to cover these foundational topics, each with their own concise section: Bayes’ theorem, Bayes classifiers, Central Limit Theorem, Logistic regression, and related regression methods.

8. Bayes’ Theorem: The Bedrock of Probabilistic ML

Bayes’ theorem lets us reverse conditional probabilities—fundamental for all probabilistic reasoning and generative models.

Formula

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

$P(A|B)$ : Probability of A given B (posterior)
$P(B|A)$ : Probability of B given A (likelihood)
$P(A)$ : Probability of A (prior)
$P(B)$ : Probability of B (evidence)

Example (Medical Test)

Disease prevalence: $P(\text{disease}) = 0.01$
True positive rate: $P(\text{pos}|\text{disease}) = 0.9$
False positive rate: $P(\text{pos}|\neg\text{disease}) = 0.1$
$P(\text{pos}) = 0.9*0.01 + 0.1*0.99 = 0.108$

P(\text{disease}|\text{pos}) = \frac{0.9*0.01}{0.108} \approx 0.083

Reference: Khan Academy: Bayes’ theorem

9. Bayesian Classification: Naive Bayes

Naive Bayes is the “hello world” of probabilistic classifiers—fast, robust, surprisingly effective when features are (nearly) independent.

Core Formula

P(C|x_1, ..., x_n) \propto P(C) \prod_{i=1}^n P(x_i|C)

$C$ : class label (e.g., spam/not spam)
$x\_i$ : feature $i$ (e.g., word appears or not)

Example (Text Classification)

Suppose “free” and “win” are features:

P(\text{spam}|\text{free, win}) \propto P(\text{spam})P(\text{free}|\text{spam})P(\text{win}|\text{spam})

The class with the highest posterior wins.

Reference: Wikipedia: Naive Bayes classifier

10. Central Limit Theorem: Why Averages Work

The Central Limit Theorem (CLT) is why “averages” make sense, and why statistical inference is possible in ML.

Statement

Given $n$ independent, identically distributed random variables $X\_1, ..., X\_n$ with mean $\mu$ and variance $\sigma^2$ :

\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \rightarrow N(\mu, \frac{\sigma^2}{n})

as $n \rightarrow \infty$ .

Meaning: The sum (or average) of many independent random variables approaches a normal distribution, regardless of the original variable’s distribution.

Example (Dice Roll)

Mean: $\mu = 3.5$ (for a fair die)
Simulate rolling 100 dice, take the average: It will be close to a normal distribution centered at 3.5.

Reference: StatTrek: Central Limit Theorem

11. Logistic Regression: Classification with Probabilities

Unlike linear regression, logistic regression is made for predicting class probabilities—e.g., spam or not spam, fraud or not fraud.

Model Formula

P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}

$w^T x + b$ is the linear score (logit)
$\sigma$ is the sigmoid function

Loss Function (Cross-Entropy):

L = -\frac{1}{n} \sum_{i=1}^n [y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i)]

Example

Predicting probability of a customer buying a product:

$x =$ age, income
$w =$ weights learned by fitting the data
Output: probability between 0 and 1

Reference: StatQuest: Logistic Regression

a. Linear Regression

\hat{y} = w^T x + b

Predicts a continuous output.
Solved by minimizing Mean Squared Error (MSE):

L = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2

b. Ridge Regression (L2 regularization)

L = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \|w\|_2^2

Penalizes large weights; reduces overfitting.

c. Lasso Regression (L1 regularization)

L = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \|w\|_1

Shrinks some weights to zero (feature selection).

d. Polynomial Regression

\hat{y} = w_0 + w_1 x + w_2 x^2 + \dots + w_d x^d

Fits non-linear data by adding higher-degree terms.

Reference: Elements of Statistical Learning, Section 3.4

13. Regularization: Taming Overfitting

Regularization prevents models from fitting noise by penalizing complexity.

Formula (General)

L_{\text{reg}} = L_{\text{orig}} + \lambda R(w)

$R(w) = |w|\_2^2$ (Ridge) or $|w|\_1$ (Lasso)

Example

Suppose you fit a model to 100 features, but only 3 matter. Lasso regression shrinks the rest toward zero, helping interpretability and robustness.

14. Multiclass & Multinomial Regression

When your target has more than two classes, you need generalizations.

a. Softmax Regression (Multinomial Logistic Regression)

P(y = k | x) = \frac{e^{w_k^T x}}{\sum_{j=1}^K e^{w_j^T x}}

for $k = 1,...,K$ classes.

b. One-vs-Rest (OvR) Strategy

Fit one binary classifier per class; pick the one with highest confidence.

Example

Classifying handwritten digits (0–9):

Input: pixel features
Output: probability for each digit
Prediction: class with highest probability

Summary Table: All-In-One Math Cheat Sheet

Area	Essential Formula	Example (with numbers)
Linear Algebra	$\mathbf{w}^T\mathbf{x} + b$	$\[2,3]\cdot\[4,1]+1=12$
Calculus	$\frac{d}{dx}x^2=2x$	$x=3 \rightarrow 6$
Probability	$P(A	B)=\frac{P(B	A)P(A)}{P(B)}$	$0.8\*0.1/0.2=0.4$
Optimization	$\theta \leftarrow \theta - \eta \nabla L$	$0.2-0.01\*3=0.17$
Statistics	$F1=2\frac{pr}{p+r}$	$p=0.7,r=0.5 \to F1=0.58$
Info Theory	$H(X)=-\sum p\log p$	$p=0.5,0.5 \to 1$

Concrete Example: Linear Regression

Given data $(x\_1, y\_1), ... (x\_n, y\_n)$ :

Model: $y = wx + b$
Loss: $L = \frac{1}{n}\sum (y\_i - (wx\_i + b))^2$
Gradient w.r.t. $w$ : $\frac{\partial L}{\partial w} = -\frac{2}{n}\sum x\_i(y\_i - (wx\_i + b))$
Gradient Descent Step: $w \leftarrow w - \eta \frac{\partial L}{\partial w}$

Must-Read References & Further Reading

Mastering these fundamentals is the difference between tweaking models and inventing new ones. If you want to actually innovate—or simply outcompete the crowd—get your hands dirty with the real math.

Subscribe for deeper breakdowns, real-world machine learning case studies, and to join a network of elite builders and visionaries. Or, let’s build together—contact me.

The Math Every Machine Learning Engineer Must Master

1. Linear Algebra: The Language of Models

Key Concepts & Formulas

2. Calculus: The Engine Behind Learning

Key Concepts & Formulas

3. Probability: Modeling Uncertainty

Key Concepts & Formulas

4. Optimization: Finding the Best Model

Key Concepts & Formulas

5. Statistics: Evaluating Models

Key Concepts & Formulas

6. Geometry: High-Dimensional Intuition

Example:

7. Information Theory: Why ML Works

Classic Proof: Why Gradient Descent Works

8. Bayes’ Theorem: The Bedrock of Probabilistic ML

Formula

Example (Medical Test)

9. Bayesian Classification: Naive Bayes

Core Formula

Example (Text Classification)

10. Central Limit Theorem: Why Averages Work

Statement

Example (Dice Roll)

11. Logistic Regression: Classification with Probabilities

Model Formula

Loss Function (Cross-Entropy):

Example

12. Related Regressions: Linear, Ridge, Lasso, Polynomial

a. Linear Regression

b. Ridge Regression (L2 regularization)

c. Lasso Regression (L1 regularization)

d. Polynomial Regression

13. Regularization: Taming Overfitting

Formula (General)

Example

14. Multiclass & Multinomial Regression

a. Softmax Regression (Multinomial Logistic Regression)

b. One-vs-Rest (OvR) Strategy

Example

Summary Table: All-In-One Math Cheat Sheet

Concrete Example: Linear Regression

Must-Read References & Further Reading

Continue reading

Lessons Learned Migrating from PostgreSQL to MariaDB

Join the Discussion