Applied Mathematics in Machine Learning
From First Principles to Optimization
Machine learning is fundamentally an application of mathematics—especially linear algebra, probability theory, and optimization.
In this article, we derive key concepts from first principles.
1. Problem Setup
We define a supervised learning dataset:
$$ \mathcal{D} = {(x^{(i)}, y^{(i)})}_{i=1}^{m} $$
where:
$$ x^{(i)} \in \mathbb{R}^n $$
$$ y^{(i)} \in \mathbb{R} $$
We want to learn a function:
$$ h_\theta : \mathbb{R}^n \rightarrow \mathbb{R} $$
2. Linear Model
We define:
$$ h_\theta(x) = \theta^T x + b $$
Expanded:
$$ h_\theta(x) = \sum_{j=1}^{n} \theta_j x_j + b $$
3. Loss Function
Mean Squared Error:
$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 $$
4. Optimization Objective
We solve:
$$ \theta^* = \arg\min_\theta J(\theta) $$
5. Gradient Descent
Gradient:
$$ \frac{\partial J}{\partial \theta} = \frac{2}{m} \sum_{i=1}^{m} x^{(i)} \left( \theta^T x^{(i)} - y^{(i)} \right) $$
Update rule:
$$ \theta := \theta - \alpha \nabla_\theta J(\theta) $$
6. Vectorized Form
$$ J(\theta) = \frac{1}{m} (X\theta - y)^T (X\theta - y) $$
Gradient:
$$ \nabla_\theta J(\theta) = \frac{2}{m} X^T (X\theta - y) $$
7. Normal Equation
Closed-form solution:
$$ \theta^* = (X^T X)^{-1} X^T y $$
8. Regularization
L2 regularization:
$$ J(\theta) = \frac{1}{m} \sum (h_\theta(x) - y)^2 + \lambda |\theta|_2^2 $$
9. Probabilistic View
Assume:
$$ y = \theta^T x + \epsilon $$
where:
$$ \epsilon \sim \mathcal{N}(0, \sigma^2) $$
Likelihood:
$$ p(y|x,\theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \theta^T x)^2}{2\sigma^2}\right) $$
10. Neural Networks Extension
$$ a^{(l)} = \sigma(W^{(l)} a^{(l-1)} + b^{(l)}) $$
Backpropagation:
$$ \frac{\partial J}{\partial W^{(l)}} = \frac{\partial J}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial W^{(l)}} $$
11. General Learning Objective
$$ \min_{\theta} ; \mathbb{E}{(x,y)\sim \mathcal{D}} \left[ \ell(h\theta(x), y) \right] $$
12. Key Insight
$$ \text{Machine Learning} = \text{Optimization} + \text{Statistics} + \text{Linear Algebra} $$
Conclusion
We built the mathematical foundation of machine learning from:
- Linear models
- Loss functions
- Optimization
- Probability
This is the backbone of modern AI systems.

