Machine Learning — Appunti TiTilda

Indice

Supervised Learning

Given training data D = \{(x, t)\} with input examples (x) and desired outputs (t), the goal is to find an approximation of an unknown function f that can generalize to new data.

Supervised learning is useful when:

To define a supervised learning problem, we need:

If f is known, the problem is function approximation. When f is unknown, we estimate it from data using a loss function that accounts for noise.

Model Evaluation

A model is evaluated based on its expected loss, which is the average loss over the all the possible data:

\mathbb{E}[L] = \int \int L(t, y(x)) \, p(x, t) dx dt

where:

Knowing p(x, t) would be equivalent to knowing f (the generating function). The joint distribution is approximated from training data.

The model y(x) that minimizes the expected loss is the conditional mean computed as: y^*(x) = \mathbb{E}[t|x] = \int t \, p(t|x) \, dt

Loss Functions

A common loss function is the Minkowski loss: L(t, y(x)) = |t - y(x)|^q

where q is a parameter that controls how errors are penalized.

Bias-Variance

The tradeoff:

More training samples reduce variance for a given model complexity, reducing the noise.

Approaches to Supervised Learning

To solve supervised learning problems, there are three main approaches:

Generative Approach

The generative approach models how the data was generated by learning the joint distribution p(x, t) = p(t|x)p(x).

Once the model is learned, it’s possible to:

This approach is powerful but often requires modeling complex distributions, which can be difficult and computationally expensive.

Discriminative Approach

The discriminative approach focuses directly on modeling the relationship between inputs and outputs by learning the conditional distribution p(t|x), without modeling the input distribution p(x).

Predictions use the conditional mean: \mathbb{E}[t|x] = \int t \, p(t|x) \, dt

This approach is more efficient for prediction tasks, as it doesn’t require modeling the full data distribution.

Direct Approach

The direct approach avoids probability modeling. Instead of learning distributions, it directly minimizes a loss function on the training data to find the best mapping from inputs to targets.

For prediction tasks, you just need the mapping x \to t to be accurate.

This approach is often more straightforward and computationally efficient.

Linear Regression

The goal is to learn a function that maps input features x to target output t. Linear regression models assume this relationship is linear in the parameters (though features can be nonlinearly transformed).

The solution can be found analytically.

y(x, w) = w_0 + \sum_{j=1}^{D-1} w_j x_j = w^T \phi(x)

where:

Linear models are used:

Ordinary Least Squares (OLS)

The Ordinary Least Squares (OLS) is a direct method that finds the weights that minimize the error on the training data.

Since we cannot compute the true expected loss (the joint distribution is unknown), we approximate it with the empirical loss computed from the N training data:

L(w) = \frac{1}{2} \sum_{n=1}^N (y(x_n, w) - t_n)^2

This is also called residual sum of squares (RSS) or sum of squared errors (SSE), and can also be written as:

RSS(w) = \|\epsilon\|_2^2 = \sum_{n=1}^N \epsilon_n^2

where:

The p-norm of the residuals is a generalization of the loss function and assign a measure of the error from a vector.

Matrix Formulation:

L(w) = \frac{1}{2} \text{RSS}(w) = \frac{1}{2} (t - \Phi w)^T (t - \Phi w)

where:

Solution

The solution is found by setting the gradient of the loss to zero:

\frac{\partial L(w)}{\partial w} = - \Phi^T (t - \Phi w) = 0

The Hessian (second derivative) is: \frac{\partial^2 L(w)}{\partial w \partial w^T} = \Phi^T \Phi

This is positive definite if \Phi has full column rank, ensuring a unique global minimum.

The point where the gradient is zero corresponds to the minimum of the loss function, and in that point there is no correlation between the residuals and the features. This means that the model has captured all the linear relationships in the data, and any remaining error is due to noise.

Solving the gradient for w: \hat{w}_{OLS} = (\Phi^T \Phi)^{-1} \Phi^T t

This is the Ordinary Least Squares (OLS) solution, which gives the best linear fit to the training data in terms of minimizing the sum of squared errors.

To work properly, OLS requires:

  1. More samples than features: N \geq M
  2. No redundant features: Features must be linearly independent, otherwise the matrix is singular and cannot be inverted.
  3. Computational budget: Inversion costs O(M^3)

Linear Model Evaluation

To evaluate the performance of a linear regression model, we use the Root Mean Squared Error (RMSE), which is derived from the residual sum of squares (RSS):

E_{\text{RMS}} = \sqrt{\frac{2 * \text{RSS}(\hat{w})}{N}}

Variance

To estimate the variance of the noise, we can use the residuals from the fitted model: \hat{\sigma}^2 = \frac{1}{N - M} \sum_{n=1}^N (t_n - \hat{w}^T \phi(x_n))^2

where:

Meaning that more more samples reduce the variance of the noise estimate, while more parameters increase it.

Based on the Gauss-Markov theorem, the OLS estimator is the Best Linear Unbiased Estimator, meaning it has the lowest variance among all linear unbiased estimators.

Stochastic Gradient Descent (SGD)

This is an iterative optimization algorithm that updates the weights incrementally using one sample at a time, rather than computing the gradient over the entire dataset.

The loss function can be written as the sum of the loss function for each sample L(w) = \sum_{n=1}^N L(x_n).

w^{(k+1)} = w^{(k)} - \alpha^{(k)} \nabla L(x_n)

For squared loss: w^{(k+1)} = w^{(k)} - \alpha^{(k)} (w^{(k)T} \phi(x_n) - t_n) \phi(x_n)

where:

At each iteration, the algorithm performs the following steps:

  1. Compute prediction error: (w^T \phi(x_n) - t_n)
  2. Compute error gradient: Multiply by the feature vector
  3. Move weights in opposite direction with a step defined by the learning rate \alpha^{(k)}: w := w - \alpha^{(k)} \times (\text{error gradient})

This method is more efficient for large datasets, as it avoids the costly matrix inversion required by OLS and allows for online learning. However each update is influenced by the noise of a single sample.

SDG can converge to the OLS solution only when the learning rate decays over time and satisfies the Robbins-Monro conditions:

Geometric Interpretation

The OLS solution projects the target vector t onto the feature space spanned by columns of \Phi:

\hat{t} = \Phi \hat{w}_{OLS} = \underbrace{\Phi (\Phi^T \Phi)^{-1} \Phi^T}_{\text{Projection Matrix } P} t

The projection \hat{t} should be as close as possible to t.

Maximum Likelihood Estimation (MLE)

The Maximum Likelihood Estimation (MLE) is a generative method that try to find the model which is most likely to have generated the observed data.

Assume targets are generated by a function summed with Gaussian noise: t = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)

This approach maximizes the likelihood, which is the probability of observing the data given the parameters:

p(t|X, w, \sigma^2) = \prod_{n=1}^N \mathcal{N}(t_n | w^T \Phi(x_n), \sigma^2)

This optimization problem is equivalent to minimizing the sum of squared errors, as maximizing the likelihood corresponds to finding the parameters that make the observed data most probable under the assumed model.

To solve it, we take the logarithm of the likelihood (log-likelihood) to simplify the product into a sum: \ln p(t|X, w, \sigma^2) = \sum_{n=1}^N \ln \mathcal{N}(t_n | w^T \Phi(x_n), \sigma^2) = - \frac{N}{2} \ln (2\pi \sigma^2) - \frac{1}{2\sigma^2} \text{RSS}(w)

Setting the derivative to zero:

\nabla l(w) = \sum_{n=1}^N t_n \Phi(x_n)^T - w^T \sum_{n=1}^N \Phi(x_n) \Phi(x_n)^T = 0 \hat{w}_{ML} = (\Phi^T \Phi)^{-1} \Phi^T t

Regularization

Regularization is a technique to prevent overfitting by adding a penalty term to the loss function that discourages complex models (too many parameters or weights too big).

Modified loss function: L(w) = \underbrace{L_D(w)}_{\text{error on data}} + \lambda \underbrace{L_w(w)}_{\text{model complexity}} L(w) = \underbrace{\frac{1}{2} \sum_{n=1}^N (t_n - w^T \Phi(x_n))^2}_{\text{RSS}} + \lambda L_w(w)

The regularization parameter \lambda controls the tradeoff:

Ridge Regression (L2 Regularization)

Penalize the L2 norm (sum of squares) of weights: L_w(w) = \frac{1}{2}\|w\|_2^2 = \frac{1}{2} \sum_{j=1}^M w_j^2

This encourages smaller weights, but does not set them to zero.

Full loss function: L(w) = \frac{1}{2} \sum_{n=1}^N (t_n - w^T \Phi(x_n))^2 + \frac{\lambda}{2} \|w\|_2^2

The solution is: \hat{w}_{\text{ridge}} = (\Phi^T \Phi + \lambda I)^{-1} \Phi^T t

Lasso Regression (L1 Regularization)

Penalize the L1 norm (sum of absolute values): L_w(w) = \|w\|_1 = \sum_{j=1}^M |w_j|

This generates sparse solutions, where some weights are exactly zero, effectively performing feature selection, removing irrelevant features from the model.

Full loss function: L(w) = \frac{1}{2} \sum_{n=1}^N (t_n - w^T \Phi(x_n))^2 + \lambda \|w\|_1

Bayesian Linear Regression

The Bayesian Linear Regression is a probabilistic approach to linear regression that incorporates uncertainty about the model parameters.

Before seeing data, there is an assumption about the distribution of weights, called the prior distribution. A common choice is a Gaussian prior: p(w) = \mathcal{N}(w | w_0, S_0)

After observing data, Bayes’ theorem is used to update the belief about the weights, resulting in the posterior distribution: \overbrace{p(w|\mathcal{D})}^{\text{posterior}} = \frac{\overbrace{p(\mathcal{D}|w)}^{\text{likelihood}} \, \overbrace{p(w)}^{\text{prior}}}{p(\mathcal{D})}

where:

After updating, the weights still maintain a Gaussian distribution (conjugate prior): \overbrace{p(w|t, \Phi, \sigma^2)}^{\text{posterior}} \propto \overbrace{\mathcal{N}(w | w_0, S_0)}^{\text{prior}} \, \overbrace{\mathcal{N}(t | \Phi w, \sigma^2 I)}^{\text{likelihood}} = \mathcal{N}(w | w_N, S_N)

where: w_N = S_N \left( S_0^{-1} w_0 + \frac{\Phi^T t}{\sigma^2} \right) S_N^{-1} = S_0^{-1} + \frac{\Phi^T \Phi}{\sigma^2}

With zero-mean gaussian prior (w_0 = 0, S_0 = \tau^2 I), the w_N is the MAP estimator that is equal to the ridge regression solution with \lambda = \frac{\sigma^2}{\tau^2}.

Predictive Distribution

The Bayesian approach allows to compute the probability distribution of the target for a new input x, called the predictive distribution: p(t| x, \mathcal{D}, \sigma^2) = \int \mathcal{N}(t | w^T \phi(x), \sigma^2) \mathcal{N}(w | w_N, S_N) dw = \mathcal{N}(t | w_N^T \phi(x), \sigma^2_N)

where: \sigma^2_N = \underbrace{\sigma^2}_{\text{data noise}} + \underbrace{\phi(x)^T S_N \phi(x)}_{\text{parameters uncertainty}}

Ultima modifica:
Scritto da: Andrea Lunghi