Kernel Least-Squares
- Motivations
- The kernel trick
- Nonlinear case
- Examples of kernels
- Kernels in practice
Motivations
Consider a linear auto-regressive model for time-series, where is a linear function of
data:image/s3,"s3://crabby-images/40590/40590269fde27fc39a426701258d45d98311080a" alt="Rendered by QuickLaTeX.com y_t=w^T x_t"
data:image/s3,"s3://crabby-images/4de37/4de375fac462675bad127224a44dbbb79321b812" alt="Rendered by QuickLaTeX.com x_t"
data:image/s3,"s3://crabby-images/4eba2/4eba290b80a490e8d34c64d013e58c99ad8e14d3" alt="Rendered by QuickLaTeX.com \hat{y}_{T+1}=w_1+w_2 y_T+w_3 y_{T-1}=w^T x_{T+1}"
We can introduce a non-linear version, where
data:image/s3,"s3://crabby-images/18a7f/18a7f02bbfd4fce8b89afec67cc4b4c6a48260d6" alt="Rendered by QuickLaTeX.com y_t"
data:image/s3,"s3://crabby-images/32539/32539a3ee98bb25b9da73f27bb38c0adbe1a4797" alt="Rendered by QuickLaTeX.com y_{t-1}, y_{t-2}"
data:image/s3,"s3://crabby-images/fa1e5/fa1e552aa0f2804042e9f1b278dc8617abe18e12" alt="Rendered by QuickLaTeX.com y_t=w^T \phi\left(x_t\right)"
data:image/s3,"s3://crabby-images/ef4b0/ef4b08ffa1d4fa0db3ed27833ad0c68aaf39260d" alt="Rendered by QuickLaTeX.com \phi\left(x_t\right)"
data:image/s3,"s3://crabby-images/cdf6f/cdf6f3fb73f8feabe4c7ac51f1d40fbbb6e6228b" alt="Rendered by QuickLaTeX.com x"
data:image/s3,"s3://crabby-images/b5501/b55014d8c1912d495703e6009c77a7b68f4619bd" alt="Rendered by QuickLaTeX.com \phi(x)"
It appears that the size of the least-squares problem grows quickly with the degree of the feature vectors. How do we do it in a computationally efficient manner?
The kernel trick
We exploit a simple fact: in the least-squares problem
the optimal lies in the span of the data points
:
for some vector . Indeed, from the fundamental theorem of linear algebra, every
can be written as the sum of two orthogonal vectors:
where (that is,
is in the nullspace
).
Hence the least-squares problem depends only on :
The prediction rule depends on the scalar products between new point and the data points
:
Once is formed (this takes
), then the training problem has only
variables. When
, this leads to a dramatic reduction in problem size.
Nonlinear case
In the nonlinear case, we simply replace the feature vectors with some “augmented” feature vectors
, with
a non-linear mapping.
This leads to the modified kernel matrix
The kernel function associated with mapping is
It provides information about the metric in the feature space, eg:
The computational effort involved in
- solving the training problem;
- making a prediction,
depends only on our ability to quickly evaluate such scalar products. We can’t choose arbitrarily; it has to satisfy the above for some
.
Examples of kernels
A variety of kernels are available. Some are adapted to the structure of data, for example, text or images. Here are a few popular choices.
Polynomial kernels
Regression with quadratic functions involves feature vectors
In fact, given two vectors , we have
More generally when is the vector formed with all the products between the components of
, up to degree
, then for any two vectors
,
The computational effort grows linearly in .
This represents a dramatic reduction in speed over the ‘‘brute force’’ approach:
- Form
;
- evaluate
. In the above approach, the computational effort grows as
.
Gaussian kernels
Gaussian kernel function:
where is a scale parameter This allows ignoring points that are too far apart. Corresponds to a non-linear mapping
to infinite-dimensional feature space.
Other kernels
There is a large variety (a zoo?) of other kernels, some adapted to the structure of data (text, images, etc).
Kernels in practice
- Kernels need to be chosen by the user.
- The choice is not always obvious; Gaussian or polynomial kernels are popular.
- We control over-fitting via cross-validation (to choose, say, the scale parameter of Gaussian kernel, or degree of the polynomial kernel).