💭
Published on

Introduction to Gaussian Process Regression

Authors
  • avatar
    Name
    Jan Hardtke
    Twitter

Gaussian processes are a robust and flexible method in machine learning, capable of leveraging prior information to make predictions. Their primary use is in regression, where they model data by fitting. However, their utility goes beyond regression, extending to tasks like classification and clustering. When faced with training data, countless functions can potentially describe it. Gaussian processes tackle this challenge by assigning a likelihood to each function, with the average of this distribution serving as the most plausible representation of the data. Moreover, their probabilistic framework naturally incorporates prediction uncertainty, providing insights into the confidence of the results.

Introductory concepts

Gaussian processes are built upon the mathematical framework of the multivariate Gaussian distribution, which extends the familiar Gaussian (or normal) distribution to multiple random variables that are jointly normally distributed. This distribution is characterized by two key components: a mean vector, μ\mu, and a covariance matrix, Σ=Cov(Xi,Xj)\Sigma = \text{Cov}(X_i, X_j), which describes the relationships between pairs of variables. Before diving into these concepts, we will first cover some fundamental terms from probability theory to ensure we have a shared understanding.

Generally, a multivariate Gaussian distribution is expressed as:

X=[X1X2Xn]N(μ,Σ)X = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix} \sim \mathcal{N}(\mu, \Sigma)

where XX is a vector of random variables.

The covariance matrix Σ\Sigma determines the shape of the distribution and, in the 2D case, can be represented as:

Σ=[σ12σ1σ2σ2σ1σ22]\Sigma = \begin{bmatrix} \sigma_1^2 & \sigma_1 \sigma_2 \\ \sigma_2 \sigma_1 & \sigma_2^2 \end{bmatrix}

Contour plots of the multivariate Gaussian distribution illustrate how Σ\Sigma affects the spread and orientation of the distribution. An example of such plots for different covariance matrices Σ\Sigma is shown in the figure below, demonstrating how the covariance influences the relationships between variables.

Description of the image

Conditioning

Suppose a joint dristribution of two vectors of gaussian random variables XX and YY

PX,Y=[XY]N(μ,Σ)=N([μXμY],[ΣXXΣXYΣYXΣYY]).P_{X,Y} = \begin{bmatrix} X \\ Y \end{bmatrix} \sim \mathcal{N}(\mu, \Sigma) = \mathcal{N}\left( \begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \begin{bmatrix} \Sigma_{XX} & \Sigma_{XY} \\ \Sigma_{YX} & \Sigma_{YY} \end{bmatrix} \right).

The covariance matrices, such as ΣXX\Sigma_{XX}, are themselves matrices, ensuring that the final covariance matrix has a size of X+Y|X| + |Y|. We can derive the conditional multivariate distributions using the following expressions:

PXYN(μX+ΣXYΣYY1(YμY),ΣXXΣXYΣYY1ΣYX)P_{X \mid Y} \sim \mathcal{N}\left( \mu_X + \Sigma_{XY} \Sigma_{YY}^{-1} (Y - \mu_Y), \Sigma_{XX} - \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{YX} \right)

or

PYXN(μY+ΣYXΣXX1(XμX),ΣYYΣYXΣXX1ΣXY). P_{Y \mid X} \sim \mathcal{N}\left( \mu_Y + \Sigma_{YX} \Sigma_{XX}^{-1} (X - \mu_X), \Sigma_{YY} - \Sigma_{YX} \Sigma_{XX}^{-1} \Sigma_{XY} \right).

Marginalization

Another important concept is marginalization. Marginalization over a random variable in a joint distribution involves summing (or integrating) over all possible values of the other variable in the conditional distribution.

For example, given two vectors of Gaussian random variables:

XN(μX,ΣXX)YN(μY,ΣYY)X \sim \mathcal{N}(\mu_X, \Sigma_{XX})\\ Y \sim \mathcal{N}(\mu_Y, \Sigma_{YY})

and their joint distribution PX,YP_{X,Y} , we can marginalize over YY as follows:

pX(x)=ypX,Y(x,y)dy=ypXY(xy)pY(y)dyp_X(x) = \int_y p_{X,Y}(x, y) \, dy = \int_y p_{X \mid Y}(x \mid y) p_Y(y) \, dy

We can observe the differences between conditioning and marginalization in the figure below.

Description of the image

As shown, conditioning involves slicing through the distribution at a specific value, resulting in a distribution with one less dimension. In this case, it becomes a 1D distribution, which is a simple normal distribution. Marginalization on the other hand acts like projecting the whole 2D distribution into one dimension, in this case YY.

Gaussian processes

Now that we have covered the basic concepts, we can discuss how to use them to formulate the ideas behind Gaussian processes and Gaussian process regression. The basic idea is as follows: given a set of sample positions {xi}1n\{x_i\}_1^n of a function f f, we aim to determine Xi=f(xi)X_i = f(x_i), where XiX_i is the i-th entry of a random vector XX. This vector follows a multivariate normal distribution, meaning:

XN(μX,ΣXX).X \sim \mathcal{N}(\mu_X, \Sigma_{XX}).

For simplicity, we often assume μX=0\mu_X = 0. However, calculating ΣXX\Sigma_{XX} involves another mathematical concept called a kernel function.

The kernel function kk, defined as k ⁣:Rn×RnRk \colon \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}, is a measure of similarity between its inputs. A high output value for kk indicates high similarity between inputs, and vice versa.

In our case, the kernel function is used to compute the covariance matrix ΣXX\Sigma_{XX} of the multivariate normal distribution. To calculate each entry of Cov(X,X)\text{Cov}(X, X), we assume:

Cov(Xi,Xj)=k(xi,xj),\text{Cov}(X_i, X_j) = k(x_i, x_j),

meaning that the covariance between two function values XiX_i and XjX_j is high when their corresponding inputs xix_i and xjx_j are close together.

Common Kernel Functions

Below are some commonly used kernel functions in the context of Gaussian processes, some of them take hyperparameters like ll or σ\sigma:

  1. RBF Kernel:
k(t,t)=σ2exp(tt22l2)k(t, t') = \sigma^2 \exp\left(-\frac{\|t - t'\|^2}{2l^2}\right)
  1. Periodic Kernel:
k(t,t)=σ2exp(2sin2(πttp)l2)k(t, t') = \sigma^2 \exp\left(-\frac{2 \sin^2\left(\pi \frac{|t - t'|}{p}\right)}{l^2}\right)
  1. Linear Kernel:
k(t,t)=σb2+σ2(tc)(tc)k(t, t') = \sigma_b^2 + \sigma^2 (t - c)(t' - c)

Different values for hyperparameters make the similarity measure of the kernels more or less strict, leading to sharper or smoother functions in the process.

Now that we have a method to compute the covariance matrix ΣXX\Sigma_{XX}, we can sample a random function ff by drawing a sample XX from the multivariate normal distribution N(μX,ΣXX)\mathcal{N}(\mu_X, \Sigma_{XX}). Each entry XiX_i in the sampled vector corresponds to the value of ff at xix_i, meaning Xi=f(xi).X_i = f(x_i).

Gaussian Processes for Regression

We have now covered the fundamentals of Gaussian processes. Next, we will extend our understanding to explore how Gaussian processes can be applied to solve regression tasks.

Description of the image

Before diving into the technical details, our objective is clear: to model a distribution of functions that pass through our training points (as shown in the image above) and are therefore able to explain the distribution of our data.

Now, let XX represent our test points and YY our training points. We aim to model the posterior distribution XYX \mid Y, which answers the question:

"What possible values for XX are there, given the training points in YY?"

This is where conditioning comes into play. Computing the distribution PXYP_{X|Y} provides us with functions ff that pass precisely through the points in YY. Conceptually, we can think of this as slicing through a X+Y|X| + |Y|-dimensional normal distribution, fixing the values at Y|Y| locations as constants. This results in a X|X|-dimensional normal distribution.

As we know it is easy to calculate distribution for PX,YP_{X, Y} with

PX,YN([μXμY],[ΣXXΣXYΣYXΣYY]).P_{X,Y} \sim \mathcal{N}\left( \begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \begin{bmatrix} \Sigma_{XX} & \Sigma_{XY} \\ \Sigma_{YX} & \Sigma_{YY} \end{bmatrix} \right).

From our prerequisites, we know that PXYP_{X|Y} can be calculated as:

PXYN(μX+ΣXYΣYY1(YμY),ΣXXΣXYΣYY1ΣYX)P_{X|Y} \sim \mathcal{N}\left( \mu_X + \Sigma_{XY} \Sigma_{YY}^{-1} (Y - \mu_Y), \Sigma_{XX} - \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{YX} \right)

This is the final distribution from which we can sample. It provides functions that pass through all points in YY and whose behavior for XX is modeled by PXYP_{X|Y} (see image above). The figure also illustrates the 95% confidence interval (2σ2\sigma) and the mean function. In practice, the mean function is often the most useful for regression or prediction tasks, as it provides the expected value of the output based on the given data. Getting the mean function as well as the 2σ2\sigma confidence interval is really easy once we obtained the μ\mu and Σ\Sigma for PXYP_{X|Y}. We can just query them at the i-th point and get the respective μi\mu_i and Σii=σi\Sigma_{ii} = \sigma_i. This wil result in

[μXY]i=[μX+ΣXYΣYY1(YμY)]i=[ΣXYΣYY1Y]i,[\mu_{X|Y}]_i = [\mu_X + \Sigma_{XY} \Sigma_{YY}^{-1} (Y - \mu_Y)]_i = [\Sigma_{XY} \Sigma_{YY}^{-1} Y]_i,

where the second equality stems from the fact that we assume μX=0\mu_{X} = 0 and μY=0\mu_{Y}=0. For the standart deviation we get

σi=[ΣXXΣXYΣYY1ΣYX.]ii\sigma_i = [\Sigma_{XX} - \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{YX}. ]_{ii}

Doing this for every point xix_i will result in the same figure we saw above.

So in summary, gaussian processes are a flexible framework for regression, where functions are modeled as samples from a multivariate normal distribution. A kernel function is used to compute the covariance matrix ΣXX\Sigma_{XX}, which encodes the similarity between inputs. The posterior distribution PXYP_{X|Y} then allows us to generate functions that pass through the training points YY while predicting the behavior for test points XX. In the end, we use the mean function provided by this posterior because it represents the most likely prediction for the underlying function.