This writing is based from: Bishop’s Pattern Recognition and Machine Learning (Bishop (2006)), and partly from CS229 notes on Guassian
Gaussian Definition and Fundamental Integral
Definition 1 (Single Variable Gaussian Distribution): The Gaussian distribution is defined as
\[\begin{equation*} \newcommand{\dby}{\ \mathrm{d}}\newcommand{\bracka}[1]{\left( #1 \right)}\newcommand{\brackb}[1]{\left[ #1 \right]}\newcommand{\brackc}[1]{\left\{ #1 \right\}}\newcommand{\abs}[1]{\left|#1\right|} p(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2\sigma^2}(x-\mu)^2\right\} \end{equation*}\]
where \(\mu\) is a parameter called mean, while \(\sigma\) is a parameter called standard derivation, and we call \(\sigma^2\) as variance.
Finally, the standard normal distribution is Gaussian distribution with mean \(0\) and variance \(1\).
To find the normalizing factor of Gaussian distribution, we use the following integral:
Proposition 1 (Gaussian integral): The integration of \(\exp(-x^2)\) is equal to
\[\begin{equation*} \int^\infty_{-\infty}\exp(-x^2)\dby x = \sqrt{\pi} \end{equation*}\]
We will transform the problem into \(2\)D polar coordinate system, which make the integration easier.
\[\begin{equation*} \begin{aligned} \bracka{\int^\infty_{-\infty}\exp(-x^2)\dby x}^2 &= \bracka{\int^\infty_{-\infty}\exp(-x^2)\dby x}\bracka{\int^\infty_{-\infty}\exp(-y^2)\dby y} \\ &= \int^\infty_{-\infty}\int^\infty_{-\infty}\exp\bracka{-(x^2+y^2)}\dby x\dby y \\ &= \int^{2\pi}_{0}\int^\infty_{0}\exp\bracka{-r^2}r\dby r\dby\theta \\ &= 2\pi\int^\infty_{0}\exp\bracka{-r^2}r\dby r = \pi\int^{0}_{-\infty} \exp\bracka{u}\dby u = \pi \end{aligned} \end{equation*}\]
Proof
Corollary 1 (Normalization of Gaussian Distribution): We consider the integration
\[\begin{equation*} \int^\infty_{-\infty}\exp\left\{ -\frac{(x-\mu)^2}{2\sigma^2}\right\} \dby x = \sqrt{2\pi\sigma^2} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \int^\infty_{-\infty}\exp\left\{ -\frac{z^2}{2\sigma^2}\right\} \dby z &= \sigma\sqrt{2} \int^\infty_{-\infty}\frac{1}{\sqrt{2}\sigma}\exp\left\{ -\frac{z^2}{2\sigma^2}\right\} \dby z \\ &= \sigma\sqrt{2} \int^\infty_{-\infty} \exp(-y^2)\dby = \sigma\sqrt{2\pi} \end{aligned} \end{equation*}\]
Proof
Statistics of Single Variable Gaussian Distribution
Definition 2 (Odd Function): The function \(f(x)\) is an odd function iff \(f(-x) = -x\)
Lemma 1 (Odd Function Integration): The integral over \([-a, a]\), where \(a\in\mathbb{R}^+\) of an odd function \(f(x)\) is \(0\) i.e
\[\begin{equation*}\int^a_{-a}f(x)\dby x = 0\end{equation*}\]
\[\begin{equation*} \begin{aligned} \int^a_{-a}f(x)\dby x &= \int^0_{-a}f(x)\dby x + \int^a_{0}f(x)\dby x \\ &= \int^a_{0}f(-x)\dby x + \int^a_{0}f(x)\dby x \\ &= -\int^a_{0}f(x)\dby x + \int^a_{0}f(x)\dby x = 0 \end{aligned} \end{equation*}\]
Proof
Proposition 2 (Mean of Gaussian): The expectation \(\mathbb{E}_{x}[x]\) of nornally distributed Gaussian is \(\mu\)
\[\begin{equation*} \begin{aligned} \frac{1}{\sqrt{2\pi\sigma^2}}\int^\infty_{-\infty} \exp\bracka{-\frac{(x-\mu)^2}{2\sigma^2}}x\dby x \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \frac{1}{\sqrt{2\pi\sigma^2}}&\int^\infty_{-\infty} \exp\bracka{-\frac{z^2}{2\sigma^2}}(z+\mu)\dby x \\ &= \frac{1}{\sqrt{2\pi\sigma^2}}\Bigg[ \underbrace{\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}z\dby z}_{I_1} + \underbrace{\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\mu\dby z}_{I_2} \Bigg] \end{aligned} \end{equation*}\]
\[\begin{equation*} g(x) = \exp\bracka{-\frac{z^2}{2\sigma^2}}z = -\bracka{-\exp\bracka{-\frac{(-z)^2}{2\sigma^2}}z} = -g(-x) \end{equation*}\]
\[\begin{equation*} \int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\mu\dby z = \mu\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\dby z = \mu\sqrt{2\pi\sigma^2} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \frac{1}{\sqrt{2\pi\sigma^2}}\int^\infty_{-\infty} \exp\bracka{-\frac{z^2}{2\sigma^2}}(z+\mu)\dby x = \frac{1}{\sqrt{2\pi\sigma^2}}\Bigg[ 0 + \mu\sqrt{2\pi\sigma^2} \Bigg] = \mu \end{aligned} \end{equation*}\]
Proof
Lemma 2 : The variance \(\operatorname{var}(x) = \mathbb{E}[(x - \mu)^2]\) is equal to \(\mathbb{E}[x^2] - \mathbb{E}[x]^2\)
This is an application of expanding the definition
\[\begin{equation*} \begin{aligned} \mathbb{E}[(x - \mu)^2] &= \mathbb{E}[x^2 -2x\mu + \mu^2] \\ &= \mathbb{E}[x^2] - 2\mathbb{E}[x]\mu + \mathbb{E}[\mu^2] \\ &= \mathbb{E}[x^2] - 2\mathbb{E}[x]^2 + \mathbb{E}[x]^2 \\ &= \mathbb{E}[x^2] - \mathbb{E}[x]^2 \\ \end{aligned} \end{equation*}\]
Proof
Proposition 3 : The variance of normal distribution is \(\sigma^2\)
\[\begin{equation*} \begin{aligned} \frac{1}{\sqrt{2\pi\sigma^2}}& \int^\infty_{-\infty} \exp\bracka{-\frac{(x-\mu)^2}{2\sigma^2}}x^2\dby x \\ &= \frac{1}{\sqrt{2\pi\sigma^2}} \int^\infty_{-\infty} \exp\bracka{-\frac{z^2}{2\sigma^2}}(z+\mu)^2\dby z \\ &=\frac{1}{\sqrt{2\pi\sigma^2}}\brackb{\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}z^2\dby z + \int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}2\mu z\dby z + \int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\mu^2\dby z } \\ &=\frac{1}{\sqrt{2\pi\sigma^2}}\brackb{\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}z^2\dby z + 0 + \mu^2\sqrt{2\pi\sigma^2} \dby z } \end{aligned} \end{equation*}\]
\[\begin{equation*} \frac{d}{dz} \exp\bracka{-\frac{z^2}{2\sigma^2}} = \exp\bracka{-\frac{z^2}{2\sigma^2}}\bracka{-\frac{z}{\sigma^2}} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \int^\infty_{-\infty}&\exp\bracka{-\frac{z^2}{2\sigma^2}}z^2\dby z = -\sigma^2\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\bracka{-\frac{z}{\sigma^2}}z\dby z \\ &= -\sigma^2\brackb{\left.z\exp\bracka{-\frac{z^2}{2\sigma^2}}\right|^\infty_{-\infty} - \int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\dby z} \\ &= -\sigma^2[0 - \sqrt{2\pi\sigma^2}] = \sigma^2\sqrt{2\pi\sigma^2} \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \lim_{z\rightarrow\infty} z\exp\bracka{-\frac{z^2}{2\sigma^2}} &- \lim_{z\rightarrow-\infty} z\exp\bracka{-\frac{z^2}{2\sigma^2}} \\ &= 0 - \lim_{z\rightarrow-\infty}1\cdot\exp\bracka{-\frac{x^2}{2\sigma^2}}\bracka{-\frac{\sigma^2}{z}} \\ &= 0-0 = 0 \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \mathbb{E}[x^2] &- \mathbb{E}[x]^2 = \sigma^2 + \mu^2 - \mu^2 = \sigma^2 \end{aligned} \end{equation*}\]
Proof
This second part is to introduce some of the mathematical basics such as linear algebra. Further results of linear Gaussian models and others will be presented in the next part.
Useful Backgrounds
Covariance and Covariance Matrix
Definition 3 (Covariance): Given 2 random variables \(X\) and \(Y\), the covariance is defined as:
\[\begin{equation*} \operatorname{cov}(X, Y) = \mathbb{E}\Big[ (X - \mathbb{E}[X])(Y - \mathbb{E}[Y]) \Big] \end{equation*}\]
It is clear that \(\operatorname{cov}(X, Y) = \operatorname{cov}(Y, X)\)
Definition 4 (Covariance Matrix): Now, if we have the collection of random variables in the random vector \(\boldsymbol x\) (of size \(n\)), then the collection of covariance between its elements are collected in covariance matrix
\[\begin{equation*} \begin{aligned} \operatorname{cov}(\boldsymbol x) &= \begin{bmatrix} \operatorname{cov}(x_1, x_1) & \operatorname{cov}(x_1, x_2) & \cdots & \operatorname{cov}(x_1, x_n) \\ \operatorname{cov}(x_2, x_1) & \operatorname{cov}(x_2, x_2) & \cdots & \operatorname{cov}(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{cov}(x_n, x_1) & \operatorname{cov}(x_n, x_2) & \cdots & \operatorname{cov}(x_n, x_n) \\ \end{bmatrix} \\ &= \mathbb{E}\Big[ (\boldsymbol x - \mathbb{E}[\boldsymbol x])(\boldsymbol x - \mathbb{E}[\boldsymbol x])^\top \Big] \end{aligned} \end{equation*}\]
The equivalent is clear when we perform the matrix multiplication over this.
Remark 1. (Property of Covariance Matrix): It is clear that the covariance matrix is (from its defintion):
- Symmetric, as \(\operatorname{cov}(x_a, x_b) = \operatorname{cov}(x_b, x_a)\)
- Positive semidefinite, if we consider arbitary constant vector \(\boldsymbol a \in \mathbb{R}^n\), then by the linearity of expectation, we have:
\[\begin{equation*} \begin{aligned} \boldsymbol a^\top \operatorname{cov}(\boldsymbol x)\boldsymbol a &= \boldsymbol a^\top \mathbb{E}\Big[ (\boldsymbol x - \mathbb{E}[\boldsymbol x])(\boldsymbol x - \mathbb{E}[\boldsymbol x])^\top \Big]\boldsymbol a \\ &= \mathbb{E}\Big[ \boldsymbol a^\top (\boldsymbol x - \mathbb{E}[\boldsymbol x])(\boldsymbol x - \mathbb{E}[\boldsymbol x])^\top \boldsymbol a \Big] \\ &= \mathbb{E}\Big[ \big(\boldsymbol a^\top (\boldsymbol x - \mathbb{E}[\boldsymbol x])\big)^2 \Big] > 0 \\ \end{aligned} \end{equation*}\]
Linear Algebra
Definition 5 (Eigenvalues and Eigenvectors): Given the matrix \(\boldsymbol X \in \mathbb{C}^{n\times n}\), then the pair of vector \(\boldsymbol v \in \mathbb{C}^n\) and number \(\lambda \in \mathbb{C}\) are called eigenvector and eigenvalue if
\[\begin{equation*} \begin{aligned} \boldsymbol A \boldsymbol v = \lambda\boldsymbol v \end{aligned} \end{equation*}\]
Proposition 4 (Eigenvalue of Symmetric Matrix): One can show that the eigenvalue of symmetric (real) matrix is always real.
Let’s consider the pairs of eigenvalues/eigenvectors \(\boldsymbol v \in \mathbb{C}^n\) and \(\lambda \in \mathbb{C}\) of symmetric (real) matrix \(\boldsymbol A\) i.e \(\boldsymbol A\boldsymbol v = \lambda\boldsymbol v\). Please note that \((x+yi)(x-yi) = x^2 + y^2 > 0\) (assume non-zero eigenvalue), this means that \(\bar{\boldsymbol v}^\top \boldsymbol v > 0\) (\(\bar{\boldsymbol v}\) is vector that contains conjugate element of \(\boldsymbol v\)). Now, see that:
\[\begin{equation*} \begin{aligned} \bar{\boldsymbol v}^\top \boldsymbol A\boldsymbol v &= \bar{\boldsymbol v}^\top (\boldsymbol A\boldsymbol v) = \lambda\bar{\boldsymbol v}^\top \boldsymbol v \\ &= (\bar{\boldsymbol v}^\top \boldsymbol A^\top )\boldsymbol v = \bar{\lambda}\bar{\boldsymbol v}^\top \boldsymbol v \\ \end{aligned} \end{equation*}\]
Note that \(\overline{\boldsymbol A\boldsymbol v} = \bar{\lambda}\bar{\boldsymbol v}\) Since \(\bar{\boldsymbol v}^\top \boldsymbol v > 0\) we see that \(\bar{\lambda} = \lambda\), which implies that \(\bar{\lambda}\) is real.
Proof (Following from here)
Proposition 5 (Eigenvalue of Positive Definite Matrix): One can show that the eigenvalue of positive definite matrix is non-negative.
\[\begin{equation*} \begin{aligned} \boldsymbol v^\top \boldsymbol A\boldsymbol v = \lambda\boldsymbol v^\top \boldsymbol v > 0 \end{aligned} \end{equation*}\]
And so the eigenvector must be \(\lambda > 0\).
Proof
Definition 6 (Linear Transformation): Given the function \(A: V \rightarrow W\), where \(V\) and \(W\) are vector spaces. Then, for \(\boldsymbol v \in V, \boldsymbol w \in W\) and \(a, b \in \mathbb{R}\)
\[\begin{equation*} \begin{aligned} A(a\boldsymbol v + b\boldsymbol w) = aA(\boldsymbol v) + bA(\boldsymbol w) \end{aligned} \end{equation*}\]
Remark 2. (Matrix Multipliacation and Linear Transformation): We can represent the linear transformation \(L : V \rightarrow W\) in terms of matrix. Let’s consider the vector \(\boldsymbol v \in V\) (of dimension \(n\)) together with basis vectors \(\brackc{\boldsymbol b_1,\dots,\boldsymbol b_n}\) of \(V\) and basis vectors \(\brackc{\boldsymbol c_1, \dots, \boldsymbol c_m}\) of \(W\). Then we can represen the vector \(\boldsymbol v\) as:
\[\begin{equation*} \begin{aligned} \boldsymbol v = v_1\boldsymbol b_1 + \dots + v_n\boldsymbol b_n \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} &L(\boldsymbol b_i) = l_{1i} \boldsymbol c_1 + l_{2i}\boldsymbol c_2 + \cdots + l_{mi}\boldsymbol c_m \\ \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} L(\boldsymbol v) &= v_1L(\boldsymbol b_1) + \dots + v_nL(\boldsymbol v_n) \\ &= \begin{aligned}[t] &v_1\Big( l_{11} \boldsymbol c_1 + l_{21}\boldsymbol c_2 + \cdots + l_{m1}\boldsymbol c_m \Big) \\ &+v_2\Big( l_{12} \boldsymbol c_1 + l_{22}\boldsymbol c_2 + \cdots + l_{m2}\boldsymbol c_m \Big) \\ &+v_n\Big( l_{1n} \boldsymbol c_1 + l_{2n}\boldsymbol c_2 + \cdots + l_{mn}\boldsymbol c_m \Big)\\ \end{aligned} \\ &= \begin{bmatrix} \sum^n_{i=1}v_il_{1i} \\ \sum^n_{i=1}v_il_{2i} \\ \vdots \\ \sum^n_{i=1}v_il_{ni} \end{bmatrix} = \begin{bmatrix} l_{11} & l_{12} & \cdots & l_{1n} \\ l_{21} & l_{22} & \cdots & l_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ l_{m1} & l_{m2} & \cdots & l_{mn} \\ \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \end{aligned} \end{equation*}\]
and so we have show that the linear transformation can be represented as matrix multiplication (in finite space).
Theorem 1 (Spectral Theorem, from here): Let \(n \le N\) and \(W\) be an n-dimensional subspace of \(\mathbb{R}^n\). Given a linear transformation \(A:W\rightarrow W\) that is symmetric. There are eigenvectors \(\boldsymbol v_1,\dots,\boldsymbol v_n\in W\) of \(A\) such that \(\brackc{\boldsymbol v_1,\dots,\boldsymbol v_n}\) is an orthonormal basis for \(W\). For normal matrix, we let \(n=N\) and \(W=\mathbb{R}^n\).
Remark 3. (Eigendecomposition): Let’s consider the matrix of eigenvectors \(\boldsymbol v_1,\boldsymbol v_2\dots,\boldsymbol v_n \in \mathbb{R}^n\) of symmetric matrix \(\boldsymbol A \in \mathbb{R}^{n\times n}\) together with eigenvalues \(\lambda_1,\lambda_2,\dots,\lambda_n\), then we have :
\[\begin{equation*} \begin{aligned} \boldsymbol A \begin{bmatrix} \kern.6em | & \kern.2em | \kern.2em & & | \kern.6em \\ \kern.6em\boldsymbol v_1 & \kern.2em\boldsymbol v_2\kern.2em & \kern.2em\cdots\kern.2em & \boldsymbol v_n\kern.6em \\ \kern.6em | & \kern.2em | \kern.2em & & | \kern.6em \\ \end{bmatrix} &= \begin{bmatrix} \kern.6em | & \kern.2em | \kern.2em & & | \kern.6em \\ \kern.6em\boldsymbol v_1 & \kern.2em\boldsymbol v_2\kern.2em & \kern.2em\cdots\kern.2em & \boldsymbol v_n\kern.6em \\ \kern.6em | & \kern.2em | \kern.2em & & | \kern.6em \\ \end{bmatrix}\begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \\ \end{bmatrix} \\ \end{aligned} \end{equation*}\]
\[\begin{equation*} \boldsymbol A = \boldsymbol Q\boldsymbol \Lambda\boldsymbol Q^\top = \sum^n_{i=1}\lambda_i\boldsymbol v_i\boldsymbol v_i^\top \end{equation*}\]
\[\begin{equation*} \boldsymbol A^{-1} = \boldsymbol Q\boldsymbol \Lambda^{-1}\boldsymbol Q^\top = \sum^n_{i=1}\frac{1}{\lambda_i}\boldsymbol v_i\boldsymbol v_i^\top \end{equation*}\]
as it is clear that \(\boldsymbol A\boldsymbol A^{-1} = \boldsymbol Q\boldsymbol \Lambda\boldsymbol Q^\top \boldsymbol Q\boldsymbol \Lambda^{-1}\boldsymbol Q^\top = \boldsymbol I\).
Proposition 6 (Determinant and Eigenvalues): Give matrix \(\boldsymbol A\) together with eigenvalues of \(\lambda_1,\lambda_2,\dots,\lambda_n\), then we can show that
\[\begin{equation*} \abs{\boldsymbol A} = \prod^n_{i=1}\lambda_i \end{equation*}\]
We will only consider the case where \(\boldsymbol A\) is diagonalizable.
We consider the eigendecomposition of \(\boldsymbol A\), as we have:
\[\begin{equation*} \abs{\boldsymbol A} = \abs{\boldsymbol Q\boldsymbol \Lambda\boldsymbol Q^{-1}} = \abs{\boldsymbol Q}\abs{\boldsymbol \Lambda}\abs{\boldsymbol Q^{-1}} = \frac{\abs{\boldsymbol Q}}{\abs{\boldsymbol Q}}\abs{\boldsymbol \Lambda} = \prod^n_{i=1}\lambda_i \end{equation*}\]
Proof
(Trace and Eigenvalues): Given a matrix \(\boldsymbol A\) together with eigenvalues of \(\lambda_1,\lambda_2,\dots,\lambda_n\), then we can show that:
\[\begin{equation*} \operatorname{Tr}(\boldsymbol A) = \sum^n_{i=1}\lambda_i \end{equation*}\]
We will only consider the case where \(\boldsymbol A\) is diagonalizable.
\[\begin{equation*} \operatorname{Tr}(\boldsymbol A) = \operatorname{Tr}(\boldsymbol Q\boldsymbol \Lambda\boldsymbol Q^{-1}) = \operatorname{Tr}(\boldsymbol \Lambda\boldsymbol Q\boldsymbol Q^{-1}) = \operatorname{Tr}(\boldsymbol \Lambda) = \sum^n_{i=1}\lambda_i \end{equation*}\]
Proof
Miscellaneous
Proposition 7 (Change of Variable): If we consider the transformation of the variables where \(T : \mathbb{R}^k \supset X \rightarrow \mathbb{R}^k\). Then we can show that:
\[\begin{equation*} \int_{\mathbb{R}^k} f(\boldsymbol y)\dby \boldsymbol y = \int_{\mathbb{R}^k} f(\boldsymbol T(\boldsymbol x))\abs{\boldsymbol J_T(\boldsymbol x)}\dby \boldsymbol x \end{equation*}\]
\[\begin{equation*} \boldsymbol J_T(\boldsymbol x) = \begin{pmatrix} \cfrac{\partial T_1(\cdot)}{\partial x_1} & \cfrac{\partial T_1(\cdot)}{\partial x_2} & \cdots & \cfrac{\partial T_1(\cdot)}{\partial x_n} \\ \cfrac{\partial T_2(\cdot)}{\partial x_1} & \cfrac{\partial T_2(\cdot)}{\partial x_2} & \cdots & \cfrac{\partial T_2(\cdot)}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \cfrac{\partial T_n(\cdot)}{\partial x_1} & \cfrac{\partial T_n(\cdot)}{\partial x_2} & \cdots & \cfrac{\partial T_n(\cdot)}{\partial x_n} \\ \end{pmatrix} \end{equation*}\]
Multivariate Guassian Distribution: Introduction
Definition 7 (Multivariate Gaussian): It is defined as:
\[\begin{equation*} \mathcal{N}(\boldsymbol x | \boldsymbol \mu, \boldsymbol \Sigma) = \frac{1}{\sqrt{\abs{2\pi\boldsymbol \Sigma}}}\exp\left\{-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu)\right\} \end{equation*}\]
where we call \(\boldsymbol \mu \in \mathbb{R}^n\) a mean and \(\boldsymbol \Sigma \in \mathbb{R}^{n\times n}\) covariance, which should be symmetric and positive semidefinite (since the covariance is always positive semidefinite and symmetric).
Remark 4. (2D Independent Gaussian): Now, let’s consider, multivariate Gaussian but in the case that both variables are independent to each other with difference variances, as we define the parameters to be:
\[\begin{equation*} \boldsymbol \mu = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix} \qquad \boldsymbol \Sigma = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \\ \end{bmatrix} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \mathcal{N}(\boldsymbol x | \boldsymbol \mu, \boldsymbol \Sigma) &= \frac{1}{\sqrt{\abs{2\pi\boldsymbol \Sigma}}}\exp\left\{-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu)\right\} \\ &= \frac{1}{4\pi^2\sigma_1^2\sigma_2^2} \exp\brackc{-\frac{1}{2} \begin{bmatrix} x_1-\mu_1 \\ x_2-\mu_2 \end{bmatrix}^\top \begin{bmatrix} \sigma_1^{-2} & 0 \\ 0 & \sigma_2^{-2} \\ \end{bmatrix} \begin{bmatrix} x_1-\mu_1 \\ x_2-\mu_2 \end{bmatrix}} \\ &= \frac{1}{4\pi^2\sigma_1^2\sigma_2^2} \exp\brackc{-\frac{1}{2} \brackb{ \bracka{\frac{x_1-\mu_1}{\sigma_1}}^2 + \bracka{\frac{x_2-\mu_2}{\sigma_2}}^2 }}\\ &= \frac{1}{2\pi\sigma_1^2} \exp\brackc{-\frac{1}{2} \frac{(x_1-\mu_1)^2}{\sigma_1^2}} \frac{1}{2\pi\sigma_2^2} \exp\brackc{-\frac{1}{2} \frac{(x_2-\mu_2)^2}{\sigma_2^2}}\\ &= \mathcal{N}(x_1 | \mu_1, \sigma_1^2)\mathcal{N}(x_2 | \mu_2, \sigma_2^2) \end{aligned} \end{equation*}\]
Remark 5. (Shape of Gaussian): Let’s consider the eigendecomposition of the inverse covariance matrix (which is positive semidefinite and symmetric), as we have
\[\begin{equation*} \begin{aligned} \boldsymbol \Sigma^{-1} = \sum^n_{i=1}\frac{1}{\lambda_i}\boldsymbol u_i\boldsymbol u_i^\top \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} (\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x - \boldsymbol \mu) = \sum^n_{i=1}\frac{(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol u_i\boldsymbol u_i^\top (\boldsymbol x - \boldsymbol \mu)}{\lambda_i} = \sum^n_{i=1}\frac{y_i^2}{\lambda_i} \end{aligned} \end{equation*}\]
where we have \(y_i = (\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol u_i\). Consider the vector to be \(\boldsymbol y = (y_1,y_2,\dots,y_n)^\top = \boldsymbol U(\boldsymbol x - \boldsymbol \mu)\). This gives us the linear transformation over \(\boldsymbol x\), which implies the following shape of Gaussian:
- Ellipsoids with the center \(\boldsymbol \mu\)
- Axis is in the direction of eigenvector \(\boldsymbol u_i\)
- Scaling of each direction is the eigenvector \(\lambda_i\) associated with \(\boldsymbol u_i\)
Proposition 8 (Normalization of Multivariate Gaussian): We can show that:
\[\begin{equation*} \begin{aligned} \int \exp\left\{-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu)\right\} \dby \boldsymbol x = \sqrt{\abs{2\pi\boldsymbol \Sigma}} \end{aligned} \end{equation*}\]
Let’s consider the change of variable, in which we will change the variable from \(x_i\) to \(y_i\) where \(\boldsymbol y = \boldsymbol U(\boldsymbol x - \boldsymbol \mu)\). To do this we have the find the Jacobian of the transformation, which is:
\[\begin{equation*} \begin{aligned} J_{ij} = \frac{\partial x_i}{\partial y_j} = U_{ji} \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \int \exp\left\{-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu)\right\} \dby \boldsymbol x &= \int \exp\brackc{\sum^n_{i=1}-\frac{y_i^2}{2\lambda_i}} |\boldsymbol J| \dby \boldsymbol y \\ &= \int \prod^n_{i=1}\exp\brackc{-\frac{y_i^2}{2\lambda_i}} \dby \boldsymbol y \\ &= \prod^n_{i=1}\int \exp\brackc{-\frac{y_i^2}{2\lambda_i}} \dby y_i \\ &= \prod^n_{i=1}\sqrt{2\pi\lambda_i} = \sqrt{\abs{2\pi\boldsymbol \Sigma}} \end{aligned} \end{equation*}\]
Note that we have shown that the determinant is the product of eigenvalues. Thus the prove is completed.
Proof
More Useful Backgrounds
Proposition 9 (Inverse of Partition Matrix): The block matrix can be inversed as:
\[\begin{equation*} \require{color} \definecolor{red}{RGB}{244, 67, 54} \definecolor{pink}{RGB}{233, 30, 99} \definecolor{purple}{RGB}{103, 58, 183} \definecolor{yellow}{RGB}{255, 193, 7} \definecolor{grey}{RGB}{96, 125, 139} \definecolor{blue}{RGB}{33, 150, 243} \definecolor{green}{RGB}{0, 150, 136} \begin{bmatrix} \boldsymbol A & \boldsymbol B \\ \boldsymbol C & \boldsymbol D \\ \end{bmatrix} = \begin{bmatrix} \boldsymbol M & -\boldsymbol M\boldsymbol B\boldsymbol D^{-1} \\ -\boldsymbol D^{-1}\boldsymbol C\boldsymbol M & \boldsymbol D^{-1}+ \boldsymbol D^{-1}\boldsymbol C\boldsymbol M\boldsymbol B\boldsymbol D^{-1} \\ \end{bmatrix} \end{equation*}\]
where we set \(\boldsymbol M = (\boldsymbol A-\boldsymbol B\boldsymbol D^{-1}\boldsymbol C)^{-1}\)
Proposition 10 (Inverse Matrix Identity): We can show that
\[\begin{equation*} (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1} = \boldsymbol P\boldsymbol B^\top (\boldsymbol B\boldsymbol P\boldsymbol B^\top + \boldsymbol R)^{-1} \end{equation*}\]
The can be proven by right multiply the inverse on the right hand-side:
\[\begin{equation*} \begin{aligned} (\boldsymbol P^{-1} + &\boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1}(\boldsymbol B\boldsymbol P\boldsymbol B^\top + \boldsymbol R) \\ &= (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1}\boldsymbol B\boldsymbol P\boldsymbol B^\top + (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1}\boldsymbol R \\ &= (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1}\boldsymbol B\boldsymbol P\boldsymbol B^\top + (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B \\ &= (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\Big[\boldsymbol B\boldsymbol R^{-1}\boldsymbol B\boldsymbol P + \boldsymbol P^{-1} \boldsymbol P\Big]\boldsymbol B^\top \\ &= (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\Big[\boldsymbol B\boldsymbol R^{-1}\boldsymbol B+ \boldsymbol P^{-1} \Big]\boldsymbol P\boldsymbol B^\top \\ &= \boldsymbol P\boldsymbol B^\top \\ \end{aligned} \end{equation*}\]
Proof
Proposition 11 (Woodbury Identity): We can show that
\[\begin{equation*} (\boldsymbol A + \boldsymbol B\boldsymbol D^{-1}\boldsymbol C)^{-1} = \boldsymbol A^{-1} - \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1} \end{equation*}\]
The can be proven by right multiply the inverse on the right hand-side:
\[\begin{equation*} \begin{aligned} \Big[ \boldsymbol A^{-1} &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1} \Big](\boldsymbol A + \boldsymbol B\boldsymbol D^{-1}\boldsymbol C) \\ &= \boldsymbol A^{-1}(\boldsymbol A + \boldsymbol B\boldsymbol D^{-1}\boldsymbol C) - \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}(\boldsymbol A + \boldsymbol B\boldsymbol D^{-1}\boldsymbol C) \\ &= \begin{aligned}[t] \boldsymbol A^{-1}\boldsymbol A + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}\boldsymbol A \\ &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} \\ &= \begin{aligned}[t] \boldsymbol I + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C \\ &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} \\ &= \begin{aligned}[t] \boldsymbol I + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B\Big[(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol D + (\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}\boldsymbol B\Big]\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} \\ &= \begin{aligned}[t] \boldsymbol I + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B\Big[(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)\Big]\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} \\ &= \begin{aligned}[t] \boldsymbol I + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} = \boldsymbol I \end{aligned} \end{equation*}\]
Proof
Conditional & Marginalisation
Remark 6. (Settings): We consider the setting where we consider the partition of random variables:
\[\begin{equation*} \begin{bmatrix} \boldsymbol x_a \\ \boldsymbol x_b \end{bmatrix} \sim \mathcal{N}\bracka{ \begin{bmatrix} \boldsymbol \mu_a \\ \boldsymbol \mu_b \end{bmatrix}, \begin{bmatrix} \boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab} \\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb} \\ \end{bmatrix}} \end{equation*}\]
\[\begin{equation*} \boldsymbol \Sigma^{-1} = \begin{bmatrix} \boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab} \\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb} \\ \end{bmatrix}^{-1} = \begin{bmatrix} \boldsymbol \Lambda_{aa} & \boldsymbol \Lambda_{ab} \\ \boldsymbol \Lambda_{ba} & \boldsymbol \Lambda_{bb} \\ \end{bmatrix} = \boldsymbol \Lambda \end{equation*}\]
where \(\boldsymbol \Lambda\) is called precision matrix. One can consider the inverse of partition matrix to find such a value of \(\boldsymbol \Lambda\) (will be useful afterward), thus we note that \(\boldsymbol \Sigma_{aa} \ne \boldsymbol \Lambda^{-1}_{aa}\), and so on.
Remark 7. (Complete the Square): To find the conditional and marginalision (together with other kinds of Gaussian manipulation), we relies on a method called completing the square. Let’s consider the quadratic form expansion:
\[\begin{equation*} \begin{aligned} -\frac{1}{2}&(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu) \\ &= {\color{blue}{-\frac{1}{2}\boldsymbol x^\top \boldsymbol \Sigma^{-1}\boldsymbol x + \boldsymbol \mu^\top \boldsymbol \Sigma^{-1}\boldsymbol x}} -\frac{1}{2} \boldsymbol \mu^\top \boldsymbol \Sigma^{-1}\boldsymbol \mu \\ \end{aligned} \end{equation*}\]
\[ \begin{aligned} -\frac{1}{2}&(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu) \\ &= \begin{aligned}[t] {\color{green}{-\frac{1}{2}(\boldsymbol x_a - \boldsymbol \mu_a)^\top \boldsymbol \Lambda_{aa}(\boldsymbol x_a - \boldsymbol \mu_a)}}{\color{yellow}{ - \frac{1}{2}(\boldsymbol x_a - \boldsymbol \mu_a)^\top \boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b)}} \\ {\color{purple}{-\frac{1}{2}(\boldsymbol x_b - \boldsymbol \mu_b)^\top \boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a)}}{\color{grey}{ - \frac{1}{2}(\boldsymbol x_b - \boldsymbol \mu_b)^\top \boldsymbol \Lambda_{bb}(\boldsymbol x_b - \boldsymbol \mu_b)}} \\ \end{aligned} \\ &= \begin{aligned}[t] &{\color{green} -\frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} -\frac{1}{2} \boldsymbol \mu^\top \boldsymbol \Lambda_{aa}\boldsymbol \mu {\color{yellow} -\frac{1}{2}\boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b + \frac{1}{2} \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b} \\ &{\color{yellow}+\frac{1}{2}\boldsymbol \mu_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b} - \frac{1}{2}\boldsymbol \mu_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b {\color{purple} -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba} \boldsymbol \mu_a} \\ &- \frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol \mu_a {\color{grey} -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b + \boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b} -\frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \mu_b \end{aligned} \\ \end{aligned} \tag{1}\]
Completing the square is to match this pattern into our formula in order to get new Gaussian distribution. There are \(2\) ways to complete the squre that depends on the scenario:
- When we want to find the Gaussian in difference form (but still being Gaussian) i.e conditional
- When we want to marginalise some variables out or when we have to find the true for of the distribution without relying on knowing the final form (or when we are not really sure about the final form) i.e marginalisation, posterior
Let’s just show how it works with examples.
Proposition 12 (Conditional): Consider the following Gaussian
\[\begin{equation*} \begin{bmatrix} \boldsymbol x_a \\ \boldsymbol x_b \end{bmatrix} \sim p(\boldsymbol x_a, \boldsymbol x_b) = \mathcal{N}\bracka{ \begin{bmatrix} \boldsymbol \mu_a \\ \boldsymbol \mu_b \end{bmatrix}, \begin{bmatrix} \boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab} \\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb} \\ \end{bmatrix}} \end{equation*}\]
We can show that: \(p(\boldsymbol x_a | \boldsymbol x_b) = \mathcal{N}(\boldsymbol x | \boldsymbol \mu_{a|b}, \boldsymbol \Lambda^{-1}_{aa})\), where we have
- \(\boldsymbol \mu_{a\lvert b} = \boldsymbol \mu_a - \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b)\)
- Or, we can set \(\boldsymbol K = \boldsymbol \Sigma_{ab}\boldsymbol \Sigma_{bb}^{-1}\), where we have
\[\begin{equation*} \begin{aligned} &\boldsymbol \mu_{a|b} = \boldsymbol \mu_a + \boldsymbol K(\boldsymbol x_b - \boldsymbol \mu_b) \qquad \begin{aligned}[t] \boldsymbol \Sigma_{a|b} &= \boldsymbol \Sigma_{aa} - \boldsymbol K\boldsymbol \Sigma_{bb}\boldsymbol K^\top \\ &= \boldsymbol \Sigma_{aa} - \boldsymbol \Sigma_{ab}\boldsymbol \Sigma_{bb}^{-1}\boldsymbol \Sigma_{ba} \\ \end{aligned} \end{aligned} \end{equation*}\]
Follows from Proposition 9 (you can try plugging the results in).
Proof: We consider the expansion of the conditional distribution, as we only gather all the terms with \(\boldsymbol x_a\)
\[\begin{equation*} \begin{aligned} -\frac{1}{2}(\boldsymbol x_a - \boldsymbol \mu_{a|b})^\top \boldsymbol \Sigma_{a|b}^{-1}(\boldsymbol x_a - \boldsymbol \mu_{a|b}) = {\color{red}-\frac{1}{2}\boldsymbol x_a^\top \boldsymbol \Sigma_{a|b}^{-1}\boldsymbol x_a} + {\color{blue}\boldsymbol x_a^\top \boldsymbol \Sigma_{a|b}^{-1}\boldsymbol \mu_{a|b}} + \text{const} \end{aligned} \end{equation*}\]
Let’s consider the values, which should be equal to equation Equation 1, as we can see that:
- The red term: we set \(\boldsymbol \Sigma_{a\lvert b}^{-1} = \boldsymbol \Lambda_{aa}\) (consider the first green term of the equation Equation 1)
- The blue term, we will have to consider \((\dots)^\top \boldsymbol x_a\). We have:
\[\begin{equation*} \begin{aligned} {\color{green} \boldsymbol \mu_{a}^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} &{\color{yellow} - \frac{1}{2}\boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b + \frac{1}{2} \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b} {\color{purple} - \frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a} \\ &= \boldsymbol x_a^\top \Big[ \boldsymbol \Lambda_{aa}\boldsymbol \mu_a - \boldsymbol \Lambda_{ab}\boldsymbol x_b + \boldsymbol \Lambda_{ab}\boldsymbol \mu_b \Big] = \boldsymbol x_a^\top \Big[ \boldsymbol \Lambda_{aa}\boldsymbol \mu_a - \boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b) \Big] \\ \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \boldsymbol x_a^\top &\boldsymbol \Lambda_{aa}\boldsymbol \mu_{a|b} = \boldsymbol x_a^\top \Big[ \boldsymbol \Lambda_{aa}\boldsymbol \mu_a - \boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b) \Big] \\ \implies&\boldsymbol \mu_{a|b} \begin{aligned}[t] &= \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{aa}\boldsymbol \mu_a - \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b) \\ &= \boldsymbol \mu_a - \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b) \\ \end{aligned} \end{aligned} \end{equation*}\]
Thus the proof is complete.
Proof
Proposition 13 (Marginalisation): Consider the following Gaussian
\[\begin{equation*} \begin{bmatrix} \boldsymbol x_a \\ \boldsymbol x_b \end{bmatrix} \sim p(\boldsymbol x_a, \boldsymbol x_b) = \mathcal{N}\bracka{ \begin{bmatrix} \boldsymbol \mu_a \\ \boldsymbol \mu_b \end{bmatrix}, \begin{bmatrix} \boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab} \\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb} \\ \end{bmatrix}} \end{equation*}\]
We can show that: \(p(\boldsymbol x_a) = \mathcal{N}(\boldsymbol x | \boldsymbol \mu_{a}, \boldsymbol \Sigma_{aa})\)
We collect the terms that contains \(\boldsymbol x_b\) so that we can integrate it out, as we have:
\[\begin{equation*} \begin{aligned} {\color{grey} -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b} &{\color{grey} +} {\color{grey}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b} {\color{purple} -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba} \boldsymbol \mu_a} {\color{yellow} -\frac{1}{2}\boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b+\frac{1}{2}\boldsymbol \mu_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b} \\ &= -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b + \boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b -\boldsymbol x_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \boldsymbol x_b^\top \boldsymbol \Lambda_{ba} \boldsymbol \mu_a \\ &= -\frac{1}{2}\Big[\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b - 2\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \Lambda_{bb}^{-1}\underbrace{\Big(\boldsymbol \Lambda_{bb}\boldsymbol \mu_b -\boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a)\Big)}_{\boldsymbol m}\Big] \\ &= -\frac{1}{2}\Big[\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b - 2\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m + (\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)^\top \boldsymbol \Lambda_{bb}(\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m) \Big] + \frac{1}{2}(\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)^\top \boldsymbol \Lambda_{bb}(\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m) \\ &= -\frac{1}{2}(\boldsymbol x_b - \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)^\top \boldsymbol \Lambda_{bb}(\boldsymbol x_b - \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m) + {\color{blue}\frac{1}{2}\boldsymbol m^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m} \\ \end{aligned} \end{equation*}\]
\[\begin{equation*} \int \exp\brackc{-\frac{1}{2}(\boldsymbol x_b - \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)^\top \boldsymbol \Lambda_{bb}(\boldsymbol x_b - \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)}\dby \boldsymbol x_b \end{equation*}\]
\[\begin{equation*} \begin{aligned} {\color{blue}\frac{1}{2}\boldsymbol m^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m} &{\color{green} -}{\color{green} \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} {\color{yellow} + \frac{1}{2} \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b}{\color{purple}+ \frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a} \\ &= \begin{aligned}[t] \frac{1}{2}\Big(&\boldsymbol \Lambda_{bb}\boldsymbol \mu_b -\boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a)\Big)^\top \boldsymbol \Lambda_{bb}^{-1}\Big(\boldsymbol \Lambda_{bb}\boldsymbol \mu_b -\boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a)\Big) \\ &{\color{green} - \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} + \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b \end{aligned} \\ &= \begin{aligned}[t] \frac{1}{2}\Big[ \boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \mu_b &- (\boldsymbol x_a - \boldsymbol \mu_a)^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \mu_b \\ &- \boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a) + (\boldsymbol x_a - \boldsymbol \mu_a)^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a) \Big] \\ &{\color{green} - \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} + \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b \end{aligned} \\ &= \begin{aligned}[t] \frac{1}{2}\Big[ \boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \mu_b &- \boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \mu_b + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \mu_b - \boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol \mu_a \\ &+ \boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol x_a - \boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol\mu_a - \boldsymbol \mu_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol x_a \\ &+ \boldsymbol \mu_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol \mu_a \Big] {\color{green} - \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} + \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b \end{aligned} \\ &= \begin{aligned}[t] \frac{1}{2}\Big[-2\boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \mu_b &+ \boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol x_a - 2\boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol\mu_a\Big] \\ &{\color{green} - \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} + \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2}\boldsymbol x_a^\top \Big[\boldsymbol \Lambda_{aa} - \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\Big]\boldsymbol x_a &+ \boldsymbol x_a^\top \Big[\boldsymbol \Lambda_{aa} - \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\Big]\boldsymbol\mu_a + \text{const} \end{aligned} \\ \end{aligned} \end{equation*}\]
If we comparing this form to the normal quadratic expanision of Gaussian, we can set the \(\boldsymbol \mu\) of marginalised Gaussian is \(\boldsymbol \mu_a\), while the covariance is \((\boldsymbol \Lambda_{aa} - \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba})^{-1}\). If we compare this to the inverse matrix partition, we can see that this is equal to \(\boldsymbol \Sigma_{aa}\). Thus complete the proof.
Proof
Proposition 14 (Linear Gaussian Model): Consider the distribution to be: \(p(\boldsymbol x) = \mathcal{N}(\boldsymbol x | {\color{yellow} \boldsymbol \mu} , {\color{blue} \boldsymbol \Lambda^{-1}})\) and \(p(\boldsymbol y | \boldsymbol x) = \mathcal{N}(\boldsymbol y | {\color{purple}\boldsymbol A}\boldsymbol x + {\color{green} \boldsymbol b}, {\color{red} \boldsymbol L^{-1}})\). We can show that the following holds:
\[\begin{equation*} \begin{aligned} p(\boldsymbol y) &= \mathcal{N}(\boldsymbol y | {\color{purple}\boldsymbol A}{\color{yellow} \boldsymbol \mu} + {\color{green} \boldsymbol b}, {\color{red} \boldsymbol L^{-1}} + {\color{purple}\boldsymbol A}{\color{blue} \boldsymbol \Lambda^{-1}}{\color{purple}\boldsymbol A^\top }) \\ p(\boldsymbol x | \boldsymbol y) &= \mathcal{N}\bracka{ \boldsymbol x | {\color{grey} \boldsymbol \Sigma}\brackc{ {\color{purple}\boldsymbol A^\top } {\color{red} \boldsymbol L}(\boldsymbol y-{\color{green} \boldsymbol b}) + {\color{blue} \boldsymbol \Lambda}{\color{yellow} \boldsymbol \mu}}, {\color{grey} \boldsymbol \Sigma} } \end{aligned} \end{equation*}\]
where we have \({\color{grey} \boldsymbol \Sigma} = ({\color{blue} \boldsymbol \Lambda} + {\color{purple}\boldsymbol A^\top }{\color{red} \boldsymbol L}{\color{purple}\boldsymbol A})^{-1}\)
We will consider the joint random variable \(\boldsymbol z = (\boldsymbol x, \boldsymbol y)^\top\). Let’s consider the joint distribution and the inside of exponential:
\[\begin{equation*} \begin{aligned} -\frac{1}{2}&(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Lambda(\boldsymbol x - \boldsymbol \mu) - \frac{1}{2}(\boldsymbol y - \boldsymbol A\boldsymbol x - \boldsymbol b)^\top \boldsymbol L(\boldsymbol y - \boldsymbol A\boldsymbol x - \boldsymbol b) + \text{const} \\ &= \begin{aligned}[t] -\frac{1}{2}\Big[ \boldsymbol x^\top \boldsymbol \Lambda\boldsymbol x &- 2\boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol x + \boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol \mu + \boldsymbol y^\top \boldsymbol L\boldsymbol y - \boldsymbol y^\top \boldsymbol L\boldsymbol A\boldsymbol x - \boldsymbol y^\top \boldsymbol L\boldsymbol b\\ &-\boldsymbol x^\top \boldsymbol A^\top \boldsymbol L\boldsymbol y + \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol A\boldsymbol x + \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol b - \boldsymbol b^\top \boldsymbol L\boldsymbol y +\boldsymbol b^\top \boldsymbol L\boldsymbol A\boldsymbol x + \boldsymbol b^\top \boldsymbol L\boldsymbol b \Big] + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2}\Big[ \boldsymbol x^\top \boldsymbol \Lambda\boldsymbol x &- 2\boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol x + \boldsymbol y^\top \boldsymbol L\boldsymbol y - \boldsymbol y^\top \boldsymbol L\boldsymbol A\boldsymbol x - \boldsymbol y^\top \boldsymbol L\boldsymbol b\\ &-\boldsymbol x^\top \boldsymbol A^\top \boldsymbol L\boldsymbol y + \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol A\boldsymbol x + \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol b - \boldsymbol b^\top \boldsymbol L\boldsymbol y +\boldsymbol b^\top \boldsymbol L\boldsymbol A\boldsymbol x \Big] + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2}\Big[ \boldsymbol x^\top \Big(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L \boldsymbol A\Big)\boldsymbol x &+ \boldsymbol y^\top \boldsymbol L\boldsymbol y - \boldsymbol y^\top \boldsymbol L\boldsymbol A\boldsymbol x - \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L\boldsymbol y\\ &+ 2\boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol b - 2\boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol x - 2\boldsymbol y^\top \boldsymbol L\boldsymbol b \Big] + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2}\Big[ \boldsymbol x^\top \Big(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L \boldsymbol A\Big)\boldsymbol x &+ \boldsymbol y^\top \boldsymbol L\boldsymbol y - \boldsymbol y^\top \boldsymbol L\boldsymbol A\boldsymbol x - \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L\boldsymbol y\Big]\\ &- \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol b + \boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol x + \boldsymbol y^\top \boldsymbol L\boldsymbol b + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2} \begin{pmatrix} \boldsymbol x \\ \boldsymbol y \end{pmatrix}^\top \begin{pmatrix} \boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A & -\boldsymbol A^\top \boldsymbol L \\ -\boldsymbol L\boldsymbol A & \boldsymbol L \end{pmatrix} \begin{pmatrix} \boldsymbol x \\ \boldsymbol y \end{pmatrix} + \begin{pmatrix} \boldsymbol x \\ \boldsymbol y \end{pmatrix}^\top \begin{pmatrix} \boldsymbol \Lambda\boldsymbol \mu - \boldsymbol A^\top \boldsymbol L\boldsymbol b \\ \boldsymbol L\boldsymbol b \end{pmatrix} + \text{const} \end{aligned} \\ \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \begin{pmatrix} \boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A & -\boldsymbol A^\top \boldsymbol L \\ -\boldsymbol L\boldsymbol A & \boldsymbol L \end{pmatrix}^{-1} = \begin{pmatrix} \boldsymbol \Lambda^{-1} & \boldsymbol \Lambda^{-1}\boldsymbol A^\top \\ \boldsymbol A\boldsymbol \Lambda^{-1} & \boldsymbol L^{-1} + \boldsymbol A\boldsymbol \Lambda^{-1}\boldsymbol A^\top \end{pmatrix} \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \begin{pmatrix} \boldsymbol \Lambda^{-1} & \boldsymbol \Lambda^{-1}\boldsymbol A^\top \\ \boldsymbol A\boldsymbol \Lambda^{-1} & \boldsymbol L^{-1} + \boldsymbol A\boldsymbol \Lambda^{-1}\boldsymbol A^\top \end{pmatrix} \begin{pmatrix} \boldsymbol \Lambda\boldsymbol \mu - \boldsymbol A^\top \boldsymbol L\boldsymbol b \\ \boldsymbol L\boldsymbol b \end{pmatrix} = \begin{pmatrix} \boldsymbol \mu \\ \boldsymbol A\boldsymbol \mu + \boldsymbol b \end{pmatrix} \end{aligned} \end{equation*}\]
\[\begin{equation*} \begin{aligned} \boldsymbol \mu - &(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A)^{-1}(\boldsymbol A^\top \boldsymbol L)(-\boldsymbol y + A\boldsymbol \mu + \boldsymbol b) \\ &= \boldsymbol \Sigma\boldsymbol A^\top \boldsymbol L(\boldsymbol y - \boldsymbol b) + \boldsymbol \mu - \boldsymbol \Sigma\boldsymbol A^\top \boldsymbol L\boldsymbol A\boldsymbol \mu \end{aligned} \end{equation*}\]
We want to show that \(\boldsymbol \mu - \boldsymbol \Sigma\boldsymbol A^\top \boldsymbol L\boldsymbol A\boldsymbol \mu = \boldsymbol \Sigma\boldsymbol \Lambda\boldsymbol \mu\).
LHS: Apply Proposition 10, where we consider the section highlight in pink:
\[ \begin{aligned} \boldsymbol \mu &- \boldsymbol \Sigma\boldsymbol A^\top \boldsymbol L\boldsymbol A\boldsymbol \mu \\ &= \boldsymbol \mu - {\color{pink}(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A)^{-1}\boldsymbol A^\top \boldsymbol L}\boldsymbol A\boldsymbol \mu \\ &= \boldsymbol \mu - \boldsymbol \Lambda^{-1}\boldsymbol A^\top (\boldsymbol A\boldsymbol \Lambda^{-1}\boldsymbol A^\top + \boldsymbol L^{-1})^{-1}\boldsymbol A\boldsymbol \mu \end{aligned} \]
RHS: We consider the section highlight in blue and use Proposition 11:
\[ \begin{aligned} \boldsymbol \Sigma\boldsymbol \Lambda\boldsymbol \mu &= {\color{blue}(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A)^{-1}}\boldsymbol \Lambda\boldsymbol \mu \\ &= \boldsymbol \mu - \boldsymbol \Lambda^{-1}\boldsymbol A^\top (\boldsymbol A\boldsymbol \Lambda^{-1}\boldsymbol A^\top + \boldsymbol L^{-1})^{-1}\boldsymbol A\boldsymbol \mu \end{aligned} \]
Now we have show that both are equal, thus we conclude the proof.