Gaussian That I Have Known and Loved (1)

This writing is based from: Bishop’s Pattern Recognition and Machine Learning (Bishop (2006)), and partly from CS229 notes on Guassian

Gaussian Definition and Fundamental Integral

Definition

Definition 1 (Single Variable Gaussian Distribution): The Gaussian distribution is defined as

\[\begin{equation*} \newcommand{\dby}{\ \mathrm{d}}\newcommand{\bracka}[1]{\left( #1 \right)}\newcommand{\brackb}[1]{\left[ #1 \right]}\newcommand{\brackc}[1]{\left\{ #1 \right\}}\newcommand{\abs}[1]{\left|#1\right|} p(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2\sigma^2}(x-\mu)^2\right\} \end{equation*}\]

where $\mu$ is a parameter called mean, while $\sigma$ is a parameter called standard derivation, and we call $\sigma^2$ as variance.

Finally, the standard normal distribution is Gaussian distribution with mean $0$ and variance $1$.

To find the normalizing factor of Gaussian distribution, we use the following integral:

Propositions

Proposition 1 (Gaussian integral): The integration of $\exp(-x^2)$ is equal to

\[\begin{equation*} \int^\infty_{-\infty}\exp(-x^2)\dby x = \sqrt{\pi} \end{equation*}\]

We will transform the problem into $2$D polar coordinate system, which make the integration easier.

\[\begin{equation*} \begin{aligned} \bracka{\int^\infty_{-\infty}\exp(-x^2)\dby x}^2 &= \bracka{\int^\infty_{-\infty}\exp(-x^2)\dby x}\bracka{\int^\infty_{-\infty}\exp(-y^2)\dby y} \\ &= \int^\infty_{-\infty}\int^\infty_{-\infty}\exp\bracka{-(x^2+y^2)}\dby x\dby y \\ &= \int^{2\pi}_{0}\int^\infty_{0}\exp\bracka{-r^2}r\dby r\dby\theta \\ &= 2\pi\int^\infty_{0}\exp\bracka{-r^2}r\dby r = \pi\int^{0}_{-\infty} \exp\bracka{u}\dby u = \pi \end{aligned} \end{equation*}\]

Please note that we use $u$-substution with $-r^2$, in the last step.

□

Proof

Corollary

Corollary 1 (Normalization of Gaussian Distribution): We consider the integration

\[\begin{equation*} \int^\infty_{-\infty}\exp\left\{ -\frac{(x-\mu)^2}{2\sigma^2}\right\} \dby x = \sqrt{2\pi\sigma^2} \end{equation*}\]

Starting by setting $z=x-\mu$, which we have:

\[\begin{equation*} \begin{aligned} \int^\infty_{-\infty}\exp\left\{ -\frac{z^2}{2\sigma^2}\right\} \dby z &= \sigma\sqrt{2} \int^\infty_{-\infty}\frac{1}{\sqrt{2}\sigma}\exp\left\{ -\frac{z^2}{2\sigma^2}\right\} \dby z \\ &= \sigma\sqrt{2} \int^\infty_{-\infty} \exp(-y^2)\dby = \sigma\sqrt{2\pi} \end{aligned} \end{equation*}\]

where we have $y = z/(\sqrt{2}\sigma)$.

□

Proof

Statistics of Single Variable Gaussian Distribution

Definition

Definition 2 (Odd Function): The function $f(x)$ is an odd function iff $f(-x) = -x$

Lemma

Lemma 1 (Odd Function Integration): The integral over $[-a, a]$, where $a\in\mathbb{R}^+$ of an odd function $f(x)$ is $0$ i.e

\[\begin{equation*}\int^a_{-a}f(x)\dby x = 0\end{equation*}\]

We can see that:

\[\begin{equation*} \begin{aligned} \int^a_{-a}f(x)\dby x &= \int^0_{-a}f(x)\dby x + \int^a_{0}f(x)\dby x \\ &= \int^a_{0}f(-x)\dby x + \int^a_{0}f(x)\dby x \\ &= -\int^a_{0}f(x)\dby x + \int^a_{0}f(x)\dby x = 0 \end{aligned} \end{equation*}\]

□

Proof

Propositions

Proposition 2 (Mean of Gaussian): The expectation $\mathbb{E}_{x}[x]$ of nornally distributed Gaussian is $\mu$

Let’s consider the following integral

\[\begin{equation*} \begin{aligned} \frac{1}{\sqrt{2\pi\sigma^2}}\int^\infty_{-\infty} \exp\bracka{-\frac{(x-\mu)^2}{2\sigma^2}}x\dby x \end{aligned} \end{equation*}\]

We will set $z = x - \mu$, where we have:

\[\begin{equation*} \begin{aligned} \frac{1}{\sqrt{2\pi\sigma^2}}&\int^\infty_{-\infty} \exp\bracka{-\frac{z^2}{2\sigma^2}}(z+\mu)\dby x \\ &= \frac{1}{\sqrt{2\pi\sigma^2}}\Bigg[ \underbrace{\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}z\dby z}_{I_1} + \underbrace{\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\mu\dby z}_{I_2} \Bigg] \end{aligned} \end{equation*}\]

Let’s consider $I_1$, where it is clear that

\[\begin{equation*} g(x) = \exp\bracka{-\frac{z^2}{2\sigma^2}}z = -\bracka{-\exp\bracka{-\frac{(-z)^2}{2\sigma^2}}z} = -g(-x) \end{equation*}\]

Thus the function $g(x)$ is and odd function. Therefore, making the integration $I_1$ vanishes to $0$ (Lemma 1). Now, for the second integration, we use Corollary 1 of the Gaussian, where

\[\begin{equation*} \int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\mu\dby z = \mu\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\dby z = \mu\sqrt{2\pi\sigma^2} \end{equation*}\]

Finally, we have

\[\begin{equation*} \begin{aligned} \frac{1}{\sqrt{2\pi\sigma^2}}\int^\infty_{-\infty} \exp\bracka{-\frac{z^2}{2\sigma^2}}(z+\mu)\dby x = \frac{1}{\sqrt{2\pi\sigma^2}}\Bigg[ 0 + \mu\sqrt{2\pi\sigma^2} \Bigg] = \mu \end{aligned} \end{equation*}\]

Thus complete the proof.

□

Proof

Lemma

Lemma 2 : The variance $\operatorname{var}(x) = \mathbb{E}[(x - \mu)^2]$ is equal to $\mathbb{E}[x^2] - \mathbb{E}[x]^2$

This is an application of expanding the definition

\[\begin{equation*} \begin{aligned} \mathbb{E}[(x - \mu)^2] &= \mathbb{E}[x^2 -2x\mu + \mu^2] \\ &= \mathbb{E}[x^2] - 2\mathbb{E}[x]\mu + \mathbb{E}[\mu^2] \\ &= \mathbb{E}[x^2] - 2\mathbb{E}[x]^2 + \mathbb{E}[x]^2 \\ &= \mathbb{E}[x^2] - \mathbb{E}[x]^2 \\ \end{aligned} \end{equation*}\]

□

Proof

Propositions

Proposition 3 : The variance of normal distribution is $\sigma^2$

Let’s consider the following equation, where we set $z = x-\mu$:

\[\begin{equation*} \begin{aligned} \frac{1}{\sqrt{2\pi\sigma^2}}& \int^\infty_{-\infty} \exp\bracka{-\frac{(x-\mu)^2}{2\sigma^2}}x^2\dby x \\ &= \frac{1}{\sqrt{2\pi\sigma^2}} \int^\infty_{-\infty} \exp\bracka{-\frac{z^2}{2\sigma^2}}(z+\mu)^2\dby z \\ &=\frac{1}{\sqrt{2\pi\sigma^2}}\brackb{\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}z^2\dby z + \int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}2\mu z\dby z + \int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\mu^2\dby z } \\ &=\frac{1}{\sqrt{2\pi\sigma^2}}\brackb{\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}z^2\dby z + 0 + \mu^2\sqrt{2\pi\sigma^2} \dby z } \end{aligned} \end{equation*}\]

Now let’s consider the first integral, please note that

\[\begin{equation*} \frac{d}{dz} \exp\bracka{-\frac{z^2}{2\sigma^2}} = \exp\bracka{-\frac{z^2}{2\sigma^2}}\bracka{-\frac{z}{\sigma^2}} \end{equation*}\]

So we can perform an integration by-part considering $u=z$ and $dv = \exp\bracka{-\frac{z^2}{2\sigma^2}}\bracka{-\frac{z}{\sigma^2}}$:

\[\begin{equation*} \begin{aligned} \int^\infty_{-\infty}&\exp\bracka{-\frac{z^2}{2\sigma^2}}z^2\dby z = -\sigma^2\int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\bracka{-\frac{z}{\sigma^2}}z\dby z \\ &= -\sigma^2\brackb{\left.z\exp\bracka{-\frac{z^2}{2\sigma^2}}\right|^\infty_{-\infty} - \int^\infty_{-\infty}\exp\bracka{-\frac{z^2}{2\sigma^2}}\dby z} \\ &= -\sigma^2[0 - \sqrt{2\pi\sigma^2}] = \sigma^2\sqrt{2\pi\sigma^2} \end{aligned} \end{equation*}\]

To show that the evaluation on the left-hand side is zero, we have

\[\begin{equation*} \begin{aligned} \lim_{z\rightarrow\infty} z\exp\bracka{-\frac{z^2}{2\sigma^2}} &- \lim_{z\rightarrow-\infty} z\exp\bracka{-\frac{z^2}{2\sigma^2}} \\ &= 0 - \lim_{z\rightarrow-\infty}1\cdot\exp\bracka{-\frac{x^2}{2\sigma^2}}\bracka{-\frac{\sigma^2}{z}} \\ &= 0-0 = 0 \end{aligned} \end{equation*}\]

The first equality comes from L’Hospital’s rule. Combinding the results:

\[\begin{equation*} \begin{aligned} \mathbb{E}[x^2] &- \mathbb{E}[x]^2 = \sigma^2 + \mu^2 - \mu^2 = \sigma^2 \end{aligned} \end{equation*}\]

□

Proof

This second part is to introduce some of the mathematical basics such as linear algebra. Further results of linear Gaussian models and others will be presented in the next part.

Useful Backgrounds

Covariance and Covariance Matrix

Definition

Definition 3 (Covariance): Given 2 random variables $X$ and $Y$, the covariance is defined as:

\[\begin{equation*} \operatorname{cov}(X, Y) = \mathbb{E}\Big[ (X - \mathbb{E}[X])(Y - \mathbb{E}[Y]) \Big] \end{equation*}\]

It is clear that $\operatorname{cov}(X, Y) = \operatorname{cov}(Y, X)$

Definition

Definition 4 (Covariance Matrix): Now, if we have the collection of random variables in the random vector $\boldsymbol x$ (of size $n$), then the collection of covariance between its elements are collected in covariance matrix

\[\begin{equation*} \begin{aligned} \operatorname{cov}(\boldsymbol x) &= \begin{bmatrix} \operatorname{cov}(x_1, x_1) & \operatorname{cov}(x_1, x_2) & \cdots & \operatorname{cov}(x_1, x_n) \\ \operatorname{cov}(x_2, x_1) & \operatorname{cov}(x_2, x_2) & \cdots & \operatorname{cov}(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{cov}(x_n, x_1) & \operatorname{cov}(x_n, x_2) & \cdots & \operatorname{cov}(x_n, x_n) \\ \end{bmatrix} \\ &= \mathbb{E}\Big[ (\boldsymbol x - \mathbb{E}[\boldsymbol x])(\boldsymbol x - \mathbb{E}[\boldsymbol x])^\top \Big] \end{aligned} \end{equation*}\]

The equivalent is clear when we perform the matrix multiplication over this.

Remark

Remark 1. (Property of Covariance Matrix): It is clear that the covariance matrix is (from its defintion):

Symmetric, as $\operatorname{cov}(x_a, x_b) = \operatorname{cov}(x_b, x_a)$
Positive semidefinite, if we consider arbitary constant vector $\boldsymbol a \in \mathbb{R}^n$, then by the linearity of expectation, we have:

\[\begin{equation*} \begin{aligned} \boldsymbol a^\top \operatorname{cov}(\boldsymbol x)\boldsymbol a &= \boldsymbol a^\top \mathbb{E}\Big[ (\boldsymbol x - \mathbb{E}[\boldsymbol x])(\boldsymbol x - \mathbb{E}[\boldsymbol x])^\top \Big]\boldsymbol a \\ &= \mathbb{E}\Big[ \boldsymbol a^\top (\boldsymbol x - \mathbb{E}[\boldsymbol x])(\boldsymbol x - \mathbb{E}[\boldsymbol x])^\top \boldsymbol a \Big] \\ &= \mathbb{E}\Big[ \big(\boldsymbol a^\top (\boldsymbol x - \mathbb{E}[\boldsymbol x])\big)^2 \Big] > 0 \\ \end{aligned} \end{equation*}\]

Linear Algebra

Definition

Definition 5 (Eigenvalues and Eigenvectors): Given the matrix $\boldsymbol X \in \mathbb{C}^{n\times n}$, then the pair of vector $\boldsymbol v \in \mathbb{C}^n$ and number $\lambda \in \mathbb{C}$ are called eigenvector and eigenvalue if

\[\begin{equation*} \begin{aligned} \boldsymbol A \boldsymbol v = \lambda\boldsymbol v \end{aligned} \end{equation*}\]

Propositions

Proposition 4 (Eigenvalue of Symmetric Matrix): One can show that the eigenvalue of symmetric (real) matrix is always real.

Let’s consider the pairs of eigenvalues/eigenvectors $\boldsymbol v \in \mathbb{C}^n$ and $\lambda \in \mathbb{C}$ of symmetric (real) matrix $\boldsymbol A$ i.e $\boldsymbol A\boldsymbol v = \lambda\boldsymbol v$. Please note that $(x+yi)(x-yi) = x^2 + y^2 > 0$ (assume non-zero eigenvalue), this means that $\bar{\boldsymbol v}^\top \boldsymbol v > 0$ ($\bar{\boldsymbol v}$ is vector that contains conjugate element of $\boldsymbol v$). Now, see that:

\[\begin{equation*} \begin{aligned} \bar{\boldsymbol v}^\top \boldsymbol A\boldsymbol v &= \bar{\boldsymbol v}^\top (\boldsymbol A\boldsymbol v) = \lambda\bar{\boldsymbol v}^\top \boldsymbol v \\ &= (\bar{\boldsymbol v}^\top \boldsymbol A^\top )\boldsymbol v = \bar{\lambda}\bar{\boldsymbol v}^\top \boldsymbol v \\ \end{aligned} \end{equation*}\]

Note that $\overline{\boldsymbol A\boldsymbol v} = \bar{\lambda}\bar{\boldsymbol v}$ Since $\bar{\boldsymbol v}^\top \boldsymbol v > 0$ we see that $\bar{\lambda} = \lambda$, which implies that $\bar{\lambda}$ is real.

□

Proof (Following from here)

Propositions

Proposition 5 (Eigenvalue of Positive Definite Matrix): One can show that the eigenvalue of positive definite matrix is non-negative.

We consider the following equations:

\[\begin{equation*} \begin{aligned} \boldsymbol v^\top \boldsymbol A\boldsymbol v = \lambda\boldsymbol v^\top \boldsymbol v > 0 \end{aligned} \end{equation*}\]

And so the eigenvector must be $\lambda > 0$.

□

Proof

Definition

Definition 6 (Linear Transformation): Given the function $A: V \rightarrow W$, where $V$ and $W$ are vector spaces. Then, for $\boldsymbol v \in V, \boldsymbol w \in W$ and $a, b \in \mathbb{R}$

\[\begin{equation*} \begin{aligned} A(a\boldsymbol v + b\boldsymbol w) = aA(\boldsymbol v) + bA(\boldsymbol w) \end{aligned} \end{equation*}\]

Remark

Remark 2. (Matrix Multipliacation and Linear Transformation): We can represent the linear transformation $L : V \rightarrow W$ in terms of matrix. Let’s consider the vector $\boldsymbol v \in V$ (of dimension $n$) together with basis vectors $\brackc{\boldsymbol b_1,\dots,\boldsymbol b_n}$ of $V$ and basis vectors $\brackc{\boldsymbol c_1, \dots, \boldsymbol c_m}$ of $W$. Then we can represen the vector $\boldsymbol v$ as:

\[\begin{equation*} \begin{aligned} \boldsymbol v = v_1\boldsymbol b_1 + \dots + v_n\boldsymbol b_n \end{aligned} \end{equation*}\]

This means that we can represent the vector $\boldsymbol v$ as: $(v_1, v_2, , v_n)^$ Furthermore, we can characterized the transformation of the basis vector:

\[\begin{equation*} \begin{aligned} &L(\boldsymbol b_i) = l_{1i} \boldsymbol c_1 + l_{2i}\boldsymbol c_2 + \cdots + l_{mi}\boldsymbol c_m \\ \end{aligned} \end{equation*}\]

for $i=1,\dots,n$. Then we can see that the definition of linear transformation together with the linear transformation of basis as we have:

\[\begin{equation*} \begin{aligned} L(\boldsymbol v) &= v_1L(\boldsymbol b_1) + \dots + v_nL(\boldsymbol v_n) \\ &= \begin{aligned}[t] &v_1\Big( l_{11} \boldsymbol c_1 + l_{21}\boldsymbol c_2 + \cdots + l_{m1}\boldsymbol c_m \Big) \\ &+v_2\Big( l_{12} \boldsymbol c_1 + l_{22}\boldsymbol c_2 + \cdots + l_{m2}\boldsymbol c_m \Big) \\ &+v_n\Big( l_{1n} \boldsymbol c_1 + l_{2n}\boldsymbol c_2 + \cdots + l_{mn}\boldsymbol c_m \Big)\\ \end{aligned} \\ &= \begin{bmatrix} \sum^n_{i=1}v_il_{1i} \\ \sum^n_{i=1}v_il_{2i} \\ \vdots \\ \sum^n_{i=1}v_il_{ni} \end{bmatrix} = \begin{bmatrix} l_{11} & l_{12} & \cdots & l_{1n} \\ l_{21} & l_{22} & \cdots & l_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ l_{m1} & l_{m2} & \cdots & l_{mn} \\ \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \end{aligned} \end{equation*}\]

and so we have show that the linear transformation can be represented as matrix multiplication (in finite space).

Theorem

Theorem 1 (Spectral Theorem, from here): Let $n \le N$ and $W$ be an n-dimensional subspace of $\mathbb{R}^n$. Given a linear transformation $A:W\rightarrow W$ that is symmetric. There are eigenvectors $\boldsymbol v_1,\dots,\boldsymbol v_n\in W$ of $A$ such that $\brackc{\boldsymbol v_1,\dots,\boldsymbol v_n}$ is an orthonormal basis for $W$. For normal matrix, we let $n=N$ and $W=\mathbb{R}^n$.

Remark

Remark 3. (Eigendecomposition): Let’s consider the matrix of eigenvectors $\boldsymbol v_1,\boldsymbol v_2\dots,\boldsymbol v_n \in \mathbb{R}^n$ of symmetric matrix $\boldsymbol A \in \mathbb{R}^{n\times n}$ together with eigenvalues $\lambda_1,\lambda_2,\dots,\lambda_n$, then we have :

\[\begin{equation*} \begin{aligned} \boldsymbol A \begin{bmatrix} \kern.6em | & \kern.2em | \kern.2em & & | \kern.6em \\ \kern.6em\boldsymbol v_1 & \kern.2em\boldsymbol v_2\kern.2em & \kern.2em\cdots\kern.2em & \boldsymbol v_n\kern.6em \\ \kern.6em | & \kern.2em | \kern.2em & & | \kern.6em \\ \end{bmatrix} &= \begin{bmatrix} \kern.6em | & \kern.2em | \kern.2em & & | \kern.6em \\ \kern.6em\boldsymbol v_1 & \kern.2em\boldsymbol v_2\kern.2em & \kern.2em\cdots\kern.2em & \boldsymbol v_n\kern.6em \\ \kern.6em | & \kern.2em | \kern.2em & & | \kern.6em \\ \end{bmatrix}\begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \\ \end{bmatrix} \\ \end{aligned} \end{equation*}\]

This is equivalent to $\boldsymbol A\boldsymbol Q = \boldsymbol Q\boldsymbol \Lambda$, which mean that if we right multiply by ${\boldsymbol Q}^\top$ (as we have orthogonal eigenvectors), then we have (or in vectorized format):

\[\begin{equation*} \boldsymbol A = \boldsymbol Q\boldsymbol \Lambda\boldsymbol Q^\top = \sum^n_{i=1}\lambda_i\boldsymbol v_i\boldsymbol v_i^\top \end{equation*}\]

Please note that $\boldsymbol A^{-1}$ can be represented as:

\[\begin{equation*} \boldsymbol A^{-1} = \boldsymbol Q\boldsymbol \Lambda^{-1}\boldsymbol Q^\top = \sum^n_{i=1}\frac{1}{\lambda_i}\boldsymbol v_i\boldsymbol v_i^\top \end{equation*}\]

as it is clear that $\boldsymbol A\boldsymbol A^{-1} = \boldsymbol Q\boldsymbol \Lambda\boldsymbol Q^\top \boldsymbol Q\boldsymbol \Lambda^{-1}\boldsymbol Q^\top = \boldsymbol I$.

Propositions

Proposition 6 (Determinant and Eigenvalues): Give matrix $\boldsymbol A$ together with eigenvalues of $\lambda_1,\lambda_2,\dots,\lambda_n$, then we can show that

\[\begin{equation*} \abs{\boldsymbol A} = \prod^n_{i=1}\lambda_i \end{equation*}\]

We will only consider the case where $\boldsymbol A$ is diagonalizable.

We consider the eigendecomposition of $\boldsymbol A$, as we have:

\[\begin{equation*} \abs{\boldsymbol A} = \abs{\boldsymbol Q\boldsymbol \Lambda\boldsymbol Q^{-1}} = \abs{\boldsymbol Q}\abs{\boldsymbol \Lambda}\abs{\boldsymbol Q^{-1}} = \frac{\abs{\boldsymbol Q}}{\abs{\boldsymbol Q}}\abs{\boldsymbol \Lambda} = \prod^n_{i=1}\lambda_i \end{equation*}\]

□

Proof

Propositions

(Trace and Eigenvalues): Given a matrix $\boldsymbol A$ together with eigenvalues of $\lambda_1,\lambda_2,\dots,\lambda_n$, then we can show that:

\[\begin{equation*} \operatorname{Tr}(\boldsymbol A) = \sum^n_{i=1}\lambda_i \end{equation*}\]

We will only consider the case where $\boldsymbol A$ is diagonalizable.

We consider the eigendecomposition of $\boldsymbol A$. Consider the trace over it, as we have:

\[\begin{equation*} \operatorname{Tr}(\boldsymbol A) = \operatorname{Tr}(\boldsymbol Q\boldsymbol \Lambda\boldsymbol Q^{-1}) = \operatorname{Tr}(\boldsymbol \Lambda\boldsymbol Q\boldsymbol Q^{-1}) = \operatorname{Tr}(\boldsymbol \Lambda) = \sum^n_{i=1}\lambda_i \end{equation*}\]

□

Proof

Miscellaneous

Propositions

Proposition 7 (Change of Variable): If we consider the transformation of the variables where $T : \mathbb{R}^k \supset X \rightarrow \mathbb{R}^k$. Then we can show that:

\[\begin{equation*} \int_{\mathbb{R}^k} f(\boldsymbol y)\dby \boldsymbol y = \int_{\mathbb{R}^k} f(\boldsymbol T(\boldsymbol x))\abs{\boldsymbol J_T(\boldsymbol x)}\dby \boldsymbol x \end{equation*}\]

where $\boldsymbol J_T(\boldsymbol x)$ is the Jacobian of the transformation $T(\cdot)$, which is defined to be:

\[\begin{equation*} \boldsymbol J_T(\boldsymbol x) = \begin{pmatrix} \cfrac{\partial T_1(\cdot)}{\partial x_1} & \cfrac{\partial T_1(\cdot)}{\partial x_2} & \cdots & \cfrac{\partial T_1(\cdot)}{\partial x_n} \\ \cfrac{\partial T_2(\cdot)}{\partial x_1} & \cfrac{\partial T_2(\cdot)}{\partial x_2} & \cdots & \cfrac{\partial T_2(\cdot)}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \cfrac{\partial T_n(\cdot)}{\partial x_1} & \cfrac{\partial T_n(\cdot)}{\partial x_2} & \cdots & \cfrac{\partial T_n(\cdot)}{\partial x_n} \\ \end{pmatrix} \end{equation*}\]

Multivariate Guassian Distribution: Introduction

Definition

Definition 7 (Multivariate Gaussian): It is defined as:

\[\begin{equation*} \mathcal{N}(\boldsymbol x | \boldsymbol \mu, \boldsymbol \Sigma) = \frac{1}{\sqrt{\abs{2\pi\boldsymbol \Sigma}}}\exp\left\{-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu)\right\} \end{equation*}\]

where we call $\boldsymbol \mu \in \mathbb{R}^n$ a mean and $\boldsymbol \Sigma \in \mathbb{R}^{n\times n}$ covariance, which should be symmetric and positive semidefinite (since the covariance is always positive semidefinite and symmetric).

Remark

Remark 4. (2D Independent Gaussian): Now, let’s consider, multivariate Gaussian but in the case that both variables are independent to each other with difference variances, as we define the parameters to be:

\[\begin{equation*} \boldsymbol \mu = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix} \qquad \boldsymbol \Sigma = \begin{bmatrix} \sigma_1^2 & 0 \\ 0 & \sigma_2^2 \\ \end{bmatrix} \end{equation*}\]

Now, let’s expand the multivariate Guassian, please note that $\abs{\boldsymbol \Sigma} = \sigma_1^2\sigma_2^2$:

\[\begin{equation*} \begin{aligned} \mathcal{N}(\boldsymbol x | \boldsymbol \mu, \boldsymbol \Sigma) &= \frac{1}{\sqrt{\abs{2\pi\boldsymbol \Sigma}}}\exp\left\{-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu)\right\} \\ &= \frac{1}{4\pi^2\sigma_1^2\sigma_2^2} \exp\brackc{-\frac{1}{2} \begin{bmatrix} x_1-\mu_1 \\ x_2-\mu_2 \end{bmatrix}^\top \begin{bmatrix} \sigma_1^{-2} & 0 \\ 0 & \sigma_2^{-2} \\ \end{bmatrix} \begin{bmatrix} x_1-\mu_1 \\ x_2-\mu_2 \end{bmatrix}} \\ &= \frac{1}{4\pi^2\sigma_1^2\sigma_2^2} \exp\brackc{-\frac{1}{2} \brackb{ \bracka{\frac{x_1-\mu_1}{\sigma_1}}^2 + \bracka{\frac{x_2-\mu_2}{\sigma_2}}^2 }}\\ &= \frac{1}{2\pi\sigma_1^2} \exp\brackc{-\frac{1}{2} \frac{(x_1-\mu_1)^2}{\sigma_1^2}} \frac{1}{2\pi\sigma_2^2} \exp\brackc{-\frac{1}{2} \frac{(x_2-\mu_2)^2}{\sigma_2^2}}\\ &= \mathcal{N}(x_1 | \mu_1, \sigma_1^2)\mathcal{N}(x_2 | \mu_2, \sigma_2^2) \end{aligned} \end{equation*}\]

Remark

Remark 5. (Shape of Gaussian): Let’s consider the eigendecomposition of the inverse covariance matrix (which is positive semidefinite and symmetric), as we have

\[\begin{equation*} \begin{aligned} \boldsymbol \Sigma^{-1} = \sum^n_{i=1}\frac{1}{\lambda_i}\boldsymbol u_i\boldsymbol u_i^\top \end{aligned} \end{equation*}\]

where $\lambda_1,\dots,\lambda_n$ and $\boldsymbol u_1,\dots,\boldsymbol u_n$ are the eigenvalues and eigenvectors, repectively of $\boldsymbol \Sigma$. Let’s consider the terms inside the exponential to be:

\[\begin{equation*} \begin{aligned} (\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x - \boldsymbol \mu) = \sum^n_{i=1}\frac{(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol u_i\boldsymbol u_i^\top (\boldsymbol x - \boldsymbol \mu)}{\lambda_i} = \sum^n_{i=1}\frac{y_i^2}{\lambda_i} \end{aligned} \end{equation*}\]

where we have $y_i = (\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol u_i$. Consider the vector to be $\boldsymbol y = (y_1,y_2,\dots,y_n)^\top = \boldsymbol U(\boldsymbol x - \boldsymbol \mu)$. This gives us the linear transformation over $\boldsymbol x$, which implies the following shape of Gaussian:

Ellipsoids with the center $\boldsymbol \mu$
Axis is in the direction of eigenvector $\boldsymbol u_i$
Scaling of each direction is the eigenvector $\lambda_i$ associated with $\boldsymbol u_i$

Propositions

Proposition 8 (Normalization of Multivariate Gaussian): We can show that:

Let’s consider the change of variable, in which we will change the variable from $x_i$ to $y_i$ where $\boldsymbol y = \boldsymbol U(\boldsymbol x - \boldsymbol \mu)$. To do this we have the find the Jacobian of the transformation, which is:

\[\begin{equation*} \begin{aligned} J_{ij} = \frac{\partial x_i}{\partial y_j} = U_{ji} \end{aligned} \end{equation*}\]

Consider its determinant, as we have: $\abs{\boldsymbol J}^2 = \abs{\boldsymbol U^\top }^2 = \abs{\boldsymbol U^\top }\abs{\boldsymbol U} = \abs{\boldsymbol U^\top \boldsymbol U} = \abs{I} = 1$. Consider the integration as we have (and use the Gaussian integrations):

\[\begin{equation*} \begin{aligned} \int \exp\left\{-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu)\right\} \dby \boldsymbol x &= \int \exp\brackc{\sum^n_{i=1}-\frac{y_i^2}{2\lambda_i}} |\boldsymbol J| \dby \boldsymbol y \\ &= \int \prod^n_{i=1}\exp\brackc{-\frac{y_i^2}{2\lambda_i}} \dby \boldsymbol y \\ &= \prod^n_{i=1}\int \exp\brackc{-\frac{y_i^2}{2\lambda_i}} \dby y_i \\ &= \prod^n_{i=1}\sqrt{2\pi\lambda_i} = \sqrt{\abs{2\pi\boldsymbol \Sigma}} \end{aligned} \end{equation*}\]

Note that we have shown that the determinant is the product of eigenvalues. Thus the prove is completed.

□

Proof

More Useful Backgrounds

Propositions

Proposition 9 (Inverse of Partition Matrix): The block matrix can be inversed as:

\[\begin{equation*} \require{color} \definecolor{red}{RGB}{244, 67, 54} \definecolor{pink}{RGB}{233, 30, 99} \definecolor{purple}{RGB}{103, 58, 183} \definecolor{yellow}{RGB}{255, 193, 7} \definecolor{grey}{RGB}{96, 125, 139} \definecolor{blue}{RGB}{33, 150, 243} \definecolor{green}{RGB}{0, 150, 136} \begin{bmatrix} \boldsymbol A & \boldsymbol B \\ \boldsymbol C & \boldsymbol D \\ \end{bmatrix} = \begin{bmatrix} \boldsymbol M & -\boldsymbol M\boldsymbol B\boldsymbol D^{-1} \\ -\boldsymbol D^{-1}\boldsymbol C\boldsymbol M & \boldsymbol D^{-1}+ \boldsymbol D^{-1}\boldsymbol C\boldsymbol M\boldsymbol B\boldsymbol D^{-1} \\ \end{bmatrix} \end{equation*}\]

where we set $\boldsymbol M = (\boldsymbol A-\boldsymbol B\boldsymbol D^{-1}\boldsymbol C)^{-1}$

Propositions

Proposition 10 (Inverse Matrix Identity): We can show that

\[\begin{equation*} (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1} = \boldsymbol P\boldsymbol B^\top (\boldsymbol B\boldsymbol P\boldsymbol B^\top + \boldsymbol R)^{-1} \end{equation*}\]

The can be proven by right multiply the inverse on the right hand-side:

\[\begin{equation*} \begin{aligned} (\boldsymbol P^{-1} + &\boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1}(\boldsymbol B\boldsymbol P\boldsymbol B^\top + \boldsymbol R) \\ &= (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1}\boldsymbol B\boldsymbol P\boldsymbol B^\top + (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1}\boldsymbol R \\ &= (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B\boldsymbol R^{-1}\boldsymbol B\boldsymbol P\boldsymbol B^\top + (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\boldsymbol B \\ &= (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\Big[\boldsymbol B\boldsymbol R^{-1}\boldsymbol B\boldsymbol P + \boldsymbol P^{-1} \boldsymbol P\Big]\boldsymbol B^\top \\ &= (\boldsymbol P^{-1} + \boldsymbol B^\top \boldsymbol R^{-1}\boldsymbol B)^{-1}\Big[\boldsymbol B\boldsymbol R^{-1}\boldsymbol B+ \boldsymbol P^{-1} \Big]\boldsymbol P\boldsymbol B^\top \\ &= \boldsymbol P\boldsymbol B^\top \\ \end{aligned} \end{equation*}\]

□

Proof

Propositions

Proposition 11 (Woodbury Identity): We can show that

\[\begin{equation*} (\boldsymbol A + \boldsymbol B\boldsymbol D^{-1}\boldsymbol C)^{-1} = \boldsymbol A^{-1} - \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1} \end{equation*}\]

The can be proven by right multiply the inverse on the right hand-side:

\[\begin{equation*} \begin{aligned} \Big[ \boldsymbol A^{-1} &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1} \Big](\boldsymbol A + \boldsymbol B\boldsymbol D^{-1}\boldsymbol C) \\ &= \boldsymbol A^{-1}(\boldsymbol A + \boldsymbol B\boldsymbol D^{-1}\boldsymbol C) - \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}(\boldsymbol A + \boldsymbol B\boldsymbol D^{-1}\boldsymbol C) \\ &= \begin{aligned}[t] \boldsymbol A^{-1}\boldsymbol A + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}\boldsymbol A \\ &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} \\ &= \begin{aligned}[t] \boldsymbol I + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C \\ &- \boldsymbol A^{-1}\boldsymbol B(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} \\ &= \begin{aligned}[t] \boldsymbol I + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B\Big[(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol D + (\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}\boldsymbol C\boldsymbol A^{-1}\boldsymbol B\Big]\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} \\ &= \begin{aligned}[t] \boldsymbol I + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B\Big[(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)^{-1}(\boldsymbol D + \boldsymbol C\boldsymbol A^{-1}\boldsymbol B)\Big]\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} \\ &= \begin{aligned}[t] \boldsymbol I + \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C &- \boldsymbol A^{-1}\boldsymbol B\boldsymbol D^{-1}\boldsymbol C \\ \end{aligned} = \boldsymbol I \end{aligned} \end{equation*}\]

□

Proof

Conditional & Marginalisation

Remark

Remark 6. (Settings): We consider the setting where we consider the partition of random variables:

\[\begin{equation*} \begin{bmatrix} \boldsymbol x_a \\ \boldsymbol x_b \end{bmatrix} \sim \mathcal{N}\bracka{ \begin{bmatrix} \boldsymbol \mu_a \\ \boldsymbol \mu_b \end{bmatrix}, \begin{bmatrix} \boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab} \\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb} \\ \end{bmatrix}} \end{equation*}\]

Due to the symmetric of covariance, we have ${ab} = {ba}^$. Furthermore, we denote

\[\begin{equation*} \boldsymbol \Sigma^{-1} = \begin{bmatrix} \boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab} \\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb} \\ \end{bmatrix}^{-1} = \begin{bmatrix} \boldsymbol \Lambda_{aa} & \boldsymbol \Lambda_{ab} \\ \boldsymbol \Lambda_{ba} & \boldsymbol \Lambda_{bb} \\ \end{bmatrix} = \boldsymbol \Lambda \end{equation*}\]

where $\boldsymbol \Lambda$ is called precision matrix. One can consider the inverse of partition matrix to find such a value of $\boldsymbol \Lambda$ (will be useful afterward), thus we note that $\boldsymbol \Sigma_{aa} \ne \boldsymbol \Lambda^{-1}_{aa}$, and so on.

Remark

Remark 7. (Complete the Square): To find the conditional and marginalision (together with other kinds of Gaussian manipulation), we relies on a method called completing the square. Let’s consider the quadratic form expansion:

\[\begin{equation*} \begin{aligned} -\frac{1}{2}&(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu) \\ &= {\color{blue}{-\frac{1}{2}\boldsymbol x^\top \boldsymbol \Sigma^{-1}\boldsymbol x + \boldsymbol \mu^\top \boldsymbol \Sigma^{-1}\boldsymbol x}} -\frac{1}{2} \boldsymbol \mu^\top \boldsymbol \Sigma^{-1}\boldsymbol \mu \\ \end{aligned} \end{equation*}\]

Note that the blue term are the term that depends on $x$. This means that to find the Gaussian, we will have to find the first and second order of $\boldsymbol x$ only. Or, we can consider each individual elements of partitioned random variable:

\[ \begin{aligned} -\frac{1}{2}&(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Sigma^{-1}(\boldsymbol x-\boldsymbol \mu) \\ &= \begin{aligned}[t] {\color{green}{-\frac{1}{2}(\boldsymbol x_a - \boldsymbol \mu_a)^\top \boldsymbol \Lambda_{aa}(\boldsymbol x_a - \boldsymbol \mu_a)}}{\color{yellow}{ - \frac{1}{2}(\boldsymbol x_a - \boldsymbol \mu_a)^\top \boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b)}} \\ {\color{purple}{-\frac{1}{2}(\boldsymbol x_b - \boldsymbol \mu_b)^\top \boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a)}}{\color{grey}{ - \frac{1}{2}(\boldsymbol x_b - \boldsymbol \mu_b)^\top \boldsymbol \Lambda_{bb}(\boldsymbol x_b - \boldsymbol \mu_b)}} \\ \end{aligned} \\ &= \begin{aligned}[t] &{\color{green} -\frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} -\frac{1}{2} \boldsymbol \mu^\top \boldsymbol \Lambda_{aa}\boldsymbol \mu {\color{yellow} -\frac{1}{2}\boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b + \frac{1}{2} \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b} \\ &{\color{yellow}+\frac{1}{2}\boldsymbol \mu_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b} - \frac{1}{2}\boldsymbol \mu_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b {\color{purple} -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba} \boldsymbol \mu_a} \\ &- \frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol \mu_a {\color{grey} -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b + \boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b} -\frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \mu_b \end{aligned} \\ \end{aligned} \tag{1}\]

Completing the square is to match this pattern into our formula in order to get new Gaussian distribution. There are $2$ ways to complete the squre that depends on the scenario:

When we want to find the Gaussian in difference form (but still being Gaussian) i.e conditional
When we want to marginalise some variables out or when we have to find the true for of the distribution without relying on knowing the final form (or when we are not really sure about the final form) i.e marginalisation, posterior

Let’s just show how it works with examples.

Propositions

Proposition 12 (Conditional): Consider the following Gaussian

\[\begin{equation*} \begin{bmatrix} \boldsymbol x_a \\ \boldsymbol x_b \end{bmatrix} \sim p(\boldsymbol x_a, \boldsymbol x_b) = \mathcal{N}\bracka{ \begin{bmatrix} \boldsymbol \mu_a \\ \boldsymbol \mu_b \end{bmatrix}, \begin{bmatrix} \boldsymbol \Sigma_{aa} & \boldsymbol \Sigma_{ab} \\ \boldsymbol \Sigma_{ba} & \boldsymbol \Sigma_{bb} \\ \end{bmatrix}} \end{equation*}\]

We can show that: $p(\boldsymbol x_a | \boldsymbol x_b) = \mathcal{N}(\boldsymbol x | \boldsymbol \mu_{a|b}, \boldsymbol \Lambda^{-1}_{aa})$, where we have

$\boldsymbol \mu_{a\lvert b} = \boldsymbol \mu_a - \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b)$
Or, we can set $\boldsymbol K = \boldsymbol \Sigma_{ab}\boldsymbol \Sigma_{bb}^{-1}$, where we have

\[\begin{equation*} \begin{aligned} &\boldsymbol \mu_{a|b} = \boldsymbol \mu_a + \boldsymbol K(\boldsymbol x_b - \boldsymbol \mu_b) \qquad \begin{aligned}[t] \boldsymbol \Sigma_{a|b} &= \boldsymbol \Sigma_{aa} - \boldsymbol K\boldsymbol \Sigma_{bb}\boldsymbol K^\top \\ &= \boldsymbol \Sigma_{aa} - \boldsymbol \Sigma_{ab}\boldsymbol \Sigma_{bb}^{-1}\boldsymbol \Sigma_{ba} \\ \end{aligned} \end{aligned} \end{equation*}\]

Follows from Proposition 9 (you can try plugging the results in).

Proof: We consider the expansion of the conditional distribution, as we only gather all the terms with $\boldsymbol x_a$

\[\begin{equation*} \begin{aligned} -\frac{1}{2}(\boldsymbol x_a - \boldsymbol \mu_{a|b})^\top \boldsymbol \Sigma_{a|b}^{-1}(\boldsymbol x_a - \boldsymbol \mu_{a|b}) = {\color{red}-\frac{1}{2}\boldsymbol x_a^\top \boldsymbol \Sigma_{a|b}^{-1}\boldsymbol x_a} + {\color{blue}\boldsymbol x_a^\top \boldsymbol \Sigma_{a|b}^{-1}\boldsymbol \mu_{a|b}} + \text{const} \end{aligned} \end{equation*}\]

Let’s consider the values, which should be equal to equation Equation 1, as we can see that:

The red term: we set $\boldsymbol \Sigma_{a\lvert b}^{-1} = \boldsymbol \Lambda_{aa}$ (consider the first green term of the equation Equation 1)
The blue term, we will have to consider $(\dots)^\top \boldsymbol x_a$. We have:

\[\begin{equation*} \begin{aligned} {\color{green} \boldsymbol \mu_{a}^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} &{\color{yellow} - \frac{1}{2}\boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b + \frac{1}{2} \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b} {\color{purple} - \frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a} \\ &= \boldsymbol x_a^\top \Big[ \boldsymbol \Lambda_{aa}\boldsymbol \mu_a - \boldsymbol \Lambda_{ab}\boldsymbol x_b + \boldsymbol \Lambda_{ab}\boldsymbol \mu_b \Big] = \boldsymbol x_a^\top \Big[ \boldsymbol \Lambda_{aa}\boldsymbol \mu_a - \boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b) \Big] \\ \end{aligned} \end{equation*}\]

Please note that $\boldsymbol \Lambda_{ab}^\top = \boldsymbol \Lambda_{ba}$. Now, let’s do “pattern” matching, as we have:

\[\begin{equation*} \begin{aligned} \boldsymbol x_a^\top &\boldsymbol \Lambda_{aa}\boldsymbol \mu_{a|b} = \boldsymbol x_a^\top \Big[ \boldsymbol \Lambda_{aa}\boldsymbol \mu_a - \boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b) \Big] \\ \implies&\boldsymbol \mu_{a|b} \begin{aligned}[t] &= \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{aa}\boldsymbol \mu_a - \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b) \\ &= \boldsymbol \mu_a - \boldsymbol \Lambda_{aa}^{-1}\boldsymbol \Lambda_{ab}(\boldsymbol x_b - \boldsymbol \mu_b) \\ \end{aligned} \end{aligned} \end{equation*}\]

Thus the proof is complete.

□

Proof

Propositions

Proposition 13 (Marginalisation): Consider the following Gaussian

We can show that: $p(\boldsymbol x_a) = \mathcal{N}(\boldsymbol x | \boldsymbol \mu_{a}, \boldsymbol \Sigma_{aa})$

We collect the terms that contains $\boldsymbol x_b$ so that we can integrate it out, as we have:

\[\begin{equation*} \begin{aligned} {\color{grey} -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b} &{\color{grey} +} {\color{grey}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b} {\color{purple} -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{ba} \boldsymbol \mu_a} {\color{yellow} -\frac{1}{2}\boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b+\frac{1}{2}\boldsymbol \mu_a^\top \boldsymbol \Lambda_{ab}\boldsymbol x_b} \\ &= -\frac{1}{2}\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b + \boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b -\boldsymbol x_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \boldsymbol x_b^\top \boldsymbol \Lambda_{ba} \boldsymbol \mu_a \\ &= -\frac{1}{2}\Big[\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b - 2\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \Lambda_{bb}^{-1}\underbrace{\Big(\boldsymbol \Lambda_{bb}\boldsymbol \mu_b -\boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a)\Big)}_{\boldsymbol m}\Big] \\ &= -\frac{1}{2}\Big[\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol x_b - 2\boldsymbol x_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m + (\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)^\top \boldsymbol \Lambda_{bb}(\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m) \Big] + \frac{1}{2}(\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)^\top \boldsymbol \Lambda_{bb}(\boldsymbol \Lambda_{bb}^{-1}\boldsymbol m) \\ &= -\frac{1}{2}(\boldsymbol x_b - \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)^\top \boldsymbol \Lambda_{bb}(\boldsymbol x_b - \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m) + {\color{blue}\frac{1}{2}\boldsymbol m^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m} \\ \end{aligned} \end{equation*}\]

So we integrate out the quantity by Gaussian integration (if we want)

\[\begin{equation*} \int \exp\brackc{-\frac{1}{2}(\boldsymbol x_b - \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)^\top \boldsymbol \Lambda_{bb}(\boldsymbol x_b - \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m)}\dby \boldsymbol x_b \end{equation*}\]

Now, we consider the other terms that doesn’t depends on $\boldsymbol x_b$ i.e all terms that depends on $\boldsymbol x_a$ together with the blue term that been left out.

\[\begin{equation*} \begin{aligned} {\color{blue}\frac{1}{2}\boldsymbol m^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol m} &{\color{green} -}{\color{green} \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} {\color{yellow} + \frac{1}{2} \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b}{\color{purple}+ \frac{1}{2}\boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a} \\ &= \begin{aligned}[t] \frac{1}{2}\Big(&\boldsymbol \Lambda_{bb}\boldsymbol \mu_b -\boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a)\Big)^\top \boldsymbol \Lambda_{bb}^{-1}\Big(\boldsymbol \Lambda_{bb}\boldsymbol \mu_b -\boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a)\Big) \\ &{\color{green} - \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} + \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b \end{aligned} \\ &= \begin{aligned}[t] \frac{1}{2}\Big[ \boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \mu_b &- (\boldsymbol x_a - \boldsymbol \mu_a)^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \mu_b \\ &- \boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a) + (\boldsymbol x_a - \boldsymbol \mu_a)^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}(\boldsymbol x_a - \boldsymbol \mu_a) \Big] \\ &{\color{green} - \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} + \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b \end{aligned} \\ &= \begin{aligned}[t] \frac{1}{2}\Big[ \boldsymbol \mu_b^\top \boldsymbol \Lambda_{bb}\boldsymbol \mu_b &- \boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \mu_b + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \mu_b - \boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol x_a + \boldsymbol \mu_b^\top \boldsymbol \Lambda_{ba}\boldsymbol \mu_a \\ &+ \boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol x_a - \boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol\mu_a - \boldsymbol \mu_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol x_a \\ &+ \boldsymbol \mu_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol \mu_a \Big] {\color{green} - \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} + \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b \end{aligned} \\ &= \begin{aligned}[t] \frac{1}{2}\Big[-2\boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \mu_b &+ \boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol x_a - 2\boldsymbol x_a^\top \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\boldsymbol\mu_a\Big] \\ &{\color{green} - \frac{1}{2}\boldsymbol x^\top _a\boldsymbol \Lambda_{aa}\boldsymbol x_a + \boldsymbol \mu_a^\top \boldsymbol \Lambda_{aa}\boldsymbol x_a} + \boldsymbol x_a^\top \boldsymbol \Lambda_{ab}\boldsymbol \mu_b + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2}\boldsymbol x_a^\top \Big[\boldsymbol \Lambda_{aa} - \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\Big]\boldsymbol x_a &+ \boldsymbol x_a^\top \Big[\boldsymbol \Lambda_{aa} - \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba}\Big]\boldsymbol\mu_a + \text{const} \end{aligned} \\ \end{aligned} \end{equation*}\]

If we comparing this form to the normal quadratic expanision of Gaussian, we can set the $\boldsymbol \mu$ of marginalised Gaussian is $\boldsymbol \mu_a$, while the covariance is $(\boldsymbol \Lambda_{aa} - \boldsymbol \Lambda_{ba}^\top \boldsymbol \Lambda_{bb}^{-1}\boldsymbol \Lambda_{ba})^{-1}$. If we compare this to the inverse matrix partition, we can see that this is equal to $\boldsymbol \Sigma_{aa}$. Thus complete the proof.

□

Proof

Propositions

Proposition 14 (Linear Gaussian Model): Consider the distribution to be: $p(\boldsymbol x) = \mathcal{N}(\boldsymbol x | {\color{yellow} \boldsymbol \mu} , {\color{blue} \boldsymbol \Lambda^{-1}})$ and $p(\boldsymbol y | \boldsymbol x) = \mathcal{N}(\boldsymbol y | {\color{purple}\boldsymbol A}\boldsymbol x + {\color{green} \boldsymbol b}, {\color{red} \boldsymbol L^{-1}})$. We can show that the following holds:

\[\begin{equation*} \begin{aligned} p(\boldsymbol y) &= \mathcal{N}(\boldsymbol y | {\color{purple}\boldsymbol A}{\color{yellow} \boldsymbol \mu} + {\color{green} \boldsymbol b}, {\color{red} \boldsymbol L^{-1}} + {\color{purple}\boldsymbol A}{\color{blue} \boldsymbol \Lambda^{-1}}{\color{purple}\boldsymbol A^\top }) \\ p(\boldsymbol x | \boldsymbol y) &= \mathcal{N}\bracka{ \boldsymbol x | {\color{grey} \boldsymbol \Sigma}\brackc{ {\color{purple}\boldsymbol A^\top } {\color{red} \boldsymbol L}(\boldsymbol y-{\color{green} \boldsymbol b}) + {\color{blue} \boldsymbol \Lambda}{\color{yellow} \boldsymbol \mu}}, {\color{grey} \boldsymbol \Sigma} } \end{aligned} \end{equation*}\]

where we have ${\color{grey} \boldsymbol \Sigma} = ({\color{blue} \boldsymbol \Lambda} + {\color{purple}\boldsymbol A^\top }{\color{red} \boldsymbol L}{\color{purple}\boldsymbol A})^{-1}$

We will consider the joint random variable $\boldsymbol z = (\boldsymbol x, \boldsymbol y)^\top$. Let’s consider the joint distribution and the inside of exponential:

\[\begin{equation*} \begin{aligned} -\frac{1}{2}&(\boldsymbol x - \boldsymbol \mu)^\top \boldsymbol \Lambda(\boldsymbol x - \boldsymbol \mu) - \frac{1}{2}(\boldsymbol y - \boldsymbol A\boldsymbol x - \boldsymbol b)^\top \boldsymbol L(\boldsymbol y - \boldsymbol A\boldsymbol x - \boldsymbol b) + \text{const} \\ &= \begin{aligned}[t] -\frac{1}{2}\Big[ \boldsymbol x^\top \boldsymbol \Lambda\boldsymbol x &- 2\boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol x + \boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol \mu + \boldsymbol y^\top \boldsymbol L\boldsymbol y - \boldsymbol y^\top \boldsymbol L\boldsymbol A\boldsymbol x - \boldsymbol y^\top \boldsymbol L\boldsymbol b\\ &-\boldsymbol x^\top \boldsymbol A^\top \boldsymbol L\boldsymbol y + \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol A\boldsymbol x + \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol b - \boldsymbol b^\top \boldsymbol L\boldsymbol y +\boldsymbol b^\top \boldsymbol L\boldsymbol A\boldsymbol x + \boldsymbol b^\top \boldsymbol L\boldsymbol b \Big] + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2}\Big[ \boldsymbol x^\top \boldsymbol \Lambda\boldsymbol x &- 2\boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol x + \boldsymbol y^\top \boldsymbol L\boldsymbol y - \boldsymbol y^\top \boldsymbol L\boldsymbol A\boldsymbol x - \boldsymbol y^\top \boldsymbol L\boldsymbol b\\ &-\boldsymbol x^\top \boldsymbol A^\top \boldsymbol L\boldsymbol y + \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol A\boldsymbol x + \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol b - \boldsymbol b^\top \boldsymbol L\boldsymbol y +\boldsymbol b^\top \boldsymbol L\boldsymbol A\boldsymbol x \Big] + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2}\Big[ \boldsymbol x^\top \Big(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L \boldsymbol A\Big)\boldsymbol x &+ \boldsymbol y^\top \boldsymbol L\boldsymbol y - \boldsymbol y^\top \boldsymbol L\boldsymbol A\boldsymbol x - \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L\boldsymbol y\\ &+ 2\boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol b - 2\boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol x - 2\boldsymbol y^\top \boldsymbol L\boldsymbol b \Big] + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2}\Big[ \boldsymbol x^\top \Big(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L \boldsymbol A\Big)\boldsymbol x &+ \boldsymbol y^\top \boldsymbol L\boldsymbol y - \boldsymbol y^\top \boldsymbol L\boldsymbol A\boldsymbol x - \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L\boldsymbol y\Big]\\ &- \boldsymbol x^\top \boldsymbol A^\top \boldsymbol L \boldsymbol b + \boldsymbol \mu^\top \boldsymbol \Lambda\boldsymbol x + \boldsymbol y^\top \boldsymbol L\boldsymbol b + \text{const} \end{aligned} \\ &= \begin{aligned}[t] -\frac{1}{2} \begin{pmatrix} \boldsymbol x \\ \boldsymbol y \end{pmatrix}^\top \begin{pmatrix} \boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A & -\boldsymbol A^\top \boldsymbol L \\ -\boldsymbol L\boldsymbol A & \boldsymbol L \end{pmatrix} \begin{pmatrix} \boldsymbol x \\ \boldsymbol y \end{pmatrix} + \begin{pmatrix} \boldsymbol x \\ \boldsymbol y \end{pmatrix}^\top \begin{pmatrix} \boldsymbol \Lambda\boldsymbol \mu - \boldsymbol A^\top \boldsymbol L\boldsymbol b \\ \boldsymbol L\boldsymbol b \end{pmatrix} + \text{const} \end{aligned} \\ \end{aligned} \end{equation*}\]

We can use Proposition 9, to show that:

\[\begin{equation*} \begin{aligned} \begin{pmatrix} \boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A & -\boldsymbol A^\top \boldsymbol L \\ -\boldsymbol L\boldsymbol A & \boldsymbol L \end{pmatrix}^{-1} = \begin{pmatrix} \boldsymbol \Lambda^{-1} & \boldsymbol \Lambda^{-1}\boldsymbol A^\top \\ \boldsymbol A\boldsymbol \Lambda^{-1} & \boldsymbol L^{-1} + \boldsymbol A\boldsymbol \Lambda^{-1}\boldsymbol A^\top \end{pmatrix} \end{aligned} \end{equation*}\]

Recall the Gaussian pattern matching, we can see that the mean is equal to:

\[\begin{equation*} \begin{aligned} \begin{pmatrix} \boldsymbol \Lambda^{-1} & \boldsymbol \Lambda^{-1}\boldsymbol A^\top \\ \boldsymbol A\boldsymbol \Lambda^{-1} & \boldsymbol L^{-1} + \boldsymbol A\boldsymbol \Lambda^{-1}\boldsymbol A^\top \end{pmatrix} \begin{pmatrix} \boldsymbol \Lambda\boldsymbol \mu - \boldsymbol A^\top \boldsymbol L\boldsymbol b \\ \boldsymbol L\boldsymbol b \end{pmatrix} = \begin{pmatrix} \boldsymbol \mu \\ \boldsymbol A\boldsymbol \mu + \boldsymbol b \end{pmatrix} \end{aligned} \end{equation*}\]

Please note that with Proposition 13, it is obvious to see how this leads to the final result. Now, for the Proposition 12, we have the usual result that $\boldsymbol \Sigma= \boldsymbol \Lambda_{xx} = (\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A)^{-1}$, for the mean:

\[\begin{equation*} \begin{aligned} \boldsymbol \mu - &(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A)^{-1}(\boldsymbol A^\top \boldsymbol L)(-\boldsymbol y + A\boldsymbol \mu + \boldsymbol b) \\ &= \boldsymbol \Sigma\boldsymbol A^\top \boldsymbol L(\boldsymbol y - \boldsymbol b) + \boldsymbol \mu - \boldsymbol \Sigma\boldsymbol A^\top \boldsymbol L\boldsymbol A\boldsymbol \mu \end{aligned} \end{equation*}\]

We want to show that $\boldsymbol \mu - \boldsymbol \Sigma\boldsymbol A^\top \boldsymbol L\boldsymbol A\boldsymbol \mu = \boldsymbol \Sigma\boldsymbol \Lambda\boldsymbol \mu$.

LHS: Apply Proposition 10, where we consider the section highlight in pink:

\[ \begin{aligned} \boldsymbol \mu &- \boldsymbol \Sigma\boldsymbol A^\top \boldsymbol L\boldsymbol A\boldsymbol \mu \\ &= \boldsymbol \mu - {\color{pink}(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A)^{-1}\boldsymbol A^\top \boldsymbol L}\boldsymbol A\boldsymbol \mu \\ &= \boldsymbol \mu - \boldsymbol \Lambda^{-1}\boldsymbol A^\top (\boldsymbol A\boldsymbol \Lambda^{-1}\boldsymbol A^\top + \boldsymbol L^{-1})^{-1}\boldsymbol A\boldsymbol \mu \end{aligned} \]

RHS: We consider the section highlight in blue and use Proposition 11:

\[ \begin{aligned} \boldsymbol \Sigma\boldsymbol \Lambda\boldsymbol \mu &= {\color{blue}(\boldsymbol \Lambda + \boldsymbol A^\top \boldsymbol L\boldsymbol A)^{-1}}\boldsymbol \Lambda\boldsymbol \mu \\ &= \boldsymbol \mu - \boldsymbol \Lambda^{-1}\boldsymbol A^\top (\boldsymbol A\boldsymbol \Lambda^{-1}\boldsymbol A^\top + \boldsymbol L^{-1})^{-1}\boldsymbol A\boldsymbol \mu \end{aligned} \]

Now we have show that both are equal, thus we conclude the proof.

□

Proof

References

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Vol. 4. 4. Springer.