Jekyll2022-04-18T08:53:46-07:00http://localhost:4000/~ewein/feed.xmlDeriving canonical correlation analysis2022-03-07T00:00:00-08:002022-03-07T00:00:00-08:00http://localhost:4000/~ewein/blog/2022/03/07/cca<p>Canonical correlation analysis (CCA) is a statistical method for exploring the relationships between two sets of variables measured on the same sample. For example, suppose for a set of $n$ individuals we’ve measured $p$ variables related to health, such as blood pressure, cholestorol levels, body mass index, etc. Now suppose that for the same set of individuals we’ve also measured $q$ variables related to exercise, such as miles run per week, maximum bench press weight, etc. Given these two sets of measurements, we may want to explore how variables from the first set relate to those from the second set.</p>
<p>How might we go about this? One naive way would be to examine each possible pair of variables from the two sets of measurements. For example, we could make a scatter plot or perform linear regression on blood pressure vs. miles, blood pressurve vs. bench press, etc. However, this requires us to examine $p\cdot q$ pairs, which quickly becomes infeasible as $p$ or $q$ gets large. On the other hand, CCA summarizes these relationships into a much smaller number of statistics while preserving as much information as possible.</p>
<p>Here we’ll derive CCA mathematically as well as discuss some of its extensions.</p>
<h2 id="defining-cca">Defining CCA</h2>
<p>Suppose we have two paired datasets $X \in \mathbb{R}^{n \times p}$ and $Y \in \mathbb{R}^{n \times q}$. By paired we mean that the $i$’th rows of $X$ and $Y$ are drawn from the same sample. For simplicitly, we’ll assume that all the features of our data are zero-mean.</p>
<div class="figure">
<img src="/~ewein/assets/output.png" class="center" width="75%" />
<div class="caption">
<span class="caption-label">Figure 1:</span> An example scenario where we might want to apply CCA. Suppose we have two datasets $X \in \mathbb{R}^{n\times 2}$ (left) and $Y \in \mathbb{R}^{n\times 3}$ (right). We assume that the measurements of $X$ and $Y$ are paired - i.e., that each row of $X$ and $Y$ consist of measurements on the same underlying system/sample. Here we visualize these paired measurements for the case where $n = 2$, giving us the measurement pairs $(\boldsymbol{u}_x, \boldsymbol{u}_y)$ and $(\boldsymbol{v}_x, \boldsymbol{v}_y)$. The goal of CCA is to help us understand the relationships between the sets of measurements in $X$ and $Y$.
</div>
</div>
<p>Now, let $\boldsymbol{w}_x \in \mathbb{R}^{p}$ and $\boldsymbol{w}_y \in \mathbb{R}^{p}$ denote linear transformations of $X$ and $Y$ respectively. The <strong><em>canonical variables</em></strong> $\boldsymbol{z}_x \in \mathbb{R}^n$ and $\boldsymbol{z}_y \in \mathbb{R}^n$ are then defined as</p>
\[\boldsymbol{z}_x = X\boldsymbol{w}_x \quad \text{and} \quad z_y = Y\boldsymbol{w}_y.\]
<p>The objective of CCA is to learn the transformations $\boldsymbol{w}_x$ and $\boldsymbol{w}_y$ such that $\boldsymbol{z}_x$ and $\boldsymbol{z}_y$ are maximally correlated. In formal notation, the optimization problem we wish to solve is</p>
<p>Letting $\boldsymbol{\Sigma}_{xy}$ denote the cross-covariance matrix of $X$ and $Y$, we can write our objective as</p>
\[\begin{aligned}
\boldsymbol{w}_x^*, \boldsymbol{w}_y^* &= \underset{\boldsymbol{w}_x, \boldsymbol{w}_y}{\operatorname{argmax}} \text{corr}(\boldsymbol{z}_x, \boldsymbol{z}_y)\\
&= \underset{\boldsymbol{w}_x, \boldsymbol{w}_y}{\operatorname{argmax}} \text{corr}(X\boldsymbol{w}_x, Y\boldsymbol{w}_y) \\
&= \underset{\boldsymbol{w}_x, \boldsymbol{w}_y}{\operatorname{argmax}} \frac{\mathbb{E}[(X\boldsymbol{w}_x)^T(Y\boldsymbol{w}_y)]}{\sqrt{\mathbb{E}[(X\boldsymbol{w}_x)^{T}(X\boldsymbol{w}_x)]}\sqrt{\mathbb{E}[(Y\boldsymbol{w}_y)^T(Y\boldsymbol{w}_y)]}} \\
&= \underset{\boldsymbol{w}_x, \boldsymbol{w}_y}{\operatorname{argmax}} \frac{\boldsymbol{w}_x^T\mathbb{E}[X^TY]\boldsymbol{w}_y}{\sqrt{\boldsymbol{w}_x^T\mathbb{E}[X^{T}X]\boldsymbol{w}_x}\sqrt{\boldsymbol{w}_y^T\mathbb{E}[Y^TY]\boldsymbol{w}_y}} \\
&= \underset{\boldsymbol{w}_x, \boldsymbol{w}_y}{\operatorname{argmax}} \frac{\boldsymbol{w}_{x}^{T}\Sigma_{xy}\boldsymbol{w}_{y}}{\sqrt{\boldsymbol{w}_x^T\Sigma_{xx}\boldsymbol{w}_x}\sqrt{\boldsymbol{w}_y^T\Sigma_{yy}\boldsymbol{w}_y}}
\end{aligned}\]
<p>Correlation is scale-invariant, so to make our lives easier we’ll constrain $||\boldsymbol{z}_x|| = ||\boldsymbol{z}_y|| = 1$. Our problem is thus to find $\boldsymbol{w}_x$ and $\boldsymbol{w}_y$ that satisfy</p>
\[\begin{equation}
\label{eq:cca_constrained}
\max_{\boldsymbol{w}_x, \boldsymbol{w}_y} \boldsymbol{w}_{x}^{T}\boldsymbol{\Sigma}_{xy}\boldsymbol{w}_{y}\quad
\text{s.t.}\ \ \boldsymbol{w}_x^T\boldsymbol{\Sigma}_{xx}\boldsymbol{w}_x = 1, \quad \boldsymbol{w}_y^T\boldsymbol{\Sigma}_{yy}\boldsymbol{w}_y = 1
\end{equation}\]
<div class="figure">
<img src="/~ewein/assets/angles.png" class="center" width="50%" />
<div class="caption">
<span class="caption-label">Figure 2:</span> A geometric interpretation of CCA. We seek to learn transformations of our paired measurements $X$ and $Y$ such that the embeddings $\boldsymbol{z}_x$ and $\boldsymbol{z}_y$ point in the same direction. We depict this idea for the case where $n = 2$.
</div>
</div>
<p>By adding these constraints, we can also obtain a more intuitive geometric interpretation of the CCA problem. Using our definitions of $\boldsymbol{z}_x$ and $\boldsymbol{z}_y$, we can rewrite (\ref{eq:cca_constrained}) as</p>
\[\begin{equation}
\max_{\boldsymbol{w}_x, \boldsymbol{w}_y} \mathbb{E}[\boldsymbol{z}_x^{T}\boldsymbol{z}_y]\quad
\text{s.t.}\ \ ||\boldsymbol{z}_x|| = 1, \quad ||\boldsymbol{z}_y|| = 1,
\end{equation}\]
<p>which, using the geometric interpretation of the dot product we can further rewrite as</p>
\[\begin{equation}
\max_{\boldsymbol{w}_x, \boldsymbol{w}_y} \cos{\theta} \quad
\text{s.t.}\ \ ||\boldsymbol{z}_x|| = 1, \quad ||\boldsymbol{z}_y|| = 1,
\end{equation}\]
<p>where $\theta$ is the angle between $\boldsymbol{z}_x$ and $\boldsymbol{z}_y$. As $\cos \theta$ is maximized at $\theta = 0$, we can interpret our problem as finding transformations $\boldsymbol{w}_x$ and $\boldsymbol{w}_y$ such that the resulting embeddings $\boldsymbol{z}_x$ and $\boldsymbol{z}_y$ are pointing in the same direction (Figure 2). Now let’s see how to actually find our transformations $\boldsymbol{w}_x$ and $\boldsymbol{w}_y$.</p>
<h2 id="solving-the-problem">Solving the Problem</h2>
<p>We’ll begin by rewriting our problem one last time to make our lives easier down the line. Define</p>
\[\begin{aligned}
\boldsymbol{\Omega} &= \boldsymbol{\Sigma}_{xx}^{-1/2}\boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1/2}_{yy} \\
\boldsymbol{c} &= \boldsymbol{\Sigma}_{xx}^{1/2}{\boldsymbol{w}_x} \\
\boldsymbol{d} &= \boldsymbol{\Sigma}_{yy}^{1/2}{\boldsymbol{w}_y}
\end{aligned}\]
<p>and then our problem becomes</p>
\[\begin{equation}
\label{eq:objective}
\max_{\boldsymbol{w}_x, \boldsymbol{zw}_y} \boldsymbol{c}^{T}\boldsymbol{\Omega}\boldsymbol{d} \quad
\text{s.t.}\ \ ||\boldsymbol{c}|| = 1, \quad ||\boldsymbol{d}|| = 1
\end{equation}\]
<p>We can then turn this constrained problem into an easier-to-solve unconstrainted problem using <a href="https://tutorial.math.lamar.edu/classes/calciii/lagrangemultipliers.aspx">Lagrange multipliers</a>, giving us</p>
\[\mathcal{L} = \boldsymbol{c}^{T}\boldsymbol{\Omega}\boldsymbol{d} - \frac{\lambda_1}{2}(\boldsymbol{c}^{T}\boldsymbol{c} - 1) - \frac{\lambda_2}{2}(\boldsymbol{d}^{T}\boldsymbol{d} - 1).\]
<p>Now let’s take the derivatives of this equation with respect to $\boldsymbol{c}$ and $\boldsymbol{d}$. First, by applying matrix derivative rules we have</p>
\[\begin{equation}
\label{eq:partial_x}
\frac{\partial \mathcal{L}}{\partial \boldsymbol{c}} = \boldsymbol{\Omega}\boldsymbol{d} -\lambda_1 \boldsymbol{c} = 0,
\end{equation}\]
<p>Similarly, we can derive.</p>
\[\begin{equation}
\label{eq:partial_y}
\frac{\partial \mathcal{L}}{\partial \boldsymbol{d}} = \boldsymbol{\Omega}^{T}\boldsymbol{c} - \lambda_2\boldsymbol{d} = 0.
\end{equation}\]
<p>We’re left with two equations and four unknowns ($\lambda_1, \lambda_2, \boldsymbol{c},$ and $\boldsymbol{d}$). Now, let’s see if we can simplify things a bit. First, we’ll multiply Equation (\ref{eq:partial_x}) by $\boldsymbol{c}^T$. This gives us</p>
\[\begin{aligned}
0 &= \boldsymbol{c}^T(\boldsymbol{\Omega}\boldsymbol{d} -\lambda_1 \boldsymbol{c}) \\
&= \boldsymbol{c}^T\boldsymbol{\Omega}\boldsymbol{d} - \lambda_1\boldsymbol{c}^T\boldsymbol{c}\\
&= \boldsymbol{c}^T\boldsymbol{\Omega}\boldsymbol{d} - \lambda_1\\
\end{aligned}\]
<p>Where we used our constraint $\boldsymbol{c}^T\boldsymbol{c} = 1$ to get the second equality. Similarly, by multiplying Equation (\ref{eq:partial_y}) by $\boldsymbol{d}^T$, we have</p>
\[\begin{aligned}
0 &= \boldsymbol{d}^T(\boldsymbol{\Omega}^T\boldsymbol{c} - \lambda_2\boldsymbol{d})\\
&=\boldsymbol{d}^T\boldsymbol{\Omega}^T\boldsymbol{c} - \lambda_2\boldsymbol{d}^T\boldsymbol{d}\\
&= \boldsymbol{d}^T\boldsymbol{\Omega}^T\boldsymbol{c} - \lambda_2
\end{aligned}\]
<p>from which we can conclude $\lambda_1$ = $\lambda_2$. Define $\lambda = \lambda_1 = \lambda_2$. Plugging this into Equation (\ref{eq:partial_x}) we find</p>
\[\begin{equation}
\label{eq:singular_value}
\boldsymbol{\Omega}\boldsymbol{d} = \lambda\boldsymbol{c}.
\end{equation}\]
<p>That is, $\boldsymbol{d}$ and $\boldsymbol{c}$ must be right and left singular vectors of $\boldsymbol{\Omega}$ with a singular value of $\lambda$. Now, which specific singular vectors should we choose? Plugging Equation (\ref{eq:singular_value}) into our objective from Equation (\ref{eq:objective}), we get</p>
\[\max_{\boldsymbol{w}_x, \boldsymbol{w}_y} \boldsymbol{c}^{T}\boldsymbol{\Omega}\boldsymbol{d} = \lambda \boldsymbol{c}^{T}\boldsymbol{c} = \lambda\]
<p>Thus implying that $\boldsymbol{d}$ and $\boldsymbol{c}$ must specifically be the right and left singular vectors of $\boldsymbol{\Omega}$ that correspond to the largest singular value $\lambda$.</p>
<h2 id="multiple-canonical-covariates">Multiple Canonical Covariates</h2>
<p>After solving the problem described previously, we’ll be able to project our data down to a single set of canonical variables. While this is a good start, by restricting ourselves to only a single set of variables, we’ll likely lose a good chunk of information. Thus, oftentimes we want to find a set of $r$ pairs of canonical variables with corresponding transformations $W_{x} \in \mathbb{R}^{p \times r}$ and $W_{y} \in \mathbb{R}^{q \times r}$, where the $i$th column of these matrices represents a mapping to the $i$th canonical covariate. To ensure that each pair of canonical variables is capturing different phenomena, we’ll restrict them to be uncorrelated/orthogonal to each other, i.e:</p>
\[(\boldsymbol{z}_{x}^{i})^T\boldsymbol{z}_x^{j} = 0, \quad (\boldsymbol{z}_{y}^{i})^{T}\boldsymbol{z}_y^{(j)} = 0 \quad \forall j \neq i : i,\ j \in \{1, \ldots, r\}\]
<p>where $(\boldsymbol{z}_{x}^{i}, \boldsymbol{z}_{y}^{i})$ is the $i$th pair of canonical variables. As it turns out, we satisfy this additional constraint without any additional work. To see this, we can define $\boldsymbol{w}_{x}^{i}$ and $\boldsymbol{w}_{y}^{i}$ such that</p>
\[\boldsymbol{z}_x^i = X\boldsymbol{w}_x^i \quad \text{and} \quad \boldsymbol{z}_y^i = Y\boldsymbol{w}_y^i.\]
<p>and similarly $\boldsymbol{c}^i$ and $\boldsymbol{d}^i$ such that</p>
\[\begin{aligned}
\boldsymbol{c}^i &= \boldsymbol{\Sigma}_{xx}^{1/2}{\boldsymbol{w}_x^i} \\
\boldsymbol{d}^i &= \boldsymbol{\Sigma}_{yy}^{1/2}{\boldsymbol{w}_y^i}
\end{aligned}\]
<p>Using our previous analysis, we can conclude that any optimal $(\boldsymbol{z}_{x}^{i}, \boldsymbol{z}_{y}^{i})$ must correspond to a $\boldsymbol{c}^i$ and $\boldsymbol{d}^i$ that are singular vectors with the $i$th largest singular values of the matrix $\boldsymbol{\Omega}$ defined previously. As it turns out, for a given matrix any two right or left singular vectors are orthogonal. That is, we must have</p>
\[(\boldsymbol{c}^{i})^T\boldsymbol{c}^j = 0, \quad (\boldsymbol{d}^{i})^T\boldsymbol{d}^j = 0.\]
<p>With a little algebra, we can show that</p>
\[(\boldsymbol{c}^{i})^T\boldsymbol{c}^j = 0 \iff (\boldsymbol{z}_x^{i})^T\boldsymbol{z}_x^j = 0, \quad (\boldsymbol{d}^{i})^T\boldsymbol{d}^j = 0 \iff (\boldsymbol{z}_y^{i})^T\boldsymbol{z}_y^j = 0,\]
<p>thus guaranteeing that we satisfy the orthogonality constraint automatically.</p>
<hr />
<p>Finally, we’ll discuss briefly the number of canonical variables $r$. Since the solutions to our optimization problem must be singular vectors of $\boldsymbol{\Omega}$, $r$ cannot be greater than the number of singular vectors of $\boldsymbol{\Omega}$.</p>
<p>How many singular vectors does $\Omega$ have? Recall that we defined $\boldsymbol{\Omega}$ as</p>
\[\boldsymbol{\Omega} = \boldsymbol{\Sigma}_{xx}^{-1/2}\boldsymbol{\Sigma}_{xy}\boldsymbol{\Sigma}^{-1/2}_{yy}\]
<p>Assuming that our data matrices $X \in \mathbb{R}^{n \times p}$ and $Y\in \mathbb{R}^{n \times q}$ do not have any redundant features, the cross-covariance matrix $\boldsymbol{\Sigma}_{xy}$ has rank $\min(p, q)$. Moreover, the matrices $\boldsymbol{\Sigma}_{xx}^{-1/2}$ and $\boldsymbol{\Sigma}_{yy}^{-1/2}$ are full-rank with rank $p$ and $q$ respectively. Multiplying a matrix by a full-rank matrix preserves rank, so we can conclude that $\text{rank}(\boldsymbol{\Omega}) = \min(p, q)$. Since the number of singular vectors of a matrix is equal to its rank, we must choose $r \leq \min(p, q)$.</p>Canonical correlation analysis (CCA) is a statistical method for exploring the relationships between two sets of variables measured on the same sample. For example, suppose for a set of $n$ individuals we’ve measured $p$ variables related to health, such as blood pressure, cholestorol levels, body mass index, etc. Now suppose that for the same set of individuals we’ve also measured $q$ variables related to exercise, such as miles run per week, maximum bench press weight, etc. Given these two sets of measurements, we may want to explore how variables from the first set relate to those from the second set.The Gumbel-max trick2022-03-04T00:00:00-08:002022-03-04T00:00:00-08:00http://localhost:4000/~ewein/blog/2022/03/04/gumbel-max<p>A <strong><em>categorical distribution</em></strong> is a discrete probability distribution that assigns a probability to each of $K$ classes (or categories). That is, for each class $k \in {1, 2, \ldots K}$ we have some value $\pi_k$ representing the probability of drawing that class. Because we’re dealing with probabilities, each $\pi_k$ must be greater than \(0\) and we must have $\sum_{k}\pi_k = 1$ (in other words, our $\pi_k$ must lie on the <strong><em>simplex</em></strong>).</p>
<p>The obvious way we might think to parameterize such a distribution is to use the vector $\boldsymbol{\pi}$ of probabilities ${\pi_k}$. That is, if we have a categorical random variable $I$, we would write</p>
\[I \sim Cat(\boldsymbol{\pi}).\]
<p>However, in many machine learning problems we may instead prefer to parameterize a discrete distribution in terms of an <em>unconstrained</em> vector of numbers. That is, we may instead wish to parameterize our distribution with some vector $\boldsymbol{\theta} \in \mathbb{R}^{K}$ of values $\theta_{k}$ that can take on arbitrary values (i.e., they may be negative, don’t sum to 1, etc.). By doing so, we can use unconstrained optimization algorithms to optimize $\boldsymbol{\theta}$ rather than restricting ourselves to constrained optimization for $\boldsymbol{\pi}$. How to we get from $\boldsymbol{\theta}$ to probabilities? Typically, we’ll use the <strong><em>softmax</em></strong> transformation:</p>
\[\begin{equation}\label{eq:1} \pi_k = \frac{\exp(\theta_k)}{\sum_{k'=1}^{K}\exp(\theta_{k'})}. \end{equation}\]
<p>After performing the transformation, we could then sample from $Cat(\pi)$. However, what if we don’t want to explicitly construct our distribution using the softmax transform? It turns out that there exists another method for achieving the same effect: the <strong><em>Gumbel-max trick</em></strong>.</p>
<hr />
<p>The <strong><em>Gumbel distribution</em></strong> is a probability distribution with location and scale parameters $\mu \in \mathbb{R}$ and $\beta \in \mathbb{R}_{\geq 0}$, respectively. Its probability density function (PDF) is</p>
\[f(x; \mu, \beta) = \frac{1}{\beta} \exp\bigg(-(x - \mu)/\beta - \exp(-(x - \mu)/\beta)\bigg)\]
<p>and cumulative density function (CDF) is</p>
\[F(x; \mu, \beta) = \exp(-\exp(-(x - \mu)/\beta)).\]
<p>We can denote a Gumbel distribution with location $\mu$ and scale $\beta$ using the notation $G(\mu, \beta)$ and a random variable following this distribution as $G_{\mu, \beta}$. The notation here can look pretty intimidating at first, but luckily for the rest of this post we’ll only need to think about “standard” Gumbels with $\mu = 0$ and $\beta = 1$. To reduce clutter, for standard Gumbel random variables we’ll omit the subscripts and just write $G$.</p>
<p>Now, given our definition of the Gumbel distribution we will make the following claim:</p>
<p>For a set of unnormalized probabilities ${\theta_k}$, we can draw a sample from the corresponding categorical distribution as follows: for each $\theta_k$ we add a sample $G^{(k)}$ from the standard Gumbel distribution, and then select the index with the maximum sum. That is.</p>
\[I = \underset{k}{\operatorname{argmax}}\{\theta_k + G^{(k)}\} \sim Cat(\boldsymbol{\pi})\]
<p>To prove this, we will show that $P(I = \omega) = \pi_{\omega}$. First, as a shorthand we’ll define</p>
\[G_{\theta_\omega} := \theta_{\omega} + G^{(\omega)}\]
<p>Now starting from the definition of argmax, we know that $ I = \omega $ can only be true if $G_{\theta_k} < G_{\theta_{\omega}}$ for all $k \neq \omega$. That is (using the shorthand $M := G_{\theta_{\omega}}$),</p>
\[P(I = \omega) = \mathbb{E}_{M}\bigg[p(G_{\theta_{k}} < M \quad \forall k \neq \omega)\bigg]\]
<p>Since our Gumbel variables ${G^{(k)}}$ are i.i.d., we can factorize the probability above to get</p>
\[\begin{aligned}
P(I = \omega) &= \mathbb{E}_{M}\bigg[\prod_{k \neq \omega} p(G_{\theta_{k}} < M)\bigg]
\end{aligned}\]
<p>Letting $f_{\omega}(\cdot)$ denote the PDF of $G_{\theta_{\omega}}$, we then have</p>
\[\begin{aligned}
P(I = \omega) &= \int_{-\infty}^{\infty}f_{\omega}(m)\prod_{k \neq \omega}p(G_{\theta_{k}} < m)dm\\
&= \int_{-\infty}^{\infty}f_{\omega}(m)\prod_{k \neq \omega}\exp(-\exp(\theta_{k}- m))dm \\
&= \int_{-\infty}^{\infty}f_{\omega}(m)\exp\bigg(-\sum_{k \neq \omega}\exp(\theta_{k}- m)\bigg)dm \\
&= \int_{-\infty}^{\infty}\exp(\theta_{\omega} - m - \exp(\theta_{\omega} - m))\exp\bigg(-\sum_{k \neq \omega}\exp(\theta_{k}- m)\bigg)dm\\
&= \int_{-\infty}^{\infty}\exp(\theta_{\omega} - m)\exp\bigg(-\sum_{k}\exp(\theta_{k} - m)\bigg)dm \\
&= \int_{-\infty}^{\infty}\exp(\theta_{\omega})\exp(-m)\exp(-\exp(-m)\sum_{k}\exp(\theta_k))dm
\end{aligned}\]
<p>Now we define \(Z = \sum_{k}\exp(\theta_k)\). From Equation \ref{eq:1} we must have $\exp(\theta_{\omega}) = \pi_{\omega}Z$. We can then write</p>
\[P(I = \omega) = \pi_{\omega}Z\int_{-\infty}^{\infty}\exp(-m)\exp(-Z\exp(-m))dm.\]
<p>Now, using the identity:</p>
\[\int_{-\infty}^{\infty}\exp(-m)\exp(-Z\exp(-m))dm = \frac{1}{Z},\]
<p>we have $P(I = \omega) = \pi_{\omega}$ as desired $\square$.</p>A categorical distribution is a discrete probability distribution that assigns a probability to each of $K$ classes (or categories). That is, for each class $k \in {1, 2, \ldots K}$ we have some value $\pi_k$ representing the probability of drawing that class. Because we’re dealing with probabilities, each $\pi_k$ must be greater than \(0\) and we must have $\sum_{k}\pi_k = 1$ (in other words, our $\pi_k$ must lie on the simplex).