Understanding Principal Component Analysis (PCA): Theory

Table of Contents

Introduction
Basic Concept
Formulation of Principal Component Analysis
Determination of Principal Components
Projection to Multidimensional Space

Introduction

Principal Component Analysis is a method that summarizes multidimensional data with correlated features into new features represented by linear combinations of the original features, while losing as little information as possible.

Data classified in machine learning often becomes high-dimensional data that greatly exceeds three dimensions. Therefore, data visualization becomes difficult and computational costs increase.

Even in such cases, by using principal component analysis, dimensionality reduction can be performed and data can be projected onto a one-dimensional line, two-dimensional plane, or three-dimensional space, allowing the data structure to be visually understood.

This article explains the basic theory of principal component analysis.

\begin{align*}
\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}
\newcommand{\T}{\mathsf{T}}
\newcommand{\dis}{\displaystyle}
\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}
\newcommand{\n}{\notag\\}
\newcommand{\t}{\ \ \ \ }
\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}
\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}
\def\l<#1>{\left\langle #1 \right\rangle}
\def\us#1_#2{\underset{#2}{#1}}
\def\os#1^#2{\overset{#2}{#1}}
\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}
\end{align*}

Basic Concept

Principal component analysis is a method that captures the concept of information contained in multidimensional data as “variance” and defines new features without losing information. For simplicity, let’s first explain principal component analysis based on two-dimensional data. Consider projecting two-dimensional data $\bm{x} = (x_1, x_2)^\T$ onto the following axis. That is, the original two-dimensional data is substituted into the following equation to become one-dimensional data.

\begin{align}
\label{eq:proj}
y = w_1x_1 + w_2x_2.
\end{align}

As an example, the figure below assumes three projection axes (a), (b), and (c), projects the observed two-dimensional data onto each axis, and shows the range over which the data spreads with black arrows <—->.

From this figure, it can be seen that projecting the data onto axis (b) results in greater data spread (variance) than projecting onto axis (a), and projecting onto axis (c) results in even greater variance than axis (b).

In this way, principal component analysis finds the projection axis where the variance of the data is maximized, and projects onto it, thereby reducing dimensions while losing as little information from the data as possible. In particular, the projection axis where variance is maximized is called the “first principal component,” and the projection axis where variance is maximized under the condition of being orthogonal to the first principal component is called the “second principal component.”

So how should we find the axis where the variance of the data is maximized when projected? Let’s formulate principal component analysis based on two-dimensional data.

Formulation of Principal Component Analysis

The $n$ observed two-dimensional data points are expressed as follows.

\begin{align}
\bm{x}^{(1)} = \mat{x_1^{(1)} \\ x_2^{(1)}},\ \bm{x}^{(2)} = \mat{x_1^{(2)} \\ x_2^{(2)}}, \cdots, \bm{x}^{(n)} = \mat{x_1^{(n)} \\ x_2^{(n)}}.
\end{align}

Then, the data when the $i$-th two-dimensional data is projected onto the projection axis: $\eq{eq:proj}$ is:

\begin{align}
\label{eq:proj2}
y^{(i)} = w_1 x_1^{(i)} + w_2 x_2^{(i)} = \bm{w}^\T \bm{x}^{(i)} \,\ \ \ \ (i=1, 2, \dots n)
\end{align}

Here, $\bm{w}$ is the coefficient vector $(w_1, w_2)^\T$ and represents the direction of the projection axis. Therefore, the purpose of principal component analysis is to find the projection axis where variance is maximized, in other words, to find such a coefficient vector $\bm{w}$.

So, let’s calculate the variance $s_y^2$ of the data on the projection axis. From the definition of variance $s_y^2 \equiv \f{1}{n}\sum_i (y^{(i)} – \bar{y})^2$, first calculating the mean $\bar{y}$ of the data on the projection axis:

\begin{align}\bar{y} &= \f{1}{n} \sum_{i=1}^n y^{(i)} \n&= \f{1}{n} \sum_{i=1}^n $ w_1 x_1^{(i)} + w_2 x_2^{(i)} $\ \ \because (\eq{eq:proj2}) \n&= w_1 \bar{x}_1 + w_2 \bar{x}_2 \ \ \because \textstyle (\bar{x} \equiv \f{1}{n} \sum_i x^{(i)})\label{eq:mean}\end{align}

Therefore, the variance $s_y^2$ is:

\begin{align}s_y^2 &= \f{1}{n} \sum_{i=1}^n $y^{(i)} – \bar{y} $^2 \n&= \f{1}{n} \sum_{i=1}^n\{ w_1 $x_1^{(i)} – \bar{x}_1$ + w_2 $x_2^{(i)} – \bar{x}_2$ \}^2 \ \ \because (\eq{eq:proj2},\eq{eq:mean}) \n&= w_1^2\underbrace{\f{1}{n} \sum_{i=1}^n $ x_1^{(i)} – \bar{x}_1 $^2}_{=s_{11}} + 2w_1w_2 \underbrace{\f{1}{n}\sum_{i=1}^n $ x_1^{(i)} – \bar{x}_1 $$ x_2^{(i)} – \bar{x}_2 $}_{=s_{12}} + w_2^2 \underbrace{\f{1}{n} \sum_{i=1}^n $ x_2^{(i)} – \bar{x}_2 $^2}_{=s_{22}} \n&= w_1^2 s_{11} + 2w_1w_2 s_{12} + w_2^2 s_{22} \n&= \bm{w}^\T S \bm{w}\label{eq:var}
\end{align}

Here, the matrix $S$ is the covariance matrix based on the two-dimensional observed data.

\begin{align*}
S = \mat{s_{11} & s_{12} \\ s_{21} & s_{22}},\ \ s_{jk} \equiv \f{1}{n}\sum_{i=1}^n $x_j^{(i)} – \bar{x}_j$$x_k^{(i)} – \bar{x}_k$,\ \ j,k=1,2.
\end{align*}

Therefore, to find the principal component, we need to find $\bm{w}$ that maximizes $s_y^2= \bm{w}^\T S \bm{w}$, but such $\bm{w}$ would diverge as $\|\bm{w}\| \to \infty$.

What we are interested in is not the magnitude of the coefficient vector $\bm{w}$, but the ratio of its components (that is, the direction of the projection axis). Therefore, we impose the constraint $\bm{w}^\T \bm{w} = 1$ on the coefficient vector. Then, the problem of finding the principal component changes to the following optimization problem.

\begin{align*}
\max_{\bm{w}} \bm{w}^\T S \bm{w}\ \ {\rm subject\ to}\ \ \bm{w}^\T\bm{w}=1.
\end{align*}

Such a maximization problem under constraints can be solved by the Lagrange multiplier method.

等式条件下におけるラグランジュの未定乗数法の解説

この記事では、数理最適化問題を解く際の強力な手法である「ラグランジュの未定乗数法」について解説します。

That is, with the Lagrange multiplier as $\lambda$, the Lagrange function:

\begin{align}
L(\bm{w}, \lambda) = \bm{w}^\T S \bm{w} + \lambda(1 – \bm{w}^\T \bm{w})\label{eq:lag}
\end{align}

The solution is obtained by finding the stationary point (the point where the differentiated function becomes 0).

Taking the partial derivative of the Lagrange function: $\eq{eq:lag}$ with respect to $\bm{w}$ gives the following equation.

\begin{align}
\pd{L}{\bm{w}} &= \pd{}{\bm{w}}(\bm{w}^\T S\bm{w}) + \lambda \pd{}{\bm{w}}(1-\bm{w}^\T \bm{w}) \n&= \pd{}{\bm{w}}(\bm{w}^\T S \bm{w}) – \lambda \pd{}{\bm{w}}(\bm{w}^\T \bm{w}) \n\n&= 2S\bm{w} -2\lambda \bm{w}.\label{eq:divlag}
\end{align}

Here, we used the vector differentiation formulas $\partial(\bm{w}^\T S\bm{w}) / \partial \bm{w} = 2S \bm{w},\ \partial(\bm{w}^\T \bm{w}) / \partial \bm{w} = 2 \bm{w}$.

「ベクトルで微分」公式導出

機械学習の理論を勉強すると「スカラーをベクトルで微分」する操作をよく目にします。この記事では「スカラーをベクトルで微分」の公式を導出します。

Then, by setting $\eq{eq:divlag}$ to $\bm{0}$, the following equation is obtained.

\begin{align*}
S\bm{w} = \lambda\bm{w}.
\end{align*}

In this way, principal component analysis can be reduced to the eigenvalue problem of the covariance matrix.

Determination of Principal Components

The problem of “finding the projection axis that maximizes variance when data is projected” has been replaced with the following eigenvalue problem.

\begin{align*}
S\bm{w} = \lambda\bm{w}.
\end{align*}

So what is the meaning of the eigenvalue $\lambda$? Let’s operate with $\bm{w}^\T$ from the left on the above equation. Then:

\begin{align}
\bm{w}^\T S \bm{w} &= \bm{w}^\T \lambda \bm{w} \n&= \lambda \bm{w}^\T \bm{w} \n&= \lambda\ \ \textstyle \because (\bm{w}^\T \bm{w}=1)\label{eq:lambda}
\end{align}

Here, we used the fact that the eigenvector is normalized ($\bm{w}^\T \bm{w}=1$). Then the left side of $\eq{eq:lambda}$ is the variance of the data on the projection axis $s_y^2 = \bm{w}^\T S \bm{w}$ from $\eq{eq:var}$. Therefore, the eigenvalue $\lambda$ means the variance of the data on the projection axis itself.

\begin{align*}
{\rm Var}[y] = \lambda
\end{align*}

From the above, for the eigenvalue problem $S\bm{w} = \lambda\bm{w}$, it can be seen that the eigenvector $\bm{w}_1$ belonging to the maximum eigenvalue $\lambda_1$ gives the direction of the first principal component.

Also, the covariance matrix $S$ is a symmetric matrix by its definition. For symmetric matrices, eigenvectors belonging to different eigenvalues are orthogonal to each other.

Therefore, for the eigenvalue problem $S\bm{w} = \lambda\bm{w}$, it can also be seen that the eigenvector $\bm{w}_2$ belonging to the second largest eigenvalue $\lambda_2$ gives the direction of the second principal component.

Projection to Multidimensional Space

Above, we have looked at the case of projection to one dimension, but let’s generally consider the case of compressing to a multidimensional subspace.
The $n$ observed $p$-dimensional data points are expressed using matrices as follows.

\begin{align*}
\us X_{[n \times p]}=\mat{x_1^{(1)} & x_2^{(1)} & \cdots & x_q^{(1)} &\cdots & x_p^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_q^{(2)} &\cdots & x_p^{(2)} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x_1^{(n)} & x_2^{(n)} & \cdots & x_q^{(n)} &\cdots & x_p^{(n)}\\}=\mat{- & \bm{x}^{(1)\T} & – \\ – & \bm{x}^{(2)\T} & – \\ & \vdots & \\ – & \bm{x}^{(n)\T} & – \\}
\end{align*}

Consider using principal component analysis to reduce the above $p$-dimensional data to $q$-dimensional data $(q < p)$.

From the discussion so far, it can be seen that the $q$-dimensional space after dimensionality reduction is the space spanned by the first principal component to the $q$-th principal component. Therefore, to find the principal components, let $\us S_{[p \times p]}$ be the covariance matrix based on the $n$ observed $p$-dimensional data points, and arrange $q$ pairs of eigenvalues and eigenvectors in descending order.

\begin{align*}
&\lambda_1 &\geq \t\ \ &\lambda_2 &\geq \t\ \ &\cdots &\geq \t\ \ &\lambda_q \\&\bm{w}_1 &\ \t\ \ &\bm{w}_2 & \t\ \ & & \t\ \ &\bm{w}_q
\end{align*}

Here, $\bm{w}_j = (w_{j1}, w_{j2}, \dots, w_{jp})^\T$. Then, the first principal component $y_1$ to the $q$-th principal component $y_q$ can be expressed as follows.

\begin{align*}
1\text{st} \t y_1 &= w_{11}x_1 + w_{12}x_2 + \cdots + w_{1p}x_p = \bm{w}^\T _1 \bm{x} \\2\text{nd} \t y_2 &= w_{21}x_1 + w_{22}x_2 + \cdots + w_{2p}x_p = \bm{w}^\T _2 \bm{x} \\&\ \ \vdots \\q\text{th} \t y_q &= w_{q1}x_1 + w_{q2}x_2 + \cdots + w_{qp}x_p = \bm{w}^\T _q \bm{x}
\end{align*}

When principal component analysis is applied, the $i$-th observed $p$-dimensional data $\bm{x}^{(i)} = $ x_1^{(i)}, x_2^{(i)}, \dots, x_p^{(i)} $^\T$ is projected onto a $q$-dimensional space of lower dimension $(q < p)$ and becomes $q$-dimensional data $\bm{y}^{(i)} = $ y_1^{(i)}, y_2^{(i)}, \dots, y_q^{(i)} $^\T$. Each component is as follows.

\begin{align*}
y_j^{(i)} = w_{j1}x_1^{(i)} + w_{j2}x_2^{(i)} + \cdots + w_{jp}x_p^{(i)} &= \bm{w}_j^\T \bm{x}^{(i)} \\&=\bm{x}^{(i)\T} \bm{w}_j\\(i = 1, \dots, n), \ \ (j=1, \dots,q)
\end{align*}

This projection, when collectively described using matrices, becomes the following equation.

\begin{align*}
\underbrace{\mat{y_1^{(1)} & y_2^{(1)} & \cdots & y_q^{(1)} \\ y_1^{(2)} & y_2^{(2)} & \cdots & y_q^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ y_1^{(n)} & y_2^{(n)} & \cdots & y_q^{(n)} \\}}_{\equiv \us Y_{[n \times q]}}＝\underbrace{\mat{- & \bm{x}^{(1)\T} & – \\ – & \bm{x}^{(2)\T} & – \\ & \vdots & \\ – & \bm{x}^{(n)\T} & – \\}}_{\equiv \us X_{[n \times p]}}\underbrace{\mat{ | & | & & | \\ \bm{w}_1 & \bm{w}_2 & \cdots & \bm{w}_q \\ | & | & & |}}_{\equiv \us W_{[p \times q]}}
\end{align*}

\begin{align}
\large\therefore \us Y_{[n \times q]} = \us X_{[n \times p]} \us W_{[p \times q]}
\end{align}

In this way, projection by principal component analysis is performed by constructing a projection matrix $W$ with eigenvectors arranged in the column direction and applying it from the right to the observed data matrix $X$.

For the implementation part of principal component analysis, see here▼