# Principal Component Analysis (PCA) Theory

## Introduction

Principal component analysis is a method of summarizing the information in multidimensional data observed for features that are correlated with each other into new features expressed as a linear combination of the original features, with as little loss of information as possible.

The data to be classified by machine learning is often highly dimensional, well beyond three dimensions. This makes data visualization difficult and computationally expensive.

Even in such cases, principal component analysis can be used to compress dimensions and project the data into a 1D line, 2D plane, or 3D space to visually grasp the data structure.

\begin{align*} \newcommand{\mat}{\begin{pmatrix} #1 \end{pmatrix}} \newcommand{\f}{\frac{#1}{#2}} \newcommand{\pd}{\frac{\partial #1}{\partial #2}} \newcommand{\d}{\frac{{\rm d}#1}{{\rm d}#2}} \newcommand{\T}{\mathsf{T}} \newcommand{$}{\left(} \newcommand{$}{\right)} \newcommand{\{}{\left\{} \newcommand{\}}{\right\}} \newcommand{$}{\left[} \newcommand{$}{\right]} \newcommand{\dis}{\displaystyle} \newcommand{\eq}{{\rm Eq}(\ref{#1})} \newcommand{\n}{\notag\\} \newcommand{\t}{\ \ \ \ } \newcommand{\argmax}{\mathop{\rm arg\, max}\limits} \newcommand{\argmin}{\mathop{\rm arg\, min}\limits} \def\l<#1>{\left\langle #1 \right\rangle} \def\us#1_#2{\underset{#2}{#1}} \def\os#1^#2{\overset{#2}{#1}} \newcommand{\case}{\{ \begin{array}{ll} #1 \end{array} \right.} \definecolor{myblack}{rgb}{0.27,0.27,0.27} \definecolor{myred}{rgb}{0.78,0.24,0.18} \definecolor{myblue}{rgb}{0.0,0.443,0.737} \definecolor{myyellow}{rgb}{1.0,0.82,0.165} \definecolor{mygreen}{rgb}{0.24,0.47,0.44} \end{align*}

## Basic Concept

Principal component analysis is a method that captures the concept of information contained in multidimensional data in terms of “variance” and defines new features without losing any information. First, for simplicity, we will explain principal component analysis based on two-dimensional data. Consider the projection of 2-dimensional data $\bm{x} = (x_1, x_2)^\T$ onto the following axes That is, the original 2-dimensional data is substituted into the following equation to make it 1-dimensional data.

\begin{align} \label{eq:proj} y = w_1x_1 + w_2x_2. \end{align}

As an example, assuming three projection axes (a), (b), and (c) in the figure below, the observed two-dimensional data are projected onto each axis, and the extent to which the data are scattered is indicated by the black arrow .

This figure shows that the scatter (variance) of the data is greater when projected onto the (b) axis than when projected onto the (a) axis, and the dispersion is even greater when projected onto the (c) axis than onto the (b) axis.

In this way, principal component analysis finds the projection axis that maximizes the variance of the data and projects the data onto it, thereby reducing the dimensionality of the data without losing as much information as possible. In particular, the projection axis with the largest variance is called the “first principal component,” and the projection axis with the largest variance under conditions orthogonal to the first principal component is called the “second principal component.

So how do we find the axis that maximizes the variance of the data when projected?

## Formulation of Principal Component Analysis

The $n$ observed 2D data are represented as follows.

\begin{align}
\bm{x}^{(1)} = \mat{x_1^{(1)} \\ x_2^{(1)}},\ \bm{x}^{(2)} = \mat{x_1^{(2)} \\ x_2^{(2)}}, \cdots, \bm{x}^{(n)} = \mat{x_1^{(n)} \\ x_2^{(n)}}.
\end{align}

Then the projection of the $i$-th 2D data onto the projection axis: $\eq{eq:proj}$ is

\begin{align}
\label{eq:proj2}
y^{(i)} = w_1 x_1^{(i)} + w_2 x_2^{(i)} = \bm{w}^\T \bm{x}^{(i)} \,\ \ \ \ (i=1, 2, \dots n)
\end{align}

where $\bm{w}$ is the coefficient vector $(w_1, w_2)^\T$ and means the direction of the projection axis. Therefore, the objective of principal component analysis is to find the projection axis with the maximum variance, in other words, to find such a coefficient vector $\bm{w}$.

We will then calculate the variance of the data on the projection axis $s_y^2$. From the definition of variance $s_y^2 \equiv \f{1}{n}\sum_i (y^{(i)} – \bar{y})^2$, we first calculate the mean $\bar{y}$ of the data on the projection axis

\begin{align}\bar{y} &= \f{1}{n} \sum_{i=1}^n y^{(i)} \n&= \f{1}{n} \sum_{i=1}^n $w_1 x_1^{(i)} + w_2 x_2^{(i)}$\ \ \because (\eq{eq:proj2}) \n&= w_1 \bar{x}_1 + w_2 \bar{x}_2 \ \ \because \textstyle (\bar{x} \equiv \f{1}{n} \sum_i x^{(i)})\label{eq:mean}\end{align}

The variance $s_y^2$ is the same as

\begin{align}s_y^2 &= \f{1}{n} \sum_{i=1}^n $y^{(i)} – \bar{y}$^2 \n&= \f{1}{n} \sum_{i=1}^n\{ w_1 $x_1^{(i)} – \bar{x}_1$ + w_2 $x_2^{(i)} – \bar{x}_2$ \}^2 \ \ \because (\eq{eq:proj2},\eq{eq:mean} ) \n&= w_1^2\underbrace{\f{1}{n} \sum_{i=1}^n $x_1^{(i)} – \bar{x}_1$^2}_{=s_{11}} + 2w_1w_2 \underbrace{\f{1}{n}\sum_{i=1}^n $x_1^{(i)} – \bar{x}_1$$x_2^{(i)} – \bar{x}_2$}_{=s_{12}} + w_2^2 \underbrace{\f{1}{n} \sum_{i=1}^n $x_2^{(i)} – \bar{x}_2$^2}_{=s_{22}} \n&= w_1^2 s_{11} + 2w_1w_2 s_{12} + w_2^2 s_{22} \n&= \bm{w}^\T S \bm{w}\label{eq:var}
\end{align}

where the matrix $S$ is the variance-covariance matrix based on two-dimensional observed data.

\begin{align*}
S = \mat{s_{11} & s_{12} \\ s_{21} & s_{22}},\ \ s_{jk} \equiv \f{1}{n}\sum_{i=1}^n $x_j^{(i)} – \bar{x}_j$$x_k^{(i)} – \bar{x}_k$,\ \ j,k=1,2.
\end{align*}

Therefore, to know the principal component, we can find $S_y^2= \bm{w}^\T S \bm{w}$ that maximizes $S_y^2= \bm{w}^\T S \bm{w}$, but such $\bm{w}$ is $\|\bm{w}\| \to \infty$ and diverges.

We are now interested not in the magnitude of the coefficient vector $\bm{w}$, but in the ratio of its components (i.e., the direction of the projection axis). So, we impose the constraint $\bm{w}^\T \bm{w} = 1$ on the coefficient vector. Then, the problem of finding the principal components changes to the following optimization problem.

\begin{align*}
\max_{\bm{w}} \bm{w}^\T S \bm{w}\ \ {\rm subject\ to}\ \ \bm{w}^\T\bm{w}=1.
\end{align*}

The maximization problem under such constraints can be solved by Lagrange’s undetermined multiplier method.

Method Of Lagrange Multiplier Under Equality Conditions
This article describes Method Of Lagrange Multiplier, a powerful technique for solving mathematical optimization problems.

That is, the Lagrange multiplier is $\lambda$ and the Lagrange function

\begin{align}
L(\bm{w}, \lambda) = \bm{w}^\T S \bm{w} + \lambda(1 – \bm{w}^\T \bm{w})\label{eq:lag}
\end{align}

The solution is obtained by finding the stopping point (the point at which the differentiated function is zero)

Lagrangian function: $\eq{eq:lag}$ partial differentiation of by $\bm{w}$ yields

\begin{align}
\pd{L}{\bm{w}} &= \pd{}{\bm{w}}(\bm{w}^\T S\bm{w}) + \lambda \pd{}{\bm{w}}(1-\bm{w}^\T \bm{w}) \n&= \pd{}{\bm{w}}(\bm{w}^\T S \bm{w}) – \lambda \pd{}{\bm{w}}(\bm{w}^\T \bm{w}) \n\n&= 2S\bm{w} -2\lambda \bm{w}.\label{eq:divlag}
\end{align}

Here, the vector differential formula $\partial(\bm{w}^\T S\bm{w}) / \partial \bm{w} = 2S \bm{w},\partial(\bm{w}^\T \bm{w}) / \partial \bm{w} = 2 \bm{w}$ was used.

Derivation Of "Differentiate By Vector" Formula
When studying machine learning theory, we often see the operation of "differentiating a scalar by a vector. In this article, we derive the formula for "differentiating a scalar by a vector.

Then, by setting $\eq{eq:divlag}$ to $\bm{0}$, we obtain the following equation.

\begin{align*} S\bm{w} = \lambda\bm{w}. \end{align*}

Thus, principal component analysis can be attributed to the eigenvalue problem of the variance-covariance matrix.

## Determination of principal components

The problem of “finding the projection axis with the largest variance when projecting the data” has been replaced by the following eigenvalue problem.

\begin{align*} S\bm{w} = \lambda\bm{w}. \end{align*}

So what are the implications of the eigenvalue $\lambda$? Let $\bm{w}^\T$ act from the left in the above equation. Then, we get

\begin{align}
\bm{w}^\T S \bm{w} &= \bm{w}^\T \lambda \bm{w} \n&= \lambda \bm{w}^\T \bm{w} \n&= \lambda\ \ \textstyle \because (\bm{w}^\T \bm{w}=1)\label{eq:lambda}
\end{align}

Here, we used the fact that the eigenvectors are normalized ($\bm{w}^\T \bm{w}=1$). Then the $\eq{eq:lambda}$ left side shows that the variance of the data on the projection axis $s_y^2 = \bm{w}^\T S \bm{w}$ from $\eq{eq:var}$. Thus, the eigenvalue $\lambda$ means the variance of the data on the projective axis itself.

\begin{align*} {\rm Var}[y] = \lambda \end{align*}

From the above, we see that for the eigenvalue problem $S\bm{w} = \lambda\bm{w}$, the eigenvector $\bm{w}_1$ belonging to the largest eigenvalue $\lambda_1$ gives the direction of the first principal component.

Also, the variance-covariance matrix $S$ is by its definition a symmetric matrix. For symmetric matrices, eigenvectors belonging to different eigenvalues have the property of being orthogonal to each other.

Thus, for the eigenvalue problem $S\bm{w} = \lambda\bm{w}$, we also see that the eigenvector $\bm{w}_2$ belonging to the second largest eigenvalue $\lambda_2$ gives the direction of the second principal component.

## Projection to multiple dimensions

In the above we have seen the case of projection into one dimension, but let us consider the case of compression into a multidimensional subspace in general. Let us represent the observed $n$ $p$-dimensional data using matrices as follows.

\begin{align*}
\us X_{[n \times p]}=\mat{x_1^{(1)} & x_2^{(1)} & \cdots & x_q^{(1)} &\cdots & x_p^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_q^{(2)} &\cdots & x_p^{(2)} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x_1^{(n)} & x_2^{(n)} & \cdots & x_q^{(n)} &\cdots & x_p^{(n)}\\}=\mat{- & \bm{x}^{(1)\T} & – \\ – & \bm{x}^{(2)\T} & – \\ & \vdots & \\ – & \bm{x}^{(n)\T} & – \\}
\end{align*}

Consider dimensionality reduction of the above $p$-dimensional data to $q$-dimensional data $(q From the previous discussion, it is clear that the$q$-dimensional space after dimensionality reduction is a space in which the first$q$principal components are stretched from the first principal component. In order to find the principal components, the variance-covariance matrix based on the observed$n$size$p$-dimensional data is$\us S_{[p \times p]}$, and its eigenvalues and eigenvectors are arranged in descending order by$q. \begin{align*} &\lambda_1 &\geq \t\ \ &\lambda_2 &\geq \t\ \ &\cdots &\geq \t\ \ &\lambda_q \\&\bm{w}_1 &\ \t\ \ &\bm{w}_2 & \t\ \ & & \t\ \ &\bm{w}_q \end{align*} where\bm{w}_j = (w_{j1}, w_{j2}, \dots, w_{jp})^\T$. Then the first principal component$y_1$to the$q$principal component$y_Qcan be expressed as follows. \begin{align*} {\rm 1st\ Principal\ Component} \t y_1 &= w_{11}x_1 + w_{12}x_2 + \cdots + w_{1p}x_p = \bm{w}^\T _1 \bm{x} \\{\rm 2nd\ Principal\ Component} \t y_2 &= w_{21}x_1 + w_{22}x_2 + \cdots + w_{2p}x_p = \bm{w}^\T _2 \bm{x} \\&\ \ \vdots \\ q {\rm th\ Principal\ Component} \t y_q &= w_{q1}x_1 + w_{q2}x_2 + \cdots + w_{qp}x_p = \bm{w}^\T _q \bm{x} \end{align*} Adapting principal component analysis, the observedi$-th$p$-dimensional data$\bm{x}^{(i)} = $x_1^{(i)}, x_2^{(i)}, \dots, x_p^{(i)}$^\T$is projected to the lower-dimensional$q$-dimensional space$(q

\begin{align*}
y_j^{(i)} = w_{j1}x_1^{(i)} + w_{j2}x_2^{(i)} + \cdots + w_{jp}x_p^{(i)} &= \bm{w}_j^\T \bm{x}^{(i)} \\&=\bm{x}^{(i)\T} \bm{w}_j\\(i = 1, \dots, n), \ \ (j=1, \dots,q)
\end{align*}

This projection can be described collectively using matrices as follows

\begin{align*}
\underbrace{\mat{y_1^{(1)} & y_2^{(1)} & \cdots & y_q^{(1)} \\ y_1^{(2)} & y_2^{(2)} & \cdots & y_q^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ y_1^{(n)} & y_2^{(n)} & \cdots & y_q^{(n)} \\}}_{\equiv \us Y_{[n \times q]}}＝\underbrace{\mat{- & \bm{x}^{(1)\T} & – \\ – & \bm{x}^{(2)\T} & – \\ & \vdots & \\ – & \bm{x}^{(n)\T} & – \\}}_{\equiv \us X_{[n \times p]}}\underbrace{\mat{ | & | & & | \\ \bm{w}_1 & \bm{w}_2 & \cdots & \bm{w}_q \\ | & | & & |}}_{\equiv \us W_{[p \times q]}}
\end{align*}

\begin{align}
\large\therefore \us Y_{[n \times q]} = \us X_{[n \times p]} \us W_{[p \times q]}
\end{align}

Thus, projection by principal component analysis is accomplished by constructing a projection matrix $W$ with eigenvectors arranged in columns and applying it to the observed data matrix $X$ from the right. 