Theory of Support Vector Machines (SVM) | Yukkuri Machine Learning

Introduction
Hard Margin SVM
Transformation using method of Lagrange multipliers
Estimation of $\hat{\bm{\alpha}}$ using the steepest descent method
Compute the parameters $\hat{\bm{w}}, \hat{b}$ of the separating hyperplane
1. Find $\hat{\bm{w}}$.
2. Find $\hat{b}$.

Introduction

A support vector machine (SVM) is a type of supervised machine learning algorithm for pattern identification.

The idea of “maximizing margins” makes it an excellent two-class classification algorithm with the advantages of high generalization ability and no local solution convergence problems.

There is a “hard margin” that assumes linearly separable data and a “soft margin” that assumes non-linearly separable data and allows some degree of misclassification, and this article explains the theory behind the hard margin support vector machine.

▲Linear separable: data can be completely separated by hyperplanes (straight lines for 2-D data) (left), and Non linearly separable: not separable by any hyperplane (right)

\begin{align*} \newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}} \newcommand{\f}[2]{\frac{#1}{#2}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}} \newcommand{\T}{\mathsf{T}} \newcommand{$}{\left(} \newcommand{$}{\right)} \newcommand{\{}{\left\{} \newcommand{\}}{\right\}} \newcommand{\[}{\left[} \newcommand{\]}{\right]} \newcommand{\dis}{\displaystyle} \newcommand{\eq}[1]{{\rm Eq}(\ref{#1})} \newcommand{\n}{\notag\\} \newcommand{\t}{\ \ \ \ } \newcommand{\tt}{\t\t\t\t} \newcommand{\argmax}{\mathop{\rm arg\, max}\limits} \newcommand{\argmin}{\mathop{\rm arg\, min}\limits} \def\l<#1>{\left\langle #1 \right\rangle} \def\us#1_#2{\underset{#2}{#1}} \def\os#1^#2{\overset{#2}{#1}} \newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.} \newcommand{\s}[1]{{\scriptstyle #1}} \definecolor{myblack}{rgb}{0.27,0.27,0.27} \definecolor{myred}{rgb}{0.78,0.24,0.18} \definecolor{myblue}{rgb}{0.0,0.443,0.737} \definecolor{myyellow}{rgb}{1.0,0.82,0.165} \definecolor{mygreen}{rgb}{0.24,0.47,0.44} \newcommand{\c}[2]{\textcolor{#1}{#2}} \newcommand{\ub}[2]{\underbrace{#1}_{#2}} \end{align*}

Hard Margin SVM

The goal of the hard margin support vector machine is to find the “optimal” hyperplane $\bm{w}^\T \bm{x} + b = 0$ that completely separates the two classes of data.

where $\bm{x}, \bm{w}$ are $p$-dimensional data vector and $p$-dimensional parameter vector, respectively.

\begin{align*} \bm{x} = \mat{x_1 \\ x_2 \\ \vdots \\ x_p},\t \bm{w} = \mat{w_1 \\ w_2 \\ \vdots \\ w_p}. \end{align*}

Although any number of hyperplanes can be drawn that completely separate the two classes of data, the “optimal” hyperplane is the separating hyperplane that maximizes the margin.

Here, the distance between the separating hyperplane and the data closest to that separating hyperplane is called the margin.

▲ Find the separating hyperplane $H$ that maximizes the margin $d$. Here, the data $\bm{x}_{-}, \bm{x}_{+}$ closest to $H$ are called the support vectors.

The margin $d$ is obtained from the calculation of the distance between the point and the hyperplane as follows

\begin{align*} d = \f{|\bm{w}^\T\bm{x}_{+} + b|}{\|\bm{w}\|} = \f{|\bm{w}^\T\bm{x}_{-} + b|}{\|\bm{w}\|}. \end{align*}

The Distance Between A Point And A Hyperplane

Derive the distance between a point and a hyperplane, which is the general form of the distance between a point and a line.

In finding the optimal segregation hyperplane $H:\hat{\bm{w}}^\T \bm{x} + \hat{b} = 0$, we use the fact that no data exists between the hyperplanes $H_{+}, H_{-}$ as shown above.

In other words, the hard-margin support vector machine problem can be attributed to the problem of “finding $\bm{w}, b$ that maximizes the margin $d$ under the condition that all data are at least $d$ away from the hyperplane.

\begin{align}
\{ \hat{\bm{w}}, \hat{b} \} = \argmax_{\bm{w}, b} d(\bm{w}, b)\t {\rm subject\ to}\ \f{|\bm{w}^\T \bm{x}^{(i)} + b|}{\| \bm{w} \|} \geq d \n
i = 1, 2, \dots, n.
\label{eq:cond}
\end{align}

However, the above equation uses absolute values as constraint conditions, which is a bit unwieldy. Since the assumption is that the data are linearly separable, let’s remove the absolute value by introducing a label variable $y^{(i)}$.

\begin{align*} y^{(i)} = \case{1 & (\bm{x}^{(i)} \in G_1) \\ -1 & (\bm{x}^{(i)} \in G_2)} \end{align*}

The label variable $y^{(i)}$ is a variable that takes $y=1$ when the data $\bm{x}^{(i)}$ belongs to the group $G_1$ (blue) and $y=-1$ when the data belongs to the group $G_2$ (red), as shown below.

Then, for the group to which the data $\bm{x}^{(i)}$ belongs, the following holds

\begin{align*} \bm{x}^{(i)} \in G_1 \Leftrightarrow \bm{w}^\T \bm{x}^{(i)} + b > 0 \n \bm{x}^{(i)} \in G_2 \Leftrightarrow \bm{w}^\T \bm{x}^{(i)} + b < 0 \end{align*}

Thus, for any given data

\begin{align*}
|\bm{w}^\T\bm{x}^{(i)} + b| = y^{(i)} (\bm{w}^\T \bm{x}^{(i)} + b) > 0, \t (i=1, 2, \dots, n)
\end{align*}

Thus $\eq{eq:cond}$ can be transformed to the form with the absolute value removed as follows.

\begin{align}
\{ \hat{\bm{w}}, \hat{b} \} = \argmax_{\bm{w}, b} d(\bm{w}, b)\t {\rm subject\ to}\ \f{y^{(i)}(\bm{w}^\T \bm{x}^{(i)} + b)}{\| \bm{w} \|} \geq d \n
i = 1, 2, \dots, n.
\label{eq:cond2}
\end{align}

Now, let’s organize the above equation a little more.

The conditional part of the above equation is not changed by constant multiplication with $\bm{w} \to c \bm{w},\ b \to c b$.

\begin{align*}
\f{y^{(i)}(c\bm{w}^\T \bm{x}^{(i)} + cb)}{c\| \bm{w} \|} &= \f{cy^{(i)}(\bm{w}^\T \bm{x}^{(i)} + b)}{c\| \bm{w} \|} \n
&= \f{y^{(i)}(\bm{w}^\T \bm{x}^{(i)} + b)}{\| \bm{w} \|} \n
\end{align*}

So, we transform the condition part to

\begin{align*} &\f{y^{(i)}(\bm{w}^\T \bm{x}^{(i)} + b)}{\| \bm{w} \|} \geq d \n \n &\downarrow \textstyle\t \s{{\rm Divide\ both\ sides\ by\ d}} \n \n &y^{(i)} \bigg( \underbrace{\f{1}{d \| \bm{w} \|} \bm{w}^\T}_{=\bm{w}^{*\T}} \bm{x}^{(i)} + \underbrace{\f{1}{d \| \bm{w} \|} b}_{=b^*} \bigg) \geq 1 \n \n &\downarrow \textstyle\t \s{\bm{w}^{*\T} = \f{1}{d \| \bm{w} \|} \bm{w}^\T,\ b^* = \f{1}{d \| \bm{w} \|} b} \n \n &y^{(i)} $ \bm{w}^{*\T} \bm{x}^{(i)} + b^* $ \geq 1 \end{align*}

where $y^{(i)} $ \bm{w}^{*\T} \bm{x}^{(i)} + b^* $ \geq 1$ The data $\bm{x}^{(i)}$ for which the equality $\bm{x}_{+}, \bm{x}_{-}$ holds is the data closest to the separation hyperplane and distance, that is, the support vector $\bm{x}_{+}, \bm{x}_{-}$.

Thus, the scaled margin $d^*$ is

\begin{align*}
d^* &= \f{y_{+} (\bm{w}^{*\T}\c{myblue}{\bm{x}_{+}} + b^* ) }{\| \bm{w}^* \|} = \f{y_{-} (\bm{w}^{*\T}\c{myred}{\bm{x}_{-}} + b^* ) }{\| \bm{w}^* \|} \n
&= \f{1}{\| \bm{w}^* \|}
\end{align*}

Thus, $\eq{eq:cond2}$ can be expressed as follows

\begin{align*}
\{ \hat{\bm{w}}, \hat{b} \} = \argmax_{\bm{w}^*, b^*} \f{1}{\| \bm{w}^* \|} \t {\rm subject\ to}\ y^{(i)} $ \bm{w}^{*\T} \bm{x}^{(i)} + b^* $ \geq 1 \n
i = 1, 2, \dots, n.
\end{align*}

The $\bm{w}^*, b^*$ that maximizes $1/ \| \bm{w}^* \|$ is equal to the $\bm{w}^*, b^*$ that minimizes $\| \bm{w}^* \|^2$. From now on, $\bm{w}^*, b^*$ is again denoted as $\bm{w}, b$, and the following can be said.

The problem of finding a hyperplane that maximizes the margin can be attributed to the problem of finding

\begin{align*}
\{ \hat{\bm{w}}, \hat{b} \} = \argmin_{\bm{w}, b} \f{1}{2}\|\bm{w}\|^2 \t {\rm subject\ to}\ y^{(i)} $ \bm{w}^{\T} \bm{x}^{(i)} + b $ \geq 1 \n
i = 1, 2, \dots, n.
\end{align*}

Transformation using method of Lagrange multipliers

Solve conditional optimization problems using method of Lagrange multipliers. This time, since the constraints are inequalities, the KKT condition can be adapted.

Since there are $n$ number of constraints on the number of data, using the Lagrange multipliers $\alpha_1, \alpha_2, \dots, \alpha_n$, we define the Lagrange function $L$ as follows.

\begin{align*}
L(\bm{w}, b, \alpha_1, \dots, \alpha_n) \equiv \f{1}{2}\|\bm{w}\|^2 – \sum_{i=1}^n \alpha_i \{ y^{(i)} (\bm{w}^\T \bm{x}^{(i)} + b) – 1 \}.
\end{align*}

Then, in the optimal solution of $\bm{w}$, the following equation holds from the KKT condition.

\begin{align*}
\case{
& \dis \pd{L(\bm{w}, b, \alpha_1, \dots, \alpha_n)}{\bm{w}} = \bm{0}, & (☆1) \n
& \dis \pd{L(\bm{w}, b, \alpha_1, \dots, \alpha_n)}{b} = 0, & (☆2) \n
&\alpha_i \geq 0, & (☆3) \n
&\alpha_i = 0,\ \s{{\rm or}}\ \ y^{(i)}(\bm{w}^\T \bm{x}^{(i)} + b) – 1 = 0. & (☆4)
}
\end{align*}

From equations (☆1) and (☆2), respectively

\begin{align}
(☆1) &= \pd{L(\bm{w}, b, \alpha_1, \dots, \alpha_n)}{\bm{w}} \n
&= \pd{}{\bm{w}} $ \f{1}{2}\|\bm{w}\|^2 – \sum_{i=1}^n \alpha_i \{ y^{(i)} (\bm{w}^\T \bm{x}^{(i)} + b) – 1 \} $ \n
&= \f{1}{2}\pd{}{\bm{w}}\|\bm{w}\|^2 – \sum_{i=1}^n \alpha_i \pd{}{\bm{w}} \{ y^{(i)} ( \bm{w}^\T \bm{x}^{(i)} + b) – 1 \} \n
&= \bm{w}\, – \sum_{i=1}^n \alpha_iy^{(i)} \bm{x}^{(i)} = \bm{0} \n
& \t\t\t\t \therefore \bm{w} = \sum_{i=1}^n \alpha_i y^{(i)}\bm{x}^{(i)}.
\label{eq:lag1}
\end{align}

\begin{align}
(☆2) &= \pd{L(\bm{w}, b, \alpha_1, \dots, \alpha_n)}{b} \n
&= \pd{}{b} $ \f{1}{2}\|\bm{w}\|^2 – \sum_{i=1}^n \alpha_i \{ y^{(i)} (\bm{w}^\T \bm{x}^{(i)} + b) – 1 \} $ \n
&= -\sum_{i=1}^n \alpha_i y^{(i)} = 0 \n
& \t\t\t\t \therefore \sum_{i=1}^n \alpha_i y^{(i)} = 0.
\label{eq:lag2}
\end{align}

In the above, substituting $\eq{eq:lag1},(\ref{eq:lag2})$ into $L(\bm{w}, b, \alpha_1, \dots, \alpha_n)$ and eliminating $\bm{w}, b$, we obtain

\begin{align}
L(\bm{w}, b, \alpha_1, \dots, \alpha_n) &\equiv \f{1}{2}\|\bm{w}\|^2 – \sum_{i=1}^n \alpha_i \{ y^{(i)} (\bm{w}^\T \bm{x}^{(i)} + b) – 1 \} \n
&= \f{1}{2}\|\bm{w}\|^2 – \ub{\sum_{i=1}^n \alpha_i y^{(i)} \bm{x}^{(i) \T}}{=\bm{w}^\T} \bm{w} \, – \ub{\sum_{i=1}^n\alpha_i y^{(i)}}{=0} b + \sum_{i=1}^n \alpha_i \n
&= \f{1}{2}\|\bm{w}\|^2 -\| \bm{w} \|^2 + \sum_{i=1}^n \alpha_i \n
&= \sum_{i=1}^n \alpha_i -\f{1}{2}\|\bm{w}\|^2 \n
&= \sum_{i=1}^n \alpha_i – \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} y^{(i)} y^{(j)}\boldsymbol{x}^{(i)\T}\boldsymbol{x}^{(j)} \n
&\equiv \tilde{L}(\alpha_1, \dots, \alpha_n)
\label{eq:L_tilde}
\end{align}

The Lagrangian function $L$ can be transformed into a $\alpha$-only function $\tilde{L}(\alpha_1, \dots, \alpha_n)$. Also, $\tilde{L}$ is an upward convex function with respect to $\alpha$ from its shape.

From the above discussion, instead of the minimization problem of $\f{1}{2} \|\bm{w} \|^2$, the original solution $\hat{\bm{\alpha}}$ can be obtained by finding the optimal solution $\hat{\bm{\alpha}}$ from the maximization problem of $\tilde{L}(\bm{\alpha})$ (this discussion is called a dual problem).

The problem of finding a hyperplane that maximizes the margin can be attributed to the problem of finding

\begin{align*}
\hat{\bm{\alpha}} &= \argmax_{\bm{\alpha}} \tilde{L}(\bm{\alpha}) \n
&= \argmax_{\bm{\alpha}} \{ \sum_{i=1}^n \alpha_i – \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} y^{(i)} y^{(j)}\boldsymbol{x}^{(i)\T}\boldsymbol{x}^{(j)} \} \n
& \tt {\rm subject\ to}\ \alpha_i \geq 0, \sum_{i=0}^n \alpha_i y^{(i)} = 0,\ (i = 1, 2, \dots, n).
\end{align*}

The following is a summary of the discussion so far.

Estimation of $\hat{\bm{\alpha}}$ using the steepest descent method

The steepest descent method is used here to find the optimal solution $\hat{\bm{\alpha}}$. In the steepest descent method, initial parameter values $\bm{\alpha}^{[0]}$ are set at random, and the parameters are updated in the upward direction, the direction of the gradient vector, as shown below since this is a maximization problem.

\begin{align} \bm{\alpha}^{[t+1]} = \bm{\alpha}^{[t]} + \eta \pd{\tilde{L}(\bm{\alpha})}{\bm{\alpha}}. \label{eq:kouka} \end{align}

To perform the steepest descent method, let us compute the value of the gradient vector $\pd{\tilde{L}(\bm{\alpha})}{\bm{\alpha}}$.

To prepare for the calculation, first rewrite $\eq{eq:L_tilde}$ as a vector-based expression.

Let the $n$ observed $p$ dimensional data be represented using matrices as follows.

\begin{align*}
\us X_{[n \times p]} = \mat{x_1^{(1)} & x_2^{(1)} & \cdots & x_p^{(1)} \\
x_1^{(2)} & x_2^{(2)} & \cdots & x_p^{(2)} \\
\vdots & \vdots & \ddots & \vdots \\
x_1^{(n)} & x_2^{(2)} & \cdots & x_p^{(n)}}
=
\mat{ – & \bm{x}^{(1)\T} & – \\
– & \bm{x}^{(2)\T} & – \\
& \vdots & \\
– & \bm{x}^{(n)\T} & – \\}.
\end{align*}

Also, the $n$ label variables $y^{(i)}$ and the undetermined multiplier $\alpha_i$ are denoted by the column vector as follows, respectively.

\begin{align*} \us \bm{y}_{[n \times 1]} = \mat{y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(n)}}, \t \us \bm{\alpha}_{[n \times 1]} = \mat{\alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_n}. \end{align*}

We now define the matrix $H$ to be

\begin{align*} \us H_{[n \times n]} \equiv \us \bm{y}_{[n \times 1]} \us \bm{y}^{\T}_{[1 \times n]} \odot \us X_{[n \times p]} \us X^{\T}_{[p \times n]} \end{align*}

where $\odot$ is the Adamar product. That is, the components of the above matrix are $(H)_{ij} = y^{(i)}y^{(j)} \bm{x}^{(i)\T} \bm{x}^{(j)}$ and $H$ is a symmetric matrix.

Then $\eq{eq:L_tilde}$ is

\begin{align*}
\tilde{L}(\bm{\alpha}) &\equiv \sum_{i=1}^n \alpha_i – \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} y^{(i)} y^{(j)}\boldsymbol{x}^{(i)\T}\boldsymbol{x}^{(j)} \n
&= \sum_{i=1}^n \alpha_i – \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} (H)_{ij} \n
&= \| \bm{\alpha} \| -\f{1}{2} \bm{\alpha}^\T H \bm{\alpha}
\end{align*}

It can be represented using vectors.

Thus, the gradient vector $\pd{\tilde{L}(\bm{\alpha})}{\bm{\alpha}}$ can be calculated using the vector derivative formula as follows

\begin{align*} \pd{\tilde{L}(\bm{\alpha})}{\bm{\alpha}} &= \pd{}{\bm{\alpha}} \| \bm{\alpha} \| -\f{1}{2} \pd{}{\bm{\alpha}} \bm{\alpha}^\T H \bm{\alpha} \n &= \bm{1}\, – H \bm{\alpha}. \end{align*}

where $\bm{1}$ is an $n$-dimensional column vector $ \bm{1} = (1, 1, \dots, 1)^\T $ where all components are 1.

Derivation Of "Differentiate By Vector" Formula

When studying machine learning theory, we often see the operation of "differentiating a scalar by a vector. In this article, we derive the formula for "differentiating a scalar by a vector.

From the above, the update rule $\eq{eq:kouka}$ for the undetermined multiplier $\bm{\alpha}$ is

\begin{align*} \bm{\alpha}^{[t+1]} = \bm{\alpha}^{[t]} + \eta $ \bm{1}\, – H \bm{\alpha}^{[t]} $ \end{align*}

Compute the parameters $\hat{\bm{w}}, \hat{b}$ of the separating hyperplane

The objective of hard-margin SVM is to find the separating hyperplane with the largest margin, in other words, the parameters $\hat{\bm{w}}, \hat{b}$ of the separating hyperplane.

Now, if we can compute the optimal solution $\hat{\bm{\alpha}}$ of the undetermined multiplier using the steepest descent method, we can find $\hat{\bm{w}}, \hat{b}$ in a piecewise manner.

Find $\hat{\bm{w}}$.

Once the optimal solution $\hat{\bm{\alpha}}$ for the undetermined multiplier is obtained, $\hat{bm{w}}$ can be calculated from $\eq{eq:lag1}$.

\begin{align} \hat{\bm{w}} = \sum_{i=1}^n \hat{\alpha}_i y^{(i)} \bm{x}^{(i)}. \label{eq:w} \end{align}

Now, let us consider the above equation a little more. Now, from (☆4) of the KKT conditions $\hat{\alpha}_i$ and $\bm{x}^{(i)}$ have

\begin{align*} \hat{\alpha_i} = 0,\ \s{{\rm or}}\ \ y^{(i)}(\bm{w}^\T \bm{x}^{(i)} + b) – 1 = 0 \end{align*}

On the other hand, the data $\bm{x}^{(i)}$ are

\begin{align*}
\case{y^{(i)}(\bm{w}^\T \bm{x}^{(i)} + b) – 1 = 0 & (\bm{x}^{(i)} {\rm \ is\ not\ the\ support\ vector}), \\
y^{(i)}(\bm{w}^\T \bm{x}^{(i)} + b) – 1 > 0 & \rm{(otherwise)}}
\end{align*}

Therefore,

\begin{align*} \case{ \hat{\alpha}_i \neq 0 \Leftrightarrow \bm{x}^{(i)} \s{{\rm is\ the\ support\ vector}}, \\ \hat{\alpha}_i = 0 \Leftrightarrow \bm{x}^{(i)} \s{{\rm is\ not\ the\ support\ vector}}. \\ } \end{align*}

Therefore, in the $\eq{eq:w}$ sum, $\hat{\alpha}_i = 0$ for the index of data that is not a support vector, and thus does not contribute to the summation. In other words, only support vectors need to be summed to save computation.

\begin{align*} \hat{\bm{w}} = \sum_{\bm{x}^{(i)} \in S} \hat{\alpha}_i y^{(i)} \bm{x}^{(i)}. \end{align*}

(where $S$ is the set of support vectors)

Find $\hat{b}$.

If we find $\hat{\bm{w}}$ from the above, we can calculate $\hat{b}$ from the relation $y^{(i)}(\hat{\bm{w}}^\T \bm{x}^{(i)} + \hat{b}) – 1 = 0$ when $\bm{x}^{(i)}$ is a support vector.

\begin{align*} \hat{b} &= \f{1}{y^{(i)}} – \hat{\bm{w}}^\T \bm{x}^{(i)} \n &= y^{(i)} – \hat{\bm{w}}^\T \bm{x}^{(i)} \t (\because y^{(i)} = 1, -1). \end{align*}

Thus, one support vector is enough to compute $\hat{b}$, but in practice, the following is computed by averaging over all support vectors to reduce the error.

\begin{align*} \hat{b} = \f{1}{|S|} \sum_{\bm{x}^{(i)} \in S} (y^{(i)} – \hat{\bm{w}}^\T \bm{x}^{(i)}). \end{align*}

(where $|S|$ is the number of support vectors)

From the above, we can determine the separating hyperplane by finding $\hat{\bm{w}}, \hat{b}$.

Click here ▼ for implementation of support vector machines.