Explanation Of Logistic Regression Theory

Introduction
Logistic model estimation
Calculation of optimal model parameters
Matrix notation of optimal model parameters

Introduction

When the objective variable $y$ is binary data (e.g. $y = 0, 1 $), it is a kind of predictive model for a real-valued objective variable $x$. As an example, suppose hypothetically that $n$ pairs of data are observed, corresponding to $y=1$ if an individual responds to a certain numerical level $x$ and $y=0$ if not.

\begin{align*} \newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}} \newcommand{\f}[2]{\frac{#1}{#2}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}} \newcommand{\T}{\mathsf{T}} \newcommand{$}{\left(} \newcommand{$}{\right)} \newcommand{\{}{\left\{} \newcommand{\}}{\right\}} \newcommand{\[}{\left[} \newcommand{\]}{\right]} \newcommand{\dis}{\displaystyle} \newcommand{\eq}[1]{{\rm Eq}(\ref{#1})} \newcommand{\n}{\notag\\} \newcommand{\t}{\ \ \ \ } \newcommand{\argmax}{\mathop{\rm arg\, max}\limits} \newcommand{\argmin}{\mathop{\rm arg\, min}\limits} \def\l<#1>{\left\langle #1 \right\rangle} \def\us#1_#2{\underset{#2}{#1}} \def\os#1^#2{\overset{#2}{#1}} \newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.} \newcommand{\s}[1]{{\scriptstyle #1}} \definecolor{myblack}{rgb}{0.27,0.27,0.27} \definecolor{myred}{rgb}{0.78,0.24,0.18} \definecolor{myblue}{rgb}{0.0,0.443,0.737} \definecolor{myyellow}{rgb}{1.0,0.82,0.165} \definecolor{mygreen}{rgb}{0.24,0.47,0.44} \end{align*}

\begin{align*}
$x^{(1)}, y^{(1)}$, $x^{(2)}, y^{(2)}$, \dots, $x^{(n)}, y^{(n)}$,\\x^{(i)} \in \mathbb{R}, \t y^{(i)} = \cases{1, & (positive). \\ 0, & (negative).}
\end{align*}

Consider building a model to estimate the response probability $p$ to the numerical level $x$ for such binary data.

Now, introducing a random variable $Y$ to indicate whether or not a response was made, the response probability for the numerical level $x$ is expressed as $\Pr(Y=1|x) = p$, and the non-response probability is expressed as $\Pr(Y = 0|x) = 1-p$.

To construct the model, we must consider the form of the function of the reaction probability $p$. Since $p$ is a probability, the possible range is $0 \leq p \leq 1$. Also, the factor $x$ that causes the reaction is a real number: $-infty

The logistic function $\sigma(\cdot)$ is used as a function to connect $x$ and $p$, which is the “logistic model”.

\begin{align*}
\s{{\rm logistic\ function:\ }} \sigma(z) \equiv \f{1}{1 + \exp(-z)}.
\end{align*}

That is, in the logistic model, the response probability $p$ is expressed with parameters $w_0, w_1$ as

\begin{align*}
p = \sigma(w_0 + w_1x) = \f{1}{1 + \exp\[-(w_0 + w_1x)\]}, \\\\(0 \leq p \leq 1), \t (-\infty < x < \infty) \end{align*}

Estimating the values of the parameters $w_0, w_1$ based on $n$ pairs of observed data yields a model that outputs the probability of a response for any given numerical level $x$.

The problem is how to estimate the parameters, which will be clarified in the next section.

Logistic model estimation

While the above description was for one type of explanatory variable $x$, we consider a more general estimation of the probability that the objective variable $y $ follows for $p$ explanatory variables.

Now, the $i$th data observed for the $p$ explanatory variables $x_1, x_2, \dots, x_p$ and the objective variable $y$ are

\begin{align*}
$x_1^{(i)}, x_2^{(i)}, \dots, x_p^{(i)} $, \t y^{(i)} = \cases{1, & (positive) \\ 0, & (negative)}, \\(i = 1, 2, \dots, n).
\end{align*}

Introducing the random variable $Y$ as before, the response and non-response probabilities are respectively

\begin{align*}
\Pr(Y=1|x_1, \dots x_p) = p,\t \Pr(Y=0|x_1, \dots x_p) = 1-p
\end{align*}

And the relationship between the explanatory variables $x_1, x_2, \dots, x_p$ and the response probability $p$ is

\begin{align}
p &= \f{1}{1 + \exp\[-(w_0 + w_1x_1 + w_2x_2 + \cdots + w_px_p)\]} \n&= \f{1}{1 + \exp(-\bm{w}^\T \bm{x})}\label{eq:proj}
\end{align}

where $\bm{w} = (w_0, w_1, \dots, w_p)^\T$ and $\bm{x} = (1, x_1, x_2, \dots, x_p)^\T$.

　Now let us estimate the value of the parameter vector $\bm{w}$ based on the observed $N$ pairs of data. We consider estimating the parameters using the maximum likelihood method.

Now, for the $i$-th data, it consists of a pair of random variables $Y^{(i)}$ and realizations $Y^{(i)}$ that indicate whether or not they responded. Then, if the true response probability for the $i$-th data is $P^{(i)}$, the response and non-response probabilities are respectively

\begin{align*}
\Pr(Y^{(i)}=1) = p^{(i)},\t \Pr(Y^{(i)}=0) = 1-p^{(i)}
\end{align*}

This means that the probability distribution that the random variable $Y^{(i)}$ follows is the following Bernoulli distribution.

\begin{align*}
\Pr(Y^{(i)} = y^{(i)}) &= {\rm Bern}(y^{(i)}|p^{(i)}) \\&= (p^{(i)})^{y^{(i)}} (1 – p^{(i)})^{1-y^{(i)}}, \\\\(y^{(i)} = 0, 1),&\t (i=1, 2, \dots, n).
\end{align*}

Thus, the likelihood function based on $y^{(1)}, y^{(2)}, \dots, y^{(n)}$ is

\begin{align}
L(p^{(1)}, p^{(2)}, \dots, p^{(n)}) &= \prod_{i=1}^n {\rm Bern}(y^{(i)}|p^{(i)}) \n&= \prod_{i=1}^n (p^{(i)})^{y^{(i)}} (1 – p^{(i)})^{1-y^{(i)}}.\label{eq:likelihood}
\end{align}

By substituting $\eq{eq:likelihood}$ for $\eq{eq:proj}$, we see that the likelihood function is a function of the parameter $\bm{w}$. From the teachings of the maximum likelihood method, the parameter that maximizes this likelihood function is the estimate $\bm{w}^*$ we wish to find.

So, let us determine the parameter $\bm{w}$ to maximize the log-likelihood function, taking $\log$ of the likelihood function.

\begin{align}
\bm{w}^* &= \us \argmax_{\bm{w}}\ \log L(p^{(1)}, p^{(2)}, \dots, p^{(n)}) \n&= \us \argmax_{\bm{w}}\ \log\{ \prod_{i=1}^n (p^{(i)})^{y^{(i)}} (1 – p^{(i)})^{1-y^{(i)}} \} \n&= \us \argmax_{\bm{w}}\ \sum_{i=1}^n \log\{ (p^{(i)})^{y^{(i)}} (1 – p^{(i)})^{1-y^{(i)}} \}\ \ \because \textstyle (\log\prod_i c_i = \sum_i \log c_i) \n&= \us \argmax_{\bm{w}}\ \sum_{i=1}^n \{ y^{(i)}\log p^{(i)} + (1-y^{(i)})\log (1 – p^{(i)}) \} \n&= \us \argmin_{\bm{w}} \ -\sum_{i=1}^n \{ y^{(i)}\log p^{(i)} + (1-y^{(i)})\log (1 – p^{(i)}) \}.\label{eq:loglike}
\end{align}

(The last line adds a minus to change the maximization problem into a minimization problem.)

We will put the last row term as $J(\bm{w})$.

\begin{align*}
J(\bm{w}) \equiv -\sum_{i=1}^n \{ y^{(i)}\log p^{(i)} + (1-y^{(i)})\log (1 – p^{(i)}) \}
\end{align*}

From the above, the $\bm{w}$ that minimizes $J(\bm{w})$ is the estimate $\bm{w}^*$ that we want to find.

\begin{align*}
\bm{w}^* = \argmin_{\bm{w}} J(\bm{w})
\end{align*}

Now, since it is difficult to express the estimated value $\bm{w}^*$ analytically and explicitly, the steepest descent method is used here to obtain its estimate. In the steepest descent method, we randomly set an initial parameter value $\bm{w}^{[0]}$ and update the parameter in a downward direction with a negative derivative as shown below.

\begin{align}
\bm{w}^{[t+1]} = \bm{w}^{[t]} – \eta\pd{J(\bm{w})}{\bm{w}}.
\end{align}

To perform the steepest descent method, calculate the value of the partial derivative $\pd{J(\bm{w})}{\bm{w}}$.

▲Image of the steepest descent method. Little by little, update the value of $\bm{w}$ and look for a trough in the loss function $J$.

Calculation of optimal model parameters

To calculate the value of the partial derivative $\pd{J(\bm{w})}{\bm{w}}$, we substitute $\eq{eq:proj}$ for $J(\bm{w})$ and organize

\begin{align*}
J(\bm{w}) &= \sum_{i=1}^n \[ -y^{(i)}\log p^{(i)} – (1-y^{(i)})\log (1 – p^{(i)}) \] \n
&\,\downarrow \textstyle\t \s{(\eq{eq:proj}:\ p^{(i)} = \frac{1}{1+\exp(-\bm{w}^\T \bm{x}^{(i)})})} \n
&= \sum_{i=1}^n \biggl[-y^{(i)}\underbrace{\log $\frac{1}{1+\exp(-\bm{w}^\T \bm{x}^{(i)})}$}_{=\log1 – \log(1+\exp(-\bm{w}^\T \bm{x}^{(i)}))} – (1-y^{(i)})\underbrace{\log $\frac{\exp(-\bm{w}^\T \bm{x}^{(i)})}{1+\exp(-\bm{w}^\T \bm{x}^{(i)})}$}_{=-\bm{w}^\T\bm{x}^{(i)} – \log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$} \biggr] \n
&= \sum_{i=1}^n \[y^{(i)}\log$1+\exp(-\bm{w}^\T \bm{x}^{(i)})$ + (1-y^{(i)})\{\bm{w}^\T\bm{x}^{(i)} + \log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$ \} \] \n
&\,\downarrow \s{（{\rm Expand\ 2nd\ term}）} \n
&= \sum_{i=1}^n \[ \cancel{y^{(i)}\log$1+\exp(-\bm{w}^\T \bm{x}^{(i)})$} + (1-y^{(i)})\bm{w}^\T\bm{x}^{(i)} \n
\t\t\t\t\t\t\t\t\t + \log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$ \cancel{-y^{(i)}\log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$} \] \n
&= \sum_{i=1}^n \underbrace{\[ (1-y^{(i)})\bm{w}^\T\bm{x}^{(i)} + \log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$ \]}_{=\varepsilon^{(i)}(\bm{w})}.
\end{align*}

We would like to do partial differentiation of the above equation in $\bm{w}$, but this is a bit complicated, so we will use the derivative of the composite function (chain rule). We decompose $u^{(i)} = \bm{w}^\T \bm{x}^{(i)}$ into the following equation.

\begin{align}
\pd{J(\bm{w})}{\bm{w}} &= \pd{}{\bm{w}} \sum_{i=1}^n \varepsilon^{(i)}(\bm{w}) \n&= \sum_{i=1}^n \pd{\varepsilon^{(i)}(\bm{w})}{\bm{w}} \n&= \sum_{i=1}^n \pd{\varepsilon^{(i)}}{u^{(i)}} \pd{u^{(i)}}{\bm{w}}.\label{eq:chain}
\end{align}

1. calculate the value of the partial derivative $\pd{\varepsilon^{(i)}}{u^{(i)}}$. Now, from $\varepsilon^{(i)} = (1-y^{(i)})u^{(i)} + \log $1 + \exp(-u^{(i)})$ $.

\begin{align}
\pd{\varepsilon^{(i)}}{u^{(i)}} &= \pd{}{u^{(i)}} \[ (1-y^{(i)})u^{(i)} + \log $1 + \exp(-u^{(i)})$ \] \n&= (1-y^{(i)}) – \f{\exp(-u^{(i)})}{1 + \exp(-u^{(i)})} \n&= $1 – \f{\exp(-u^{(i)})}{1 + \exp(-u^{(i)})}$ – y^{(i)} \n&= \f{1}{1 + \exp(-u^{(i)})} – y^{(i)} \n&\,\downarrow \textstyle\t \s{(\eq{eq:proj}:\ p^{(i)} = \frac{1}{1+\exp(-u^{(i)})})} \n&= p^{(i)} – y^{(i)}.\label{eq:pd_epsilon}
\end{align}

2. calculate the value of the partial derivative $\pd{u^{(i)}}{\bm{w}}$.

\begin{align}
\pd{u^{(i)}}{\bm{w}} &= \pd{(\bm{w}^\T \bm{x}^{(i)})}{\bm{w}} \n&= \bm{x}^{(i)}\ \ \because \s{({\rm Vector\ Differentiation\ Formulas})}.\label{eq:pd_u}
\end{align}

Derivation Of "Differentiate By Vector" Formula

When studying machine learning theory, we often see the operation of "differentiating a scalar by a vector. In this article, we derive the formula for "differentiating a scalar by a vector.

From the above, substituting $\eq{eq:chain}$ for $\eq{eq:pd_epsilon},(\ref{eq:pd_u})$

\begin{align}
\pd{J(\bm{w})}{\bm{w}} = \sum_{i=1}^n $p^{(i)} – y^{(i)} $ \bm{x}^{(i)}\label{eq:grad}
\end{align}

Matrix notation of optimal model parameters

Another representation of $\eq{eq:grad} $ is shown using a matrix. Let $n$ observed $p$-dimensional data be represented using matrices as follows.

\begin{align*}
\us X_{[n \times (p+1)]}=\mat{1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_p^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} &\cdots & x_p^{(2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(n)} & x_2^{(n)} &\cdots & x_p^{(n)}\\}.
\end{align*}

Also, the $n$ objective variables $y^{(i)}$ and reaction probabilities $p^{(i)}$ are represented by column vectors, respectively, as follows.

\begin{align*}
\us \bm{y}_{[n \times 1]} = \mat{y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(n)}},\t\us \bm{p}_{[n\times 1]} = \mat{p^{(1)} \\ p^{(2)} \\ \vdots \\ p^{(n)}}.
\end{align*}

Using the above, let $\eq{eq:grad}$ be in matrix notation.

\begin{align*}
\pd{J(\bm{w})}{\bm{w}} = $\pd{J}{w_0},\pd{J}{w_1},\pd{J}{w_2}, \dots, \pd{J}{w_p} $^\T
\end{align*}

\begin{align*}
\pd{J}{w_j} &= \sum_{i=1}^n $p^{(i)} – y^{(i)} $ x^{(i)}_j \\&= \sum_{i=1}^n x^{(i)}_j $p^{(i)} – y^{(i)} $ \\&(j = 0, 1, 2, \dots, p\t \s{{\rm however,}}\ x^{(i)}_0=1).
\end{align*}

In matrix form,

\begin{align*}
\pd{J(\bm{w})}{\bm{w}} = \us \mat{\pd{J}{w_0} \\ \pd{J}{w_1} \\ \pd{J}{w_2} \\ \vdots \\ \pd{J}{w_p}}_{[(p+1) \times1]} = \underbrace{\us \mat{1 & 1 & \cdots & 1 \\ x^{(1)}_1 & x^{(2)}_1 & \cdots & x^{(n)}_1 \\ x^{(1)}_2 & x^{(2)}_2 & \cdots & x^{(n)}_2 \\ \vdots & \vdots & \ddots & \vdots \\ x^{(1)}_p & x^{(2)}_p & \cdots & x^{(n)}_p}_{[(p+1)\times n]}}_{=X^\T}\underbrace{\us \mat{p^{(1)} – y^{(1)} \\ p^{(2)} – y^{(2)} \\ \vdots \\ p^{(n)} – y^{(n)}}_{[n\times 1]}}_{=\bm{p}-\bm{y}}
\end{align*}

\begin{align}
\therefore \pd{J(\bm{w})}{\bm{w}} =\large X^\T(\bm{p}\, -\, \bm{y}).
\end{align}

Click here ▼ to implement logistic regression.