# Explanation Of Logistic Regression Theory

## Introduction

When the objective variable $y$ is binary data (e.g. $y = 0, 1$), it is a kind of predictive model for a real-valued objective variable $x$. As an example, suppose hypothetically that $n$ pairs of data are observed, corresponding to $y=1$ if an individual responds to a certain numerical level $x$ and $y=0$ if not.

\begin{align*} \newcommand{\mat}{\begin{pmatrix} #1 \end{pmatrix}} \newcommand{\f}{\frac{#1}{#2}} \newcommand{\pd}{\frac{\partial #1}{\partial #2}} \newcommand{\d}{\frac{{\rm d}#1}{{\rm d}#2}} \newcommand{\T}{\mathsf{T}} \newcommand{$}{\left(} \newcommand{$}{\right)} \newcommand{\{}{\left\{} \newcommand{\}}{\right\}} \newcommand{$}{\left[} \newcommand{$}{\right]} \newcommand{\dis}{\displaystyle} \newcommand{\eq}{{\rm Eq}(\ref{#1})} \newcommand{\n}{\notag\\} \newcommand{\t}{\ \ \ \ } \newcommand{\argmax}{\mathop{\rm arg\, max}\limits} \newcommand{\argmin}{\mathop{\rm arg\, min}\limits} \def\l<#1>{\left\langle #1 \right\rangle} \def\us#1_#2{\underset{#2}{#1}} \def\os#1^#2{\overset{#2}{#1}} \newcommand{\case}{\{ \begin{array}{ll} #1 \end{array} \right.} \newcommand{\s}{{\scriptstyle #1}} \definecolor{myblack}{rgb}{0.27,0.27,0.27} \definecolor{myred}{rgb}{0.78,0.24,0.18} \definecolor{myblue}{rgb}{0.0,0.443,0.737} \definecolor{myyellow}{rgb}{1.0,0.82,0.165} \definecolor{mygreen}{rgb}{0.24,0.47,0.44} \end{align*}

\begin{align*}
$x^{(1)}, y^{(1)}$, $x^{(2)}, y^{(2)}$, \dots, $x^{(n)}, y^{(n)}$,\\x^{(i)} \in \mathbb{R}, \t y^{(i)} = \cases{1, & (positive). \\ 0, & (negative).}
\end{align*}

Consider building a model to estimate the response probability $p$ to the numerical level $x$ for such binary data.

Now, introducing a random variable $Y$ to indicate whether or not a response was made, the response probability for the numerical level $x$ is expressed as $\Pr(Y=1|x) = p$, and the non-response probability is expressed as $\Pr(Y = 0|x) = 1-p$.

To construct the model, we must consider the form of the function of the reaction probability $p$. Since $p$ is a probability, the possible range is $0 \leq p \leq 1$. Also, the factor $x$ that causes the reaction is a real number: $-infty The logistic function$\sigma(\cdot)$is used as a function to connect$x$and$p, which is the “logistic model”. \begin{align*} \s{{\rm logistic\ function:\ }} \sigma(z) \equiv \f{1}{1 + \exp(-z)}. \end{align*} That is, in the logistic model, the response probabilityp$is expressed with parameters$w_0, w_1as \begin{align*} p = \sigma(w_0 + w_1x) = \f{1}{1 + \exp$-(w_0 + w_1x)$}, \\\0 \leq p \leq 1), \t (-\infty < x < \infty) \end{align*} Estimating the values of the parameters w_0, w_1 based on n pairs of observed data yields a model that outputs the probability of a response for any given numerical level x. The problem is how to estimate the parameters, which will be clarified in the next section. ## Logistic model estimation While the above description was for one type of explanatory variable x, we consider a more general estimation of the probability that the objective variable y follows for p explanatory variables. Now, the ith data observed for the p explanatory variables x_1, x_2, \dots, x_p and the objective variable y are \begin{align*} \(x_1^{(i)}, x_2^{(i)}, \dots, x_p^{(i)}, \t y^{(i)} = \cases{1, & (positive) \\ 0, & (negative)}, \i = 1, 2, \dots, n). \end{align*} Introducing the random variable Y as before, the response and non-response probabilities are respectively \begin{align*} \Pr(Y=1|x_1, \dots x_p) = p,\t \Pr(Y=0|x_1, \dots x_p) = 1-p \end{align*} And the relationship between the explanatory variables x_1, x_2, \dots, x_p and the response probability p is \begin{align} p &= \f{1}{1 + \exp$-(w_0 + w_1x_1 + w_2x_2 + \cdots + w_px_p)$} \n&= \f{1}{1 + \exp(-\bm{w}^\T \bm{x})}\label{eq:proj} \end{align} where \bm{w} = (w_0, w_1, \dots, w_p)^\T and \bm{x} = (1, x_1, x_2, \dots, x_p)^\T. Now let us estimate the value of the parameter vector \bm{w} based on the observed N pairs of data. We consider estimating the parameters using the maximum likelihood method. Now, for the i-th data, it consists of a pair of random variables Y^{(i)} and realizations Y^{(i)} that indicate whether or not they responded. Then, if the true response probability for the i-th data is P^{(i)}, the response and non-response probabilities are respectively \begin{align*} \Pr(Y^{(i)}=1) = p^{(i)},\t \Pr(Y^{(i)}=0) = 1-p^{(i)} \end{align*} This means that the probability distribution that the random variable Y^{(i)} follows is the following Bernoulli distribution. \begin{align*} \Pr(Y^{(i)} = y^{(i)}) &= {\rm Bern}(y^{(i)}|p^{(i)}) \\&= (p^{(i)})^{y^{(i)}} (1 – p^{(i)})^{1-y^{(i)}}, \\\\(y^{(i)} = 0, 1),&\t (i=1, 2, \dots, n). \end{align*} Thus, the likelihood function based on y^{(1)}, y^{(2)}, \dots, y^{(n)} is \begin{align} L(p^{(1)}, p^{(2)}, \dots, p^{(n)}) &= \prod_{i=1}^n {\rm Bern}(y^{(i)}|p^{(i)}) \n&= \prod_{i=1}^n (p^{(i)})^{y^{(i)}} (1 – p^{(i)})^{1-y^{(i)}}.\label{eq:likelihood} \end{align} By substituting \eq{eq:likelihood} for \eq{eq:proj}, we see that the likelihood function is a function of the parameter \bm{w}. From the teachings of the maximum likelihood method, the parameter that maximizes this likelihood function is the estimate \bm{w}^* we wish to find. So, let us determine the parameter \bm{w} to maximize the log-likelihood function, taking \log of the likelihood function. \begin{align} \bm{w}^* &= \us \argmax_{\bm{w}}\ \log L(p^{(1)}, p^{(2)}, \dots, p^{(n)}) \n&= \us \argmax_{\bm{w}}\ \log\{ \prod_{i=1}^n (p^{(i)})^{y^{(i)}} (1 – p^{(i)})^{1-y^{(i)}} \} \n&= \us \argmax_{\bm{w}}\ \sum_{i=1}^n \log\{ (p^{(i)})^{y^{(i)}} (1 – p^{(i)})^{1-y^{(i)}} \}\ \ \because \textstyle (\log\prod_i c_i = \sum_i \log c_i) \n&= \us \argmax_{\bm{w}}\ \sum_{i=1}^n \{ y^{(i)}\log p^{(i)} + (1-y^{(i)})\log (1 – p^{(i)}) \} \n&= \us \argmin_{\bm{w}} \ -\sum_{i=1}^n \{ y^{(i)}\log p^{(i)} + (1-y^{(i)})\log (1 – p^{(i)}) \}.\label{eq:loglike} \end{align} (The last line adds a minus to change the maximization problem into a minimization problem.) We will put the last row term as J(\bm{w}). \begin{align*} J(\bm{w}) \equiv -\sum_{i=1}^n \{ y^{(i)}\log p^{(i)} + (1-y^{(i)})\log (1 – p^{(i)}) \} \end{align*} From the above, the \bm{w} that minimizes J(\bm{w}) is the estimate \bm{w}^* that we want to find. \begin{align*} \bm{w}^* = \argmin_{\bm{w}} J(\bm{w}) \end{align*} Now, since it is difficult to express the estimated value \bm{w}^* analytically and explicitly, the steepest descent method is used here to obtain its estimate. In the steepest descent method, we randomly set an initial parameter value \bm{w}^{} and update the parameter in a downward direction with a negative derivative as shown below. \begin{align} \bm{w}^{[t+1]} = \bm{w}^{[t]} – \eta\pd{J(\bm{w})}{\bm{w}}. \end{align} To perform the steepest descent method, calculate the value of the partial derivative \pd{J(\bm{w})}{\bm{w}}. ▲Image of the steepest descent method. Little by little, update the value of \bm{w} and look for a trough in the loss function J. ## Calculation of optimal model parameters To calculate the value of the partial derivative \pd{J(\bm{w})}{\bm{w}}, we substitute \eq{eq:proj} for J(\bm{w}) and organize \begin{align*} J(\bm{w}) &= \sum_{i=1}^n $-y^{(i)}\log p^{(i)} – (1-y^{(i)})\log (1 – p^{(i)})$ \n &\,\downarrow \textstyle\t \s{(\eq{eq:proj}:\ p^{(i)} = \frac{1}{1+\exp(-\bm{w}^\T \bm{x}^{(i)})})} \n &= \sum_{i=1}^n \biggl[-y^{(i)}\underbrace{\log \(\frac{1}{1+\exp(-\bm{w}^\T \bm{x}^{(i)})}}_{=\log1 – \log(1+\exp(-\bm{w}^\T \bm{x}^{(i)}))} – (1-y^{(i)})\underbrace{\log $\frac{\exp(-\bm{w}^\T \bm{x}^{(i)})}{1+\exp(-\bm{w}^\T \bm{x}^{(i)})}$}_{=-\bm{w}^\T\bm{x}^{(i)} – \log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$} \biggr] \n &= \sum_{i=1}^n $y^{(i)}\log$1+\exp(-\bm{w}^\T \bm{x}^{(i)})$ + (1-y^{(i)})\{\bm{w}^\T\bm{x}^{(i)} + \log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$ \}$ \n &\,\downarrow \s{（{\rm Expand\ 2nd\ term}）} \n &= \sum_{i=1}^n $\cancel{y^{(i)}\log$1+\exp(-\bm{w}^\T \bm{x}^{(i)})$} + (1-y^{(i)})\bm{w}^\T\bm{x}^{(i)} \n \t\t\t\t\t\t\t\t\t + \log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$ \cancel{-y^{(i)}\log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$}$ \n &= \sum_{i=1}^n \underbrace{$(1-y^{(i)})\bm{w}^\T\bm{x}^{(i)} + \log $1 + \exp(-\bm{w}^\T \bm{x}^{(i)})$$}_{=\varepsilon^{(i)}(\bm{w})}. \end{align*} We would like to do partial differentiation of the above equation in\bm{w}$, but this is a bit complicated, so we will use the derivative of the composite function (chain rule). We decompose$u^{(i)} = \bm{w}^\T \bm{x}^{(i)}into the following equation. \begin{align} \pd{J(\bm{w})}{\bm{w}} &= \pd{}{\bm{w}} \sum_{i=1}^n \varepsilon^{(i)}(\bm{w}) \n&= \sum_{i=1}^n \pd{\varepsilon^{(i)}(\bm{w})}{\bm{w}} \n&= \sum_{i=1}^n \pd{\varepsilon^{(i)}}{u^{(i)}} \pd{u^{(i)}}{\bm{w}}.\label{eq:chain} \end{align} 1. calculate the value of the partial derivative\pd{\varepsilon^{(i)}}{u^{(i)}}$. Now, from$\varepsilon^{(i)} = (1-y^{(i)})u^{(i)} + \log $1 + \exp(-u^{(i)})$ . \begin{align} \pd{\varepsilon^{(i)}}{u^{(i)}} &= \pd{}{u^{(i)}} $(1-y^{(i)})u^{(i)} + \log $1 + \exp(-u^{(i)})$$ \n&= (1-y^{(i)}) – \f{\exp(-u^{(i)})}{1 + \exp(-u^{(i)})} \n&= $1 – \f{\exp(-u^{(i)})}{1 + \exp(-u^{(i)})}$ – y^{(i)} \n&= \f{1}{1 + \exp(-u^{(i)})} – y^{(i)} \n&\,\downarrow \textstyle\t \s{(\eq{eq:proj}:\ p^{(i)} = \frac{1}{1+\exp(-u^{(i)})})} \n&= p^{(i)} – y^{(i)}.\label{eq:pd_epsilon} \end{align} 2. calculate the value of the partial derivative\pd{u^{(i)}}{\bm{w}}. \begin{align} \pd{u^{(i)}}{\bm{w}} &= \pd{(\bm{w}^\T \bm{x}^{(i)})}{\bm{w}} \n&= \bm{x}^{(i)}\ \ \because \s{({\rm Vector\ Differentiation\ Formulas})}.\label{eq:pd_u} \end{align} Derivation Of "Differentiate By Vector" Formula When studying machine learning theory, we often see the operation of "differentiating a scalar by a vector. In this article, we derive the formula for "differentiating a scalar by a vector. From the above, substituting\eq{eq:chain}$for$\eq{eq:pd_epsilon},(\ref{eq:pd_u})\begin{align} \pd{J(\bm{w})}{\bm{w}} = \sum_{i=1}^n $p^{(i)} – y^{(i)}$ \bm{x}^{(i)}\label{eq:grad} \end{align} ## Matrix notation of optimal model parameters Another representation of\eq{eq:grad} $is shown using a matrix. Let$n$observed$p-dimensional data be represented using matrices as follows. \begin{align*} \us X_{[n \times (p+1)]}=\mat{1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_p^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} &\cdots & x_p^{(2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(n)} & x_2^{(n)} &\cdots & x_p^{(n)}\\}. \end{align*} Also, then$objective variables$y^{(i)}$and reaction probabilities$p^{(i)}are represented by column vectors, respectively, as follows. \begin{align*} \us \bm{y}_{[n \times 1]} = \mat{y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(n)}},\t\us \bm{p}_{[n\times 1]} = \mat{p^{(1)} \\ p^{(2)} \\ \vdots \\ p^{(n)}}. \end{align*} Using the above, let\eq{eq:grad}\$ be in matrix notation.

\begin{align*}
\pd{J(\bm{w})}{\bm{w}} = $\pd{J}{w_0},\pd{J}{w_1},\pd{J}{w_2}, \dots, \pd{J}{w_p}$^\T
\end{align*}

\begin{align*}
\pd{J}{w_j} &= \sum_{i=1}^n $p^{(i)} – y^{(i)}$ x^{(i)}_j \\&= \sum_{i=1}^n x^{(i)}_j $p^{(i)} – y^{(i)}$ \\&(j = 0, 1, 2, \dots, p\t \s{{\rm however,}}\ x^{(i)}_0=1).
\end{align*}

In matrix form,

\begin{align*}
\pd{J(\bm{w})}{\bm{w}} = \us \mat{\pd{J}{w_0} \\ \pd{J}{w_1} \\ \pd{J}{w_2} \\ \vdots \\ \pd{J}{w_p}}_{[(p+1) \times1]} = \underbrace{\us \mat{1 & 1 & \cdots & 1 \\ x^{(1)}_1 & x^{(2)}_1 & \cdots & x^{(n)}_1 \\ x^{(1)}_2 & x^{(2)}_2 & \cdots & x^{(n)}_2 \\ \vdots & \vdots & \ddots & \vdots \\ x^{(1)}_p & x^{(2)}_p & \cdots & x^{(n)}_p}_{[(p+1)\times n]}}_{=X^\T}\underbrace{\us \mat{p^{(1)} – y^{(1)} \\ p^{(2)} – y^{(2)} \\ \vdots \\ p^{(n)} – y^{(n)}}_{[n\times 1]}}_{=\bm{p}-\bm{y}}
\end{align*}

\begin{align}
\therefore \pd{J(\bm{w})}{\bm{w}} =\large X^\T(\bm{p}\, -\, \bm{y}).
\end{align} 