Theory And Implementation Of Multiple Regression Analysis

機械学習

This article is available in: 日本語

Introduction

In the previous article, we discussed linear regression by the method of least squares.

There, the regression line was obtained with the mass of the weight suspended on the spring as the explanatory variable and the length of the spring as the objective variable. In other words, this was a case where there was only one explanatory variable. So how do we construct a regression model when there are multiple explanatory variables?

As a specific example, let us take apartment rent. As shown in Table 1.
The three explanatory variables are: ・Area ・Year built ・Walking distance to the station
Rent is the objective variable.
What formulation should be used to construct a rent prediction model using the data on area, building age, and station walking distance?

▲Table 1: I want to build a model that predicts the objective variable using three explanatory variables

This article provides multiple regression analysis to model the relationship between multiple explanatory and objective variables and the derivation of the normal equation for the regression coefficient.

We will also implement multiple regression analysis in Python. The code can be tried below.

Google Colab

\begin{align*}
\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}
\newcommand{\T}{\mathsf{T}}
\newcommand{\(}{\left(}
\newcommand{\)}{\right)}
\newcommand{\{}{\left\{}
\newcommand{\}}{\right\}}
\newcommand{\[}{\left[}
\newcommand{\]}{\right]}
\newcommand{\dis}{\displaystyle}
\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}
\newcommand{\n}{\notag\\}
\newcommand{\t}{\ \ \ \ }
\newcommand{\tt}{\t\t\t\t}
\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}
\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}
\def\l<#1>{\left\langle #1 \right\rangle}
\def\us#1_#2{\underset{#2}{#1}}
\def\os#1^#2{\overset{#2}{#1}}
\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}
\newcommand{\s}[1]{{\scriptstyle #1}}
\definecolor{myblack}{rgb}{0.27,0.27,0.27}
\definecolor{myred}{rgb}{0.78,0.24,0.18}
\definecolor{myblue}{rgb}{0.0,0.443,0.737}
\definecolor{myyellow}{rgb}{1.0,0.82,0.165}
\definecolor{mygreen}{rgb}{0.24,0.47,0.44}
\newcommand{\c}[2]{\textcolor{#1}{#2}}
\newcommand{\ub}[2]{\underbrace{#1}_{#2}}
\end{align*}

Capture relationships among multiple variables

In the above, we used three explanatory variables as an example, but more generally, we consider $p$ explanatory variables $x_1, x_2, \dots, x_p$ and an objective function $y$.

Table 2 below is a generalization of Table 1 and shows the results of measuring $n$ pieces of data. For example, at the $i$-th data point, the objective variable $y^{(i)}$ is related to the explanatory variable pair $x_1^{(i)}, x_2^{(i)}, \dots, x_p^{(i)}$.

▲Table 2: Correspondence table for $n$ observations of data with $p$ explanatory variables. The columns represent the type of explanatory variables and the rows represent the index of the data

The objective is to construct a model linking the objective variable $y$ with the explanatory variables $x_1, x_2, \dots, x_p$. We assume that the structure of this phenomenon is linear based on trends in the data and a priori findings.

That is, the relationship between the objective and explanatory variables can be expressed by a linear equation.

\begin{align*}
y = \theta_0 + \theta_1 x_1 + \theta_1 x_2 + \cdots + \theta_p x_p
\end{align*}

where $\theta_0, \dots, \theta_p$ are the regression coefficients.

Considering that the data are actually displaced from the regression equation due to the error $\varepsilon$, each data point satisfies the following relationship (we will put the subscript of the data number in the upper right corner)

\begin{align*}
y^{(1)} &= \theta_0 + \theta_1 x_1^{(1)} + \theta_1 x_2^{(1)} + \cdots + \theta_p x_p^{(1)} + \varepsilon ^{(1)}, \n
y^{(2)} &= \theta_0 + \theta_1 x_1^{(2)} + \theta_1 x_2^{(2)} + \cdots + \theta_p x_p^{(2)} + \varepsilon ^{(2)}, \n
& \ \ \vdots \n
y^{(i)} &= \theta_0 + \theta_1 x_1^{(i)} + \theta_1 x_2^{(i)} + \cdots + \theta_p x_p^{(i)} + \varepsilon ^{(i)}, \n
&\ \ \vdots \n
y^{(n)} &= \theta_0 + \theta_1 x_1^{(n)} + \theta_1 x_2^{(n)} + \cdots + \theta_p x_p^{(n)} + \varepsilon ^{(n)}.
\end{align*}

A model that captures this relationship between two or more explanatory variables and the objective variable is called a linear multiple regression model.

Now, the above equation can be expressed using a matrix as follows

\begin{align*}
\underbrace{\mat{y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(i)} \\ \vdots \\y^{(n)}}}_{\equiv \bm{y}}
=
\underbrace{\mat{
1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_p^{(1)} \\
1 & x_1^{(2)} & x_2^{(2)} & \cdots & x_p^{(2)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(i)} & x_2^{(i)} & \cdots & x_p^{(i)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(n)} & x_2^{(n)} & \cdots & x_p^{(n)}
}}_{\equiv X}
\underbrace{\mat{\theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_p}}_{\equiv \bm{\theta}}
+
\underbrace{\mat{\varepsilon^{(1)} \\ \varepsilon^{(2)} \\ \vdots \\ \varepsilon^{(i)} \\ \vdots \\\varepsilon^{(n)}}}_{\equiv \bm{\varepsilon}}.
\end{align*}

Furthermore, we can express this equation using vectors and matrices as follows

\begin{align}
\bm{y} = X \bm{\theta} + \bm{\varepsilon}
\label{eq:reg}
\end{align}

We will discuss how to determine this regression coefficient vector $\bm{\theta}$ in the next section.

Estimation of regression coefficients using the least squares method

As in the case of a single explanatory variable, the regression coefficient vector is estimated using the least squares method.

That is, the method finds the estimated value $\hat{\bm{\theta}}$ of the regression coefficient vector $J \equiv (\varepsilon^{(1)})^2 + (\varepsilon^{(2)})^2 + \cdots + (\varepsilon^{(n)})^2$ so that the sum of squared error terms in each data $\hat{\bm{\theta}}$ is minimized.

From $\eq{eq:reg}$, $\bm{\varepsilon} = \bm{y}\, – X \bm{\theta}$, the least-squares method in linear multiple regression can be expressed in the following formula.

\begin{align*}
\hat{\bm{\theta}} &= \argmin_{\bm{\theta}} J(\bm{\theta}) \n
&= \argmin_{\bm{\theta}} \sum_{i=1}^n (\varepsilon^{(i)})^2 \n
&= \argmin_{\bm{\theta}}\, \bm{\varepsilon}^\T \bm{\varepsilon} \n
&= \argmin_{\bm{\theta}}\, (\bm{y}\, – X \bm{\theta})^\T (\bm{y}\, – X \bm{\theta}).
\end{align*}

Therefore, as before, the estimate is obtained by partial differentiation of $J(\bm{\theta})$ by $\bm{\theta}$ and placing $\bm{0}$.

\begin{align*}
J(\bm{\theta}) &= (\bm{y}\, – X \bm{\theta})^\T (\bm{y}\, – X \bm{\theta}) \n
&= (\bm{y}^\T – \bm{\theta}^\T X^\T) (\bm{y}\, – X \bm{\theta}) \t \t (\because (AB)^\T = B^\T A^\T) \n
&= \bm{y}^\T \bm{y}\ – \bm{y}^\T X \bm{\theta}\ – \underbrace{\bm{\theta}^\T X^\T \bm{y}}_{= \bm{y}^\T X \bm{\theta}} + \bm{\theta}^\T X^\T X \bm{\theta} \n
&= \bm{y}^\T \bm{y}\ – 2 \bm{y}^\T X \bm{\theta} + \bm{\theta}^\T X^\T X \bm{\theta}
\end{align*}

Therefore, we use the vector derivative formula to find the extreme values.

\begin{align*}
\pd{J(\bm{\theta})}{\bm{\theta}} &= \pd{}{\bm{\theta}} \( \bm{y}^\T \bm{y}\ – 2 \bm{y}^\T X \bm{\theta} + \bm{\theta}^\T X^\T X \bm{\theta} \) \n
&= \underbrace{\pd{}{\bm{\theta}} (\bm{y}^\T \bm{y})}_{=0} \ – 2 \pd{}{\bm{\theta}} (\bm{y}^\T X \bm{\theta}) + \pd{}{\bm{\theta}} (\bm{\theta}^\T X^\T X \bm{\theta}) \n
&= -2 X^\T \bm{y} + 2X^\T X \bm{\theta} = \bm{0}.
\end{align*}

Here, the last line of the transformation used the vector derivative formula.

Thus, we obtain the following equation, called the normal equation

\begin{align*}
X^\T X \bm{\theta} = X^\T \bm{y}
\end{align*}

From the above, if the inverse of the matrix $X^\T X$ exists, the least-squares estimate is given by

\begin{align*}
\hat{\bm{\theta}} = (X^\T X)^{-1} X^\T \bm{y}\t (☆)
\end{align*}

Perform linear multiple regression analysis using Python

Now, let’s perform a linear multiple regression analysis using Python. The regression coefficient vector $\hat{\bm{\theta}}$ can be obtained by calculating equation (☆).

The data used as an example will have two explanatory variables for convenience of visualization.

import numpy as np
import pandas as pd

area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22]
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0]
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000]

df = pd.DataFrame({
    'area': area,
    'age': age,
    'rent': rent})

First, visualize the target 3D data using plotly.

import plotly.express as px
import plotly.graph_objects as go


fig = px.scatter_3d(df, x='area', y='age', z='rent', color='rent', opacity=0.7)
fig.show()

Find the regression plane that fits this data.

Now, we have a matrix $X$ and a vector $\bm{y}$ to compute equation (☆).

X = df[['area', 'age']].values
ones = np.ones(len(X)).reshape(-1, 1)
X = np.hstack((ones, X))
print('X')
print(X)

y = df['rent'].values
print('y')
print(y)
# Output
X
[[ 1.   17.   31.  ]
 [ 1.   18.   34.  ]
 [ 1.   21.   36.  ]
 [ 1.   24.   31.  ]
 [ 1.   18.9  28.  ]
 [ 1.   20.66 22.  ]
 [ 1.   23.51 19.  ]
 [ 1.   25.21 12.  ]
 [ 1.   24.94  4.  ]
 [ 1.   30.22  0.  ]]
y
[35000 36000 39000 42000 46000 50000 59000 68000 73000 86000]

Compute equation (☆) to obtain the regression coefficient vector $\hat{\bm{\theta}}$.

 
def multiple_regression(X, y):
    theta = np.linalg.pinv(X.T @ X) @ X.T @ y
    return theta

theta = multiple_regression(X, y)
theta_0 = theta[0]
theta_1 = theta[1]
theta_2 = theta[2]

print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')
# Output
theta_0: 50300.79618461159
theta_1: 1154.3386202815495
theta_2: -1045.7759592710272

Now that the regression coefficients have been obtained, let’s plot the regression plane.

mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)

y = (theta_0 + theta_1 * xx1 + theta_2 * xx2)

fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8)
fig.update_traces(marker=dict(
    color=colors[0],
    size=8,
    line=dict(width=2,color='white')))

fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7))
fig.show()

Thus, the regression plane was obtained by solving the normal equation.

Perform a linear multiple regression analysis using scikit-learn

Multiple regression analysis can be performed in scikit-learn as well.

from sklearn.linear_model import LinearRegression

area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22]
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0]
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000]
df = pd.DataFrame({
    'area': area,
    'age': age,
    'rent': rent})

X = df[['area', 'age']].values
y = df['rent'].values

# 回帰係数を求める
model = LinearRegression()
model.fit(X, y)

theta_0 = model.intercept_
theta_1 = model.coef_[0]
theta_2 = model.coef_[1]

print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')
# Output
theta_0: 50300.79618451819
theta_1: 1154.3386202847946
theta_2: -1045.7759592701218
mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)

y = (theta_0 + theta_1 * xx1 + theta_2 * xx2)

fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8)
fig.update_traces(marker=dict(
    color=colors[0],
    size=8,
    line=dict(width=2,color='white')))

fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7))
fig.show()

The regression plane was obtained in scikit-learn as well.

タイトルとURLをコピーしました