Linear Regression: Multiple Regression Theory and Python Implementation (From Scratch & scikit-learn)

機械学習

Introduction

 In the previous article, we discussed linear regression using the least squares method.

線形回帰〜最小二乗法の計算/解説とPythonのフルスクラッチ、scikit-learnよる実装〜
機械学習の中でも歴史ある線形回帰について、なかでも「最小二乗法」についての理論とpython, scikit-learnによる実装を紹介します

There, we determined the regression line using the mass of a weight hung on a spring as the explanatory variable and the length of the spring as the target variable. In other words, that was a case with only one explanatory variable. However, how should we construct a regression model when there are **multiple explanatory variables**?

As a concrete example, let’s consider apartment rent. As shown in Table 1, the data consists of:
・Three explanatory variables: Area, Age of building, and Walking time to station
・Target variable: Rent
How should we formulate the problem to construct a prediction model for rent using the data for area, age, and walking time?

▲Table 1: We want to build a model that predicts the target variable using three explanatory variables

 In this article, we will cover **multiple regression analysis**, which models the relationship between multiple explanatory variables and a target variable, and derive the **normal equation** for finding the regression coefficients.

We will also implement multiple regression analysis using Python. You can try the code below.

Google Colab

\begin{align*}
\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}
\newcommand{\T}{\mathsf{T}}
\newcommand{\dis}{\displaystyle}
\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}
\newcommand{\n}{\notag\\}
\newcommand{\t}{\ \ \ \ }
\newcommand{\tt}{\t\t\t\t}
\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}
\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}
\def\l<#1>{\left\langle #1 \right\rangle}
\def\us#1_#2{\underset{#2}{#1}}
\def\os#1^#2{\overset{#2}{#1}}
\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}
\newcommand{\s}[1]{{\scriptstyle #1}}
\newcommand{\c}[2]{\textcolor{#1}{#2}}
\newcommand{\ub}[2]{\underbrace{#1}_{#2}}
\end{align*}

Capturing relationships between multiple variables

 In the example above, we used 3 explanatory variables, but let’s consider a more general case with $p$ explanatory variables $x_1, x_2, \dots, x_p$ and a target variable $y$.

Table 2 below is a generalization of Table 1, showing the results of measuring $n$ data points. For example, in the $i$-th data point, the target variable $y^{(i)}$ is related to the set of explanatory variables $x_1^{(i)}, x_2^{(i)}, \dots, x_p^{(i)}$.

▲Table 2: Correspondence table when observing $n$ samples with $p$ explanatory variables.
Columns represent the types of explanatory variables, and rows represent the data index.

Our goal is to construct a model that links the target variable $y$ with the explanatory variables $x_1, x_2, \dots, x_p$. Here, based on data trends or prior knowledge, we assume that the structure of this phenomenon is linear.

That is, we assume the relationship between the target variable and the explanatory variables can be expressed by the linear equation:

\begin{align*}
y = \theta_0 + \theta_1 x_1 + \theta_1 x_2 + \cdots + \theta_p x_p
\end{align*}

Here, $\theta_0, \dots, \theta_p$ are the regression coefficients.

Considering that in reality the data deviates from the regression equation due to the influence of error $\varepsilon$, each data point satisfies the following relationship: (We denote the data index as a superscript on the right)

\begin{align*}
y^{(1)} &= \theta_0 + \theta_1 x_1^{(1)} + \theta_1 x_2^{(1)} + \cdots + \theta_p x_p^{(1)} + \varepsilon ^{(1)}, \n
y^{(2)} &= \theta_0 + \theta_1 x_1^{(2)} + \theta_1 x_2^{(2)} + \cdots + \theta_p x_p^{(2)} + \varepsilon ^{(2)}, \n
& \ \ \vdots \n
y^{(i)} &= \theta_0 + \theta_1 x_1^{(i)} + \theta_1 x_2^{(i)} + \cdots + \theta_p x_p^{(i)} + \varepsilon ^{(i)}, \n
&\ \ \vdots \n
y^{(n)} &= \theta_0 + \theta_1 x_1^{(n)} + \theta_1 x_2^{(n)} + \cdots + \theta_p x_p^{(n)} + \varepsilon ^{(n)}.
\end{align*}

A model that captures the relationship between a target variable and two or more explanatory variables in this way is called a **linear multiple regression model**.

 Now, if we express the above equations using matrices, they can be represented as follows:

\begin{align*}
\underbrace{\mat{y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(i)} \\ \vdots \\y^{(n)}}}_{\equiv \bm{y}}
=
\underbrace{\mat{
1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_p^{(1)} \\
1 & x_1^{(2)} & x_2^{(2)} & \cdots & x_p^{(2)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(i)} & x_2^{(i)} & \cdots & x_p^{(i)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(n)} & x_2^{(n)} & \cdots & x_p^{(n)}
}}_{\equiv X}
\underbrace{\mat{\theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_p}}_{\equiv \bm{\theta}}
+
\underbrace{\mat{\varepsilon^{(1)} \\ \varepsilon^{(2)} \\ \vdots \\ \varepsilon^{(i)} \\ \vdots \\\varepsilon^{(n)}}}_{\equiv \bm{\varepsilon}}.
\end{align*}

Furthermore, expressing this equation using vectors and matrices, we see that:

\begin{align}
\bm{y} = X \bm{\theta} + \bm{\varepsilon}
\label{eq:reg}
\end{align}

In the next section, we will discuss how to determine this regression coefficient vector $\bm{\theta}$.

Estimating Regression Coefficients using the Least Squares Method

 Just as with the case of a single explanatory variable, we estimate the regression coefficient vector using the least squares method.

That is, we seek the estimated value $\hat{\bm{\theta}}$ of the regression coefficient vector $\bm{\theta}$ that minimizes the sum of squared error terms for each data point, $J \equiv (\varepsilon^{(1)})^2 + (\varepsilon^{(2)})^2 + \cdots + (\varepsilon^{(n)})^2$.

From $\eq{eq:reg}$, since $\bm{\varepsilon} = \bm{y}\, – X \bm{\theta}$, the mathematical expression of the least squares method for linear multiple regression is as follows:

\begin{align*}
\hat{\bm{\theta}} &= \argmin_{\bm{\theta}} J(\bm{\theta}) \n
&= \argmin_{\bm{\theta}} \sum_{i=1}^n (\varepsilon^{(i)})^2 \n
&= \argmin_{\bm{\theta}}\, \bm{\varepsilon}^\T \bm{\varepsilon} \n
&= \argmin_{\bm{\theta}}\, (\bm{y}\, – X \bm{\theta})^\T (\bm{y}\, – X \bm{\theta}).
\end{align*}

Therefore, similar to the previous article, we find the estimated value by partially differentiating $J(\bm{\theta})$ with respect to $\bm{\theta}$ and setting it to $\bm{0}$.

\begin{align*}
J(\bm{\theta}) &= (\bm{y}\, – X \bm{\theta})^\T (\bm{y}\, – X \bm{\theta}) \n
&= (\bm{y}^\T – \bm{\theta}^\T X^\T) (\bm{y}\, – X \bm{\theta}) \t \t (\because (AB)^\T = B^\T A^\T) \n
&= \bm{y}^\T \bm{y}\ – \bm{y}^\T X \bm{\theta}\ – \underbrace{\bm{\theta}^\T X^\T \bm{y}}_{= \bm{y}^\T X \bm{\theta}} + \bm{\theta}^\T X^\T X \bm{\theta} \n
&= \bm{y}^\T \bm{y}\ – 2 \bm{y}^\T X \bm{\theta} + \bm{\theta}^\T X^\T X \bm{\theta}
\end{align*}

Thus, we use vector differentiation formulas to find the extremum.

\begin{align*}
\pd{J(\bm{\theta})}{\bm{\theta}} &= \pd{}{\bm{\theta}} \( \bm{y}^\T \bm{y}\ – 2 \bm{y}^\T X \bm{\theta} + \bm{\theta}^\T X^\T X \bm{\theta} \) \n
&= \underbrace{\pd{}{\bm{\theta}} (\bm{y}^\T \bm{y})}_{=0} \ – 2 \pd{}{\bm{\theta}} (\bm{y}^\T X \bm{\theta}) + \pd{}{\bm{\theta}} (\bm{\theta}^\T X^\T X \bm{\theta}) \n
&= -2 X^\T \bm{y} + 2X^\T X \bm{\theta} = \bm{0}.
\end{align*}

Here, we used vector differentiation formulas in the transformation of the last line.

「ベクトルで微分」公式導出
機械学習の理論を勉強すると「スカラーをベクトルで微分」する操作をよく目にします。この記事では「スカラーをベクトルで微分」の公式を導出します。

Consequently, we obtain the following equation, called the **normal equation**.

\begin{align*}
X^\T X \bm{\theta} = X^\T \bm{y}
\end{align*}

From the above, if the inverse matrix of matrix $X^\T X$ exists, the least squares estimate is given by:

\begin{align*}
\hat{\bm{\theta}} = (X^\T X)^{-1} X^\T \bm{y}\t (☆)
\end{align*}

Performing Linear Multiple Regression Analysis using Python

 Now, let’s perform linear multiple regression analysis using Python. By calculating equation (☆), we can find the regression coefficient vector $\hat{\bm{\theta}}$.

For the sake of visualization, the data used in this example will have 2 explanatory variables.

import numpy as np
import pandas as pd

area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22]  # Explanatory variable 1
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0]  # Explanatory variable 2
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000]  # Target variable

df = pd.DataFrame({
    'area': area,
    'age': age,
    'rent': rent})

First, we visualize the target 3D data using plotly.

import plotly.express as px
import plotly.graph_objects as go

fig = px.scatter_3d(df, x='area', y='age', z='rent', color='rent', opacity=0.7)
fig.show()

We will find a regression plane that fits this data.

Now, we prepare matrix $X$ and vector $\bm{y}$ to calculate equation (☆).

# Prepare matrix X and vector y

X = df[['area', 'age']].values
ones = np.ones(len(X)).reshape(-1, 1)
X = np.hstack((ones, X))

print('Matrix X')
print(X)

y = df['rent'].values
print('Vector y')
print(y)
## Outputs
Matrix X
[[ 1.   17.   31.  ]
 [ 1.   18.   34.  ]
 [ 1.   21.   36.  ]
 [ 1.   24.   31.  ]
 [ 1.   18.9  28.  ]
 [ 1.   20.66 22.  ]
 [ 1.   23.51 19.  ]
 [ 1.   25.21 12.  ]
 [ 1.   24.94  4.  ]
 [ 1.   30.22  0.  ]]
Vector y
[35000 36000 39000 42000 46000 50000 59000 68000 73000 86000]

We calculate equation (☆) to find the regression coefficient vector $\hat{\bm{\theta}}$.

def multiple_regression(X, y):
    """Calculate the regression coefficient vector."""
    theta = np.linalg.pinv(X.T @ X) @ X.T @ y
    return theta

theta = multiple_regression(X, y)
theta_0 = theta[0]
theta_1 = theta[1]
theta_2 = theta[2]

print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')
## Outputs
theta_0: 50300.79618461159
theta_1: 1154.3386202815495
theta_2: -1045.7759592710272

Now that we have found the regression coefficients, let’s plot the regression plane.

# Visualization of the regression plane
mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)

y = (theta_0 + theta_1 * xx1 + theta_2 * xx2)  # Regression plane

fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8)  # Plot data points
fig.update_traces(marker=dict(
    color=colors[0],
    size=8,
    line=dict(width=2,color='white')))

fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7))  # Plot the plane
fig.show()

In this way, we were able to obtain the regression plane by solving the normal equation.

Performing Linear Multiple Regression Analysis using scikit-learn

 We can perform multiple regression analysis in the same way using scikit-learn.

from sklearn.linear_model import LinearRegression

area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22]  # Explanatory variable 1
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0]  # Explanatory variable 2
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000]  # Target variable
df = pd.DataFrame({
    'area': area,
    'age': age,
    'rent': rent})

X = df[['area', 'age']].values
y = df['rent'].values

# Calculate regression coefficients
model = LinearRegression()
model.fit(X, y)

theta_0 = model.intercept_
theta_1 = model.coef_[0]
theta_2 = model.coef_[1]

print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')
## Outputs
theta_0: 50300.79618451819
theta_1: 1154.3386202847946
theta_2: -1045.7759592701218
# Visualization of the regression plane
mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)

y = (theta_0 + theta_1 * xx1 + theta_2 * xx2)  # Regression plane

fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8)  # Plot data points
fig.update_traces(marker=dict(
    color=colors[0],
    size=8,
    line=dict(width=2,color='white')))

fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7))  # Plot the plane
fig.show()

We were able to obtain the regression plane similarly using scikit-learn.

Copied title and URL