线性回归与多元回归分析：理论推导、Python从零实现与scikit-learn实战

Table of Contents

前言
捕捉多变量之间的关系
使用最小二乘法估计回归系数
使用 Python 进行线性多元回归分析
使用 scikit-learn 进行线性多元回归分析

前言

　在上一篇文章中，我们讨论了基于最小二乘法的线性回归。

線形回帰〜最小二乗法の計算/解説とPythonのフルスクラッチ、scikit-learnよる実装〜

機械学習の中でも歴史ある線形回帰について、なかでも「最小二乗法」についての理論とpython, scikit-learnによる実装を紹介します

在那篇文章中，我们将悬挂在弹簧上的重物质量作为解释变量，将弹簧的长度作为目标变量，求出了回归直线。也就是说，那是只有一个解释变量的情况。那么，在有多个解释变量的情况下，应该如何构建回归模型呢？

作为具体例子，我们以公寓的租金为例。如表1所示，这是包含：
・面积　・房龄　・距车站步行时间这3个解释变量，以及
・租金作为目标变量
的数据。要利用面积、房龄、距车站步行时间的数据来构建租金预测模型，应该如何进行公式化呢？

▲表1: 想要构建一个利用3个解释变量来预测目标变量的模型

　本文将介绍对多个解释变量与目标变量之间的关系进行建模的多元回归分析，并推导用于求回归系数的正规方程。

此外，还将使用 Python 实现多元回归分析。可以在下方尝试运行代码。

Google Colab

\begin{align*}
\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}
\newcommand{\T}{\mathsf{T}}
\newcommand{\dis}{\displaystyle}
\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}
\newcommand{\n}{\notag\\}
\newcommand{\t}{\ \ \ \ }
\newcommand{\tt}{\t\t\t\t}
\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}
\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}
\def\l<#1>{\left\langle #1 \right\rangle}
\def\us#1_#2{\underset{#2}{#1}}
\def\os#1^#2{\overset{#2}{#1}}
\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}
\newcommand{\s}[1]{{\scriptstyle #1}}
\newcommand{\c}[2]{\textcolor{#1}{#2}}
\newcommand{\ub}[2]{\underbrace{#1}_{#2}}
\end{align*}

捕捉多变量之间的关系

　上面以3个解释变量为例，这里我们将问题稍微一般化，考虑 $p$ 个解释变量 $x_1, x_2, \dots, x_p$ 和目标变量 $y$ 的情况。

下面的表2是表1的一般化形式，展示了测量的 $n$ 个数据的结果。例如，在第 $i$ 个数据点中，解释变量组 $x_1^{(i)}, x_2^{(i)}, \dots, x_p^{(i)}$ 与目标变量 $y^{(i)}$ 相关联。

▲表2: 观测 $n$ 个包含 $p$ 个解释变量的数据时的对应表。
列表示解释变量的种类，行表示数据的索引

我们的目的是构建一个将目标变量 $y$ 与解释变量 $x_1, x_2, \dots, x_p$ 联系起来的模型。在这里，根据数据趋势或先验知识，我们假设该现象的结构是线性的。

也就是说，目标变量与解释变量的关系可以用线性形式

\begin{align*}
y = \theta_0 + \theta_1 x_1 + \theta_1 x_2 + \cdots + \theta_p x_p
\end{align*}

来表示。这里，$\theta_0, \dots, \theta_p$ 是回归系数。

考虑到实际上受误差 $\varepsilon$ 的影响，数据会偏离回归方程，因此各数据点满足以下关系。（我们将数据编号的下标标记在右上角）

\begin{align*}
y^{(1)} &= \theta_0 + \theta_1 x_1^{(1)} + \theta_1 x_2^{(1)} + \cdots + \theta_p x_p^{(1)} + \varepsilon ^{(1)}, \n
y^{(2)} &= \theta_0 + \theta_1 x_1^{(2)} + \theta_1 x_2^{(2)} + \cdots + \theta_p x_p^{(2)} + \varepsilon ^{(2)}, \n
& \ \ \vdots \n
y^{(i)} &= \theta_0 + \theta_1 x_1^{(i)} + \theta_1 x_2^{(i)} + \cdots + \theta_p x_p^{(i)} + \varepsilon ^{(i)}, \n
&\ \ \vdots \n
y^{(n)} &= \theta_0 + \theta_1 x_1^{(n)} + \theta_1 x_2^{(n)} + \cdots + \theta_p x_p^{(n)} + \varepsilon ^{(n)}.
\end{align*}

这种捕捉两个或更多解释变量与目标变量之间关系的模型，被称为线性多元回归模型。

那么，上式如果用矩阵来表示，可以写成如下形式。

\begin{align*}
\underbrace{\mat{y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(i)} \\ \vdots \\y^{(n)}}}_{\equiv \bm{y}}
=
\underbrace{\mat{
1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_p^{(1)} \\
1 & x_1^{(2)} & x_2^{(2)} & \cdots & x_p^{(2)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(i)} & x_2^{(i)} & \cdots & x_p^{(i)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(n)} & x_2^{(n)} & \cdots & x_p^{(n)}
}}_{\equiv X}
\underbrace{\mat{\theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_p}}_{\equiv \bm{\theta}}
+
\underbrace{\mat{\varepsilon^{(1)} \\ \varepsilon^{(2)} \\ \vdots \\ \varepsilon^{(i)} \\ \vdots \\\varepsilon^{(n)}}}_{\equiv \bm{\varepsilon}}.
\end{align*}

进一步地，用向量和矩阵来表示该式，

\begin{align}
\bm{y} = X \bm{\theta} + \bm{\varepsilon}
\label{eq:reg}
\end{align}

我们可以得到上式。

在下一节中，我们将讨论确定这个回归系数向量 $\bm{\theta}$ 的方法。

使用最小二乘法估计回归系数

　与解释变量为1个的情况一样，我们使用最小二乘法来估计回归系数向量。

也就是说，这是一种求回归系数向量 $\bm{\theta}$ 的估计值 $\hat{\bm{\theta}}$ 的方法，使得各数据的误差项平方和 $J \equiv (\varepsilon^{(1)})^2 + (\varepsilon^{(2)})^2 + \cdots + (\varepsilon^{(n)})^2$ 最小化。

由 $\eq{eq:reg}$ 可知 $\bm{\varepsilon} = \bm{y}\, – X \bm{\theta}$，因此线性多元回归中的最小二乘法用数学公式表示如下。

\begin{align*}
\hat{\bm{\theta}} &= \argmin_{\bm{\theta}} J(\bm{\theta}) \n
&= \argmin_{\bm{\theta}} \sum_{i=1}^n (\varepsilon^{(i)})^2 \n
&= \argmin_{\bm{\theta}}\, \bm{\varepsilon}^\T \bm{\varepsilon} \n
&= \argmin_{\bm{\theta}}\, (\bm{y}\, – X \bm{\theta})^\T (\bm{y}\, – X \bm{\theta}).
\end{align*}

因此，与上一篇一样，我们通过对 $\bm{\theta}$ 求 $J(\bm{\theta})$ 的偏导数，并令其为 $\bm{0}$ 来求出估计值。

\begin{align*}
J(\bm{\theta}) &= (\bm{y}\, – X \bm{\theta})^\T (\bm{y}\, – X \bm{\theta}) \n
&= (\bm{y}^\T – \bm{\theta}^\T X^\T) (\bm{y}\, – X \bm{\theta}) \t \t (\because (AB)^\T = B^\T A^\T) \n
&= \bm{y}^\T \bm{y}\ – \bm{y}^\T X \bm{\theta}\ – \underbrace{\bm{\theta}^\T X^\T \bm{y}}_{= \bm{y}^\T X \bm{\theta}} + \bm{\theta}^\T X^\T X \bm{\theta} \n
&= \bm{y}^\T \bm{y}\ – 2 \bm{y}^\T X \bm{\theta} + \bm{\theta}^\T X^\T X \bm{\theta}
\end{align*}

所以，我们使用向量微分公式来求极值。

\begin{align*}
\pd{J(\bm{\theta})}{\bm{\theta}} &= \pd{}{\bm{\theta}} $ \bm{y}^\T \bm{y}\ – 2 \bm{y}^\T X \bm{\theta} + \bm{\theta}^\T X^\T X \bm{\theta} $ \n
&= \underbrace{\pd{}{\bm{\theta}} (\bm{y}^\T \bm{y})}_{=0} \ – 2 \pd{}{\bm{\theta}} (\bm{y}^\T X \bm{\theta}) + \pd{}{\bm{\theta}} (\bm{\theta}^\T X^\T X \bm{\theta}) \n
&= -2 X^\T \bm{y} + 2X^\T X \bm{\theta} = \bm{0}.
\end{align*}

这里，最后一行的变形使用了向量微分公式。

「ベクトルで微分」公式導出

機械学習の理論を勉強すると「スカラーをベクトルで微分」する操作をよく目にします。この記事では「スカラーをベクトルで微分」の公式を導出します。

因此，我们得到了被称为正规方程的下式。

\begin{align*}
X^\T X \bm{\theta} = X^\T \bm{y}
\end{align*}

综上所述，如果矩阵 $X^\T X$ 的逆矩阵存在，则最小二乘估计值由下式给出。

\begin{align*}
\hat{\bm{\theta}} = (X^\T X)^{-1} X^\T \bm{y}\t (☆)
\end{align*}

使用 Python 进行线性多元回归分析

　那么，让我们使用 Python 来进行线性多元回归分析。计算式(☆)即可求出回归系数向量 $\hat{\bm{\theta}}$。

为了便于可视化，作为示例的数据使用2个解释变量。

import numpy as np
import pandas as pd

area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22]  # Explanatory variable 1
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0]  # Explanatory variable 2
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000]  # Target variable

df = pd.DataFrame({
    'area': area,
    'age': age,
    'rent': rent})

import numpy as np
import pandas as pd

area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22]  # Explanatory variable 1
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0]  # Explanatory variable 2
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000]  # Target variable

df = pd.DataFrame({
    'area': area,
    'age': age,
    'rent': rent})

首先，使用 plotly 对目标3维数据进行可视化。

import plotly.express as px
import plotly.graph_objects as go

fig = px.scatter_3d(df, x='area', y='age', z='rent', color='rent', opacity=0.7)
fig.show()

import plotly.express as px
import plotly.graph_objects as go

fig = px.scatter_3d(df, x='area', y='age', z='rent', color='rent', opacity=0.7)
fig.show()

我们将求出拟合该数据的回归平面。

接下来，为了计算式 (☆)，我们要准备矩阵 $X$ 和向量 $\bm{y}$。

# Prepare matrix X and vector y

X = df[['area', 'age']].values
ones = np.ones(len(X)).reshape(-1, 1)
X = np.hstack((ones, X))

print('Matrix X')
print(X)

y = df['rent'].values
print('Vector y')
print(y)

# Prepare matrix X and vector y

X = df[['area', 'age']].values
ones = np.ones(len(X)).reshape(-1, 1)
X = np.hstack((ones, X))

print('Matrix X')
print(X)

y = df['rent'].values
print('Vector y')
print(y)

## Outputs
Matrix X
[[ 1.   17.   31.  ]
 [ 1.   18.   34.  ]
 [ 1.   21.   36.  ]
 [ 1.   24.   31.  ]
 [ 1.   18.9  28.  ]
 [ 1.   20.66 22.  ]
 [ 1.   23.51 19.  ]
 [ 1.   25.21 12.  ]
 [ 1.   24.94  4.  ]
 [ 1.   30.22  0.  ]]
Vector y
[35000 36000 39000 42000 46000 50000 59000 68000 73000 86000]

## Outputs
Matrix X
[[ 1.   17.   31.  ]
 [ 1.   18.   34.  ]
 [ 1.   21.   36.  ]
 [ 1.   24.   31.  ]
 [ 1.   18.9  28.  ]
 [ 1.   20.66 22.  ]
 [ 1.   23.51 19.  ]
 [ 1.   25.21 12.  ]
 [ 1.   24.94  4.  ]
 [ 1.   30.22  0.  ]]
Vector y
[35000 36000 39000 42000 46000 50000 59000 68000 73000 86000]

计算式 (☆) 以求出回归系数向量 $\hat{\bm{\theta}}$。

def multiple_regression(X, y):
    """Calculate the regression coefficient vector."""
    theta = np.linalg.pinv(X.T @ X) @ X.T @ y
    return theta

theta = multiple_regression(X, y)
theta_0 = theta[0]
theta_1 = theta[1]
theta_2 = theta[2]

print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')

def multiple_regression(X, y):
    """Calculate the regression coefficient vector."""
    theta = np.linalg.pinv(X.T @ X) @ X.T @ y
    return theta

theta = multiple_regression(X, y)
theta_0 = theta[0]
theta_1 = theta[1]
theta_2 = theta[2]

print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')

## Outputs
theta_0: 50300.79618461159
theta_1: 1154.3386202815495
theta_2: -1045.7759592710272

## Outputs
theta_0: 50300.79618461159
theta_1: 1154.3386202815495
theta_2: -1045.7759592710272

既然求出了回归系数，让我们来绘制回归平面。

# Visualization of the regression plane
mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)

y = (theta_0 + theta_1 * xx1 + theta_2 * xx2)  # Regression plane

fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8)  # Plot data points
fig.update_traces(marker=dict(
    color=colors[0],
    size=8,
    line=dict(width=2,color='white')))

fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7))  # Plot the plane
fig.show()

# Visualization of the regression plane
mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)

y = (theta_0 + theta_1 * xx1 + theta_2 * xx2)  # Regression plane

fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8)  # Plot data points
fig.update_traces(marker=dict(
    color=colors[0],
    size=8,
    line=dict(width=2,color='white')))

fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7))  # Plot the plane
fig.show()

像这样，通过求解正规方程，我们得到了回归平面。

使用 scikit-learn 进行线性多元回归分析

from sklearn.linear_model import LinearRegression

area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22]  # Explanatory variable 1
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0]  # Explanatory variable 2
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000]  # Target variable
df = pd.DataFrame({
    'area': area,
    'age': age,
    'rent': rent})

X = df[['area', 'age']].values
y = df['rent'].values

# Calculate regression coefficients
model = LinearRegression()
model.fit(X, y)

theta_0 = model.intercept_
theta_1 = model.coef_[0]
theta_2 = model.coef_[1]

print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')

from sklearn.linear_model import LinearRegression

area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22]  # Explanatory variable 1
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0]  # Explanatory variable 2
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000]  # Target variable
df = pd.DataFrame({
    'area': area,
    'age': age,
    'rent': rent})

X = df[['area', 'age']].values
y = df['rent'].values

# Calculate regression coefficients
model = LinearRegression()
model.fit(X, y)

theta_0 = model.intercept_
theta_1 = model.coef_[0]
theta_2 = model.coef_[1]

print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')

## Outputs
theta_0: 50300.79618451819
theta_1: 1154.3386202847946
theta_2: -1045.7759592701218

## Outputs
theta_0: 50300.79618451819
theta_1: 1154.3386202847946
theta_2: -1045.7759592701218

# Visualization of the regression plane
mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)

y = (theta_0 + theta_1 * xx1 + theta_2 * xx2)  # Regression plane

fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8)  # Plot data points
fig.update_traces(marker=dict(
    color=colors[0],
    size=8,
    line=dict(width=2,color='white')))

fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7))  # Plot the plane
fig.show()

# Visualization of the regression plane
mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)

y = (theta_0 + theta_1 * xx1 + theta_2 * xx2)  # Regression plane

fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8)  # Plot data points
fig.update_traces(marker=dict(
    color=colors[0],
    size=8,
    line=dict(width=2,color='white')))

fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7))  # Plot the plane
fig.show()

　同样地，使用 scikit-learn 也可以进行多元回归分析。

使用 scikit-learn 也同样得到了回归平面。