Introduction
In the previous article, we discussed linear regression using the least squares method.

There, we determined the regression line using the mass of a weight hung on a spring as the explanatory variable and the length of the spring as the target variable. In other words, that was a case with only one explanatory variable. However, how should we construct a regression model when there are **multiple explanatory variables**?
As a concrete example, let’s consider apartment rent. As shown in Table 1, the data consists of:
・Three explanatory variables: Area, Age of building, and Walking time to station
・Target variable: Rent
How should we formulate the problem to construct a prediction model for rent using the data for area, age, and walking time?
In this article, we will cover **multiple regression analysis**, which models the relationship between multiple explanatory variables and a target variable, and derive the **normal equation** for finding the regression coefficients.
We will also implement multiple regression analysis using Python. You can try the code below.

\begin{align*}
\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}
\newcommand{\T}{\mathsf{T}}
\newcommand{\dis}{\displaystyle}
\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}
\newcommand{\n}{\notag\\}
\newcommand{\t}{\ \ \ \ }
\newcommand{\tt}{\t\t\t\t}
\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}
\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}
\def\l<#1>{\left\langle #1 \right\rangle}
\def\us#1_#2{\underset{#2}{#1}}
\def\os#1^#2{\overset{#2}{#1}}
\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}
\newcommand{\s}[1]{{\scriptstyle #1}}
\newcommand{\c}[2]{\textcolor{#1}{#2}}
\newcommand{\ub}[2]{\underbrace{#1}_{#2}}
\end{align*}
Capturing relationships between multiple variables
In the example above, we used 3 explanatory variables, but let’s consider a more general case with $p$ explanatory variables $x_1, x_2, \dots, x_p$ and a target variable $y$.
Table 2 below is a generalization of Table 1, showing the results of measuring $n$ data points. For example, in the $i$-th data point, the target variable $y^{(i)}$ is related to the set of explanatory variables $x_1^{(i)}, x_2^{(i)}, \dots, x_p^{(i)}$.
Columns represent the types of explanatory variables, and rows represent the data index.
Our goal is to construct a model that links the target variable $y$ with the explanatory variables $x_1, x_2, \dots, x_p$. Here, based on data trends or prior knowledge, we assume that the structure of this phenomenon is linear.
That is, we assume the relationship between the target variable and the explanatory variables can be expressed by the linear equation:
\begin{align*}
y = \theta_0 + \theta_1 x_1 + \theta_1 x_2 + \cdots + \theta_p x_p
\end{align*}
Here, $\theta_0, \dots, \theta_p$ are the regression coefficients.
Considering that in reality the data deviates from the regression equation due to the influence of error $\varepsilon$, each data point satisfies the following relationship: (We denote the data index as a superscript on the right)
\begin{align*}
y^{(1)} &= \theta_0 + \theta_1 x_1^{(1)} + \theta_1 x_2^{(1)} + \cdots + \theta_p x_p^{(1)} + \varepsilon ^{(1)}, \n
y^{(2)} &= \theta_0 + \theta_1 x_1^{(2)} + \theta_1 x_2^{(2)} + \cdots + \theta_p x_p^{(2)} + \varepsilon ^{(2)}, \n
& \ \ \vdots \n
y^{(i)} &= \theta_0 + \theta_1 x_1^{(i)} + \theta_1 x_2^{(i)} + \cdots + \theta_p x_p^{(i)} + \varepsilon ^{(i)}, \n
&\ \ \vdots \n
y^{(n)} &= \theta_0 + \theta_1 x_1^{(n)} + \theta_1 x_2^{(n)} + \cdots + \theta_p x_p^{(n)} + \varepsilon ^{(n)}.
\end{align*}
A model that captures the relationship between a target variable and two or more explanatory variables in this way is called a **linear multiple regression model**.
Now, if we express the above equations using matrices, they can be represented as follows:
\begin{align*}
\underbrace{\mat{y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(i)} \\ \vdots \\y^{(n)}}}_{\equiv \bm{y}}
=
\underbrace{\mat{
1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_p^{(1)} \\
1 & x_1^{(2)} & x_2^{(2)} & \cdots & x_p^{(2)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(i)} & x_2^{(i)} & \cdots & x_p^{(i)} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_1^{(n)} & x_2^{(n)} & \cdots & x_p^{(n)}
}}_{\equiv X}
\underbrace{\mat{\theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_p}}_{\equiv \bm{\theta}}
+
\underbrace{\mat{\varepsilon^{(1)} \\ \varepsilon^{(2)} \\ \vdots \\ \varepsilon^{(i)} \\ \vdots \\\varepsilon^{(n)}}}_{\equiv \bm{\varepsilon}}.
\end{align*}
Furthermore, expressing this equation using vectors and matrices, we see that:
\begin{align}
\bm{y} = X \bm{\theta} + \bm{\varepsilon}
\label{eq:reg}
\end{align}
In the next section, we will discuss how to determine this regression coefficient vector $\bm{\theta}$.
Estimating Regression Coefficients using the Least Squares Method
Just as with the case of a single explanatory variable, we estimate the regression coefficient vector using the least squares method.
That is, we seek the estimated value $\hat{\bm{\theta}}$ of the regression coefficient vector $\bm{\theta}$ that minimizes the sum of squared error terms for each data point, $J \equiv (\varepsilon^{(1)})^2 + (\varepsilon^{(2)})^2 + \cdots + (\varepsilon^{(n)})^2$.
From $\eq{eq:reg}$, since $\bm{\varepsilon} = \bm{y}\, – X \bm{\theta}$, the mathematical expression of the least squares method for linear multiple regression is as follows:
\begin{align*}
\hat{\bm{\theta}} &= \argmin_{\bm{\theta}} J(\bm{\theta}) \n
&= \argmin_{\bm{\theta}} \sum_{i=1}^n (\varepsilon^{(i)})^2 \n
&= \argmin_{\bm{\theta}}\, \bm{\varepsilon}^\T \bm{\varepsilon} \n
&= \argmin_{\bm{\theta}}\, (\bm{y}\, – X \bm{\theta})^\T (\bm{y}\, – X \bm{\theta}).
\end{align*}
Therefore, similar to the previous article, we find the estimated value by partially differentiating $J(\bm{\theta})$ with respect to $\bm{\theta}$ and setting it to $\bm{0}$.
\begin{align*}
J(\bm{\theta}) &= (\bm{y}\, – X \bm{\theta})^\T (\bm{y}\, – X \bm{\theta}) \n
&= (\bm{y}^\T – \bm{\theta}^\T X^\T) (\bm{y}\, – X \bm{\theta}) \t \t (\because (AB)^\T = B^\T A^\T) \n
&= \bm{y}^\T \bm{y}\ – \bm{y}^\T X \bm{\theta}\ – \underbrace{\bm{\theta}^\T X^\T \bm{y}}_{= \bm{y}^\T X \bm{\theta}} + \bm{\theta}^\T X^\T X \bm{\theta} \n
&= \bm{y}^\T \bm{y}\ – 2 \bm{y}^\T X \bm{\theta} + \bm{\theta}^\T X^\T X \bm{\theta}
\end{align*}
Thus, we use vector differentiation formulas to find the extremum.
\begin{align*}
\pd{J(\bm{\theta})}{\bm{\theta}} &= \pd{}{\bm{\theta}} \( \bm{y}^\T \bm{y}\ – 2 \bm{y}^\T X \bm{\theta} + \bm{\theta}^\T X^\T X \bm{\theta} \) \n
&= \underbrace{\pd{}{\bm{\theta}} (\bm{y}^\T \bm{y})}_{=0} \ – 2 \pd{}{\bm{\theta}} (\bm{y}^\T X \bm{\theta}) + \pd{}{\bm{\theta}} (\bm{\theta}^\T X^\T X \bm{\theta}) \n
&= -2 X^\T \bm{y} + 2X^\T X \bm{\theta} = \bm{0}.
\end{align*}
Here, we used vector differentiation formulas in the transformation of the last line.

Consequently, we obtain the following equation, called the **normal equation**.
\begin{align*}
X^\T X \bm{\theta} = X^\T \bm{y}
\end{align*}
From the above, if the inverse matrix of matrix $X^\T X$ exists, the least squares estimate is given by:
\begin{align*}
\hat{\bm{\theta}} = (X^\T X)^{-1} X^\T \bm{y}\t (☆)
\end{align*}
Performing Linear Multiple Regression Analysis using Python
Now, let’s perform linear multiple regression analysis using Python. By calculating equation (☆), we can find the regression coefficient vector $\hat{\bm{\theta}}$.
For the sake of visualization, the data used in this example will have 2 explanatory variables.
import numpy as np
import pandas as pd
area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22] # Explanatory variable 1
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0] # Explanatory variable 2
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000] # Target variable
df = pd.DataFrame({
'area': area,
'age': age,
'rent': rent})First, we visualize the target 3D data using plotly.
import plotly.express as px
import plotly.graph_objects as go
fig = px.scatter_3d(df, x='area', y='age', z='rent', color='rent', opacity=0.7)
fig.show()We will find a regression plane that fits this data.
Now, we prepare matrix $X$ and vector $\bm{y}$ to calculate equation (☆).
# Prepare matrix X and vector y
X = df[['area', 'age']].values
ones = np.ones(len(X)).reshape(-1, 1)
X = np.hstack((ones, X))
print('Matrix X')
print(X)
y = df['rent'].values
print('Vector y')
print(y)## Outputs
Matrix X
[[ 1. 17. 31. ]
[ 1. 18. 34. ]
[ 1. 21. 36. ]
[ 1. 24. 31. ]
[ 1. 18.9 28. ]
[ 1. 20.66 22. ]
[ 1. 23.51 19. ]
[ 1. 25.21 12. ]
[ 1. 24.94 4. ]
[ 1. 30.22 0. ]]
Vector y
[35000 36000 39000 42000 46000 50000 59000 68000 73000 86000]We calculate equation (☆) to find the regression coefficient vector $\hat{\bm{\theta}}$.
def multiple_regression(X, y):
"""Calculate the regression coefficient vector."""
theta = np.linalg.pinv(X.T @ X) @ X.T @ y
return theta
theta = multiple_regression(X, y)
theta_0 = theta[0]
theta_1 = theta[1]
theta_2 = theta[2]
print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')## Outputs
theta_0: 50300.79618461159
theta_1: 1154.3386202815495
theta_2: -1045.7759592710272Now that we have found the regression coefficients, let’s plot the regression plane.
# Visualization of the regression plane
mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)
y = (theta_0 + theta_1 * xx1 + theta_2 * xx2) # Regression plane
fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8) # Plot data points
fig.update_traces(marker=dict(
color=colors[0],
size=8,
line=dict(width=2,color='white')))
fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7)) # Plot the plane
fig.show()In this way, we were able to obtain the regression plane by solving the normal equation.
Performing Linear Multiple Regression Analysis using scikit-learn
We can perform multiple regression analysis in the same way using scikit-learn.
from sklearn.linear_model import LinearRegression
area = [17.00, 18.00, 21.00, 24.00, 18.90, 20.66, 23.51, 25.21, 24.94, 30.22] # Explanatory variable 1
age = [31, 34, 36, 31, 28, 22, 19, 12, 4, 0] # Explanatory variable 2
rent = [35000, 36000, 39000, 42000, 46000, 50000, 59000, 68000, 73000, 86000] # Target variable
df = pd.DataFrame({
'area': area,
'age': age,
'rent': rent})
X = df[['area', 'age']].values
y = df['rent'].values
# Calculate regression coefficients
model = LinearRegression()
model.fit(X, y)
theta_0 = model.intercept_
theta_1 = model.coef_[0]
theta_2 = model.coef_[1]
print(f'theta_0: {theta_0}')
print(f'theta_1: {theta_1}')
print(f'theta_2: {theta_2}')## Outputs
theta_0: 50300.79618451819
theta_1: 1154.3386202847946
theta_2: -1045.7759592701218# Visualization of the regression plane
mesh_size = 1
margin = 0.1
x1_min, x1_max = df['area'].min()-margin, df['area'].max()+margin
x2_min, x2_max = df['age'].min()-margin, df['age'].max()+margin
x1_range = np.arange(x1_min, x1_max, mesh_size)
x2_range = np.arange(x2_min, x2_max, mesh_size)
xx1, xx2 = np.meshgrid(x1_range, x2_range)
y = (theta_0 + theta_1 * xx1 + theta_2 * xx2) # Regression plane
fig = px.scatter_3d(df, x='area', y='age', z='rent', opacity=0.8) # Plot data points
fig.update_traces(marker=dict(
color=colors[0],
size=8,
line=dict(width=2,color='white')))
fig.add_traces(go.Surface(x=x1_range, y=x2_range, z=y, opacity=0.7)) # Plot the plane
fig.show()We were able to obtain the regression plane similarly using scikit-learn.




