# Theory And Implementation Of Least Squares Method

## Introduction

This article introduces the theory and python implementation of linear regression, which has a long history in machine learning, particularly the least squares method. The least squares method is a simple model that has many applications and is a very important concept.

For example, we often have a scatterplot and think there might be a linear relationship there. The least squares method allows us to fit an appropriate straight line. （Of course, there are times when a curvilinear relationship seems likely, but we will limit our discussion to linear relationships.

You can also try the sample code in this article at ▼

\begin{align*} \newcommand{\mat}{\begin{pmatrix} #1 \end{pmatrix}} \newcommand{\f}{\frac{#1}{#2}} \newcommand{\pd}{\frac{\partial #1}{\partial #2}} \newcommand{\d}{\frac{{\rm d}#1}{{\rm d}#2}} \newcommand{\T}{\mathsf{T}} \newcommand{$}{\left(} \newcommand{$}{\right)} \newcommand{\{}{\left\{} \newcommand{\}}{\right\}} \newcommand{$}{\left[} \newcommand{$}{\right]} \newcommand{\dis}{\displaystyle} \newcommand{\eq}{{\rm Eq}(\ref{#1})} \newcommand{\n}{\notag\\} \newcommand{\t}{\ \ \ \ } \newcommand{\tt}{\t\t\t\t} \newcommand{\argmax}{\mathop{\rm arg\, max}\limits} \newcommand{\argmin}{\mathop{\rm arg\, min}\limits} \def\l<#1>{\left\langle #1 \right\rangle} \def\us#1_#2{\underset{#2}{#1}} \def\os#1^#2{\overset{#2}{#1}} \newcommand{\case}{\{ \begin{array}{ll} #1 \end{array} \right.} \newcommand{\s}{{\scriptstyle #1}} \definecolor{myblack}{rgb}{0.27,0.27,0.27} \definecolor{myred}{rgb}{0.78,0.24,0.18} \definecolor{myblue}{rgb}{0.0,0.443,0.737} \definecolor{myyellow}{rgb}{1.0,0.82,0.165} \definecolor{mygreen}{rgb}{0.24,0.47,0.44} \newcommand{\c}{\textcolor{#1}{#2}} \newcommand{\ub}{\underbrace{#1}_{#2}} \end{align*}

## Capture the relationship between two variables

Here we take as an example the relationship between the “mass of the weight” suspended from the spring and the “length of the spring. Below are 10 sets of data for the mass of the weight and the length of the spring, and their scatter plots.

Suppose the following relationship exists between the two variables: the mass of the weight $x$ and the length of the spring $y$.

\begin{align*} y = u(x) \end{align*}

Now consider the unknown function $u(x)$ that represents the true structure of this phenomenon. We assume from data trends and a priori findings that the structure of this phenomenon is linear.

We therefore assume the following as the true structure $u(x)$.

\begin{align*} y = u(x) = \theta_0 + \theta_1 x \end{align*}

If the structure of the phenomenon is linear and there are no errors in the data in Table 1, all 10 data will lie on a straight line if the intercept: $\theta_0$ and slope: $\theta_1$ of the line are properly determined.

In reality, however, the data deviates from that straight line due to errors. Therefore, we assume that the equation $y = \theta_0 + \theta_1 x + \varepsilon$ is satisfied, taking into account the misalignment $\varepsilon$.

In general, when $n$ data are observed, the $i$-th data point $(x^{(i)}, y^{(i)})$ satisfies the following equation.

\begin{align*} y^{(i)} = \theta_0 + \theta_1 x^{(i)} + \varepsilon^{(i)}. \end{align*}

where $\theta_0, \theta_1$ are the regression coefficients and $\varepsilon^{(i)}$ is the error term, and the model expressed in the above equation is called a linear regression model. The variable $y$ corresponding to the length of the spring is called the objective variable, and the variable $x$ corresponding to the mass of the weight is called the explanatory variable.

Now, how do we determine the regression coefficients $\theta_0, \theta_1$? One method for determining the value of this parameter is the method of least squares.

*If there were a curvilinear relationship, for example $y=\theta_0 + \theta_1 x + \theta_2 x^2 + \varepsilon$, that would be the model. However, even in this case, since the objective variable is linear with respect to the parameter $\theta$, it is classified as a linear regression problem. That is, if $\bm{x} = (1, x, x^2)^\T$, $\bm{\theta} = (\theta_0, \theta_1, \theta_2)^\T$, then $y = \bm{\theta}^\T \bm{x} + \varepsilon$ and is linear. Therefore, there is no major difference from the next section in obtaining an estimate of $\bm{\theta}$.

## Estimating regression coefficients by the least squares method

In this linear regression model, the true value at the $i$th data $x^{(i)}$ is $\theta_0 + \theta_1 x^{(i)}$, to which the error $\varepsilon^{(i)}$ is added and data $y^{(i)}$ is observed.

The method of least squares is a regression coefficient $(\varepsilon^{(1)})^2 + (\varepsilon^{(2)})^2 + \cdots + (\varepsilon^{(n)})^2$ so that the sum of the squares of the error terms in each data minimizes the regression coefficient $\theta_0, This is a method to determine$ \theta_1$. Now,$\varepsilon^{(i)} = y^{(i)} – (\theta_0 + \theta_1 x^{(i)})$So, to express the least-squares method as a formula, the sum of squares is$J \equiv \sum_i (\varepsilon^{(i)})^2and Defined, it can be expressed as follows. \begin{align*} \{\theta_0, \theta_1\} &= \argmin_{\theta_0, \theta_1} J(\theta_0, \theta_1) \n &= \argmin_{\theta_0, \theta_1} \sum_{i=1}^n (\varepsilon^{(i)})^2 \n &= \argmin_{\theta_0, \theta_1} \sum_{i=1}^n $y^{(i)} – (\theta_0 + \theta_1 x^{(i)})$^2. \end{align*} Then, by viewingJ(\theta_0, \theta_1)$as a function of the regression coefficients$\theta_0, \theta_1$and setting the partial differentiated expression to 0, we obtain the value of the regression coefficient that minimizes$J. \begin{align*} J(\theta_0, \theta_1) = \sum_{i=1}^n $y^{(i)} – (\theta_0 + \theta_1 x^{(i)})$^2 \end{align*} Therefore, \begin{align*} \pd{J}{\theta_0} = 2 \sum_{i=1}^n $y^{(i)} – \theta_0 – \theta_1 x^{(i)}$ \cdot (-1) = 0 \n \therefore \sum_{i=1}^n y^{(i)} = \sum_{i=1}^n \theta_0 + \sum_{i=1}^n \theta_1 x^{(i)}, \end{align*} \begin{align*} \pd{J}{\theta_1} = 2 \sum_{i=1}^n $$y^{(i)} – \theta_0 – \theta_1 x^{(i)}$ \cdot (-x^{(i)})$ = 0 \n \therefore \sum_{i=1}^n x^{(i)}y^{(i)} = \sum_{i=1}^n \theta_0 x^{(i)} + \sum_{i=1}^n \theta_1 (x^{(i)})^2. \end{align*} From the above, the estimated regression coefficients\hat{\theta}_0,\hat{\theta}_1can be obtained by solving the following simultaneous equations. \begin{align*} \case{ \sum_i y^{(i)} = n \theta_0 + \theta_1 \sum_i x^{(i)}, \\ \sum_i x^{(i)}y^{(i)} = \theta_0 \sum_i x^{(i)} + \theta_1 \sum_i (x^{(i)})^2. } \end{align*} Using the expected value{\rm E}[x] = \f{1}{n} \sum_i x^{(i)}, we organize \begin{align*} \case{ {\rm E}[y] = \theta_0 + \theta_1 {\rm E}[x], \\ {\rm E}[xy] = \theta_0 {\rm E}[x] + \theta_1 {\rm E}[x^2]. } \end{align*} Thus, the solution of the simultaneous equations, the estimated regression coefficients\hat{\theta}_0,\hat{\theta}_1, respectively, are \begin{align*} \hat{\theta}_0 = {\rm E}[y] – \hat{\theta_1} {\rm E}[x], \t \hat{\theta}_1 &= \f{{\rm E}[xy] – {\rm E}[x] \cdot {\rm E}[y]}{{\rm E}[x^2] – {\rm E}[x]^2} \n &= \f{{\rm Cov}(x, y)}{{\rm V}[x]} \t (☆) \end{align*} where{\rm{V}}[x]$denotes the variance and${\rm Cov}(x, y)\$ denotes the covariance.

## Calculating regression coefficients using Python

I will use python to calculate the regression coefficients. As an example, we will use the relationship between the mass of a weight suspended from a spring and the length of the spring, as shown in the introduction. All we have to do is simply calculate the equation (☆).

import numpy as np
import matplotlib.pyplot as plt

weights = np.array([5, 10, 15, 20, 25, 30, 35, 40, 45, 50])
lengths = np.array([5.3, 5.7, 6.4, 6.9, 7.7, 8.2, 8.4, 9.8, 9.9, 10.7])

def least_squares(x, y):
theta_1 = np.cov(x, y) / np.var(x)
theta_0 = np.mean(y) - theta_1 * np.mean(x)
return theta_0, theta_1

theta_0, theta_1 = least_squares(weights, lengths)

print(f'slope: {theta_0}')
print(f'intercept: {theta_1}')

x = np.arange(5, 55, 0.1)
y = theta_0 + theta_1 * x

plt.scatter(weights, lengths, c='#007bc3')
plt.plot(x, y, c='#de3838')
plt.show()
# Output
slope: 4.196296296296298
intercept: 0.13468013468013465

## Calculate regression coefficients using scikit-learn

The regression coefficients can be obtained in scikit-learn as well.

from sklearn.linear_model import LinearRegression

weights = np.array([5, 10, 15, 20, 25, 30, 35, 40, 45, 50])
lengths = np.array([5.3, 5.7, 6.4, 6.9, 7.7, 8.2, 8.4, 9.8, 9.9, 10.7])

model = LinearRegression()
model.fit(weights.reshape(-1, 1), lengths)

theta_0, theta_1 = model.intercept_, model.coef_

print(f'slope: {theta_0}')
print(f'intercept: {theta_1}')

x = np.arange(5, 55, 0.1)
y = theta_0 + theta_1 * x

plt.scatter(weights, lengths, c='#007bc3')
plt.plot(x, y, c='#de3838')
plt.show()
# Output
slope: 4.566666666666668
intercept: [0.12121212] 