Derivation of Spearman’s rank correlation coefficient and example calculation using python

確率・統計学

This article is available in: 日本語

Introduction

Correlation coefficients are often used as a method of summarizing relationships between data. There are different types of correlation coefficients, and one is called a rank correlation coefficient, which is used in cases where only the order of the data is known.

In this article, I will focus on Spearman’s rank correlation coefficient in particular, and describe its derivation and examples of calculations using python.

You can try the source code described in this article from Google Colab below.

Google Colaboratory

\begin{align*}
\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}
\newcommand{\T}{\mathsf{T}}
\newcommand{\(}{\left(}
\newcommand{\)}{\right)}
\newcommand{\{}{\left\{}
\newcommand{\}}{\right\}}
\newcommand{\[}{\left[}
\newcommand{\]}{\right]}
\newcommand{\dis}{\displaystyle}
\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}
\newcommand{\n}{\notag\\}
\newcommand{\t}{\ \ \ \ }
\newcommand{\tt}{\t\t\t\t}
\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}
\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}
\def\l<#1>{\left\langle #1 \right\rangle}
\def\us#1_#2{\underset{#2}{#1}}
\def\os#1^#2{\overset{#2}{#1}}
\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}
\newcommand{\s}[1]{{\scriptstyle #1}}
\definecolor{myblack}{rgb}{0.27,0.27,0.27}
\definecolor{myred}{rgb}{0.78,0.24,0.18}
\definecolor{myblue}{rgb}{0.0,0.443,0.737}
\definecolor{myyellow}{rgb}{1.0,0.82,0.165}
\definecolor{mygreen}{rgb}{0.24,0.47,0.44}
\newcommand{\c}[2]{\textcolor{#1}{#2}}
\newcommand{\ub}[2]{\underbrace{#1}_{#2}}
\end{align*}

Spearman’s rank correlation coefficient.

Spearman’s rank correlation coefficient $\rho_{xy}$ is determined only by the order of the data, regardless of the size of the data value. If you only know the order of the data, you don’t need to take a number.

For example, you can calculate Spearman’s rank correlation coefficient even if you only know the following academic test rankings.

Math test rankings.Physics test rankings.
11
34
22
45
53
66

In general, the order of the observed values $x, y$ is expressed as $\tilde{x}, \tilde{y}$, respectively, as shown in the following table.

$x$ rank.$y$ rank.
$\tilde{x}^{(1)}$$\tilde{y}^{(1)}$
$\tilde{x}^{(2)}$$\tilde{y}^{(2)}$
$\vdots$$\vdots$
$\tilde{x}^{(n)}$$\tilde{y}^{(n)}$

Then, Spearman’s rank correlation coefficient $\rho_{xy}$ can be calculated by the following equation.

\begin{align*}
\rho_{xy} = 1\ – \f{6}{n(n^2-1)} \sum_{i=1}^n (\tilde{x}^{(i)}\ – \tilde{y}^{(i)})^2.
\end{align*}

Calculation of rank correlation coefficient using python.

Let’s put off proving why Spearman’s rank correlation coefficient takes the form of the above equation, and first try to calculate the rank correlation coefficient in python.

The calculation method does not require the implementation of the above formula, and can be easily calculated using the spearmanr function in the scipy.stats library.

For example, using the example of the academic achievement test rankings mentioned earlier, the rank correlation coefficient is calculated as follows.

import scipy.stats as st

math = [1, 3, 2, 4, 5, 6]
phys = [1, 4, 2, 5, 3, 6]

corr, pvalue = st.spearmanr(math, phys)
print(corr)
# Outputs
0.8285714285714287

Thus, from the perspective of Spearman’s rank correlation coefficient, there is a strong correlation between the rankings of mathematical tests and physical tests.

Derivation of Spearman’s rank correlation coefficient.

Finally, we derive Spearman’s rank correlation equation. The starting point of the derivation is from the defining formula of the correlation coefficient.

\begin{align*}
\rho_{xy} = 1\ – \f{6}{n(n^2-1)} \sum_{i=1}^n (\tilde{x}^{(i)}\ – \tilde{y}^{(i)})^2
\end{align*}

\begin{align*}
\rho_{xy} = \f{{\rm cov}[\tilde{x}, \tilde{y}]}{\sigma_{\tilde{x}} \sigma_{\tilde{y}}} = \f{\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}}) (\tilde{y}^{(i)} – \bar{\tilde{y}}) }{\sqrt{\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}})^2} \sqrt{\sum_{i=1}^n (\tilde{y}^{(i)} – \bar{\tilde{y}})^2}}.
\end{align*}

Proof.

\begin{align*}
\rho_{xy} = \f{\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}}) (\tilde{y}^{(i)} – \bar{\tilde{y}}) }{\sqrt{\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}})^2} \sqrt{\sum_{i=1}^n (\tilde{y}^{(i)} – \bar{\tilde{y}})^2}} \t (☆)
\end{align*}

In preparation for deformation, we use the fact that the following equation holds.

\begin{align*}
\bar{\tilde{x}} &= \bar{\tilde{y}} = \sum_{i=1}^n i = \f{n+1}{2}, \n
\sum_{i=1}^n {\tilde{x}^{(i)}}^2 &= \sum_{i=1}^n {\tilde{y}^{(i)}}^2 = \sum_{i=1}^n i^2 = \f{n(n+1)(2n+1)}{6}.
\end{align*}

then

\begin{align*}
\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}})^2 = \sum_{i=1}^n (\tilde{y}^{(i)} – \bar{\tilde{y}})^2 &= \sum_{i=1}^n {\tilde{x}^{(i)}}^2 – n\bar{\tilde{x}}^2 \n
&= \f{n(n+1)(2n+1)}{6} – n \( \f{n+1}{2} \)^2 \n
&= \left(\frac{n+1}{2}\right)\left(\frac{n(2 n+1)}{3}-n \frac{(n+1)}{2}\right) \n
&= \left(\frac{n+1}{2}\right)\left(\frac{n^{2}-n}{6}\right) \n
&= \frac{n\left(n^{2}-1\right)}{12}
\end{align*}

and

\begin{align*}
\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}}) (\tilde{y}^{(i)} – \bar{\tilde{y}}) &= \sum_{i=1}^n \tilde{x}^{(i)} \tilde{y}^{(i)} – n\bar{\tilde{x}} \bar{\tilde{y}} \n
&= \sum_{i=1}^n \tilde{x}^{(i)} \tilde{y}^{(i)} – n\left(\frac{n+1}{2}\right)^{2} \n
&= \sum_{i=1}^n \tilde{x}^{(i)} \tilde{y}^{(i)} – \frac{n(n+1)^{2}}{4}
\end{align*}

is substituted into the equation (☆).

\begin{align*}
\rho_{xy} &= \frac{\sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)} -\frac{n(n+1)^{2}}{4}}{\frac{n\left(n^{2}-1\right)}{12}} \n
&= \frac{12 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)}}{n\left(n^{2}-1\right)}-\frac{3 n(n+1)^{2}}{n\left(n^{2}-1\right)} \n
&= \frac{12 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)}}{n\left(n^{2}-1\right)}-\frac{3 n(n+1)^{2}}{n(n-1)(n+1)} \n
&= \frac{12 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)}}{n\left(n^{2}-1\right)}-\frac{3(n+1)}{n-1}
\end{align*}

On the other hand, \begin{align*} \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2 &= \sum_{i=1}^{n} {\tilde{x}^{(i)}}^{2}+\sum_{i=1}^{n} {\tilde{y}^{(i)}}^{2}-2 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)} \n &= 2 \sum_{i=1}^{n} i^{2}-2 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)} \n &= 2 \frac{n(n+1)(2 n+1)}{6}-2 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)} \end{align*}

\begin{align*}
\sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)}=\frac{n(n+1)(2 n+1)}{6}-\frac{1}{2} \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2
\end{align*}

Using the above,

\begin{align*}
\rho_{xy} &= \frac{12 \cdot\left(\frac{n(n+1)(2 n+1)}{6}-\frac{1}{2} \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2\right)}{n\left(n^{2}-1\right)}-\frac{3(n+1)}{n-1} \n
&= \frac{2 n(n+1)(2 n+1)-6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2}{n(n-1)(n+1)}-\frac{3(n+1)}{n-1} \n
&= \frac{4 n^{3}+6 n^{2}+2 n-6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2-3 n(n+1)^{2}}{n(n-1)(n+1)} \n
&= \frac{n^{3}-n-6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2}{n(n-1)(n+1)} \n
&= \frac{n^{3}-n-6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2}{n\left(n^{2}-1\right)} \n
&= 1-\frac{6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2}{n\left(n^{2}-1\right)}.
\end{align*}

Therefore, the following equation holds.

\begin{align*}
\rho_{xy} = 1\ – \f{6}{n(n^2-1)} \sum_{i=1}^n (\tilde{x}^{(i)}\ – \tilde{y}^{(i)})^2.
\end{align*}


For more information on the relationship between data, please refer to this article.

【Multivariate Data】 Scatter Plots and Correlation Coefficients
In this article, I will discuss scatter plots and scatter plot matrices as a basic way to handle multivariate data, and correlation coefficients, rank correlation coefficients, and variance-covariance matrices as a method of summarization.
タイトルとURLをコピーしました