The post About events, probabilities and random variables. first appeared on Yukkuri Machine Learning.

]]>Last time, we studied how to handle univariate and multivariate data.

【Multivariate Data】 Scatter Plots and Correlation Coefficients

In this article, I will discuss scatter plots and scatter plot matrices as a basic way to handle multivariate data, and correlation coefficients, rank correlation coefficients, and variance-covariance matrices as a method of summarization.

In this article, we will discuss events, probabilities, and random variables. The descriptions are mathematical and abstract, but what they say is very common sense, so it might be a good idea to take it without thinking too hard.

As an example, consider a situation where you throw a dice once. The possible outcomes of the dice are 1, 2, 3, 4, 5, or 6. At this time,

- A possible result is called a sample point.
- A set of sample points $\Omega = \{ 1, 2, 3, 4, 5, 6 \}$ is called a sample space.
- A subset of the sample space is called an event.

\begin{align*} \newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}} \newcommand{\f}[2]{\frac{#1}{#2}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}} \newcommand{\T}{\mathsf{T}} \newcommand{\(}{\left(} \newcommand{\)}{\right)} \newcommand{\{}{\left\{} \newcommand{\}}{\right\}} \newcommand{\[}{\left[} \newcommand{\]}{\right]} \newcommand{\dis}{\displaystyle} \newcommand{\eq}[1]{{\rm Eq}(\ref{#1})} \newcommand{\n}{\notag\\} \newcommand{\t}{\ \ \ \ } \newcommand{\tt}{\t\t\t\t} \newcommand{\argmax}{\mathop{\rm arg\, max}\limits} \newcommand{\argmin}{\mathop{\rm arg\, min}\limits} \def\l<#1>{\left\langle #1 \right\rangle} \def\us#1_#2{\underset{#2}{#1}} \def\os#1^#2{\overset{#2}{#1}} \newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.} \newcommand{\s}[1]{{\scriptstyle #1}} \definecolor{myblack}{rgb}{0.27,0.27,0.27} \definecolor{myred}{rgb}{0.78,0.24,0.18} \definecolor{myblue}{rgb}{0.0,0.443,0.737} \definecolor{myyellow}{rgb}{1.0,0.82,0.165} \definecolor{mygreen}{rgb}{0.24,0.47,0.44} \newcommand{\c}[2]{\textcolor{#1}{#2}} \newcommand{\ub}[2]{\underbrace{#1}_{#2}} \end{align*}

The probability of the occurrence of the event $A$ is expressed as $P(A)$. In this case, $P(A)$ has the following properties to satisfy.

A random variable is a variable that takes various values with a fixed probability. For example, the outcome of the dice is a random variable. If the dice are not distorted, the odds of getting any of them are the same: $1/6$. This can be expressed as follows, where $X$ is the outcome of the dice.

\begin{align*} P(X = x) = \f{1}{6}, \t x=1, 2, \dots, 6 \end{align*}

A random variable that takes on a variety of values, such as a dice, is called a discrete random variable.

On the other hand, if the value of a random variable is expected to change continuously, such as height or weight, it is called a continuous random variable.

The probability of the occurrence of the event $A$ under the condition that the event $B$ has occurred is called the conditional probability of $A$ in $B$, and is expressed as follows.

\begin{align*} P(A|B) = \f{P(A \cap B)}{P(B)} \end{align*}

Here, when the conditional probability of $A$ in $B$ is not affected by $B$, that is, when the following equation holds, the events $A, B$ are said to be independent.

\begin{align*} P(A|B) = \f{P(A \cap B)}{P(B)} = P(A) \end{align*}

The fact that $A, B$ are independent is also expressed as the following expression by rewriting the above expression.

\begin{align*} P(A \cap B) = P(A)P(B) \end{align*}

In the above conditional probability, if we swap the order of $A$ and $B$, we multiply $P(A \cap B) = P(A|B)P(B)$, $P(A \cap B) = P(B|A)P(A)$, respectively.

Also, Bayes’ theorem is extended in the following form.

The transformation in the last row uses the formula for all probabilities.

When conditioned by multiple random variables, the Bayesian formula can be written as follows.

The proof is as follows.

**Proof.
**

Given the codistribution $P(X, Y, Z)$, from the definition of conditional probability,

\begin{align*} P(X, Y, Z) &= P(X|Y, Z)P(Y, Z) \n &=P(X|Y, Z)P(Y|Z)P(Z) \end{align*}

On the other hand, \begin{align*} P(X, Y, Z) &= P(Y|X, Z)P(X, Z) \n &=P(Y|X, Z)P(X|Z)P(Z) \end{align*}

so compare the two,

\begin{align*} P(X|Y, Z)P(Y|Z)P(Z) &= P(Y|X, Z)P(X|Z)P(Z) \n \therefore P(X | Y, Z) &= \f{P(Y|X, Z) P(X|Z)}{P(Y|Z)} \end{align*}

For random variables, it is often referred to as an expected value, not an average. The definitions of expectation and variance are somewhat different for discrete and continuous random variables.

The following is the definition of the random variable $X$: Expected value: $E[X]$, Variance: $V[X]$. Note that $\mu$ is the expected value.

\begin{align*} E[X] &= \sum_{x} x P(X = x) \n V[X] &= \sum_{x} (x – \mu)^2 P(X = x) \end{align*}

\begin{align*} E[X] &= \int_{\infty}^{\infty} x f(x) dx \n V[X] &= \int_{\infty}^{\infty} (x – \mu)^2 f(x) dx \end{align*}

There is also an important formula for Variance:

This can generally be proven as follows:

**Proof.
**

Hereafter, $E[X] = \mu$.

\begin{align*} V[X] &= E[(X – \mu)^2] \n &= E[X^2 – 2\mu X + \mu^2] \n &= E[X^2] -2 \mu E[X] + \mu^2 \n &= E[X^2] -2 \mu^2 + \mu^2 \n &= E[X^2] – E[X]^2. \end{align*}

Next time: ▼ Transformation of random variables and product ratio matrix function.

The post About events, probabilities and random variables. first appeared on Yukkuri Machine Learning.

]]>The post Derivation of Spearman’s rank correlation coefficient and example calculation using python first appeared on Yukkuri Machine Learning.

]]>Correlation coefficients are often used as a method of summarizing relationships between data. There are different types of correlation coefficients, and one is called a rank correlation coefficient, which is used in cases where only the order of the data is known.

In this article, I will focus on Spearman’s rank correlation coefficient in particular, and describe its derivation and examples of calculations using python.

You can try the source code described in this article from Google Colab below.

rank-correlation-coefficient.ipynb

Colaboratory notebook

\begin{align*}

\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}

\newcommand{\f}[2]{\frac{#1}{#2}}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}

\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}

\newcommand{\T}{\mathsf{T}}

\newcommand{\(}{\left(}

\newcommand{\)}{\right)}

\newcommand{\{}{\left\{}

\newcommand{\}}{\right\}}

\newcommand{\[}{\left[}

\newcommand{\]}{\right]}

\newcommand{\dis}{\displaystyle}

\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}

\newcommand{\n}{\notag\\}

\newcommand{\t}{\ \ \ \ }

\newcommand{\tt}{\t\t\t\t}

\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}

\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}

\def\l<#1>{\left\langle #1 \right\rangle}

\def\us#1_#2{\underset{#2}{#1}}

\def\os#1^#2{\overset{#2}{#1}}

\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}

\newcommand{\s}[1]{{\scriptstyle #1}}

\definecolor{myblack}{rgb}{0.27,0.27,0.27}

\definecolor{myred}{rgb}{0.78,0.24,0.18}

\definecolor{myblue}{rgb}{0.0,0.443,0.737}

\definecolor{myyellow}{rgb}{1.0,0.82,0.165}

\definecolor{mygreen}{rgb}{0.24,0.47,0.44}

\newcommand{\c}[2]{\textcolor{#1}{#2}}

\newcommand{\ub}[2]{\underbrace{#1}_{#2}}

\end{align*}

Spearman’s rank correlation coefficient $\rho_{xy}$ is determined only by the order of the data, regardless of the size of the data value. If you only know the order of the data, you don’t need to take a number.

For example, you can calculate Spearman’s rank correlation coefficient even if you only know the following academic test rankings.

Math test rankings. | Physics test rankings. |
---|---|

1 | 1 |

3 | 4 |

2 | 2 |

4 | 5 |

5 | 3 |

6 | 6 |

In general, the order of the observed values $x, y$ is expressed as $\tilde{x}, \tilde{y}$, respectively, as shown in the following table.

$x$ rank. | $y$ rank. |
---|---|

$\tilde{x}^{(1)}$ | $\tilde{y}^{(1)}$ |

$\tilde{x}^{(2)}$ | $\tilde{y}^{(2)}$ |

$\vdots$ | $\vdots$ |

$\tilde{x}^{(n)}$ | $\tilde{y}^{(n)}$ |

Then, Spearman’s rank correlation coefficient $\rho_{xy}$ can be calculated by the following equation.

\begin{align*}

\rho_{xy} = 1\ – \f{6}{n(n^2-1)} \sum_{i=1}^n (\tilde{x}^{(i)}\ – \tilde{y}^{(i)})^2.

\end{align*}

Let’s put off proving why Spearman’s rank correlation coefficient takes the form of the above equation, and first try to calculate the rank correlation coefficient in python.

The calculation method does not require the implementation of the above formula, and can be easily calculated using the spearmanr function in the scipy.stats library.

For example, using the example of the academic achievement test rankings mentioned earlier, the rank correlation coefficient is calculated as follows.

```
import scipy.stats as st
math = [1, 3, 2, 4, 5, 6]
phys = [1, 4, 2, 5, 3, 6]
corr, pvalue = st.spearmanr(math, phys)
print(corr)
```

```
# Outputs
0.8285714285714287
```

Thus, from the perspective of Spearman’s rank correlation coefficient, there is a strong correlation between the rankings of mathematical tests and physical tests.

Finally, we derive Spearman’s rank correlation equation. The starting point of the derivation is from the defining formula of the correlation coefficient.

\begin{align*}

\rho_{xy} = 1\ – \f{6}{n(n^2-1)} \sum_{i=1}^n (\tilde{x}^{(i)}\ – \tilde{y}^{(i)})^2

\end{align*}

\begin{align*}

\rho_{xy} = \f{{\rm cov}[\tilde{x}, \tilde{y}]}{\sigma_{\tilde{x}} \sigma_{\tilde{y}}} = \f{\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}}) (\tilde{y}^{(i)} – \bar{\tilde{y}}) }{\sqrt{\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}})^2} \sqrt{\sum_{i=1}^n (\tilde{y}^{(i)} – \bar{\tilde{y}})^2}}.

\end{align*}

**Proof.**

\begin{align*}

\rho_{xy} = \f{\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}}) (\tilde{y}^{(i)} – \bar{\tilde{y}}) }{\sqrt{\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}})^2} \sqrt{\sum_{i=1}^n (\tilde{y}^{(i)} – \bar{\tilde{y}})^2}} \t (☆)

\end{align*}

In preparation for deformation, we use the fact that the following equation holds.

\begin{align*}

\bar{\tilde{x}} &= \bar{\tilde{y}} = \sum_{i=1}^n i = \f{n+1}{2}, \n

\sum_{i=1}^n {\tilde{x}^{(i)}}^2 &= \sum_{i=1}^n {\tilde{y}^{(i)}}^2 = \sum_{i=1}^n i^2 = \f{n(n+1)(2n+1)}{6}.

\end{align*}

then

\begin{align*}

\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}})^2 = \sum_{i=1}^n (\tilde{y}^{(i)} – \bar{\tilde{y}})^2 &= \sum_{i=1}^n {\tilde{x}^{(i)}}^2 – n\bar{\tilde{x}}^2 \n

&= \f{n(n+1)(2n+1)}{6} – n \( \f{n+1}{2} \)^2 \n

&= \left(\frac{n+1}{2}\right)\left(\frac{n(2 n+1)}{3}-n \frac{(n+1)}{2}\right) \n

&= \left(\frac{n+1}{2}\right)\left(\frac{n^{2}-n}{6}\right) \n

&= \frac{n\left(n^{2}-1\right)}{12}

\end{align*}

and

\begin{align*}

\sum_{i=1}^n (\tilde{x}^{(i)} – \bar{\tilde{x}}) (\tilde{y}^{(i)} – \bar{\tilde{y}}) &= \sum_{i=1}^n \tilde{x}^{(i)} \tilde{y}^{(i)} – n\bar{\tilde{x}} \bar{\tilde{y}} \n

&= \sum_{i=1}^n \tilde{x}^{(i)} \tilde{y}^{(i)} – n\left(\frac{n+1}{2}\right)^{2} \n

&= \sum_{i=1}^n \tilde{x}^{(i)} \tilde{y}^{(i)} – \frac{n(n+1)^{2}}{4}

\end{align*}

is substituted into the equation (☆).

\begin{align*}

\rho_{xy} &= \frac{\sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)} -\frac{n(n+1)^{2}}{4}}{\frac{n\left(n^{2}-1\right)}{12}} \n

&= \frac{12 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)}}{n\left(n^{2}-1\right)}-\frac{3 n(n+1)^{2}}{n\left(n^{2}-1\right)} \n

&= \frac{12 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)}}{n\left(n^{2}-1\right)}-\frac{3 n(n+1)^{2}}{n(n-1)(n+1)} \n

&= \frac{12 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)}}{n\left(n^{2}-1\right)}-\frac{3(n+1)}{n-1}

\end{align*}

On the other hand, \begin{align*} \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2 &= \sum_{i=1}^{n} {\tilde{x}^{(i)}}^{2}+\sum_{i=1}^{n} {\tilde{y}^{(i)}}^{2}-2 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)} \n &= 2 \sum_{i=1}^{n} i^{2}-2 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)} \n &= 2 \frac{n(n+1)(2 n+1)}{6}-2 \sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)} \end{align*}

\begin{align*}

\sum_{i=1}^{n} \tilde{x}^{(i)} \tilde{y}^{(i)}=\frac{n(n+1)(2 n+1)}{6}-\frac{1}{2} \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2

\end{align*}

Using the above,

\begin{align*}

\rho_{xy} &= \frac{12 \cdot\left(\frac{n(n+1)(2 n+1)}{6}-\frac{1}{2} \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2\right)}{n\left(n^{2}-1\right)}-\frac{3(n+1)}{n-1} \n

&= \frac{2 n(n+1)(2 n+1)-6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2}{n(n-1)(n+1)}-\frac{3(n+1)}{n-1} \n

&= \frac{4 n^{3}+6 n^{2}+2 n-6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2-3 n(n+1)^{2}}{n(n-1)(n+1)} \n

&= \frac{n^{3}-n-6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2}{n(n-1)(n+1)} \n

&= \frac{n^{3}-n-6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2}{n\left(n^{2}-1\right)} \n

&= 1-\frac{6 \sum_{i=1}^{n} (\tilde{x}^{(i)} – \tilde{y}^{(i)})^2}{n\left(n^{2}-1\right)}.

\end{align*}

Therefore, the following equation holds.

\begin{align*}

\rho_{xy} = 1\ – \f{6}{n(n^2-1)} \sum_{i=1}^n (\tilde{x}^{(i)}\ – \tilde{y}^{(i)})^2.

\end{align*}

For more information on the relationship between data, please refer to this article.

【Multivariate Data】 Scatter Plots and Correlation Coefficients

In this article, I will discuss scatter plots and scatter plot matrices as a basic way to handle multivariate data, and correlation coefficients, rank correlation coefficients, and variance-covariance matrices as a method of summarization.

The post Derivation of Spearman’s rank correlation coefficient and example calculation using python first appeared on Yukkuri Machine Learning.

]]>The post 【Multivariate Data】 Scatter Plots and Correlation Coefficients first appeared on Yukkuri Machine Learning.

]]>In the previous issue, we discussed how to handle the most basic univariate data.

This article discusses scatter plots and scatter matrix as basic ways to handle multivariate data, and correlation coefficient, rank correlation coefficient, and variance-covariance matrix as summarization methods.

The program used is described in python and it can be tried in Google Colab below.

Google Colaboratory

\begin{align*} \newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}} \newcommand{\f}[2]{\frac{#1}{#2}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}} \newcommand{\T}{\mathsf{T}} \newcommand{\(}{\left(} \newcommand{\)}{\right)} \newcommand{\{}{\left\{} \newcommand{\}}{\right\}} \newcommand{\[}{\left[} \newcommand{\]}{\right]} \newcommand{\dis}{\displaystyle} \newcommand{\eq}[1]{{\rm Eq}(\ref{#1})} \newcommand{\n}{\notag\\} \newcommand{\t}{\ \ \ \ } \newcommand{\tt}{\t\t\t\t} \newcommand{\argmax}{\mathop{\rm arg\, max}\limits} \newcommand{\argmin}{\mathop{\rm arg\, min}\limits} \def\l<#1>{\left\langle #1 \right\rangle} \def\us#1_#2{\underset{#2}{#1}} \def\os#1^#2{\overset{#2}{#1}} \newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.} \newcommand{\s}[1]{{\scriptstyle #1}} \definecolor{myblack}{rgb}{0.27,0.27,0.27} \definecolor{myred}{rgb}{0.78,0.24,0.18} \definecolor{myblue}{rgb}{0.0,0.443,0.737} \definecolor{myyellow}{rgb}{1.0,0.82,0.165} \definecolor{mygreen}{rgb}{0.24,0.47,0.44} \newcommand{\c}[2]{\textcolor{#1}{#2}} \newcommand{\ub}[2]{\underbrace{#1}_{#2}} \end{align*}

We will again use the iris dataset as the data to be handled. iris dataset consists of petal (petal) and gak (sepal) lengths of three varieties: Versicolour, Virginica, and Setosa.

First, as bivariate data, we will restrict the variety to Setosa and deal only with sepal_length and sepal_width.

Now we will import the iris dataset in python.

```
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as st
import matplotlib.pyplot as plt
df_iris = sns.load_dataset('iris')
sepal_length = df_iris[df_iris['species']=='setosa']['sepal_length']
sepal_width = df_iris[df_iris['species']=='setosa']['sepal_width']
```

The most basic way to visually see the relationship between sepal_length and sepal_width is to draw a scatter plot.

To draw a scatterplot, use the scatterplot function in the seaborn library.

```
sns.scatterplot(x=sepal_length, y=sepal_width)
plt.show()
```

The scatterplot in the figure above shows that the points are aligned rightward. This means that flowers with larger sepal_length tend to have larger sepal_width.

The seaborn library also has a jointplot function that depicts a histogram along with a scatterplot.

```
sns.jointplot(x=sepal_length, y=sepal_width)
plt.show()
```

Correlation coefficients (Pearson’s product-moment correlation coefficient) are often used to identify “rightward” or “downward” trends between data; the correlation coefficient between two variables $x$ and $y$ takes values ranging from $-1$ to $1$, where $x$ tends to increase as $y$ increases. It takes a positive value when $x$ tends to increase and a negative value when $y$ tends to decrease.

There is a positive correlation between $x$ and $y$ when the correlation coefficient is close to $1$, a negative correlation when close to $-1$, and no correlation when close to $0$.

The correlation coefficient $r_{xy}$ is defined as

\begin{align*} r_{xy} = \f{\sum_{i=1}^n (x^{(i)} – \bar{x}) (y^{(i)} – \bar{y}) }{\sqrt{\sum_{i=1}^n (x^{(i)} – \bar{x})^2} \sqrt{\sum_{i=1}^n (y^{(i)} – \bar{y})^2}}. \end{align*}

where as the mean deviation vector

\begin{align*} \bm{x} &= (x^{(1)} – \bar{x}, x^{(2)} – \bar{x}, \dots, x^{(n)} – \bar{x})^\T, \n \bm{y} &= (y^{(1)} – \bar{y}, y^{(2)} – \bar{y}, \dots, y^{(n)} – \bar{y})^\T \end{align*}

The correlation coefficient $R_{xy}$ coincides with the cosine $\cos \theta$ of the angle $\theta$ between the vectors $\bm{x}, \bm{y}$. \begin{align*} r_{xy} = \cos \theta = \f{\bm{x}^\T \bm{y}}{\|\bm{x}\| \|\bm{y}\|}. \end{align*}

From this we see that $-1 \leq r_{xy} \leq 1$.

Also, if there is a positive correlation, $\bm{x}, \bm{y}$ point in the same direction, and if there is no correlation, $\bm{x}, \bm{y}$ can be interpreted as pointing in orthogonal directions.

The numerator of the defining formula for the correlation coefficient divided by the number of samples $n$ is called the covariance of $x, y$.

\begin{align*} \sigma_{xy} = \f{1}{n} \sum_{i=1}^n (x^{(i)} – \bar{x}) (y^{(i)} – \bar{y}) \end{align*}

Using this value and the standard deviation $\sigma_x, \sigma_y$ of $x, y$, the correlation coefficient can be expressed as

\begin{align*} r_{xy} = \f{\sigma_{xy}}{\sigma_x \sigma_y} \end{align*}

When the correlation coefficient is $r_{xy} = \pm 1 $, there is a linear relationship between $x, y$. The proof is given below.

**Proof**

In the following, the variance $\sigma_x^2, \sigma_y^2$ of $x, y$ is not $0$.

If $R_{xy}$ is $1$ or $-1$, then the cosine $\cos \theta$ of the angle $\theta$ between the mean deviation vectors $\bm{x}, \bm{y}$ is $1$ or $-1$. Therefore,

\begin{align*} \bm{y} = \gamma \bm{x} \end{align*}

There exists a constant $\gamma \neq 0$ satisfying $\gamma$ (if $r_{xy}=1$, $\gamma$ is positive, if $r_{xy}=-1$, $\gamma$ is negative). From this, $(y^{(i)} – \bar{y}) = \gamma (x^{(i)} – \bar{x}), \ (i = 1, \dots, n)$. In other words, $x, y$ have a linear relationship.

Also, taking the average of the squares of both sides in the above equation, we obtain

\begin{align*} \f{1}{n}\sum_{i=1}^n (y^{(i)} – \bar{y})^2 &= \gamma^2 \f{1}{n}\sum_{i=1}^n (x^{(i)} – \bar{x})^2 \n \therefore \sigma_y^2 &= \gamma^2 \sigma_x^2. \end{align*}

Therefore, $\gamma = \pm \sqrt{\sigma_y^2 / \sigma_x^2}$ and the following linear relationship holds for $x, y$.

\begin{align*} y = \pm \sqrt{\f{\sigma_y^2}{\sigma_x^2}} (x – \bar{x}) + \bar{y}. \end{align*}

The sign of the slope is the same as the sign of $r_{xy}$.

In python, the correlation coefficient can be calculated as follows

```
corr = np.corrcoef(sepal_length, sepal_width)[0, 1]
print(f'correlation coefficient: {corr}')
```

```
# Output
correlation coefficient: 0.7425466856651597
```

One caveat to the correlation coefficient is that it is valid when the two variables are in a linear relationship, but not when they are not (in the case of a non-linear relationship). In fact, as shown in the figure below, for data with a relationship other than a linear relationship, the correlation coefficient will determine that there is “no correlation” and will not work effectively.

Thus, the correlation coefficient is a quantitative expression of how close a linear relationship is; “correlated” is different from “there is a relationship between the data.

Another point to note is the phenomenon that when the data is truncated, the correlation coefficient approaches $0$ compared to the original data. As an example, the correlation coefficient between pre-enrollment grades and post-enrollment grades is inherently positive, but since we can observe post-enrollment grades only for accepted students and have no data for those who did not enroll, the correlation coefficient becomes low.

This phenomenon is called the cutting or selection effect.

In addition to the correlation coefficient just mentioned (Pearson’s product-moment correlation coefficient), various other correlation coefficients are known. Here we discuss Spearman’s rank correlation coefficient.

The rank correlation coefficient is valid when only the rank of the data is known. For example, if we only know the ranks of the following achievement tests.

Math test rankings. | Physics test rankings. |
---|---|

1 | 1 |

3 | 4 |

2 | 2 |

4 | 5 |

5 | 3 |

6 | 6 |

The order correlation coefficient is the one that captures the correlation using only the order of such data.

If we denote the order of the observed values $x and y$ as $\tilde{x} and \tilde{y}$, respectively, as shown in the following table,

$x$ rank. | $y$ rank. |
---|---|

$\tilde{x}^{(1)}$ | $\tilde{y}^{(1)}$ |

$\tilde{x}^{(2)}$ | $\tilde{y}^{(2)}$ |

$\vdots$ | $\vdots$ |

$\tilde{x}^{(n)}$ | $\tilde{y}^{(n)}$ |

Spearman’s rank correlation coefficient $\rho_{xy}$ is calculated by:

\begin{align*} \rho_{xy} = 1\ – \f{6}{n(n^2-1)} \sum_{i=1}^n (\tilde{x}^{(i)}\ – \tilde{y}^{(i)})^2. \end{align*}

For more information on why this formula is used, please see the following article.

Derivation of Spearman's rank correlation coefficient and example calculation using python

Correlation coefficients are often used as a method of summarizing relationships between data. There are different types of correlation coefficients, and one is called a rank correlation coefficient, which is used in cases where only the order of the data is known. In this article, I will focus on Spearman's rank correlation coefficient in particular, and describe its derivation and examples of calculations using python.

In python, you can do this:

```
math = [1, 3, 2, 4, 5, 6]
phys = [1, 4, 2, 5, 3, 6]
corr, pvalue = st.spearmanr(math, phys)
print(corr)
```

```
# Outputs
0.8285714285714287
```

When the data is more than three variables, it becomes difficult to draw a scatter plot using all the variables. So, let’s use a scatter plot matrix that displays two pairs of scatter plots of each variable side by side on a panel.

Let’s draw a scatter plot matrix using the four variables from the iris data set. Python uses the pairplot function of the seaborn library.

```
df_setosa = df_iris[df_iris['species']=='setosa'] # 品種はSetosaに限定する
sns.pairplot(data=df_setosa)
plt.show()
```

By looking at the scatter plot matrix in this way, we can capture the relationship between each variable at once.

The correlation coefficients are also grouped together in a matrix format. As a general argument, consider the data of a $m$ variable with a sample size of $n$. In this case, the matrix $\tilde{X}$ is defined below.

\begin{align*} \tilde{X} = \mat{ x_1^{(1)} – \bar{x}_1 & x_2^{(1)} – \bar{x}_2 & \cdots & x_m^{(1)} – \bar{x}_m \\ x_1^{(2)} – \bar{x}_1 & x_2^{(2)} – \bar{x}_2 & \cdots & x_m^{(2)} – \bar{x}_m \\ \vdots & \vdots & \ddots & \vdots \\ x_1^{(n)} – \bar{x}_1 & x_2^{(n)} – \bar{x}_2 & \cdots & x_m^{(n)} – \bar{x}_m }. \end{align*}

Then, the matrix $\Sigma$, called the variance-covariance matrix, is given by.

\begin{align*} \Sigma = \f{1}{n} \tilde{X}^\T \tilde{X}. \end{align*}

Here, the $(i, j)$ component of the variance-covariance matrix, $\sigma_{ij}$, is defined as follows.

\begin{align*} \sigma_{ij} = \f{1}{n} \sum_{k=1}^n (x^{(k)}_i – \bar{x}_i) (x^{(k)}_j – \bar{x}_j) \end{align*}

$\sigma_{ij}$ is the covariance of the $i$ and $j$ variables. In particular, the diagonal component is the variance of the $i$ variable.

Similarly, the target matrix $R$ whose $(i, j)$ component is the correlation coefficient (Pearson’s product-moment correlation coefficient) $r_{ij}$ of the $i$ and $j$ variables is called a correlation matrix.

\begin{align*} R = \mat{ 1 & r_{11} & \cdots & r_{1m} \\ r_{11} & 1 & \cdots & r_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ r_{m1} & r_{m2} & \cdots & 1 }. \end{align*}

In python, the correlation matrix can be computed as follows:

```
corr_mat = df_setosa.corr()
corr_mat
```

Correlation matrices are easier to understand using heatmaps.

```
cmap = sns.diverging_palette(255, 0, as_cmap=True) # カラーパレットの定義
sns.heatmap(corr_mat, annot=True, fmt='1.2f', cmap=cmap, square=True, linewidths=0.5, vmin=-1.0, vmax=1.0)
plt.show()
```

Next time: ▼Events, probabilities and random variables.

The post 【Multivariate Data】 Scatter Plots and Correlation Coefficients first appeared on Yukkuri Machine Learning.

]]>The post How To Handle Univariate Data Histograms and Box Plots first appeared on Yukkuri Machine Learning.

]]>This course deals with “univariate data,” the foundation of statistics, which refers to data consisting of a single type of variable, such as height data or math exam scores.

This article describes how to create summary statistics such as mean and variance, histograms and box-and-whisker plots to visually capture characteristics of univariate data.

The program used is described in python and it can be tried in Google Colab below.

Google Colaboratory

We will use the iris dataset, which consists of petal and sepal lengths of three varieties: Versicolour, Virginica, and Setosa.

In this case, since we will treat the data as univariate data, we will limit the variety to Setosa and treat only sepal_length.

Now we will import the iris dataset in python.

```
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as st
import matplotlib.pyplot as plt
df_iris = sns.load_dataset('iris')
iris_data = df_iris[df_iris['species']=='setosa']['sepal_length']
print(iris_data)
```

```
# Output
[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0]
```

Since it is difficult to grasp the characteristics of the data by looking at a list of data, a histogram is depicted.

A histogram is a bar graph with frequency (or relative frequency) on the vertical axis and class on the horizontal axis, and is drawn using the displot function in the seaborn library.

```
sns.displot(iris_data)
plt.show()
```

In this histogram, the horizontal axis is sepal_length (cm) in 0.25 cm increments. Each of these increments is called a class, and the width of the increment is called the width of the class, while the number of classes is called the number of classes. The vertical axis is called the frequency, and counts the number of data that fit the class.

The displot function can also be used to display the vertical axis as a percentage of the total as 1 = relative frequency. To do so, pass stat=’probability’ as an argument.

```
sns.displot(iris_data, stat='probability')
plt.show()
```

The shape of a histogram can be changed by changing the number of classes and the width of the classes. Too many or too few ranks will not capture the characteristics of the data well.

By default, the seaborn library’s displot function uses something called the Sturges formula to determine the number of classes. It states that the number of classes $k$ can be determined for a given number of samples $n$ by

\begin{align*} k = \lceil 1 + \log_2 n \rceil \end{align*}

where $\lceil x \rceil$ means an integer rounded up to the decimal point of the real number $x$.

For example, since the data we are dealing with has a sample size of 50, we can apply the Sturgess formula

\begin{align*} k = \lceil 1 + \log_2 50 \rceil = \lceil 1 + 5.6438 \rceil = \lceil 6.6438 \rceil = 7 \end{align*}

If you check the histogram shown earlier, you will see that the number of classes is indeed 7.

To change the number of grades in the seaborn library’s displot function, pass bins={grade number} as an argument.

```
sns.displot(iris_data, bins=10)
plt.show()
```

To capture the data, it is useful to calculate the mean $\bar{x}$, standard deviation $\sigma$ and variance $\sigma^2$.

The average $\bar{x}$ is calculated by

\begin{align*} \bar{x} = \sum_{i=1}^n x^{(i)}. \end{align*}

Also, the variance $\sigma^2$ and standard deviation $\sigma$ are given below.

\begin{align*} \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x^{(i)} – \bar{x})^2, \end{align*}

\begin{align*} \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x^{(i)} – \bar{x})^2}. \end{align*}

However, statistics usually uses unbiased variance.

\begin{align*} \tilde{\sigma}^2 = \frac{1}{n-1} \sum_{i=1}^n (x^{(i)} – \bar{x})^2 \end{align*}

As we will discuss in detail another time, we believe that the variance divided by $n$ tends to estimate the true variance smaller, so the denominator is balanced by the smaller denominator.

The mean corresponds to the center of gravity of the data, and the standard deviation (variance) expresses how scattered the data are around the mean.

A number that summarizes the characteristics of the data is called a summary statistic.

The above values can be calculated in python as follows

```
print(f'mean: {np.mean(iris_data)}')
print(f'var: {np.var(iris_data)}')
print(f'std: {np.std(iris_data)}')
print(f'unbiased vat: {st.tvar(iris_data)}')
```

```
# Output
mean: 5.005999999999999
var: 0.12176399999999993
std: 0.348946987377739
unbiased vat: 0.12424897959183677
```

The data are ordered from smallest to largest, and the value exactly at the halfway point is called the median, or median.

Values exactly $1/4$ (25%) and $3/4$ (75%) from the smaller end, rather than exactly half, are also used as summary statistics. They are called the first quartile (25% point) (Q1) and the third quartile (75% point) (Q3), respectively.

In python it can be calculated as follows

```
print(f'median: {np.median(iris_data)}')
print(f'quantile: {np.quantile(iris_data, q=[0.25, 0.5, 0.75])}')
```

```
# Output
median: 5.0
quantile: [4.8 5. 5.2]
```

The difference between the 75th and 25th percentile points, $Q3 – Q1$, is called the interquartile range deviation (IQR) and indicates how much data is concentrated around the median.

This can be depicted in a box-and-whisker diagram using the boxplot function in the seaborn library.

```
sns.boxplot(y=iris_data)
plt.show()
```

When comparing multiple distributions, it is often easier to understand a box-and-whisker diagram side-by-side than to compare histograms directly.

```
sns.boxplot(data=df_iris.drop('species', axis=1))
plt.show()
```

The seaborn library also provides a boxenplot function that extends the box-and-whisker diagram. It displays information on the bottom of the data distribution without dropping it.

```
sns.boxenplot(data=df_iris.drop('species', axis=1))
plt.show()
```

Reference Books

リンク

▼ Next: How to handle multivariate data

The post How To Handle Univariate Data Histograms and Box Plots first appeared on Yukkuri Machine Learning.

]]>The post ROC Curve and AUR, Implementation with Python first appeared on Yukkuri Machine Learning.

]]>In the previous issue, we introduced various evaluation metrics for machine learning classification problems, including the confusion matrix.

In this issue, we continue with the ROC curve and AUR, which are commonly used to evaluate classifiers.

Note that the program code described in this article can be tried in the following google colab.

Google Colaboratory

The ROC curve uses TPR (true positive rate (= recall)) and FPR (false positive rate); let’s review these two first before we start talking about the ROC curve.

Consider classifying data as “+ (positive) or – (negative)”.

When the classifier is fed test data and allowed to make inferences, four patterns arise

True Positive (TP): Infer + for data whose true class is +.

False Negative (FN): Infer data whose true class is + as –.

False Positive (FP): Infer data with a true class of – as +.

True negative (TN): Infer data with true class – as –.

The results of this classification are summarized in the following table, which is called the confusion matrix.

where TPR and FPR are defined by the following equations

\begin{align*}

{\rm TPR} = \frac{{\rm TP}}{{\rm TP} + {\rm FN}}, \ \ \ \ {\rm FPR} = \frac{{\rm FP}}{{\rm TN} + {\rm FP}}

\end{align*}

TPR is a measure of how much of the total + (positive) data the classifier correctly infers as + (positive).

FPR is a measure of how much of the total – (negative) data the classifier incorrectly infers as + (positive).

As mentioned in the introduction, the ROC curve, which is the main topic of this issue, is an evaluation index for classifiers calculated based on TPR and FPR, which have the above characteristics.

Some classifiers output a probability of being + when classifying data as + or -. A typical example is logistic regression.

We will call the probability that the classifier’s output is + the “score”.

Normally, data are classified with a threshold value of 0.5, such as “+ if the score is 0.5 or higher, – if the score is less than 0.5,” and so on. Changing this threshold value will naturally change the data classification results, which in turn changes the performance of the classifier (TPR and FPR).

The ROC curve is a plot of FPR on the horizontal axis and TPR on the vertical axis when the threshold is varied.

Let’s look at the ROC curve using a specific example.

In the problem of classifying data as + or -, suppose a classifier yields the following scores

True class | score |
---|---|

| 0.8 |

| 0.6 |

| 0.4 |

| 0.5 |

| 0.3 |

| 0.2 |

▲ true class and the score output by the classifier (probability of being +)

where the threshold $x$: “+ if the score is above $x$, – if the score is below $x$” is the value of each score output by the discriminator, and the TPR and FPR are calculated at each threshold value.

For example, for the threshold $x = 0.8$ the confusion matrix is

Predict | Predict | |

true | TP: 1 | FN: 2 |

true | FP: 0 | TN: 3 |

\begin{align*}

{\rm TPR} = \frac{1}{1+2} = 0.33\cdots, \ \ \ \ {\rm FPR} = \frac{0}{3 + 2} = 0

\end{align*}

The results of this calculation of TPR and FPR for the threshold $x \in \{ 0.8, 0.6, 0.4, 0.5, 0.3, 0.2 \}$ are as follows

True class | score | TPR | FPR |
---|---|---|---|

| 0.8 | 0.33 | 0 |

| 0.6 | 0.66 | 0 |

| 0.5 | 0.66 | 0.33 |

| 0.4 | 1.0 | 0.33 |

| 0.3 | 1.0 | 0.66 |

| 0.2 | 1.0 | 1.0 |

Plotting this TPR and FPR produces the ROC curve.

Now, let’s evaluate the classifier using the ROC curve. A good classifier is one that can correctly classify the data into + and – at a certain threshold value. In other words, it is a classifier that can increase TPR without increasing FPR.

This state of increasing TPR without increasing FPR is indicated by the upper left point in the above figure. In other words, the closer the ROC curve is to the upper left, the better the classifier.

On the other hand, what happens to a “bad classifier” (i.e., a classifier that outputs + and – randomly), the + and – will be in the same proportion no matter how the threshold is determined. This means that as the TPR increases, so does the FPR, and the ROC curve becomes a straight line from the origin (0.0, 0.0) to the upper right (1.0, 1.0).

AUR (Area Under Curve) is an index that quantifies such “ROC curve is closer to the upper left. This is defined as the area under the ROC curve, with a maximum value of 1.0. In other words, the closer the AUR is to 1.0, the better the classifier.

ROC curves can be easily plotted using scikit-learn’s roc_curve function. The AUR can also be calculated with the roc_auc_score function.

In this article, we will build two types of models, logistic regression and random forests, and compare their performance with ROC curves and AURs.

- First, the data for the binary classification problem is prepared and divided into training and test data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns=bc.feature_names)
df['target'] = bc.target
X = df.drop(['target'], axis=1).values
y = df['target'].values
```

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
```

- Create a logistic regression model and a random forest model.

```
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
model_lr = LogisticRegression(C=1, random_state=42, solver='lbfgs')
model_lr.fit(X_train, y_train)
model_rf = RandomForestClassifier(random_state=42)
model_rf.fit(X_train, y_train)
```

- The test data are used to predict probabilities, depict ROC curves, and calculate AURs.

```
from sklearn.metrics import roc_curve, roc_auc_score
proba_lr = model_lr.predict_proba(X_test)[:, 1]
proba_rf = model_rf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, proba_lr)
plt.plot(fpr, tpr, color=colors[0], label='logistic')
plt.fill_between(fpr, tpr, 0, color=colors[0], alpha=0.1)
fpr, tpr, thresholds = roc_curve(y_test, proba_rf)
plt.plot(fpr, tpr, color=colors[1], label='random forestss')
plt.fill_between(fpr, tpr, 0, color=colors[1], alpha=0.1)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
print(f'Logistic regression AUR: {roc_auc_score(y_test, proba_lr):.4f}')
print(f'Random forest AUR: {roc_auc_score(y_test, proba_rf):.4f}')
```

```
# Output
Logistic regression AUR: 0.9870
Random forest AUR: 0.9885
```

The results showed that the AUR was close to 1.0 in both cases, and there was not much difference between the two.

(In terms of this result, it is difficult to determine which one is better in terms of AUR because both are very close in value ……)

The post ROC Curve and AUR, Implementation with Python first appeared on Yukkuri Machine Learning.

]]>The post 【Python】Fill In Data With Intervals Between Dates And Times In Pandas. first appeared on Yukkuri Machine Learning.

]]>This article is a reminder of how to fill in data intervals for data that is spaced by dates and times.

To fill in the interval between dates and times, use the asfreq function in pandas.

You can try the code in this article in the following google colab

Google Colaboratory

As an example, we will deal with the following data (assume that the data is stored in a variable called df below).

The data is in 10-minute increments, but the data is not present in some places, and the time interval is empty.

The policy for filling in the data is

- Specify the time column (datetime) as index
- Using asfreq function in pandas
- Undo index

```
import pandas as pd
df_ = df.set_index('datetime')
df_ = df_.asfreq(freq='10min')
df_fill = df_.reset_index()
print(df_fill)
```

As described above, it is possible to fill in date and time data that have been left in between.

The arguments that can be specified for freq in the asfreq function are detailed in the official reference below.

Time series / date functionality — pandas 2.0.1 documentation

The post 【Python】Fill In Data With Intervals Between Dates And Times In Pandas. first appeared on Yukkuri Machine Learning.

]]>The post 【Python】Creating A List Of Consecutive Dates And Times In Pandas. first appeared on Yukkuri Machine Learning.

]]>This article is a reminder of how to create a list of consecutive dates and times using python.

The policy is to use the pandas date_range function.

You can try the code in this article in the following google colab

Google Colaboratory

- Create a list of consecutive dates in the following way

```
from datetime import datetime
import pandas as pd
dt_list = pd.date_range(start='2022-01-01', periods=10, freq='D')
print(dt_list)
```

```
# Output
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10'],
dtype='datetime64[ns]', freq='D')
```

- It can also be created by specifying the beginning and ending dates and times as arguments.

```
dt_list = pd.date_range(start='2022-01-01', end='2022-01-10', freq='D')
print(dt_list)
```

```
# Output
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10'],
dtype='datetime64[ns]', freq='D')
```

- The time specified in the argument can be of datetime type.

```
start_dt = datetime(year=2022, month=1, day=1)
dt_list = pd.date_range(start=start_dt, periods=10, freq='D')
print(dt_list)
```

```
# Output
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10'],
dtype='datetime64[ns]', freq='D')
```

- By changing the freq argument, you can also create a list of dates and times every 10 minutes or every 30 minutes.

```
dt_list = pd.date_range(start='2022-01-01', periods=10, freq='10min')
print(dt_list)
dt_list = pd.date_range(start='2022-01-01', periods=10, freq='30min')
print(dt_list)
```

```
# Output
DatetimeIndex(['2022-01-01 00:00:00', '2022-01-01 00:10:00',
'2022-01-01 00:20:00', '2022-01-01 00:30:00',
'2022-01-01 00:40:00', '2022-01-01 00:50:00',
'2022-01-01 01:00:00', '2022-01-01 01:10:00',
'2022-01-01 01:20:00', '2022-01-01 01:30:00'],
dtype='datetime64[ns]', freq='10T')
DatetimeIndex(['2022-01-01 00:00:00', '2022-01-01 00:30:00',
'2022-01-01 01:00:00', '2022-01-01 01:30:00',
'2022-01-01 02:00:00', '2022-01-01 02:30:00',
'2022-01-01 03:00:00', '2022-01-01 03:30:00',
'2022-01-01 04:00:00', '2022-01-01 04:30:00'],
dtype='datetime64[ns]', freq='30T')
```

The arguments that can be specified for freq are detailed in the official reference below.

Time series / date functionality — pandas 2.0.1 documentation

The post 【Python】Creating A List Of Consecutive Dates And Times In Pandas. first appeared on Yukkuri Machine Learning.

]]>The post Classification Evaluation Indicators: Accuracy, Precision, Recall, F-measure first appeared on Yukkuri Machine Learning.

]]>After a model (classifier) is trained by machine learning in a classification problem, its performance needs to be evaluated.

This article discusses the following, which are its evaluation indicators

- Accuracy
- Precision
- Recall / True Positive Rate: TPR
- False Positive Rate: FPR
- F-measure

We also describe how to calculate the above using scikit-learn.

You can try the source code for this article from google colab below.

Google Colaboratory

To simplify matters, we will limit our discussion to the two classes of classification problems. Here, we consider classifying data as either + (positive) or – (negative).

.

Now, when the classifier is fed test data and allowed to make inferences, the following four patterns arise.

- True Positive (TP): Infer
**+**for data whose true class is**+**. - False Negative (FN): Infer data whose true class is + as –.
- False Positive (FP): Infer data with a true class of – as +.
- True negative (TN): Infer data with true class – as –.

The results of this classification are summarized in a table as shown below, which is called the confusion matrix. The diagonal components of this table indicate the number of data for which the inference is correct, and the off-diagonal components indicate the number of data for which the inference is incorrect.

These four patterns above define the evaluation indicators for various classifiers.

The accuracy is the proportion of the test data that the classifier correctly infers and is expressed by the following equation

\begin{align*} {\rm accuracy} = \frac{{\rm TP} + {\rm TN}}{{\rm TP} + {\rm FP} + {\rm FN} + {\rm TN}} \end{align*}

+ (positive) and – (negative), so it is an expression for the percentage of ${\rm TP} + {\rm TN}$ cases out of the total number of data ${\rm TP} + {\rm FP} + {\rm FN} + {\rm TN}$.

Now, a problem arises when evaluating classifier performance based solely on this percentage of correct answers.

As an example, let’s consider a data set of 100,000 data, of which 99990 are – (negative) and 10 are + (positive).

Suppose a discriminator estimates all data to be – (negative) as shown in the following table.

At this point, we calculate accuracy

\begin{align*} {\rm accuracy} &= \frac{{\rm TP} + {\rm TN}}{{\rm TP} + {\rm FP} + {\rm FN} + {\rm TN}} \\ &= \frac{0 + 99990}{0 + 0 + 10 + 99990} \\ &= 0.9999 = 99.99 \% \end{align*}

The accuracy is so high that it is considered a good classifier even though it has not detected a single + (positive) case.

In other words, it is not sufficient to judge the performance of a classifier by the percentage of correct answers alone, and various indicators have been proposed as follows

Precision is a measure of how reliable a classifier is when it determines that data is + (positive).

\begin{align*} {\rm precision} = \frac{{\rm TP}}{{\rm TP} + {\rm FP}} \end{align*}

This indicator is mainly used when one wants to increase predictive certainty. However, a classifier that only increases accuracy can be achieved by reducing the number of FPs (the number of cases where – is incorrectly inferred as +), i.e., by using a model that judges + more strictly.

Recall is a measure of how well the classifier correctly inferred + (positive) out of the total + (positive) data. It is also called the true positive rate (TPR).

\begin{align*} {\rm recall} = \frac{{\rm TP}}{{\rm TP} + {\rm FN}} \end{align*}

When the importance of reducing FN (the number of cases where + is incorrectly inferred as -) is important, this indicator is used in cases such as cancer diagnosis. However, a classifier that only increases the reproducibility can be achieved with a model that loosely determines +, or in the extreme, a model that determines + for all data.

The false positive rate (FPR) is a measure of how much of the total – (negative) data the classifier incorrectly infers as + (positive).

\begin{align*} {\rm FPR} = \frac{{\rm FP}}{{\rm TN} + {\rm FP}} \end{align*}

A small value for this indicator is desired. However, a classifier that reduces only the false positive rate can be achieved with a model that judges – for all data.

This FPR and TPR (= recall) are used in the ROC curve.

▼Click here to see the contents of the ROC curve.

There is a trade-off between precision and recall, and these indicators cannot be high at the same time. The reason for the trade-off, as mentioned earlier, is that a classifier that increases only precision is realized with a model that judges “strictly” +, while a classifier that increases only recall is realized with a model that judges “loosely” +.

Now, a model with high precision and recall means a model with low FP and FN, i.e., a high-performance classifier with low off-diagonal components of the confusion matrix = low misclassification. Therefore, we define the F-measure as the harmonic mean of precision and recall.

\begin{align*} F = \frac{2}{\frac{1}{{\rm recall}} + \frac{1}{{\rm precision}}} = 2 \cdot \frac{{\rm precision} \cdot {\rm recall}}{{\rm precision} + {\rm recall}} \end{align*} |

The above indicators can be easily calculated using scikit-learn.

First, import the necessary libraries and define the data to be handled. The data to be handled in this case is a simple array with 1: positibe, -1: negative.

```
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
y_true = [-1, -1, -1, -1, -1, 1, 1, 1, 1, 1]
y_pred = [-1, -1, -1, 1, 1, -1, 1, 1, 1, 1]
names = ['positive', 'negative']
```

First, we generate the confusion matrix, which can be generated in scikit-learn with confusion_matrix.

```
cm = confusion_matrix(y_true, y_pred, labels=[1, -1])
print(cm)
```

```
# Output
[[4 1]
[2 3]]
```

To make the output easier to read, let’s display the confusion matrix in seaborn.

```
cm = pd.DataFrame(data=cm, index=names, columns=names)
sns.heatmap(cm, square=True, cbar=True, annot=True, cmap='Blues')
plt.xlabel('predicted value', fontsize=15)
plt.ylabel('grand truth', fontsize=15)
plt.show()
```

Next, let’s calculate the evaluation index. In scikit-learn, the evaluation indicators described so far can be calculated together using classification_report.

```
eval_dict = classification_report(y_true, y_pred, output_dict=True, target_names=names)
df = pd.DataFrame(eval_dict)
print(df)
```

```
# Output
positive negative accuracy macro avg weighted avg
precision 0.750000 0.666667 0.7 0.708333 0.708333
recall 0.600000 0.800000 0.7 0.700000 0.700000
f1-score 0.666667 0.727273 0.7 0.696970 0.696970
support 5.000000 5.000000 0.7 10.000000 10.000000
```

The first and second columns of the output results show the results of the indicators when positive and negative are used as positive examples, respectively.

Also, macro avg and weighted avg are called macro average and weighted average, respectively.

In this problem set-up, the indicators we want are as follows.

```
print(f"accuracy: {df['accuracy'][0]:.2f}")
print(f"precision: {df['positive']['precision']:.2f}")
print(f"recall: {df['positive']['recall']:.2f}")
print(f"f1-score: {df['positive']['f1-score']:.2f}")
```

```
# Output
accuracy: 0.70
precision: 0.75
recall: 0.60
f1-score: 0.67
```

You can try the above code in the following google colab

Google Colaboratory

The post Classification Evaluation Indicators: Accuracy, Precision, Recall, F-measure first appeared on Yukkuri Machine Learning.

]]>The post Implementation Of K-means Method, Elbow Method, Silhouette Analysis first appeared on Yukkuri Machine Learning.

]]>One of the best-known clustering methods is the k-means method, which assumes that data can be classified into $K$ clusters and assigns each data set to one of the clusters according to a certain procedure.

This article describes how the k-means method works and how it is implemented.

The k-means method requires that the number of clusters to be classified be given in advance, but the elbow method and silhouette analysis are introduced as methods for determining the optimal number of clusters.

You can try the source code for this article from google colab below.

Google Colaboratory

\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}

\newcommand{\f}[2]{\frac{#1}{#2}}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}

\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}

\newcommand{\T}{\mathsf{T}}

\newcommand{\(}{\left(}

\newcommand{\)}{\right)}

\newcommand{\{}{\left\{}

\newcommand{\}}{\right\}}

\newcommand{\[}{\left[}

\newcommand{\]}{\right]}

\newcommand{\dis}{\displaystyle}

\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}

\newcommand{\n}{\notag\\}

\newcommand{\t}{\ \ \ \ }

\newcommand{\tt}{\t\t\t\t}

\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}

\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}

\def\l<#1>{\left\langle #1 \right\rangle}

\def\us#1_#2{\underset{#2}{#1}}

\def\os#1^#2{\overset{#2}{#1}}

\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}

\newcommand{\s}[1]{{\scriptstyle #1}}

\definecolor{myblack}{rgb}{0.27,0.27,0.27}

\definecolor{myred}{rgb}{0.78,0.24,0.18}

\definecolor{myblue}{rgb}{0.0,0.443,0.737}

\definecolor{myyellow}{rgb}{1.0,0.82,0.165}

\definecolor{mygreen}{rgb}{0.24,0.47,0.44}

\newcommand{\c}[2]{\textcolor{#1}{#2}}

\newcommand{\ub}[2]{\underbrace{#1}_{#2}}

\end{align*}

Assuming that the number of clusters is 2, the concept of the k-means method is explained based on the figure below. The figure below is taken from PRML (Pattern Recognition and Machine Learning 2006).

(a): For each cluster, consider the centers of gravity $\bm{\mu}_1, \bm{\mu}_2$. Now, it is impossible to know exactly which cluster a given data belongs to, so the initial value of the cluster center of gravity is given appropriately. In the figure above, the red and blue crosses indicate the cluster centers of gravity.

(b): assign each data to a cluster according to its proximity to the center of gravity of either cluster.

(c): Calculate the center of gravity in the assigned cluster and update the center of gravity.

(d)~(i): Repeat steps (b) and (c) to continue updating the cluster. Iteration continues until no more clusters are updated or until the maximum number of iterations defined by the analyst is reached.

Let us formulate the above.

The $i$th data point out of $N$ data is denoted as $\bm{x}^{(i)}$. Also, assume that this data is divided into $K$ clusters, and denote the center of gravity of the $j$th cluster as $\bm{\mu}_j$.

Then the k-means method becomes the following loss function minimization problem.

\begin{align*}

L = \sum_{i=1}^N \sum_{j=1}^K r_{ij} \| \bm{x}^{(i)} – \bm{\mu}_j \|^2

\end{align*}

where $r_{ij}$ is a value that takes $1$ if the data point $x^{(i)}$ belongs to the $j$-th cluster and $0$ otherwise, and can be written as follows

\begin{align*}

r_{ij} = \case{1\t {\rm if}\ j=\argmin_k \| \bm{x}^{(i)} – \bm{\mu}_k \|^2 \n 0 \t {\rm otherwise.}}

\end{align*}

On the other hand, the update of the center of gravity $\bm{\mu}_j$ is

\begin{align*}

\bm{\mu}_j = \f{\sum_i r_{ij} \bm{x}^{(i)}}{\sum_i r_{ij}}

\end{align*}

It can be seen from the form of the equation that this is calculating the average value of the data vector belonging to the cluster.

The above equation is also given by solving the loss function for $\bm{\mu}_j$ with the partial derivative set to zero.

\begin{align*}

\pd{L}{\bm{\bm{\mu}}_j} = 2 \sum_{i=1}^N r_{ij} (\bm{x}^{(i)} – \bm{\mu}_j) = 0

\end{align*}

,

Thus, the procedure for the k-means method can be formulated as follows.

(a): give a random initial value for the cluster center of gravity $\bm{\mu}_j, (j=1, \dots, K)$.

(b):

\begin{align*}

r_{ij} = \case{1\t {\rm if}\ j=\argmin_k \| \bm{x}^{(i)} – \bm{\mu}_k \|^2 \n 0 \t {\rm otherwise.}}

\end{align*}

Calculate the above formula and assign each data to a cluster.

(c):

\begin{align*}

\bm{\mu}_j = \f{\sum_i r_{ij} \bm{x}^{(i)}}{\sum_i r_{ij}}

\end{align*}

Calculate the above formula and update the cluster center of gravity.

(d): Repeat steps (b) and (c) to continue updating the cluster. Iteration continues until no more clusters are updated or until the maximum number of iterations defined by the analyst is reached.

Let’s try cluster analysis using the k-means method with scikit-learn. The data used in this project will be generated using scikit-learn’s make_blobs to generate a dataset for classification.

The first step is to create a dataset for classification.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=150,
centers=3,
cluster_std=1.0,
shuffle=True,
random_state=42)
x1 = X[:, 0]
x2 = X[:, 1]
plt.scatter(x1, x2)
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
```

scikit-learn provides the sklearn.cluster.KMeans class to perform cluster analysis using the k-means method. The implementation is extremely simple and is shown below. Note that the number of clusters to be divided must be decided in advance.

```
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=0, init='random')
model.fit(X)
clusters = model.predict(X)
print(clusters)
df_cluster_centers = pd.DataFrame(model.cluster_centers_)
df_cluster_centers.columns = ['x1', 'x2']
print(df_cluster_centers)
```

```
# Output
[2 2 0 1 0 1 2 2 0 1 0 0 1 0 1 2 0 2 1 1 2 0 1 0 1 0 0 2 0 1 1 2 0 0 1 1 0
0 1 2 2 0 2 2 1 2 2 1 2 0 2 1 1 2 2 0 2 1 0 2 0 0 0 1 1 1 1 0 1 1 0 2 1 2
2 2 1 1 1 2 2 0 2 0 1 2 2 1 0 2 0 2 2 0 0 1 0 0 2 1 1 1 2 2 1 2 0 1 2 0 0
2 1 0 0 1 0 1 0 2 0 0 1 0 2 0 0 1 1 0 2 2 1 1 2 1 0 1 1 0 2 2 0 1 2 2 0 2
2 1]
x1 x2
0 -6.753996 -6.889449
1 4.584077 2.143144
2 -2.701466 8.902879
```

Since this alone is difficult to understand, the clustering results are illustrated in the following figure.

```
df = pd.DataFrame(X, columns=['x1', 'x2'])
df['class'] = clusters
sns.scatterplot(data=df, x='x1', y='x2', hue='class')
sns.scatterplot(data=df_cluster_centers, x='x1', y='x2', s=200, marker='*', color='gold', linewidth=0.5)
plt.show()
```

Thus, with the sklearn.cluster.KMeans class, you can easily perform cluster analysis using the k-means method.

One problem with the k-means method is that the number of clusters must be specified. This section introduces the elbow method and silhouette analysis used to determine the optimal number of clusters. Note that the data used below is the data with 3 clusters created earlier with make_blobs.

In the Elbow method, the optimal number of clusters is determined by calculating the loss function of the k-means method while varying the number of clusters and illustrating the results.

\begin{align*}

L = \sum_{i=1}^N \sum_{j=1}^K r_{ij} \| \bm{x}^{(i)} – \bm{\mu}_j \|^2

\end{align*}

The implementation of the elbow method is described below. The value of the loss function can be accessed at model.inertia_.

```
sum_of_squared_errors = []
for i in range(1, 11):
model = KMeans(n_clusters=i, random_state=0, init='random')
model.fit(X)
sum_of_squared_errors.append(model.inertia_)
plt.plot(range(1, 11), sum_of_squared_errors, marker='o')
plt.xlabel('number of clusters')
plt.ylabel('sum of squared errors')
plt.show()
```

The illustrated results show that the value of the loss function decreases until the number of clusters (the value on the horizontal axis) is 3, after which it remains almost unchanged.

The Elbow method determines the optimal number of clusters as the number of clusters for which the degree of decrease in the loss function changes rapidly. Therefore, in this case, the optimal number of clusters can be determined to be 3.

Silhouette analysis evaluates clustering performance based on the following indicators

- The denser the data points in a cluster, the better.
- The further apart each cluster is, the better.

Specifically, clustering performance is evaluated by the silhouette coefficient, which is defined by the following procedure

(1) Calculate the average distance of a data point $\bm{x}^{(i)}$ to other data points in the cluster $C_{\rm in}$ to which it belongs, as $a^{(i)}$ agglomeration in the cluster.

\begin{align*}

a^{(i)} = \f{1}{|C_{\rm in}| – 1} \sum_{\bm{x}^{(j)} \in C_{\rm in}} \| \bm{x}^{(i)} – \bm{x}^{(j)} \|

\end{align*}

(2) Calculate the average distance to the data point $C_{\rm near}$ belonging to the nearest cluster $C_{\rm near}$ to the data point $\bm{x}^{(i)}$ as the deviation from another cluster $b^{(i)}$.

\begin{align*}

b^{(i)} = \f{1}{|C_{\rm near}|} \sum_{\bm{x}^{(j)} \in C_{\rm near}} \| \bm{x}^{(i)} – \bm{x}^{(j)} \|

\end{align*}

(3) Divide $b^{(i)} – a^{(i)}$ by the larger of $a^{(i)}$ and $b^{(i)}$ to compute the silhouette factor $s^{(i)}$.

\begin{align*}

s^{(i)} = \f{b^{(i)} – a^{(i)}}{\max(a^{(i)}, b^{(i)})}

\end{align*}

The silhouette coefficient, by its definition, falls in the $[-1,1]$ interval. When the silhouette coefficient is calculated and averaged over all data, the closer to 1, the better the clustering performance.

The silhouette analysis visualization follows these rules

- Sort by cluster number
- Sort by silhouette coefficient value within the same cluster

The silhouette analysis is visualized by plotting the silhouette coefficient on the horizontal axis and the cluster number on the vertical axis.

```
import matplotlib.cm as cm
from sklearn.metrics import silhouette_samples
def show_silhouette(fitted_model):
cluster_labels = np.unique(fitted_model.labels_)
num_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, fitted_model.labels_)
y_ax_lower, y_ax_upper = 0, 0
y_ticks = []
for idx, cls in enumerate(cluster_labels):
cls_silhouette_vals = silhouette_vals[fitted_model.labels_==cls]
cls_silhouette_vals.sort()
y_ax_upper += len(cls_silhouette_vals)
cmap = cm.get_cmap("Spectral")
rgba = list(cmap(idx/num_clusters))
rgba[-1] = 0.7
plt.barh(
y=range(y_ax_lower, y_ax_upper),
width=cls_silhouette_vals,
height=1.0,
edgecolor='none',
color=rgba)
y_ticks.append((y_ax_lower + y_ax_upper) / 2.0)
y_ax_lower += len(cls_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color='orangered', linestyle='--')
plt.xlabel('silhouette coefficient')
plt.ylabel('cluster')
plt.yticks(y_ticks, cluster_labels + 1)
plt.show()
for i in range(2, 5):
model = KMeans(n_clusters=i, random_state=0, init='random')
model.fit(X)
show_silhouette(model)
```

The silhouette diagrams are plotted for the number of clusters specified as 2, 3, and 4, respectively. The red dashed line represents the average silhouette coefficient.

If the clusters are properly separated, the “thickness” of the silhouettes in each cluster tends to be close to even.

In the figure above, the silhouette “thickness” is even when the number of clusters is 3, and the average value of the silhouette coefficient is the highest. From this we can conclude that the optimal number of clusters is 3.

reference: scikit-learn

As described above, implementing the k-means method is easy in scikit-learn.

The number of clusters to be classified must be given in advance, and the elbow method and silhouette analysis were introduced as guidelines for determining the number of clusters.

There are also the x-means and g-means methods, which allow clustering without providing the number of clusters to be classified in advance.

The post Implementation Of K-means Method, Elbow Method, Silhouette Analysis first appeared on Yukkuri Machine Learning.

]]>The post How to install and use labelImg first appeared on Yukkuri Machine Learning.

]]>In order to perform object detection using deep learning such as YOLO, a training image dataset is required. In other words, it is necessary to prepare information on “what” is in “what part” of the image.

The tool labelImg makes it easy to create such a training image dataset.

This article describes how to install and use labelImg.

▼The dataset was actually created using labelImg, and YOLO training was performed here.

For Windows and Linux, download the latest package (version at the bottom of the page) from the link below and unzip it.

Page not found · GitHub Pages

Then, in the expanded folder

`./labelimg`

Clone the labelImg repository and install the necessary libraries.

```
git clone https://github.com/tzutalin/labelImg.git
cd labelImg
brew install qt
brew install libxml2
pip install pyqt5 lxml
make qt5py3
```

Then, after that.

`python labelImg.py`

This time, we will use this image to create training data.

The required labels are

- lion
- tiger

1. First, we need to specify the list of labels to be used for training.

Rewrite “data/predefined_classes.txt” in the folder where labelImg is installed with the required set of labels.

2. Run labelImg.

3.Click on “Open Directory” and specify the directory where the target image is located. The images stored in that directory will then be loaded.

4. Since we will be creating a dedicated dataset to study with YOLO, click on “PascalVOC” under the “Save” button in the sidebar and change it to “YOLO”. Then, select a rectangle and label for the image from “Create Rectangle” in the lower left corner of the sidebar.

5. Finally, save the data from the “Save” button to complete the training data.

When the save is complete, a new “classes.txt” and “lion_tiger.txt” (txt file with the same name as the image) will be created in the directory where the target image is located.

What is important is the latter file, which contains information on “what” is in “what part” of the image.

Also, the shortcut keys for labelImg are as follows (Ctrl → Command⌘ for mac).

Ctrl + u | Load all images from directory |

Ctrl + r | Change default annotation target directory |

Ctrl + s | Save |

Ctrl + d | Copy current label and rect box |

Ctrl + Shift + d | Delete current image |

space | Flag the current image as verified |

w | Create a rectangular box |

d | Next image |

a | Previous image |

del | Delete selected rect boxes |

Ctrl++ | Zoom in |

Ctrl– | Zoom out |

↑→↓← | Move selected rect box with arrow keys |

Now we can use deep learning such as YOLO to learn object detection.

▼ Click here to learn more about YOLO

リンク

リンク

The post How to install and use labelImg first appeared on Yukkuri Machine Learning.

]]>