The post 【Multivariate Data】 Scatter Plots and Correlation Coefficients first appeared on Yukkuri Machine Learning.

]]>In the previous issue, we discussed how to handle the most basic univariate data.

This article discusses scatter plots and scatter matrix as basic ways to handle multivariate data, and correlation coefficient, rank correlation coefficient, and variance-covariance matrix as summarization methods.

The program used is described in python and it can be tried in Google Colab below.

Google Colaboratory

\begin{align*}

\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}

\newcommand{\f}[2]{\frac{#1}{#2}}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}

\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}

\newcommand{\T}{\mathsf{T}}

\newcommand{\(}{\left(}

\newcommand{\)}{\right)}

\newcommand{\{}{\left\{}

\newcommand{\}}{\right\}}

\newcommand{\[}{\left[}

\newcommand{\]}{\right]}

\newcommand{\dis}{\displaystyle}

\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}

\newcommand{\n}{\notag\\}

\newcommand{\t}{\ \ \ \ }

\newcommand{\tt}{\t\t\t\t}

\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}

\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}

\def\l<#1>{\left\langle #1 \right\rangle}

\def\us#1_#2{\underset{#2}{#1}}

\def\os#1^#2{\overset{#2}{#1}}

\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}

\newcommand{\s}[1]{{\scriptstyle #1}}

\definecolor{myblack}{rgb}{0.27,0.27,0.27}

\definecolor{myred}{rgb}{0.78,0.24,0.18}

\definecolor{myblue}{rgb}{0.0,0.443,0.737}

\definecolor{myyellow}{rgb}{1.0,0.82,0.165}

\definecolor{mygreen}{rgb}{0.24,0.47,0.44}

\newcommand{\c}[2]{\textcolor{#1}{#2}}

\newcommand{\ub}[2]{\underbrace{#1}_{#2}}

\end{align*}

We will again use the iris dataset as the data. Iris dataset consists of petal and sepal lengths of three varieties: Versicolour, Virginica, and Setosa.

First, as bivariate data, we will restrict the variety to Setosa and deal only with sepal_length and sepal_width.

Now we will import the iris dataset in python.

```
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as st
import matplotlib.pyplot as plt
df_iris = sns.load_dataset('iris')
sepal_length = df_iris[df_iris['species']=='setosa']['sepal_length']
sepal_width = df_iris[df_iris['species']=='setosa']['sepal_width']
```

The most basic way to visually see the relationship between sepal_length and sepal_width is to draw a scatter plot.

To draw a scatterplot, use the scatterplot function in the seaborn library.

```
sns.scatterplot(x=sepal_length, y=sepal_width)
plt.show()
```

The scatterplot in the above figure shows that the points are aligned rightward. This means that flowers with larger sepal_length tend to have larger sepal_width.

The seaborn library also has a jointplot function that depicts a histogram along with a scatterplot.

```
sns.jointplot(x=sepal_length, y=sepal_width)
plt.show()
```

Correlation coefficients (Pearson’s product-moment correlation coefficient) are often used to identify “rightward” or “downward” trends between data, and the correlation coefficient between two variables $x$ and $y$ takes values ranging from $-1$ to $1$, with positive values when $x$ tends to increase and negative values when $y$ tends to decrease. It takes a positive value when $y$ tends to decrease as $x$ increases, and a negative value when $y$ tends to decrease as $x$ increases.

There is a positive correlation between $x$ and $y$ when the correlation coefficient is close to $1$, a negative correlation when close to $-1$, and no correlation when close to $0$.

The correlation coefficient $r_{xy}$ is defined as

\begin{align*}

r_{xy} = \f{\sum_{i=1}^n (x^{(i)} – \bar{x}) (y^{(i)} – \bar{y}) }{\sqrt{\sum_{i=1}^n (x^{(i)} – \bar{x})^2} \sqrt{\sum_{i=1}^n (y^{(i)} – \bar{y})^2}}.

\end{align*}

where as the mean deviation vector

\begin{align*}

\bm{x} &= (x^{(1)} – \bar{x}, x^{(2)} – \bar{x}, \dots, x^{(n)} – \bar{x})^\T, \n

\bm{y} &= (y^{(1)} – \bar{y}, y^{(2)} – \bar{y}, \dots, y^{(n)} – \bar{y})^\T

\end{align*}

The correlation coefficient $R_{xy}$ coincides with the cosine $\cos \theta$ of the angle $\bm{x}, \bm{y}$ formed by the vectors $\bm{x}, \bm{y}$.

\begin{align*}

r_{xy} = \cos \theta = \f{\bm{x}^\T \bm{y}}{\|\bm{x}\| \|\bm{y}\|}.

\end{align*}

From this we see that $-1 \leq r_{xy} \leq 1$.

Also, if there is a positive correlation, $\bm{x}, \bm{y}$ point in the same direction, and if there is no correlation, $\bm{x}, \bm{y}$ can be interpreted as pointing in orthogonal directions.

The numerator of the defining formula for the correlation coefficient divided by the number of samples $n$ is called the covariance of $x, y$.

\begin{align*}

\sigma_{xy} = \f{1}{n} \sum_{i=1}^n (x^{(i)} – \bar{x}) (y^{(i)} – \bar{y})

\end{align*}

Using this value and the standard deviation $\sigma_x, \sigma_y$ of $x, y$, the correlation coefficient can be expressed as

\begin{align*}

r_{xy} = \f{\sigma_{xy}}{\sigma_x \sigma_y}.

\end{align*}

When the correlation coefficient is $r_{xy} = \pm 1 $, there is a linear relationship between $x, y$. The proof is given below.

**Proof**

以下、 $x, y$ の分散 $\sigma_x^2, \sigma_y^2$ は $0$ でないとする。

$r_{xy}$ が $1$ か $-1$ である場合、平均偏差ベクトル $\bm{x}, \bm{y}$ のなす角 $\theta$ の余弦 $\cos \theta$ は $1$ か $-1$ である。よって、

\begin{align*}

\bm{y} = \gamma \bm{x}

\end{align*}

を満たす定数 $\gamma \neq 0$ が存在する（$r_{xy}=1$ なら $\gamma$ は正、$r_{xy}=-1$ なら $\gamma$ は負の値をとる）。このことから、$(y^{(i)} – \bar{y}) = \gamma (x^{(i)} – \bar{x}), \ (i = 1, \dots, n)$ が成立する。つまり、$x, y$ には直線関係がある。

また、上式にて両辺の2乗の平均をとると、

\begin{align*}

\f{1}{n}\sum_{i=1}^n (y^{(i)} – \bar{y})^2 &= \gamma^2 \f{1}{n}\sum_{i=1}^n (x^{(i)} – \bar{x})^2 \n

\therefore \sigma_y^2 &= \gamma^2 \sigma_x^2.

\end{align*}

したがって、$\gamma = \pm \sqrt{\sigma_y^2 / \sigma_x^2}$ となり、 $x, y$ には次の直線関係が成り立つ。

\begin{align*}

y = \pm \sqrt{\f{\sigma_y^2}{\sigma_x^2}} (x – \bar{x}) + \bar{y}.

\end{align*}

傾きの符号は $r_{xy}$ の符号と同じである。

pythonでは相関係数は次のように計算できます。

```
# 相関係数を計算
corr = np.corrcoef(sepal_length, sepal_width)[0, 1]
print(f'相関係数: {corr}')
```

```
# 出力
相関係数: 0.7425466856651597
```

相関係数の注意点として、2つの変数が直線的な関係にあるときこれは有効ですが、そうで無い場合（非線形の関係の場合）には有効ではありません。実際、下図のように直線関係以外の関係性が存在するデータに対して、相関係数からは「相関がない」と判断され、有効に働きません。

このように、相関係数は直線的な関係にどれだけ近いかを定量的に表現したもので、「相関がある」ことと、「データ間に関係がある」ことは異なります。

また、もう一方の注意点としてデータを切断した時に元のデータと比べて相関係数が $0$ に近づく現象があります。例としては、入学前の成績と入学後の成績は、本来正の相関を示しますが、入学後の成績を観測できるのは合格者のみであり、入学しなかった者のデータがないため，相関係数が低くなります。

このような現象を**切断効果**、または**選抜効果**とよびます。

先ほどの相関係数（ピアソンの積率相関係数）以外にも様々な相関係数が知られています。ここでは、**スピアマンの順位相関係数**について述べます。

順位相関係数は、データの順位しかわかっていない場合有効です。例えば、次のような学力テストの順位しかわかってない場合です。

数学テストの順位 | 物理学テストの順位 |
---|---|

1 | 1 |

3 | 4 |

2 | 2 |

4 | 5 |

5 | 3 |

6 | 6 |

このようなデータの順序だけを用いて相関を捉えるものが**順序相関係数**となります。

観測対象の値 $x, y$ の順位をそれぞれ、次の表のように $\tilde{x}, \tilde{y}$ と表記すると、

$x$ の順位 | $y$ の順位 |
---|---|

$\tilde{x}^{(1)}$ | $\tilde{y}^{(1)}$ |

$\tilde{x}^{(2)}$ | $\tilde{y}^{(2)}$ |

$\vdots$ | $\vdots$ |

$\tilde{x}^{(n)}$ | $\tilde{y}^{(n)}$ |

**スピアマンの順位相関係数** $\rho_{xy}$ は次式で計算されます。

\begin{align*}

\rho_{xy} = 1\ – \f{6}{n(n^2-1)} \sum_{i=1}^n (\tilde{x}^{(i)}\ – \tilde{y}^{(i)})^2.

\end{align*}

なぜこのような式になるかは▼次の記事をご覧ください。

pythonでは、次のようにして計算できます。

```
# スピアマンの順位相関係数を計算
math = [1, 3, 2, 4, 5, 6]
phys = [1, 4, 2, 5, 3, 6]
corr, pvalue = st.spearmanr(math, phys)
print(corr)
```

```
# 出力
0.8285714285714287
```

3変量以上のデータとなると、全ての変数を用いて散布図を描写するのは難しくなります。そこで、各変数の2組のペアの散布図をパネル上に並べて表示する**散布図行列**を用いましょう。

irisデータセットの4つの変数を用いて散布図行列を描写してみます。pythonでは`seaborn`

ライブラリの`pairplot`

関数を用います。

```
# 散布図行列を描写
df_setosa = df_iris[df_iris['species']=='setosa'] # 品種はSetosaに限定する
sns.pairplot(data=df_setosa)
plt.show()
```

このように散布図行列をみることで、それぞれの変数間の関係を一度に捉えることができます。

相関係数も行列形式でまとめて表記されます。一般的な議論として、サンプル数が $n$で、 $m$ 変数のデータを考えます。この時、行列 $\tilde{X}$ を以下で定めます。

\begin{align*}

\tilde{X} = \mat{

x_1^{(1)} – \bar{x}_1 & x_2^{(1)} – \bar{x}_2 & \cdots & x_m^{(1)} – \bar{x}_m \\

x_1^{(2)} – \bar{x}_1 & x_2^{(2)} – \bar{x}_2 & \cdots & x_m^{(2)} – \bar{x}_m \\

\vdots & \vdots & \ddots & \vdots \\

x_1^{(n)} – \bar{x}_1 & x_2^{(n)} – \bar{x}_2 & \cdots & x_m^{(n)} – \bar{x}_m

}.

\end{align*}

すると、**分散共分散行列**とよばれる行列 $\Sigma$ は次式で表されます。

\begin{align*}

\Sigma = \f{1}{n} \tilde{X}^\T \tilde{X}.

\end{align*}

ここで、分散共分散行列の、第$(i, j)$成分 $\sigma_{ij}$ はその定義から、

\begin{align*}

\sigma_{ij} = \f{1}{n} \sum_{k=1}^n (x^{(k)}_i – \bar{x}_i) (x^{(k)}_j – \bar{x}_j)

\end{align*}

となるので、$\sigma_{ij}$ は第 $i$ 変数と第 $j$ 変数の共分散となります。特に対角成分は第 $i$ 変数の分散となります。

同様にして、第 $i$ 変数と第 $j$ 変数の相関係数（ピアソンの積率相関係数）$r_{ij}$ を第$(i, j)$成分とする対象行列 $R$ を**相関行列**とよびます。

\begin{align*}

R = \mat{

1 & r_{11} & \cdots & r_{1m} \\

r_{11} & 1 & \cdots & r_{2m} \\

\vdots & \vdots & \ddots & \vdots \\

r_{m1} & r_{m2} & \cdots & 1

}.

\end{align*}

pythonでは相関行列は次のように計算できます。

```
# 相関行列を計算する
corr_mat = df_setosa.corr()
corr_mat
```

相関行列はヒートマップを使用すると、わかりやすいです。

```
# ヒートマップで相関行列を描写
cmap = sns.diverging_palette(255, 0, as_cmap=True) # カラーパレットの定義
sns.heatmap(corr_mat, annot=True, fmt='1.2f', cmap=cmap, square=True, linewidths=0.5, vmin=-1.0, vmax=1.0)
plt.show()
```

次回: ▼事象、確率と確率変数

参考書籍

リンク

The post 【Multivariate Data】 Scatter Plots and Correlation Coefficients first appeared on Yukkuri Machine Learning.

]]>The post How To Handle Univariate Data Histograms and Box Plots first appeared on Yukkuri Machine Learning.

]]>This course deals with “univariate data,” the foundation of statistics, which refers to data consisting of a single type of variable, such as height data or math exam scores.

This article describes how to create summary statistics such as mean and variance, histograms and box-and-whisker plots to visually capture characteristics of univariate data.

The program used is described in python and it can be tried in Google Colab below.

Google Colaboratory

We will use the iris dataset, which consists of petal and sepal lengths of three varieties: Versicolour, Virginica, and Setosa.

In this case, since we will treat the data as univariate data, we will limit the variety to Setosa and treat only sepal_length.

Now we will import the iris dataset in python.

```
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as st
import matplotlib.pyplot as plt
df_iris = sns.load_dataset('iris')
iris_data = df_iris[df_iris['species']=='setosa']['sepal_length']
print(iris_data)
```

```
# Output
[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0]
```

Since it is difficult to grasp the characteristics of the data by looking at a list of data, a histogram is depicted.

A histogram is a bar graph with frequency (or relative frequency) on the vertical axis and class on the horizontal axis, and is drawn using the displot function in the seaborn library.

```
sns.displot(iris_data)
plt.show()
```

In this histogram, the horizontal axis is sepal_length (cm) in 0.25 cm increments. Each of these increments is called a class, and the width of the increment is called the width of the class, while the number of classes is called the number of classes. The vertical axis is called the frequency, and counts the number of data that fit the class.

The displot function can also be used to display the vertical axis as a percentage of the total as 1 = relative frequency. To do so, pass stat=’probability’ as an argument.

```
sns.displot(iris_data, stat='probability')
plt.show()
```

The shape of a histogram can be changed by changing the number of classes and the width of the classes. Too many or too few ranks will not capture the characteristics of the data well.

By default, the seaborn library’s displot function uses something called the Sturges formula to determine the number of classes. It states that the number of classes $k$ can be determined for a given number of samples $n$ by

\begin{align*} k = \lceil 1 + \log_2 n \rceil \end{align*}

where $\lceil x \rceil$ means an integer rounded up to the decimal point of the real number $x$.

For example, since the data we are dealing with has a sample size of 50, we can apply the Sturgess formula

\begin{align*} k = \lceil 1 + \log_2 50 \rceil = \lceil 1 + 5.6438 \rceil = \lceil 6.6438 \rceil = 7 \end{align*}

If you check the histogram shown earlier, you will see that the number of classes is indeed 7.

To change the number of grades in the seaborn library’s displot function, pass bins={grade number} as an argument.

```
sns.displot(iris_data, bins=10)
plt.show()
```

To capture the data, it is useful to calculate the mean $\bar{x}$, standard deviation $\sigma$ and variance $\sigma^2$.

The average $\bar{x}$ is calculated by

\begin{align*} \bar{x} = \sum_{i=1}^n x^{(i)}. \end{align*}

Also, the variance $\sigma^2$ and standard deviation $\sigma$ are given below.

\begin{align*} \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x^{(i)} – \bar{x})^2, \end{align*}

\begin{align*} \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x^{(i)} – \bar{x})^2}. \end{align*}

However, statistics usually uses unbiased variance.

\begin{align*} \tilde{\sigma}^2 = \frac{1}{n-1} \sum_{i=1}^n (x^{(i)} – \bar{x})^2 \end{align*}

As we will discuss in detail another time, we believe that the variance divided by $n$ tends to estimate the true variance smaller, so the denominator is balanced by the smaller denominator.

The mean corresponds to the center of gravity of the data, and the standard deviation (variance) expresses how scattered the data are around the mean.

A number that summarizes the characteristics of the data is called a summary statistic.

The above values can be calculated in python as follows

```
print(f'mean: {np.mean(iris_data)}')
print(f'var: {np.var(iris_data)}')
print(f'std: {np.std(iris_data)}')
print(f'unbiased vat: {st.tvar(iris_data)}')
```

```
# Output
mean: 5.005999999999999
var: 0.12176399999999993
std: 0.348946987377739
unbiased vat: 0.12424897959183677
```

The data are ordered from smallest to largest, and the value exactly at the halfway point is called the median, or median.

Values exactly $1/4$ (25%) and $3/4$ (75%) from the smaller end, rather than exactly half, are also used as summary statistics. They are called the first quartile (25% point) (Q1) and the third quartile (75% point) (Q3), respectively.

In python it can be calculated as follows

```
print(f'median: {np.median(iris_data)}')
print(f'quantile: {np.quantile(iris_data, q=[0.25, 0.5, 0.75])}')
```

```
# Output
median: 5.0
quantile: [4.8 5. 5.2]
```

The difference between the 75th and 25th percentile points, $Q3 – Q1$, is called the interquartile range deviation (IQR) and indicates how much data is concentrated around the median.

This can be depicted in a box-and-whisker diagram using the boxplot function in the seaborn library.

```
sns.boxplot(y=iris_data)
plt.show()
```

When comparing multiple distributions, it is often easier to understand a box-and-whisker diagram side-by-side than to compare histograms directly.

```
sns.boxplot(data=df_iris.drop('species', axis=1))
plt.show()
```

The seaborn library also provides a boxenplot function that extends the box-and-whisker diagram. It displays information on the bottom of the data distribution without dropping it.

```
sns.boxenplot(data=df_iris.drop('species', axis=1))
plt.show()
```

Reference Books

リンク

▼ Next: How to handle multivariate data

The post How To Handle Univariate Data Histograms and Box Plots first appeared on Yukkuri Machine Learning.

]]>The post ROC Curve and AUR, Implementation with Python first appeared on Yukkuri Machine Learning.

]]>In the previous issue, we introduced various evaluation metrics for machine learning classification problems, including the confusion matrix.

In this issue, we continue with the ROC curve and AUR, which are commonly used to evaluate classifiers.

Note that the program code described in this article can be tried in the following google colab.

Google Colaboratory

The ROC curve uses TPR (true positive rate (= recall)) and FPR (false positive rate); let’s review these two first before we start talking about the ROC curve.

Consider classifying data as “+ (positive) or – (negative)”.

When the classifier is fed test data and allowed to make inferences, four patterns arise

True Positive (TP): Infer + for data whose true class is +.

False Negative (FN): Infer data whose true class is + as –.

False Positive (FP): Infer data with a true class of – as +.

True negative (TN): Infer data with true class – as –.

The results of this classification are summarized in the following table, which is called the confusion matrix.

where TPR and FPR are defined by the following equations

\begin{align*}

{\rm TPR} = \frac{{\rm TP}}{{\rm TP} + {\rm FN}}, \ \ \ \ {\rm FPR} = \frac{{\rm FP}}{{\rm TN} + {\rm FP}}

\end{align*}

TPR is a measure of how much of the total + (positive) data the classifier correctly infers as + (positive).

FPR is a measure of how much of the total – (negative) data the classifier incorrectly infers as + (positive).

As mentioned in the introduction, the ROC curve, which is the main topic of this issue, is an evaluation index for classifiers calculated based on TPR and FPR, which have the above characteristics.

Some classifiers output a probability of being + when classifying data as + or -. A typical example is logistic regression.

We will call the probability that the classifier’s output is + the “score”.

Normally, data are classified with a threshold value of 0.5, such as “+ if the score is 0.5 or higher, – if the score is less than 0.5,” and so on. Changing this threshold value will naturally change the data classification results, which in turn changes the performance of the classifier (TPR and FPR).

The ROC curve is a plot of FPR on the horizontal axis and TPR on the vertical axis when the threshold is varied.

Let’s look at the ROC curve using a specific example.

In the problem of classifying data as + or -, suppose a classifier yields the following scores

True class | score |
---|---|

| 0.8 |

| 0.6 |

| 0.4 |

| 0.5 |

| 0.3 |

| 0.2 |

▲ true class and the score output by the classifier (probability of being +)

where the threshold $x$: “+ if the score is above $x$, – if the score is below $x$” is the value of each score output by the discriminator, and the TPR and FPR are calculated at each threshold value.

For example, for the threshold $x = 0.8$ the confusion matrix is

Predict | Predict | |

true | TP: 1 | FN: 2 |

true | FP: 0 | TN: 3 |

\begin{align*}

{\rm TPR} = \frac{1}{1+2} = 0.33\cdots, \ \ \ \ {\rm FPR} = \frac{0}{3 + 2} = 0

\end{align*}

The results of this calculation of TPR and FPR for the threshold $x \in \{ 0.8, 0.6, 0.4, 0.5, 0.3, 0.2 \}$ are as follows

True class | score | TPR | FPR |
---|---|---|---|

| 0.8 | 0.33 | 0 |

| 0.6 | 0.66 | 0 |

| 0.5 | 0.66 | 0.33 |

| 0.4 | 1.0 | 0.33 |

| 0.3 | 1.0 | 0.66 |

| 0.2 | 1.0 | 1.0 |

Plotting this TPR and FPR produces the ROC curve.

Now, let’s evaluate the classifier using the ROC curve. A good classifier is one that can correctly classify the data into + and – at a certain threshold value. In other words, it is a classifier that can increase TPR without increasing FPR.

This state of increasing TPR without increasing FPR is indicated by the upper left point in the above figure. In other words, the closer the ROC curve is to the upper left, the better the classifier.

On the other hand, what happens to a “bad classifier” (i.e., a classifier that outputs + and – randomly), the + and – will be in the same proportion no matter how the threshold is determined. This means that as the TPR increases, so does the FPR, and the ROC curve becomes a straight line from the origin (0.0, 0.0) to the upper right (1.0, 1.0).

AUR (Area Under Curve) is an index that quantifies such “ROC curve is closer to the upper left. This is defined as the area under the ROC curve, with a maximum value of 1.0. In other words, the closer the AUR is to 1.0, the better the classifier.

ROC curves can be easily plotted using scikit-learn’s roc_curve function. The AUR can also be calculated with the roc_auc_score function.

In this article, we will build two types of models, logistic regression and random forests, and compare their performance with ROC curves and AURs.

- First, the data for the binary classification problem is prepared and divided into training and test data.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
df = pd.DataFrame(bc.data, columns=bc.feature_names)
df['target'] = bc.target
X = df.drop(['target'], axis=1).values
y = df['target'].values
```

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
```

- Create a logistic regression model and a random forest model.

```
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
model_lr = LogisticRegression(C=1, random_state=42, solver='lbfgs')
model_lr.fit(X_train, y_train)
model_rf = RandomForestClassifier(random_state=42)
model_rf.fit(X_train, y_train)
```

- The test data are used to predict probabilities, depict ROC curves, and calculate AURs.

```
from sklearn.metrics import roc_curve, roc_auc_score
proba_lr = model_lr.predict_proba(X_test)[:, 1]
proba_rf = model_rf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, proba_lr)
plt.plot(fpr, tpr, color=colors[0], label='logistic')
plt.fill_between(fpr, tpr, 0, color=colors[0], alpha=0.1)
fpr, tpr, thresholds = roc_curve(y_test, proba_rf)
plt.plot(fpr, tpr, color=colors[1], label='random forestss')
plt.fill_between(fpr, tpr, 0, color=colors[1], alpha=0.1)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
print(f'Logistic regression AUR: {roc_auc_score(y_test, proba_lr):.4f}')
print(f'Random forest AUR: {roc_auc_score(y_test, proba_rf):.4f}')
```

```
# Output
Logistic regression AUR: 0.9870
Random forest AUR: 0.9885
```

The results showed that the AUR was close to 1.0 in both cases, and there was not much difference between the two.

(In terms of this result, it is difficult to determine which one is better in terms of AUR because both are very close in value ……)

The post ROC Curve and AUR, Implementation with Python first appeared on Yukkuri Machine Learning.

]]>The post 【Python】Fill In Data With Intervals Between Dates And Times In Pandas. first appeared on Yukkuri Machine Learning.

]]>This article is a reminder of how to fill in data intervals for data that is spaced by dates and times.

To fill in the interval between dates and times, use the asfreq function in pandas.

You can try the code in this article in the following google colab

Google Colaboratory

As an example, we will deal with the following data (assume that the data is stored in a variable called df below).

The data is in 10-minute increments, but the data is not present in some places, and the time interval is empty.

The policy for filling in the data is

- Specify the time column (datetime) as index
- Using asfreq function in pandas
- Undo index

```
import pandas as pd
df_ = df.set_index('datetime')
df_ = df_.asfreq(freq='10min')
df_fill = df_.reset_index()
print(df_fill)
```

As described above, it is possible to fill in date and time data that have been left in between.

The arguments that can be specified for freq in the asfreq function are detailed in the official reference below.

Time series / date functionality — pandas 1.5.0 documentation

The post 【Python】Fill In Data With Intervals Between Dates And Times In Pandas. first appeared on Yukkuri Machine Learning.

]]>The post 【Python】Creating A List Of Consecutive Dates And Times In Pandas. first appeared on Yukkuri Machine Learning.

]]>This article is a reminder of how to create a list of consecutive dates and times using python.

The policy is to use the pandas date_range function.

You can try the code in this article in the following google colab

Google Colaboratory

- Create a list of consecutive dates in the following way

```
from datetime import datetime
import pandas as pd
dt_list = pd.date_range(start='2022-01-01', periods=10, freq='D')
print(dt_list)
```

```
# Output
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10'],
dtype='datetime64[ns]', freq='D')
```

- It can also be created by specifying the beginning and ending dates and times as arguments.

```
dt_list = pd.date_range(start='2022-01-01', end='2022-01-10', freq='D')
print(dt_list)
```

```
# Output
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10'],
dtype='datetime64[ns]', freq='D')
```

- The time specified in the argument can be of datetime type.

```
start_dt = datetime(year=2022, month=1, day=1)
dt_list = pd.date_range(start=start_dt, periods=10, freq='D')
print(dt_list)
```

```
# Output
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10'],
dtype='datetime64[ns]', freq='D')
```

- By changing the freq argument, you can also create a list of dates and times every 10 minutes or every 30 minutes.

```
dt_list = pd.date_range(start='2022-01-01', periods=10, freq='10min')
print(dt_list)
dt_list = pd.date_range(start='2022-01-01', periods=10, freq='30min')
print(dt_list)
```

```
# Output
DatetimeIndex(['2022-01-01 00:00:00', '2022-01-01 00:10:00',
'2022-01-01 00:20:00', '2022-01-01 00:30:00',
'2022-01-01 00:40:00', '2022-01-01 00:50:00',
'2022-01-01 01:00:00', '2022-01-01 01:10:00',
'2022-01-01 01:20:00', '2022-01-01 01:30:00'],
dtype='datetime64[ns]', freq='10T')
DatetimeIndex(['2022-01-01 00:00:00', '2022-01-01 00:30:00',
'2022-01-01 01:00:00', '2022-01-01 01:30:00',
'2022-01-01 02:00:00', '2022-01-01 02:30:00',
'2022-01-01 03:00:00', '2022-01-01 03:30:00',
'2022-01-01 04:00:00', '2022-01-01 04:30:00'],
dtype='datetime64[ns]', freq='30T')
```

The arguments that can be specified for freq are detailed in the official reference below.

Time series / date functionality — pandas 1.5.0 documentation

The post 【Python】Creating A List Of Consecutive Dates And Times In Pandas. first appeared on Yukkuri Machine Learning.

]]>The post Classification Evaluation Indicators: Accuracy, Precision, Recall, F-measure first appeared on Yukkuri Machine Learning.

]]>After a model (classifier) is trained by machine learning in a classification problem, its performance needs to be evaluated.

This article discusses the following, which are its evaluation indicators

- Accuracy
- Precision
- Recall / True Positive Rate: TPR
- False Positive Rate: FPR
- F-measure

We also describe how to calculate the above using scikit-learn.

You can try the source code for this article from google colab below.

Google Colaboratory

To simplify matters, we will limit our discussion to the two classes of classification problems. Here, we consider classifying data as either + (positive) or – (negative).

.

Now, when the classifier is fed test data and allowed to make inferences, the following four patterns arise.

- True Positive (TP): Infer
**+**for data whose true class is**+**. - False Negative (FN): Infer data whose true class is + as –.
- False Positive (FP): Infer data with a true class of – as +.
- True negative (TN): Infer data with true class – as –.

The results of this classification are summarized in a table as shown below, which is called the confusion matrix. The diagonal components of this table indicate the number of data for which the inference is correct, and the off-diagonal components indicate the number of data for which the inference is incorrect.

These four patterns above define the evaluation indicators for various classifiers.

The accuracy is the proportion of the test data that the classifier correctly infers and is expressed by the following equation

\begin{align*} {\rm accuracy} = \frac{{\rm TP} + {\rm TN}}{{\rm TP} + {\rm FP} + {\rm FN} + {\rm TN}} \end{align*}

+ (positive) and – (negative), so it is an expression for the percentage of ${\rm TP} + {\rm TN}$ cases out of the total number of data ${\rm TP} + {\rm FP} + {\rm FN} + {\rm TN}$.

Now, a problem arises when evaluating classifier performance based solely on this percentage of correct answers.

As an example, let’s consider a data set of 100,000 data, of which 99990 are – (negative) and 10 are + (positive).

Suppose a discriminator estimates all data to be – (negative) as shown in the following table.

At this point, we calculate accuracy

\begin{align*} {\rm accuracy} &= \frac{{\rm TP} + {\rm TN}}{{\rm TP} + {\rm FP} + {\rm FN} + {\rm TN}} \\ &= \frac{0 + 99990}{0 + 0 + 10 + 99990} \\ &= 0.9999 = 99.99 \% \end{align*}

The accuracy is so high that it is considered a good classifier even though it has not detected a single + (positive) case.

In other words, it is not sufficient to judge the performance of a classifier by the percentage of correct answers alone, and various indicators have been proposed as follows

Precision is a measure of how reliable a classifier is when it determines that data is + (positive).

\begin{align*} {\rm precision} = \frac{{\rm TP}}{{\rm TP} + {\rm FP}} \end{align*}

This indicator is mainly used when one wants to increase predictive certainty. However, a classifier that only increases accuracy can be achieved by reducing the number of FPs (the number of cases where – is incorrectly inferred as +), i.e., by using a model that judges + more strictly.

Recall is a measure of how well the classifier correctly inferred + (positive) out of the total + (positive) data. It is also called the true positive rate (TPR).

\begin{align*} {\rm recall} = \frac{{\rm TP}}{{\rm TP} + {\rm FN}} \end{align*}

When the importance of reducing FN (the number of cases where + is incorrectly inferred as -) is important, this indicator is used in cases such as cancer diagnosis. However, a classifier that only increases the reproducibility can be achieved with a model that loosely determines +, or in the extreme, a model that determines + for all data.

The false positive rate (FPR) is a measure of how much of the total – (negative) data the classifier incorrectly infers as + (positive).

\begin{align*} {\rm FPR} = \frac{{\rm FP}}{{\rm TN} + {\rm FP}} \end{align*}

A small value for this indicator is desired. However, a classifier that reduces only the false positive rate can be achieved with a model that judges – for all data.

This FPR and TPR (= recall) are used in the ROC curve.

▼Click here to see the contents of the ROC curve.

There is a trade-off between precision and recall, and these indicators cannot be high at the same time. The reason for the trade-off, as mentioned earlier, is that a classifier that increases only precision is realized with a model that judges “strictly” +, while a classifier that increases only recall is realized with a model that judges “loosely” +.

Now, a model with high precision and recall means a model with low FP and FN, i.e., a high-performance classifier with low off-diagonal components of the confusion matrix = low misclassification. Therefore, we define the F-measure as the harmonic mean of precision and recall.

\begin{align*} F = \frac{2}{\frac{1}{{\rm recall}} + \frac{1}{{\rm precision}}} = 2 \cdot \frac{{\rm precision} \cdot {\rm recall}}{{\rm precision} + {\rm recall}} \end{align*} |

The above indicators can be easily calculated using scikit-learn.

First, import the necessary libraries and define the data to be handled. The data to be handled in this case is a simple array with 1: positibe, -1: negative.

```
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
y_true = [-1, -1, -1, -1, -1, 1, 1, 1, 1, 1]
y_pred = [-1, -1, -1, 1, 1, -1, 1, 1, 1, 1]
names = ['positive', 'negative']
```

First, we generate the confusion matrix, which can be generated in scikit-learn with confusion_matrix.

```
cm = confusion_matrix(y_true, y_pred, labels=[1, -1])
print(cm)
```

```
# Output
[[4 1]
[2 3]]
```

To make the output easier to read, let’s display the confusion matrix in seaborn.

```
cm = pd.DataFrame(data=cm, index=names, columns=names)
sns.heatmap(cm, square=True, cbar=True, annot=True, cmap='Blues')
plt.xlabel('predicted value', fontsize=15)
plt.ylabel('grand truth', fontsize=15)
plt.show()
```

Next, let’s calculate the evaluation index. In scikit-learn, the evaluation indicators described so far can be calculated together using classification_report.

```
eval_dict = classification_report(y_true, y_pred, output_dict=True, target_names=names)
df = pd.DataFrame(eval_dict)
print(df)
```

```
# Output
positive negative accuracy macro avg weighted avg
precision 0.750000 0.666667 0.7 0.708333 0.708333
recall 0.600000 0.800000 0.7 0.700000 0.700000
f1-score 0.666667 0.727273 0.7 0.696970 0.696970
support 5.000000 5.000000 0.7 10.000000 10.000000
```

The first and second columns of the output results show the results of the indicators when positive and negative are used as positive examples, respectively.

Also, macro avg and weighted avg are called macro average and weighted average, respectively.

In this problem set-up, the indicators we want are as follows.

```
print(f"accuracy: {df['accuracy'][0]:.2f}")
print(f"precision: {df['positive']['precision']:.2f}")
print(f"recall: {df['positive']['recall']:.2f}")
print(f"f1-score: {df['positive']['f1-score']:.2f}")
```

```
# Output
accuracy: 0.70
precision: 0.75
recall: 0.60
f1-score: 0.67
```

You can try the above code in the following google colab

Google Colaboratory

The post Classification Evaluation Indicators: Accuracy, Precision, Recall, F-measure first appeared on Yukkuri Machine Learning.

]]>The post Implementation Of K-means Method, Elbow Method, Silhouette Analysis first appeared on Yukkuri Machine Learning.

]]>One of the best-known clustering methods is the k-means method, which assumes that data can be classified into $K$ clusters and assigns each data set to one of the clusters according to a certain procedure.

This article describes how the k-means method works and how it is implemented.

The k-means method requires that the number of clusters to be classified be given in advance, but the elbow method and silhouette analysis are introduced as methods for determining the optimal number of clusters.

You can try the source code for this article from google colab below.

Google Colaboratory

\begin{align*}

\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}

\newcommand{\f}[2]{\frac{#1}{#2}}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}

\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}

\newcommand{\T}{\mathsf{T}}

\newcommand{\(}{\left(}

\newcommand{\)}{\right)}

\newcommand{\{}{\left\{}

\newcommand{\}}{\right\}}

\newcommand{\[}{\left[}

\newcommand{\]}{\right]}

\newcommand{\dis}{\displaystyle}

\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}

\newcommand{\n}{\notag\\}

\newcommand{\t}{\ \ \ \ }

\newcommand{\tt}{\t\t\t\t}

\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}

\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}

\def\l<#1>{\left\langle #1 \right\rangle}

\def\us#1_#2{\underset{#2}{#1}}

\def\os#1^#2{\overset{#2}{#1}}

\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}

\newcommand{\s}[1]{{\scriptstyle #1}}

\definecolor{myblack}{rgb}{0.27,0.27,0.27}

\definecolor{myred}{rgb}{0.78,0.24,0.18}

\definecolor{myblue}{rgb}{0.0,0.443,0.737}

\definecolor{myyellow}{rgb}{1.0,0.82,0.165}

\definecolor{mygreen}{rgb}{0.24,0.47,0.44}

\newcommand{\c}[2]{\textcolor{#1}{#2}}

\newcommand{\ub}[2]{\underbrace{#1}_{#2}}

\end{align*}

Assuming that the number of clusters is 2, the concept of the k-means method is explained based on the figure below. The figure below is taken from PRML (Pattern Recognition and Machine Learning 2006).

(a): For each cluster, consider the centers of gravity $\bm{\mu}_1, \bm{\mu}_2$. Now, it is impossible to know exactly which cluster a given data belongs to, so the initial value of the cluster center of gravity is given appropriately. In the figure above, the red and blue crosses indicate the cluster centers of gravity.

(b): assign each data to a cluster according to its proximity to the center of gravity of either cluster.

(c): Calculate the center of gravity in the assigned cluster and update the center of gravity.

(d)~(i): Repeat steps (b) and (c) to continue updating the cluster. Iteration continues until no more clusters are updated or until the maximum number of iterations defined by the analyst is reached.

Let us formulate the above.

The $i$th data point out of $N$ data is denoted as $\bm{x}^{(i)}$. Also, assume that this data is divided into $K$ clusters, and denote the center of gravity of the $j$th cluster as $\bm{\mu}_j$.

Then the k-means method becomes the following loss function minimization problem.

\begin{align*}

L = \sum_{i=1}^N \sum_{j=1}^K r_{ij} \| \bm{x}^{(i)} – \bm{\mu}_j \|^2

\end{align*}

where $r_{ij}$ is a value that takes $1$ if the data point $x^{(i)}$ belongs to the $j$-th cluster and $0$ otherwise, and can be written as follows

\begin{align*}

r_{ij} = \case{1\t {\rm if}\ j=\argmin_k \| \bm{x}^{(i)} – \bm{\mu}_k \|^2 \n 0 \t {\rm otherwise.}}

\end{align*}

On the other hand, the update of the center of gravity $\bm{\mu}_j$ is

\begin{align*}

\bm{\mu}_j = \f{\sum_i r_{ij} \bm{x}^{(i)}}{\sum_i r_{ij}}

\end{align*}

It can be seen from the form of the equation that this is calculating the average value of the data vector belonging to the cluster.

The above equation is also given by solving the loss function for $\bm{\mu}_j$ with the partial derivative set to zero.

\begin{align*}

\pd{L}{\bm{\bm{\mu}}_j} = 2 \sum_{i=1}^N r_{ij} (\bm{x}^{(i)} – \bm{\mu}_j) = 0

\end{align*}

,

Thus, the procedure for the k-means method can be formulated as follows.

(a): give a random initial value for the cluster center of gravity $\bm{\mu}_j, (j=1, \dots, K)$.

(b):

\begin{align*}

r_{ij} = \case{1\t {\rm if}\ j=\argmin_k \| \bm{x}^{(i)} – \bm{\mu}_k \|^2 \n 0 \t {\rm otherwise.}}

\end{align*}

Calculate the above formula and assign each data to a cluster.

(c):

\begin{align*}

\bm{\mu}_j = \f{\sum_i r_{ij} \bm{x}^{(i)}}{\sum_i r_{ij}}

\end{align*}

Calculate the above formula and update the cluster center of gravity.

(d): Repeat steps (b) and (c) to continue updating the cluster. Iteration continues until no more clusters are updated or until the maximum number of iterations defined by the analyst is reached.

Let’s try cluster analysis using the k-means method with scikit-learn. The data used in this project will be generated using scikit-learn’s make_blobs to generate a dataset for classification.

The first step is to create a dataset for classification.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=150,
centers=3,
cluster_std=1.0,
shuffle=True,
random_state=42)
x1 = X[:, 0]
x2 = X[:, 1]
plt.scatter(x1, x2)
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
```

scikit-learn provides the sklearn.cluster.KMeans class to perform cluster analysis using the k-means method. The implementation is extremely simple and is shown below. Note that the number of clusters to be divided must be decided in advance.

```
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=0, init='random')
model.fit(X)
clusters = model.predict(X)
print(clusters)
df_cluster_centers = pd.DataFrame(model.cluster_centers_)
df_cluster_centers.columns = ['x1', 'x2']
print(df_cluster_centers)
```

```
# Output
[2 2 0 1 0 1 2 2 0 1 0 0 1 0 1 2 0 2 1 1 2 0 1 0 1 0 0 2 0 1 1 2 0 0 1 1 0
0 1 2 2 0 2 2 1 2 2 1 2 0 2 1 1 2 2 0 2 1 0 2 0 0 0 1 1 1 1 0 1 1 0 2 1 2
2 2 1 1 1 2 2 0 2 0 1 2 2 1 0 2 0 2 2 0 0 1 0 0 2 1 1 1 2 2 1 2 0 1 2 0 0
2 1 0 0 1 0 1 0 2 0 0 1 0 2 0 0 1 1 0 2 2 1 1 2 1 0 1 1 0 2 2 0 1 2 2 0 2
2 1]
x1 x2
0 -6.753996 -6.889449
1 4.584077 2.143144
2 -2.701466 8.902879
```

Since this alone is difficult to understand, the clustering results are illustrated in the following figure.

```
df = pd.DataFrame(X, columns=['x1', 'x2'])
df['class'] = clusters
sns.scatterplot(data=df, x='x1', y='x2', hue='class')
sns.scatterplot(data=df_cluster_centers, x='x1', y='x2', s=200, marker='*', color='gold', linewidth=0.5)
plt.show()
```

Thus, with the sklearn.cluster.KMeans class, you can easily perform cluster analysis using the k-means method.

One problem with the k-means method is that the number of clusters must be specified. This section introduces the elbow method and silhouette analysis used to determine the optimal number of clusters. Note that the data used below is the data with 3 clusters created earlier with make_blobs.

In the Elbow method, the optimal number of clusters is determined by calculating the loss function of the k-means method while varying the number of clusters and illustrating the results.

\begin{align*}

L = \sum_{i=1}^N \sum_{j=1}^K r_{ij} \| \bm{x}^{(i)} – \bm{\mu}_j \|^2

\end{align*}

The implementation of the elbow method is described below. The value of the loss function can be accessed at model.inertia_.

```
sum_of_squared_errors = []
for i in range(1, 11):
model = KMeans(n_clusters=i, random_state=0, init='random')
model.fit(X)
sum_of_squared_errors.append(model.inertia_)
plt.plot(range(1, 11), sum_of_squared_errors, marker='o')
plt.xlabel('number of clusters')
plt.ylabel('sum of squared errors')
plt.show()
```

The illustrated results show that the value of the loss function decreases until the number of clusters (the value on the horizontal axis) is 3, after which it remains almost unchanged.

The Elbow method determines the optimal number of clusters as the number of clusters for which the degree of decrease in the loss function changes rapidly. Therefore, in this case, the optimal number of clusters can be determined to be 3.

Silhouette analysis evaluates clustering performance based on the following indicators

- The denser the data points in a cluster, the better.
- The further apart each cluster is, the better.

Specifically, clustering performance is evaluated by the silhouette coefficient, which is defined by the following procedure

(1) Calculate the average distance of a data point $\bm{x}^{(i)}$ to other data points in the cluster $C_{\rm in}$ to which it belongs, as $a^{(i)}$ agglomeration in the cluster.

\begin{align*}

a^{(i)} = \f{1}{|C_{\rm in}| – 1} \sum_{\bm{x}^{(j)} \in C_{\rm in}} \| \bm{x}^{(i)} – \bm{x}^{(j)} \|

\end{align*}

(2) Calculate the average distance to the data point $C_{\rm near}$ belonging to the nearest cluster $C_{\rm near}$ to the data point $\bm{x}^{(i)}$ as the deviation from another cluster $b^{(i)}$.

\begin{align*}

b^{(i)} = \f{1}{|C_{\rm near}|} \sum_{\bm{x}^{(j)} \in C_{\rm near}} \| \bm{x}^{(i)} – \bm{x}^{(j)} \|

\end{align*}

(3) Divide $b^{(i)} – a^{(i)}$ by the larger of $a^{(i)}$ and $b^{(i)}$ to compute the silhouette factor $s^{(i)}$.

\begin{align*}

s^{(i)} = \f{b^{(i)} – a^{(i)}}{\max(a^{(i)}, b^{(i)})}

\end{align*}

The silhouette coefficient, by its definition, falls in the $[-1,1]$ interval. When the silhouette coefficient is calculated and averaged over all data, the closer to 1, the better the clustering performance.

The silhouette analysis visualization follows these rules

- Sort by cluster number
- Sort by silhouette coefficient value within the same cluster

The silhouette analysis is visualized by plotting the silhouette coefficient on the horizontal axis and the cluster number on the vertical axis.

```
import matplotlib.cm as cm
from sklearn.metrics import silhouette_samples
def show_silhouette(fitted_model):
cluster_labels = np.unique(fitted_model.labels_)
num_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, fitted_model.labels_)
y_ax_lower, y_ax_upper = 0, 0
y_ticks = []
for idx, cls in enumerate(cluster_labels):
cls_silhouette_vals = silhouette_vals[fitted_model.labels_==cls]
cls_silhouette_vals.sort()
y_ax_upper += len(cls_silhouette_vals)
cmap = cm.get_cmap("Spectral")
rgba = list(cmap(idx/num_clusters))
rgba[-1] = 0.7
plt.barh(
y=range(y_ax_lower, y_ax_upper),
width=cls_silhouette_vals,
height=1.0,
edgecolor='none',
color=rgba)
y_ticks.append((y_ax_lower + y_ax_upper) / 2.0)
y_ax_lower += len(cls_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color='orangered', linestyle='--')
plt.xlabel('silhouette coefficient')
plt.ylabel('cluster')
plt.yticks(y_ticks, cluster_labels + 1)
plt.show()
for i in range(2, 5):
model = KMeans(n_clusters=i, random_state=0, init='random')
model.fit(X)
show_silhouette(model)
```

The silhouette diagrams are plotted for the number of clusters specified as 2, 3, and 4, respectively. The red dashed line represents the average silhouette coefficient.

If the clusters are properly separated, the “thickness” of the silhouettes in each cluster tends to be close to even.

In the figure above, the silhouette “thickness” is even when the number of clusters is 3, and the average value of the silhouette coefficient is the highest. From this we can conclude that the optimal number of clusters is 3.

reference: scikit-learn

As described above, implementing the k-means method is easy in scikit-learn.

The number of clusters to be classified must be given in advance, and the elbow method and silhouette analysis were introduced as guidelines for determining the number of clusters.

There are also the x-means and g-means methods, which allow clustering without providing the number of clusters to be classified in advance.

The post Implementation Of K-means Method, Elbow Method, Silhouette Analysis first appeared on Yukkuri Machine Learning.

]]>The post How to install and use labelImg first appeared on Yukkuri Machine Learning.

]]>In order to perform object detection using deep learning such as YOLO, a training image dataset is required. In other words, it is necessary to prepare information on “what” is in “what part” of the image.

The tool labelImg makes it easy to create such a training image dataset.

This article describes how to install and use labelImg.

▼The dataset was actually created using labelImg, and YOLO training was performed here.

For Windows and Linux, download the latest package (version at the bottom of the page) from the link below and unzip it.

Page not found · GitHub Pages

Then, in the expanded folder

`./labelimg`

Clone the labelImg repository and install the necessary libraries.

```
git clone https://github.com/tzutalin/labelImg.git
cd labelImg
brew install qt
brew install libxml2
pip install pyqt5 lxml
make qt5py3
```

Then, after that.

`python labelImg.py`

This time, we will use this image to create training data.

The required labels are

- lion
- tiger

1. First, we need to specify the list of labels to be used for training.

Rewrite “data/predefined_classes.txt” in the folder where labelImg is installed with the required set of labels.

2. Run labelImg.

3.Click on “Open Directory” and specify the directory where the target image is located. The images stored in that directory will then be loaded.

4. Since we will be creating a dedicated dataset to study with YOLO, click on “PascalVOC” under the “Save” button in the sidebar and change it to “YOLO”. Then, select a rectangle and label for the image from “Create Rectangle” in the lower left corner of the sidebar.

5. Finally, save the data from the “Save” button to complete the training data.

When the save is complete, a new “classes.txt” and “lion_tiger.txt” (txt file with the same name as the image) will be created in the directory where the target image is located.

What is important is the latter file, which contains information on “what” is in “what part” of the image.

Also, the shortcut keys for labelImg are as follows (Ctrl → Command⌘ for mac).

Ctrl + u | Load all images from directory |

Ctrl + r | Change default annotation target directory |

Ctrl + s | Save |

Ctrl + d | Copy current label and rect box |

Ctrl + Shift + d | Delete current image |

space | Flag the current image as verified |

w | Create a rectangular box |

d | Next image |

a | Previous image |

del | Delete selected rect boxes |

Ctrl++ | Zoom in |

Ctrl– | Zoom out |

↑→↓← | Move selected rect box with arrow keys |

Now we can use deep learning such as YOLO to learn object detection.

▼ Click here to learn more about YOLO

リンク

リンク

The post How to install and use labelImg first appeared on Yukkuri Machine Learning.

]]>The post 【Google Colab】How to do object detection and learning with YOLO v5 first appeared on Yukkuri Machine Learning.

]]>In this article, we will use Python to perform object detection.

Object detection is a technique for detecting what is in an image and where it is in the image.

This time, we will run “YOLO v5” on Google Colab, which makes it easy to try object detection.

In addition, you can try the code in this article at ▼

Google Colaboratory

First, start by cloning the YOLO v5 repository by executing the following command in google colab.

`!git clone https://github.com/ultralytics/yolov5`

After cloning, install the libraries required for operation. To do this, use the requirements.txt file in the cloned repository.

```
%cd /content/yolov5/
!pip install -qr requirements.txt
```

Installation is now complete.

Now, let’s try object detection with YOLO v5. 5 models are available for YOLO v5, but we will use the small model, YOLOv5s.

We used this image for object detection.

In this case, we will use Google Colab, which allows us to use GPUs for free. Go to “Runtime -> Change Runtime Type” and set “Hardware Accelerator” to “GPU”.

.

To perform object detection, go to the yolov5 directory, which is the cloned repository, and execute the following command

`!python detect.py --source {PATH of images used for inference} --weights yolov5s.pt --conf 0.3 --name demo --exist-ok`

After executing the above command, the directory /runs/detect/demo/ will be created, where the images of the object detection will be output.

Here is the image after inference. You can see that it detects cars, signals, etc.

In order to successfully infer the object you want to detect from within an image, you need to prepare a dataset specifically for that purpose and “train the model”.

This time, we tried to train the model to detect penguins.

First, let’s try to infer the penguin image using Model: YOLOv5s without training the model. The results are as follows

Thus detected as bird, not penguin (well, penguins are birds, so it’s not a mistake ……?). It is also detected as banana, although it is hard to see because of the overlap.

So, let’s train the model to detect it as PENGUIN.

In order to train the model, we need to create teacher data on which of the images are penguins.

We have collected about 30 free images of penguins. (train: 20 images, val: 6 images)

And we used labelImg to create the teacher data. labelImg, a library, makes it easy to create txt information of rectangular coordinates indicating penguins in an image.

▼Click here to learn how to install and use labelImg.

This time, we will create a “penguins” directory and store images and labeling data there. Under the “penguins” directory, we will also place a file named “penguins.yaml”, which will be used when learning YOLOv5.

```
penguins
| - penguin.yaml
| - train
| | - img_・・・.jpg
| | - img_・・・.txt
| | - img_・・・.jpg
| | - img_・・・.txt
| | - img_・・・.jpg
| ・・・
| - val
| | - img_・・・.jpg
| | - img_・・・.txt
| ・・・
| - test
| - img_・・・.jpg
| - img_・・・.jpg
・・・
```

The penguin.yaml file should contain the paths to the train, val, and test folders, the number of classes to classify, and information about the classes, as shown below.

```
# train and val data as 1) directory: path/images/, 2) file: path/images.txt, or 3) list: [path1/images/, path2/images/]
train: /content/blog_yolo_v5/penguins/train/
val: /content/blog_yolo_v5/penguins/val/
test: /content/blog_yolo_v5/penguins/test/
# number of classes
nc: 1
# class names
names: ['penguin']
```

To train, simply execute the following command.

`!python train.py --img 640 --batch 16 --epochs 200 --data {penguin.yamlのpath} --weights yolov5s.pt`

When training is finished, the post-training parameters are created in the directory /runs/train/exp/weight/. best.pt is the one with the highest accuracy during training, and last.pt is the one from the last epoc.

TensorBoard can also be used to graph the learning process.

```
# tensorboard
%load_ext tensorboard
%tensorboard --logdir runs
```

Now, let’s try object detection for penguins using the trained model.

`!python detect.py --source {path of test directory} --weights {path of best.pt} --conf 0.25 --name trained_exp --exist-ok --save-conf`

The result is as follows.

You recognize me as a PENGUIN.

In this case, a smaller model called YOLOv5s is used. Using a larger model or increasing the number of teacher data may produce an even more accurate model.

リンク

リンク

The post 【Google Colab】How to do object detection and learning with YOLO v5 first appeared on Yukkuri Machine Learning.

]]>The post How To Use TA-Lib With Google Colab first appeared on Yukkuri Machine Learning.

]]>

TA-Lib makes it easy to create indicators for technical analysis on stock prices, FX, and virtual currencies.

This article describes how to use such TA-Lib with google coalb. We will use yahoo_finance_api2 for stock price data and mplfinance for data plotting.

In addition, you can try the code in this article here.

Google Colaboratory

TA-Lib is installed by downloading the necessary packages from https://sourceforge.net/projects/ta-lib/, not by pip installation. Specifically, run the following command in google colab to complete the TA-Lib installation.

```
!curl -L http://prdownloads.sourceforge.net/ta-lib/ta-lib-0.4.0-src.tar.gz -O && tar xzvf ta-lib-0.4.0-src.tar.gz
!cd ta-lib && ./configure --prefix=/usr && make && make install && cd - && pip install ta-lib
```

The library yahoo_finance_api2 allows you to retrieve stock quotes from Yahoo Finance. This library is installed via pip installation.

`!pip install yahoo_finance_api2`

Finally, mplfinance is installed. This is a library that makes it easy to draw candlesticks.

`!pip install mplfinance`

This completes the installation of the required libraries.

This time, we will be doing the following.

- Get Google stock quotes using yahoo_finance_api2
- Create appropriate indicators using TA-Lib
- Plot data using mplfinance

First, import the necessary libraries.

```
import talib
from yahoo_finance_api2 import share
import pandas as pd
import mplfinance as mpf
```

Now, let’s first try to obtain stock prices using yahoo_finance_api2.

```
# Get Google's stock price
my_share = share.Share('GOOG')
ohlcv = my_share.get_historical(
share.PERIOD_TYPE_YEAR, 1,
share.FREQUENCY_TYPE_DAY, 1)
df = pd.DataFrame(ohlcv)
df['timestamp'] = pd.to_datetime(df['timestamp'].astype(int), unit='ms')
df.set_index("timestamp", inplace=True)
df
```

Once the stock price is obtained, create an appropriate indicator using TA-Lib. In this case, we created SMA and EMA.

A list of other indicators that can be created can be found here.

```
df['SMA'] = talib.SMA(df['close'], timeperiod=5)
df['EMA'] = talib.EMA(df['close'], timeperiod=5)
df.dropna(inplace=True)
df
```

Finally, the data is plotted using mplfinance.

```
term = -50
indicators = [
mpf.make_addplot(df['SMA'][term:], color='skyblue', width=2.5),
mpf.make_addplot(df['EMA'][term:], color='pink', width=2.5),
]
mpf.plot(df[term:], figratio=(12,4), type='candle', style="yahoo", volume=True, addplot=indicators)
```

リンク

The post How To Use TA-Lib With Google Colab first appeared on Yukkuri Machine Learning.

]]>