This article is available in: 日本語
Introduction
One of the best-known clustering methods is the k-means method, which assumes that data can be classified into $K$ clusters and assigns each data set to one of the clusters according to a certain procedure.
This article describes how the k-means method works and how it is implemented.
The k-means method requires that the number of clusters to be classified be given in advance, but the elbow method and silhouette analysis are introduced as methods for determining the optimal number of clusters.
You can try the source code for this article from google colab below.
\begin{align*}
\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}
\newcommand{\T}{\mathsf{T}}
\newcommand{\(}{\left(}
\newcommand{\)}{\right)}
\newcommand{\{}{\left\{}
\newcommand{\}}{\right\}}
\newcommand{\[}{\left[}
\newcommand{\]}{\right]}
\newcommand{\dis}{\displaystyle}
\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}
\newcommand{\n}{\notag\\}
\newcommand{\t}{\ \ \ \ }
\newcommand{\tt}{\t\t\t\t}
\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}
\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}
\def\l<#1>{\left\langle #1 \right\rangle}
\def\us#1_#2{\underset{#2}{#1}}
\def\os#1^#2{\overset{#2}{#1}}
\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}
\newcommand{\s}[1]{{\scriptstyle #1}}
\definecolor{myblack}{rgb}{0.27,0.27,0.27}
\definecolor{myred}{rgb}{0.78,0.24,0.18}
\definecolor{myblue}{rgb}{0.0,0.443,0.737}
\definecolor{myyellow}{rgb}{1.0,0.82,0.165}
\definecolor{mygreen}{rgb}{0.24,0.47,0.44}
\newcommand{\c}[2]{\textcolor{#1}{#2}}
\newcommand{\ub}[2]{\underbrace{#1}_{#2}}
\end{align*}
Concept of k-means method
Assuming that the number of clusters is 2, the concept of the k-means method is explained based on the figure below. The figure below is taken from PRML (Pattern Recognition and Machine Learning 2006).
(a): For each cluster, consider the centers of gravity $\bm{\mu}_1, \bm{\mu}_2$. Now, it is impossible to know exactly which cluster a given data belongs to, so the initial value of the cluster center of gravity is given appropriately. In the figure above, the red and blue crosses indicate the cluster centers of gravity.
(b): assign each data to a cluster according to its proximity to the center of gravity of either cluster.
(c): Calculate the center of gravity in the assigned cluster and update the center of gravity.
(d)~(i): Repeat steps (b) and (c) to continue updating the cluster. Iteration continues until no more clusters are updated or until the maximum number of iterations defined by the analyst is reached.
Let us formulate the above.
The $i$th data point out of $N$ data is denoted as $\bm{x}^{(i)}$. Also, assume that this data is divided into $K$ clusters, and denote the center of gravity of the $j$th cluster as $\bm{\mu}_j$.
Then the k-means method becomes the following loss function minimization problem.
\begin{align*}
L = \sum_{i=1}^N \sum_{j=1}^K r_{ij} \| \bm{x}^{(i)} – \bm{\mu}_j \|^2
\end{align*}
where $r_{ij}$ is a value that takes $1$ if the data point $x^{(i)}$ belongs to the $j$-th cluster and $0$ otherwise, and can be written as follows
\begin{align*}
r_{ij} = \case{1\t {\rm if}\ j=\argmin_k \| \bm{x}^{(i)} – \bm{\mu}_k \|^2 \n 0 \t {\rm otherwise.}}
\end{align*}
On the other hand, the update of the center of gravity $\bm{\mu}_j$ is
\begin{align*}
\bm{\mu}_j = \f{\sum_i r_{ij} \bm{x}^{(i)}}{\sum_i r_{ij}}
\end{align*}
It can be seen from the form of the equation that this is calculating the average value of the data vector belonging to the cluster.
The above equation is also given by solving the loss function for $\bm{\mu}_j$ with the partial derivative set to zero.
\begin{align*}
\pd{L}{\bm{\bm{\mu}}_j} = 2 \sum_{i=1}^N r_{ij} (\bm{x}^{(i)} – \bm{\mu}_j) = 0
\end{align*}
,
Thus, the procedure for the k-means method can be formulated as follows.
(a): give a random initial value for the cluster center of gravity $\bm{\mu}_j, (j=1, \dots, K)$.
(b):
\begin{align*}
r_{ij} = \case{1\t {\rm if}\ j=\argmin_k \| \bm{x}^{(i)} – \bm{\mu}_k \|^2 \n 0 \t {\rm otherwise.}}
\end{align*}
Calculate the above formula and assign each data to a cluster.
(c):
\begin{align*}
\bm{\mu}_j = \f{\sum_i r_{ij} \bm{x}^{(i)}}{\sum_i r_{ij}}
\end{align*}
Calculate the above formula and update the cluster center of gravity.
(d): Repeat steps (b) and (c) to continue updating the cluster. Iteration continues until no more clusters are updated or until the maximum number of iterations defined by the analyst is reached.
Implementation of the k-means method
Let’s try cluster analysis using the k-means method with scikit-learn. The data used in this project will be generated using scikit-learn’s make_blobs to generate a dataset for classification.
The first step is to create a dataset for classification.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=150,
centers=3,
cluster_std=1.0,
shuffle=True,
random_state=42)
x1 = X[:, 0]
x2 = X[:, 1]
plt.scatter(x1, x2)
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
scikit-learn provides the sklearn.cluster.KMeans class to perform cluster analysis using the k-means method. The implementation is extremely simple and is shown below. Note that the number of clusters to be divided must be decided in advance.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=0, init='random')
model.fit(X)
clusters = model.predict(X)
print(clusters)
df_cluster_centers = pd.DataFrame(model.cluster_centers_)
df_cluster_centers.columns = ['x1', 'x2']
print(df_cluster_centers)
# Output
[2 2 0 1 0 1 2 2 0 1 0 0 1 0 1 2 0 2 1 1 2 0 1 0 1 0 0 2 0 1 1 2 0 0 1 1 0
0 1 2 2 0 2 2 1 2 2 1 2 0 2 1 1 2 2 0 2 1 0 2 0 0 0 1 1 1 1 0 1 1 0 2 1 2
2 2 1 1 1 2 2 0 2 0 1 2 2 1 0 2 0 2 2 0 0 1 0 0 2 1 1 1 2 2 1 2 0 1 2 0 0
2 1 0 0 1 0 1 0 2 0 0 1 0 2 0 0 1 1 0 2 2 1 1 2 1 0 1 1 0 2 2 0 1 2 2 0 2
2 1]
x1 x2
0 -6.753996 -6.889449
1 4.584077 2.143144
2 -2.701466 8.902879
Since this alone is difficult to understand, the clustering results are illustrated in the following figure.
df = pd.DataFrame(X, columns=['x1', 'x2'])
df['class'] = clusters
sns.scatterplot(data=df, x='x1', y='x2', hue='class')
sns.scatterplot(data=df_cluster_centers, x='x1', y='x2', s=200, marker='*', color='gold', linewidth=0.5)
plt.show()
Thus, with the sklearn.cluster.KMeans class, you can easily perform cluster analysis using the k-means method.
Search for optimal number of clusters
One problem with the k-means method is that the number of clusters must be specified. This section introduces the elbow method and silhouette analysis used to determine the optimal number of clusters. Note that the data used below is the data with 3 clusters created earlier with make_blobs.
Elbow method
In the Elbow method, the optimal number of clusters is determined by calculating the loss function of the k-means method while varying the number of clusters and illustrating the results.
\begin{align*}
L = \sum_{i=1}^N \sum_{j=1}^K r_{ij} \| \bm{x}^{(i)} – \bm{\mu}_j \|^2
\end{align*}
The implementation of the elbow method is described below. The value of the loss function can be accessed at model.inertia_.
sum_of_squared_errors = []
for i in range(1, 11):
model = KMeans(n_clusters=i, random_state=0, init='random')
model.fit(X)
sum_of_squared_errors.append(model.inertia_)
plt.plot(range(1, 11), sum_of_squared_errors, marker='o')
plt.xlabel('number of clusters')
plt.ylabel('sum of squared errors')
plt.show()
The illustrated results show that the value of the loss function decreases until the number of clusters (the value on the horizontal axis) is 3, after which it remains almost unchanged.
The Elbow method determines the optimal number of clusters as the number of clusters for which the degree of decrease in the loss function changes rapidly. Therefore, in this case, the optimal number of clusters can be determined to be 3.
Silhouette Analysis
Silhouette analysis evaluates clustering performance based on the following indicators
- The denser the data points in a cluster, the better.
- The further apart each cluster is, the better.
Specifically, clustering performance is evaluated by the silhouette coefficient, which is defined by the following procedure
(1) Calculate the average distance of a data point $\bm{x}^{(i)}$ to other data points in the cluster $C_{\rm in}$ to which it belongs, as $a^{(i)}$ agglomeration in the cluster.
\begin{align*}
a^{(i)} = \f{1}{|C_{\rm in}| – 1} \sum_{\bm{x}^{(j)} \in C_{\rm in}} \| \bm{x}^{(i)} – \bm{x}^{(j)} \|
\end{align*}
(2) Calculate the average distance to the data point $C_{\rm near}$ belonging to the nearest cluster $C_{\rm near}$ to the data point $\bm{x}^{(i)}$ as the deviation from another cluster $b^{(i)}$.
\begin{align*}
b^{(i)} = \f{1}{|C_{\rm near}|} \sum_{\bm{x}^{(j)} \in C_{\rm near}} \| \bm{x}^{(i)} – \bm{x}^{(j)} \|
\end{align*}
(3) Divide $b^{(i)} – a^{(i)}$ by the larger of $a^{(i)}$ and $b^{(i)}$ to compute the silhouette factor $s^{(i)}$.
\begin{align*}
s^{(i)} = \f{b^{(i)} – a^{(i)}}{\max(a^{(i)}, b^{(i)})}
\end{align*}
The silhouette coefficient, by its definition, falls in the $[-1,1]$ interval. When the silhouette coefficient is calculated and averaged over all data, the closer to 1, the better the clustering performance.
The silhouette analysis visualization follows these rules
- Sort by cluster number
- Sort by silhouette coefficient value within the same cluster
The silhouette analysis is visualized by plotting the silhouette coefficient on the horizontal axis and the cluster number on the vertical axis.
import matplotlib.cm as cm
from sklearn.metrics import silhouette_samples
def show_silhouette(fitted_model):
cluster_labels = np.unique(fitted_model.labels_)
num_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(X, fitted_model.labels_)
y_ax_lower, y_ax_upper = 0, 0
y_ticks = []
for idx, cls in enumerate(cluster_labels):
cls_silhouette_vals = silhouette_vals[fitted_model.labels_==cls]
cls_silhouette_vals.sort()
y_ax_upper += len(cls_silhouette_vals)
cmap = cm.get_cmap("Spectral")
rgba = list(cmap(idx/num_clusters))
rgba[-1] = 0.7
plt.barh(
y=range(y_ax_lower, y_ax_upper),
width=cls_silhouette_vals,
height=1.0,
edgecolor='none',
color=rgba)
y_ticks.append((y_ax_lower + y_ax_upper) / 2.0)
y_ax_lower += len(cls_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg, color='orangered', linestyle='--')
plt.xlabel('silhouette coefficient')
plt.ylabel('cluster')
plt.yticks(y_ticks, cluster_labels + 1)
plt.show()
for i in range(2, 5):
model = KMeans(n_clusters=i, random_state=0, init='random')
model.fit(X)
show_silhouette(model)
The silhouette diagrams are plotted for the number of clusters specified as 2, 3, and 4, respectively. The red dashed line represents the average silhouette coefficient.
If the clusters are properly separated, the “thickness” of the silhouettes in each cluster tends to be close to even.
In the figure above, the silhouette “thickness” is even when the number of clusters is 3, and the average value of the silhouette coefficient is the highest. From this we can conclude that the optimal number of clusters is 3.
reference: scikit-learn
As described above, implementing the k-means method is easy in scikit-learn.
The number of clusters to be classified must be given in advance, and the elbow method and silhouette analysis were introduced as guidelines for determining the number of clusters.
There are also the x-means and g-means methods, which allow clustering without providing the number of clusters to be classified in advance.