Clustering comparison

On this page the centroid-based clustering method k-Means will be compared to the density-based clustering method DBSCAN.

k-Means was introduced by James MacQueen in 1967. It aims to partition all observations in to k clusters so that the within-cluster sum of squares is minimized while the between-cluster sum of squares is maximized.

DBSCAN or Density-based spatial clustering of applications with noise was introduced in 1996 by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu. It forms clusters of closely neighboring points and is able to detect outliers as noise.

Algorithms

k-Means

The k-Means Algorithm is based on the method of least squares and optimizes \[ F = \sum_{i=1}^{k}\sum_{x_{j} \in S_{i}}\left \| x_{j} - \mu_{i} \right \|^{2} \] where \(x_i\) are data points, \(\mu_i\) are Centroids or Means and \(S_i\) are corresponding clusters.
The Algorithm has three steps.

1. Initialisation
Choose \(k\) Centroids (also called Means)
\( m_1^{(0)},\cdots,m_k^{(0)} \)

2. Assign Points
Each data Point is assigned to the "nearest" cluster (cluster variance is increased the least).
\( S_i^{\left( t \right)} = \left\{ x_j : \left\| x_j - m_{i}^{\left( t \right)} \right\|^2 \leq \left\| x_j - m_{i^*}^{\left( t \right)} \right\|^2 \text{ for all } i^*=1,\cdots,k \right\} \)

3. Update Centroids
Recalculate the new cluster means.
\( m_i^{\left( t+1 \right)} = \frac{1}{\left| S_{i}^{\left( t \right)} \right|} \sum_{x_j \in S_{i}^{\left( t \right)}} x_j \)

Repeat steps 2 and 3 until the centroids don't change anymore.

DBSCAN

In DBSCAN there are three different kinds of points:

core points which are dense themselfs
density-reachable points that are reachable from core points but are not dense themselfs
outliers or noise points

The DBSCAN method has two parameters ε and minPts, where ε is the maximum neigborhood radius and minPts is the minimum number of neighbors to be a core point.

From one point another point is reachable if their distance is smaller than ε. A Point is dense (=core point) if it has at least minPts in its ε-reachable neighborhood.

Density-reachable points are reachable from core points but are not themselfs dense.

All reachable points around at least one core point form a cluster. Other points are outliers.

blue: core points, green: density-reachable points, orange: outliers

Algorithm

Start with arbitrary, not yet visited point
Get the point's ε-neighborhood
If the point is dense, start a cluster
If not, the point is labeled as noise (it can get part of a cluster later)
Get the ε-neighborhood of all unvisited points in the cluster and add them to the cluster
Again add the ε-neighborhood of the neighbors to the cluster if the neighbor is dense itself
Continue until the density-connected cluster is completely found
Start with a new unvisited point
Continue until all points are either part of a cluster or labeled as noise

Advantages of DBSCAN over k-Means

DBSCAN does not require a priori knowled about the number of classes
DBSCAN can separate arbitrarily shaped and non-linearly separable clusters
DBSCAN is robust against noise

Disadvantages of DBSCAN

One has to understand the data to choose meaningful parameters
DBSCAN has difficulties at clustering datasets with large differences in densities
DBSCAN is not entirely deterministic (Assignment of points reachable from more than one cluster is dependant on the processing order)

Hands-On

There are three tabs Data, k-Means and DBSCAN.
In the Data tab you can toggle to draw own data points by clicking in the drawing area below, you can load random and predefined datasets and clear the drawing area.
In the k-Means tab you can change the configuration of k-Means by adding own centroids, adding a number of random centroids and clearing all centroids. In addition you can change the distance measure. Then you can run the k-Means algorithm either step by step or in a loop.
In the DBSCAN tab you can change the parameters ε and minPts and choose the distance measure. Then you can perform the classification and clear the clustering again.
Each cluster is displayed in its own color. Outliers are displayed white with gray outline.

Draw Data Points

Configuration

Draw Centroids

Distance Measure

Algorithm

Configuration

Distance Measure

Clustering comparison

Algorithms

k-Means

DBSCAN

Hands-On

Configuration

Algorithm

Configuration

Algorithm

References