# Clustering comparison

On this page the centroid-based clustering method *k-Means* will be compared to the density-based clustering method *DBSCAN*.

k-Means was introduced by James MacQueen in 1967. It aims to partition all observations in to k clusters so that the within-cluster sum of squares is minimized while the between-cluster sum of squares is maximized.

DBSCAN or **D**ensity-**b**ased **s**patial **c**lustering of **a**pplications with **n**oise was introduced in 1996 by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu. It forms clusters of closely neighboring points and is able to detect outliers as noise.

## Algorithms

### k-Means

The k-Means Algorithm is based on the method of least squares and optimizes \[ F = \sum_{i=1}^{k}\sum_{x_{j} \in S_{i}}\left \| x_{j} - \mu_{i} \right \|^{2} \] where \(x_i\) are data points, \(\mu_i\) are Centroids or Means and \(S_i\) are corresponding clusters.

The Algorithm has three steps.

**1. Initialisation**

Choose \(k\) Centroids (also called Means)

\( m_1^{(0)},\cdots,m_k^{(0)} \)

**2. Assign Points**

Each data Point is assigned to the "nearest" cluster (cluster variance is increased the least).

\( S_i^{\left( t \right)} = \left\{ x_j : \left\| x_j - m_{i}^{\left( t \right)} \right\|^2 \leq \left\| x_j - m_{i^*}^{\left( t \right)} \right\|^2 \text{ for all } i^*=1,\cdots,k \right\} \)

**3. Update Centroids**

Recalculate the new cluster means.

\( m_i^{\left( t+1 \right)} = \frac{1}{\left| S_{i}^{\left( t \right)} \right|} \sum_{x_j \in S_{i}^{\left( t \right)}} x_j \)

Repeat steps 2 and 3 until the centroids don't change anymore.

### DBSCAN

In DBSCAN there are three different kinds of points:

**core points**which are dense themselfs**density-reachable points**that are reachable from core points but are not dense themselfs**outliers**or noise points

The DBSCAN method has two parameters *ε* and *minPts*, where *ε* is the maximum neigborhood radius and *minPts* is the minimum number of neighbors to be a core point.

From one point another point is reachable if their distance is smaller than *ε*. A Point is dense (=core point) if it has at least *minPts* in its *ε*-reachable neighborhood.

Density-reachable points are reachable from core points but are not themselfs dense.

All reachable points around at least one core point form a cluster. Other points are outliers.

**Algorithm**

- Start with arbitrary, not yet visited point
- Get the point's
*ε*-neighborhood - If the point is dense, start a cluster
- If not, the point is labeled as noise (it can get part of a cluster later)
- Get the
*ε*-neighborhood of all unvisited points in the cluster and add them to the cluster - Again add the
*ε*-neighborhood of the neighbors to the cluster if the neighbor is dense itself - Continue until the density-connected cluster is completely found
- Start with a new unvisited point
- Continue until all points are either part of a cluster or labeled as noise

Advantages of DBSCAN over k-Means

- DBSCAN does not require a priori knowled about the number of classes
- DBSCAN can separate arbitrarily shaped and non-linearly separable clusters
- DBSCAN is robust against noise

Disadvantages of DBSCAN

- One has to understand the data to choose meaningful parameters
- DBSCAN has difficulties at clustering datasets with large differences in densities
- DBSCAN is not entirely deterministic (Assignment of points reachable from more than one cluster is dependant on the processing order)

## Hands-On

There are three tabs *Data*, *k-Means* and *DBSCAN*.

In the *Data* tab you can toggle to draw own data points by clicking in the drawing area below, you can load random and predefined datasets and clear the drawing area.

In the *k-Means* tab you can change the configuration of k-Means by adding own centroids, adding a number of random centroids and clearing all centroids. In addition you can change the distance measure. Then you can run the k-Means algorithm either step by step or in a loop.

In the *DBSCAN* tab you can change the parameters *ε* and *minPts* and choose the distance measure. Then you can perform the classification and clear the clustering again.

Each cluster is displayed in its own color. Outliers are displayed white with gray outline.

## References

- Jakub Młokosiewicz:
*k-means-visualization on GitHub*, April 2018 - Wikipedia:
*k-means clustering*, April 2018 - James MacQueen:
*Some methods for classification and analysis of multivariate observations*, 1967 - Corneliu Sugar:
*jDBSCAN on GitHub*, April 2018 - Wikipedia:
*DBSCAN*, April 2018 - Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu:
*A density-based algorithm for discovering clusters in large spatial databases with noise*, 1996