Density-Based Spatial Clustering of Applications with Noise#

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed in [Ester96]. It is a density-based clustering non-parametric algorithm: given a set of observations in some space, it groups together observations that are closely packed together (observations with many nearby neighbors), marking as outliers observations that lie alone in low-density regions (whose nearest neighbors are too far away).

Details#

Given the set \(X = \{x_1 = (x_{11}, \ldots, x_{1p}), \ldots, x_n = (x_{n1}, \ldots, x_{np})\}\) of \(n\) \(p\)-dimensional feature vectors (further referred as observations), a positive floating-point number epsilon and a positive integer minObservations, the problem is to get clustering assignments for each input observation, based on the definitions below [Ester96]:

core observation#

An observation \(x\) is called core observation if at least minObservations input observations (including \(x\)) are within distance epsilon from observation \(x\);

directly reachable#

An observation \(y\) is directly reachable from \(x\) if \(y\) is within distance epsilon from core observation \(x\). Observations are only said to be directly reachable from core observations.

reachable#

An observation \(y\) is reachable from an observation \(x\) if there is a path \(x_1, \ldots, x_m\) with \(x_1 = x\) and \(x_m = y\), where each \(x_{i+1}\) is directly reachable from \(x_i\). This implies that all observations on the path must be core observations, with the possible exception of \(y\).

noise observation#

Noise observations are observations that are not reachable from any other observation.

cluster#

Two observations \(x\) and \(y\) are considered to be in the same cluster if there is a core observation \(z\), and \(x\) and \(y\) are both reachable from \(z\).

Each cluster gets a unique identifier, an integer number from \(0\) to \(\text{total number of clusters } – 1\). Each observation is assigned an identifier of the cluster it belongs to, or \(-1\) if the observation considered to be a noise observation.

Computation#

The following computation modes are available:

Examples#

Batch Processing:

Distributed Processing: