Density-Based Spatial Clustering of Applications with Noise¶
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed in [Ester96]. It is a density-based clustering non-parametric algorithm: given a set of observations in some space, it groups together observations that are closely packed together (observations with many nearby neighbors), marking as outliers observations that lie alone in low-density regions (whose nearest neighbors are too far away).
Details¶
Given the set \(X = \{x_1 = (x_{11}, \ldots, x_{1p}), \ldots, x_n = (x_{n1}, \ldots, x_{np})\}\)
of \(n\) \(p\)-dimensional feature vectors (further referred as observations),
a positive floating-point number epsilon
and a positive integer minObservations
,
the problem is to get clustering assignments for each input observation, based on the definitions below [Ester96]:
- core observation¶
An observation \(x\) is called core observation if at least
minObservations
input observations (including \(x\)) are within distanceepsilon
from observation \(x\);- directly reachable¶
An observation \(y\) is directly reachable from \(x\) if \(y\) is within distance
epsilon
from core observation \(x\). Observations are only said to be directly reachable from core observations.- reachable¶
An observation \(y\) is reachable from an observation \(x\) if there is a path \(x_1, \ldots, x_m\) with \(x_1 = x\) and \(x_m = y\), where each \(x_{i+1}\) is directly reachable from \(x_i\). This implies that all observations on the path must be core observations, with the possible exception of \(y\).
- noise observation¶
Noise observations are observations that are not reachable from any other observation.
- cluster¶
Two observations \(x\) and \(y\) are considered to be in the same cluster if there is a core observation \(z\), and \(x\) and \(y\) are both reachable from \(z\).
Each cluster gets a unique identifier, an integer number from \(0\) to \(\text{total number of clusters } – 1\). Each observation is assigned an identifier of the cluster it belongs to, or \(-1\) if the observation considered to be a noise observation.
Computation¶
The following computation modes are available:
Examples¶
Batch Processing:
Distributed Processing: