K-Means

The K-Means algorithm solves clustering problem by partitioning \(n\) feature vectors into \(k\) clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.

Operation

Computational methods

Programming Interface

Training

Lloyd’s

train(…)

train_input

train_result

Inference

Lloyd’s

infer(…)

infer_input

infer_result

Mathematical formulation

Training

Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and a positive integer \(k\), the problem is to find a set \(C = \{ c_1, \ldots, c_k \}\) of \(p\)-dimensional centroids that minimize the objective function

\[\Phi_{X}(C) = \sum_{i = 1}^n d^2(x_i, C),\]

where \(d^2(x_i, C)\) is the squared Euclidean distance from \(x_i\) to the closest centroid in \(C\),

\[d^2(x_i, C) = \min_{1 \leq j \leq k} \| x_i - c_j \|^2, \quad 1 \leq i \leq n.\]

Expression \(\|\cdot\|\) denotes \(L_2\) norm.

Note

In the general case, \(d\) may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.

Training method: Lloyd’s

The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating Assignment and Update steps, where \(t\) denotes a index of the current iteration, e.g., \(C^{(t)} = \{ c_1^{(t)}, \ldots, c_k^{(t)} \}\) is the set of centroids at the \(t\)-th iteration. The method requires the initial centroids \(C^{(1)}\) to be specified at the beginning of the algorithm (\(t = 1\)).

(1) Assignment step: Assign each feature vector \(x_i\) to the nearest centroid. \(y_i^{(t)}\) denotes the assigned label (cluster index) to the feature vector \(x_i\).

\[y_i^{(t)} = \mathrm{arg}\min_{1 \leq j \leq k} \| x_i - c_j^{(t)} \|^2, \quad 1 \leq i \leq n.\]

Each feature vector from the training set \(X\) is assigned to exactly one centroid so that \(X\) is partitioned to \(k\) disjoint sets (clusters)

\[S_j^{(t)} = \big\{ \; x_i \in X : \; y_i^{(t)} = j \; \big\}, \quad 1 \leq j \leq k.\]

(2) Update step: Recalculate centroids by averaging feature vectors assigned to each cluster.

\[c_j^{(t + 1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x, \quad 1 \leq j \leq k.\]

If any of \(S_j^{(t)}\) are empty, start the empty clusters handling procedure. In this case, you can set the value of \(c_j^{(t+1)}\) as the farthest point from the previously calculated centroids for each empty cluster. This procedure makes sure that the number of clusters remains the same.

\[c_j^{(t + 1)} = \mathrm{arg}\max_{x \in X} d^2(x, C^{(t + 1)}).\]

The steps (1) and (2) are performed until the following stop condition,

\[\Phi_{X}(C^{(t)}) - \Phi_{X}(C^{(t + 1)}) < \varepsilon,\]

is satisfied or number of iterations exceeds the maximal value \(T\) defined by the user.

Inference

Given the inference set \(X' = \{ x_1', \ldots, x_m' \}\) of \(p\)-dimensional feature vectors and the set \(C = \{ c_1, \ldots, c_k \}\) of centroids produced at the training stage, the problem is to predict the index \(y_j' \in \{ 0, \ldots, k-1 \}\), \(1 \leq j \leq m\), of the centroid in accordance with a method-defined rule.

Inference method: Lloyd’s

Lloyd’s inference method computes the \(y_j'\) as an index of the centroid closest to the feature vector \(x_j'\),

\[y_j' = \mathrm{arg}\min_{1 \leq l \leq k} \| x_j' - c_l \|^2, \quad 1 \leq j \leq m.\]

Programming Interface

Refer to API Reference: K-Means.

Distributed mode

The algorithm supports distributed execution in SPMD mode (only on GPU).

Usage Example

Training

kmeans::model<> run_training(const table& data,
                           const table& initial_centroids) {
   const auto kmeans_desc = kmeans::descriptor<float>{}
      .set_cluster_count(10)
      .set_max_iteration_count(50)
      .set_accuracy_threshold(1e-4);

   const auto result = train(kmeans_desc, data, initial_centroids);

   print_table("labels", result.get_labels());
   print_table("centroids", result.get_model().get_centroids());
   print_value("objective", result.get_objective_function_value());

   return result.get_model();
}

Inference

table run_inference(const kmeans::model<>& model,
                  const table& new_data) {
   const auto kmeans_desc = kmeans::descriptor<float>{}
      .set_cluster_count(model.get_cluster_count());

   const auto result = infer(kmeans_desc, model, new_data);

   print_table("labels", result.get_labels());
}

Examples

Batch Processing: