K-Means¶
The K-Means algorithm solves clustering problem by partitioning \(n\) feature vectors into \(k\) clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.
Operation |
Computational methods |
Programming Interface |
||
Mathematical formulation¶
Training¶
Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and a positive integer \(k\), the problem is to find a set \(C = \{ c_1, \ldots, c_k \}\) of \(p\)-dimensional centroids that minimize the objective function
where \(d^2(x_i, C)\) is the squared Euclidean distance from \(x_i\) to the closest centroid in \(C\),
Expression \(\|\cdot\|\) denotes \(L_2\) norm.
Note
In the general case, \(d\) may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.
Training method: Lloyd’s¶
The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating Assignment and Update steps, where \(t\) denotes a index of the current iteration, e.g., \(C^{(t)} = \{ c_1^{(t)}, \ldots, c_k^{(t)} \}\) is the set of centroids at the \(t\)-th iteration. The method requires the initial centroids \(C^{(1)}\) to be specified at the beginning of the algorithm (\(t = 1\)).
(1) Assignment step: Assign each feature vector \(x_i\) to the nearest centroid. \(y_i^{(t)}\) denotes the assigned label (cluster index) to the feature vector \(x_i\).
Each feature vector from the training set \(X\) is assigned to exactly one centroid so that \(X\) is partitioned to \(k\) disjoint sets (clusters)
(2) Update step: Recalculate centroids by averaging feature vectors assigned to each cluster.
If any of \(S_j^{(t)}\) are empty, start the empty clusters handling procedure. In this case, you can set the value of \(c_j^{(t+1)}\) as the farthest point from the previously calculated centroids for each empty cluster. This procedure makes sure that the number of clusters remains the same.
The steps (1) and (2) are performed until the following stop condition,
is satisfied or number of iterations exceeds the maximal value \(T\) defined by the user.
Inference¶
Given the inference set \(X' = \{ x_1', \ldots, x_m' \}\) of \(p\)-dimensional feature vectors and the set \(C = \{ c_1, \ldots, c_k \}\) of centroids produced at the training stage, the problem is to predict the index \(y_j' \in \{ 0, \ldots, k-1 \}\), \(1 \leq j \leq m\), of the centroid in accordance with a method-defined rule.
Inference method: Lloyd’s¶
Lloyd’s inference method computes the \(y_j'\) as an index of the centroid closest to the feature vector \(x_j'\),
Programming Interface¶
Refer to API Reference: K-Means.
Distributed mode¶
The algorithm supports distributed execution in SPMD mode (only on GPU).
Usage Example¶
Training¶
kmeans::model<> run_training(const table& data,
const table& initial_centroids) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(10)
.set_max_iteration_count(50)
.set_accuracy_threshold(1e-4);
const auto result = train(kmeans_desc, data, initial_centroids);
print_table("labels", result.get_labels());
print_table("centroids", result.get_model().get_centroids());
print_value("objective", result.get_objective_function_value());
return result.get_model();
}
Inference¶
table run_inference(const kmeans::model<>& model,
const table& new_data) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(model.get_cluster_count());
const auto result = infer(kmeans_desc, model, new_data);
print_table("labels", result.get_labels());
}
Examples¶
Batch Processing:
Batch Processing: