# K-Means¶

The K-Means algorithm solves clustering problem by partitioning $$n$$ feature vectors into $$k$$ clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.

 Operation Computational methods Programming Interface Training Lloyd’s train(…) train_input train_result Inference Lloyd’s infer(…) infer_input infer_result

## Mathematical formulation¶

### Training¶

Given the training set $$X = \{ x_1, \ldots, x_n \}$$ of $$p$$-dimensional feature vectors and a positive integer $$k$$, the problem is to find a set $$C = \{ c_1, \ldots, c_k \}$$ of $$p$$-dimensional centroids that minimize the objective function

$\Phi_{X}(C) = \sum_{i = 1}^n d^2(x_i, C),$

where $$d^2(x_i, C)$$ is the squared Euclidean distance from $$x_i$$ to the closest centroid in $$C$$,

$d^2(x_i, C) = \min_{1 \leq j \leq k} \| x_i - c_j \|^2, \quad 1 \leq i \leq n.$

Expression $$\|\cdot\|$$ denotes $$L_2$$ norm.

Note

In the general case, $$d$$ may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.

#### Training method: Lloyd’s¶

The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating Assignment and Update steps, where $$t$$ denotes a index of the current iteration, e.g., $$C^{(t)} = \{ c_1^{(t)}, \ldots, c_k^{(t)} \}$$ is the set of centroids at the $$t$$-th iteration. The method requires the initial centroids $$C^{(1)}$$ to be specified at the beginning of the algorithm ($$t = 1$$).

(1) Assignment step: Assign each feature vector $$x_i$$ to the nearest centroid. $$y_i^{(t)}$$ denotes the assigned label (cluster index) to the feature vector $$x_i$$.

$y_i^{(t)} = \mathrm{arg}\min_{1 \leq j \leq k} \| x_i - c_j^{(t)} \|^2, \quad 1 \leq i \leq n.$

Each feature vector from the training set $$X$$ is assigned to exactly one centroid so that $$X$$ is partitioned to $$k$$ disjoint sets (clusters)

$S_j^{(t)} = \big\{ \; x_i \in X : \; y_i^{(t)} = j \; \big\}, \quad 1 \leq j \leq k.$

(2) Update step: Recalculate centroids by averaging feature vectors assigned to each cluster.

$c_j^{(t + 1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x, \quad 1 \leq j \leq k.$

The steps (1) and (2) are performed until the following stop condition,

$\sum_{j=1}^k \big\| c_j^{(t)} - c_j^{(t+1)} \big\|^2 < \varepsilon,$

is satisfied or number of iterations exceeds the maximal value $$T$$ defined by the user.

### Inference¶

Given the inference set $$X' = \{ x_1', \ldots, x_m' \}$$ of $$p$$-dimensional feature vectors and the set $$C = \{ c_1, \ldots, c_k \}$$ of centroids produced at the training stage, the problem is to predict the index $$y_j' \in \{ 0, \ldots, k-1 \}$$, $$1 \leq j \leq m$$, of the centroid in accordance with a method-defined rule.

#### Inference method: Lloyd’s¶

Lloyd’s inference method computes the $$y_j'$$ as an index of the centroid closest to the feature vector $$x_j'$$,

$y_j' = \mathrm{arg}\min_{1 \leq l \leq k} \| x_j' - c_l \|^2, \quad 1 \leq j \leq m.$

## Programming Interface¶

Refer to API Reference: K-Means.

## Usage example¶

### Training¶

kmeans::model<> run_training(const table& data,
const table& initial_centroids) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(10)
.set_max_iteration_count(50)
.set_accuracy_threshold(1e-4);

const auto result = train(kmeans_desc, data, initial_centroids);

print_table("labels", result.get_labels());
print_table("centroids", result.get_model().get_centroids());
print_value("objective", result.get_objective_function_value());

return result.get_model();
}


### Inference¶

table run_inference(const kmeans::model<>& model,
const table& new_data) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(model.get_cluster_count());

const auto result = infer(kmeans_desc, model, new_data);

print_table("labels", result.get_labels());
}


## Examples¶

Batch Processing: