.. ******************************************************************************
.. * Copyright 2020 Intel Corporation
.. *
.. * Licensed under the Apache License, Version 2.0 (the "License");
.. * you may not use this file except in compliance with the License.
.. * You may obtain a copy of the License at
.. *
.. * http://www.apache.org/licenses/LICENSE-2.0
.. *
.. * Unless required by applicable law or agreed to in writing, software
.. * distributed under the License is distributed on an "AS IS" BASIS,
.. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. * See the License for the specific language governing permissions and
.. * limitations under the License.
.. *******************************************************************************/
.. highlight:: cpp
.. default-domain:: cpp
.. _alg_kmeans:
=======
K-Means
=======
.. include:: ../../../includes/clustering/kmeans-introduction.rst
------------------------
Mathematical formulation
------------------------
.. _kmeans_t_math:
Training
--------
Given the training set :math:`X = \{ x_1, \ldots, x_n \}` of
:math:`p`-dimensional feature vectors and a positive integer :math:`k`, the
problem is to find a set :math:`C = \{ c_1, \ldots, c_k \}` of
:math:`p`-dimensional centroids that minimize the objective function
.. math::
\Phi_{X}(C) = \sum_{i = 1}^n d^2(x_i, C),
where :math:`d^2(x_i, C)` is the squared Euclidean distance from :math:`x_i` to
the closest centroid in :math:`C`,
.. math::
d^2(x_i, C) = \min_{1 \leq j \leq k} \| x_i - c_j \|^2, \quad 1 \leq i \leq n.
Expression :math:`\|\cdot\|` denotes :math:`L_2` `norm
`_.
.. note::
In the general case, :math:`d` may be an arbitrary distance function. Current
version of the oneDAL spec defines only Euclidean distance case.
.. _kmeans_t_math_lloyd:
Training method: *Lloyd's*
~~~~~~~~~~~~~~~~~~~~~~~~~~
The Lloyd's method [Lloyd82]_ consists in iterative updates of centroids by
applying the alternating *Assignment* and *Update* steps, where :math:`t`
denotes a index of the current iteration, e.g., :math:`C^{(t)} = \{ c_1^{(t)},
\ldots, c_k^{(t)} \}` is the set of centroids at the :math:`t`-th iteration. The
method requires the initial centroids :math:`C^{(1)}` to be specified at the
beginning of the algorithm (:math:`t = 1`).
**(1) Assignment step:** Assign each feature vector :math:`x_i` to the nearest
centroid. :math:`y_i^{(t)}` denotes the assigned label (cluster index) to the
feature vector :math:`x_i`.
.. math::
y_i^{(t)} = \mathrm{arg}\min_{1 \leq j \leq k} \| x_i - c_j^{(t)} \|^2, \quad 1 \leq i \leq n.
Each feature vector from the training set :math:`X` is assigned to exactly one
centroid so that :math:`X` is partitioned to :math:`k` disjoint sets (clusters)
.. math::
S_j^{(t)} = \big\{ \; x_i \in X : \; y_i^{(t)} = j \; \big\}, \quad 1 \leq j \leq k.
**(2) Update step:** Recalculate centroids by averaging feature vectors assigned
to each cluster.
.. math::
c_j^{(t + 1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x, \quad 1 \leq j \leq k.
If any of :math:`S_j^{(t)}` are empty, start the empty clusters handling procedure.
In this case, you can set the value of :math:`c_j^{(t+1)}` as the farthest point from
the previously calculated centroids for each empty cluster. This procedure makes sure
that the number of clusters remains the same.
.. math::
c_j^{(t + 1)} = \mathrm{arg}\max_{x \in X} d^2(x, C^{(t + 1)}).
The steps (1) and (2) are performed until the following **stop condition**,
.. math::
\Phi_{X}(C^{(t)}) - \Phi_{X}(C^{(t + 1)}) < \varepsilon,
is satisfied or number of iterations exceeds the maximal value :math:`T` defined
by the user.
.. _kmeans_i_math:
Inference
---------
Given the inference set :math:`X' = \{ x_1', \ldots, x_m' \}` of
:math:`p`-dimensional feature vectors and the set :math:`C = \{ c_1, \ldots, c_k
\}` of centroids produced at the training stage, the problem is to predict the
index :math:`y_j' \in \{ 0, \ldots, k-1 \}`, :math:`1 \leq j \leq m`, of the
centroid in accordance with a method-defined rule.
.. _kmeans_i_math_lloyd:
Inference method: *Lloyd's*
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Lloyd's inference method computes the :math:`y_j'` as an index of the centroid
closest to the feature vector :math:`x_j'`,
.. math::
y_j' = \mathrm{arg}\min_{1 \leq l \leq k} \| x_j' - c_l \|^2, \quad 1 \leq j \leq m.
---------------------
Programming Interface
---------------------
Refer to :ref:`API Reference: K-Means `.
----------------
Distributed mode
----------------
The algorithm supports distributed execution in SPMD mode (only on GPU).
-------------
Usage Example
-------------
.. include:: ../../../includes/clustering/kmeans-usage-examples.rst
--------
Examples
--------
.. include:: ../../../includes/clustering/kmeans-examples.rst