.. ******************************************************************************
.. * Copyright 2020 Intel Corporation
.. *
.. * Licensed under the Apache License, Version 2.0 (the "License");
.. * you may not use this file except in compliance with the License.
.. * You may obtain a copy of the License at
.. *
.. * http://www.apache.org/licenses/LICENSE-2.0
.. *
.. * Unless required by applicable law or agreed to in writing, software
.. * distributed under the License is distributed on an "AS IS" BASIS,
.. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. * See the License for the specific language governing permissions and
.. * limitations under the License.
.. *******************************************************************************/
Distributed Processing
**********************
This mode assumes that the data set is split into ``nblocks`` blocks across computation nodes.
Algorithm Parameters
++++++++++++++++++++
The K-Means clustering algorithm in the distributed processing mode has the following parameters:
.. tabularcolumns:: |\Y{0.15}|\Y{0.15}|\Y{0.7}|
.. list-table:: Algorithm Parameters for K-Means Computation (Distributed Processing)
:header-rows: 1
:widths: 10 10 60
:align: left
:class: longtable
* - Parameter
- Default Value
- Description
* - ``computeStep``
- Not applicable
- The parameter required to initialize the algorithm. Can be:
- ``step1Local`` - the first step, performed on local nodes
- ``step2Master`` - the second step, performed on a master node
* - ``algorithmFPType``
- ``float``
- The floating-point type that the algorithm uses for intermediate computations. Can be ``float`` or ``double``.
* - ``method``
- ``defaultDense``
- Available computation methods for K-Means clustering:
- ``defaultDense`` - implementation of Lloyd's algorithm
- ``lloydCSR`` - implementation of Lloyd's algorithm for CSR numeric tables
* - ``nClusters``
- Not applicable
- The number of clusters. Required to initialize the algorithm.
* - ``gamma``
- :math:`1.0`
- The weight to be used in distance calculation for binary categorical features.
* - ``distanceType``
- ``euclidean``
- The measure of closeness between points (observations) being clustered. The only distance type supported so far is the Euclidean distance.
* - ``assignFlag``
- ``false``
- A flag that enables computation of assignments, that is, assigning cluster indices to respective observations.
To compute K-Means clustering in the distributed processing mode, use the general schema described in Algorithms as follows:
.. _kmeans_computation_step_1:
Step 1 - on Local Nodes
+++++++++++++++++++++++
.. figure:: images/kmeans-distributed-computation-step-1.png
:width: 1000
:alt:
K-Means Computation: Distributed Processing, Step 1 - on Local Nodes
In this step, the K-Means clustering algorithm accepts the input described below.
Pass the ``Input ID`` as a parameter to the methods that provide input for your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: |\Y{0.2}|\Y{0.8}|
.. list-table:: Input for K-Means Computation (Distributed Processing, Step 1)
:header-rows: 1
:widths: 10 60
:align: left
:class: longtable
* - Input ID
- Input
* - ``data``
- Pointer to the :math:`n_i \times p` numeric table that represents the :math:`i`-th data block on the local node.
The input can be an object of any class derived from ``NumericTable``.
* - ``inputCentroids``
- Pointer to the :math:`\mathrm{nClusters} \times p` numeric table with the initial cluster centroids.
This input can be an object of any class derived from NumericTable.
In this step, the K-Means clustering algorithm calculates the partial results and results described below.
Pass the ``Partial Result ID`` or ``Result ID`` as a parameter to the methods that access the results of your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: |\Y{0.2}|\Y{0.8}|
.. list-table:: Partial Results for K-Means Computation (Distributed Processing, Step 1)
:header-rows: 1
:widths: 10 60
:align: left
:class: longtable
* - Partial Result ID
- Result
* - ``nObservations``
- Pointer to the :math:`\mathrm{nClusters} \times 1` numeric table that contains
the number of observations assigned to the clusters on local node.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define this result as an object of any class derived from ``NumericTable`` except ``CSRNumericTable``.
* - ``partialSums``
- Pointer to the :math:`\mathrm{nClusters} \times p` numeric table with
partial sums of observations assigned to the clusters on the local node.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define the result as an object of any class derived from ``NumericTable``
except ``PackedTriangularMatrix``, ``PackedSymmetricMatrix``, and ``CSRNumericTable``.
* - ``partialObjectiveFunction``
- Pointer to the :math:`1 \times 1` numeric table that contains the value of the partial objective function
for observations processed on the local node.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define this result as an object of any class derived from ``NumericTable`` except ``CSRNumericTable``.
* - ``partialCandidatesDistances``
- Pointer to the :math:`\mathrm{nClusters} \times 1` numeric table that contains the value of the ``nClusters``
largest objective function for the observations processed on the local node and stored in descending order.
.. note::
By default, this result if an object of the ``HomogenNumericTable`` class,
but you can define this result as an object of any class derived from ``NumericTable``
except ``PackedTriangularMatrix``, ``PackedSymmetricMatrix``, ``CSRNumericTable``.
* - ``partialCandidatesCentroids``
- Pointer to the :math:`\mathrm{nClusters} \times 1` numeric table that contains the observations of the ``nClusters``
largest objective function value processed on the local node and stored in descending order of the objective function.
.. note::
By default, this result if an object of the ``HomogenNumericTable`` class,
but you can define this result as an object of any class derived from ``NumericTable``
except ``PackedTriangularMatrix``, ``PackedSymmetricMatrix``, ``CSRNumericTable``.
.. tabularcolumns:: |\Y{0.2}|\Y{0.8}|
.. list-table:: Output for K-Means Computation (Distributed Processing, Step 1)
:header-rows: 1
:widths: 10 60
:align: left
* - Result ID
- Result
* - ``assignments``
- Use when ``assignFlag`` = ``true``. Pointer to the :math:`n_i \times 1` numeric table
with 32-bit integer assignments of cluster indices to feature vectors in the input data on the local node.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define this result as an object of any class derived from ``NumericTable``
except ``PackedTriangularMatrix``, ``PackedSymmetricMatrix``, and ``CSRNumericTable``.
.. _kmeans_computation_step_2:
Step 2 - on Master Node
+++++++++++++++++++++++
.. figure:: images/kmeans-distributed-computation-step-2.png
:width: 1000
:alt:
K-Means Computation: Distributed Processing, Step 2 - on Master Node
In this step, the K-Means clustering algorithm accepts the input from each local node described below.
Pass the ``Input ID`` as a parameter to the methods that provide input for your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: |\Y{0.2}|\Y{0.8}|
.. list-table:: Input for K-Means Computation (Distributed Processing, Step 2)
:header-rows: 1
:widths: 10 60
:align: left
* - Input ID
- Input
* - ``partialResuts``
- A collection that contains results computed in :ref:`Step 1 ` on local nodes.
In this step, the K-Means clustering algorithm calculates the results described below.
Pass the ``Result ID`` as a parameter to the methods that access the results of your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: |\Y{0.2}|\Y{0.8}|
.. list-table:: Output for K-Means Computation (Distributed Processing, Step 2)
:header-rows: 1
:widths: 10 60
:align: left
:class: longtable
* - Result ID
- Result
* - ``centroids``
- Pointer to the :math:`\mathrm{nClusters} \times p` numeric table with centroids.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define the result as an object of any class derived from ``NumericTable``
except ``PackedTriangularMatrix``, ``PackedSymmetricMatrix``, and ``CSRNumericTable``.
* - ``objectiveFunction``
- Pointer to the :math:`1 \times 1` numeric table that contains the value of the objective function.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define this result as an object of any class derived from ``NumericTable`` except ``CSRNumericTable``.
.. important::
The algorithm computes assignments using input centroids.
Therefore, to compute assignments using final computed centroids, after the last call to ``Step2compute()`` method on the master node,
on each local node set assignFlag to true and do one additional call to ``Step1compute()`` and ``finalizeCompute()`` methods.
Always set assignFlag to true and call ``finalizeCompute()`` to obtain assignments in each step.
.. note::
To compute assignments using original ``inputCentroids`` on the given node,
you can use K-Means clustering algorithm in the batch processing mode with the subset of the data available on this node.
See :ref:`kmeans_computation_batch` for more details.