.. ******************************************************************************
.. * Copyright 2020 Intel Corporation
.. *
.. * Licensed under the Apache License, Version 2.0 (the "License");
.. * you may not use this file except in compliance with the License.
.. * You may obtain a copy of the License at
.. *
.. * http://www.apache.org/licenses/LICENSE2.0
.. *
.. * Unless required by applicable law or agreed to in writing, software
.. * distributed under the License is distributed on an "AS IS" BASIS,
.. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. * See the License for the specific language governing permissions and
.. * limitations under the License.
.. *******************************************************************************/
Distributed Processing
**********************
This mode assumes that the data set is split into ``nblocks`` blocks across computation nodes.
Parameters
++++++++++
Centroid initialization for KMeans clustering in the distributed processing mode has the following parameters:
.. tabularcolumns:: \Y{0.15}\Y{0.15}\Y{0.15}\Y{0.55}
.. listtable:: Algorithm Parameters for KMeans Initialization (Distributed Processing)
:widths: 10 10 10 30
:headerrows: 1
:class: longtable
*  Parameter
 Method
 Default Valude
 Description
*  ``computeStep``
 any
 Not applicable
 The parameter required to initialize the algorithm. Can be:
 ``step1Local``  the first step, performed on local nodes. Applicable for all methods.
 ``step2Master``  the second step, performed on a master node. Applicable for deterministic and random methods only.
 ``step2Local``  the second step, performed on local nodes. Applicable for ``plusPlus`` and ``parallelPlus`` methods only.
 ``step3Master``  the third step, performed on a master node. Applicable for ``plusPlus`` and ``ParallelPlus`` methods only.
 ``step4Local``  the forth step, performed on local nodes. Applicable for ``plusPlus`` and ``parallelPlus`` methods only.
 ``step5Master``  the fifth step, performed on a master node. Applicable for ``plusPlus`` and ``parallelPlus`` methods only.
*  ``algorithmFPType``
 any
 ``float``
 The floatingpoint type that the algorithm uses for intermediate computations. Can be ``float`` or ``double``.
*  ``method``
 Not applicable
 ``defaultDense``
 Available initialization methods for KMeans clustering:
 ``defaultDense``  uses first nClusters feature vectors as initial centroids
 ``deterministicCSR``  uses first nClusters feature vectors as initial centroids for data in a CSR numeric table
 ``randomDense``  uses random nClusters feature vectors as initial centroids
 ``randomCSR``  uses random nClusters feature vectors as initial centroids for data in a CSR numeric table
 ``plusPlusDense``  uses KMeans++ algorithm [Arthur2007]_
 ``plusPlusCSR``  uses KMeans++ algorithm for data in a CSR numeric table
 ``parallelPlusDense``  uses parallel KMeans++ algorithm [Bahmani2012]_
 ``parallelPlusCSR``  uses parallel KMeans++ algorithm for data in a CSR numeric table
For more details, see the algorithm description.
*  ``nClusters``
 any
 Not applicable
 The number of centroids. Required.
*  ``nRowsTotal``
 any
 :math:`0`
 The total number of rows in all input data sets on all nodes. Required in the distributed processing mode in the first step.
*  ``offset``
 any
 Not applicable
 Offset in the total data set specifying the start of a block stored on a given local node. Required.
*  ``oversamplingFactor``

* ``parallelPlusDense``
* ``parallelPlusCSR``
 :math:`0.5`
 A fraction of ``nClusters`` in each of ``nRounds`` of parallel KMeans++.
:math:`L = \mathrm{nClusters}*\mathrm{oversamplingFactor}` points are sampled in a round.
For details, see [Bahmani2012]_, section 3.3.
*  ``nRounds``

* ``parallelPlusDense``
* ``parallelPlusCSR``
 :math:`5`
 The number of rounds for parallel KMeans++. :math:`L * \mathrm{nRounds}` must be greater than ``nClusters``.
For details, see [Bahmani2012]_, section 3.3.
*  ``firstIteration``

* ``parallelPlusDense``
* ``parallelPlusCSR``
* ``plusPlusDense``
* ``plusPlusCSR``
 ``false``
 Set to true if ``step2Local`` is called for the first time.
*  ``outputForStep5Required``

* ``parallelPlusDense``
* ``parallelPlusCSR``
 ``false``
 Set to true if ``step4Local`` is called on the last iteration of the
:ref:`Step 2 `  :ref:`Step 4 ` loop.
Centroid initialization for KMeans clustering follows the general schema described in :ref:`algorithms`.
.. tabs::
.. tab:: ``plusPlus`` methods
.. figure:: images/kmeansdistributedinitplusPlusmethods.png
:alt:
KMeans Centroid Initialization with ``plusPlus`` methods: Distributed Processing
.. tab:: ``parrallelPlus`` methods
.. figure:: images/kmeansdistributedinitparallelPlusmethods.png
:alt:
KMeans Centroid Initialization with ``parrallelPlus`` methods: Distributed Processing
.. _kmeans_init_step_1:
Step 1  on Local Nodes (``deterministic``, ``random``, ``plusPlus``, and ``parallelPlus`` methods)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
.. tabs::
.. tab:: ``plusPlus`` methods
.. figure:: images/kmeansdistributedinitstep1plusPlusmethods.png
:alt:
KMeans Centroid Initialization with ``plusPlus`` methods: Distributed Processing, Step 1  on Local Nodes
.. tab:: ``parrallelPlus`` methods
.. figure:: images/kmeansdistributedinitstep1parallelPlusmethods.png
:alt:
KMeans Centroid Initialization with ``parrallelPlus`` methods: Distributed Processing, Step 1  on Local Nodes
In this step, centroid initialization for KMeans clustering accepts the input described below.
Pass the ``Input ID`` as a parameter to the methods that provide input for your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Input for KMeans Initialization (Distributed Processing, Step 1)
:headerrows: 1
:widths: 10 60
:align: left
*  Input ID
 Input
*  ``data``
 Pointer to the :math:`n_i \times p` numeric table that represents the :math:`i`th data block on the local node.
.. note::
While the input for ``defaultDense``, ``randomDense``, ``plusPlusDense``, and ``parallelPlusDense`` methods
can be an object of any class derived from ``NumericTable``,
the input for ``deterministicCSR``, ``randomCSR``, ``plusPlusCSR``, and ``parallelPlusCSR`` methods
can only be an object of the ``CSRNumericTable`` class.
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the ``Result ID`` as a parameter to the methods that access the results of your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Output for KMeans Initialization (Distributed Processing, Step 1)
:headerrows: 1
:widths: 10 60
:align: left
*  Result ID
 Result
*  ``partialCentroids``
 Pointer to the :math:`\mathrm{nClusters} \times p` numeric table with the centroids computed on the local node.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define the result as an object of any class derived from ``NumericTable``
except ``PackedTriangularMatrix``, ``PackedSymmetricMatrix``, and ``CSRNumericTable``.
.. _kmeans_init_step_2_master:
Step 2  on Master Node (``deterministic`` and ``random`` methods)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
This step is applicable for ``deterministic`` and ``random`` methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the ``Input ID`` as a parameter to the methods that provide input for your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Input for KMeans Initialization (Distributed Processing, Step 2 on Master Node)
:headerrows: 1
:widths: 10 60
:align: left
*  Input ID
 Input
*  ``partialResuts``
 A collection that contains results computed in :ref:`Step 1 `
on local nodes (two numeric tables from each local node).
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the ``Result ID`` as a parameter to the methods that access the results of your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Output for KMeans Initialization (Distributed Processing, Step 2 on Master Node)
:headerrows: 1
:widths: 10 60
:align: left
*  Result ID
 Result
*  ``centroids``
 Pointer to the :math:`\mathrm{nClusters} \times p` numeric table with centroids.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define the result as an object of any class derived from ``NumericTable``
except ``PackedTriangularMatrix``, ``PackedSymmetricMatrix``, and ``CSRNumericTable``.
.. _kmeans_init_step_2_local:
Step 2  on Local Nodes (``plusPlus`` and ``parallelPlus`` methods)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
.. tabs::
.. tab:: ``plusPlus`` methods
.. figure:: images/kmeansdistributedinitstep2plusPlusmethods.png
:alt:
KMeans Centroid Initialization with ``plusPlus`` methods: Distributed Processing, Step 2  on Local Nodes
.. tab:: ``parrallelPlus`` methods
.. figure:: images/kmeansdistributedinitstep2parallelPlusmethods.png
:alt:
KMeans Centroid Initialization with ``parrallelPlus`` methods: Distributed Processing, Step 2  on Local Nodes
This step is applicable for ``plusPlus`` and ``parallelPlus`` methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the ``Input ID`` as a parameter to the methods that provide input for your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Input for KMeans Initialization (Distributed Processing, Step 1 on Local Nodes)
:headerrows: 1
:widths: 10 60
:align: left
:class: longtable
*  Input ID
 Input
*  ``data``
 Pointer to the :math:`n_i \times p` numeric table that represents the :math:`i`th data block on the local node.
.. note::
While the input for ``defaultDense``, ``randomDense``, ``plusPlusDense``, and ``parallelPlusDense`` methods
can be an object of any class derived from ``NumericTable``,
the input for ``deterministicCSR``, ``randomCSR``, ``plusPlusCSR``, and ``parallelPlusCSR`` methods
can only be an object of the ``CSRNumericTable`` class.
*  ``inputOfStep2``
 Pointer to the :math:`m \times p` numeric table with the centroids calculated in the previous steps
(:ref:`Step 1 ` or :ref:`Step 4 `).
The value of :math:`m` is defined by the method and iteration of the algorithm:
 ``plusPlus`` method: :math:`m = 1`
 ``parallelPlus`` method:
 :math:`m = 1` for the first iteration of the Step 2  Step 4 loop
 :math:`m = L = \mathrm{nClusters} * \mathrm{oversamplingFactor}` for other iterations
This input can be an object of any class derived from ``NumericTable``,
except ``CSRNumericTable``, ``PackedTriangularMatrix``, and ``PackedSymmetricMatrix``.
*  ``internalInput``
 Pointer to the ``DataCollection`` object with the internal data of the distributed algorithm
used by its local nodes in :ref:`Step 2 ` and :ref:`Step 4 `.
The ``DataCollection`` is created in :ref:`Step 2 ` when ``firstIteration`` is set to ``true``,
and then the ``DataCollection`` should be set from the partial result as an input for next local steps
(:ref:`Step 2 ` and :ref:`Step 4 `).
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the ``Result ID`` as a parameter to the methods that access the results of your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Output for KMeans Initialization (Distributed Processing, Step 2 on Local Nodes)
:headerrows: 1
:widths: 10 60
:align: left
:class: longtable
*  Result ID
 Result
*  ``outputOfStep2ForStep3``
 Pointer to the :math:`1 \times 1` numeric table that contains the overall error accumulated on the node.
For a description of the overall error, see :ref:`KMeans Clustering Details `.
*  ``outputOfStep2ForStep5``
 Applicable for ``parallelPlus`` methods only and calculated when ``outputForStep5Required`` is set to ``true``.
Pointer to the :math:`1 \times m` numeric table with the ratings of centroid candidates computed on the previous steps
and :math:`m = \mathrm{oversamplingFactor} * \mathrm{nClusters} * \mathrm{nRounds} + 1`.
For a description of ratings, see :ref:`KMeans Clustering Details `.
.. note::
By default, these results are objects of the ``HomogenNumericTable`` class,
but you can define the result as an object of any class derived from ``NumericTable``
except ``PackedTriangularMatrix``, ``PackedSymmetricMatrix``, and ``CSRNumericTable``.
.. _kmeans_init_step_3:
Step 3  on Master Node (``plusPlus`` and ``parallelPlus`` methods)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
.. tabs::
.. tab:: ``plusPlus`` methods
.. figure:: images/kmeansdistributedinitstep3plusPlusmethods.png
:alt:
KMeans Centroid Initialization with ``plusPlus`` methods: Distributed Processing, Step 3  on Master Node
.. tab:: ``parrallelPlus`` methods
.. figure:: images/kmeansdistributedinitstep3parallelPlusmethods.png
:alt:
KMeans Centroid Initialization with ``parrallelPlus`` methods: Distributed Processing, Step 3  on Master Node
This step is applicable for plusPlus and parallelPlus methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the ``Input ID`` as a parameter to the methods that provide input for your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Input for KMeans Initialization (Distributed Processing, Step 3)
:headerrows: 1
:widths: 10 60
:align: left
*  Input ID
 Input
*  ``inputOfStep3FromStep2``
 A keyvalue data collection that maps parts of the accumulated error to the local nodes:
:math:`i`th element of this collection is a numeric table that contains overall error accumulated on the :math:`i`th node.
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the ``Result ID`` as a parameter to the methods that access the results of your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Output for KMeans Initialization (Distributed Processing, Step 3)
:headerrows: 1
:widths: 10 60
:align: left
:class: longtable
*  Result ID
 Result
*  ``outputOfStep3ForStep4``
 A keyvalue data collection that maps the input from :ref:`Step 4 ` to local nodes:
:math:`i`th element of this collection is a numeric table that contains the input from
:ref:`Step 4 ` on the ith node.
Note that :ref:`Step 3 ` may produce no input for :ref:`Step 4 ` on some local nodes,
which means the collection may not contain the :math:`i`th node entry.
The single element of this numeric table :math:`v \leq \Phi_X(C)`, where the overall error :math:`\Phi_X(C)` calculated on the node.
For a description of the overall error, see :ref:`KMeans Clustering Details `.
This value defines the probability to sample a new centroid on the :math:`i`th node.
*  ``outputOfStep3ForStep5``
 Applicable for parallelPlus methods only. Pointer to the service data to be used in :ref:`Step 5 `.
.. _kmeans_init_step_4:
Step 4  on Local Nodes (``plusPlus`` and ``parallelPlus`` methods)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
.. tabs::
.. tab:: ``plusPlus`` methods
.. figure:: images/kmeansdistributedinitstep4plusPlusmethods.png
:alt:
KMeans Centroid Initialization with ``plusPlus`` methods: Distributed Processing, Step 4  on Local Nodes
.. tab:: ``parrallelPlus`` methods
.. figure:: images/kmeansdistributedinitstep4parallelPlusmethods.png
:alt:
KMeans Centroid Initialization with ``parrallelPlus`` methods: Distributed Processing, Step 4  on Local Nodes
This step is applicable for plusPlus and parallelPlus methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the ``Input ID`` as a parameter to the methods that provide input for your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Input for KMeans Initialization (Distributed Processing, Step 4)
:headerrows: 1
:widths: 10 60
:align: left
:class: longtable
*  Input ID
 Input
*  ``data``
 Pointer to the :math:`n_i \times p` numeric table that represents the :math:`i`th data block on the local node.
.. note::
While the input for ``defaultDense``, ``randomDense``, ``plusPlusDense``, and ``parallelPlusDense`` methods
can be an object of any class derived from ``NumericTable``,
the input for ``deterministicCSR``, ``randomCSR``, ``plusPlusCSR``, and ``parallelPlusCSR`` methods
can only be an object of the ``CSRNumericTable`` class.
*  ``inputOfStep4FromStep3``
 Pointer to the :math:`l \times m` numeric table with the values calculated in :ref:`Step 3 `.
The value of :math:`m` is defined by the method of the algorithm:
 ``plusPlus`` method: :math:`m = 1`
 ``parallelPlus`` method: :math:`m \leq L`, :math:`L = \mathrm{nClusters} * \mathrm{oversamplingFactor}`
This input can be an object of any class derived from ``NumericTable``,
except ``CSRNumericTable``, ``PackedTriangularMatrix``, and ``PackedSymmetricMatrix``.
*  ``internalInput``
 Pointer to the ``DataCollection`` object with the internal data of the distributed algorithm
used by its local nodes in :ref:`Step 2 ` and :ref:`Step 4 `.
The ``DataCollection`` is created in :ref:`Step 2 ` when ``firstIteration`` is set to ``true``,
and then the ``DataCollection`` should be set from the partial result as the input for next local steps
(:ref:`Step 2 ` and :ref:`Step 4 `).
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the ``Result ID`` as a parameter to the methods that access the results of your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Output for KMeans Initialization (Distributed Processing, Step 4)
:headerrows: 1
:widths: 10 60
:align: left
*  Result ID
 Result
*  ``outputOfStep4``
 Pointer to the :math:`m \times p` numeric table that contains centroids computed on this local node,
where :math:`m` equals to the one in ``inputOfStep4FromStep3``.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define the result as an object of any class derived from ``NumericTable``
except ``CSRNumericTable``, ``PackedTriangularMatrix``, and ``PackedSymmetricMatrix``.
.. _kmeans_init_step_5:
Step 5  on Master Node (``parallelPlus`` methods)
++++++++++++++++++++++++++++++++++++++++++++++++++
.. figure:: images/kmeansdistributedinitstep5parallelPlusmethods.png
:width: 1000
:alt:
KMeans Centroid Initialization with ``parrallelPlus`` methods: Distributed Processing, Step 5  on Master Node
This step is applicable for parallelPlus methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the ``Input ID`` as a parameter to the methods that provide input for your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Input for KMeans Initialization (Distributed Processing, Step 5)
:headerrows: 1
:widths: 10 60
:align: left
:class: longtable
*  Input ID
 Input
*  inputCentroids
 A data collection with the centroids calculated in :ref:`Step 1 ` or :ref:`Step 4 `.
Each item in the collection is the pointer to :math:`m \times p` numeric table,
where the value of :math:`m` is defined by the method and the iteration of the algorithm:
``parallelPlus`` method:
 :math:`m = 1` for the data added as the output of :ref:`Step 1 `
 :math:`m \leq L`, :math:`L = \mathrm{nClusters} * \mathrm{oversamplingFactor}`
for the data added as the output of :ref:`Step 4 `
Each numeric table can be an object of any class derived from ``NumericTable``,
except ``CSRNumericTable``, ``PackedTriangularMatrix``, and ``PackedSymmetricMatrix``.
*  ``inputOfStep5FromStep2``
 A data collection with the items calculated in :ref:`Step 2 ` on local nodes.
For a detailed definition, see ``outputOfStep2ForStep5`` above.
*  ``inputOfStep5FromStep3``
 Pointer to the service data generated as the output of :ref:`Step 3 ` on master node.
For a detailed definition, see ``outputOfStep3ForStep5`` above.
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the ``Result ID`` as a parameter to the methods that access the results of your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: \Y{0.2}\Y{0.8}
.. listtable:: Output for KMeans Initialization (Distributed Processing, Step 5)
:headerrows: 1
:widths: 10 60
:align: left
*  Result ID
 Result
*  ``centroids``
 Pointer to the :math:`\mathrm{nClusters} \times p` numeric table with centroids.
.. note::
By default, this result is an object of the ``HomogenNumericTable`` class,
but you can define the result as an object of any class derived from ``NumericTable``
except ``PackedTriangularMatrix``, ``PackedSymmetricMatrix``, and ``CSRNumericTable``.