Distributed Processing

This mode assumes that the data set is split into nblocks blocks across computation nodes.

Parameters

Centroid initialization for K-Means clustering in the distributed processing mode has the following parameters:

Algorithm Parameters for K-Means Initialization (Distributed Processing)

Parameter

Method

Default Valude

Description

computeStep

any

Not applicable

The parameter required to initialize the algorithm. Can be:

  • step1Local - the first step, performed on local nodes. Applicable for all methods.

  • step2Master - the second step, performed on a master node. Applicable for deterministic and random methods only.

  • step2Local - the second step, performed on local nodes. Applicable for plusPlus and parallelPlus methods only.

  • step3Master - the third step, performed on a master node. Applicable for plusPlus and ParallelPlus methods only.

  • step4Local - the forth step, performed on local nodes. Applicable for plusPlus and parallelPlus methods only.

  • step5Master - the fifth step, performed on a master node. Applicable for plusPlus and parallelPlus methods only.

algorithmFPType

any

float

The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

method

Not applicable

defaultDense

Available initialization methods for K-Means clustering:

  • defaultDense - uses first nClusters feature vectors as initial centroids

  • deterministicCSR - uses first nClusters feature vectors as initial centroids for data in a CSR numeric table

  • randomDense - uses random nClusters feature vectors as initial centroids

  • randomCSR - uses random nClusters feature vectors as initial centroids for data in a CSR numeric table

  • plusPlusDense - uses K-Means++ algorithm [Arthur2007]

  • plusPlusCSR - uses K-Means++ algorithm for data in a CSR numeric table

  • parallelPlusDense - uses parallel K-Means++ algorithm [Bahmani2012]

  • parallelPlusCSR - uses parallel K-Means++ algorithm for data in a CSR numeric table

For more details, see the algorithm description.

nClusters

any

Not applicable

The number of centroids. Required.

nRowsTotal

any

\(0\)

The total number of rows in all input data sets on all nodes. Required in the distributed processing mode in the first step.

offset

any

Not applicable

Offset in the total data set specifying the start of a block stored on a given local node. Required.

oversamplingFactor

  • parallelPlusDense

  • parallelPlusCSR

\(0.5\)

A fraction of nClusters in each of nRounds of parallel K-Means++. \(L = \mathrm{nClusters}*\mathrm{oversamplingFactor}\) points are sampled in a round. For details, see [Bahmani2012], section 3.3.

nRounds

  • parallelPlusDense

  • parallelPlusCSR

\(5\)

The number of rounds for parallel K-Means++. \(L * \mathrm{nRounds}\) must be greater than nClusters. For details, see [Bahmani2012], section 3.3.

firstIteration

  • parallelPlusDense

  • parallelPlusCSR

  • plusPlusDense

  • plusPlusCSR

false

Set to true if step2Local is called for the first time.

outputForStep5Required

  • parallelPlusDense

  • parallelPlusCSR

false

Set to true if step4Local is called on the last iteration of the Step 2 - Step 4 loop.

Centroid initialization for K-Means clustering follows the general schema described in Algorithms.

K-Means Centroid Initialization with plusPlus methods: Distributed Processing

Step 1 - on Local Nodes (deterministic, random, plusPlus, and parallelPlus methods)

K-Means Centroid Initialization with plusPlus methods: Distributed Processing, Step 1 - on Local Nodes

In this step, centroid initialization for K-Means clustering accepts the input described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Input for K-Means Initialization (Distributed Processing, Step 1)

Input ID

Input

data

Pointer to the \(n_i \times p\) numeric table that represents the \(i\)-th data block on the local node.

Note

While the input for defaultDense, randomDense, plusPlusDense, and parallelPlusDense methods can be an object of any class derived from NumericTable, the input for deterministicCSR, randomCSR, plusPlusCSR, and parallelPlusCSR methods can only be an object of the CSRNumericTable class.

In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Output for K-Means Initialization (Distributed Processing, Step 1)

Result ID

Result

partialCentroids

Pointer to the \(\mathrm{nClusters} \times p\) numeric table with the centroids computed on the local node.

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable except PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.

Step 2 - on Master Node (deterministic and random methods)

This step is applicable for deterministic and random methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Input for K-Means Initialization (Distributed Processing, Step 2 on Master Node)

Input ID

Input

partialResuts

A collection that contains results computed in Step 1 on local nodes (two numeric tables from each local node).

In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Output for K-Means Initialization (Distributed Processing, Step 2 on Master Node)

Result ID

Result

centroids

Pointer to the \(\mathrm{nClusters} \times p\) numeric table with centroids.

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable except PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.

Step 2 - on Local Nodes (plusPlus and parallelPlus methods)

K-Means Centroid Initialization with plusPlus methods: Distributed Processing, Step 2 - on Local Nodes

This step is applicable for plusPlus and parallelPlus methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Input for K-Means Initialization (Distributed Processing, Step 1 on Local Nodes)

Input ID

Input

data

Pointer to the \(n_i \times p\) numeric table that represents the \(i\)-th data block on the local node.

Note

While the input for defaultDense, randomDense, plusPlusDense, and parallelPlusDense methods can be an object of any class derived from NumericTable, the input for deterministicCSR, randomCSR, plusPlusCSR, and parallelPlusCSR methods can only be an object of the CSRNumericTable class.

inputOfStep2

Pointer to the \(m \times p\) numeric table with the centroids calculated in the previous steps (Step 1 or Step 4).

The value of \(m\) is defined by the method and iteration of the algorithm:

  • plusPlus method: \(m = 1\)

  • parallelPlus method:

    • \(m = 1\) for the first iteration of the Step 2 - Step 4 loop

    • \(m = L = \mathrm{nClusters} * \mathrm{oversamplingFactor}\) for other iterations

This input can be an object of any class derived from NumericTable, except CSRNumericTable, PackedTriangularMatrix, and PackedSymmetricMatrix.

internalInput

Pointer to the DataCollection object with the internal data of the distributed algorithm used by its local nodes in Step 2 and Step 4. The DataCollection is created in Step 2 when firstIteration is set to true, and then the DataCollection should be set from the partial result as an input for next local steps (Step 2 and Step 4).

In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Output for K-Means Initialization (Distributed Processing, Step 2 on Local Nodes)

Result ID

Result

outputOfStep2ForStep3

Pointer to the \(1 \times 1\) numeric table that contains the overall error accumulated on the node. For a description of the overall error, see K-Means Clustering Details.

outputOfStep2ForStep5

Applicable for parallelPlus methods only and calculated when outputForStep5Required is set to true. Pointer to the \(1 \times m\) numeric table with the ratings of centroid candidates computed on the previous steps and \(m = \mathrm{oversamplingFactor} * \mathrm{nClusters} * \mathrm{nRounds} + 1\). For a description of ratings, see K-Means Clustering Details.

Note

By default, these results are objects of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable except PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.

Step 3 - on Master Node (plusPlus and parallelPlus methods)

K-Means Centroid Initialization with plusPlus methods: Distributed Processing, Step 3 - on Master Node

This step is applicable for plusPlus and parallelPlus methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Input for K-Means Initialization (Distributed Processing, Step 3)

Input ID

Input

inputOfStep3FromStep2

A key-value data collection that maps parts of the accumulated error to the local nodes: \(i\)-th element of this collection is a numeric table that contains overall error accumulated on the \(i\)-th node.

In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Output for K-Means Initialization (Distributed Processing, Step 3)

Result ID

Result

outputOfStep3ForStep4

A key-value data collection that maps the input from Step 4 to local nodes: \(i\)-th element of this collection is a numeric table that contains the input from Step 4 on the i-th node.

Note that Step 3 may produce no input for Step 4 on some local nodes, which means the collection may not contain the \(i\)-th node entry. The single element of this numeric table \(v \leq \Phi_X(C)\), where the overall error \(\Phi_X(C)\) calculated on the node. For a description of the overall error, see K-Means Clustering Details.

This value defines the probability to sample a new centroid on the \(i\)-th node.

outputOfStep3ForStep5

Applicable for parallelPlus methods only. Pointer to the service data to be used in Step 5.

Step 4 - on Local Nodes (plusPlus and parallelPlus methods)

K-Means Centroid Initialization with plusPlus methods: Distributed Processing, Step 4 - on Local Nodes

This step is applicable for plusPlus and parallelPlus methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Input for K-Means Initialization (Distributed Processing, Step 4)

Input ID

Input

data

Pointer to the \(n_i \times p\) numeric table that represents the \(i\)-th data block on the local node.

Note

While the input for defaultDense, randomDense, plusPlusDense, and parallelPlusDense methods can be an object of any class derived from NumericTable, the input for deterministicCSR, randomCSR, plusPlusCSR, and parallelPlusCSR methods can only be an object of the CSRNumericTable class.

inputOfStep4FromStep3

Pointer to the \(l \times m\) numeric table with the values calculated in Step 3.

The value of \(m\) is defined by the method of the algorithm:

  • plusPlus method: \(m = 1\)

  • parallelPlus method: \(m \leq L\), \(L = \mathrm{nClusters} * \mathrm{oversamplingFactor}\)

This input can be an object of any class derived from NumericTable, except CSRNumericTable, PackedTriangularMatrix, and PackedSymmetricMatrix.

internalInput

Pointer to the DataCollection object with the internal data of the distributed algorithm used by its local nodes in Step 2 and Step 4. The DataCollection is created in Step 2 when firstIteration is set to true, and then the DataCollection should be set from the partial result as the input for next local steps (Step 2 and Step 4).

In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Output for K-Means Initialization (Distributed Processing, Step 4)

Result ID

Result

outputOfStep4

Pointer to the \(m \times p\) numeric table that contains centroids computed on this local node, where \(m\) equals to the one in inputOfStep4FromStep3.

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable except CSRNumericTable, PackedTriangularMatrix, and PackedSymmetricMatrix.

Step 5 - on Master Node (parallelPlus methods)

K-Means Centroid Initialization with parrallelPlus methods: Distributed Processing, Step 5 - on Master Node

This step is applicable for parallelPlus methods only. Centroid initialization for K-Means clustering accepts the input from each local node described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Input for K-Means Initialization (Distributed Processing, Step 5)

Input ID

Input

inputCentroids

A data collection with the centroids calculated in Step 1 or Step 4. Each item in the collection is the pointer to \(m \times p\) numeric table, where the value of \(m\) is defined by the method and the iteration of the algorithm:

parallelPlus method:

  • \(m = 1\) for the data added as the output of Step 1

  • \(m \leq L\), \(L = \mathrm{nClusters} * \mathrm{oversamplingFactor}\) for the data added as the output of Step 4

Each numeric table can be an object of any class derived from NumericTable, except CSRNumericTable, PackedTriangularMatrix, and PackedSymmetricMatrix.

inputOfStep5FromStep2

A data collection with the items calculated in Step 2 on local nodes. For a detailed definition, see outputOfStep2ForStep5 above.

inputOfStep5FromStep3

Pointer to the service data generated as the output of Step 3 on master node. For a detailed definition, see outputOfStep3ForStep5 above.

In this step, centroid initialization for K-Means clustering calculates the results described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Output for K-Means Initialization (Distributed Processing, Step 5)

Result ID

Result

centroids

Pointer to the \(\mathrm{nClusters} \times p\) numeric table with centroids.

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable except PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.