Batch Processing#

Input#

Centroid initialization for K-Means clustering accepts the input described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm.

Algorithm Input for K-Means Initialization (Batch Processing)#

Input ID

Input

data

Pointer to the \(n \times p\) numeric table with the data to be clustered.

Note

The input can be an object of any class derived from NumericTable.

Parameters#

The following table lists parameters of centroid initialization for K-Means clustering, which depend on the initialization method parameter method.

Algorithm Parameters for K-Means Initialization (Batch Processing)#

Parameter

method

Default Value

Description

algorithmFPType

any

float

The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

method

Not applicable

defaultDense

Available initialization methods for K-Means clustering:

For CPU:

  • defaultDense - uses first nClusters points as initial centroids

  • deterministicCSR - uses first nClusters points as initial centroids for data in a CSR numeric table

  • randomDense - uses random nClusters points as initial centroids

  • randomCSR - uses random nClusters points as initial centroids for data in a CSR numeric table

  • plusPlusDense - uses K-Means++ algorithm [Arthur2007]

  • plusPlusCSR - uses K-Means++ algorithm for data in a CSR numeric table

  • parallelPlusDense - uses parallel K-Means++ algorithm [Bahmani2012]

  • parallelPlusCSR - uses parallel K-Means++ algorithm for data in a CSR numeric table

For GPU:

  • defaultDense - uses first nClusters points as initial centroids

  • randomDense - uses random nClusters points as initial centroids

nClusters

any

Not applicable

The number of clusters. Required.

nTrials

  • parallelPlusDense

  • parallelPlusCSR

\(1\)

The number of trails to generate all clusters but the first initial cluster. For details, see [Arthur2007], section 5

oversamplingFactor

  • parallelPlusDense

  • parallelPlusCSR

\(0.5\)

A fraction of nClusters in each of nRounds of parallel K-Means++. L=nClusters*oversamplingFactor points are sampled in a round. For details, see [Bahmani2012], section 3.3.

nRounds

  • parallelPlusDense

  • parallelPlusCSR

\(5\)

The number of rounds for parallel K-Means++. (L*nRounds) must be greater than nClusters. For details, see [Bahmani2012], section 3.3.

engine

any

SharePtr< engines:: mt19937:: Batch>()

Pointer to the random number generator engine that is used internally for random numbers generation.

Output#

Centroid initialization for K-Means clustering calculates the result described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm.

Algorithm Output for K-Means Initialization (Batch Processing)#

Result ID

Result

centroids

Pointer to the \(nClusters \times p\) numeric table with the cluster centroids.

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable except for PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.