Basic Statistics

Basic statistics algorithm computes the following set of quantitative dataset characteristics:

  • minimums/maximums

  • sums

  • means

  • sums of squares

  • sums of squared differences from the means

  • second order raw moments

  • variances

  • standard deviations

  • variations

Operation

Computational methods

Programming Interface

Computing

dense

compute(…)

compute_input

compute_result

Partial Computing

dense

partial_compute(…)

partial_compute_input

partial_compute_result

Finalize Computing

dense

finalize_compute(…)

partial_compute_result

compute_result

Mathematical formulation

Computing

Given a set \(X\) of \(n\) \(p\)-dimensional feature vectors \(x_1 = (x_{11}, \ldots, x_{1p}), \ldots, x_n = (x_{n1}, \ldots, x_{np})\), the problem is to compute the following sample characteristics for each feature in the data set:

Statistic

Definition

Minimum

\(min(j) = \smash{\displaystyle \min_i } \{x_{ij}\}\)

Maximum

\(max(j) = \smash{\displaystyle \max_i } \{x_{ij}\}\)

Sum

\(s(j) = \sum_i x_{ij}\)

Sum of squares

\(s_2(j) = \sum_i x_{ij}^2\)

Means

\(m(j) = \frac {s(j)} {n}\)

Second order raw moment

\(a_2(j) = \frac {s_2(j)} {n}\)

Sum of squared difference from the means

\(\text{SDM}(j) = \sum_i (x_{ij} - m(j))^2\)

Variance

\(k_2(j) = \frac {\text{SDM}(j) } {n - 1}\)

Standard deviation

\(\text{stdev}(j) = \sqrt {k_2(j)}\)

Variation coefficient

\(V(j) = \frac {\text{stdev}(j)} {m(j)}\)

Partial Computing

Given a block of a \(X = \{ x_1, \ldots, x_n \}\) dataset with \(n\) feature vectors of \(p\) dimension, the sums is a \(1 \times p\) matrix, the crossproduct is \(p \times p\) square matrices. The sums and the cross product are computed with the following formulas:

Statistic

Definition

Partial Minimum

\(min(j) = \smash{\displaystyle \min_i } \{x_{ij}\}\)

Partial Maximum

\(max(j) = \smash{\displaystyle \max_i } \{x_{ij}\}\)

Partial Sum

\(s(j) = \sum_i x_{ij}\)

Partial Sum of squares

\(s_2(j) = \sum_i x_{ij}^2\)

Finalize Computing

Given a partial result with partial products, the means is a \(1 \times p\) matrix, the covariance and correlation matrices are \(p \times p\) square matrices. The means, the covariance, and the correlation are computed with the following formulas:

Statistic

Definition

Finalize Minimum

\(min(j) = \smash{\displaystyle \min_i } \{x_{ij}\}\)

Finalize Maximum

\(max(j) = \smash{\displaystyle \max_i } \{x_{ij}\}\)

Finalize Sum

\(s(j) = \sum_i x_{ij}\)

Finalize Sum of squares

\(s_2(j) = \sum_i x_{ij}^2\)

Finalize Means

\(m(j) = \frac {s(j)} {n}\)

Finalize Second order raw moment

\(a_2(j) = \frac {s_2(j)} {n}\)

Finalize Sum of squared difference from the means

\(\text{SDM}(j) = \sum_i (x_{ij} - m(j))^2\)

Finalize Variance

\(k_2(j) = \frac {\text{SDM}(j) } {n - 1}\)

Finalize Standard deviation

\(\text{stdev}(j) = \sqrt {k_2(j)}\)

Finalize Variation coefficient

\(V(j) = \frac {\text{stdev}(j)} {m(j)}\)

Computation method: dense

The method computes the basic statistics for each feature in the data set.

Programming Interface

Refer to API Reference: Basic statistics.

Online mode

The algorithm supports online mode.

Distributed mode

The algorithm supports distributed execution in SPMD mode (only on GPU).

Usage Example

Computing

 void run_computing(const table& data) {
 const auto bs_desc = dal::basic_statistics::descriptor{};

 const auto result = dal::compute(bs_desc, data);

 std::cout << "Minimum:\n" << result.get_min() << std::endl;
 std::cout << "Maximum:\n" << result.get_max() << std::endl;
 std::cout << "Sum:\n" << result.get_sum() << std::endl;
 std::cout << "Sum of squares:\n" << result.get_sum_squares() << std::endl;
 std::cout << "Sum of squared difference from the means:\n"
     << result.get_sum_squares_centered() << std::endl;
 std::cout << "Mean:\n" << result.get_mean() << std::endl;
 std::cout << "Second order raw moment:\n" << result.get_second_order_raw_moment() << std::endl;
 std::cout << "Variance:\n" << result.get_variance() << std::endl;
 std::cout << "Standard deviation:\n" << result.get_standard_deviation() << std::endl;
 std::cout << "Variation:\n" << result.get_variation() << std::endl;
}

Examples