Basic Statistics¶
Basic statistics algorithm computes the following set of quantitative dataset characteristics:
minimums/maximums
sums
means
sums of squares
sums of squared differences from the means
second order raw moments
variances
standard deviations
variations
Operation |
Computational methods |
Programming Interface |
||
Mathematical formulation¶
Computing¶
Given a set \(X\) of \(n\) \(p\)-dimensional feature vectors \(x_1 = (x_{11}, \ldots, x_{1p}), \ldots, x_n = (x_{n1}, \ldots, x_{np})\), the problem is to compute the following sample characteristics for each feature in the data set:
Statistic |
Definition |
---|---|
Minimum |
\(min(j) = \smash{\displaystyle \min_i } \{x_{ij}\}\) |
Maximum |
\(max(j) = \smash{\displaystyle \max_i } \{x_{ij}\}\) |
Sum |
\(s(j) = \sum_i x_{ij}\) |
Sum of squares |
\(s_2(j) = \sum_i x_{ij}^2\) |
Means |
\(m(j) = \frac {s(j)} {n}\) |
Second order raw moment |
\(a_2(j) = \frac {s_2(j)} {n}\) |
Sum of squared difference from the means |
\(\text{SDM}(j) = \sum_i (x_{ij} - m(j))^2\) |
Variance |
\(k_2(j) = \frac {\text{SDM}(j) } {n - 1}\) |
Standard deviation |
\(\text{stdev}(j) = \sqrt {k_2(j)}\) |
Variation coefficient |
\(V(j) = \frac {\text{stdev}(j)} {m(j)}\) |
Partial Computing¶
Given a block of a \(X = \{ x_1, \ldots, x_n \}\) dataset with \(n\) feature vectors of \(p\) dimension, the sums is a \(1 \times p\) matrix, the crossproduct is \(p \times p\) square matrices. The sums and the cross product are computed with the following formulas:
Statistic |
Definition |
---|---|
Partial Minimum |
\(min(j) = \smash{\displaystyle \min_i } \{x_{ij}\}\) |
Partial Maximum |
\(max(j) = \smash{\displaystyle \max_i } \{x_{ij}\}\) |
Partial Sum |
\(s(j) = \sum_i x_{ij}\) |
Partial Sum of squares |
\(s_2(j) = \sum_i x_{ij}^2\) |
Finalize Computing¶
Given a partial result with partial products, the means is a \(1 \times p\) matrix, the covariance and correlation matrices are \(p \times p\) square matrices. The means, the covariance, and the correlation are computed with the following formulas:
Statistic |
Definition |
---|---|
Finalize Minimum |
\(min(j) = \smash{\displaystyle \min_i } \{x_{ij}\}\) |
Finalize Maximum |
\(max(j) = \smash{\displaystyle \max_i } \{x_{ij}\}\) |
Finalize Sum |
\(s(j) = \sum_i x_{ij}\) |
Finalize Sum of squares |
\(s_2(j) = \sum_i x_{ij}^2\) |
Finalize Means |
\(m(j) = \frac {s(j)} {n}\) |
Finalize Second order raw moment |
\(a_2(j) = \frac {s_2(j)} {n}\) |
Finalize Sum of squared difference from the means |
\(\text{SDM}(j) = \sum_i (x_{ij} - m(j))^2\) |
Finalize Variance |
\(k_2(j) = \frac {\text{SDM}(j) } {n - 1}\) |
Finalize Standard deviation |
\(\text{stdev}(j) = \sqrt {k_2(j)}\) |
Finalize Variation coefficient |
\(V(j) = \frac {\text{stdev}(j)} {m(j)}\) |
Computation method: dense¶
The method computes the basic statistics for each feature in the data set.
Programming Interface¶
Refer to API Reference: Basic statistics.
Online mode¶
The algorithm supports online mode.
Distributed mode¶
The algorithm supports distributed execution in SPMD mode (only on GPU).
Usage Example¶
Computing¶
void run_computing(const table& data) {
const auto bs_desc = dal::basic_statistics::descriptor{};
const auto result = dal::compute(bs_desc, data);
std::cout << "Minimum:\n" << result.get_min() << std::endl;
std::cout << "Maximum:\n" << result.get_max() << std::endl;
std::cout << "Sum:\n" << result.get_sum() << std::endl;
std::cout << "Sum of squares:\n" << result.get_sum_squares() << std::endl;
std::cout << "Sum of squared difference from the means:\n"
<< result.get_sum_squares_centered() << std::endl;
std::cout << "Mean:\n" << result.get_mean() << std::endl;
std::cout << "Second order raw moment:\n" << result.get_second_order_raw_moment() << std::endl;
std::cout << "Variance:\n" << result.get_variance() << std::endl;
std::cout << "Standard deviation:\n" << result.get_standard_deviation() << std::endl;
std::cout << "Variation:\n" << result.get_variation() << std::endl;
}
Examples¶
Batch Processing:
Online Processing:
Batch Processing:
Online Processing: