Covariance#

In statistics, covariance and correlation are two of the most fundamental measures of linear dependence between two random variables. The covariance and the correlation represent the joint variability of any two features. The correlation is dimensionless, while the covariance is measured in units obtained by multiplying the units of the two features. Another important distinction is that covariance can be affected by the higher variance of one feature, while correlation removes the effect of the variances by normalizing the covariance of two features by their square-root of variances. Their usage is application-dependent. The covariance algorithm computes the following:

  • Means

  • Covariance (sample and estimated by maximum likelihood method)

  • Correlation

Operation

Computational methods

Programming Interface

Computing

dense

compute(…)

compute_input

compute_result

Partial Computing

dense

partial_compute(…)

partial_compute_input

partial_compute_result

Finalize Computing

dense

finalize_compute(…)

partial_compute_result

compute_result

Mathematical formulation#

Computing#

Given a dataset \(X = \{ x_1, \ldots, x_n \}\) with \(n\) feature vectors of dimension \(p\), the means is a \(1 \times p\) matrix, the covariance and the correlation matrices are \(p \times p\) square matrices. The means, the covariance, and the correlation are computed with the following formulas:

Statistic

Definition

Means

\(M = (m_{1}, \ldots , m_{p})\), where \(m_{j} = \frac{1}{n}\sum _{i}{x}_{ij}\)

Covariance matrix (sample)

\(Cov = (v_{ij})\), where \(v_{ij} = \frac{1}{n-1}\sum_{k=1}^{n}(x_{ki}-m_{i})(x_{kj}-m{j})\), \(i = \overline{1,p}\), \(j = \overline{1,p}\)

Covariance matrix (maximum likelihood)

\(Cov' = (v'_{ij})\), where \(v'_{ij} = \frac{1}{n}\sum_{k=1}^{n}(x_{ki}-m_{i})(x_{kj}-m{j})\), \(i = \overline{1,p}\), \(j = \overline{1,p}\)

Correlation matrix

\(Cor = (c_{ij})\), where \(c_{ij} = \frac{v_{ij}}{\sqrt{v_{ii}\cdot v_{jj}}}\), \(i = \overline{1,p}\), \(j = \overline{1,p}\)

Partial Computing#

Given a block of a \(X = \{ x_1, \ldots, x_n \}\) dataset with \(n\) feature vectors of \(p\) dimension, the sums is a \(1 \times p\) matrix, the cross product is \(p \times p\) square matrices. The sums and cross product are computed with the following formulas:

Statistic

Definition

Sums

\(S = (m_{1}, \ldots , m_{p})\), where \(m_{j} = sum _{i}{x}_{ij}\)

Cross product matrix

\(Crossproduct = (v_{ij}) = \sum_{k=1}^{n}(x_{ki}-m_{i})(x_{kj}-m{j})\), \(i = \overline{1,p}\), \(j = \overline{1,p}\)

Finalize Computing#

Given a partial result with partial products, the means is a \(1 \times p\) matrix, the covariance and correlation matrices are \(p \times p\) square matrices. The means, the covariance, and the correlation are computed with the following formulas:

Statistic

Definition

Means

\(M = (m_{1}, \ldots , m_{p})\), where \(m_{j} = \frac{1}{n}\sum _{i}{x}_{ij}\)

Covariance matrix (sample)

\(Cov = (v_{ij})\), where \(v_{ij} = \frac{1}{n-1}\sum_{k=1}^{n}(x_{ki}-m_{i})(x_{kj}-m{j})\), \(i = \overline{1,p}\), \(j = \overline{1,p}\)

Covariance matrix (maximum likelihood)

\(Cov' = (v'_{ij})\), where \(v'_{ij} = \frac{1}{n}\sum_{k=1}^{n}(x_{ki}-m_{i})(x_{kj}-m{j})\), \(i = \overline{1,p}\), \(j = \overline{1,p}\)

Correlation matrix

\(Cor = (c_{ij})\), where \(c_{ij} = \frac{v_{ij}}{\sqrt{v_{ii}\cdot v_{jj}}}\), \(i = \overline{1,p}\), \(j = \overline{1,p}\)

Computation method: dense#

The method computes means, variance-covariance, or correlation matrix for the dense data. This is the default and the only method supported.

Programming Interface#

Refer to API Reference: Covariance.

Online mode#

The algorithm supports online mode.

Distributed mode#

The algorithm supports distributed execution in SPMD mode (only on GPU).