Multivariate BACON Outlier Detection

In multivariate outlier detection methods, the observation point is the entire feature vector.

Details

Given a set \(X\) of \(n\) feature vectors \(x_1 = (x_{11}, \ldots, x_{1p}), \ldots, x_n = (x_{n1}, \ldots, x_{np})\) of dimension \(p\), the problem is to identify the vectors that do not belong to the underlying distribution using the BACON method (see [Billor2000]).

In the iterative method, each iteration involves several steps:

  1. Identify an initial basic subset of \(m > p\) feature vectors that can be assumed as not containing outliers. The constant \(m\) is set to \(5p\). The library supports two approaches to selecting the initial subset:

    • Based on distances from the medians \(||x_i - \text{med}||\), where:

      • med is the vector of coordinate-wise medians

      • \(||.||\) is the vector norm

      • \(i = 1, \ldots, n\)

    • Based on the Mahalanobis distance \(d_i (\text{mean}, S) = \sqrt {(x_i - \text{mean})^T s^{-1} (x_i - \text{mean})}\), where:

      • mean and \(S\) are the mean and the covariance matrix, respectively, of \(n\) feature vectors

      • \(i = 1, \ldots, n\)

    Each method chooses \(m\) feature vectors with the smallest values of distances.

  2. Compute the discrepancies using the Mahalanobis distance above, where mean and S are the mean and the covariance matrix, respectively, computed for the feature vectors contained in the basic subset.

  3. Set the new basic subset to all feature vectors with the discrepancy less than \(c_{npr}\chi_{p, \frac {\alpha}{n}}^2\), where:

    • \(chi_{p, \alpha}^2\) is the \((1 - \alpha)\) percentile of the Chi-square distribution with \(p\) degrees of freedom

    • \(c_{npr} = c_{hr} + c_{np}\), where:

      • \(r\) is the size of the current basic subset

      • \(c_{hr} = \max \{0, \frac {h - r}{h + r}\}\), where \(h = [\frac{n + p + 1}{2}]\) and \([ ]\) is the integer part of a number

      • \(c_{np} = 1 + \frac{p + 1}{n - p} + \frac{2}{n - 1 - 3p}\)

  4. Iterate steps 2 and 3 until the size of the basic subset no longer changes.

  5. Nominate the feature vectors that are not part of the final basic subset as outliers.

Batch Processing

Algorithm Input

The multivariate BACON outlier detection algorithm accepts the input described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Algorithm Input for Multivariate BACON Outlier Detection (Batch Processing)

Input ID

Input

data

Pointer to the \(n \times p\) numeric table with the data for outlier detection.

Note

The input can be an object of any class derived from the NumericTable class.

Algorithm Parameters

The multivariate BACON outlier detection algorithm has the following parameters:

Algorithm Parameters for Multivariate BACON Outlier Detection (Batch Processing)

Parameter

Default Value

Description

algorithmFPType

float

The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

initializationMethod

baconMedian

The initialization method, can be:

  • baconMedian - median-based method

  • defaultDense - Mahalanobis distance-based method

alpha

\(0.05\)

One-tailed probability that defines the \((1 - \alpha)\) quantile of the \(\chi^2\) distribution with \(p\) degrees of freedom.

Recommended value: \(\frac{\alpha}{n}\), where \(n\) is the number of observations.

toleranceToConverge

\(0.005\)

The stopping criterion. The algorithm is terminated if the size of the basic subset is changed by less than the threshold.

Algorithm Output

The multivariate BACON outlier detection algorithm calculates the result described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Algorithm Output for Multivariate BACON Outlier Detection (Batch Processing)

Result ID

Result

weights

Pointer to the \(n \times 1\) numeric table of zeros and ones. Zero in the \(i\)-th position indicates that the \(i\)-th feature vector is an outlier.

Note

By default, the result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable except the PackedSymmetricMatrix, PackedTriangularMatrix, and CSRNumericTable.

Examples