.. ******************************************************************************
.. * Copyright 2020 Intel Corporation
.. *
.. * Licensed under the Apache License, Version 2.0 (the "License");
.. * you may not use this file except in compliance with the License.
.. * You may obtain a copy of the License at
.. *
.. * http://www.apache.org/licenses/LICENSE-2.0
.. *
.. * Unless required by applicable law or agreed to in writing, software
.. * distributed under the License is distributed on an "AS IS" BASIS,
.. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. * See the License for the specific language governing permissions and
.. * limitations under the License.
.. *******************************************************************************/
Multivariate BACON Outlier Detection
====================================
In multivariate outlier detection methods, the observation point is the entire feature vector.
Details
*******
Given a set :math:`X` of :math:`n` feature vectors
:math:`x_1 = (x_{11}, \ldots, x_{1p}), \ldots, x_n = (x_{n1}, \ldots, x_{np})` of dimension :math:`p`,
the problem is to identify the vectors that do not belong to the underlying distribution using the BACON method (see [Billor2000]_).
In the iterative method, each iteration involves several steps:
#. Identify an initial basic subset of :math:`m > p` feature vectors that can be assumed as not containing outliers.
The constant :math:`m` is set to :math:`5p`. The library supports two approaches to selecting the initial subset:
- Based on distances from the medians :math:`||x_i - \text{med}||`, where:
- `med` is the vector of coordinate-wise medians
- :math:`||.||` is the vector norm
- :math:`i = 1, \ldots, n`
- Based on the Mahalanobis distance :math:`d_i (\text{mean}, S) = \sqrt {(x_i - \text{mean})^T s^{-1} (x_i - \text{mean})}`, where:
- `mean` and :math:`S` are the mean and the covariance matrix, respectively, of :math:`n` feature vectors
- :math:`i = 1, \ldots, n`
Each method chooses :math:`m` feature vectors with the smallest values of distances.
#. Compute the discrepancies using the Mahalanobis distance above, where mean and S are the mean and the covariance matrix, respectively, computed for the feature vectors contained in the basic subset.
#. Set the new basic subset to all feature vectors with the discrepancy less than :math:`c_{npr}\chi_{p, \frac {\alpha}{n}}^2`,
where:
- :math:`chi_{p, \alpha}^2` is the :math:`(1 - \alpha)` percentile of the Chi-square distribution with :math:`p` degrees of freedom
- :math:`c_{npr} = c_{hr} + c_{np}`, where:
- :math:`r` is the size of the current basic subset
- :math:`c_{hr} = \max \{0, \frac {h - r}{h + r}\}`, where :math:`h = [\frac{n + p + 1}{2}]` and :math:`[ ]` is the integer part of a number
- :math:`c_{np} = 1 + \frac{p + 1}{n - p} + \frac{2}{n - 1 - 3p}`
#. Iterate steps 2 and 3 until the size of the basic subset no longer changes.
#. Nominate the feature vectors that are not part of the final basic subset as outliers.
Batch Processing
****************
Algorithm Input
---------------
The multivariate BACON outlier detection algorithm accepts the input described below.
Pass the ``Input ID`` as a parameter to the methods that provide input for your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: |\Y{0.2}|\Y{0.8}|
.. list-table:: Algorithm Input for Multivariate BACON Outlier Detection (Batch Processing)
:widths: 10 60
:header-rows: 1
* - Input ID
- Input
* - ``data``
- Pointer to the :math:`n \times p` numeric table with the data for outlier detection.
.. note:: The input can be an object of any class derived from the ``NumericTable`` class.
Algorithm Parameters
--------------------
The multivariate BACON outlier detection algorithm has the following parameters:
.. tabularcolumns:: |\Y{0.15}|\Y{0.15}|\Y{0.7}|
.. list-table:: Algorithm Parameters for Multivariate BACON Outlier Detection (Batch Processing)
:header-rows: 1
:widths: 10 10 60
:align: left
:class: longtable
* - Parameter
- Default Value
- Description
* - ``algorithmFPType``
- ``float``
- The floating-point type that the algorithm uses for intermediate computations. Can be ``float`` or ``double``.
* - ``initializationMethod``
- ``baconMedian``
- The initialization method, can be:
- ``baconMedian`` - median-based method
- ``defaultDense`` - Mahalanobis distance-based method
* - ``alpha``
- :math:`0.05`
- One-tailed probability that defines the :math:`(1 - \alpha)` quantile of the :math:`\chi^2` distribution with :math:`p` degrees of freedom.
Recommended value: :math:`\frac{\alpha}{n}`, where :math:`n` is the number of observations.
* - ``toleranceToConverge``
- :math:`0.005`
- The stopping criterion. The algorithm is terminated if the size of the basic subset is changed by less than the threshold.
Algorithm Output
----------------
The multivariate BACON outlier detection algorithm calculates the result described below.
Pass the ``Result ID`` as a parameter to the methods that access the results of your algorithm.
For more details, see :ref:`algorithms`.
.. tabularcolumns:: |\Y{0.2}|\Y{0.8}|
.. list-table:: Algorithm Output for Multivariate BACON Outlier Detection (Batch Processing)
:widths: 10 60
:header-rows: 1
* - Result ID
- Result
* - ``weights``
- Pointer to the :math:`n \times 1` numeric table of zeros and ones.
Zero in the :math:`i`-th position indicates that the :math:`i`-th feature vector is an outlier.
.. note::
By default, the result is an object of the ``HomogenNumericTable`` class,
but you can define the result as an object of any class derived from ``NumericTable``
except the ``PackedSymmetricMatrix``, ``PackedTriangularMatrix``, and ``CSRNumericTable``.
Examples
********
.. tabs::
.. tab:: C++ (CPU)
Batch Processing:
- :cpp_example:`out_detect_bacon_dense_batch.cpp `
.. tab:: Python*
Batch Processing:
- :daal4py_example:`bacon_outlier.py`