k-Nearest Neighbors Classification, Regression, and Search (k-NN)¶
\(k\)-NN classification, regression, and search algorithms are based on finding the \(k\) nearest observations to the training set. For classification, the problem is to infer the class of a new feature vector by computing the majority vote of its \(k\) nearest observations from the training set. For regression, the problem is to infer the target value of a new feature vector by computing the average target value of its \(k\) nearest observations from the training set. For search, the problem is to identify the \(k\) nearest observations from the training set to a new feature vector. The nearest observations are computed based on the chosen distance metric.
Operation |
Computational methods |
Programming Interface |
|||
Mathematical formulation¶
Training¶
Let \(X = \{ x_1, \ldots, x_n \}\) be the training set of \(p\)-dimensional feature vectors, let \(Y = \{ y_1, \ldots, y_n \}\) be the set of class labels, where \(y_i \in \{ 0, \ldots, C-1 \}\), \(1 \leq i \leq n\), and \(C\) is the number of classes. Given \(X\), \(Y\), and the number of nearest neighbors \(k\), the problem is to build a model that allows distance computation between the feature vectors in training and inference sets at the inference stage.
Let \(X = \{ x_1, \ldots, x_n \}\) be the training set of \(p\)-dimensional feature vectors, let \(Y = \{ y_1, \ldots, y_n \}\) be the corresponding continuous target outputs, where \(y_i \in \mathbb{R}\). Given \(X\), \(Y\), and the number of nearest neighbors \(k\), the problem is to build a model that allows distance computation between the feature vectors in training and inference sets at the inference stage.
Let \(X = \{ x_1, \ldots, x_n \}\) be the training set of \(p\)-dimensional feature vectors. Given \(X\) and the number of nearest neighbors \(k\), the problem is to build a model that allows distance computation between the feature vectors in training and inference sets at the inference stage.
Training method: brute-force¶
The training operation produces the model that stores all the feature vectors from the initial training set \(X\).
Training method: k-d tree¶
The training operation builds a \(k\)-\(d\) tree that partitions the training set \(X\) (for more details, see k-d Tree).
Inference¶
Let \(X' = \{ x_1', \ldots, x_m' \}\) be the inference set of \(p\)-dimensional feature vectors. Given \(X'\), the model produced at the training stage, and the number of nearest neighbors \(k\), the problem is to predict the label \(y_j'\) from the \(Y\) set for each \(x_j'\), \(1 \leq j \leq m\), by performing the following steps:
Identify the set \(N(x_j') \subseteq X\) of \(k\) feature vectors in the training set that are nearest to \(x_j'\) with respect to the Euclidean distance, which is chosen by default. The distance can be customized with the predefined set of pairwise distances: Minkowski distances with fractional degree (including Euclidean distance), Chebyshev distance, and Cosine distance.
Estimate the conditional probability for the \(l\)-th class as the fraction of vectors in \(N(x_j')\) whose labels \(y_j\) are equal to \(l\):
(1)¶\[P_{jl} = \frac{1}{| N(x_j') |} \Big| \big\{ x_r \in N(x_j') : y_r = l \big\} \Big|, \quad 1 \leq j \leq m, \; 0 \leq l < C.\]Predict the class that has the highest probability for the feature vector \(x_j'\):
(2)¶\[y_j' = \mathrm{arg}\max_{0 \leq l < C} P_{jl}, \quad 1 \leq j \leq m.\]
Let \(X' = \{ x_1', \ldots, x_m' \}\) be the inference set of \(p\)-dimensional feature vectors. Given \(X'\), the model produced at the training stage, and the number of nearest neighbors \(k\), the problem is to predict the continuous target variable \(y_j'\) from the \(Y\) set for each \(x_j'\), \(1 \leq j \leq m\), by performing the following steps:
Identify the set \(N(x_j') \subseteq X\) of \(k\) feature vectors in the training set that are nearest to \(x_j'\) with respect to the Euclidean distance, which is chosen by default. The distance can be customized with the predefined set of pairwise distances: Minkowski distances with fractional degree (including Euclidean distance), Chebyshev distance, and Cosine distance.
Estimate the conditional expectation of the target variable based on the nearest neighbors \(N(x_j')\) as the average of the target values for those neighbors:
(3)¶\[y_j' = \frac{1}{| N(x_j') |} \sum_{x_r \in N(x_j')} y_r, \quad 1 \leq j \leq m.\]
Let \(X' = \{ x_1', \ldots, x_m' \}\) be the inference set of \(p\)-dimensional feature vectors. Given \(X'\), the model produced at the training stage, and the number of nearest neighbors \(k\):
Identify the set \(N(x_j') \subseteq X\) of \(k\) feature vectors in the training set that are nearest to \(x_j'\) with respect to the Euclidean distance, which is chosen by default. The distance can be customized with the predefined set of pairwise distances: Minkowski distances with fractional degree (including Euclidean distance), Chebyshev distance, and Cosine distance.
Inference method: brute-force¶
Brute-force inference method determines the set \(N(x_j')\) of the nearest feature vectors by iterating over all the pairs \((x_j', x_i)\) in the implementation defined order, \(1 \leq i \leq n\), \(1 \leq j \leq m\).
Inference method: k-d tree¶
K-d tree inference method traverses the \(k\)-\(d\) tree to find feature vectors associated with a leaf node that are closest to \(x_j'\), \(1 \leq j \leq m\). The set \(\tilde{n}(x_j')\) of the currently known nearest \(k\) neighbors is progressively updated during the tree traversal. The search algorithm limits exploration of the nodes for which the distance between the \(x_j'\) and respective part of the feature space is not less than the distance between \(x_j'\) and the most distant feature vector from \(\tilde{n}(x_j')\). Once tree traversal is finished, \(\tilde{n}(x_j') \equiv N(x_j')\).
Programming Interface¶
Refer to API Reference: k-Nearest Neighbors Classification, Regression, and Search.
The following table describes current device support:
Task |
CPU |
GPU |
---|---|---|
Classification |
Yes |
Yes |
Regression |
No |
Yes |
Search |
Yes |
Yes |
Distributed mode¶
The algorithm supports distributed execution in SPMD mode (only on GPU).
Usage Example¶
Training¶
knn::model<> run_training(const table& data,
const table& labels) {
const std::int64_t class_count = 10;
const std::int64_t neighbor_count = 5;
const auto knn_desc = knn::descriptor<float>{class_count, neighbor_count};
const auto result = train(knn_desc, data, labels);
return result.get_model();
}
Inference¶
table run_inference(const knn::model<>& model,
const table& new_data) {
const std::int64_t class_count = 10;
const std::int64_t neighbor_count = 5;
const auto knn_desc = knn::descriptor<float>{class_count, neighbor_count};
const auto result = infer(knn_desc, model, new_data);
print_table("labels", result.get_labels());
}
Examples¶
Batch Processing:
Batch Processing: