# k-Nearest Neighbors Classification and Search (k-NN)¶

$$k$$-NN classification and search algorithms are based on finding the $$k$$ nearest observations to the training set. For classification, the problem is to infer the class of a new feature vector by computing the majority vote of its $$k$$ nearest observations from the training set. For search, the problem is to infer $$k$$ nearest observations from the training set to a new feature vector. The nearest observations are computed based on the chosen distance metric.

 Operation Computational methods Programming Interface Training Brute-force k-d tree train(…) train_input train_result Inference Brute-force k-d tree infer(…) infer_input infer_result

## Mathematical formulation¶

### Training¶

Let $$X = \{ x_1, \ldots, x_n \}$$ be the training set of $$p$$-dimensional feature vectors, let $$Y = \{ y_1, \ldots, y_n \}$$ be the set of class labels, where $$y_i \in \{ 0, \ldots, C-1 \}$$, $$1 \leq i \leq n$$, and $$C$$ is the number of classes. Given $$X$$, $$Y$$, and the number of nearest neighbors $$k$$, the problem is to build a model that allows distance computation between the feature vectors in training and inference sets at the inference stage.

#### Training method: brute-force¶

The training operation produces the model that stores all the feature vectors from the initial training set $$X$$.

#### Training method: k-d tree¶

The training operation builds a $$k$$-$$d$$ tree that partitions the training set $$X$$ (for more details, see k-d Tree).

### Inference¶

Let $$X' = \{ x_1', \ldots, x_m' \}$$ be the inference set of $$p$$-dimensional feature vectors. Given $$X'$$, the model produced at the training stage, and the number of nearest neighbors $$k$$, the problem is to predict the label $$y_j'$$ from the $$Y$$ set for each $$x_j'$$, $$1 \leq j \leq m$$, by performing the following steps:

1. Identify the set $$N(x_j') \subseteq X$$ of $$k$$ feature vectors in the training set that are nearest to $$x_j'$$ with respect to the Euclidean distance, which is chosen by default. The distance can be customized with the predefined set of pairwise distances: Minkowski distances with fractional degree (including Euclidean distance), Chebyshev distance, and Cosine distance.

2. Estimate the conditional probability for the $$l$$-th class as the fraction of vectors in $$N(x_j')$$ whose labels $$y_j$$ are equal to $$l$$:

(1)$P_{jl} = \frac{1}{| N(x_j') |} \Big| \big\{ x_r \in N(x_j') : y_r = l \big\} \Big|, \quad 1 \leq j \leq m, \; 0 \leq l < C.$
3. Predict the class that has the highest probability for the feature vector $$x_j'$$:

(2)$y_j' = \mathrm{arg}\max_{0 \leq l < C} P_{jl}, \quad 1 \leq j \leq m.$

#### Inference method: brute-force¶

Brute-force inference method determines the set $$N(x_j')$$ of the nearest feature vectors by iterating over all the pairs $$(x_j', x_i)$$ in the implementation defined order, $$1 \leq i \leq n$$, $$1 \leq j \leq m$$.

#### Inference method: k-d tree¶

K-d tree inference method traverses the $$k$$-$$d$$ tree to find feature vectors associated with a leaf node that are closest to $$x_j'$$, $$1 \leq j \leq m$$. The set $$\tilde{n}(x_j')$$ of the currently known nearest $$k$$ neighbors is progressively updated during the tree traversal. The search algorithm limits exploration of the nodes for which the distance between the $$x_j'$$ and respective part of the feature space is not less than the distance between $$x_j'$$ and the most distant feature vector from $$\tilde{n}(x_j')$$. Once tree traversal is finished, $$\tilde{n}(x_j') \equiv N(x_j')$$.

## Usage example¶

### Training¶

knn::model<> run_training(const table& data,
const table& labels) {
const std::int64_t class_count = 10;
const std::int64_t neighbor_count = 5;
const auto knn_desc = knn::descriptor<float>{class_count, neighbor_count};

const auto result = train(knn_desc, data, labels);

return result.get_model();
}


### Inference¶

table run_inference(const knn::model<>& model,
const table& new_data) {
const std::int64_t class_count = 10;
const std::int64_t neighbor_count = 5;
const auto knn_desc = knn::descriptor<float>{class_count, neighbor_count};

const auto result = infer(knn_desc, model, new_data);

print_table("labels", result.get_labels());
}


## Examples¶

Batch Processing: