Notes on k-nearest neighbours

In this post the kk-nearest neighbour algorithm is briefly introduced.

The principle of the kk-nearest neighbour algorithm is that instances belonging to the same class will generally exist in close proximity to one another within the rr dimensional space, where each dimension corresponds to one of the rr attributes in the feature vector x\vec{x}. The multidimensional space is populated by the reference set of labelled training instances (x(t),y(t))(\vec{x}^{(t)},y^{(t)}). A distance measure D(x(t),x~)D(\vec{x}^{(t)},\tilde{\vec{x}}) is used to measure the proximity between each instance in the reference set and an unlabelled instance x~\tilde{\vec{x}}. The unlabelled instance is classified by observing the class labels of its closest kk neighbours within the reference set, and assigning a class label by popularity vote or distance weighted popularity vote over the set of observed class labels.

The figure shows how the kk-nearest neighbour algorithm performs classification. Within the two-dimensional feature space a 33-nearest neighbour algorithm will classify instance x~1\tilde{\vec{x}}_1 as belonging to group AA. This classification is done by observing that all three closest training set instances have the class label AA. The instance x~2\tilde{\vec{x}}_2 is classified as belonging to group BB because two out of three neighbouring training instances have the class label BB.

The k-nearest neighbour algorithm
The k-nearest neighbour algorithm.

Lazy supervised learning

The kk-nearest neighbour algorithm is called a lazy supervised learning technique because classification is based directly on the reference set, and not on an induced function hh. Therefore, classification using this technique can be computationally and memory intensive, especially when the reference set is large. Considering that the computational and memory requirements of the kk-nearest neighbour algorithm grow with the number of instances in the reference set, several instance selection methods can be applied to reduce the size of the reference set 1. Instance selection methods, also known as prototyping methods, try to reduce the number of instances used in the reference set below that of the original training set. Reducing the size of the reference set not only decreases the computational and memory requirements of the kk-nearest neighbour algorithm but also filters out the noise in the original training set 2.

Prototyping to reduce computational and memory requirements

Prototyping assumes that a small number of instances can be found to represent the entire set of training instances. In general, one prototype per class can be constructed by averaging over all the example instances belonging to the same class. Alternatively, multiple prototypes can be constructed by looking for subsets of homogeneous instances within the set of instances belonging to the same class. Averaging over each identified subset produces a prototype. Clustering, an unsupervised learning technique, can be applied here to find homogeneous groups of instances. An introduction to clustering techniques can be found in3.

Single hyper-paramater

The kk-nearest neighbour algorithm has the advantage of only having one hyper-parameter, the number of nearest neighbours kk, that needs to be optimized. However, it has been shown that this supervised learning technique is intolerant to noisy data, intolerant to redundant features and sensitive to the choice of similarity measure used2. Therefore, though the kk-nearest neighbour algorithm is very simple to understand and implement, its performance varies considerably across implementations.

Notes

References

Footnotes

  1. Reinartz, T., 2002. A unifying view on instance selection. Data Mining and Knowledge Discovery 6 (2), 191-210.

  2. Aha, D. W., Kibler, D., Albert, M. K., 1991. Instance-based learning algorithms. Machine learning 6 (1), 37-66. 2

  3. Nisbet, R., Elder, J., Miner, G., 2009. Handbook of Staistical Analysis and Data Mining Applications. Academic Press/Elsevier.

Related tags

emerging technologydecision and data science