# K-Nearest Neighbors

## Contents

## About

The -nearest neighbor (k-NN) algorithm can be utilized for both classification and regression. It is commonly used when there is little or no prior knowledge about the distribution of the data. In other words, the advantage of -nearest neighbors is that it enables discriminant analysis when reliable parametric estimates of probability densities are unknown. Note that -nearest neighbor is a local and non-parametric method where the parameter controls the size of locality.

In order to perform -nearest neighbors, a suitable must be chosen along with a distance measurement that is acceptable for the type of predictor data. Specifically, if the predictors are numerical, a Euclidean distance (such as the , or norms) can be used. If the data are categorical, some options are the Hamming distance and the Value Difference Measure (VDM). In practice, the choice of is usually determined by use of -fold cross-validation. ^{[1]}

## K-nn Algorithm

Consider a matrix of training predictors with an associated response vector Note that may be either a vector of classes (i.e. categorical responses) or a vector of numerical values. The former case is known as -nearest neighbors classification, while the latter is known as -nearest neighbor regression. Next, consider a matrix of test predictors for which we would like to predict the response. Denote as

The -nearest neighbors algorithm is as follows:

**Step 1:** Choose

**Step 2:** For each measure the distance to the -nearest 's. In other words measure the distance from each observation in to the
-nearest observations in .

**Step 3:** For classification (i.e. if is a categorical vector of responses), predict the response for as the class (i.e. category) that occurs the greatest number of times among the nearest .
For regression (i.e. if is a real valued vector of responses), predict the response for as the average of the responses among the nearest .

## Controversies

The -NN algorithm does not learn anything from the training data and simply uses the training data itself for classification. To predict the label of a new instance the -NN algorithm will find the closest neighbors to the new instance from the training data, the predicted class label will then be set as the most common label among the closest neighboring points. This disadvantage can result in the algorithm not generalizing well and also not being robust to noisy data. Additionally, The estimation of can be problematic. For example, suppose -fold cross validation reveals the presence of two minimums. This leads to the issue of deciding which minimum is optimal. ^{[1]}

## History

**1951** : Fix and Hodges introduced a non-parametric method for pattern classification that has since become known the k-nearest neighbor rule

**1967** : Cover and Hart showed that some of the formal properties of the -nearest neighbor rule were worked out. ^{[2]}

## Top 5 Recent Tweets

Date | Author | Tweet |
---|---|---|

13 Dec 2015 | @PeterOliverCaya | Evaluating credit risk using K-nearest neighbors: |

11 Dec 2015 | @soulparking | is getting my k-Nearest Neighbors running with my IRIS dataset #datascience https://www.swarmapp.com/c/juL7n4fTe8X |

10 Dec 2015 | @bloopish | The k-NN algorithm is among the simplest of all #MachineLearning algorithms http://ow.ly/TBsPN |

26 Nov 2015 | @tetmorikawa | #KNN (K-Nearest Neighbors) classifier can do a decent job in this context, I guess. #DataScience http://www.edvancer.in/logistic-regression-vs-decision-trees-vs-svm-part1/ … |

7 Nov 2015 | @vnfrombucharest | Word Mover's Distance for K nearest neighbors document classification, @scikit_learn friendly code w/ Matt Kusner! http://vene.ro/blog/word-movers-distance-in-python.html … |

## Top 5 Recent News Headlines

Date | Title | Link |
---|---|---|

21 Apr 2015 | Kris Bryant: Which players are most similar? | http://www.beyondtheboxscore.com/2015/4/21/8460365/kris-bryant-similar-players-katoh-k-nearest-neighbors |

19 Jun 2015 | The future of data at scale | http://radar.oreilly.com/2015/06/the-future-of-data-at-scale.html |

9 Jul 2015 | Large-scale Direct Targeting for Drug Repositioning and Discovery | http://www.nature.com/articles/srep11970 |

1 Jul 2015 | A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods | http://www.nature.com/hdy/journal/v115/n6/full/hdy201557a.html |

10 AUG 2015 | Amazon talks new predictive model for personalised recommendations | http://www.cio.com.au/article/581562/amazon-talks-new-predictive-model-personalised-recommendations/ |

## Top 5 Lifetime Tweets

Date | Author | Tweet |
---|---|---|

4 Feb 2012 | @bigdata | k nearest neighbors: Parallel kNN Joins for Big Data in Hadoop/MapReduce, algo tested on 100M+ records & 30+ Dimensions http://goo.gl/XduYc |

21 Feb 2012 | @Gephi | RT @pcbje: Visualizing k-nearest neighbors with #Gephi http://pcbje.com/post/17432986180/visualizing-k-nearest-neighbors-with-gephi … |

31 Mar 2014 | @peterneubauer | Fantastic @_nicolemargaret! RT @newsycombinator: Movie Recommendations with k-Nearest Neighbors and Cosine Similarity http://gist.neo4j.org/?8173017 |

18 Feb 2014 | @YhatHQ | K-nearest neighbors implemented in javascript http://bit.ly/1jNUiNt |

2 August 2013 | @yurevich | Nice and easy - http://blog.yhathq.com/posts/classification-using-knn-and-python.html … - using K-nearest neighbors algorithm for classification with python and pandas |

## Top 5 Lifetime News Headlines

Date | Title | Link |
---|---|---|

5 Jun 2014 | Nearest Neighbors: An Algorithm You Know | http://www.science20.com/a_quantum_diaries_survivor/nearest_neighbors_an_algorithm_you_know-137796 |

18 Oct 2013 | Statistical models can predict a Kickstarter’s success within 4 hours | http://arstechnica.com/business/2013/10/statistical-models-can-predict-a-kickstarters-success-within-4-hours/ |

6 Oct 2013 | Gaining access to the best machine-learning methods | http://www.forbes.com/sites/oreillymedia/2013/10/06/gaining-access-to-the-best-machine-learning-methods/ |

14 Apr 2013 | Supercomputer Apps Tackle Cancer, Autism, Heart Attacks | http://www.informationweek.com/government/leadership/supercomputer-apps-tackle-cancer-autism-heart-attacks/d/d-id/1109401? |

12 May 2014 | Visualizing playing styles in tennis | http://www.sbnation.com/tennis/2014/5/12/5697064/tennis-player-styles-rafael-nadal-novak-djokovic |

- ↑
^{1.0}^{1.1}Ghosh, Anil K. "On optimum choice of k in nearest neighbor classification." Computational Statistics & Data Analysis 50.11 (2006): 3113-3123. - ↑ http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1053964