K-Nearest Neighbors

Follow

Verify

About[edit | edit source]

The $k$ -nearest neighbor (k-NN) algorithm can be utilized for both classification and regression. It is commonly used when there is little or no prior knowledge about the distribution of the data. In other words, the advantage of $k$ -nearest neighbors is that it enables discriminant analysis when reliable parametric estimates of probability densities are unknown. Note that $k$ -nearest neighbor is a local and non-parametric method where the parameter $k$ controls the size of locality.

In order to perform $k$ -nearest neighbors, a suitable $k$ must be chosen along with a distance measurement that is acceptable for the type of predictor data. Specifically, if the predictors are numerical, a Euclidean distance (such as the $L_{1},L_{2}$ , or $L_{\infty }$ norms) can be used. If the data are categorical, some options are the Hamming distance and the Value Difference Measure (VDM). In practice, the choice of $k$ is usually determined by use of $k$ -fold cross-validation. ^[1]

K-nn Algorithm[edit | edit source]

Consider a matrix of training predictors ${\textbf {X}}={\begin{bmatrix}x_{1,1}&x_{1,2}&\cdots &x_{1,m}\\x_{2,1}&x_{2,2}&\cdots &x_{2,m}\\\vdots &\vdots &\ddots &\vdots \\x_{n,1}&x_{n,2}&\cdots &x_{n,m}\\\end{bmatrix}}={\begin{bmatrix}{\textbf {x}}_{1}\\{\textbf {x}}_{2}\\\vdots \\{\textbf {x}}_{n}\\\end{bmatrix}}$ with an associated response vector ${\textbf {y}}={\begin{bmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\\\end{bmatrix}}.$ Note that ${\textbf {y}}$ may be either a vector of classes (i.e. categorical responses) or a vector of numerical values. The former case is known as $k$ -nearest neighbors classification, while the latter is known as $k$ -nearest neighbor regression. Next, consider a matrix ${\textbf {Z}}$ of test predictors for which we would like to predict the response. Denote ${\textbf {Z}}$ as ${\textbf {Z}}={\begin{bmatrix}z_{1,1}&z_{1,2}&\cdots &z_{1,m}\\z_{2,1}&z_{2,2}&\cdots &z_{2,m}\\\vdots &\vdots &\ddots &\vdots \\z_{p,1}&z_{p,2}&\cdots &z_{p,m}\\\end{bmatrix}}={\begin{bmatrix}{\textbf {z}}_{1}\\{\textbf {z}}_{2}\\\vdots \\{\textbf {z}}_{n}\\\end{bmatrix}}.$

The $k$ -nearest neighbors algorithm is as follows:

Step 1: Choose $k$

Step 2: For each ${\textbf {z}}_{i}$ measure the distance to the $k$ -nearest ${\textbf {x}}_{j}$ 's. In other words measure the distance from each observation in ${\textbf {Z}}$ to the $k$ -nearest observations in ${\textbf {X}}$ .

Step 3: For classification (i.e. if ${\textbf {y}}$ is a categorical vector of responses), predict the response for ${\textbf {z}}_{i}$ as the class (i.e. category) that occurs the greatest number of times among the $k$ nearest ${\textbf {x}}_{j}'s$ . For regression (i.e. if ${\textbf {y}}$ is a real valued vector of responses), predict the response for ${\textbf {z}}_{i}$ as the average of the $k$ responses among the $k$ nearest ${\textbf {x}}_{j}'s$ .

Controversies[edit | edit source]

The $k$ -NN algorithm does not learn anything from the training data and simply uses the training data itself for classification. To predict the label of a new instance the $k$ -NN algorithm will find the $k$ closest neighbors to the new instance from the training data, the predicted class label will then be set as the most common label among the $k$ closest neighboring points. This disadvantage can result in the algorithm not generalizing well and also not being robust to noisy data. Additionally, The estimation of $k$ can be problematic. For example, suppose $k$ -fold cross validation reveals the presence of two minimums. This leads to the issue of deciding which minimum is optimal. ^[1]

History[edit | edit source]

1951 : Fix and Hodges introduced a non-parametric method for pattern classification that has since become known the k-nearest neighbor rule

1967 : Cover and Hart showed that some of the formal properties of the $k$ -nearest neighbor rule were worked out. ^[2]

Top 5 Recent Tweets[edit | edit source]

Date	Author	Tweet
13 Dec 2015	@PeterOliverCaya	Evaluating credit risk using K-nearest neighbors: ftp://buff.ly/1NX4q03
11 Dec 2015	@soulparking	is getting my k-Nearest Neighbors running with my IRIS dataset #datascience https://www.swarmapp.com/c/juL7n4fTe8X
10 Dec 2015	@bloopish	The k-NN algorithm is among the simplest of all #MachineLearning algorithms http://ow.ly/TBsPN
26 Nov 2015	@tetmorikawa	#KNN (K-Nearest Neighbors) classifier can do a decent job in this context, I guess. #DataScience http://www.edvancer.in/logistic-regression-vs-decision-trees-vs-svm-part1/ …
7 Nov 2015	@vnfrombucharest	Word Mover's Distance for K nearest neighbors document classification, @scikit_learn friendly code w/ Matt Kusner! http://vene.ro/blog/word-movers-distance-in-python.html …

Top 5 Recent News Headlines[edit | edit source]

Date	Title	Link
21 Apr 2015	Kris Bryant: Which players are most similar?	http://www.beyondtheboxscore.com/2015/4/21/8460365/kris-bryant-similar-players-katoh-k-nearest-neighbors
19 Jun 2015	The future of data at scale	http://radar.oreilly.com/2015/06/the-future-of-data-at-scale.html
9 Jul 2015	Large-scale Direct Targeting for Drug Repositioning and Discovery	http://www.nature.com/articles/srep11970
1 Jul 2015	A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods	http://www.nature.com/hdy/journal/v115/n6/full/hdy201557a.html
10 AUG 2015	Amazon talks new predictive model for personalised recommendations	http://www.cio.com.au/article/581562/amazon-talks-new-predictive-model-personalised-recommendations/

Top 5 Lifetime Tweets[edit | edit source]

Date	Author	Tweet
4 Feb 2012	@bigdata	k nearest neighbors: Parallel kNN Joins for Big Data in Hadoop/MapReduce, algo tested on 100M+ records & 30+ Dimensions http://goo.gl/XduYc
21 Feb 2012	@Gephi	RT @pcbje: Visualizing k-nearest neighbors with #Gephi http://pcbje.com/post/17432986180/visualizing-k-nearest-neighbors-with-gephi …
31 Mar 2014	@peterneubauer	Fantastic @_nicolemargaret! RT @newsycombinator: Movie Recommendations with k-Nearest Neighbors and Cosine Similarity http://gist.neo4j.org/?8173017
18 Feb 2014	@YhatHQ	K-nearest neighbors implemented in javascript http://bit.ly/1jNUiNt
2 August 2013	@yurevich	Nice and easy - http://blog.yhathq.com/posts/classification-using-knn-and-python.html … - using K-nearest neighbors algorithm for classification with python and pandas

Top 5 Lifetime News Headlines[edit | edit source]

Date	Title	Link
5 Jun 2014	Nearest Neighbors: An Algorithm You Know	http://www.science20.com/a_quantum_diaries_survivor/nearest_neighbors_an_algorithm_you_know-137796
18 Oct 2013	Statistical models can predict a Kickstarter’s success within 4 hours	http://arstechnica.com/business/2013/10/statistical-models-can-predict-a-kickstarters-success-within-4-hours/
6 Oct 2013	Gaining access to the best machine-learning methods	http://www.forbes.com/sites/oreillymedia/2013/10/06/gaining-access-to-the-best-machine-learning-methods/
14 Apr 2013	Supercomputer Apps Tackle Cancer, Autism, Heart Attacks	http://www.informationweek.com/government/leadership/supercomputer-apps-tackle-cancer-autism-heart-attacks/d/d-id/1109401?
12 May 2014	Visualizing playing styles in tennis	http://www.sbnation.com/tennis/2014/5/12/5697064/tennis-player-styles-rafael-nadal-novak-djokovic

↑ ^{Jump up to: 1.0} ^1.1 Ghosh, Anil K. "On optimum choice of k in nearest neighbor classification." Computational Statistics & Data Analysis 50.11 (2006): 3113-3123.
Jump up ↑ http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1053964

[Optimum_Choice-1] {Jump up to: 1.0} ^1.1 Ghosh, Anil K. "On optimum choice of k in nearest neighbor classification." Computational Statistics & Data Analysis 50.11 (2006): 3113-3123.

[2] Jump up ↑ http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1053964

[1]

[2]

K-Nearest Neighbors

Follow

Verify

Contents

About[edit | edit source]

K-nn Algorithm[edit | edit source]

Controversies[edit | edit source]

History[edit | edit source]

Top 5 Recent Tweets[edit | edit source]

Top 5 Recent News Headlines[edit | edit source]

Top 5 Lifetime Tweets[edit | edit source]

Top 5 Lifetime News Headlines[edit | edit source]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Top Authors

Verification history

Navigation

Tools

K-Nearest Neighbors Follow Verify

Contents

About[edit | edit source]

K-nn Algorithm[edit | edit source]

Controversies[edit | edit source]

History[edit | edit source]

Top 5 Recent Tweets[edit | edit source]

Top 5 Recent News Headlines[edit | edit source]

Top 5 Lifetime Tweets[edit | edit source]

Top 5 Lifetime News Headlines[edit | edit source]

Navigation menu

Search

Top Authors

Verification history

K-Nearest Neighbors

Follow

Verify