Naive Bayes (also known as the Bayes Classifier) is a probabilistic classifier that has been widely used for both clustering and classification. The probabilistic model of Naive Bayes is based on Bayes theorem and is termed Naive due to strong independence assumptions. Additionally, Naive Bayes is called a generative model as it models a distribution of points. To contrast, whereas a discriminative model such as logistic regression or a support vector machine is attempting to separate the data set via a line (or, more generally a hyper-plane), a generative model such as Naive Bayes is attempting to model the data points that are distant from that line.

## Derivation

Suppose that given ${\textbf {y}}$, we have ${\textbf {X}}={\textbf {x}}_{1},\dots ,{\textbf {x}}_{m}$ as conditionally independent random variables with ${\textbf {y}}$ a vector of length $n$. By the definition of conditional independence we have:

$P({\textbf {X}}|{\textbf {y}})=P({\textbf {x}}_{1},\dots ,{\textbf {x}}_{m}|{\textbf {y}})=P({\textbf {x}}_{1}|{\textbf {y}})P({\textbf {x}}_{2},\dots ,{\textbf {x}}_{m}|{\textbf {y}})=P({\textbf {x}}_{1}|{\textbf {y}})P({\textbf {x}}_{2}|{\textbf {y}})P({\textbf {x}}_{3},\dots ,{\textbf {x}}_{m}|{\textbf {y}})=\dots =P({\textbf {x}}_{1}|{\textbf {y}})P({\textbf {x}}_{2}|{\textbf {y}})\cdots P({\textbf {x}}_{m}|{\textbf {y}})=\prod _{i=1}^{m}P({\textbf {x}}_{i}|{\textbf {y}}).$

Now, assuming ${\textbf {y}}$ is a discrete variable, by Bayes Rule and the above we have:

$P({\textbf {y}}=y_{j}|{\textbf {x}}_{1},\dots ,{\textbf {x}}_{m})={\dfrac {P({\textbf {y}}=y_{j})P({\textbf {x}}_{1},\dots ,{\textbf {x}}_{m}|{\textbf {y}}=y_{j})}{\sum _{k=1}^{n}P({\textbf {y}}=y_{k})P({\textbf {x}}_{1},\dots ,{\textbf {x}}_{m}|{\textbf {y}}=y_{k})}}={\dfrac {P({\textbf {y}}=y_{j})\prod _{i=1}^{m}P({\textbf {x}}_{i}|{\textbf {y}}=y_{j})}{\sum _{k=1}^{n}P({\textbf {y}}=y_{k})\prod _{i=1}^{m}P({\textbf {x}}_{i}|{\textbf {y}}=y_{k})}}.$

Therefore, to attain the value of ${\textbf {y}}$ with the highest probability, (denoted by ${\hat {\textbf {y}}}$) it suffices to solve the equation:

${\hat {\textbf {y}}}=arg\max _{y_{j}}{\bigg \lbrace }{\dfrac {P({\textbf {y}}=y_{j})\prod _{i=1}^{m}P({\textbf {x}}_{i}|{\textbf {y}}=y_{j})}{\sum _{k=1}^{n}P({\textbf {y}}=y_{k})\prod _{i=1}^{m}P({\textbf {x}}_{i}|{\textbf {y}}=y_{k})}}{\bigg \rbrace }$.

Notice that as the equation in the denominator does not depend on $y_{j}$, it follows that the equation above reduces to:

${\hat {\textbf {y}}}=arg\max _{y_{j}}{\bigg \lbrace }P({\textbf {y}}=y_{j})\prod _{i=1}^{m}P({\textbf {x}}_{i}|{\textbf {y}}=y_{j}){\bigg \rbrace }$.

This is called the Naive Bayes classification rule.

## Controversies

The adjective naive comes from the unrealistic assumption that the features in a data set are conditionally independent. In practice, this assumption is often violated. However, it should be noted that naive Bayes classifiers still tend to perform very well under this unrealistic assumption for small sample sizes. 

Additionally, consider a binary response $0,1$ and suppose that for some level of a predictor factor, we have all of that level assigned to $0$. In this case, the conditional probability of ${\textbf {x}}$ given ${\textbf {y}}$ will be zero and therefore (as the formula above involves a product) will zero out the information contained in the posterior distribution. In other words, given data with a predictor class that is assigned to only 0, the Naive Bayes model will always predict new data for that particular class as 0. Depending on the application, this may or may not be desirable.

## History

1763: Reverend Thomas Bayes (1702–61), who studied how to compute a distribution for the probability parameter of a binomial distribution and Bayes' theorem was named after him.

1960 :Naive Bayes introduced under a different name into the text retrieval community in the early 1960s.

## Business problems that could be solved with Naive Bayes techniques

Naive Bayes can be used to for purposes such as predicting customer behavior,  predicting customer preferences,  predicting customer churn,  predicting fraudulent financial reporting,  spam detection,  network security,  and sentiment analysis.  Naive Bayes may also be used as a distance measurement between categorical variables. 

## Example

Consider the following dataset:

Yes No No No
No No Yes Yes
Yes Yes No Yes
Yes No Yes Yes
No Yes Yes No
No No No No
Yes No No Yes
No Yes Yes No
No Yes Yes Yes
No No Yes Yes

Suppose we want to classify a new patient with the following observation:

Yes No Yes

Now, we have:

$P(Flu=Yes)=.60$ $P(Flu=No)=.40$
$P(Fever=Yes|Flu=Yes)=0.5$ $P(Fever=Yes|Flu=No)=0.25$
$P(Headache=Yes|Flu=Yes)=0.33$ $P(Headache=Yes|Flu=No)=0.5$
$P(SoreThroat=Yes|Flu=Yes)=0.66$ $P(SoreThroat=Yes|Flu=No)=0.5$
$P(Fever=No|Flu=Yes)=0.5$ $P(Fever=No|Flu=No)=0.75$
$P(Headache=No|Flu=Yes)=0.66$ $P(Headache=No|Flu=No)=0.5$
$P(SoreThroat=No|Flu=Yes)=0.33$ $P(SoreThroat=No|Flu=No)=0.5$

From this, we have

$P(Flu=Yes)P(Fever=Yes|Flu=Yes)P(Headache=No|Flu=Yes)P(SoreThroat=Yes|Flu=Yes)=(0.6)(0.5)(.66)(.66)=0.13$

and

$P(Flu=No)P(Fever=Yes|Flu=No)P(Headache=No|Flu=No)P(SoreThroat=Yes|Flu=No)=(0.4)(0.5)(0.5)(0.5)=0.05$.

Therefore, as 0.13 is larger that 0.05, this new patient would be classified as being sick with the flu.

## Top 5 Recent Tweets

11 Dec 2015 @vyassaurabh411 One of the most informative articles on #NaiveBayes and #TextClassification!! Thanks @rasbt !! #DataScience
13 Dec 2015 @albuhhh Overhearing a Stats PhD talk about his daily fantasy sports algorithm. Hint: involves naive Bayes.

Exhibit A that you're getting fleeced.

11 Dec 2015 @@gcosma1 A nice explanation of naive bayes #machinelearning #datascience
10 Dec 2015 @kylemathews classifying domains into arbitrary categories. Using Naive bayes. It's working really well but there's lots of knobs to learn.
4 Dec 2015 @cljds An implementation of Naive Bayes in Clojure applied to the Titanic dataset https://github.com/clojuredatascience/ch4-classification/blob/master/src/cljds/ch4/examples.clj#L215 … #clojurex