Logistic regression, also called logit regression or logit modeling, is a statistical technique that enables modelling of categorical dependent variables. When the dependent variable is binary i.e it can assume only two values e.g male or female the technique is referred to as binary logistic regression. When a dependent variable can assume more than two values and is not ordered the technique is referred to as multinomial logistic regression. When a dependent variable assumes more than two values and is ordered the technique is known as ordinal logistic regression. The independent variables can be continuous or discrete but when a discrete variable assumes more than two variables dummy coding has to be done.

Logistic regression can be applied in the following situations:

1. When a dependent variable assumes only two values. For example classification of students as admitted or rejected based on their physical and social characteristics. [1]
2. When a dependent variable takes on more than two values and there is no importance in the order of the values. For example high school students can choose a general, vocational or academic program. Social economic and writing score could then be used to model their choice. [2]
3. When a dependent variable assumes more than two values that are ordered. For example a market research firm would like to understand decision making in ordering the size (small, medium, large or extra large) of soda. [3]

The technique has found wide use among data scientists and has been implemented in statistical software like IBM SPSS, R, Stata and SAS among others.

## Assumptions of logistic regression

Logistic Regression has some assumptions that need to be satisfied for results to be valid.

• The dependent variable must be either a binary variable.
• You need to avoid model over-fitting by excluding variables not relevant to the model
• The error terms have to be independent
• The variables should not be related i.e no multicollinearity.
• Independent variables need at least 10 values for accurate maximum likelihood estimation.[4]

## Examples of logistic regression in practical use

• A study published in the American Journal of Botany used logistic regression to understand the effect of two herbs on subtropical shrubland habitat. [5]
• Uro Today reported a study that used ordinal logistic regression to understand factors that cause hand-foot skin reaction. [6]
• Nature.com reports a study that used logistic regression to predict cirrhosis hospitalization. [7]

## The Logistic Regression Model

With ${\displaystyle {\mathbf {y}}}$ a vector of binary responses, and ${\displaystyle {\mathbf {X}}}$ a matrix of numerical variables, the logistic regression model is

${\displaystyle y_{i}={\dfrac {1}{1+e^{{\mathbf {x}}_{i}^{T}\beta }}}+\epsilon _{i}}$.

Note that ${\displaystyle {\mathbf {x}}_{i}^{T}}$ is the transpose of the ${\displaystyle i^{th}}$ column of ${\displaystyle {\mathbf {X}}}$.

## Parameter Estimation With Maximum Likelihood

Suppose that there are ${\displaystyle n_{1}}$ successes (1's) and so, ${\displaystyle n-n_{1}}$ failures (0's). The ungrouped case is where (by re-ordering the observations if necessary) we have ${\displaystyle y_{1}=y_{2}=\dots =y_{n_{1}}=1}$ and ${\displaystyle y_{n_{1}+1}=y_{n_{1}+2}=\dots =y_{n}=0}$. In this case, the maximum likelihood estimates for logistic regression are the ${\displaystyle {\hat {\beta }}}$ that satisfy the equations

${\displaystyle \sum _{i=1}^{n}\left(1-{\dfrac {e^{{\mathbf {x}}_{i}^{T}\beta }}{1+e^{{\mathbf {x}}_{i}^{T}\beta }}}\right){\mathbf {x_{i}}}=\sum _{i=1}^{n_{1}}{\mathbf {x}}_{i}}$.

Note that as these equations are not linear in the ${\displaystyle \beta }$ and as there are ${\displaystyle p}$ unknowns and ${\displaystyle p}$ equations, algorithms such as Gauss-Newton or Levenberg-Marquardt must be used to approximate a solution.

## Goodness of Fit Using the Likelihood Ratio Test

Consider the hypothesis ${\displaystyle H_{0}}$: The logistic model is appropriate versus ${\displaystyle H_{1}}$: The saturated model is appropriate. To test this we use the likelihood ratio test, with ${\displaystyle L_{saturated}}$ the likelihood of the saturated model and ${\displaystyle L_{model}}$ the likelihood of the model. The test statistic is defined as

${\displaystyle D_{model}=-2\ln \left({\dfrac {L_{model}}{L_{saturated}}}\right)}$

and is called the model deviance. Note that under ${\displaystyle H_{0}}$, it follows that ${\displaystyle D_{model}\sim \chi _{n-p}^{2}}$ and so, ${\displaystyle H_{0}}$ is rejected when ${\displaystyle D_{model}>\chi _{\alpha ,n-p}^{2}}$ where ${\displaystyle \alpha }$ is a previously chosen significance level. Notice that in practice this is not very useful as generally the saturated model is usually unknown.

In practice, one would generally like to test if a subset of the explanatory variables provides the same amount of information as the explanatory variables in the full model. In this case, we write the vector ${\displaystyle \beta }$ as two separate vectors (or a scalar and a vector). Specifically, we write ${\displaystyle \beta ={\begin{bmatrix}\beta _{1}\\\beta _{2}\end{bmatrix}}}$ and consider the hypothesis ${\displaystyle H_{0}}$: ${\displaystyle \beta _{1}={\mathbf {0}}}$ versus ${\displaystyle H_{1}}$: ${\displaystyle \beta _{1}\neq {\mathbf {0}}}$. The test statistic is defined as

${\displaystyle D_{sub-model}=-2\ln \left({\dfrac {L_{sub-model}}{L_{full-model}}}\right)}$

where the likelihood of the sub-model is the likelihood in regards to the model with the vector of parameters ${\displaystyle \beta _{2}}$ and likelihood of the full-model is the likelihood in regards to the model with the entire vector of parameters ${\displaystyle \beta }$. In this case, ${\displaystyle D_{sub-model}\sim \chi _{r}^{2}}$ where ${\displaystyle r}$ are the degrees of freedom associated with the vector ${\displaystyle \beta _{1}}$. This test statistic is called the deviance. In this case, ${\displaystyle H_{0}}$ is rejected when ${\displaystyle D_{sub-model}>\chi _{\alpha ,r}^{2}}$ where ${\displaystyle \alpha }$ is a previously chosen significance level.

## History

1958: Logistic regression was developed by statistician David Cox.

## Top 5 Recent Tweets

4 Jan 2016 @FrannyKristell Me ha gustado un vídeo de @YouTube (http://youtu.be/EocjYP5h0cE?a - Statistics with R: Logistic Regression, Lesson 19 by Courtney Brown). https://twitter.com/FrannyKristell/status/683519754418688000
3 Jan 2016 @researchbib Predictive Analytics of Student Graduation Using Logistic Regression and Decision Tree Algorithm http://dlvr.it/D9N9KQ ReseachBib https://twitter.com/researchbib/status/683489983399723009
2 Jan 20156 @bryantravissmit Quick Post explaining Regularization in Logistic Regression http://wp.me/p73AmQ-e7 via #datascience #machinelearning https://twitter.com/bryantravissmit/status/683362921452183552
1 Jan 2016 @DaiDaniels One of the Best.I was lucky & having audited both his Logistic Regression & Survival Analysis classes over the years http://www.seattletimes.com/seattle-news/obituaries/dr-norman-breslow-74-dies-uw-biostatistician-led-to-advances-in-medical-research/?utm_source=twitter&utm_medium=social&utm_campaign=article_left https://twitter.com/DaiDaniels/status/683126255571488768
1 Jan 2016 @colbycosh Oh, there’s some good stuff in here. That pROC package for plotting ROCs in a logistic regression is new. https://twitter.com/colbycosh/status/682952900587433984