Logistic Regression

Follow

Verify

Logistic regression, also called logit regression or logit modeling, is a statistical technique that enables modelling of categorical dependent variables. When the dependent variable is binary i.e it can assume only two values e.g male or female the technique is referred to as binary logistic regression. When a dependent variable can assume more than two values and is not ordered the technique is referred to as multinomial logistic regression. When a dependent variable assumes more than two values and is ordered the technique is known as ordinal logistic regression. The independent variables can be continuous or discrete but when a discrete variable assumes more than two variables dummy coding has to be done.

Logistic regression can be applied in the following situations:

When a dependent variable assumes only two values. For example classification of students as admitted or rejected based on their physical and social characteristics. ^[1]
When a dependent variable takes on more than two values and there is no importance in the order of the values. For example high school students can choose a general, vocational or academic program. Social economic and writing score could then be used to model their choice. ^[2]
When a dependent variable assumes more than two values that are ordered. For example a market research firm would like to understand decision making in ordering the size (small, medium, large or extra large) of soda. ^[3]

The technique has found wide use among data scientists and has been implemented in statistical software like IBM SPSS, R, Stata and SAS among others.

1 Assumptions of logistic regression
2 Examples of logistic regression in practical use
3 The Logistic Regression Model
4 Parameter Estimation With Maximum Likelihood
5 Goodness of Fit Using the Likelihood Ratio Test
6 History
7 Top 5 Recent Tweets
8 Top 5 Recent News Headlines
9 Top 5 Lifetime Tweets
10 References

Assumptions of logistic regression

Logistic Regression has some assumptions that need to be satisfied for results to be valid.

The dependent variable must be either a binary variable.
You need to avoid model over-fitting by excluding variables not relevant to the model
The error terms have to be independent
The variables should not be related i.e no multicollinearity.
Independent variables need at least 10 values for accurate maximum likelihood estimation.^[4]

Examples of logistic regression in practical use

A study published in the American Journal of Botany used logistic regression to understand the effect of two herbs on subtropical shrubland habitat. ^[5]
Uro Today reported a study that used ordinal logistic regression to understand factors that cause hand-foot skin reaction. ^[6]
Nature.com reports a study that used logistic regression to predict cirrhosis hospitalization. ^[7]

The Logistic Regression Model

With ${\mathbf {y}}$ a vector of binary responses, and ${\mathbf {X}}$ a matrix of numerical variables, the logistic regression model is

$y_{i}={\dfrac {1}{1+e^{{\mathbf {x}}_{i}^{T}\beta }}}+\epsilon _{i}$ .

Note that ${\mathbf {x}}_{i}^{T}$ is the transpose of the $i^{th}$ column of ${\mathbf {X}}$ .

Parameter Estimation With Maximum Likelihood

Suppose that there are $n_{1}$ successes (1's) and so, $n-n_{1}$ failures (0's). The ungrouped case is where (by re-ordering the observations if necessary) we have $y_{1}=y_{2}=\dots =y_{n_{1}}=1$ and $y_{n_{1}+1}=y_{n_{1}+2}=\dots =y_{n}=0$ . In this case, the maximum likelihood estimates for logistic regression are the ${\hat {\beta }}$ that satisfy the equations

$\sum _{i=1}^{n}\left(1-{\dfrac {e^{{\mathbf {x}}_{i}^{T}\beta }}{1+e^{{\mathbf {x}}_{i}^{T}\beta }}}\right){\mathbf {x_{i}}}=\sum _{i=1}^{n_{1}}{\mathbf {x}}_{i}$ .

Note that as these equations are not linear in the $\beta$ and as there are $p$ unknowns and $p$ equations, algorithms such as Gauss-Newton or Levenberg-Marquardt must be used to approximate a solution.

Goodness of Fit Using the Likelihood Ratio Test

Consider the hypothesis $H_{0}$ : The logistic model is appropriate versus $H_{1}$ : The saturated model is appropriate. To test this we use the likelihood ratio test, with $L_{saturated}$ the likelihood of the saturated model and $L_{model}$ the likelihood of the model. The test statistic is defined as

$D_{model}=-2\ln \left({\dfrac {L_{model}}{L_{saturated}}}\right)$

and is called the model deviance. Note that under $H_{0}$ , it follows that $D_{model}\sim \chi _{n-p}^{2}$ and so, $H_{0}$ is rejected when $D_{model}>\chi _{\alpha ,n-p}^{2}$ where $\alpha$ is a previously chosen significance level. Notice that in practice this is not very useful as generally the saturated model is usually unknown.

In practice, one would generally like to test if a subset of the explanatory variables provides the same amount of information as the explanatory variables in the full model. In this case, we write the vector $\beta$ as two separate vectors (or a scalar and a vector). Specifically, we write $\beta ={\begin{bmatrix}\beta _{1}\\\beta _{2}\end{bmatrix}}$ and consider the hypothesis $H_{0}$ : $\beta _{1}={\mathbf {0}}$ versus $H_{1}$ : $\beta _{1}\neq {\mathbf {0}}$ . The test statistic is defined as

$D_{sub-model}=-2\ln \left({\dfrac {L_{sub-model}}{L_{full-model}}}\right)$

where the likelihood of the sub-model is the likelihood in regards to the model with the vector of parameters $\beta _{2}$ and likelihood of the full-model is the likelihood in regards to the model with the entire vector of parameters $\beta$ . In this case, $D_{sub-model}\sim \chi _{r}^{2}$ where $r$ are the degrees of freedom associated with the vector $\beta _{1}$ . This test statistic is called the deviance. In this case, $H_{0}$ is rejected when $D_{sub-model}>\chi _{\alpha ,r}^{2}$ where $\alpha$ is a previously chosen significance level.

History

1958: Logistic regression was developed by statistician David Cox.

Top 5 Recent Tweets

Date	Author	Tweet	link
4 Jan 2016	@FrannyKristell	Me ha gustado un vídeo de @YouTube (http://youtu.be/EocjYP5h0cE?a - Statistics with R: Logistic Regression, Lesson 19 by Courtney Brown).	https://twitter.com/FrannyKristell/status/683519754418688000
3 Jan 2016	@researchbib	Predictive Analytics of Student Graduation Using Logistic Regression and Decision Tree Algorithm http://dlvr.it/D9N9KQ ReseachBib	https://twitter.com/researchbib/status/683489983399723009
2 Jan 20156	@bryantravissmit	Quick Post explaining Regularization in Logistic Regression http://wp.me/p73AmQ-e7 via #datascience #machinelearning	https://twitter.com/bryantravissmit/status/683362921452183552
1 Jan 2016	@DaiDaniels	One of the Best.I was lucky & having audited both his Logistic Regression & Survival Analysis classes over the years http://www.seattletimes.com/seattle-news/obituaries/dr-norman-breslow-74-dies-uw-biostatistician-led-to-advances-in-medical-research/?utm_source=twitter&utm_medium=social&utm_campaign=article_left …	https://twitter.com/DaiDaniels/status/683126255571488768
1 Jan 2016	@colbycosh	Oh, there’s some good stuff in here. That pROC package for plotting ROCs in a logistic regression is new.	https://twitter.com/colbycosh/status/682952900587433984

Top 5 Recent News Headlines

Date	Title	Link
21 Dec 2015	Predictive factors for sorafenib-induced hand-foot skin reaction using ordered logistic regression analysis.	http://www.urotoday.com/recent-abstracts/urologic-oncology/renal-cancer/85576-predictive-factors-for-sorafenib-induced-hand-foot-skin-reaction-using-ordered-logistic-regression-analysis.html
17 Dec 2015	Effects of habitat degradation, microsite, and seed density on the persistence of two native herbs in a subtropical shrubland	http://www.amjbot.org/content/early/2015/11/24/ajb.1500125/suppl/DC1
17 Dec 2015	Men With Mustaches Outnumber Women in Top Med School Roles	http://fortune.com/2015/12/17/men-mustaches-women-med-school/
22 Dec 2015	Gut Microbiota Alterations can predict Hospitalizations in Cirrhosis Independent of Diabetes Mellitus	http://www.nature.com/articles/srep18559
20 Oct 2015	Kaggle Tackles Whale of an Identification Problem	http://www.datanami.com/2015/10/20/kaggle-tackles-whale-of-an-identification-problem/

Top 5 Lifetime Tweets

References

[1] ttps://onlinecourses.science.psu.edu/stat504/node/149

[2] ttp://www.ats.ucla.edu/stat/spss/dae/mlogit.htm

[3] ttp://www.ats.ucla.edu/stat/sas/dae/ologit.htm

[4] ttp://www.statisticssolutions.com/assumptions-of-logistic-regression/

[5] ttp://www.amjbot.org/content/early/2015/11/24/ajb.1500125/suppl/DC1

[6] ttp://www.urotoday.com/recent-abstracts/urologic-oncology/renal-cancer/85576-predictive-factors-for-sorafenib-induced-hand-foot-skin-reaction-using-ordered-logistic-regression-analysis.html

[7] ttp://www.nature.com/articles/srep18559

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Logistic Regression

Follow

Verify

Contents

Assumptions of logistic regression

Examples of logistic regression in practical use

The Logistic Regression Model

Parameter Estimation With Maximum Likelihood

Goodness of Fit Using the Likelihood Ratio Test

History

Top 5 Recent Tweets

Top 5 Recent News Headlines

Top 5 Lifetime Tweets

References

Top Authors

Verification history

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

Logistic Regression Follow Verify

Contents

Assumptions of logistic regression

Examples of logistic regression in practical use

The Logistic Regression Model

Parameter Estimation With Maximum Likelihood

Goodness of Fit Using the Likelihood Ratio Test

History

Top 5 Recent Tweets

Top 5 Recent News Headlines

Top 5 Lifetime Tweets

References

Top Authors

Verification history

Navigation menu

Search

Logistic Regression

Follow

Verify