Logistic Regression
Logistic regression, also called logit regression or logit modeling, is a statistical technique that enables modelling of categorical dependent variables. When the dependent variable is binary i.e it can assume only two values e.g male or female the technique is referred to as binary logistic regression. When a dependent variable can assume more than two values and is not ordered the technique is referred to as multinomial logistic regression. When a dependent variable assumes more than two values and is ordered the technique is known as ordinal logistic regression. The independent variables can be continuous or discrete but when a discrete variable assumes more than two variables dummy coding has to be done.
Logistic regression can be applied in the following situations:
- When a dependent variable assumes only two values. For example classification of students as admitted or rejected based on their physical and social characteristics. [1]
- When a dependent variable takes on more than two values and there is no importance in the order of the values. For example high school students can choose a general, vocational or academic program. Social economic and writing score could then be used to model their choice. [2]
- When a dependent variable assumes more than two values that are ordered. For example a market research firm would like to understand decision making in ordering the size (small, medium, large or extra large) of soda. [3]
The technique has found wide use among data scientists and has been implemented in statistical software like IBM SPSS, R, Stata and SAS among others.
Contents
- 1 Assumptions of logistic regression
- 2 Examples of logistic regression in practical use
- 3 The Logistic Regression Model
- 4 Parameter Estimation With Maximum Likelihood
- 5 Goodness of Fit Using the Likelihood Ratio Test
- 6 History
- 7 Top 5 Recent Tweets
- 8 Top 5 Recent News Headlines
- 9 Top 5 Lifetime Tweets
- 10 References
Assumptions of logistic regression
Logistic Regression has some assumptions that need to be satisfied for results to be valid.
- The dependent variable must be either a binary variable.
- You need to avoid model over-fitting by excluding variables not relevant to the model
- The error terms have to be independent
- The variables should not be related i.e no multicollinearity.
- Independent variables need at least 10 values for accurate maximum likelihood estimation.[4]
Examples of logistic regression in practical use
- A study published in the American Journal of Botany used logistic regression to understand the effect of two herbs on subtropical shrubland habitat. [5]
- Uro Today reported a study that used ordinal logistic regression to understand factors that cause hand-foot skin reaction. [6]
- Nature.com reports a study that used logistic regression to predict cirrhosis hospitalization. [7]
The Logistic Regression Model
With a vector of binary responses, and a matrix of numerical variables, the logistic regression model is
.
Note that is the transpose of the column of .
Parameter Estimation With Maximum Likelihood
Suppose that there are successes (1's) and so, failures (0's). The ungrouped case is where (by re-ordering the observations if necessary) we have and . In this case, the maximum likelihood estimates for logistic regression are the that satisfy the equations
.
Note that as these equations are not linear in the and as there are unknowns and equations, algorithms such as Gauss-Newton or Levenberg-Marquardt must be used to approximate a solution.
Goodness of Fit Using the Likelihood Ratio Test
Consider the hypothesis : The logistic model is appropriate versus : The saturated model is appropriate. To test this we use the likelihood ratio test, with the likelihood of the saturated model and the likelihood of the model. The test statistic is defined as
and is called the model deviance. Note that under , it follows that and so, is rejected when where is a previously chosen significance level. Notice that in practice this is not very useful as generally the saturated model is usually unknown.
In practice, one would generally like to test if a subset of the explanatory variables provides the same amount of information as the explanatory variables in the full model. In this case, we write the vector as two separate vectors (or a scalar and a vector). Specifically, we write and consider the hypothesis : versus : . The test statistic is defined as
where the likelihood of the sub-model is the likelihood in regards to the model with the vector of parameters and likelihood of the full-model is the likelihood in regards to the model with the entire vector of parameters . In this case, where are the degrees of freedom associated with the vector . This test statistic is called the deviance. In this case, is rejected when where is a previously chosen significance level.
History
1958: Logistic regression was developed by statistician David Cox.
Top 5 Recent Tweets
Date | Author | Tweet | link |
---|---|---|---|
4 Jan 2016 | @FrannyKristell | Me ha gustado un vídeo de @YouTube (http://youtu.be/EocjYP5h0cE?a - Statistics with R: Logistic Regression, Lesson 19 by Courtney Brown). | https://twitter.com/FrannyKristell/status/683519754418688000 |
3 Jan 2016 | @researchbib | Predictive Analytics of Student Graduation Using Logistic Regression and Decision Tree Algorithm http://dlvr.it/D9N9KQ ReseachBib | https://twitter.com/researchbib/status/683489983399723009 |
2 Jan 20156 | @bryantravissmit | Quick Post explaining Regularization in Logistic Regression http://wp.me/p73AmQ-e7 via #datascience #machinelearning | https://twitter.com/bryantravissmit/status/683362921452183552 |
1 Jan 2016 | @DaiDaniels | One of the Best.I was lucky & having audited both his Logistic Regression & Survival Analysis classes over the years http://www.seattletimes.com/seattle-news/obituaries/dr-norman-breslow-74-dies-uw-biostatistician-led-to-advances-in-medical-research/?utm_source=twitter&utm_medium=social&utm_campaign=article_left … | https://twitter.com/DaiDaniels/status/683126255571488768 |
1 Jan 2016 | @colbycosh | Oh, there’s some good stuff in here. That pROC package for plotting ROCs in a logistic regression is new. | https://twitter.com/colbycosh/status/682952900587433984 |
Top 5 Recent News Headlines
Date | Title | Link |
---|---|---|
21 Dec 2015 | Predictive factors for sorafenib-induced hand-foot skin reaction using ordered logistic regression analysis. | http://www.urotoday.com/recent-abstracts/urologic-oncology/renal-cancer/85576-predictive-factors-for-sorafenib-induced-hand-foot-skin-reaction-using-ordered-logistic-regression-analysis.html |
17 Dec 2015 | Effects of habitat degradation, microsite, and seed density on the persistence of two native herbs in a subtropical shrubland | http://www.amjbot.org/content/early/2015/11/24/ajb.1500125/suppl/DC1 |
17 Dec 2015 | Men With Mustaches Outnumber Women in Top Med School Roles | http://fortune.com/2015/12/17/men-mustaches-women-med-school/ |
22 Dec 2015 | Gut Microbiota Alterations can predict Hospitalizations in Cirrhosis Independent of Diabetes Mellitus | http://www.nature.com/articles/srep18559 |
20 Oct 2015 | Kaggle Tackles Whale of an Identification Problem | http://www.datanami.com/2015/10/20/kaggle-tackles-whale-of-an-identification-problem/ |
Top 5 Lifetime Tweets
References
- ↑ https://onlinecourses.science.psu.edu/stat504/node/149
- ↑ http://www.ats.ucla.edu/stat/spss/dae/mlogit.htm
- ↑ http://www.ats.ucla.edu/stat/sas/dae/ologit.htm
- ↑ http://www.statisticssolutions.com/assumptions-of-logistic-regression/
- ↑ http://www.amjbot.org/content/early/2015/11/24/ajb.1500125/suppl/DC1
- ↑ http://www.urotoday.com/recent-abstracts/urologic-oncology/renal-cancer/85576-predictive-factors-for-sorafenib-induced-hand-foot-skin-reaction-using-ordered-logistic-regression-analysis.html
- ↑ http://www.nature.com/articles/srep18559