Logistic Regression

From Verify.Wiki
Jump to: navigation, search

Logistic regression, also called logit regression or logit modeling, is a statistical technique that enables modelling of categorical dependent variables. When the dependent variable is binary i.e it can assume only two values e.g male or female the technique is referred to as binary logistic regression. When a dependent variable can assume more than two values and is not ordered the technique is referred to as multinomial logistic regression. When a dependent variable assumes more than two values and is ordered the technique is known as ordinal logistic regression. The independent variables can be continuous or discrete but when a discrete variable assumes more than two variables dummy coding has to be done.

Logistic regression can be applied in the following situations:

  1. When a dependent variable assumes only two values. For example classification of students as admitted or rejected based on their physical and social characteristics. [1]
  2. When a dependent variable takes on more than two values and there is no importance in the order of the values. For example high school students can choose a general, vocational or academic program. Social economic and writing score could then be used to model their choice. [2]
  3. When a dependent variable assumes more than two values that are ordered. For example a market research firm would like to understand decision making in ordering the size (small, medium, large or extra large) of soda. [3]

The technique has found wide use among data scientists and has been implemented in statistical software like IBM SPSS, R, Stata and SAS among others.

Assumptions of logistic regression

Logistic Regression has some assumptions that need to be satisfied for results to be valid.

  • The dependent variable must be either a binary variable.
  • You need to avoid model over-fitting by excluding variables not relevant to the model
  • The error terms have to be independent
  • The variables should not be related i.e no multicollinearity.
  • Independent variables need at least 10 values for accurate maximum likelihood estimation.[4]

Examples of logistic regression in practical use

  • A study published in the American Journal of Botany used logistic regression to understand the effect of two herbs on subtropical shrubland habitat. [5]
  • Uro Today reported a study that used ordinal logistic regression to understand factors that cause hand-foot skin reaction. [6]
  • Nature.com reports a study that used logistic regression to predict cirrhosis hospitalization. [7]

The Logistic Regression Model

With a vector of binary responses, and a matrix of numerical variables, the logistic regression model is

.

Note that is the transpose of the column of .

Parameter Estimation With Maximum Likelihood

Suppose that there are successes (1's) and so, failures (0's). The ungrouped case is where (by re-ordering the observations if necessary) we have and . In this case, the maximum likelihood estimates for logistic regression are the that satisfy the equations

.

Note that as these equations are not linear in the and as there are unknowns and equations, algorithms such as Gauss-Newton or Levenberg-Marquardt must be used to approximate a solution.

Goodness of Fit Using the Likelihood Ratio Test

Consider the hypothesis : The logistic model is appropriate versus : The saturated model is appropriate. To test this we use the likelihood ratio test, with the likelihood of the saturated model and the likelihood of the model. The test statistic is defined as

and is called the model deviance. Note that under , it follows that and so, is rejected when where is a previously chosen significance level. Notice that in practice this is not very useful as generally the saturated model is usually unknown.

In practice, one would generally like to test if a subset of the explanatory variables provides the same amount of information as the explanatory variables in the full model. In this case, we write the vector as two separate vectors (or a scalar and a vector). Specifically, we write and consider the hypothesis : versus : . The test statistic is defined as

where the likelihood of the sub-model is the likelihood in regards to the model with the vector of parameters and likelihood of the full-model is the likelihood in regards to the model with the entire vector of parameters . In this case, where are the degrees of freedom associated with the vector . This test statistic is called the deviance. In this case, is rejected when where is a previously chosen significance level.

History

1958: Logistic regression was developed by statistician David Cox.

Top 5 Recent Tweets

Date Author Tweet link
4 Jan 2016 @FrannyKristell Me ha gustado un vídeo de @YouTube (http://youtu.be/EocjYP5h0cE?a - Statistics with R: Logistic Regression, Lesson 19 by Courtney Brown). https://twitter.com/FrannyKristell/status/683519754418688000
3 Jan 2016 @researchbib Predictive Analytics of Student Graduation Using Logistic Regression and Decision Tree Algorithm http://dlvr.it/D9N9KQ ReseachBib https://twitter.com/researchbib/status/683489983399723009
2 Jan 20156 @bryantravissmit Quick Post explaining Regularization in Logistic Regression http://wp.me/p73AmQ-e7 via #datascience #machinelearning https://twitter.com/bryantravissmit/status/683362921452183552
1 Jan 2016 @DaiDaniels One of the Best.I was lucky & having audited both his Logistic Regression & Survival Analysis classes over the years http://www.seattletimes.com/seattle-news/obituaries/dr-norman-breslow-74-dies-uw-biostatistician-led-to-advances-in-medical-research/?utm_source=twitter&utm_medium=social&utm_campaign=article_left https://twitter.com/DaiDaniels/status/683126255571488768
1 Jan 2016 @colbycosh Oh, there’s some good stuff in here. That pROC package for plotting ROCs in a logistic regression is new. https://twitter.com/colbycosh/status/682952900587433984

Top 5 Recent News Headlines

Date Title Link
21 Dec 2015 Predictive factors for sorafenib-induced hand-foot skin reaction using ordered logistic regression analysis. http://www.urotoday.com/recent-abstracts/urologic-oncology/renal-cancer/85576-predictive-factors-for-sorafenib-induced-hand-foot-skin-reaction-using-ordered-logistic-regression-analysis.html
17 Dec 2015 Effects of habitat degradation, microsite, and seed density on the persistence of two native herbs in a subtropical shrubland http://www.amjbot.org/content/early/2015/11/24/ajb.1500125/suppl/DC1
17 Dec 2015 Men With Mustaches Outnumber Women in Top Med School Roles http://fortune.com/2015/12/17/men-mustaches-women-med-school/
22 Dec 2015 Gut Microbiota Alterations can predict Hospitalizations in Cirrhosis Independent of Diabetes Mellitus http://www.nature.com/articles/srep18559
20 Oct 2015 Kaggle Tackles Whale of an Identification Problem http://www.datanami.com/2015/10/20/kaggle-tackles-whale-of-an-identification-problem/

Top 5 Lifetime Tweets

References

  1. https://onlinecourses.science.psu.edu/stat504/node/149
  2. http://www.ats.ucla.edu/stat/spss/dae/mlogit.htm
  3. http://www.ats.ucla.edu/stat/sas/dae/ologit.htm
  4. http://www.statisticssolutions.com/assumptions-of-logistic-regression/
  5. http://www.amjbot.org/content/early/2015/11/24/ajb.1500125/suppl/DC1
  6. http://www.urotoday.com/recent-abstracts/urologic-oncology/renal-cancer/85576-predictive-factors-for-sorafenib-induced-hand-foot-skin-reaction-using-ordered-logistic-regression-analysis.html
  7. http://www.nature.com/articles/srep18559

Verification history