Simple Linear Regression
Simple linear regression is the most commonly used technique for determining how the response variable is affected by changes in a single explanatory variable . The terms "response" and "explanatory" mean the same thing as "dependent" and "independent", but the former terminology is preferred because in the case of multiple linear regression the "independent" variable may actually be interdependent with many other variables as well.
Simple linear regression is used for three main purposes:
- To describe the linear dependence of the response on the explanatory variable .
- To predict values of the response from values of the explanatory variable , for which more data are available.
- To correct for the linear dependence of one variable on another, in order to clarify other features of its variability.
Any line fitted through a cloud of data will deviate from each data point to greater or lesser degree. The vertical distance between a data point and the fitted line is termed a "residual". This distance is a measure of prediction error, in the sense that it is the discrepancy between the actual value of the response variable and the value predicted by the line. Linear regression determines the best-fit line through a scatterplot of data, such that the sum of squared residuals is minimized; equivalently, it minimizes the error variance. The fit is "best" in precisely that sense: the sum of squared errors is as small as possible. That is why it is also termed "Ordinary Least Squares" regression. [1]
Contents
- 1 The Simple Linear Regression Model
- 2 Assumptions
- 3 Partitioning the total variability
- 4 Residuals
- 5 Parameter Estimation With Ordinary Least Squares
- 6 Degrees of Freedom
- 7 Mean Squares
- 8 Coefficient of Determination
- 9 Controversies
- 10 History
- 11 Top 5 Recent Tweets
- 12 Top 5 Recent News Headlines
- 13 Top 5 Lifetime Tweets
- 14 Top 5 Lifetime News Headlines
The Simple Linear Regression Model
Formally, consider an explanatory variable and a response and suppose there are randomly selected subjects in an experiment. With as unknown random errors and , the simple linear regression model is:
.
Notice the similarity of the equation above to the classic equation of a line , where m is the slope and b is the intercept. It is easily seen that in simple linear regression is the slope while is the intercept. We call and the unknown parameters and additionally, we assume and to be constants (and not random variables).
Assumptions
- The are nonrandom and measured with negligible error.
- The are uncorrelated random variables with mean equal to 0 (i.e. with the expected value).
- The have homogeneous variance . (i.e. with the variance).
- At each level (or individual observations) of , denoted , is a random variable with mean, variance, and a distribution. Note that denotes the levels, or individual observations of
- It is often assumed (although not necessary) that are normally distributed with mean 0 and variance (i.e. ). Note that this is the same as assuming that .
Notice that from the assumptions above, it follows that (which is called this the mean of at the point ). It also follows that which is called the variance of at the point .
Partitioning the total variability
From the about section, the goal of simple linear regression is to estimate for all . This is accomplished by using the estimates of and which are attained through a partitioning of the "total variability" of the observed response where the total variability of is the quantity . Denoting the least squares estimates of and as and respectively, the process follows in two steps.
- Estimate and .
- Calculate for all . The are called the predicted values.
Note that the partitioning of the total variability of is achieved by adding to the equation of total variability in the following way: ,
and that the quantities from equation above are given special names:
- The sum of squares total is ,
- The sum of squares regression is , and
- The sum of squares error is .
Note that some authors refer to the sum of squares regression as the The sum of squares model.
Residuals
The residual is the difference between the observed and the predicted . This is written as . Now, as , it follows that . Due to this, in order to attain quality predictions, and are chosen to minimize for each .
Parameter Estimation With Ordinary Least Squares
One method for estimating the unknown parameters and is through the use of Ordinary Least Squares (OLS). This is accomplished by finding values for and that minimize the sum of the squared residuals: . Note that the squared residuals are used in the summation above due to the fact that the sum of the (non-squared) residuals is necessarily 0.
OLS proceeds by taking the partial derivatives of with respect to and . After algebra, the OLS estimates are
- and
Therefore, the OLS predictions are attained with the equation . Note that and are unbiased estimators for and as and . Note also that the unbiased property of the parameter estimates follows from the assumption that .
Degrees of Freedom
For each of the sum of squares equations (from the partitioning of total variability section), there are related degrees of freedom. For simple linear regression, we have:
Note that is the number of explanatory variables in the model and is the n minus the number of unknown parameters in the model.
Mean Squares
The mean squares are the ratio of the sum of squares over the respective degrees of freedom. Therefore, for simple linear regression:
- the mean square for the model is and
- the mean square error is .
Note also that MSE is an unbiased estimate for the variance . Specifically, .
Coefficient of Determination
The coefficient of determination, denoted by is a measure of fit for the estimated model. Specifically, is a measure of the amount of variance (of Y) explained by the explanatory variable . For simple linear regression, the equation is:
.
Note that is a number between 0 and 1. For example, implies that all points fall on a straight line while (or if is close to 0) implies that the points are extremely scattered or that the points follow a non-linear pattern. In either case, the regression model is poor when is close to 0 while an close to 1 indicates that the model produces quality predictions (i.e. the model is a good fit).
Controversies
The main criticisms of simple linear regression involve the simplicity of the model (it involves only one explanatory variable), the (optional) assumption of the normality of the response , and the fact that, most of the time, the relationship between an explanatory variable and a response cannot be described as linear.
History
1805: The earliest form of regression was the method of least squares, which was published by Legendre. [2]