# Simple Linear RegressionFollow Verify

Simple linear regression is the most commonly used technique for determining how the response variable $Y$ is affected by changes in a single explanatory variable $X$. The terms "response" and "explanatory" mean the same thing as "dependent" and "independent", but the former terminology is preferred because in the case of multiple linear regression the "independent" variable may actually be interdependent with many other variables as well.

Simple linear regression is used for three main purposes:

1. To describe the linear dependence of the response $Y$ on the explanatory variable $X$.
2. To predict values of the response $Y$ from values of the explanatory variable $X$, for which more data are available.
3. To correct for the linear dependence of one variable on another, in order to clarify other features of its variability.

Any line fitted through a cloud of data will deviate from each data point to greater or lesser degree. The vertical distance between a data point and the fitted line is termed a "residual". This distance is a measure of prediction error, in the sense that it is the discrepancy between the actual value of the response variable and the value predicted by the line. Linear regression determines the best-fit line through a scatterplot of data, such that the sum of squared residuals is minimized; equivalently, it minimizes the error variance. The fit is "best" in precisely that sense: the sum of squared errors is as small as possible. That is why it is also termed "Ordinary Least Squares" regression. 

## The Simple Linear Regression Model

Formally, consider an explanatory variable $X$ and a response $Y$ and suppose there are $n$ randomly selected subjects in an experiment. With $\epsilon _{i}$ as unknown random errors and $i=1,2,...,n$, the simple linear regression model is:

$Y=\beta _{0}+\beta _{1}x_{i}+\epsilon _{i}$.

Notice the similarity of the equation above to the classic equation of a line $y=mx+b$, where m is the slope and b is the intercept. It is easily seen that in simple linear regression $\beta _{1}$ is the slope while $\beta _{0}$ is the intercept. We call $\beta _{0}$ and $\beta _{1}$ the unknown parameters and additionally, we assume $\beta _{0}$ and $\beta _{1}$ to be constants (and not random variables).

## Assumptions

1. The $x_{i}$ are nonrandom and measured with negligible error.
2. The $\epsilon _{i}$ are uncorrelated random variables with mean equal to 0 (i.e. $E(\epsilon _{i})=0$ with $E$ the expected value).
3. The $\epsilon _{i}$ have homogeneous variance $\sigma ^{2}$. (i.e. $var(\epsilon _{i})=E(\epsilon _{i}^{2})=\sigma ^{2}$ with $var$ the variance).
4. At each level (or individual observations) of $X$, denoted $x_{i}$, $Y$ is a random variable with mean, variance, and a distribution. Note that $y_{i}$ denotes the levels, or individual observations of $Y$
5. It is often assumed (although not necessary) that $\epsilon _{i}$ are normally distributed with mean 0 and variance $\sigma ^{2}$ (i.e. $\epsilon _{i}\sim N(0,\sigma ^{2})$). Note that this is the same as assuming that $y_{i}\sim N(\beta _{0}+\beta _{1}x_{i},\sigma ^{2})$.

Notice that from the assumptions above, it follows that $E(y_{i})=\mu (x_{i})=\beta _{0}+\beta _{1}x_{i}$ (which is called this the mean of $y_{i}$ at the point $x_{i}$). It also follows that $var(y_{i})=\sigma ^{2}(x_{i})=\sigma ^{2}$ which is called the variance of $y_{i}$ at the point $x_{i}$.

## Partitioning the total variability

From the about section, the goal of simple linear regression is to estimate $\mu (x_{i})=y_{i}$ for all $i$. This is accomplished by using the estimates of $\beta _{0}$ and $\beta _{1}$ which are attained through a partitioning of the "total variability" of the observed response $Y$ where the total variability of $Y$ is the quantity $\sum _{i=1}^{n}(y_{i}-{\bar {Y}})^{2}$. Denoting the least squares estimates of $\beta _{0}$ and $\beta _{1}$ as ${\hat {\beta }}_{0}$ and ${\hat {\beta }}_{1}$ respectively, the process follows in two steps.

1. Estimate ${\hat {\beta }}_{0}$ and ${\hat {\beta }}_{1}$.
2. Calculate ${\hat {y}}_{i}={\hat {\mu }}(x_{i})={\hat {\beta }}_{0}+{\hat {\beta }}_{1}x_{i}$ for all $i$. The ${\hat {y}}_{i}$ are called the predicted values.

Note that the partitioning of the total variability of $Y$ is achieved by adding $0=-{\hat {y}}_{i}+{\hat {y}}_{i}$ to the equation of total variability in the following way: $\sum _{i=1}^{n}(y_{i}-{\bar {Y}})^{2}=\sum _{i=1}^{n}(y_{i}-{\hat {y}}_{i}+{\hat {y}}_{i}-{\bar {Y}})^{2}=\sum _{i=1}^{n}(y_{i}-{\hat {y}}_{i})+\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {Y}})^{2}$,

and that the quantities from equation above are given special names:

1. The sum of squares total is $SST=\sum _{i=1}^{n}(y_{i}-{\bar {Y}})^{2}$,
2. The sum of squares regression is $SSR=\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {Y}})^{2}$, and
3. The sum of squares error is $SSE=\sum _{i=1}^{n}(y_{i}-{\hat {y}}_{i})$.

Note that some authors refer to the sum of squares regression as the The sum of squares model.

## Residuals

The $i^{th}$ residual is the difference between the observed $y_{i}$ and the predicted ${\hat {y}}_{i}$. This is written as $e_{i}=y_{i}-{\hat {y}}_{i}$. Now, as ${\hat {y}}_{i}={\hat {\beta }}_{0}+{\hat {\beta }}_{1}x_{i}$, it follows that $e_{i}=y_{i}-{\hat {\beta }}_{0}-{\hat {\beta }}_{1}x_{i}$. Due to this, in order to attain quality predictions, ${\hat {\beta }}_{0}$ and ${\hat {\beta }}_{1}$ are chosen to minimize $e_{i}$ for each $i$.

## Parameter Estimation With Ordinary Least Squares

One method for estimating the unknown parameters $\beta _{0}$ and $\beta _{1}$ is through the use of Ordinary Least Squares (OLS). This is accomplished by finding values for ${\hat {\beta }}_{0}$ and ${\hat {\beta }}_{1}$ that minimize the sum of the squared residuals: $\sum _{i=1}^{n}e_{i}^{2}=\sum _{i=1}^{n}(y_{i}-{\hat {y}}_{i})^{2}=\sum _{i=1}^{n}(y_{i}-{\hat {\beta }}_{0}-{\hat {\beta }}_{1}x_{i})^{2}$. Note that the squared residuals are used in the summation above due to the fact that the sum of the (non-squared) residuals is necessarily 0.

OLS proceeds by taking the partial derivatives of $sum_{i=1}^{n}(y_{i}-{\hat {\beta }}_{0}-{\hat {\beta }}_{1}x_{i})^{2}$ with respect to $\beta _{0}$ and $\beta _{1}$. After algebra, the OLS estimates are

1. ${\hat {\beta }}_{1}={\dfrac {\sum _{i=1}^{n}x_{i}y_{i}-n{\bar {X}}{\bar {Y}}}{\sum _{i=1}^{n}x_{i}^{2}-n{\bar {X}}^{2}}}$ and
2. ${\hat {\beta }}_{0}={\bar {Y}}-{\hat {\beta }}_{1}{\bar {X}}$

Therefore, the OLS predictions are attained with the equation ${\hat {y_{i}}}={\hat {\beta }}_{0}+{\hat {\beta }}_{1}x_{i}$. Note that ${\hat {\beta }}_{0}$ and ${\hat {\beta }}_{1}$ are unbiased estimators for $\beta _{0}$ and $\beta _{1}$ as $E({\hat {\beta }}_{0})=\beta _{0}$ and $E({\hat {\beta }}_{1})=\beta _{1}$. Note also that the unbiased property of the parameter estimates follows from the assumption that $E(\epsilon _{i})=0$.

## Degrees of Freedom

For each of the sum of squares equations (from the partitioning of total variability section), there are related degrees of freedom. For simple linear regression, we have:

1. $df_{total}=n-1$
2. $df_{regression}=1$
3. $df_{error}=n-2$

Note that $df_{regression}$ is the number of explanatory variables in the model and $df_{error}$ is the n minus the number of unknown parameters $\beta$ in the model.

## Mean Squares

The mean squares are the ratio of the sum of squares over the respective degrees of freedom. Therefore, for simple linear regression:

1. the mean square for the model is $MSR={\dfrac {SSM}{df_{regression}}}=SSM$ and
2. the mean square error is $MSE={\dfrac {SSE}{df_{error}}}={\dfrac {SSE}{n-2}}$.

Note also that MSE is an unbiased estimate for the variance $\sigma ^{2}$. Specifically, ${\hat {\sigma }}^{2}=MSE$.

## Coefficient of Determination

The coefficient of determination, denoted by $R^{2}$ is a measure of fit for the estimated model. Specifically, $R^{2}$ is a measure of the amount of variance (of Y) explained by the explanatory variable $X$. For simple linear regression, the equation is:

$R^{2}={\dfrac {SSM}{SST}}=1-{\dfrac {SSE}{SST}}$.

Note that $R^{2}$ is a number between 0 and 1. For example, $R^{2}=1$ implies that all points fall on a straight line while $R^{2}=0$ (or if $R^{2}$ is close to 0) implies that the points are extremely scattered or that the points follow a non-linear pattern. In either case, the regression model is poor when $R^{2}$ is close to 0 while an $R^{2}$ close to 1 indicates that the model produces quality predictions (i.e. the model is a good fit).

## Controversies

The main criticisms of simple linear regression involve the simplicity of the model (it involves only one explanatory variable), the (optional) assumption of the normality of the response $Y$, and the fact that, most of the time, the relationship between an explanatory variable and a response cannot be described as linear.

## History

1805: The earliest form of regression was the method of least squares, which was published by Legendre.