The extension of simple linear regression to multiple explanatory (or predictor variables) is known as multiple linear regression (or multivariable linear regression). Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Multiple linear regression is used for two main purposes:

1. To describe the linear dependence of the response ${\displaystyle Y}$ on a collection of explanatory variables ${\displaystyle X_{1},X_{2},...,X_{k}}$.
2. To predict values of the response ${\displaystyle Y}$ from values of the explanatory variable ${\displaystyle X_{1},X_{2},...,X_{k}}$, for which more data are available.

Any hyper-plane fitted through a cloud of data will deviate from each data point to greater or lesser degree. The vertical distance between a data point and the fitted line is termed a "residual". This distance is a measure of prediction error, in the sense that it is the discrepancy between the actual value of the response variable and the value predicted by the hyper-plane. Linear regression determines the best-fit hyper-plane through a scattering of data, such that the sum of squared residuals is minimized; equivalently, it minimizes the error variance. The fit is "best" in precisely that sense: the sum of squared errors is as small as possible. That is why it is also termed "Ordinary Least Squares" regression. [1]

## The Multiple Linear Regression Model

Formally, consider a collection of explanatory variables ${\displaystyle X_{1},X_{2},...,X_{k}}$ and a response variable ${\displaystyle Y}$ and suppose there are ${\displaystyle n}$ randomly selected subjects in an experiment. With ${\displaystyle \epsilon _{i}}$ as unknown random errors and ${\displaystyle i=1,2,...,n}$, the multiple linear regression model is:

${\displaystyle y_{i}=\beta _{0}+\beta _{1}x_{1i}+\beta _{2}x_{2i}+...++\beta _{k}x_{ki}+\epsilon _{i}}$.

We call ${\displaystyle \beta _{0},\beta _{1},...,\beta _{k}}$ the unknown parameters and additionally, we assume ${\displaystyle \beta _{0},\beta _{1},...,\beta _{k}}$ to be constants (and not random variables). Note that usually, multiple regression is written in terms of vectors and matrices. Specifically, it is usually written:

${\displaystyle {\mathbf {y}}={\mathbf {X}}{\mathbf {\beta }}+{\mathbf {\epsilon }}}$

where ${\displaystyle {\mathbf {y}}={\begin{bmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\end{bmatrix}},{\mathbf {X}}={\begin{bmatrix}1&x_{11}&\cdots &x_{1k}\\1&x_{21}&\cdots &x_{2k}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{nk}\end{bmatrix}},{\mathbf {\beta }}={\begin{bmatrix}\beta _{1}\\\beta _{2}\\\vdots \\\beta _{n}\end{bmatrix}},and{\mathbf {\epsilon }}={\begin{bmatrix}\epsilon _{1}\\\epsilon _{2}\\\vdots \\\epsilon _{n}\end{bmatrix}}}$.

## Assumptions

1. The ${\displaystyle x_{ji}}$ are nonrandom and measured with negligible error. Note that the vectors ${\displaystyle {\mathbf {x}}}$ from the matrix ${\displaystyle {\mathbf {X}}}$ can also be transformations of explanatory variables.
2. ${\displaystyle \epsilon }$ is a random vector.
3. For each ${\displaystyle i}$, ${\displaystyle E(\epsilon _{i})=0}$ where ${\displaystyle E}$ is the expected value. That is, the ${\displaystyle \epsilon _{i}}$ have mean equal to 0. This can also be written as ${\displaystyle E(\epsilon )=0}$.
4. For each ${\displaystyle i}$, ${\displaystyle var(\epsilon _{i})=\sigma ^{2}}$ where ${\displaystyle var}$ is the variance. That is, the ${\displaystyle \epsilon _{i}}$ have homogeneous variance ${\displaystyle \sigma ^{2}}$. This can also be written as ${\displaystyle var(\epsilon )=\sigma ^{2}}$.
5. For each ${\displaystyle i\neq j}$, ${\displaystyle E(\epsilon _{i}\epsilon _{j})=0}$. That is, the ${\displaystyle \epsilon _{i}}$ are uncorrelated random variables.
6. It is often assumed (although not necessary) that ${\displaystyle \epsilon }$ follows a multivariate normal distribution with mean ${\displaystyle {\mathbf {0}}}$ and variance ${\displaystyle \sigma ^{2}{\mathbf {I}}}$ with ${\displaystyle {\mathbf {0}}}$ a vector of zeros with length ${\displaystyle n}$ and ${\displaystyle I}$ the ${\displaystyle n\times n}$ identity matrix.

Notice assumptions 4 and 5 can be written compactly as ${\displaystyle var({\mathbf {\epsilon }})=\sigma ^{2}{\mathbf {I}}}$.

## Partitioning the total variability

From the about section, the goal of multiple regression is to estimate ${\displaystyle \mu (x_{i})=y_{i}}$ for all ${\displaystyle i}$. This is accomplished by using the estimates of ${\displaystyle \beta }$ which are attained through a partitioning of the "total variability" of the observed response ${\displaystyle Y}$ where the total variability of ${\displaystyle Y}$ is the quantity ${\displaystyle \sum _{i=1}^{n}(y_{i}-{\bar {Y}})^{2}}$. Denoting the least squares estimate of ${\displaystyle \beta }$ as ${\displaystyle {\hat {\beta }}}$, the process follows in two steps.

1. Estimate ${\displaystyle {\hat {\beta }}}$.
2. Calculate ${\displaystyle {\hat {y}}_{i}={\hat {\mu }}(x_{i})={\hat {\beta }}{\mathbf {X}}}$. The ${\displaystyle {\hat {y}}_{i}}$ are called the predicted values.

Note that the partitioning of the total variability of ${\displaystyle Y}$ is achieved by adding ${\displaystyle 0=-{\hat {y}}_{i}+{\hat {y}}_{i}}$ to the equation of total variability in the following way: ${\displaystyle \sum _{i=1}^{n}(y_{i}-{\bar {Y}})^{2}=\sum _{i=1}^{n}(y_{i}-{\hat {y}}_{i}+{\hat {y}}_{i}-{\bar {Y}})^{2}=\sum _{i=1}^{n}(y_{i}-{\hat {y}}_{i})+\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {Y}})^{2}}$,

and that the quantities from equation above are given special names:

1. The sum of squares total is ${\displaystyle SST=\sum _{i=1}^{n}(y_{i}-{\bar {Y}})^{2}}$,
2. The sum of squares regression is ${\displaystyle SSR=\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {Y}})^{2}}$, and
3. The sum of squares error is ${\displaystyle SSE=\sum _{i=1}^{n}(y_{i}-{\hat {y}}_{i})}$.

Note that some authors refer to the sum of squares regression as the The sum of squares model.

## Residuals

The ${\displaystyle i^{th}}$ residual is the difference between the observed ${\displaystyle y_{i}}$ and the predicted ${\displaystyle {\hat {y}}_{i}}$. This is written as ${\displaystyle e_{i}=y_{i}-{\hat {y}}_{i}}$. Now, as ${\displaystyle {\hat {y}}_{i}=\beta _{0}+\beta _{1}x_{1i}+\beta _{2}x_{2i}+...+\beta _{k}x_{ki}}$, it follows that ${\displaystyle e_{i}=y_{i}-\beta _{0}-\beta _{1}x_{1i}-\beta _{2}x_{2i}-...-\beta _{k}x_{ki}}$. Due to this, in order to attain quality predictions, ${\displaystyle {\hat {\beta }}_{0},{\hat {\beta }}_{1},\dots ,{\hat {\beta }}_{k}}$ are chosen to minimize ${\displaystyle e_{i}}$ for each ${\displaystyle i}$.

## Parameter Estimation With Ordinary Least Squares

One method for estimating the matrix of unknown parameters ${\displaystyle \beta }$ is through the use of Ordinary Least Squares (OLS). This is accomplished by finding values for ${\displaystyle {\hat {\beta }}}$ that minimize the sum of the squared residuals: ${\displaystyle ({\mathbf {y}}-{\mathbf {X}}\beta )^{T}({\mathbf {y}}-{\mathbf {X}}\beta )}$ where ${\displaystyle T}$ is the matrix transpose.

OLS proceeds by taking the matrix calculus partial derivatives of ${\displaystyle ({\mathbf {y}}-{\mathbf {X}}\beta )^{T}({\mathbf {y}}-{\mathbf {X}}\beta )}$ with respect to ${\displaystyle \beta }$. After algebra, the OLS estimates (which are also known as the normal equations) we have

${\displaystyle ({\mathbf {X}}^{T}{\mathbf {X}})\beta ={\mathbf {X}}^{T}{\mathbf {y}}}$

Therefore, assuming that the inverse of ${\displaystyle ({\mathbf {X}}^{T}{\mathbf {X}})}$ exists, the OLS predictions are attained with the equation ${\displaystyle \beta =({\mathbf {X}}^{T}{\mathbf {X}})^{-1}{\mathbf {X}}^{T}{\mathbf {y}}}$. Note that ${\displaystyle {\hat {\beta }}}$ is unbiased estimator for ${\displaystyle \beta }$ as ${\displaystyle E({\hat {\beta }})=\beta }$. Note also that the unbiased property of the parameter estimates follows from the assumption that ${\displaystyle E(\epsilon _{i})=0}$.

## Degrees of Freedom

For each of the sum of squares equations (from the partitioning of total variability section), there are related degrees of freedom. For simple linear regression, we have:

1. ${\displaystyle df_{total}=n-1}$,
2. ${\displaystyle df_{regression}=p-1}$, and
3. ${\displaystyle df_{error}=n-p}$.

Note that ${\displaystyle df_{regression}}$ is the number of explanatory variables in the model minus 1 and ${\displaystyle df_{error}}$ is the n minus the number of explanatory variables ${\displaystyle \beta }$ denoted ${\displaystyle p}$ in the model.

## Mean Squares

The mean squares are the ratio of the sum of squares over the respective degrees of freedom. Therefore, for simple linear regression:

1. the mean square for the model is ${\displaystyle MSR={\dfrac {SSR}{df_{regression}}}={\dfrac {SSR}{p-1}}}$ and
2. the mean square error is ${\displaystyle MSE={\dfrac {SSE}{df_{error}}}={\dfrac {SSE}{n-p}}}$.

Note also that MSE is an unbiased estimate for the variance ${\displaystyle \sigma ^{2}}$. Specifically, ${\displaystyle {\hat {\sigma }}^{2}=MSE}$.

## Coefficient of Determination

The coefficient of determination, denoted by ${\displaystyle R^{2}}$ is a measure of fit for the estimated model. Specifically, ${\displaystyle R^{2}}$ is a measure of the amount of variance (of Y) explained by the explanatory variable ${\displaystyle X}$. For simple linear regression, the equation is:

${\displaystyle R^{2}={\dfrac {SSR}{SST}}=1-{\dfrac {SSE}{SST}}}$.

Note that ${\displaystyle R^{2}}$ is a number between 0 and 1. For example, ${\displaystyle R^{2}=1}$ implies that all points fall on a straight line while ${\displaystyle R^{2}=0}$ (or if ${\displaystyle R^{2}}$ is close to 0) implies that the points are extremely scattered or that the points follow a non-linear pattern. In either case, the regression model is poor when ${\displaystyle R^{2}}$ is close to 0 while an ${\displaystyle R^{2}}$ close to 1 indicates that the model produces quality predictions (i.e. the model is a good fit).

## Controversies

The main criticisms of multiple linear regression involve the required linearity (in the coefficients) of the model as well as the (optional) assumption of the normality in the response ${\displaystyle Y}$.