What is Linear Regression?

In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect. Least squares linear regression is a method for predicting the value of a dependent variable Y, based on the value of an independent variable X.

In this AP Statistics tutorial, we focus on the case where there is only one independent variable. This is called simple regression. In another tutorial (see Regression Tutorial), we cover multiple regression, which handles two or more independent variables.

Tip: The next lesson presents a simple linear regression example that shows how to apply the material covered in this lesson. Since this lesson is a little dense, you may benefit by also reading the next lesson.

Prerequisites for Regression

Simple linear regression is appropriate when the following conditions are satisfied.

The dependent variable Y has a linear relationship to the independent variable X. To check this, make sure that the XY scatterplot is linear and that the residual plot shows a random pattern. (Don't worry. We'll cover residual plots in a future lesson.)
For each value of X, the probability distribution of Y has the same standard deviation σ. When this condition is satisfied, the variability of the residuals will be relatively constant across all values of X, which is easily checked in a residual plot.
For any given value of X,
- The Y values are independent, as indicated by a random pattern on the residual plot.
- The Y values are roughly normally distributed (i.e., bell-shaped). A little skewness is ok if the sample size is large. A histogram or a dotplot will show the shape of the distribution.

The Least Squares Regression Line

Linear regression finds the straight line, called the least squares regression line or LSRL, that best represents observations in a bivariate data set. Suppose Y is a dependent variable, and X is an independent variable. The population regression line is:

Y = Β₀ + Β₁X

where Β₀ is a constant, Β₁ is the regression coefficient, X is the value of the independent variable, and Y is the value of the dependent variable.

Given a random sample of observations, the population regression line is estimated by a sample regression line. The sample regression line is:

ŷ = b₀ + b₁x

where b₀ is a constant, b₁ is the regression coefficient, x is the value of the independent variable, and ŷ is the predicted value of the dependent variable.

How to Define a Regression Line

Normally, you will use a computational tool - a software package (e.g., Excel) or a graphing calculator - to find b₀ and b₁. You enter the x and y values into your program or calculator, and the tool solves for the regression constant (b₀) and for the regression coefficient (b₁).

In the unlikely event that you find yourself on a desert island without a computer or a graphing calculator, you can solve for b₀ and b₁ "by hand". Here are the equations.

b₁ = Σ [ (x_i - x)(y_i - y) ] / Σ [ (x_i - x)²]

b₁ = r * (s_y / s_x)

b₀ = y - b₁ * x

where b₀ is the constant in the regression equation, b₁ is the regression coefficient, r is the correlation between x and y, x_i is the x value for observation i, y_i is the y value for observation i, x is the sample mean of x, y is the sample mean of y, s_x is the standard deviation of x, and s_y is the standard deviation of y.

Properties of the Regression Line

When the regression parameters (b₀ and b₁) are defined as described above, the regression line has the following properties.

The line minimizes the sum of squared differences between observed values (the y values) and predicted values (the ŷ values computed from the regression equation).
The regression line passes through the mean of the x values (x) and through the mean of the y values (y).
The regression constant (b₀) is equal to the y intercept of the regression line.
The regression coefficient (b₁) is the average change in the dependent variable (y) for a 1-unit change in the independent variable (x). It is the slope of the regression line.

The least squares regression line is the only straight line that has all of these properties.

The Coefficient of Determination

The coefficient of determination (denoted by R²) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.

The coefficient of determination ranges from 0 to 1.
An R² of 0 means that the dependent variable cannot be predicted from the independent variable.
An R² of 1 means the dependent variable can be predicted without error from the independent variable.
An R² between 0 and 1 indicates the extent to which the dependent variable is predictable. An R² of 0.10 means that 10 percent of the variance in y is predictable from x; an R² of 0.20 means that 20 percent is predictable; and so on.

The formula for computing the coefficient of determination for a linear regression model with one independent variable is given below.

Coefficient of determination. The coefficient of determination (R²) for a linear regression model with one independent variable is:

R² = { ( 1 / N ) * Σ [ (x_i - x) * (y_i - y) ]
/ (σ_x * σ_y ) }²

where N is the number of observations used to fit the model, Σ is the summation symbol, x_i is the x value for observation i, x is the mean x value, y_i is the y value for observation i, y is the mean y value, σ_x is the standard deviation of x, and σ_y is the standard deviation of y.

If you know the linear correlation (r) between two variables, then the coefficient of determination (R²) is easily computed using the following formula: R² = r².

Standard Error

The standard error about the regression line (often denoted by SE) is a measure of the average amount that the regression equation over- or under-predicts. The higher the coefficient of determination, the lower the standard error; and the more accurate predictions are likely to be.

Test Your Understanding

Problem 1

A researcher uses a regression equation to predict home heating bills (dollar cost), based on home size (square feet). The correlation between predicted bills and home size is 0.70. What is the correct interpretation of this finding?

(A) 70% of the variability in home heating bills can be explained by home size.
(B) 49% of the variability in home heating bills can be explained by home size.
(C) For each added square foot of home size, heating bills increased by 70 cents.
(D) For each added square foot of home size, heating bills increased by 49 cents.
(E) None of the above.

Solution

The correct answer is (B). The coefficient of determination measures the proportion of variation in the dependent variable that is predictable from the independent variable. The coefficient of determination is equal to R²; in this case, (0.70)² or 0.49. Therefore, 49% of the variability in heating bills can be explained by home size.

Last lesson Next lesson