Teach yourself statistics

Teach yourself statistics

What is Linear Regression?

In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect. Least squares linear regression is a method for predicting the value of a dependent variable Y, based on the value of an independent variable X.

For the next few lessons, we focus on the case where there is only one independent variable. This is called simple regression. Toward the end of the tutorial, we will cover multiple regression, which handles two or more independent variables.

Tip: The next lesson presents a simple linear regression example that shows how to apply the material covered in this lesson. Since this lesson is a little dense, you may benefit by also reading the next lesson.

Requirements for Regression

Simple linear regression is appropriate when the following conditions are satisfied.

Linearity. The relationshp between the independent variable X and the dependent variable Y should be linear. To check this, make sure that the XY scatterplot is linear and that the residual plot shows a random pattern. (In a future lesson, we'll explain how to check linearity with a residual plot.)
Homoscedasticity. The variance of residuals should be constant across all levels of the independent variable. To check for homoscedasticity, plot residuals against the independent variable. If the spread is roughly constant, homoscedasticity holds. (Bartlett's test and Hartley's Fmax test can also be used to test for homogeneity of variance; but these tests are not part of the AP Statistics curriculum, and they will not appear on the AP Statistics test.)
Independence. Residuals should be independent of each other. The value of one residual should not provide any information about the value of another. Plot residuals against time or observation order. If the residuals fluctuate randomly around the zero line with no clear pattern, they are likely independent. If they show a trend (e.g., increasing or decreasing) or cyclical behavior, this indicates dependence.
Normality. The residuals should be normally distributed, especially for small sample sizes. Plot a histogram of the residuals and check for a bell-shaped distribution. Or produce a normal probability plot. If points on the plot will fall approximately along a straight line, the residuals are normally distributed. (This assumption is less critical when the sample size is large.)

By checking these assumptions, you can ensure the validity of a simple linear regression model. If any assumptions are violated, appropriate transformations or other analytical methods should be considered. (We'll cover transformations in a future lesson.)

The Least Squares Regression Line

Linear regression finds the straight line, called the least squares regression line or LSRL, that best represents observations in a bivariate dataset. Suppose Y is a dependent variable, and X is an independent variable. The population regression line is:

Y = Β₀ + Β₁X

where Β₀ is a constant, Β₁ is the regression coefficient, X is the value of the independent variable, and Y is the value of the dependent variable.

Given a random sample of observations, the population regression line is estimated by a sample regression line. The sample regression line is:

ŷ = b₀ + b₁x

where b₀ is a constant, b₁ is the regression coefficient, x is the value of the independent variable, and ŷ is the predicted value of the dependent variable.

How to Define a Regression Line

Normally, you will use a computational tool - a software package (e.g., Excel) or a graphing calculator - to find b₀ and b₁. You enter the x and y values into your program or calculator, and the tool solves for the regression constant (b₀) and for the regression coefficient (b₁).

In the unlikely event that you find yourself on a desert island without a computer or a graphing calculator, you can solve for b₀ and b₁ "by hand". Here are the equations.

b₁ = Σ [ (x_i - x)(y_i - y) ] / Σ [ (x_i - x)²]

b₁ = r * (s_y / s_x)

b₀ = y - b₁ * x

where b₀ is the constant in the regression equation, b₁ is the regression coefficient, r is the correlation between x and y, x_i is the x value for observation i, y_i is the y value for observation i, x is the sample mean of x, y is the sample mean of y, s_x is the standard deviation of x, and s_y is the standard deviation of y.

Properties of the Regression Line

When the regression parameters (b₀ and b₁) are defined as described above, the regression line has the following properties.

The line minimizes the sum of squared differences between observed values (the y values) and predicted values (the ŷ values computed from the regression equation).
The regression line passes through the mean of the x values (x) and through the mean of the y values (y).
The regression constant (b₀) is equal to the y intercept of the regression line.
The regression coefficient (b₁) is the average change in the dependent variable (y) for a 1-unit change in the independent variable (x). It is the slope of the regression line.

The least squares regression line is the only straight line that has all of these properties.

The Coefficient of Determination

The coefficient of determination (denoted by R²) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.

The coefficient of determination ranges from 0 to 1.
An R² of 0 means that the dependent variable cannot be predicted from the independent variable.
An R² of 1 means the dependent variable can be predicted without error from the independent variable.
An R² between 0 and 1 indicates the extent to which the dependent variable is predictable. An R² of 0.10 means that 10 percent of the variance in y is predictable from x; an R² of 0.20 means that 20 percent is predictable; and so on.

The formula for computing the coefficient of determination for a linear regression model with one independent variable is given below.

Coefficient of determination. The coefficient of determination (R²) for a linear regression model with one independent variable is:

R² = Σ (ŷ_i - ȳ)² / Σ (y_i - ȳ)²

where y_i is the value of the dependent variable for observation i, ŷ_i is the predicted value of the dependent variable for observation i, and ȳ is the mean of observed values of the dependent variable.

If you know the linear correlation (r) between the independent variable and the dependent variable, then the coefficient of determination (R²) is easily computed using the following formula: R² = r².

Standard Error of the Estimate

The standard error of the estimate (aka, residual standard error) is a measure of the average amount that the regression equation over- or under-predicts. The higher the coefficient of determination, the lower the standard error; and the more accurate predictions are likely to be.

For simple linear regression (regression with only one independent variable), the standard error of the estimate (SE) can be calculated from this formula:

SE = sqrt [ Σ(y_i - ŷ_i)² / (n - 2) ]

where y_i is the actual value of the dependent variable for observation i, ŷ_i is the predicted value of dependent variable for observation i, and n is sample size.

Here is how to interpret the standard error of the estimate.

The standard error tells you on average how much the actual data points deviate from the regression line.
A smaller standard error indicates the regression model fits the data more closely.
The standard error has the same units as the dependent variable y.

You can think of the standard error like the standard deviation of the residuals: If the standard error is, say, 2.3, then on average the actual values are about 2.3 units away from the predicted values.

Note: The standard error of the estimate is different from the standard error of the slope. In future lessons, we'll describe the standard error of the slope; and we'll explain how the standard error of the slope is used to test hypotheses about the slope and to define a confidence interval around the slope.

Test Your Understanding

Problem 1

A researcher uses a regression equation to predict home heating bills (dollar cost), based on home size (square feet). The correlation between predicted bills and home size is 0.70. What is the correct interpretation of this finding?

(A) 70% of the variability in home heating bills can be explained by home size.
(B) 49% of the variability in home heating bills can be explained by home size.
(C) For each added square foot of home size, heating bills increased by 70 cents.
(D) For each added square foot of home size, heating bills increased by 49 cents.
(E) None of the above.

Solution

The correct answer is (B). The coefficient of determination measures the proportion of variation in the dependent variable that is predictable from the independent variable. The coefficient of determination is equal to R²; in this case, (0.70)² or 0.49. Therefore, 49% of the variability in heating bills can be explained by home size.

Last lesson Next lesson