### Linear Regression

#### Introduction

#### Simple Regression

- Linear Regression
- Regression Example
- Residual Analysis
- Transformations
- Influential Points
- Slope Estimate
- Slope Test

#### Multiple Regression

### Linear Regession: Table of Contents

#### Introduction

#### Simple Regression

- Linear Regression
- Regression Example
- Residual Analysis
- Transformations
- Influential Points
- Slope Estimate
- Slope Test

#### Multiple Regression

# Multicollinearity and Regression Analysis

In regression, multicollinearity refers to the extent to which independent variables are correlated. Multicollinearity exists when:

- One independent variable is correlated with another independent variable.
- One independent variable is correlated with a linear combination of two or more independent variables.

In this lesson, we'll examine the impact of multicollinearity on regresssion analysis.

## The Multicollinearity Problem

As part of regression analysis, researchers examine regression coefficients to assess the relative influence of independent variables. They look at the magnitude of coefficients, and they test the statistical significance of coefficients.

If the coefficient for a particular variable is significantly greater than zero, researchers judge that the variable contributes to the predictive ability of the regression equation. In this way, it is possible to distinguish variables that are more useful for prediction from those that are less useful.

This kind of analysis makes sense when multicollinearity is small. But it is problematic when multicollinearity is great. Here's why:

- When one independent variable is
*perfectly*correlated with another independent variable (or with a combination of two or more other independent variables), a unique least-squares solution for regression coefficients does not exist. - When one independent variable is
*highly*correlated with another independent variable (or with a combination of two or more other independent variables), the marginal contribution of that independent variable is influenced by other independent variables. As a result:- Estimates for regression coefficients can be unreliable.
- Tests of significance for regression coefficients can be misleading.

With this in mind, the analysis of regression coefficients should be contingent on the extent of multicollinearity. This means that the analysis of regression coefficients should be preceded by an analysis of multicollinearity.

If the set of independent variables is characterized by a little bit of multicollinearity, the analysis of regression coefficients should be straightforward. If there is a lot of multicollinearity, the analysis will be hard to interpret and can be skipped.

**Note:** Multicollinearity makes it hard to assess the relative importance of independent variables,
but it does not affect the usefulness of the regression equation for prediction. Even when multicollinearity is
great, the least-squares regression equation can be highly predictive. So, if you are only interested in
prediction, multicollinearity is not a problem.

## How to Measure Multicollinearity

There are two popular ways to measure multicollinearity: (1) compute a coefficient of multiple determination for each independent variable, or (2) compute a variance inflation factor for each independent variable.

### Coefficient of Multiple Determination

In the previous lesson, we described how the coefficient of multiple determination (R^{2})
measures the proportion of variance in the dependent variable that is explained
by all of the independent variables.

If we ignore the dependent variable, we can compute a coefficient of multiple determination (R^{2}_{k})
for each of the *k* independent variables. We do this by regressing the *k ^{th}* independent variable
on all of the other independent variables. That is, we treat X

_{k}as the dependent variable and use the other independent variables to predict X

_{k}.

How do we interpret R^{2}_{k}? If R^{2}_{k} equals zero, variable *k* is not correlated
with any other independent variable; and multicollinearity is not a problem for variable *k*. As a rule of thumb,
most analysts feel that multicollinearity is a potential problem when R^{2}_{k} is greater than 0.75;
and, a serious problem when R^{2}_{k} is greater than 0.9.

### Variance Inflation Factor

The variance inflation factor is another way to express exactly the same information found in the coefficient of multiple correlation. A variance inflation factor is computed for each independent variable, using the following formula:

VIF_{k} = 1 / ( 1 - R^{2}_{k} )

where VIF_{k} is the variance inflation factor for variable *k*, and R^{2}_{k} is the
coefficient of multiple determination for variable *k*.

In many statistical packages (e.g., SAS, SPSS, Minitab), the variance inflation factor is available as an optional regression output. In MiniTab, for example, the variance inflation factor can be displayed as part of the regression coefficient table.

The interpretation of the variance inflation factor mirrors the interpretation of the coefficient of
multiple determination. If VIF_{k} = 1, variable *k* is not correlated with any other independent
variable. As a rule of thumb, multicollinearity is a potential problem when VIF_{k} is greater than 4;
and, a serious problem when it is greater than 10. The output above shows a VIF of 2.466, which indicates
some multicollinearity but not enough to worry about.

**Bottom line:** If R^{2}_{k} is greater than 0.9 or VIF_{k} is greater than
10, it is likely that regression coefficients are poorly estimated. And significance tests on those coefficients
may be misleading.

## How to Deal with Multicollinearity

If you only want to predict the value of a dependent variable, you may not have to worry about multicollinearity. Multiple regression can produce a regression equation that will work for you, even when independent variables are highly correlated.

The problem arises when you want to assess the relative importance of an independent variable with a high
R^{2}_{k} (or, equivalently, a high VIF_{k}). In this situation, try the following:

- Redesign the study to avoid multicollinearity. If you are working on a true experiment, the experimenter controls treatment levels. Choose treatment levels to minimize or eliminate correlations between independent variables.
- Increase sample size. Other things being equal, a bigger sample means reduced sampling error. The increased precision may overcome potential problems from multicollinearity.
- Remove one or more of the highly-correlated independent variables. Then, define a new regression equation, based on the remaining variables. Because the removed variables were redundant, the new equation should be nearly as predictive as the old equation; and coefficients should be easier to interpret because multicolinearity is reduced.
- Define a new variable equal to a linear combination of the highly-correlated variables. Then, define a new regression equation, using the new variable in place of the old highly-correlated variables.

**Note:** Multicollinearity only affects variables that are highly correlated. If the variable you are
interested in has a small R^{2}_{j}, statistical analysis of its regression coefficient will
be reliable and informative. That analysis will be valid, even when other variables exhibit high multicollinearity.

## Test Your Understanding

In this section, two problems illustrate the role of multicollinearity in regression analysis. In Problem 1, we see what happens when multicollinearity is small; and in Problem 2, we see what happens when multicollinearity is big.

**Problem 1**

Consider the table below. It shows three performance measures for 10 students.

Student | Test score | IQ | Study hours |
---|---|---|---|

1 | 100 | 110 | 40 |

2 | 95 | 110 | 40 |

3 | 90 | 120 | 30 |

4 | 85 | 110 | 40 |

5 | 80 | 100 | 20 |

6 | 75 | 110 | 40 |

7 | 70 | 90 | 0 |

8 | 65 | 110 | 40 |

9 | 60 | 80 | 10 |

10 | 55 | 80 | 10 |

In the previous lesson, we used data from the table to develop a least-squares regression equation to predict test score. We also conducted statistical tests to assess the contribution of each independent variable (i.e., IQ and study hours) to the prediction.

For this problem,

- Measure multicollinearity, based on (1) IQ and (2) the number of hours that the student studied.
- Discuss the impact of multicollinearity for interpreting statistical tests on IQ and study hours.

**Solution**

In this lesson, we described two ways to measure multicollinearity:

- Compute a coefficient of multiple determination (R
^{2}_{k}) for each independent variable. - Compute a variance inflation factor (VIF
_{k}) for each independent variable.

The two approaches are equivalent; so, in practice, you only need to do one or the other, but not both. For this problem, though, we computed both. We used MiniTab to compute variance inflation factors, and we used Excel to compute coefficients of multiple determination. (In the previous lesson, we showed how to compute a coefficient of multiple determination with Excel.)

Here are the results of our analysis.

Variable k |
VIF_{k} |
R^{2}_{k} |
---|---|---|

IQ | 2.466 | 0.595 |

Study hours | 2.466 | 0.595 |

We have rules of thumb to interpret VIF_{k} and R^{2}_{k}. Multicollinearity makes it hard
to interpret the statistical significance of the regression coefficient for variable *k* when VIF_{k}
is greater than 4 or when R^{2}_{k} is greater than 0.75. Since neither condition is evident
in this problem, we can safely accept the results of statistical tests on regression coefficients.

We actually conducted those tests for this problem in the previous lesson. For convenience, key results are reproduced below:

The p-values for IQ and for Study Hours are statistically significant at the 0.05 level. We can trust these findings, because multicollinearity is within acceptable levels.

It is also interesting to look at the effectiveness of the regression equation to predict the dependent
variable, Test Score. As part of the analysis in the previous lesson, we
found that the coefficient of determination (R^{2}) for the regression equation
was 0.905. This means about 90% of the variance in Test Score was accounted for by IQ and Study Hours.

# Marksmanship Training

- Teach yourself to shoot.
- Hit the bullseye (nearly) every time.

**Problem 2**

Problem 2 is identical to Problem 1, except that we've added a new independent variable - grade point average (GPA). Values for each variable appear in the table below.

Student | Test score | IQ | Study hours | GPA |
---|---|---|---|---|

1 | 100 | 110 | 40 | 3.9 |

2 | 95 | 110 | 40 | 2.6 |

3 | 90 | 120 | 30 | 2.7 |

4 | 85 | 110 | 40 | 3 |

5 | 80 | 100 | 20 | 2.4 |

6 | 75 | 110 | 40 | 2.2 |

7 | 70 | 90 | 0 | 2.1 |

8 | 65 | 110 | 40 | 2.1 |

9 | 60 | 80 | 10 | 1.5 |

10 | 55 | 80 | 10 | 1.8 |

Assume that you want predict Test Score, based on three independent variables - IQ, Study Hours, and GPA. As part of multiple regression analysis, you will assess the relative importance of each independent variable. Before you make that assessment, you need to understand multicollinearity among the independent variables.

For this problem, do the following:

- Measure multicollinearity, based on IQ, Study Hours, and GPA.
- Discuss how multicollinearity affects your ability to interpret statistical tests on IQ, Study Hours, and GPA.

**Solution**

By now, we know that there are two ways to measure multicollinearity:

- Compute a coefficient of multiple determination (R
^{2}_{k}) for each independent variable. - Compute a variance inflation factor (VIF
_{k}) for each independent variable.

We used MiniTab to compute variance inflation factors, and we used Excel to compute coefficients of multiple determination. Here are the results of our analysis.

Variable k |
VIF_{k} |
R^{2}_{k} |
---|---|---|

IQ | 22.73 | 0.956 |

Study hours | 2.52 | 0.603 |

GPA | 19.61 | 0.949 |

We have rules of thumb to interpret VIF_{k} and R^{2}_{k}. Multicollinearity makes it hard
to interpret the statistical significance of the regression coefficient for variable *k* when VIF_{k}
is greater than 4 or when R^{2}_{k} is greater than 0.75. Based on these guidelines, we would
conclude that multicollinearity is a problem with IQ and GPA, but not with Study Hours.

Here is the regression coefficients table for this problem. The table shows the following information each coefficient: its value, its standard error, a t-statistic, and the significance of the t-statistic. Based on what you know about multicollinearity, how would you interpret results reported in this table?

The p-value for Study Hours is statistically significant at the 0.05 level. We can trust this finding, because multicollinearity is within acceptable levels for the Study Hours variable. The p-values for IQ and GPA are not statistically significant. However, multicollinearity is high for these variables, so their tests of significance are suspect. Despite the non-significant test results, we cannot say with confidence that neither IQ nor GPA are poor predictors of test score.

In fact, based on the analysis in Problem 1, we know that IQ actually is a good predictor of test score. In this problem, the effect of IQ is confounded with the effect of GPA, because the two variables are so highly correlated. As a result, the p-value for IQ is not significant in this problem. Here's the moral: An insignificant p-value may be misleading when the variable being tested suffers from multicollinearity.

And finally, we looked at the effectiveness of the regression equation to predict the dependent
variable, Test Score. The coefficient of determination (R^{2}) for the regression equation
was 0.916. This is slightly greater than the coefficient of determination (0.905) found in Problem 1,
when we only used IQ and Study Hours as independent variables.
Even though predictors in this problem suffered from multicollinearity, the regression
equation was still highly predictive. This illustrates the fact that multicollinearity does not
affect the ability of the regression equation to predict a dependent variable; it only affects
statistical tests on regression coefficients.

Bestsellers Statistics and Probability Updated daily | ||

1. Practical Statistics for Data Scientists: 50 Essential Concepts $39.99 $13.46 | ||

2. Naked Statistics: Stripping the Dread from the Data $16.95 $11.52 | ||

3. Barron's AP Statistics $18.99 $12.91 | ||

4. Statistics Course Pack Set 1 Op: Statistics in Plain English, Fourth Edition (Volume 1) $41.95 $33.49 | ||

5. Statistics For Dummies (For Dummies (Math & Science)) $19.99 $13.59 |

Texas Instruments TI-83 Plus Graphing Calculator $149.99 $92.99 38% off | |

See more Graphing Calculators ... |

Bestsellers Handheld Calculators Updated daily | ||

1. Texas Instruments TI-84 Plus CE Graphing Calculator, Black $150.00 $125.00 | ||

2. Sharp EL-W535B WriteView Scientific Calculator $24.99 $78.95 | ||

3. Sharp EL-W516B Scientific Calculator with WriteView (Black) $24.99 $49.19 | ||

4. Casio FX-CG10 PRIZM Color Graphing Calculator (Black) $129.99 $88.99 | ||

5. Texas Instruments TI-84 Plus CE Lightning Graphing Calculator $150.00 $152.99 |

Barron's AP Statistics with CD-ROM $29.99 $6.99 77% off | |

See more AP Statistics study guides |

Bestsellers Advanced Placement Statistics Updated daily | ||

1. Barron's AP Statistics $18.99 $12.91 | ||

2. Cracking the AP Statistics Exam, 2019 Edition: Practice Tests & Proven Techniques to Help You Score a 5 (College Test Preparation) $19.99 $13.38 | ||

3. Barron's AP Statistics, 8th Edition $18.99 $27.69 | ||

4. Ultimate AP Statistics Practice Book: 100 Essential Problems Completely Explained on YouTube $14.70 $9.00 | ||

5. Cracking the AP Statistics Exam, 2017 Edition: Proven Techniques to Help You Score a 5 (College Test Preparation) $19.99 $51.57 |