Multicollinearity and Regression
In regression, multicollinearity refers to the extent to which independent variables are correlated. Multicollinearity exists when:
- One independent variable is correlated with another independent variable.
- One independent variable is correlated with a linear combination of two or more independent variables.
In this lesson, we'll examine the impact of multicollinearity on regresssion analysis.
The Multicollinearity Problem
As part of regression analysis, researchers examine regression coefficients to assess the relative influence of independent variables. They look at the magnitude of coefficients, and they test the statistical significance of coefficients.
If the coefficient for a particular variable is significantly greater than zero, researchers judge that the variable contributes to the predictive ability of the regression equation. In this way, it is possible to distinguish variables that are more useful for prediction from those that are less useful.
This kind of analysis makes sense when multicollinearity is small. But it is problematic when multicollinearity is great. Here's why:
- When one independent variable is perfectly correlated with another independent variable (or with a combination of two or more other independent variables), a unique least-squares solution for regression coefficients does not exist.
- When one independent variable is highly correlated with another independent variable
(or with a combination of two or more other independent variables),
the marginal contribution of that independent variable is influenced by other independent variables.
As a result:
- Estimates for regression coefficients can be unreliable.
- Tests of significance for regression coefficients can be misleading.
With this in mind, the analysis of regression coefficients should be contingent on the extent of multicollinearity. This means that the analysis of regression coefficients should be preceded by an analysis of multicollinearity.
If the set of independent variables is characterized by a little bit of multicollinearity, the analysis of regression coefficients should be straightforward. If there is a lot of multicollinearity, the analysis will be hard to interpret and can be skipped.
Note: Multicollinearity makes it hard to assess the relative importance of independent variables, but it does not affect the usefulness of the regression equation for prediction. Even when multicollinearity is great, the least-squares regression equation can be highly predictive. So, if you are only interested in prediction, multicollinearity is not a problem.
How to Measure Multicollinearity
There are two popular ways to measure multicollinearity: (1) compute a coefficient of multiple determination for each independent variable, or (2) compute a variance inflation factor for each independent variable.
Coefficient of Multiple Determination
In the previous lesson, we described how the coefficient of multiple determination (R2) measures the proportion of variance in the dependent variable that is explained by all of the independent variables.
If we ignore the dependent variable, we can compute a coefficient of multiple determination (R2k) for each of the k independent variables. We do this by regressing the kth independent variable on all of the other independent variables. That is, we treat Xk as the dependent variable and use the other independent variables to predict Xk.
How do we interpret R2k? If R2k equals zero, variable k is not correlated with any other independent variable; and multicollinearity is not a problem for variable k. As a rule of thumb, most analysts feel that multicollinearity is a potential problem when R2k is greater than 0.75; and, a serious problem when R2k is greater than 0.9.
Variance Inflation Factor
The variance inflation factor is another way to express exactly the same information found in the coefficient of multiple correlation. A variance inflation factor is computed for each independent variable, using the following formula:
VIFk = 1 / ( 1 - R2k )
where VIFk is the variance inflation factor for variable k, and R2k is the coefficient of multiple determination for variable k.
In many statistical packages (e.g., SAS, SPSS, Minitab), the variance inflation factor is available as an optional regression output. In MiniTab, for example, the variance inflation factor can be displayed as part of the regression coefficient table.
The interpretation of the variance inflation factor mirrors the interpretation of the coefficient of multiple determination. If VIFk = 1, variable k is not correlated with any other independent variable. As a rule of thumb, multicollinearity is a potential problem when VIFk is greater than 4; and, a serious problem when it is greater than 10. The output above shows a VIF of 2.466, which indicates some multicollinearity but not enough to worry about.
Bottom line: If R2k is greater than 0.9 or VIFk is greater than 10, it is likely that the regression coefficient of variable k is a poor indicator of the relative importance of variable k. The statistical significance of that regression coefficient, as an indicator of relative importance, may be misleading.
How to Deal with Multicollinearity
If you only want to predict the value of a dependent variable, you may not have to worry about multicollinearity. Multiple regression can produce a regression equation that will work for you, even when independent variables are highly correlated.
The problem arises when you want to assess the relative importance of an independent variable with a high R2k (or, equivalently, a high VIFk). In this situation, try the following:
- Redesign the study to avoid multicollinearity. If you are working on a true experiment, the experimenter controls treatment levels. Choose treatment levels to minimize or eliminate correlations between independent variables.
- Increase sample size. Other things being equal, a bigger sample means reduced sampling error. The increased precision may overcome potential problems from multicollinearity.
- Remove one or more of the highly-correlated independent variables. Then, define a new regression equation, based on the remaining variables. Because the removed variables were redundant, the new equation should be nearly as predictive as the old equation; and coefficients should be easier to interpret because multicolinearity is reduced.
- Define a new variable equal to a linear combination of the highly-correlated variables. Then, define a new regression equation, using the new variable in place of the old highly-correlated variables.
Note: Multicollinearity only affects variables that are highly correlated. If the variable you are interested in has a small R2j, statistical analysis of its regression coefficient will be reliable and informative. That analysis will be valid, even when other variables exhibit high multicollinearity.
Test Your Understanding
In this section, two problems illustrate the role of multicollinearity in regression analysis. In Problem 1, we see what happens when multicollinearity is small; and in Problem 2, we see what happens when multicollinearity is big.
Problem 1
Consider the table below. It shows three performance measures for 10 students.
Student | Test score | IQ | Study hours |
---|---|---|---|
1 | 100 | 125 | 30 |
2 | 95 | 104 | 40 |
3 | 92 | 110 | 25 |
4 | 90 | 105 | 20 |
5 | 85 | 100 | 20 |
6 | 80 | 100 | 20 |
7 | 78 | 95 | 15 |
8 | 75 | 95 | 10 |
9 | 72 | 85 | 0 |
10 | 65 | 90 | 5 |
In the previous lesson, we used data from the table to develop a least-squares regression equation to predict test score. We also conducted statistical tests to assess the contribution of each independent variable (i.e., IQ and study hours) to the prediction.
For this problem,
- Measure multicollinearity, when IQ and Study Hours are independent variables.
- Discuss the impact of multicollinearity for interpreting statistical tests on IQ and Study Hours.
Solution
In this lesson, we described two ways to measure multicollinearity:
- Compute a coefficient of multiple determination (R2k) for each independent variable.
- Compute a variance inflation factor (VIFk) for each independent variable.
The two approaches are equivalent; so, in practice, you only need to do one or the other, but not both. In the previous lesson, we showed how to compute a coefficient of multiple determination with Excel, and how to derive a variance inflation factor from the coefficient of multiple determination.
Here are the variance inflation factors and the coefficients of multiple determination for the present problem.
Variable k | VIFk | R2k |
---|---|---|
IQ | 2.466 | 0.595 |
Study hours | 2.466 | 0.595 |
We have rules of thumb to interpret VIFk and R2k. Multicollinearity makes it hard to interpret the statistical significance of the regression coefficient for variable k when VIFk is greater than 4 or when R2k is greater than 0.75. Since neither condition is evident in this problem, we can safely accept the results of statistical tests on regression coefficients.
We actually conducted those tests for this problem in the previous lesson. For convenience, key results are reproduced below:
The p-values for IQ and for Study Hours are statistically significant at the 0.05 level. We can trust these findings, because multicollinearity is within acceptable levels.
It is also interesting to look at the effectiveness of the regression equation to predict the dependent variable, Test Score. As part of the analysis in the previous lesson, we found that the coefficient of determination (R2) for the regression equation was 0.905. This means about 90% of the variance in Test Score was accounted for by IQ and Study Hours.
Problem 2
Problem 2 is identical to Problem 1, except that we've added a new independent variable - grade point average (GPA). Values for each variable appear in the table below.
Student | Test score | IQ | Study hours | GPA |
---|---|---|---|---|
1 | 100 | 125 | 30 | 3.9 |
2 | 95 | 104 | 40 | 2.6 |
3 | 92 | 110 | 25 | 2.7 |
4 | 90 | 105 | 20 | 3 |
5 | 85 | 100 | 20 | 2.4 |
6 | 80 | 100 | 20 | 2.2 |
7 | 78 | 95 | 15 | 2.1 |
8 | 75 | 95 | 10 | 2.1 |
9 | 72 | 85 | 0 | 1.5 |
10 | 65 | 90 | 5 | 1.8 |
Assume that you want predict Test Score, based on three independent variables - IQ, Study Hours, and GPA. As part of multiple regression analysis, you will assess the relative importance of each independent variable. Before you make that assessment, you need to understand multicollinearity among the independent variables.
For this problem, do the following:
- Measure multicollinearity, based on IQ, Study Hours, and GPA.
- Discuss how multicollinearity affects your ability to interpret statistical tests on IQ, Study Hours, and GPA.
Solution
By now, we know that there are two ways to measure multicollinearity:
- Compute a coefficient of multiple determination (R2k) for each independent variable.
- Compute a variance inflation factor (VIFk) for each independent variable.
We used the MiniTab formula to compute variance inflation factors, and we used Excel to compute coefficients of multiple determination. Here are the results of our analysis.
Variable k | VIFk | R2k |
---|---|---|
IQ | 22.64 | 0.956 |
Study hours | 2.52 | 0.603 |
GPA | 19.66 | 0.949 |
We have rules of thumb to interpret VIFk and R2k. Multicollinearity makes it hard to interpret the statistical significance of the regression coefficient for variable k when VIFk is greater than 4 or when R2k is greater than 0.75. Based on these guidelines, we would conclude that multicollinearity is not a problem for Study Hours, but it is a problem for IQ and GPA.
To solve this problem, we need to generate a table of regression coefficients that shows the statistical significance for each coefficient. We explained how to generate a table of regression coefficients with Excel in a previous lesson. Using Excel, you can generate the table below:
The table shows the following information for each coefficient: its value, its standard error, a t-statistic, and the significance of the t-statistic. Based on what you know about multicollinearity, how would you interpret results reported in this table?
The p-value for Study Hours is statistically significant at the 0.05 level. We can trust this finding, because multicollinearity is within acceptable levels for the Study Hours variable. The p-values for IQ and GPA are not statistically significant. However, multicollinearity is very high for these variables; so their tests of significance are suspect. Despite the non-significant test results, we cannot say with confidence that neither IQ nor GPA are poor predictors of test score.
In fact, based on the analysis in Problem 1, we would conclude that IQ is a good predictor of test score. In this problem, the effect of IQ is strongly correlated with the other two predictors (R2IQ = 0.956). Similarly, the effect of GPA is strongly correlated with IQ and Study Hours (R2IQ = 0.949). This may explain the insignificant p-values for IQ and GPA in this problem. Here's the moral: Insignificant p-values may be misleading when the variables being tested suffer from multicollinearity.
And finally, we looked at the effectiveness of the regression equation to predict the dependent variable, Test Score. The coefficient of determination (R2) for the regression equation was 0.916. This is slightly greater than the coefficient of determination (0.905) found in Problem 1, when we only used IQ and Study Hours as independent variables. Even though two predictors in this problem suffered from multicollinearity, the regression equation was still highly predictive. This illustrates the fact that multicollinearity does not affect the ability of the regression equation to predict a dependent variable; it only affects statistical tests on regression coefficients.