Multiple Regression Analysis With Excel
In the real world, you will probably never conduct multiple regression analysis by hand. Most likely, you will use computer software (SAS, SPSS, Minitab, Excel, etc.).
Excel is a widely-available software application that supports multiple regression. In this lesson, we use Excel to demonstrate multiple regression analysis. (Other software packages produce outputs similar to Excel; so if you understand the outputs from Excel, you will understand similar outputs from other software.)
Sample Problem With Excel
Consider the table below. It shows three performance measures for 10 students.
Student | Test score | IQ | Study hours |
---|---|---|---|
1 | 100 | 125 | 30 |
2 | 95 | 104 | 40 |
3 | 92 | 110 | 25 |
4 | 90 | 105 | 20 |
5 | 85 | 100 | 20 |
6 | 80 | 100 | 20 |
7 | 78 | 95 | 15 |
8 | 75 | 95 | 10 |
9 | 72 | 85 | 0 |
10 | 65 | 90 | 5 |
In this lesson, using data from the table, we are going to complete the following tasks:
- Develop a least-squares regression equation to predict test score, based on (1) IQ and (2) the number of hours that the student studied.
- Assess how well the regression equation predicts test score, the dependent variable.
- Assess the contribution of each independent variable (i.e., IQ and study hours) to the prediction.
These are common tasks in regression analysis. With the right software, they are easy to accomplish. We'll walk you step by step through each task, starting with setting up Excel.
How to Enable Excel
When you open Excel, the module for regression analysis may or may not be enabled. So, before you do anything else, you need to determine whether Excel is enabled. Here's how to do that:
- Open Excel.
- Click the Data tab.
- If you see the Data Analysis button in the upper right corner, the Analysis TookPak is enabled and you are ready to go.
If the Data Analysis button is not visible, the Analysis ToolPak is not enabled. In that case, do the following:
- Click the File tab.
- Select Options to open the Excel Options dialog box.
- Click the Add-Ins item, from the left column. This opens the View and Manage Microsoft Office Add-ins screen.
- From the Manage drop-down box, choose Excel Add-Ins and click the Go button. This opens the Add-Ins dialog box.
- From the Add-Ins dialog, check the box beside Analysis ToolPak and click Go.
This enables the Analysis ToolPak. Now, when you click the Data tab, you will see a Data Analysis button in the upper right corner under the Data tab. (If this explanation of how to enable the Analysis ToolPak is unclear, go to https://stattrek.com/anova/excel-analysis-toolpak for more detailed instruction.)
Data Entry With Excel
Data entry with Excel is easy. There are three main steps:
- Enter data on spreadsheet.
- Identify independent and dependent variables.
- Specify desired analyses.
To illustrate the process, we'll walk through each step, using data from our sample problem. First, we want to enter data on an Excel spreadsheet. This means listing data for each variable in adjacent columns, as shown below:
Next, we want to identify the independent and dependent variables. Begin by clicking the Data tab and the Data Analysis button.
This will open the Data Analysis dialog box. From the drop-down list, select "Regression" and click OK.
Excel will display the Regression dialog box. This is where you identify data fields for the independent and dependent variables. In the Input Y Range, enter coordinates for the dependent variable. In the Input X Range, enter coordinates for the independent variable(s). If you include column labels in these input ranges, check the Labels box. In the example below, we have included labels, so the Labels box is checked.
By default, Excel will produce a standard set of outputs. For this sample problem, that's all we need; so click OK to generate standard regression outputs.
Note: If desired, you can request additional outputs in the form of residual plots and normal probability plots. To produce the plots, check the appropriate box(es) under Output options on the Regression dialog box.
Data Analysis With Excel
Excel provides everything we need to address the tasks we defined for this sample problem. Recall that we wanted to do three things:
- Develop a least-squares regression equation to predict test score, based on (1) IQ and (2) the number of hours that the student studied.
- Assess how well the regression equation predicts test score, the dependent variable.
- Assess the contribution of each independent variable (i.e., IQ and study hours) to the prediction.
Let's review the output produced by Excel and see how it addresses each task.
Regression Equation
The first task in our analysis is to define a linear, least-squares regression equation to predict test score, based on IQ and study hours. Since we have two independent variables, the equation takes the following form:
ŷ = b0 + b1x1 + b2x2
In this equation, ŷ is the predicted test score. The independent variables are IQ and study hours, which are denoted by x1 and x2, respectively. The regression coefficients are b0, b1, and b2. On the right side of the equation, the only unknowns are the regression coefficients; so to specify the equation, we need to assign values to the coefficients.
In the previous lesson, we showed how to assign values to regression coefficients, using matrix algebra - a time-consuming, labor-intensive process by hand. Excel does all the hard work behind the scenes, and displays the result in a regression coefficients table:
Here, we see that the regression intercept (b0) is 23.156, the regression coefficient for IQ (b1) is 0.509, and the regression coefficient for study hours (b2) is 0.467. So the least-squares regression equation can be re-written as:
ŷ = 23.156 + 0.505 * IQ + 0.467 * Hours
This is the only linear equation that satisfies a least-squares criterion. That means this equation fits the data from which it was created better than any other linear equation.
Coefficient of Multiple Determination
The fact that our equation fits the data better than any other linear equation does not guarantee that it fits the data well. We still need to ask: How well does our equation fit the data?
To answer this question, researchers look at the coefficient of multiple determination (R2). The coefficient of multiple determination measures the proportion of variation in the dependent variable that can be predicted from the set of independent variables in the regression equation. When the regression equation fits the data well, R2 will be large (i.e., close to 1); and vice versa.
The coefficient of multiple determination can be defined in terms of sums of squares:
SSR = Σ ( ŷ - y )2
SSTO = Σ ( y - y )2
R2 = SSR / SSTO
where SSR is the sum of squares due to regression, SSTO is the total sum of squares, ŷ is the predicted value of the dependent variable, y is the dependent variable mean, and y is the dependent variable raw score.
Luckily, you will never have to compute the coefficient of multiple determination by hand. It is a standard output of Excel (and most other analysis packages), as shown below.
A quick glance at the output suggests that the regression equation fits the data pretty well. The coefficient of muliple determination is 0.905. For our sample problem, this means 90.5% of test score variation can be explained by IQ and by hours spent in study.
An Alternative View of R2
The coefficient of multiple correlation (R2) is the square of the correlation between actual and predicted values of the dependent variable. Thus,
R2 = r2y, ŷ
where y is the dependent variable raw score, ŷ is the predicted value of the dependent variable, and ry, ŷ is the correlation between y and ŷ.
ANOVA Table
Another way to evaluate the regression equation would be to assess the statistical significance of the regression sum of squares. For that, we examine the ANOVA table produced by Excel:
This table tests the statistical significance of the independent variables as predictors of the dependent variable. The last column of the table shows the results of an overall F test. The F statistic (33.4) is big, and the p value (0.00026) is small. This indicates that one or both independent variables has explanatory power beyond what would be expected by chance.
Like the coefficient of multiple correlation, the overall F test found in the ANOVA table suggests that the regression equation fits the data well.
Significance of Regression Coefficients
With multiple regression, there is more than one independent variable; so it is natural to ask whether a particular independent variable contributes significantly to the regression after effects of other variables are taken into account. The answer to this question can be found in the regression coefficients table:
The regression coefficients table shows the following information for each coefficient: its value, its standard error, a t-statistic, and the significance of the t-statistic. In this example, the t-statistics for IQ and study hours are both statistically significant at the 0.05 level. This means that IQ contributes significantly to the regression after effects of study hours are taken into account. And study hours contribute significantly to the regression after effects of IQ are taken into account.
Note: This analysis omits any consideration of multicollinearity, a topic we will cover in the next lesson. Be aware, however, that it is best practice to assess multicollinearity in the independent variables before testing significance of regression coefficients.
Final Thoughts
This lesson was all about multiple regression analysis. We used Excel, but the analysis would be much the same with other software packages. All major software packages (SAS, SPSS, Minitab, etc.) produce three key outputs:
- Regression coefficients, based on a least-squares criterion.
- Measures of goodness of fit, like a coefficient of multiple determination and/or an overall F test.
- Significance tests for individual regression coefficients.
If you can interpret these regression outputs from Excel, you should have no trouble interpreting the same outputs from other packages.