13.5 Testing the Significance of the Overall Model
LEARNING OBJECTIVES
- Conduct and interpret an overall model test on a multiple regression model.
Previously, we learned that the population model for the multiple regression equation is
[latex]\begin{eqnarray*}y&=&\beta_0+\beta_1x_1+\beta_2x_2+\cdot s+\beta_kx_k+\epsilon\end{eqnarray*}[/latex]
where [latex]x_1,x_2,\ldots,x_k[/latex] are the independent variables, [latex]\beta_0,\beta_1,\ldots,\beta_k[/latex] are the population parameters of the regression coefficients, and [latex]\epsilon[/latex] is the error variable. The error variable [latex]\epsilon[/latex] accounts for the variability in the dependent variable that is not captured by the linear relationship between the dependent and independent variables. The value of [latex]\epsilon[/latex] cannot be determined, but we must make certain assumptions about [latex]\epsilon[/latex] and the errors/residuals in the model in order to conduct a hypothesis test on how well the model fits the data. These assumptions include:
- The model is linear.
- The errors/residuals have a normal distribution.
- The mean of the errors/residuals is 0.
- The variance of the errors/residuals is constant.
- The errors/residuals are independent.
Because we do not have the population data, we cannot verify that these conditions are met. We need to assume that the regression model has these properties in order to conduct hypothesis tests on the model.
Testing the Overall Model
We want to test if there is a relationship between the dependent variable and the set of independent variables. In other words, we want to determine if the regression model is valid or invalid.
- Invalid Model. There is no relationship between the dependent variable and the set of independent variables. In this case, all of the regression coefficients [latex]\beta_i[/latex] in the population model are zero. This is the claim for the null hypothesis in the overall model test: [latex]H_0:\beta_1=\beta_2=\cdot s=\beta_k=0[/latex].
- Valid Model. There is a relationship between the dependent variable and the set of independent variables. In this case, at least one of the regression coefficients [latex]\beta_i[/latex] in the population model is not zero. This is the claim for the alternative hypothesis in the overall model test: [latex]H_a:\text{at least one }\beta_i\neq 0[/latex].
The overall model test procedure compares the means of explained and unexplained variation in the model in order to determine if the explained variation (caused by the relationship between the dependent variable and the set of independent variables) in the model is larger than the unexplained variation (represented by the error variable [latex]\epsilon[/latex]). If the explained variation is larger than the unexplained variation, then there is a relationship between the dependent variable and the set of independent variables, and the model is valid. Otherwise, there is no relationship between the dependent variable and the set of independent variables, and the model is invalid.
The logic behind the overall model test is based on two independent estimates of the variance of the errors:
- One estimate of the variance of the errors, [latex]MSR[/latex], is based on the mean amount of explained variation in the dependent variable [latex]y[/latex].
- One estimate of the variance of the errors, [latex]MSE[/latex], is based on the mean amount of unexplained variation in the dependent variable [latex]y[/latex].
The overall model test compares these two estimates of the variance of the errors to determine if there is a relationship between the dependent variable and the set of independent variables. Because the overall model test involves the comparison of two estimates of variance, an [latex]F[/latex]-distribution is used to conduct the overall model test, where the test statistic is the ratio of the two estimates of the variance of the errors.
The mean square due to regression, [latex]MSR[/latex], is one of the estimates of the variance of the errors. The [latex]MSR[/latex] is the estimate of the variance of the errors determined by the variance of the predicted [latex]\hat{y}[/latex]-values from the regression model and the mean of the [latex]y[/latex]-values in the sample, [latex]\overline{y}[/latex]. If there is no relationship between the dependent variable and the set of independent variables, then the [latex]MSR[/latex] provides an unbiased estimate of the variance of the errors. If there is a relationship between the dependent variable and the set of independent variables, then the [latex]MSR[/latex] provides an overestimate of the variance of the errors.
[latex]\begin{eqnarray*}SSR&=&\sum\left(\hat{y}-\overline{y}\right)^2\\\\MSR&=&\frac{SSR}{k}\end{eqnarray*}[/latex]
The mean square due to error, [latex]MSE[/latex], is the other estimate of the variance of the errors. The [latex]MSE[/latex] is the estimate of the variance of the errors determined by the error [latex](y-\hat{y})[/latex] in using the regression model to predict the values of the dependent variable in the sample. The [latex]MSE[/latex] always provides an unbiased estimate of the variance of errors, regardless of whether or not there is a relationship between the dependent variable and the set of independent variables.
[latex]\begin{eqnarray*}SSE&=&\sum\left(y-\hat{y}\right)^2\\\\MSE&=&\frac{SSE}{n-k-1}\end{eqnarray*}[/latex]
The overall model test depends on the fact that the [latex]MSR[/latex] is influenced by the explained variation in the dependent variable, which results in the [latex]MSR[/latex] being either an unbiased or overestimate of the variance of the errors. Because the [latex]MSE[/latex] is based on the unexplained variation in the dependent variable, the [latex]MSE[/latex] is not affected by the relationship between the dependent variable and the set of independent variables, and is always an unbiased estimate of the variance of the errors.
The null hypothesis in the overall model test is that there is no relationship between the dependent variable and the set of independent variables. The alternative hypothesis is that there is a relationship between the dependent variable and the set of independent variables. The [latex]F[/latex]-score for the overall model test is the ratio of the two estimates of the variance of the errors, [latex]\displaystyle{F=\frac{MSR}{MSE}}[/latex] with [latex]df_1=k[/latex] and [latex]df_2=n-k-1[/latex]. The p-value for the test is the area in the right tail of the [latex]F[/latex]-distribution to the right of the [latex]F[/latex]-score.
NOTES
- If there is no relationship between the dependent variable and the set of independent variables, both the [latex]MSR[/latex] and the [latex]MSE[/latex] are unbiased estimates of the variance of the errors. In this case, the [latex]MSR[/latex] and the [latex]MSE[/latex] are close in value, which results in an [latex]F[/latex]-score close to 1 and a large p-value. The conclusion of the test would be that the null hypothesis is true.
- If there is a relationship between the dependent variable and the set of independent variables, the [latex]MSR[/latex] is an overestimate of the variance of the errors. In this case, the [latex]MSR[/latex] is significantly larger than the [latex]MSE[/latex], which results in a large [latex]F[/latex]-score and a small p-value. The conclusion of the test would be that the alternative hypothesis is true.
Steps to Conduct a Hypothesis Test on the Overall Regression Model
- Write down the null hypothesis that there is no relationship between the dependent variable and the set of independent variables:
[latex]\begin{eqnarray*}H_0:&&\beta_1=\beta_2=\cdot s=\beta_k=0\\\\\end{eqnarray*}[/latex]
- Write down the alternative hypotheses that there is a relationship between the dependent variable and the set of independent variables:
[latex]\begin{eqnarray*}H_a:&&\text{at least one }\beta_i\text{ is not 0}\\\\\end{eqnarray*}[/latex]
- Collect the sample information for the test and identify the significance level [latex]alpha[/latex].
- The p-value is the area in the right tail of the [latex]F[/latex]-distribution. The [latex]F[/latex]-score and degrees of freedom are
[latex]\begin{eqnarray*}F&=&\frac{MSR}{MSE}\\\\df_1&=&k\\\\df_2&=&n-k-1\\\\\end{eqnarray*}[/latex]
- Compare the p-value to the significance level and state the outcome of the test:
- If p-value[latex]\leq\alpha[/latex], reject [latex]H_0[/latex] in favour of [latex]H_a[/latex].
- The results of the sample data are significant. There is sufficient evidence to conclude that the null hypothesis [latex]H_0[/latex] is an incorrect belief and that the alternative hypothesis [latex]H_a[/latex] is most likely correct.
- If p-value[latex]\gt\alpha[/latex], do not reject [latex]H_0[/latex].
- The results of the sample data are not significant. There is not sufficient evidence to conclude that the alternative hypothesis [latex]H_a[/latex] may be correct.
- If p-value[latex]\leq\alpha[/latex], reject [latex]H_0[/latex] in favour of [latex]H_a[/latex].
- Write down a concluding sentence specific to the context of the question.
The calculation of the [latex]MSR[/latex], the [latex]MSE[/latex], and the [latex]F[/latex]-score for the overall model test can be time consuming, even with the help of software like Excel. However, the required [latex]F[/latex]-score and p-value for the test can be found on the regression summary table, which we learned how to generate in Excel in a previous section.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an employee’s job satisfaction from the number of hours of unpaid work per week the employee does, the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job satisfaction score is out of 10, with higher values indicating greater job satisfaction.
Job Satisfaction | Hours of Unpaid Work per Week | Age | Income ($1000s) |
4 | 3 | 23 | 60 |
5 | 8 | 32 | 114 |
2 | 9 | 28 | 45 |
6 | 4 | 60 | 187 |
7 | 3 | 62 | 175 |
8 | 1 | 43 | 125 |
7 | 6 | 60 | 93 |
3 | 3 | 37 | 57 |
5 | 2 | 24 | 47 |
5 | 5 | 64 | 128 |
7 | 2 | 28 | 66 |
8 | 1 | 66 | 146 |
5 | 7 | 35 | 89 |
2 | 5 | 37 | 56 |
4 | 0 | 59 | 65 |
6 | 2 | 32 | 95 |
5 | 6 | 76 | 82 |
7 | 5 | 25 | 90 |
9 | 0 | 55 | 137 |
8 | 3 | 34 | 91 |
7 | 5 | 54 | 184 |
9 | 1 | 57 | 60 |
7 | 0 | 68 | 39 |
10 | 2 | 66 | 187 |
5 | 0 | 50 | 49 |
Previously, we found the multiple regression equation to predict the job satisfaction score from the other variables:
[latex]\begin{eqnarray*}\hat{y}&=&4.7993-0.3818x_1+0.0046x_2+0.0233x_3\\\\\hat{y}&=&\text{predicted job satisfaction score}\\x_1&=&\text{hours of unpaid work per week}\\x_2&=&\text{age}\\x_3&=&\text{income (\$1000s)}\end{eqnarray*}[/latex]
At the 5% significance level, test the validity of the overall model to predict the job satisfaction score.
Solution:
Hypotheses:
[latex]\begin{eqnarray*}H_0:&&\beta_1=\beta_2=\beta_3=0\\H_a:&&\text{at least one }\beta_i\text{ is not 0}\end{eqnarray*}[/latex]
p-value:
The regression summary table generated by Excel is shown below:
SUMMARY OUTPUT | ||||||
Regression Statistics | ||||||
Multiple R | 0.711779225 | |||||
R Square | 0.506629665 | |||||
Adjusted R Square | 0.436148189 | |||||
Standard Error | 1.585212784 | |||||
Observations | 25 | |||||
ANOVA | ||||||
df | SS | MS | F | Significance F | ||
Regression | 3 | 54.189109 | 18.06303633 | 7.18812504 | 0.001683189 | |
Residual | 21 | 52.770891 | 2.512899571 | |||
Total | 24 | 106.96 | ||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | |
Intercept | 4.799258185 | 1.197185164 | 4.008785216 | 0.00063622 | 2.309575344 | 7.288941027 |
Hours of Unpaid Work per Week | -0.38184722 | 0.130750479 | -2.9204269 | 0.008177146 | -0.65375772 | -0.10993671 |
Age | 0.004555815 | 0.022855709 | 0.199329423 | 0.843922453 | -0.04297523 | 0.052086864 |
Income ($1000s) | 0.023250418 | 0.007610353 | 3.055103771 | 0.006012895 | 0.007423823 | 0.039077013 |
The p-value for the overall model test is in the middle part of the table under the ANOVA heading in the Significance F column of the Regression row. So the p-value=[latex]0.0017[/latex].
Conclusion:
Because p-value[latex]=0.0017\lt 0.05=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis. At the 5% significance level there is enough evidence to suggest that there is a relationship between the dependent variable “job satisfaction” and the set of independent variables “hours of unpaid work per week,” “age”, and “income.”
NOTES
- The null hypothesis [latex]\beta_1=\beta_2=\beta_3=0[/latex] is the claim that all of the regression coefficients are zero. That is, the null hypothesis is the claim that there is no relationship between the dependent variable and the set of independent variables, which means that the model is not valid.
- The alternative hypothesis is the claim that at least one of the regression coefficients is not zero. The alternative hypothesis is the claim that at least one of the independent variables is linearly related to the dependent variable, which means that the model is valid. The alternative hypothesis does not say that all of the regression coefficients are not zero, only that at least one of them is not zero. The alternative hypothesis does not tell us which independent variables are related to the dependent variable.
- The p-value for the overall model test is located in the middle part of the table under the Significance F column heading in the Regression row (right underneath the ANOVA heading). You will notice a p-value column heading at the bottom of the table in the rows corresponding to the independent variables. These p-values in the bottom part of the table are not related to the overall model test we are conducting here. These p-values in the independent variable rows are the p-values we will need when we conduct tests on the individual regression coefficients in the next section.
- The p-value of 0.0017 is a small probability compared to the significance level, and so is unlikely to happen assuming the null hypothesis is true. This suggests that the assumption that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to reject the null hypothesis in favour of the alternative hypothesis. In other words, at least one of the regression coefficients is not zero and at least one independent variable is linearly related to the dependent variable.
Watch this video: Basic Excel Business Analytics #51: Testing Significance of Regression Relationship with p-value by ExcelIsFun [20:44]
Concept Review
The overall model test determines if there is a relationship between the dependent variable and the set of independent variable. The test compares two estimates of the variance of the errors ([latex]MSR[/latex] and [latex]MSE[/latex]). The ratio of these two estimates of the variance of the errors is the [latex]F[/latex]-score from an [latex]F[/latex]-distribution with [latex]df_1=k[/latex] and [latex]df_2=n-k-1[/latex]. The p-value for the test is the area in the right tail of the [latex]F[/latex]-distribution. The p-value can be found on the regression summary table generated by Excel.
The overall model hypothesis test is a well established process:
- Write down the null and alternative hypotheses in terms of the regression coefficients. The null hypothesis is the claim that there is no relationship between the dependent variable and the set of independent variables. The alternative hypothesis is the claim that there is a relationship between the dependent variable and the set of independent variables.
- Collect the sample information for the test and identify the significance level.
- The p-value is the area in the right tail of the [latex]F[/latex]-distribution. Use the regression summary table generated by Excel to find the p-value.
- Compare the p-value to the significance level and state the outcome of the test.
- Write down a concluding sentence specific to the context of the question.