"

13.4 Testing the Significance of the Overall Model

LEARNING OBJECTIVES

  • Conduct and interpret an overall model test on a multiple regression model.

Previously, we learned that the population model for the multiple regression equation is

[latex]\begin{eqnarray*}y&=&\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k+\epsilon\end{eqnarray*}[/latex]

where [latex]x_1,x_2,\ldots,x_k[/latex] are the independent variables, [latex]\beta_0,\beta_1,\ldots,\beta_k[/latex] are the population parameters of the regression coefficients, and [latex]\epsilon[/latex] is the error variable. The error variable [latex]\epsilon[/latex] accounts for the variability in the dependent variable that is not captured by the linear relationship between the dependent and independent variables. The value of [latex]\epsilon[/latex] cannot be determined, but we must make certain assumptions about [latex]\epsilon[/latex] and the errors/residuals in the model in order to conduct a hypothesis test on how well the model fits the data.  These assumptions include:

  • The model is linear.
  • The errors/residuals have a normal distribution.
  • The mean of the errors/residuals is [latex]0[/latex].
  • The variance of the errors/residuals is constant.
  • The errors/residuals are independent.

Because we do not have the population data, we cannot verify that these conditions are met. We need to assume that the regression model has these properties in order to conduct hypothesis tests on the model.

Testing the Overall Model

We want to test if there is a relationship between the dependent variable and the set of independent variables. In other words, we want to determine if the regression model is valid or invalid.

  • Invalid Model. There is no relationship between the dependent variable and the set of independent variables. In this case, all of the regression coefficients [latex]\beta_i[/latex] in the population model are zero. This is the claim for the null hypothesis in the overall model test:  [latex]H_0:\beta_1=\beta_2=\cdots=\beta_k=0[/latex].
  • Valid Model. There is a relationship between the dependent variable and the set of independent variables. In this case, at least one of the regression coefficients [latex]\beta_i[/latex] in the population model is not zero. This is the claim for the alternative hypothesis in the overall model test:  [latex]H_a:\text{at least one }\beta_i\neq 0[/latex].

The overall model test procedure compares the means of explained and unexplained variation in the model in order to determine if the explained variation (caused by the relationship between the dependent variable and the set of independent variables) in the model is larger than the unexplained variation (represented by the error variable [latex]\epsilon[/latex]). If the explained variation is larger than the unexplained variation, then there is a relationship between the dependent variable and the set of independent variables, and the model is valid. Otherwise, there is no relationship between the dependent variable and the set of independent variables, and the model is invalid.

The logic behind the overall model test is based on two independent estimates of the variance of the errors:

  • One estimate of the variance of the errors, [latex]MSR[/latex], is based on the mean amount of explained variation in the dependent variable [latex]y[/latex].
  • One estimate of the variance of the errors, [latex]MSE[/latex], is based on the mean amount of unexplained variation in the dependent variable [latex]y[/latex].

The overall model test compares these two estimates of the variance of the errors to determine if there is a relationship between the dependent variable and the set of independent variables. Because the overall model test involves the comparison of two estimates of variance, an [latex]F[/latex]-distribution is used to conduct the overall model test, where the test statistic is the ratio of the two estimates of the variance of the errors.

The mean square due to regression, [latex]MSR[/latex], is one of the estimates of the variance of the errors. The [latex]MSR[/latex] is the estimate of the variance of the errors determined by the variance of the predicted [latex]\hat{y}[/latex]-values from the regression model and the mean of the [latex]y[/latex]-values in the sample, [latex]\overline{y}[/latex]. If there is no relationship between the dependent variable and the set of independent variables, then the [latex]MSR[/latex] provides an unbiased estimate of the variance of the errors. If there is a relationship between the dependent variable and the set of independent variables, then the [latex]MSR[/latex] provides an overestimate of the variance of the errors.

[latex]\begin{eqnarray*}SSR&=&\sum\left(\hat{y}-\overline{y}\right)^2\\\\MSR&=&\frac{SSR}{k}\end{eqnarray*}[/latex]

The mean square due to error, [latex]MSE[/latex], is the other estimate of the variance of the errors. The [latex]MSE[/latex] is the estimate of the variance of the errors determined by the error [latex](y-\hat{y})[/latex] in using the regression model to predict the values of the dependent variable in the sample. The [latex]MSE[/latex] always provides an unbiased estimate of the variance of errors, regardless of whether or not there is a relationship between the dependent variable and the set of independent variables.

[latex]\begin{eqnarray*}SSE&=&\sum\left(y-\hat{y}\right)^2\\\\MSE&=&\frac{SSE}{n-k-1}\end{eqnarray*}[/latex]

The overall model test depends on the fact that the [latex]MSR[/latex] is influenced by the explained variation in the dependent variable, which results in the [latex]MSR[/latex] being either an unbiased or overestimate of the variance of the errors. Because the [latex]MSE[/latex] is based on the unexplained variation in the dependent variable, the [latex]MSE[/latex] is not affected by the relationship between the dependent variable and the set of independent variables and is always an unbiased estimate of the variance of the errors.

The null hypothesis in the overall model test is that there is no relationship between the dependent variable and the set of independent variables. The alternative hypothesis is that there is a relationship between the dependent variable and the set of independent variables. The [latex]F[/latex]-score for the overall model test is the ratio of the two estimates of the variance of the errors, [latex]\displaystyle{F=\frac{MSR}{MSE}}[/latex] with [latex]df_1=k[/latex] and [latex]df_2=n-k-1[/latex]. The [latex]p-\text{value}[/latex] for the test is the area in the right tail of the [latex]F[/latex]-distribution to the right of the [latex]F[/latex]-score.

NOTES

  1. If there is no relationship between the dependent variable and the set of independent variables, both the [latex]MSR[/latex] and the [latex]MSE[/latex] are unbiased estimates of the variance of the errors. In this case, the [latex]MSR[/latex] and the [latex]MSE[/latex] are close in value, which results in an [latex]F[/latex]-score close to 1 and a large [latex]p-\text{value}[/latex]. The conclusion of the test would be that the null hypothesis is true.
  2. If there is a relationship between the dependent variable and the set of independent variables, the [latex]MSR[/latex] is an overestimate of the variance of the errors. In this case, the [latex]MSR[/latex] is significantly larger than the [latex]MSE[/latex], which results in a large [latex]F[/latex]-score and a small [latex]p-\text{value}[/latex]. The conclusion of the test would be that the alternative hypothesis is true.

Conducting a Hypothesis Test on the Overall Regression Model

Follow these steps to perform a hypothesis test on the overall regression model:

  1. Write down the null hypothesis that there is no relationship between the dependent variable and the set of independent variables:

    [latex]\begin{eqnarray*}H_0:&&\beta_1=\beta_2=\cdots=\beta_k=0\\\\\end{eqnarray*}[/latex]

  2. Write down the alternative hypotheses that there is a relationship between the dependent variable and the set of independent variables:

    [latex]\begin{eqnarray*}H_a:&&\text{at least one }\beta_i\text{ is not } 0\\\\\end{eqnarray*}[/latex]

  3. Collect the sample information for the test and identify the significance level [latex]\alpha[/latex].
  4. The [latex]p-\text{value}[/latex] is the area in the right tail of the [latex]F[/latex]-distribution.  The [latex]F[/latex]-score and degrees of freedom are

    [latex]\begin{eqnarray*}F&=&\frac{MSR}{MSE}\\\\df_1&=&k\\\\df_2&=&n-k-1\\\\\end{eqnarray*}[/latex]

  5. Compare the [latex]p-\text{value}[/latex] to the significance level and state the outcome of the test.
    • If [latex]p-\text{value}\leq\alpha[/latex], reject [latex]H_0[/latex] in favour of [latex]H_a[/latex].
      • The results of the sample data are significant. There is sufficient evidence to conclude that the null hypothesis [latex]H_0[/latex] is an incorrect belief and that the alternative hypothesis [latex]H_a[/latex] is most likely correct.
    • If [latex]p-\text{value}\gt\alpha[/latex], do not reject [latex]H_0[/latex].
      • The results of the sample data are not significant. There is not sufficient evidence to conclude that the alternative hypothesis [latex]H_a[/latex] may be correct.
  6. Write down a concluding sentence specific to the context of the question.

The calculation of the [latex]MSR[/latex], the [latex]MSE[/latex], and the [latex]F[/latex]-score for the overall model test can be time-consuming, even with the help of software like Excel. However, the required [latex]F[/latex]-score and [latex]p-\text{value}[/latex] for the test can be found on the regression summary table, which we learned how to generate in Excel in a previous section.

EXAMPLE

The human resources department at a large company wants to develop a model to predict an employee’s job satisfaction from the number of hours of unpaid work per week the employee does, the employee’s age, and the employee’s income. A sample of [latex]25[/latex] employees at the company is taken, and the data is recorded in the table below. The employee’s income is recorded in [latex]\$1000[/latex]s, and the job satisfaction score is out of [latex]10[/latex], with higher values indicating greater job satisfaction.

Job Satisfaction Hours of Unpaid Work per Week Age Income ([latex]\$1000[/latex]s)
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49

Previously, we found the multiple regression equation to predict the job satisfaction score from the other variables:

[latex]\begin{eqnarray*}\hat{y}&=&4.7993-0.3818x_1+0.0046x_2+0.0233x_3\\\\\hat{y}&=&\text{predicted job satisfaction score}\\x_1&=&\text{hours of unpaid work per week}\\x_2&=&\text{age}\\x_3&=&\text{income (\$1000s)}\end{eqnarray*}[/latex]

At the [latex]5\%[/latex] significance level, test the validity of the overall model to predict the job satisfaction score.

Solution 

Hypotheses:

[latex]\begin{eqnarray*}H_0:&&\beta_1=\beta_2=\beta_3=0\\H_a:&&\text{at least one }\beta_i\text{ is not }0\end{eqnarray*}[/latex]

[latex]p-\text{value}[/latex]:

The regression summary table generated by Excel is shown below:

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.711779225
R Square 0.506629665
Adjusted R Square 0.436148189
Standard Error 1.585212784
Observations 25
ANOVA
df SS MS F Significance F
Regression 3 54.189109 18.06303633 7.18812504 0.001683189
Residual 21 52.770891 2.512899571
Total 24 106.96
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 4.799258185 1.197185164 4.008785216 0.00063622 2.309575344 7.288941027
Hours of Unpaid Work per Week -0.38184722 0.130750479 -2.9204269 0.008177146 -0.65375772 -0.10993671
Age 0.004555815 0.022855709 0.199329423 0.843922453 -0.04297523 0.052086864
Income ([latex]\$1000[/latex]s) 0.023250418 0.007610353 3.055103771 0.006012895 0.007423823 0.039077013

The [latex]p-\text{value}[/latex] for the overall model test is in the middle part of the table under the ANOVA heading in the Significance F column of the Regression row. So the [latex]p-\text{value}=0.0017[/latex].

Conclusion:  

Because [latex]p-\text{value}=0.0017\lt 0.05=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis. At the [latex]5\%[/latex] significance level, there is enough evidence to suggest that there is a relationship between the dependent variable “job satisfaction” and the set of independent variables “hours of unpaid work per week,” “age”, and “income.”

NOTES

  1. The null hypothesis [latex]\beta_1=\beta_2=\beta_3=0[/latex] is the claim that all of the regression coefficients are zero. That is, the null hypothesis is the claim that there is no relationship between the dependent variable and the set of independent variables, which means that the model is not valid.
  2. The alternative hypothesis is the claim that at least one of the regression coefficients is not zero. The alternative hypothesis is the claim that at least one of the independent variables is linearly related to the dependent variable, which means that the model is valid. The alternative hypothesis does not say that all of the regression coefficients are not zero, only that at least one of them is not zero. The alternative hypothesis does not tell us which independent variables are related to the dependent variable.
  3. The [latex]p-\text{value}[/latex] for the overall model test is located in the middle part of the table under the Significance F column heading in the Regression row (right underneath the ANOVA heading). You will notice a [latex]p-\text{value}[/latex] column heading at the bottom of the table in the rows corresponding to the independent variables. These [latex]p-\text{value}[/latex] in the bottom part of the table are not related to the overall model test we are conducting here. These [latex]p-\text{value}[/latex] in the independent variable rows are the  [latex]p-\text{value}[/latex]we will need when we conduct tests on the individual regression coefficients in the next section.
  4. The [latex]p-\text{value}[/latex] of [latex]0.0017[/latex] is a small probability compared to the significance level, and so is unlikely to happen, assuming the null hypothesis is true. This suggests that the assumption that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to reject the null hypothesis in favour of the alternative hypothesis. In other words, at least one of the regression coefficients is not zero, and at least one independent variable is linearly related to the dependent variable.

Video: “Basic Excel Business Analytics #51: Testing Significance of Regression Relationship with p-value” by excelisfun [20:45] is licensed under the Standard YouTube License.Transcript and closed captions available on YouTube.


Exercises

  1. A local restaurant advocacy group wants to study the relationship between a restaurant’s average weekly profit, the restaurant’s seating capacity, and the average daily traffic that passes the restaurant’s location. The group took a sample of restaurants and recorded their average weekly profit (in [latex]\$1000[/latex]s), the seating restaurant’s seating capacity, and the average number of cars (in [latex]1000[/latex]s) that passes the restaurant’s location. The data is recorded in the following table:
    Seating Capacity Traffic Count ([latex]1000[/latex]s) Weekly Net Profit ([latex]\$1000[/latex]s)
    120 19 23.8
    180 8 29.2
    150 12 22
    180 15 26.2
    220 16 33.5
    235 10 32
    115 18 22.4
    110 12 20.4
    165 21 23.7
    220 20 34.7
    140 24 27.1
    145 24 23.3
    140 13 20.9
    200 14 29.6
    210 14 31.4
    175 12 23.2
    175 15 31.1
    190 17 28.2
    100 23 25.2
    145 20 20.7
    135 13 37.2
    25 13 26.3
    140 25 20
    130 14 28.2
    135 10 24.6
    160 23 23.7

    In Question 1 of Section 13.1, we found the regression model to predict the average weekly profit from other variables. At the [latex]5\%[/latex] significance level, test the validity of the model.

    Click to see Answer
    • Hypotheses: [latex]\begin{eqnarray*}H_0:&&\beta_1=\beta_2\\H_a:&&\text{at least one }\beta_i\text{ is not }0\end{eqnarray*}[/latex]
    • [latex]p-\text{value}=0.0205[/latex]
    • At the [latex]5\%[/latex] significance level, there is enough evidence to suggest that there is a relationship between the dependent variable “weekly profit” and the set of independent variables “seating capacity” and “traffic count.”

     

  2. A local university wants to study the relationship between a student’s GPA, the average number of hours they spend studying each night, and the average number of nights they go out each week. The university took a sample of students and recorded the following data:
    GPA Average Number of Hours Spent Studying Each Night Average Number of Nights Go Out Each Week
    3.72 5 1
    3.88 3 1
    3.67 2 1
    3.87 3 4
    2.49 1 4
    1.29 1 2
    1.01 2 4
    2.12 1 1
    1.9 1 5
    3.42 3 2
    1.33 1 4
    1.07 0 2
    2.75 3 1
    3.82 4 1
    3.91 5 0
    2.25 2 3
    2.06 1 5
    2.92 3 2
    3.06 3 1
    3.65 2 2
    3.69 4 1

    In Question 2 of Section 13.1, we found the regression model to predict GPA from other variables. At the [latex]1\%[/latex] significance level, test the validity of the model.

    Click to see Answer
    • Hypotheses: [latex]\begin{eqnarray*}H_0:&&\beta_1=\beta_2\\H_a:&&\text{at least one }\beta_i\text{ is not }0\end{eqnarray*}[/latex]
    • [latex]p-\text{value}=0.0002[/latex]
    • At the [latex]1\%[/latex] significance level, there is enough evidence to suggest that there is a relationship between the dependent variable “GPA” and the set of independent variables “average number of hours spent studying each night” and “average number of nights go out each week.”

     

  3. A very large company wants to study the relationship between the salaries of employees in management positions, their age, the number of years the employee spent in college, and the number of years the employee has been with the company. A sample of management employees is taken, and the data is recorded below:
    Age Years of College Years with Company Salary ([latex]\$1000[/latex]s)
    60 8 29 317.3
    33 3 5 97.3
    57 6 27 263.1
    32 4 5 101.3
    31 6 3 114.2
    61 8 19 350.4
    41 7 8 146.9
    35 4 2 91.7
    51 6 21 198.2
    50 8 10 196.5
    57 5 15 105.7
    49 6 18 118.3
    62 7 27 305.2
    52 8 26 239.9
    39 4 8 145.9
    42 7 5 175.4
    62 4 24 219.4
    60 4 22 202.1
    65 3 21 196.3
    40 4 10 143.9
    62 6 29 408.7
    53 7 5 145.2
    48 8 5 175.1
    61 5 6 152.7
    38 7 3 99.7
    40 7 12 174.9
    45 7 7 149.2
    58 7 14 282.8
    38 4 3 95.7
    41 5 18 232.8

    In Question 3 of Section 13.1, we found the regression model to predict salary from other variables. At the [latex]1\%[/latex] significance level, test the validity of the model.

    Click to see Answer
    • Hypotheses: [latex]\begin{eqnarray*}H_0:&&\beta_1=\beta_2=\beta_3\\H_a:&&\text{at least one }\beta_i\text{ is not }0\end{eqnarray*}[/latex]
    • [latex]p-\text{value}=0.0000002[/latex]
    • At the [latex]1\%[/latex] significance level, there is enough evidence to suggest that there is a relationship between the dependent variable “salary” and the set of independent variables “age”, “years of college”, and “years at company.”

     


13.5 Testing the Significance of the Overall Model” and “13.8 Exercises” from Introduction to Statistics by Valerie Watts is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Statistics - Second Edition Copyright © 2025 by Valerie Watts is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book