10.3 The Goodness-of-Fit Test

Valerie Watts

10.3 The Goodness-of-Fit Test

LEARNING OBJECTIVES

Conduct and interpret [latex]\chi^2[/latex]-goodness-of-fit hypothesis tests.

Recall that a categorical (or qualitative) variable is a variable where the data can be grouped by specific categories. Examples of categorical variables include eye colour, blood type, or brand of car. A categorical variable is a random variable that takes on categories. Suppose we want to determine whether the data from a categorical variable “fit” a particular distribution or not. That is, for a categorical variable with a historical or assumed probability distribution, does a new sample from the population support the assumed probability distribution, or does the sample indicate that there has been a change in the probability distribution?

The [latex]\chi^2[/latex]-goodness-of-fit test allows us to test if the sample data from a categorical variable fits the pattern of expected probabilities for the variable. In a [latex]\chi^2[/latex]-goodness-of-fit test, we are analyzing the distribution of the frequencies for one categorical variable. This is a hypothesis test where the hypotheses state that the categorial variable does or does not follow an assumed probability distribution, and a [latex]\chi^2[/latex]-distribution is used to determine the [latex]p-\text{value}[/latex] for the test.

Conducting a [latex]\chi^2[/latex]-Goodness-of-Fit Test

Suppose a categorical variable has [latex]k[/latex] possible outcomes (categories) with probabilities [latex]p_1,p_2,\ldots,p_k[/latex]. Suppose [latex]n[/latex] independent observations are taken from this categorical variable.

Write down the null and alternative hypotheses:
[latex]\begin{eqnarray*}\\H_0:&&p_1=p_{1_0},p_2=p_{2_0},\ldots,p_k=p_{k_0}\\H_a:&&\text{at least one }p_i\neq p_{i_0}\\\\\end{eqnarray*}[/latex]
Collect the sample information for the test and identify the significance level [latex]\alpha[/latex].
Use the [latex]\chi^2[/latex]-distribution to find the [latex]p-\text{value}[/latex], which is the area in the right tail of the [latex]\chi^2[/latex]-distribution. The [latex]\chi^2[/latex]-score and degrees of freedom are
[latex]\begin{eqnarray*}\chi^2&=&\sum\frac{(\text{observed-expected})^2}{\text{expected}}\\\\df&=&k-1\\\\\text{observed}&=&\text{observed frequency from the sample data}\\\text{expected}&=&\text{expected frequency from assumed distribution}\\k&=&\text{number of categories}\\\\\end{eqnarray*}[/latex]
Compare the [latex]p-\text{value}[/latex] to the significance level and state the outcome of the test.
- If [latex]p-\text{value}\leq\alpha[/latex], reject [latex]H_0[/latex] in favour of [latex]H_a[/latex].
  - The results of the sample data are significant. There is sufficient evidence to conclude that the null hypothesis [latex]H_0[/latex] is an incorrect belief and that the alternative hypothesis [latex]H_a[/latex] is most likely correct.
- If [latex]p-\text{value}\gt\alpha[/latex], do not reject [latex]H_0[/latex].
  - The results of the sample data are not significant. There is no sufficient evidence to conclude that the alternative hypothesis [latex]H_a[/latex] may be correct.
Write down a concluding sentence specific to the context of the question.

NOTES

The null hypothesis is the claim that the categorial variable follows the assumed distribution. That is, the probability [latex]p_i[/latex] of each possible outcome of the categorical variable equals a hypothesized probability [latex]p_{i_0}[/latex].
The alternative hypothesis is the claim that the categorical variable does not follow the assumed distribution. That is, for at least one possible outcome of the categorical variable, the probability [latex]p_i[/latex] does not equal the claimed probability [latex]p_{i_0}[/latex].
In order to use the [latex]\chi^2[/latex]-goodness-of-fit test, the expected frequency for each category must be at least [latex]5[/latex].
The [latex]p-\text{value}[/latex] for a [latex]\chi^2[/latex]-goodness-of-fit test is always the area in the right tail of the [latex]\chi^2[/latex]-distribution. So, we use chisq.dist.rt to find the [latex]p-\text{value}[/latex] for a [latex]\chi^2[/latex]-goodness-of-fit test.
To calculate the [latex]\chi^2[/latex]-score:
- For each of the possible outcomes of the categorical variable, calculate [latex]\displaystyle{\frac{(\text{observed-expected})^2}{\text{expected}}}[/latex]:
  1. Find the difference between the observed frequency (from the sample) and the expected frequency (from the null hypothesis). The expected frequency equals [latex]n\times p_{i_0}[/latex] where [latex]n[/latex] is the sample size and [latex]p_{i_0}[/latex] is the assumed probability for the [latex]i[/latex]th outcome claimed in the null hypothesis.
  2. Square the difference in step (i).
  3. Divide the value found in step (ii) by the expected frequency.
- Add up the values of [latex]\displaystyle{\frac{(\text{observed-expected})^2}{\text{expected}}}[/latex] for each of the outcomes.
We expect that there will be a discrepancy between the observed frequency and the expected frequency. If this discrepancy is very large, the value of [latex]\chi^2[/latex] will be very large and result in a small [latex]p-\text{value}[/latex].

EXAMPLE

Absenteeism of college students from math classes is a major concern to math instructors because missing class appears to increase the drop rate. Suppose that a study was done to determine if the actual student absenteeism rate follows faculty perception. The faculty believe that the distribution of the number of absences per term is as follows:

Number of Absences per Term

Expected Percent of Students

0–2

[latex]50\%[/latex]

3–5

[latex]30\%[/latex]

6–8

[latex]12\%[/latex]

9–11

[latex]6\%[/latex]

12+

[latex]2\%[/latex]

At the end of the semester, a random survey of [latex]300[/latex] students across all mathematics courses was taken, and the actual (observed) number of absences for the [latex]300[/latex] students was recorded.

Number of Absences per Term	Observed Number of Students
0–2	120
3–5	100
6–8	55
9–11	15
12+	10

At the [latex]5\%[/latex] significance level, determine if the number of absences per term follows the distribution assumed by the faculty.

Solution

Let [latex]p_1[/latex] be the probability a student has 0-2 absences, [latex]p_2[/latex] be the probability a student has 3-5 absences, [latex]p_3[/latex] be the probability a student has 6-8 absences, [latex]p_4[/latex] be the probability a student has 9-11 absences, and [latex]p_5[/latex] be the probability a student has 12 or more absences.

Hypotheses:

[latex]\begin{eqnarray*}H_0:&&p_1=50\%,p_2=30\%,p_3=12\%,p_4=6\%,p_5=2\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal its stated probability}\end{eqnarray*}[/latex]

[latex]p-\text{value}[/latex]:

From the question, we have [latex]n=300[/latex] and [latex]k=5[/latex]. Now we need to calculate the [latex]\chi^2[/latex]-score for the test.

The observed frequency for each category is the number of observations in the sample that fall into that category. This is the information provided in the sample above.

Next, we must calculate out the expected frequencies. The expected frequency is the number of observations we would expect to see in the sample, assuming the null hypothesis is true. To calculate the expected frequency for each category, we multiply the sample size [latex]n[/latex] by the probability associated with that category claimed in the null hypothesis.

Number of Absences per Term	Observed Frequency	Expected Frequency
0-2	120	0.5[latex]\times[/latex]300=150
3-5	100	0.3[latex]\times[/latex]300=90
6-8	55	0.12[latex]\times[/latex]300=36
9-11	15	0.06[latex]\times[/latex]300=18
12+	10	0.02[latex]\times[/latex]300=6

To calculate the [latex]\chi^2[/latex]-score, we work out the quantity [latex]\displaystyle{\frac{(\text{observed-expected})^2}{\text{expected}}}[/latex] for each category and then add up these quantities.

[latex]\begin{eqnarray*}\chi^2&=&\sum\frac{(\text{observed-expected})^2}{\text{expected}}\\&=&\frac{(120-150)^2}{150}+\frac{(100-90)^2}{90}+\frac{(55-36)^2}{36}+\frac{(15-18)^2}{18}+\frac{(10-6)^2}{6}\\&=&20.305\ldots\end{eqnarray*}[/latex]

The degrees of freedom for the [latex]\chi^2[/latex]-distribution is [latex]df=k-1=5-1=4[/latex]. The [latex]\chi^2[/latex]-goodness-of-fit test is a right-tailed test, so we use the chisq.dist.rt function to find the [latex]p-\text{value}[/latex]:

This is a chi square distribution. Along the horizontal axis the point chi square is labeled. The area in the right tail to the right of chi square is shaded and labeled with p-value.

Function

chisq.dist.rt

Field 1

20.305….

Field 2

4

Answer

0.0004

So the [latex]p-\text{value}=0.0004[/latex].

Conclusion:

Because [latex]p-\text{value}=0.0004\lt 0.05=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis. At the [latex]5\%[/latex] significance level, there is enough evidence to suggest that the number of absences per term does not follow the distribution assumed by faculty.

NOTES

The null hypothesis is the claim that the percent of students that fall into each category is as stated. That is, [latex]50\%[/latex] students miss between 0 and 2 classes, [latex]30\%[/latex] of the students miss between 3 and 5 students, etc.
The alternative hypothesis is the claim that at least one of the percent of students that fall into each category is not as stated. The alternative hypothesis does not say that every [latex]p_i[/latex] does not equal its stated probabilities, only that one of them does not equal its stated probability.
Keep all of the decimals throughout the calculation (i.e. in the calculation of the [latex]\chi^2[/latex]-score) to avoid any round-off error in the calculation of the [latex]p-\text{value}[/latex]. This ensures that we get the most accurate value for the [latex]p-\text{value}[/latex]. Use Excel to calculate the expected frequencies and the [latex]\chi^2[/latex]-score.
The [latex]p-\text{value}[/latex] is the area in the right tail of the [latex]\chi^2[/latex]-distribution, to the right of [latex]\chi^2=20.305...[/latex]. In the calculation of the [latex]p-\text{value}[/latex]:
- The function is chisq.dist.rt because we are finding the area in the right tail of a [latex]\chi^2[/latex]-distribution.
- Field 1 is the value of [latex]\chi^2[/latex].
- Field 2 is the value of the degrees of freedom [latex]df[/latex].
The [latex]p-\text{value}[/latex] of [latex]0.0004[/latex] is a small probability compared to the significance level, and so is unlikely to happen assuming the null hypothesis is true. This suggests that the assumption that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to reject the null hypothesis in favour of the alternative hypothesis. In other words, student absenteeism does not fit faculty perception.

EXAMPLE

Employers want to know which days of the week employees have the highest number of absences in a five-day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of [latex]60[/latex] managers are asked on which day of the week they had the highest number of employee absences. The results are recorded in the table below. At the [latex]5\%[/latex] significance level, test if the day of the week with the highest number of absences occurs with equal frequency during a five-day work week.

Day of the Week	Observed Frequency
Monday	15
Tuesday	11
Wednesday	10
Thursday	9
Friday	15

Solution

Let [latex]p_1[/latex] be the probability the highest number of absences occurs on Monday, [latex]p_2[/latex] be the probability the highest number of absences occurs on Tuesday, [latex]p_3[/latex] be the probability the highest number of absences occurs on Wednesday, [latex]p_4[/latex] be the probability the highest number of absences occurs on Thursday, and [latex]p_5[/latex] be the probability the highest number of absences occurs on Friday.

If the day of the week with the highest number of absences occurs with equal frequency, then the probability that any day has the highest number of absences is the same as any other day. Because there are [latex]5[/latex] days (categories), if the frequencies are equal, then each day would have a probability of [latex]20\%[/latex] [latex]\left(\text{or }\frac{1}{5}\right)[/latex].

Hypotheses:

[latex]\begin{eqnarray*}H_0:&&p_1=p_2=p_3=p_4=p_5=20\%\\H_a:&&\text{at least one of the }p_i\neq 20\%\end{eqnarray*}[/latex]

[latex]p-\text{value}[/latex]:

From the question, we have [latex]n=60[/latex] and [latex]k=5[/latex]. Now we need to calculate out the [latex]\chi^2[/latex]-score for the test.

The observed frequency for each category is the number of observations in the sample that fall into that category. This is the information provided in the sample above.

Next, we must calculate out the expected frequencies. The expected frequency is the number of observations we would expect to see in the sample, assuming the null hypothesis is true. To calculate the expected frequency for each category, we multiply the sample size [latex]n[/latex] by the probability associated with that category claimed in the null hypothesis.

Day of the Week

Observed Frequency

Expected Frequency

Monday

15

0.2[latex]\times[/latex]60=12

Tuesday

11

0.2[latex]\times[/latex]60=12

Wednesday

10

0.2[latex]\times[/latex]60=12

Thursday

9

0.2[latex]\times[/latex]60=12

Friday

15

0.2[latex]\times[/latex]60=12

To calculate the [latex]\chi^2[/latex]-score, we work out the quantity [latex]\displaystyle{\frac{(\text{observed-expected})^2}{\text{expected}}}[/latex] for each category and then add up these quantities.

[latex]\begin{eqnarray*}\chi^2&=&\sum\frac{(\text{observed-expected})^2}{\text{expected}}\\&=&\frac{(15-12)^2}{12}+\frac{(11-12)^2}{12}+\frac{(10-12)^2}{12}+\frac{(9-12)^2}{12}+\frac{(15-12)^2}{12}\\&=&2.666\ldots\end{eqnarray*}[/latex]

The degrees of freedom for the [latex]\chi^2[/latex]-distribution is [latex]df=k-1=5-1=4[/latex]. The [latex]\chi^2[/latex]-goodness-of-fit test is a right-tailed test, so we use the chisq.dist.rt function to find the [latex]p-\text{value}[/latex]:

Function

chisq.dist.rt

Field 1

2.666….

Field 2

4

Answer

0.6151

So the [latex]p-\text{value}=0.6151[/latex].

Conclusion:

Because [latex]p-\text{value}=0.6151\gt 0.05=\alpha[/latex], we do not reject the null hypothesis. At the [latex]5\%[/latex] significance level, there is enough evidence to suggest that the day of the week with the highest number of absences occurs with equal frequency during a five-day work week.

NOTES

The null hypothesis is the claim that the probability each day of the week has the highest number of absences is [latex]20\%[/latex].
The alternative hypothesis is the claim that at least one of the probabilities is not [latex]20\%[/latex]. The alternative hypothesis does not say that every [latex]p_i[/latex] does not equal [latex]20\%[/latex], only that one of them does not equal [latex]20\%[/latex].
Keep all of the decimals throughout the calculation (i.e. in the calculation of the [latex]\chi^2[/latex]-score) to avoid any round-off error in the calculation of the [latex]p-\text{value}[/latex]. This ensures that we get the most accurate value for the [latex]p-\text{value}[/latex].
The [latex]p-\text{value}[/latex] of [latex]0.6151[/latex] is a large probability compared to the significance level, and so is likely to happen assuming the null hypothesis is true. This suggests that the assumption that the null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject the null hypothesis.

TRY IT

Teachers want to know which night each week their students are doing most of their homework. Most teachers think that students do an equal amount of homework each night. Suppose a random sample of [latex]49[/latex] students are asked on which night of the week they did the most homework. The results are shown in the table below. At the [latex]5\%[/latex] significance level, are the nights that students do most of their homework equally distributed?

Day of Week	Number of Students
Sunday	11
Monday	8
Tuesday	10
Wednesday	7
Thursday	10
Friday	5
Saturday	5

Click to see Solution

Let [latex]p_1[/latex] be the probability students do their homework on Sunday, [latex]p_2[/latex] be the probability students do their homework on Monday, [latex]p_3[/latex] be the probability students do their homework on Tuesday, [latex]p_4[/latex] be the probability students do their homework on Wednesday, [latex]p_5[/latex] be the probability students do their homework on Thursday, [latex]p_6[/latex] be the probability students do their homework on Friday, and [latex]p_7[/latex] be the probability students do their homework on Saturday.

Hypotheses:

[latex]\begin{eqnarray*}H_0:&&p_1=p_2=p_3=p_4=p_5=p_6=p_7=\frac{1}{7}\\H_a:&&\text{at least one of the }p_i\neq\frac{1}{7}\end{eqnarray*}[/latex]

[latex]p-\text{value}[/latex]:

From the question, we have [latex]n=49[/latex] and [latex]k=7[/latex].

Day of the Week

Observed Frequency

Expected Frequency

Sunday

11

1/7[latex]\times[/latex]49=7

Monday

8

1/7[latex]\times[/latex]49=7

Tuesday

10

1/7[latex]\times[/latex]49=7

Wednesday

7

1/7[latex]\times[/latex]49=7

Thursday

10

1/7[latex]\times[/latex]49=7

Friday

5

1/7[latex]\times[/latex]49=7

Saturday

5

1/7[latex]\times[/latex]49=7

[latex]\begin{eqnarray*}\chi^2&=&\sum\frac{(\text{observed-expected})^2}{\text{expected}}\\&=&\frac{(11-7)^2}{7}+\frac{(8-7)^2}{7}+\frac{(10-7)^2}{7}+\frac{(7-7)^2}{7}\\&&+\frac{(10-7)^2}{7}+\frac{(5-7)^2}{7}+\frac{(5-7)^2}{7}\\&=&6.142\ldots\end{eqnarray*}[/latex]

The degrees of freedom for the [latex]\chi^2[/latex]-distribution is [latex]df=k-1=7-1=6[/latex].

Function

chisq.dist.rt

Field 1

6.142….

Field 2

6

Answer

0.4074

So the [latex]p-\text{value}=0.4074[/latex].

Conclusion:

Because [latex]p-\text{value}=0.4074\gt 0.05=\alpha[/latex], we do not reject the null hypothesis. At the [latex]5\%[/latex] significance level, there is enough evidence to suggest that the nights students do most of their homework are equally distributed.

TRY IT

One study indicates that the number of televisions that American families have is distributed as shown in this table:

Number of Televisions	Percent
0	[latex]10\%[/latex]
1	[latex]16\%[/latex]
2	[latex]55\%[/latex]
3	[latex]11\%[/latex]
4 or more	[latex]8\%[/latex]

A researcher wants to determine if the number of televisions that families in the far western part of the U.S. have the same distribution as the above study. A random sample of [latex]600[/latex] families in the far western U.S. is taken, and the results are recorded in the following table:

Number of Televisions	Observed Frequency
0	66
1	119
2	340
3	60
4 or more	15

At the [latex]1\%[/latex] significance level, does it appear that the distribution of the number of televisions for families in the far western U.S is different from the distribution for the American population as a whole?

Click to see Solution

Let [latex]p_1[/latex] be the probability a family owns 0 televisions, [latex]p_2[/latex] be the probability a family owns 1 television, [latex]p_3[/latex] be the probability a family owns 2 televisions, [latex]p_4[/latex] be the probability a family owns 3 televisions, and [latex]p_5[/latex] be the probability a family owns 4 or more televisions.

Hypotheses:

[latex]\begin{eqnarray*}H_0:&&p_1=10\%,p_2=16\%,p_3=55\%,p_4=11\%,p_5=8\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal its stated probability}\end{eqnarray*}[/latex]

[latex]p-\text{value}[/latex]:

From the question, we have [latex]n=600[/latex] and [latex]k=5[/latex].

Number of Televisions	Observed Frequency	Expected Frequency
0	66	0.1[latex]\times[/latex]600=60
1	119	0.16[latex]\times[/latex]600=96
2	340	0.55[latex]\times[/latex]600=330
3	60	0.11[latex]\times[/latex]600=66
4 or more	15	0.08[latex]\times[/latex]600=48

[latex]\begin{eqnarray*}\chi^2&=&\sum\frac{(\text{observed-expected})^2}{\text{expected}}\\&=&\frac{(66-60)^2}{60}+\frac{(119-96)^2}{96}+\frac{(340-330)^2}{330}+\frac{(60-66)^2}{66}+\frac{(15-48)^2}{48}\\&=&29.646\ldots\end{eqnarray*}[/latex]

The degrees of freedom for the [latex]\chi^2[/latex]-distribution is [latex]df=k-1=5-1=4[/latex].

Function

chisq.dist.rt

Field 1

29.646….

Field 2

4

Answer

0.000006

So the [latex]p-\text{value}=0.000006[/latex].

Conclusion:

Because [latex]p-\text{value}=0.000006\lt 0.01=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis. At the [latex]1\%[/latex] significance level, there is enough evidence to suggest that the distribution of the number of televisions for families in the far western U.S is different from the distribution for the American population as a whole.

TRY IT

The expected percentage of the number of pets students in the United States have in their homes is distributed as follows:

Number of Pets	Percent
0	[latex]18\%[/latex]
1	[latex]25\%[/latex]
2	[latex]30\%[/latex]
3	[latex]18\%[/latex]
4 or more	[latex]9\%[/latex]

A researcher wants to find out if the distribution of the number of pets students in Canada have is the same as the distribution shown in the U.S. A random sample of [latex]1,000[/latex] students from Canada is taken, and the results are shown in the table below:

Number of Pets	Observed Frequency
0	210
1	240
2	320
3	140
4+	90

At the [latex]1\%[/latex] significance level, is the distribution of the number of pets students in Canada have different from the distribution for the United States?

Click to see Solution

Let [latex]p_1[/latex] be the probability a student owns 0 pets, [latex]p_2[/latex] be the probability a student owns 1 pet, [latex]p_3[/latex] be the probability a student owns 2 pets, [latex]p_4[/latex] be the probability a student owns 3 pets, and [latex]p_5[/latex] be the probability a student owns 4 or more pets.

Hypotheses:

[latex]\begin{eqnarray*}H_0:&&p_1=18\%,p_2=25\%,p_3=30\%,p_4=18\%,p_5=9\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal its stated probability}\end{eqnarray*}[/latex]

[latex]p-\text{value}[/latex]:

From the question, we have [latex]n=1000[/latex] and [latex]k=5[/latex].

Number of Pets	Observed Frequency	Expected Frequency
0	210	0.18[latex]\times[/latex]1000=180
1	240	0.25[latex]\times[/latex]1000=250
2	320	0.30[latex]\times[/latex]1000=300
3	140	0.18[latex]\times[/latex]1000=180
4 or more	90	0.09[latex]\times[/latex]1000=90

[latex]\begin{eqnarray*}\chi^2&=&\sum\frac{(\text{observed-expected})^2}{\text{expected}}\\&=&\frac{(210-180)^2}{180}+\frac{(240-250)^2}{250}+\frac{(320-300)^2}{300}+\frac{(140-180)^2}{180}+\frac{(90-90)^2}{90}\\&=&15.622\ldots\end{eqnarray*}[/latex]

The degrees of freedom for the [latex]\chi^2[/latex]-distribution is [latex]df-k-1=5-1=4[/latex].

Function

chisq.dist.rt

Field 1

15.622….

Field 2

4

Answer

0.0036

So the [latex]p-\text{value}=0.0036[/latex].

Conclusion:

Because [latex]p-\text{value}=0.0036\lt 0.01=\alpha[/latex], we reject the null hypothesis in favour of the alternative hypothesis. At the [latex]1\%[/latex] significance level, there is enough evidence to suggest that the distribution of the number of pets students in Canada have is different from the distribution for the United States.

Video: “Pearson’s chi square test (goodness of fit) | Probability and Statistics | Khan Academy” by Khan Academy [11:48] is licensed under the Standard YouTube License.Transcript and closed captions available on YouTube.

Exercises

A teacher predicts that the distribution of grades on the final exam will be as follows:

Grade Percent

A [latex]25\%[/latex]

B [latex]30\%[/latex]

C [latex]35\%[/latex]

D [latex]10\%[/latex]

In a class of [latex]20[/latex] students, the frequency of the grades on the final exam is given below:

Grade Frequency

A 7

B 7

C 5

D 1

At the [latex]5\%[/latex] significance level, do the actual grades match the teacher’s assumed distribution?
Click to see Answer
- Hypotheses: [latex]\begin{eqnarray*}H_0:&&p_1=25\%,p_2=30\%,p_3=35\%,p_4=10\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal its stated probability}\end{eqnarray*}[/latex]
- [latex]p-\text{value}=0.5645[/latex]
- Conclusion: At the [latex]5\%[/latex] significance level, there is enough evidence to conclude that the distribution of grades on the final exam follows the teacher’s stated distribution.
A six-sided die is rolled [latex]120[/latex] times, and the results are recorded in the table below. At the [latex]5\%[/latex] significance level, determine if the die is fair. (Hint: in a fair die, each of the faces is equally likely to occur.)

Face Value Frequency

1 15

2 29

3 16

4 15

5 30

6 15
Click to see Answer
- Hypotheses: [latex]\begin{eqnarray*}H_0:&&p_1=p_2=p_3=p_4=p_5=p_6=16.67\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal }16.67\%\end{eqnarray*}[/latex]
- [latex]p-\text{value}=0.0184[/latex]
- Conclusion: At the [latex]5\%[/latex] significance level, there is enough evidence to conclude that the distribution of the dice rolls does not follow the assumed distribution. The dice is not fair.
The distribution of the marital status for the male population of certain country, ages 15 and older, is as shown in the table below.

Marital Status Percent

never married [latex]31.3\%[/latex]

married [latex]56.1\%[/latex]

widowed [latex]2.5\%[/latex]

divorced/separated [latex]10.1\%[/latex]

Suppose that a random sample of [latex]400[/latex] young adult males from the country, ages 18 to 24 years old, yield the following frequency distribution.

Marital Status Frequency

never married 140

married 238

widowed 2

divorced/separated 20

At the [latex]1\%[/latex] significance level, test if this young adult male age group fits the distribution of the adult male population of the country.
Click to see Answer
- Hypotheses: [latex]\begin{eqnarray*}H_0:&&p_1=31.3\%,p_2=56.1\%,p_3=2.5\%,p_4=10.1\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal its stated probability}\end{eqnarray*}[/latex]
- [latex]p-\text{value}=0.0002[/latex]
- Conclusion: At the [latex]1\%[/latex] significance level, there is enough evidence to conclude that the distribution of the martial statues of the young adult male age group is different than the distribution of the adult male population.

The columns in the table below contain the Race/Ethnicity of the high schools in a certain country for a recent year and the percentages of the Overall Student Population. A local school district wants to determine if its high schools follow the same distribution for the ethnicity of its students. The school district takes a sample of [latex]1,000[/latex] high school students in the district, and the right column in the table contains the breakdown of the ethnicity of the students in the sample.

Race/Ethnicity

Overall Student Population

Survey Frequency

Asian, Asian American, or Pacific Islander

[latex]5.4\%[/latex]

82

Black

[latex]14.5\%[/latex]

135

Hispanic or Latino

[latex]15.9%[/latex]

136

Indigenous

[latex]1.2\%[/latex]

10

White

[latex]61.6\%[/latex]

604

Not reported/other

[latex]1.4\%[/latex]

33

At the [latex]5\%[/latex] significance level, determine if the distribution of ethnicity at the local school district follows the overall student population.

Click to see Answer

Hypotheses: [latex]\begin{eqnarray*}H_0:&&p_1=5.4\%,p_2=14.5\%,p_3=15.9\%,p_4=1.2\%,p_5=61.6\%,p_6=1.4\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal its stated probability}\end{eqnarray*}[/latex]
[latex]p-\text{value}=0.0018[/latex]
Conclusion: At the [latex]5\%[/latex] significance level, there is enough evidence to conclude that the distribution of the ethnicity of high schools in the local district is different than the distribution of the overall high school student population.

The table below shows the expected distribution of majors of all male university students across the country. A local university wants to know if the distribution of majors for its male students follows the same distribution. A sample of [latex]5,000[/latex] male students at the university is taken, and their majors are recorded. The data from the sample is shown in the right column in the table.

Major

Expected Major

Actual Major

Arts & Humanities

[latex]12.0\%[/latex]

630

Biological Sciences

[latex]6.7\%[/latex]

320

Business

[latex]22.7\%[/latex]

1100

Education

[latex]5.8\%[/latex]

315

Engineering

[latex]15.6\%[/latex]

800

Physical Sciences

[latex]3.6\%[/latex]

175

Professional

[latex]9.3\%[/latex]

450

Social Sciences

[latex]7.6\%[/latex]

370

Technical

[latex]1.8\%[/latex]

90

Other

[latex]8.2\%[/latex]

400

Undecided

[latex]6.7\%[/latex]

350

At the [latex]5\%[/latex] significance level, determine if the distribution of the majors of male students at the local university fits the distribution of majors for all male university students.

Click to see Answer

Hypotheses: [latex]\begin{align*}&H_0:p_1=12\%,p_2=6.7\%,p_3=22.7\%,p_4=5.8\%,p_5=15.6\%,p_6=3.6\%,p_7=9.3\%,p_8=7.6\%,p_9=1.8\%,p_{10}=8.2\%,p_{11}=6.7\%\\&H_a:\text{at least one of the }p_i's\text{ does not equal its stated probability}\end{align*}[/latex]
[latex]p-\text{value}=0.6561[/latex]
Conclusion: At the [latex]5\%[/latex] significance level, there is enough evidence to conclude that the distribution of majors for male students at the local university follows the distribution of majors for all male university students across the country.

The table below shows the expected distribution of majors of all female university students across the country. A local university wants to know if the distribution of majors for its female students follows the same distribution. A sample of [latex]5,000[/latex] female students at the university is taken, and their majors are recorded. The data from the sample is shown in the right column in the table.

Major

Expected Major

Actual Major

Arts & Humanities

[latex]14.0\%[/latex]

670

Biological Sciences

[latex]8.4\%[/latex]

410

Business

[latex]13.1\%[/latex]

685

Education

[latex]13.0\%[/latex]

650

Engineering

[latex]2.6\%[/latex]

145

Physical Sciences

[latex]2.6\%[/latex]

125

Professional

[latex]18.9\%[/latex]

975

Social Sciences

[latex]13.0\%[/latex]

605

Technical

[latex]0.4\%[/latex]

15

Other

[latex]5.8\%[/latex]

300

Undecided

[latex]8.2\%[/latex]

420

At the [latex]5\%[/latex] significance level, determine if the distribution of majors of female students at the local university fits the distribution of majors for all female university students.

Click to see Answer

Hypotheses:[latex]\begin{eqnarray*}H_0:&&p_1=14\%,p_2=8.4\%,p_3=13.1\%,p_4=13\%,p_5=2.6\%,p_6=2.6\%,p_7=18.9\%,p_8=13\%,p_9=0.4\%,p_{10}=5.8\%,p_{11}=8.2\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal its stated probability}\end{eqnarray*}[/latex]
[latex]p-\text{value}=0.3791[/latex]
Conclusion: At the [latex]5\%[/latex] significance level, there is enough evidence to conclude that the distribution of majors for female students at the local university follows the distribution of majors for all female university students across the country.

A local police department wants to know if the percentage of traffic accidents is the same for each day of the week. The department takes a sample of [latex]500[/latex] traffic accidents and records the day of the week on which they occurred.

Day of the Week Frequency

Sunday 75

Monday 60

Tuesday 74

Wednesday 57

Thursday 65

Friday 79

Saturday 90

At the [latex]5\%[/latex] significance level, determine if the proportion of traffic accidents is the same for each day of the week.
Click to see Answer
- Hypotheses:[latex]\begin{eqnarray*}H_0:&&p_1=p_2=p_3=p_4=p_5=p_6=p_7=14.29\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal }14.29\%\end{eqnarray*}[/latex]
- [latex]p-\text{value}=0.0817[/latex]
- Conclusion: At the [latex]5\%[/latex] significance level, there is enough evidence to conclude that the proportion of traffic accidents is the same for each day of the week.
A local retailer provides a variety of payment options for its customers: cash, cheque, credit and debit. The retailer believes that the current distribution of the payment options is as follows: [latex]15\%[/latex] cash, [latex]5\%[/latex] cheque, [latex]50\%[/latex] credit, and [latex]30\%[/latex] debit. The retailer takes a sample of [latex]450[/latex] transactions and records the payment method

Payment Method Frequency

Cash 78

Cheque 12

Credit 250

Debit 110

At the [latex]1\%[/latex] significance level, determine if the retailer’s claimed distribution of the payment methods is correct.
Click to see Answer
- Hypotheses:[latex]\begin{eqnarray*}H_0:&&p_1=15\%,p_2=5\%,p_3=50\%,p_4=30\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal its stated probability}\end{eqnarray*}[/latex]
- [latex]p-\text{value}=0.003[/latex]
- Conclusion: At the [latex]1\%[/latex] significance level, there is enough evidence to conclude that the distribution of the payment methods is different than the retailer’s claim.
A local restaurant owner has five locations across the city. The owner wants to know if the percentage of customers is the same at each location. The owner takes a sample of [latex]750[/latex] customers and records which restaurant location they visited.

Location Frequency

1 126

2 179

3 141

4 131

5 173

At the [latex]1\%[/latex] significance level, determine if the proportion of customers is the same for each restaurant location.
Click to see Answer
- Hypotheses:[latex]\begin{eqnarray*}H_0:&&p_1=p_2=p_3=p_4=p_5=20\%\\H_a:&&\text{at least one of the }p_i's\text{ does not equal }20\%\end{eqnarray*}[/latex]
- [latex]p-\text{value}=0.0031[/latex]
- Conclusion: At the [latex]1\%[/latex] significance level, there is enough evidence to conclude that the proportion of customers is not the same for each location.

“10.4 The Goodness-of-Fit Test” and “10.6 Exercises” from Introduction to Statistics by Valerie Watts is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Statistics - Second Edition Copyright © 2025 by Valerie Watts is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Grade	Frequency
A	7
B	7
C	5
D	1

Face Value	Frequency
1	15
2	29
3	16
4	15
5	30
6	15

Marital Status	Frequency
never married	140
married	238
widowed	2
divorced/separated	20

Payment Method	Frequency
Cash	78
Cheque	12
Credit	250
Debit	110

Conducting a [latex]\chi^2[/latex]-Goodness-of-Fit Test

NOTES

NOTES

NOTES

Exercises

License

Share This Book