6.6 Hypothesis Tests In-Depth
Establishing the parameter of interest, type of distribution to use, the test statistic and p-value can help you figure out how to go about a hypothesis test. However, there are several other factors you should consider when interpreting the results.
Rare Events
Suppose you make an assumption about a property of the population (this assumption is the null hypothesis). Then you gather sample data randomly. If the sample has properties that would be very unlikely to occur if the assumption is true, then you would conclude that your assumption about the population is probably incorrect. (Remember that your assumption is just an assumption—it is not a fact and it may or may not be true. But your sample data are real and the data are showing you a fact that seems to contradict your assumption.)
For example, Didi and Ali are at a birthday party of a very wealthy friend. They hurry to be first in line to grab a prize from a tall basket that they cannot see inside because they will be blindfolded. There are 200 plastic bubbles in the basket and Didi and Ali have been told that there is only one with a $100 bill. Didi is the first person to reach into the basket and pull out a bubble. Her bubble contains a $100 bill. The probability of this happening is = 0.005. Because this is so unlikely, Ali is hoping that what the two of them were told is wrong and there are more $100 bills in the basket. A “rare event” has occurred (Didi getting the $100 bill) so Ali doubts the assumption about only one $100 bill being in the basket.
Errors in Hypothesis Tests
When you perform a hypothesis test, there are four possible outcomes depending on the actual truth (or falseness) of the null hypothesis H0 and the decision to reject or not. The outcomes are summarized in the following table:
H0 IS ACTUALLY | ||
---|---|---|
ACTION | True | False |
Do not reject H0 | Correct Outcome | Type II error |
Reject H0 | Type I Error | Correct Outcome |
The four possible outcomes in the table are:
- The decision is not to reject H0 when H0 is true (correct decision).
- The decision is to reject H0 when H0 is true (incorrect decision known as a Type I error).
- The decision is not to reject H0 when, in fact, H0 is false (incorrect decision known as a Type II error).
- The decision is to reject H0 when H0 is false (correct decision whose probability is called the power of the test).
Each of the errors occurs with a particular probability. The Greek letters α and β represent the probabilities.
α = probability of a Type I error = P(Type I error) = probability of rejecting the null hypothesis when the null hypothesis is true.
β = probability of a Type II error = P(Type II error) = probability of not rejecting the null hypothesis when the null hypothesis is false.
The power of a test is 1 – β.
Ideally, α and β should be as small as possible because they are probabilities of errors, but rarely are they zero. We want a high power that is as close to one as well. Increasing the sample size can help us achieve these by reducing both α and β, and therefore increasing the power of the test.
Example
Suppose the null hypothesis, H0, is: Frank’s rock climbing equipment is safe.
Type I error: Frank thinks that his rock climbing equipment may not be safe when, in fact, it really is safe. Type II error: Frank thinks that his rock climbing equipment may be safe when, in fact, it is not safe.
α = probability that Frank thinks his rock climbing equipment may not be safe when, in fact, it really is safe. β = probability that Frank thinks his rock climbing equipment may be safe when, in fact, it is not safe.
Notice that, in this case, the error with the greater consequence is the Type II error. (If Frank thinks his rock climbing equipment is safe, he will go ahead and use it.)
Your turn!
Suppose the null hypothesis, H0, is: the blood cultures contain no traces of pathogen X. State the Type I and Type II errors.
Statistical Significance Versus Practical Significance
When the sample size becomes larger, point estimates become more precise and any real differences in the mean and null value become easier to detect and recognize. Even a very small difference would likely be detected if we took a large enough sample. Sometimes researchers will take such large samples that even the slightest difference is detected, even differences where there is no practical value. In such cases, we still say the difference is statistically significant, but it is not practically significant.
For example, an online experiment might identify that placing additional ads on a movie review website statistically significantly increases viewership of a TV show by 0.001%, but this increase might not have any practical value.
One role of a data scientist in conducting a study often includes planning the size of the study. The data scientist might first consult experts or scientific literature to learn what would be the smallest meaningful difference from the null value. She also would obtain other information, such as a very rough estimate of the true proportion p, so that she could roughly estimate the standard error. From here, she can suggest a sample size that is sufficiently large that, if there is a real difference that is meaningful, we could detect it. While larger sample sizes may still be used, these calculations are especially helpful when considering costs or potential risks, such as possible health impacts to volunteers in a medical study.
The decision is to reject the null hypothesis when, in fact, the null hypothesis is true
Erroneously rejecting a true null hypothesis, or erroneously failing to reject a false null hypothesis
The probability of failing to reject a true hypothesis
Finding sufficient evidence that the effect we see is not just due to variability, often from rejecting the null hypothesis