2.6 Measures of Dispersion

LEARNING OBJECTIVES

  • Recognize, describe, calculate, and analyze the measures of the spread of data: variance, standard deviation, and range.

It can be misleading to only use the measures of central tendency (mean, median, mode) to describe a data set.  Measures of central tendency describe the center of a distribution.  Measures of dispersion or variability are used to describe the spread or dispersion of the data.  So far in this chapter, we have already seen a measure of dispersion—the interquartile range.  The interquartile range describes the spread of the middle 50% of the data.  But there are other measures of dispersion, including range, variance, and standard deviation.

Range

The range is the difference between the largest and smallest value in a set of data:

[latex]\displaystyle{\mbox{Range}=\mbox{Maximum Value}-\mbox{Minimum Value}}[/latex]

Range is not a very good measure of variability because it is based on only two values in the data set (the largest and smallest values) and is highly influenced by outliers.  Also, the range does not help us distinguish between two data sets with the same largest and smallest values because the two data sets will have the same range.

EXAMPLE

AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody drug are as follows:

3 4 8 8 10 11 12 13 14 15
15 16 16 17 17 18 21 22 22 24
24 25 26 26 27 27 29 29 31 32
33 33 34 34 35 37 40 44 44 47

Calculate the range.

Solution:

The largest value is 47 and the smallest value is 3, so

[latex]\displaystyle{\mbox{Range}=47-3=44}[/latex]

Variance and Standard Deviation

An important characteristic of any set of data is the variation in the data from the mean.  In some data sets, the data values are concentrated close to the mean, but in other data sets, the data values are more widely spread out from the mean.  The most common measure of variation, or spread, is the standard deviation. The standard deviation is a number that measures, on average, how far data values are from their mean.  The standard deviation provides a numerical measure of the overall amount of variation in a data set, and can be used to determine whether a particular data value is close to or far away from the mean.

The standard deviation provides a measure of the overall variation in a data set.  The standard deviation is always positive or zero. The standard deviation is small when the data are all concentrated close to the mean because there is little variation or spread in the data. The standard deviation is larger when the data values are more spread out from the mean because there is a lot variation in the data.  The lower case letter [latex]s[/latex] represents the sample standard deviation and the Greek letter [latex]\sigma[/latex] represents the population standard deviation.

Suppose that we are studying the amount of time customers wait in line at the checkout at supermarket A and supermarket B. The mean wait time at both supermarkets is five minutes.  At supermarket A the standard deviation for the wait time is two minutes and at supermarket B the standard deviation for the wait time is four minutes.  Because supermarket B has a higher standard deviation, we know that there is more variation in the wait times at supermarket B. Overall, wait times at supermarket B are more spread out from the mean and wait times at supermarket A are more concentrated near the mean.

As well, the standard deviation can be used to determine whether a data value is close to or far from the mean.  For example, suppose that Rosa and Binh both shop at supermarket A where the mean wait time at the checkout is five minutes and the standard deviation is two minutes.  Suppose Rosa’s wait time is seven minutes and Binh’s wait time is one minute:

  • Rosa’s wait time of seven minutes is two minutes longer than the mean of five minutes.  Because two minutes is equal to one standard deviation, Rosa’s wait time of seven minutes is one standard deviation above the mean of five minutes.
  • Binh’s wait time of one minute is four minutes less than the mean of five minutes.  Because four minutes is equal to two standard deviations, Binh’s wait time of one minute is two standard deviations below the mean of five minutes.

A data value that is two standard deviations from the mean is just on the borderline for what many statisticians would consider to be far from the mean.  Considering data to be far from the mean if it is more than two standard deviations away is more of an approximate “rule of thumb” than a rigid rule.  In general, the shape of the distribution of the data affects how much of the data is further away than two standard deviations.

Calculating the Standard Deviation

If [latex]x[/latex] is a number, then the difference “[latex]x[/latex] – mean” is called its deviation from the mean.  In a data set, there are as many deviations as there are items in the data set.  The deviations are used to calculate the standard deviation.  If the numbers belong to a population, in symbols a deviation is [latex]x – \mu[/latex].  For sample data, in symbols a deviation is [latex]\displaystyle{x-\overline{x}}[/latex].

The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample.  The calculations are similar, but not identical.  Therefore the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample.  The lower case letter [latex]s[/latex] represents the sample standard deviation and the Greek letter [latex]\sigma[/latex] represents the population standard deviation. I f the sample has the same characteristics as the population, then [latex]s[/latex] should be a good estimate of [latex]\sigma[/latex].

To calculate the standard deviation, we need to calculate the variance first. The variance is the average of the squares of the deviations (the [latex]x-\overline{x}[/latex] values for a sample or the [latex]x – \mu[/latex] values for a population).  The symbol [latex]\sigma^2[/latex] represents the population variance and the population standard deviation [latex]\sigma[/latex] is the square root of the population variance.  The symbol [latex]s^2[/latex] represents the sample variance and the sample standard deviation [latex]s[/latex] is the square root of the sample variance.  The standard deviation can be thought of as a special average of the deviations.

To calculate a population standard deviation [latex]\sigma[/latex]:

  1. Add up the deviations from the mean: [latex]x-\mu[/latex]
  2. Divide the sum in step 1 by the population size [latex]N[/latex].
  3. The population standard deviation is the square root of the value from step 2.

The formula for the population standard deviation is:  [latex]\displaystyle{\sigma=\sqrt{\frac{\sum(x-\mu)^2}{N}}}[/latex]

To calculate a sample standard deviation [latex]s[/latex]:

  1. Add up the deviations from the mean: [latex]x-\overline{x}[/latex]
  2. Divide the sum in step 1 by the sample size [latex]n-1[/latex].
  3. The sample standard deviation is the square root of the value from step 2.

The formula for the population standard deviation is:  [latex]\displaystyle{s=\sqrt{\frac{\sum(x-\overline{x})^2}{n-1}}}[/latex]


Watch this video: How to calculate Standard Deviation and Variance by statistricsfun [5:04] (transcript available).


CALCULATING VARIANCE IN EXCEL

To find the variance in Excel:

  • If the data is population data, use the var.p(array) function where array is the array or cell range containing the data.  The output from the var.p function is the population variance.
  • If the data is sample data, use the var.s(array) function where array is the array or cell range containing the data.  The output from the var.s function is the sample variance.
    • Visit the Microsoft page for more information about the var.s function.

NOTE

There are two different functions to calculate variance in Excel because variance is calculated differently depending on whether the data is from a sample or from a population.  When calculating variance, make sure that you are using the correct function based on the type of data you are working with (sample or population).

CALCULATING STANDARD DEVIATION IN EXCEL

To find the standard deviation in Excel:

  • If the data is population data, use the stdev.p(array) function where array is the array or cell range containing the data.  The output from the stdev.p function is the population standard deviation.
    • Visit the Microsoft page for more information about the stdev.p function.
  • If the data is sample data, use the stdev.s(array) function where array is the array or cell range containing the data.  The output from the stdev.s function is the sample standard deviation.
    • Visit the Microsoft page for more information about the stdev.s function.

NOTE

There are two different functions to calculate standard deviation in Excel because standard deviation is calculated differently depending on whether the data is from a sample or from a population.  When calculating standard deviation, make sure that you are using the correct function based on the type of data you are working with (sample or population).


Watch this video: Range, Variance, Standard Deviation in Excel by Joshua Emmanuel [1:10] (transcript available).


EXAMPLE

In a fifth grade class, the teacher was interested in the standard deviation of the ages of her students. The following data are the ages, in years, for a sample of 20 fifth grade students. The ages are rounded to the nearest half year:

9 9.5 9.5 10 10 10 10 10.5 10.5 10.5
10.5 11 11 11 11 11 11 11.5 11.5 11.5

Calculate the mean, the variance, and the standard deviation of the ages of the students.  Interpret the standard deviation.

Solution:

Enter the data into an Excel spreadsheet.  For this example, suppose we entered the data in column A from cell A1 to A20.

For the mean:

Function average Answer
Field 1 A1:A20 10.525 years

For the variance:

Function var.s Answer
Field 1 A1:A20 0.5125

For the standard deviation:

Function stdev.s Answer
Field 1 A1:A20 0.7159 years

Interpreting the standard deviation:

On average, the age of any fifth grader is 0.7159 years away from the mean of 10.525 years.

NOTES

  1. We are using the var.s (not var.p) and stdev.s (not stdev.p) functions to calculate the variance and standard deviation because the data is from a sample.
  2. Standard deviation has the same units as the data.  In this case, the data is measured in years, so the standard deviation is also in years.
  3. There are no units associated with variance.

TRY IT

On a baseball team, the ages, in years, of each of the players are as follows:

21 21 22 23 24
24 25 25 28 29
28 31 32 33 33
34 35 36 36 36
36 38 38 38 40

Find the mean and standard deviation.

 

Click to see Solution

 

[latex]\begin{eqnarray*}\mu&=&30.64\mbox{ years}\\\\\sigma&=&5.99\mbox{ years}\end{eqnarray*}[/latex]

NOTE

We are using the var.p (not var.s) and stdev.p (not stdev.s) functions to calculate the variance and standard deviation because the baseball team is a population.

NOTE

Your concentration should be on what the standard deviation tells you about the data. The standard deviation is a number which measures how far the data are spread from the mean. Let a calculator or computer do the arithmetic.

The standard deviation, [latex]s[/latex] or [latex]\sigma[/latex], is either zero or a positive number.  When the standard deviation is zero, there is no dispersion—that is, the all the data values are equal to each other.  The standard deviation is small when the data are all concentrated close to the mean and is larger when the data values show more variation from the mean.  When the standard deviation is a lot larger than zero, the data values are very spread out about the mean.  Outliers in the data can make [latex]s[/latex] or [latex]\sigma[/latex] very large.

The standard deviation, when first presented, can seem unclear.  By graphing your data, you can get a better “feel” for the deviations and the standard deviation.  You will find that in symmetrical distributions, the standard deviation can be very helpful but in skewed distributions the standard deviation may not be much help.  The reason is that the two sides of a skewed distribution have different spreads.  In a skewed distribution, it is better to look at the first quartile, the median, the third quartile, the smallest value, and the largest value.  Because numbers can be confusing, always graph your data.

EXAMPLE

Use the following sample of exam scores from Susan Dean’s spring pre-calculus class:

33 42 49 49 53 55 55 61
63 67 68 68 69 69 72 73
74 78 80 83 88 88 88 90
92 94 94 94 94 96 100

Calculate the following:

  • The mean.
  • The standard deviation.
  • The median.
  • The first quartile.
  • The third quartile.
  • [latex]IQR[/latex].

Solution:

Enter the data into an Excel spreadsheet.  For this example, suppose we entered the data in column A from cell A1 to A31.

For the mean:

Function average Answer
Field 1 A1:A31 73.5

For the median:

Function median Answer
Field 1 A1:A31 73

For the standard deviation:

Function stdev.s Answer
Field 1 A1:A31 17.92

For the first quartile:

Function quartile.exc Answer
Field 1 A1:A31 61
Field 2 1

For the third quartile:

Function quartile.exe Answer
Field 1 A1:A31 90
Field 2 3

For the IQR:  [latex]\displaystyle{IQR=90-61=29}[/latex]

Comparing Values from Different Data Sets

The standard deviation is useful when comparing data values that come from different data sets. If the data sets have different means and different standard deviations, then comparing the data values directly can be misleading.  In order to directly compare values in different data sets, we can compare how many standard deviations away from the mean of its data set a value is.  This is done by calculating the value’s [latex]z[/latex]-score:

Sample [latex]\displaystyle{z = \frac{x - \overline{x}}{s}}[/latex]
Population [latex]\displaystyle{z = \frac{x - \mu}{\sigma}}[/latex]

The value [latex]x[/latex] is [latex]z[/latex] standard deviations away from the mean.

EXAMPLE

Two students, John and Ali, are from different high schools and wanted to find out who had the highest GPA when compared to their school.  Which student had the highest GPA when compared to their school?

Student GPA School Mean GPA School Standard Deviation
John 2.85 3.0 0.7
Ali 77 80 10

Solution:

For each student, determine how many standard deviations, the [latex]z[/latex]-score, their GPA is away from the mean of their school.

John:  [latex]\displaystyle{z=\frac{2.85 - 3.00}{0.7}=-0.21}[/latex]

Ali:  [latex]\displaystyle{z=\frac{77- 80}{10}=−0.3}[/latex]

John has the better GPA when compared to his school because his GPA is 0.21 standard deviations below his school’s mean while Ali’s GPA is 0.3 standard deviations below her school’s mean, which means that John’s GPA is closer to his school’s mean than Ali’s GPA is to hers.

NOTE

The sign of a [latex]z[/latex]-score is important.  A negative [latex]z[/latex]-score tells us that [latex]x[/latex] is below the mean.  A positive [latex]z[/latex]-score tell us that [latex]x[/latex] is above the mean.  The absolute value of the[latex]z[/latex]-score tells us how many standard deviations away from the mean the value of [latex]x[/latex] is.

TRY IT

Two swimmers, Angie and Beth are from different teams and wanted to find out who had the fastest time for the 50 meter freestyle when compared to her team’s mean time.  Which swimmer had the fastest time when compared to her team?

Swimmer Time (seconds) Team Mean Time Team Standard Deviation
Angie 26.2 27.2 0.8
Beth 27.3 30.1 1.4
Click to see Solution

 

Angie:  [latex]\displaystyle{z=\frac{26.2 - 27.2}{0.8}=-1.25}[/latex]

Beth:  [latex]\displaystyle{z=\frac{27.3- 30.1}{1.4}=−2}[/latex]

Angie’s time is 1.25 standard deviations below her team’s mean time and Beth is 2 standard deviations below her team’s time.  So, Angie had the faster time when compared to her team’s mean than Beth’s time is to hers.

The following lists give a few facts that provide a little more insight into what the standard deviation tells us about the distribution of the data.

For ANY data set, no matter what the distribution of the data is:

  • At least [latex]75\%[/latex] of the data is within two standard deviations of the mean.
  • At least [latex]89\%[/latex] of the data is within three standard deviations of the mean.
  • At least [latex]95\%[/latex] of the data is within [latex]4.5[/latex] standard deviations of the mean.
  • This is known as Chebyshev’s Rule.

For data having a distribution that is BELL-SHAPED and SYMMETRIC:

  • Approximately [latex]68\%[/latex] of the data is within one standard deviation of the mean.
  • Approximately [latex]95\%[/latex] of the data is within two standard deviations of the mean.
  • More than [latex]99\%[/latex] of the data is within three standard deviations of the mean.
  • This is known as the Empirical Rule.
  • It is important to note that this rule only applies when the shape of the distribution of the data is bell-shaped and symmetric.

Concept Review

The standard deviation measures the average spread of the data about the mean.  There are different equations to use if are calculating the standard deviation of a sample or of a population.  The standard deviation allows us to compare individual data or to the mean of the data numerically.

  • The formula for calculating a sample standard deviation is [latex]\displaystyle{s=\sqrt{\frac{\sum(x-\overline{x})^2}{n-1}}}[/latex].
  • The formula for calculating a population standard deviation is [latex]\displaystyle{\sigma=\sqrt{\frac{\sum(x-\mu)^2}{N}}}[/latex].

Attribution

2.7 Measures of the Spread of the Data in Introductory Statistics by OpenStax is licensed under a Creative Commons Attribution 4.0 International License.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Introduction to Statistics Copyright © 2022 by Valerie Watts is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.