Variability refers to how "spread out" a group of scores is.
To see what we mean by spread out, consider graphs in Figure 1. These graphs represent the
scores on two quizzes. The mean score for each quiz is
7.07.0. Despite the equality of
means, you can see that the distributions are quite different.
Specifically, the scores on Quiz 1 are more densely packed and
those on Quiz 2 are more spread out. The differences among
students was much greater on Quiz 2 than on Quiz 1.
The terms variability, spread, and dispersion are synonyms,
and refer to how spread out a distribution is. Just as in the
section on central tendency we discussed measures of the
center of a distribution of scores, in this chapter we will
discuss measures of the variability of a distribution. There
are four frequently used measures of variability, the range,
interquartile range, variance, and standard deviation. In the
next few paragraphs, we will look at each of these four
measures of variability in more detail.
The range is the simplest measure of variability
to calculate, and one you have probably encountered many times
in your life. The range is simply the highest score minus the
lowest score. Let's take a few examples. What is the range of
the following group of numbers -
10256734
10
2
5
6
7
3
4
? Well, the highest number is
1010, and the lowest number is
2 2 , so
10-2=8
10
2
8
.
The range is 88. Let's take
another example. Here's a dataset with
1010 numbers -
99452367459182786251
99
45
23
67
45
91
82
78
62
51
. What is the range? The highest number is
9999 and the lowest number is
2323, so
99-23=76
99
23
76
; the range is 7676. Now
consider the two quizzes shown in Figure 1. On Quiz 1, the lowest score was
55 and the highest score was
99. Therefore, the range is
44. The range on Quiz 2 was
larger: the lowest score was 44
and the highest score was 1010.
Therefore the range is 66.
The interquartile range (IQR) is a range that
contains the middle 50% of the scores in a distribution. It
is computed as follows:
IQR=75th percentile-25th percentile
IQR
75th percentile
25th percentile
For Quiz 1, the 75th percentile is
88 and the 25th percentile is
66. The interquartile range is
therefore 22. For Quiz 2, which
has greater spread, the 75th percentile is
99, the 25th percentile is
55, and the interquartile range
is 44. Recall that in the
discussion of boxplots, the 75th percentile was called the
upper hinge and the 25th percentile was called the lower
hinge. Using this terminology, the interquartile range is
referred to as the H-spread.
A related measure of variability is called the
semi-interquartile range. The semi-interquartile
range is defined simply as the interquartile range divided by
22. If a distribution is
symmetric, the median plus or minus the semi-interquartile
range contains half the scores in the distribution.
Variability can also be defined in terms of how close the
scores in the distribution are to the middle of the
distribution. Using the mean as the measure of the middle of
the distribution, the variance is defined as the average
squared difference of the scores from the mean. The data from
Quiz 1 are shown in Table 1. The
mean score is 7.07.0. Therefore,
the column "Deviation from Mean" contains the score
-7-7.
The column "Squared Deviation" is simply the previous column
squares.
Calculation of Variance for Quiz 1 scores.
| |
Scores |
Deviation from Mean |
Squared Deviation |
| |
9 |
2 |
4 |
| |
9 |
2 |
4 |
| |
9 |
2 |
4 |
| |
8 |
1 |
1 |
| |
8 |
1 |
1 |
| |
8 |
1 |
1 |
| |
8 |
1 |
1 |
| |
7 |
0 |
0 |
| |
7 |
0 |
0 |
| |
7 |
0 |
0 |
| |
7 |
0 |
0 |
| |
7 |
0 |
0 |
| |
6 |
-1 |
1 |
| |
6 |
-1 |
1 |
| |
6 |
-1 |
1 |
| |
6 |
-1 |
1 |
| |
6 |
-1 |
1 |
| |
6 |
-1 |
1 |
| |
5 |
-2 |
4 |
| |
5 |
-2 |
4 |
| Mean |
7 |
0 |
1.5 |
One thing that is important to notice is that the mean
deviation from the mean is 00.
This will always be the case. The mean of the squared
deviations is 1.51.5. Therefore,
the variance is 1.51.5. Analogous
calculations with Quiz 2 show that it's variance is
6.76.7. The formula for the
variance is:
σ2=∑X-μ2N
σ
2
X
μ
2
N
where
σ2
σ
2
is the variance, μμ is the
mean, and NN is the number of
numbers. For Quiz 1,
μ=7
μ
7
and
N=20
N
20
.
If the variance in a sample is used to estimate the variance
in a population, then the previous formula underestimates the
variance and the following formula should be used:
s2=∑X-M2N-1
s
2
X
M
2
N
1
where
s2
s
2
is the estimate of the variance and
MM is the sample mean. Note that
MM is the mean of a sample taken
from a population with a mean of
μμ. Since, in practice, the
variance is usually computed in a sample, this formula is most
often used. The simulation "estimating variance" illustrates the bias
in the formula with NN in the
denominator.
Let's take a concrete example. Assume the scores
11,
22,
44, and
55 were sampled from a larger
population. To estimate the variance in the population you
would compute
s2
s
2
as follows:
M=1+2+4+54=124=3
M
1
2
4
5
4
12
4
3
s2=1-32+2-32+4-32+5-324-1=4+1+1+43=103=3.333
s
2
1
3
2
2
3
2
4
3
2
5
3
2
4
1
4
1
1
4
3
10
3
3.333
There are an alternate formulas that can be easier to use if
you are doing your calculations with a hand calculator:
σ2=∑X2-∑X2NN
σ
2
X
2
X
2
N
N
and
s2=∑X2-∑X2NN-1
s
2
Σ
X
2
X
2
N
N
1
For this example,
∑X2=12+22+42+52=46
X
2
1
2
2
2
4
2
5
2
46
∑X2N=1+2+4+52N=1444=36
X
2
N
1
2
4
5
2
N
144
4
36
σ2=46-364=2.5
σ
2
46
36
4
2.5
and
s2=46-363=3.333
s
2
46
36
3
3.333
as with the other formula.
The standard deviation is simply the square root
of the variance. This makes the standard deviations of the
two quiz distributions 1.2251.225
and 2.5882.588. The standard
deviation is an especially useful measure of variability when
the distribution is normal or approximately normal (see Probability)
because the proportion of the distribution within a given
number of standard deviations from the mean can be
calculated. For example, 6868% of
the distribution is within one standard deviation of the mean
and approximately 9595% of the
distribution is within two standard deviations of the
mean. Therefore, if you had a normal distribution with a mean
of 5050 and a standard deviation
of 1010, then
6868% of the distribution would be
between
50-10=40
50
10
40
and
50+10=60
50
10
60
. Similarly, about 9595%
of the distribution would be between
50-2×10=30
50
2
10
30
and
50+2×10=70
50
2
10
70
. The symbol for the population standard deviation
is σσ; the symbol for an
estimate computed in a sample is
ss. Figure 2 shows two normal distributions. Both
distributions have means of 5050.
The blue distribution has a standard deviation of
55; the red distribution has a
standard deviation of 1010. For
the blue distribution, 68 68 % of
the distribution is between 4545
and 5555; for the red
distribution, 6868% is between
4040 and
6060.