Stat Handout
Stat Handout
∑x
sum by the number of data values. This formula can be written in symbols as mean= .
n
Statisticians often collect data from small portions of a large group in order to determine
information about the group. In such situations, the entire group under consideration is known as
the population, and any subset of the population is called a sample. It is traditional to denote the
mean of a sample by x́ (read as “x bar”) and to denote the mean of a population by the Greek
letter μ (lowercase mu).
Example: Six friends in a biology class of 20 students received test grades of 92, 84, 65, 76,
88, and 90. Find the mean of these test scores.
Solution: The six friends are a sample of the population of 20 students. Use x́ to represent
the mean.
∑ x 92+84 +65+76+ 88+90 495
x́= = = =82.5
n 6 6
Therefore, the mean of the test scores is 82.5.
Another measure of central tendency is the median, which is the middle number (or the
mean of the two middle numbers) in a list of numbers that have been arranged in numerical order
from least to greatest or vice versa. Any list of numbers that is arranged as such is known as a
ranked list.
The median of a ranked list of n numbers is the middle number if n is odd and the mean
of the two middle numbers if n is even.
The third measure of central tendency is the mode, which is the number that occurs most
frequently within a list of numbers. However, it is possible for some lists of numbers to not have
a mode. For instance, in the list 1, 6, 8, 10, 32, 15, 49, each number occurs exactly once. Because
no number occurs more often than the other numbers in the list, there is no mode.
On the other hand, it is also possible for a list of numerical data to have more than one
mode. For instance, in th list 4, 2, 6, 2, 7, 9, 2, 4, 9, 8, 9, 7, the numbers 2 and 9 each occur three
times. Each of the other numbers occurs less than three times. Thus 2 and 9 are both modes for
the data.
The mean, the median, and the mode are all acceptable measures of central tendency.
However, they are generally not equal. The mean of a set of data is the most sensitive among the
three. A change in any of the numbers changes the mean, and the mean can be changed
drastically by changing an extreme value. In contrast, the median and the mode of a set of data
are usually not changed by changing an extreme value.
When a data set has one or more extreme values that are very different from the majority
of data values (known as outliers), the mean will not necessarily be a good indicator of an
average value. To see why, let us compare the mean, median, and mode for the salaries of 5
employees of a small company.
506,000
The sum of the 5 salaries is ₱506,000, and hence the mean is =101,200. Meanwhile, the
5
median is the middle number, which is ₱36,000, and because the ₱20,000 salary occurs most
frequently, the mode is ₱20,000. The data set contains one outlier, which makes the mean
considerably larger than the median. Most of the employees of this company would probably
agree that the median of ₱36,000 better represents the average of the salaries than does either the
median or the mode.
MEASURES OF DISPERSION
In the preceding section we introduced three measures of central tendency—the mean,
the median, and the mode. However, some characteristics of a set of data may not be evident
from an examination of these quantities. For instance, consider a soft-drink dispensing machine
that should dispense 8 oz of your selection into a cup. The table below shows data for two of
these machines.
Machine 1 Machine 2
9.52 8.01
6.41 7.99
10.07 7.95
5.85 8.03
8.15 8.02
x́=8.0 x́=8.0
The mean data value for each machine is 8 oz. However, the quantity of soda dispensed
in Machine 1 is highly inconsistent—in some cases the soda overflows the cup, and in other
cases too little soda is dispensed. The machine obviously needs adjustment. Machine 2, on the
other hand, is working just fine. The quantity dispensed is very consistent, with little variation.
This example shows that average values do not affect the spread or dispersion of data. To
measure these, we must introduce statistical values known as the range and the standard
deviation.
The range of a set of data values is the difference between the greatest data value and the
least data value. In the above example, the greatest quantity dispensed by Machine 1 is 10.07 oz
and the least quantity is 5.85. Thus, the range of the number of ounces dispensed is
10.07−5.85=4.22oz.
The range of a set of data is easy to calculate, but it can be deceiving. The range is a
measure that depends only on the two most extreme values, and as such it is very sensitive. A
measure of dispersion that is less sensitive to extreme values is the standard deviation. The
standard deviation of a set of numerical data makes use of the amount by which each individual
data value deviates from the mean. These deviations, represented by ( x−x́ ), are positive when the
data value x is greater than the mean x́ and are negative when x is less than x́.
The sum of all the deviations ( x−x́ ) is 0 for all sets of data. Because of this, we cannot
use the sum of the deviations as a measure of dispersion for a set of data. Instead, the standard
deviation uses the sum of the squares of the deviations.
If x 1 , x 2 , x 3 ,… , x n is a population of n numbers with a mean of μ, then the standard
deviation of the population is
2
∑ ( x−μ )
σ=
√ n
If x 1 , x 2 , x 3 ,… , x n is a sample of n numbers with a mean of x́, then the standard deviation of the
sample is
2
∑ ( x− x́ )
s=
√ n−1
Most statistical applications involve a sample rather than a population, which is the complete set
of data values. Sample standard deviations are designated by the lowercase letter s. In those
cases in which we do work with a population, we designate the standard deviation of the
population by σ , which is the lowercase Greek letter sigma. We can use the following procedure
to calculate the standard deviation of n numbers.
1. Determine the mean of the n numbers.
2. For each number, calculate the deviation (difference) between the number and the mean
of the numbers.
3. Calculate the square of each deviation and find the sum of these squared deviations.
4. If the data is a population, divide the sum by n. If the data is a sample, divide the sum by
n−1.
5. Find the square root of the quotient in Step 4.
Example: The following numbers were obtained by sampling a population: 2, 4, 7, 12, 15.
Find the standard deviation of the sample.
Solution:
1. The mean of the numbers is
2+4 +7+12+15 40
x́= = =8
5 5
2. For each number, calculate the deviation between the number and the mean.
x x−x́
2 2−8=−6
4 4−8=−4
7 7−8=−1
12 12−8=4
15 15−8=7
3. Calculate the square of each deviation in Step 2, and find the sum of those
squared deviations.
x x−x́ x−x́
2 2−8=−6 (−6 )2=36
4 4−8=−4 (−4 )2=16
7 7−8=−1 (−1 )2=1
12 12−8=4 4 2=16
15 15−8=7 72 =49
∑ ¿118
4. Because we have a sample of n=5 values, divide the sum 118 by n−1, which is
4.
118
=29.5
4
5. The standard deviation of the sample is s= √ 29.5 . To the nearest hundredths, the
standard deviation is s=5.43.