Adult Basic Education
Advanced Level
MATHEMATICS
Data Analysis
Ministry of Advanced Education,
Training and Technology
Adult Basic Education
Advanced Level Mathematics
Data Analysis
Prepared by
Paul Grinder, Okanagan University College
with
Pat Corbett-Labatt, North Island College
Bob Darling, Malaspina University-College
Peter Robbins, Kwantlen University College
Ada Sarsiat, Northwest Community College
for the
Province of British Columbia
Ministry of Advanced Education, Training and Technology
and the
Centre for Curriculum, Transfer and Technology
© 2000-2020 Province of British Columbia, Ministry of
Advanced Education, Skills & Training
Republished by BCcampus with permission.
Victoria, B.C.
Data Analysis by Paul Grinder is released under a
Creative Commons Attribution 4.0 International Licence,
except where otherwise noted.
The CC licence permits you to retain, reuse, copy,
redistribute, and revise this bookin whole or in partfor
free providing the authors are attributed as follows:
If you redistribute all or part of this book, it is
recommended the following statement be added to the
copyright page so readers can access the original book
at no cost:
This textbook can be referenced. In APA citation style, it
would appear as follows:
Visit BCcampus Open Education to learn about open
education in British Columbia.
Data Analysis by Paul Grinder is under a CC BY 4.0 Licence.
Download for free from the B.C. Open Textbook Collection: https://open.bccampus.ca
Grinder, P. (2020). Data analysis. BCcampus.
i
Contents
Learning outcomes .................................................................... ii
Glossary ................................................................................... iii
Unit 1: The uses and abuses of statistics ...................................1
Unit 2: Introduction: Mean, median, mode, range and
graphs ...............................................................................5
Unit 3: Measures of position: quartiles and percentiles .........16
Unit 4: The standard deviation ................................................26
Unit 5: The normal distribution ..............................................33
Unit 6: The normal curve ........................................................44
Unit 7: Analysing survey data .................................................55
Unit 8: A statistics project .......................................................62
Appendix A ..............................................................................66
Appendix B ..............................................................................67
Answers....................................................................................69
ii
Learning outcomes
The word statistics is derived from the Latin word status which means “state”. Governments
were the first to use statistics. They used statistics to collect and interpret data about their
countries. Today, statistics are used in almost every major field of study.
Upon completion of this Module, you should be able to:
explain the uses and misuses of statistics
demonstrate an understanding of mean, median, mode, range, quartiles, percentiles,
standard deviation, the normal curve, z scores, sampling error and confidence intervals
graphically present data in the form of frequency tables, line graphs, bar graphs and stem
and leaf plots
design and conduct a statistics project, analyze the data and communicate your
observations about the data
Procedure for independent study
1. Read each of the units in order and complete all of the exercises. If you need
assistance, contact your instructor.
2. Complete the Activity Exercises wherever possible.
3. Study the terminology in the Glossary to become familiar with the definitions.
4. If recommended by your instructor, complete additional problem sets.
5. Complete the Project for this Module.
iii
Glossary
Bar graph
A graph that uses side by side bars of different lengths to represent ranked data.
Confidence interval
The interval in which a statistic will likely fall, a certain percent of the time, after repeated
experimentation.
Data
The information collected for statistical analysis.
Deviation
The difference between one data value and the mean.
Frequency
The number of times that a particular value occurs in a set of data.
Frequency graph
Sometimes called a broken line graph. A graph with a horizontal axis representing data
values and a vertical axis representing frequency values.
Frequency histogram
Also known as a bar graph.
Measures of central tendency
Statistics that describe where the data is centred. The mean, median and mode are measures
of central tendency.
Measures of position
Statistics that describe how one data value compares to another. Percentiles, quartiles and z
scores are measures of position.
Measures of variation
Statistics that describe how the data is spread out or dispersed. The range, deviation and
standard deviation are measures of variation.
Mean
The average. The mean is obtained by finding the sum of the data values and dividing by the
number of data values.
Median
The middle value, or the average of the two middle values, of a set of ranked data.
Mode
The data value that occurs most frequently.
iv
Normal curve
Also called a bell curve. Data that is distributed symmetrically about the mean so that most
of the data is close to the mean.
Normal distribution
A distribution that takes the shape of a normal curve when graphed. Approximately 68% of
the data values will fall within one standard deviation of the mean, 95.5% will fall within two
standard deviations of the mean and 99.7% of the data will fall within three standard
deviations of the mean.
Percentile
One of the 100 values that divide a set of ranked data into 100 equal intervals. The 48
th
percentile is a value that is higher than 48% of all the data values.
Population
A large group from which samples are taken for statistical analysis.
Quartile
One of four values that divide a set of ranked data into four equal intervals. The first quartile
is equal to the 25
th
percentile.
Random
A value is random if it has an equal chance of occurring as any other value from the same set.
Random sample
A sample that has the same probability of being chosen as any other sample of the same size.
Range
The difference between the largest data value and the smallest data value.
Ranked data
Data that is listed from highest to lowest or lowest to highest.
Sample
A small set of data chosen from a larger set of data.
Sampling error
The amount of error associated with a calculated value as determined by the size of the
sample.
Standard deviation
The square root of the average squared deviation of a set of data.
Statistic
A value calculated from a set of data. The mean and z scores are statistics.
Statistics
A branch of mathematics that collects, organizes and analyzes data.
v
Stem and leaf plot
A table of data values where the last digits of data values (leaves) are strung out behind their
first digits (or stem values).
Survey
Information derived from a sampling of a certain population.
Tally
A method of counting data using “tic” marks.
Yes population
A 40% yes population is one that has responded yes to a particular question 40% of the time.
z score
Also known as a standard score. The value obtained by dividing the deviation by the standard
deviation.
vi
1
Unit 1: The uses and abuses of statistics
The word statistics has two meanings. A statistic is a numerical measurement describing
some characteristic of a set of data. For example, a statistic like 290 pounds could be used to
describe the average or mean weight of a football team. Statistics is also a collection of
methods for planning experiments, collecting data, analyzing the data and drawing
conclusions.
The uses of statistics
It is hard to read a magazine or newspaper without coming across some statistical survey or
analysis. Sportscasts, TV documentaries and newscasts also have their share of statistics. The
uses of statistics include applications in business, sports, medicine, agriculture, psychology,
sociology, education and political science. Governments use statistics to monitor everything
from life style preferences to crime rates. New drugs are statistically analyzed to determine
their effectiveness on patients. The statistical technique of random selection is employed to
guarantee that a small sample of a larger population group is actually an unbiased
representation of the whole population. Statistics, such as plus-minus records, can even be
used to determine whether a certain hockey player should be given more or less ice time.
The abuses of statistics
Just as statistics can be used to provide a solid quantitative analysis of a set of data, statistics
can be misused to distort data. The abuse of statistics is what Benjamin Disraeli (nineteenth
century British prime minister) was referring to when he made the famous comment, “There
are three kinds of lies lies, damned lies and statistics.”
Statistics can be used to misrepresent a situation. Suppose a small store employs 6 people
who earn an average, or mean, wage of $8.50 per hour as calculated below,
Now suppose the store owner, who earns $40 per hour, includes his wages in the calculation,
If the store owner reports that the average wage earned at the store is $13, he or she is
misrepresenting the situation since the store owner is the only person making $13 per hour or
more.
$8.50
6
$10$9$8$8$8$8
=
++++
+
$13
7
$40$10$9$8
$8$8$8
=
++++++
2
Another source of deceptive statistics results from the faulty collection of data. Companies
that conduct public opinion polls have to be extremely careful that they survey a large
enough sample of the population and also an unbiased segment of the population. For
example, suppose a poll was conducted in BC to determine whether a luxury tax should be
imposed on buyers of new pick-up trucks. The citizens of Prince George might respond quite
differently to the poll than the residents of Victoria. The poll could be quite biased if it was
only conducted in Victoria, or only conducted in Prince George.
Statistical graphs can be presented in a deceptive manner. Consider the two bar graphs
below depicting the same data.
25
20
15
Men Women
Hours
watching
TV
35
25
20
15
10
5
0
Men Women
Hours
watching
TV
Hours spent per week watching TV
Without a close inspection of the vertical scale, the first bar graph creates the impression that
men watch twice as much TV as women do. In the second graph, the vertical scale starts at 0,
and the length of the bars are proportional to the actual hour of TV watching.
The above examples illustrate only a few of the abuses of statistics. To avoid the “lies and
damned lies”, every step of the statistical process must be scrupulously carried out; from the
collection of the data, to the calculation of a statistic, to the presentation of conclusions.
Now complete Exercise 1 and check your answers.
3
Exercise 1
1. Why is the following bar graph misleading?
Women
Men
75 80
Life expectancy from birth
2. What factor or factors might cause the following surveys to be biased?
a. TV news watchers are asked to phone in their opinion on whether marijuana
smoking should be legalized.
b. A questionnaire asking family members to list the number of books they read in
the last year is mailed to 1000 homes in the city of Vancouver.
c. To determine how many college students are smokers, Butler asks the first 20
students he sees standing outside the main entrance to the college, “Are you a
smoker?”
Answers are on page 69.
4
Activity 1: Watching TV
Ask every student in the room to write down, on a small piece of paper, an estimate of the
number of minutes they spent watching TV yesterday. Collect the data (pieces of paper) in
some sort of container.
1. a. Draw one piece of paper and record the number.
b. Do you think that this one piece of data is a good representation of the actual (yet
to be calculated) average?
2. Replace the first piece of paper and draw two pieces of data. Find the mean of these
two.
3. Replace the two pieces of data and now draw four pieces of paper. What is the
average time spent watching TV based on just these four pieces of data?
4. Replace the four pieces of paper and draw one half (or one half plus one) of the data.
Find the average time for one half the data.
5. Now find the mean using all the data.
a. How do the previous calculations of the mean, using smaller samples of the total
data, compare to the actual mean?
b. Some of the students may have recorded 0 minutes for the time they spent
watching TV yesterday. How did these zeros affect the mean time?
c. Now calculate the mean for only those students who actually watched some TV
yesterday.
5
Unit 2: Introduction: Mean, median, mode, range
and graphs
Statistics is the science of collecting, classifying, presenting and interpreting numerical data.
The data are numbers or measurements collected by a statistician. For example, the data
below are scores obtained by 12 students on a math quiz out of 40 marks.
32, 39, 32, 27, 30, 34, 32, 35, 40, 36, 32, 36
In order to statistically describe the above data, we might ask the following questions.
1. What is the average, or mean, score?
2. What is the middle, or median, score?
3. What score occurs most often, or what is the mode?
4. What is the difference between the highest and lowest score, or what is the range?
6. How can the data be represented graphically, with a line graph, bar graph or stem
and leaf plot?
The mean, median, mode, and range are four statistics which can be used to describe a set of
data. The mean, median, and mode are called measures of central tendency because they tell
us where the data is centered. The range is a measure of variation because it tells us how
much the data is spread out.
The mean is the most important measure of central tendency. It is calculated as follows:
valuesdata ofnumber theisn and value,data a is x mean, x where
x
or values,data ofnumber by the divided valuesdata theall of sum theis The
=
Σ
=
n
x
mean
The symbol “
Σ
” is the Greek letter “sigma” and means “the sum of all”. Here “
Σ
x” means the sum of all x (or data) values.
Example 1
6
Find the mean score of the following 12 math test scores;
32, 39, 32, 27, 30, 34, 32, 35, 40, 36, 32 and 36.
Solution
Using the formula, (note that n = 12),
12
363236403532343027323932 +++++++++++
=
Σ
=
n
x
x
=
33.75
12
405
=
The mean score is
.75.33
The median is the middle value when the data is arranged from highest to
lowest. If there are two middle values, then the median is the mean of these two
values.
Example 2
Find the median score of the above 12 math scores.
Solution
Arrange the data from the highest to lowest.
40
35
32
39
34
middle
32
36
32
values
30
36
32
27
Note that because there are an even number (twelve) of data values, we have two middle
values. The mean of these two values is
33
2
66
2
3234
==
+
Hence the median math score is 33.