Está en la página 1de 40

Lecture 2: Graphical

Techniques and Numerical


Measures

1
Graphical Excellence
 “Graphical excellence” deals with the
effective use of graphical techniques.
 Effective graphical techniques are
– informative,
– concise,
– clear presentation of the data to the viewer.

How can we achieve graphical excellenc


2
 Graphical excellence is achieved when
– The graph presents large data sets concisely and
coherently.
– The ideas and concepts to be delivered are
clearly understood to the viewer.
– The graph encourages the viewer to compare
variables.
– The display induces the viewer to address the
substance of the data and not the form of the
graph.
– There is no distortion of what the data reveal.
3
Graphical Deception
 Itis important to be able to evaluate
critically the information presented by
graphical techniques.
 Things to be cautious about when observing
a graph:
– Is there a missing scale on one axis.
– Do not be influenced by a graph’s caption.
– Are changes presented in absolute values only,
or in percent form too. 4
Is there a missing
Are changes presented in absolute values o
scale on one axis.
or in percent form too.
? (3%)
120.0
(2%)
110.0
(1%)
100.0

Time Time

10% Has any axis been stretched?

Dollars

10%

Aug. 98 Sept. 98
1980 1985 1990
5
Measures of Central Location

Usually, we focus our attention on two aspects of measures of


central location:
Measure of the central data point (the average).
Measure of dispersion of the data about the average.

The central data point reflects the locations of


all the actual data points.

6
Measures of Central Location (Central Tendency)
Usually, we focus our attention on two aspects of measures of
central location:
Measure of the central data point (the average).
Measure of dispersion of the data about the average.

With two data points, If the third data point appear


the central location exactly in the middle of the
current range, the central
should fall in the middle
t if the third data point
location should not change
pears on the left hand-side between them (in order
(because it is currently
the midrange, it should “pull”to reflect the location of
residing in the middle).
e central location to the left. both of them).

ith one data point


early the central
cation is at the point
7
elf.
Arithmetic mean
– This is the most popular and useful measure of
central location
Sum of the measurements
Mean =
Number of measurements

Sample mean Population mean


n N
∑i=i=11xxii
n ∑ i=1 xi
x= µ=
n
n N
Sample size Population size

8
• Example 4.1
mean of the sample of six measurements 7, 3, 9, -2, 4, 6 is
∑ i6=1 xi x71 + x
7 32 + x
3 93 +−x
9 24 + x
45 + x
4 66
6
x= = = 4.5
4.5
6 6
• Example 4.2
ppose the telephone bills of example 2.1 represent populat
measurements. The population mean is
200
∑ i=1 xi x1 + 15.30
42.19
42.19 x
15.30
2 + ...+ 53.21
x200
53.21
µ= = = 43.59
43.59
200 200

9
• Example 4.3
When many of the measurements have the same value, the
easurement can be summarized in a frequency table. Supp
he number of children in a sample of 16 employees were reco
s follows:

NUMBER OF CHILDREN 0 1 2 3
NUMBER OF EMPLOYEES 3 4 7 2
16 employees

∑16
i=1xix1+ x2...+ x16 3(0) + 4(1) + 7(2) + 2(3)
x= = = = 1.5
16 16 16

10
The median
– The median of a set of measurements is the
value that falls in the middle when the
measurements are arranged in order of
magnitude.
Example 4.4
Seven employee salaries were Suppose
recordedone employee’s salary of $31,
(in 1000s) : 28, 60, 26, 32, 30, was
26, 29.
added to the group recorded befor
Find the median salary. Find the median salary.
Even number of observat
Odd number of observations
There are two
middle values!
26,26,28,29,30,32,60 26,26,28,29,
26,26,28,29,
26,26,28,29, 29.530,32,60,31
26,26,28,29,30,32,60,31
30,32,60,31
,30,32,60,
First, sort the salaries. First, sort the salaries.
Then, locate the value in Then, locate the values11in
The mode
– The mode of a set of measurements is the value
that occurs most frequently.
– Set of data may have one mode (or modal
class), or two or more modes.
For large data sets
The modal class the modal class is
much more relevant
than the a single-
value mode.

12
– Example 4.5
 The manager of a men’s store observes the waist

size (in inches) of trousers sold yesterday: 31, 34,


36, 33, 28, 34, 30, 34, 32, 40.
 The mode of this data set is 34 in.

This information seems


valuable
(for example, for the
design of a new display
in the store), much
more than “ the median
is 33.2 in.”.

13
Measures of Central Location

Usually, we focus our attention on two aspects of measures of


central location:
Measure of the central data point (the average).
Measure of dispersion of the data about the average.

The central data point reflects the locations of


all the actual data points.

14
• Example 4.6
A professor of statistics wants to report the results of a midt
exam, taken by 100 students. The data appear in file XM04
Find the mean, median, and mode, and describe the informa
they provide. The mean provides information
about the over-all performance level
of the class. It can serve as a tool for
Marks
making comparisons with other
Mean 73.98
StandardError 2.1502163 classes and/or other exams.
Median 81
Mode 84
StandardDeviation 21.502163 The Median indicates that half of the
SampleVariance
Kurtosis
462.34303
0.3936606 class received a grade below 81%,
Skewness
Range
-1.073098
89
and half of the class received a grade
Minimum
Maximum
11
100
above 81%.
Sum 7398
Count 100 The mode must be used when data is
qualitative. If marks are classified by
letter grade, the frequency of each
grade can be calculated.Then, the mod
15
becomes a logical measure to comput
Excel Histogram

Fre q u e n cy
Bin Frequency
10 0
20 3 30
30
40
2
6 20 The histogram is skewed to the left
50 6
60 5 10
70 10
80 16
0
90 28
10
20
30

40
50

60
70
80

90

re
0
100 24

10
Mo
More 0

Modal class

16
Relationship among Mean, Median,
and Mode
 If a distribution is symmetrical, the
mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the


left or to the right, the three measures differ.

A positively skewed distribution


(“skewed to the right”)

Mode Mean
Median 17
 If
a distribution is symmetrical, the mean,
median and mode coincide

If a distribution is non symmetrical, and skewed to the left or


to the right, the three measures differ.

A positively skewed distribution A negatively skewed distribu


(“skewed to the right”) (“skewed to the left”)

Mode Mean Mean Mode


Median Median 18
The geometric mean
– This is a measure of the average growth rate.
– Let Ri denote the the rate of return in period i
(i=1,2…,n). The geometric mean of the returns
R1, R2, …,Rn is the constant Rg that produces the
same terminal wealth at the end of period n as do
the actual returns for the n periods.
Rg = n (1+ R1)(1+ R2)...(
1+ Rn) − 1

If the rate of return was Rg in every


For the given series of rate of
returns the n-period return isperiod, the n-period return
n
would
calculated by be calculated by (1+ Rg)
(1+ R1)(1+ R2)...(
1+ Rn) = (1+ Rg)n
19
– Example 4.7
 A firm’s sales were $1,000,000 three years ago.

 Sales have grown annually by 20%, 10%, -5%.

 Find the geometric mean rate of growth in sales.

– Solution
 Since Rg is the geometric mean
(1+R)3 = (1+.2)(1+.1)(1-.05)= 1.2540
Thus,
Rg = 3 (1+ .2)(1+ .1)(1− .05) − 1= .0784
, or 7.84%.

20
Measures of variability
(Dispersion or Spread)
 Measures of central location fail to tell the
whole story about the distribution.
 A question of interest still remains unanswered:

How typical is the average value of all


the measurements in the data set?
or
How much spread out are the measurements
about the average value?
21
Observe two hypothetical data sets

Low variability data set

The average value provides


a good representation of the
values in the data set.
High variability data set

This is the previous


data set. It is now
changing to...

The same average value does not


provide as good presentation of the
values in the data set as before. 22
The range
– The range of a set of measurements is the difference
between the largest and smallest measurements.
– Its major advantage is the ease with which it can be
computed.
– Its major shortcoming is its failure to provide
information on the dispersion of the values between
the two end points.
But, how do all the measurements spread out?
The range cannot assist in answering this question

? Range
? ?
Smallest Largest
measurement measurement 23
The variance
– This measure of dispersion reflects the values
of all the measurements.
– The variance of a population of N
measurements
x1, x2,…,xN having a mean µ is defined as
∑N (x
i=1 i
− µ )2
σ2 =
N
– The variance of a sample of n measurements
x1, x2, …,xn having a mean x is defined as
∑n (x
i=1 i
− x)2
s2 =
n− 1 24
Consider two small populations: 9-10= -1
Population A: 8, 9, 10, 11,
Let us12 11-10= +1
start by calculating
the sum
Population B: 4, 7, 10, 13, 16of deviations 8-10= -2
12-10= +2
Thus, a measure of dispersion
Sum = 0
is needed that agrees with this
The sum of deviations
observation.
A is zero in both cases,
therefore, another
8 9 10 11 12 measure is needed.
…but measurements in B
The mean of both
are much more dispersed 4-10 = - 6
populations is 10...
then those in A. 16-10 = +6
B 7-10 = -3

4 7 10 13 16 13-10 = +3
Sum = 0

25
9-10= -1
The sum of squared deviations 11-10= +1
is used in calculating the variance. 8-10= -2
See example next. 12-10= +2

Sum = 0
The sum of deviations
A is zero in both cases,
therefore, another
8 9 10 11 12 measure is needed.

4-10 = - 6
16-10 = +6
B 7-10 = -3

4 7 10 13 16 13-10 = +3
Sum = 0

26
Let us calculate the variance of the two populations

2 2 2 2 2
(8− 10
) + (9− 10
) + (10− 10) + ( −
11 10
) + (12− 10
)
σ 2A = =2
5

2 2 2 2 2
2 (4 − 10
) + (7− 10
) + (10− 10) + (13− 10
) + (16− 10
)
σB = = 18
5
Why is the variance defined as
the average squared deviation?
Why not use the sum of squared
deviations as a measure of all, the sum of squared
After
dispersion instead? deviations increases in
magnitude when the dispersion
of a data set increases!! 27
Which data
Which data set
set has
has aa larger
larger dispersion?
dispersion?

Let us calculate the sum of squared deviations for both data

However, whenDatacalculated
set B on
“per observation” basis
is more (variance),
dispersed
the data set dispersions are properly ranked
around the mean
A B
1 2 3 1 3 5
σ A22=
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2) = 10
SumA/N = 10/5 = 2
5 times 5 times

SumB = (1-3)2 + (5-3)2 = 8


! σ = SumB/N = 8/2 = 4
2
B

28
– Example 4.8
 Find the mean and the variance of the following

sample of measurements (in years).


3.4, 2.5, 4.1, 1.2, 2.8, 3.7
– Solution
A shortcut formula
∑i6=1xi 3.4+ 2.5+ 4.1+ 1.2+ 2.8+ 3.7 17.7
x= = = = 2.95
6 6 6
∑ n
(x − x)2 n (∑ n
x )2
2 i=1 i 1  2 i=1 i 
s = = ∑ xi − =
n− 1 n− 1i=1 n 
 
=[3.42+2.52+…+3.72]-[(17.7)2/6] = 1.075 (years)2

29
Measures of variability
(Dispersion or Spread)
 Measures of central location fail to tell the
whole story about the distribution.
 A question of interest still remains unanswered:

How typical is the average value of all


the measurements in the data set?
or
How much spread out are the measurements
about the average value?
30
The variance
– This measure of dispersion reflects the values
of all the measurements.
– The variance of a population of N
measurements
x1, x2,…,xN having a mean µ is defined as
∑N (x
i=1 i
− µ )2
σ2 =
N
– The variance of a sample of n measurements
x1, x2, …,xn having a mean x is defined as
∑n (x
i=1 i
− x)2
s2 =
n− 1 31
– The standard deviation of a set of measurements is
the square root of the variance of the measurements.

Samplestandard :s = s2
deviation
Population
standard :σ = σ 2
deviation
– Example 4.9
 Rates of return over the past 10 years for two mutual

funds are shown below. Which one have a higher level of


risk?
Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1,
30.05
Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3,
11.4 32
– Solution
– Let us use the Excel printout that is run from the

“Descriptive statistics” sub-menu (use file Xm04-


09)
Fund A FundB

Mean 16 Mean 12
Standard Error 5.295 Standard Error 3.152
Median 14.6 Median 11.75
und A should be consideredMode #N/A Mode #N/A
skier because its standardStandardDeviation 16.74 Standard Deviation 9.969
Sample Variance 280.3 Sample Variance 99.37
eviation is larger Kurtosis -1.34 Kurtosis -0.46
Skewness 0.217 Skewness 0.107
Range 49.1 Range 30.6
Minimum -6.2 Minimum -2.8
Maximum 42.9 Maximum 27.8
Sum 160 Sum 120
Count 10 Count 10
33
The coefficient of variation
– The coefficient of variation of a set of
measurements is the standard deviation divided
by the mean value. s
Sample
coefficien : cv=
t ofvariation
x
σ
Population
coefficien : CV=
t ofvariation
µ
– This coefficient provides a proportionate
measure of variation.
A standard deviation of 10 may be perceived
as large when the mean value is 100, but only
moderately large when the mean value is 500
34
Interpreting Standard
Deviation
 The standard deviation can be used to
– compare the variability of several distributions
– make a statement about the general shape of a
distribution.
 The empirical rule: If a sample of measurements
has a mound-shaped distribution, the interval
(x− s,x+ s) contains
approximat
ely68%ofthemeasuremen
ts
(x− 2s,x+ 2s) contains
approximat
ely95%ofthemeasuremen
ts
(x− 3s,x+ 3s) contains
virtually
allofthemeasuremen
ts
35
– Example 4.10
 The duration of 30 long-distance telephone calls are

shown next. Check the empirical rule for the this set
of measurements.

• Solution
First check if the histogram has an approximate
mound-shape
10
8
6
4
2
0
2 5 8 11 14 17 20 More

36
• Calculate the mean and the standard deviation:
Mean = 10.26; Standard deviation = 4.29.

• Calculate the intervals:

(x− s,x+ s) = (10.26


- 4.29,10.26+ 4.29)= (5.97,
14.55)
(x − 2s,x + 2s) = (1.68,18.84)
(x− 3s,x+ 3s) = (-2.61,
23.13)

Interval Empirical
Interval Empirical Rule
Rule Actual
Actual percentage
percentage
5.97,14.55
5.97, 14.55 68%
68% 70%
70%
1.68,18.84
1.68, 18.84 95%
95% 96.7%
96.7%
-2.61,23.13
-2.61, 23.13 100%
100% 100%
100%

37
Other conclusions
By the empirical rule, approximately 95% of the area
under a mound-shaped histogram lies between

(x− 2s,x+ 2s)


95%
of the area
x− 2s, x x+ 2s
– Since about 95% of all the measurements fall
within two standard deviation around the mean
the telephone calls duration problem 17.2
s≅ = 4.3minutes
e range is 19.5-2.3=17.2 minutes. 4
Range
s≅
4 38
The Chebyshev theorem
– Given any set of measurements and a number k
(not smaller than 1), the fraction of these
measurements that lie within k standard deviations
around the mean is at least 1-1/k2. 1-1/22=3/4
– This theorem is valid for any set of measurements
(sample, population) of any shape. 1-1/32=8/9
K Interval Chebyshev Empirical Rule
1 x − s,x + s at least 0% approximately 68%
2 x− 2s,x+ 2s at least 75% approximately 95%
3 x− 3s,x + 3s at least 89% approximately 100%
39
• Example 4.6
A professor of statistics wants to report the results of a midt
exam, taken by 100 students. The data appear in file XM04
Find the mean, median, and mode, and describe the informa
they provide. The mean provides information
about the over-all performance level
of the class. It can serve as a tool for
Marks
making comparisons with other
Mean 73.98
StandardError 2.1502163 classes and/or other exams.
Median 81
Mode 84
StandardDeviation 21.502163 The Median indicates that half of the
SampleVariance
Kurtosis
462.34303
0.3936606 class received a grade below 81%,
Skewness
Range
-1.073098
89
and half of the class received a grade
Minimum
Maximum
11
100
above 81%.
Sum 7398
Count 100 The mode must be used when data is
qualitative. If marks are classified by
letter grade, the frequency of each
grade can be calculated.Then, the mod
40
becomes a logical measure to comput

También podría gustarte