Documentos de Académico
Documentos de Profesional
Documentos de Cultura
uide
M
MAT
211 Proba
P
abilityy and Statisstics
Summ
mer 2015
Offfered byy
Departm
D
ment off Physical Sciiences
S
School
of Enggineerinng and Compuuter Sccience
Ind
dependeent Unniversityy, Banggladeshh
Course Coordinat
C
C
tor:
Dr.Ship
praBanik
k, Associate Professor
Instructoors:
Dr.SShipraBaanik
Dr. Md. Hanif
H
Muurad
Dr. AB
BM Shahhadat Hossain
Mss. Proma Anwer Khan
K
Ms. Rum
mana Hossain
Ms. Zainab Lutfun
L
Naahar
Topics
Text/Reference
Introduction: Definition: variable, scales of Course Guide, pp.7-8
measurement, raw data, qualitative data,
quantitative data, cross-sectional data, time series HW: Text, Ex: 2,4,6,9-13, pp.21data, census survey, sample survey, target 23
population, random sample, computer and
statistical packages
Lecture 2
Lecture 3
Lecture 4
Lecture 5
Lecture 6
Review
Lecture 1 - Lecture 5
Lecture 7
Topic
Lecture 1-Lecture 5
Lecture 8
Lecture 9
Probability Theory:
Random experiment, random variable, sample Course Guide, pp.29-33
space, events, counting rules, tree diagram,
HW: Text, Ex: 1-9, pp.158-159
probability defined on events
Ex: 14-21, pp.162-164
Lecture 10
Lecture 11
Review
Lecture 8 - Lecture 10
Lecture 12
Topic:
Lecture 8-Lecture 10
Lecture 13
Normal Distribution
Lecture 14
Lecture 13 continued
Lecture 15
Topic:
Lecture 13- Lecture 14
Lecture 16
Lecture 17
Lecture 18
Test of hypothesis
Lecture 19 continued
Test of hypothesis
Lecture 21
Lecture 22
Lecture 23
Lecture 24
Lecture 22 continued
Lecture-1
Chapter-1: Introduction
Important Definition:
Data, elements, variable, observations, raw data, qualitative data, quantitative data, scales of measurement
population, random sample, census, sample survey, cross-sectional data, time series data, Computer and
statistical analysis, glossary.
Textbook: Anderson D.R., Sweeney, D.J. and Thomas A.W. (2011), Statistics for Business and
Economics (11th Edition), South-Western, A Division of Thomson Learning.
Data (or Variable) - Changing characteristics.
Examples: Gender, Grade, Family size , Score, Age, and many others.
Gender, Grade- Qualitative data (letter)
Family size, Score, Age - Quantitative data (numeric value)
Family size Whole number Discrete data
Score, Age Continuous data
Note: ID #, cell # are qualitative data
Observations- Data size
Variable denoted by X, Y, Z or denoted by first letter (e.g. Score S, Age A)
Elements Variable (X), elements x1, x2, ., xn
Raw data Data collected by survey, census etc. It is known as ungrouped data.
Note: Always we have raw data. We have to process or make data summary by various statistical
techniques (we will learn all by Chapters 2-3).
Scales of measurement
Before analysis, scale of each of selected variables have to define. Specially, when we do our analysis by
statistical packages (e.g. SPSS, Minitab, Strata even in Excel also). We have to assign scale for each of
variables those involve in our analysis.
There are four kinds of scale: nominal, ordinal, interval and ratio
Nominal, ordinal - Qualitative data
Nominal scale The variables like Name, ID, Address, Cell # declare this scale. Not possible to do
analysis.
Ordinal scale Qualitative data like test performances (excellent, good, poor etc), quality of food (good
or bad) etc. possible to order. Some analysis is possible.
HW: Text
Ex: 2,4,6,9-13, pp.21-23
Lecture 2
Chapter-2: Summary of raw data
You will get an idea about the following:
Aim of presentation of raw data; Tabular form of raw data (e.g. Summarizing qualitative and quantitative
Data).
The aim of presentation of raw data is to make a large and complicated set of raw data into a more
compact and meaningful form. Usually, one can summarize the raw data by
(a) The tabular form
(b) The graphical form and
(c) Finally numerically such as measures of central tendency, measures of dispersion and others.
Under the tabular and the graphical form, we will learn frequency distribution (grouping data), bar graphs,
histograms, stem-leaf display method and others.
Presentation of data can be found in annual reports newspaper articles and research studies. Everyone is
exposed to those types of presentations. Hence, it is important to understand how they are prepared and
how they should be interpreted.
As indicated in the Lecture 1, data can be classified as either qualitative or quantitative.
The plan of this lecture is to introduce the tabular methods, which are commonly used to summarize both
the qualitative and the quantitative data.
Summarizing qualitative data
Recall raw data and find the following data:
Table 1: Test Performances of MAT 211
Good
Excellent
Poor
Excellent
Poor
Good
Poor
Excellent
Excellent
Good
Excellent
Excellent
Good
Poor
Good
Tabu
ular summarry
T
Excellen
nt
Good
Poor
Tally
marks
|||| |
||||
||||
Relative(percennt )
Freequency rfi (pf
( i)
0.440(40%)
0.333(33%)
0.226(26%)
Pie Chart
8
Poor
27%
6
4
Excellent
40%
Good
33%
2
0
Exxcellent
G
Good
P
Poor
Data Sum
mmary: Our analysis
a
show
ws that test performances observed
o
exceellent 40%, goood 33% and poor
observed 26%.
88
69
78
85
59
78
93
57
46
89
We know
w very well theese data are quantitative
q
daata. Processinng of these kinnds data little bit differs froom
qualitativee data. Follow
w the followinng:
Solution: Define T- Teest Score and n =15
Need to fiind lowest and highest valuues of the givven raw data set.
s Here L = 46 and H= 955. Assume # of
o
classes K =5. Thus, wee find size of the
t class c= (H-L)/K
(
= 9.8810.
10
Tabular Summary
T
46-56
56-66
66-76
76-86
86-96
Tally
marks
||
||
||
|||
|||| |
HW: Text
Ex:4-10, pp.36-39
Ex:15-21, pp.46-48
Ex: 39, 41,42, pp.65-67
11
Relative(percent )
Frequency rfi (pfi)
0.13(13%)
0.13(13%)
0.13(13%)
0.20(20%)
0.40(40%)
Cumulative
frequency (Fi)
2
4
6
9
15
Lecture 3
Summarizing Raw Data Continued
Graphical summary: Histogram, Ogive
Recall Lecture 2, Table data. We need a frequency table for the above two shapes
Histogram
Ogive
16
14
12
10
0
4656 5666 6676 7686 8696
20
40
60
80
100
120
Data Summary
Our analysis shows that there are 6 students score observed 86 to 96 and only 1 student score observed 46
to 56 and so on.
9 students score observed less than 86, 6 students score observed less than 76 and so on.
Other Graphical summaries: stem and leaf display, line chart
Line chart - Time plots of the stock indices (We need a time series data)
12
HW: Text
Ex:15-21, pp.46-48
Ex:25-28, pp.52-53
13
10-20
|| (2)
|| (2)
(0)
4
Y
20-30
|| (2)
|| (2)
(0)
4
30-40
(0)
|(1)
|(1)
2
Total
4
5
1
n=10
Data summary:
We see that there are 2 restaurants their quality of food is very good and meal prices are ranging 20$ to
30$, 1 restaurant quality of food is excellent, 4 restaurants meal prices are ranging 10$ to 20$ and so on.
Graphical method-scatter diagram
Scatter diagram provide the following information about the relationship between two variables.
strength
shape linear, curved etc.
Direction positive or negative
Presence of outliers
Problem -2:
Now consider the following two variables: # of commercials and total sales for 5 sound equipment stores.
Data are as follows:
# of commercials: 2, 5, 1, 3, 4 and total sales: 50, 57, 41, 54, 54
60
50
Sales
40
30
20
10
0
0
Comm
Figure: Scatter diagram of Sales and # of commercials for 5 sound equipment stores
HW: Text
Ex: 31, 33-36, pp.60-61
14
Lecture 4
Chapter 3: Summarizing Raw Data (Numerical measures)
We will learn several numerical measures that provide a data summary using numeric formulas.
Now we will learn the following:
(1) Measures of average: simple mean, weighted mean, median, mode, quartiles, percentiles
(2) Measures of variation: Range, inter-quartile range, variance, standard deviation
(3) Measures of skewness: symmetry, positive skewness, negative skewness
(4) Measures of Kurtosis: leptokurtic, platykurtic and mesokurtic
Measures of average: simple mean, weighted mean, median, mode, quartiles, percentiles
Definition of average: It is a single central value that represents the whole set of data. Different
measures of averages are: simple mean, weighted mean, median, mode, quartiles, percentiles.
We will learn the above measures for the raw data and grouped data.
Mean: Denoted by
and calculated by
/ .
For example, for a set of monthly starting salaries of 5 graduates: 3450, 3550, 3550, 3480, 3355.
3450
3480
3550
3550
For Q2: i = (pn)/100 = (50*5)/100= 2.50. The next integer 3. Thus, Q2 is 3480.
For Q1: i = (pn)/100 =(25*5)/100 = 1.25. The next integer 2. Thus, Q1 is 3450.
For Q3: i = (pn)/100 =(75*5)/100 = 3.75. The next integer 4. Thus, Q3 is 3550.
15
Now consider the following data: 3450, 3550, 3550, 3480, 3355, 3490
Here
/ = 3.4792e+003 = 3479.2
3450
3480
3490
3550
3550
For Q2: i = (pn)/100 = (50*6)/100= 3. It is an average value of 3rd and 4th observations of the sorted data.
Thus, Q2 = (3480+3490)/2 = 3485.
For Q1: i = (pn)/100 =(25*6)/100 = 1.50. The next integer 2. Thus, Q1 is 3450.
For Q3: i = (pn)/100 =(75*6)/100 = 4.5. The next integer 5. Thus, Q3 is 3550.
Mode: It is the value that occurs with greatest frequency. Denoted by M0.
Consider the following observations
(1) 3450, 3550, 3550, 3480, 3355 - M0 is 3550.
(2) 3450, 3550, 3550, 3480, 3450 - M0 are 3450 and 3550.
(3) 3450, 3550, 3550, 3450, 3450 - M0 is 3450
(4) 3450, 3650, 3550, 3480, 3355 no Mode.
Data Summary:
Mean = 3477 it means that most of graduates monthly starting salaries is about 3477$.
Median = 3485 it means that 50% graduates monthly starting salaries are observed below 3485$ and the
remaining (50%) graduates monthly starting salaries are observed over 3485$.
First quartile = 3450 it means that 25% graduates monthly starting salaries are observed below 3450$ and
the remaining (75%) graduates monthly starting salaries are observed over 3450$.
Third quartile = 3550 it means that 75% graduates monthly starting salaries are observed below 3550$
and the remaining (25%) graduates monthly starting salaries are observed over 3550$.
Mode = 3450 it means that the most common graduates monthly starting salaries is 3450$.
HW: Text
Ex: 5-10, pp.92-94
16
Lecture 5
Chapter 3_Numerical measures continued
We will learn measures of variation
Recall the concept of average (Ref. Lecture 4). Follow the following: Say for example, suppose we have
the following 2 sets of raw data:
1) 15, 15,15,15,15 Average 15 and variation 0.
2) 15, 16,19, 13, 12 Average 15 and variation 2.73.
Statistical meaning of variation
Make a question - is there any difference exist between each of observations from the average value?
Suppose X score of CT1 (class test 1) and for example, suppose it is calculated average score 15.
Next investigation will be to see differences between each of students marks to average marks.
If difference is 0, very easy to say student score and average score is same.
If differences give us a positive (negative) sign (+(-)), we can say that student score is greater(lower) than
the average score.
How we can measure variation of a data set. Various measures (or formulas) are available to detect
variation. These are:
1. Range, R = H-L, H-highest value of a data set and L Lowest value of a data set
2. Inter-quartile range, IR = p75 p25, p75- 75th percentile and p25- 25th percentile
). That means
SD = sqrt(variance).
Note: Measures of variation cannot be negative. At least can be 0, recall which indicates all students got
same scores.
Calculation for variance and SD
Recall monthly starting salaries of 5 graduates: 3450, 3550, 3550, 3480, 3355, where we found
729
5329
5329
9
14884
17
Data summary: SD = 81.054 indicates that graduates salary varies from the average salary 3477$.
Note: Variance cannot be interpreted because its unit comes as a square. For example if mean = 3477$
then variance = 6570$2. Taking square root of variance removes this problem (going back to the original
unit of data), which is standard deviation (SD).
So, no interpretation for variance and talk always on SD measure.
Coefficient of variation
See Text, p.99
HW: Text, Ex: 16-24, pp.100-102
18
Lecture 7
Class Test 1 (20%)
Exam Time: 90 minutes
Requirements:
1) Must need a two variables scientific calculator (No alternatives).
2) Mobile will be shut off during exam time.
Format of questions
1) Lecture 1- Lecture 5 solved and HW problems
2) Related Text book questions
19
Lecture 8
Working with grouped data
So far we focused calculation of all measures of average and variation for ungrouped (raw data).
Sometimes grouped data (frequency table) is available. In this situation, formula for ungrouped (raw data)
is invalid. Follow the following:
Recall Tabular summary, where X- Test Score and n =15 (Lecture 2)
X
46-56
56-66
66-76
76-86
86-96
X
46-56
56-66
66-76
76-86
86-96
Total
Frequency (# of
students) fi, , i=1,2,3
2
2
2
3
6
n =15
Frequency (# of students)
fi, , i=1,2,3
2
2
2
3
6
Total n =15
Midpoints
fimi
(mi)
51
102
61
122
71
142
81
243
91
546
1155
1352
512
72
48
1176
3160
/ = 1155/15=77
=3160/14=225.71 and SD = sqrt(variance) = sqrt(225.71) =
Data Summary: SD = 15.02 indicates that students score varies from the average score 77.
HW: Text, Ex: 54-55, pp.128-129
20
Measures of Skewness
We can get a general impression of skewness by drawing a histogram. To understand the concept of
skewness, consider the following 3 histograms:
Figure -1
8
6
4
2
Figure-2
8
Figure-3
6
6
2
4
2
2
0
8
6
4
2
0
6
2
21
Problem
Suppose X test score. Let mean = 15, median(50th percentile or 2nd quartile) = 17 and SD =3.
Here SK = -2.00.
Data summary: SK = -2.00 it means that the test score is negatively skewed. It means that most of
students score over 15.
Let mean = 18, median= 14 and SD =5. Here SK = 2.40.
Data summary: SK = 2.40 it means that the test score is positively skewed. It means that most of students
score below 18.
Let mean = 16, median= 16 and SD =5. Here SK = 0.
Data summary: SK = 0 it means that the test score is symmetric. It means that few students score below
and over 16.
Kurtosis
Suppose if a distribution is symmetric, the next question is about the central peak: Is it high or sharp or
short or broad.
Pearson (1905) described kurtosis in comparison with the normal distribution and used phases leptokurtic,
platykurtic and mesokurtic to describe different distributions.
If the distribution has more values in the tails of the distribution and a peak, it is leptokurtic. It is a curve
like two heaping kangaroos has long tails and is peaked up in the center.
If there are fewer values in the tails, more in the shoulders and less in the peak, it is platykurtic.
A platykurtic curve, like a platypus, has a short tail and is flat-topped.
HW: Text book
Ex: 5 and 6, pp.92-93 (Calculate skewness and interpret)
22
5.8
4.6
4.9
7.1
5.2
8.1
0.2
3.4
4.5
8.0
7.9
6.1
5.6
5.5
3.1
6.8
4.6
3.8
2.6
4.5
4.6
7.7
3.8
4.1
6.1
4.1
4.4
5.2
1.5
23
4 114556669
5 22568
6 1128
7 179
8 01
Interpretation: Table 1 shows that to run 9 programs need time 4.1 to 4.9 seconds, 5 programs need 5.2
to 5.8 seconds and so on.
b)
X(Classes)
Frequency(fi)
__________________
0.21.5
1.5-2.8
2.8-4.1
4.1-5.4
10
5.4-6.7
6.7-8.1
6
__________________
n =30
Table 3: Relative Frequency Distribution of X
X(Classes)
Relative Frequency(rfi)
______________________________
0.21.5
0.03
1.5-2.8
0.03
2.8-4.1
0.20
4.1-5.4
0.33
5.4-6.7
0.20
6.7-8.1.1.1.1
0.20
_________________________________
24
rf
=1
i =1
X(Classes)
_______________________________________
0.21.5
1.5-2.8
2.8-4.1
5.4-6.7
18
5.4-6.7
24
6.7-8.4
30
___________________________________________
Interpretation: Table 2 shows that 9 programs need times 4.1 to 5.4 seconds, Table 3 shows that 30
percent programs need times 4.1 to 5.4 seconds and Table 4 shows that 18 programs need at most 5.4
seconds and so on.
c)
Descriptive Statistics: X
Variable
n Minimum
30
Variable
Mean
Maximum
0.20
5.0
8.1
Median(Q2)
4.75
StDev
1.859
Q1
Q3
4.02
6.12
Interpretation:
Mean = 5.0 seconds means that most of times to run a program need approx. 5 seconds.
Median = 4.75 seconds means that 50% programs to run need less than 4.75 seconds and rest of 50% need
more than 4.75 seconds.
Standard deviation = 1.859 seconds means all the times a program did not take 5 seconds to run.
25
Q1 = 4.02 seconds means that first 25% programs to run need less than 4.02 seconds and rest of 75%
need more than 4.02 seconds.
Q3 = 6.12 seconds means that first 75% programs to run need less than 6.12 seconds and rest of 25%
need more than 6.12 seconds.
Formulae
Mean - Ungrouped data:
Formula: x =
1 n
x i , where is the summation sign.
n i =1
1 k
f i m i , where fi are the frequency of the ith class and mi are the midpoints of the ith
n i =1
class, midpoint = (LCL+UCL) of the ith class/2 and k are the total no. of classes.
n2 FM e 1
c , where l M e is the LCL of the median class, FM e 1 is the cf below the
f M e
Formula: M e = l M e +
median class, f M e is the frequency of the median class and c is the size of the median class.
26
If i is an integer, the pth percentile is the mean of the value of the positions i and i + 1.
percentile class, Fpi 1 is the cf below the ith percentile class, f pi is the frequency of the ith percentile class
and c is the size of the ith percentile class. For application, refer the calculation of median for grouped
data.
Standard Deviation for Ungrouped Data:
Formula: =
1 n
1 n
2
(
x
x
)
,
where
x
are
the
raw
data
and
x
=
i
i
xi .
n 1 i =1
n i =1
Formula: =
1 n
f i (m i x ) 2 , where fi are the frequencies of the ith class, mi are the mid-points
n 1 i =1
1 k
fimi .
n i =1
d) Histogram of X
# of programs
9
6
3
0
0.2-1.5
1.5-2.8
2.8-4.1
4.1-5.4
cpu time
27
5.4-6.7
6.7-8.1
3( x M e )
, where 3 S k +3 i.e. skewness can range from -3 to +3. Interpretation
of Sk
Interpretation: Sk = 0.4034 indicate that to run a few programs need time more than 5 seconds.
28
Lecture 9
Chapter-4_Introduction to probability
Some Important Definitions:
Random experiment, random variable, sample space, events (simple event, compound event), counting
rules, combinations, permutations, tree diagram, probability defined on events
Introduction
We finished our first important part of the course (known as data summary). Even we sat for CT1. Now
we are moving in the 2nd very important part of the course, namely Chance Theory. It is also known as
Probability Theory. The word Chance or Probability frequently we are using in our real life. For
examples:
(i)
(ii)
(iii)
What is the chance of getting grade-A for the course MAT 211?
What is the chance that sales will decease if we increase prices of a commodity?
What is the chance that a new investment will be profitable?
29
0P(E)1
If
P(E) = 0, no chance to occur (improbable event).
P(E) = 0.5, 50% chance that the E will occur.
P(E) 1.0, 100% chance, that the E will occur (sure event)
Recall grade example
P(grade A) =(#of E)/S = 95/100 = 0.95, where S = {all possible grades}, E = grade A.
Summary: The randomly selected student will get grade A, chance is 95%.
Some random experiments and S (Text, p.143)
Random experiment 1:
Toss a fair coin. S = {H,T}, H-head and T-tail. If E head, then P(H) = = 0.5 and P(T) = = 0.5.
Random experiment 2:
Select a part for inspection, S = {defective, non-defective}.
Random experiment 3:
Conduct a sales call, S = {purchase, no purchase}.
Random experiment 4:
Roll a fair die, S={1,2,3,4,5,6}.
30
Random experiment 5:
Play a football game, S = {win, lose, tie}.
Note that in the sample space, S all possible events read as or. Be careful not and. It is impossible to
get H or T in one experiment. Win and lose in one game is also impossible (realize it!)
Important concepts
mutually exclusive events, equally like events, Tree diagram, combination
Mutually exclusive events It is the event where two possible events cannot occur simultaneously. Toss a
coin, H and T cannot occur in a single random experiment. It is written as P(HT) =0.
If P(HT) 0, events are mutually inclusive. Toss two coins (or one coin two times), H and T can occur
in this random experiment. For example, P(HT) =0.50, where S = {HH, HT, TH, TT}.
Equally like events Two events has equal chance of being occur. Toss a coin, P(H) = P(T) = 0.5.
Tree diagram It is a technique to make a summary of all possible events of a random experiment
graphically.
Combination - It is a formula to make a summary of all possible events of a random experiment.
Counting rules Two rules: Combination and Permutation
Combination It allows one to count the number of experimental outcomes when the experiment
involves selecting n objects from a set of N objects.
For example, if we want to select 5 students from a group of 10 students, then
!
!
!
Permutation It allows one to count the number of experimental outcomes when n objects are to be
selected from a set of N objects, where the order of selection is important.
For example, if we want to select 5 students from a group of 10 students ( where order is important , then
!
!
!
Ex: 1. How many ways can three items can be selected from a group of six items? Use the letters A, B, C,
D, E and F to identify the items and list each of the different combinations of three items.
!
20 possible ways letters can be selected. Some examples, ABC, ABD,
Solution: S = 6
!
Ex: 2. How many permutations of three items can be selected from a group of six items? Use the letters
A, B, C, D, E and F to identify the items and list each of the different permutations of items B, D and F.
Solution: S = 6
!
!
31
Different permutations of items B, D and F: BDF, BFD, DBF, DFB, FDB, FBD, 6 outcomes.
Ex:3: An experiment with three outcomes has been repeated 50 times and it was learned that E1 occurred
20 times, E2 occurred 13 times and E3 occurred 17 times. Assign probabilities to the outcomes.
Solution: S = {E1, E2, E3}. Here P(E1) = 20/50=0.40, P(E2) = 13/50=0.26, P(E3) = 17/50=0.34
P(S) = P(E1)+ P(E2)+ P(E3)= 0.40+0.26+0.34 = 1.0
Ex:4: A decision maker subjectively assigned the following probabilities to the four outcomes of an
experiment: P(E1) = 0.10, P(E2) = 0.15, P(E3) = 0.40 and P(E4) = 0.20. Are these probability
assignments valid? Explain
Solution: S = {E1, E2, E3, E4}. Here P(E1) = 0.10, P(E2) = 0.15, P(E3) = 0.40, P(E4) = 0.20.
P(S)=P(E1)+P(E2)+P(E3)+P(E4)=0.10+0.15+0.40+0.20=0.85<1.0. Thus, probability assignments invalid
because P(S) 1.
The above two problems tell us for any random experiment, P(S) = 1.
Ex:5: Suppose that a manager of a large apartment complex provide the following probability estimates
about the number of vacancies that will exist next month
Vacancies:
0
Probability: 0.05
1
0.15
2
0.35
3
4
5
0.25 0.10 0.10
Ex:6: The National Sporting Goods Association conducted a survey of persons 7 years of age or older
about participation in sports activities. The total population in this age group was reported at 248.5
million, with 120.9 million male and 127.6 million female. The number of participation for the top five
sports activities appears here
Participants
Male
Female
22.2
21.0
25.6
24.3
28.7
57.7
Activity
Bicycle riding
Camping
Exercise walking
32
20.4
26.4
24.4
34.4
a. For a randomly selected female, estimate the probability of participation in each of the sports
activities
b. For a randomly selected male, estimate the probability of participation in each of the sports
activities
c. For a randomly selected person, what is the probability the person participates to exercise
walking?
d. Suppose you just happen to see an exercise walker going by. What is the probability the walker is
a woman? What is the probability the walker is a man?
Solution: S = {Br, C, EW, EE,S}, where Br - Bicycle riding, C- Camping, EW- Exercise walking, EEExercising with equipment and S Swimming.
a. Female can come from any sports activities. Thus P(F) = (21/248.5) +(24.3/248.5) + +
(34.4/248.5).
b. Male can come from any sports activities. Thus P(M) = (22.2/248.5) +(25.6/248.5) + +
(26.4/248.5).
c. Person can be male or female. Thus, P(EW) = P(Male EW) +P(Female EW) = (28.7/248.5)
+(57.7/248.5)=86.4/248.5=0.34 = 34%.
d. We have to consider exercise walker population. Thus, P(woman/EW) = 57.7/ (28.7+57.7) =
57.7/86.4 = 0.67 = 67%. P(man/EW) = 28.7/ (28.7+57.7) = 28.7/86.4 = 0.33 = 33%.
HW: Textbook
Ex: 1-9, pp.158-159
Ex: 14-21, pp.162-164
33
Lecture 10
Basic relationships of probability (addition law, complement law, conditional law, multiplication law)
Addition Law
Suppose we have two events A and B (A, B S). The chance of occurring A or B is written as
P(AB) = P(A) + P(B) - P(AB), if two events are not mutually exclusive.
P(AB) = P(A) + P(B), if two events are mutually exclusive.
Above average
14
Excellent
13
What is the chance that the viewer will rate the new show as average or better?
What is the chance that the viewer will rate the new show as average or worse?
The viewer will rate the new show as average or better, chance is 76%.
(ii)
The viewer will rate the new show as average or worse, chance is 46%.
34
What is the chance that the randomly selected worker completed work will not be late?
Suppose one employee if selected randomly what is the probability that the worker completed
work as late nor will assembled a defective product?
2
Even number
2 or even number
Not 2
2 given that die will show even number
2 given that die will show odd number
Solution: S = 6. (i) P(2) = 1/6 (ii) P(Even number) =3/6 (iii) P(2even number) = (1/6)+(3/6)-(1/6)
(iv)P(2c)=1-P(2) = 5/6 (v) P(2/even number) = 1/3 (vi) P(2/odd number)=0
Observe carefully (i) to (iv) are unconditional probabilities, but (v) to (vi) are conditional probabilities.
Here to calculate (i) to (iv) we used unconditional sample space, whether to calculate (v) to (vi) we used
conditional sample space, where has given condition from the roll we need even or odd numbers.
Multiplication law
Suppose we have two events A and B (A, B S), the chance of getting A and B is defined as
35
Men
288
672
960
Women
36
204
240
Total
324
876
1200
a)
b)
c)
d)
Men
0.24
0.56
0.80
Women
0.03
0.17
0.20
Total
0.27
0.73
1.00
b) P(Men) = 0.80, P(Women) = 0.20, P(Promoted) = 0.27, P(Not Promoted) = 0.73, these are known as
marginal probabilities.
c) P(Promoted/Men)=288/960.
d) P(Not Promoted/ Female) =204/240.
e) P(Male/Promotion) = 288/324.
f) P(Female/not Promoted) = 204/876.
36
Lecture 12
Mid-term test -20%
Requirements:
1) Must need a two variables scientific calculator (No alternatives).
2) Mobile will be shut off during exam time.
Format of questions
3) Lecture 8-Lecture 10 solved and HW problems
4) Related Text book questions
/Good Luck/
37
Lectu
ure 13
Chap
pter 6
Normal distribution)
d
m de Moivre,, a French maathematician in
i 1733.
Discovereed by Abraham
The form or shape can be given in thhe following::
Thee
equation depends
d
upon
n the two paraameters meann () and standdard deviationn () follows:
f (X) =
1
2 2
e 1 / 2( X ) ,
mathem
matical
< X <
where = mean of no
ormal variablee, = SD of the normal variable
v
( annd determinne the locatioon and
shape of the
t normal prrobability disttribution) andd e are matheematical consttants, which values
v
are eqqual to
3.14 and 2.728
2
respectively.
By notatioon X ~ N(,
) read as X iss normally disstributed withh mean and standard devviation .
It is true that
t once an
nd are speciified, the norm
mal curve is completely
c
deetermined.
d Normal Pro
obability Distribution
Standard
m variable thatt has a normaal distributionn with a mean of zero and standard
s
deviaation of one is
i said
A random
to have a standard norm
mal probabilitty distributionn.
The letterr Z is common
nly used to deesignate this particular
p
norrmal random variable.
v
The standdard normal probability
p
distribution, areas under thee normal curvve have been computed annd are
available in tables that can be used in
i computing probabilitiess.
The final page table is an example of
o such a tablee.
38
39
40
Lecture 15
Class Test _2 -15%
Requirements:
1) Must need a two variables scientific calculator (No alternatives).
2) Mobile will be shut off during exam time.
Format of questions
1) Lecture 13-Lecture 14 solved and HW problems
2) Related Text book questions
41
Lectu
ure 16
Chappter 8
Random Sam
mpling
Our step here
h
how to collect
c
random
m samples froom the target population and
a how to suummarize colllected
raw data effective
e
way
ys so that geneeral peoples can
c understand so clearly.
Generallyy, there are tw
wo ways the reequired inform
mation may be
b obtained:
a) Census
C
survey
y and
b) Sample survey
y.
The total count of all units
u
of the poopulation for a certain chaaracteristics knnown as com
mplete enumerration,
also termeed census surv
vey.
Money, manpower
m
and
d time required for carryinng out compllete enumerattion will geneerally be largge and
there are many situatiions where complete
c
enuumeration is not
n possible. Thus, sampple enumeration or
sample suurvey is used to select a raandom part of the population using thee table of randdom numberss (e.g.
see Text, p.269) have been
b
construccted by each of
o the digits 0,1,
0 , 9.
42
To illustrate how to select sample by the method of use of table of random numbers, consider the
following problem:
Suppose the monthly pocket money (TK/-) given to each of the 50 School of Business students at IUB as
follows:
Pocket Money (TK/-)
1100
1500
8900
4500
2700
3800
3000
6700
2600
3600
7500
7900
4600
2000
2400
1300
8500
6500
6200
5800
6000
6800
9200
3800
1200
8000
7100
8600
8700
6300
7600
7700
2600
7800
2000
9000
7300
8400
1700
2500
5700
5300
5500
1700
3700
5400
2400
4000
1200
7300
To draw a random sample of size 10 from a population of size 50, first of all, need to identify the 50 units
of the population with the numbers 1 to 50.
Pocket Money (TK/-)
1100(1)
1500(2)
8900(3)
4500(4)
2700
(5)
3800
(6)
3000
(7)
6700 (8)
2600 (9)
3600(10)
7500(11)
7900
(12)
4600
(13)
2000
(14)
2400
(15)
1300
(16)
8500
(17)
6500
(18)
6200
(19)
5800(20)
6000
(21)
6800
(22)
9200
(23)
3800
(24)
1200
(25)
8000
(26)
7100
(27)
8600
(28)
8700
(29)
6300
(30)
7600(31)
7700
(32)
2600(33)
7800
(34)
2000
(35)
9000
(36)
7300
(37)
8400(38)
1700
(39)
2500
(40)
5700
(41)
5300
(42)
5500
(43)
1700
(44)
3700
(45)
5400
(46)
2400
(47)
4000
(48)
1200(49)
7300
(50)
Then, in the given random number table, starting with the first number and moving row wise (or column
wise or diagonal wise) to pick out the numbers in pairs, one by one, ignoring those numbers which are
greater than 50, until a selection of 10 numbers is made.
# Selected row-wise sample numbers: 27, 15, 45, 11, 02, 14, 18, 07, 39, 31
43
# Selected row-wise monthly pocket money (TK/-) of 10 students out of 50: 7100, 2400, 3700, 7500,
1500, 2000, 6500, 3000, 1700, 7600
HW:
Calculate mean and standard deviation of 10 students monthly pocket money (Use formula and
Scientific calculator)
Text, Ex: 3-8, pp.272-273
44
Lecture 17
Chapter 8_Interval estimation (Estimation of Parameters)
Aim
Be familiar how to construct a confidence interval for the population parameter.
The sample statistic is calculated from the sample data and the population parameter is inferred (or
estimated) from this sample statistic. In alternative words, statistics are calculated; parameters are
estimated.
Two types of estimates we find: point estimate and interval estimate.
Point Estimate It is the single best value. For example, mean and SD of total marks for a course of IUB
students are point estimates because these are single value.
Interval Estimate - Confidence Interval
The point estimate is varying for sample to sample and going to be different from the population
parameter because due to the sampling error. There is no way to know who close it is to the actual
parameter. For this reason, statisticians like to give an interval estimate (confidence interval), which is a
range of values used to estimate the parameter.
A confidence interval is an interval estimate with a specific level of confidence. A level of confidence is
the probability that the interval estimate will contain the parameter. The level of confidence is 1 - . 1-
area lies within the confidence interval.
Confidence interval for based on large samples
Problem
Suppose, total marks for a course of 35 randomly selected IUB students is normally distributed with mean
78 and SD 9. Find 90%, 95% and 99% confidence intervals for population mean . Make a summary
based on findings.
Solution:
We have given X~N(78,9), where X - total marks for a course of 10 randomly selected IUB students and
n=35.
90% confidence interval for :
/
Here =78,
=1.65
Thus,
78
78
1.65
9
35
2.5101
45
78
1.65
78
2.5101
35
75.4899
80.5101
Summary: Based on our findings, we are 90% confident that population mean is ranging 75.5 to 80.5.
95% confidence interval for :
/
Here =78,
=1.96
Thus,
78
78
1.96
9
35
2.9817
75.0183
78
1.96
78
2.9817
35
80.9817
Summary:
Based on our findings, we are 95% confident that population mean is ranging 75.01 to 80.98.
99% confidence interval for :
/
Here =78,
=2.58
Thus,
78
78
2.58
9
35
3.9249
74.0751
78
2.58
78
3.9249
35
81.9249
Summary:
Based on our findings, we are 99% confident that population mean is ranging 74.07 to 81.92.
Practice problems
1. In an effort to estimate the mean amount spent per customer for dinner at a major Atlanta restaurant,
data were collected for a sample of 49 customers over a three-week period. Assume a population
deviation of $2.50.
a. At a 95% confidence level, what is the margin of error?
b. If the sample mean is $22.6, what is the 95% confidence interval for the population mean?
46
Guideline:
X- Amount spent per customer for dinner at a major Atlanta restaurant. Here n=49, SD =
a) Find Margin of error (ME) =
$2.50
=
=1.96
(Solve it)
2. Have a machine filling bags of popcorn; weight of bags known to be normally distributed with mean
weight 14.1 oz and SD 0.3 oz. Take sample of 40 bags, whats a 95% confidence interval for population
mean ?
Guideline:
a) X - weight of bags. Here n=40, =14.1,
=1.96
/ = .
(Solve it)
3. The National Quality Research Center at the University of Michigan provides a quarterly measure of
consumer opinions about products and services (The Wall Street Journal, February 18, 2013). A survey of
40 restaurants in the Fast Food/ Pizza group showed a sample mean customer satisfaction index of 71.
Past data indicate that the population standard deviation of the index has been relatively stable with =5.
a. Using 95% confidence, determine the margin of error.
b. Determine the margin of error if 99% confidence is desired.
Guideline:
Follow 1 and 2 questions guideline
4. The undergraduate GPA for students admitted to the top graduate business schools is 3.37. Assume this
estimate is based on a sample of 120 students admitted to the top schools. Using past years' data, the
population standard deviation can be assumed known with .28. What is the 95% confidence interval
estimate of the mean undergraduate GPA for students admitted to the top graduate business schools?
Guideline:
Follow 1 and 2 questions guideline
HW: Text,
47
Problem
Suppose we have given sample heights of 20 IUB students, where
= 67.3", SD = 3.6" and the
distribution is symmetric. Develop 95% confidence interval for and make a summary based on your
findings.
Solution:
We have given X~N(67.3,3.6), where X - heights of 20 randomly selected IUB students and n=20.
95% confidence interval for :
Here =67.3,
=2.093
Thus,
67.3
2.093
3.6
20
67.3
2.093
3.6
20
Summary: Based on our findings, we are 95% confident that population mean is ranging 65.61 to 68.98.
48
Practice problems
1. The Innternational Air
A Transport Association surveys businness travelerss to develop quality ratinggs for
transatlanntic gateway airports.
a
The maximum poossible ratingg is 10. Suppoose a simple random sampple of
25 busineess travelers iss selected andd each traveleer is asked to provide a ratting for the Miami
M
Internaational
Airport. The
T ratings ob
btained from the
t sample off 25 business travelers
t
folloow.
49
6, 4, 6, 8, 7, 7, 6, 3, 3, 8, 10, 4, 8, 7, 8, 7, 5, 9, 5, 8, 4, 3, 8, 5,5
Develop a 95% confidence interval estimate of the population mean rating for Miami.
2. Text book, Ex.15-17, p.324
3. Have a machine filling bags of popcorn; weight of bags known to be normally distributed with mean
weight 10.5 oz and SD 0.8 oz. Take sample of 10 bags, whats a 90% confidence interval for population
mean ?
has a
where the
values are based on a chi-square distribution with n-1 degress of freedom and 1- is the
50
where thee
51
Problem-1
A statistician chooses 27 randomly selected dates and when examining the occupancy records of a
particular motel for those dates, finds a standard deviation of 5.86 rooms rented. If the number of rooms
rented is normally distributed, find the 95% confidence interval for the population standard deviation of
the number of rooms rented.
Solution:
Here X - Number of rooms rented, S = 5.86 and n=27
95% confidence interval for the population standard deviation ():
Here
.
.
.
.
.
Summary: Based on our findings, we are 95% confident that population standard deviation is ranging
4.615 to 8.031.
Problem-2
A statistician chooses 27 randomly selected dates and when examining the occupancy records of a
particular motel for those dates, finds a standard deviation of 5.86 rooms rented. If the number of rooms
rented is normally distributed, find the 95% confidence interval for the population variance of the number
of rooms rented.
Solution:
Here X - Number of rooms rented, S = 5.86 and n=27
95% confidence interval for the population variance (2):
Here
.
.
.
.
21.297
64.492
52
Summary: Based on our findings, we are 95% confident that population variance is ranging 21.297 to
64.492
Practice problems
1. The variance in drug weights is critical in the pharmaceutical industry. For a specific drug,
with weights measured in grams, a sample of 18 units provided a sample variance of s2=0.36.
a. Construct a 90% confidence interval estimate of the population variance for the weight of this
drug.
b. Construct a 90% confidence interval estimate of the population standard deviation.
2. The daily car rental rates for a sample of eight cities follow.
City
Daily Car Rental Rate ($)
Atlanta
69
Chicago
72
Dallas
75
New Orleans
67
Phoenix
62
Pittsburgh
65
San Francisco
61
Seattle
59
a. Compute the sample variance and the sample standard deviation for these data.
b. What is the 95% confidence interval estimate of the variance of car rental rates for the population?
c. What is the 90% confidence interval estimate of the standard deviation for the population?
53
Lecture 18
Chapter 10
Interval estimations about two population means, standards deviations, see Text. Chapter 10
54
Lecture 19
Tests of hypothesis
In general, we do not know the true value of population parameters (mean, proportion, variance,
SD and others). They must be estimated based on random samples. However, we do have
hypotheses about what the true values are.
The major purpose of hypothesis testing is to choose between two competing hypotheses about
the value of a population parameter.
Actually, in hypothesis testing we begin by making a tentative assumption about a population
parameter. This tentative assumption is called the null hypothesis and is denoted by H0.
It is needed then to define another hypothesis, called the alternative hypothesis, which is the
opposite in H0. It is denoted by Ha or H1.
Both the null and alternative hypothesis should be stated before any statistical test of significance
is conducted.
In general, it is most convenient to always have the null hypothesis contain an equal sign, e.g.
(1) H0: = 100
H1: 100
(2) H0: 100
H1: < 100
(3) H0: 100
H1: > 100
Thus, note that
under H0, signs are =, and
under H1, signs are , < and >
In general, a hypothesis tests about the values of the population mean take one of the following
three forms:
H0: = 0
H0: 0
H0: 0
H1: 0
H1: < 0
H1: > 0
55
For example, consider the following problems in choosing the proper form for a hypothesis test:
Problem 1
The manager of an automobile dealership is considering a new bonus plan designed to increase
sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager
wants to conduct a research study to see whether the new bonus plan increases sales volume. To
collect data on the plan, a sample of sales personnel will be allowed to sell under the new bonus
plan for a 1-month period. Define the null and the alternative hypotheses.
Solution: Here H0: 14 and H1: > 14.
Problem 2
The manager of an automobile dealership is considering a new bonus plan designed to increase
sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager
wants to conduct a research study to see whether the new bonus plan decreases sales volume. To
collect data on the plan, a sample of sales personnel will be allowed to sell under the new bonus
plan for a 1-month period. Define the null and the alternative hypotheses.
Solution: Here H0: 14 and H1: < 14.
Problem 3
The manager of an automobile dealership is considering a new bonus plan designed to increase
sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager
wants to conduct a research study to see whether the new bonus plan changes sales volume. To
collect data on the plan, a sample of sales personnel will be allowed to sell under the new bonus
plan for a 1-month period. Define the null and the alternative hypotheses.
Solution: Here H0: = 14 and H1: 14.
Steps for conducting a of hypothesis test
1. Develop H0 and H1.
2. Specify the level of significance, , which defines unlikely values of sample statistic if the
null hypothesis is true. It is selected by the researcher at start. The common values of are 0.01,
0.05 and 0.10 and is most common 0.05.
3. Select the test statistic (a quantity calculated using the sample values that is used to perform
the hypothesis test) that will be used to test the hypothesis.
Guidelines to select test statistic:
56
Problem-4
Individuals filing federal income tax returns prior to March 31 had an average refund of $1056.
Consider the population of last minute filers who mail their returns during the last 5 days of the
income tax period typically April 10 to April 15. A researcher suggests that one of the reasons
individuals wait until the last 5 days to file their returns is that on average those individuals have
a lower refund than early fillers.
a) Develop appropriate hypotheses such that rejection of null hypothesis will support the
researchers argument.
b. Using 5% level of significance, what is the critical value for the test statistic and what is the
rejection rule?
57
c. For a sample
s
of 40
00 individuaals who filedd a return beetween Aprill 10 and Aprril 15, the saample
mean reffund was $910 and the sample
s
standdard deviatioon was $16000. Computee the value of
o the
test statisstic.
d. What is
i your conclusion?
e. What is
i the p-valu
ue for the testt?
Solution
n
Denote X - Individuaals federal inncome tax reeturns prior to
t March 311. Here n = 400,
4
= $9100 and
= $16000.
(a) Set upp the followiing hypothesses:
H0: $10556 vs. H1: < $1056
(b) We find
fi that n > 30, choose the
t z-statistiic. The criticcal value of the
t z-statistiic at the 5% level
of signifiicance is fou
und from the z table is -11.645.
Rejectionn rule: Rejecct H0 if zcal -1.645
58
Problem- 5
Individuals filing federal income tax returns prior to March 31 had an average refund of $1056.
Consider the population of last minute filers who mail their returns during the last 5 days of the
income tax period typically April 10 to April 15. A researcher suggests that one of the reasons
individuals wait until the last 5 days to file their returns is that on average those individuals have
grater refund than early fillers.
a) Develop appropriate hypotheses such that rejection of null hypothesis will support the
researchers argument.
b. Using 5% level of significance, what is the critical value for the test statistic and what is the
rejection rule?
c. For a sample of 400 individuals who filed a return between April 10 and April 15, the sample
mean refund was $910 and the sample standard deviation was $1600. Compute the value of the
test statistic.
d. What is your conclusion?
e. What is the p-value for the test?
Solution
Denote X - Individuals federal income tax returns prior to March 31. Here n = 400, = $910 and
= $1600.
(a) Set up the following hypotheses:
H0: $1056 vs. H1: > $1056
(b) We find that n > 30, choose the z-statistic. The critical value of the z-statistic at the 5% level
of significance is found from the z table is 1.645.
Rejection rule: Reject H0 if zcal 1.645
59
(d) Conclusion
or zcaal -1.960
61
422 and
62
63
64
Decisionn:
Decisionn:
{Insert decision curv
ve}
Accept thhe null hypo
othesis.
Thus, it is possiblee to concludde that we are 99% confident
c
thhat we may accept the null
hypothessis. More cllearly, basedd on samplee evidence, it may be concluded the
t variabiliity in
interest rates
r
increaseed.
One Sample
S
Testss
Pop
pulation SD ()
( Test
66
Note: i) Two-sided or two-tailed tests and the other twos are one-sided or one-tail lower or upper
tests.
Statistic:
Zcal =
n (x Ho )
x
Zcal:
(large
Statistic:
Statistic:
n (x Ho )
sx
p PH0
p
, where p =
PH 0 (1 PH 0 )
n
2 =
(n 1)S 2x
2
, where Sx is
2
H0
Distribution:
Distribution:
Distribution:
Standard Normal Z (or t)
and use Z-table (or t-table)
to have Ztab or ttab.
67
Lecture 21
Tests of two populations means, two standard deviations, Applications from real data
See Text, Chapter 11
Note: i) Two-sided or two-tailed tests and the other twos are one-sided or one-tail lower or upper
tests.
Statistic:
Zcal =
(x1 x 2 )
(large
x1 x 2
Statistic:
Statistic:
Zcal:
F=
p1 p 2
, where
p1 p 2
p1 p 2 =
S12
2
2
, where S1 and S2 are
2
S2
P1 (1 P1 ) P2 (1 P2 )
+
n1
n2
Distribution:
(x1 x 2 )
S x1 x 2
Distribution:
68
For example Ftab = Fn, for onesided test and Ftab = Fn,/2 for
two sided test, where n = n1+ n22.
Distribution:
Standard Normal Z (or t)
and use Z-table (or t-table)
to have Ztab or ttab.
69
Lectures 22-23
Chapter 14_Correlation and Regression Analysis
70
71
72
73
74
on No. of commercials
Make a shape of Scatter diagram to see what sorts of relation exist between and x and y.
70
60
50
40
TotalSales
30
20
10
0
0
No.ofTVCommercials
Summary: We see that there is a positive relation exists between no. of TV commercials and total
sales.
To understand very clearly what percent relation exist between x and y, we will apply the following
formula (known as correlation coefficient) is defined as
75
where
1
1
/
Make the following calculation table (details Must see Textbook, pp.115-116) to find rxy
No. of TV
Commercials(x)
2
5
1
3
4
1
5
3
4
2
Total
30
Total
Sales(y)
50
57
41
54
54
38
63
48
59
46
510
51
1
36
100
9
9
169
144
9
64
25
566
1
4
4
0
1
4
4
0
1
1
20
51
1
12
20
0
3
26
24
0
8
5
99
7.93,
11
,
= 11/(1.49x7.93) =0.9310
Summary
=0.93 means that when no. of TV commercials increases there is a 93% chance that total
We see that
sales may be increased.
76
Here there are two parameters and . These two will be estimated based on random samples data.
Using the Ordinary Least Square method, we find that estimated values of and
Estimated model y on x:
yi = + xi , i = 1,2,.,n
Prediction or Forcasting
The predicted model is defined by
yp = + xp
Fit a model y on x.
Predict (or forecast) total sales when x=5.
Solution:
Consider the following two variables regression model
Yi = + Xi + ei, i = 1,2,.,n
where Y= Total sales
=constant
= regression coefficient y on x
X = No. of commercials
e = random error
Two parameters and will be estimated based on random samples data y and x.
Calculation table
No. of TV
Commercials(x)
2
5
1
3
4
1
5
3
4
2
Total
30
Total
Sales(y)
50
57
41
54
54
38
63
48
59
46
510
51
1
36
100
9
9
169
144
9
64
25
566
1
4
4
0
1
4
4
0
1
1
20
78
51
1
12
20
0
3
26
24
0
8
5
99
51-(4.95x3) = 36.15
Thus, estimated model y on x becomes: yi= 36.15+4.95xi
Summary
=36.15 means that if there are no commercials (i.e. x=0), then expected sales may be 36.15$.
=4.95 means that when no. of TV commercials increases there is a chance that total sales may be
increased.
HW: Text
Ex: 47-51, pp.122-124
Ex: 4-14, 18-21, pp.570-582
79