Está en la página 1de 28

Lecture Notes

15MA301-Probability & Statistics

S
N
A
H
IT
Prepared by

H
AT
S ATHITHAN
Assistant Professor
F
O

Department of of Mathematics
Faculty of Engineering and Technology
S
TE

SRM UNIVERSITY
Kattankulathur-603203, Kancheepuram District.
O
N
E
R
TU
C
LE

SRM UNIVERSITY

Kattankulathur-603203, Kancheepuram District.


Probability & Statistics S.ATHITHAN
Unit-4
Correlation, Regression and Analysis of Variance
T OPICS :
? Pearsons Correlation coefficient
? Spearmans Rank correlation coefficient
? Regression Concepts-Regression lines
? Analysis of Variance
One-way classification and

S
Two-way classification

N
A
? Introduction to Non-parametric test-Wilcoxon signed rank test (one sample test)

H
? Wilcoxon Mann-Whitney rank test (Two samples test)

IT
H
AT
F
O
S
TE
O
N
E
R
TU
C
LE

Page 1 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
1 Pearsons Correlation coefficient
Example: 1. Find the correlation co-efficient for the following data:
X 27 28 29 30 32 32 33
.
Y 17 18 19 19 21 20 21

Solution: Karl Pearsons correlation coefficient is given by

N (uv) (u) (v)


rXY = rU V = q q
2
N (v 2 ) ((v))2

2
N (u ) ((u))

S
with u = X 30 and v = Y 19

N
X Y u = X 30 v = Y 19 u2 v2 uv

A
27 17 -3 -2 9 4 6

H
28 18 -2 -1 4 1 2

IT
29 19 -1 0 1 0 0
30 19 0 0 0 0 0

H
32 21 2 2 4 4 4
AT
32 20 2 1 4 1 2
33 21 3 2 9 4 6
211 135 1 2 31 14 20
F
O

Now,
S
TE

N (uv) (u) (v)


rXY = rU V = q q
N (u2 ) ((u))2 N (v 2 ) ((v))2

O

7 20 1 2
N

= p p
{7 31 1} {7 14 4}
E

138 138
= = = 0.968.
R

216 94 142.49
TU
C

Example: 2. Find the correlation co-efficient for the following data:


LE

X 62 64 65 69 70 71 72 74
.
Y 126 125 139 145 165 152 180 208

Solution: Karl Pearsons correlation coefficient is given by

N (uv) (u) (v)


rXY = rU V = q q
N (u2 ) ((u))2 N (v 2 ) ((v))2

with u = X 69 and v = Y 152

Page 2 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
X Y u = X 69 v = Y 152 u2 v2 uv
62 126 -7 -26 49 676 182
64 125 -5 -27 25 729 135
65 139 -4 -13 16 169 52
69 145 0 -7 0 49 0
70 165 1 13 1 169 13
71 152 2 0 4 0 0
72 180 3 28 9 784 84
74 208 5 56 25 3136 280
-5 24 129 5712 746

S
Now,

N
N (uv) (u) (v)
rXY = rU V = q

A
q
2
N (v 2 ) ((v))2

2
N (u ) ((u))

H
8 746 (5) 24

IT
= p p
{8 129 25} {8 5712 (24)2 }

H
5968 + 120 AT
= p p
{1032 25} {45696 576}
6088 6088
= = = 0.903.
F

1007 45120 31.733 212.415


O
S
TE

2 Spearmans Rank correlation coefficient


O
N

Example: 3. Find the Spearmans rank correlation co-efficient for the following data:
E

X 78 36 98 25 75 82 90 62 65 39
.
R

Y 84 51 91 60 68 62 86 58 63 47
TU

Solution: Spearmans correlation coefficient is given by


C

6d2
LE

XY = 1
n(n2 1)

with d = Rx Ry , Rx -Rank in X and Ry -Rank in Y

Page 3 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
2
X Y Rx Ry d d
78 84 4 3 1 1
36 51 9 9 0 0
98 91 1 1 0 0
25 60 10 7 3 9
75 68 5 4 1 1
82 62 3 6 -3 9
90 86 2 2 0 0
62 58 7 8 -1 1
65 63 6 5 1 1
39 47 8 10 -2 4

S
26

N
Now,

A
H
6d2
XY = 1

IT
n(n2 1)
6 26

H
= 1 = 0.8424
10(99) AT
F

Example: 4. Find the Spearmans rank correlation co-efficient for the following data:
O

X 35 23 47 17 10 43 9 6 28
.
S

Y 30 33 45 23 8 49 12 4 11
TE

Solution: Spearmans correlation coefficient is given by


O

6d2
N

XY =1
n(n2 1)
E
R

with d = Rx Ry , Rx -Rank in X and Ry -Rank in Y


TU

X Y Rx Ry d d2
35 30 3 4 -1 1
C

23 33 5 3 2 4
LE

47 45 1 2 -1 1
17 23 6 5 1 1
10 8 7 8 -1 1
43 49 2 1 1 1
9 12 8 6 2 4
6 4 9 9 0 0
28 11 4 7 -3 9
22

Page 4 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
Now,

6d2
XY = 1
n(n2 1)
6 22
= 1 = 1 0.1833 = 0.8167
9(80)

3 Regression Analysis

S
N
Example: 5. Find the regression equations for the following data:

A
X 27 28 29 30 32 32 33

H
.
Y 17 18 19 19 21 20 21

IT
H
Solution: The two regression equations are given by
The regression line (equation) of y on x
AT
Cov(X, Y )
y y = (x x) = byx (x x)
F

x2
O

N (uv) (u) (v)


where byx = bvu = 
S

N (u2 ) ((u))2

TE

and the regression line (equation) of x on y


O

Cov(X, Y )
N

x x = (y y) = bxy (y y)
y2
E

N (uv) (u) (v)


R

where bxy = buv = 


N (v 2 ) ((v))2

TU

with u = X 30 and v = Y 19
C

X Y u = X 30 v = Y 19 u2 v2 uv
LE

27 17 -3 -2 9 4 6
28 18 -2 -1 4 1 2
29 19 -1 0 1 0 0
30 19 0 0 0 0 0
32 21 2 2 4 4 4
32 20 2 1 4 1 2
33 21 3 2 9 4 6
211 135 1 2 31 14 20

Now,

Page 5 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
X 211 Y 135
x = = = 30.14, y = = = 19.286,
n 7 n 7
N (uv) (u) (v)
bxy = buv =
N (v 2 ) ((v))2


7 20 1 2
=
{7 14 4}
138
= = 1.468.
94

S
N
N (uv) (u) (v)
byx = bvu =

A
N (u2 ) ((u))2


H
7 20 1 2
=

IT
{7 31 1}
138

H
= = 0.6389.
216
AT
F

The regression equations are given by The regression line (equation) of y on x


O
S
TE

y y = byx (x x) = 0.6389(x x)
y 0.6389x = 19.286 0.6389 30.14 = 0.0296
O
N

and the regression line (equation) of x on y


E

x x = bxy (y y) = 1.468(y y)
R

x 1.468y = 30.14 1.468 19.286 = 1.828


TU
C

Example: 6. The two regression equations are given by x + 0.87y = 19.13 and 0.50x + y =
LE

11.64. Find the means of x and y and the correlation co-efficient.


Solution: Here

Page 6 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
4 Analysis of Variance (ANOVA)

4.1 One way classification

Example: 7. The following data is about the mistakes made by 4 photographic laboratory
technicians in 5 successive days. Is there any significance difference in performance among the
technicians?

Technician I Technician II Technician III Technician IV


X1 X2 X3 X4

S
4 14 10 9
14 9 12 12 .

N
10 12 7 8

A
8 10 15 10

H
11 14 11 11

IT
H
Solution: AT
H0 : There is no significant difference between the technicians.
H1 : There is significant difference between the technicians.
F
O

We shift the origin to 10 (Change of scale).


S

X12 X22 X32 X42


TE

X1 X2 X3 X4 Total
-4 4 0 -1 -1 16 16 0 1
O

4 -1 2 2 7 16 1 4 4
N

0 2 -3 -2 -3 0 4 9 4 .
-2 0 5 0 3 4 0 25 0
E

1 4 1 1 7 1 16 1 1
R

-1 9 5 0 13 37 37 39 10
TU

Step 1: N = 20
Step 2: T = 13 (Sum of all the values)
C

T2 (13)2
LE

Step 3: Calculate = = 8.45


N 20
Step 4: TSS (Total Sum of Squares)

T2
T SS = X12 + X22 + X32 + X42
N
= 37 + 37 + 39 + 10 8.45
= 114.55

Page 7 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
Step 5: SSC (Sum of Squares between samples(Columns))

(X1 )2 (X2 )2 (X3 )2 (X4 )2 T 2


SSC = + + +
NC1 NC2 NC3 NC4 N
where NCi =No. of elements in each column, i = 1, 2, 3, 4
(1)2 (9)2 (5)2 (0)2
= + + + 8.45
5 5 5 5
1 81 25
= + + + 0 8.45 = 12.95
5 5 5
Step 6: SSE= TSS-SSC=114.55-12.95=101.6

S
N
ANOVA Table

A
H
Source Sum of d.f. Mean Variance Ra- Table Value
of Varia- Squares Squares tio (F ) (F )

IT
tion

H
Between SSC = c1 = M SC = Fc
 AT = Ft (3, 16)
Columns 12.95 41 = SSC M SC at 1%
= =
3 c1 M SE LOS=5.29
12.95 6.35
F

= 4.31
3
O

4.31
.
Residual/ SSE = N c = M SE  =
S


Error 101.6 204 = SSE
TE

=
16 N c
101.6
O

=
16
N

6.35
Total SSC +
E

SSE =
R

114.55
TU
C

Step 7: Conclusion: Since Calculated value of F is less than the table value of F . i.e. Fc < Ft .
LE

H0 is accepted.

4.2 Two way classification

Example: 8. A company appoints 4 salesmen A,B,C and D and observes their sales in 3
seasons. The figures in lakhs of Rs. are given below. Carry out the analysis of variance.
Salesmen
Season
A B C D
Summer 45 40 38 37 .
Winter 43 41 45 38
Monsoon 39 39 41 41

Page 8 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
Solution:
H0 : There is no significant difference between the sales of dealers and also between monthly
sales.
H1 : There is no significant difference between the sales of dealers and also between monthly
sales.

In order to simplify the calculations we shift the origin to 40 (Change of scale).

Dealers Seasons
Seasons
X1 X2 X3 X4 Total X12 X12 X12 X12

S
Y1 5 0 -2 -3 0 25 0 4 9
.
Y2 3 1 5 -2 7 9 1 25 4

N
Y3 -1 -1 1 1 0 1 1 1 1

A
Total 7 0 4 -4 7 35 2 30 14

H
Step 1: N = 12

IT
Step 2: T = 7 (Sum of all the values)

H
T2 (7)2
Step 3: Calculate = = 4.083 AT
N 12
Step 4: TSS (Total Sum of Squares)

T2
F

T SS = X12 + X22 + X32 + X42


O

N
= 35 + 2 + 30 + 4 4.083
S

= 66.917
TE

Step 5: SSC (Sum of Squares between samples(Columns))


O
N

(X1 )2 (X2 )2 (X3 )2 (X4 )2 T 2


SSC = + + +
NC1 NC2 NC3 NC4 N
E

where NCi =No. of elements in each column, i = 1, 2, 3, 4


R

(7)2 (0)2 (4)2 (4)2


TU

= + + + 4.083
3 3 3 3
49 16 16
4.083 = 22.917
C

= +0+ +
3 3 3
LE

Step 6: SSR (Sum of Squares within samples(Rows))

(Y1 )2 (Y2 )2 (Y3 )2 T 2


SSR = + +
NR1 NR2 NR3 N
where NRi =No. of elements in each row, i = 1, 2, 3
(0)2 (7)2 (0)2
= + + 4.083
4 4 4
49
= 0+ + 0 4.083 = 8.167
4
Step 7: SSE= TSS-SSC-SSR=66.917-22.917-8.167=35.833

Page 9 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
ANOVA Table

Source Sum of d.f. Mean Squares Variance Ra- Table Value


of Varia- Squares tio (F ) (F )
tion
Between SSC = c 1 = M SC 
 = Fcc = Fct (c1, 6)
Columns 22.917 41=3 SSC M SC at 5%
= =
c1 M SE LOS= 4.76
7.639 1.279
Between SSR = r 1 = M
 SR  = Frc = Frt (r1, 6)
Rows 8.167 31=2 SSR M SE at 5% .
= =

S
r1 M SR LOS= 5.14
4.0835 1.462

N
Residual/ SSE = N (c + M SE 
 =

A
Error 35.833 r 1) = SSE
=

H
(c1)(r 6
5.972

IT
1) = 6
Total T SS =

H
66.917 AT
= N (c + r 1)
Step 8: Conclusion: Since Calculated value of F is less than the table value of F in both the
F

cases. i.e. Fc < Ft . H0 is accepted.


O
S
TE

5 Non-Parametric Tests (Wilcoxon and Mann-Whitney Tests)


O
N

5.1 Wilcoxon Signed Rank Test


E

When using this technique, those assumptions should be follow:


R
TU

1. R(di ) is symmetry
2. R(di ) is mutually independent
C

3. R(di ) is has the same median


LE

This test
Can be applied to two types of sample: one sample or paired sample
For one sample, this method tests whether the sample could have been drawn from a
population having a hypothesized value as its median
For paired sample, to test whether the two populations from which these samples are
drawn identical.
Parameters used :-
di - difference of paired samples

Page 10 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
di - modular of difference of paired samples
R- ranks
R(di ) signed-rank

The Wilcoxon Signed Rank test for one sample

Null and alternative hypothesis

Two tail test Left tail test Right tail test

S
H0 medianR(d) = m0 medianR(d) = m0 medianR(d) = m0
H1 medianR(d) 6= m0 medianR(d) > m0 medianR(d) < m0

A
Rejection area min(T + , T ) a, T a T+ a
2

H
IT
Test Procedure

H
AT
For each of the observed values, find the difference between each value and the median,di =
xi m0 where m0 = median value that has been specified
F

Ignoring the observation where di = 0 , rank the |di | values so the smallest |di | will have
O

a rank of 1. Where two or more differences have the same value find their mean rank, and
use this.
S
TE

For observation where xi > m0 , list the rank as +R(di ) column and xi < m0 list the rank
as -R(di ) column
O

Then, sum the ranks of the positive differences, T + and sum the ranks of the negative
N

differences T .
E
R

X X
T+ = +R(di ), T = R(di )
TU

The test statistic, W is depends on the alternative hypothesis:


C

For a two tailed test the test statistic W = min(T + , T )


LE

For a one tailed test where the H1 : median R(di ) > m0 the test statistic, W = T
For a one tailed test where the H1 : median R(di ) < m0 the test statistic, W = T +
Critical region: Compare the test statistic, W with the critical value in the tables; the null
hypothesis is rejected if W critical value, a
Make a decision.

Page 11 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
Example: 9. An environmental activist believes her communitys drinking water contains at
least the 40.0 parts per million (ppm) limit recommended by health officials for a certain metal.
In response to her claim, the health department samples and analyzes drinking water from a
sample of 11 households in the community. The results are as in the table below. At the 0.05
level of significance, by using Wilcoxon method can we conclude that the communitys drinking
water might at least 40.0 ppm recommended limit?
Household A B C D E F G H I J K
Observed
Concentration 39 20.2 40 32.2 30.5 26.5 42.1 45.6 42.1 29.9 40.9

Hints/Solution:

S
Here m0 = 40.
Household A B C D E F G H I J K

N
Observed

A
Concentration 39 20.2 40 32.2 30.5 26.5 42.1 45.6 42.1 29.9 40.9

H
di = xi m0 1 19.8 0 7.8 9.5 13.5 2.1 5.6 2.1 10.1 0.9

Total
IT
|di | 1 19.8 0 7.8 9.5 13.5 2.1 5.6 2.1 10.1 0.9
Rank R(di ) 2 10 6 7 9 3.5 5 3.5 8 1

H
+R(di ) AT 3.5 5 3.5 1 13
R(di ) 2 10 6 7 9 8 42
Step: 1
F

Null Hypothesis H0 : median of R(d) 40. i.e., communitys drinking water might at least
O

40.0 ppm recommended limit 40


Alternative Hypothesis H1 : median of R(d) < 40. i.e., communitys drinking water might at
S

least 40.0 ppm recommended limit < 40


TE

Step: 2 X
O

Based on the alternative hypothesis,test statistic is Wc = +R(di ) = 13.


N

Step: 3
The table value (critical value) of Wilcoxon signed rank for one tail test at 5% LOS and n =
E

11 1 = 10 d.f. is Wt = 10.
R
TU

Step: 4
Since Wc = 13 > Wt = 10. We reject H0 .
C
LE

5.2 Wilcoxon Rank Sum Test or Mann-Whitney Test

+ To determine whether a difference exist between two populations


+ Sometimes called as Wilcoxon rank sum test

Page 12 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
+ Two independent random samples are required from each population. Let m1 and m2
be the random samples of sizes n1 and n2 where n1 n2 from population X and Y
respectively

Null and alternative hypothesis

Two tail test Left tail test Right tail test


H0 m1 = m2 m1 = or m2 m1 = or m2
H1 m1 6= m2 m1 < m2 m1 > m2
Rejection area W
/ [a, b] W < cv W > cv

S
cv = [a, b] = critical value with b = upper critical value a = lower critical value

N
Hypotheses to be tested are:

A
H0 : The distributions for populations 1 and 2 are identical

H
IT
H1 : The distributions for populations 1 and 2 are different (two tailed test)

H
H1 : The distributions for populations 1 and 2 lies to the left of that population 2 (left tailed
AT
test)
H1 : The distributions for populations 1 and 2 lies to the right of that population 2 (right tailed
F

test)
O

1.Rank all n1 + n2 observations from small to large.


S

2. Test Statistic:
TE

i. Find T1 , the rank sum for the observations in sample 1 (Left tailed test)
O

ii. Find T1 = n1 (n1 + n2 + 1) T1 , (right tailed test)


N

iii.Find minimum of T between T1 and T1 .


E

3. Rejection region : Reject H0 ,if test statistic less than critical value. Critical value: get from
R

table Wilcoxon Sum Rank Test table.


TU

4. Conclusion.
C
LE

Page 13 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
Example: 10. Data below show the marks obtained by electrical engineering students in an
examination. Can we conclude the achievements of male and female students identical at
significance level = 0.1.
Gender Marks
Male 60
Male 62
Male 78
Male 83
Female 40
Female 65
Female 70

S
Female 88
Female 92

N
A
Hints/Solution:

H
IT
Gender Marks Rank

H
Male 60 2
Male 62 3
AT
Male 78 6
Male 83 7
F

Female 40 1
O

Female 65 4
Female 70 5
S

Female 88 8
TE

Female 92 9
O

Step: 1
N

Null Hypothesis H0 : Male and Female achievement are the same


Alternative Hypothesis H1 : Male and Female achievement are not the same
E

Step: 2
R

X
Test statistic is given as follows: We haven1 = 4, n2 = 5, W = R1 = 2 + 3 + 6 + 7 = 18.
TU

Step: 3
C

The table value (critical value) of Wilcoxon rank sum test for two tail test at 10% LOS and
LE

n1 = 4, n2 = 5 d.f. is Wt = 13, 27. From the table of Wilcoxon rank sum test for = 0.1 (two
tail test), n1 = 4, n2 = 5, so critical value =13,27
Step: 4
Reject H0 if W
/ [a, b] or W
/ [13, 27]
Since 18 [13, 27], thus we fail to reject H0 and conclude that the achievements of male and
female are not significantly different.

Page 14 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
Practice/Exercise Problems

Problems based on Pearsons correlation, Spearmans correlation and re-


gression coefficients

1. Find the Pearsons and Spearmans Correlation Co-efficient and two lines of regression
for the following data:
Sales 15 18 25 27 30 35
.
Expenditure 50 65 82 95 110 120
2. Find the Pearsons and Spearmans Correlation Co-efficient and two lines of regression

S
for the following data:

N
X 1 3 5 7 8 10
.

A
Y 8 12 15 17 18 20

H
3. Find the Pearsons and Spearmans Correlation Co-efficient and two lines of regression

IT
for the following data:

H
X 43 44 46 40 44 42 45 42 38 40 42 57
AT .
Y 29 31 19 18 19 27 27 29 41 30 26 10
4. Find the Pearsons and Spearmans Correlation Co-efficient and two lines of regression
F

for the following data:


O

M arksinM athematics
Marks in Statistics
S

47 52 57 62 67
TE

57 3 4 2
62 4 8 8 2 .
O

67 7 12 4 4
N

72 3 10 8 5
77 3 5 8
E
R

Hint: Use the Formula


TU

N (fXY uv) (fX u) (fY v)


rXY = rU V = q
N (fX u2 ) ((fX u))2 N (fY v 2 ) ((fY v))2

C
LE

Y 35
with u = X 20 and v = Y 35 or
10
Ans: rXY = 0.63
5. Find the Pearsons and Spearmans Correlation Co-efficient and two lines of regression
for the following data:

Page 15 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
AgeX
Marks Y
18 19 20 21 T otal
10 20 4 2 2 8
20 30 5 4 6 4 19
30 40 6 8 10 11 35 .
40 50 4 4 6 8 22
50 60 2 4 4 10
60 70 2 3 1 6
T otal 19 22 31 28 100
Ans: rXY = 0.1897

S
6. Calculate the Pearsons correlation co-efficient and regression equations for the following
data:

N
AgeX

A
Marks Y
16 17 17 18 18 19 19 20

H
30 40 20 10 3 2

IT
40 50 4 28 6 4 .
50 60 5 11

H
60 70 2 AT
70 80 5
F
O

7. Calculate the Pearsons and Spearmans correlation co-efficient for the following data:
S

X 43 44 46 40 44 42 45 42 38 40 42 57
TE

.
Y 29 31 19 18 19 27 27 29 41 30 26 10
O

Sales 15 18 25 27 30 35
N

.
Expenditure 50 65 82 95 110 120
E

X 43 44 46 40 44 42 45 42 38 40 42 57
.
R

Y 29 31 19 18 19 27 27 29 41 30 26 10
TU

8. The competitors in a musical contest were ranked by the by 3 judges. Which pair of
judges have more or less the same taste in music?
C

S.N o. 1 2 3 4 5 6 7 8 9 10
LE

RankbyJudgeA 6 5 3 10 2 4 9 7 8 1
.
RankbyJudgeB 5 8 4 7 10 2 1 6 9 3
RankbyJudgeC 4 9 8 1 2 3 10 5 7 6
Ans: rAB = 0.0486,rBC = 0.2970 and rAC = 0.0424. A and B have same taste or
attitude towards ranking
9. Ten participants of a competition were ranked according to their performance by 3 judges.
Which pair of judges have nearly same attitude on ranking?
S.N o. 1 2 3 4 5 6 7 8 9 10
RankbyX 1 6 5 10 3 2 4 9 7 8
.
RankbyY 3 5 8 4 7 10 2 1 6 9
RankbyZ 6 4 9 8 1 2 3 10 5 7

Page 16 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN

Ans: X and Z have same taste or attitude towards ranking


10. Find the Pearsons and Spearmans Correlation Co-efficient and two lines of regression
for the following data:

X 1 2 3 4 5 6 7
.
Y 9 8 10 12 11 13 14
11. The owner of a chain of ten stores wishes to forecast net profit with the help of next years
projected sales of food and non-food items. The date about current years sales of food
items, sale of non-food items as also net profit for all the ten stores are available as fol-

S
lows.

N
A
Supermarket No. 1 2 3 4 5 6 7 8 9 10

H
Net profit Y sales in cr Y 5.6 4.7 5.4 5.5 5.1 6.8 5.8 8.2 5.8 6.2

IT
Sales of food in crores X1 20 15 18 20 16 25 22 30 24 25
Sales of non food in cr X2 5 5 6 5 6 6 4 7 3 4

H
AT
Y = b0 + b1 X 1 + b2 X 2

Where b0 , b1 and b2 are found by solving the normal equations


F
O

Y = nb0 + b1 X1 + b2 X2
Y X1 = b0 X1 + b1 X12 + b2 X1 X2
S
TE

Y X2 = b0 X2 + b1 X1 X2 + b2 X22
O

Answer: b0 = 0.223, b1 = 0.196, b2 = 0.287, y = 0.233 + 0.196x1 + 0.287x2


N
E

12. The annual food expenditure of a family depends on the net income of the family and the
R

no of members in the family. A sample survey of 6 families are given below. Find the
TU

Food expenditure in 1000s (Y ) 10 12 14 15 10 11


equation of multiple regression. The net income (X1 ) 25 30 25 32 20 21
C

No. of members (X2 ) 5 6 3 6 2 2


LE

Ans)

6b0 + 153b1 + 24b2 = 72


153b0 + 654b1 + 114b2 = 1871
24b0 + 612b1 + 114b2 = 296

On solving b0 = 6.72, b1 = 1.05, b2 = 2.01


Regression equation is Y = 6.72 + 1.05X1 2.01X2
13. Given the following data, fit a regression equation representing dependence of number of
credit cards on family size and family income. Also show whether addition of Family
Income variable has improve the relationship by finding sums of squares of errors as also
calculating simple and multiple correlation coefficients. Fit and determine the multiple

Page 17 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
regression equation.
No. of credit cards 4 6 6 7 8 7 8 10
Family size 2 2 4 4 5 5 6 6
Family income in lakhs 14 16 14 17 18 21 17 25
14. Bivariate data often arises from the use of two different techniques to measure the same
quantity. As an example, the accompanying observations on x = hydrogen concentration
(ppm) using a gas chromatography method and y = concentration using a new sensor
method were given.
x 47 62 65 70 70 78 95 100 114 118
y 38 62 53 67 84 79 93 106 117 116
x 124 127 140 140 140 150 152 164 198 221

S
y 127 114 134 139 142 170 149 154 200 215

N
Find the Pearsons, Spearmans correlation coefficients and find the lines of regressions.

A
Construct a scatter-plot. Does there appear to be a very strong relationship between the

H
two types of concentration measurements? Do the two methods appear to be measuring

IT
roughly the same quantity? Explain your reasoning.
15. The accompanying data on y =ammonium concentration (mg/L) and x = transpiration

H
(ml/h) is given. Find the Pearsons, Spearmans correlation coefficients and find the lines
AT
of regressions. How would you describe the relationship between the variables, and does
simple linear regression appear to be an appropriate modeling strategy?
F

x 5.8 8.8 11.0 13.6 18.5 21.0 23.7


O

y 7.8 8.2 6.9 5.3 4.7 4.9 4.3


S

x 26.0 28.3 31.9 36.5 38.2 40.4


TE

y 2.7 2.8 1.8 1.9 1.1 .4


16. Given an experiment to investigate how the behavior of mozzarella cheese varied with
O

temperature.Consider the accompanying data on x =temperature and y =elongation(%) at


N

failure of the cheese.


x 59 63 68 72 74 78 83
E

y 118 182 247 208 197 135 132


R

Find the Pearsons, Spearmans correlation coefficients and find the lines of regressions.
TU

a. Construct a scatterplot in which the axes intersect at (0, 0). Mark 0, 20, 40, 60, 80, and
100 on the horizontal axis and 0, 50, 100, 150, 200, and 250 on the vertical axis.
C

b. Construct a scatterplot in which the axes intersect at (55, 100), as was done in the cited
LE

article. Does this plot seem preferable to the one in part (a)? Explain your reasoning.
c. What do the plots of parts (a) and (b) suggest about the nature of the relationship
between the two variables?

Problems based on ANOVA

1. The following table gives the number of refrigerators sold by 4 dealers in three months.
Is there any significant difference between the sales made by the dealers and sales made

Page 18 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
Dealer
Month
A B C D
by them in different months? I 50 40 48 39 .
II 46 48 50 45
III 39 44 40 39

2. There are three main brands of a certain powder. A set of 120 sample values are examined
and found to be allocated among four groups and three brands as shown below. Is there
Groups
Brands
A B C D

S
any significance difference in brands preference? I 0 4 8 15 .
II 5 8 13 6

N
III 8 19 11 13

A
H
IT
3. The following data represent the number of units of production per day truned out by 5
different workers using 4 different types of machines. (a) Test whether the five men differ

H
with respect to mean productivity. (b) Test whether the mean productivity is the same for
AT
the four different machine types.
W orkers
Machine Type
F

A B C D
O

I 44 38 47 36
S

II 46 40 52 43 .
TE

III 34 36 44 32
IV 43 38 46 33
O

V 38 42 49 39
N
E

4. Four doctors each test four treatments for a certain disease and observe the number of
R

days each patient takes to recover. The results are as follows (recovery time in days).
TU

Groups
Brands
1 2 3 4
C

A 10 14 19 20
Discuss the difference between the doctors and treatment. .
LE

B 11 15 17 21
C 9 12 16 19
D 8 13 17 20

5. The following data is on total Fe for four types of iron formation (1=carbonate, 2 =silicate,
3=magnetite, 4=hematite).
1: 20.5 28.1 27.8 27.0 28.0 25.2 25.3 27.1 20.5 31.3
2: 26.3 24.0 26.2 20.2 23.7 34.0 17.1 26.8 23.7 24.9
3: 29.5 34.0 27.5 29.4 27.9 26.2 29.9 29.5 30.0 35.6
4: 36.5 44.2 34.1 30.3 31.4 33.1 34.1 32.9 36.3 25.5
Carry out an analysis of variance F test at significance level 0.01, and summarize the

Page 19 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
results in an ANOVA table.
6. A consumer product-testing organization wished to compare the annual power consump-
tion for five different brands of dehumidifier. Because power consumption depends on
the prevailing humidity level, it was decided to monitor each brand at four different levels
ranging from moderate to heavy humidity (thus blocking on humidity level). Within each
level, brands were randomly assigned to the five selected locations. (a) Test whether the
five brands differ with respect to treatment. (b) Test whether the brands is the same for
Blocks (humidity level)
Treatment (brands)
A B C D
I 685 792 838 875
the four different humidity level. II 722 806 893 953 .

S
III 733 802 880 941

N
IV 811 888 952 1005

A
V 828 920 978 1023

H
IT
7. Four different coatings are being considered for corrosion protection of metal pipe. The

H
pipe will be buried in three different types of soil. To investigate whether the amount
AT
of corrosion depends either on the coating or on the type of soil, 12 pieces of pipe are
selected. Each piece is coated with one of the four coatings and buried in one of the three
types of soil for a fixed time, after which the amount of corrosion (depth of maximum
F

pits, in .0001 in.) is determined. The data appears in the table. (a) Test whether the soil
O

types differ with respect to treatment. (b) Test whether the soil types is the same for the
S

Soil Type
Coating
TE

A B C
I 64 49 50
four different coating. .
O

II 53 51 48
N

III 47 45 50
IV 51 43 52
E
R
TU

8. In an experiment to see whether the amount of coverage of light-blue interior latex paint
depends either on the brand of paint or on the brand of roller used, one gallon of each
C

of four brands of paint was applied using each of three brands of roller, resulting in the
LE

following data (number of square feet covered). (a) Test whether the Roller Brands dif-
fer with respect to treatment. (b) Test whether the Paint Brands differ with respect to
treatment. (c) Test whether the roller brands is the same for the different paint brands.
Soil Type
Coating
A B C
I 454 446 451
.
II 446 444 447
III 439 442 444
IV 444 437 443

9. The following data is given on the effort required of a subject to arise from four different

Page 20 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
types of stools (Borg scale). Perform an analysis of variance using a 5 .05, and follow
this with a multiple comparisons analysis if appropriate. (a) Test whether the Subjects
differ with respect to treatment. (b) Test whether the Types of Stools differ with respect
to treatment. (c) Test whether the Subject is the same for the different Types of Stools.
Subject
Type of Stool
1 2 3 4 5 6 7 8 9
I 12 10 7 7 8 9 8 7 9
.
II 15 14 14 11 11 11 12 11 13
III 12 13 13 10 8 11 12 8 10
IV 10 12 9 9 7 10 11 7 8

S
N
A
Problems based on Non-Parametric tests W ILCOXON AND M ANN W HIT-

H
NEY TESTS

IT
1. Student satisfaction surveys ask students to rate a particular course, on a scale of 1 (poor)

H
to 10 (excellent). In previous years the replies have been symmetrically distributed about
AT
a median of 4. This year there has been a much greater on-line element to the course, and
staff want to know how the rating of this version of the course compares with the previous
one. 14 students, randomly selected, were asked to rate the new version of the course and
F
O

their ratings were given by 1, 3, 6, 4, 8, 2, 3, 6, 5, 2, 3, 4, 1, 2. Is there any evidence at


the 5% level that students rate this version any differently?
S
TE

2. The following data represent the number of hours that a rechargeable hedge trimmer
operates before a recharge is required: 1.5, 2.2, 0.9, 1.3, 2.0, 1.6, 1.8, 1.5, 2.0, 1.2, 1.7.
O

Use the Wilcoxon signed rank test to test the hypothesis, at the 0.05 level of significance,
N

that this particular trimmer operates a median of 1.8 hours before requiring a recharge.
3. The following data represent the time, in minutes, that a patient has to wait during 12
E

visits to a doctors office before being seen by the doctor: 17, 15, 20, 20, 32, 28, 12, 26,
R

25, 25, 35 and 24. Use the Wilcoxon signed rank test at the 0.05 level of significance to
TU

test the doctors claim that the median waiting time for her patients is not more than 20
minutes.
C
LE

4. Using high school records, Johnson High school administrators selected a random sam-
ple of four high school students who attended Garfield Junior High and another random
sample of five students who attended Mulbery Junior High. The ordinal class standings
for the nine students are listed in the table below. Test using Mann-Whitney test at 0.05
level of significance.
Garfield Junior High Mulbery Junior High
Student Class standing Student Class standing
Fields 8 Hart 70
Clark 52 Phipps 202
Jones 112 Kirwood 144
TIbbs 21 Abbott 175
Guest 146

Page 21 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
5. The effectiveness of advertising for two rival products (Brand X and Brand Y) was com-
pared. Market research at a local shopping centre was carried out, with the participants
being shown adverts for two rival brands of coffee, which they then rated on the over-
all likelihood of them buying the product (out of 10, with 10 being definitely going to
buy the product). Half of the participants gave ratings for one of the products, the other
half gave ratings for the other product. Test using Mann-Whitney test at 0.05 level of
significance.
Brand X Brand Y
Participant Rating Participant Rating
1 3 1 9
2 4 2 7

S
3 2 3 5

N
4 6 4 10

A
5 2 5 6

H
6 5 6 8

IT
6. The nicotine content of two brands of cigarettes, measured in milligrams, was found to

H
be as follows:
Brand-A 2.1 4.0 6.3 5.4 4.8
AT 3.7 6.1 3.3
Brand-B 4.1 0.6 3.1 2.5 4.0 6.2 1.6 2.2 1.9 5.4
F

Test the hypothesis using Mann-Whitney test, at the 0.05 level of significance, that the
O

median nicotine contents of the two brands are equal against the alternative that they are
unequal.
S
TE

7. To find out whether a new serum will arrest leukemia, nine patients, who have all reached
an advanced stage of the disease, are selected. Five patients receive the treatment and
O

four do not. The survival times, in years, from the time the experiment commenced are
N

Treatment 2.1 5.3 1.4 4.6 0.9


No-treatment 1.9 0.5 2.8 3.1
E
R

Use the rank-sum test, at the 0.05 level of significance, to determine if the serum is effec-
TU

tive.
8. A fishing line is being manufactured by two processes. To determine if there is a differ-
C

ence in the mean breaking strength of the lines, 10 pieces manufactured by each process
LE

are selected and then tested for breaking strength. The results are as follows:
Process-1 10.4 9.8 11.5 10.0 9.9 9.6 10.9 11.8 9.3 10.7
Process-2 8.7 11.2 9.8 10.1 10.8 9.5 11.0 9.8 10.5 9.9
Use the rank-sum test with = 0.1 to determine if there is a difference between the mean
breaking strengths of the lines manufactured by the two processes.
9. The urinary fluoride concentration (parts per million) was measured both for a sample
of livestock grazing in an area previously exposed to fluoride pollution and for a similar
sample grazing in an unpolluted region:
Polluted 21.3 18.7 23.0 17.1 16.8 20.9 19.7
Unpolluted 14.2 18.3 17.2 18.4 20.0

Page 22 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN
Does the data indicate strongly that the true average fluoride concentration for livestock
grazing in the polluted region is larger than for the unpolluted region? Use the Wilcoxon
rank-sum test at level a = 0.01.
10. A random sample of 15 automobile mechanics certified to work on a certain type of car
was selected, and the time (in minutes) necessary for each one to diagnose a particular
problem was determined, resulting in the following data: 30.6, 30.1, 15.6, 26.7, 27.1,
25.4, 35.0, 30.8, 31.9, 53.2, 12.5, 23.2, 8.8, 24.9, 30.2. Use the Wilcoxon test at signifi-
cance level 0.10 to decide whether the data suggests that true average diagnostic time is
less than 30 minutes.
TAKE MORE PROBLEMS IN THE TEXTBOOK AND SOME REFERENCE BOOKS FOR YOUR

S
PRACTICE .

N
Contact: (+91) 979 111 666 3 (or) athithan.s@ktr.srmuniv.ac.in

A
Visit: https://sites.google.com/site/lecturenotesofathithans/home

H
IT
H
AT
F
O
S
TE
O
N
E
R
TU
C
LE

Page 23 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN

S
N
A
H
IT
H
AT
F
O
S
TE
O
N
E
R
TU
C
LE

Page 24 of 27 https://sites.google.com/site/lecturenotesofathithans/home
Probability & Statistics S.ATHITHAN

S
N
A
H
IT
H
AT
F
O
S
TE
O
N
E
R
TU
C
LE

Page 25 of 27 https://sites.google.com/site/lecturenotesofathithans/home
WILCOXON RANK SUM TEST

S
N
A
H
IT
H
AT
F
O
S
TE

WILCOXON SIGNED RANK TEST


O
N
E
R
TU
C
LE
SPEARMANS RANK CORRELATION COEFFICIENT

S
N
A
H
IT
H
AT
F
O
S
TE
O
N
E
R
TU
C
LE

También podría gustarte