Está en la página 1de 26

4.

Regression and Correlation Analysis This section introduces regression analysis which is a method used to describe a relationship between two variables and goes on to explain about correlation analysis which measures the strength of the relationship between two variables. This manual uses Spearmans rank correlation coefficient and Pearsons product moment coefficient of correlation as a measure of strength between two variables. Regression analysis is concerned with the estimating of one variable (dependent variable) on the basis of one or more other variables (independent variable). If an analyst for instance is trying to predict the share price of a particular sector there will be a whole range of independent variables to be considered. In this manual, we will restrict our attention to the particular case where a dependent variable y is related to a single independent variable x . The Regression Equation When only one independent variable is used in making forecast, the technique used is called Simple Linear regression. The forecasts are made by means of a straight line using the equation
y = a + bx
a = the y int ercept when x = 0 b = slope = the amount that y changes with a unit change in x

The linear function is useful because it is mathematically simple and it can be shown to be reasonably close to the approximation of many situations. The first step to establish whether there is a relationship between variables is by means of a scatter diagram. This is a plot of the two variables on an x y graph. Given that we believe there is a relationship between the two variables, the second step is to determine the form of this relationship.

194

Example 1 Consider the following data of a major appliance store. The daily high temperature and of air conditioning units sold for 8 randomly selected business days during the hot dry season. Daily High Temperature (x) oc 27 35 18 20 46 36 26 23 Draw a Scatter diagram for the data.
y 6 Number of units 5 used 4 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 x

Number of Units (y) 5 6 2 1 6 4 3 3

Daily High temperature (oc)

Figure 1.0

195

The distribution of points in the Scatter diagram suggests that a straight line roughly fits these points. The most straight forward method of fitting a straight line to the set of data points is by eye. The values of a and b can then be determined from the graph, a is the intercept on the vertical axis and b is the slope. The other method is that of semi averages. This technique consists of splitting the data into two equal groups, plotting the mean point for each group and joining these points with a straight line. Example 2 Using data of Example 1, fit a straight line using the method of semi-averages. The procedure for obtaining the y on Step 1 Sort the data into size order by

x regression line is as follows:

x - value.
y 2 1 3 3 5 6 4 1

x
18 20 23 26 27 35 36 46

Step 2 Split the data up into two equal groups, a lower half and upper half (if there is an odd of items, drop the central one). Lower half of Data Upper half of data y x 27 5 35 6 36 4 46 1 144 16 36 4

18 20 23 26 Totals 87 Averages 21.75

y 2 1 3 3 9 2.5

196

y
6

Table 1.0

5 4 3 2 18 26

34 42 x

Method of Semi-average for Example 1. Figure 2.0 Step 3 Calculate the mean point for each group Step 4 Plot the mean points in Step 3 on a graph within suitably scaled axes and joining them with a straight line. This is the required y on x regression line. Least Square Line Let us consider a typical data point with coordinates ( xi , yi ) (See Figure 3.0). The error in the forecast ( y coordinate of data point-forecasted coordinate as given by the straight line ) is denoted by ei . The line which minimizes the value of ei is called the least square line or the regression line. This can be shown by using calculus. Here we just give the best estimates of a and b by the following formula. 197

x y xy n b= ( x) x n
2 2

a = y b x where n is the number of data po int s


= a + bx The values of a and b are then substituted into equation y

least squares line

ei

yi

x
Figure 3.0 The least squares line with the error term ei . Example 3 Fit the least squares line to the data in Example 1.

198

Table 2 shows the calculations for the estimates of a and b.

x
18 20 23 26 27 35 36 46

y
2 1 3 3 5 6 4 1

x2

y2

xy
36 20 69 78 135 210 144 46

x = 231
n = 8, y = 3.125 ,

y = 25
x = 23.1

324 400 529 676 729 1225 1296 2116


2

= 7295

4 1 9 9 25 36 16 1
2

=101

xy = 738

x y xy n b= ( x) x
2 2

( 231)( 25) 8 b= 2 ( 231) 7295 8 738 = 16.125 624.875

b = 0.0258 a = y b x = 3.125 0.0258( 28.875) = 2.38

giving the equation for the regression line of y = 2.38 + 0.0258 x Forecasting Using the Regression Line

199

Having obtained the regression line, It can be used to forecast the value of y for a given value of x . Suppose that we wish to determine the number of units sold if we have a daily temperature of 42o c .
= 2.38 + 0.0258( 42) = 3.46 i.e. From the regression line the forecasted value y is y the expected number of units sold is 3.

Now suppose that we wish to determine the number of units sold if the temperature is = 2.38 + 0.0258(49) = 3.64 i.e. 49o c. The forecasted value of y is the given by y the expected number of units sold is 4. The two examples differ due to the fact that the first y value was forecasted from an x value within the range of x values, while the second value outside the range of x values in the original data set.. The first example is a case of interpolation and the second is that of extrapolation. With extrapolation, the assumption is that the relationship between the two variables continue to behave in the same way outside the given range of x values from which the least square line was computed. Exercise 7 1. For the following data

x
y

2 2

5 3

6 4

8 5

10 6

11 8

13 9

16 10

a) b) c)

Draw a scatter diagram By eye, fit a straight line to the data (ensuring it passes through the mean value) Fit the equation of the line by the method of semi-averages.

2.

The following data have been collected regarding sales and advertising expenditure. Sales (Kms) Advertising expenditure (K 000s) 200

10.5 11.2 9.9 10.6 11.4 12.1 a) b)

230 280 310 350 400 430

Plot the above data on a scatter diagram. Fit the regression line using the method of least squares.

c) Estimate the sales if K530 000 is spent on advertising expenditure. Note that advertising expenditure is the x variable and sales is the y variable. 3. Fit a least square line to the data in the table below.

x
y

5 4

7 5

8 6

10 8

11 7

13 10

4.

The table below shows the final grades in Mathematics and Communication obtained by students selected at random from a large group of students. a) b) c) d) Graph the data Fit a least-squares line If a student receives a grade of 85 in Mathematics, what is her expected grade in Communication? If a student receives a grade of 65 in Communication, what is her expected grade in Mathematics? 80 86 97 70 89 75 99 69 87 78

Mathematics (x) 5.

Communication (y) 75 65 80 65 80 70 79 45 70 80 The table below shows the birth rate per 100 population during 1999 2005 year Birth rate per 1000 1999 14.6 2000 14.5 2001 13.8 2002 13.4 2003 13.6 2004 12.8 2005 12.6

201

a) b) c) 4.7

Graph the data. Find the least squares line fitting the data. Code the years 1999 to 2005 as the whole number 1 through 7. Predict the birth rate in 2009, assuming the present trend continues.

Correlation Analysis Correlation analysis is used to determine the degree of association between two variables. Having obtained the equation of the regression line, correlation analysis can be used to measure how well one variable is linearly related to another. The coefficient of correlation r can assume any value inclusive in the range 1 to +1 . A value of r is close to or equal to 1 , we have a negative correlation. The sign of the correlation coefficient is the same as the sign of the slope of the regression line. The following scatter diagrams illustrate certain values of the correlation coefficient.

x x x x

x x x x

x x
r =0

x x

r =1
x x x x

r = 1 The method of investigating whether a linear relationship exists between two variables x and y is by calculating Pearsons product moment correlation coefficient (PPMCC) denoted by r given by the formula

202

r=

x y xy n ( x) ( y) x y n n
2 2 2

Example 4 By calculating the PPMCC find the degree of association between weekly earnings and the amount of tax paid for each member of a group of 10 manual workers. Weekly Wage (K 000) Income Tax (K 000) 79 10 81 8 87 14 88 14 91 17 92 12 98 18 98 22 103 21 113 24

The PPMCC is calculated in the Table below

x
79 81 87 88 91 92 98 98 103 113

y
10 8 14 14 17 12 18 22 21 24

x2

y2

xy
790 648 1218 1232 1547 1104 1764 2156 2163 2712
2

x =930

y =160

6241 6561 7569 7744 8281 8464 9604 9604 10609 12769

100 64 196 196 289 144 324 484 441 576


= 87446

= 2814

xy =15334

203

r=

x y xy n ( x) ( y) x y n n
2 2 2

r=

15334

( 430)(160)
10
2

( 930) 2814 (160) 2 87446 10 10 454 (956)(254)

= 0.921

r is near 1 and indicates a strong positive linear correlation between the two variables.
Example 5 Evaluate the PPMCC for the following data.

x
y

15 143

20 141

25 144

30 149

35 148

The PPMCC is calculated in the Table below.

x
15 20 25 30 35
x =125

y
143 141 144 149 148

x2

y2

xy
2145 2820 3600 4470 5180

225 400 625 900 1225

20449 19881 20736 22201 21904

y = 725

= 3375

= 105171

xy =18215

204

r=

x y xy n ( x) ( y) x y n n
2 2 2

r=

18215
2

(125)( 725)
5

(125) 105171 ( 725) 2 3375 5 5 90 (250)(46)

r=

= 0.839

The Coefficient of Determination The coefficient of determination is the square of the coefficient of correlation r. In words, it gives the proportion of the variation (in the y - values) that is explained (by the variation in the x - values). In Example 10, the correlation coefficient is determination:
r 2 = (0.839) 2 = 0.704

= 0.839.

Therefore coefficient of

( 3 decimals) This means that only 70.4% of the variation in the variable y is due to the variation in the variable x . Note that the coefficient of determination r 2 is between 0 and +1 inclusive.

Spearmans Rank Correlation Coefficient.

205

An alternative method of measuring correlation is by means of the Spearmans rank correlation coefficient obtained by the formula.
r =1 6d 2

n(n 2 1)

where d = difference between rankings. Example 6 Two members of an interview panel have ranked seven applicants in order of preference for a specified post. Calculate the degree of agreement between the two members. Applicant Interviewer X Interviewer Y A 1 4 B 2 3 C 3 1 D 4 2 E 5 5 F 6 7 G 7 6

The differences in rankings are shown below. D


d2

-3 9

-1 1

2 4

2 4

0 0

-1 1

1 1

d = 0,
r = 1 n(n 2 1) 6 d 2

= 20

= 1

6(20) 7(49 1) 120 336

= 1

= 1 0.3571 r = 0.6429

Example 7 The results of two tests taken by 10 employees are shown below (figures in %) Employee A B C D E F G H I J

206

Test X Test Y

50 56

52 51

58 53

66 65

70 64

74 81

77 76

86 78

92 80

94 92

Rank each employee in order of performance in the two tests and calculate the rank coefficient . Ranking the employees in each test we have Employee Test X Test Y d
d2

A 10 8 2 4

B 9 10 -1 1

C 8 9 -1 1

D 7 6 1 1

E 6 7 -1 1

F 5 2 3 9

G 4 5 -1 1

H 3 4 -1 1

I 2 3 -1 1

J 1 1 0 0

= 0, 6 d2

= 20

r =1

n( n 2 1)

=1

6( 20) 120 =1 10(100 1) 990

r = 1 0.1212 = 0.8788

Exercise 8 1. Draw a scatter diagram of each of the sets of values given below, and calculate the PPMCC in each case.

x
a)

6 3 1

7 6 3

8 9 5

9 12 7

10 15 9 11

b)

207

8 2 12

7 4 8

6 6 8

5 8 14

4 10 9

3 12 7 14 13

c)

x
y

2.

The following table gives the percentage unemployment figures for males and females in 9 regions. Draw a scatter diagram of these data and calculate PPMCC.
Region Central Lusaka 4.4 3.8 12.5 11.8

Unemployment % Male Female

Luapula 3.4 3.2

Northern 3.5 3.8

Eastern 4.5 4.6

Copperbelt 12.8 11.5

N.Western 3.2 4.0

Western 4.2 3.8

Southern 4.8 3.5

3.

In a job evaluation exercise an assessor ranks eight jobs in order of increasing health risk. The same jobs have also been ranked in decreasing order on the basis of the number of applicants attracted per advertised post. Job A Health 1 Applicant 4 B 2 3 C 3 2 D 4 1 E 5 6 F 6 5 G 7 8 H 8 7

Calculate the rank correlation coefficient for this information.

4.

The table below gives the Shorthand and Typing speeds of a sample of seven secretaries Secretary Speed Typing Shorthand (words /min) 1 42 97 2 44 84 3 47 95 4 47 96 5 50 10 7 Calculate the degree of correlation between the two skills by: 6 54 98 7 57 117

208

a) b) 5.

the PPMCC, and the rank correlation coefficient.

On the different days (picked at random) the following values were obtained for the price of a share for a particular company together with the index on that day Share price (K) Index 26 0 11 5 25 0 13 5 350 140 200 120 150 105 100 110 115 106 120 165 135 175 145 115

Calculate Spearmans rank correlation coefficient and say whether the index and indicate whether the index is a reasonable indicator for the price of the companys share.

EXAMINATION QUESTIONS WITH ANSWERS Multiple Choice Questions

1.1

2 If d = 10 and n = 8, the Spearmans rank correlation coefficient to 3 decimal places will be?

A.

0.188

B.

0.841

C.

0.821

D.

0.881

(Natech , 1.2. Mathematics & Statistics, December 2004)

1.2

The prices of the following items are to be ranked prior to the calculation of Spearmans rank correlation coefficient. What is the rank of item G? 209

Item Price A. 5

E 18

F 24 B.

G 23 4

I C.

J 3

K 19 D.

L 25 2.5

(Natech , 1.2. Mathematics & Statistics, December 2003)

1.3 8 x 6 x 4 2 x 0 4 8 x x x x 12 16 x x x

On the basis of the Scatter diagram above, which of the following equations would best represent the regression line of Y on X? A. D. 1.4 y = x8 y=x8
(Natech , 1.2. Mathematics & Statistics, December 2003)

B.

y=x+8

C.

y = x + 8

An investigation is being carried out regarding the hypothesis that factor X is a cause of ailment Y. Which coefficient of correlation between X and Y gives most support to the ailment? A. -0.9 B. -0.2 C. +0.8 D. 0

(Natech , 1.2. /B1Mathematics & Statistics, December 1999 (Rescheduled))

1.5 If

x = 216, y = 555, x
A. 0.79 B.

= 10436,

then the value of r, the coefficient of correlation to two decimal paces, is 0.62 C. 1.01 D.

= 46075,

xy = 19635 and n =8,


1.02

(Natech , 1.2. Mathematics & Statistics, December 2001)

210

1.6

The Scatter diagram below shows

A. C.

High positive correlation Very high negative correlation

B. D.

Very high correlation Perfect correlation.

(Natech , 1.2. Mathematics & Statistics, June 2005)

1.7

Find the value of a in a regression equation if b = 7, n = 10. A. 145 B. -65 C. 25

x = 150, y = 400 and


D. -650

(Natech , 1.2. Mathematics & Statistics, June 2005)

1.8

In regression analysis, the variable whose value is estimated is referred to as the: A. C. Simple variable Linear variable B. D. Independent variable Dependent variable

1.9

The value of the coefficient of determination is interpreted as indicating A. B. C. D. The proportion of unexplained variance The proportion of explained variance The extent of causation The extent of relationship

1.10

Of the following coefficient of correlation, the one that is indicative of the greatest extent of relationship between the independent and dependent variables is

211

A.

B.

+.20

C. SECTION B

.95

D.

+.70

QUESTION ONE a) Derive the product moment correlation coefficient from the following data and comment on your results. Pupil A Mathematics 41 marks, x Physics 36 mark, y b) B 37 20 C 38 31 D 39 24 E 49 37 F 47 35 G 42 42 H 34 26 I 36 27 J 48 29 K 29 23

Find the estimated line, by method of least squares, fitting the following results from a Physics experiment. Load, x 0.1 (Newtons) Extensions, 18 y (mm) 0.1 11 0.2 25 0.2 22 0.3 35 0.3 50 0.4 54 0.4 45 0.5 52 0.5 68

(Natech , 1.2. Mathematics & Statistics, June 2001)

c)

A company has the following data on its profit (y) and advertising expenditure 9x) over the last six years.
Profits (Million (K) 11.3 12.1 14.1 14.6 15.1 15.2 Advertising Million (K) 0.52 0.61 0.63 0.70 0.70 0.75

i) ii)

Use two (2) methods to justify your assumption that there is a relationship between the two variables. Forecast the profits for next year if an advertising budget of K800 000 is allocated.
(Natech , 1.2. Mathematics & Statistics, December 2003)

212

QUESTION TWO a) In the context of regression analysis explain what is meant by the following terms. i) ii) b) Regression coefficient Explanatory variable.

The following data shows the monthly imports (I) of apples and average prices (P) over a twelve-month period. Monthly Imports (I) (000 tonnes)
100 120 125 130 128 126 120 100 90 90 95 98

Average Monthly Prices (P) (K/tones)


232 220 218 210 210 212 217 240 242 238 230 230

i)

Determine the regression equation if imports (I) of apples on the price (P) and use it to forecast monthly imports when the average monthly price is K250 per tonne. If the correlation coefficient of the data is 0.95, interpret the results.
(Natech , 1.2. Mathematics & Statistics, December 2004)

ii)

QUESTION THREE Hungry Lion is a major food retailing company, which has recently decided to open several new restaurants. In order to assist with the choice of sitting these restaurants the management of fast foods limited whished to investigate the effect of income on eating habits. As part of their report a marketing agency produced the following table showing the percentage of annual income spend on food y, for a given annual family income ((K) x)

213

x (K000,000) 18 27 36 45 54 72 90 a) Plot, on separate Scatter diagrams. i) ii) b) c) y against x

y 62 48 37 31 27 22 18

on the relationship between income and percentage of family spent on food.

log10 y against log10 x, and comment

Use the method of least squares to fit the relationship y = ax b to the data. Estimate a and b. Estimate the percentage of annual income spent on food by a family with an annual income of K64,800,000.
(Natech , 1.2. Mathematics & Statistics, December 2001)

QUESTION FOUR a) Sales of product A between 0 and 4 years were as follows: Year 2000 2001 2002 2003 2004 Required: i) ii) iii) Calculate the correlation coefficient r. Comment on the result in (i) above. Calculate the coefficient of determination and comment. Units sold (000s) 20 18 15 14 11

214

iv) b)

Use a regression equation to estimate the sales in the year 2005.

The table below shows the respective masses X and Y of a sample of 12 fathers and their oldest ones. Mass X 65 of father (Kg) Mass Y 68 of son (Kg) 63 66 67 68 64 3 65 68 69 62 66 70 68 66 65 68 71 67 67 69 68 71 70

From the data given above: i) ii) c) Construct a scatter diagram Calculate the rank correlation coefficient using Spearmans method.
(Natech , 1.2. Mathematics & Statistics, June 2005)

Find the degree of correlation between the Bank of Zambia base lending rate and the dollar exchange rate taken over the past six months using: i) ii) The product moment coefficient of correlation. The coefficient of rank correlation. Feb 14 1.91 Mar 13.5 1.86 Apr 12.5 1.84 May 12 1.84 Jun 12 1.83

Month Jan st Base % as on 1 of 14 each month Average rate ($) 1.90 QUESTION FIVE a)

(Natech , 1.2. Mathematics & Statistics, Nov/Dec 2000)

The following table shows the number of units of a good product and the total costs incurred. Units Produced Total Costs (K) 100 40 000 200 45 000 300 50 000 400 500 65 000 70 000 600 70 000 700 80 000

Draw a scatter diagram b) c) Find the appropriate least squares regression line so that the costs can be predicted from production levels and estimate the total costs when production is 250 units. State the fixed costs of production.

215

d)

Calculate r and explain how much of the variation in the dependent variable is explained by the variation of the independent variable.
(Natech , 1.2. Mathematics & Statistics, June 2002)

QUESTION SIX a) A sample of eight employees is taken from the Production Department of an electronics factory. The data below relates to the number of weeks experience in the soldering of components, and the number of components, which were rejected as unsatisfactory last week. Employee A Weeks of 4 experience (x) No. of rejections 21 (y) i) ii) iii) B 5 22 C 7 15 D 9 18 E 10 14 F 11 14 G 12 11 H 14 13

Draw a Scatter diagram of the data. Calculate a coefficient of correlation for these data and interpret its value. Find the least squares regression equation of rejects on experience. Predict the number of rejects you would expect from an employee with one week experience.
(Natech , 1.2. Mathematics & Statistics, December, 1999 Rescheduled))

b)

i) ii)

Distinguish between regression and correlation. A experiment was conducted on 8 children to determine how a childs reading, ability varied with his/her ability to write. The points awarded were as follows: Child Writing Reading A 7 8 B 8 9 C 4 4 D 0 2 E 2 3 F 6 7 G 9 6 H 5 5

Calculate the coefficient of rank correlation and interpret the results.


(Natech , 1.2. Mathematics & Statistics, December, 2002)

c)

The mass of a growing animal is measured, in g, on the same day each week for with weeks. The results are given below. Week x 1 Mass (g) y 480 2 3 504 560 216 4 616 5 666 6 702 7 759 8 801

i) ii)

Using 2cm to represent week 1 on the x-axis and 2cm to represent 100g on the y-axis, plot a scatter diagram of mass y against week x. Find the equation of the regression line of y on x.
(Natech , 1.2. Mathematics & Statistics, December, 1998)

QUESTION SEVEN a) The following Table gives the cost price and number of faults per annum experienced with seven brands of video recorders. Brand A B C D E F G i) iii) Video Recorders Price (K000) No. of Faults per Annum 492 2 458 6 435 7 460 4 505 3 439 5 477 1

Determine Spearmans rank Correlation coefficient. Interpret your answer in (i) above.
(Natech , 1.2. Mathematics & Statistics, December,1998)

b)

The following Table gives a set of ten pairs of observation of inspection costs per thousand articles produced recorded on a number of occasions at several factories controlled by a single group and producing comparable products. Observation 1 2 3 4 5 6 Inspection costs per thousand articles 0.25 0.30 0.15 0.75 0.40 0.65 Number of defective articles per thousand 50 35 60 15 46 20

217

7 8 9 10

0.45 0.24 0.35 0.70

28 45 42 22

Putting inspection costs = x and number of defectives = y You are required to: i) ii) Represent these pairs of observations on a scatter diagram. Find the regression line of y on x.
(Natech , 1.2. Mathematics & Statistics, December,1997)

QUESTION EIGHT A Quality Control Manager has been hiring temporary workers to check all the surgical needles before they are dispatched (in boxes) to the customers. He believes that there is a relationship between the number of defective needles (per 1000) dispatched to customers and the experience (in weeks) of the workers. To test this theory, he randomly selects a sample of eight workers and gives then a box each of surgical needles to check. Unknown to the workers, the inspected boxes are returned to the Manager for some more checking. He checks all the surgical needles in each of these boxes and records the number of defective needles (per 1,000) contained in it. This information is summarized in the table below. Worker Experience (in weeks) x Defective (Per 1,000) y You are required to: A 4 21 B 5 22 C 7 15 D 9 18 E 10 14 F 11 14 G 12 11 H 14 13

218

a) b) c) d)

Draw a scatter plot of y against x Calculate the coefficient of correlation ad interpret its value. Find the least squares regression equation of the number of defectives on experience. Estimate the number of defectives in a box inspected by a worker with 6 weeks of experience.
(Natech , 1.2. Mathematics & Statistics, December 1996)

C
Coefficient of Determination..................................205 correlation......194, 202, 204, 205, 206, 208, 209, 210, 211, 212, 213, 214, 215, 216, 219 Correlation Analysis........................................194, 202

L
Least Square............................................................197

R
regression 194, 196, 197, 199, 200, 201, 202, 210, 211, 213, 215, 216, 217, 218, 219 Regression...............................................194, 199, 213

E
extrapolation...........................................................200

219

También podría gustarte