Está en la página 1de 45

QBM 101 Business Statistics

Dr. Lai Kee Huong


Department of Business Studies
Faculty of Business, Economics & Accounting
keehuong.lai@help.edu.my

SUBJECT OUTLINE:
Module

1: Introduction; organizing
and graphing data; numerical
descriptive measures

Module

2: Probability, discrete random


variables; continuous random variables
and the normal distribution

Module

3: Sampling distributions;
estimation; hypothesis testing

Module

4: Simple linear regression

CHAPTER 10:
SIMPLE LINEAR REGRESSION
10.1 Simple linear regression
10.2
Standard deviation of errors
and coefficient of determination
10.3 Inferences about B
10.4 Linear correlation
10.5 Regression analysis: A complete
example
10.6 Interpretation of Excel output

A regression model is a mathematical equation


that describes the relationship between two or
more variables. A simple regression model
includes only two variables: one independent and
one dependent. The dependent variable is the one
being explained, and the independent variable is
the one used to explain the variation in the
dependent variable.
A (simple) regression model that gives a straightline relationship between two variables is called a
linear regression model.

Regression: describing the nature of relationship


between variables positive, negative, linear, or
nonlinear.
Correlation: determining whether a relationship
between variables exists
Questions: Are the two variables related? If so,
what is the strength? What kind of relationship?
What prediction can be made?
Examples: Height and weight of human, number
of cigarettes smoked vs weights of infants;
time spent on studying and exam marks.

Dependent variable (DV) (y, the one being


explained) vs. independent variable (IV) (x, used
to explain the variation).
Simple (only 1 IV) vs. multiple (> 1 IV)
regression
Linear (straight-line relationship) vs. nonlinear
regression

SIMPLE LINEAR REGRESSION ANALYSIS


In the regression model y = A + Bx + , A
is called the y-intercept or constant term, B
is the slope, and is the random error term.
The dependent and independent variables
are y and x, respectively.
In the model = a + bx, a and b, which are
calculated using sample data, are called the
estimates of A and B, respectively.

SCATTER PLOT/DIAGRAM

ERROR SUM OF SQUARE (SSE)


The error sum of squares, denoted SSE, is

SSE e 2 (y y )2
The values of a and b that give the minimum
SSE are called the least square estimates of A
and B, and the regression line obtained with
these estimates is called the least squares line.

Least square/best-fit line:

y a bx
SS xx x

SS yy y

SS xy xy
b

SS xy
SS xx

a y bx

n
x y
n

FORMULAS

Source: http://mathworld.wolfram.com/LeastSquaresFitting.html

Source: http://mathworld.wolfram.com/LeastSquaresFitting.html

Source: http://mathworld.wolfram.com/LeastSquaresFitting.html

Source: http://mathworld.wolfram.com/LeastSquaresFitting.html

Least square/best-fit line:

x 386 55.1429, y y 108 15.4286


n

SS xy xy
SS xx x
2

x y 6403 386 108 447.5714


n

(386) 2
23058
1772.8571
7

n
SS xy 447.5714
b

0.2525
SS xx 1772.8571

a y bx 15.4286 (0.2525)(55.1429) 1.5050


y a bx 1.5050 0.2525x

Least square/best-fit line (estimation and its


reliability):
b

SS xy
SS xx

447.5714
0.2525
1772.8571

a y bx 15.4286 (0.2525)(55.1429) 1.5050


y a bx 1.5050 0.2525 x
Estimate the amount of food expenditures when the income is $6100.
y a bx 1.5050 0.2525(61) $16.9075 hundred $1690.75
Error, e y y 16 16.9075 $0.9075 hundred $90.75
Estimate the amount of food expenditures when the income is $6000.
y a bx 1.5050 0.2525(60) $16.655 hundred $1665.50
The estimation is reliable because 60 (33,83)
Estimate the amount of food expenditures when the income is $2000.
y a bx 1.5050 0.2525(20) $6.555 hundred $655.50
The estimation is not reliable because 20 (33,83) *Extrapolation

ERROR OF PREDICTION

Least square/best-fit line (interpretation of


regression coefficients):
y a bx 1.5050 0.2525 x
y intercept, a 1.5050
A family with RM 0 income will
spend RM1.5050 hundred
=RM150.50 on food.
Slope coefficient, b 0.2525
For every one unit (RM100) of increment
in income, the expenditure on food will
increase by RM0.2525 hundred = RM25.25.

Least square/best-fit line (assumptions of


regression models):
1. Error has a mean of zero.
2. The errors are independent.
3. The distribution of error is normal.
4. The distribution of population errors has
the same (constant) standard deviation.

~ N ( 0, )
2

Degrees of Freedom for a Simple Linear


Regression Model
The degrees of freedom for a simple linear
regression model are
df = n 2

Standard deviation of errors:

is estimated by se
SSE
2
se
, where SSE ( y y )
n2
df n 2
se

SS yy bSS xy
n2

Standard deviation of errors:

SS xy
SS xx

447.5714
0.2525
1772.8571

SS xy xy
SS yy y
se

x y 6403 386 108 447.5714


n

SS yy bSS xy
n2

(108) 2
1792
125.1743
7
125.1743 (0.2525)(447.5714)
1.5939
72

Coefficient of determination (COD)

r
2

bSS xy
SS yy

,0 r 1
2

b 0.2525, SS xy 447.5714, SS yy 125.7143


r
2

bSS xy
SS yy

0.2525(447.5714)

0.899 89.9%
125.7143

Interpretation: 89.9% of the total variation in food expenditures


of household can be explained by the variation in incomes, and
the remaining 10.1% is due to randomness and other variables.

Coefficient of correlation (COC)

SS xy
SS xx SS yy

, 1 r 1

SS xx 1772.8571, SS xy 447.5714, SS yy 125.7143


r

SS xy
SS xx SS yy

447.5714

0.9481
1772.8571125.7143

Interpretation: Positive or negative sign/correlated.


Very weak, average/moderate, strong, very strong
r 0.9481: very strong and positively correlated
Other example:
r 0.1111: very weak and negatively correlated

Test statistic: tcalc

bB

, df n 2
sb

H0 : B 0
H1 : B 0 (two-tailed test)
B 0 (positive),B 0 (negative) (one-tailed test)
is unknown, use the t distribution.

HT about the slope coefficient, B


Test at the 1% significance level whether the
slope of the regression line is positive.
H 0 : B 0, H1 : B 0 (one-tailed test)

0.01
df n 2 7 2 5
tcalc

b B 0.2525 0

6.662
sb
0.0379

tcritical t ,n 2 t0.01,5 3.365


tcritical 3.365 tcalc 6.662
Reject H 0 . There is sufficient evidence to conclude
that the slope is positive, or, income determines
food expenditure positively.

A random sample of eight drivers selected from a small city


insured with a company and having similar minimum
required auto insurance policies was selected. The following
table lists their driving experiences (in years) and monthly
auto insurance premiums (in dollars).

Regression Analysis: A Complete Example

(a) IV and DV. Do you expect a positive or negative relationship?


(b) Compute SS xx , SS yy , and SS xy .
(c) Find the least square regression line.
(d) Interpret the regression coefficients in (c).
(e) Calculate the COC and COD. Interpret their meanings.
(f) Predict the monthly premium for a driver with 10 years of experience.
Comment on the reliability of the estimation.
(g) Compute the standard deviation of errors.
(h) Test at a 5% significance level whether B is negative.

Regression Analysis: A Complete Example


(a) IV: Driving experience, DV: Monthly auto insurance premium
A negative linear relationship.

Regression Analysis: A Complete Example


x 90
y 474

(b) x

11.25, y

59.25
n

SS xy

x y

(90)(474)

xy
4739
593.5

SS xx x

SS xy
SS xx

SS yy y 2

(c) b

y
n

(90) 2
1396
383.5
8
(474) 2
29, 642
1557.5
8

593.5
1.5476
383.5

a y bx 59.25 (1.5476)(11.25) 76.6605


y a bx 76.6605 1.5476 x

Regression Analysis: A Complete Example


(d) y a bx 76.6605 1.5476 x
y intercept, a 76.6605
A driver with 0 years of driving experience will need to pay
a monthly premium of $76.66.
Slope coefficient, b 1.5476
For every one extra year of driving experience, the monthyly
premium will decrease by $1.55.
(e) COC, r

SS xy
SS xx SS yy

593.5
0.7679
(383.5)(1557.5)

A moderately strong and negatively correlation.


r
2

bSS xy
SS yy

(1.5476)(593.5)
0.5897
1557.5

Alternative: COD,r 2 0.7679 0.5897


2

58.97% of the variation in monthly premium can be explained by


driving experience, whereas the remaining 41.03% is due to
randomness and other unaccounted factors.

Regression Analysis: A Complete Example

(f) y (10) 76.6605 1.5476(10) $61.18


The estimstion is reliable because 10 (2,25).

(g) se

SS yy bSS xy
n2

1557.5 (1.5476)(593.5)

10.3199
82

Regression Analysis: A Complete Example


(h) H 0 : B 0, H1 : B 0

0.05, df n 2 8 2 6
tcalc

b B 1.5476 0

2.937
sb
0.5270

tcritical t ,df t0.05,6 1.943


tcalc 2.937 tcritical 1.943
Reject H 0 . There is sufficient evidence to conclude that the slope is negative.

The hypothesis test on B can be


performed using the p-value approach,
using the output obtained from
statistical software.

EXCEL OUTPUT

Source: http://www.excel-easy.com/examples/regression.html

EXCEL

EXCEL

EXCEL

SUMMARY
Identify IV (x) and DV (y)
Calculate SS of xx, yy, and xy
Determine the best fit line
Calculate and interpret regression coefficients
Calculate and interpret COC and COD
Estimate and comment on its reliability
Hypothesis test on B (critical value approach
using manual calculation, or p-value
approach from the Excel output)
Finding missing values from the given Excel
output

También podría gustarte