Está en la página 1de 21

Business Analytics

Linear Regression

4.b.Introduction to Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable based on
the value of at least one independent variable
Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
explain usually denoted by Y
Independent variable: the variable used to
explain the dependent variable. Denoted by X. it
is sometimes also referred as predictor variable
2

4.b.Simple Linear Regression Model
(SLRM)
Only one independent variable, x (When there is
only one predictor variable, the prediction
method is called simple regression)
Relationship between x and y is described by a
linear function (in other words, a simple linear
regression is the one where predictions of Y when
plotted as a function of X, form a straight line).
Changes in y are generally assumed to be caused
by changes in x

3

4.b.SLRM Example
I have a data set which contains 25,000
records of human heights and weights
(source:
http://socr.ucla.edu/docs/resources/SOCR_Da
ta/SOCR_Data_Dinov_020108_HeightsWeight
s.html)
You can download csv file of the data from:
(https://app.box.com/s/10z9keaeyucqc0nks8u
q)

4

4.b.Assumptions
1. A linear relationship exists between the dependent and the
independent variable.

2. The independent variable is uncorrelated with the residuals.

3. The expected value of the residual term is zero

4. The variance of the residual term is constant for all observations
(Homoskedasticity)

5. The residual term is independently distributed; that is, the residual
for one observation is not correlated with that of another
observation

6. The residual term is normally distributed.

5

| | 0 ) ( = c E
( ) | |
2 2
c
o c =
i
E
i] j 0, ) [E(
j i
= =
4.b.Types of Regression Models
6

Positive Linear Relationship
Negative Linear Relationship
No Relationship
Relationship NOT Linear
4.b.Population Linear Regression
7

Predicted Value
of Y for X
i

Intercept =
0

(continued)
Random Error for this x value

Y
X
u X Y
1 0
+ + =
x
i
Slope =
1
u
i
Individual
person's marks
4.b.Population Regression Function
8

Linear component
Population y
intercept
Population Slope
Coefficient
Random
Error term,
or residual
Dependent
Variable
Independent
Variable
Random Error
component
u X Y
1 0
+ + =
But can we actually get this equation?
If yes what all information we will need?
4.b.Sample Regression Function
9

Predicted Value
of Y for X
i

Intercept =
0

(continued)
Random Error for this x value

Y
X
x
i
Slope =
1
e x b b y
1 0
+ + =
e
i
Observed Value
of y for xi
4.b.Sample Regression Function
10

e x b b y
1 0 i
+ + =
Estimate of the
regression intercept

Estimate of the
regression slope

Independent
variable
Error term
Notice the similarity with the Population Regression Function
Can we do something of the error term?
4.b.The error term (residual)
Represents the influence of all the variable which
we have not accounted for in the equation
It represents the difference between the actual
"y" values as compared the predicted y values
from the Sample Regression Line
Wouldn't it be good if we were able to reduce this
error term?
What are we trying to achieve by Sample
Regression?

11

x b b y

1 0 i
+ =
u X Y
1 0
+ + =
To Predict PRL from SRL
4.b.Our Objective
12

4.b.One method to find b
0
and b
1
Method of Ordinary Least Squares (OLS)
b
0
and b
1
are obtained by finding the values of b
0

and b
1
that minimize the sum of the squared
residuals





Are there any advantages of minimizing the squared
errors?
Why don't we take the sum?
Why don't we take absolute values instead?
13

2
1 0
2 2
x)) b (b (y
) y

(y e
+ =
=


4.b.OLS Regression Properties
The sum of the residuals from the least squares
regression line is 0.

The sum of the squared residuals is a minimum.
Minimize( )
The simple regression line always passes through the
mean of the y variable and the mean of the x variable
The least squares coefficients are unbiased estimates
of
0
and
1

14

0 ) ( =

y y
2
) ( y y


4.b.Interpretation of the Slope and
the Intercept
b
0
is the estimated average value of y when the value
of x is zero. More often than not it does not have a
physical interpretation
b
1
is the estimated change in the average value of y
as a result of a one-unit change in x
15

y
x
b
0
X b b Y
1 0
+ =
.
slope of the line(b
1
)

4.b.Limitations of Regression
Analysis
Parameter Instability - This happens in situations
where correlations change over a period of time.
This is very common in financial markets where
economic, tax, regulatory, and political factors
change frequently.
Public knowledge of a specific regression relation
may cause a large number of people to react in a
similar fashion towards the variables, negating its
future usefulness.
If any regression assumptions are violated,
predicted dependent variables and hypothesis
tests will not hold valid.
16

4.b.General Multiple Linear
Regression Model
In simple linear regression, the dependent
variable was assumed to be dependent on only
one variable (independent variable)
In General Multiple Linear Regression model, the
dependent variable derive sits value from two or
more than two variable.
General Multiple Linear Regression model take the
following form:


where:
Y
i
= i
th
observation of dependent variable Y
X
ki
= i
th
observation of k
th
independent variable X
b
0
= intercept term
b
k
= slope coefficient of k
th
independent variable

i
= error term of i
th
observation
n = number of observations
k = total number of independent variables

17

i ki k i i i
X b X b X b b Y c + + + + + = .........
2 2 1 1 0
4.b.Estimated Regression Equation
As we calculated the intercept and the slope
coefficient in case of simple linear regression by
minimizing the sum of squared errors, similarly we
estimate the intercept and slope coefficient in
multiple linear regression.

Sum of Squared Errors is minimized and the slope coefficient is
estimated.

The resultant estimated equation becomes:


Now the error in the i
th
observation can be written
as:



18

=
n
i
i
1
2
c
ki k i i i
X b X b X b b Y
. . . . .
+ + + + = .........
2 2 1 1 0
|
.
|

\
|
+ + + + = =
. . . . .
ki k i i i i i i
X b X b X b b Y Y Y .........
2 2 1 1 0
c
4.b.Interpreting the Estimated
Regression Equation
Intercept Term (b
0
): It's the value of dependent
variable when the value of all independent
variables become zero.


Slope coefficient (b
k
): It's the change in the
dependent variable from a unit change in the
corresponding independent (X
k
) variable keeping
all other independent variables constant.
In reality when the value of the independent variable changes by one unit, the
change in the dependent variable is not equal to the slope coefficient but
depends on the correlation among the independent variables as well.
Therefore, the slope coefficient are called partial slope coefficients as well

19

0 .......
2 1
0
= = =
=
k
X X X when
Y of Value b
4.b.Assumptions of Multiple
Regression Model
There exists a linear relationship between the
dependent and independent variables.
The expected value of the error term, conditional
on the independent variables is zero.
The error terms are homoskedastic, i.e. the
variance of the error terms is constant for all the
observations.
The expected value of the product of error terms
is always zero, which implies that the error terms
are uncorrelated with each other.
The error term is normally distributed.
The independent variables doesn't have any linear
relationships between each other.
20

Thank you!
Pristine www.edupristine.com
Pristine
702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA
www.edupristine.com
Ph. +91 22 3215 6191

También podría gustarte