Está en la página 1de 1177

Business Analytics

A Spreadsheet Oriented
Approach

Dr. A.N. Sah


To my daughter Akanksha
Copyright 2017 Dr. A.N. Sah
All rights reserved.
Preface
This ebook is written with an objective
of giving essential knowledge of
Business Analytics in a clear cut and
precise manner. The book is divided
into four parts. The part I deals with
introduction to Business Analytics. In
this part, I dealt with basics of business
analytics. Specifically, I tried to
answer the following questions:
What is business analytics?
What are the types of analytics?
What is the process of business
analytics?
It also discusses similarity between
business analytic process (BAP) and
organization decision-making process
(ODMP). In this I also tried to give the
meaning of frequently used terminology
such as data, database, Big data and
little data. The book is broadly divided
into three parts. Part II deals with
descriptive analytics for decision
making. In this part the book not only
provide conceptual clarity but also tell
you how to carry out analytics using
MS Excel. The various topics it
includes are: Line Chart, Bar Chart,
Sub-divided Bar Graph, Percentage
Bar Graph, Pie Chart, Doughnut Chat,
Area Chart, Stock Chart, Scatter Chart,
Bubble Chart, Stem-and leaf display,
Measures of Central Tendency,
Measures of Dispersion and Measures
of Shape, Exploratory data analysis,
Five-Number summary, Box Plot,
Weighted arithmetic mean, mean in
case of grouped data, variance and
standard deviation in grouped data,
Corvariance and Correlation, and
Regression.
Part III deals with Predictive analytics.
The various topics it includes are:
Introduction to Prediction, Regression
analysis, linear trendline, Quadratic
trendline, cubic trendline, exponential
trend, double log model, moving
average, exponential smoothing,
Part IV deals with prescriptive
analytics. The ebook is written with an
objective of providing prescription to
business managers based on objective
analysis. It rules out subjectivity in the
decision making process. Topics
included are linear optimization,
applications of linear programming in
various areas, decision analysis.
Contents

Part I
Chapter 1: Introduction to Business
Analytics

Part II

Chapter 2: Introduction to Data


Analytics
Chapter 3: Data Visualization and
Representation
Chapter 4: Descriptive Analytics by
Numerical Methods
Chapter 5: Measures of Shape
Chapter 6: Covariance and
Correlation

Part III

Chapter 7: Introduction to Predictive


Analytics
Chapter 8: Time Series and its
Components
Chapter 9: Trend Analytics
Chapter 10: Predictive Models for
Stationary Time Series
Chapter 11: Simple Regression
Chapter 12: Multiple Regression
Chapter 13: Seasonal Forecasting
with Dummy Regression
Chapter 14: ARIMA Modeling
Chapter 15: Markov Analysis
Chapter 16: Monte Carlo Simulation
Chapter 17: Qualitative Methods of
Prediction

Part IV
Chapter 18: Linear Programming
Chapter 19: Applications of Linear
Programming
Chapter 20: Decision Analysis
Chapter 1
Introduction to Business
Analytics
Introduction
Business analytics tries to find that data
set or databases may contain information
that could not only help to solve the
problem but also find opportunities to
improve business performance. Business
Analytics process starts with collection
of business related data and then
sequential application of descriptive,
predictive and prescriptive analytics
with a view to improve business
decision making and organizational
performance. It is a set of techniques and
processes that are used to analyse data
to improve business performance
through fact-based decision-making.
Business Analytics creates capabilities
for companies in order to compete in the
market effectively and is likely to
become one of the main functional areas
in most companies. Competing on
analytics: The new science of winning a
well acclaimed book by Thomas
Devonport emphasized that a significant
proportion of high-performance
companies have high analytical skills
among their personnel. Also in a recent
survey of nearly 3000 executives, MIT
Sloan Management Review reported that
there is striking correlation between an
organizations analytics sophistication
and its competitive performance.
CAPITAL ONE a credit card company
has managed a profit of close to $1
billion in their credit card business in
the recent past, whereas many of their
competitors have shown a loss of
several millions in credit card business.
The success of Capital One is attributed
to its analytical strength. Thus there is
significant evidence from the corporate
world that the ability to make better
decisions improves with analytical
skills. According to new research,
relevance of effective data management
and business analytics is growing and
being considered strategic and discussed
at board-room level.
Types of Analytics
The Institute of Operations Research and
Management Sciences (INFORMS) has
broadly categorized types of analytics
into:
Descriptive Analytics
Descriptive analytics means describing
any phenomenon in terms of graph,
pictures, symbols and application of
simple statistical tools such mean,
standard deviation that describes nature
of data. Describing analytics is the first
step in data handling.
Predictive Analytics
Predictive analytics means estimating in
unknown situations. Forecasting and
prediction are used often
interchangeably. However, the
prediction is a more general term and
connotes estimating for any time period
before, during, or after the current
observation. Predictive analytics search
for patterns found in historical and
transactional data to understand a
business problem. Advanced statistical
techniques such as regression analysis,
time series models are the main tool of
predictive analytics.

Prescriptive Analytics
It comprises of applications of decision
theory, operations research and
management science to make informed
decisions.

Business Analytics Process


The business analytic process involves
applying three types of analytics viz.
descriptive, predictive and prescriptive
sequentially to a source of data. The
exercise is aimed at improving business
performance in some way. The process
is described as follows:

Step 1: Obtaining data


Business analytics begins either with a
data set or with a database. Database
refers to collection of data files having
information on people, businesses etc.
Thus data can be obtained from data
warehousing (a collection of large
databases) or computer clouds
(hardware and software used for data
storage and retrieval). In this context, it
is important to distinguish between Big
data and little data. Big data refers to the
collection of very large and complex
data sets or databases that are not
process by even software systems. Little
data is any data which is not big which
can be used to help individual
businesses keep track of its customers.

Step 2: Descriptive Analytics


Descriptive analytics deals with what is
contained in a data set or database. Thus
it involves describing the broad features
of a data set either by graphs and charts
or simple statistical tools. For example,
a pie chart is used to classify the
customers know who are the buyers of
its product.
Descriptive analytics purpose is to
identify the possible trends. The
objective is to get a broad picture of
what generally the data looks like i.e.,
what actually happened? From the
descriptive analytics we try to find
possible opportunities.

Descriptive analytics employs charts


and graphs in order to summarize data or
databases. It also includes measures of
central tendency (mean, median, and
mode), measures of dispersion (standard
deviation), frequency distribution, and
probability and its distribution and
sampling method.
Step 3: Predictive Analytics
Predictive analytics aims at finding:
a) What is happening?
b) Why is it happening? and
c) What it will happen in future?

The purpose of conducting predictive


analytics is to predict opportunities
which the firm can take advantage. It
employs various statistical methods such
as regression analysis, time series
forecasting techniques, qualitative
methods of prediction, Markov process,
and Monte Carlo simulation.

Step 4: Prescriptive Analytics


Prescriptive analytic analysis tries to
find how should the problem be
handled? The purpose of the predictive
analytics is to allocate scarce resources
optimally to take advantage of the
predicted trends or future opportunities.
The main tools of prescriptive analytics
are linear programming, simplex
procedures, and decision theory.

Step 5: Outcome of the Business


Analytics Process
The outcome of the entire business
analytic process must be expressed in
measurable metrics. It explicitly
measure increase in business value and
organizational performance.

The following figure clearly depicts the


sequential process of business analytic
process (BAP).

Figure 1: Business Analytic Process


(BAP)
Business Analytic Process
(BAP) and Organization
Decision-Making Process
(ODMP)

The process of business analytic process


and the various steps involved in the
traditional organization decision-making
process (ODMP) are shown in figure 2.
It shows the business analytic process
start with database or data set. Next the
sequential application of descriptive
analytics, predictive analytics and
prescriptive analytics with result
oriented approach in terms of metrics.
Thus business analytic process is data
driven analytic process.

The organization decision-making


process (ODMP) given by Elbing
(1970). The five-step ODMP starts with
recognizing a problem that needs a
decision or solution. In the next step the
problem is explored and determines its
intensity, its size and impact and other
aspects of the problem.
Figure 2: BAP and ODMP
In the third step of ODMP, we try to find
various strategies to solve the identified
problem in relation to organizational
goal and objective. This stage of ODMP
is very similar to the predictive
analytics of the business analytic
process which intends to find paths,
trends and strategy. The fourth stage of
ODMP is to implement the strategy.
Finally the fifth step of both is
essentially the same that is measure
increase in business value or business
performance in metrics.

Thus there is a very close similarity


between BAP and ODMP. While BAP is
data driven process ODMP is
descriptive in nature and loaded with lot
of subjectivity. The business analytic
process however emphasizes on
objectivity and is fact-based decision
making process.

Exercises

1. What are business analytics and


its importance?
2. What are the various types of
analytics?
3. Explain the business analytics
process. Why the steps in the
business analytics process are
sequential?
4. How is the business analytics
process (BAP) similar to the
organization decision-making
process (ODMP)?
Chapter 2
Introduction to Data
Analytics

Introduction
Numerical facts and figures are called
data. Consider the following examples:
Indian economy will grow by 9 -10
% per annum in coming 5 years.
The money supply in the US
economy is increasing by 5%
every year.
Some Stock market analysts
believe that BSE Sensesx will be
at 35,000 points by 2020 A.D.
The male-female ration in India is
980 as per 2011 census.
The population of India is growing
by above 2% every year.
The voters turn out in India is only
50%.
The literacy rate in Bihar even
after 50 years of Independence is
only 47%.
The inflation in India in the year
2015-16 was below 4%.

To conduct any kind of empirical


analysis analysts in business and
economics or for that matter in any field
require data. Once the data is in hand,
the first problem before the analyst is
how data should be handled so that
meaningful inferences can be drawn
from it. This chapter provides basics of
data handling. Essentially, it focuses on:
(1) the types of data used in business
analysis; (2) use of various graphs for
presenting information; (3) descriptive
statistics which are used numerically to
summarize main characteristics of a data
set.

Meaning of Data
As said earlier, numerical facts and
figures are called data. In fact, data is
nothing but meaning of statistics in plural
sense. In plural sense, it is used to
denote and refer to numerical and
quantitative information. For example,
Bill scored 72 marks in business
statistics paper. In 2015, while Chinese
economy grew at 6.5 per cent Indian
economy grew at 7.5 per cent annum.
These numerical and quantitative
informations are called data. They are
collected, tabulated, summarized, and
analyzed for presentation and
interpretation.

Elements and Variables


The unit on which data are collected is
known as elements.
Table 2.1: Marks of Students
Name of Students Marks obtained
Bob 75
Alan 86
Alisha 77
Mike 80
For example in the above Table 2.1,
each individual student is an element.
Data in the table 2.1 contains four
elements.

A variable is a characteristic of the


elements. In the Table 2.1, mark obtained
is the variable. Thus,
Bob
(Element) Marks
Obtained (Variable)

Measurements on variable provide data.


Measurements obtained for a particular
element is called an observation. Thus,
Bob (Element) Marks
Obtained (Variable) 75
(observation).

Levels of Measurement
There are generally four types of
variables encountered in empirical
analysis. The type of variable under
consideration plays an important role in
selecting appropriate statistical tool for
analysis. For instance, it is not
advisable to compute arithmetic mean of
nominal scale variable. These various
types of variables and its nature are
discussed as follows:
Nominal Scale Variables
Nominal scale variable is very common
in marketing or social science research.
A nominal scale divides data into
categories which are mutually
exclusives and collectively exhaustive.
In other words, a data point is grouped
under one and only one category and all
other data will grouped somewhere else
in the scale. The word nominal means
namelike which means that the numbers
or codes given to objects or events are
naming or classifying only. These
numbers have no true meaning and thus
cannot be added, multiplied or divided.
They are simply labels or identification
number.

The following are examples of nominal


scales:
Gender: (1) Men
(2) Women
Nationality: (1)
Indian (2) American (3)
Other
What kind of analysis can be done on
this type of data? We can only find the
numbers and percentages of items in
each group. The only appropriate
statistical tool which can be applied to
such data is mode.

Ordinal Scale Variables


Ordinal scale is one step further in
levels of measurement. Ordinal scales
not only classify data into categories but
also order them. One point to note here
is that ordinal measurement requires
transitivity assumption which is
described as: if A is preferred to B, and
B is preferred to C, then A is preferred
to C. The following is an example of
ordinal scale:
Please rank the following cars by look
from 1 to 5 with 1 being most stylish and
5 the least stylish.
Enzo Ferrari ( )
Koenigsegg CCXR ( )
McLareb F1 ( )
Bugati Veyron ( )
Lamborghini Reventon ( )
Numbers in an ordinal scale only stand
for rank order. They do not represent
absolute quantities and the interval
between two numbers need not be equal.
In case of ordinal scaled data
mathematical operations such as
addition and division cannot be
permitted. Mode and median can be
found but arithmetic mean must not be
computed. One can also use quartile and
percentile for measuring variation.

Interval and Ratio Scale Variables


Interval scales take measurement one
step further. It has all the characteristics
of ordinal scales in addition to equal
interval between the points on the scale.
On interval scales common mathematical
operations are permitted. One can find
arithmetic mean, variance and other
statistics.

Ratio scales have a meaningful origin or


absolute zero in addition to all the
features of interval scale. Since the
origin or location of zero point is
universally acceptable, we can compare
magnitudes of ratio-scaled variables. In
other words, this scale includes a zero
value to indicate that nothing exists for
the variable at the zero point. For
example, if the price of a smart phone is
zero means it has no cost and it is free.
Suppose the price of Apple smart phone
is Rs. 40000 and the price of Micromax
phone is Rs.20000. It can be infer that
price of Apple phone is twice of
Micromax. Examples of ratio scales are
income, weights, etc. All types of
statistical and mathematical
computations are permitted on ratio-
scaled variables.

Types of Data
There are three common types of data
Time Series Data
A time series data is a sequence of
observations ordered in time. For
example data on real gross domestic
product, money supply, inflation, etc. are
collected at specific points in time, say
yearly. These data are ordered by time
and are called time series. The
observations made on GDP or money
supply at timet andt+1 are separated
by some unit of time such as days,
weeks, months or years. The following
are some examples of time series data:
Annual GDP at current price
from1950 to 2016
Monthly figures on broad money
(M3) from April 2005 to March
2016
Profits of Reliance Industry over 20
years
Daily closing price of BSE Sensex
over 10 years
Macroeconomist who studies economy
as a whole often works with time series
data on important macroeconomic
variables like real GDP and inflation,
etc.

Cross-Sectional Data
Cross-sectional data refers to parallel
data on many units such as individuals,
firms, countries, etc. at the same point of
time. The following are few examples of
cross sectional data:
Profits of 50 firms in 20014-15
Per capita income of 100 nations in
2016
Heights and weights of 500 people
in a company in 2016.
Income, education and experience of
2500 workers in a locality in 2015.
In economics field, micro economists
often work with cross-sectional data.
For example, suppose a labor economist
wish to know how much the workers of
the textile industry earn. To this end, he
asks 100 workers how much they earn.
Thus, income of 100 workers in textile
industry is nothing but cross sectional
data.

Panel Data
Panel data contains features of both time
series as well as cross sectional data. In
panel data analysis the same cross
section unit is surveyed over time. The
following are examples of panel data:
Growth rate of M3 of 20 countries
from 1995 to 2015
Profits of 100 firms over 10 years
Rate of return of 120 mutual funds
over 15 years
Exercises

1. What is data?
2. What are elements and variable?
3. What are the four levels of
measurement of variable?
4. What is nominal data? Can you find
arithmetic mean of such data? Why?
5. What is ordinal data? Is it qualitative
or quantitative data?
6. What is interval data? Give one
example.
7. What is ratio scaled variable? Is it
qualitative or quantitative data?
8. What are cross-sectional, time series
and panel data?
Chapter 3
Data Visualization and
Representation
Introduction
Descriptive analytics means describing
any phenomenon in terms of graph,
pictures, symbols and mathematical
expressions. Describing data related to
economy, business and other areas is the
first step in data handling.

There is a famous saying, one picture is


worth a thousand words. Summarizing
data through graphic representation
involves use of graph and charts to
report key features of data.

Diagrammatic Representation
Diagrams and graphs are useful because
they provide a birds eye view of the
entire data and information presented is
easily understood. However, a
distinction is there between diagram and
graph.

A diagram is generally constructed on a


plain paper. Diagrams are better suited
for publicity, campaign, and promotion.

Steps for Constructing Diagram

Step 1: Diagram must be given a


suitable title. The title should convey the
main idea in a few possible words that
the diagram intends to portray.

Step 2: Proper proportion between the


width and height of the diagram should
be maintained.

Step 3: Scale should be in even numbers


or multiple of five. Odd scale of 1, 3, 5,
7, 9 should be avoided.

Step 4: Footnote should be given at the


bottom of the diagram.

Step 5: An index of different types of


lines or shades, colors should be used
which may help in understanding the
meaning of diagram.
Graphic Representation
Graph brings the figures in concrete
form. In constructing a graph we
generally make use of a graph paper. For
representing frequency distributions and
time series, graphs are more appropriate
than diagrams. Graphs indicate trends
and relations.

Method of Construction of Graph


. Graphs are drawn on a special type of
paper called graph paper which has a
network of horizontal and vertical
lines. The thick lines for each division
of a centimeter or an inch measure and
the thin lines of small parts are the
same.
2. In a graph of any size, two straight
lines are drawn at right angle to each
other, intersecting at point O is
called origin.
3. Two lines are called co-ordinate axis,
horizontal line is called X-axis and is
denoted by X OX and vertical line is
called Y-axis and is denoted by YOY.
4. Graph is divided in four quadrants.
Generally we use first quadrant unless
negative magnitudes are to be
displayed.
5. Scale along the axis should be chosen
that the entire data can be
accommodated in available space
without crowding. If data are irregular
use false line.

In this chapter, we will discuss


techniques to present graphically time
series and cross-sectional data which
are frequently used in economic and
business analysis.

Line Chart
Line graph is used to present a particular
variable measured at various points over
time. It is drawn by connecting lines
between two data points. A line graph is
very useful in highlighting trends in a
variable over time. Monthly closing
stock prices data of Facebook, Inc.,
from 18th May 2012 through 1st
February 2017 is shown in Figure 3.1 is
an example of line chart. This data set
contains 58 observations and it is
difficult for a reader to comprehend
anything meaningful observing raw data.
However, a reader can easily grasp the
main characteristics and its trend over
the years looking at the line graph.

Figure 3.1
As evident from Fig. 3.1, the stock price
of Facebook shows a rising trend after
18th May, 2013; before this the stock
price of Facebook has remained stagnant
almost for a year. The stock price of
Facebook was around $30 in May 2012
and it crossed $50 somewhere in mid of
2013. The trend in the stock is up and as
on 1st February 2017, the stock was
quoting a price of $135.54.

Computer Application
Microsoft Excel with Chart Wizard can
generate line graph. To construct line
graph in Excel, enter label and data into
a column. Choose Insert from the menu
bar, then go for Chart from the pull
down menu. Select Line from this and
follow the instructions and in four steps
line graph is completed. There are
options in Excel such as including a
legend, deciding data labels and finally
deciding the location of the chart.

Using Excel
Many of the graphs such as line graph,
bar graph and scatter plot discussed in
this chapter can be generated by the help
of Chart Wizard. Excel can also
generate histogram using Data Analysis.

Line Graph

Step 1: Open Excel and entered the data


in columns or rows in the worksheet as
shown below.
Step 2: Select the data
Step 3: On the Insert Tab, in the chart
group, click the line chart icon
Step 4: If you click on the first option,
excel generate the line chart shown
below:
Step 5: Put cursor on the figure and
click. Chart Tools will appear above the
toolbar. After this, click layout, you will
find options line Axis Title, Chart Title,
etc.

Step 6: For labeling X-axis, click Axis


Title and select Primary Horizontal axis
title from drop menu. When you will
click primary horizontal axis title
another drop menu will appear. Click
Title Below Axis, a box will appear
below X-axis.
Step 7: Write the name of variable in the
box as shown below
Step 8: For labeling Y-axis, click Axis
Title and select Primary vertical axis
title from drop menu. When you will
click primary vertical axis title another
drop menu will appear. Click either
vertical title or horizontal title; a box
will appear on Y-axis.
Step 9: Write the name of variable in the
box
Step 10: If you want to put chart title
then click chart Title; a drop menu will
appear. Next, select either Centered
Overtay Title or Above Chart.
Step 11: Write the chart title in the box.
Bar Graph
Simple bar graph is used to represent the
figures on a single variable. Figures of
profits of different companies in a
particular year, population in various
years etc may be represented by bar
graph. Bar graph has same width but the
length is drawn in proportion to the size
of the figure.
Table 3.1: Sales/ Revenue
of Facebook
Year Revenue ($
Billions)
2012 5.09

2013 7.87
2014 12.47
2015 17.93

2016 27.64

Table 3.1 gives figures on sales of the


Facebook from 2012 to 2016. This
information may be represented by bar
graph which provides a visual display of
this information.

Fig. 3.1 Sales/ Revenue of Facebook


Fig. 3.1 shows revenue of the Facebook
during 2012 and 2016. The sales of the
Facebook which was merely 5 billion
dollars in 2012, it has increased in
excess of 25 billion dollars in 2016.
Computer Application
Microsoft Excel with Chart Wizard can
also generate bar graph. To construct bar
graph in Excel, enter label into one
column and data into another column.
Choose Insert from the menu bar, then
go for Chart from the pull down menu.
Select Bar from this and follow the
instructions and in four steps line graph
is completed. There are options in Excel
such as including a legend, deciding
data labels and finally deciding the
location of the chart
Using Excel

Step 1: Open Excel and entered data as


shown below
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click Column
as shown below
Step 4: If you click on the first option
the screen below will appear
Step 5: The procedure for labeling the
bar chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the bar graph
produced in Excel is shown below:
Multiple Bar Graph
When you want to compare two or more
independent variables, it is better to
construct two bars side by side instead
of drawing two separate bar graphs at
two different places. In multiple bar
graph two or more bars are drawn
together on a common scale using
different shades for different series. For
example, if you want to compare the
revenue and expenditure of a country
each year, you can draw two bars
together to show the revenue and
expenditure each year.

Illustration
The following table shows the Sales of
Apple Inc.($ Billions) and cost of goods
sold ($ billions) from 2012 to 2016.

Sales of Apple Inc ($


Year billion)
2012 155.97
2013 170.87
2014 183.24
2015 231.28
2016 214.23

The information in the above table can


be suitably represented by using multiple
bar diagram.
Fig. 3.3: Sales of Apple Inc. and cost
of goods sold from 2012 to 2016
From the above multiple bar graph, it is
clear that total sales of Apple Inc. is
always higher than total cost of goods
sold. Both the variables move in tandem.

Computer Application

Step 1: Enter data in excel sheet as


shown below
Step 2: Click Insert on the toolbar then
the screen below will appear
Step 3: Select data as shown below
Step 4: Next click Column as shown
below
Step 5: Click on the first option then the
screen will appear as shown below
Step 6: The procedure for labeling the
multiple bar chart is same to that of line
graph discussed under step 5 to step 11.
Following those steps the multiple bar
graph produced in Excel is shown
below:
Subdivided Bar Diagrams
Subdivided or component bar diagram is
used to represent various parts of the
total. In this diagram, the total magnitude
is divided into different parts or
components. The procedure for
constructing a subdivided bar graph is
that first we construct simple bars for
each class taking total in that class. Next,
we divide these simple bars into parts in
the ratio of various components.
Subdivided bar diagram is also known
as component bar diagram.

Illustration
The table below shows sector-wise
capital formation of household sector,
private corporate sector and public
sector in India from 2010-11 to 2014-15.
The appropriate bar diagram to
represent the contribution of one sector
in total capital formation in India is
subdivided bar diagram.

Private
Household Corporate
Years Sector Sector
2010-
11 437544 33262
2011-
12 415207 47916
2012-
13 461627 57591
2013-
14 481026 70848
2014-
15 526209 59064

Fig. 3.4: Sector-wise Capital


Formation in India from 2010-11 to
2014-15
Computer Application

Step 1: Enter data in excel as shown


below
Step 2: Click Insert on the toolbar then
the screen below will appear
Step 3: Select data as shown below
Step 4: Next click Column as shown
below
Step 5: Click on the second option
under 2-D Column then the screen will
appear as shown below
Step 6: The procedure for labeling the
subdivided bar graph is same to that of
line graph discussed under step 5 to step
11. Following those steps the subdivided
bar graph produced in Excel is shown
below:
Percentage Bar Diagram
When sub-divided bar is drawn on
percentage basis it is termed as
percentage bar diagram. The main idea
behind this is comparison on relative
basis. To draw percentage bar diagram,
the total of each bar is kept equal to 100
and each subdivision is cut in the
percentage of their component in the
aggregate. In this bar diagram, all the
bars are of equal height and components
show the percentage visually and help in
comparison very easily.

Illustration
The table below shows major
components of Central Government
Receipts from 2011-12 to 2014-15.
Construct the percentage bar to illustrate
this data.

Years Tax Non-Tax Capital


Revenue Revenue Receipt
2011- 439547 102317 170807
12
2012- 443319 96940 343697
13
2013- 465103 112191 444243
14
2014- 534094 148118 426537
15

Solution
The percentage bar graph of Central
Government Receipts from 2011-12 to
2014-15 is shown below.
Fig. 3.5: Percentage Bar Graph of
Major Components of Central
Government Receipts (Rs. Cr)
Fig 3.5 shows that in 2011-12, the share
of tax revenue was more than 60% in
total receipts. The share of non-tax
revenue in the total receipts was around
15% and share of capital receipts was
roughly 25 percent. However, in 2014-
15, we find slight change in total
receipts. The percentage share of tax
revenue was less than 50% in the total
receipts; non-tax revenue was more or
less 12% but the capital receipts were
around 38% in the total receipts.
Computer Application

Step 1: Enter data in excel sheet as


shown below
Step 2: Click Insert on the toolbar then
the screen below will appear
Step 3: Select data as shown below
Step 4: Next click Column as shown
below
Step 5: Click on the third option under
2-D Column then the screen will appear
as shown below
Step 6: The procedure for labeling the
percentage bar graph is same to that of
line graph discussed under step 5 to step
11. Following those steps the percentage
bar graph produced in Excel is shown
below:
Pie Chart
A pie chart is a circular chart which is
divided into segments to give an idea
about percentages or the relative shares
of parts to a whole. A pie chart provides
a quick visual idea of the relative
magnitudes of a part to a whole. Pie
charts are frequently used when the
objective is to compare a part of a group
with the whole group. One can use a pie
chart to show, say, market shares for
various airline companies in India, or
different types of toys sold by a store.
However, a pie chart is not a good
technique for showing increases and
decreases, or direct relationships
between two variables. We will explain
the construction of a pie chart with an
example. Consider the hypothetical
annual advertising expenditure of
various companies given in the
following Table 2.2:

Table 3.2
Company Expenditure
(Rs.lakhs)
Tata Rs. 600
Reliance Rs. 1200
Godrej Rs. 200
M&M Rs. 200
Bajaj Rs. 300
Birla Rs. 100
Others Rs. 400
Total Rs. 3000

The first step in the construction of pie


chart is to determine the proportion of
the subunit to the total. This can be
accomplished by dividing each subunit
by total. The next step is to multiply each
proportion by 360 for obtaining number
of degrees to represent each item
because a circle contains 3600. For
example, Tata spends Rs. 600 lakhs
represent 0.20 proportion of the total
advertising spending. When this value is
multiplied by 3600, it results in 720.
Hence, Tata expenses will account for
720 of the circle. To draw other slices of
the pie chart a compass can be used. The
pie chart for the above problem is
shown in the following figure.
Fig 3.6: Pie Chart
From the above figure, it is evident that
Reliance constitutes 40% of the total
advertising expenditure. Birla accounted
for only 3% in total advertising
expenditure.

Computer Application
Microsoft Excel with Chart Wizard can
also generate pie chart. To construct pie
chart in Excel, enter label into one
column and data into another column.
Choose Insert from the menu bar, then
go for Chart from the pull down menu.
Select Pie from this and follow the
instructions and in four steps line graph
is completed. There are options in Excel
such as including a legend, deciding
data labels and finally deciding the
location of the chart.

Using Excel
The following table 3.3 shows sectoral
allocations of financial resources during
10th plan.
Table 3.3: Sectoral Allocation during
10th Plan
Sectors Rs.
Crores
Education 62461
Rural Development Land 87041
Resources & Panchayati
Raj
Health Family Welfare & 45771
Ayush
Agriculture & Irrigation 50639
Social Justice 36381
Physical Infrastructure 89021
Scientific Department 29823
Energy 47266
Total Priority Sector 448403
Others 365375
Total 813778

We will create pie chart using excel


now.

Step 1: Open Excel and entered data as


shown below. Compute the percentage
share of each sector by dividing each
subunit by total. Next to obtain
percentage multiply it by 100. Column
C gives percentage share in of each
sector in total.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click Pie as
shown below:
Step 4: If you click on the first option
the screen below will appear
Step 5: The procedure for labeling the
bar chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the bar graph
produced in Excel is shown below:
Doughnut Chart
Doughnut chart is very similar to a pie
chart. Like pie chart it also shows the
size of items in a data series,
proportional to the sum of the items.
However, it can be used for more than
one data series. The doughnut charts
show data in rings where each ring
represents a data series. It is also kept in
mind that doughnut charts are not easy to
read.

Example
In a MBA course there are 55 male and
30 female students respectively.
Construct a doughnut chart to display
this information.
Computer Application
Microsoft Excel with Chart Wizard can
also generate pie chart. To construct pie
chart in Excel, enter label into one
column and data into another column.
Choose Insert from the menu bar, then
go for Chart from the pull down menu.
Select Doughnut from this and follow
the instructions and in four steps. There
are options in Excel such as including a
legend, deciding data labels and finally
deciding the location of the chart.

Using Excel
The following table shows data relating
to gender of students:
Male Female
MBA 55 30
We will create doughnut chart using
excel now.

Step 1: Open Excel and entered data as


shown below.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click other
charts as shown below:
Step 4: If you click on the first option of
the doughnut chart the following screen
will appear
Step 5: The procedure for labeling the
doughnut is same to that of line graph
discussed under step 5 to step 11.
Following those steps the doughnut
graph produced in Excel is shown
below:
Area Chart
Area charts is also used to plot change
over time or categories. An area chart
sums the plotted values, and thereby
area chart depicts the relationship of
parts to a whole. Thus, we can use the
area charts to highlight the magnitude of
change over time.
The following table shows the US
energy consumption (trillion Btu) by
sectors from 2012 to 2016.

End-Use
Sector 2012 2013 2014 20
Residential 17835 18687 19278 186
Commercial 15840 16227 16622 166
Industrial 28258 28660 28931 286
Transportation 24098 24516 24683 250

We will create area chart for the above


data using excel now.

Step 1: Open Excel and entered data as


shown below.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click area
charts as shown below:
Step 4: If you click on the first option of
the area chart the following screen will
appear
Step 5: The procedure for labeling the
area chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the area chart
produced in Excel is shown below:
Subdivided (Stacked) Area Chart
The sub-divided area chart is more
appropriate than simple area chart in this
particular example. So now we will
construct a sub-divided or stacked area
chart using excel.

Step 1: Open Excel and entered data as


shown below.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click area
charts as shown below:
Step 4: If you click on the second option
(either 2D or 3D) of the area chart the
following screen will appear
Step 5: The procedure for labeling the
area chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the area chart
produced in Excel is shown below:
Percentage (100% Stacked) Area
Chart
The limitation of sub-divided area chart
leads us to go for the 100% stacked area
chart. So now we will construct a
percentage or 100% stacked area chart
using excel.

Step 1: Open Excel and entered data as


shown below.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click area
charts as shown below:
Step 4: If you click on the third option
(either 2D or 3D) of the area chart the
following screen will appear
Step 5: The procedure for labeling the
area chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the area chart
produced in Excel is shown below:
Stock Charts
Stock charts are used by technical
analysts. As the name suggests they are
useful for showing fluctuations in stock
prices. However, these charts are also
useful to show low and high values of
some variables such as rainfall,
temperature, pressure, etc.
While constructing the stock charts, the
data need to be entered in excel in a
specific order. For instance, arrange
data with high, low, and close as
column headings to generate a simple
high-low-close stock chart.
Computer Application
Microsoft Excel with Chart Wizard can
also generate pie chart. To construct pie
chart in Excel, enter label into one
column and data into another column.
Choose Insert from the menu bar, then
go for Other Charts from the pull down
menu. Select first under stock chart
from this and follow the instructions and
in four steps your stock chart is ready.
There are options in Excel such as
including a legend, deciding data labels
and finally deciding the location of the
chart.

High-Low-Close Stock Chart Using


Excel

The following table shows data relating


to stock prices of Coca-Cola Company
from 23.02.2017 to 02.03.2017.
Date High Low
Mar 02,
2017 42.56 42.07
Mar 01,
2017 42.35 41.88
Feb 28,
2017 42.07 41.64
Feb 27,
2017 41.75 41.59
Feb 24,
2017 41.91 41.6
Feb 23,
2017 42 41.62

We will create stock chart using excel


now.
Step 1: Open Excel and entered data as
shown below.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click other
charts as shown below:
Step 4: If you click on the first option of
the stock chart the following screen will
appear
Step 5: The procedure for labeling the
stock chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the stock chart
produced in Excel is shown below:
Open-High-Low-Close Stock Chart
Using Excel
The following table shows data relating
to stock prices of Coca-Cola Company
from 23.02.2017 to 02.03.2017.
Date Open High Low Clo
Mar 02,
2017 42.08 42.56 42.07 42
Mar 01,
2017 42.01 42.35 41.88 42
Feb 28, 2017 41.68 42.07 41.64 41
Feb 27, 2017 41.75 41.75 41.59 41
Feb 24, 2017 41.7 41.91 41.6 41
Feb 23, 2017 41.67 42 41.62 41

We will create stock chart using excel


now.
Step 1: Open Excel and entered data as
shown below.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click other
charts as shown below:
Step 4: If you click on the first option of
the stock chart the following screen will
appear
Step 5: The procedure for labeling the
stock chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the stock chart
produced in Excel is shown below:
Volume- Open- High-Low-Close Stock
Chart Using Excel
The following table shows data relating
to stock prices of Coca-Cola Company
from 23.02.2017 to 02.03.2017.

Date Volume High Low

Mar 02, 2017 1,54,89,200 42.56 42.07

Mar 01, 2017 1,46,62,000 42.35 41.88

Feb 28, 2017 1,59,46,800 42.07 41.64

Feb 27, 2017 1,21,86,400 41.75 41.59

Feb 24, 2017 1,32,15,300 41.91 41.6

Feb 23, 2017 1,28,56,800 42 41.62


We will create stock chart using excel
now.
Step 1: Open Excel and entered data as
shown below.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click other
charts as shown below:
Step 4: If you click on the first option of
the stock chart the following screen will
appear
Step 5: The procedure for labeling the
stock chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the stock chart
produced in Excel is shown below:
Volume-Open- High-Low-Close Stock
Chart Using Excel

The following table shows data relating


to stock prices of Coca-Cola Company
from 23.02.2017 to 02.03.2017.

Date Volume Open High


Mar 02,
2017 1,54,89,200 42.08 42.56
Mar 01,
2017 1,46,62,000 42.01 42.35
Feb 28, 2017 1,59,46,800 41.68 42.07
Feb 27, 2017 1,21,86,400 41.75 41.75
Feb 24, 2017 1,32,15,300 41.7 41.91
Feb 23, 2017 1,28,56,800 41.67 42
We will create stock chart using excel
now.

Step 1: Open Excel and entered data as


shown below.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click other
charts as shown below:
Step 4: If you click on the first option of
the stock chart the following screen will
appear
Step 5: The procedure for labeling the
stock chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the stock chart
produced in Excel is shown below:
Scatter Plot
Sometimes analysts are interested to find
nature of relationship between two or
more variables so that one phenomenon
can be explained in terms of another. For
example: Are changes in interest rate
related with profitability of corporate
sector? Are changes in trading volume
activity associated with movement of
stock prices of a stock?

Fig. 2.8 Scatter Plot between Stock


Prices of Microsoft and Facebook
Such relationship between two variables
can be shown graphically by scatter plot.
Scatter plot gives a quick idea about the
relationship between stock prices of
Microsoft and Facebook over the time
period as shown in Fig.2.8. A careful
inspection of this figure gives some
indication that both stock prices are
related. The relationship between these
two variables is seems to be positive.

From the figure 2.8 we cannot say


precisely the degree of association.
Scatter plot only gives the direction of
relationships between two variables. If
two variables move in unison over time,
there will be positive relationship
between them. However, when two
variables move in opposite directions,
we will have negative association
between two variables. One important
aspect to be kept in mind is that the
relationships are only tendencies and
may not hold necessarily for every year.
Computer Application
Microsoft Excel with Chart Wizard can
also generate scatter plot. To construct
scatter in Excel, enter label and data into
columns. Choose Insert from the menu
bar, then go for Chart from the pull
down menu. Select X-Y (Scatter) from
this and follow the instructions and in
four steps scatter plot is completed.
There are options in Excel such as
including a legend, deciding data labels
and finally deciding the location of the
chart.

Using Excel
Step 1: To find the relationship between
price movement of Microsoft and
Facebook, we can construct a scatter
plot. To construct scatter plot in Excel,
first entered data as shown below:
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click Scatter
as shown below
Step 4: If you click on the first option
the screen below will appear
Step 5: The procedure for labeling the
scatter plot is same to that of line graph
discussed under step 5 to step 11.
Following those steps the scatter
produced in Excel is shown below:
Bubble Chart
A Bubble chart is like a scatter chart
with an additional third column to
represent the size of the bubbles.
Bubble shows to represent the data
points in the data series.

The following table shows sales and


R&D expenses of Apple Inc., from 2012
to 2016.

Year Sales ($ Billions) R&D ($ Billio


2012 155.97
2013 170.87
2014 183.24
2015 231.28
2016 214.23

We will create a bubble chart for the


above data using excel now.

Step 1: Open Excel and entered data as


shown below.
Step 2: Click Insert on the toolbar the
screen below will appear
Step 3: Select data and click bubble
charts as shown below:
Step 4: If you click on the third option
(either 2D or 3D) of the area chart the
following screen will appear
Step 5: The procedure for labeling the
bubble chart is same to that of line graph
discussed under step 5 to step 11.
Following those steps the area chart
produced in Excel is shown below:
Frequency Distribution
Frequency distribution is a useful tool to
summarize data in the form of class intervals
and frequencies. Mr. Prasoom Dwivedi teaches
Business statistics paper in University of
Petroleum. The following table gives final
marks obtained by students. Mr. Prasoom wants
to develop some charts and graphs to show the
average mark obtained by students. What are
the highest and lowest marks?

Table: Marks Obtained in Statistics


70 87 88 72 72
82 78 72 72 68
65 57 67 86 82
71 78 72 81 84
85 82 71 48 67
67 56 68 90 71
86 69 72 77 52
81 75 66 76 55
In the above table, individual marks are listed.
Such unorganized data is called raw data or
ungrouped data. No useful information can be
easily comprehensible even after careful
examination of data. This raw data can provide
better insights if organized into a frequency
distribution. Frequency distribution is table
where data are arranged by class along with
class frequency. According to Morris
Humberg, A frequency distribution or
frequency table is simply a table in which the
data grouped into classes and the number of
cases which fall in each class are recorded. The
numbers in each class are referred to as
frequencies.

We will describe construction of frequency


distribution with the help of an example. To
construct a frequency distribution, we should
first determine the range of the data. The range
of the data can be computed as the difference
between the highest and lowest numbers. For
the above example, the range is 49 (90-41).

Next, determine the number of classes. A


useful rule in this context is the 2K rule which
suggest that select the smallest number (K) for
the number of classes such that 2K exceed the
total number of observations(N). In the above
example, there are 64 marks. So N = 64. If we
take K = 5, which means 5 classes should be
used but 25 =32 which is less than total number
of observations and may not include every
observations. If we take K = 6, then 26 =64
which include every item. So the recommended
number of classes is 6.

After deciding the number of classes, we must


determine the width of the class interval. In
general, it is computed by dividing the range by
the number of classes. In this example, the
class interval is coming 49/6 = 8.16. Generally,
the number is rounded up to the next number,
which is in this case is 9. However, in this case,
we will keep 10 as the class width. The
frequency distribution must begin at a number
equal to or lower than the lowest number and
end at a number equal to or greater than the
highest number in the data set. Since the lowest
number in our data set is 41 and the highest
number is 90, the frequency distribution in our
case starts with 40 and end with 90. The
following table 2.8 shows the frequency
distribution for our data:
Table: Frequency Distribution

Class Frequency Relative


Interval Frequency
40-50 2 0.031
50-60 7 0.1093
60 -70 15 0.2343
70-80 21 0.3281
80-90 19 0.2968

Relative frequency can also be calculated by


dividing individual class frequency by total
frequency. When we multiply relative
frequency with 100, it can be interpreted in
terms of percentages. Thus, 32 % students have
got marks between 70 and 80. 29% students
got marks between 80 and 90. And only 3%
students got marks between 40 and 50.
Frequency Polygon
Frequency polygons are a graphical
representation of frequency table. Like
histogram, it is also useful in knowing the
shape of distributions. The procedure of
obtaining a frequency polygon is first to find
mid-value of each class interval in a frequency
distribution. Next, plot these frequencies
against corresponding mid-points and then
connect these points by straight lines. Finally,
extend these straight lines to meet X-axis to
form a frequency polygon. The advantage of
frequency polygon over histogram is that when
two or more distributions are compared, the
frequency polygon provides better and clear
picture about shape of the distributions than the
histogram.
Illustration
The table below shows a frequency distribution
of marks obtained by students in business
statistics paper. Construct a frequency polygon
for this data.
Marks Frequency
0 - 10 4
10-20 7
20-30 3
30-40 5
40-50 18
50-60 30
60-70 20
70-80 8
80-90 6
90-100 2

Solution
The first step is to find mid-points of each
class interval in a frequency table as shown
below
Mid-
Marks Points Frequency
0 - 10 5 4
10-20 15 7
20-30 25 3
30-40 35 5
40-50 45 18
50-60 55 30
60-70 65 20
70-80 75 8
80-90 85 6
90-
100 95 2

The second step is to plot frequencies against


corresponding mid-points as shown below
Ogive
An ogive is a cumulative frequency polygon. It
is a type of frequency polygon that depicts
cumulative frequencies. An ogive plots
cumulative relative frequency percent on the
vertical axis and class interval on the horizontal
axis from left to right. Consider the following
data:
Cumulative Cum
Frequency Rela
Freq
Marks Frequency
0 - 10 4 4
10-20 7 11
20-30 3 14
30-40 5 19
40-50 18 37
50-60 30 67
60-70 20 87
70-80 8 95
80-90 6 101
90-100 2 103
Histogram
Another way of summarizing data is
histogram. Histogram is a two
dimensional graph where not only length
but also the width of bar is important. It
is used to display frequency distribution
of a data set in which while the
frequency is represented on the vertical
axis, the range of data values are
represented on the horizontal axis. Fig.
2.9 shows a histogram in which twenty-
eight
Fig. 2.9 Histogram
companies made profits between Rs. 400-
500 lakhs, six companies earned profits
between Rs.100-200 lakhs and only one
company made a profit of between
Rs.900-1000 lakhs.
Computer Application
Excel make histogram using a tool called
Data Analysis. It is important to note that
classes are called bins in Excel. To
construct histogram in excel, click tools
on the excel menu bar. From the tools
drop down menu select data analysis.
When you click data analysis, a dialog
box will appear. Choose Histogram from
this dialog box. Select data into the
Input Range. If we would like to decide
lower and upper limits of the class
interval i.e. bins enter endpoints into Bin
Range. If we would like excel to
establish bins, leave this blank. If we
want labels then click Labels. For
histogram graph, click Chart Output.
After this, click OK.
Using Excel

Listed below are 64 different daily


closing value of NSE Nifty. We will
explain how to construct histogram using
these values.
4150.85 4117.35
4111.15 4077 4079.3
4076.65 4134.3 4120.3
4260.9 4278.1 4246.2
4293.25 4249.65 4295.8
4198.25 4179.5 4145
4170 4171.45 4147.1
4252.05 4259.4 4285.7
4313.75 4357.55 4359.3
4406.05 4387.15 4446.15
4499.55 4562.1 4566.05
4619.8 4445.2 4440.05

Step 1: To construct histogram in Excel


first entered data and click data on
toolbar as shown below:
Step 2: Click Data Analysis on the
dropdown menu. When you will click
Data Analysis a dialog box will appear
shown below:
Step 3: Select histogram from the dialog
box and click ok. Another dialog box
will appear shown below:
Step 4: Specify the range of raw data
into Input Range. If you want to specify
class lower and upper limits, put these
limits into Bin Range. If you want excel
to decide the bins (class size), leave this
blank. If data has label, then select
Labels. If you want histogram, select
Chart Output and click OK.
Step 5: Now right click keeping cursor
on any of the bar as shown below
Step 6: When you click Format Data
Series, you will get the following
dialogue box
Step 7: Type 0 in Gap Width and Click
OK. The resulting figure is histogram as
shown below
Exploratory Data Analysis: Stem and
Leaf
Exploratory data analysis comprises of
simple calculations and graphs that help
to summarize data in an easy manner.
One such technique is stem-and- leaf
which not only shows the rank order but
also the shape of data.
Consider the marks of 40 students in
business statistics paper:
66 32 48 42 51 54 92
50 71 72 80 12 15 65
82 78 84 86 93 96 77
82 91 28 65
In order to develop a stem-and-leaf
exhibit, first arrange the leading digits of
each observation to the left of a vertical
line. Second put the last digit of each
observation to the right to the right of the
vertical line maintaining the order of the
observations.
Next sort the digits on each line into rank
order as shown below:

The numbers to the left of the vertical


line form the stem and each digit to the
right of the vertical line is called a leaf.
So in first row, 1 is the stem value and 2,
5, 7 and 9 are leaves. Thus, it shows
four data values have a first digit of 1.
The leaves show that the data values are
12, 15, 17 and 19. Likewise, in the 9th
row, 9 is the stem value and 1, 2, 3 and 6
are leaves values.
The stem-and leaf display also depicts
the shape of distribution. When you use a
rectangle to contain the leaves of each
stem we obtain the following exhibit:

Thus, the stem-and-leaf display is very


similar to histogram and gives
information regarding the shape of the
distribution. The stem-and-leaf display
for data more than 2 or more digits are
possible by assuming leaf unit as 10,
100, etc.
Cross Tabulation
A cross-tabulation analysis is one of the
most widely used analytical tools of the
market research industry. It is generally
a two dimensional table in which the top
and left margin labels define the classes
for the two variables. It provides
relationship between two categorical
variables. It is also called a contingency
table. In simple words, a contingency
table shows observed frequencies in
columns and rows classified in order to
find relationship between two or more
categorical variables. The simplest
contingency table is a 2x2 table shown
below:

Male Female Total


Smoke 60 14 74
Dont 24 80 104
Smoke
Total 84 94 178

The right-hand column and the bottom


row figures are called marginal totals.
After arranging data into columns and
rows in a contingency table, the next
thing is to find the relationship between
smoking and gender.
Example
T-Series is a well known music
company in India. The research and
development department of this company
wants to know whether the type of music
preferred and age of the listener is
independent. A random sample of 620
music listeners is taken, results of the
survey is summarized in the following
contingency table.

Hindi Hin
Film Bha
Music Type Music
Age group

10 -25 190 20
26-50 80 60
51 and more 46 98
Total 316 178

In the age category of 10-25, 242


respondents out of 620 were in this age
bracket. 190 respondents out of 242
preferred Hindi film music. This shows
that 78.51 per cent of young people
preferred Hindi film music. 8 per cent
young people preferred Hindi Bhajan.
Only 3 per cent young people were
listening Hindi Gazal. However, 9.91
per cent of young people preferred
western music. Likewise we can also
analyze other age categories.

Exercises
1. The following table shows the US
Field production of crude oil
(Thousand Barrels) from 2000 to
2015. Construct a line chart using
excel.
U.S. Field Production o
Year Barr
2000 2130
2001 2117
2002 2096
2003 2061
2004 1991
2005 1892
2006 1856
2007 1852
2008 1829
2009 1953
2010 1998
2011 2060
2012 2374
2013 2725
2014 3198
2015 3436

2. The following table shows list of


largest private employers in the
United States in 2015. Construct a
bar chart using excel.
Companies No
Wal-Mart Stores 1,5
McDonald 420
Kroger 400
International Business Machine 377
(IBM)
The Home Depot 371
United Parcel Service 362
Target 347
Amazon.com 341
Berkshire Hathaway 316
Yum! Brands 303

3. Consider the sales and cost of goods


sold data of Amazon.Com given
below in the table. Construct a
multiple bar chart using excel.

Year Sales (Billion Cost o


$) (Billi
2012 61.09 45.97
2013 74.45 54.18
2014 88.99 62.75
2015 107.01 71.65
2016 135.99 88.27

4. The following table shows the


various components of total expenses
(Rs.cr) of Reliance Industries in
2016. Construct a subdivided and a
percentage bar chart using excel.
Year Sales (Billion Cost o
$) (Billi
2012 61.09 45.97
2013 74.45 54.18
2014 88.99 62.75
2015 107.01 71.65
2016 135.99 88.27

5. The market capitalizations of top IT


companies in the US for the year
20015-16 given below in the table.
Construct a pie chart and write a
brief report summarizing the data.
6. Consider the Amazons Sales growth
(%) and Cost of Goods sold growth
(%) for the year 2016. Construct a
Doughnut chart using excel.
Items Percentage
(%)
1. Sales 27.08
Growth 23.19
2. Cost of
Goods Sold

7. The table below shows open, high,


low and closing prices of Reliance
from 1st March, 2017 to 15th March,
2017. Draw a stock chart using
excel.
Date Open High Low

15- 1289.40 1316.00 1283.25


03-
2017

14- 1315.95 1319.10 1285.25


03-
2017

10- 1293.95 1296.50 1261.50


03-
2017

09- 1289.80 1298.00 1281.70


03-
2017
08- 1307.05 1310.00 1286.00
03-
2017

07- 1316.15 1326.75 1296.60


03-
2017

06- 1268.00 1309.90 1265.45


03-
2017

03- 1237.00 1287.80 1237.00


03-
2017

02- 1236.00 1254.00 1225.65


03-
2017

01- 1242.00 1244.90 1230.25


03-
2017

8. Consider the sales and EBITDA data


of Amazon from 2012 to 2016.
Construct a scatter plot using excel.
Year Sales (Billion EBITD
$)
2012 61.09 2.67
2013 74.45 3.66
2014 88.99 4.5
2015 107.01 8.05
2016 135.99 11.84

9. Consider the following data.


70 81 5 24 67 62
48 56 18 40 41 6
51 46 62 21 26
a) Construct a frequency
distribution.
b) Draw a relative frequency
distribution.
c) Comment on the results

10. Construct a histogram for the


following data in Excel
505 78 154 280 250
85 236 64 518 51
705 129 360 450 38
978 713 512 486 325
11. The following data represents the
marks of 70 students in Business
Analytics paper. Construct a stem-and-
leaf display.
68 74 80 65
71 62 37 60
41 45 58 89
58 67 64 68
83 36 62 61
51 57 61 51
Chapter 4
Descriptive Analytics by
Numerical Methods
In earlier section, we discussed
graphical methods of displaying
information. It is very informative and
useful in enriching an essay or a report.
However, in many cases, providing a
precise numerical figure is more
desirable. In this section, we will
discuss frequently used descriptive
statistics for summarizing namely,
measures of central tendency and
dispersion.
Measures of Central Tendency
Measures of central tendency tell where
the middle value of the data set lies.
The most common measures of central
tendency are mean, median and the
mode.

Mean
The arithmetic mean is what most
laymen call an average. Mean is
computed by adding all the observations
in a data set and dividing the resulting
sum by the total number of observations.
The mathematical formula for computing
mean is:
(3.1)

where N is the sample size and


represents the mean.

The mean is used to represent the entire


data by a single number. Though, its a
representative figure in most of the cases
but can mislead in cases where extreme
values are present in a data set.

Solved Example
Compute arithmetic mean of the
following marks in Statistics obtained by
10 students in a test:
Roll No: 1 2
3 4 5 6 7 8
Marks
56 78 65 44 90 88 75
Solution:
Roll. No. Marks
1 56
2 78
3 65
4 44
5 90
6 88
7 75
8 52
9 51
10 68
N=10 X =667

Thus, the mean mark is 66.7.

Using Excel
In Microsoft Excel, AVERAGE function
can be used to calculate arithmetic mean.
In particular, we calculate the mean by
entering the formula:

= AVERAGE (A2:A11)
Median
The median is another frequently used
measure of central tendency which is the
middle value in data set when it is
arranged in an ascending order. When
the number of observations is odd, the
middle value is the median. In case of
even number of observations, there is no
single value. So in such case the median
is computed as the average of two
middle observations.

Solved Example 1
Compute the median of the following
sales data of ABC Company.
304
414 520 315 480 600 665

Solution
First arrange the data in ascending order
as shown below:

304 315 414 480 520 600

Second, since the number of


observations is odd, the middle value is
the median. Thus, the median is sales of
ABC Company is 480.

Solved Example 2
Compute the median for the following
data
68 76 84 52 40 94 66

Solution
First arrange the data in ascending order
as shown below:
40 52 54 66 68 76 84

Since the number of observations is


even, we have to find out two middle
values in the arranged data set. Here, 66
and 68 are the two middle values. The
median is the average of these two
middle values.

Median = = 67
Using Excel
In Microsoft Excel, MEDIAN function
can be used to calculate the median. In
particular, we calculate the median by
entering the formula:
= MEDIAN (A2:A9)
Mode
Mode is defined as the value that most
often occurs with highest frequency in a
data set. For example, consider the
sample of marks of 5 students in a class
given below:
45 62 58 62 76

In the above example, the only figure


occurs twice is 62. As this figure occurs
with a frequency of 2 it has the highest
frequency. Thus, modal mark of the class
is 62. Sometimes highest frequency
occurs at two or more values in this
case; more than one mode can exist.
When the data have exactly two modes,
it is called bimodal data. If there are
more than two modes for a data set, it is
called multimodel data set.

Using Excel
In Microsoft Excel, MODE function can
be used to calculate the mode. In
particular, we calculate the mode by
entering the formula:

= MODE (A2:A6)
Quartile
The quartile divides a series or a set of
observations into 4 equal parts. Median
being the second quartile, so there are
only two quartiles actually. The lower
quartile (Q1) divides a series into such
that (one-fourth) of total frequency is
lying below Q1 and (three-fourth) is
lying above Q1. The upper quartile (Q3)
divides a series into such that (three-
fourth) of total frequency is lying below
Q3 and (one-fourth) is lying above Q3.
Using Excel
I collected stock prices data of Apple
Inc., from 1st February 2017 to 3rd
March, 2017 to illustrate how to
compute 1st and 3rd quartiles in excel. In
Microsoft Excel, QUARTILE function
can be used to calculate the 1st, 2nd, and
3rd quartiles. In particular, we calculate
the lower quartile (Q1) by entering the
formula:
= QUARTILE (B2:B23, 1)
Thus, the 1st quartile is 132.06.

We calculate the upper quartile (Q3) by


entering the formula:
= QUARTILE (B2:B23, 3)

Thus the 3rd quartile stock price of


Apple Inc., is 136.87.

Percentile
Every year in India around more than
200000 MBA aspirants appear in
Common Aptitude Test (CAT). Often you
may have heard students saying that I got
95 percentile in CAT or my CAT score
is 99 percentile. What does it mean? Its
meaning is only 5 percent person got
more marks than me if my score is 95
percentile.

Like median and quartiles, percentile is


also a positional measure. While
quartile divides a data set into four
equal parts percentile divides a data set
into 100 equal parts.

Using Excel

In Microsoft Excel, PERCENTILE


function can be used to calculate the
percentiles. In particular, we calculate
the 5th percentile by entering the
formula:

= PERCENTILE (B2:B23, 0.05)


Thus the 5th percentile of Apple stock
price is 128.76 meaning 95 per cent
items in the sample are above 128.76.

We can also calculate the 95th percentile


by entering the formula:

= PERCENTILE (B2:B23, 0.95)


Thus, the 95th percentile stock price of
Apple Inc., is 139.73. This implies only
5 per cent prices are above 139.73.
Measures of Dispersion
Measures of dispersion give the extent
of deviation from its mean value.
Measures of central tendency such as the
mean or median only provides the
location of the middle value but it does
not tell anything about the spread of the
data. Dispersion also known as
variability measures the extent of items
from some central value. The
significance of dispersion lies in the fact
that a small value for a measure of
dispersion shows that the data are
clustered closely i.e. the mean of the
data is representative and therefore it is
reliable. However, a large value for a
measure of dispersion shows that the
data are scattered and in this case mean
is not the representative figure and
therefore it is not reliable. The various
measures of dispersion are the range,
inter-quartile range, mean deviation and
standard deviation.

Range
The range is the simplest measure of
dispersion. The range is defined as the
difference between the highest value and
the lowest value.

Range = Highest Value Lowest


Value (3.2)
Solved Example
A sample of 6 MBA graduates from IIM
(Indore) revealed their starting package
(Rs. Lakhs). Compute range.

14 21 12 36 25 8

Solution
The range is given by the following:
Range = Highest Value Lowest Value
= 36-8 = 28.
Thus, the range is Rs. 28 lakhs.

Range is a good measure of dispersion


when the data set shows a stable pattern.
However, the biggest limitation is that it
is based on only two observations that
is, highest and lowest values of the data
set.

Inter-quartile Range
Range as a measure of dispersion is
based on maximum and minimum values
in the data set. To avoid this problem,
one can resort to inter-quartile range.
Inter -quartile range is computed on the
middle 50% of the observations after
elimination of highest and lowest 25%
observations in a data set which is
arranged in ascending order. Unlike
range, inter-quartile range is not
sensitive to extreme values.

Solved Example
The following data shows quarterly
operating profit (Rs. Cr) of Reliance
industries from September 2008 to June
2011. Calculate inter-quartile range.
6474 5363 5437 5921 7217 7844

Solution
First arrange the data in ascending order

5363 5437 5921 6474 7217 7844


Drop first three figures (5363, 5437 and
5921) and last three figures (9545,
9843 and 9926) in this data set.
The remaining observations constitute
50% of the observations. These
observations are 6474, 7217,
7844 , 9136, 9342 and 9396.
For the remaining data observations, if
we calculate range, we will get inter-
quartile range.
Inter-quartile Range = 9396- 6474 =
2922.
You can notice that the range for this
problem is (9926-5363=4563). Inter-
quartile range is 2922 is much smaller
compared to range thus showing that it is
less sensitive to extreme numbers
present in the data set.

Using Excel
I collected stock prices data of Apple
Inc., from 1st February 2017 to 3rd
March, 2017 to illustrate how to
compute 1st and 3rd quartiles in excel
and computation of inter-quartile range.
First calculate the lower quartile (Q1) by
entering the formula:
= QUARTILE (B2:B23, 1)
So the 1st quartile is 132.06.

Next calculate the upper quartile (Q3) by


entering the formula:
= QUARTILE (B2:B23, 3)
The 3rd quartile stock price of Apple
Inc., is 136.87.
Inter-Quartile Range = Q3-Q1
= 136.87-132.06
= 4.81

Mean Absolute Deviation


Mean absolute deviation is another
measure of dispersion. It is defined as
follows:

where is the mean of the distribution.

Solved Example
The following data shows annual gross
profit margin (%) of Indian Oil
Corporation Ltd (IOCL) from March
2007 to March 2011. Calculate Mean
Absolute Deviation (MAD).
Year:
2007 2008 2009 2010 2011
GPM(%): 5.15 5.13 2.33 6.36
Solution

Year GPM(%) (X- ) |X- |


2007 5.15 0.54 0.54
2008 5.13 0.52 0.52
2009 2.33 -2.28 2.28
2010 6.36 1.75 1.75
2011 4.11 -0.5 0.5
4.616 MD=1.118

Thus, the mean absolute deviation in


gross profit margin in IOCL is 1.11%.

Variance
The variance is the most widely used
measure of variability. It is basically
average of the squared deviations from
the arithmetic mean. The formula for
population and sample variance are as
follows:
It is to be kept in mind that we generally
work with a sample. Also the variances
of population and sample are practically
same, when the number of observations
is large.
Suppose in a class there are 20 students.
Their marks in business statistics paper
are as follows:
Students Marks
Mike

Tony
Ryan
Bob
Joe
Smith
Robin
Kate
Silsa
Tisca
Tom
Jim
David
Adam
Singer
Rocky
Mark
John
Hillan

Now it is important to remember that


data of entire class or population is
given in the above table so to compute
variance we should use variance of
population formula.

In Microsoft Excel, to obtain variance of


population and variance of sample, enter
the formulas:

= VARP (B2:B20)
Suppose we take a representative
sample from the class which is given as
follows:

Students Marks
Joe
Kate
Tom
David
Singer
Rocky
Mark
Hillan

Now it is important to remember that


data of a sample from the population is
given in the above table. So to compute
the variance we should use variance of
sample formula.

In Microsoft Excel, to obtain variance of


sample, enter the formulas:
= VARP (B2:B9)
The variance is expressed in squared
units instead of original units and create
problem in interpretation. In fact, this is
reason standard deviation is preferred to
variance.
Standard Deviation
While mean indicates representative
value for a data, standard deviation
shows the dispersion or variability
across data points. Other measures of
variation which we have already
discussed above are range, inter-quartile
range and mean deviation but standard
deviation is considered to be the most
efficient measure of dispersion.
Standard deviation was introduced by
Karl Pearson in 1893. It is a measure of
variation present in the sample. If all the
data points present in a sample are near
to each other, the standard deviation is
tend to be small. Nevertheless, if data
points are greatly dispersed then
standard deviation will tend to be large.
It is denoted by (sigma). The
mathematical formula for computing
standard deviation is:

Standard deviation has little meaning in


its absolute sense. However, when
standard deviations of two distributions
are compared, the distribution with
smaller standard deviation shows less
variability. The meaning will become
clear by considering the following
example. Lets say there are two
projects A and B, the average return and
time horizon of the two projects are the
same. However, project A has a
standard deviation of 2.8 and project B
has a standard deviation of 3.6. Which
project will you prefer? No doubt a
rational investor will choose project A
because it is less risky.

Solved Example
Compute standard deviation for the
following data:
52 65 88 72 81 112 66

Solution:
X 2

52 -29 841
65 -16 256
88 7 49
72 -9 81
81 0 0
31
112 961
-15
66 24
225
105 9 576
90 21 81
102 -23 441
58 529
X =891
)2=4040
Thus, the standard deviation is 20.09.

Using Excel
In Microsoft Excel, to obtain standard
deviation of sample, enter the formula:

= STDEV (A1:A11)
Coefficient of Variation
The coefficient of variation measures
dispersion in relation to the mean. This
is a relative measure of dispersion and
is used to compare the relative variation
in one data set with the relative variation
in another data set. For example,
suppose you want to know the relative
variation of marks for two classes of
students: Class 1 and Class 2. This
relative of dispersion that is coefficient
of variation can serve the purpose.
The coefficient of variation is given by
the following expression:

where
S = standard deviation
Solved Example
The following table gives closing prices
of Infosys Technology Ltd and Tata
Consultancy Services (TCS) from
29.06.2011 to 9.08.2011 in descending
order i.e. 9th Aug to 29th June.
TCS Infosys
964.8 2374.55
1005.35 2470.5
1057.95 2591.2
1095.85 2709.15
1110.35 2732.6
1130.65 Mean= 1134.473 2750.95 Mea
1135.25 S = 50.59333 2815.1 S=
1137 2775.9
1129.55 C.V= 4.459351 2751.1 C.V=
1147.15 2796.65
1144.65 2801.8
1139.9 2807.75
1133.4 2828.25
1122.55 2768.1
1132.45 2752.45
1140.05 2750.5
1125.05 2713.4
1146.05 2731.35
1123.7 2740.35
1149.05 2777.3
1145.35 2791.55
1156.65 2921.15
1171.65 2976.55
1195.9 2995.7
1182.8 2953.7
1179.45 2956.45
1185.7 2938.95
1191.9 2934.15
1184.2 2910.45
1169.85 2881.75

Thus, the coefficient of variation of TCS


and Infosys shows that the variation in
TCS is lower than the variation in
Infosys. In other words, fluctuation in
stock prices of Infosys is high compared
to fluctuations in stock prices of TCS.

Exercises
1. The following table shows closing
stock prices of Google and
Microsoft from 15th September,
2015 to 16th October, 2015.
Date Google Stock Price
15-09-2015 635.14
16-09-2015 635.98
17-09-2015 642.90
18-09-2015 629.25
21-09-2015 635.44
22-09-2015 622.69
23-09-2015 622.36
24-09-2015 625.80
25-09-2015 611.97
28-09-2015 594.89
29-09-2015 594.97
30-09-2015 608.42
01-10-2015 611.29
02-10-2015 626.91
05-10-2015 641.47
06-10-2015 645.44
07-10-2015 642.36
08-10-2015 639.16
09-10-2015 643.61
12-10-2015 646.67
13-10-2015 652.30
14-10-2015 651.16
15-10-2015 661.74
16-10-2015 662.20

Compute using Excels Function:


a) Mean
b) Median
c) Mode
d) Q1 and Q3
e) 5th Percentile
f) 95th percentile
g) Variance
h) Standard deviation
i) Coefficient of Variation

2. The following table shows data on


return on equity (ROE) and price
earning (P/E) ratios of major players
in IT sector of India.

Companies ROE P/E


Patni 22.25 5.6
Infosys 26.29 20.2
TCS 38.8 26.2
Wipro 20.41 17.1
Tech 20.58 11.9
Mahindra
HCL 21.41 23.2
Technologies
Mphasis 34.27 7.23
a) Compute mean, median and
mode of ROE and P/E.
b) Compute range, mean absolute
deviation and standard deviation of
ROE and P/E.

3. The following figure shows net sales


of HCL Technology from 2007 to
2015.
Year Net Sales
(Rs. Cr)
2007 3768.62
2008 4615.39
2009 4675.09
2010 5078.76
2011 6794.48
2012 8907.22
2013 12517.82
2014 16497.37
2015 17153.44

Conduct descriptive statistics analysis


using Microsoft Excel Data Analysis
tool
Appendix to Chapter 4
How to Install Microsoft Excel Data
Analysis
Step 1: Open Microsoft excel. The following
spread sheet will pop up.
Step 2: Go to homepage and click it.
The following window appears.
Step 3: When you click Excel Options
the following window pops up. Next
click Add ins
Step 4: When you click Add-Ins the
following window will appear.
Step 5: When you click Go then
another dialogue box pops up. Select
first two options Analysis ToolPak and
Analysis ToolPak-VBA from Add-Ins
available.
Step 6: After selecting the above
options from Add-Ins click OK. MS
Excel will take few minutes in installing
Data Analysis.
Step 7: After Data Analysis get
installed, when you click data on the
menu bar, data analysis option will
appear on the main menu bar as shown
below.
Step 8: Now when you click Data
Analysis another dialogue box will
pops up in which a list of statistical
tools are given used for analysis
purpose.
Chapter 5
Measures of Shapes
Introduction
Most of the statistical analysis is based
on assumption of normal distribution.
Measures of shapes tells us whether data
set is normally distributed of not.

Symmetrical Distribution
The shape of a distribution has a very
important role in statistical analysis. In
fact most of the statistical analysis is
based on the assumption of normal or
symmetrical distribution. Rarely
binomial or Poisson or other types of
distribution is used for statistical
analysis. It is important to note here that
normal distribution or symmetrical
distribution is a requirement in most
statistical analysis and to begin
statistical analysis, we have to check
whether data are normally distributed or
not.

Asymmetrical Distribution
Asymmetrical distribution means the
distribution is not normal. The non-
symmetrical distribution or non-normal
distribution is called skewed
distribution. In Figure 1 panel b shows
the shape of a normal distribution and
panel a and panel c show skewed
distributions.
Figure 1

So skewness refers to the lack of


symmetry in the shape of a frequency
distribution. When a distribution is not
symmetrical it is called skewed
distribution. It is important to note here
that measure of skewness of any
distribution is defined in relation to
normal distribution. Thus, skewness tells
us the difference between the manner in
which observations are distributed in a
particular distribution compared to a
normal distribution.
In figure 1 panel b shows the shape of a
normal distribution. In a normal
distribution or bell-shaped curve, the
arithmetic mean, median and mode are
lying at the centre of the curve and they
are equal. In a normal curve, spread of
the items on the both side of the centre
point are same.
Panel a in figure 1 shows the shape of a
negatively skewed distribution. In this
case, the skewness will be negative. In
this distribution, the frequencies in the
distribution are spread out over a greater
range of low-ends values on the left side
of the distribution from the centre point.
In a negatively skewed distribution, the
value of mode is maximum and the value
of mean is minimum. Median lies
between the mode and the mean.

Panel c in figure 1 shows the shape of a


positively skewed distribution. In this
case, the skewness will be positive. In
this distribution, the frequencies in the
distribution are spread out over a greater
range of high-ends values on the right
side of the distribution from the centre
point. In a positively skewed
distribution, the value of mode is
minimum and the value of mean is
maximum. Median lies between the
mode and the mean.

Measure of Skewness
Skewness is defined as the lack of
symmetry in a frquency distribution.
There are two types of skewness:
Absolute Measure of skewness
Relative measure of skewness
Absolute Measure of Skewness
It is measured by taking the difference
between the mean and the mode.
Absolute Skewness =
If the value of mean is greater than the
mode the skewness will be positive.
However, if the mode is greater than the
mean, the skewness will be negative.
Here it is important to ask why the
skewness is defined as the difference
between the mean and the mode? In a
symmetrical or normal distribution, the
mean, median and mode all are equal.
However, in a skewed distribution, the
mean moves away from the mode, which
is nothing but skewness. Thus, the
distance between the mean and the mode
could be used to measure skewness. The
greater the distance, whether positive or
negative, the higher is the skewness.
Illustration
The following table shows stock prices
of State bank of India (SBI) and ICICI
Bank from 01.06.2016 to 08.07.2016.
Compute absolute skewness for the
above data.
Date SBI
01-06-2016
02-06-2016 20
03-06-2016
06-06-2016 19
07-06-2016
08-06-2016
09-06-2016
10-06-2016
13-06-2016
14-06-2016 20
15-06-2016 21
16-06-2016
17-06-2016
20-06-2016
21-06-2016
22-06-2016
23-06-2016
24-06-2016 21
27-06-2016
28-06-2016
29-06-2016
30-06-2016 21
01-07-2016 21
04-07-2016 22
05-07-2016
07-07-2016
08-07-2016

Solution
We computed mean, median and mode
for the SBI and ICICI bank stock prices,
which given below:
Company Mean Median Mode*
SBI 212.21 213.9 217.28
ICICI 242.65 241.25 245.45
bank
Note: Mode is computed by by the
formula: Mode = 3 Median- 2Mean

The absolute skewness =


Thus, the absolute skewness for SBI is:
212.21-217.28 = -5.07
Similarly, the absolute skewness for
ICICI Bank is: 245.45-242.65 = -2.8
Thus, the stock prices of SBI and ICICI
bank are having negative skewness of
5.07 rupees and 2.8 rupees respectively.
Relative Measure of Skewness
If the absolute skewness expressed in
relation to some measure of dispersion
such as standard deviation in their
respective distribution, the resultant
measure would be relative in nature and
can be used for direct comparision.

The Karl Pearsons Coefficient of


Skewness

It is based on the difference between


arithmetic mean and mode divided by
standard deviation. Symbolically,

When a distribution is symmetrical or


normal, the values of mean, median and
mode are equal and coincide and,
therefore, the coefficient of skewness
will be zero.
However, if the distribution is positively
skewed, the coefficient of skewness
shall have positive values. The degree
of skewness will be given by the
numerical value. Likewise, if the
distribution is negatively skewed. The
coefficient of skewness will have
negative values.

Using Excel
In Microsoft Excel, to obtain skewness,
enter the formula:
= Skew (B2:B28)
It is important to note that the above
excel formula gives relative skewness.

Concept of Kurtosis
Kurtosis refers to the degree of flatness
or peakedness of a frequency
distribution. It is always measured in
relation to the peakedness of normal
curve. It tells us the extent to which a
distribution is more peaked or flat than
the normal curve. There are three
possibilities:
1. The frequency distribution exactly
coincide with a normal curve. A
normal curve is itself called
mesokurtic. Figure 2 shows all these
three possibilies.
Figure 2

2. If the distribution is more peaked


than the normal curve then it is
called leptokurtic. In a leptokurtic
distribution, items are more closely
clustered around the mean.
3. If the distribution is flatter than the
normal curve then it is called
platykurtic. In platykurtic distibuton,
the obervations are more dispered
from the mean than the normal curve.

Concept of Moments
The deviation of any item in a
distribution from its mean is given by the
following expression:
X-
Let denote the above expression by x.
The arithmetic mean of the various
powers of these deviations are called
moments of the distribution. For
example,
1. If we take the mean of the first
power of the deviations of items
from the mean, we will get the first
moment about the mean. It is denoted
by . Symbolically,

2. Likewise, if we take the mean of the


second power of the deviations of
items from the mean, we will get the
second moment about the mean. It is
denoted by . Symbolically,
3. Similarly, if we take the mean of the
third power of the deviations of
items from the mean, we will get the
third moment about the mean. It is
denoted by . Symbolically,

4. And if we take the mean of the fourth


power of the deviations of items
from the mean, we will get the fourth
moment about the mean. It is denoted
by . Symbolically,

So the moments can also be extended to


higher powers but in practice the first
four moments suffice.

Importance of Moments
The concept of moment is very important
in statistical work. Moments can help to
measure the central tendency of a set of
items, their dispersion, their asymmetry
and their peakedness. The computation
of first four moments about the mean
helps to identify the various
characteristics of a frequency
distribution. This is, in fact, the first step
in the analysis of a frequency
distribution. The following table is the
summary of how moments are helpful in
analysizing a distribution.

Moment What it
measures
1. First moment Mean
about origin Variance
2. Second moment Skewness
about the Kurtosis
arithmetic mean
3. Third moment
about the
arithmetic mean
4. Fourth moment
about the
arithmetic mean

Two important constants of a distribution


are computed from , are:

measures skewness and measures


kurtosis. In a symmetric distribution, all
the odd moments i.e., 1, 3, etc., would
always be zero.
Illustration 1
Skewness is a measure of bias in
dispersion of data. In other words, it
indicates degree of asymmetry. In
positively skewed distribution, the mean
is to the right of the peak of the
distribution as it is pulled by few very
high observations. Similarly, in
negatively skewed distribution, the mean
is to the left of the peak of the
distribution. The moment coefficient of
skewness is given by the following
formula:
If the value of skewness is zero then the
distribution is symmetric. When it is
greater than 0 it indicates that it is
positively skewed to the right and when
its less than 0 it indicates it is negatively
skewed to the left.

The moment coefficient of kurtosis is


given as follows:
When the value of coefficient of kurtosis
is 3 then the distribution is normal.
When it is different from 3 it indicates
that the distribution is not normal.

Solved Example
Find the kurtosis for the following data:

57 60 62 65 68 72 78

Using Excel
In Microsoft Excel, to obtain skewness,
enter the formula:

= Kurt (B2:B28)
Chebyshev Theorem
For symmetrical or normal distribution
about 68% of the items fall between +1
and -1 standard deviation from
the arithmetic mean and about 95% of
the observations fall between +2 and -2
standard deviations. And about 99.7% of
the items fall between -3 and +3
standard deviations. However,
Chebyshevs Theorem allows us to use
this idea for any distribution,
irrespective of the shape of distribution.
The theorem states that given a group of
N numbers, at least the proportion 1-
(1/K)2 of the N observations will lie
within K standard deviation from the
mean.
Chebyshev Proportion

K Values Ranges Minimum


proportion of
items
1 0
2 2 75%
3 3 88.89 %
4 4 93.75%
5 4 96%
. .
. .
K K 1-(1/K)2
Empirical Rule
In case of a normal distribution, the
following relationships hold good:

Approximately 68% of the area


under the curve lies within 1
standard deviation from the mean.
Approximately 95% of the area
under the curve lies within 2
standard deviations from the mean.
Approximately 99.7% of the area
under the curve lies within 3
standard deviations from the mean.
This is known as the empirical rule or
the 68-95-99.7 rule. It is clear that in a
normal distribution most outcomes will
lie within 3 standard deviations from the
mean.

Solved Example
I collected the daily stock prices of
Apple Inc., from 1st February 2016 to 3rd
March 2017 to illustrate the empirical
rule. Between 97.37 and 119.65, 68 per
cent of the observed stock prices of
Apple Inc., fall in this range. Other
ranges are obtained in a likewise manner
shown below:
z- Score
Standard normal distribution is a special
normal probability distribution with a
mean of zero and a standard deviation of
one. A normal variable can be
transformed into standard normal
variable by the following formula:

where
X is an observation from the original
normal distribution, is the mean and
is the standard deviation of the original
normal distribution. The standard normal
distribution is also called Z distribution
or Z score. A Z score tells us the number
of standard deviations a particular
observation is above or below the mean.
A Z score is a unit free number which
help to compute probability because it
cannot be calculated directly as they are
expressed in different units. Thus, it is
necessary to convert them into Z score
first.

To find probabilities, one can use the


following table. In the table, values of z
is given in the left-hand column and in
the top row value of z with two decimal
points. For example, what is the
corresponding probability value of a Z
value of 1.75? For a Z value of 1.75,
we will look for 1.7 in the left-hand
column and 0.05 in the top row. Then
select that value where column and row
intersect. The value of 0.4599 is the area
under the curve 0 and 1.75. In other
words, the probability of lying the
random variable between 0 and 1.75 is
45.99 percent. It is important to note that
the table gives the area under the curve
between the mean and any positive value
of z.

Similarly, suppose the value of z is 0.89


then the corresponding probability value
can be found by the same procedure. The
area under the curve between 0 and 0.89
is 0.3133. Thus, the probability of
random variable lying between 0 and
0.89 is 31.33 percent. However, if you
want the probability of a z value
between -0.89 and +0.89. Please note
that we have already found the
probability associated with a z value
lying between 0 and +0.89 which is
0.3133. Since normal distribution is
symmetric, the left tail is the mirror
image of the right tail. Thus, the
probability of a z value between 0 and
+0.89 is same as the probability of z
value of 0 and -0.89, that is 0.3133.
Hence, the probability of a z value
between -0.89 and +0.89 is
0.3133+0.3133 = 0.6266.
z 0.00 0.01 0.02 0.03 0.04 0.05
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989

In similar manner, the probability of a z


value between -2.00 and +2.00 is
0.9544 and between -3.00 and +3.00 is
0.9972. Let us now consider finding the
probability that z is greater than 1.00
The area under the curve between 0 and
1 is 0.3413. As we know that the total
area above the mean is 0.5000, the area
above 1.00 must be 0.5000 0.3413 =
0.1587. Thus, the probability that the
random variable exceed 1.00 is 15.87
percent. Also, the probability that the z
will be less than -1.00 is 15.87 (why?).

Finally, we will find the probability that


z is between 1.5 and 3.00. Please note
that the area between 0 and 3.00 is
0.4987. The area between 0 and 1.5 is
0.4332. Thus, the area between 1.5 and
3.00 is 0.4987-0.4332 = 0.0655. Thus,
the probability that z lie between 1.5 and
3.00 is 6 percent.

Solved Example
The mean mark of students in business
statistics paper is 78 with a standard
deviation of 15. The random variable
marks of students follow a normal
distribution.
a) What is probability of obtaining
marks less than 50?
b) What is the probability of
obtaining marks more than 90?
c) What is the probability that the
marks lie between 80 and 90?

Solution
a) Using standard normal distribution,
we converted it to z score first as
follows:
For a z value of between 0 and -1.86,
the area under the curve is 0.4686. As
we know that the total area below the
mean is 0.5000, the area below -1.86
must be 0.5000 0.4686 = 0.0314.
Thus, the probability that the marks will
be less than 50 is 3.14 percent.

b) Using standard normal distribution,


we converted it to z score first as
follows:

For a z value of between 0 and 0.80, the


area under the curve is 0.2881. As we
know that the total area above the mean
is 0.5000, the area above 0.80 must be
0.5000 0.2881 = 0.2119. Thus, the
probability that the marks will be more
than 90 is 21.19 percent.

c) First, we will convert random


variable X into z score as follows:

and

Next, we have to find probability that z


is between 0.13 and 0.80. Please note
that the area between 0 and 0.80 is
0.2881. The area between 0 and 0.13 is
0.0517. Thus, the area between 0.13 and
0.80 is 0.2881-0.0517 = 0.2364. Thus,
the probability that the marks lie
between 80 and 90 is 23.64 percent.

Computer Application
To demonstrate using Microsoft excel
for calculating probability of a normally
distributed variable, the following
example will be used.

The life of a CFL bulb is normally


distributed with mean life 12 months
and standard deviation 3 months.
a) What is the probability that a
CFL bulb last for less than 6 months?
b) What is the probability that a
CFL bulb last for more than 15
months?
c) What is the probability that a
CFL bulb has a life between 10 and
18 months?

Solution
Step 1: Open any Microsoft excel sheet.
Step 2: Click functions the following
dialog box appear
Step 3: Select Statistical from select
category. When you select statistical the
following will appear:
Step 4: Next select NORMDISTfrom
select a function. The following dialog
box will come.
Step 5: When you click OK, the
following dialog box will appear
Step 6: Enter the value of X, , and 1
in the cumulative cell. It is important to
note that Microsoft excel always
provides cumulative probability. When
the value of Z is negative, we will get
the answer directly. When z is positive,
we will get answer by subtracting
probability return by excel from 1 i.e. 1-
probability value.
Thus, the probability that the a CFL bulb
lasts for less than 6 months is 0.0227 or
2.27 percent.

Step 7: For part(b) enter the values and


the following probability is given by
Excel:
The probability that a CFL bulb last for
more than 15 months is 1.00 0.8413 =
0.1587. Thus, the probability of lasting
a CFL bulb fro more than 15 months is
15.87 percent.

Step 8: For part (c)


From the above excel output, we got the
cumulative probability up to 18 months
is 0.9772 and up to 14 months is
0.7475. Thus, the probability of lasting a
CFL bulb between 14 months and 18
months is 0.9772 0.7475 = 0.2297.

Exploratory Data Analysis


In graphical and tabular methods of
summarizing data we discussed the
stem-and-leaf display as a tool of
exploratory data analysis. It facilitates
us to use simple arithmetic and easy to
draw diagrams to summarize data. Under
numerical methods we have two
measures for summarizing data Five-
Number Summary and Box Plots
Five-Number Summary
The five-number summary comprises of:
1. Smallest Value
2. First Quartile (Q1)
3. Median (Q2)
4. Third Quartile (Q3)
5. Largest Value
To develop five-number summaries first
arrange data in ascending order. After
arranging data in ascending order you
can easily identify the smallest, the
quartiles and the largest values.
Using Excel
In Microsoft excel, five-number
summary can easily computed with excel
function as shown below:
Box-Plot
A box plot is a graph that summarizes
data. It is based on five-number
summary. In order to construct a box
plot, you have to compute median and
the first and third quartiles, i.e., Q1 and
Q3. The inter-quartile range (IQR) = Q3-
Q1 is also used.
The various steps involved in the
construction of a box plot are as
follows:
1. A box is drawn. At the ends of the
box 1st and 3rd quartiles are located.
2. A perpendicular is drawn in the box
at the location of median.
3. IQR fixes the limits. The lower and
upper limits of the box are 1.5 (IQR)
below the Q1 and 1.5(IQR) above the
Q3. Data points outside these limits
are termed as outliers.
4. The dashed lines are called
whiskers. They whiskers are drawn
from the ends of the box to the
smallest and the largest values inside
the limits.
5. * symbol is used to locate the
positions of outliers.
Working with Grouped Data
Weighted Arithmetic Mean
In simple arithmetic mean computation
all items in a series have equal
importance. However, sometimes all
items or observations may not have
equal importance. For example, in a
household budget food items may have
greater importance than entertainment. In
such cases, appropriate weights are
assigned to various observations.
Arithmetic mean calculated using weight
is called weighted arithmetic mean.
In the computation of weighted
arithmetic mean an important problem
arises in selection of weights. Weights
may be either actual or arbitrary. It is
useful when each item have different
importance.
Method of Computation of Weighted
Arithmetic Mean
1. Each item is assigned weight
2. Each item is multiplied with
respective weight and summation of
product is obtained
3. Total of product is divided by sum of
weights

Illustration
A student of MBA got the following
marks in the first semester where the
weightage of mid-term, end-term and
internal assessment are 30%, 50% and
20% respectively. Find the weighted
arithmetic mean.
Exams Marks Weight wX
(X) (w)
Mid- 76 30 2280
term
End- 60 50 3000
term
Internal 70 20 1400

=100 =6680

Weighted Arithmetic Mean


Thus, the student got 66.8 percent marks
in business statistics paper.

Arithmetic Mean Grouped Data


In continuous series or grouped data,
arithmetic mean can be computed by the
following formula:

where,
M = mid-point of the class interval
f = Frequency of each class

Illustration
The following table shows dividend
declared by different companies during
2015. Compute the average dividend.
Dividend Mid- No. of fM
(%) Point Companies
0-10 5 12
10-20 15 15
20-30 25 20
30-40 35 25
40-50 45 30
50-60 55 45
60-70 65 60
70-80 75 36
80-90 85 42
90-100 95 50
=2

Thus, the average dividend given by the


companies during 2015 is 60.91 per
cent.

Standard Deviation in case of


Grouped Data
For computing standard deviation in
case of continuous series or grouped
data, any of the following three methods
may be applied:
Actual Mean Method
Assumed Mean Method
Step Deviation Method
Actual Mean Method
The procedure of calculating standard
deviation by actual mean method is as
follows:
1. Take the mid-points of various
classes
2. Multiply the mid-points with
respective frequencies and find the
sum of it.
3. Compute the mean
4. Take the deviations of mid-points
from the computed mean i.e.,d
5. Multiply the mid-points with
respective frequencies and find
6. Square the deviations and multiply
them with respective frequencies and
obtain
7. Apply the formula:

Illustration
The following data show the monthly
stock prices of State Bank of India from
January 2015 to April 2016. Compute
the standard deviation of prices.
Date Close
Jan-15 310
Feb-15 301.6
Mar-15 267
Apr-15 270.05
May-15 278.15
Jun-15 262.8
Jul-15 270.4
Aug-15 247.1
Sep-15 237.25
Oct-15 237.2
Nov-15 250.45
Dec-15 224.4
Jan-16 179.95
Feb-16 158.4
Mar-16 194.3
Apr-16 188.95

Solution
The data on SBI shows that the highest
and lowest prices of SBI during this
period were 310 and 158.4. We grouped
the data taking a class width of 20. The
following is the grouped price data for
the SBI:
Class Frequency Mid- fm (m- fd
Interval (f) points )i.e.d
(m)
150-170 1 160 160 -82.5 -82
170-190 2 180 360 -62.5 -12
190-210 1 200 200 -42.5 -42
210-230 1 220 220 -22.5 -22
230-250 3 240 720 -2.5 -7.
250-270 3 260 780 17.5 52.
270-290 3 280 840 37.5 112
290-310 2 300 600 57.5 115

Now apply the formula to obtain


standard deviation
Thus, the standard deviation of stock
prices of SBI is 42.35. In other words,
the stock price of SBI may deviate by
42.35 on both sides from its mean value
of 242.5. On the upper side it may go to
284.85 and on the lower side it may
touch 200.15.

Computer Application
To compute descriptive statistics in
Excel, click tools on the excel menu bar.
From the tools drop down menu select
Data Analysis. When you click Data
Analysis, a dialog box will appear.
Choose Descriptive Statistics from this
dialog box. Select data into the Input
Range. If we want labels then click
Labels. Select Summary Statistics from
the dialog box. Specify the location
where you want results. Click OK.

Step 1: Open Excel and entered the data


as shown below:
Step 2: Click Data Analysis on the
dropdown menu. When you will click
Data Analysis a dialog box will appear
shown below:
Step 3: Select descriptive statistics from
the dialog box and click ok. Another
dialog box will appear shown below:
Step 4: Specify the range of raw data
into Input Range. Select Column if data
is grouped by column otherwise select
row. If data has label, then click Labels.
To get descriptive statistics, tick
summary statistics and then click OK.
The descriptive statistics are shown
below:
Reading Computer Output
Given below is a computer printout of
descriptive statistics from MS-Excel.
Using the following data on marks of 10
students: 56 78 92 60 65
70 84 45 7 88 descriptive
statistics in excel is computed.

Marks

Mean
Standard Error
Median
Mode
Standard
Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count

The above table shows that mean mark


is 71.The standard deviation is 14.93.
Variance is square of standard deviation
which is 223.11. The value of kurtosis
and skewness are -0.68 and -0.24
respectively which shows data is not
normally distributed. Range which
defined as the highest marks minus
lowest marks is 47. The minimum mark
in this data set is 45 while maximum
mark is 92. The sum total of all marks is
710 and the total number of observations
is 10.

Exercises

1. Consider the daily stock prices of


TCS and Infosys from 7th Sep 2016
to 7th October 2016.

Date TCS
07-09-2016 2440.55
08-09-2016 2322.1
09-09-2016 2352.45
12-09-2016 2359.05
13-09-2016 2359.05
14-09-2016 2328.45
15-09-2016 2326.95
16-09-2016 2361.7
19-09-2016 2411.45
20-09-2016 2406.5
21-09-2016 2413.4
22-09-2016 2378
23-09-2016 2398.1
26-09-2016 2401.1
27-09-2016 2436.1
28-09-2016 2423.05
29-09-2016 2437.8
30-09-2016 2430.8
03-10-2016 2411.7
04-10-2016 2405.15
05-10-2016 2386.35
06-10-2016 2388.75
07-10-2016 2367.8

1. Calculate Karl Pearsons


coefficient of skewness.
2. Compute kurtosis for TCS and
Infosys data.
3. Develop five number summary

2. From the following data, compute


arithmetic mean by direct method
Marks: 0-10 10-20 20-30
30-40 40-50 50-60
No. of students: 5 10
25 30 20 10

3. Find the standard deviation for the


following data:
Return on No.of
Equity (RoE in companies
%)
0-10 20
10-20 25
20-30 40
30-40 21
40-50 18
Chapter 6
Covariance and Correlation
Sample covariance is given by the
following formula:

If the covariance between X and Y is


zero, it implies absence of association
them. A positive value of covariance (X
andY) indicates a positive linear
relation between X and Y and a negative
value of COV(XY) shows a negative
relationship between X and Y.
Using Excel
In Microsoft excel, to obtain covariance,
enter the formula:
= COVAR (A2:A16, B2:B16)
Correlation is a technique to find
relationship between any two variables.
Karl Pearson coefficient of correlation
is the technique by which one can
measure the degree of relationship
between any two variables. There are
three types correlation, namely, positive,
negative and zero correlation. Karl
Pearson coefficient of correlation is
denoted by r. The limiting value of
correlation coefficient lies between 1.
The formula for computing coefficient of
correlation, r, is:
Consider two variables annual income
and expenditure of 10 individuals
working in UPES given below:
Income (Rs Lakhs):
5.8 6.2 4.5 3 3.6 6.5 8
3.6
Expenditure (Rs Lakhs):
2.1 2.8 1.8 1.5 1.9 3.6
1.4
Step 1: Open Excel and entered the data
as shown below:
Step 2: Select cell D2, access the Insert
Function and choose CORREL from the
dialog box as shown below:
Step 3: Click OK
Step 4: When the function arguments
dialog box appears:
Enter cells A1:A11 in the Array 1
box and B1:B11 in Array 2 box
Click OK

Thus the correlation between income


and expenditure is 0.90. The positive
sign of the coefficient indicates that the
two variables are positively related. The
magnitude of 0.9 shows that when
income increases 90 percent of the time
expenditure also increases.
Exercises
Consider the yearly stock prices of ITC
and Hindustan Unilever Limited (HUL).
Calculate the covariance and correlation
coefficient.
Year Stock Price of ITC S
(Rs.) H
2000 19.92 2
2001 15.04 2
2002 14.68 1
2003 21.88 2
2004 29.1 1
2005 47.33 1
2006 58.65 2
2007 70.09 2
2008 57.15 2
2009 83.61 2
2010 116.32 3
2011 134.19 4
2012 191.18 5
2013 214.38 5
2014 245.58 7
2015 218.38 8

Appendix1: List of Common Formula


in Excel
1. Addition = A1+B1
2. Subtraction = A1-B1
3. = A1*B1
Multiplication
4. Division = A1/B1
5. Power = A1^n
6. Sum =
SUM(Range)
7. Average = AVERAGE
(Range)
8. Standard = STDEV
Deviation (Range)
9. Square root = SQRT (A1)
10. Logarithm = log (A1)
(base 10)
11. Natural = LN(A1)
Logarithm
Chapter 7
Introduction to Predictive
Analytics
Introduction
Predictive analytics means estimating in
unknown situations. Forecasting and
prediction are used often
interchangeably in forecasting literature.
However, the prediction is a more
general term and connotes estimating for
any time period before, during, or after
the current observation. Forecasting is
commonly used in the context of
analyzing time series data. In this book,
we will also use forecasting and
prediction interchangeably.

Predictive analytics is very helpful for


managers who are operating in a very
competitive and uncertain environment
today. The importance of predictive
analytics can be known form the
following statement:

A successful business manager is a


forecaster first; purchasing raw
materials, producing, pricing,
marketing and organizing all will
follow.

Demand results in sales so predicting


demand is very important in overall
business. When demand is predicted
accurately, it can be met in a timely and
efficient manner. Accurate predictions
help a company avoid lost sales or
stock-out situations, and prevent
customers from going to competitors. At
the bottom line, the accurate prediction
helps in procuring raw materials and
component parts much more cost-
effectively avoiding last minute
purchases.

What is Prediction?
Prediction is a method for guessing the
future. Predictive analysts believe in the
law of repetition; they extrapolate from
the past. Prediction is classified into
qualitative, quantitative, and mixed
methods.

A qualitative forecasting method is


based strictly on human judgment and is
useful for long range strategic
forecasting, usually for a period
comprising from 2 to 10 years. Some of
the techniques used under this are
hierarchical method, surveys, panel of
experts and the Delphi method. It is this
strategic forecasting which tends to
make or break companies.

However, the area of predictive


analytics revolves around demand
forecasting for marketing and production
for shorter horizons. The following are
important uses of predictive analytics in
various areas of business

The marketing department forecasts


sales for new product lines to give
strategic information about their
sales later at maturity.

It also forecasts sales for existing


product lines to give feedback on
whether the current sales techniques
are working well or not. Such
forecasts are called short range
strategic forecasts which are done
usually for 1 to 2 years.
Besides this, there is tactical and
requirement forecasting that is done
for very short horizons of 1 week to
12 months.

Actually for producing short and tactical


forecasts, we make use of objective
methods of predictive analytics such as
smoothing techniques and causal
methods.

The Process of Predictive Analytics


The process of predictive analytics
involves five basic steps:
Problem Identification
Information Collection
Preliminary Analysis
Choosing and Fitting Models
Evaluating the Model

Problem Identification
It is the first step in the predictive
analytics process. This involves
identifying the exact variable of interest
that is to be forecasted. The problem
identification stage requires a deep
understanding of the system or the
company. This first stage also raises
some critical questions such as:
1. How the predictions or forecasts
will be used in the company?
2. Who requires the forecasts?
3. For what purpose the forecasts are
required?
4. How the whole predictive analytical
exercise fits within the company?

Collection of information
After the problem identification, both
quantitative and qualitative information
are collected in order to carry out further
predictive analytics. In particular, it
gives rise to two types of information: a)
Statistical data, and b) personal
judgement and opinion of experts. In the
process of forecasting both kinds of data
must be obtained for arriving at the
reliable prediction.

Preliminary Analysis
It is the third stage. In this stage we try to
understand the data in hand. We can start
by constructing line chart or bar chart
using excel to understand the major
trends in data. The line chart can suggest
whether sales or any other variable is
linear or non-linear over time. One can
also compute descriptive statistics such
as mean, standard deviation, skewness
and kurtosis to know the distributional
aspects of data. The idea behind doing
this preliminary analysis is to get a feel
for the data. This stage helps in
suggesting some insightful models that
might be useful in the prediction
process.
Choosing and Fitting Models
The next step in the prediction process is
to choose and fit the correct model for
forecasting. As pointed out earlier, the
preliminary analysis is very useful and
can suggest appropriate forecasting
model for the underlying data generating
process. One can pick up one or two
leading models for subsequent analysis.
Depending upon the forecasting
horizons, the predictive analytic
techniques can be selected.

The various types of models which are


employed by analysts are:
1. Moving average
2. Exponential smoothing
3. Correlation
4. Trend
5. Regression models
6. Box-Jenkins ARIMA models
7. Input- Output model
There are varieties of other models also
from which an analyst makes his choice
depending upon the assumption of
historical data.

Evaluating the Predictive Model


After the estimation of the model, it is
now time to interpret the results of the
predictive model and use it for making
forecasts. The performance of a
predictive analytical model can be
judged from the accuracy of forecasts.
As a predictive analyst, you must know
the use to which the forecasts can be
made. Moreover, it is not important to
predict the existing trend but it should
also predict the turning points so that
decision making process in an
organization can become more effective
and efficient.

Exercises
1. What is prediction? Discuss its
significance in business decision
making process.
2. Explain the process of predictive
analytics.
Chapter 8
Time Series and its
Components
Introduction
A time series data is a sequence of
observations ordered in time. For
example data on sales turnover, net
profit, total expenses, stock prices,
exchange rates, etc., are collected at
specific points in time, say weekly,
monthly, quarterly or yearly basis. These
data are ordered by time and are called
time series. Time series analysis helps
to understand performance of a business
entity; its evolution over time and its
likely performance in the future.
According to Prof. Werner Z. Hirsch,
The main objective in analyzing time
series is to understand, interpret and
evaluate changes in economic
phenomenon in the hope of more
correctly anticipating the course of the
events. In other words, analysis of time
series data not only helps in studying
past behavior of an economic or
business phenomenon but also aid in
forecasting of economic variables such
as sales, cost scenario, etc. Based on
these predictions, businessmen,
administrators and planners can
formulate their policy and future plan.
This importance of time series analysis
was long before highlighted by Edward
E. Lewis as: For the economist in his
effort to learn more and more about how
the economic system works, the study of
time series is perhaps the most important
source of information.

Importance of Analysis of Time Series


Time series data relates to sequence of
observations ordered by time. It plays an
important role in the empirical analysis
of economic and commercial variables.
It is extremely useful because of the
following reasons:

Helpful in analyzing past behavior


Time series data on economic variables
such as GDP, inflation, etc are helpful to
evaluate the past performance of an
economy. For example the real GDP
growth rate of 2014-15 was 7.2% and
real GDP growth rate of 2015-16 was
7.6%. Henceforth, Indias performance
was better in 2015-16 than 2014-15.
Similarly, Sales turnover of TCS was
Rs.73578.06 crores in 2015 and it was
Rs. 85863.85 crores in 2016. Thus, the
performance of TCS was better in 2016
compared to its performance in 2015.
Helpful in forecasting and strategic
planning
Analysis of time series data also helps
in forecasting GDP, inflation, sales
turnover, net profit, etc if past
information are available. For instance,
the real GDP growth rate of India in
2014-15 and 2015-16 were 7.2% and
7.6% respectively. Now suppose, I want
to forecast real GDP growth rate for
2016-17? How can I do it? Obviously,
in this regard, past information on last
two years data on GDP growth rate will
be helpful.

Strategic planning refers to long range


planning of an economy or a company.
For instance, what will be the size of
Indian economy by 2050 or the market
share of TCS in coming 10 years. The
planners can deliberately try to achieve
some envisaged target based on past
time series information. Times series
analysis also facilitates comparative
study over a period of time.
Components of Time Series
Time series such as gross domestic
product, sales of refrigerators,
movement of stock prices, money supply,
etc are influenced collectively by a large
number of factors and forces. The
influence of various factors affecting a
time series can be classified under
certain definite categories. These
categories are called components of time
series. These components are:
Trend
Trend refers to the general tendency of
an economic variable over a long time.
According to Simpson & Kafka, Trend
also called secular or long term trend is
the basic tendency of production, sales,
income, employment, or the like to grow
or decline over a period of time. The
concept of trend does not include short
range oscillations but, rather, steady
movements over a long time.

Thus, trend is the overall average


tendency of phenomenon or variable
over a long period of time. It is possible
to find an increasing, decreasing or
stable trend over different periods of
time. However, trend is the general and
gradual movement of the variable.

In short, it can be said that despite short-


term fluctuations from time to time in a
phenomenon, there will be underlying
tendency of movement either upward or
downward and this long term tendency is
called trend. For example, GDP,
government expenditure, population in
India, closing prices of BSE Sensex, etc
are in general increasing tendency over
time. Thus, the trend is the general,
smooth, and long term average tendency.
The trend component of time series can
be linear or non-linear. Trend analyzes
the pattern of behavior of a phenomenon
in the past and its future behavior can be
forecasted in the assumption that past
behavior will also continue in the future.
Seasonal Variations
The seasonal variation is responsible for
regular rise and fall in the magnitude of
the time series. According to Biterson,
The seasonal variation in a time series
is repetitive, recurrent pattern of
changes, which occurs within a year or
shorter time period. Thus, it refers to
regular and repetitive movements in a
time series which occurs periodically
over a span of less than a year. Thus,
seasonality is a phenomenon which
linked to daily, weekly, monthly,
quarterly and half-yearly data.

The main cause of seasonal variations in


time series data is the change in climate
and man-made conventions. For
example, sales of woolen clothes
generally increase in winter season.
Besides this, customs and tradition also
affect economic variables for instance
sales of gold increase during marriage
season.

The main objective of studying


seasonality and its measurement is to
know their effects and isolate the
seasonal component from the trend. Its
study is extremely useful for producers,
sales managers, etc., for operational
planning and decision making regarding
purchase and procurement of raw
materials, production, marketing,
advertising programs, etc. In the absence
of seasonal analysis, a sudden jump and
decline in sales may be misinterpreted
and it can have adverse impact on
business.
Cyclical Variations
It also occurs periodically like seasonal
component but cyclical variations may
take more than a year to reoccur. It is
called cyclical because they occur in
cyclical manner. This cycle has four
stages, namely, prosperity, recession,
depression and recovery. It is important
to note that there is no definite period of
cyclical variations. In general, this
period varies from 3 to 10 years.
Irregular Variations
Sometimes economic time series such as
production, sales, etc., are also
influenced by certain unforeseen
incidents called irregular variations. For
example decline in sales of an exporting
firm due to 9/11 terrorist attack. This
type of variations does not exhibit any
definite pattern and therefore cannot be
forecasted.
Classical Decomposition Method
So far, we discussed that all time series
have four components namely, trend,
seasonal, cyclical and irregular
variations. How these four components
are interrelated? Classical
decomposition method recognizes two
basic types of model.
Additive Model
This model is written as: Y = T+S+C+I
where the sum of all four components
yields the original series. Thus, if you
want to find short term variations,
deduct trend from the original series i.e.
Y-T = S+C+I. Likewise, for computing
cyclical and irregular variations, deduct
trend and seasonal variations from
original series, that is, Y-T-S = C+I.
And, if you want to find irregular
variation then deduct trend, seasonal and
cyclical components from the original
series, that is, Y-T-S-C =I. This
procedure of isolating each component
present in time series is based on the
assumption that all components are
residual.

Multiplicative Model
This model has a form: Y = TxSxCxI. If
you want to find short-term variations,
divide the original series by calculated
trend values. The formula below
describes this operation:
Similarly, if the aim is to find irregular
component from the original time series,
the following operations will achieve
this:

Besides these two basic models, many


time series do not grouped under two
categories. Very often in practice a
mixture additive and multiplicative
model is used for analysis. Examples of
few mixed models are given below:
Y = (TxSxC)+I Y =
Tx(S+C+I)

Exercises
1. What is time series data? Discuss the
various components of a time series.
2. Distinguish between additive and
multiplicative models.
Chapter 9
Trend Analytics
Introduction
Trend analytics means explaining any
variable in terms of time. Each
observation of a phenomenon in a time
series is the compound effect of the four
components namely, trend, seasonal
variation, cyclical variation and random
component. Trend is one of the dominant
components of time series data. The
procedure of isolating the trend values
from the time series involves the
measurement of trend.
Least Square Method of Estimating
Method
It is a mathematical method where a
trend line is fitted to the data in such a
way that:
a) the sum of deviations of the
actual values of Y and the estimated
values of

is zero i.e.,
b) the sum of squares of the
deviations of actual and estimated
values is least or minimum from the
estimated line and hence the name
least square method i.e.,

The line obtained by this method is


known as the line of best fit.
Fitting a Straight line Trend
The equation for straight line is given by
the following equation:

where Y is the dependent variable i.e.


time series
X is independent variable. Here time is
the independent variable.
is the Y-intercept which gives initial
value of time series
is the slope coefficient

In order to find the values of constants


and the following two normal
equations are to be solved:
Illustration
Fit straight line trend by the method of
least square for the following data:
Year Production of Pulses
(Million Tonnes)
2009-10 14.66
2010-11 18.24
2011-12 17.09
2012-13 18.34
2013-14 19.25
2014-15 17.20

Solution
Year Production(Y) Time (X) X2
2009- 14.66 1 1
10 18.24 2 4
2010- 17.09 3 9
11 18.34 4 16
2011- 19.25 5 25
12 17.20 6 36
2012-
13
2013-
14
2014-
15
104.78 = 6 +21 (i)
375.22 = 21+91 (ii)
Multiply equation (i) by 7 and equation
(ii) by 2
42 +147 = 733.46
42 +182 =750.44
_ _ _

-35 = -16.98
To find the values of substitute the
values of in equation (i), we get
104.78 = 6 +21
104.78 = 6 +21(0.4851)
104.78 = 6 + 10.1871
104.78-10.1871 = 6
94.6 = 6

Or
Thus, the estimated linear trend line is
Y = 15.76 +0.4851X
What is the use of the above estimated
trend line? The estimated trend line can
be used to forecast the production of
pulses in 2015-16. In this case the value
of time variable for 2015-16 will be 7.
Substitute the value of 7 in the estimated
linear trend and we get

Thus, the forecasted production of


pulses for the year 2015-16 is 19.15
million tones.
Illustration
This illustration is based on deviation
method. Using annual production of rice
in India from 1990-91 to 2005-06, we
will tell you the procedure of finding
trend in the below table as follows:
Y Time yt

74.29 1 -7.86 -7.5 58.9


74.68 2 -7.47 -6.5 48.5
72.86 3 -9.29 -5.5 51.0
80.3 4 -1.85 -4.5 8.32
81.81 5 -0.34 -3.5 1.19
76.98 6 -5.17 -2.5 12.9
81.73 7 -0.42 -1.5 0.63
82.54 8 0.39 -0.5 -0.1
86.08 9 3.93 0.5 1.96
89.68 10 7.53 1.5 11.2
84.98 11 2.83 2.5 7.07
93.34 12 11.19 3.5 39.1
71.82 13 -10.33 4.5 -46.
88.53 14 6.38 5.5 35.0
83.13 15 0.98 6.5 6.37
91.79 16 9.64 7.5 72.3

=82.15

From the above calculations, now we


can find the value of regression
coefficients as follows:
Thus, our trend line for this example is
given as:

As pointed out earlier, if someone is


dealing with annual data there is no
question of seasonality arises. However,
if we consider rice production as any
time series, theoretically we find short-
term fluctuations by deducting trend
values from actual value if the model is
additive. If the model is multiplicative,
short-term variations can be found by
dividing actual values by trend values.

Using Excel
Step 1: Enter the data as shown below:

Step 2: Click Data and select Data


Analysis from the menu bar. When you
click data analysis the following dialog
box will appear.
Step 3: Select Regression from the
dialog box as shown below:
Step 4: When you click ok the following
another dialog box will appear shown
below:
Step 4: Select dependent variable in
Input Y Range. Here, rice production
is Y variable. Select independent
variable, that is, T, in Input X Range.
Click labels and specify output range as
shown below:
Step 5: After this, when you click OK
the following results will appear shown
below:
Step 6: The trend line is given below
from the Excel output:

Y = 74.45 + 0.9066 T

Non-Linear Trend
If a variable tends to increase or
decrease by constant amount over time
then the linear trend measurement is
appropriate. However, if the increase or
decrease in a variable over time
expands by uneven increment then
parabolic curve of second or third
power may be more appropriate. The
parabolic curves of various degrees are:

Second Degree Parabola


The equation for second degree parabola
or quadratic equation is:

where a is the value of trend at the


origin, b is the slope at the origin and
c establishes whether the curve is up or
down by how much. The values of a, b,
and c can be found by solving the
following three normal equations:
If short-cut method is used by taking
middle year as the origin much time and
energy can be saved. Since ,
then the summation of any odd power
will also be zero. Thus, the .
As a result, the normal equations are
reduced to:
Solving

Exponential Trend
If the time series is increasing or
decreasing by a constant percentage
rather than a constant amount, the
exponential trend model is considered
appropriate often many economic and
business date show such tendency. The
equation for exponential function is:

Y = X

The above model cannot be estimated in


its original form. However, after
logarithmic transformation, it will
become linear and can be estimated by
OLS method. The transformed model is
given below as:
Log (Y) = log() +log(X)
The following equations are used to find
the values of and

If middle year is assumed to origin, the


above equations yield:
Illustration
To illustrate exponential trend model
consider data relating to production of
natural gas in India from 1990-91 to
2004-05 in table 6.2 given below:
Natural Log (Y) X XlogY
Gas
(Y)
17,998 4.26 -7.00 -29.79
18,645 4.27 -6.00 -25.62
18,060 4.26 -5.00 -21.28
18,335 4.26 -4.00 -17.05
19,468 4.29 -3.00 -12.87
22,642 4.35 -2.00 -8.71
23,256 4.37 -1.00 -4.37
26,401 4.42 0.00 0.00
27,428 4.44 1.00 4.44
28,446 4.45 2.00 8.91
29,477 4.47 3.00 13.41
29,714 4.47 4.00 17.89
31,389 4.50 5.00 22.48
31,962 4.50 6.00 27.03
31,763 4.50 7.00 31.51
LogY=65.82 XLogY
From the above data, we obtained the
values of regression coefficients as:

= 65.82/15 =4.39
Similarly,

= 5.98/280 = 0.02
Thus, our estimated exponential trend
model is:
= 4.39+0.02t
The above estimated model can be used
to forecast next year natural gas
production as given below:
2005-06= Antilog [4.39+0.02(8)]=
Antilog [4.55]= 35481.34 million cubic
meters

Using Excel
Quadratic Trends
The linear function to model trend is
very common. However, sometimes time
series shows nonlinear trend like
quadratic trend or cubic trend. In
quadratic relationship the value of
dependent variable Y from time variable
T, which is, independent variable takes
the form

We will consider monthly sales of


Mobile Handsets data discussed earlier
to illustrate quadratic trend model.

Step 1: Enter the data as shown below:


Step 2: Create first independent
variable (T) by coding month. Create
second independent variable (T2) by
squaring first independent variable using
formula = C2^2 as shown below:
Step 3: Click Data and select Data
Analysis from the menu bar. When you
click data analysis the following dialog
box will appear.
Step 3: Select Regression from the
dialog box as shown below:
Step 4: When you click ok the following
another dialog box will appear shown
below:
Step 4: Enter data from B1:B13 in
Input Y Range and enter data from
C1:D1 in Input X Range Click labels.
Specify output range as shown below:
Step 5: After this, when you click OK
the following results will appear shown
below:
Thus, the quadratic trend model from the
excel output is:
Exponential Trend
When the time series increases or
decreases by a constant percentage in
every period one could fit exponential
trend model which is specified as
follows:

Y = t
The above model cannot be estimated in
its original form. However, after
logarithmic transformation, it will
become linear and can be estimated by
OLS method.

To illustrate exponential trend model


consider data relating to production of
natural gas in India from 1990-91 to
2004-05 in table given below:
Year Natural Gas
(Y)
1990-91 17,998
1991-92 18,645
1992-93 18,060
1993-94 18,335
1994-95 19,468
1995-96 22,642
1996-97 23,256
1997-98 26,401
1998-99 27,428
1999-2000 28,446
2000-01 29,477
2001-02 29,714
2002-03 31,389
2003-04 31,962
2004-05 31,763

Step 1: Enter the data as shown below:


Step 2: Click Data and select Data
Analysis from the menu bar. When you
click data analysis the following dialog
box will appear.
Step 3: Select Regression from the
dialog box as shown below:
Step 4: When you click ok the following
another dialog box will appear shown
below:
Step 4: Select dependent variable in
Input Y Range. Here, log (Y) is Y
variable. Select independent variable,
that is, T, in Input X Range. Click
labels and specify output range as shown
below:
Step 5: After this, when you click OK
the following results will appear shown
below:
Thus, the exponential trend model is:

Log (Y) = 9.70 + 0.0491T


The above estimated model can be used
to forecast next year natural gas
production as given below:
2005-06=
Antilog [9.70+0.0491(16)]=
Antilog [10.4966]= 36192.24 million
cubic meters.

Exercises
1. The following table shows index of
electricity generation in India
(Million Kwh) from 2004-05 to
2015-16. Fit straight line trend by
least square method.
Year Index of Electricity
Generation (Millions Kwh)
2004-05 100
2005-06 105.2
2006-07 112.8
2007-08 120.3
2008-09 123.3
2009-2010 130.8
2010-11 138
2011-12 149.5
2012-13 155.2
2013-14 164.7
2014-15 178.6
2015-16 188.7

2. The following table shows index of


passenger cars in India from 2004-
05 to 2015-16. Fit a straight line
trend by least square method.
Year Index of Passenger
Cars
2004-05 100
2005-06 108.5
2006-07 128.3
2007-08 147.3
2008-09 157.1
2009-2010 197.9
2010-11 254.1
2011-12 260.3
2012-13 251.7
2013-14 240.7
2014-15 251.6
2015-16 262

3. The following table shows index of


production of tea (000 tons) in India
from 2005-06 to 2015-16. Fit
straight line trend by least square
method.
4.
Year Index of Passenger Cars
2005-06 107.5
2006-07 114.3
2007-08 114.1
2008-09 116.5
2009-2010 119.3
2010-11 116.4
2011-12 117
2012-13 136.5
2013-14 145.5
2014-15 141.4
2015-16 145.1

3. Consider the following data on sales


turnover of Infosys Technologies
from 1992 to 2016 given in table
below
4.
Year Sales Turnover
(Rs.Cr)
1992 9.22
1993 14.23
1994 28.97
1995 55.42
1996 88.55
1997 139.22
1998 303.65
1999 508.8
2000 869.7
2001 1900
2002 2603.59
2003 3622.69
2004 4760.89
2005 6859.66
2006 9028
2007 13149
2008 15648
2009 20264
2010 21140
2011 25385
2012 31254
2013 36765
2014 44341
2015 47300
2016 53983

Fit an exponential trend


model
Chapter 10
Predictive Models for
Stationary Time Series
Introduction
A time series is stationary if it oscillates
around its mean value, i.e. it moves in a
horizontal fashion. Generally these kinds
of time series are easier to handle than
the nonstationary ones, i.e. those that
exhibit some kind of upward or
downward movement.
Models for Stationary Time Series
Data
The various models which are used for
prediction of stationary time series are:
Naive Model
A naive model simply assumes the value
in previous period as the forecasted
value in the current period. This model
answers the following question:

What will be the sales turnover of Apple


Inc., next quarter?

Suppose the top management says, The


sales turnover in the next quarter will be
similar to the last quarter. The above
statement is mathematically expressed as
follows:

Yt = Yt-1
where Yt is the forecasted value for the
next quarter or at time t, and Yt-1 is the
actual value of sales in the previous
quarter or at time t-1. There other
versions of nave models also:

1. A variant of naive model involves


taking average of two prior values to
generate forecast.
2. Another variation of naive modelling
involves incorporating trend that may
be present in data.
Example
The following table shows sales of
Facebook from 2012 to 2016. Predict
the sales of the Facebook for next year
using nave model.
Year Revenue ($
Billions)
2012 5.09

2013 7.87

2014 12.47
2015 17.93
2016 27.64

Solution
1. The nave model says the forecasted
sales of the Facebook for the next
year will be same as the previous
year.
Yt = Yt-1
Thus the sales for the year 2017 will
be 27.64 billion dollars.

2. A variant of naive model involves


taking the average of two prior
values to generate forecast.
According to this model the sales of
Facebook is:

3. Another variation of naive modelling


involves incorporating trend that may
be present in data. According to this
model the sales of Facebook will be:

Yt = Yt-1+Y
= 27.64+9.71 = 37.35

Simple Average
If a time series is stationary and we just
want to predict a single future value of
this series, then using an average value
of the series is almost as good as any
other method. The most elementary
forecasting method is simple average
model. With this model, the forecasts for
time periodt is the average of the
values for a given number of previous
time periods.
The advantage of this simple method is
that it can be extended further in the
future. If we need to forecast for the next
five observations, we just extend the
mean line. By definition, if a series is
stationary it fluctuates around its mean.
Therefore, the mean is its best predictor.
This method does not produce very
accurate forecasts, but the results will be
precise enough. To add more
sophistication to our forecasting and to
try to emulate the movements of the
original series, we need to extend the
principle of a simple average to a
moving average principle.
Example
The following table shows call money
rate which is a proxy for interest rate in
India. In general, the interest rate is a
non-trending variable and for such
variable the mean is its best predictor.
Call Money
Year
Rate (%)
2002-
5.89
03
2003-
4.62
04
2004-
4.65
05
2005-
5.60
06
2006-
07 7.22

2007-
6.07
08
2008-
7.26
09
2009-
3.29
10
2010-
5.89
11
2011-
8.22
12
2012-
8.09
13
2013-
8.28
14
2014-
7.97
15
2015- 6.98
16
2016-
6.40
17
Thus the predicted call money rate for
the next year is 6.42 per cent.
Moving Average Method
Moving average is another method of
determining trend. It consists of a series
of arithmetic means computed from
overlapping groups of successive items
of a time series. Each moving average is
computed using values covering a fixed
time interval called period of moving
average. Successive moving average is
computed by removing the first
observations of the previously averaged
group by the next observation. The
objective of finding these averages is to
remove the periodic type of variations.
The averaging process smoothens out
fluctuations and ups and downs in the
data. To remove variations appropriate
period of moving average is used.
Usually 3, 4, 5, or 7 period moving
average are used to compute the moving
average.

As an example look at the procedure of


3 period moving average as shown
below:
1. Take the average of first three
observations and place it against the
middle value (i.e., 2nd observation).
2. Leave the first observation and take
the average of next 3 values and
place it against the middle value
(i.e., 3rd observation).
3. This process is continued till the last
observation for computing the
moving average.
4. The resultant series of moving
average is the trend.

Moving average is a very popular


technique of technical analysis used for
analyzing stocks and commodities. Often
traders use 30 days, 50 days, 150 days
and 200 days moving average for
analytical purpose. In fact, 200 days
moving average is considered as the
long term trend. If a particular stock
price is quoting price above its 200 days
moving average, it is said to be bullish
in nature. However, if a stock quoting
price below its 200 days moving
average, it is said to be bearish in
nature.

Illustration
Find the trend of the yearly stock prices
of TCS using 3 period moving average
method.
Year Stock Price
2004-05 333.88
2005 425.62
2006 609.3
2007 541.68
2008 239.05
2009 749.75
2010 1165.05
2011 1161.25
2012 1258.55
2013 2170.95
2014 2554.7
2015-16 2439.2
Solution
Year Stock Price 3 MA
2004-05 333.88 #N/A
2005 425.62 #N/A
2006 609.3 456.2667
2007 541.68 525.5333
2008 239.05 463.3433
2009 749.75 510.16
2010 1165.05 717.95
2011 1161.25 1025.35
2012 1258.55 1194.95
2013 2170.95 1530.25
2014 2554.7 1994.733
2015-16 2439.2 2388.283
Using Excel
Step 1: Enter the data as shown below:
Step 2: Click Data and select Data
Analysis from the menu bar. When you
click data analysis the following dialog
box will appear.
Step 3: Select Moving Average from
the dialog box as shown below:
Step 4: When you click ok the following
another dialog box will appear shown
below:
Step 4: Enter data from B1:B13 in
Input Range. Click labels and put 3 in
Interval form 3-period moving
average. Specify output range as shown
below:
Step 5: After this, when you click OK
the following results will appear shown
below:
Illustration
The table below shows are the
shipments (in millions of dollars) over a
12-month period. Use these data to
compute 4-month moving average for all
available months.
Month Shipments
January 1,056
February 1,345
March 1,381
April 1,191
May 1,259
June 1,361
July 1,110
August 1,334
September 1,416
October 1,282
November 1,341
December 1,382
Solution
The first moving average is

The next 4-month moving average is


calculated as

This first 4-month moving average can


be used to forecast the shipments in May.
Because 1,259 shipments were actually
made in May, the error of the forecast is:
Errormay=1,259-1243.25=15.75

Shown next, along with the monthly


shipments, are the 4-month moving
average and errors of forecast when
using 4-month moving averages to
predict next months shipments. The first
moving average is displayed beside the
month of May because it i9s computed
by suing January, February, March and
April and because it is being used to
forecast the shipments for May. The rest
of the 4-month moving average and
errors of forecast are shown below.

4-Month Moving Average


Weighted Moving Average
A forecaster may want to place more
weight on certain periods of time than on
others. For example, a forecaster might
believe that the previous months value
is three times as important in forecasting
as other months. A moving average in
which some time periods are weighted
differently than others is called a
weighted moving average. As an
example, suppose a 3-month weighted
average is computed by weighting last
months value by 3, the value for the
previous month by 2, and the value for
the month before that by 1. The
weighted average is computed as:

Notice that the divisor is 6. With a


weighted average, the divisor always
equals to total number of weights. In this
example, the value of Mt-1 counts three
times as much as the value of Mt-3.

Illustration
Compute a 4-month weighted moving
average for the data given in table 4.1,
using weights of 4 for the last months
value, 2 for the previous months value,
and 1 months for each values from the 2
months prior to that.
Solution
The first weighted average is

This moving average is recomputed for


each ensuing month. Displayed next are
the monthly values, weighted moving
averages, and the forecast error for the
data.

4-Month Weighted Moving Average


Exponential Smoothing
Another forecasting technique for
stationary series is exponential
smoothing. It is used to weight data from
previous time periods with
exponentially deceasing importance in
the forecast. Exponential smoothing is
accomplished by multiplying the actual
value for the present time period, Xt, by
a value between 0 and 1 referred to as
and adding the result to the product of
the present time periods forecast, Ft,
and (1-). The following is more
formalized version.

Ft+1=.Xt+ (1-) Ft
where
Ft+1= the forecast for the next time
period (t+1)
Ft= the forecast for the present time
period (t)
Xt = the actual value for the present time
period
= a value between 0 and 1 referred to
as the exponential smoothing constant.

The value of is determined by the


forecaster. The essence of this procedure
is that the new forecast is a combination
of the present forecast and the present
actual value. If the is chosen to be less
than 0.5, less weight is placed on the
actual value than on the forecast of that
value. If is chosen to be more than 0.5,
more weight is placed on the actual
value than on the forecast of that value.

Illustration
The table gives monthly price data on jet
kerosene. Use exponential smoothing to
forecast the values for ensuing time
period. Work the problem using =0.2,
0.5 and 0.8.
Monthly Price Data on Jet kerosene
Month Price of Jet
kerosene
January 66.1
February 66.1
March 66.4
April 64.3
May 63.2
June 61.6
July 59.3
August 58.1
September 58.9
October 60.9
November 60.7
December 59.4
Solution
The following table provides the
forecasts with each of the three values of
alpha. Note that because no forecast is
given for the first time period, we cannot
compute a forecast based on exponential
smoothing for the second period.
Instead, we can use the actual value for
the first period as the forecast for the
second period to get started.

As examples, the forecasts for the third,


fourth, and fifth periods are computed
for = 0.2
F3=0.2(66.1)+0.8(66.1) =66.1
F4=0.2(66.4)+0.8(66.1)=66.16
F5= 0.2(64.3)+0.8(66.16)=65.78

Using Excel

The table below gives monthly sales


data of mobile hand sets of a retail store.
Use exponential smoothing to forecast
the values for next time period using
=0.2, 0.8 and 0.6.
Month Sales of
Mobile
Hand Sets
January 150
February 180
March 200
April 175
May 160
June 148
July 165
August 190
September 230
October 210
November 200
December 245
Solution

The following table provides the


forecasts with each of the three values of
alpha. When = 0.2, the forecast for
next January month is 189 mobiles.
When = 0.8 and 0.6 the forecasts for
January are 202 and 203 mobiles
respectively.

Mobile
Month Sales =0.2
January 150 #N/A
February 180 150
March 200 156
April 175 164.8
May 160 166.84
June 148 165.472
July 165 161.9776
August 190 162.5821
September 230 168.0657
October 210 180.4525
November 200 186.362
December 245 189.0896

Computer Application
Step 1: Enter the data as shown below:
Step 2: Click Data and select Data
Analysis from the menu bar. When you
click data analysis the following dialog
box will appear.
Step 3: Select Exponential Smoothing
from the dialog box as shown below:
Step 4: When you click ok the following
another dialog box will appear shown
below:
Step 4: Enter data from B1:B13 in
Input Range. Click labels and put 0.8 if
=0.02 in Damping factor form.
Specify output range as shown below:
Step 5: After this, when you click OK
the following results will appear shown
below:
Exercises

1. The following figure shows net sales


of HCL Technology from 2007 to
2015. Find trend by average method
for the HCL data.
Year Net Sales (Rs.
Cr)
2007 3768.62
2008 4615.39
2009 4675.09
2010 5078.76
2011 6794.48
2012 8907.22
2013 12517.82
2014 16497.37
2015 17153.44

2. The following table shows data on


annual stock price of ITC from 2000
to 2015 as on 16.11.2016. Find trend
of ITC stock prices using 3-period
moving average.

Year ITC Stock Price


(Rs.)
2000 19.92
2001 15.04
2002 14.68
2003 21.88
2004 29.1
2005 47.33
2006 58.65
2007 70.09
2008 57.15
2009 83.61
2010 116.32
2011 134.19
2012 191.18
2013 214.38
2014 245.58
2015 218.38

Also compute exponential


smoothing using an alpha value of 0.55.
3. Consider the sales figures of a
particular company for the period of
2012 to 2015 in the following table.
Predict sales of next quarter using
different nave models.
Year Q1 Q2 Q3 Q4
2012 672 636 680 704
2013 744 700 756 784
2014 828 800 840 880
2015 936 860 944 972
Chapter 11
Simple Regression Model

INTRODUCTION
Regression analysis is a statistical tool
for analyzing the nature of relationship
between two variables. It is an important
tool frequently used by economists to
understand the relationship among two
or more variables. When our interest
lies in explaining one variable in terms
of another variable i.e. y in terms of x,
we use regression. Regression model
studies the relationship between two
variables when one is the dependent
variable and the other is an independent
variable. For example, an agriculture
economist might be interested in
explaining crop yield (y) with the help of
amount of fertilizers (x) used; change in
inflation (y) in terms of change in money
supply (x).

THE POPULATION REGRESSION


FUNCTION (PRF)
Before discussing the population
regression function, let us introduce here
the meaning of population.

Population or Universe means all units


related to an enquiry. Population may be
finite or infinite. Finite population is that
population whose elements are fixed and
can be counted. However, in case of
infinite population elements include
large number of units that cannot be
counted.

In any econometric investigation,


studying each and every unit or item is
time consuming and costly affair. Thus,
most often a representative sample is
drawn from the population and based on
the study of sample conclusions are
drawn for the whole population.
Population
The ultimate objective is to estimate
population mean on the basis of sample
mean. Similarly, the estimation of
population variance on the basis of
sample variance. This whole process is
called inferential statistics.

In case of two variables Y and X the


relationship between Y and X is
envisaged at the population level as
shown below:
Regression Function
Sample Regression Function

We may think that mark of students (Y)


and attendance (X) are related but the
relationship is unknown. Assuming a
linear relationship between expected
value of Y and X, we have:

E(Y|X) = +X
(11.1)

Where E(Y|X) = conditional expected


value of Y given the X

and are the parameters that describe


the characteristics of population.

X is the independent variable or


explanatory variable.

The relationship between Y and X is


defined by the parameters and.
Equation (1) given as follows:

E(Y|Xi) = +Xi
(11.2)

is called population regression function


(PRF) which is not known to us. Then
the question arises, how to estimate
and that define the relation between Y
and X.

To achieve the above objective, we have


to select a representative sample from
the population, that is, we have to
collect information on Y and X
variables. Here it is to be remembered
that we can get n number of samples.
The idea is to estimate the PRF from the
sample data. Since the PRF is estimated
from sample data, SRF is at best an
approximation of the true PRF.

Thus, we are trying to estimate the PRF


which is unknown with the help of
sample data. The sample counterpart of
PRF is called SRF which is given as
follows:

The above equation is called the Sample


Regression Function (SRF)

is the estimator of E(Y|X)


is the estimator of
is the estimator of

STOCHASTIC SPECIFICATION OF
PRF AND SRF

As discussed earlier, the deterministic


or mathematical relationship between
two variables, say Y and X is given as
follows:

Y=
+X
(11.4)

is exact in nature. Equation (11.3) says Y


is explained by only and only by the X
variable. There is no other variable
exists in the Universe that can add to the
explanation of phenomenon Y. This is the
meaning of deterministic relationship
which says that there is no randomness
or stochastic in the relationship between
Y and X.

But as I said earlier, the laws of


economics, management, behavioural
sciences and social sciences are not
deterministic in nature because they deal
with human beings and their behavior.
They do not come under exact sciences.
Therefore, the randomness is inherent
attribute of laws of economics and
management. Hence the stochastic
specification of population regression
function (PRF) and stochastic sample
regression function (SRF).
The stochastic specification of PRF is
obtained by introducing the random error
term in the conditional expected value of
Y given the value of X. Thus, Y is
comprises of two components as shown
below:

Yi =E(Y|Xi)+i
(11.5)

Or

Yi =+Xi+i

Equation (5.5) comprises of two parts


1 E(Y|X) or +X is the conditional
expectation of Y given the value of X.
This component is called systematic or
deterministic component.

2. is the random or undeterministic


component which is a proxy for
everything that may influence Y but are
not incorporated in the regression model
explicitly.

Now just as we expressed the PRF in


stochastic form we can also specify the
SRF in its stochastic form. The SRF in
its deterministic form is:
The above equation can be written in
stochastic form by simply introducing
the stochastic random disturbance term
as shown below:

is the estimator of E(Y|X)


is the estimator of
is the estimator of
sample estimate of i or residual

LINE OF BEST FIT


Linear regression analysis helps in
finding the best-fitting straight line
through the points of a data set. The line
of best fit is called a regression line. We
obtain the line of best fit with the help of
an example. Lets us consider two
variables inflation as measured by
Wholesale Price Index (WPI) and Prices
of crude oil measured by West Texas
Intermediate. In general, it is believed
that with increase in prices of crude oil
in international market, inflation in India
also rises. Table 11.1 gives monthly data
on Wholesale price index in India and
Price of WTI crude oil.
Table 11.1: Monthly Data on WPI and
WTI

WPI WTI

180.9 36.75
182.1 40.28
185.2 38.03
186.6 40.78
188.4 44.9
189.5 45.94
188.9 53.28
190.2 48.47
188.8 43.15
188.6 46.84
188.8 48.15
189.5 54.19
191.6 52.98
192.2 49.83
193.2 56.35
194.6 59
195.3 64.99
197.2 65.59
197.8 62.26
198.2 58.32
197.2 59.41
196.3 65.49
196.4 61.63
196.8 62.69

A scatter plot of the above data is shown


in Figure 11.2. Visual inspection of the
data suggests that Wholesale Price Index
(WPI) increase as the Price of crude oil
(WTI) increases. The question now is
how to formalize this relationship in
quantitative manner. The answer is
regression analysis which formally
analyzes the relationship.

Figure 11.2: Scatter Diagram


To begin, lets assume that a linear
relationship exists between wholesale
price index and crude oil prices. We can
express this linear relationship
mathematically as:
Y = +X (11.8)

Equation (11.8) represents the simple


linear regression model. It is also called
the two-variable regression model
because it relates only two variables Y
and X. Variables Y and X are called by
different names. Variable Y is called
dependent variable, the explained
variable, the response variable, the
predicated variable etc. Variable X is
called the independent variable, the
explanatory variable, the control
variable, the predictor variable etc.
These terms are used interchangeably in
economic analysis. However, dependent
variable and explained variable are
frequently used for variable Y and
independent variable and explanatory
variable are used for variable X.

The above equation (11.8) relates


inflation and crude oil price which
precisely says that inflation can be
explained by crude oil prices. In other
words, variation in inflation can be
explained by variation in crude oil
prices. But, in reality, crude oil price is
not the only factor which influences
inflation. There are numerous other
factors such level of money supply
interest rate etc affecting inflation. There
are certain missing variables on which
we cannot find data. The omission of
these factors would mean that data
points will not lie precisely on fitted
straight line or the model makes an error.
To account all these errors, we include
error term in the regression model and is
given as:

Y = +X+ e (11.9)
where
Y = dependent variable
X = independent variable
= Y intercept
= slope coefficient

Equation (11.9) is called regression line.


If we know what the values of and
were, we would know the relationship
between inflation and crude oil prices.
Thus, our first and foremost problem is
to find the values of and .
ASSUMPTIONS OF CLASSICAL
LINEAR REGRESSION MODEL

In regression analysis, we are not only


concerned with estimation of parameters
of the regression model but also
inferences about them. Therefore, our
concern is not only functional form of the
model but also how Y is generated. To
this end, we make certain assumptions as
to how Y is generated. Look at above
equation (11.2). It shows that the
dependent variable (Y) is a function of
both independent variable (X) and error
term (e). Thus, it is important to specify
how X and error term are generated
otherwise we cannot draw statistical
inferences about Y or and .
Therefore, we make certain assumptions
about the independent variable and error
term. If these assumptions are not
satisfied then valid interpretation of the
regression estimates is not possible.

The various assumptions of classical


linear regression model are as follows:

The linear regression model is based on


certain assumptions. There are three
types of assumptions:
1. Assumptions relating to the
distribution of the random variable
2. Assumptions relating to the
relationship between and the
independent variables
3. Assumptions relating to the
relationship between the independent
variables themselves
Assumptions relating to the
distribution of the random variable
These are assumptions about the
distribution of the values of and they
are crucial for the estimates of the
parameters.
1. i is random real variable: i is a
random variable and it may assume
value in any period with chance. It
may be positive, negative or zero.
Each value assumed by in any
particular period has certain
probability.
2. The mean value of in any
particular period is zero: For each
value of X, may assume various
values, some may be positive values
and some may negative values but if
we the mean of values, for any given
value of X, they will be zero. This
assumption ensure that Yi = +Xi
gives the relationship between Y and
X on the average.
3. The variance of i is constant in
each period: This assumption means
for all values of X, the s will have
the equal spread around their mean,
i.e., assumption of homoskadastic
errors.
4. The stochastic random error term i
has a normal distribution: For each
value of X, the random variable has
a normal distribution. Symbolically,
~ N(0, 2)
5. The various values of random error
term are independent: The value of
random error term assumed in one
period does not depend on the value
assumed by it in any other period. In
other words covariance of i and j
are equal to zero. This is called
assumption of no autocorrelation in
error terms.
Assumptions relating to the
relationship between and the
independent variables
1. is independent of the
independent variable: The
covariance of and X is zero.
Symbolically
Cov (X) = E{[Xi-E(Xi)] [i-
E(i)]}=0
2. The independent variables are
measured without error: absorbs
the errors of measurement in Ys and
that is why we assume that the
independent variables are error free,
while the Y values may or may not
include errors of measurement.
Assumptions relating to the
relationship between the independent
variables themselves
1. The independent variables are not
perfectly linearly related. If there is
more than one independent variable
in the relationship it is assumed that
they are not perfectly correlated with
each other. In fact, the independent
variables should not even be strongly
correlated, they should not be highly
multicollinear.
2. The macroeconomic variables
should be correctly aggregated: In
general, the variables Y and X are
aggregative variables, representing
the sum of individual items. It is
assumed that the appropriate
aggregation method has been used in
compiling the aggregate variables.
3. The relationship between Y and X
being estimated is identified: It is
assumed that the relationship whose
coefficients we want to estimate has
a unique mathematical form, which is
it does not contain the same
variables as any other equation
related to the one being investigated.
Only if this assumption is fulfilled
can we be certain that the
coefficients which results from our
computations are the true parameters
of the relationships which we study.
4. The relationship is correctly
specified: It is assumed that we have
not committed any specification
error in determining the explanatory
variables, that we have included all
the important regressors explicitly in
the model, and that its mathematical
form is correct.
LEAST SQUARE METHOD
The least square method is a
mathematical technique that can be used
to find the values of and that form the
straight line. No straight line can
envelop perfectly each observation in
the scatter plot exactly on it. This is
reflected in the discrepancies between
the actual values and predicted values
by the regression line. Any straight line
fitted to the data will result in some
error. A number of straight lines could
be drawn that would seem to fit the data.

The values of and that is found by the


least square method result in less
difference between actual values and
predicted values. Any values for and
other than those determined by the least
square method result in greater
differences between actual value and
values indicated by regression line. In
more strict sense, the least square
method determines the values of and
that minimizes the sum of squared
errors.Any values for and other than
those determined by the least square
method result in greater sum of squared
differences between actual value and
predicted value.

Here, the relationship between inflation


and crude oil prices as envisaged in
equation (11.8) will be found with help
of a sample data. Thus, our sample
regression line will be:

Y = dependent variable
X = independent variable
= estimated Y intercept
= estimated slope coefficient

Values of and can be calculated as


follows:
(11.10)

(11.11)

Table 11.2 Least Square Computation


Y X Y2
180.9 36.75 32724.81
182.1 40.28 33160.41
185.2 38.03 34299.04
186.6 40.78 34819.56
188.4 44.9 35494.56
189.5 45.94 35910.25
188.9 53.28 35683.21
190.2 48.47 36176.04
188.8 43.15 35645.44
188.6 46.84 35569.96
188.8 48.15 35645.44
189.5 54.19 35910.25
191.6 52.98 36710.56
192.2 49.83 36940.84
193.2 56.35 37326.24
194.6 59 37869.16
195.3 64.99 38142.09
197.2 65.59 38887.84
197.8 62.26 39124.84
198.2 58.32 39283.24
197.2 59.41 38887.84
196.3 65.49 38533.69
196.4 61.63 38572.96
196.8 62.69 38730.24
Y X=1259.3 Y2=880048.5
=4594.3
=52.47
=191.42
n = 24

Using data from Table 11.2, is


calculated as:

=0.50

= 191.42-0.50(52.47)

= 165.07

Thus, estimated regression equation is:


= 165.07+0.50X

where is the estimated value or


predicted value of inflation for a given
value of crude oil price. According to
the estimated regression equation, for
every 1 percent increase in crude oil
price, inflation will increase by half
percentage point. The value of is
165.07. This shows that when price of
crude oil is zero, the level of inflation
will be 165.07.

MEASURING FIT COEFFICIENT


OF DETERMINATION ( R2)
Earlier, we discussed that the least
square method determines that values of
and such that it minimizes the sum
of squared errors. However, it may
happen that the best fitting line does not
fit the data at all. Thus, it is desirable to
develop some measure of fit to assess
the adequacy of the fitted line. The most
common statistic used to measure fit is
called coefficient of determination. It is
denoted by R2. To explain R2, we have
developed some measures of variation.
The first measure is the total sum of
squares (SST). It measures the variation
of the Yi around their mean . The total
sum of squares is defined as the sum of
squared differences between actual Y
and its mean . Symbolically, it is
given as follows:

We must remember that SST is a


measure of variation of Y and this
variation in Y is explained through
explanatory variable X. Thus, it is
possible to think that total variation in Y
consists of two parts: one that which is
attributable to explanatory variable, X;
other part that which is attributable to all
factors other than explanatory variable,
X, as shown below:

TSS = RSS +SSE (11.12)


where SSR is the regression sum of
squares. It refers to the sum of squared
differences between the predicted value
of Y and mean . It is given by:

SSE stands for error sum of squares


which is equal to the sum of squared
differences between the actual Y and the
predicted which is given by:
These measures such as SST, SSR and
SSE provide little that can be directly
interpreted. However, the ratio of SSR
and SST tells us the proportion of
variation in Y that is explained by the
explanatory variable X. This ratio is
called the coefficient of determination. It
is defined as:

R2 =

The value of R2 ranges between 0 and 1.


A R2 of zero means that none of the
variability in dependent variable, y, is
explained by the independent variable,
x. A R2 of 1 means all the variability in
the dependent variable, y, is explained
by the independent variable, x and there
are no errors. Hence, SSR=0 and R2 =1.
In sum, high values of R2 indicate a good
fit and low values a poor fit.

STANADRAD ERROR OF OLS


ESTIMATES

In last section, we found the values of


and based on a sample. Since these
values of and are obtained by
ordinary least squares method, they are
called least squares estimates. Least
squares estimates are function of the
sample data. So when a sample change
there is possibility that values of least
squares estimate may also change. Now,
the question arises, how we can
determine the reliability of and . In
statistics, the reliability of OLS
estimates are measured by standard
error. The standard errors of and are
obtained by the following formulae:
wherevar =variance and se =standard
error and 2 is the variance of error
term. We can calculate the variance of
and and hence standard errors once we
know the value of 2 because all other
quantities except 2 can be estimated
from the data. The formula for
calculating variance of error term is
given as follows:

where is estimator of population


variance of error term, that is, 2, n-2
refers to number of degrees of freedom
and is the sum of errors squared.
The square root of variance of error
term is known as standard error of
estimate or the standard error of
regression which is given as follows:

Standard error of regression is nothing


but standard deviation of actual Y from
the regression line. It is quite often used
as a summary measure to assess the
overall fit of the estimated regression
model. Higher the standard error of
estimate lowers the fit of the model and
vice-versa.

The important features of variances of


and must be noted here. If we look at
the formula of variance of , it is clear
that the variance of is directly
proportional to 2 and inversely
proportional to . So, the larger the
variation in the values of independent
variable, X, the lower the variance of
given 2. Further, greater the variance
of error term, 2, greater the variance ,
given . Thus, it is better to have
more number of observations in any
regression analysis because as n
increases, the number of terms in
increases which in turn decreases the
variance of . Also, the variance of
is directly related to 2 and but
inversely related to and the sample
size.

CONFIDENCE INTERVAL FOR


AND

In statistical inference, we try to find the


characteristics of population based on
sample. Here, our objective is to find the
true but unknown values of population
and . To this end, we collected one
sample and found the values of and .
These values of and obtained from a
sample are called statistic and are
denoted by and . Under statistical
inference, we try to find how close is
to and is to .

Now, before constructing confidence


interval for OLS estimates, it is better to
have some idea of one concept called
estimation. Let us take one example to
illustrate this concept. Suppose exam for
statistics paper is going to be held. Can
we predict the average marks obtained
by students in coming exam? Let us say
that based on a sample of 10 students
who appeared in last semester exam in
the same paper got an average of 75
marks. From this, if we say that the
average marks obtained by students in
coming exam will be 75 then it is called
point estimate. However, if we say that
the average marks obtained by students
will be between 70 and 80 then it is
called interval estimate. Next, with
what confidence you are saying that the
average marks will be between 70 and
80. Can you say that I am 90 percent
confident that the average marks of
student will be between 70 and 80? In
general, decisions in social sciences are
made at 90 percent, 95 percent and 99
percent level.

In this case, the point estimate is 75. A


point estimate like this may differ from
the true value. In that case, our
prediction may go wrong. How reliable
is this estimate? We know that the
reliability of a point estimator is
measured by its standard error. With
standard error, we can construct an
interval around this point. Now, we can
claim that this random interval has, say
95 percent probability of including true
value of average marks. The concept of
standard error is required to construct
interval around estimators and that is
why, we discussed this concept
previously.

To construct confidence interval, we


need two positive numbers and , the
latter lying between 0 and 1, such that
the probability that the random interval
contains the true is 1-.
In mathematical terms,

(11.22)

Such an interval is known as confidence


interval; 1- is known as the coefficient
of confidence; which lies between 0
and 1 is known as the level of
significance. The endpoints of the
confidence interval are known as the
confidence limits, being the lower
limit of confidence interval and
the upper limit of the confidence
interval.

As discussed earlier, with the


assumption of error term as normal, the
OLS estimators and follow normal
distribution. In other words,
follow Z distribution as given below:

(11.23)

Thus, we can use the normal distribution


for making probabilistic statement about
if the population variance 2 is known.
However, if the population variance 2
is not known to us, the estimator and
will follow t-distribution as given
below:

(11.24)

where refers to standard error.


Now, we can use t-distribution to
construct confidence interval with the
help of standard error for as follows:

(11.25)
Or in short we can write equation
(11.25) as

(11.26)

Similarly, we can write the confidence


interval for the intercept term as:

(11.27)

or

(11.28)
We must keep one thing in the mind that
the width of confidence interval is
directly proportional to the standard
error meaning if the standard error of
estimate is large then the confidence
interval will also be wide and vice-
versa.

Lets back to our numerical example. We


found that = 165.0741, =2.3002
and degree of freedom = 22. If we want
to make confidence interval at 95%
level then =0.05. If we look for 22
degrees of freedom in t-table the critical
= = 2.074. If we substitute these
values in equation (11.27), we will have
confidence interval for the intercept term
at 5 percent level of significance as
follows:

165.07412.074 (2.3002)
(11.29)

or

160.3035169.844
(11.30)

Thus, given the level of significance of


5%, in 95 out of 100 cases interval like
160.3035169.8447 will contain the
true population in the long run. In a
similar way, we can construct
confidence interval for slope coefficient
as follows:

0.50222.074 (0.0432) i.e. 0.4127


0.5917
(11.31)

HYPOTHESIS TESTING

You must have read mathematical proofs


in class 9th or 10th by contradiction.
Hypothesis testing or what you can call
theory testing, preposition testing,
principle testing is the same thing which
has a special name in inferential
statistics called hypothesis testing. In
fact, new knowledge in the society is
found by falsifying the existing
hypothesis or theory.

The hypothesis is an assumption


regarding population parameter. The
purpose of hypothesis testing is to find
whether enough statistical evidence exist
to arrive at the conclusion that a belief
or hypothesis about population
parameter is supported by the data. In
statistics, the stated hypothesis is called
the null hypothesis which is donated as
. The null hypothesis is usually tested
against an alternative hypothesis which
is opposite of null. Alternate hypothesis
is denoted by .
Two-Tail Test

Let us consider the following null


hypothesis and alternate hypothesis:

= 0.20

0.20

Thus, in the null hypothesis the value of


is 0.20 which is a single hypothesis.
However, the alternate hypothesis is a
composite hypothesis. This implies that
value of is either greater or less than
0.20. Thus, we are interested in both
tails of the distribution. Hence, it is
called two-tail test. Such two-tail
alternate hypothesis is very often
formulated when the researcher do not
have a strong a priori theoretical idea to
frame alternate hypothesis.

One -Tail Test

One-tail test is resorted when we have


strong a priori theoretical basis to
suggest that the alternate hypothesis is
unidirectional. For example, consider
the following null and alternate
hypotheses:

0.20
0.20

In the above, alternate hypothesis tells us


that the value of is necessarily greater
than 0.20. Thus, in this case, the
researcher is only interested in the right
tail of the distribution. Since only one
tail is the relevant while conducting this
hypothesis testing, it is termed as one-
tail test.

Procedure of Hypothesis Testing


Lets revert to our example, in which we
found =0.5022. Lets say the true
population is 0.75. Now, we have to
conduct hypothesis test to find whether
there exist enough statistical evidence to
claim that the true is 0.75. The
procedure of hypothesis testing is as
follows:

a) Frame the null and alternate


hypotheses as:
: =0.75
: 0.75

b) Choose an appropriate
significance level i.e. . In general,
decisions in social sciences are made
at 10 percent, 5 percent and 1 percent
level of significance. Lets take
=0.05 i.e. 5% to test this hypothesis.
c) Choose an appropriate test such as
Z Test or t Test. As we know that
and follow t-distribution because
variance of population is not known.
So in our case t-test is the appropriate
test.
d) Compute the value of the t-test as
follows:

e) Compare the calculated t value i.e.


-5.7361 with table value of t i.e.
critical t value. Critical t value for 22
degrees of freedom is 2.074 or -2.074
since t-distribution is a symmetric.
f) Take the decision. The rule is when
the calculated t value is greater than
critical t value, reject the null
hypothesis. In our case, calculated t
value i.e. 5.7361 is more negative
than critical t value i.e. -2.074, so
reject the null hypothesis.

The conclusion is that there is no enough


statistical evidence to support that the
true is 0.75. The null is rejected in
favour of alternate hypothesis which
implies that true is either less than or
greater than 0.75.

Type I and Type II Error

In hypothesis testing, the value of , the


level of significance plays critical role
either in rejecting or accepting a null
hypothesis. Generally while testing a
hypothesis, value of is chosen at the 1,
5 or 10 percent levels. When the null
hypothesis is correct but the test rejects
it, the researcher is committing type I
error. When null hypothesis is false but
statistical test accepts it, we are
committing type II error. There is
relationship between type I and type II
error. For a given sample size, when we
try to minimize a type I error, probability
of making type II error increases and
vice-versa. Thus, it involves a tradeoff
between type I and type II error, given
the sample size. Then, how to decide
which type of error is acceptable? One
can decide by considering the relative
costs of the two types of errors.
However, costs of these errors are
rarely known to the researcher.
Therefore, researchers after fixing the
value of choose a test statistic which
minimizes the chance of making type II
error. Let denote probability of
committing type I error and denote the
probability of committing type II error.
Then one minus beta is known as the
power of the test. In this test, the power
of the test is maximized so that the
probability of making type II error is as
small as possible.

FORECASTING WITH
REGRESSION MODEL
As discussed earlier, we estimated the
sample regression function based on a
sample data as:

where is the estimator of true


expected Y for a given value of X. The
above equation can be used now for
forecasting or prediction. There are two
types of forecasting or prediction namely
mean prediction and individual
prediction. Mean prediction is the
prediction of the conditional mean value
of Y corresponding to a given value of
X, say, X1. Individual prediction is
prediction of an individual value of Y
corresponding to a given value of X1.

Mean Prediction
Let us consider the value of X1, that is,
crude oil is $75. From this given value
of X1, we can find the expected level of
inflation with the help of estimated
regression line as follows:

= 165.07 +0.50 (75)

= 202.57

Thus, 202.57 is the expected level of


inflation when the price of crude is $75
per barrel. Since is a estimator of true
expected value of Y, we need to know
the sampling distribution of . It is
distributed with mean and the
variance:

If we replace the unknown with


estimated , will follow t
distribution with n-2 degrees of
freedom. And, if the standard error of
is known to us, we can construct
confidence interval and test hypothesis
in an usual manner. The confidence
interval for is given by the following
expression:

(11.33)

To make you understand how confidence


interval is constructed around , we
revert to our example and computed
variance of which is given as:

Variance ( ) =
=
1.0947

and

se ( ) = 1.04

Now, let us construct confidence interval


for , that is, 202.57 at 95% level of
significance as:

202.57 2.074 (1.04) E (Y1|X1=75)


202.57 +2.074(1.04) (11.35)
Thus, 200.42 and 204.72 are the lower
and upper confidence interval for
inflation respectively for a given level
of crude oil of $ 75 per barrel. In other
words, given the value of X, in repeated
sampling, 95 out of 100 intervals like
(11.35) will contain the true mean value
of Y.

Individual Prediction

If you are interested in forecasting an


individual value of Y, Y1, for a given
value of X, say, X1. It can be computed
by the same equation, but its variance is
different and is given below as follows:
As discussed earlier, Y1, follows normal
distribution with mean ( ) and

variance . As
variance is rarely known, it follows t
distribution with n-2 degrees of
freedom. Once we come to know the
variance of , we can construct
the confidence interval around this point
estimate as follows:
Or

Pr [202.57-2.074
(2.16)Y1|X1202.57+2.074(2.16)] =
95% (11.38)

Or

Pr (198.0902Y1|X1= 75 207.0498) =
95% (11.39)

We can notice that the confidence


interval widens in the latter case as X1
moves away from . Therefore,
researchers must exercise great caution
while predicting mean value or
individual value for a given level of X
which is very far away from the sample
mean, .

REPORTING REGRESSION
RESULTS

We must try to put up results of estimated


model in such a way that the reader can
easily understand the report. Lets take
our example of inflation crude oil
example to show how regression result
should be presented:

se = (2.3002) (0.0432)

t = (71.76) (11.62)
r2 = 0.85 df = 22

F = 135.1037 F1, 22, 0.05 = 4.30

In the above estimated regression model,


bracketed figures in the second row are
standard errors of the regression
coefficients, bracketed figures in the
third row are computed t value. These t
values are computed assuming that the
true population value of each regression
coefficient is zero individually. r2 is the
coefficient of determination which
indicates how much variation in
dependent variable, Y, is explained by
independent variable, X. The last row
gives computed F and critical F value
under the null hypothesis that regression
coefficients are jointly zero. In this case,
since the calculated F value is greater
than critical F value, we will reject the
null hypothesis.

Solved Example

The following table shows average


household expenditure and average
household income across various cities
in India published in Business Today
dated September 7, 2008.

Avg house Exp (Rs.


Cities Lakhs)
Mumbai 2.0114
Delhi 2.05028
Kolkatta 1.74951
Chennai 1.55286
Banglore 1.64923
Hyderbad 1.49251
Ahmadabad 1.34479
Pune 1.26918
Surat 1.90591
kanpur 1.18567
jaipur 1.6754
Lucknow 1.52948
Nagpur 1.82871
Bhopal 1.28836
Coimbatore 1.5205
Faridabad 1.64457
Amritsar 1.6454
Ludhiana 1.34187
Chandigarh 2.12805
Jalandhar 2.29335

1. Estimate parameters of simple


linear regression model.
2. Compute coefficient of
determination.
3. Calculate the standard error of
intercept and slope coefficient
4. Construct confidence interval for
slope coefficient at 5 percent
significance level.
5. Test the hypothesis that = 0.05
at 1 percent significance level.
6. Construct 95 % confidence
interval to estimate the expected
value of expenditure when
average household income is
Rs. 2,50,000
7. Report the regression results in a
standard format.
Solution
As a first step to fitting simple linear
regression model, we constructed a
scatter plot of these two variables. The
figure below shows that the two
variables viz. average
householdsexpenditure and average
household income are positively related
with each other. More importantly, the
relationship is clearly linear in nature.
Thus, simple linear regression model is
appropriate.
a) The simple linear regression
model for the above data is specified
as follows:
where
Average household expenditure =
dependent variable
Average household income =
Independent variable
& are the parameters of the model
is the random disturbance term.

Avr
Avg. income y=
Exp (Y) (X) x=
2.0114 4.59457 0.3614 1.584
2.05028 4.08237 0.40028 1.072
1.74951 2.87199 0.09951 -0.138
1.55286 3.37059 -0.09714 0.360
1.64923 3.00678 -0.00077 -0.003
1.49251 2.73353 -0.15749 -0.276
1.34479 3.17856 -0.30521 0.168
1.26918 2.10458 -0.38082 -0.905
1.90591 4.31201 0.25591 1.302
1.18567 1.59761 -0.46433 -1.412
1.6754 3.00374 0.0254 -0.006
1.52948 2.80393 -0.12052 -0.206
1.82871 3.08625 0.17871 0.076
1.28836 1.6521 -0.36164 -1.35
1.5205 2.19846 -0.1295 -0.811
1.64457 2.52558 -0.00543 -0.484
1.6454 2.67056 -0.0046 -0.339
1.34187 2.73211 -0.30813 -0.277
2.12805 4.84775 0.47805 1.837
2.29335 2.96651 0.64335 -0.043
1.65 =3.01

Thus, the regression line is:

The above regression indicate that


when income is zero, the expected
average household expenditure will
be Rs. 86520. Because even if
income is zero people will spend
either from borrowing or past
savings for their survival.
The slope coefficient value 0.2607
shows that when income increases
by Rs. 100,000 on an average
expenditure will increase by Rs.
26000
b) For finding coefficient of
determination which tells us
percentage of variation in the
dependent variable that is explained
by independent variable, we have to
compute SSR, SST and SSE.

0.13061 0.173563
0.160224 0.080137
0.009902 0.001053
0.009436 0.009513
5.93E-07 7.25E-06
0.024803 0.004698
0.093153 0.002254
0.145024 0.054056
0.06549 0.117614
0.215602 0.132977
0.000645 3.61E-06
0.014525 0.002519
0.031937 0.000548
0.130783 0.122819
0.01677 0.043275
2.95E-05 0.015068
2.12E-05 0.007217
0.094944 0.004749
0.228532 0.232912
0.413899 6.09E-05
SST =1.786332 SSR =1.005042
The coefficient of determination
denoted by R2 is given as follows:

Thus, R2 is 0.56 which shows that 56%


variation in average household
expenditure is explained by average
household income.

c) The standard error of intercept and


slope coefficient are computed as
follows:
d) For constructing confidence
interval around slope coefficient, we
require t-table value at stated level of
significance. For 18 degrees of
freedom, t-value at 5 percent
significance level is 2.101.

Thus, in the long run, in 95 out of 100


cases true will lie between 0.15 and
0.37. Similarly confidence interval
around intercept can be constructed.

e) To test the hypothesis that =1, we


will use t-statistic which is given as
follows:

For 18 degrees of freedom, the critical t-


value at 1 percent significance level is
2.878. Since the calculated t-value is
greater than critical t value, we will
reject the null hypothesis. Thus, the
value is not 0.05 but it is different from
0.05.

f) When the given value of X0 = 2.5


lakhs, the point forecast can be
estimated using regression line is:

Thus, when average income is Rs. 2.5


lakhs, the expected average household
expenditure will be 1.51695 lakhs. To
construct confidence interval around
point forecast, we have to find variance
of ) which given as:
Standard error of ) is the square root
of variance of ). Thus, square root of
0.01404 is 0.1184. Thus, the 95%
confidence interval for expected average
household expenditure when income is
2.5 lakhs is given as follows:

1.51695 2.878 (0.1184) E


(Y0|X0=2.5) 1.51695 +2.878(0.1184)
1.17619 E (Y0|X0 =2.5) 1.85770

Thus, the confidence interval for


expected average household income is
lying between 1.17619 and 1.85770 at
95% significance level.
g) The standard format for reporting
regression result is given as follows:

Se = (0.1699) (0.0541)
t = (5.11) (4.81)
R2 = 0.56 df =18

EVALUATING REGRESSION
RESULTS
In the last section, we discussed how to
report the results of regression analysis.
After this, one would like to know
whether the fitted model is appropriate
or not. How to evaluate an estimated
regression model? There are certain
criteria of judging the estimated
regression model which are as follows:

a) The researcher must check the


sign of the estimated coefficients. Is
the sign correct and as per theoretical
expectation? In our example, it is
expected that inflation is positively
related with crude oil prices. The
sign of estimated is positive which
is in accordance with prior
expectation.
b) Is the estimated coefficient
statistically significant which theory
claim? To check this, one can use t
statistic and can infer whether a
particular coefficient is statistically
significant or not.
c) How much variation in the
dependent variable is explained by
independent variable? One can use r2
which tells percentage variation in Y
that is explained by X. In our
example r2 is 0.85 which means 85%
variation in inflation can be
explained by crude oil prices.
d) Are the assumptions of classical
linear regression model violated? We
will discuss this particular topic later
on but would like to check the
normality of residual or error term,
. This is also important because
hypothesis testing procedure by t and
F tests is valid when the error term is
normally distributed. Otherwise, the t
and F values are not reliable in case
of small or finite samples. To check
the normality of error, one easy
technique is to draw the histogram of
residual. One can also use Jarque-
Bera test of normality. It is based on
OLS residuals. JB test first find the
skewness and kurtosis of estimated
error term and uses the following test
statistic:

(11.40)
where n = sample size, S =
skewness, and K = kurtosis. A
variable is considered to be normally
distributed when S = 0 and K=3.
Thus, in JB test of normality the null
hypothesis is that S and K are
jointly 0 and 3 respectively. JB
statistic follows chi square
distribution with 2 degrees of
freedom. If the calculated chi-square
value is greater than critical chi-
square value, reject the null
hypothesis for a given level of
significance. But we must remember
that JB statistic is a large sample test.

EXCEL APPLICATION
Regression analysis can be done by
Excel with the help of Data Analysis.
Select Tools on the menu bar and choose
Data Analysis from the drop down
menu. Next, select Regression from the
Data Analysis dialog box. Enter the
dependent variable into Input Y Range.
Enter the independent variables in Input
X Range. Select Labels, if you have
labels for series. If you want regression
through origin, click Constant is Zero
otherwise leave it blank. Excel also
provides residuals, standardized
residuals, residual plots and fine fits
plots. If you are interested in these items
then select them from regression dialog
box. Click OK.

The standard excel regression output


gives coefficient of determination (R2),
Standard error of estimate and an
ANOVA table. It also provides F
statistic to assess the overall
significance and t-statistic to assess the
significance of marginal contribution by
dependent variables with associated p-
values.

Step 1: Enter the data in excel as shown


below:
Step 2: Click data analysis and
Regression from the dialog box.
Step 3: After selecting regression click
OK. When you click OK the following
dialog box will appear.
Step 4: Enter the dependent variable in
Input Y Range and independent variable
in Input X range. Select labels if you
have labels.
Step 5: Click OK. When you click ok
the following regression results will
appear.
Reading the Computer Output
The given below is Excel printout of
regression results. Since we have
already studied simple linear regression
model, the table below should be
comprehensible. However, it may be
useful to interpret the key figures of this
table.

1. The regression line is


. To get a point
forecast, simply plug the value of X
in the regression line. For example,
if X = 25, the predicted value for Y
is 62.3493.
2. R2 = 0.90, which means that
independent variable X explain 90
percent of total variation in
dependent variable Y.
3. Standard error of estimate is equal to
4.80. It is the dispersion around the
regression line.
4. SSR = 1755.18, SSE =184.91 and
SST=1940.1. From these figures, one
can obtain R2.
5. The standard error of intercept and
slope coefficients are 5.3920 and
0.1888 respectively. The t-statistic
are 3.93 and 8.71 respectively. It
shows that both coefficients are
statistically highly significant.
6. The confidence interval for Y-
intercept term at 5% significance
level lies between 8.7851 and
33.6535.
7. The confidence interval for slope
coefficient at 5 % significance level
is 1.2098 and 2.0806.
EXERCISES

. Consider the following data on Y and


X.
Y: 45 60 76 95 105
X: 16 20 26 34 48
a) Fit a regression model for the
above data assuming linear
relationship between Y and X.
b) Interpret the intercept and slope
coefficient.

2. Consider the following data on two


variables, X and Y.
Y: 4 9 11 18 26
X: 1 2 3 4
5 6
a) Develop a scatter plot for these
data. What does scatter plot indicate
about the relationship between the Y
and X.
b) Estimate the regression line. Use
the estimated regression line to
predict the value of Y when X = 5.

3. Given are data on two variables, X


and Y.
Y: 1 3 5 8 15
X: 2 5 8 12
18 25
a) Estimate the regression line.
Interpret the estimated intercept and
the slope coefficient.
b) Compute SST, SSR and SSE.
c) Compute and interpret
coefficient of determination (R2).

4. Consider the following data on net


profit (Rs.Cr) and sales turnover (Rs.
Cr.)
Net
Profit: 34.12 56.45
Sales
Turnover: 280 320 300
a) Fit a regression model assuming
linear relationship between net profit
and sales turnover.
b) Interpret the intercept and slope
coefficient.
c) Compute standard error of
estimate
d) Compute SST, SSR and SSE
e) Compute R2

5. Consider the following data


Y: 36 41 53 58 71 82
X:
14 12 18 24 20 28

a) Fit a regression model assuming


linear relationship between net profit
and sales turnover.
b) Interpret the intercept and slope
coefficient.
c) Compute standard error of
estimate
d) Compute SST, SSR and SSE
e) Compute R2
f) Compute standard error of and
.

6. Consider the following data


Y: 136 148 172 258 280 30
X:
114 102 82 64 50 48

a) Fit a regression model assuming


linear relationship between net profit
and sales turnover.
b) Interpret the intercept and slope
coefficient.
c) Compute standard error of
estimate
d) Compute SST, SSR and SSE
e) Compute R2
f) Compute standard error of and
.
g) Construct confidence interval
around estimated and values at 5
percent significance level.
h) Test the hypothesis that = 0 at 5
percent level.
7. Consider the following data on sales
(in Rs.Cr.) and advertising expenditure
(in Rs.Cr.)
Sales:
60 80 118 146 155 160
Adv.Exp: 2.3 2.1 2.5 3 1.
a) Fit a simple linear regression
model to the above data.
b) Compute coefficient of
determination.
c) Calculate the standard error
of intercept and slope coefficient
d) Construct confidence interval
for slope coefficient at 5 percent
significance level.
e) Test the hypothesis that = 0
at 1 percent significance level.
f) Construct 95 % confidence
interval to estimate the expected
value of sales when advertising
expenditure is 1.5 Cr.
g) Construct a 99% confidence
interval to estimate single value
of sales when advertising
expenditure is 2.6 Cr.
h) Check the normality
assumption of residuals.
i) Report the regression results
in standard format.
Chapter 12
Multiple Regression
INTRODUCTION
In the previous chapter, we read simple
linear regression model where there is
one dependent and only one independent
variable. In practice, this model is often
inadequate in explaining complex
economic relationships. For instance,
inflation is not only influenced by high
crude oil prices in the international
market but also by other factors such as
money supply in the economy. Suppose
one researcher wants to include money
supply as another independent variable
then we need to extend the concept of
simple regression model. When we do
this, it will become multiple linear
regression model where there is one
dependent variable and two or more
independent variables are present in the
model.

MULTIPLE REGRESSION MODEL


The simplest multiple regression model
is one which contain one dependent and
two independent or explanatory
variables. This model can be expressed
as follows:

(12.1)
where Y is the dependent variable, X1
and X2 are independent variables, and
is the error term. , , and are the
parameters of the model. In general
multiple regression model with k
independent variable can be written as:

(12.2)

Our immediate problem is to find the


values of , , and with the help of
a sample. The values of , , and
can be found by ordinary least square
method which minimizes the sum of
squares of error i.e. .
Symbolically,

(12.3)

When we differentiate (12.3) with


respect to , , and , we will get the
following three normal equations:

(12.4)

(12.5)
(12.6)

When we set these equations (12.4),


(12.5) and (12.6) to zero and solve them,
we will get the values of regression
coefficients as follows:

(12.7)

(12.8)
(12.9)
which provides the OLS estimators of
population parameters , and .
After estimation of the , , and
then problem arises as how to interpret
these regression coefficients.

ASSUMPTIONS OF MULTIPLE
REGRESSION MODEL

The various assumptions of multiple


classical linear regression model are as
follows:
a) The parameters of the regression
model are linear.
b) The independent variables (X1)
and (X2) are considered to be fixed
in repeated samples that is,
independent variables are
nonstochastic in nature.
c) Given the values of independent
variables, the expected value of the
error term is zero.
d) Given the value of independent
variables, the variance of error term
is constant for all items. This is
called homoscadasticity assumption.
e) No autocorrelation between any
two error terms.
f) The covariance between error
term and independent variables is
zero.
g) No multicollinearity

Interpretation of Partial Regression


Coefficients

Before we interpret, let us consider that


we want to know what determine the
movement of stock market. Some
economists think that one of the factors
on which stock market movement
depends is the growth rate of economy.
Another factor one could think which
influence is corporate profit. Since we
want to explain stock market movement
in terms of growth rate of economy and
corporate profit, our dependent variable
(Y) is stock market and two independent
variables are growth rate (X1) and
corporate profit (X2). Now we can
express our multiple regression model
as follows:

(6.10)

where and are the parameters


of the multiple regression model which
are unknown. However, approximate
values of and can be computed
with the help of a sample. For this
purpose, suppose we have stock market,
growth rate, and corporate profit data in
the form of S&P 500, U.S. GDP growth
rate and U.S. corporate profit
respectively. Table 12.1 gives data on
these economic variables.
Table 12.1

Year S&P 500 U.S. Corpor


GDP($bn) Profit($
1985. 186.84 6,053.7 87.6
1986. 236.34 6,263.6 83.1
1987. 286.83 6,475.1 115.6
1988. 265.79 6,742.7 153.8
1989. 322.84 6,981.4 135.1
1990. 334.59 7,112.5 110.1
1991. 376.18 7,100.5 66.4
1992. 415.74 7,336.6 22.1
1993. 451.41 7,532.7 83.2
1994. 460.42 7,835.5 174.9
1995. 541.72 8,031.7 198.2
1996. 670.50 8,328.9 224.9
1997. 873.43 8,703.5 244.5
1998. 1,085.50 9,066.9 234.4
1999. 1,327.33 9,470.3 257.8
2000. 1,427.22 9,817.0 275.3
2001 1,194.18 9,890.7 41.7
2002 993.94 10,048.8 36.2
2003 965.23 10,301.0 134.7
Source: Economic Report of the
President, 2008

Based on the above data given in Table


12.1, we estimated the (12.10) model
and the result is provided below as:

S&P 500 = -1524.72 + 0.25GDP +


0.95 Corporate Profits
Se = (189.22) (0.02)
(0.42)

t = (-8.05) (12.5)
(2.26)

r2 =0.89 df
=16

Based on sample data of stock market,


gross domestic product and corporate
profit, we found the values of , ,
and as -1524.72, 0.25 and 0.95
respectively. The sign of coefficients of
GDP and corporate profit variables are
positive which are as per expectation.
We expected that corporate profit and
growth of economy positively influence
the stock market. The estimated
regression model says that when GDP
and corporate profit are zero, the level
of S&P 500 will be -1524.72. Please
note here that it is not possible to
provide economic meaning to the
intercept term always. Therefore, if
economic theory does not suggest
presence of an intercept term, we can
safely ignore the intercept term in
empirical work. Further, = 0.25 can
be interpreted as when GDP changes by
1 percent, value of S&P 500 will change
by 0.25 percent corporate profit
remaining the same. Similarly, when
corporate profit changes by 1 percent,
value of S&P will change by 0.95
percent gross domestic product
remaining the same. We will interpret
other things later on.

STANDARD ERROR OF OLS


ESTIMATORS

We have estimated unknown parameters


of (6.10) model and with
sample data and found the values as
-1524.72, 0.25 and 0.95 respectively.
As we know standard errors are
required for two main purposes: to
establish confidence interval and test
hypotheses of estimated regression
coefficients. In case of multiple
regression model with two explanatory
variables formulas for standard errors
are as follows:

(12.11)

Standard error ( =

(12.12)

(12.13)
Standard error

(12.14)

(12.15)

Standard error
(12.16)

As pointed out earlier all the quantities


except entering into equations (12.11),
(12.13) and (12.15) can be computed
from the data. is population variance
which can be estimated as follows:

Since there are three parameters to be


estimated in the model (12.10), the
degrees of freedom are given by (n-3)
now. In simple regression model, the
number of parameters was two and thats
why degrees of freedom were (n-2).
Thus, in a model where there are k
parameters, the degrees of freedom will
be (n-k).
MULTIPLE COEFFICIENT OF
DETERMINATION (ADUSTED R2)
In simple regression model, we
discussed the concept of coefficient of
determination denoted by r2. It measures
percentage of total variation in
dependent variable (Y) explained by
independent variable (X). This concept
can be easily extended to multiple
regression model containing two or
more independent variables. The
proportion of total variation in
dependent variable (Y) that is explained
by variables (X1) and (X2) jointly is
given by multiple coefficient of
determination which is denoted by R2. It
is the ratio of regression sum of squares
to the total sum of squares. It is given as
follows:

or

(12.18)

R2 like r2 lies between 0 and 1. If value


of R2 is 1, that means the estimated
regression line explain 100 % variation
in dependent variable (Y). However, if
its value is coming 0, that mean the
estimated regression model unable to
explain any of the variation in dependent
variable (Y). Higher the value of R2,
better is the fit of the estimated model.

An important property of R2 is when the


number of explanatory variables
increases, R2 invariably increases and
never declines. In other words, when an
additional explanatory variable is added
to the model, R2 never decreases. Thus,
the larger the number of explanatory
variables, the smaller the errors sum of
squares (SSE) tends to become. So, at
least in theory, it is possible to construct
models with n-1 independent variables
that makes SSE = 0, so that R2=1. Hence,
R2 is adjusted for the number of
independent variables in the model i.e. it
imposes penalty for each independent
variable that is included in the model in
terms of degrees of freedom lost.
Adjusted R2 is denoted by and
defined as:

(12.19)

Adjusted R2 can be negative, and will


always be less than or equal to R2.
Econometricians differ on this point
whether to provide R2 or while
reporting regression results. We would
like to say that student consider it as
another summary statistic. In our stock
market example, R2 is 0.89 which
implies that independent variables gross
domestic product and corporate profits
jointly explain 89% variation in the
movement of S&P 500.

Can we compare two regression model


based on the coefficient of
determination? The answer is yes.
However, the size of sample and the
dependent variable must be the same.
One more point to be noted here is we
must avoid maximizing always. Very
often researchers try to choose that
model which gives high indicating
better fit. However, it must be
remembered that objective of regression
analysis is not to obtain high but
estimating dependable figures of
population parameters and drawing
statistical inferences about them.
EVALUATING CONTRIBUTION OF
INDIVIDUAL INDEPENDENT
VARIABLES: THE T TEST

In the last section, we discussed that the


proportion of variation in Y that is
jointly explained by variables (X1) and
(X2) can be known by looking at R2. But
how can we know whether a particular
independent variable is statistically
significant or not? Statistical
significance of independent variables in
multiple regression model is ascertained
by computed t values. This t value of
each regression coefficients are obtained
dividing estimated regression coefficient
by its standard error under the null
hypothesis that population parameter is
zero. In stock market example, estimated
regression coefficient of corporate profit
is 0.95 i.e. and its standard
error is 0.02. Under the null hypothesis
that = 0, t-statistic for this coefficient
is given as follows:

= 10.38
Thus, computed t value for estimated
regression coefficient is 10.38.
Similarly, we can find the t-statistic of
, regression coefficients. After
computing t-statistic, how can we decide
whether corporate profit variable is
statistically significant in explaining the
variation in Y? For this, a simple rule is
if the computed value of t-statistic is 2
or greater than 2 then that particular
variable is statistically significant at 5
percent level of significance. Since
computed t-statistic of is 10.38 which
implies that corporate profit variable is
statistically significant at 5%
significance level.
We can also do hypothesis testing
regarding individual regression
coefficients by t-test. Lets test this
hypothesis that . The procedure of
testing is as follows:

a) Frame the null and alternate


hypothesis as

b) Choose an appropriate level of


significance: lets take as 0.01 or
1%.

c) Choose an appropriate test: Since


in this case, we dont know the
variance of population so follows
t-distribution with n-3 degrees of
freedom.

d) Compute the t-statistic as follows:

= 2.5
e) Compare computed t value with
critical t value and take the decision:
since the calculated t value is less than
critical t0.005,16 value of 2.92, do not
reject the null hypothesis.
Similarly, we can conduct hypothesis
test of other regression coefficients.

Evaluating the Overall Fit of the


Model: The F-Statistic
In the last section, we tested significance
of individual regression coefficients
under the null hypothesis that each true
population regression coefficient was
zero by t-test. Now we will consider the
following hypothesis:

(12.20)

The above null given in (12.20) is a


joint hypothesis that and are jointly
zero. This joint hypothesis is called
overall significance test of estimated
regression line. This can be tested by F-
statistic which is defined as follows:

(12.21)

where SSR = Sum of squares due to


regression
SSE = Sum of squares due to
error
n = total number of
observations
k = number of parameters in the
model

The decision rule is: when the computed


F is greater than critical F, reject the null
hypothesis. Lets illustrate the procedure
of joint hypothesis testing by using stock
market example. We computed the
following values required for this test
as:

SSR = 26, 11,134


SSE = 29, 38, 898.1
n = 19
k = 3
Put these values into (12.21), we will
get:

= 71.07

We will now compare this computed


value of 71.07 with critical F value. Our
numerator degrees of freedom is 2 and
denominator degrees of freedom is 16.
At 5% significance level critical F value
is 3.63. Thus, computed F value is
greater than critical F value, we reject
the null hypothesis that and are
simultaneously equal to zero. This
implies that at least one slope coefficient
is not zero.

Relationship between R2 and F-


statistic

In case of multiple regression model


(with intercept), if the random error term
is normally distributed then the null
hypothesis:

can be tested by the F-statistic which is


given by the following formula:
with (k-1) degrees of freedom in the
numerator and (n-k) degrees of freedom
in the dinomenetor.

The relation between R2 and F-statistic


can be established as follows. Let us
write
or
or

(12.22)

The above equation (12.22) shows there


is an intimate relationship between R2
and F-test. They are directly related with
each other. When R2 = 0, F is also zero
and when R2=1, F is infinite. Thus, it is
clear that testing the overall significance
is equivalent to testing the null that R2 is
zero.
We discussed significance of individual
regression coefficient by t-test as well
as overall significance i.e. all slope
coefficients are jointly zero by F-test.
Now we must bear one thing in mind that
these two tests are different. It may
possible that null hypothesis of a
particular slope regression coefficient is
zero cannot be rejected, yet it is possible
to reject the null hypothesis that all the
slope coefficients are jointly zero.

Testing Marginal Contribution of an


Independent Variable

In the previous sections, we established


that GDP(X1) and corporate profit (X2)
were statistically significant variables in
explaining the movement of S&P 500
(Y) based on t tests. We have also found
by testing overall significance that both
the variables GDP(X1) and corporate
profit (X2) jointly have significant effect
on S&P 500. The R2 found to be 0.89.
But it is very difficult to say what part of
0.89 is due to GDP and what part of 0.89
is due to corporate profit variable. This
problem arises due to the fact that GDP
and corporate profit variables may have
correlation between them.

Now to find the marginal contribution of


an independent variable, we will
introduce Corporate profit and GDP
variables in a sequential manner. We
will first regress S&P 500 on Corporate
profit only. The result is given below as:

S&P500 = 322.46 + 2.34Corporate


profit

t = (1.85) (2.17)

r2 = 0.21 df =17

The above results show that corporate


profit variable is statistically significant.
However, the overall fit as indicated by
r2 is only 0.21 which implies that only
21% variation in S&P 500 is explained
by corporate profit.

Can we improve this model further?


Suppose one researcher thinks that gross
domestic product (GDP) is another
variable which also impacts the S&P
500. If the addition of (GDP) increases
SSR significantly in relation to SSE so
that r2 rises, such contribution is called
marginal or incremental contribution of
an independent variable. To assess the
marginal or incremental contribution of
GDP variable, F-statistic is used which
is given as follows:

(12.23)
The above F-statistic can also be written
in terms of R2

In our example following are the values:

SSRnew = 26,11,134

SSRold = 6,30,772

SSEnew =2,93,898.1

Number of new regressor = 1


Number of parameters in the new model
=3

= 107.81

Thus, computed F is 107.81 is


statistically highly significant suggesting
that inclusion of GDP variable to the
model significantly increases SSR and
hence coefficient of determination. The
above F-test gives a formal way of
deciding whether a variable should be
included to the regression model.

Thumb Rule: When the adjusted R2


increases with the inclusion of a
variable, it is retained in the model
although it does not reduces sum of
squares due to error (SSE)

Solved Example
Tata Motors is a major automobile
manufacturer in India. The company is
interested to know how octane level
present in gasoline and weight of the car
impact the mileage. A random sample of
12 cars were taken for studying this
relationship. The following table gives
data on mileage (Y), weight of the car
(X1) and Octane level (X2). Answer the
following questions.
Y X1(tons) X2
16.5 3.4 88
14.6 4.1 90
21.8 2.5 94
15.4 2.8 86
18.4 5.6 86
20.2 8.2 95
25.4 4 98
16.6 6.5 92
13.5 4.5 84
18.7 4.8 96
14.8 3.2 81
21.6 5.5 92

a) Fit a multiple linear regression


model to the above data.
b) Compute coefficient of
determination and adjusted R2
c) Calculate the standard error of
slope coefficients
d) Construct confidence interval
for slope coefficients at 5 percent
significance level.
e) Test the hypothesis that 2= 1
at 1 percent significance level.
f) Test the significance of overall
fit of the model.
g) Test the marginal contribution
of each independent variables in the
model.
h) Report the regression results in
a standard format.
Solution
a) The multiple linear regression
model for the above data is specified
as follows:

where Y = mileage
X1 =weight of the car
X2 = Octane level
, 1 and 2 are the
parameters of the model
is the random
disturbance term.
Thus, the regression line is given as
follows:

b) For finding coefficient of


determination which tells us
percentage of variation in the
dependent variable that is explained
by independent variables, we have to
compute SSR, SST and SSE.

2.6244 0.844242
12.3904 0.000933
13.5424 7.042729
7.3984 3.580972
0.0784 6.597604
4.3264 3.372392
52.9984 20.50639
2.3104 0.324146
21.3444 11.70559
0.3364 10.34756
11.0224 22.89639
12.1104 0.657479
SST =140.48 SSR =87.87

The coefficient of determination denoted


by R2 is given as follows:
Thus, R2 is 0.56 which shows that 56%
variation in average household
expenditure is explained by average
household income.

In the context of multiple regression


model, R2 is adjusted for lost degrees of
freedom which is given as follows:

Thus, 54 percent variation in mileage


can be explained by the independent
variables weight of car and octane level
in gasoline.
c) The standard error of slope
coefficients are computed as follows:

d) For constructing confidence


interval around slope coefficient, we
require t-table value at stated level of
significance. For 9 degrees of freedom,
t-value at 5 percent significance level
is 2.262.

Thus, in the long run, in 95 out of 100


cases true 1 will lie between -1.29 and
0.81. Similarly confidence interval
around second slope coefficient can be
constructed.
e) To test the hypothesis that 2 =1,
we will use t-statistic which is given
as follows:

For 9 degrees of freedom, the critical t-


value at 1 percent significance level is
3.250. Since the calculated t-value is
less than critical t value, we cannot
reject the null hypothesis. Thus, the 2
value is 1.
f) The testing overall fit of the model
involve testing the following
hypothesis:
H0: 1=2=0
H1: at least one is not zero

The above hypothesis can be tested by


F statistic which given as follows:

The F-critical value at 5% significance


level is 4.26. Since the calculated F is
greater than Critical F, reject the null
hypothesis.
g) The marginal contribution of each
independent variable can be known
from their respective t-statistic. The t-
statistic of independent variables (X1)
and (X2) are -0.51 and 3.8
respectively. It is clear that (X1) is not
significant at 5% level while (X2) is
significant variable at 1 percent
significance level.

h) The standard format of reporting


regression results is given as follows:

Se = (12.74) (0.43)
(0.1385)
t = (-2.44) (-0.51)
(3.80)

R2 = 0.62 adjusted R2=


0.54 F= 7.51 (0.00) df =
9

POLYMONIAL REGRESSION
MODELS
We discussed in the last chapter how to
deal with non-linearity. Here we will
discuss non-linearity in the context of
multiple regression models. When the
dependent variable (Y) is non-linearly
related with independent variable (X),
one can fit polynomial regression model.
In polynomial regression model powers
of the independent variables are used. A
polynomial regression model of Kth
order is given as follows:

where the independent variable (X) has


raised from first power through kth
power. If the regression model consists
of independent variable (X) which is
raised to its first power, then it will
become simple regression model as
given below:
The above model is also known as first
degree polynomial model. A second
degree polynomial regression model is
that in which independent variable (X)
is raised to its second power and is
specified as:

Model (12.26) is called quadratic model


which has wide application in
economics relating to cost and
production functions. Let us consider the
relationship between marginal cost and
quantity output and total cost and
quantity of output as shown as shown in
figure 12.1. and 12.2 respectively.

Fig.12.1: U-shaped marginal cost curve

Fig. 12.2: total cost curve


How to characterize these relationships
in econometrics? To fit a curve through
scatter points which first declines and
then rises as indicated by fig. 12.1, we
can go for a quadratic model as
indicated in model (12.26). For cubic
relationship as depicted in fig.12.2 can
be captured by a three degree
polynomial regression model.

Interaction Effects
Multiple regression is a very flexible
tool used for various purposes. One such
use is incorporating interaction effects
in quantitative analysis. One can easily
incorporate interaction between two
variables using multiple regression
framework. For example, lets consider
the placement of students. A researcher
thinks that placement of a student is
largely depends on knowledge as
reflected by CGPA and communication
skills of students. Accordingly, he
specified a multiple linear regression
model assuming that the independent
variables are linearly related with
dependent variable given as follows:

Assuming linear relationship between


independent variables and dependent
variable, the above function can be
written as follows:
The above multiple regression model
says that placement of students can be
largely explained in terms of CGPA and
communication skills. The researcher
thinks that these variables may
combinedly affect the prospects of
placing students in any business school.
So now the question before him is how
to incorporate the interactions effects of
CGPA and communication skills on
placement. This can be done very easily
in multiple regression framework by
simply introducing another independent
variable as shown below:
In the above model, 3 gives you
combined effects of CGPA and
communication skills on placement.

FORECASTING WITH MULTIPLE


REGRESSION
Forecasting by using multiple regression
model is just an extension of simple
linear regression. The estimated multiple
regression model can be used for mean
prediction and individual prediction. Let
us forecast the value of S&P 500 by
using estimated (4.10) model, when the
value of U.S. GDP and corporate profit
are $ 10,000 billions and $150 billion
respectively. Simply, put these figures in
the below given estimated model:
S&P 500 = -1524.72 + 0.25(10,000) +
0.95(150)
= 1117.78

Thus, the mean predicted level of S&P


500 will be 1117.78 when US GDP and
corporate profit are 10,000 and 150
respectively. As the formula for
computing variance and standard error
of forecast value are complicated, we
restrict this discussion to point estimate
only. However, one can find variance
and standard error more easily by matrix
method which is beyond the scope of
this book.
Excel Application

Regression analysis can be done by


Excel with the help of Data Analysis.
Select Tools on the menu bar and choose
Data Analysis from the drop down
menu. Next, select Regression from the
Data Analysis dialog box. Enter the
dependent variable into Input Y Range.
Enter the independent variables in Input
X Range. Here independent variable X
range may include a number of columns.
Excel decides the number of explanatory
variables from the number of columns
entered in Input X Range. Select
Labels, if you have labels for series. If
you want regression through origin, click
Constant is Zero otherwise leave it
blank. Excel also provides residuals,
standardized residuals, residual plots
and fine fits plots. If you are interested
in these items then select them from
regression dialog box. Click OK.

The standard excel regression output


gives coefficient of determination (R2),
Standard error of estimate and an
ANOVA table. It also provides F
statistic to assess the overall
significance and t-statistic to assess the
significance of marginal contribution by
independent variables with associated
p-values.
Step 1: Enter the data in excel as shown
below:
Step 2: Go to data analysis. When you
click data analysis the following dialog
box will come:
Step 3: Select regression and click ok.
When you click ok the following dialog
box will appear.
Step 4: Enter the dependent variable in
Input Y Range and independent variables
in Input X range that is, $B$1:$C$11.
Select labels if it has label for your
variables.
Step 5: After this when you click ok. The
following results will be given by excel.
Reading the Computer Output
The given below is an output of a
multiple regression model from Excel
where Y is dependent variable and X
and P are independent variables.
1. The regression line is
. When
the values of X and P are zero, on an
average, the value of the dependent
variable (Y) will be 0.2438. The
interpretation of 0.011 coefficient is
when X increases by 1 percent on an
average, Y will increase by 0.011
percent keeping P constant.
Similarly, keeping X constant when
P increase by 1 percent on an
average, Y will increase by 0.11
percent.

To get a point forecast, simply plug


the given values of X and P in the
regression line.

2. R2 = 0.52, which means that


independent variable X and P
explain 52 percent of total variation
in dependent variable Y. The
2
adjusted R is 0.39 which shows that
when degrees of freedom lost taken
into account, both the variables
together explain only 39 percent total
variation in Y.

3. Standard error of estimate is equal to


0.3066. It is the dispersion around
the regression line.
4. SSR = 0.7308, SSE =0.6581 and
SST=1.38. From these figures, one
can obtain R2.
5. The F-statistic is 3.88 and associated
p-value is 0.07 which shows that the
null of 1=2=0 cannot be rejected at
5 percent significance level.
However, it is rejected at 10 percent
significance level.

6. The standard error of intercept and


slope coefficients are 0.2888,
0.0054 and 0.1088 respectively. The
t-statistic are 0.84, 2.06 and 1.08
respectively. It shows that only
variable X is significant at 10
percent level. The other independent
variable is not significant at all.
7. The confidence interval for Y-
intercept term at 5% significance
level lies between -0.43 and 0.92.
8. The confidence interval for slope
coefficient of X at 5 % significance
level is -0.001 and 0.024
9. The confidence interval for slope
coefficient of P at 5 % significance
level is -0.13 and 0.37
REVIEW EXERCISES

. Using the following data on Y, X1 and


X2
Y 34 56 45
62 74 66
86 90
120 146
X1 12 21 18
34 43 52
65 72
80 88
X2 88 67 55
46 38 30
24 18
26 20

1. Fit a multiple linear regression


model to the above data.
2. Interpret the estimated
coefficients of the model.
2. Consider the following data on Y, X1
and X2
Y 134 256 245
262 274 310
386 490
512
X1 102 121 118
124 143 152
170 181
210
X2 78 77 65
56 48 30
26 18
12 9

1. Fit a multiple linear regression


model to the above data.
2. Interpret the estimated
coefficients of the model.
3. Compute standard error of
estimate
4. Compute SST, SSR and SSE.
5. Compute R2 and adjusted R2
3. The following table shows data on
imports, gross national product (GNP)
and time variable from 1970-71 to
2012-13.
Imports GNP Tim
16.34 5856.72
18.25 5917.03
18.67 5901.38
29.55 6174.98
45.19 6254.37
52.65 6823.55
50.74 6910.96
60.2 7432.23
68.11 7842.97
91.43 7447.72
125.49 7985.04
136.08 8423.24
142.93 8642.88
158.32 9320.51
171.34 9674.85
196.58 10079.99
200.96 10510.71
222.44 10862.09
282.35 11936.97
353.28 12667.67
431.93 13310.4
478.51 13495.41
633.75 14226.92
731.01 15061.38
899.71 16032.64
1226.78 17200.69
1389.2 18593.7
1541.76 19432.08
1783.32 20731.4
2152.37 22385.67
2308.73 23246.81
2452 24535.91
2972.06 25519.75
3591.08 27550.56
5010.65 29490.89
6604.09 32281.77
8405.06 35348.49
10123.12 38794.57
13744.36 41332.92
13637.36 44883.14
16834.67 48822.49
23454.63 51968.48
26731.13 54491.04

1. Fit a multiple linear regression


model to the above data.
2. Compute coefficient of
determination and adjusted R2
3. Calculate the standard error of
intercept and slope coefficients
4. Construct confidence interval for
slope coefficients at 5 percent
significance level.
5. Test the hypothesis that 1 = 1 and
2= 0 at 1 percent significance
level.
6. Test the significance of overall fit
of the model.
7. Test the marginal contribution of
each independent variable in the
model.
8. Report the regression results in
standard form.

4. The following table shows data on


gross domestic savings, GNP and rate of
interest from 2000-01 to 2011-12.
Lon
Year GDS GNP Inte
2000-
01 5155.45 23246.81
2001-
02 5853.75 24535.91
2002-
03 6562.29 25519.75
2003-
04 8237.75 27550.56
2004-
05 10507.03 29490.89
2005-
06 12351.51 32281.77
2006-
07 14859.09 35348.49
2007-
18363.32 38794.57
08
2008-
09 18026.20 41332.92
2009-
10 21823.38 44883.14
2010-
11 26519.34 48822.49
2011-
12 27652.90 51968.48

1. Fit a multiple linear regression


model to the above data taking
gross domestic savings as
dependent variable.
2. Compute coefficient of
determination and adjusted R2
3. Calculate the standard error of
intercept and slope coefficients
4. Construct confidence interval for
slope coefficients at 1 percent
significance level.
5. Test the hypothesis that 1 = 1 and
2= 50 at 10 percent significance
level.
6. Test the significance of overall fit
of the model.
7. Test the marginal contribution of
each independent variable in the
model.
8. Report the regression results in
standard form.

5. The following table shows data on


stock return (%) dividend payout ratio,
retained earnings ratio (%) and return on
capital (%) of software industry in India
during 2012-13.

Stock Dividend
Return Payout
Company (%) (%)
Hexaware
Technology 17.36 65.23
HCL Tech 83.46 49.49
Mastek 25.38 28.64
MindTree 97.8 17.05
Persistent 68.65 23.06
Polaris 9.92 34.84
Rolta -11.84 17.56
TCS 62.21 30.3
Tech
Mahindra 63.68 11.49
Wipro 46.32 35.64
Zensar Tech -2.01 33.34

1. Fit a multiple linear regression


model to the above data taking
stock return as dependent variable.
2. Compute coefficient of
determination and adjusted R2
3. Calculate the standard error of
intercept and slope coefficients
4. Construct confidence interval for
slope coefficients at 1 percent
significance level.
5. Test the hypothesis that 1 = 0 and
3=1 at 5 percent significance
level.
6. Test the significance of overall fit
of the model.
7. Test the marginal contribution of
each independent variable in the
model.
8. Report the regression results in
standard form.
Chapter 13
Seasonal Forecasting with
Dummy Regression
INTRODUCTION
Before dealing with dummy variable, we
would like to discuss about types of
variables here. In general there are four
types of variables: nominal scale,
ordinal scale, interval scale and ratio
scale variables. So far we have
discussed regression analysis assuming
that dependent and independent
variables are either interval or ratio
scale variables on which all types of
statistical operations are applicable.

TYPES OF VARIABLES
There are generally four types of
variables encountered in empirical
analysis. The type of variable under
consideration plays an important role in
selecting appropriate statistical tool for
analysis. For instance, it is not
advisable to compute arithmetic mean of
nominal scale variable. These various
types of variables and its nature are
discussed as follows:
Nominal Scale Variables
Nominal scale variable is very common
in marketing or social science research.
A nominal scale divides data into
categories which are mutually
exclusives and collectively exhaustive.
In other words, a data point is grouped
under one and only one category and all
other data will grouped somewhere else
in the scale. The word nominal means
namelike which means that the numbers
or codes given to objects or events are
naming or classifying only. These
numbers have no true meaning and thus
cannot be added, multiplied or divided.
They are simply labels or identification
number. The following are examples of
nominal scales:
Gender (1) Men
(2) Women
Nationality (1) Indian (2)
American (3) Others

What kind of analysis can be done on


this type of data? We can only find the
numbers and percentages of items in
each group. The only appropriate
statistical tool which can be applied to
such data is mode.

Ordinal Scale Variables


Ordinal scale is one step further in
levels of measurement. Ordinal scales
not only classify data into categories but
also order them. One point to note here
is that ordinal measurement requires
transitivity assumption which is
described as: if A is preferred to B, and
B is preferred to C, then A is preferred
to C. The following is an example of
ordinal scale:

Please rank the following cars by look


from 1 to 5 with 1 being most stylish and
5 the least stylish.

Enzo Ferrari ( )
Koenigsegg CCXR ( )
McLareb F1 ( )
Bugati Veyron ( )
Lamborghini Reventon ( )

Numbers in an ordinal scale only stand


for rank order. They do not represent
absolute quantities and the interval
between two numbers need not be equal.
In case of ordinal scaled data
mathematical operations such as
addition and division cannot be
permitted. Mode and median can be
found but arithmetic mean must not be
computed. One can also use quartile and
percentile for measuring variation.

Interval and Ratio Scale Variables


Interval scales take measurement one
step further. It has all the characteristics
of ordinal scales in addition to equal
interval between the points on the scale.
On interval scales common mathematical
operations are permitted. One can find
arithmetic mean, variance and other
statistics. Ratio scales have a meaningful
origin or absolute zero in addition to all
the features of interval scale. Since the
origin or location of zero point is
universally acceptable, we can compare
magnitudes of ratio-scaled variables.
Examples of ratio scales are income,
weights, etc. All types of statistical and
mathematical computations are permitted
on ratio-scaled variables.

NATURE OF DUMMY VARIABLES


In regression analysis so far we have
analyzed situations where both
dependent and independent variables
were either interval or ratio-scaled data.
For example, we tried to explain
movement of S&P 500 in terms of U.S.
GDP and corporate profit. In this
example, S&P 500 (Y) and corporate
profit(X2) both are ratio-scaled
variables as they have a universally
accepted origin. However, it is a known
fact that S&P 500 is not only influenced
by US GDP and corporate profit but also
by nominal scale variable such as days
of the week. Some days provide more
returns than other days or returns in
January month is different than other
months. Thus, it can be said that
dependent variables are not only
affected by ratio-scaled variables but
also by nominal variables such as
religion, race, sex, etc. For instance, if
you want to find the effect of gender on
salary drawn other things holding
constant, it can be done by dummy
variable technique.

Before including gender as another


explanatory variable in regression
model, we have to quantify our nominal
variable. One can quantify such nominal
variable by constructing artificial
variable which takes only two values 1
and 0. While 1 indicates that a person is
male and 0 indicates that a person is
female. Thus, a variable which takes
only two values such as 0 and 1 are
called dummy variables. They are also
called categorical, indicator or binary
variables in literature.

ANOVA MODELS
Analysis of variance (ANOVA) model is
that model where the dependent variable
is quantitative in nature and all the
independent variables are categorical in
nature. We will illustrate ANOVA
models with an example. Table 7.1 gives
monthly returns of BSE Sensex for the
period May 1990-91 and December
2007.

Table 7.1: Monthly Returns of BSE


Sensex
Observations Months/Years Returns
of
Sensex
1 May 90 0.69
2 June 90 2.14
3 July 90 16.89
4 August 90 18.99
5 September 90 17.17
6 October 90 3.52
7 November 90 -3.53
8 December 90 -11.04
10 January 91 -14.23
11 February 91 10.47
12 March 91 7.26
13 April 91 6.31
... .
-- .. .
210 September 07 16.99
211 October 07 14.72
212 November 07 -2.39
213 December 07 4.35

To find whether there are monthly effects


in BSE Sensex returns, we used ANOVA
model as follows:

(13.1)

where Y = Monthly returns of BSE


Sensex D6 = 1 if the month is
November
D1= 1 if the month is
June =0
otherwise
=0
otherwise
D7 = 1 if the month is December
D2 = 1 if the month is
July =0
otherwise
=0
otherwise
D8 = 1if the month is January
D3 = 1 if the month is
August =0
otherwise
=0
otherwise
D9 = 1 if the month is February
D4 = 1 if the month is
September =0
otherwise
=0
otherwise
D10 = 1 if the month is March
D5 = 1 if the month is
October =0
otherwise
=0
otherwise
D11 = 1 if the month is April;

0 otherwise
The above equation is similar to
multiple regression model but the only
difference is that instead of quantitative
independent variables, the variables are
categorical in nature which takes value
of 1 if the observation belongs to a
particular group and 0 otherwise. The
results of estimated model are as
follows:
Y=-1.38+2.10D1+4.13D2+4.88D3+5.68D4+1.48D5+0.98
4.88D8+5.69D9+3.48D10+2.09D11
t = (-0.75) (0.80) (1.58) (1.86) (2.17) (0.56)
(0.37) (1.35) (1.84) (2.14) (1.31) (0.78)
R2 =0.05 df = 200

The results indicate that mean return of


May month is given by the coefficient of
intercept that is -1.38% over the sample
period. Mean return of BSE sensex for
all other months of are given as
follows:
Mean return for the Month June: + 1 =
-1.38+2.10 = 0.72

Mean return for the month July: + 2 =


-1.38+ 4.13 = 2.75

Thus, the mean return of BSE Sensex for


the month of June and July are greater
than May month returns by 0.72% and
2.75%. Similarly we can find the returns
of other months. In the above discussed
ANOVA model the benchmark category
which receives value of 1 is May month.
The coefficients of all other dummy
variables present in model (5.1) are
called differential intercept coefficient.
It measures by how much the intercept
coefficient differs from dummy variable
coefficients. For example, mean retrun
of June month is higher by 2.10 % by
may month return of -1.38% and actual
return of June is 0.72%. Please note that
we introduced 11 dummy variables for
twelve independent variables. Thus, if a
categorical variable has n categories,
introduce only (n-1) dummy variables
to avoid dummy trap.

ANCOVA MODELS
ANOVA models are very common in
market research, sociology and
psychology research. However, in
economics, very often regression models
contains indepedendent variables which
quantitative as well as qualitative in
nature. Such models are called Analysis
of covariance models (ANCOVA).
ANCOVA models are just an extension
of ANOVA models used for controlling
the effects of quantitative independent
variables in a model where there are
both types of variables quantitative
and qualitative. We will consider the
earlier example and include one
quantitative regressor, Index of
Industrial Productivity (IIP), a proxy for
GDP in India.

(13.2)
The results of the above model are given
below as:

Y= -6.16 + 1.42D1+..+0.50 D11+


0.03 X
t = (-2.15) (0.56) (0.20)
(2.38)
R2 =0.10 df = 199
The above model cab be interpreted as:
other things remaining same, when IIP
increases by 1%, on the average, BSE
Sensex returns increases by 3%.

DESEASONALIZATION USING
DUMMY VARIABLES
The demand for woolen cloths during
winter usually increases and declines in
all other seasons. You must have noticed
that demand for soft drinks such as Coke
and Pepsi rises during summer. Also,
sales of department stores at the time of
Dipawali and Holi increases compared
to other days. This regular and repetitive
occurrence of variation in economic
variables such as sales of refrigerator is
called seasonal variation. A time series
is composed of four components trend,
seasonal oscillations, cyclical
oscillations and random fluctuations.
Many economic variables show patterns
of seasonality. More often seasonal
component from of a time series is
removed and the process of removing
seasonality is called deseasonalization.
There are various methods of removing
seasonality from a time series, but here
we will discuss how seasonality is
removed by dummy variable technique.
Let us consider the quarterly data of gold
prices for the period 1990-91 to 1999-
2000. To find seasonal effect in each
quarter, we fitted the following model:

(13.3)

Where Y = quarterly price of gold and


Ds are the dummies, which takes a
value of 1 for the relevant quarter and 0
otherwise. In the above model, we
introduced only three dummy variables
to avoid dummy trap. The Y-intercept
gives the mean price of gold in the first
quarter. Seasonality in gold prices in
various quarters can be known from the
significant t value of the dummy
coefficient for that quarter. For quarterly
gold prices data, we obtain the
following regression results:

Y =
4450.45+16.64D1+1.06D2+129.96D3
t = (31.71) (0.08) (0.005) (0.65)
2
R = 0.01 df = 31

The above results show that the mean


price of gold in first quarter was Rs.
4450.45 per 10 gms in India over the
sample period. The mean price of gold
in second quarter, third quarter and
fourth quarter are higher by Rs.16.64,
Rs. 1.06 and Rs. 129.96 than the first
quarter respectively. The values of t
statistic of the dummy variables D1, D2
and D3 are less than 2 which implies that
these coefficients are statistically
insignificant at 5 percent level of
significance implying no seasonality in
gold prices in second, third and fourth
quarters. Only the first quarter shows the
sign of seasonality as the t value of this
coefficient is 31.71.

The next thing is to obtain


deseasonalized time series. This can be
obtained by estimating the value of Y
from estimated model for each
observation. Now subtract estimated Y
from the actual Y, which are simply the
residuals from the regression model.
These residuals form the deseasonalized
time series of gold prices. This time
series is now free form seasonal
component. However, other components
namely trend, cycle and random
oscillations are still present in the time
series.

PIECEWISE LINEAR REGRESSION


Dummy variable technique can be used
to model a non-linear relationship with
several linear relationships known as
piecewise linear regression. For
instance, it can be assumed that
expenditure on luxury items and income
is progressive in nature that is, the more
you earn the larger share of the income
should go on luxury items. Let
expenditure on luxury items (Y) is
related with the gross household income
(X) as:

Y =
A+BX+
(13.5)

In order to transform (13.1) into a


piecewise linear regression, we need to
define two dummy variables that will
describe on what linear segment the
household is located. Thus we define:

D1=

D2=

Next we will re-specify the intercept


and the slope coefficient (13.5) in the
following way:

A= A0+A1D1+A2D2 (13.6)

B = B0+B1D1+B2D2 (13.7)
Now put (13.6) and (13.7) into the
equation (13.5) and we have:

Y = (A0+A1D1+A2D2)+
(B0+B1D1+B2D2)X+ (13.8)
or
Y=
A0+A1D1+A2D2+B0X+B1(D1X)+B2(D2X)+

The estimated relations in three income


ranges are now given as follows:
a) When X<X1
=a0+b0X
b) When X1
=(a0+a1)+(b0+b1)X
c) When X>X2 =
(a0+a1)+(b0+b2)X

Using Excel
Regression analysis can be done by
Excel with the help of Data Analysis.
Select Tools on the menu bar and choose
Data Analysis from the drop down
menu. Next, select Regression from the
Data Analysis dialog box. Enter the
dependent variable into Input Y Range.
Enter the independent variables in Input
X Range. Here independent variable X
range may include a number of columns.
Excel decides the number of explanatory
variables from the number of columns
entered in Input X Range. Select
Labels, if you have labels for series. If
you want regression through origin, click
Constant is Zero otherwise leave it
blank. Excel also provides residuals,
standardized residuals, residual plots
and fine fits plots. If you are interested
in these items then select them from
regression dialog box. Click OK.

The standard excel regression output


gives coefficient of determination (R2),
Standard error of estimate and an
ANOVA table. It also provides F
statistic to assess the overall
significance and t-statistic to assess the
significance of marginal contribution by
independent variables with associated
p-values.
Using Excel
The quarterly sales data over the past
three years are as follows:

Quarter Year 1 Year 2 Year3


1 1698 1810 1880

2 950 920 1120

3
2630 2950 2940
4
2580 2360 2620

We will fit a dummy regression model to


find quarterly effect in sales.

Step 1: Enter the data in excel as shown


below:
Step 2: Go to data analysis. When you
click data analysis the following dialog
box will come:
Step 3: Select regression and click ok.
When you click ok the following dialog
box will appear.
Step 4: Enter the dependent variable in
Input Y Range and independent variables
in Input X range that is, $B$1:$D$13.
Select labels if have labels for your
variables.
Step 5: After this when you click ok.
The following results will be given by
excel.
Exercises
Q.1 Consider the sales figures of a
particular company for the period of
2012 to 2015 in the following table.
Predict next four quarter sales using
dummy technique.
Year Q1 Q2 Q3 Q4
2012 672 636 680 704
2013 744 700 756 784
2014 828 800 840 880
2015 936 860 944 972

Q.2 Given below are the data on


monthly salaries, work experience and
gender. Can you say that there is gender
effect in monthly salary.
Monthly Experience Gender
Salary(Rs in years (1=
000) male, 0
=
Female)
26 3 0
54 10 1
32 2.5 0
28 5 0
18 1 1
12 0 0
60 15 1
75 16 1
85 10 0
45 3 0
67 12 1
24 4 0
30 6 0
35 8 0
55 3 1
16 1 0
Chapter 14
ARIMA Modeling
Introduction
Autoregressive Integrated Moving
Average (ARIMA) is a model of
persistence or autocorrelation in a time
series. ARIMA was introduced by Box
and Jenkins in 1970s for analysis and
forecasting of a time series and is
considered as the most efficient
forecasting technique. This model is
widely used for modeling economic and
business time series data that requires
only the historical data on a particular
variable. ARIMA modeling involves
three stages viz. identification of the
model, estimation and verification of the
model. Once a particular ARIMA model
passes the verification stage then it can
be used for forecasting purpose.

AR and MA and ARIMA Processes


The fundamental philosophy behind
ARIMA modeling is that the history
repeats itself. Using historical data on a
particular variable, the goal is first to
identify the underlying structure of the
data and thereby predict the future
values of that variable. The basic
assumption is that the factors that have
influenced patterns of activity in the past
and present will continue to do so in
more or less the same way in the future
also.

Autoregressive Process
An AR process expresses a time series
as linear function of its own lagged
values. The simplest AR model is the
first order autoregressive or AR(1)
model as shown below:

(14.1)
where is the series in time t, is
the series in the previous period, and
are the parameters of the model and
is the random error term. The order of
the AR shows how many lagged past
values are included in the model. An AR
process of order 2 can be written as
follows:

(14.2)

In general, an AR process of pth order


can be specified as:

(14.3)
In autoregressive models, the current
value is determined by pervious values
of a time series. There are no other
independent variables for explaining the
variations in dependent variables. In
such models data speaks for themselves.

Given below is the procedure of


constructing lagged values of food
production in India from using data
1990-91 to 2006-07.
Table 14.1 : Current and lagged
Production of Food Grains in India
Yt Yt-1 Yt-2 Yt-3
176.39 - - -
168.38 176.39 - -
179.48 168.38 176.39 -
184.26 179.48 168.38 176.39
191.5 184.26 179.48 168.38
180.42 191.5 184.26 179.48
199.44 180.42 191.5 184.26
192.26 199.44 180.42 191.5
203.61 192.26 199.44 180.42
209.8 203.61 192.26 199.44
196.81 209.8 203.61 192.26
212.85 196.81 209.8 203.61
174.78 212.85 196.81 209.8
213.19 174.78 212.85 196.81
198.36 213.19 174.78 212.85
208.59 198.36 213.19 174.78
216.13 208.59 198.36 213.19
MA Process
A moving average (MA) process is a
model where the time series thought as a
moving average of a random error term,
. The first order moving average, or
MA(1), model is given as:

(14.4)
As with AR models, higher order MA
models include higher lagged terms.
The letter q is used to denote the order
of the moving average model.

The autoregressive and moving average


components can be combined to arrive at
what is called autoregressive moving
average, ARMA (p, q) model where p
and q represents the number of AR and
MA terms respectively. The simplest
ARMA model is ARMA (1, 1) shown
below:

(14.5)
ARIMA modeling requires time series to
be stationary. If the series is stationary in
its level, then we can fit ARMA model
to the time series under question.

Box-Jenkins
Methodology
ARMA modeling is done by a series of
well-defined steps. The first step is the
identification of the model. Identification
involves specifying the appropriate
process (AR, MA or ARMA) and its
order. To identify the appropriate
process and its order autocorrelation
function(ACF) and partial
autocorrelation function (PACF) are
used. Sometimes identification is done
by an automated iterative procedure --
fitting many different possible model
structures and orders and using a
goodness-of fit statistic to select the best
model.
The second step is to estimate the
coefficients of the model. Coefficients of
AR models can be estimated by least-
squares regression. Estimation of
parameters of MA and ARMA models
usually requires a more complicated
iteration procedure (Chatfield 1975). In
practice, estimation is fairly transparent
to the user, as it accomplished
automatically by a computer program
with little or no user interaction.

The third step is to check the model.


This step is also called diagnostic
checking, or verification (Anderson
1976). Two important elements of
checking are to ensure that the residuals
of the model are random, and to ensure
that the estimated parameters are
statistically significant. Usually the
fitting process is guided by the principal
of parsimony, by which the best model
is the simplest possible model that
adequately describes the data. The
simplest model is the model with the
fewest parameters.

Identification by ACF and PACF Plots


The classical method of model
identification as described by Box and
Jenkins (1970) is judge the appropriate
model structure and order from the
appearance of the plotted
autocorrelation function and partial
autocorrelation function. Autocorrelation
function is a bar chart of correlation
coefficients between a current values of
time series and its lagged values. For
our example of food grain production in
India, the autocorrelation function is
given below:
Partial autocorrelation function (PACF)
is a bar showing correlation between a
variable and its lagged value that is not
is not influenced by correlations at all
lower order lagged values. For food
grain example, the PACF is shown
below:
The identification of ARMA models
from the acf and pacf plots is difficult
and requires much experience for all but
the simplest models. Lets look at the
diagnostic patterns for the two simplest
models: AR(1) and MA(1).
The table below summarizes the typical
shape of the ACF and PACF for the
various ARIMA models.

Type of ACFs PACFs


Models
AR(P) Declines Significant
exponentially spikes
through p lags
MA (q) Significant Exponential
spikes Decays
through lags q
ARMA(p, Decline Decline
q) Exponentially Exponentially
The acf of an AR(1) model declines
geometrically as function of lag. For
example, the acf of series that follows an
AR(1) model with coefficient a1 = 0.5
is {0.5, 0.52, 0.53, 0.54} at lags 1-4.
The pacf of the AR(1) process at lags k
>1 is zero, because if the model is
AR(1), all autocorrelation is removed
by the AR(1) model.

In summary, the diagnostic patterns of


acf and pacf for an AR(1) model are:
Acf: declines in geometric
progression from its highest value at
lag 1
Pacf: cuts off abruptly after lag 1
The opposite types of patterns apply to
an MA(1) process:
Acf: cuts off abruptly after lag 1
Pacf: declines in geometric
progression from its highest value at
lag 1

Automated Identification by the FPE


Criterion
Identification of ARIMA models can
also be done by trial and error method
and by using a goodness-of-fit statistic.
Akaikes Final Prediction Error (FPE)
and Information Theoretic Criterion
(AIC) are two closely related alternative
statistical measures of goodness-of-fit of
an ARMA(p,q) model. Goodness of fit
might be expected to be measured by
some function of the variance of the
model residuals: the fit improves as the
residuals become smaller. Both the FPE
and AIC are functions of the variance of
residuals. Another factor that must be
considered, however, is the number of
estimated parameters n=p+q. This is so
because by including enough parameters
we can force a model to perfectly fit any
data set. Measures of goodness of fit
must therefore compensate for the
artificial improvement in fit that comes
from increasing complexity of model
structure. The FPE is given by
(14.6)

where V is the variance of model


residuals, N is the length of the time
series, and n=p+q is the number of
estimated parameters in the ARMA
models. In application, the FPE is
computed for various candidate models,
and the model with the lowest FPE is
selected as the best-fit model. The AIC
(Akaike Information Criterion) is
another widely used goodness-of-fit
measure, and
is given by

(14.7)
As with the FPE, the best-fit model has
minimum value of AIC. Neither the FPE
nor the AIC directly addresses the
question whether the model residuals are
white noise. A strategy for model
identification by the FPE is to iteratively
fit several different models and find the
model that gives approximately
minimum FPE and does a good job of
producing random residuals. The
checking of residuals is described in the
next section.

Checking the Model Are the


Residuals Random?
A key question in ARMA modeling is
does the model effectively describe the
persistence? If so, the model residuals
should be random or uncorrelated in
time and the autocorrelation function
(acf) of residuals should be zero at all
lags except lag zero. Of course, for
sample series, the acf will not be exactly
zero, but should fluctuate close to zero.

The acf of the residuals can be examined


in two ways. First, the acf can be
scanned to see if any individual
coefficients fall outside some specified
confidence interval around zero.
Approximate confidence intervals can
be computed. The correlogram of the
true residuals (which are unknown) is
such that rk is normally distributed with
mean

E(rk) = 0

and variance

var(rk) =
where rk is the autocorrelation
coefficient of the ARMA residuals at lag
k. The appropriate confidence interval
for rk can be found by referring to a
normal distribution cdf. We know that
the 0.975 probability point of the
standard normal distribution is 1.96. The
95% confidence interval for rk is
therefore . For the 99%
confidence interval, the probability point
of the normal distribution is 2.57. The
99% CI is therefore . An rk
outside this CI is evidence that the
model residuals are not random.

It should be pointed out that the


correlogram of the residuals from a
fitted ARMA model has somewhat
different properties than the acf of the
true residuals which are unknown
because the true model is unknown. As a
result, the above approximation
overestimates the width of the CI at low
lags when applied to the acf of the
residuals of a fitted model. At large lags,
however, the approximation is close.
A different approach to evaluating the
randomness of the ARMA residuals is to
look at the acf as a whole rather than
at the individual ' kr separately. The test
is called the portmanteau lack-of-fit
test, and the test statistic is:

(14.8)

This statistic is referred to as the


portmanteau statistic, or Q statistic.
The Q statistic, computed from the
lowest K autocorrelations, say at lags k
= 1, 2,20 , follows a 2 (Chi-square)
distribution with (Kpq)degrees of
freedom, where p and q are the AR and
MA orders of the model and N is the
length of the time series. If the computed
value of Q exceeds the value from the 2
table for some specified significance
level, the null hypothesis that the series
of autocorrelations represents a random
series is rejected at that level.

Solved Example

To demonstrate ARIMA modeling,


consider data on food grains production
from 1950-51 to 2006-07. We will adopt
a trial and error method to fit an
autoregressive model to the data. At the
outset, we fitted an AR(3) model to food
grains data on ad hoc basis. The fitted
AR(3) model is:

where the initial year is 1953-54

Next, we have to conduct significance


test of the highest order parameter,
which is 0.2444 with a standard error of
0.1386. The null and alternate
hypotheses are:
H0: 3= 0
H1: 3 0
The above hypothesis can be tested by t-
statistic given below:
Thus, the calculated t- value is 1.7629.
This calculated t-value is compared with
critical t-value for making decision
whether to reject or not to reject the null
hypothesis. This above t-statistic is
defined by n-2p-1 degree of freedom.
Thus, in our example, for 45 degrees of
freedom the critical t-value for 5% level
of significance is 2.00. Since the
calculated t-value of 1.7629 is less than
critical value of 2.00, do not reject the
null hypothesis. This implies that the
coefficient of Yt-3 is not significant and
can be deleted from analysis. However,
this coefficient is significant at 10
percent level. Similarly, we can also
conduct significance test of other
coefficients.

Table 8.1: Third Order


Autoregressive Model of Food Grains
Production in India
As evident from table 8.1, all the three
coefficients are significant at 10 percent
level, we retained them in our model.
Using the estimates = 6.2932, 1=
0.2428, 2= 0.5085 and 3=0.2444 and
most recent data on food grains
production Y53= 216.13, the forecast for
the year 2007-08 is obtained as:

To forecast 2 period ahead, we can use


the following equation:

Finally, the food grains forecast for the


year 2009-10 is given by the following
equation:
The residuals from the estimated AR(3)
model is plotted in figure 8.2.

Figure 8.2: Residuals

The figure shows that residuals are


randomly distributed which indicate that
our fitted model is appropriate.
However, it is clear from the figure 8.2
that there are few outliers present in the
data set. Removing outliers from the data
may produce more accurate forecast.
ARIMA modeling is sophisticated forecasting
technique which requires statistical software
packages for forecasting.

Illustration Using EViews Software


An example of ARIMA modeling is
given which is estimated by a statistical
software called EViews. ARIMA
modeling starts with graphing the data.
Time series data on aviation fuel (FAD)
is plotted below
Fig. Fuel Aviation Data
There does not appear to be any obvious
seasonal pattern. However, the graph
gives the impression that there is a trend
in the data which we can check with the
correlogram i.e. autocorrelation function
(ACF) and partial autocorrelation
function (PACF)
Thus we can check the correlogram of
the first differences
Thus we would not expect that the
simple integrated model would eliminate
all the autocorrelation in the data.
Because the ACF drops off so fast but
the PACF continues we suspect an MA
process may be indicated. If it was an
ARIMA(1,1,0) then we would expect the
correlogram of the differences to show a
much longer memory process. This
model is given as:

To estimate the MA(1) in the differences


with a trend (ARIMA(0,1,1)) in Eviews
we specify the equation:
d(fad) c ma(1)
The results from this estimation
Thus the parameter estimated is
significant. Note that the fit statistics like
adjusted R Square is for the differences.
The correlogram of the residuals is
given by Which indicates that we have
that are white noise.
Alternatively we could have fit an
AR(1) to the differenced data or an
ARIMA(1,1,0) which is specified in
Eview as : d(fad) c ar(1). In this case
we would have obtained

The AIC (Akaike Information Criteria)


is 3.299 for the AR and 3.269 for the
MA. The best model is the one with the
lowest number for this criterion.
Alternatively we could have tried both
by specifying an ARIMA(1,1,1) as:
d(fad) c ar(1) ma(1).

Note that the AR coefficient is now


estimated as insignificant with the MA
coefficient as marginally significant at
the 90% level.

Distributed-Lag Models
In economics, we often talk about long-
run period and short-run period. Time in
economics is not defined in terms of
number of days or weeks but in terms of
how quickly supply adjust to demand.
Thus, supply seldom adjust to demand
instantaneously. There are many
economic varaibles which takes times to
influence other economic variables. For
example, one of the functions of the
central bank of any country is to maintain
price statbility in the economy. In simple
words, keeping inflation in control. One
way to control inflation is to tight money
supply in the economy. However,
tightening money supply by the monetary
authority of country will not curb
inflation immediately. It may take some
time to show influence of money
tightening on inflation. Thus, inflation
responds to tightening of money supply
with lapse of time which is called lag
in economics.

How to incorporate lag in regerssion


models? Consider the following model

(14.9)
where dependent variables are included
as explanatory variables. This is called
autoregressive model or dynamic model
which is already discussed. Another
example of regerssion model containing
lagged variables is:

The above regerssion model contains


current and lagged values of Xt as
independent variables. Such regression
model is called a distributed lag model
which take care of delayed response of
inflation due to change in money supply
that are spread out over large number of
time periods. In the above model (8.2),
the coefficient 1 is called short-run
multiplier because it provides change in
expected value of dependent variable
(Y) due to unit change in independent
variable (X) in the same time period.
(1+2) gives change in mean value of Y
in the next period. The sum of all partial
coefficients is called long-run
multiplier.

Estimation of Distributed-Lag Models


We will consider a regerssion model
where Y is dependent on current and
lagged values of one explanatory
variable for simplicity:
How to find the values of and s of
(14.3). There are two approaches: (1)
ad hoc estimation and (2) Koyck
approach. We will restrict our analysis
to ad hoc estimation of distributed-lag
model to avoid complexity.

Ad hoc estimation of distributed-lag


model was suggested by F.F. Alt and
J.Timbergen. They suggested a
sequential approach to estimate (8.3)
regerssion model. In this method, one
can first regress Yt on current value i.e.
Xt, then Yt on current value and one
lagged value of X, that is, Xt and Xt-1,
and so on. One should stop this
sequential process when the coefficients
of lagged regressors either become
statistically insignificant or the sign of at
least one coefficient becomes negative
from postive or vice-versa.

Solved Example
To demonstrate the mechanics of
distributed lag model consider Wilmore
quaterly data on sales and advertisement
from Winter 1994 to Fall 2004 as given
in table 8.2. We will estimate a
distributed lag model by ad hoc
approach to the Wilmore quartely data.

Table 7.2: Wilmore Quaterly Data on


sales and Advertisement
Quarter Year Sales
Winter 1994 507000
Spring 1994 500000
Summer 1994 710000
Fall 1994 893000
Winter 1995 888000
Spring 1995 1490000
Summer 1995 2121000
Fall 1995 2300000
Winter 1996 2506000
Spring 1996 3501000
Summer 1996 3672000
Fall 1996 3266000
. . .
. . .
Winter 1997 3382000
Spring 1997 4503000
Summer 1997 4196000
Fall 1997 4422000
Winter 2003 6132000
Spring 2003 6138000
Summer 2003 6145000
Fall 2003 6232000
Winter 2004 6310000
Spring 2004 6289000
Summer 2004 6391000
Fall 2004 6418000

Source:
www.ciadvertising.org/SA/spring_05/adv
First, we regressed sales on
advertisement and the results is as
follows:
Salest = 410701.7+1.39 advt
t = (2.45) (26.35)
r2 = 0.94 df =42

The above estimated model says that


when advertisement expenditure is nil,
the expected sales of Wilmore is 410701
cases. Advertisement is a statistically
significant variable which determines
sales and this variable explains 94
percent varaitions in sales.

However, we know that adverstisement


influence sales with lag. Lets say
advertising expenditure in first quarter
starts influencing sales from second
quarter. To tackle such situations, we
specified a distributed lag model of
sales as:

Salest = +1advt +2advt-


1+t (8.12)

That is, we regressed Yt on Xt and Xt-1.


We estimated model (8.12) and the
results are:

Salest = 468253.2 +1.67advt -0.30 advt-1


t =(2.46) (2.78) (-0.51)

R2 =0.94 df = 41
It is evident from the results that the
coefficient of one lagged advertisement
is statistically insignificant, we should
stop this sequential process here. We
went further and estimated the following
regerssion model:

Salest = +1advt +2advt-1+ 3advt-


2+t (8.13)

The results are as follows:

Salest = 479472.1 +1.6advt +0.34 advt-


1- 0.58advt-2
t = (1.80) (2.11)
(0.48) (-0.86)
R2 =0.93 df =40

We can notice that the sign of coefficient


of advt-1 has changed, when we
estimated (8.13) regerssion model. From
the three models, it is clear that first
model is the more accurate model in
explaining sales as the one lagged
advertisement variable is statistically
insignificant.

Granger Causality
In the beginning, we categorically said
that regression does not means
causations. Regression analysis helps in
explaining one variable in terms of
another variables but it does not imply
causation. However, in case of time
series data it is possible to interpret
regerssion in terms of cause and effect
analysis. It is a fact that time does not
run backward and if event A occurs
before event B, then one can safely infer
that A causes B while it is not possible
to say that B causes A. This is the basic
idea behind the Granger causality. Very
often we try to find whether GDP causes
money supply or money supply causes
GDP. A similar case is to determine
whether growth causes inflation or
inflation causes growth. Such issues in
macroeconomics can be settled by
Granger causality test. This test
invloves estimation of the following pair
of regressions:

where random errors 1t and 2t are


assumned to be independent. Equation
(8.6) says that current GDP is caused by
its own lagged values as well as by
lagged values of money supply.
Similarly, equation (8.7) assume that
current money supply is determined by
lagged values of money supply and
lagged values of GDP. Granger causality
is tested by F-test. The procedure of
testing Granger causality is given below:
1. Regress current GDP on all lagged
values of GDP. In this regerssion do
not include lagged values of money
supply. This regerssion is termed as
restricated regression. From this,
find the restricted error sum of
squares, SSER .
2. Now regress current GDP on all

lagged values of GDP and money


supply. This is termed as unrestricted
regression. From this, find the error
sum of squares, SSEUR
3. Formulate the null and alternate
hypothesis as:

4. To test this hypothesis, F test is

applied which is given as follows:

This test follows the F distribution


with m and (n-k) degrees of freedom.
m refers to number of lagged money
supply variables and k stand for
number of parameters in the model.
5. Reject the null if the computed F

value is greater than critical F value


at chosen level of significance.
Rejecting null implies that lagged
money supply belong in the
regression which is another way of
saying money supply Granger causes
GDP. Steps (a) to(e) can be
repeated to test whether GDP causes
money supply.

Solved Example
To illustrate Granger causality, we
collected data on gross domestic product
of India at current price and data on
money supply (narrow money) from
1950-51 to 2002-03. The objective is to
find whether GDP causes money supply
or money supply causes GDP. In this
illustration we run regression on change
in GDP ( GDP) against change in
money supply ( Money) instead of
running regression in levels.
In the above table 7.3, we constructed
two lagged values of ( GDP) and two
lagged values of ( Money) for testing
Granger causality. The step by step
procedure of testing Granger causality
test is explained below:

1. We regerss ( GDP) on two lagged

values of change in GDP variables.


This is our restricted regerssion.
Mathematically, we estimated the
following regerssion.
And obtain restricted error sum of
squares i.e. SSER which is equal to
11684952249
2. Next, we estimated the following
unrestricted regression model which
contain not only lagged values of (
GDP) but also lagged values of (
Money) as given below:

And obtain unrestricted error sum


of squares i.e. SSEUR which is equal to
10678740492
3. We have to test the following
hypothesis:

H1: At least one s is not zero


4. To test the above null hypothesis, we

apply F test given below:

Thus, our calculated F value is


2.025.

5. Finally, we have to compare the


computed F value with critical F
value. For 5% level of significance,
the F critical value is 3.23. Since the
computed F value is less than critical
F value, do not reject null
hypothesis. This implies that money
supply do not belong in the
regression. In other words, money
supply does not cause GDP. We can
repeat the above steps to determine
GDP causes money supply. One can
repeat the above steps to determine
GDP causes money supply.

Thus, it is established that GDP is not


caused by money supply.

Using Excel
Listed below are 64 different daily
closing value of NSE Nifty. Fit an AR(5)
model to this data.

4150.85 4117.35
4111.15 4077 4079.3
4076.65 4134.3 4120.3 417
4260.9 4278.1 4246.2 4204
4293.25 4249.65 4295.8 4297
4198.25 4179.5 4145 414
4170 4171.45 4147.1 421
4252.05 4259.4 4285.7 4263
4313.75 4357.55 4359.3 435
4406.05 4387.15 4446.15 450
4499.55 4562.1 4566.05 461
4619.8 4445.2 4440.05 4528

Step 1: Enter the data in excel as shown


below:
Step 2: Go to data analysis. When you
click data analysis the following dialog
box will come:
Step 3: Select regression and click ok.
When you click ok the following dialog
box will appear.
Step 4: Enter the dependent variable in
Input Y Range and independent variables
in Input X range that is, $B$1:$F$65.
Select labels if have labels for your
variables.
Step 5: After this when you click ok. The
following results will be given by excel.
Exercises

Q. 1 Fit an ARMA model to the


following monthly data of oil import by
Box-Jenkins method.
Year/Month Oil Import
(Rs. Crore)

April 5652.5
May 5864.2
June 5990.9
July 6831.6
August 6566.7
September 6434.1
October 6637.5
November 7551.1
December 5154.2
January 5674.7
February 4359.9
March 4929.8
2009-10
April 5599.2
May 6907.7
June 5772.3
July 6722.7
August 5677
September 5577
October 5037
November 4506
December 5061
January 4921
February 4648
March 6341
2011-12
April 6090
May 7583
June 6329
July 6785
August 7891
September 7580
October 7454
November 5681
December 6666
January 7622
February 7525
March 8161
2012-13
April 6890
May 7887
June 6760
July 7111
August 6724
September 7501
October 8139
November 8165
December 8743
January 9189
February 7740
March 9671
2013-14
April 10254
May 9729
June 12395
July 10454
August 11695
September 11871
October 12197
November 9521
December 9014
January 11477
February 11054
March 14433
2014-15
April 13511
May 14178
June 13495
July 15215
August 17275
September 17778
October 15423
November 15382
December 16967
January 18502
February 18219
March 18695
Q. 2 The table below shows data on
Research & Development (R&D) and
Sales of various companies. It is not
clear whether sales is determined by
R&D or R&D is determined by sales.
Using these data, determine the direction
of Granger causality.

R&D Sales
54.95 380
72.66 450
87.58 515
64.69 400
74.81 458
66.44 460
51.46 305
72.77 485
80.03 518
76.39 506
69.84 540
52.08 354
61.98 375
73.3 330
56.99 410
78.38 490
96.44 528
60.74 445
89.5 630
95.24 600
68.33 465
56.71 388
88.18 578
64.8 416

Q. 3 Using the data of Q.2, fit a


distributed lag model assuming that sales
is not only determined by current R&D
expenses but also by previous expenses
of R&D expenses by adhoc method.
Chapter 15
Markov Analysis
Introduction
Market share analysis is done by a
technique called Markov analysis which
is a technique that deals with the
probabilities of future occurrences by
analyzing presently known
probabilities. The technique has
numerous applications in business,
including market share analysis, bad
debt prediction, university enrollment
predictions, and determining whether a
machine will break down in the future.

Markov analysis makes the assumption


that the system starts in an initial state or
condition. For example, two competing
manufacturers might have 40% and 60%
of the market sales, respectively, as
initial states. Perhaps in two months the
market shares for the two companies
will change to 45% and 55% of the
market, respectively. Predicting these
future states involves knowing the
system's likelihood or probability of
changing from one state to another. For a
particular problem, these probabilities
can be collected and placed in a matrix
or table. This matrix of transition
probabilities shows the likelihood that
the system will change from one time
period to the next. This is the Markov
process, and it enables us to predict
future states or conditions.

Like many other quantitative techniques,


Markov analysis can be studied at any
level of depth and sophistication.
Fortunately, the major mathematical
requirements are just that you know how
to perform basic matrix manipulations
and solve several equations with several
unknowns.
Because the level of this course
prohibits a detailed study of Markov
mathematics, we limit our discussion to
Markov processes that follow four
assumptions:
1. There are a limited or finite
number of possible states.
2. The probability of changing
states remains the same over
time.
3. We can predict any future state
from the previous state and the
matrix of transition
probabilities.
4. The size and makeup of the
system (e.g., the total number
of manufacturers and cus
tomers) do not change during
the analysis.

States and State


Probabilities
States are used to identify all possible
conditions of a process or a system. For
example, a machine can be in one of two
states at any point in time. It can be
either functioning correctly or not
functioning correctly. We can call the
proper operation of the machine the first
state, and we can call the incorrect
functioning the second state. Indeed, it is
possible to identify specific states for
many processes or systems. If there are
only three grocery stores in a small
town, a resident can be a customer of
anyone of the three at any point in time.
Therefore, there are three states
corresponding to the three airline
companies.

In Markov analysis we also assume that


the states are both collectively
exhaustive and mutually exclusive.
Collectively exhaustive means that we
can list all of the possible states of a
system or process. Our discussion of
Markov analysis assumes that there is a
finite number of states for any system.
Mutually exclusive means that a system
can be in only one state at any point in
time.
After the states have been identified, the
next step is to determine the probability
that the system is in this state. Such
information is then placed into a vector
of state probabilities.

Where n =number of states


= probability of being in
state 1, state 2, state n

Vector of State Probabilities


Let's look at the vector of states for
people with the three airline companies.
There could be a total of 100,000 people
that using airline during any given month.
Forty thousand people may be using Jet
airways, which will be called state 1.
Thirty thousand people may using Indian
Airlines, which will be called state 2,
and 30,000 people may be shopping at
Air Deccan, which will be called state
3. The probability that a person will be
using any of these three airlines is as
follows:
State 1 Jet
Airways
40,000/100,000 =0.40 = 40%
State 2 Indian Airlines
30,000/100,000 =0.30
= 30%
State 3 Air Deccan
30,000/100,000
=0.30 = 30%

These probabilities can be placed in the


vector of state probabilities shown as
follows:

(1) = (0.40, 0.3, 0.3)


(1)= vector of state probabilities for
the three airlines for period 1
1= 0.4 = probability that a person will
travel by Jet airways
2= 0.3 = probability that a person will
travel by Indian Airlines
3= 0.3 = probability that a person will
travel by Air Deccan

You should also notice that the


probabilities in the vector of states for
the three airlines represent the market
shares for these three airlines for the
first period. Thus jet Airways has 40%
of the market, Indian airlines has 30%,
and Air Deccan has 30% of the market
in period 1. When we are dealing with
market shares, the market shares can be
used in place of probability values.

Management of these three airlines


should be interested in how their market
shares change over time. Travelers do
not always remain with one airline, but
they may go to a different airline for
their next travel. In this example, a study
has been performed to determine how
loyal the travelers have been. It is
determined that 80% of the person who
travel by jet airways one year will
return to that airline next year. However,
of the other 20% of Jet airways, 10%
will switch to Indian Air lines and the
other 10% will switch to Air Deccan for
their next travel. For who person who
travel this year by Indian airlines, 70%
will-return, 10% will switch to Jet
airways, and 20% will switch to Air
Deccan. Of the person who travel this
year by Air Deccan at, 60% will return,
but 20% will go to Jet airways and 20%
will switch to Indian airlines.

To find market share of each air line next


year the concept of matrix of transition
probabilities is required.
Matrix of Transition Probabilities
The concept that allows us to get from a
current state, such as market shares, to a
future state is the matrix of transition
probabilities. This is a matrix of
conditional. probabilities of being in a
future state given a current state. The
following definition is helpful:
Let Pi j = conditional probability of
being in state j in the future given the
current state of i

For example, P12 is the probability of


being in state 2 in the future given the
event was in state 1 in the period
before.

Let P = matrix of transition probabilities


P=
(15.1)

Individual Pi j values are usually


determined empirically. For example, if
we have observed over time that 10% of
the people currently shopping at store 1
(or state 1) will be shopping at store 2
(state 2) next period, then we know that
P12 = 0.1 or 10%.

Transition Probabilities for the Three


Airlines
We used historical data with the three
grocery stores to determine what
percentage of the customers would
switch each month. We put these
transitional probabilities into the fol
lowing matrix:

P=

Recall that jet airways represent state 1,


Indian Airlines is state 2, and Air
Deccan is state 3. The meaning of these
probabilities can be expressed in terms
of the various states, as follows:

Row 1
0.8 = P11 = probability of being in state 1
after being in state 1 the preceding
period
0.1 = P12 = probability of being in state 2
after being in state 1 the preceding
period
0.1 = P13 = probability of being in state 3
after being in state 1 the preceding
period
Row 2
0.1 = P21 =probability of being in state 1
after being in state 2 the preceding
period
0.7 = P22= probability of being in state 2
after being in state 2 the preceding
period
0.2 = P23 = probability of being in state 3
after being in state .2 the preceding
period
Row 3
0.2 = P31 = probability of being in state 1
after being in state 3 the preceding
period
0.2 = P32 = probability of being in state
2 after being in state 3 the preceding
period
0.6 = P33 = probability of being in state 3
after being in state 3 the preceding
period

Note that the three probabilities in the


top row sum to 1. The probabilities for
any row in a matrix transition
probabilities will also sum to 1.

After the state probabilities have been


determined along with the matrix of
transition probabilities, it is possible to
predict future state probabilities.
Predicting Future Market Shares
One of the purposes of Markov analysis
is to predict the future. Given the vector
of state probabilities and the matrix of
transition probabilities, it is not very
difficult to determine the state
probabilities at a future date. With this
type of analysis, we are able to compute
the probability that a person will be
shopping at one of the grocery stores in
the future. Because this probability is
equivalent to market share, it is possible
to detennine future market shares for
American Food, Food Mart, and Atlas
Foods. When the current period is 0,
calculating the state probabilities for the
next period (period 1) can be
accomplished as follows:

(1) =
(0)P
(15.2)

Furthermore, if we are in any period n,


we can compute the state probabilities
for period n + 1 as follows:

(n + 1) =
(n)P
(15.3)

Above equation can be used to answer


the question of next period's market
shares for the grocery stores. The
computations are

(l) = (0)P
=

= [(0.4)(0.8) + (0.3)(0.1) + (0.3)(0.2),


(0.4)(0.1)+ (0.3)(0.7) + (0.3)(0.2), (0.4)
(0.1) + (0.3)(0.2) + (0.3)(0.6)]
= (0.41, 0.31,0.28)

As you can see, the market share for Jet


airways and Indian Airlines has
increased while the market share for Air
Deccan has decreased. Will this trend
continue in the next period and the one
after that? From above equation, we can
derive a model that will tell us what the
state probabilities will be in any time
period in the future.

Exercises
1. The buying patterns for two brands of
health drink can be expressed as Markov
process with the following transition
probabilities:
From To
Bournvitae
Bournvitae 0.40
Horlicks 0.55

1. Which brand has the most loyal


customers? Explain.
2. What are the projected market shares for
the two brands?
2. Suppose that in Problem 1 a new health
drink brand enters the market such that the
following transition probabilities exist:
From
Bournvitae
Bournvitae 0.35
Horlicks 0.50
Boost 0.20

What are the new long-term market shares?


Which brand appears to suffer the most with
the introduction of a new brand?
Chapter 16
Monte Carlo Simulation
Boeing Corporation and Airbus
industries commonly build simulation
models of their proposed jet aircraft and
then test the aerodynamic properties of
the models. Civil defense organizations
also carry out rescue and evacuation
practice as it simulates the natural
disaster conditions of a hurricane or
tornado. The u.s. Army simulates enemy
attacks and defense strategies in war
games played on a computer. Business
students take courses that use
management games to simulate realistic
competitive business situations. And
thousands of business, government, and
service organizations develop sim
ulation models to assist in making
decisions concerning inventory control,
maintenance scheduling, plant layout,
investments, and sales forecasting.
As a matter of fact, simulation is one of
the most widely used quantitative
analysis tools. Various surveys of the
largest US corporations reveal that over
half use simulation in corporate
planning.
Simulation sounds like it may be the
solution to all management problems.
This is, unfortunately, by no means true.
Yet we think you may find it one of the
most flexible and fascinating of the
quantitative techniques in your studies.
Let's begin our discussion of simulation
with a simple definition.
To simulate is to try to duplicate the
features, appearance, and characteristics
of a real system. In this section, we
show how to simulate a business or
management system by building a
mathematical model that comes as close
as possible to representing the reality of
the system. We won't build any physical
models, as might be used in airplane
wind tunnel simulation tests. But just as
physical model airplanes are tested and
modified under experimental conditions,
our mathematical models are used to
experiment and to estimate the effects of
various actions. The idea behind
simulation is to imitate a real-world
situation mathematically, then to study its
properties and operating characteristics,
and, finally, to draw conclusions and
make action decisions based on the
results of the simulation. In this way, the
real-life system is not touched until the
advantages and disadvantages of what
may be a major policy decision are first
measured on the system's model.
Using simulation, a manager should (1)
define a problem, (2) introduce the
variables associated with the problem,
(3) construct a numerical model, (4) set
up possible courses of action for testing,
(5) run the experiment, (6) consider the
results (possibly deciding to modify the
model or change data inputs), and (7)
decide what course of action to take.
The problems tackled by simulation can
range from very simple to extremely
complex, from bank teller lines to an
analysis of the U.S. economy. Although
very small simulations can be conducted
by hand, effective use of this technique
requires some automated means of
calculation, namely, a computer. Even
large-scale models, simulating perhaps
years of business decisions, can be
handled in a reasonable amount of time
by computer. Though simulation is one
of the oldest quantitative analysis tools,
it was not until the introduction of
computers in the mid1940s and early
1950s that it became a practical means
of solving management and military
problems.
In this chapter we will start with a
presentation of the advantages and
disadvantages of simulation. An
explanation of the Monte Carlo method
of simulation follows. Other simulation
models besides the Monte Carlo
approach are also discussed briefly.
Finally, the important role of computers
in simulation is illustrated.
Advantages and Disadvantages of
Simulation
Simulation is a tool that has become
widely accepted by managers for
several reasons:
1. It is relatively straightforward
and flexible.
2. Recent advances in software
make some simulation models
very easy to develop.
3. It can be used to analyze large
and complex real-world
situations that cannot be
solved by conventional
quantitative analysis models.
For example, it may not be
possible to build and solve a
mathematical model of a city
government system that
incorporates important
economic, social,
environmental, and political
factors. Simulation has been
used successfully to model
urban systems, hospitals,
educational systems, national
and state economies, and even
world food systems.
4. Simulation allows what-if
types of questions. Managers
like to know in advance what
options are attractive. With a
computer, a manager can try
out several policy decisions
within a matter of
minutes.
5. Simulations do not interfere
with the real-world system. It
may be too disruptive, for
example, to experiment with
new policies or ideas in a
hospital, school, or manufac
turing plant. With simulation,
experiments are done with the
model, not on the system
itself.
6. Simulation allows us to study
the interactive effect of
individual components or
variables to determine which
ones are important.
7. "Time compression" is
possible with simulation. The
effect of ordering, advertising
or other policies over many
months or years can be
obtained by computer
simulation in a short time.
8. Simulation allows for the
inclusion of real-world
complications that most
quantitative analysis models
cannot permit. For example,
some queuing models require
exponential or Poisson
distributions; some inventory
and network models require
normality. But simulation can
use any probability
distribution that the user
defines; it does not require
standard distributions.

The main disadvantages of simulation


are:
1. Good simulation models for
complex situations can be
very expensive. It is often a
long, complicated process to
develop a model. A corporate
planning model, for example,
may take months or even years
to develop.
2. Simulation does not generate
optimal solutions to problems
as do other quantitative
analysis techniques such as
economic order quantity,
linear programming, or PERT.
It is a trial-and-error approach
that can produce different
solutions in repeated runs.
3. Managers must generate all of
the conditions and constraints
for solutions that they want to
examine. The simulation
model does not produce
answers by itself.
4. Each simulation model is
unique. Its solutions and
inferences are not usually
transferable to other problems.

Monte Carlo Simulation


When a system contains elements that
exhibit chance in their behavior, the
Monte Carlo method of simulation can
be applied.
The basic idea in Monte Carlo
simulation is to generate values for the
variables making up the model being
studied. There are a lot of variables in
real-world systems that are probabilistic
in nature and that we might want to
simulate. A few examples of these
variabIes follow:
1. Inventory demand on a daily
or weekly basis
2. Lead time for inventory orders
to arrive
3. Times between machine
breakdowns
4. Times between arrivals at a
service facility
5. Service times
6. Times to complete project
activities
7. Number of employees absent
from work each day
The basis of Monte Carlo simulation is
experimentation on the chance (or
probabilistic) elements through random
sampling. The technique breaks down
into five simple steps:
Five Steps of Monte Carlo Simulation
1. Setting up a probability
distribution for important
variables
2. Building a cumulative
probability distribution for
each variable in step 1
3. Establishing an interval of
random numbers for each
variable
4. Generating random numbers
5. Actually simulating a series of
trials
We will examine each of these steps and
illustrate them with the following
illustration.
Illustration
Harry's aircraft Tire sells all types of
tires, but a popular radial tire accounts
for a large portion of Harry's overall
sales. Recognizing that inventory costs
can be quite significant with this
product, Harry wishes to determine a
policy for managing this inventory. To
see what the demand would look like
over a period of tinie, he wishes to
simulate the daily demand for a number
of days.
Step 1: Establishing Probability
Distributions. One common way to
establish a probability distribution for a
given variable is to examine historical
outcomes. The probability, or relative
frequency, for each possible outcome of
a variable is found by dividing the
frequency of observation by the total
number of observations. The daily
demand for radial tires at Harry's Tire
over the past 200 days is shown in Table
16.1. We can convert these data to a
probability distribution, if we assume
that past demand rates will hold in the
future, by dividing each demand
frequency by the total demand, 200. This
is illustrated in Table 16.1.
Table 16.1: Daily Demand for radial
Tires
Demand for Frequency
Tires (days)
0 10
1 20
2 40
3 60
4 40
5 30
Probability distributions, we should
note, need not be based solely on
historical observations. Often,
managerial estimates based on judgment
and experiences are used to create a
distribution. Sometimes, a sample of
sales, machine breakdowns, or service
rates is used to create probabilities for
those variables. And the distributions
themselves can be either empirical, as in
Table 16.1, or based on the commonly
known normal, binomial, Poisson, or
exponential patterns.
Table 16.2: Probability of Demand for
radial Tires
Demand Probability of
Variable Occurrence
0 10/200 = 0.05
1 20/200 =0.10
2 40/200 = 0.20
3 60/200 = 0.30
4 40/200 = 0.20
5 30/200 = 0.15
200/200 = 1.00

Step 2: Building a Cumulative


Probability Distribution for Each
Variable. The conversion from regular
probability distribution, such as in the
right-hand column of Table 16.2, to a
cumulative distribution is an easy job.
A cumulative probability is the
probability that a variable (demand)
will be less than or equal to a particular
value. A cumulative distribution lists all
of the possible values al}d the
probabilities. In Table 16.3 we see that
the cumulative probability for each level
of demand is the sum of the number in
the probability column (middle column)
added to the previous cumulative
probability (rightmost column). The
cumulative probability, graphed in
Figure 16.2, is used in step 3 to help
assign random numbers.
Daily Probability Cumulativ
Demand Probability
0 0.05 0.05
1 0.10 0.15
2 0.20 0.35
3 0.30 0.65
4 0.20 0.85
5 0.15 1.00

Step 3: Setting Random Number


Intervals. After we have established a
cumulative probability distribution' for
each variable included in the simulation,
we must assign a set of numbers to
represent each possible value or
outcome. These are referred to as
random number intervals. Random
numbers are discussed in detail in step
4. Basically, a random number is a
series of digits (say, two digits from 01,
02, . . . , 98, 99, 00) that have
been"selected by a totally random
process.
If there is a 5% chance that demand for a
product (such as Harry's radial tires) is
0 units per day, we want 5% of the
random numbers available to correspond
to a demand of 0 units. If a total of 100
two-digit numbers is used in the
simulation (think of them as being
numbered chips in a bowl), we could
assign a demand of 0 units to the first
five random numbers: 01, 02, 03, 04,
and 05. Then a simulated demand for 0
units would be created every time one of
the numbers 01 to 05 was drawn. If there
is also a 10% chance that demand for the
same product is 1 unit per day, we could
let the next 10 random numbers (06, 07,
08, 09, 10, 11, 12, 13, 14, and 15)
represent that demand--!and so on for
other demand levels.

In general, using the cumulative


probability distribution computed and
graphed in step 2, we can set the interval
of random numbers for each level of
demand in a very simple fashion. You
will note in Table 10.4 that the interval
selected to represent each possible daily
demand is very closely related to the
cumulative probability on its left. The
top end of each interval is always equal
to the cumulative probability percentage.
Similarly, we can see in Figure 16.2 and
in Table 16.4 that the length of each
interval on the right corresponds to the
probability of one of each of the
possible daily demands. Hence, in
assigning random numbers to the daily
demand for three radial tires, the range
of the random number interval:(36 to 65)
corresponds exactly to the probability
(or proportion) of that outcome. A daily
demand for three radial tires occurs
30% of the time. Any of the 30 random
numbers greater than 35 up to and
including 65 are assigned to that event.

Step 4: Generating Ran om Numbers.


Random numbers may be generated for
simulation problems in several ways. If
the problem is very large and the
process being studied involves
thousands of simulation trials, computer
programs are available to generate the
random numbers needed.
If the simulation is being done by hand,
as in this book, the numbers may be
selected by the' spin of a roulette wheel
that has 100 slots, by blindly grabbing
numbered chips out
Table 16.4: Assignment of random
number intervals for Harrys Tire
Daily Probability Cumulative In
Demand Probability
Ra
Nu
0 0.05 0.05 01
1 0.10 0.15 06
2 0.20 0.35 16
3 0.30 0.65 36
4 0.20 0.85 66
5 0.15 1.00 86
Step 5: Simulating the Experiment. We
can simulate outcomes of an experiment
by simply selecting random numbers
from Table 16.5. Beginning anywhere in
the table, we note the interval in Table
16.4 or Figure 16.2 into which each
number falls. For example, if the random
number chosen is 81 and the interval 66
to 85 represents a daily demand for four
tires, we select a demand of four tires.
Table 16.5: Table of random numbers
We now illustrate the concept further by
simulating 10 days of demand for radial
tires at Harrys Tire (see Table 16.6).
We select the random numbers needed
from Table 16.5, starting in the upper
left-hand corner and continuing down the
first column
Table 16.6: Ten day simulation of
Demand for Tires
Day Random Simulated
Number Daily
Demand
1 52 3
2 37 3
3 82 4
4 69 4
5 98 5
6 96 5
7 33 2
8 50 3
9 88 5
10 90 5
39= total
10-day
Demand
3.9 =
average
daily
demand

It is interesting to note that the average


demand of 3.9 tires in this 10-day
simulation differs significantly from the
expected daily demand, which we can
compute from the data in Table 16.2.
Expected daily Demand

=
= (0.05)(0) + (0.10)(1) + (0.20)(2) +
(0.30)(3) + (0.20)(4) + (0.15)(5) =
2.95 tires
If this simulation were repeated
hundreds or thousands of times, it is
much more likely that the average
simulated demand would be nearly the
same as the expected demand.
Naturally, it would be risky to draw any
hard and fast conclusions regarding the
operation of a firm from only a short
simulation. However, this simulation by
hand demonstrates the important
principles involved. It helps us to
understand the process of Monte Carlo
simulation
The simulation for Harrys Tire involved
only one variable. The true power of
simulation is seen when several random
variables are involved and the situation
is more complex. As you might expect,
the computer can be a very helpful tool
in carrying out the tedious work in larger
simulation undertakings.
6.9
Two
Other Types of
Simulation Models
Simulation models are often broken into
three categories. The first, the Monte
Carlo method just discussed, uses the
concepts of probability distribution and
random numbers to evaluate system
responses to various policies. The other
two categories are operational gaming
and systems simulation. Although in
theory the three methods are distinctly
different, the growth of computerized
simulation has tended to create a
common basis in procedures and blur
these differences.
Operational Gaming
Operational gaming refers to simulation
involving two or more competing
players. The best examples are military
games and business games. Both allow
participants to match their management
and decision-making skills in
hypothetical situations of conflict.
Military games are used worldwide to
train a nation's top military officers, to
test offensive and defensive strategies,
and to examine the effectiveness of
equipment and armies. Business games,
first developed by the firm Booz" Allen
and Hamilton in the 1950s, are popular
with both executives and business
students. They provide an opportunity to
test business skills and decision-making
ability in a competitive environment.
The person or team that performs best in
the simulated environment is rewarded
by knowing that his or her company has
been most successful in earning the
largest profit, grabbing a high market
share, or perhaps increasing the firm's
trading value on the stock exchange.
During each period of competition, be it
a week, month, or quarter, teams respond
to market conditions by coding their
latest management decisions with
respect to inventory, production,
financing, investment, marketing, and
research. The competitive business
environment is simulated by computer,
and a new printout summarizing current
market conditions is presented to
players. This allows teams to simulate
years of operating conditions in a matter
of days, weeks, or a semester.
Systems Simulation
Systems simulation is similar to business
gaming in that it allows users to test
various managerial policies and
decisions to evaluate their effect on the
operating .environment. This variation of
simulation models the dynamics of large
systems. Such systems include corporate
operations,4 the national economy, a
hospital, or a city government system.
In a corporate operating system, sales,
production levels, marketing policies,
investments, union contracts, utility
rates, financing, and other factors are all
related in a series of mathematical
equations that are examined by
simulation. In a simulation of an urban
government, systems simulation can be
employed to evaluate the impact of tax
increases, capital expenditures for roads
and buildings, housing availability, new
garbage routes, immigration and out-
migration, locations of new schools or
senior citizen centers, birth and death
rates, and many more vital issues.
Simulations of economic systems, often
called econometric models, are used by
government agencies, bankers, and large
organizations to predict inflation rates,
domestic and foreign money supplies,
and unemployment levels.
The value of systems simulation lies in
its allowance of what-if questions to test
the effects of various policies. A
corporate planning group, for example,
can change the value of any input, such
as an advertising budget, and examine
the impact on sales, market share, or
short-term costs. Simulation can also be
used to evaluate different research and
development projects or to determine
long-range planning horizons.
Verification and Validation
In the development of a simulation
model, it is important that the model be
checked to see that it is working
properly and providing a good
representation of the real world
situation. The verification process
involves determining that the computer
model is internally consistent and
following the logic of the conceptual
model.

Validation is the process of comparing a


model to the real system that it
represents to make sure that it is
accurate. The assumptions of the model
should be checked to see that the
appropriate probability distribution is
being used. An analysis of the inputs and
outputs should be made to see that the
results are reasonable. If we know what
the actual outputs are for a specific set
of inputs, we could use those inputs in
the computer model to see that the
outputs of the simulation are consistent
with the real world system.

Role of Computers in Simulation

We recognize that computers are critical


in simulating complex tasks. They can
generate, random numbers, simulate
thousands of time periods in a matter of
secqnds.or minutes,,and provide
management with reports that make
decision making easier. As a matter of
fact, a computer approach is almost a
necessity for us to draw valid
conclusions from a simulation. Because
we require a very large number of
simulations, it would be a real burden to
rely on pencil and paper alone.

Three types of computer programming


languages are available to help the
simulation process. The first type,
general-purpose languages, includes
Visual Basic, C++, and Java. The
second type, special-purpose
simulation languages, have three
advantages: (1) they require less
programming time for large simulations,
(2) they are usually more efficient and
easier to check for errors, and (3) they
have random number generators already
built in as subroutines. Three of the
major special-purpose languages are
GPSS/H, SLAM II, and SIMSCRIPT
II.5.

Simulation has proven so popular that a


third type, commercial, easy to" use p
ewritten simulation programs, are also
available. Some are generalized to
handle a. wide variety of situations,
ranging from queuing to inventory. These
include Extend, AutoMod, ALPH/Sim,
SIMUL8, STELLA, Arena, AweSim!,
SLX, and numerous others. These
programs run on personal computers and
often have animated graphic
capabilities. Many of these packages
'have tools for testing to see if the
appropriate probability distribution is
being used and for statistically analyzing
the output.

Exercises

1. Assume that the share price of


Orange Inc., is currently $50. The
following probability distribution
shows how the share price of Orange
is expected to change over a three
month period.
Share Probability
Price
Change
($)
-5 0.05
-2 0.10
0 0.15
1 0.25
2 0.20
3 0.10
4 0.10
5 0.05

1. Set up intervals of random number


that can be used to generate the
changes in share price over a 3-
month time.
2. Suppose the random numbers are
0.3014, 0.9276, 0.1095, and 0.7580
and the current price at $ 50,
simulate the stock price of the
Orange.
3. What is the ending share price of
Orange?
Chapter 17
Qualitative Methods of
Prediction
Qualitative forecasting technique is
based on judgements about future. These
are often called judgemental or
nonextrapolative techniques. In this type
of forecasting, dependence on numbers
are not only small but they also lack
rigourous specification of the underlying
assumption. Judgemental forecasting is
one of the important techniques of
forecasting. In this method, an individual
or small group of people prepare
forecasts regarding likely future
conditions. When used by experienced
persons keeping in mind historical
trends, current economic situations and
other relevant factors, this method can
produce good estimates. This method,
however, tend to work best when
environment changes very rapidly. When
economic and political conditions are
very stable, quantitative methods may
yield very good estimates. However,
when economic and political
environment are in flux, quantitative
methods may not capture important clues
which have the ability to change the
historical patterns in a country. A variant
of judgemental approach is consensus
forecasting. In this, experts familiar with
factors influencing a particular thing/
variable meet to arrive at some
consensus regarding what is likely to
happen in near future. Consensus
forecasting work best when there is little
or no historical information. The
following are important qualitative
forecasting techniques:

Delphi Method

Delphi is the most common method of


qualitative forecasting. It is originally
developed by researchers at Rand
Corporation. This method develops
forecasts through group consensus. The
procedure adopted in Delphi technique
is outlined here:

1. First the members of a panel of


experts who are physically separated
and unknown to each other, asked to
respond to a series of
questionnaires.
2. Second, responses of the experts
from the first questionnaire are
analyzed and become the basis for
second questionnaire that contains
views of the entire group.
3. Third, each expert then asked to
reconsider or revise his/her previous
views in light of what the group is
thinking.
4. Fourth, this process will go on till
the coordinator thinks consensus has
been reached to some extent.
Expert Judgment Method
Expert method uses the experience of
people like executives, salespeople,
marketing people, distributors, or
outside experts. These people are
familiar with product line or a group of
products which are a great help in
generating sales forecasts. The
techniques under this method generally
involve combining inputs from multiple
sources. The advantage of this method is
that it can offset biases introduced into a
forecast when the forecast is produced
by a single person.
Jury of Executive Opinion
When executives from various corporate
functions involved in forecasting for
example people from finance, marketing,
sales, production, and logistics meet to
produce forecast. This meeting is termed
a jury of executive opinion. The jury of
executive opinion is one of the most
familiar and frequently used of all
forecasting techniques.

Exercises
1. Explain the Delphi method.
2. Distinguish between jury of
executive opinion and expert
judgment method.
Chapter 18
Linear Programming

Introduction
Linear programming is a mathematical
technique to find the best or optimal
solution to a problem under a given
constraint. Here linear means linear
equation which has a degree of 1. If you
plot a linear equation, you will get a
straight line equation. The word
programming is closely related with
computer programs the act or job of
creating computer programs. However,
it also refers to the process of
developing and implementing various
sets of instructions which enable a
computer to perform a certain task.
Moreover, today you cannot conduct
optimization analysis involving more
than two decision variables without the
help of some computer software.
The prime objective of any business
entity is to maximize profit. The profit is
equal to total revenue minus total cost
which is expressed as:
Profit = Total Revenue Total Cost
Thus the various objectives of firms can
be:
1. Maximization of profit
2. Maximization of total revenue
3. Minimization of total cost or cost of
production
Thus linear programming deals with
either maximizing or minimizing some
objective function, subject to a set of
linear constraints. Thus, a linear
programming model consists of the
following:
1. An objective function
2. A set of decision variables
3. A set of constraints
Linear Programming Model
Let X1, X2, X3,., Xn are decision
variables.
Z is the objective function, which is
basically a function of decision
variables X1, X2, X3,., Xk.
The objective is to maximize the
objective function Z which is assumed to
be linear shown as follows:

Formulation of Linear Programming


The linear programming technique is a
very useful mathematical tool to find the
best or optimal solution in many
business situations. Here it is important
to remember that linear programming
model is formulated in per unit terms.
Let us assume that there are two
products X and Y. The per unit profit
from the product X and Y are
respectively $10 and $ 12. Given this
information the objective is to maximize
profit from the production of X and Y.
Now the objective function is written as:
Maximize Z = $10X + $12Y
Thus while formulating the objective
function of the linear programming
model determine the objective of the
problem and expressed it using some
criteria in terms of decision variables.
Next we have to find out the constraints.
Lets say in order to produce 1 unit of X
and Y, we require 10 labour hours and
15 labour hours respectively. The total
labour hours available in a day are 400
labour hours. Thus our first constraint is:
10X + 15Y 400
In other words the per unit requirement
of labour hours to produce 1 unit of X
and Y are respectively 10 and 15.
Further we also require machine to
produce X and Y; in order to produce 1
unit of X and 1 unit of Y we need 6
hours of machine and 10 hours of
machine respectively and the total
machine hours available in a day is 180.
So the second constraint is:
6X+10Y 180
Thus in this case also the per unit
requirement of machine hours to produce
1 unit of X and Y are 6 and 10
respectively.
Finally we have

Maximize: Z = $10X + $12Y


subject to the following constraints:
10X + 15Y 400 labour hours
constraint
6X + 10Y 180 machine hours
constraint
Xj0
How to Solve Linear Programming
Problem
There are two methods for solving linear
programming problem. The graphical
method in case of two decision
variables. If there are more than two
decision variables then simplex method
is used for solving linear programming
problem.
Graphical Method
The graphical method is a useful for
simple problems. Under this the program
is plotted on a graph paper to locate
alternatives choices and the ultimate best
single solution.
The non-negativity constraints
This ensures that all the values of
decision variables X and Y and should
be either zero or positive (X,Y 0). As a
result of this assumption, our choices of
values are now confined to (++)
coordinate only.
Inequality Constraints
When you plot the inequality constraints
one by one would go on limiting the area
of choice. Consider the labour hours
constraints:
10X + 15Y 400
When drawn on the graph paper reduces
the choices to the triangle OAB. (It is
left as an exercise)
Assuming that all labour hours utilized
for the production of X and none for Y
then we have:
10X +15(0) =400
= 10X = 400
X = 400/10 = 40
Similarly assume that all labour hours
utilized for the production of Y and none
for X then we have:
10(0) + 15Y =400
15Y = 400
Y = 400/15 = 26.66
As pointed earlier, plotting of the labour
hours constraint on the graph reduces the
choices to the triangle OAB.
Now consider the machine hour
constraint:
6X +10Y 180

Assume that all machine hours available


are utilized for the production of X and
none for Y then we have:
6X + 10 (0) =180
6X=180
X =180/6 = 30
Likewise assume that all machine hours
available are utilized for the production
of Y and none for X then we have:
6(0) + 10Y =180
10Y = 180
Y = 180/10 =18

Now the effective area is reduced to


OAEC that meets both the constraints.
Feasible Region
The shaded area AECO satisfies both
the constraints and it does not exceed of
labour hours and machine hours. The
region which satisfies all the inequality
and non-negativity is called feasible
region.
Final Solution
One of the corner points which maximize
profit would be the ultimate choice.
Mathematically one of the combinations
of X and Y from the feasible regions that
meets the inequality constraints will be
the final solution.
Here in the above figure, there are three
corner points
Points Combinations of X and
Y Total Profit
A (0, 26.66) 319.92
E (6, 23) 336
C (20, 0) 200

Thus a combination of 6 units of X and


23 units of Y at point E gives the
maximum profit and therefore becomes
the final choice under the given
constraint.
Simplex Method
Often linear programming problems
have more than two decision variables.
In cases, where there are more than two
decision variables, it is inconvenient to
use the graphical method to find optimal
solution. In such situations, procedure
called simplex method is used to find the
best solution by iterative method. Here
you should know that solving linear
programming problems manually is a
very time consuming process. Moreover,
if there are, say, 10 or 20 decision
variables in a linear programming
problem then manually solving it by
simplex procedure is almost impossible.
Therefore I will suggest my readers to
have knowledge of some mathematical
softwares and online tools for solving
linear programming problems. In this
ebook, I will demonstrate how to use an
online tool to solve linear programming
problems.
Consider the following example.
Maximize: P = 12X +10Y
Subject to
24X+16Y 96
10X +20Y 80
6X +6Y 36
X, Y0
Step 1: Type simplex method tool in
Google as shown below:

Step 2: Go to www.
zweigmedia.com/RealWorld/simplex.html
Step 3: Scroll down. When you click on
the Example the following dialogue
box pop up:
Step 4: Type your linear program using
space bar as shown below:
Step 5: After typing your linear
programming problem click solve
Thus the profit function P will be
maximized when 2 units of X and 3 units
of Y will be produced. The total profit is
54.

The Dual in Linear Programming


In linear programming, for every
maximization problem, there exits a
symmetrical minimization problem and
vice-versa. The original program is
referred to as the primal program and its
symmetrical counterpart is called the
dual program. The optimal solution from
both the primal and dual will be same as
they originate from the same data.
Formulation of a Dual Problem
Consider the following linear
programming problem:
Maximize: P = 12X +10Y
Subject to
24X+16Y 96
10X +20Y 80
6X +6Y 36
X, Y0

Solution
Minimize P = 96Y1 +80 Y2 +36 Y3
subject to
24Y1 +10Y2 +6Y212
16Y1 +20Y2 +6 Y3 10
Y1, Y2, Y3 0
Linear Programming with Excel
I will demonstrate how to solve linear
program with excel with an example.
Consider the following problem.
Maximize: P = 12X +10Y
Subject to
24X+16Y 96
10X +20Y 80
6X +6Y 36
X, Y0
Step 1: Enter the linear program data as
shown below:
In the above worksheet
1. Cells C3 to D5 show the per unit
requirements of X and Y
2. Cells E3 to E5 show the maximum
amount available of two inputs.
3. Cells C7 to D7 show the per unit
profit for X and Y
Step 2: Specify the cell locations for the
decision variables as shown below:
While cell C11 will show the number of
X produced the cell D11 will contain the
number of Y produced.
Step 3: Choose a cell and enter a
formula for computing the value of
objective function.
Cell C13 = C7*C11+D7*D10
Step 4: Select a cell and enter a formula
for computing the left hand side of the
each constraint:
We the following:
Cell C16 = C3*C11+D3*D11
Cell C17 = C4*C11+D4*D11
Cell C18 =C5*C11 +D5*D11
Step 5: Select a cell and enter a formula
for calculating the right hand side of
each constraints:
Thus we have
Cell E16 = E3
Cell E17 = E4
Cell E18 = E5
After writing the linear program, I will
show how to use excel to solve it:
Step 1: Click data as shown below:
Step 2: Select the Solver and when you
click it the following dialogue box will
pop up:
Step 3: Enter cell C13 into the Set
Target Cell and select the Equal to Max
option
Step 4: Enter cells C11 to D11 in By
Changing Cells box as shown below:
Step 5: Select Add and when the Add
constraint box appears, enter cells C16
:C18 in the cell reference box. Next
select <= . Further, enter Cells E16:E18
in the Constraint box shown below:
Step 6: Click Ok
Step 7: Click Solve
Step 8: Click Ok
Thus, the profit function will be
maximized when 2 units of X and 3 units
of Y are produced. The maximum profit
is 54. We got the same result using the
online tool also.
Exercises
Q1. Solve the following linear
programming problem using either MS
Excel or online simplex tool.
Maximize Z = 8x +2y
s.t.
20x +4y 60
6x + 8y 12
4x +4y 20
x, y0

Q2. WMI is planning an advertising


campaign with the help of FM radio,
Newspaper and Television. The aim of
the advertising campaign is to reach as
many MBA aspirants as possible. The
following information is given:
FM Newspap
Cost per unit of 50000 500000
advertisement
Number of customer 200000 300000
reach per
advertisement
(lakh)
Number of B. Com 45000 100000
students reached
(lakh)

The upper limit of advertising budget is


25 lakhs and it requires that
a) at least 20 lakh B.com students are
approached.
b) at least 2 units of advertisement be
made each on TV and FM radio
Formulate the linear program and solve
it using either MS Excel or online
simplex tool.
Chapter 19
Applications of Linear
Programming

Introduction
The linear programming has applications
in many areas of business and
management. In order to have deeper
insight we will discuss various
applications of linear programming
method.
Product-Mix Application
GTC manufactures two types of smart
phones. Type A smart phone belongs to
premium category and its price is
Rs.50000 per handset. Type B is an
economy brand costing Rs. 18000 per
handset. Type A smart phone contributes
a profit of Rs. 6000 per unit while Type
B phone contributes a profit of Rs. 3600.
Both the smart phones require three
inputs labour hours, machine hours and
materials. The requirements per unit and
total availability of inputs are
summarized in the following table:
Smart Phones Labour Machi
hours Hours
Type A 8 4
Type B 8 5
Total Available 600 1400

Solution
Let the product mix comprises of X units
of type A phone and Y units of type B
phones. The objective is to find the
product mix that maximizes total profit
specified as below:
Maximize Z = 6000X + 3600Y
subject to
8X + 8Y 600
4X + 5Y 1400
34X + 20Y 750
X,Y 0

Thus the optimal solution requires 37.5


handsets of type B and no type A smart
phones. The maximized total profit is
135000.
Investment Application
Ketan is contemplating how to allocate
foods to various investment avenues
available to him so as so as maximum
return on investment.
The various investments alternatives
are:-
a) EPF
b) Company bonds
c) Time deposits
d) Gold
e) Mutual fund
f) Real state
g) Stock
He estimated the risk involved in
various investment alternatives in terms
of standard deviation of returns. The
data on the return of investment the
locking period and the risk involved are
as follows:
Instruments Return Locking Risk
period
EPF 9% 15 1
Company 12% 5 3
Bonds
Time 8.75% 7 1
deposits
Gold 18% 3 2
Mutual 15% 3 3
fund
Real state 20% 4 4
Stock 30% 3 5

He decided that average standard


deviation should not be more than 4 for
more than 15 years. Further he also
mentioned that at least 30% investible
fund must be invested in mutual fund.
Maximize Return Z= 0.09x1 + 0.12x2 +
0.875x3 + 0.18x4 + 0.15x5 +0.2x6 +
0.30x7
subject to
x1 + 3x2 + x3 + 2x4 + 3x5 + 4x6 + 5x7 4
risk constraint
15x1 + 5x2 + 7x3 + 3x4 + 3x5 + 4x6 +
5x7 15 Locking period constraint
X5 0.30
x1,x2,x3,x4,x5,x6,x70
Thus the total return is 187.5%. The
solution suggests putting money in the
time deposits given the risk profile of
the investor.

Transportation Problem Application


One of the important applications of
linear programming is the transportation
problem. It deals with the distribution of
goods from several points of supply to a
number of points of demand. While
supply points are called sources, points
of demand are called destinations.
Each point of supply or source supplies
a fixed number of units often called
availability and each destination has a
fixed demand known as the
requirement.
The objective in transportation problem
may be to minimize transportation cost
or minimize transportation time while
transporting goods from sources to
destination.

Solved Example
JE Motors is an automobile
manufacturing company transports cars
from 3 plants 1, 2, and 3 to three major
cities A, B, and C. The manufacturing
plants are able to supply the following
number of cars per month:
Plants Supply (capa
1 1500
2 1500
3 1000
The requirements of the three major
cities in number of cars per month:
Cities Demand
(Requiremen
A 1000
B 1000
C 2000

In the table below each row represents a


source and each column represents a
destination. The cost of transporting 1
car from each plant to each city in
rupees is shown as follows:

Plants To
From
1 4000 (D)
2 7000 (G)
3 4000 (J)
Demand 1000

In this problem, the decision variables


Xij are the number of cars transported
from plant i (where i = 1, 2, 3) to cities j
(where j = A, B, C). Here it is important
to remember that supply (4000) is equal
to requirement (4000). Such
transportation problem is called
balanced problems.
Let D is the number of cars transported
from Plant 1 to city A. Since the cost of
transporting 1 car from plant 1 to city A
is 4000. Total cost of transportation is
4000D. Similarly other costs can be
obtained:
The objective in this problem is to
minimize total cost of transportation
represented by:
Minimize Z = 4000D + 3000E
+8000F+7000G +5000H +9000I +4000J
+5000K +5000L
subject to plant and city constraints
shows as follows:

The total number of cars transported


from plant 1 to cities A, B, and C must
be equal to supply:
D + E +F = 1500
G + H + I = 1500
J + K + L = 1000
The total number of cars received by
city A from plant 1, 2, and 3 must be
equal to demand; so:
D + G + J = 1000
E + H + K = 1000
F + I + L = 2000
All decision variables are non-negative.
Step 1: write the linear program as
shown below:
Step 2: Click solve
Thus, the total minimized transportation
cost is 22000000. It suggests
transporting 1000 cars from plant 1 to
city A and 500 cars to city B and no car
to city C. Likewise the solution suggests
no car from plant 2 to city A and only
500 and 1000 cars from plant 2 to city B
and C respectively. Further it suggests
transportation of no car from plant 3 to
city A and B and only 1000 cars from
plant 3 to city C.
Assignment Problem
There are situations where we generally
encounter the following problems:
assigning salesman to sales
territories
assigning instructors to courses
assigning Agents to tasks
assigning Contracts to bidders
assigning taxis to customers
The distinguishing feature of assignment
problem is one to one relation; one agent
to one and only one job. So in the
assignment problem the assumption is:
(Number of agents) = (number of jobs)
Such problems are called balanced
problems.
Solved Example
A department has three visiting
professors, A, B, and C who are to be
assigned to three courses, 1, 2, and 3.
The estimated course completion time in
hours are given in the following table:
1 2 3
A 36 (P) 30 (Q) 45
(R
B 18 (S) 20 (T) 24
(U
C 30 (V) 15 (W) 34
(X

The objective is to minimize completion


time in hours represented by Z. Let P is
the number of hours to complete course
1 by professor A. Thus the total time
taken is 36P. Likewise other completion
time can also be obtained. Thus
Minimize Z = 36P+30Q +45R +18S
+20T +24U +30V +15W +34 X
subject to:
Since each professor is assigned to one
and only one course, we have:
P + Q + R =1
S +T + U =1
V+ W+ X= 1
Since each course is assigned to one and
only one professor, we have:
P +S +V =1
Q + T +W =1
R +U + X =1
Thus the total minimized course
completion time is 75 hours by assigning
professor A to course 1, professor B to
course 3 and professor C to course 2.
Exercises
Q1. Sah & Sah produces products at
three plants 1, 2, and 3 and transported
to three warehouses A, B, and C. The
transportation cost per unit, plant
capacity and requirement from the
warehouses are given below:

Warehouse
Plant
A B
C Plant Capacity
1
200 165
244 3000
2
105 108
88 5000
3
125 180
100 1000
Warehouse Demand
2000 4000 3000
1. Formulate a linear programming
model for minimizing transportation
cost.
2. Solve the linear programming
problem either using Solver-add-in
or online simplex tool.
Q2. A supervisor has 4 workers, A, B,
C, and D who are to be assigned to three
jobs, 1,2, and 3. The following data
show the number of hours required for
each worker to complete each job:

Jobs
Workers 1 2
3
A
160 200 275
B
176 236 225
C
186 235 226
D
161 245 235

1. Formulate a linear programming


model for minimizing job completion
time.
2. Solve the linear programming
problem either using Solver-add-in
or online simplex tool.
2.
Chapter 20
Decision Theory
Introduction
Managers today operate in a very
competitive and uncertain environment.
Managers always try to reduce
uncertainty and make better estimates of
what will happen in the future. This goal
is attained by decision analysis
techniques. Decision theory provides a
framework for making decisions to
managers. When future event is
characterized by uncertainty, decision
analysis allows us to make optimal
decisions from a set of possible decision
alternatives.
A decision maker main job is to identify
the problem, try to understand and know
the cause of the problem and prepare
action plan for its rectification. If he/she
feels that the decision is not optimal then
some more information is sought.
However, the fact of the matter is that
decision can not unduly delayed. Making
decision on time is important as delayed
decisions means lost opportunities.
Effective decision-making requires
following things:
Identify and define the problem
Identify the possible alternatives
available
Evaluate each possible alternative
Select the best alternative
The decision making process in
quantitative way is very aptly put by
Israel Brosch as:

The quantitative approach is based upon


data, facts, information and logic. It
consists of an orderly and systematic
framework of defining, analyzing and
solving problems in an objective and
scientific manner. The quantitative
approach is not intended to replace
perception, good judgment and common
sense, the fundamental decisions tools of
competent managers. It is intended to
improve managers decision making
ability and to provide them an
accountable means for justifying and
evaluating their own managerial
performance.

The ultimate power with regard to


decision making rests with the manager.
However, statistical or quantitative
analysis aid to the manager for making
optimal decisions. In this chapter, we
will introduce students to Decision
Theory which is an analytical and
systematic approach to the study of
decision-making process. In brief, in
decision theory, we select a decision
from a set of possible decision
alternatives when future is uncertain.
The goal is to optimize the
payoff/benefit.

Elements of Decision
Theory
The important elements of decision
theory are:
1. Alternatives
2. States of Nature
3. Payoff
4. Payoff Table

Alternatives
The various alternatives available to the
decision maker. When a decision maker
faced with alternatives the decision
making involves choosing among two or
more possible alternatives with a view
to choose the best alternative so that the
benefit/payoff will be maximized. For
example, suppose a company has Rs. 60
cr. of cash surplus. This cash surplus can
be used for various alternatives:
distribute the cash as dividend
reinvest the cash in the company
keep the cash for meeting future
obligation

On various alternatives or courses of


action, the decision maker has some
control.

States of Nature
States of nature of outcomes of an event
are uncontrolled occurrences. The states
of nature correspond to the possible
future events which are not in the control
of the decision maker. States of nature
are mutually exclusive meaning when
one state of nature occurs, it rules out the
occurrence of all other states of nature
and collectively exhaustive means all the
states of nature must be identified by the
decision maker.

Payoff
Payoff is the resulting net benefit to the
decision maker that arises from selecting
a particular alternative in a particular
state of nature.

Payoff Table
Payoff table lists various alternatives,
states of nature and entries in the table
are the payoff. In general, the column of
the payoff table corresponds to the
various alternatives. The rows of the
table correspond to the various states of
nature, and the entries of the payoff
tables are the payoff.

Payoff Table

States of Nature
Alternatives S1 S2 S3

Sm
A1 P 11 P 12 P 13 - -
-------------------------- -P 1m
A2
P 21 P 22 P 23 -----
A3
-------------------------------------------- P
.
. P 31 P 32 P 33 ----
.
An -----------------------------P 3m

. . . ------
----------------------------.
. .
.
.
. .
.
.
P n1 P n2 . -------
----------------------------P nm

Decision-Making Criteria
The decision-making criteria broadly
divided into three:
1. Decision-making under Certainty
2. Decision-making under Uncertainty
3. Decision-making under Risk

Decision-making under Certainty


Decision making under certainty is also
termed as decision making with perfect
information. In this situation, the future
state of nature is assumed to be known.
Decision maker, in this situation,
precisely know outcome or consequence
of every decision alternative in advance
and he chooses the alternative with the
best payoff. Linear programming is a
technique frequently used for making
decision in such circumstance.
Decision-making under Uncertainty
Under uncertainty, decision maker has no
knowledge about the probabilities of
various states of nature. Consequently,
expected payoff each alternative cannot
be estimated. For example, while
launching a new product in market, the
company has generally no idea how
market will react to my new product as
there are no historical data or past
experience available with the company.
These criteria will be helpful in doing a
best-case or worst-case analysis. It is to
be kept in mind that different criteria
will lead to different decision
recommendations. Thus, the decision
maker must understand this and then
select the specific criteria for decision
making under uncertain environment.
Consider the following example to
illustrate how decision is made under
uncertainty. An investor is planning to
make some investment. The various
alternatives in which he/she can put
his/her money are:
Saving Bank Account
Fixed Deposit
Bonds
Mutual Funds
Equities
The returns (%) from each instrument
are dependent upon the state of the
economy. Three states of nature are
identified as:
No Growth Scenario
Low Growth Scenario
High Growth Scenario
However, the probability of occurring of
the above states of nature is not
available to the investor. The payoff
table for this problem is given below as:
Decisions States
of Nature
No Growth Low
Growth High
Growth
Saving 3
Bank A/C 4 5
Fixed 7.5
Deposit 8 8.5
Gold 10
Company 12 18
Bond 8
Mutual 10
Fund 11.5
Equities -3
12
21
-12
15
25

The criteria used for decision making


under uncertainty are:
The optimistic approach (Maximax)
The conservative approach
(Maximin)
Equally Likely (Laplace Criterion)
Hurwicz Criterion
The minimax regret approach
(Minimax)

The Optimistic Approach (Maximax


Criterion)
Decision making using this criterion is
based on best possible outcome. This is
the approach taken by optimistic and
aggressive decision-maker. The users of
this approach are very optimistic about
future outcome and think that the best
possible outcome always occurs. An
optimistic and aggressive decision-
maker searches for highest payoff. The
procedure is as follows:
First, we find the maximum payoff
associated with each decision
alternative.
Next select the decision alternative
associated with maximum of
maximum payoff.
In the above example, using the
optimistic criteria, the maximum of
maximum return is 25% coming from
investment in equities. Thus, an
optimistic decision-maker invests money
in equities.
Decisions States Maxim
of Nature Criter
No Growth Low
Growth High
Growth
Saving 3 5
Bank A/C 4 5 8.5
Fixed 7.5 18
Deposit 8 8.5 11.5
Gold 10 21
Company 12 18 25
Bond 8
Mutual 10 11.5
Fund -3
Equities 12 21
-12
15
25

The conservative approach (Maximin)


Using this criterion, the decision is
based on worst possible outcome. This
is the approach taken by pessimistic and
conservative decision-maker. The users
of this approach are very pessimistic
about future outcome and think that the
worst possible outcome always occurs.
A pessimistic and conservative
decision-maker looks for lowest payoff.
The procedure is as follows:
First, we find the minimum payoff
associated with each decision
alternative.
Next select the decision alternative
associated with maximum of
minimum payoff.
In the above example, using the
pessimistic criteria, the maximum of
minimum return is 10% coming from
investment in gold. Thus, a pessimistic
decision-maker invests money in gold.
Decisions States Maxim
of Nature Criter
No Growth Low
Growth High
Growth
Saving 3 3
Bank A/C 4 5 7.5
Fixed 7.5 10
Deposit 8 8.5 8
Gold 10 -3
Company 12 18 -12
Bond 8
Mutual 10 11.5
Fund -3
Equities 12 21
-12
15
25
Equally Likely (Laplace Criterion)
In Laplace criterion, the decision-maker
assumes that there is equal likelihood of
occurring each state of nature and
compute expected payoff for each
decision alternative.

Decisions States Laplac


of Nature Criteri
No Growth Low
Growth High
Growth
(1/3)
(1/3) (1/3)
Saving 3 4
Bank A/C 4 5 8
Fixed 7.5 13.33
Deposit 8 8.5 9.83
Gold 10 10
Company 12 18 9.33
Bond 8
Mutual 10 11.5
Fund -3
Equities 12 21
-12
15
25

Thus, according to Laplace criterion, the


decision-maker selects gold instrument
for investment.

Hurwicz Criterion
So far we have seen that most optimistic
and most pessimistic criteria are the
maximax and maximin respectively
which are two extreme ways of making
decision. Therefore, it would be
realistic to select appropriate
combinations of two extremes. This
approach was suggested by Hurwicz. In
this case, the degree of optimism
between extreme pessimism (0) and
extreme optimism (1) is given by
which lies between 0 and 1. In this
approach, a decision index is defined as
given below:
where,
Mi = Maximum payoff corresponding to
ith decision alternative
mi = Minimum payoff corresponding to
ith decision alternative
Decisions Decision Hurwicz
Index Criterion

Saving 0.6*(3) + (1- 3.8


Bank A/C 0.60)*(5) 7.9
Fixed 0.6*(7.5)+ 13.2
Deposit (1-0.60)* 9.4
Gold (8.5) 6.6
Company 0.6*(10)+ 2.8
Bond (1-0.60)*
Mutual (18)
Fund 0.6*(8)+ (1-
Equities 0.60)*(11.5)
0.6*(-3)+ (1-
0.60)*(21)
0.6*(-12)+
(1-0.60)*
(25)

Hence, according to Hurwicz criterion,


the decision-maker selects gold
instrument for investment.

Regret Criterion
In this approach, a decision-maker tries
to minimize his/her opportunity loss.
When a decision maker chooses the best
payoff pertaining to a decision
alternative then there is no opportunity
loss or regret. That is, if the decision
maker selects the decision to invest
money in saving bank a/c, according to
maximin criteria, the decision-maker
gets 3 percent return under no growth
scenario, then his/her regrets or
opportunity loss would be 3-3 =0.
However, later if the decision maker
realizes that he/she could as well have a
return of 5% pertaining to saving bank
a/c decision alternative under the
scenario of high growth then his/her
opportunity loss or regret is:
5-3 = 1%. With regard to a scenario of
low growth for the decision alternative
saving bank a/c, his/her regret will be 4-
3 = 1%. Hence, the regret table for each
decision alternative can be constructed
likewise:

Regret Table
Decisions States
of Nature
No Growth Low
Growth High
Growth
Saving 0
Bank A/C 1 2
Fixed 0
Deposit 0.5 1
Gold 0
Company 2 8
Bond 0
Mutual 2 3.5
Fund 0
Equities 15 24
0
27
37
Hence the maximum regret for each
decision alternative would be the
payoffs {2, 1, 8, 3.5, 24, 37}
respectively. The minimum regret
amongst these is 1 corresponding to
fixed deposit decision alternative.
Hence, according to regret or
opportunity loss criterion, the decision
alternative selected for investment is
fixed deposit instrument.

Decision-making under
Risk
Decision making under situations of less
than complete certainty can be classified
into two:
Risk
Uncertainty
Risk refers to a situation where decision
maker is aware of all possible states of
nature or outcome and could attach
probability to each state of nature.
Uncertainty refers to a situation where
decision maker has no knowledge about
the probability of the states of nature
occurring.

Decision maker under risky situation has


some knowledge regarding the state of
nature. He/she can also attach
probability to these states of nature. In
such circumstance, with the help of some
inputs collected by the decision maker,
the probability estimate for the
occurrence of state of nature or outcome
is assigned. Next, for each decision
alternative, given the state of nature,
expected value is computed give by the
following formula:

where
Xi = Payoff
Pi = Probability
Using the expected monetary value
(EMV) criterion, decision maker select
the alternative with the best expected
payoff.
Example
Today education is considered as
investment. A student choice for a
professional or non-professional course
depends on kind of job and package they
get by doing it. To make decision, a
student views the prospect of highly paid
job as either low, medium or high. The
following payoff table shows the
projected package in Rs. lakhs:

Prospects of Jobs

Course Low (0.5) Medium


(0.3) High (0.2)

MBA/MCA 2.3
4
MCOM 8

Others 3.5
5 6
2.5
3.6 7

Using expected value, the optimal


decision strategy is:
EMV (D1) = 2.3(0.5) +4(0.3)+ 8(0.2) =
3.9 lakhs
EMV (D2) = 3.5+(0.5) + 5(0.3) + 6(0.2)
=4.45 lakhs
EMV (D3) = 2.5+(0.5) + 3.6(0.3) +
7(0.2) = 3.73 lakhs
By doing MCOM, a student can expect
on an average a package of 4.45 lakhs.
However, other alternatives are not
optimal. Thus, the best decision
alternative is doing M.COM course.
Expected Value of Perfect
Information (EVPI)
Probability estimates of states of nature
can be improved if one can gather
information. The expected profit may
increase if one knows with certainty
which state of nature would occur. If the
decision maker precisely knows which
state of nature will occur, then we can
find what is called expected value of
information (EVPI). If we know which
state will occur then as decision maker
we will choose the highest payoff under
that state of nature. Thus, EVPI is given
by
EVPI = 3*(0.5) +6*(0.3)+ 8*(0.2)
EMV = 4.85-4.45 =0.4 lakh
Decision Tree Analysis
Decision tree is a sequential
representation of the multi-stage
decision problem. In the decision-
making process, the decision maker has
to take into account a number of
alternatives and uncertainties associated
with those decision alternatives.
Decision-maker analyzing the
alternatives has to identify the time
sequence in which various actions and
events follow. Each different strategy
involve different payoff. This type of
analytical process when displayed by a
graph in the form a tree and branches
then it is called decision tree.
A decision tree is a graphical display
used for representing sequential,
multistage decision problem. Each
decision tree has two types of nodes.
While square or box type nodes
correspond to decision alternative,
round or circle nodes represent the
outcome or chance. The branches
originating from each circle node
represent the different outcomes and the
branches originating from each square
node represent the different decision
alternatives. At the end of each branch of
a decision tree are the payoffs.

Example
Omega Group has to decide whether to
set up consulting firm in Delhi or Patna.
Setting up consulting firm in Delhi will
cost the company Rs. 50 lakhs.
However, if it sets company in Patna, it
will cost Rs. 25 lakhs. Omega Group
conducted a cost-profit analysis which
reveals the following estimates over the
next 5 years:
High demand for
services Probability = 0.5
Moderate demand for
services Probability = 0.3
Low demand for
services Probability =
0.2

Setting up consulting firm in Delhi


with high demand will yield an
annual profit of Rs. 30 lakhs
Setting up consulting firm in Delhi
with moderate demand will yield an
annual profit of Rs. 20 lakhs
Setting up consulting firm in Delhi
with low demand will yield an
annual profit of Rs.8 lakhs
Setting up consulting firm in Patna
with high demand will yield an
annual profit of Rs.25 lakhs
Setting up consulting firm in Patna
with medium demand will yield an
annual profit of Rs.15 lakhs
Setting up consulting firm in Patna
with low demand will yield an
annual profit of Rs.10 lakhs
If we add the probabilities of each state
of nature and payoff associated with
each outcome then this is called rolling
out the tree.

In order to compute the value of the


decision, we will roll out the tree. The
expected value of node (2) is given as:
Expected Monetary Value (EMV) of
node 2= 0.5*(30)+0.3*(20)+0.2*(8) =
22.6 lakhs

Similarly, expected value of node 3 =


0.5*(25)+0.3*(15)+0.2*(10) = 19 lakhs.

Thus, Omega Group is advised on the


basis of above analysis to set consulting
firm in New Delhi.

Summary
Decision theory provides a
framework for making decisions to
managers. When future event is
characterized by uncertainty,
decision analysis allows us to make
optimal decisions from a set of
possible decision alternatives.
Effective decision-making requires
following things:
Identify and define the problem
Identify the possible alternatives
available
Evaluate each possible
alternative
Select the best alternative

The important elements of decision


theory are:
Alternatives
States of Nature
Payoff
The various alternatives
available to the decision maker.
States of nature of outcomes of
an event are uncontrolled
occurrences. The states of nature
correspond to the possible future
events which are not in the
control of the decision maker.
Payoff is the resulting net benefit to
the decision maker that arises from
selecting a particular alternative in a
particular state of nature.

Decision making under certainty is


also termed as decision making with
perfect information. In this situation,
the future state of nature is assumed
to be known.
Under uncertainty, decision maker
has no knowledge about the
probabilities of various states of
nature. Consequently, expected
payoff each alternative cannot be
estimated.

The criteria used for decision


making under uncertainty are:
The optimistic approach
(Maximax)
The conservative approach
(Maximin)
Equally Likely (Laplace
Criterion)
Hurwicz Criterion
The minimax regret approach
(Minimax)
Using the expected monetary value
(EMV) criterion, decision maker
select the alternative with the best
expected payoff under risk.

Decision tree is a sequential


representation of the multi-stage
decision problem. In the decision-
making process, the decision maker
has to take into account a number of
alternatives and uncertainties
associated with those decision
alternatives.
Glossary

Decision theory It
provides a framework
for making decisions
to managers. When
future event is
characterized by
uncertainty, decision
analysis allows us to
make optimal
decisions from a set of
possible decision
alternatives.
Alternatives The
various alternatives available
to the decision maker
States of Nature
States of nature of
outcomes of an event
are uncontrolled
occurrences. The
states of nature
correspond to the
possible future events
which are not in the
control of the decision
maker.

Payoff Payo
is the resulting net
benefit to the decision
maker that arises from
selecting a particular
alternative in a
particular state of
nature.

Payoff Table
Payoff table lists
various alternatives,
states of nature and
entries in the table are
the payoff.

Maximax Criteriaon
Decision making using
this criterion is based
on best possible
outcome. This is the
approach taken by
optimistic and
aggressive decision-
maker.

Maximin Usi
this criterion, the
decision is based on
worst possible
outcome. This is the
approach taken by
pessimistic and
conservative
decision-maker.

Laplace Criterion In
Laplace criterion, the
decision-maker
assumes that there is
equal likelihood of
occurring each state of
nature and compute
expected payoff for
each decision
alternative.

Hurwicz Criterion
This approach was
suggested by Hurwicz.
In this case, the degree
of optimism between
extreme pessimism (0)
and extreme optimism
(1) is given by
which lies between 0
and 1.

Regret Criterion In
this approach, a
decision-maker tries
to minimize his/her
opportunity loss.
When a decision
maker chooses the
best payoff pertaining
to a decision
alternative then there
is no opportunity loss
or regret.

Risk Risk
refers to a situation
where decision maker
is aware of all
possible states of
nature or outcome and
could attach
probability to each
state of nature.

Uncertainty It
refers to a situation
where decision maker
has no knowledge
about the probability
of the states of nature
occurring.

Decision tree It
is a sequential
representation of the
multi-stage decision
problem. In the
decision-making
process, the decision
maker has to take into
account a number of
alternatives and
uncertainties
associated with those
decision alternatives.

Exercises
. Assume that a decision maker faced
with three decision alternatives and
three states of nature with the
following payoffs table:
States of
Decision S1
Alternatives
D1 160 1
D2 112 1
175
D3

What is optimal decision using the


following criteria:
1. Optimistic approach
2. Conservative approach
3. Laplace approach
4. Hurwicz approach (take = 0.6)
2. Suppose that a decision maker faced
with four decision alternatives with
three states of nature with the
following payoffs table:
States of
Decision S1 S
Alternatives
D1 150 100
D2 120 105
110 100
D3
80 112
D4

Let us assume that the decision maker has no


probability estimates of each three states of
nature. Determine the best decision using:
a) Maximax criterion
b) Maximin criterion
c) Laplace criterion
d) Minimax regret criterion
e) Hurwicz criterion ( = 0.55)
3. A real estate company has to decide whether
to construct luxury apartments or non-luxury
apartments. The selection of apartments
depends on interest rate scenario in the
economy. The company has conducted
economic analysis to estimate the interest
rate scenario as low, medium or high. The
following payoff table shows the projected
profit in Rs. Lakhs:

States of
Decision S1 (6-7%) S2 (8-9
Alternatives
Luxury 2000 150
Non-luxury 1050 950

a) Using optimistic, conservative and


Laplace criteria recommend an optimal
decision
b) Suppose that the probability estimates of
each state of nature are S1=0.5, S2=0.3 and
S3=0.2 respectively. What is best decision
using the expected monetary value approach?

4. Assume that a decision maker faced with


three decision alternatives and three states
of nature with probability estimates of
S1=0.65, S2=0.25 and S3=0.2 respectively.
The payoff table is as follows:

States of
Decision S1(0.65) S2 (0.
Alternatives
D1 160 18
D2 70 110
185 50
D3
a) Recommend an optimal decision
based on expected monetary value
(EMV) approach
b) Calculate the expected value of
perfect information (EVPI)

5. Suppose that a decision maker has


estimated the following payoff table
for a particular venture.

States of Nat
Decision S1 S2 S3
Alternatives
D1 160 235 190
D2 190 150 210
230 195 205
D3 200 220 240
D4

Find the optimal decision using:


a) Maximin criterion
b) Maximax criterion
c) Laplace criterion
d) Hurwicz Criterion ( =0.65)
e) Minimax regret criterion
f) Assume that probability estimates
of S1=0.4, S2=0.3 and S3=0.15,
S4=0.10 and S5=0.05. Find the optimal
decision using (EMV).
g) Calculate the EVPI.
6. The payoff table of a decision problem
are given as follows:

States of Natur
Decision S1(0.20) S2 (0.4) S3 (
Alternatives

D1 15 35
D2 25 5
20 15
D3

1. Draw a decision tree.


2. What is the expected value of perfect
information?

También podría gustarte