Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Identifier: M-oo1
Title: Handout (Reader)
Prepared by:
Atakilt Hagos
March, 2014
Addis Ababa
PPS- 562
Advanced Research Methods
The module, Advanced Research Methods is a module that is
Module
Level
Abbreviation
Subtitle
Duration in
Semesters
Frequency
Language
Mode of
Delivery
ECTS
Workload
Contact Hours
Non-Contact
Hours
Total Hours
70 Hrs
140 Hrs
210 Hrs
Assessment
Description
Examination
Types
Examination
Duration
Assignments
Repetition
Description
Learning Outcomes
Prerequisites
Content
Media
Literature
Electronic Books:
Khotari , C.R (2004) Research Methodology: Methods and
Techniques, New Age International (P) Limited Publishers, New
Delhi.
Dowson, Catherine (2002) Practical Research Methods. How to
Books, Oxford, UK.
4
The Reader
Part I: Research Meaning and Process
Unit One:
Scientific Research and the Research Process
As students of the masters program and as professionals after graduation, you will be engaged in
scientific research. As decision makers, you may be provided with information on the progress
and findings of a research project sponsored by your organization or another agency. One way or
another, you are likely to be involved in research. So it is very essential for you to know what
research is and how it is carried out. Research requires passion, knowledge and skills. So what is
research? Why do we conduct research? What are the building blocks of scientific research?
What process should you follow in conducting scientific research? We will address these and
other questions in this chapter.
Learning Objectives: After reading this chapter, you should be able to:
Explain the meaning and objectives of research
Discuss characteristics of scientific research
Describe the criteria of Good Research
Distinguish between inductive and deductive research types
Discuss the nine steps of the Research Process
Objectives of Research:
To gain familiarity with a phenomenon or achieve new insights into it (via:
exploratory or formative research studies)
To portray/describe the characteristics of a particular individual, situation or group
(Via: descriptive research studies)
6
It is empirical: Science is based purely around observation and measurement, and the vast
It is Intellectual and Visionary: Science requires vision, and the ability to observe the
implications of results. the visionary part of science lies in relating the findings back into the real
world. The process of relating findings to the real world is known as induction, or inductive
reasoning, and is a way of relating the findings to the universe around us.
It uses experiments to test predictions: This process of induction and generalization allows
scientists to make predictions about how they think that something should behave, and design an
experiment in a laboratory or by just observing the natural world.
iii) Good research is empirical: It implies that research is related basically to one or
more aspects of a real situation and deals with concrete data that provides a basis
for external validity to research results.
iv) Good research is replicable: This characteristic allows research results to be
verified by replicating the study and thereby building a sound basis for decisions.
10
In induction, conclusion is based on reasons with proof and evidence for a fact.
11
The first practice is that masters and even PhD students are required to make their topic
as broad as possible and incorporate as many aspects of the problem as possible. The
advantage of this approach is that the student will have some wider knowledge on the
problem; what it lacks is depth or it takes more time to make the research both wider and
deeper at the same time. In the absence of sufficient time and resources,
o The literature survey becomes too broad.
o Many of the variables in this type of research are not well identified or else they
are not well-defined relevant indicators are not sufficiently included.
o As a result, the questions in a questionnaire or interview are too general and
shallow. The research methods adopted tend in most cases to be descriptive.
Having quality of research in mind, the second approach is more appealing. With this in mind,
you can follow the following guideline while identifying your research problem/topic.
Decide the general area of interest or aspect of a subject matter that you would like to
enquire into and consider the feasibility of a particular solution. Pick a smaller part of a
bigger problem; do not try to address a big problem in one research.
Understand the problem by discussing with friends/colleagues or with those that have
some expertise in the matter or with those agencies working in relation to the issue.
Narrow the problem down based on the general discussion and phrase the problem in
operational terms. This process of narrowing down the problem is iterative.
Examine all available literature to acquaint oneself with the selected problem. There are
two types of literature: the conceptual (concepts and theories) and the empirical. This can
also help the researcher know what data are available.
Requires careful verification of the validity and objectivity of the background facts
concerning the problem.
12
Your problem statement must be specific to the issue at hand and often ends up with
research questions. Within a given research topic, it is possible that different researchers
could formulate different research questions. Therefore, it is very important to write
down your research questions at the end of the problem statement. The research
questions specifically indicate what your study is about.
ii) While you state the problem, you may need to provide some data or information to
express the magnitude of the problem. This may require you to do a preliminary data
gathering.
The theoretical framework refers to a summary of the theories that you will refer to in your study. You
will refer to the theories during the development of the hypotheses and the conceptual framework; when
12
13
you prepare the research design; do the data analysis and generalization. Your conceptual framework will
indicate the important issues to be assessed or the variables to be measured; their possible indicators; the
type and direction of relationship that exists among the variables, and so on.
What you summarize as part of the theoretical framework has to be very relevant to the topic and
particularly the research problem and the research questions. While conducting the literature survey,
students often throw in whatever literature that is in one way or another related with the research area but
not necessarily with the research problem. Failing to prepare the theoretical framework properly has at
least the following disadvantages:
a) You will not have the basis to define relevant concepts, identify the assessment issue or the
variables and define them.
b) You dont know what relationship to expect.
c) It will not be easy for you to choose the appropriate research design.
d) Your data collection instruments will be ill designed.
e) During analysis, you will not have any theory to compare your results with.
f) The contribution of your research to the existing theory will be blurred.
ii)
Based on the theoretical framework, you are expected to develop your conceptual framework. In
this part, you will define the concepts you will use in your research. In the literature, concepts
may defined in different ways and you will have to make a choice here. In your study, how the
concepts defined operationally? What are the variables and indicators that you will use in your
study to measure the concepts? How do the different concepts relate among each other? These
and other questions should be answered via your conceptual framework.
Conceptual frameworks are best done graphically rather than in text. The diagram depicts the
concepts/issues and how they relate among each other. You may create your illustrative diagram
or adopt or adopt it from the literature. In the later case, you have to clearly cite the sources for
your diagram.
Is a tentative assumption made in order to draw out and set its logical or empirical
consequences; It should be specific and pertinent to the piece of research in hand;
Provides the focal point for the research; to delimit the area, sharpen thinking and keep
the researcher on the right track.
Determines data type required; the data collection and sampling methods to be used; the
tests that must be conducted during data analysis.
13
14
Results from a-priori thinking about the subject, examination of available data and
material.
ii.
iii.
iv.
v.
vi.
vii.
viii.
ix.
15
Organizing fieldwork
Organizing data processing (e.g. entering the coding of the questionnaire items into EXCEL or
SPSS spread sheet; entering data before the data collection has been completed)
15
16
This is a very important part of your study. Depending on the methods you have selected during your
research design, which in turn depends on the type of the research (exploratory, descriptive, etc) and the
research approach (qualitative vs, quantitative, or both), you need to analyze the data using those
methods. What is expected as a result of your analysis is the findings pertinent to the research questions
and objectives.
In exploratory research, it could be in the form of proposing a hypothesis that shall be tested by
several other studies in the future.
In explanatory studies, your generalization could be in the form of statements regarding what
factors explain the dependent variable and whether this is in accordance with what is stated in
theory.
If you had no hypothesis at the beginning, explain the findings on the basis of some theory known
as interpretation. The process of interpretation may trigger new questions which will serve as a
basis for further researches.
Analyz
e data
Observation,
artifacts
Tentativ
e
working
hypothe
Tentativ
e
Researc
h
Field
work
(data
collectio
Refining
hypothes
is
Develop
theory
Write
the
Report
If necessary,
collect additional
data
Interviews, hanging
out
Focus groups
Field notes
16
17
Unit Two
Types of Research Design and the Sampling Design
2.1 Research Approaches
2.1.1 Types of Research based on the nature of enquiry
Based on the nature of the research enquiry, research approaches are classified as exploratory,
descriptive, explanatory (causal) or predictive research
A.
Exploratory research:
Research undertaken to explore an issue or a topic to identify a problem, clarify the nature of the problem
or define the issue involved. It can be used to develop propositions (hypothesis) for further research, to
gain new insights and a greater understanding of the issue, especially when little is known about the issue.
Characterstics of exploratory research:
Exploratory studies could be carried out based on literature search (review), experience survey, and
analysis of selected cases. Observation, focus groups, and interviews are useful methods of data collection
for exploratory researches.
B.
Fact-finding enquiries, describing the state of affairs as it exists; researcher has no control over variables,
can only report what happened or what is happening by using Survey or Correlational Methods. It aims at
answering the questions: Who? What? Where? When? How? And How Many? This type of research is
carried out to answer more clearly defined research questions
Longitudinal: Studying different units (e.g. households) over time (e.g. over several years).
This study could be based on either a true panel or an omnibus panel.
True panel: In this case, the units of analysis included in the sample (e.g. households) are
consistently studied over time. Eaxmple, if the study is about consumption pattern of
households and Ato Ayeyels household is part of the study, Ato Ayeles household will
be studies throught the time period.
Omnibus panel: In this case, the members of the sample may change. In the previous
example, Ato Ayeles household could be part of the study in year I while it may not be
part of the study in year II.
17
18
Cross sectional: Studying different units (e.g. households, sub-cities, regions, etc) at a given
point in time.
Causal or explanatory (Hypothesis-Testing/Experimental): -
C.
Helps to develop causal explanations about variables/factors by addressing the why questions: Why
do people chose brand A and not brand B? Why are some customers and not others satisfied with a
product of a firm? Why do some celebrities and not others use drags? Explanatory research may
involve experiment (laboratory or field experiment).
D.
Predictive: to predict the likely future effects of current actions using the if...then proposition.
2.1.2
2.1.3
18
2.1.4
2.1.5
2.1.6
2.1.7
19
b)
Field-Setting Research
b)
Laboratory/Simulation Research
(iii) Sampling unit:- Sampling unit may be a geographical one such as state, district,
village, etc., or a construction unit such as house, flat, etc., or it may be a social
unit such as family, club, school, etc., or it may be an individual. The researcher
will have to decide one or more of such units that he has to select for his study.
(iv) Source list (sampling frame): is the frame from which sample is to be drawn. It
contains the names of all items of a universe (in case of finite universe only). If
source list is not available, researcher has to prepare it.
(v) Size of sample: An optimum sample is one which fulfills the requirements of
efficiency, representativeness, reliability and flexibility. Furthermore, the desired
precision as also an acceptable confidence level for the estimate, the parameters
19
20
A systematic bias results from errors in the sampling procedures, and it cannot be
reduced or eliminated by increasing the sample size.
ii) At best the causes responsible for these errors can be detected and corrected.
Usually a systematic bias is the result of one or more of the following factors:
1. Inappropriate sampling frame:
2. Defective measuring device:
3. Non-respondents:
4. Indeterminancy principle: Sometimes we find that individuals act differently when kept
under observation than what they do when kept in non-observed situations.
5. Natural bias in the reporting of data: People in general understate their incomes if asked
about it for tax purposes, but they overstate the same if asked for social status or their
affluence. Generally in psychological surveys, people tend to give what they think is the
correct answer rather than revealing their true feelings.
Sampling errors
Sampling errors:
Are the random variations in the sample estimates around the true population
parameters (e.g., population mean).
20
21
Since they occur randomly and are equally likely to be in either direction, their
nature happens to be of compensatory type and the expected value of such errors
happens to be equal to zero.
Sampling error decreases with the increase in the size of the sample, and it happens
to be of a smaller magnitude in case of homogeneous population.
If we increase the sample size, the precision can be improved. But increasing the size
of the sample has its own limitations viz., a large sized sample increases the cost of
collecting data and also enhances the systematic bias.
Thus the effective way to increase precision is usually to select a better sampling
design which has a smaller sampling error for a given sample size at a given cost. In
practice, however, people prefer a less precise design because it is easier to adopt the
same and also because of the fact that systematic bias can be controlled in a better
way in such a design.
In brief, while selecting a sampling procedure, researcher must ensure that the
procedure causes a relatively small sampling error and helps to control the systematic
bias in a better way.
There are several reasons for taking a sample (and hence a sampling design) instead of a complete
enumeration of the whole population or census. These include:
21
22
a)
b)
c)
A.
Simple Random Sampling: This is a method of sampling in which every member of the population has
the same chance of being included in the sample.
Systematic Random Sampling: In some instances, the most practical way of sampling is to select, say,
every 20th name on a list, every 12 th house on one side of a street, every 50 th piece of item coming off a
production line, and so on. This is called systematic sampling, and an element of randomness can be
introduced into this kind of sampling by using random numbers to pick the unit with which to start.
Stratified Random Sampling: The methods of stratified sampling tend to be economically desirable if
the population to be sampled can be divided into relatively homogeneous subdivisions or strata. Stratified
random sampling is the procedure of dividing the population into relatively homogeneous groups, called
22
23
strata, and then taking a simple random sample from each stratum. If the population elements are
homogeneous, then there is no need to apply this technique.
Example: If our interest is the income of households in a city, then our strata may be:
low income households
middle income households
high income households
To obtain a sample from each stratum, we may follow two ways:
i.
Taking a sample of size proportional to the sub-population (stratum) size, i.e., draw a
large sample from a large stratum and a small sample from a small sub-population. This
is known as proportional allocation.
ii.
Selecting a sample from each stratum so that the variation due to sampling is minimized.
This is known as optimum allocation.
iii.
Selecting equal units from each stratum. This is known as equal allocation.
Cluster Sampling: This is a method of sampling in which the total population is divided into relatively
small subdivisions, called clusters, and then some of these clusters are randomly selected using simple
random sampling. Once the clusters are selected, one possibility is to use all the elements in the selected
clusters. However, if elements within selected clusters give similar results, it seems uneconomical to
measure them all. In such cases, we take a random sample of elements from each of the selected clusters
(called two-stage sampling).
Example: Suppose we want to make a survey on the attitude and awareness of households about solid
waste management (SWM) in Addis Ababa. Collecting information on each and every household is
impractical from the point of view of cost and time. What we do is divide the city into a number of
relatively small subdivisions, say, Kebeles. So the Kebeles are our clusters. Then we randomly select, say,
20 Kebeles using simple random sampling. To collect information about individual households, we have
two options:
1)
2)
B.
Convenience, Haphazard or Accidental sampling (members of the population are chosen based on
their relative ease of access)
Judgmental sampling or Purposive sampling (The researcher chooses the sample based on who
he/she thinks would be appropriate for the study)
Purposive sampling starts with a purpose in mind and the sample is thus selected to include
people or objects of interest and exclude those who do not suit the purpose. Purposive
sampling can be subject to bias and error.
23
24
Case study (The research is limited to one group, often with a similar characteristic or of small size.)
Ad hoc quotas (A quota is established and researchers are free to choose any respondent they wish
as long as the quota is met.)
Snowball sampling (The first respondent refers a friend. The friend also refers a friend, etc.)
Comparison: Probability and non-probability sampling
Probability sampling (or random sampling) is a sampling technique in which the probability of
getting any particular sample may be calculated. Non-probability sampling does not meet this
criterion and should be used with caution. Non-probability sampling techniques cannot be used
to infer from the sample to the general population. Performing non-probability sampling is
considerably less expensive than doing probability sampling, but the results are of limited value.
The difference between non-probability (accidental or purposive) and probability sampling is
that non-probability sampling does not involve random selection and probability sampling does.
Does that mean non-probability samples aren't representative of the population? Not necessarily.
But it does mean that non-probability samples cannot depend upon the rationale of probability
theory. At least with a probabilistic sample, we know the odds or probability that we have
represented the population well. We are able to estimate confidence intervals for the statistic.
With non-probability samples, we may or may not represent the population well, and it will often
be hard for us to know how well we've done so. In general, researchers prefer probabilistic or
random sampling methods over non-probabilistic ones, and consider them to be more accurate
and rigorous.
24
25
Unit Three
25
26
Structured interview: A structured interview means that the questions are developed a head of
time with some opportunity to ask pre-planned, open-ended, probing questions.
ii.
Semi- Structured Interview: In a semi-structured interview, the interviewer will have some set
questions but interviewer can ask some spontaneous question too.
iii.
Unstructured Interview: This interview is also called an in-depth interview. The interviewer
begins by asking a general question. The interviewer then encourages the respondent to talk
freely.
A. Personal interview:
Personal interview is the process taking place between interviewer (person asking question) and
interviewee (person answering to the question).
Advantages of Personal Interview:
Time consuming
Geographic limitation
Can be expensive
27
Make an appointment with respondent(s) discussing details of why and how long.
Try and fix a venue and time when you will not be disturbed.
B. Telephonic interview:
This is an alternative form of interview to the personal, face-to-face interview.
Telephonic interviews are less time consuming and less expensive and the researcher has ready
access to anyone on the planet that has telephone.
Advantages of Telephone Interview:
Questionnaire required
Questionnaire Method:
Description: A questionnaire is a series of written questions an a topic about which the subjects opinions
are sought. In this method of data collection questionnaire will be sent to respondent through either post
or e-mail and asking the respondent to fill up the questionnaire and send it back to the researcher. A
questionnaire consists of a number of questions well formulated, printed or typed in a definite order to
probe and obtain responses from respondents. Therefore there is a variation in the form and content of
questionnaire from situation to situation.
Advantages of Questionnaire Method
28
29
Usually the cost of gathering secondary data is much lower than the cost of organizing primary
data. Moreover, secondary data has several supplementary uses.
It also helps to plan the collection of primary data, in case, it becomes necessary.
Advantages of Secondary data analysis:
Secondary data analysis has several advantages:
It makes use of data that were already collected by someone else.
It often allows researcher to extend the scope of your study considerably.
It saves time that would otherwise be spent collecting data.
It provides a larger database (usually) than what would be possible to collect on ones own.
In many small research projects it is impossible to consider taking a national sample because of
the costs involved.
Many archived databases are already national in scope and, by using them; researcher can
leverage a relatively small budget into a much broader study than if you collected the data
yourself.
Disadvantage of secondary data:
You may have less control over how the data was collected.
There may be biases in the data that you dont know about.
Its answers may not exactly fit your research questions.
It may be obsolete data.
Old secondary data collections can distort the results of the research.
Secondary data con also raise issues of authenticity and copyright.
What is the research problem?: The problem definition and objectives of the research.
ii) What type(s) of evidence is needed to address it?: Exploratory, descriptive, causal or
explanatory
iii) What ideas, concepts, variables are we measuring? Content, definition and indicators
iv)
v)
From whom should we collect the data? Nature of the target population or sample (e.g.,
their education level, cultural background, etc)
29
vi)
30
vii) Where will the data be collected? In the street/shopping centre. At respondents office or
home.
viii) How will responses be captured? Pen and paper, computer, Audi and/or video recording,
photograph.
ix)
x)
31
Decide on the question content: This is done by clarifying the research objectives (the
information requirements) and what exactly it is that the question needs to measure.
o
ii)
Some questions require standard answer options. For example: Marital status has standard
answer options (single or never married, married, living as married, separated, divorced,
widowed). While developing the content of a questionnaire, clarify the meaning (concepts,
definitions and indicators). If you are not clear with the concepts and their indicators, it is
difficult to craft the questions with the right wording of the statements.
Ensure Proper Wording of the Questions:
How much money do you earn? - What type of earning is this question referring to (from
work? Investment? Remittances? Social benefits?); what time period is it referring to? (dayly?
Weekly? Annually?)
Do you have a personal computer? is this questions referring to ownership of the computer or
type of the computer? What does it mean by you? (Myself? My household?) What does it mean
by have? (own? or have access to?)
Using of technical Jargons and abbreviations: Example refurbishments, fiscal policy, monetary
policy, UNHCR, UNESCO, etc
Using words that are difficult to read out or pronounce: E.g., ,In an anonymous form.
Use of double-barrelled questions: Example- Do you like using e-mail and the web; Would you
like to be rich and famous?
Use of negatively phrased questions: Do you agree that it is NOT the job of the government to
take decisions about the following?
Use of questions that challenge the respondents memory: Example How many hours of
television did you watch last month?; List the books you have read in the last year
31
32
Including leading questions: Example - Public speeches against racism should not be allowed.
Do you agree or disagree?
Wording questions using sensitive or loaded non-neutral questions: Example What do you
think of welfare for the poor?
Questions that make assumptions: Example How often do you travel to rural areas? (this
wording assumes that the respondent travels to rural areas); When did you stop beating your
wife?
Questions with overlapping response categories: Example- Howmany hours did you spend in the
library yesterday? (response categories: 0-1 hours, 1-2 hours, 2-3 hours, etc)
Questions with insufficient response categories: Example How do you travel to work each
day? (response categories: by my own car, on foot, by public bus, on a bicycle). In this case other
modes are missing (e.g., by service bus, by a friend/colleagues car, by motor bicycle, by Bajaj,
by cart, etc) and the possibility to chose more than one mode is not provided.
Questions on sensitive topics Example: which political party did you elect during the May 7
election?
iii)
Put your questions into effective and logical order. Dont ask sensitive and difficult questions too
early. Also it is preferable to ask personal questions (e.g. classification questions such as those on
age, income, etc) at the end.
Classify your questions into groups and provide a brief introduction to each group. Within each
group (module), begin with general questions and then move on to specific questions.
In case a particular question is not relevant to the respondent, indicate that they can jump the
question to indicate which question they should jump to.
iv)
v)
The questionnaire must be long enough to cover the research objectives but not too long
considering the research cost and time to the respondent.
vi)
33
Test the questionnaire out to identify its pitfalls and correct them before you go for the full-scale
survey.
33
34
Unit Four
Research Proposal, Referencing, Reporting Results and Ethical
Considerations
4.1 The Research Proposal
4.1.1 The Need for Research Proposal
Before you embark up on your research for the masters thesis, you will be required to submit a
research proposal. Most journals and calls for conference papers also require a submission of a
research proposal or abstract or synopsis.
The title: the title should be as short as possible but should adequately represent the topic
and the research problem/objective
1. The introduction
1.1 General Background
Your research proposal should have an introduction. The introduction should given a general background
in relation to the research area/topic and enhance interest of readers. Also indicate the debate/controversy
in the literature over the topic or issue you intend to deal with in your research. You should demonstrate
the relevance of your research to the theory (to the debate) and/or for practice (policy).
34
35
2. Theoretical/conceptual framework
This is the part where you will provide a summary of the literature review you have conducted. Your
review should lead to a clear definition of the theoretical framework and the conceptual framework.
2.1 Theoretical framework
The theoretical framework refers to a summary of the theories that you will refer to in your study. Review
the relevant literature and identify what the theory says about the issue/topic you are addressing.
Elaborate the different perspectives. Also provide alternative definitions (if any) of the important concepts
and variables that you will use in your research. The literature review will help you identify the relevant
variables that could potentially be used as indicators/measures for your concepts. If you are going to
investigate the relationship among variables, show what the theories state about the relationships (i.e.
about the direction of relationships and significance).
What you summarize as part of the theoretical framework has to be very relevant to the topic and
particularly the research problem and the research questions. While conducting the literature survey,
students often throw in whatever literature that is in one way or another related with the research area but
not necessarily with the research problem. Failing to prepare the theoretical framework properly has at
least the following disadvantages:
35
36
You will not have the basis to define relevant concepts, identify the assessment issue or the
variables and define them.
It will not be easy for you to choose the appropriate research design.
During analysis, you will not have any theory to compare your results with.
Based on the theoretical framework, you are expected to develop your conceptual framework.
This is the part in which you are going to define the concepts operationally (in a way you like your
readers to understand the concepts) and where you will select your factors/variables. In the literature,
concepts may defined in different ways and you will have to make a choice here. In your study,
how the concepts defined operationally? What are the variables and indicators that you will use
in your study to measure the concepts? How do the different concepts relate among each other?
These and other questions should be answered via your conceptual framework. For instance,
education quality can be measured using a number of indicators. Among these indicators (which you must
have identified and defined in the conceptual framework), clearly show which ones you are going to use
in your research. Consider the usefulness/appropriateness of the indicator as well as data availability and
feasibility in your choice of the variables.
If you are doing a qualitative research that involves, for instance, assessments, you should make clear the
assessment themes/issues and the indicators. If you could add a diagrammatical representation of the
conceptual framework, it will add a visual effect to your operationalization. The diagram depicts the
concepts/issues and how they relate among each other. You may create your illustrative diagram
or adopt or adopt it from the literature. In the later case, you have to clearly cite the sources for
your diagram.
2.3 Hypothesis
In deductive research designs, it is necessary that you formulate some hypothesis. Once you are clear with
the theory and your conceptual framework, you can state some hypothesis. A hypothesis is a tentative
assumption made in order to draw out and set its logical or empirical consequences. It should be
specific and pertinent to the piece of research in hand. The hypothesis will provide the focal
point for your research; to delimit the area, sharpen thinking and keep the researcher on the right
track. Also remember that your hypothesis determines data type required; the data collection and
sampling methods to be used; the tests that must be conducted during data analysis.
3. Methodology (Research design):
Once you have developed your hypothesis, the next step is to craft your research design. A
research design is like the blue print for house construction. If you start building a house with out
first having the design (consisting of the architectural, electrical, sanitary, etc designs), the result
is you dont know what type of house you will end up with; it will be costly and time taking;
36
37
often involving construction and demolition of what has been constructed. Most importantly, the
house lacks quality and may be prone to risks. Likewise, a research conducted without a research
design at hand is aimless, ambiguous, time taking, costly and may be totally irrelevant and
unacceptable in light of the requirements for a scientific investigation. Research design refers to
the crafting of the conceptual structure within which research will be conducted in a way that is
as efficient as possible, the collection of relevant evidence with minimal expenditure of effort,
time and money.
3.1 Research type:
Describe the type of your research based on some commonly known criteria. More specifically, describe
the type of your research based on the nature of the research enquiry (e.g., exploratory, descriptive, etc);
the mode of data collection; the type of the data; and so on.
The methods of data collection (which method to which data type interview,
questionnaire, FGD, Observation, etc)
38
The American Psychological Association reference style uses the Author-Date format.
Refer to the Publication Manual of the American Psychological Association (6th ed.) for
more information. Check the Library Catalogue for call number and location(s).
When quoting directly or indirectly from a source, the source must be acknowledged in
the text by author name and year of publication. If quoting directly, a location reference
such as page number(s) or paragraph number is also required.
IN-TEXT CITATION
Direct quotation use quotation marks around the quote and include page numbers
Samovar and Porter (1997) point out that "language involves attaching meaning to symbols"
(p.188).
Alternatively, Language involves attaching meaning to symbols" (Samovar & Porter, 1997,
p.188).
Indirect quotation/paraphrasing no quotation marks
Attaching meaning to symbols is considered to be the origin of written language (Samovar &
Porter, 1997).
N.B. Page numbers are optional when paraphrasing, although it is useful to include them
(Publication Manual, p. 171).
39
Author (last name, initials only for first & middle names)
Publication date
Title (in italics; capitalize only the first word of title and subtitle, and proper nouns)
Place of publication
Publisher
Citing Books
Source
Example Citation
Book by a corporate
author
Author (last name, initials only for first & middle names)
Date of publication of article (year and month for monthly publications; year, month and
day for daily or weekly publications)
39
40
Title of article (capitalize only the first word of title and subtitle, and proper nouns)
Title of publication in italics (i.e., Journal of Abnormal Psychology, Newsweek, New York
Times)
Example Citation
Article in a monthly
magazine (include
volume # if given)
Article in a weekly
magazine (include
volume # if given)
Article in a daily
newspaper
Article in a scholarly
journal
Book review
40
41
If the DOI number is not available, APA recommends giving the URL of the publication.
If the URL is not known, include the database name and accession number, if known:
Retrieved from ERIC database (ED496394).
Citing Articles from the Librarys Online Subscription Databases
Source
Example Citation
Magazine article
with URL
Journal article
with DOI
Author (last name, initials only for first & middle names)
Date of publication of article
Title of article
DOI number, if given. More information about DOI numbers is available on the
American Psychological Association's APA Style page.
If the DOI is not available, give the URL (Web address) of the article.
41
42
Example Citation
Overbay, A., Patterson, A. S., & Grable, L. (2009). On the outs: Learning
styles, resistance to change, and teacher retention. Contemporary Issues
inTechnology and Teacher Education, 9(3). Retrieved from
http://www.citejournal.org/vol9/iss3/currentpractice/article1.cfm
Article in an online
magazine
Romm, J. (2008, February 27). The cold truth about climate change.
Salon.com. Retrieved from http://www.salon.com
Article in an online
newspaper
McCarthy, M. (2004, May 24). Only nuclear power can now halt global
warming. Earthtimes. Retrieved from http://www.earthtimes.org
Date you accessed the information (APA recommends including this if the information is
likely to change)
Example Citation
unknown author
43
http://www.psu.edu/ur/about/myths.html
Page within a Web site Global warming 101. (2012). In Union of Concerned Scientists.
(unknown author)
Retrieved December 14, 2012, from
http://www.ucsusa.org/global_warming/global_warming_101/
Electronic Books
Important Elements:
Author (last name, initials only for first & middle names)
Publication date
Title (in italics; capitalize only the first word of title and subtitle, and proper nouns)
Place of publication
Publisher
URL (Web address) of the site from which you accessed the book
Citing Electronic Books
Source
Electronic Book
Example Citation
McKernan, B. (2005). Digital cinema: The revolution in cinematography,
postproduction, and distribution. New York, NY: Mc-Graw Hill. Retrieved
from www.netlibrary.com.
Post, E. (1923). Etiquette in society, in business, in politics, and at home.
New York, NY: Funk & Wagnalls. Retrieved from
http://books.google.com/books.
Studio
44
Example Citation
Johnston, J. (Director). (2004). Hidalgo. [Motion Picture]. United States,
Touchstone/Disney.
Television program
in series
Example Citation
Buckner, N. & Whittlesey, R. (Writers, Producers & Directors). (2006).
Dogs and more dogs. [Television series episode]. In P. Apsell (Senior
Executive Producer), NOVA. Boston: WGBH.
Place of publication
Publisher
Citing Government Publications
44
Source
Government
document
45
Example Citation
U.S. Dept. of Housing and Urban Development. (2000). Breaking the cycle of
domestic violence: Know the facts. Washington, DC: U.S. Government
Printing Office.
46
Paragraphs:
not too long, not too short;
Convey one idea in one paragraph,
Usually a paragraph has introduction and concluding sentences
Check for flow, coherence, economy, etc
Use connecting words/phrases; avoid repetition of words/phrases
4.3.3
4.3.4
Do not report results again!; Focus on the why part (reasons)/explanations for the
findings from your data or from theory, or give your own interpretation
Link with theory/other studies, your hypothesis
Use the specific objectives of your study as a guide
4.3.5 When writing the conclusion and recommendations:
The conclusions should be based on your findings
While concluding, answer your research questions/address your specific objectives
Recommendations should be based on your conclusions
While recommending solutions, indicate how your recommendations could be put into
practice (the how part)
47
For example, the tendency to describe some populations as deviant leads away from
focusing on larger problems of the distribution of political, economic, and social power.
Social scientist need to be aware of the possible uses to which their research may be put.
Research should not only enhance the researcher's career, but also benefit the group,
organization, or population studied.
Those who fund and conduct research also reap its benefits.
48
The people who are the "subjects" of the research may have neither the power to shape
the research nor the ability to refuse to participate.
Is participation Voluntary?
1. Participants must be voluntary and not coerced.
2. Participants cannot be threatened with a loss of other, unrelated benefits( e.g., food stamps,
bilingual education)
3. Participants cannot be offered unreasonably large inducements to participate ( e.g., prisoners)
4. Information must be provided about all risks or potential risks of participation including
physical harm, pain, discomfort, embarrassment, loss of privacy; exposure to illness, etc.
5. Information must be provided about all benefits or potential benefits of participation, for
example, free health care, monetary incentives, the value of the research to science, etc.
The ratio of risks to benefits should be stated.
7. Are the benefits sufficient to allow participants to put themselves at risk? Should the study be
done at all?
8. Are the participants' rights and well being sufficiently protected?
9. Are the means of obtaining informed consent adequate and appropriate?
10. Participants can withdraw from the study at any time, refuse to comply with any part of the
study and refuse to answer any questions.
III. Ethical issues related to the methods
Most ethical violations correspond to illegitimate use of the investigator's power.
Researchers need to be trained to be concerned, as social scientists with people as well as with
research design, methodology, etc.
Ethical concerns include:
1. Involvement without consent:
through participants observation or covert observations;
through unknown intervention in ongoing programs or
operations;
through field experiments.
Disguising the true nature or purpose of the research:
- the way it will be used is not revealed;
- information is withheld that would affect informed
48
49
consent;
3. Deceiving the research participant:
- to conceal the purpose of the research;
- to conceal the true function of the participant's actions;
- to conceal the experiences the participants will have to
undergo;
Leading participants to commit acts that lessen their self-esteem:
- Cheating, lying stealing; harming others;
- yielding to social pressure contrary to one's ideas;
- prohibiting the rendering of aid when needed;
- behavioral control or character change;
- denial of the right to self-determination;
Coercion that abridges freedom of choice
- research is linked to participation in organizational or institutional programs;
- requests for participation" are worded in such a way that it is difficult to say no;
- Participation is made a requirement of a college course;
6. Physical or mental stress:
horror, threat to identity, failure, fear, emotional shock.
7. Invasion of privacy:
- covert observation;
- unnecessary questions of personal nature on interviews or questionnaires;
- disguised, indirect, or projective tests;
- using third-party information without consent;
- it is also the ethical responsibility of the researcher to ensure that the data are accurately
collected, coded, entered, analyzed, and interpreted, so as not to perform a disservice to the
subject population.
After the project is over, the researcher should:
- remove any harmful after- effects from the participants;
- maintain anonymity and/or confidentiality;
- publish the findings in reports and articles;
49
50
50
51
UNIT FIVE:
The Survey Method and Case Studies
5.1 The Survey Method
5.1.1 What is a survey?
A survey is a detailed and quantified description of a population. They attempt to identify
something about a population, that is, a set of objects about which we wish to make
generalizations. A population is frequently a set of people, but organizations, institutions or even
countries can comprise the unit of analysis.
Surveys involve the systematic collecting of data, whether this be by interview, questionnaire or
observation methods, so at the very heart of surveys lies the importance of standardization.
Precise samples are selected for surveying, and attempts are made to standardize and eliminate
errors from survey data gathering tools.
A particular form of survey, a census, is a study of every member of a given population. For
example, The Central Statistics Agency of Ethiopia conducts the population and housing census
every ten years. A census provides essential data for government policy makers and planners, but
is also useful, for example, to businesses that want to know about trends in consumer behavior
such as ownership of durable goods, and demand for services.
Large number of observation can be collected within a certain period of time from a relatively
large number respondents.
The sample can be spread over a wide area, which has statistical value, and there by making the
study some what more generalizable.
At the same time, the principal researcher need not spend too much time in the field.
Surveys require respondents to answer questions about their opinions, attitudes, or preferences
and about socio demographic characteristics of respondents.
52
52
53
54
Therefore, case study is conducted only for specific case. Actually case study means a study in
depth. Here depth means to explore all peculiarities of case. It gives a detailed knowledge about
the phenomena and not able to generalize beyond the knowledge. In physical science every unit
is the true representative of the population, but in social sciences, the units may not be true
representative of the population. This is because there are individual differences as well as intraindividual differences. Therefore, prediction cannot be made on the basis of knowledge obtained
from case study. No statistical inferences can be drawn from the exploration of a phenomenon.
54
55
Here case does not necessarily mean an individual. Case means an unit, it may be an institution
or a nation, or religion or may be an individual or a concept
55
56
56
57
PART TWO
Presenting and Analyzing Quantitative Data
(With SPSS Application)
Contents:
Unit 6: Analyzing Quantitative Data - Descriptive Statistics:
Basic concepts in statistics;
Classification and Presentation of Statistical Data (bar chart, pie chart, histogram);
Measures of central tendency and dispersion (mean, median, mode, mean deviation,
variance, standard deviation, covariance, Z-score);
Exercise with SPSS Application
Unit 7: Analyzing Quantitative Data- Tests of hypothesis concerning means and
proportions:
Tests of hypotheses concerning means; Tests concerning the difference between two
means (independent samples);
Tests of mean difference between several populations (independent samples); Pairedsamples t-test (Differences between dependent groups);
Tests of association (the Pearson coefficient of correlation and test of its significance, The
Spearman rank correlation coefficient and test of its significance); Nonparametric
Correlations (The Chi-square test);
Hypothesis test for the difference between two proportions; Exercise with SPSS
Application
Unit 8: Analyzing Quantitative Data - The simple linear regression model and Statistical
Inference;
The simple linear regression model, estimation of regression coefficients and interpreting
results;
Hypothesis testing;
Exercise with SPSS application
Unit 9: Analyzing Quantitative Data - The multiple linear regression model and Statistical
Inference;
The multiple linear regression model, estimation of regression coefficients and
interpreting results;
Hypothesis testing;
Exercise with SPSS application
57
58
Unit Six
Analyzing Quantitative Data: Descriptive Statistics
6.1 Basic concepts in statistics
6.1.1 What is Statistics?
Statistics is a science pertaining to the collection, presentation, analysis and interpretation or explanation
of data. Data can then be subjected to statistical analysis, serving two related purposes: description and
inference.
Descriptive statistics summarize the population data by describing what was observed in the
sample numerically or graphically.
Inferential statistics uses patterns revealed through analysis of sample data to draw inferences
about the population represented.
For a sample to be used as a guide to an entire population, it is important that it is truly a representative of
that overall population. Appropriate and scientific sampling procedures assure that the inferences and
conclusions can be safely extended from the sample to the population as a whole.
The raw materials for any statistical analysis are the data. Once data are collected, we have to organize
and describe these data in a concise, meaningful manner so that they become meaningful. In order to
determine their significance, we must display the data in the form of tables, graphs and charts (so that we
can have a good overall picture of the data). Then, we have to analyze the data, i.e., we calculate
summary measures such as the mean and standard deviation; assess the extent of relationship (correlation)
between two (or more) variables; and the like. Finally, based on the analysis, we have to make
generalizations and arrive at reasonable decisions.
59
fact, that is, the student population of the university has increased from 1000 to 2500 this year. The fact is
that the percentage of minority students has decreased: from 1% last year to 0.8% this year.
For example, we can record the gender of respondents as 0 and 1, where 0 stands for
male and 1 stands for female. The numbers we assign for the various categories are
purely arbitrary, and any arithmetic operation applied to these numbers is meaningless.
ii.
Ordinal Scale: The ordinal scale is the next higher level of measurement precision. It ensures
that the possible categories can be placed in a specific order (rank) or in some natural way.
Again here the numbers are not obtained as a result of a counting or measurement process, and
consequently, arithmetic operations are not allowed.
For example, responses for health service provision can be coded as 1, 2, 3 and 4: 1 for
poor 2 for moderate 3 for good 4 for excellent. It is quite obvious that there is some
natural ordering: the category 'excellent' (which is coded as 4) indicates a better health
service provision than the category 'moderate' (which is coded as 2) and, thus, order
relations are meaningful.
59
60
iii.
Interval Scale: The interval scale is the second highest level of measurement precision. Unlike the
nominal and ordinal scales of measurement, the numbers in an interval scale are obtained as a
result of a measurement process and have some units of measurement. Also the differences
between any two adjacent points on any part of the scale are meaningful. However, a point can
not be considered to be a multiple of another, that is, ratios have no meaningful interpretation.
For example, Celsius temperature scale that subdivides the distance between the freezing
and boiling point into 100 equally spaced parts is an interval scale. There is a meaningful
difference between 30 degree Celsius and 12 degree Celsius. However, a temperature of
20 degree Celsius can not be interpreted as twice as hot as a temperature of 10 degree
Celsius.
iv.
Ratio Scale: The ratio scale represents the highest form of measurement precision. In addition to
the properties of all lower scales of measurement, it possesses the additional feature that ratios
have meaningful interpretation. Furthermore, there is no restriction on the kind of statistics that
can be computed for ratio scaled data.
For example, the height of individuals (in centimeters), the annual profit of firms (in Birr)
and plot elevation (in meters) represent ratio scales. The statement the annual profit of
Firm X is twice as large as that of Firm Y has a meaningful interpretation.
First, knowing the level of measurement helps you decide on how to interpret the data. For example,
if you know that a measure is nominal, then you know that the numerical values are just short codes
for the longer names.
Second, knowing the level of measurement helps you decide what statistical analysis is appropriate
on the values that were assigned. If a measure is nominal, for instance, then you know that you
would never average the data values.
Examples: Daily Dow-Jones stock market average close for the past 90 days, a firms
quarterly sales over the past 5 years. etc
60
61
The data may be quantitative (e.g. exchange rates, prices, number of shares outstanding), or qualitative
(e.g. the day of the week, the number of the financial products purchased by private individuals over a
period of time, etc.).
ii) Cross-sectional data: Cross-sectional data are data on one or more variables collected at a single point
in time. Such data do not have a meaningful sequence. For example, the data might be on:
Sales of 30 companies
Productivity of each sales division
A cross-section of stock returns on the New York Stock Exchange (NYSE)
iii) Panel data: Panel data have the dimensions of both time series and cross-sections, e.g. the daily
prices of a number of blue chip stocks over two years.
Note:
i) For time series data, it is usual to denote the individual observation numbers using the index t, and the
total number of observations available for analysis by T. For cross-sectional data, the individual
observation numbers are indicated using the index i , and the total number of observations available
for analysis by N.
ii) In contrast to the time series case, no natural ordering of the observations in a cross-sectional sample.
For example, the observations i might be on the price of bonds of different firms at a particular point
in time, ordered alphabetically by company name. On the other hand, in a time series context, the
ordering of the data is relevant since the data are usually ordered chronologically.
6.2.4 Continuous and discrete variables
As well as classifying data as being of the time series or cross-sectional type, we could also distinguish it
as being either continuous or discrete.
i) A quantitative variable that has a connected string of possible values at all points along the number
line, with no gaps between them, is called a continuous variable. In other words, a variable is said to
be continuous if it can assume an infinite number of real values within a certain range. It can take on
any value and is not confined to take specific numbers. The values of such variables are often obtained
by measuring. Examples of a continuous variable are distance, age and daily revenue.
The number of people in a particular shopping mole per hour or the number of shares
traded during a day are examples of discrete variables. These can take on values such as
0, 1, 2, 3 ... In these cases, having 86.3 people in the mole or 585.7 shares traded would
not make sense.
1.2.5
62
Example: The following data represents the average monthly number of road fatalities (human injury due
to traffic accidents) for a total of 50 major roads in a city:
46.6
55.2
55.7
48.8
48.8
61.5
56.0
59.3
60.5
58.0
43.2
54.5
54.6
56.8
50.3
51.2
50.6
55.4
50.9
52.6
47.1
57.6
57.0
57.8
53.9
58.8
53.6
63.8
49.6
53.3
57.9
53.9
52.7
52.4
47.4
53.0
55.2
58.3
59.1
56.8
56.0
59.2
57.1
53.3
52.4
47.8
45.8
57.3
56.1
51.8
Frequency (number of
major roads)
43.2 46.6
46.7 50.1
50.2 53.6
13
53.7 57.1
15
57.2 60.6
11
60.7 64.1
2
50
63
variable over time) and is a simpler visual aid for comparison purposes. Some of the reasons why we use
graphs when presenting data include: they are quick and direct; they facilitate understanding of the data;
they can convince readers; and they can be easily remembered.
If you have decided that using a graph is the best method to relay your message, then some of the
guidelines to follow are:
1. Define your target audience.
Ask yourself the following questions to help you understand more about your audience and what their
needs are: Who is your target audience? What do they know about the issue? What do they expect to see?
What do they want to know? What will they do with the information?
2. Determine the message(s) to be transmitted.
Ask yourself the following questions to figure out what your message is and why it is important: What do
the data show? Is there more than one main message? What aspect of the message(s) should be
highlighted? Can all of the message(s) be displayed on the same graphic?
Knowing what type of graph to use with what type of information is crucial. Depending on the nature of
the data some graphs might be more appropriate than others. There are many different types of graphs that
can be used to convey information. These include vertical line graphs, bar graphs (charts), pie charts and
histograms, among others.
The presentation of data in the form of tables, graphs and charts is an important part of the
process of data analysis and report writing.
The results can be expressed within the text of a report, data are usually more digestible if they
are presented in the form of a table or graphical display.
Graphs and charts can quickly convey to the reader the essential points or trends in the data.
The presentation should be as simple as possible, avoid the trap of adding too much information. A
good rule of thumb is to only present one idea or to have only one purpose for each graph or chart
you create.
The title should be clear, and concise indicating what?, when?, and where? the data were obtained.
Codes, legends and labels should be clear and concise, following standard formats if possible,
The use of footnotes is advised to explain essential features of the data that are critical for the correct
interpretation of the graph or chart.
64
Show
Use
Data Needed
Frequency of Occurrence:
Bar chart
Simple percentages or
comparisons of magnitude
Pie chart
Line graph
Pareto chart
Measurements taken in chronological order (attribute or
variable data can be used)
Run chart
Control chart
Distribution: Variation not
related to time
(distributions)
Histograms
Scatter diagram
i)
Bar graphs are one of the many techniques used to present data in a visual form so that the reader may
readily recognize patterns or trends. Bar graphs usually present categorical (qualitative) variables or
numeric (discrete) variables grouped in class intervals.
Example: The following data is on the bed sizes (that is, total number of beds available for patient use) in
three hospitals for the years 2003-2005.
2003
2004
2005
64
65
40
45
45
25
60
60
35
45
75
Total
100
150
180
The simple bar chart does not consider the contribution of each hospital to the total bed size. It simply
provides information on the aggregate bed sizes in the three hospitals for the years 2003-2005.
Figure 1: A simple bar chart for the total bed size in three hospitals
A multiple bar chart (double (or group) bar graph) is another effective means of comparing sets of data.
This type of vertical bar graph gives two or more pieces of information for each item on the x-axis instead
of just one as in Figure 1. In this particular example it allows you to make direct comparisons of values
across categories on the same graph, where the values are bed sizes and the categories are the years 2003
2005. The graph is shown in Figure 2 below.
Figure 2: A multiple bar chart for the bed sizes in three hospitals
From Figure 2, for example, we can see that the total number of beds available for patient use in Hospital
C has consistently increased from 2003 to 2005. On the other hand, the bed size of Hospital B has
65
66
increased from 2003 to 2004, but remained the same in 2005. When comparison is made between
hospitals, we can see that Hospital A had the highest bed size in 2003, but gave way to Hospital B in 2004
and to Hospital C in 2005.
ii)
The Pie chart
A pie chart is a chart that is used to summarize a set of categorical data or to display the different values
of a given variable by means of percentage distribution. This type of chart is a circle divided into a series
of segments (or sectors) each representing a particular category. The area of each segment is the same
proportion of a circle as the category is of the total data set. The use of the pie chart is quite popular since
the circle provides a visual concept of the whole (100%).
Example: A sample of 100 adults was asked what they feel is the most important issue facing today's
youth among: unemployment, youth violence, rising school fees, drugs in schools, and career
counselling. The results are shown in below:
Table: Adults opinion on the most important issue facing the youth
Issue
Number of adults
Unemployment
38
Youth violence
12
Drugs in schools
22
Career counselling
20
100
First we have to find the percentage contribution of each category (issue), and then the angle measures of
the sectors representing each category of responses have to be calculated.
Issue
Percentage share
Unemployment
38/100 = 38%
Youth violence
8/100 = 8%
8% x 3600 = 28.80
12/100 = 12%
Drugs in schools
22/100 = 22%
Career counselling
20/100 = 20%
100
3600
At last we partition the circle into sectors based on the above angle measures.
66
67
Figure 3: A pie chart of the opinion of adults as to the most important issue facing today's youth
class boundaries
Frequency (number of
major roads)
43.2 46.6
43.15 46.65
46.7 50.1
46.65 50.15
50.2 53.6
50.15 53.65
13
53.7 57.1
53.65 57.15
15
57.2 60.6
57.15 60.65
11
60.7 64.1
60.65 64.15
2
50
67
68
Summary
After being collected and processed, data need to be organized to produce useful information or output.
Output is usually governed by the need to communicate specific information to a specific audience. The
only limit to the different forms of output you can produce is the different types of output devices
currently available. To help determine the best output type for the information you have produced, you
need to ask yourself these questions: For whom is the output being produced? How will the audience best
understand it?
Generally we have two types of output devices: tables and graphs. Grouping variables and presenting
them as a grouped frequency distribution is part of the process of organizing data so that they become
useful information. If a variable is continuous or takes a large number of values, then it is easier to present
and handle the data by grouping the values into class intervals. On the other hand, discrete variables can
be grouped into class intervals or not.
The other type of output devices are graphs. Graphs are effective visual tools because they present
information quickly and easily. If you have decided that using a graph is the best method to relay your
message, then the guidelines to remember are: define your target audience (understand more about your
audience and what their needs are); determine the message to be transmitted (figure out what your
message is and why it is important); and experiment with different types of graphs and select the most
appropriate. Note that it not appropriate to use a graph when there are too few data (one, two or three data
points) or the data show little or no variations.
69
centrally within a set of data arranged in increasing or decreasing order of magnitude. Thus, we refer to
these as measures of central tendency. These include the mean, median and mode.
A. The Mean
i) The arithmetic mean
The arithmetic mean, or simply the mean, of a set of n observations X1 , X 2 , , X n , denoted by X , is
defined as :
n X
X1 X 2 . . . X n
j
n
j1 n
.......................................................... (1)
Xw
w1X1 w 2 X 2 . . . w k X k
w1 w 2 . . . w k
w X
w
j
................................. (3)
Portfolio expected return (an interest rate, indicating performance) is the weighted average of the
expected rates of return of assets in the portfolio, weighted by dollars invested
Suppose portfolio contains three stocks. One ($1,000 invested) is expected to return 20%. Another
($1,800 invested) expects 15%. Third is $2,200 and 30%.
Total invested is 1,000+1,800+2,200 = $5,000
Weights are:
w1 = $1,000/$5,000 = 0.20
w2 = $1,800/$5,000 = 0.36
w3 = $2,200/$5,000 = 0.44
Weighted average is
0.20 (20%) + 0.36 (15%) + 0.44 (30%) = 22.6%
This is the expected (mean) return for the portfolio. Note that each stock is represented in proportion to $
invested.
69
70
n1X1 n 2 X 2 . . . n k X k
n1 n 2 . . . n k
XG
n X
n
j
................................... (4)
Example: In a company, the average salary of a group of 50 male employees is 700 Birr and that of a
group of 30 female employees is 500 Birr. Find the average salary of male and female employees
combined.
Solution:
Let n M , n F denote the number of male and female employees, respectively, and let X M , X F denote the
average salary of male and female employees, respectively. Then the average salary of male and female
employees combined is:
XG
n M X M n F X F 50(700) 30(500)
=
= 625 birr.
nM nF
50 30
B. The median
%, is a single value that divides a set of data into two equal parts. It is the
The median, denoted by X
middle most or most central item in the data set.
Note: Data values which are by far smaller or larger as compared to the bulk of data are called extreme
values or outliers. Whenever such extreme values exist, the mean may give a distorted picture of the
data. On the other hand, the median of such data gives a good overall picture of the data.
Example: income
Average (mean) income for a country equally divides the total, which may include some
very high incomes
Median income chooses the middle person (half earn less, half earn more), giving less
influence to high incomes (if any)
C. Quartiles
The median divides a set of data into two equal parts. The values that divide a set of data into four equal
parts are called quartiles, and are denoted by Q1 , Q 2 and Q3 . If data are arranged in increasing order,
the positions of the quartiles are:
min
Q1
Q2
Q3
max
70
71
where min is the minimum observation, and max is the maximum observation. Note that the second
quartile ( Q 2 ) is the median.
D. The mode
, of a set of numbers is that value which occurs with the greatest frequency. A
The mode, denoted by X
data set is called uni-modal, bi-modal or multi-modal depending on whether it has one mode, two modes
or more than two modes, respectively.
6.3.2
Measures of dispersion
The measures of central tendency help us in describing a set of data by a single number or by a typical
value. However, they do not provide us any information about the extent to which the values differ from
one another or from an average value. This bit of information is very essential as illustrated in the
following example.
Example: Suppose we are told that the mean of two numbers is 1000. Then, these two numbers may be 4
and 1996, or 990 and 1010. Definitely 1000 is not a good descriptive measure for the numbers 4 and
1996, while it is a good representative figure for 990 and 1010. The reason behind this is that there is a
considerable difference between the numbers 4 and 1996, while the difference between 990 and 1010 is
relatively small.
Thus, the dispersion (spread or variability) of a data set gives us additional information that enable us to
judge the reliability of our measure of central tendency: if data are widely dispersed, then the mean
(median or mode) is less representative of the data as a whole than it would be for data with small
dispersion. The measures of dispersion also enable us to compare several samples with similar averages.
Example
1.
2.
Financial analysts are concerned about the dispersion of a firms earnings. Widely dispersed
earnings, those varying from extremely high to low or even negative levels, indicate a higher risk to
stock holders and creditors than do earnings remaining relatively stable.
Quality control experts analyze the dispersion of a products quality levels. A drug that is average in
purity but ranges from very pure to highly impure may endanger lives.
A. The range
A quick and easy indication of dispersion is the range. The range of a set of data is the difference
between the largest and smallest observed values, i.e;
Range = max - min
where max = largest observation and min = smallest observation.
B. The interquartile range
In case an extreme value(s) exists, we use another measure of dispersion, called the interquartile range
(Q), which is defined as:
71
72
Q = Q3 - Q1
where Q3 and Q1 are the third and first quartiles, respectively.
Identifying Outliers
Outliers are observations that are far from the center of the distribution. These are defined as observations
which are either:
1 n
MD (about mean) | X j X | ,
n j1
where X
72
(X
j1
73
) 2
X
j1
The positive square root the population variance is called the population standard deviation, i.e.;
N
(X
j1
) 2
Usually, information about the entire population is not available. The reason for this is that collecting data
about the entire population is time consuming, expensive, and sometimes impossible. Thus, we often take
a sample and infer something about the population based on sample statistics. If we have a sample of size
n comprising of the values X1 , X 2 , . . . , X n , we calculate the sample variance as:
n
S2
(X
j1
X) 2
n 1
X
j1
The sample standard deviation is defined as the positive square root of the sample variance, i.e.;
n
(X
j1
X) 2
n 1
74
b) If we multiply each value of a data set by a constant, then the new standard deviation will be the
original standard deviation multiplied by that constant.
S
x 100%
X
74
75
XX
S
The standard score measures the deviation of each value of a data set from the mean in units of standard
deviation. It is used to compare the relative standing of values.
2) If a distribution is lop-sided, that is, if most of the data are concentrated either on the left or the right
end, it is said to be skewed.
a) A distribution is said to be skewed to the right, or positively skewed, when most of the data are
concentrated on the left of the distribution.
Income provides one example of a positively skewed distribution. Most people have small income, but
some make quite a bit more, with a smaller number making many millions of dollars a year. Therefore,
the positive (right) tail on the line graph for income extends out quite a long way, whereas the negative
(left) skew tail stops at zero. The right tail clearly extends farther from the distribution's centre than the
left tail as shown below:
b) A distribution is said to be skewed to the left, or negatively skewed, if most of the data are
concentrated on the right of the distribution. The left tail clearly extends farther from the distribution's
centre than the right tail as shown below:
75
76
Example: The following data is the score of 41 students on math test (rounded to the nearest integer):
31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43, 65, 50, 55, 18, 53, 41, 50, 34,
67, 56, 44, 4, 54, 57, 39, 52, 45, 35, 51, 63, 42
The histogram and the frequency polygon superimposed on it are shown in the Figure below.
76
77
Unit Seven
Tests of hypothesis concerning means and proportions
7.1 Tests of hypotheses concerning means
1. Parametric and non-parametric statistics (tests)
Parametric tests are statistical tests which make certain assumptions about the parameters of the full
population from which the sample is taken (e.g., a normal distribution). If those assumptions are correct,
parametric methods can produce more accurate and precise estimates (they are said to have more
statistical power). However, if those assumptions are incorrect, parametric methods can be very
misleading. These tests normally involve data expressed in absolute numbers (interval or ratio) rather than
ranks and categories (nominal or ordinal). Such tests include analysis of variance (ANOVA), t-tests, etc.
Consider the frequency distribution shown below. It can easily be observed that the distribution deviates
substantially from the normal distribution (the bell-shaped distribution).
This may also be that case for many variables of interest. For example, is income distributed normally in
the population? -- probably not. The incidence rates of rare diseases are not normally distributed in the
population, and neither are very many other variables in which a researcher might be interested. With a
sample of small size at hand, analyzing such variables using parametric tests might be misleading!
Note: We can apply parametric tests even if we are not sure that the distribution of the variable under
investigation in the population is normal as long as our sample is large enough. If our sample is very
small, however, then those tests can be used only if we are sure that the variable is normally distributed.
Applications of tests that are based on the normality assumptions are further limited by a lack of precise
measurement. For example, course grade (A, B, C, D, F) is a crude measure of scholastic
accomplishments that only allows us to establish a rank ordering of students from "good" students to
"poor" students. Most common statistical techniques assume that the underlying measurements are at least
77
78
of interval. However, as in our example, this assumption is very often not tenable, and the data rather
represent a rank ordering of observations (ordinal) rather than precise measurements.
Thus, the need is evident for statistical procedures that allow us to process data of low quality (nominal
or ordinal), from small samples on variables about which nothing is known (concerning their
distribution). Nonparametric methods have been developed to be used in cases when the researcher
knows nothing about the parameters of the variable of interest in the population (hence the name
nonparametric).
In more technical terms, nonparametric methods do not rely on the estimation of parameters (such as the
mean or the standard deviation) describing the distribution of the variable of interest in the population.
Therefore, these methods are also sometimes (and more appropriately) called parameter-free methods or
distribution-free methods.
Non-parametric methods are widely used for studying populations that take on a ranked order. The use of
non-parametric methods may be necessary when data have a ranking but no clear numerical
interpretation, such as when assessing preferences; in terms of levels of measurement, for data on an
ordinal scale.
When to use which method
Basically, there is at least one nonparametric equivalent for each parametric general type of test. In
general, these tests fall into the following categories:
Differences between independent groups: Usually, when we have two samples that we want to compare
concerning their mean value for some variable of interest, we would use the t-test for independent
samples. A nonparametric alternative for this test is the Mann-Whitney U test. If we have multiple groups,
we would use (the parametric) analysis of variance; a nonparametric equivalent to this method is the
Kruskal-Wallis analysis of ranks test.
Differences between dependent groups: If we want to compare two variables measured in the same
sample we would customarily use the t-test for dependent samples (if we want to compare students' math
skills at the beginning of the semester with their skills at the end of the semester). Nonparametric
alternatives to this test are the Sign test and Wilcoxon's matched pairs test. If the variables of interest are
dichotomous in nature (i.e., "pass" vs. "no pass") then McNemar's Chi-square test is appropriate.
Relationships between variables: To express a relationship between two variables one usually computes
the correlation coefficient. A nonparametric equivalent to the standard correlation coefficient is the
Spearman rank correlation coefficient. If the two variables of interest are categorical in nature (e.g.,
"passed" vs. "failed" by "male" vs. "female") an appropriate nonparametric test of the relationship
between the two variables is the Chi-square test.
78
79
H 0 : 1 2
H A : 1 2
where 1 is the mean of population 1 and 2 is the mean of population 2. The null hypothesis ( H 0 )
simply states that the two populations under consideration have equal means. Note than our conclusion is
about the means of the two populations (i.e., true means); it is not about the samples (or sample means)!
a) The t- test:
Assumptions:
1) The samples come from two normally distributed populations.
2
2
2) 1 and 2 are assumed equal but not known.
Under these assumptions, we can use the t-distribution with ( n1 n 2 2 ) degrees of freedom to find the
critical values ( t / 2 (n1 n 2 2) ) for a given level of significance (). The test statistic is given by:
t cal
X1 X 2
1
1
Sp
n1
n2
Sp
2
2
Here S1 and S2 are the variances computed from the samples.
| t cal | t / 2 (n1 n 2 2)
Example 1: The following summary statistics are on the annual household income (in thousands
of dollars) of individuals who previously defaulted (group 1) and not defaulted (group 2) on their
bank loans (data obtained from SPSS package: bankloan.sav).
Defaulted (group 1)
80
Mean
X1 = 41.2131
X 2 = 47.1547
Variance
S12 = 1858.949
S22 = 1171.019
Sample size
n1 = 183
n 2 = 517
Test if there is a significant difference in the mean income of defaulters and non-defaulters at the
5% level of significance.
Solution
The null and alternative hypotheses are:
H 0 : 1 2
H A : 1 2
The level of significance is = 0.05.
The pooled standard deviation is computed as:
Sp
= 36.7477
n1 n 2 2
183 517 2
X1 X 2
1
1
Sp
n1
n2
41.2131 47.1547
1
1 = - 1.686
36.7477
183 517
81
Nonparametric Tests
2 Independent Samples
Grouping Variable
default (? ?)
Define Groups
Group 1: 1
Group 2: 0
OK
The output is as shown below:
Mann-Whitney Test
Ranks
Previously defaulted
Mean Rank
Sum of Ranks
517
368.83
190685.50
Yes
183
298.71
54664.50
Total
700
81
82
Test Statisticsa
Household income in
thousands
Mann-Whitney U
37828.500
Wilcoxon W
54664.500
-4.032
.000
Decision: Since the p-value is less than 1%, we reject H 0 . Thus, we conclude that there is a
significant difference in the mean income of defaulters and non-defaulters at the 1% level of
significance.
Question: Why are the two tests led to a different conclusion?
This is unexpected, and we have to look for the source of this fallacy. The box plot of the data is
shown below:
82
83
We see from the box plot that the item on the 445 th row is an outlier (extreme value). Probably
the fallacy is due to this value. After removing this row, the in independent sample t-test yields
the following result:
Independent Samples Test
Levene's
Test for
Equality of
Variances t-test for Equality of Means
95% Confidence
Interval of the
Difference
F
Househol
d income
in
thousand
s
Equal
variances
assumed
Equal
variances
not
assumed
Sig. t
df
Sig.
(2Mean
tailed Differen
)
ce
Std.
Error
Differen
ce
Lower
Upper
.005
2.87926 8.16573
13.8187 2.51266
9
2.74484 8.16573
13.5643 2.76713
2
Remark
2
2
The independent samples t-test is of two types: equal variances assumed ( H 0 : 1 1 ) and equal
2
2
variances not assumed ( H1 : 1 1 ). In order to identify which of the two tests is appropriate,
we use the Levene's test for equality of variances. If the p-value for this test is less than 5%, then
we reject H 0 and consider the result under Equal variances not assumed row; other wise, we
use the result under Equal variances assumed row.
In our case, the Levene's test for equality of variances has a p-value of 0.061 which is greater
2
2
than 5%. Thus, we do not reject H 0 : 1 1 , and consequently, we refer to the result in the
Equal variances assumed row. The p-value is 0.005 which is less than 1%. Thus, we reject the
hypothesis of equality of means of the two groups.
3. Tests of mean difference between several populations (independent samples)
We have seen in section 3 above how to apply hypothesis testing procedures to test the null
hypothesis of no difference between two population means. It is not unusual for the investigator
to be interested in testing the null hypothesis of no difference among several population means.
83
84
Assumptions: Random samples of size n are taken from each of the k populations which are
independently and normally distributed with means 1 , 2 , . . ., k and common variance 2
(i.e the variability in each group is the same). Also all observations are continuous.
Under this general principle we want to test:
H 0 : 1 2 . . . k
against the alternative:
H1 : At least two of them are different.
ANOVA
For such test of hypothesis we use a method called analysis of variance. Analysis of variance is
a method of splitting the total variation into meaningful components that measure different
sources of variation.
In other words, we split the total sum of squares ( SS total ) into between groups (sample) sum
of squares ( SS between ) and within group (sample) sum of squares ( SS within ). And the test
statistic for testing H 0 versus H1 is given by the variance ratio:
SS
cal
B
2
SS
between
within
(k 1)
k(n 1)
where k is the number of groups, and n is the sample size from each group. In order to decide
whether the null hypothesis has to be rejected or not, we compare the above test statistic with
F (k 1, k(n 1)) , the value from the F-distribution with (k 1) and k(n 1) degrees of freedom
for a given level of significance .
Decision rule:
If the calculated value (test statistic) exceeds F (k 1, k(n 1)) , we reject H0 and conclude that
the group means are not all equal.
84
85
The analysis of variance (ANOVA) table for testing such hypotheses is as shown below.
ANOVA table for a one way classification
Source of
variation
Sum of squares
Between
groups
SS
Within group
SS
between
d.f.
k1
Mean square
SS
between
k 1
within
k(n 1)
SS
F-ratio
SS
SS
between
within
within
(k 1)
k(n 1)
k(n 1)
Total
SS
total
kn 1
85
86
df1
1.353
df2
3
Sig.
119
.261
86
87
df1
df2
4.696
Sig.
116
.004
Here the p-value is less than 1%. Thus, we reject the null hypothesis and conclude that the
variability in the cost of claims is different depending on the age of a vehicle. Now we can test if
the average cost of claims for the three groups of vehicles (based on age) is the same.
The ANOVA table for testing:
H 0 : 1 2 3 4 versus
H1 : at least two of the means are different is shown below:
ANOVA
Average cost of claims
Sum of
Squares
df
Mean Square
Between Groups
462109.712
154036.571
Within Groups
359371.613
116
3098.031
Total
821481.325
119
F
49.721
Sig.
.000
Since the p-value is less than 1%, we conclude that there is a significant difference in the mean
cost of claims for vehicles of different ages at the one percent level of significance.
Question: which groups of means are different?
To answer this question, we apply pair-wise comparison of means. Since the equality of
variances assumption is rejected, the appropriate tests are those listed under Equal Variances
Not Assumed in SPSS. The output of Post Hoc Tests (Multiple comparisons) is shown below:
87
(I)
(J)
Vehicle Vehicle
age
age
0-3
4-7
8-9
10+
88
Std. Error
Sig.
Lower Bound
Upper Bound
4-7
34.774
15.546
.164
-7.68
77.23
8-9
98.677*
16.000
.000
55.04
142.31
10+
165.570*
15.113
.000
124.17
206.97
0-3
-34.774
15.546
.164
-77.23
7.68
8-9
63.903*
13.206
.000
27.97
99.84
10+
130.796*
12.115
.000
97.75
163.84
0-3
-98.677*
16.000
.000
-142.31
-55.04
4-7
-63.903*
13.206
.000
-99.84
-27.97
10+
66.892*
12.693
.000
32.26
101.52
0-3
-165.570*
15.113
.000
-206.97
-124.17
4-7
-130.796*
12.115
.000
-163.84
-97.75
8-9
-66.892*
12.693
.000
-101.52
-32.26
89
The nonparametric equivalent of the one-way ANOVA is the KruskalWallis test. The Kruskal
Wallis one-way analysis of variance by ranks is a nonparametric method for testing equality of
population medians among groups. It is identical to a one-way analysis of variance with the data
replaced by their ranks. It is an extension of the MannWhitney U test to three or more groups.
Since it is a non-parametric method, the KruskalWallis test does not assume a normal
population, unlike the analogous one-way analysis of variance.
Example 4: Consider the data is on the average cost of vehicle insurance claims and vehicle age.
To apply the KruskalWallis one-way analysis of variance for the difference in the mean cost of
claims for vehicles of different ages, the procedure in SPSS is as follows:
Analyze
Nonparametric Tests
k Independent Samples
Grouping Variable
vehicleage (? ?)
Define Range
Minimum
1; Maximum 4
OK
The output is as shown below:
Kruskal-Wallis Test
89
90
Ranks
Vehicle age
Average cost of claims
Mean Rank
0-3
31
89.69
4-7
31
79.68
8-9
31
48.58
10+
27
18.65
Total
120
Test Statisticsa,b
Average cost of claims
Chi-Square
df
Asymp. Sig.
73.989
3
.000
D j X j Yj ,
91
j 1, 2, . . ., n
We then calculate the mean and sample variance of the differences D j s as:
2
D
1 n
Dj
n j 1
1 n
(D j D) 2
n 1 j 1
Assumption:
2
The differences ( D j s) are normally distributed with mean and variance d .
t cal
D 0
SD n
| t cal | t / 2 (n 1)
91
92
Std. Deviation
161.4920
1000
55.44955
1.75347
134.9620
1000
44.79421
1.41652
Sig. (2tailed)
Mean
(D)
Pair 1
26.53000
1.00371
Lower
24.56037
Upper
28.49963
26.432
df
999
.000
Since the p-value is less than 0.01, we reject the null hypothesis and conclude that there is a
significant difference between the current mean sale value of houses and the mean sale value in
the last appraisal at the 1% level of significance (the sale value has appreciated, on average).
Remark: The nonparametric equivalent to the paired-samples t-test is the Wilcoxon Signed
Ranks Test. This test does not assume a normal population.
Example 6: For the data in example 5, the Wilcoxon Signed Ranks Test results are shown below:
Wilcoxon Signed Ranks Test
92
93
Ranks
N
Value at last appraisal Sale value of house
Mean Rank
Sum of Ranks
548.34
460056.50
Positive Ranks
160b
246.52
39443.50
Ties
1c
Total
1000
Test Statisticsb
Value at last appraisal - Sale
value of house
Z
-23.055a
.000
Since the p-value is less than 0.01, we again reject the null hypothesis and conclude that there is
a significant difference between the current mean sale value of houses and the mean sale value in
the last appraisal at the 1% level of significance.
5. Tests of association
a) The Pearson coefficient of correlation and test of its significance
For two continuous variables X and Y, a measure of the strength of linear relationship is
provided by the Pearson coefficient of correlation which is defined as:
n xy ( x)( y)
(n x 2 ( x) 2)(n y 2 ( y) 2)
93
94
The sign of r indicates the direction of the relationship between the two variables X and Y. If
an inverse relationship exists, then r will fall between 0 and -1. Likewise, if there is a direct
relationship, then r will be a value within the range 0 to 1.
To see if this value of r is of sufficient magnitude to indicate that the two variables of interest are
correlated, we test the hypothesis:
H0: = 0
HA: 0
where is the true (population) coefficient of correlation. The test statistic is:
t cal r
n2
1 r2
Advertising
spending (X)
Detrended sales
(Y)
Month
Advertising
spending (X)
Detrended sales
(Y)
4.69
12.23
13
5.15
12.27
6.41
11.84
14
5.25
12.57
5.47
12.25
15
1.72
8.87
3.43
11.1
16
3.04
11.15
4.39
10.97
17
4.92
11.86
2.15
8.75
18
4.85
11.07
1.54
7.75
19
3.13
10.38
2.67
10.5
20
2.29
8.71
1.24
6.71
21
4.9
12.07
10
1.77
7.6
22
5.75
12.74
11
4.46
12.46
23
3.61
9.82
94
12
1.83
8.47
95
24
4.62
11.51
Solution
Summary statistics:
n = 24,
= 89.28,
= 253.65,
xy = 1001.954, x
= 386.58,
2755.299
Plugging in these values, the sample coefficient of correlation is:
n xy ( x)( y)
(n x 2 ( x) 2)(n y 2 ( y) 2)
= 0.91627
24 2
n2
= (0.91627)
= 10.72913
2
1 r
1 (0.91927) 2
For = 0.01, t / 2(n 2) t 0.005(22) 2.819 . Since the calculated value t = 10.72913 is
greater than 2.819, we reject H0 and conclude that there is a significant positive (or direct)
correlation between advertising spending and sales at the one percent level of significance. The
SPSS output is shown below:
Correlations
95
96
Correlations
Advertising
spending
Advertising
spending
Pearson Correlation 1
.916**
Sig. (2-tailed)
.000
N
Detrended sales
Detrended sales
24
.000
24
24
**
1
24
rs 1
6 d 2
n(n 2 1)
Step1: Rank the xs among themselves giving rank 1 to the largest (or smallest) observation,
rank 2 to the second largest (or second smallest) observation, and
so on.
Step 2: Rank the ys similarly.
Step 3: Find d = rank of x - rank of y for each pair of observations.
2
Step 4: Find d (the sum of squares of the differences between each pair of ranks)
Step 5: Compute the rank correlation coefficient using the above formula.
Example 8: For the data on the advertising spending and sales of a company recorded over a
period of n = 24 months, the SPSS output of nonparametric correlations is shown below:
96
97
Nonparametric Correlations
Correlations
Advertising
spending
Detrended sales
Advertising
spending
Detrended
sales
Correlation
Coefficient
1.000
.889**
Sig. (2-tailed)
.000
24
24
**
Correlation
Coefficient
.889
1.000
Sig. (2-tailed)
.000
24
24
Since the p-value is less than 0.01, we reject H 0 and conclude that there is a significant
correlation between advertising spending and sales at the one percent level of significance.
c) The Chi-square test
If the two variables whose degree of association we want to test are categorical in nature (for
example, job satisfaction versus income), the appropriate nonparametric statistic for testing such
relationship is the Chi-square test.
Example 9: Here we use the data in SPSS package:
Files\SPSSInc\Statistics17\Samples\English\customer_dbase.sav
Suppose we want to check if there is a relationship between the level of income of employees
(categorized into five) and job satisfaction (categorized from highly dissatisfied to highly
satisfied). Here job satisfaction is not continuous, and hence we can not apply the Pearson
coefficient of correlation.
Before going to the test it is a good idea to see what the data look like using graphs. The multiple
bar chart for the said variables is shown below:
97
98
The chart gives us some idea about the relationship between the two variables. For example, the
frequency of highly dissatisfied employees keeps on decreasing as income increases. However,
we do not come to the final judgement before we apply objective statistical tests. In tests of
independence, the null and alternative hypotheses are of the form:
H0: The two classifications are independent.
HA: The two classifications are dependent.
The null hypothesis can also be written as there is no association between the two
classifications.' The SPSS procedure for testing such hypotheses is:
Analyze
Nonparametric Tests
income category
Job satisfaction
Chi-Square
OK
The SPSS outpour is as shown below:
98
99
Chi-Square Test
Test Statistics
Income category in thousands
a
Job satisfaction
24.426a
Chi-Square
1252.834
df
Asymp. Sig.
.000
.000
Since the p-value is less than 0.01, we reject H0 and conclude that income and job satisfaction are
dependent or associated. A cross-tabulation of income category versus job satisfaction is shown
below.
Job satisfaction
Income
category
(in
thousand
s)
Under
$25
Somewh
Highly
at
Somewh
dissatisfi dissatisfi
at
Highly
ed
ed
Neutral satisfied satisfied Total
Count
413
153
1330
18.2% 16.5%
11.5%
100.0%
413
440
227
1793
23.0%
24.5% 19.4%
12.7%
100.0%
173
182
177
819
percent 14.5%
age
21.1%
22.2% 20.5%
21.6%
100.0%
$75 $124
Count
100
143
183
668
percent 7.9%
age
15.0%
21.4% 28.3%
27.4%
100.0%
$125+
Count
52
85
146
390
13.3%
21.8% 23.1%
37.4%
100.0%
percent 31.1%
age
365
percent 20.4%
age
$50 - $74 Count
119
53
17
percent 4.4%
age
303
242
22.8%
219
348
168
189
90
99
100
It can be seen that for income category Under $25, more than 50% of employees are highly
dissatisfied or somewhat dissatisfied. On the other hand, for the income category $125+, about
60% of employees are either somewhat satisfied or highly satisfied. In general, the higher the
income level, the more likely are the employees to be satisfied with their job.
n1P1 n 2 P2
n1 n 2
Standard error (SE) of the sampling distribution of the difference between two proportions:
SE(P1 P2 )
n1
P(1 P)
n 2
100
Zcal
P1 P2
SE(P1 P2 )
101
P1 P2
1
n1
P(1 P)
1
n 2
We then compare this statistic with the critical value from the standard normal distribution for a
given level of significance .
Decision: Reject the null hypothesis if: | Zcal | Z / 2
Example 10: Consider the data in example 9 (the level of income of employees and job
satisfaction).
Now let us compare employees who earn under $25 (thousand per year) and those who earn $25
$49.
Job satisfaction
Income
category
(in
thousand
s)
Under
$25
Somew
Somew
Highly
hat
hat
Highly
dissatisfi dissati Neutr satisfie satisfie
ed
sfied
al
d
d
Total
Count
413
Percent 31.1%
age
365
Percent 20.4%
age
303
242
219
153
1330
22.8% 18.2
%
413
348
440
23.0% 24.5
%
227
1793
Is there a significant difference between the proportion of those who are highly dissatisfied in the
two income groups?
Solution
Here n1 = 413, P1 = 0.311, n 2 = 365, P2 = 0.204. The pooled sample proportion is:
P
n1P1 n 2 P2
413(0.311) 365(0.204)
= 0.260801
n1 n 2
413 365
102
1 1
1
1
0.260801(1 0.260801)
= 0.031543
413 365
n1 n 2
SE(P1 P2 ) P(1 P)
Zcal
P1 P2
0.311 0.204
= 3.392191
SE(P1 P2 )
0.031543
The critical value from the standard normal distribution for = 0.01 is Z / 2 2.57 .
Decision: Since | Zcal | Z / 2 , we reject the null hypothesis. Thus, there a significant difference
between the proportion of those who are highly dissatisfied in the two income groups at the one
percent level of significance.
Note that P1 P2 > 0 . This indicates that a significantly higher proportion of employees who
earn under $25 are highly dissatisfied with their job as compared to those who earn $25 $49.
102
103
Unit Eight:
The simple linear regression model and Statistical Inference
8.1 What is a regression model?
What is regression analysis? In very general terms, regression is concerned with describing and
evaluating the relationship between a given variable and one or more other variables on which
the given variable depends. More specifically, regression is an attempt to explain movements in a
variable by reference to movements in one or more other variables.
The given variable is referred to as the dependent (or response) variable (denoted by Y), while
the variables which are thought to affect it are referred to as independent (explanatory or
regressor) variables (denoted by X1 , X 2 , X 3 , . . ., X k ). The case where we have just one
explanatory variable is called simple linear regression. If we have two or more explanatory
variable, then we have the multiple linear regression model.
8.2 Regression versus correlation
The correlation between two variables measures the degree of linear association between them. If
X and Y are correlated, then there is an evidence of a linear relationship between the two
variables. However, it is not implied that changes in X cause changes in Y, or that changes in Y
cause changes in X. The degree of linear relationship between these two variables is measured by
the coefficient of correlation.
In regression, the dependent variable and the independent variable are treated very differently.
The dependent variable is assumed to be random or stochastic in some way, i.e. to have
a certain probability distribution.
The independent variables are, however, assumed to have fixed (non-stochastic) values
in repeated samples.
104
Suppose that a researcher has some idea that there should be a relationship between two
variables Y and X, and that economic, financial, etc. theory suggests that an increase in X will
lead to an increase in Y. A sensible first stage to testing whether there is indeed an association
between the variables would be to form a scatter plot of them. Suppose that the outcome of this
plot is as shown in figure 1.
Y X .. (1)
to get the line that best fits the data. The researcher would then be seeking to find the values of
the parameters or coefficients, and , which would place the line as close as possible to all of
the data points taken together.
However, this equation is an exact one. Assuming that this equation is appropriate, if the values
of and had been calculated, then given a value of X, it would be possible to determine with
certainty what the value of Y would be. Imagine -- a model which says with complete certainty
what the value of one variable will be given any value of the other!
Clearly this model is not realistic. Statistically, it would correspond to the case where the model
fitted the data perfectly -- that is, all of the data points lay exactly on a straight line. To make the
model more realistic, a random disturbance or error term, denoted by t , is added to the
equation. Thus, we have:
104
105
Yt X t t . (2)
where the subscript t (= 1, 2, 3, . . .) denotes the observation number.
Even in the general case where there is more than one explanatory variable, some
determinants of Yt will always in practice be omitted from the model. This might, for
example, arise because the number of influences on Y is too large to place in a single
model, or because some determinants of Y may be unobservable or not measurable.
There may be errors in the way that Y is measured which cannot be modelled.
There are bound to be random outside influences on Y that again cannot be modelled. For
example, a terrorist attack, a hurricane or a computer failure could all affect financial
asset returns in a way that cannot be captured in a model and can not be forecast reliably.
Similarly, many researchers would argue that human behaviour has an inherent
randomness and unpredictability!
105
106
Note: The reason that the sum of the squared distances is minimised rather than, for example,
finding the sum of t that is as close to zero as possible, is that in the latter case some points will
lie above the line while others lie below it. Then, when the sum to be made as close to zero as
possible is formed, the points above the line would count as positive values, while those below
would count as negative values. So these distances will cancel each other out and the sum would
be zero. However, taking the squared distances ensures that all deviations that enter the
calculation are positive and therefore do not cancel out.
So the sum of squared distances to be minimized is given by:
12 22 32 . . . T2
t 1
2
t
. (3)
106
107
This sum is known as the residual sum of squares (RSS) or the sum of squared residuals. But
what is t ? Again, it is the difference between the actual point and the estimating line, that is,
. So minimising
t Yt Y
t
t 1
2
t
is equivalent to minimising
(Y
t 1
)2
Y
t
Letting and denote the values of and selected by minimising the RSS, respectively, the
equation for the fitted line is given by
which is also known as a loss function. Take the summation over all of the observations, i.e. from
t = 1 to T , where T is the number of observations:
L
(Y
t 1
)2
Y
t
(Y
t 1
X t ) 2 (4)
To find the values of and which minimise the residual sum of squares (equivalently, to find
the equation of the line that is closest to the data), L is minimised with respect to and . This
is achieved through differentiating L with respect to and , and setting the first derivatives to
zero. The resulting coefficient estimators for the slope and the intercept are given by:
T X t Yt ( X t )( Yt )
=
T X 2t ( X t ) 2
(X X)(Y Y)
(X X)
t
X Y
X
t
2
t
TXY
TX 2
. (5)
Y X (6)
Thus, given only the sets of observations Yt and X t , it is always possible to calculate the
X is the best fit to
estimated values of the two parameters and so that the line: Y
t
t
the set of data. This method of finding the optimum is known as OLS.
Note (estimator and estimate)
Estimators are the formulae used to calculate the coefficients (or parameters in general). For
example, the expressions given above for and are estimators. Estimates, on the other hand,
are the actual numerical values for the coefficients that are obtained from the sample.
Example1: The following data is on the excess returns of a given asset (Y) together with the
excess returns on a market index (market portfolio) (X) from January 2009 to December 2010
recorded on a monthly basis:
107
108
year/month
2009/01
-7.93
-17.75
2009/02
-9.93
-14.39
2009/03
8.83
12.9
2009/04
10.17
16.11
2009/05
5.33
7.21
2009/06
0.44
-1.96
2009/07
7.76
9.08
2009/08
3.23
7.7
2009/09
4.15
3.17
2009/10
-2.49
-5.38
2009/11
5.64
5.65
2009/12
2.8
0.76
2010/01
-3.51
-0.61
2010/02
3.39
3.63
2010/03
6.3
8.09
2010/04
2.13
1.85
2010/05
-7.86
-8.73
2010/06
-5.67
-6.5
2010/07
7.27
6.81
2010/08
-4.81
-7.06
2010/09
9.56
8.48
2010/10
4.02
2.05
2010/11
0.63
0.58
2010/12
6.77
9.15
108
109
The idea here is to check if there is a linear relationship between X and Y. The first stage could
be to form a scatter plot of the two variables. This is shown in Figure 3 below. Clearly, there
appears to be a positive, approximately linear, relationship between X and Y.
XY
X2
-7.93
-17.75
140.7575
62.8849
-9.93
-14.39
142.8927
98.6049
8.83
12.90
113.9070
77.9689
10.17
16.11
163.8387
103.4289
5.33
7.21
38.4293
28.4089
0.44
-1.96
-0.8624
0.1936
7.76
9.08
70.4608
60.2176
3.23
7.70
24.8710
10.4329
4.15
3.17
13.1555
17.2225
-2.49
-5.38
13.3962
6.2001
5.64
5.65
31.8660
31.8096
2.80
0.76
2.1280
7.8400
109
TOTAL
110
-3.51
-0.61
2.1411
12.3201
3.39
3.63
12.3057
11.4921
6.30
8.09
50.9670
39.6900
2.13
1.85
3.9405
4.5369
-7.86
-8.73
68.6178
61.7796
-5.67
-6.50
36.8550
32.1489
7.27
6.81
49.5087
52.8529
-4.81
-7.06
33.9586
23.1361
9.56
8.48
81.0688
91.3936
4.02
2.05
8.2410
16.1604
0.63
0.58
0.3654
0.3969
6.77
9.15
61.9455
45.8329
46.22
40.84
1164.7554
896.9532
0.887 1.344X
Y
t
t
where X t is the excess return of the market portfolio over the risk free rate (i.e. R m R f ), also
known as the market risk premium.
Interpretation of and
The coefficient estimate of is interpreted as, if X increases by 1 unit, Y will be expected to
increase by the amount units, everything else being equal. If is negative, a rise in X would
110
111
on average cause a fall in Y. The intercept coefficient estimate ( ) is interpreted as the value that
would be taken by the dependent variable Y if the independent variable X took a value of zero.
In Example 1, the coefficient estimate of 1.344 is interpreted as: if the excess return of the
market portfolio over the risk free rate increases by 1%, then the excess returns of this particular
asset will be expected to increase by 1.344%, everything else being equal.
If an analyst tells you that he expects the market to yield a return 10% higher than the risk-free
rate next month, what would you expect the excess return on this asset to be? To answer this,
plug in X = 10 in the estimated equation. This yields:
2
t
T[ X TX 2 ]
2
t
. (10)
111
1
2
X
t TX 2
112
. (11)
where is the estimated standard deviation of the residuals. It is worth noting that the standard
errors give only a general indication of the likely accuracy of the regression parameters. They do
not show how accurate a particular set of coefficient estimates is. If the standard errors are small,
it shows that the coefficients are likely to be precise on average, not how precise they are for this
particular sample. Thus, standard errors give a measure of the degree of uncertainty in the
estimated values for the coefficients. It can be seen that they are a function of the actual
observations on the explanatory variable, X, the sample size, T, and another term, .
is an estimate of the variance of the disturbance term. The actual variance of the disturbance
term t is denoted by 2 . An estimator of 2 is given by:
2
T
1
2t . (12)
(T 2) t 1
The OLS residuals corresponding to each time point, their square and the total sum of squares of
the residuals are shown in the following table.
year/month
2t
2009/01
-7.93
-17.75
-6.2026
38.4723
112
113
2009/02
-9.93
-14.39
-0.1540
0.0237
2009/03
8.83
12.9
1.9172
3.6755
2009/04
10.17
16.11
3.3258
11.0610
2009/05
5.33
7.21
0.9322
0.8689
2009/06
0.44
-1.96
-1.6643
2.7698
2009/07
7.76
9.08
-0.4645
0.2157
2009/08
3.23
7.7
4.2452
18.0214
2009/09
4.15
3.17
-1.5216
2.3152
2009/10
-2.49
-5.38
-1.1455
1.3122
2009/11
5.64
5.65
-1.0446
1.0911
2009/12
2.8
0.76
-2.1168
4.4808
2010/01
-3.51
-0.61
4.9957
24.9565
2010/02
3.39
3.63
-0.0399
0.0016
2010/03
6.3
8.09
0.5082
0.2583
2010/04
2.13
1.85
-0.1261
0.0159
2010/05
-7.86
-8.73
2.7233
7.4163
2010/06
-5.67
-6.5
2.0093
4.0373
2010/07
7.27
6.81
-2.0758
4.3088
2010/08
-4.81
-7.06
0.2932
0.0860
2010/09
9.56
8.48
-3.4842
12.1395
2010/10
4.02
2.05
-2.4668
6.0852
2010/11
0.63
0.58
0.6203
0.3848
2010/12
6.77
9.15
0.9364
0.8768
0.0000
144.8748
TOTAL
Note that the sum of the residuals ( t ) is zero. The residual variance is computed as:
2
T
1
1
2t
(144.8748) = 6.585217
(T 2) t 1
(24 2)
The residual standard deviation is the square root of the variance, i.e.,
113
114
6.585217 = 2.566168
Estimated standard errors of and are given by:
2
t
T[ X TX ]
2
t
(2.566168)
896.9532
= 0.551918
24[896.9532 24(1.925833) 2 ]
1
1
(2.566168)
= 0.090281
2
896.9532 24(1.925833) 2
X TX
2
t
With the standard errors calculated, the results are written as:
0.887
Y
t
(0.551918)
1.344 X t
(0.090281)
The standard error estimates are usually placed in parentheses under the relevant coefficient
estimates.
(0.2109)
114
115
= 0.6091 is a single (point) estimate of the unknown population parameter, . As stated above,
the reliability of the point estimate is measured by the coefficients standard error. The
information from the sample coefficients ( and ) and their standard errors ( se( ) and se( ) )
can be used to make inferences about the population parameters ( and ). So the estimate of the
slope coefficient is = 0.6091, but it is obvious that this number is likely to vary to some degree
from one sample to the next. Thus, it might be of interest to answer the question, Is it plausible,
given this estimate, that the true population parameter, , could be 0.5? Is it plausible that could
be 1? etc. Answers to these questions can be obtained through hypothesis testing.
116
0.5, corresponding to an increase in risk, is not of interest. In this case, the null and alternative
hypotheses would be specified as:
H 0 : 0.5
H1 : 0.5
This prior information should come from (financial) theory of the problem under consideration,
and not from an examination of the estimated value of the coefficient.
Note that there is always an equality sign under the null hypothesis. So, for example, < 0.5
would not be specified under the null hypothesis.
There are two ways to conduct a hypothesis test: via the test of significance approach or via the
confidence interval approach. Both methods centre on a statistical comparison of the estimated
value of the coefficient, and its value under the null hypothesis. In very general terms, if the
estimated value is a long way away from the hypothesised value, the null hypothesis is likely to
be rejected; if the value under the null hypothesis and the estimated value are close to one
another, the null hypothesis is less likely to be rejected. For example, consider = 0.6091 as
above. A null hypothesis that the true value of is 5 ( H 0 : 5 ) is more likely to be rejected
than a null hypothesis that the true value of is 0.5 ( H 0 : 0.5 ). What is required now is a
statistical decision rule that will permit the formal testing of such hypotheses.
116
117
(X X)(Y Y)
(X X)
(X X)Y (X X)Y
(X X)
(X X)Y (sin ce (X
(X X)
WY
t
X) 0)
where Wt
(X t X)
are the weights. Since the weighted sum of normal random variables
(X t X)2
is also normally distributed, it can be said that the coefficient estimates will also be normally
distributed. Thus:
: N( , se( ))
: N( , se( ))
where:
se( )
se( )
2
t
T[ X 2t TX 2 ]
. (13)
1
. (14)
X TX 2
2
t
Thus, inferences about the true regression coefficients and can be made based on the normal
distribution (or similar distributions). Note that relations (13) and (14) involve the unknown
residual standard deviation .
117
118
Will the coefficient estimates still follow a normal distribution if the errors do not follow a
normal distribution? The answer is yes provided that the sample size is sufficiently large. This
is due to the central limit theorem (CLT). The normal distribution is plotted below.
:
se( )
Var( )
N(0,1) and
: N(0,1)
se( )
Var( )
The square roots of the coefficient variances are the standard errors that are given by relations
(13) and (14) above. Unfortunately, the true standard errors of the regression coefficients are
never known (since the true residual standard deviation is unknown) -- all that is available are
their sample counterparts, the calculated standard errors of the coefficient estimates
and
defined by the relations (10) and (11), respectively. Replacing the true values of the standard
errors with the sample estimated versions induces another source of uncertainty, and also means
that the standardised statistics are not normally distributed! They rather follow another
distribution, namely, the Students t-distribution with (T 2) degrees of freedom. That is:
t (T 2)
and
t (T 2)
119
does the t-distribution look like? It looks similar to a normal distribution, but with fatter tails and
a smaller peak at the mean (shown in Figure 5 below). In addition to the two parameters (mean
and variance), the t-distribution has another parameter, its degrees of freedom.
N(0,1)
t(5)
t(40)
5%
1.645
2.015
1.684
2.5%
1.96
2.571
2.021
1%
2.33
3.365
2.423
There are broadly two approaches to testing hypotheses under regression analysis: the test of
significance approach and the confidence interval approach. Each of these will now be
considered in turn.
a) The test of significance approach
Assume that the regression equation is given by:
119
Yt X t t
120
t = 1, 2, . . . , T .
1. Estimate and ,
and
*
2. Calculate the test statistic. If the null hypothesis is H 0 : and the alternative hypothesis
*
is H1 : (for a two-sided test), the test statistic is given by:
*
t
(15)
3. A tabulated distribution with which to compare the estimated test statistics is required. Test
statistics derived in this way can be shown to follow the t-distribution with (T 2) degrees of
freedom.
4. Choose a significance level, often denoted by (not the same as the regression intercept
coefficient). It is conventional to use a significance level of 5% or 1%.
5. Given a significance level, a rejection region and non-rejection region can be determined. If a
5% significance level is employed, this means that 5% of the total distribution (5% of the
area under the curve) will be in the rejection region. That rejection region can either be split
in half (for a two-sided test) or it can all fall on one side of the y-axis, as is the case for a onesided test. For a two-sided test, the 5% rejection region is split equally between the two tails,
as shown in figure 6(a). For a one-sided test, the 5% rejection region is located solely in one
tail of the distribution, as shown in figures 6(b) and 6(c), for a test where the alternative is of
the less than form, and where the alternative is of the greater than form, respectively.
H 0 : *
H1 : *
120
121
H 0 : *
H1 : *
Figure 6(b): Rejection region for a one (left) -sided 5% test of hypothesis
H 0 : *
H1 : *
Figure 6(c): Rejection regions for a one (right) -sided 5% test of hypothesis
6. Use the t-tables to obtain a critical value or values with which to compare the test statistic.
The critical value will be that value of x that puts 5% into the rejection region. In figures 6(a)
6(c), c1, c2, c3 and c4 denote such critical values.
7. Finally perform the test. If the test statistic lies in the rejection region, then reject the null
hypothesis ( H 0 ), else do not reject H 0 .
Steps 2 7 require further comment. In step 2, the estimated value of is compared with the
value that is subject to test under the null hypothesis, but this difference is normalised or scaled
by the standard error of the coefficient estimate. The standard error is a measure of how
confident one is in the coefficient estimate obtained in the first stage. If a standard error is small,
the value of the test statistic will be large relative to the case where the standard error is large.
121
122
For a small standard error, it would not require the estimated and hypothesised values to be far
away from one another for the null hypothesis to be rejected.
In this context, the number of degrees of freedom can be interpreted as the number of pieces of
additional information beyond the minimum requirement. If two parameters are estimated ( and
-- the intercept and the slope of the line, respectively), a minimum of two observations is
required to fit this line to the data. As the number of degrees of freedom increases, the critical
values in the tables decrease in absolute terms, since less caution is required and one can be more
confident that the results are appropriate.
The significance level is also sometimes called the size of the test (note that this is completely
different from the size of the sample) and it determines the region where the null hypothesis
under test will be rejected or not rejected. Remember that the distributions in figures 6(a) 6(c)
are for a random variable. Purely by chance, a random variable will take on extreme values
(either large and positive values or large and negative values) occasionally. More specifically, a
significance level of 5% means that a result as extreme as this or more extreme would be
expected only 5% of the time as a consequence of chance alone. To give one illustration, if the
5% critical value for a one-sided test is 1.68, this implies that the test statistic would be expected
to be greater than this only 5% of the time by chance alone. There is nothing magical about the
test -- all that is done is to specify an arbitrary cut-off value for the test statistic that determines
whether the null hypothesis would be rejected or not. It is conventional to use a 5% size of test,
but 10% and 1% are also commonly used.
However, one potential problem with the use of a fixed (e.g. 5%) size of test is that if the sample
size is sufficiently large, any null hypothesis can be rejected. This is particularly worrisome in
finance, where tens of thousands of observations or more are often available. What happens is
that the standard errors reduce as the sample size increases, thus leading to an increase in the
value of all t-test statistics. This problem is frequently overlooked in empirical work, but some
econometricians have suggested that a lower size of test (e.g. 1%) should be used for large
samples
Note also the use of terminology in connection with hypothesis tests: it is said that the null
hypothesis is either rejected or not rejected. It is incorrect to state that if the null hypothesis is not
rejected, it is accepted (although this error is frequently made in practice), and it is never said
that the alternative hypothesis is accepted or rejected. One reason why it is not sensible to say
that the null hypothesis is accepted is that it is impossible to know whether the null is actually
true or not! In any given situation, many null hypotheses will not be rejected. For example,
suppose that H 0 : = 0.5 and H 0 : = 1 are separately tested against the relevant two-sided
alternatives and neither null is rejected. Clearly then it would not make sense to say that H 0 :
= 0.5 is accepted and H 0 : = 1 is accepted, since the true (but unknown) value of cannot be
122
123
both 0.5 and 1. So, to summarise, the null hypothesis is either rejected or not rejected on the
basis of the available evidence.
Example 4: Consider the data in example 1 (on the excess returns of a given asset (Y) together
with the excess returns on a market portfolio (X) from January 2009 to December 2010 recorded
on a monthly basis. The regression result was:
0.887
Y
t
(0.551918)
1.344 X t
(0.090281)
where the figures in parentheses are the standard error estimates. Using both the test of
significance and confidence interval approaches, test the hypothesis that 0 against a twosided alternative at the 5% level of significance.
Solution
The test of significance approach
The null and alternative hypotheses are:
H0 : 0
H1 : 0
The test statistic is calculated as:
t
1.344 0
0.090281 = 14.887
Now we need the critical value from the t-distribution with (T-2) = (24-2) = 22 degrees of
freedom and at the 5% level. This means that 5% of the total distribution will be in the rejection
region, and since this is a two-sided test, 2.5% of the distribution is required to be contained in
each tail. From the t-distribution table we have:
t / 2 (T 2) t 0.025 (22) 2.074
From the symmetry of the t-distribution around zero, the critical values in the upper and lower
tail will be equal in magnitude, but opposite in sign. Thus, the rejection (critical) regions are as
shown below:
123
124
Decision: Reject H 0 : 0 since the test statistic lies within the rejection region. Thus, CAPM
beta is significantly different from zero. The implication is that movements in the given asset
(X) are significantly related with movements in the market (Y).
Some more terminology
If the null hypothesis is rejected at the 5% level, it would be said that the result of the test is
statistically significant. If the null hypothesis is not rejected, it would be said that the result of
the test is not significant, or that it is insignificant. Finally, if the null hypothesis is rejected at
the 1% level, the result is termed highly statistically significant.
H 0 is true
H 0 is false
Significant
Type I error
correct decision
124
125
(Reject H 0 )
Not significant
correct decision
Type II error
The probability of a type I error is just , the significance level or size of the test chosen. To see
this, recall what is meant by significance at the 5% level: it is only 5% likely that a result as or
more extreme as this could have occurred purely by chance. Or, to put this in another way, it is
only 5% likely that this null would be rejected when it was in fact true.
Note that there is no chance for a free lunch (i.e. a cost-less gain) here! What happens if the size
of the test is reduced (e.g. from a 5% test to a 1% test)? The chances of making a type I error
would be reduced but so would the probability that the null hypothesis would be rejected at all,
so increasing the probability of a type II error. So there always exists a direct trade-off between
type I and type II errors when choosing a significance level. The only way to reduce the chances
of both is to increase the sample size or to select a sample with more variation, thus increasing
the amount of information upon which the results of the hypothesis test are based. In practice, up
to a certain level, type I errors are usually considered more serious and hence a small size of test
is usually chosen (5% or 1% are the most common).
1. general statement of the problem : This will usually involve the formulation of a theoretical
model based on a certain theory that two or more variables should be related to one another
in a certain way. The model is unlikely to be able to completely capture every relevant realworld phenomenon, but it should present a sufficiently good approximation that it is useful
for the purpose at hand.
2. collection of data relevant to the model: The data required may be available electronically
from data provider or from published government figures. Alternatively, the required data
may be available only via a survey after distributing a set of questionnaires i.e. primary data.
3. choice of estimation method relevant to the model proposed: For example, is a single
equation or multiple equation technique to be used?
125
126
4. statistical evaluation of the model: What assumptions were required to estimate the
parameters of the model optimally? Were these assumptions satisfied by the data or the
model? Also, does the model adequately describe the data? If the answer is yes, proceed to
step 5; if not, go back to steps 1--3 and either reformulate the model, collect more data, or
select a different estimation technique that has less stringent requirements.
5. evaluation of the model from a theoretical perspective: Are the parameter estimates of the
sizes and signs that the theory or intuition from step 1 suggested? If the answer is yes,
proceed to step 6; if not, again return to stages 1--3.
6. use of model: When a researcher is finally satisfied with the model, it can then be used for
testing the theory specified in step 1, or for formulating forecasts or suggested courses of
action. This suggested course of action might be for an individual, or as an input to
government policy.
It is important to note that the process of building a robust empirical model is an iterative one,
and it is certainly not an exact science. Often, the final preferred model could be very different
from the one originally proposed, and need not be unique in the sense that another researcher
with the same data and the same initial theory could arrive at a different final specification.
SPSS Application
126
127
Unit Nine:
The Multiple linear regression model and Statistical Inference
9.1 Introduction
So far we have seen the basic statistical tools and procedures for analyzing relationships between
two variables. But in practice, economic models generally contain one dependent variable and
two or more independent variables. Such models are called multiple regression models.
Example 1:
a) In demand studies we study the relationship between the demand for a good (Y) and price of
the good ( X1 ), prices of substitute goods ( X 2 ) and the consumers income ( X 3 ). Here, Y is
the dependent variable and X1 , X 2 and X 3 are the explanatory (independent) variables. The
relationship is estimated by a multiple linear regression equation (model) of the form:
X X X
Y
0
1 1
2 2
3 3
where 0 , 1 , 2 and 3 are estimated regression coefficients.
b) In a study of the amount of output (product), we are interested to establish a relationship
between output (Q) and labour input (L) & capital input (K). The equations are often
estimated in log-linear form as:
log(L) log(K)
log(Q)
0
1
2
c) In a study of the determinants of the number of children born per woman (Y), the possible
explanatory variables include years of schooling of the woman ( X1 ), womans (or husbands)
earning at marriage ( X 2 ), age of woman at marriage ( X 3 ) and survival probability of
children at age five ( X 4 ). The relationship can thus be expressed as:
X X X X
Y
0
128
y i x 22i x 2i yi x 1i x 2i
2
x1i2 x 22i x1i x 2i
1i
y i x1i2 x1i yi x 1i x 2i
2
x1i2 x 2i2 x1i x 2i
2i
0 Y 1X1 2 X 2
where Y , X1 and X 2 are the mean values of the variables Y, X1 and X 2 , respectively.
An estimator of the variance of the errors 2 is given by:
n
i2
i 1
n 3
(Y
i 1
)2
Y
i
n 3
X X
where Y
i
0
1 1i
2 2i
The standard errors of the estimated regression coefficients 1 and 2 are estimated as:
2
and
(1 r122 ) x1i2
2
(1 r122 ) x 2i2
x x
x x
1i
2
1i
2i
2
2i
Example 2: Consider the following data on per capita food consumption (Y), price of food ( X1 )
and per capita income ( X 2 ) for the years 1927-1941 in the United States. Retail price of food
and per capita disposable income are deflated by the Consumer Price Index.
Year
X1
X2
Year
X1
X2
1927
88.9
91.7
57.7
1935
85.4
88.1
52.1
1928
88.9
92.0
59.3
1936
88.5
88.0
58.0
128
129
1929
89.1
93.1
62.0
1937
88.4
88.4
59.8
1930
88.7
90.9
56.3
1938
88.6
83.5
55.9
1931
88.0
82.3
52.7
1939
91.7
82.4
60.3
1932
85.9
76.3
44.4
1940
93.3
83.0
64.1
1933
86.0
78.3
43.8
1941
95.1
86.2
73.7
1934
87.1
84.3
47.8
129
130
Year
X1
X2
x1
x2
1927
88.9
91.7
57.7
-0.007
5.800
1.173
1928
88.9
92.0
59.3
-0.007
6.100
2.773
1929
89.1
93.1
62.0
0.193
7.200
5.473
1930
88.7
90.9
56.3
-0.207
5.000
-0.227
1931
88.0
82.3
52.7
-0.907
-3.600
-3.827
1932
85.9
76.3
44.4
-3.007
-9.600
-12.127
1933
86.0
78.3
43.8
-2.907
-7.600
-12.727
1934
87.1
84.3
47.8
-1.807
-1.600
-8.727
1935
85.4
88.1
52.1
-3.507
2.200
-4.427
1936
88.5
88.0
58.0
-0.407
2.100
1.473
1937
88.4
88.4
59.8
-0.507
2.500
3.273
1938
88.6
83.5
55.9
-0.307
-2.400
-0.627
1939
91.7
82.4
60.3
2.793
-3.500
3.773
1940
93.3
83.0
64.1
4.393
-2.900
7.573
1941
95.1
86.2
73.7
6.193
0.300
17.173
Total
1333.6
1288.5
847.9
Mean
88.90667
85.90
56.52667
130
131
The necessary calculations using the transformed variables are shown below:
x2y
x1 x 2
x 22
x1
x2
-0.007
5.800
1.173
-0.039
-0.008
6.805
33.640
1.377
4.45E-05
-0.007
6.100
2.773
-0.041
-0.018
16.917
37.210
7.691
4.45E-05
0.193
7.200
5.473
1.392
1.058
39.408
51.840
29.957
0.037
-0.207
5.000
-0.227
-1.033
0.047
-1.133
25.000
0.051
0.043
-0.907
-3.600
-3.827
3.264
3.470
13.776
12.960
14.643
0.822
-3.007
-9.600
-12.127
28.864
36.461
116.416
92.160
147.056
9.040
-2.907
-7.600
-12.727
22.091
36.992
96.723
57.760
161.968
8.449
-1.807
-1.600
-8.727
2.891
15.766
13.963
2.560
76.155
3.264
-3.507
2.200
-4.427
-7.715
15.523
-9.739
4.840
19.595
12.297
-0.407
2.100
1.473
-0.854
-0.599
3.094
4.410
2.171
0.165
-0.507
2.500
3.273
-1.267
-1.658
8.183
6.250
10.715
0.257
-0.307
-2.400
-0.627
0.736
0.192
1.504
5.760
0.393
0.094
2.793
-3.500
3.773
-9.777
10.540
-13.207
12.250
14.238
7.803
4.393
-2.900
7.573
-12.741
33.272
-21.963
8.410
57.355
19.301
6.193
0.300
17.173
1.858
106.36
5.152
0.090
294.923
38.357
27.630
257.397
275.900
355.14
838.289
99.929
TOTAL
x1 y
x12
y2
Summary statistics:
2
1
= 355.14,
2
2
y i x 22i x 2i yi x 1i x 2i
2
x1i2 x 2i2 x1i x 2i
1i
= -0.21596
y i x1i2 x1i yi x 1i x 2i
2
x1i2 x 22i x1i x 2i
2i
131
132
= 0.378127
0 Y 1X1 2 X 2 = 88.90667 (-0.21596)(85.9) (0.378127)(56.52667)
= 86.08318
i
2i
3i
The error sum of squares (ESS) =
2
i
is:
2
i
8.567271
= 0.713939225
15 3
n 3
x x
x x
1i
2i
2
1i
2
2i
275.9
355.14 838.289
= 0.505655733
The standard errors of estimated regression coefficients 1 and 2 are estimated as:
2
(1 r122 ) x1i2
2
(1 r122 ) x 2i2
0.713939225
= 0.05197
[1 (0.505655733) 2 ][355.14]
0.713939225
= 0.033826
[1 (0.505655733) 2 ][838.289]
132
133
The total sum of squares (TSS) is a measure of dispersion of the observed values of Y about
their mean. This is computed as:
TSS
(Yi Y)2
i 1
Yi2 nY 2
i 1
y
i 1
2
i
The regression (explained) sum of squares (RSS) measures the amount of the total
variability in the observed values of Y that is accounted for by the linear relationship between
the observed values of X and Y. This is computed as:
RSS
(Y
i 1
Y) 2
The error (residual or unexplained) sum of squares (ESS) is a measure of the dispersion of
the observed values of Y about the regression line. This is computed as:
)2
ESS (Yi Y
i
Note: It can be shown that the total sum of squares is the sum of the regression sum of squares
and the error sum of squares; i.e., TSS = RSS + ESS.
If a regression equation does a good job of describing the relationship between the dependent
variable and the independent variables, the regression (explained) sum of squares (RSS) should
constitute a large proportion of the total sum of squares (TSS). Thus, it would be of interest to
determine the magnitude of this proportion by computing the ratio of the explained sum of
squares to the total sum of squares. This proportion is called the sample coefficient of
determination, R 2 . That is:
2
Coefficient of determination = R
RSS
ESS
1
TSS
TSS
R 2 measures the proportion of variation in the dependent variable that is explained by the
independent variables (or by the linear regression model). It is a goodness-of-fit statistic. The
proportion of total variation in the dependent variable that is accounted for by factors other than
X (for example, due to excluded variables, chance, etc) is equal to ( 1 R 2 ) x 100%.
133
134
Example 3: Consider the data on per capita food consumption (Y), price of food ( X1 ) and per
capita income ( X 2 ). Calculate the coefficient of determination and interpret.
Solution
The variation in the dependent variable Y (food consumption) can be decomposed into:
Total sum of squares: TSS
(Y
i 1
Y) 2
(Yi Y i )2
i 1
y
i 1
n
2
i
i 1
2
i
= 99.929
= 8.567271
R2
RSS
= 0.914
TSS
R 2 = 0.914 indicates that 91.4% of the variation (change) in food consumption is attributed
to the effect of food price and/or consumer income.
1 R 2 = 0.086. This indicates that 8.6% of the variation in food consumption is due to factors
(variables) not included in our specification (such as habit persistence, geographical and time
variation, etc.)
134
135
H 0 : 1 2 . . . k
H A : H 0 is not true
The null hypothesis ( H 0 ) states that all regression coefficients are insignificant (none of them
explains the dependent variable). Not rejecting H 0 means that such a model is inadequate, and is
useless for prediction or inferential purposes.
The above test is accomplished by means of analysis of variance (ANOVA) which enables us to
test the significance of R 2 . The ANOVA table for multiple linear regression model is given
below:
ANOVA table for multiple linear regression
Source of
variation
Sum of
squares
Degrees of
freedom
Mean
square
Regression
RSS
k1
RSS/(k-1)
Residual
ESS
nk
ESS/(n-k)
Total
TSS
n1
Variance ratio
Fcal
RSS /(k 1)
ESS /(n k)
Here k is the number of parameters estimated from the sample data and n is the sample size. To
test for the significance of R 2 , we compare the variance ratio with F (k 1, n k) , the critical
value from the F distribution with (k 1) and (n k) degrees of freedom in the numerator and
denominator, respectively, for a given significance level .
Decision rule: Reject H 0 if:
Fcal
RSS /(k 1)
F (k 1, n k)
ESS /(n k)
If H 0 is rejected, we then conclude that R 2 is significant (or that the fitted model is adequate
and is useful for prediction purposes).
Note:
As the number of explanatory (independent) variables increases, R 2 always increases. This
implies that the goodness-of-fit of an estimated model depends on the number of independent
(explanatory) variables regardless of whether they are important or not. To eliminate this
dependency, we calculate the adjusted R 2 (denoted by R 2 ) as:
135
136
n 1
nK
2
2
Unlike R , R may increase or decrease when new variables are added into the model.
R 2 1 (1 R 2 )
Example 4: Consider the multiple regression model of per capita food consumption (Y) on price
of food ( X1 ) and per capita income ( X 2 ) given by:
Yi 0 1X1i 2 X 2i i
The fitted multiple regression model from the sample data was:
Sum of
squares
Degrees of
freedom
Mean
square
Variance ratio
Regression
91.362
31=2
45.681
Fcal 63.98448
Residual
8.567
15 3 = 12
0.714
Total
99.929
15 1 = 14
137
Since the test statistic is greater than both tabulated values, the above ratio is significant at the
conventional levels of significance (1% and 5%). Thus, we reject the null hypothesis and
conclude that the model is adequate, that is, variation (change) in per capita food consumption is
significantly attributed to the effect of food price and/or per capita disposable income.
11. Tests on the regression coefficients
Once we come up with the conclusion that the model is adequate, the next step would be to test
for the significance of each of the coefficients in the model. To test whether each of the
coefficients is significant or not, the null and alternative hypotheses are given by:
H0 : j 0
HA : j 0
for j = 1, 2.
The test statistic is the ratio of the estimated regression coefficients to their estimated standard
errors, that is,
tj
, j 1, 2, . . ., k
j
Decision rule:
If
We have already calculated the standard errors of estimated regression coefficients 1 and 2 as:
= 0.05197, = 0.033826.
1
2
a) Does food price significantly affect per capita food consumption?
The hypothesis to be tested is:
137
138
H 0 : 1 0
H A : 1 0
The test statistic is calculated as:
0.21596
t1 1
4.155
0.05197
For significance level = 0.01 and degrees of freedom (n-3) = (15-3) = 12, the value from the
students t-distribution is:
t / 2 (n 3) t 0.005 (12) 3.055
Decision: Since | t1 | 4.155 3.055 , we reject the null hypothesis and conclude that food
price significantly affects per capita food consumption at the 1% level of significance.
b) Does disposable income significantly affect per capita food consumption?
The hypothesis to be tested is:
H 0 : 2 0
H A : 2 0
The test statistic is calculated as:
0.378127
t2 2
11.179
0.033826
Food price significantly and negatively affects per capita food consumption, while disposable
income significantly and positively affects per capita food consumption.
The estimated coefficient of food price is -0.21596. Holding disposable income constant, a
one dollar increase in food price results in a 0.216 dollar decrease in per capita food
consumption.
The estimated coefficient of food price is 0.378127. Holding food price constant, a one dollar
increase in disposable income results in a 0.378 dollar increase in per capita food
consumption.
139
readily available for such analysis, and thus, one does not need to go through the details of the
calculations involved. The SPSS output for the above data is shown below.
Model Summary
Model
1
R
.956a
R Square
.914
Adjusted
R Square
.900
Std. Error of
the Estimate
.84495
R 2 = 0.914 indicates that 91.4% of the variation (change) in food consumption is attributed to
the effect of food price and/or consumer income.
ANOVAb
Model
1
Regression
Residual
Total
Sum of
Squares
91.362
8.567
99.929
df
2
12
14
Mean Square
45.681
.714
F
63.984
Sig.
.000a
Example:
If p-value = 0.002, then we can reject the null hypothesis for all values of greater than
0.002 (such as = 0.01, = 0.05).
139
140
If p-value = 0.03, then we can reject the null hypothesis for all values of greater than 0.03
(such as = 0.05). However, we can not reject the null hypothesis at
= 0.01.
If p-value = 0.07, then we can not reject the null hypothesis at both 5% and 1% levels of
significance.
Note: In SPSS output, the p-value corresponds are displayed in the column under Sig.
In this particular example, Sig. = 0.000 which is less than 0.01 or 0.05. Thus, we can conclude
that the model is adequate at the 1% level of significance. That is, there is a significant linear
relationship between food consumption and food price and/or consumer income. This means
based on food price and consumer income, we can make valid inferences about per capita food
consumption at the 99% level of confidence.
Coefficientsa
Model
1
(Constant)
price
income
Unstandardized
Coefficients
B
Std. Error
86.083
3.873
-.216
.052
.378
.034
Standardized
Coefficients
Beta
-.407
1.095
t
22.226
-4.155
11.178
Sig.
.000
.001
.000
As can be seen from the table above, the p-values for price and income are both less than 0.01.
Thus, we can conclude that both variables significantly affect consumption at the 1% level of
significance. From the signs of the estimated regression coefficients we can see that the direction
of influence is opposite: price affects consumption negatively while income affects consumption
positively. The constant term (intercept) is also significant.
Note: In general, if Sig. > 0.05, we doubt the importance of the variable!
Multicollinearity
Introduction
In the construction of an econometric model, it may happen that two or more variables giving
rise to the same piece of information are included, that is, we may have redundant information or
unnecessarily included related variables. This is what we call a multicollinearity (MC) problem.
Such kind of MC is so common in macroeconomic time series data (such as GNP, money supply,
income, etc) since economic variables tend to move together over time.
140
141
We have seen earlier that the variances of 2 and 3 are estimated by:
)
V(
2
2
2
and V(3 )
(1 r232 ) x 2i2
(1 r232 ) x 3i2
) and V(
) become very large (or will be inflated)
both V(
2
3
j
s.e.( j )
, j 2, 3 where s.e.( j )
)
V(
j
Thus, under a high degree of MC, the standard errors will be inflated and the test statistic will be
a very small number. This often leads to incorrectly accepting (not rejecting) the null hypothesis
when in fact the parameter is significantly different from zero!
Major implications of a high degree of MC
141
142
X2
X3
X4
Year
X2
X3
X4
1949
15.9
149.3
4.2
108.1
1959
26.3
239.0
0.7
167.6
1950
16.4
161.2
4.1
114.8
1960
31.1
258.0
5.6
176.8
1951
19.0
171.5
3.1
123.2
1961
33.3
269.8
3.9
186.6
1952
19.1
175.5
3.1
126.9
1962
37.0
288.4
3.1
199.7
1953
18.8
180.8
1.1
132.1
1963
43.3
304.5
4.6
213.9
1954
20.4
190.7
2.2
137.7
1964
49.0
323.4
7.0
223.8
1955
22.7
202.1
2.1
146.0
1965
50.3
336.8
1.2
232.0
1956
26.5
212.4
5.6
154.1
1966
56.6
353.9
4.5
242.9
1957
28.1
226.1
5.0
162.3
1967
59.9
369.7
5.0
252.0
1958
27.6
231.9
5.1
164.3
Standard error
t-ratio
-19.982
4.372
-4.570
GDP
0.100
0.194
0.515
Stock formation
0.447
0.341
1.309
Consumption
0.149
0.297
0.501
Constant
143
significance. Thus, the linear regression model is adequate. However, all of the estimated
regression coefficients (save the constant term) are insignificant at the conventional levels of
significance. This is an indication that the standard errors are inflated due to MC. Since an
increase in GDP is often associated with an increase in consumption, they have a tendency to
grow up together over time leading to MC. The coefficient of correlation between GDP and
consumption is 0.999. Thus, it seems that the problem of MC is due to the joint appearance of
these two variables.
Methods of detection of MC
Multicollinearity almost always exists in most applications. So the question is not whether it is
present or not; it is a question of degree! Also MC is not a statistical problem; it is a data
(sample) problem. Therefore, we do not test for MC; but measure its degree in any particular
sample (using some rules of thumb).
1
1 R 2j
, j 2, 3, . . ., K
2
where R j is the coefficient of determination obtained when the X j variable is regressed on
the remaining explanatory variables (called auxiliary regression). For example, the VIF of
is defined as:
2
VIF( 2 )
1
1 R 22
2
where R 2 is the coefficient of determination of the auxiliary regression:
143
144
X 2i 1 3 X 3i 4 X 4i . . . K X Ki u i
Rule of thumb:
a) If VIF( j ) exceeds 10, then j is poorly estimated because of MC (or the j th regressor
variable ( X j ) is responsible for MC).
2
b) (Kliens rule) MC is troublesome if any of the R j exceeds the overall R 2 (the
coefficient of determination of the regression equation (*)).
Example: Consider the data on imports (Y), GDP ( X 2 ), stock formation ( X 3 ) and
consumption ( X 4 ) for the years 1949 1967. The coefficient of determination of the
auxiliary regression of GDP ( X 2 ) on stock formation ( X 3 ) and consumption ( X 4 ):
X 2i 1 3 X 3i 4 X 4i u i
2
is (using SPSS) R 2 = 0.998203. The VIF of 2 is thus:
VIF( 2 )
1
1
556.5799
2
1 R2
1 0.998203
Since this figure is by far exceeds 10, we can conclude that the coefficient of GDP is poorly
estimated because of MC (or that GDP is responsible for MC). It can also be shown that
VIF( 4 ) 555.898 indicating that consumption is also responsible for MC.
Remedial measures
To circumvent the problem of MC, some of the possibilities are:
1. Include additional observations maintaining the original model so that a reduction in the
correlation among variables is attained.
2. Dropping a variable.
This may result in an incorrect specification of the model (called specification bias). If we
consider our example, we expect both GDP and Consumption to have an impact on Imports.
By dropping one or the other, we have introduced specification bias.
Exercise with SPSS Application
144
145
145