Está en la página 1de 24

SESSION 3

BIVARIATE DATA ANALYSIS

Quantitative Economics Econ314


Introduction to Bivariate Analysis
Last week we examined a number of descriptive techniques for
describing one variable only (univariate analysis).
However, a more important part of statistics is describing the
relationship between two (or more) variables.

For example:
Is government spending related to GDP growth?
How does an increase in education affect earnings?

If two variables are related, this means that you can use information
about one variable to predict the values of the other variable.
We will investigate these types of predictions further when we
begin the econometrics part of the course next week.
The univariate summary statistics we featured last week can be
used to describe each of the individual variables, but they will not
give any indication of the relationship between variables.

Therefore the purpose of bivariate analysis is to summarise the


nature of the relationship between two variables.

This type of analysis can be performed on quantitative or qualitative


data, and can be presented in the form of statistics or graphs.
Bivariate Analysis of Categorical Data
Cross-tabs
The relationship between two categorical variables can be
represented as a two-dimensional frequency distribution, by cross-
tabulating the variables:
Note that the Total in the bottom row of the table gives the total
number of males and females, regardless of their employment
status.

The Total in the end column of the table gives the total number
people with each employment status, regardless of their gender.

In the table above, there are more females than males in each
employment status category.

However, since there are also more females than males in the
sample, it is difficult to tell whether females are more likely to be
economically inactive than males, for example.

For this reason, we often present cross-tabulated data in the form


of percentages, rather than frequencies.
Cross-tabs with row, column and cell percentages
The same example is presented below, but this time the percentages
are displayed as well as the frequencies:
There is a lot of information contained in this table.
The key to interpreting the figures correctly is to note:
whether the figure is a frequency or percentage;
where the figures sum to (i.e. where the 100% is located), for the
percentage figures.

The figures in the top of each row (right-aligned) are the frequencies,
identical to Table 1.

The next row contains row percentages (note that the values add to
100% at the end of the row).
Thus the figure of 38.2 means that 38.2% of all values in that row are
male
i.e. 38.2% of all economically inactive people are male, or
55% of all employed people are females.
etc.
The next row contains column percentages (note that the values
add to 100% at the bottom of the column).
Thus the figure of 65.6 means that 65.6% of all values in that
column are economically inactive
i.e. 65.6% of all males are economically inactive, or
6.3% of all females are unemployed.
etc.
The final row contains cell percentages (note that the values add to
100% in the bottom right cell of the table).
Thus the figure of 26.3 means that 26.3% of all values in the sample
are economically inactive and male
i.e. 26.3% of the sample are economically inactive males, or
13.8% of the sample are employed females.
Etc.
Interpret the figure of 70.8 in one sentence.
Compare it to the figure of 65.6; what does it tell you?
Bivariate Analysis of Continuous Data
Introduction
When quantitative variables have numerous different values, we
cannot meaningfully cross-tabulate them. (Why not?)
Instead, we look at summary statistics, or for graphical displays, that
describe the extent of association or relationship between the
variables.
e.g., the univariate descriptive statistics below summarise education
attainment and monthly income for several individuals.

The summary statistics for the two variables are useful but do not
tell us much about the relationship between the two variables.
However, a scatterplot reveals a much clearer relationship between
them.
Scatterplots
A scatterplot or X-Y graph is a graphical method of depicting the
relationship between two variables, and particularly to examine what
we believe to be a causal relationship.
The independent or explanatory variable, X, is represented -
horizontal axis, while the dependent variable, Y, is represented on the
vertical axis.
A scatterplot may reveal three important features:
The direction of the relationship:
An upward-sloping line indicates a positive linear relationship [(a), (c)
and (e) below].
A downward-sloping line indicates a negative linear relationship [(b), (d)
and (f)].
A curved pattern indicates a non-linear relationship [(h)].
No pattern indicates that there is no apparent relationship [(g)].
The strength of the relationship:
The more the data points cluster along an imaginary line, the stronger
the relationship [compare (a) and (b) with (c) and (d) and these again,
with (e) and (f)].
The presence of outliers:
Data points that are distant from the bulk of the data, or that lie far
away from the imaginary line showing the relationship between the
variables, may be outliers and should be investigated further.

Directions and Strengths of Linear relationships


Covariance
Graphical methods of depicting relationships are useful, but their
interpretation can be subjective.
Calculating a summary statistic is an objective way of measuring such
relationships.
Covariance is a measure of association between two variables.
It is calculated using deviances i.e. the distance between each data value
and the mean of that variable.
The (sample) covariance between X and Y is given by:

The following table shows completed education (in years) and monthly
income (in thousands of Rands) for 8 individuals:
The positive sign on the covariance statistic indicates that there is a positive
relationship between education and income i.e. on average, as education
increases, so too does income.

Although the covariance measure is useful in identifying the nature of the


association between X and Y, it has a serious problem the numerical value
is very sensitive to the units in which X and Y are measured.

In the example above, if income was measured in Rands rather than


thousands of Rands, the covariance measure would go up by a factor of
1000, to 5143.
However, this does not indicate that the relationship has become
stronger.
Correlation coefficient
To avoid this problem with the units of measurement, we instead use
Pearsons correlation coefficient.
The correlation coefficient between X and Y is defined as:

The correlation coefficient has the following useful properties:

r is metric-independent (i.e. does not depend on the units of


measurement)

r reflects the direction of the relationship (it can be positive or negative)

r reflects the magnitude of the relationship (it can be large or small)


The correlation coefficient has:

a maximum value of 1, which is attained when there is a perfect positive


linear correlation between the sample values of X and Y.

On a scatterplot, all the points lie exactly on an upward sloping straight


line.

a minimum value of 1, which is attained when there is a perfect negative


linear correlation between the sample values of X and Y.

On a scatterplot, all the points lie exactly on a downward sloping


straight line.

A value of 0 indicates that there is no linear association between the


observations on X and Y in the sample.
Using the education and earnings example from earlier:

This indicates that there is a quite strong, positive, linear relationship


between education and earnings.
Another way of interpreting this (correlation coefficient) measure is
to square it:

r2 measures the proportion of the variability in Y that is


associated (or matches up) with the variation in X.

In our example,

Therefore 73.5% of the variation in earnings between


individuals can be associated with the variation in their
education.
Bivariate Analysis of Continuous and Categorical Data
We may be interested in the relationship between a continuous and a
categorical variable.
However, we have seen previously that it is not useful to perform cross-
tabulations for continuous variables.
Correlation coefficient above is calculated only for two continuous variables.

Although there are methods of calculating correlation between a


continuous and a categorical variable, they are beyond the scope of this
course.
How then can we examine a relationship between a continuous and a
categorical variable?
By combining the univariate descriptive statistics for a continuous variable
with the categories of the categorical variable.
And then comparing distributions and measures of central tendency and
dispersion between categories can highlight key features of the relationship
between the variables.
For example, consider the relationship between gender and earnings.
One way to represent this relationship is by comparing the distribution of
earnings of males to that of females:

What conclusion could you draw from the graph above?


You could also use measures of central tendency and dispersion to
describe this relationship.
Again, separate the two categories, and calculate the univariate
statistics for each category:

What conclusion could you draw from the table above?


Introduction to practical
This weeks computer practical session involves using Stata to construct the
descriptive statistics weve featured in this session, and interpreting them.

Commands
For cross-tabulating two categorical variables, use the command tab
followed by the names of the variables.
For calculating correlation coefficients, use the command correlate
followed by the names of the variables.
For calculating covariances, use the command correlate followed by the
names of the variables, and the command covariance.

Scatterplots
Use the drop-down menu Graphics > Twoway graph. Create a scatterplot,
enter the names of the variables, and add titles for the axes if desired, or
Use the command scatter with the names of the variables.

También podría gustarte