Documentos de Académico
Documentos de Profesional
Documentos de Cultura
by
Vipul Mehta
Data & Variable
What is data?
Facts, statistics or items of information
Which can be used further to obtain some knowledge
What is a variable?
Data is expressed in terms of a variable where
Variable is any characteristic that varies from one member of a
population to another.
For example height of students in this classroom, which varies with
one individual to another
Types of Variables
There are two types of variables
Numerical variables quantitative
Categorical variables qualitative
Discrete
Numerical
Types of
Continuous
variables
Categorical
Dataset and Data Table
Dataset: Data of a group of variables for a collection of
people
Data table: A dataset organized into a table, with one
column for each variable and one row for each individual
Data Table
Sample
A sample is selected to represent the population in a research
study.
Statistics
Types of statistics
Descriptive statistics
Inferential statistics
Measures of Central Tendency
What is
Mean
Average of numbers
Median
Central number when all are put in increasing order.
What happens in case of even set of numbers?
Mode
Number with maximum appearance in the set
Other Measures of Central Tendency
Percentile
Quartile
Q1
Q2
Q3
Percentile
To calculate the pth percentile for a set of data, the following process is used:
1. Arrange the data in increasing order (smallest to largest)
2. Compute an index i given as
p
i n
100
1. where p is the percentile of interest
2. n is the number of observations
3. A. If i is not an integer, round up to the next number
B. If i is an integer, pth percentile is the average of the values in positions i an i + 1
Lets say we find the height of 10 students of this class (in cm) and get the following
results:
180 165 158 160 176 163 179 161
152 159
Find the 75th, 85th& 50th percentile.
1 n
Variance (x i - ) 2
2
n i 1
1 n
Standard Deviation i
n i 1
(x - ) 2
Coefficient of Variation Cv = 100 %
Exploratory Data Analysis - Five Number Summary
In Five Number Summary, the following five numbers are used
to summarize the data:
1. Smallest value
2. First quartile (Q1)
3. Median (Q2)
4. Third quartile (Q3)
5. Largest value
For our set of numbers discussed in the previous class,
142 158 159 160 161 163 165 176
200 240
Perform the Five Number Summary
Hence draw the Box Plot of the data
Exploratory Data Analysis - Five Number Summary
Smallest value =
Q1 =
Median =
Q3 =
Largest Value =
Smallest
Q1
Median
Q3
Largest
Exercise
The following are the marks of 10 students in this class (out
of 20)
4 4 5 5 6 7 8 8 9
14
a) Perform the five number summary and draw the box plot.
b) Do you observe any outliers in the data set?
Bessels Correction
Bessels Correction
Multiply population variance obtained by [n/n-1]
1 n 2 n
s (x i - x)
2
n i 1 n 1
This is done to increase the sample variance to make it closer to
population variance
Thus the unbiased sample variance becomes
1 n
s 2
i
n - 1 i 1
(x - x ) 2
Sample v. Population
Mean N
x x
i
n
x i
i 1 N i 1 n
Variance
1 N 1 n
(x i - ) 2
2
s
2
i (x - x ) 2
N i 1 n - 1 i 1
Standard Deviation
1 N 1 n
i
N i 1
(x - ) 2
s i
n - 1 i 1
(x - x ) 2
Covariance
x y
n
i x i y
s xy i 1
n 1
Scatter Diagram
65
II I
Sales ($100s) 60
55
50
45
40
III IV
35
0 1 2 3 4 5 6
Number of Commercials
What do we see?
We observe that the amount of money we generate is increasing
with the increasing number of commercials
Covariance Shortcoming
The sign of sxy gives insight about the linear relationship
The value of sxy does not give us insight about the linear
relationship
Pearsons Correlation Coefficient
To overcome the shortcoming with Covariance, we use a
Correlation Coefficient
Pearsons Correlation Coefficient is defined as:
s xy
rxy
sx s y
Where,
rxy is sample correlation coefficient
sxy is sample covariance
sx is sample standard deviation of x
sy is sample standard deviation of y
Spearmans Rank Coefficient
Spearmans Rank Coefficient is defined as:
n
6 d i
2
r s 1 i 1
n n 1
2
where,
rs is Spearmans Rank Coefficient
d is different between ranks of the two variables
n is sample size
Exercise
You are the marketing manager in a furniture making unit.
You have been assigned the task of figuring out the optimum
price of the newly developed table-chair set your company
has designed. You do a survey on a sample of people to see
their preference for the following set of prices. For example,
50 people said they prefer a price of Rs 400 for the set.
Similarly you get other observations.
Price (00s) 4 6 11 3 16 14
No of 50 45 40 60 30 35
responses in
favor
Exercise
Price (00s) 4 6 11 3 16 14
No of 50 45 40 60 30 35
responses in
favor
On the above data, analyze the following:
a) If you have to draw a scatter diagram, which one would be the
independent variable?
b) Draw a scatter diagram of the data. What does it indicate
about the relationship between the two?
c) Compute and interpret the sample covariance.
d) Compute and interpret the Pearsons correlation coefficient
e) Compute and interpret the Spearmans Rank coefficient
Weighted Mean
What is the weighted average price of the following set of
data?
Price (00s) 4 6 11 3 16 14
No of 50 45 40 60 30 35
responses in
favor
x
wxi i
w i