Está en la página 1de 65

MPH Ist Year

Biostatistics

Prabesh Ghimire
Biostatistics MPH 19th
Batch

Table of Contents
UNIT 1: INTRODUCTION TO BIOSTATISTICS ................................................................................................ 4
Biostatistics and its Role in Public Health ................................................................................................. 4
UNIT 2: DESCRIPTIVE STATISTICS ................................................................................................................. 4
Variables.................................................................................................................................................... 4
Scales of Measurement............................................................................................................................. 5
Diagrammatic and graphic presentation .................................................................................................. 7
Measure of Central Tendency ................................................................................................................. 12
Measures of Dispersion .......................................................................................................................... 14
UNIT 3: PROBABILITY DISTRIBUTION......................................................................................................... 16
Probability Distributions ......................................................................................................................... 18
Binomial Distribution .............................................................................................................................. 18
Poison Distribution.................................................................................................................................. 19
Normal Distribution ................................................................................................................................ 20
UNIT 4: CORRELATION AND REGRESSION ................................................................................................. 21
Correlation .............................................................................................................................................. 21
Regression ............................................................................................................................................... 26
UNIT 5: SAMPLING THEORY, SAMPLING DISTRIBUTION AND ESTIMATION ......................................... 33
Sampling Techniques .............................................................................................................................. 39
Determination of Sample Size................................................................................................................. 43
UNIT 6: Hypothesis Testing ........................................................................................................................ 43
Parametric Test ....................................................................................................................................... 48
Z-test ................................................................................................................................................... 48
T-test ................................................................................................................................................... 50
Analysis of Variance (ANOVA) ............................................................................................................. 51
Scheffe Test ........................................................................................................................................ 52
Turkey Test .......................................................................................................................................... 52
Bonferroni Test ................................................................................................................................... 53
Non-Parametric Tests ............................................................................................................................. 53
Mann Whitney U Test ......................................................................................................................... 53
Wilcoxon Signed Rank Test ................................................................................................................. 53

Prabesh Ghimire Page | 2


Biostatistics MPH 19th
Batch

Kruskal Wallis Test .............................................................................................................................. 54


Tests of Association ............................................................................................................................ 54
Chi-Square Test ................................................................................................................................... 54
Fischers Exact Test ............................................................................................................................. 56
McNemars Chi-Square Test ............................................................................................................... 56
STATISTICAL SOFTWARE IN BIOSTATISTICS ............................................................................................... 57
Introduction to Various Statistical Softwares ......................................................................................... 57
Data Management in Epidata ................................................................................................................. 60
Data Management in SPSS ...................................................................................................................... 61
Miscellaneous ............................................................................................................................................. 62
Important Formulae for Biostatistics ........................................................................................................ 64

Prabesh Ghimire Page | 3


Biostatistics MPH 19th
Batch

UNIT 1: INTRODUCTION TO BIOSTATISTICS


Biostatistics and its Role in Public Health

Biostatistics is the branch of statistics responsible for interpreting the scientific data that is
generated in the health sciences, including the public health sphere.
- In essence, the goal of biostatistics is to disentangle the data received and make valid
inferences that can be used to solve problems in public health.

Role/Usefulness of biostatistics in public health


- All quantitative public health research involves wider application of biostatistics from
sampling, sample size calculation to data collection, processing, analysis and generating
evidence.
- Calculating risks measures such as relative risk, odds ratio etc. in population involves
biostatistics.
- A biostatistical measure such as hypothesis testing is critical to establishing exposure-
outcome relationships.
- Statistical methods are applied in the evaluation of interventions, screening and prevention
programs in populations.
- A proper trial and intervention studies requires the understanding of the proper use of
parametric and non-parametric statistical tests.
- Regular monitoring of public health programs involves analyzing data to identify problems
and solutions in public health sector.
- Disease surveillance programs involve collection, analysis and interpretation of data.
- Choice of sampling techniques is the foundation to all public health research.

UNIT 2: DESCRIPTIVE STATISTICS

Variables

Concept of Variables
If a characteristic takes on different values in different persons, places, or things, we label the
characteristic as a Variable.
Some examples of variables include diastolic blood pressure, heart rate, the heights of male
adults, the weights of under-5 years children, the ages of patients in OPD.

Types of Variables
Variables can be usually distinguished into qualitative and quantitative:

i. Qualitative Variables
- Qualitative variables have values that are intrinsically non-numeric (categorical)
- E.g., cause of death, nationality, race, gender, severity of pain (mild, moderate, severe), etc.
- Qualitative variables generally have either nominal or ordinal scales.
- Qualitative variables can be reassigned numeric values (e.g., male=o, female =1), but they
are still intrinsically qualitative.

Prabesh Ghimire Page | 4


Biostatistics MPH 19th
Batch

ii. Quantitative variables


- Quantitative variables have values that are intrinsically numeric.
- E.g., survival time, systolic blood pressure, number of children in a family, height, age, body
mass index, etc.
- Quantitative variables can be further sub-divided into discrete and continuous variables.
a. Discrete Variables
- Discrete variables have a set of possible values that is either finite or countably infinite.
- E.g., number of pregnancies, shoe size, number of still births, etc.
- For discrete variables there are gaps between its possible values. Discrete values often
take integer (whole number) values (e.g., counts), but some some discrete values can
take non-integer values.

b. Continuous Variable
- A continuous variable has a set of possible values including all values in an interval of
the real line.
- E.g., duration of seizure, body mass index, height
- No gaps exist between possible values.

Scales of Measurement

There are four different measurement scales. Each is designed for specific purpose.
i. Nominal Scale
ii. Ordinal Scale
iii. Interval Scale
iv. Ratio Scale

i. Nominal Scale
- It is simply as system of assigning numbers to events in order to label/identify them i.e.
assignment of numbers to cricket player in order to identify them.
- Nominal data can be grouped but not ranked. For example, male/female, urban/rural,
diseased/healthy are examples of nominal data and such data consists of numbers used
only to classify an object, person or characteristics.
- Nominal scale is the least powerful among the scales of measurement.
- It indicates no order or distant relationship and has no arithmetic origin.
- Chi-square test is the most common test of statistical significance that can be utilized in this
scale.

ii. Ordinal Scale


- Among the three ordered scales, i.e. ordinal, interval and ratio scale, ordinal scale occupies
the lowest level e.g. ordinal < interval < ratio.
- This scale places events in a meaningful order. (ARI may be classified as no pneumonia,
pneumonia, severe pneumonia and very severe disease).
- The size of the interval is not defined, i.e. no conclusion about whether the difference
between first and second grade is same as the difference between second and third grade.

Prabesh Ghimire Page | 5


Biostatistics MPH 19th
Batch

- Ordinal scale only permits ranking of items from highest to lowest. Thus use of this scale
implies statement of greater than or less than without being able to state how much greater
or less.
- Ordinal data can be both grouped and ranked. Examples include mild, moderate and severe
malnutrition, first degree, second degree and third degree uterine prolapse, etc.

iii. Interval Scale


- Similar to ordinal scale, here data can be placed in meaningful order, and in addition, they
have meaningful intervals between them. The intervals can also be measured.
- In Celsius scale, difference between 1000 and 900 is same as the difference between 600
and 500. However, because interval scales do not have absolute zero (an arbitrary zero
point is assigned).
- Ratio of scores are not meaningful i.e. 1000C is not twice as hot as 500C, because 00C does
not indicate complete absence of heat, rather it is the freezing point of water.
- Intelligent Quotient zero does not indicate complete absence of IQ, but indicates a serious
intellectual problem.

iv. Ratio Scale


- Ratio scale has some properties as an interval scale, but because it has an absolute zero,
meaningful ratio does exist.
- Weight in gram or pounds, time in seconds or days, BP in mm of Hg and pulse rate are all
ratio scale data.
- Only temperature scale that follows ratio scale is Kelvin scale in which zero degree indicates
an absolute absence of heat, just as zero pulse rate indicates an absolute lack of pulse.
Therefore, it is correct to say pulse rate of 120 is twice as fast as pulse rate of 60.

Summary of Scales of Measurement


Scale Characteristic Question Examples
Nominal Is A different than B? Marital Status
Eye Color
Gender
Religious affiliation
Ordinal Is A bigger than B? Stage of disease
Severity of pain
Level of satisfaction
Interval By how many units do A and B differ Temperature
SAT scores
Ratio By how many times is A bigger than B Distance
Length
Weight
Pulse rate

Prabesh Ghimire Page | 6


Biostatistics MPH 19th
Batch

Diagrammatic and graphic presentation

There are different types of diagrams and graphic representations for a given dataset:
i. Bar Graph
- Bar graph is the simplest qualitative graphic representation of data.
- A bar graph contains two or more categories along one-axis and a series of bars, one for
each category, along the other axis.
- Typically, the length of the bare represents the magnitude of the measure (amount,
frequency, percentage, etc.) for each category.
- The bar graph is qualitative because the categories are non-numerical, and it may be either
horizontal or vertical.

Construction of bar chart


- First the frequencies are labeled on one axis and categories of the variable on the other
axis.
- A rectangle is constructed at each category of the variable with a height equal to the
frequency (number of observations) in the category.
- A space is left between each category to connote distinct, separate categories and to clarify
the presentation.
Advantages of bar chart
- Bar graph gives clearer picture of the distribution since data are presented with rectangles.
- It is commonly used in presentation of research paper by speakers in symposium, seminar
and training.

Limitations
- Bar graph is not applicable for plotting data overlapping with each other because it gives a
confusing picture.

Prabesh Ghimire Page | 7


Biostatistics MPH 19th
Batch

ii. Pie diagram/ Sector diagram


- Pie diagram is a circle in which the total area is divided into number of sectors. Each sector
represents the percentage value of the concerned category.
- Areas of segments of a circle are compared and it enables comparative differences at a
glance.
- Degrees of angle denote the frequency and area of the sector. Pie diagram has a total area
of 100% with 1% equivalent to 3.60 of the circle.
- Size of each angle is calculated by multiplying the class percentage with 3.6 or by formula =
(class frequency/ total observations) 3600
- Legend/index must be there to represent the different categories.
- Some authorities opine that, the segments should be started at 12O Clock position and
arranged clockwise in order of their magnitude by keeping largest first.

iii. Histogram (Also known as block diagram)


- We may display a frequency distribution graphically in the form of histogram, which is a
special type of bar graph.
- Histogram represents categories of continuous and ordered data, the bars are adjacent to
each other, with no space in between the bars.
- Epidemic curve is an example of histogram.
Construction of Histogram
- When we construct a histogram the values of the variable under consideration are
represented by the horizontal axis, while the vertical axis has the frequency of occurrence.
- Above each class interval on the horizontal axis is a rectangular bar or cell is erected so that
the height corresponds to the respective frequency.
- The cells of the histogram must be joined and should be of equal width for each class
interval.

Advantages
- It allows to visually comparing the distribution of different sets of data.

Limitations
- Histogram is not applicable in plotting two or more sets of data over-lapping with each other
because it gives a confusing picture.
- Since only one set of distribution can be plotted in one graph, it is expensive and more time
consuming.

Prabesh Ghimire Page | 8


Biostatistics MPH 19th
Batch

iv. Frequency Polygon


- A frequency distribution can be portrayed graphically by means of a frequency polygon,
which is a special kind of line graph.
- Frequency polygon represents distribution of continuous and ordered data.
- Frequency polygon shows a trend of an event over a period of time i.e., either increasing,
declining, or static, remaining same. For example, frequency polygon can be used to show a
declining trend of infant mortality rate over a period of time or to show the increasing
incidence of HIV infection.

Construction of Frequency polygon


- To draw a frequency polygon, we first place a dot above the midpoint of each class interval
represented on the horizontal axis of a graph.
- The height of a given dot above the horizontal axis corresponds to the frequency of the
relevant class interval.
- Connecting the dots by straight lines produces the frequency polygon.

Advantages
- The change of points from one place to another is direct and gives correct impression.
- Unlike histogram, it is possible to plot two or more sets of distribution overlapping on the
same baseline because it still gives a clear picture of the comparison of each distribution.

Limitations
- Can be used only with continuous data

Prabesh Ghimire Page | 9


Biostatistics MPH 19th
Batch

v. Stem and Leaf display/plot


- Steam and leaf display is one of the graphical representations of quantitative data sets.
- A stem and leaf plot bears a strong resemblance to a histogram and serves the same
purpose.
- A properly constructed stem and leaf display, like a histogram provides information
regarding the range of the data set, shows the location of the highest concentration of
measurements, and reveals the presence or absence of symmetry.

Construction of Stem and Leaf Display


- To construct a stem and leaf display, each measurement is partitioned into two parts. The
first part is called the stem, and the second part is called the leaf.
- The stem consists of one or more of the initial digits of the measurement, and the leaf is
composed of one or more of the remaining digits.
- All partitioned numbers are shown together in a single display; the stems form an ordered
column with the smallest stem at the top and the largest at the bottom.
- In the stem column, all stems within the range of data are included even when a
measurement with that stem is not in the data set.
- The rows of the display contain the leaves, ordered and listed to the right of their respective
stems. When leaves consist of more than one digit, all digits after the first may be deleted.
- Decimals when present in the original data are omitted in the stem and leaf display.
- The stems are separated from their leaves by a vertical line

Advantages of stem and leaf plot


- Unlike histogram, it preserves the information contained in the individual measurements.
Such measurements are lost when measurements are assigned to the class intervals of a
histogram.
- It can be constructed during the tallying process, so the intermediate step of preparing an
ordered array is eliminated.
- It is primarily of value in helping researchers and decision makers understand the nature of
data.

Limitations
- As a rule, it is not suitable for use in annual reports or other dissemination purposes.

Prabesh Ghimire Page | 10


Biostatistics MPH 19th
Batch

vi. Box and Whisker Plot


- A box and whisker plot is a graph that displays a summary of a large amount of data in five
numbers; median, upper quartile, lower quartile, minimum data value and maximum data
value.
- The construction of a box and whisker plot makes use of the quartiles of a data set.

Construction of Box and Whisker Plot


- The variables of interest are represented on the horizontal axis.
- A box is drawn in the space above the horizontal axis in such a way that the left end of the
box aligns with the first quartile Q 1 and the right end of the box aligns with the third quartile
Q3.
- The box is divided into two parts by a vertical line that aligns with the median Q 2 .
- A horizontal line called Whisker is drawn from the left end of the box to a point that aligns
with the smallest measurement in the dataset.
- Another horizontal line or Whisker is drawn from the right end of the box to a point that
aligns with the largest measurement in the data set

Advantages
- Examination of a box-and-whisker plot for a set of data reveals information regarding the
amount of spread, location of concentration, and symmetry of data.
- It is easy to compare the stratified data using the Box and Whisker Plot.

Limitations
- Mean and mode cannot be identified using the box plot
- If large outliners are present, the Box plot is more likely to give an incorrect representation.
- The issue with handling large amount of data in a box plot is that the exact values and
details of the distribution of results are not retained.

Example
Given the weight measurements (Kg) in a group of selected students:

14.6 24.3 24.9 27.0 27.2 28.2 28.8 29.9 30.7


31.5 31.6 32.3 32.8 33.3 33.6 34.3 36.9 38.3

Here, smallest and largest values are 14.6 and 44 respectively. The Q 1 is 27.25, median is 31.1
and third quartile is 33.525. The Box and Whisker Plot for the given dataset will be:

Prabesh Ghimire Page | 11


Biostatistics MPH 19th
Batch

Measure of Central Tendency

A measure of central tendency is popularly known as an average. Central tendency is defined


as the statistical measure that identifies a single value as representative of an entire distribution.

The commonly used measures of central tendency are


1. Mean
2. Median and
3. Mode

1. Arithmetic Mean
- The arithmetic mean or mean of a set of measurement is defined to be the sum of the
measurements divided by the total number of measurements.
- The population mean is denoted by the Greek letter and the sample mean is denoted by
the symbol .

() =

Properties of Mean
i. The sum of the deviations of a given set of observations from the arithmetic mean is equal
to zero.
(X X) = 0

ii. The sum of squares of deviations of set of observations from arithmetic mean is minimum.
( )2 ( )2

iii. If every value of the variable X is increased (or decreased) by constant value, the arithmetic
mean of observation so obtained also increases (or decreases) by the same constant.
, = ,
, =

Advantages of mean
- The mean uses every value in the data and hence is a good representative of the data. The
irony in this is that most of the times this value never appears in the raw data.
- Repeated samples drawn from the same population tend to have similar means. The mean
is therefore the measure of central tendency that best resists the fluctuation between
different samples.
- It is closely related to standard deviation, the most common measure of dispersion.

Disadvantages
- The important disadvantage of mean is that it is sensitive to extreme values/outliers,
especially when the sample size is small. Therefore, it is not an appropriate measure of
central tendency for skewed distribution.
- Mean cannot be calculated for nominal or non numeric ordinal data. Even though mean can
be calculated for numerical ordinal data, many times it does not give a meaningful value,
e.g. stage of cancer.

Prabesh Ghimire Page | 12


Biostatistics MPH 19th
Batch

2. Median
- Median is the value which occupies the middle position when all the observations are
arranged in an ascending/descending order.
- It divides the frequency distribution exactly into two halves. Fifty percent of observations in a
distribution have scores at or below the median. Hence median is the 50th percentile.
- Median is also known as positional average

+ 1
=
2

Advantages
- It is easy to compute and comprehend
- It is not distorted by outliers/skewed data
- It can be determined for ratio, interval, and ordinal scale

Disadvantages
- It does not take into account the precise value of each observation and hence does not use
all information available in the data.
- Unlike mean, median is not amenable to further mathematical calculation and hence is not
used in many statistical tests.
- If we pool the observations of two groups, median of the pooled group cannot be expressed
in terms of the individual medians of the pooled groups.

3. Mode
- Mode is defined as the value that occurs most frequently in the data.
- Some data sets do not have a mode because each value occurs only once. On the other
hand, some data sets can have more than one mode.
- Mode is rarely used as a summary statistic except to describe a bimodal distribution.

Advantages
- It is the only measure of central tendency that can be used for data measured in a nominal
scale.
- It can be calculated easily.

Disadvantages
- It is not used in statistical analysis as it is not algebraically defined and the fluctuation in the
frequency of observation is more when the sample size is small

Selecting the appropriate measure


i. Mean is generally considered the best measure of central tendency and the most frequently
used one. However, there are some situations where the other measures of central
tendency are preferred.
ii. Median is preferred to mean when
- There are few extreme scores in the distribution.

Prabesh Ghimire Page | 13


Biostatistics MPH 19th
Batch

- Some scores have undetermined values.


- There is an open ended distribution.
- Data are measured in an ordinal scale.
iii. Mode is the preferred measure when data are measured in a nominal scale.
iv. Geometric mean is the preferred measure of central tendency when data are measured in a
logarithmic scale.

Empirical Relationship between mean, median and mode


i. In a symmetrical distribution mean, median and mode are identical, i.e. have the same
value.
Mean = Median = Mode

ii. In case of a moderately asymmetrical or skewed


distribution, the values of mean, median and mode are
observed to have the following empirical relationship
Mode = 3 Median 2Mean

For a positively skewed distribution


Mean > Median > Mode

For a negatively skewed distribution


Mean < Median < Mode

Measures of Dispersion

Measures of dispersion refer to the variability of data from the measure of central tendency.
Some commonly used measures of dispersion are
i. Range
- The range is the difference between the largest and the smallest observation in the data

Advantages
- It is independent of measure of central tendency and easy to calculate

Disadvantages
- It is very sensitive to outliers and does not use all the observation in a data set.
- It is more informative to provide maximum and minimum value rather than providing range.

Prabesh Ghimire Page | 14


Biostatistics MPH 19th
Batch

ii. Inter-quartile range


- Interquartile range is defined as the difference between the first and third quartile.
- Hence, inter-quartile range describes middle 50% o the observations.
- If the interquartile range is large it means that the middle 50% of observations are spaced
wide apart.
- Half the distance between Q 1 and Q 3 is called the inter-quartile range or the quartile
deviation.

Advantage
- It can be used as a measure of dispersion if the extreme values are not being recorded
exactly (as in case of open-ended class intervals in frequency distribution).
- It is not affected by extreme values.
- It is useful for erratic or highly skewed distributions

Disadvantages
- It is not amenable to further mathematical manipulation
- It is very much affected by sampling fluctuations

iii. Standard Deviation


- Standard deviation is the most commonly used measure of dispersion. It is a measure of
spread of data about the mean.
- Standard deviation is the square root of sum of squared deviation from the mean divided by
the number of observations.
(( )2 )
=

()2
2

=
1

Advantages
- If the observations are from a normal distribution, SD serves as a basis for many further
statistical analyses.
- Along with mean it can be used to detect skewness.

Disadvantages
- It is inappropriate measure of dispersion for skewed data

Selection of Measures of dispersion


- SD is used as a measure of dispersion when mean is used as a measure of central
tendency (ie for symmetric numerical data)
- For ordinal data or skewed numerical data, median and interquartile range is used.

Prabesh Ghimire Page | 15


Biostatistics MPH 19th
Batch

Some relative measures of dispersion


Relative Measures Formula
1 Coefficient of range (L-S)/(L+S)
2 Coefficient of quartile deviation (Q 3 -Q 1 )/(Q 3 +Q 1 )
3 Coefficient of Variance (CV) SD/ mean

UNIT 3: PROBABILITY DISTRIBUTION


Probability is defined as a chance of occurring an event.

There are three basic interpretations of probability


i. Classical probability
ii. Empirical or relative frequency probability
iii. Subjective probability

i. Classical probability
- Classical probability assumes that all outcomes in the sample space are equally likely to
occur.
- The probability of any event E is


- This probability is denoted by
()
() =
()

ii. Empirical probability


- Empirical probability relies on actual experience (experimentation or historical data) to
determine the likelihood of outcomes.
- Given a frequency distribution, the probability of an event being in a given class is

() = =

- For example, if in rolling a dice for 50 time, the number One is obtained 14 times, then the
probability of obtaining one in a trial is 14/50.

iii. Subjective probability


- Subjective probability uses probability value based on an educated guess or estimate,
employing opinions and inexact information.
- Example: A physician might say that, on the basis of his diagnosis, there is 30% chance that
the patient will die.

Prabesh Ghimire Page | 16


Biostatistics MPH 19th
Batch

Axioms of Probability
i. The probability of any event A lies between 0 and 1.
i.e. 0 P(A)1

ii. The sum of probabilities of all the outcomes in a sample space is always 1.
i.e. P(E i ) =1
P(S) = 1

iii. For any mutually exclusive events A & B, the probability that at least one of these events
occurs is equal to sum of their respective probabilities.
P(AUB) = P(A) + P(B)
a. Proposition 1
- The probability that an event does not occur is 1 minus the probability that the event occurs.
P(Ac) = 1 P(A)

b. Proposition 2
- For any non-mutually exclusive events A&B, the probability that at least one of these events
occurs is equal to sum of their experiences minus probability of both events occurring
together.
P(AUB) = P(A) + P(B) P(A&B)

Conditional Probability
- The conditional probability of an event A in a relationship to an event B is defined as the
probability that event A will occur after event B has already occurred.
- The condition probability of A given B has occurred is equal to the probability of A&B divided
by the probability of B, provided the probability is not zero.
(&)
(/) = , () 0
()

Bayes Theorem
If E 1 , E 2 , E 3 ,..E n are mutually disjoint events with P(E i ) 0 (i=0,1,2,n), then for any
arbitrary event A, which is subset of
(). (/)
(/) =
(). (/)
If i=1,2,3 then

(1). (/1)
(1/) =
(1) (/1) + (2)(/2) + (3) (/3)

Prabesh Ghimire Page | 17


Biostatistics MPH 19th
Batch

Probability Distributions

i. Discrete probability distribution


- If X be the discrete random variable, then the probability function of the discrete random
variable X, is a function of a real variable X, denoted by f(x), defined by
f(x)= P(X=x) for all X
- The set of ordered pairs [X i ,f(x i )] is called the probability distribution of the discrete random
variable X.
- Mean of discrete probability distribution is = [X.P(x)]
- Variance of a discrete probability distribution = [(X i - )2. P(x)]
For computation of variance 2=[(x i 2p i )]2
- Expectation, E(X) = mean

Properties/Requirements for discrete probability distribution


a. The sum of probabilities of all events in the sample space must be equal to 1.
i.e. P(X=x) = 1
b. The probability of each event in the sample space must be between or equal to 0 and 1.
i.e. 0 P(X=x) 1
A random variable is a variable whose values are determined by chance.

ii. Continuous Probability Distribution


- The probability function of the continuous random variable X, is a function of real variable X,
denoted by f(x), define by
f(x)= P(X=x) for all X.
- The set of ordered pairs [X i ,f(x i )] is called the probability distribution of the continuous
random variable X.

Properties of continuous probability distribution


a. f(x i )0 for all X
+
b. () = 1

Binomial Distribution

- Binomial distribution is one of the most widely used discrete probability distribution in
applied statistics.
- This distribution is derived from a process known as Bernoulli trial (by James Bernoulli)
- The random variable X is said to have a binomial distribution if its probability function is
given by
b(X,n,p) = f(x,p) = n C x px qn-x, x=0,1,2,3,n
where,
n= number of trials
p = probability of success
q = probability of failure = 1-p
x= number of success

Prabesh Ghimire Page | 18


Biostatistics MPH 19th
Batch

Assumptions for Binomial Distribution


i. There must be a fixed number of trials.
ii. Each trial can have only two outcomes. These outcomes can be considered as either
success or failure.
iii. The probability of success must remain the same for each independent trial.
iv. The outcomes of each trial must be independent of one another.

Mean and Variance


The mean and variance of a variable that has the binomial distribution can be found by using
the following formula
Mean, = np
Variance,2= npq

Poison Distribution

- Poisson distribution expresses the probability of a given number of events occurring in a


fixed interval of time and/or space, with known average rate.
- The random variable X is said to have a Poisson distribution with parameter , if its
probability function is given by
e .x
P[X,] = f(X,)= , x= 0,1,2,3.
x!
f(x)0 for every x and () = 1
where, is the shape parameter which indicates the average number of events in the given
time internal.
- Poisson distribution is useful when n is large and probability of an even is small.

Assumptions for Poisson distribution


i. The occurrences of the events are independent.
ii. Theoretically, an infinite number of occurrences of the event must be possible in the interval.
iii. The probability of the single occurrence of the event in a given interval is proportional to the
length of the interval.
iv. In any infinitesimally small portion of the interval, the probability of more than one
occurrence of the event is negligible.
v. Mean and variance are equal i.e. mean = variance =

Practical Situations where Poisson law holds


i. Person contracting a rare disease.
ii. Certain drug having an uncomfortable reaction.
iii. Deaths occurring in hospital per day.
iv. No. of measles occurring in a location per year
v. No. of myocardial infarction cases arriving in a hospital per day.

Prabesh Ghimire Page | 19


Biostatistics MPH 19th
Batch

Normal Distribution

- Normal distribution is the most widely used continuous probability distribution.


- It is frequently called the Gaussian distribution
- The normal distribution depends on mean and standard deviation. Mean defines the
center of the curve and standard deviation defines the spread.

Characteristics of Normal Distribution


- A normal distribution curve is bell-shaped.
- The mean, median and mode are all equal and located at the center of the distribution.
- A normal distribution is unimodal.
- The total area under the curve above x-axis is one square unit. This characteristic follows
from the fact that the normal distribution is a probability distribution.
- The area under the part of a normal curve that lies within 1 standard deviation is 68.26%;
within 2 standard deviation is 95.44% and within 3 standard deviation is 99.7%.

Importance of Normal Distribution


- Normal distribution is the basis of sampling theory. With the help of normal distribution, one
can test whether the samples drawn from the population represent the population
satisfactorily or not.
- Large sample tests are based on the properties of normal distribution.
- Normal distribution is widely used in the study of natural phenomena such as birth rates,
blood pressure, etc.

Standard normal distribution


- If X is a normal variate with mean and variance 2, then a standard normal variate Z is
defined by,

=

- Standard normal distribution is the distribution with mean =0 and standard deviation =1.
- It is denoted by ZN (O,1) and means that Z follows normal distribution with mean 0 and
standard deviation 1.

Prabesh Ghimire Page | 20


Biostatistics MPH 19th
Batch

UNIT 4: CORRELATION AND REGRESSION

Correlation

If two (or more) variables are so related that the change in the value of one variable is
accompanied by the change in the value of other variable, then they are said they have
correlation. Hence, correlation analysis is defined as the statistical technique which measures
the degree (or strength) & direction of relationship between two or more variables.

E.g rise in body temperature is accompanied by rise in pulse rate.

Correlation is different from Association


Association is the term used for assessing relationship between categorical variable.
Correlation is the term used for assessing relationship between continuous variables.

Types of correlation
1. Positive and Negative Correlation
i. Positive Correlation
- If two variable X and Y move in the same direction (i.e if one variable rises, other rises and
vice versa), then it is called as positive correlation.
- Example: Gestational age against birth weight of baby

ii. Negative correlation


- If two variable X and Y move in opposite direction (i.e. if one variable rises, other falls and
vice versa), then it is called as negative correlation.

2. Linear and non-linear correlation


i. Linear correlation:
- If the ratio of change of two variables X and Y remains constant throughout, then they are
said to be linearly correlated.
- The graph of variables having such a relationship will form a straight line.

ii. Non-linear correlation


- If the ratio of change between the two variables is not constant but changing, correlation is
said to be non-linear.
- In case of non-linear correlation, values of the variable plotted on a graph will give a curve.

3. Simple, Partial and Multiple Correlation


i. Simple correlation (Bivariate)
- When we study the relationship between two variable only, then it is called simple
correlation.
- E.g. Relationship between age and diabetes.

Prabesh Ghimire Page | 21


Biostatistics MPH 19th
Batch

ii. Partial correlation


- When three or more variables are taken but relationship between any two variables is
studied, assuming other variables as constant, then it is called partial correlation.
- E.g relationship between amount of rainfall and crop production under constant temperature.

iii. Multiple correlation


- When we study the relationship among three or more variables, then it is called multiple
correlation.
- Relationship of crop production with amount rainfall and temperature.

Bivariate Correlation
Many biomedical studies are designed to explore relationship between two variables and
specifically to determine whether these variables are independent or dependent.
E.g. Are obesity and blood pressure related?

Purpose of studying bivariate relationship


- To assess whether two variables are associated.
- To enable value of one variable to be predicted from any known other variable.

Methods of determining bivariate correlation


i. Scattered diagram
ii. Karl Pearsons correlation coefficient (r)
iii. Spearmans rank correlation coefficient ()
iv. Kendalls tau-b with their significance levels

Scattered Diagram
- Scattered diagram is the graphic method of finding out correlation between two variables.
- By this method, direction of correlation can be ascertained.
- For constructing a scatter diagram, X-variable is represented on X-axis and Y-variable on Y-
axis.
- Each pair of values of X and Y series is plotted shown by dots in two-dimensional space of
X-Y
- This diagram formed by bivariate data is known as scattered diagram.

The scattered diagram gives idea about the direction and magnitude of correlation in the
following ways:
a. Perfect Positive Correlation (r = +1)
- If all points are plotted in the shape of a straight line passing from the lower left corner to the
upper right corner, then both series X and Y have perfect positive correlation.

b. Perfect Negative Correlation (r = -1)


- If all points are plotted in the shape of a straight line passing from the upper left corner to the
lower right corner, then both series X and Y have perfect negative correlation.

Prabesh Ghimire Page | 22


Biostatistics MPH 19th
Batch

c. High degree of positive correlation


- When concentration of point moves from lower left corner to upper right corner and the
points are close to each other, then X and Y have high degree of positive correlation.

d. High degree of negative correlation


- When concentration of point moves from upper left corner to lower right corner and the
points are close to each other, then X and Y have high degree of negative correlation.

e. Zero correlation (r=0)


- When all points are scattered in four directions and are lacking any pattern, then there is
absence of correlation.

Demerits of Scattered Diagram


- This diagram does not give the degree of correlation between two variables. Thus strength
of correlation cannot be ascertained.

Karl Pearsons Correlation Coefficient


- This is the quantitative method of measuring the degree of correlation between two
variables.
- This method has been given by Karl Pearson and after his name, it is known as Pearsons
coefficient of correlation.
- This is the best method of working out correlation coefficient.
- It is denoted by r(x,y) or r xy or simply r.
- It measure both magnitude and direction of relationship.

Formula for calculating correlation coefficient


(, )
=
( ) ( )
Where,
( )( )
Co-variance of (x,y) is

Prabesh Ghimire Page | 23


Biostatistics MPH 19th
Batch

i. Actual mean method (Product-moment method)


( )( )
=
( )2 ( )2

i. Actual data method (direct method)


()
=
2 ()2 2 ()2

Properties of correlation coefficient


i. Correlation coefficient (r) is a pure number and is independent of the units of measurement.
ii. The limit of Karl Pearsons correlation coefficient lies between -1 and +1.
Symbolically, -1 r +1
iii. Correlation coefficient between the two variables is symmetric.
i.e r xy = r yx
iv. The correlation coefficient between the two variable is independent of change of origin.
i.e r xy = r uv , where, u=X-A and v=Y-B
v. The correlation coefficient between two variable is independent of change of scale.
i.e. r xy = r uv , where, u=(X-A)/h and v=(Y-B)/k
vi. The geometric mean between two regression coefficients gives the value of correlation
coefficient.
i.e. r = = .

Degree of correlation and interpretation of r


i. Perfect correlation
- If r = +1, it is perfect positive correlation
- If r = -1, it is perfect negative correlation

ii. High degree of correlation


- If 0.75 r 1, it has a high degree of correlation

iii. Moderate degree of correlation


- If 0.25 r 0.75, it has moderate degree of correlation

iv. Low degree of correlation


- If 0 r 0.25, it has a low degree of correlation

v. No correlation
- If r =0, then there is no existence of correlation

Limitations of Pearsons correlation coefficient


i. It is affected by extreme values
ii. It does not demonstrate cause effect relationship (correlation does not imply causation)

Prabesh Ghimire Page | 24


Biostatistics MPH 19th
Batch

iii. Directionality problem: It does not explain whether variable X causes a change in variable Y
or reverse is true.
iv. It is unstable with small sample sizes
v. It measures only a linear relationship.

Spearmans Rank Correlation Coefficient


- Pearsons correlation coefficient is very sensitive to outlying values. One approach is to rank
the two sets of variable X and Y separately and measuring the degree of correlation. This is
known as Spearmans rank correlation coefficient.
- This method of determining correlation was propounded by Prof. Charles Edward Spearman
in 1904.
- This method can be used for those types of variables where quantitative measurement is
not suitable but it can be arranged in rank or order (e.g. intelligence).
- Spearmans rank correlation coefficient is denoted by (rho) and given by
62
=1 3

Where, D= R 1 -R 2
R 1 = rank of first series of data
R 2 = rank of second series of data
n = number of paired observations

For Repeated ranks


When two or more items have equal values in a series, then in such case item of equal values
are assigned common ranks, which is the average of the ranks.
In such case, we use a modified formula to determine rank correlation coefficient:
13 1 23 2
6[2 + + + .
12 12
=1
3

Where,
m 1 is the no.of repetition of 1st item
m 2 is the no.of repetition of 2nd item

Properties or
- It is less sensitive to outlying values than Pearsons correlation coefficient
- It can be used when one or both of the relevant variable are ordinal
- It relies on rank rather than on actual observations
- The sum total of rank difference (i.e. D) is always equal to zero.
- The value of rank correlation coefficient will be equal to the value of Pearsons coefficient of
correlation for the two characteristics taking the ranks as values of the variables, provided
no rank value is repeated.

Demerits
- This method cannot be used for finding correlation in grouped frequency distribution.

Prabesh Ghimire Page | 25


Biostatistics MPH 19th
Batch

Kendalls Tau Rank Correlation Coefficient


- This measure was proposed by the British statistician Maurice Kendall.
- This is also a non-parametric measure of correlation, similar to Spearmans rank correlation
coefficient.
- The Kendalls tau rank correlation coefficient compares the ranks of the numerical values in
x and y, which means a total of pairings to compare will be n(n-1)/2.
- Pairs of observations are said to be concordant if the ranks for both observations are same,
and discordant if they are not.
- Kendalls correlation typically has a lower value than Spearmans correlation coefficient.
- It is denoted by (tau) and given by

= (1)
2
Where,
n c = number of concordant
n c = number of discordant

Regression

- The term Regression was coined by Sir Francis Galton.


- Statistical regression is the study of the nature of relationship between the variable so that
one may be able to predict the unknown value of one variable for a known value of another
variable.
- In regression, one variable is considered as independent variable and another variable is
taken as dependent variable. With the help of regression, possible values of the dependent
variable are estimated on the basis of the values of the independent variable.

Purpose of regression
- To predict the value of dependent variable based on the value of an independent variable.
- To explain the impact of changes in a dependent variable for every unit change in
independent variable.
- To explain the nature of relationship between variables.

Types of variables in regression


1. Dependent Variable
- The variable we wish to explain or the unknown value which is to be estimated by the help
of known value.
- It is also called response variable/predict/ probable variable

2. Independent variable
- The variable used to explain or the known variable which is used for prediction is called
independent variable.
- It is also called explanatory variable.

Prabesh Ghimire Page | 26


Biostatistics MPH 19th
Batch

Types of regression analysis


1. Simple and multiple regression
- In simple regression analysis, we study only two variables at a time, in which one variable is
dependent and another is independent.
- In multiple regression analysis, more than two variables are studied at a time, in which one
is dependent variable and others are independent variable.

2. Linear and non-linear regression


- When dependent variable changes with independent variable in some fixed ratio, this is
called a linear regression.
- When dependent variable varies with change in independent variable in a changing ratio,
then it is referred to as non-linear regression.

Simple Linear Regression Analysis


- It is type in which a relationship between two variables (one dependent and one
independent) is described by a linear function.
- Changes in Y (dependent) are assumed to be caused by changes in X (independent).

Assumptions underlying simple linear regression


Some basic assumptions underlying simple regression model are
1. Variables are measured without error (Nonstochastic): The mean of the probability
distribution of random error e is 0. This assumption states that the mean value of y for a
given value of x is given by b 0 + b 1 x 1.
2. Constant variance: The variance of the probability distribution of e is constant for all levels of
the independent variable, x. That is, the variance of e is constant for all values of x.
3. Normality: The probability distribution of error is normal.
4. Independence of error: The errors that are associated with two different observations are
independent of each other.
5. Linearity: The means of the subpopulations of Y all lie on the same straight line.

Cautions in regression analysis


- Outliers
- Non-linear relations
- Confounding
- Randomness

Regression lines
- The regression lines shows the average relationship between two variables. This is also
known as the Line of best fit.
- If two variables X and Y are given, then there are two regression lines related to them which
are as follows:

i. Regression line x on y (x depends on y)


- The regression line of X on Y gives the best estimate for the value of X for any given value
of Y.

Prabesh Ghimire Page | 27


Biostatistics MPH 19th
Batch

ii. Regression line y on x


- The regression line y on x gives the best estimate for the value of Y for any given value of X.

Fitting regression line Y on X


Let X be the independent variable and Y be the dependent variable, then the simple linear
regression equation of y on x is given by

Generally the regression line is given by,


Y= a yx + b yx X +
Where,
Y = dependent variable
a yx = population Y intercept = value of y at x=0 = constant
b yx = population slope coefficient = regression coefficient of y on x = change in y for each unit
change in x = slope of regression line
X= independent variable
= random error

For simple regression, error is assumed to be zero. So, the estimate of population regression
line is given by
= a yx + b yx X

To fit the regression lines we must find the unique values of a yx and b yx

For this we use principle of least square. Using this principle, derive a and b solving following
two equations.
= + 2
And
= +
Alternatively,
The values can be obtained by,
()
=
2 ()2

After the calculation of values of a yx and b yx , the regression equation of Y on X becomes


. . = + .

Interpretation of a and b in Y= a+bx


- a is the estimated average value of Y when the value of X is zero.
- b is the estimated change in the average value of Y as a result of one-unit change in X.

Prabesh Ghimire Page | 28


Biostatistics MPH 19th
Batch

Properties of regression coefficient


i. The geometric mean between two regression coefficient gives the correlation coefficient
i.e. = .
ii. Both the regression coefficients must have the same algebraic sign.
iii. The coefficient of correlation will have the same sign as that of regression coefficient
iv. The product of two regression coefficient is equal to or less than one
i.e. . 1
v. The arithmetic mean of two regression coefficients is either equal or greater than the
correlation coefficient.
+

2
vi. Two regression lines intersect as the point ( ,
)
vii. Regression coefficient between the variable is independent of change of origin but not of
scale.

Some important terms in correlation and regression

Probable Error
- The probable error of correlation coefficient helps in determining the accuracy and reliability
of the value of the coefficient in so far as it depends on the conditions of random sampling.
- Probable error of r is an amount which if added to and subtracted from the value of r
produces limits within which the coefficients of correlation in the population can be expected
to lie..
- The probability error of the coefficient of correlation is obtained as follows:
1 2
. = 0.6745

Where, r is the coefficient of correlation and N is the number of pairs of observations.

Interpretation of Probable Error


- If the value of r is more than six times the probable error, the coefficient of correlation is
significant.
- If the value of r is less than the probable error, there is no evidence of correlation i.e. the
value of r is not at all significant.

Utility of Probable Error


- It is used to determine the reliability of coefficient of correlation.
- Probable error is used to interpret the value of the correlation coefficient.
If |r| > 6PE, then correlation coefficient is taken to be significant.
If |r| < 6PE, the coefficient of correlation is taken to be insignificant.
- Probable error also determines the the upper and lower limits within which the correlation of
a randomly selected sample from the same universe will fall.
Symbolically, (rho) = r PE

Prabesh Ghimire Page | 29


Biostatistics MPH 19th
Batch

Conditions for use of Probable Error


- The data must approximate a normal frequency curve.
- The statistical measure for which P.E.is computed must have been calculated from a
sample.
- The sample must have been drawn at random.

Coefficient of determination
- The concept coefficient of determination is used for the interpretation of regression
coefficient.
- The coefficient of determination is also called r-squared and is denoted by r2.
- The coefficient of determination explains the percentage variation in the dependent variable
Y that can be explained in terms of the independent variable X.
- It measures the closeness of fit of the regression equation to the observed values of Y.
- For example, if r is 0.9 then the coefficient of determination (r2) will be 0.81 which implies
that 81% of the total variation in the dependent variable (Y) occurs due to the independent
variable X. The remaining 19% variation occurs due to other external factors.
- Thus the coefficient of determination is defined as the ratio of the explained variance to the
total variance.
- Coefficient of determination lies between 0 to 1 i.e. 0 r2 +1
- When r2 =1, all the observations fall on the regression line.
- When r2 =0, none of the variation in Y is explained by the regression.


2 = =

Coefficient of non-determination
- By dividing the unexplained variance by the total variation, the coefficient of non-
determination can be determined.
- Assuming the total of variation as 1, then the coefficient of non-determination can be
determined by subtracting the coefficient of determination from1.
- It is denoted by K2.
( 2 ) = 1 2
- Suppose if coefficient of determination is 0.81, then coefficient of non-determination will be
0.19, which means that 19% of the variations are due to other factors.
- Coefficient of alienation is the square root of coefficient of non-determination (= 1 2 )

Prabesh Ghimire Page | 30


Biostatistics MPH 19th
Batch

Differences between correlation and regression


Correlation Regression
Meaning Correlation is a statistical measure Regression describes how an
which determines co-relationship independent variable is numerically
or association of two variable related to the dependent variable
Usage To represent linear relationship To fit a best line and estimate one
between two variables variable on the basis of another
variable
Origin and scale Correlation coefficient is Regression is independent of change
independent of the change of of origin but not of scale
origin and scale
Symmetry Correlation coefficients r xy and r yx Regression coefficients b xy and b yx are
are symmetric asymmetric
r xy = r yx b xy b yx
Dependent and It is of no importance which of X It make a difference to which variable is
independent and Y is dependent variable and dependent and which is independent
variables which is independent variable
Indicates Correlation coefficient indicates Regression indicates the impact of a
the extent to which the variables unit change in the known variable(x) on
move together the estimated variable (y)

Multiple Linear Regression

- It is type in which a relationship between one dependent and more than one independent
variables described by a linear function.
- Changes in Y (dependent) are assumed to be caused by changes in X 1 , X 2 , X 3 , X 4 ,
(independent).
- Multiple regression analysis is used when a statistician thinks there are several independent
variables contributing to the variation of the dependent variable.
- For example, if a statistician wants to see whether birth weight of a child is dependent on
gestational age, age of mother and antenatal visits, then multiple regression analysis may
be applicable.
Birth Weight = 0 + 1 GA + 2 Age + 3 ANC + e
Where, GA = Gestational age
Age = Age of mother in yrs
ANC = Antenatal care

Assumptions of Multiple Linear Regression Analysis


The assumptions for multiple regression analysis are similar to those for simple regression
i. Normality: For any specific value of the independent variable, the values of y variable are
normally distributed.
ii. Equal variance: The variances for the y variables are the same for each value of the
independent variable.
iii. Linearity: There is a linear relationship between the dependent variable and the independent
variables.
iv. Non-multicollinearity: The independent variables are not correlated.
v. Independence: The values for the y variable are independent

Prabesh Ghimire Page | 31


Biostatistics MPH 19th
Batch

Logistic Regression
- Logistic regression is a kind of predictive model that can be used when the dependent
variable is a categorical variable having two categories and independent variable are either
numerical or categorical.
- Examples of categorical variables are disease/ no disease, smokers/non-smokers, etc.
- The dependent variable in the logistic model is often termed as outcome or target variable,
whereas independent variables are known as predictive variables.
- It provides answers to questions such as:
How does the probability of getting lung cancer change for every additional pound of
overweight and for every X cigarettes smoked per day?
Do body weight calorie intake, fat intake, and age have an influence on heart attacks
(yes vs. no)?

Purpose of Logistic Regression


- To make maximum likelihood estimation by transforming the dependent into logit variable.
- To model a nonlinear association in a linear way.
- To estimate the probability of a certain event occurring.
- To predict category of outcomes for single cases
- To predict group membership or outcome
- To compare two models with different number of predictors

Assumptions for logistic regression


i. Assumes a linear relationship between the logit of the independent and dependent
variables. However, does not assume a linear relationship between the actual dependent
and independent variables.
ii. The sample is large- reliability of estimation declines when there are only a few cases
iii. Independent variables are not linear functions of each other
iv. There should be no outliners in the data.
v. The dependent variables should be dichotomous in nature.
vi. There should be no high inter-correlations (multi-co linearity)

Advantages of Logistics Regression


i. The independent variables are not required to be linearly related with the dependent
variable.
ii. It can be used with the data having non-linear relationship.
iii. The dependent variable need not follow normal distribution.
iv. The assumption of homoscedasticity is not required. In other words, no homogeneity of
variance assumption is required.
v. This model can use independent variables in any measurement scale.

Limitations
i. Outcome variable must always be discrete
ii. When continuous outcome is categorized or categorical variables are dichotomized, some
important information may be lost.

Prabesh Ghimire Page | 32


Biostatistics MPH 19th
Batch

iii. Ratio of cases to variables: Using discrete variables requires that there are enough
responses in every given category.
- If there are too many cells with no responses, parameter estimates and standard errors will
likely blow up.
- Also can make groups perfectly separable (e.g. multi-collinear) which will make maximum
likelihood estimation impossible

UNIT 5: SAMPLING THEORY, SAMPLING DISTRIBUTION AND ESTIMATION

Sampling theory is the field of statistics that is involved with the collection, analysis and
interpretation of data gathered from random samples of a population under study.

Objectives of sampling theory


- Statistical estimation
- Hypothesis testing

Principles of sampling theory


The main aim of sampling theory is to make the sampling more effective so that the answer to a
particular question can be given in a valid, efficient and economical way. The theory of sampling
is based on some important basic principles:
i. Principle of validity
- This principle states that the sampling design should provide a valid estimate of a population
value.
- Thus, the principle of validity ensures that there is some definite and preassigned probability
for each individual unit to be selected in the representative sample.

ii. Principle of statistical regularity


- This theory is based upon the following two conditions:
As the sample size increases, the true characteristics of the population are more likely to
reveal
The sample should be selected randomly in which each and every unit has an equal
chance of being selected.

iii. Principle of optimization


- This principle gives emphasis to obtaining optimum results with minimized total loss in terms
of cost and mean square error or sampling variance.

Concept of Descriptive and Inferential Statistics


i. Descriptive Statistics:
- Descriptive statistics consists of procedures used to summarize and describe the important
characteristics of a set of measurements.
- Examples: Measures of central tendency, Dispersion, Probability distribution, etc.

Prabesh Ghimire Page | 33


Biostatistics MPH 19th
Batch

ii. Inferential Statistics


- Inferential statistics consists of procedures used to make inferences about population
characteristics from information contained in a sample drawn from this population.
- It is the act of generalizing from a sample to a population with calculated degree of certainty.
- The objective of inferential statistics is to make inferences (that is draw conclusions, make
predictions, make decisions) about the characteristics of a population from information
contained in a sample.

Parameter Statistics
Source Population Sample
Notation for Mean
Notation for SD s
Vary No Yes
Calculated No Yes

There are two forms of statistical inference


1. Estimation (point and interval)
- Estimate true value of the parameter from a sample
2. Hypothesis testing
- Assess the strength of evidence for/against a hypothesis

1. Estimation
Estimation is the statistical process by which population characteristics (i.e parameter) are
estimated from the sample characteristics (i.e. statistic) with desired degree of precision.

Types of estimation
i. Point Estimations
- A point estimate is a specific numerical value from a sample that estimates a parameter.
- The best point estimate of the population mean is the sample mean .

ii. Interval estimation


- An interval estimate of a parameter is an interval or a range of values used to estimate the
parameter.
- This estimate may or may not contain the value of the parameter being estimated.
- It provides more information about a population parameter than a point estimate.
- Interval estimation is done by finding a confidence interval at a given level of precision.

Prabesh Ghimire Page | 34


Biostatistics MPH 19th
Batch

Confidence Interval
A confidence interval is a range value around the given sample statistic where true population
value is assumed to lie at a give level of confidence.

The confidence level is the probability that the interval estimate will contain the parameter,
assuming that a large number of samples are selected and that the estimation process on the
same parameter is repeated. Confidence level generally used is 90%, 95% and 99%.

Confidence interval for population mean can be calculate as


Confidence interval = point estimate measure of confidence X standard error
Confidence interval = point estimate margin of error
Confidence interval = 1/2
Confidence interval = 1/2 /

Confidence Level (1-) Alpha Level () Associated Z value (Z 1- /2 )


.90 .10 1.65
.95 .05 1.96
.99 .01 2.58

As the length of CI increases, it is more likely to capture . Therefore, the length of CI is longer
in 99% confidence as compared to 90%.

Interval estimation for population proportion can be calculate as



= 1/2

When variance is not known


In practice, we rarely know . In this case we use s as an estimate of . This adds another
element of uncertainty to our inference. This uncertainty is not captured by Z statistics.
Therefore, we use a modification of Z based on Students t distribution.

CI = 1,1/2 /

Points to remember
- Confidence interval applies only when a sample is selected by a probability sampling
technique and the population is normal or the sample is large.
- In addition, the CI does not account for practical problems such as:
Measurement error and processing error
Other selection biases

Prabesh Ghimire Page | 35


Biostatistics MPH 19th
Batch

Properties of a good estimator


i. The estimator should be an unbiased estimator. That is, the expected value or the mean of
the estimates obtained from samples of a given size is equal to the parameter being
estimated.
ii. The estimator should be consistent. For a consistent estimator, as sample size increases,
the value of the estimator approaches the value of the parameter estimated.
iii. The estimator should be a relatively efficient estimator. That is, of all the statistics that can
be used to estimate a parameter, the relatively efficient estimator has the smallest variance.

Factors affecting width of confidence interval


i. Level of confidence:
- The level of confidence influences the width of the interval through means of t or Z value.
- Larger the level of confidence, the larger is the t or Z value and larger is the interval.

ii. Sample size:


- Bigger the sample size, the length of interval gets smaller.
- As the sample size increase the standard error decreases and the interval also gets smaller.

iii. Standard deviation


- Standard deviation directly related to the margin of error.
- Other things remaining constant, the greater standard deviation produces wider margin of
error.
- Therefore, confidence interval increases with increase in the standard deviation.

Why confidence interval is preferred over P-value?


- The advantage of confidence intervals in comparison to giving p-values after hypothesis
testing is that the result is given directly at the level of data measurement. Confidence
intervals provide information about statistical significance, as well as the direction and
strength of the effect. This also allows a decision about the clinical relevance of the results.
- Statistical significance must be distinguished from medical relevance or biological
importance. If the sample size is large enough, even very small differences may be
statistically significant. On the other hand, even large differences may lead to non-significant
results if the sample is too small. However, the investigator should be more interested in the
size of the difference in interventional effect between two study groups in public health
studies, as this is what is important for successful intervention, rather than whether the result
is statistically significant or not.

Sampling distribution
- It is a distribution obtained using the statistics computed from all possible random samples
of a specific size taken from a normal population.
- Sampling distribution is a theoretical concept.
- In practice it is too expensive to take many samples from a population. Simulation may be
used instead of many samples to approximate sampling distribution.

Prabesh Ghimire Page | 36


Biostatistics MPH 19th
Batch

- Probability may be used to obtain an exact sampling distribution without simulation.


- Information from the population is linked to the population via the sampling distribution.

Sampling distribution of Mean


Sampling distribution of mean is an important sampling distribution in statistics.
It is a distribution obtained using the means computed from all possible random samples of a
specific size taken from a normal population.

When sampling is from a normally distributed population, the distribution of the sample mean
will possess the following properties:
i. The sampling distribution of mean tends to be normal as sample size increases (Central
Limit Theorem).
ii. The mean obtained from the sampling distribution of mean will be same as the population
mean.
iii. The standard deviation of the sample means will be smaller than the standard deviation of
the population, and it will be equal to the population standard deviation divided by the
square root of the sample size. This is called standard error.

Central Limit Theorem


- For a non-normal population with a mean and standard deviation , the sampling
distribution of mean computed from this population will tend to become normal as the
sample size increases.
- This distribution will have a mean and standard deviation /n.
- The importance of central limit theorem is that it removes the constraint of normality in the
population.

Sampling and Non-sampling error


Sampling (random) error and non-sampling (systematic) error distort the estimation of
population parameters from sample statistics.

i. Sampling Error/ Random Error


- Sampling error is the difference between the sample
measure and the corresponding population measure due to
the fact that the sample is not a perfect representation of
the population.
- Random error occurs by chance and is the result of
sampling variability.
- Because of chance, different samples will produce different
results and therefore it must be taken into account when using a sample to make inferences
about a population. This difference is referred to as the sampling error and its variability is
measured by the standard error.
- The effect generated by random error can be corrected by increasing the size of the sample.
- Random selection is an effective way to reduce random errors.

Prabesh Ghimire Page | 37


Biostatistics MPH 19th
Batch

ii. Systematic Error/Non-sampling error/Bias


- Systematic error refers to the tendency to consistently underestimate or overestimate a true
value.
- It appears equally in repeated measurements of the same object quantity.
- Systematic errors cannot be eliminated by repeated measurement and averaging. It is also
difficult to identify all causes of systematic errors.
- The major sources of systematic error are selection and information (measurement) bias.
- Systematic error cannot be eliminated but can be reduced significantly by adopting various
measures depending on the nature of bias.
Calibrating measurement instrument
Blinding
Training interviewers and observers

Standard Error of Mean (SEM)


- To find out how close the sample mean is to the population mean, we find the standard
deviation of the distribution of means.
- This standard deviation of the sampling distribution of means is called the standard error of
mean.
- Standard error is not an error in a true sense.
- The standard error quantifies the precision of a sample mean.
- The standard error of mean is equal to standard deviation divided by the square root of
sample size

Standard Error of Mean = /n

- As the formula shows, the standard error is dependent on the size of the sample; standard
error is inversely related to the square root of the sample size. Therefore, larger the n
becomes, the more closely will the sample means represent the true population mean.
- Also the standard error is influenced by the standard deviation and the sample size. The
greater the dispersion around the mean, greater is the standard error and less certain we
are about the actual population mean.

Prabesh Ghimire Page | 38


Biostatistics MPH 19th
Batch

Sampling Techniques

Sampling is a statistical procedure of drawing a sample from a population with a belief that the
drawn sample will exhibit the relevant characteristics of the whole population.

Applications of sampling in public health


- Random sampling is the basic requirement for establishing causes-effect relationship.
- Use of appropriate sampling methods help generalize the findings of health research to the
entire population of interest.
- Sampling is useful to assure both internal and external validity of public health research.

Sampling techniques can be divided into two types:


i. Probability or random sampling
ii. Non-probability or non-random sampling

i. Probability Sampling
- Probability sampling is a method of drawing a sample so that all the units in the population
have equal probability of being selected as a unit of the sample.
- The advantage of probability method is that the sampling error of a given sample size can
be estimated statistically and therefore the samples can be subjected to further statistical
procedures.

ii. Non-Probability Sampling


- In-non probability sampling technique, the probability that a specific unit of the population
will be selected is unknown and cannot be determined.
- This technique is based on the judgment of the researcher.

Types of Probability Sampling


i. Simple Random Sampling
- This is the simplest and most common method in probability sampling.
- In this method, the units are selected in such as way that each and every unit of population
has equal chance of inclusion in the sample.
- If n be the no. of sample which is to be drawn from the population of size N, then each
sample has n/N probability of being selected.
- Simple random sampling is mostly used when the elements of population are more or less
homogenous.
- Usually the selection of sample is done by lottery or random number table.

Advantages
- Reduces selection bias as selection depends on chance.
- Relatively cheap compared to stratified random sampling
- Sampling error can be easily measured

Prabesh Ghimire Page | 39


Biostatistics MPH 19th
Batch

Limitations
- Complete list of sampling frame is needed.
- This method may not always achieve best representatives.
- Units may be scattered
- Less suitable for large population

ii. Stratified Random Sampling


- Stratified sample is obtained by separating the population into non-overlapping groups
called strata and thus obtaining a simple random sample from each stratum.
- The population is divided to make the elements within a group as homogenous as possible.
- In this method to get a higher precision, following points are to be examined carefully.
Formation of strata
No. of strata to be formed
Allocation of sample size in each stratum

- There are two types of stratified random sampling; proportionate and disproportionate
- In proportionate stratified sampling, the sample size from each stratum is dependent on that
size of the stratum. Therefore largest strata are sampled more heavily as they make larger
percentage of the target population.
- In disproportionate sampling, the sample selection from each stratum is independent of its
size.

Advantages
- This method produces more representative samples.
- Facilitates comparison between strata and understanding of each stratum and its unique
characteristics.
- It is suitable for large and heterogeneous population.

Limitations
- It requires more cost, time and resources
- Stratification is a difficult process.

iii. Systematic Sampling


- In systematic sampling, only the first sample unit is selected at random and the remaining
units are automatically selected at the fixed equal interval guiding by some rule.
- Suppose N units of population are numbered from 1 to N in some order. Then, the sample
interval K = N/n is determined, where n is the desired sample size. The first item in between
1&K is selected at random and every other elements are automatically selected in the
interval of K.

Advantages
- This methods is simple and easy.
- The selected samples are evenly spread in the population and therefore minimize chances
of clustered selection of subjects.

Prabesh Ghimire Page | 40


Biostatistics MPH 19th
Batch

Limitations
- The method may introduce bias when elements are not arranged in random order.
- In some cases, complete sampling frame may be unavailable.

iv. Cluster Sampling


- In the cluster sampling, the population is divided into separate groups of elements called
clusters, in such a way that characteristics of elements within the clusters are
heterogeneous and between the clusters are homogeneous. The size of clusters may be
approximately equal or it differs.
- Then, simple random sampling or probability proportionate to size sampling technique is
applied to select the cluster.
- Clusters can also be naturally occurring homogenous groups such as villages, towns or
schools.

Advantages
- This method is faster, easier and cheaper
- It is useful when sampling frame is not available
- It is economical when study area is large

Limitations
- There are high chances of sampling error
- Over or underrepresentation of cluster can skew the result of the study.

v. Multi-stage sampling
- In multi-stage sampling, the selection of the sample is drawn in two or more stages.
- The population is divided into a number of first-stage units from which the sample is drawn
at random among them. These units are called first stage units.
- In the second stage, the elements are randomly drawn from the first stage unit and these
units are called second units. The procedure can further be repeated for third and fourth
stages as required.
- The ultimate unit is called the unit of analysis
- Example:
First stage: Development regions
Second stage: Districts
Third stage: VDCs

Advantages
- It is quiet convenient in very large area
- Saves cost, time and resources
- Sample frame is required for only those which are selected.

Limitations
- This method may not always achieve representative samples.
- High level of subjectivity

Prabesh Ghimire Page | 41


Biostatistics MPH 19th
Batch

Types of Non-Probability Sampling


i. Purposive or judgmental sampling
- In this method, the choice of element in the sample depends entirely on the judgment of the
investigator.
- Researchers might decide purposely to select subjects who are judged to be typical of the
population.
- This approach involves high degree of subjectivity. However, this method can be of
advantage in many situations. For example, purposive sampling is often used when
researchers want a sample of experts.

Advantages
- Useful when the sample size is small
- Applied when the number of elements in the population is unknown.

Limitations
- There are high chances of selection bias
- It is not a scientific method

ii. Convenience sampling


- Convenience sampling entails using most conveniently available people as participants.
- Example, a researcher selects a sample as to those who appear in the hospital.

Advantages
- It is useful for pre-testing questionnaires
- It is useful for pilot studies

Limitations
- Selected samples might be atypical to the population
- There are high chances of selection bias

iii. Quota Sampling


- This technique is similar to stratified random sampling, however instead of randomly
sampling from each stratum, the researcher uses a non-random sampling method to gather
data from one stratum until the desired quota of samples is filled.

iv. Snowball sampling


- Snowball sampling is used to reach target population where the sampling units are difficult
to identify.
- Under snowball sampling, each identified member of the target population is asked to
identify other sampling units who belong to the same target population.
- Snowball sampling would be used to identify successive sampling units, for example drug
addicts, sex workers, etc.
- The issues under investigation are usually confidential or sensitive in nature.

Prabesh Ghimire Page | 42


Biostatistics MPH 19th
Batch

Determination of Sample Size

Assumptions in calculating sample size


- Sampling method used in random sampling.
- The proportion or variability in the population is known.
- The population is normally distributed

Sample size for estimating population mean


The formula for estimating sample size is given by
2 2
=
2
Where,
n= sample size
d = maximum allowable error or margin of error
= population standard deviation
Z = value of Z at desired level of significance

Sample size for estimating population proportion


The formula for estimating sample size is given by

2
=
2
Where,
n= sample size
d = maximum allowable error or margin of error
P = population proportion
Z = value of Z at desired level of significance

UNIT 6: Hypothesis Testing


A hypothesis is simply a statement or claim about a population parameter.

There are two types of statistical hypothesis


i. Null hypothesis
- The null hypothesis states that there is no difference between a parameter and a specific
value, or that there is no difference between two parameters.
- It is denoted by H 0
- Null hypothesis may or may not be rejected.
- Always contain =, orsign

ii. Alternative Hypothesis


- An altenative hypothesis statest that there is existence of difference between parameter and
a specific value, or states that there is a difference between two parameters.
- This is opposite to null hypothesis.
- Alternative hypothesis never contains =, orsign
- This hypothesis is also known as a researchers hypothesis.

Prabesh Ghimire Page | 43


Biostatistics MPH 19th
Batch

Use of hypothesis testing in Public Health


- To test the efficacy of drug in a clinical trial
- To test the effectiveness of public health interventions (differences between pre-post
intervention)
- To establish causality.
- To test the effectiveness of screening and diagnostic tests.

Errors in hypothesis testing


There are usually two types of errors a researcher can make, Type I error and Type II error.

Type I error
- A type I error is characterized by the rejection of the null hypothesis when it is true and is referred by
alpha () level.
- Alpha level or the level of the significance of a test is the probability researchers are willing to take in
the making of a type I error.
- In public health research, alpha level is usually set at a level of 0.05 or 0.01.
- Type I error can be minimized by increasing the sample size.

Type II error
- Type II error is characterized by failure to reject the false null hypothesis.
- The probability of making a type II error is called beta (), and the probability of avoiding type II error
is called power (1- ).

Actual Situation

Decision H 0 True H 0 false


Type I error Correct Decision
Reject H 0
() (1- )

Do not reject Correct decision Type II error


H0 (1-) ( )

Relationship between type I and Type II error


- It is important to point out that both Type I and Type II errors are always going to be there in the
decision making process. But Type I and Type II error cannot happen at the same time.
Type I error can occur only when H 0 is true
Type II error can occur only when H 0 is false
- If Type I error increases then Type II error decreases

Factors affecting type II error


- increases when decreases
- increases when increases
- increases when n decreases

Prabesh Ghimire Page | 44


Biostatistics MPH 19th
Batch

Power of a Test
- The power of a statistical test measures the sensitivity of the test to detect a real difference
in parameter if one actually exists.
- The power of a test is a probability and like all probabilities, can have values ranging 0 to 1.
- The higher power, the more sensitive the test is to detecting a real difference between
parameters if there is a difference.
- In other words, the closer the power of a test is to 1, the better the test is for rejecting the
null hypothesis, if the null hypothesis is, in fact false.
- The power of a test is equal to 1- , that is 1 minus the probability of committing a type I error. So
power of test depends upon the probability of committing a type II error.
- The power of a test can be increase by increasing the value of . For example instead of using =
0.01, use =0.05.
- Another way to increase the power of a test is to select a larger sample size. The larger sample size
would make the standard error of the mean smaller and consequently reduce .

Steps in hypothesis testing


i. Stating the null hypothesis
ii. Stating the alternative hypothesis
iii. Choosing the level of significance
iv. Choosing the sample size
v. Determining the appropriate statistical technique and the test statistic to use
vi. Finding the critical values at corresponding level of significance and determining the
rejection regions
vii. Collecting data and computing the test statistic
viii. Comparing the test statistic to the critical value to determine whether the test statistic falls in
the region of rejection. This can also be done by comparing P-value as appropriate.
ix. Making the statistical decision: Rejecting H 0 if the test statistic falls in the rejection region.
x. Expressing the decision in the context of problem

Level of Significance
- A level of significance is a threshold that demarcates statistical significance.
- Levels of significance are expressed in probability terms, and are denoted with Greek letter
alpha .
- In tests of statistical significance, we use a cut-off point called a level of significance or . It defines
the rejection region of the sampling distribution.
- It provides the critical values of the test. The results of a test are compared to these critical values
and are categorized as statistically significant or not statistically significant.
- The level of significance is anticipated by the researcher at the beginning.
- The most commonly used level of significance in public health studies are .01, .05 or .10.
- In other sense, the level of significance can also viewed as the probability of making type I
error.
- It is the margin that we use to tolerate type I error.

Prabesh Ghimire Page | 45


Biostatistics MPH 19th
Batch

- When the level of significance is set to any value, we mean to say that is the risk of making
type I error that we are prepared to accept.

P-Value
- The P-value (or probability value) is the probability of getting a sample statistic (such as the
mean) or a more extreme sample statistic in the direction of the alternative hypothesis when
the null hypothesis is true.
- In other words, the P-value is the actual area under the standard normal distribution curve
representing the probability of particular sample statistic or a more extreme sample statistics
occurring if the null hypothesis is true.
- The P-value of 0.05 means that the probability of getting the difference would be 5 in 100
times.
- P-value is particularly important in determining the significance of the results in hypothesis
testing.

Decision rule in hypothesis testing using P-value


- If P-value , reject the null hypothesis
- If P-value > , do not reject the null hypothesis

Types of Tests of Hypothesis


i. Parametric tests
- Parametric tests are statistical tests for population parameters such as means, variances
and proportions that involve assumptions about the populations from which the samples
were selected.
- Common parametric tests include Z-test, T-test and F-test

Assumptions for parametric test


- Data must be normally distributed
- Samples must be drawn randomly from the population
- Homogeneity of variance: the variance should be similar in each group

ii. Non-parametric test


- Non-parametric tests are distribution free tests that do not rely on any sampling distribution
and use ordinal or nominal level data.
- Non-parametric tests are used when the data are non-normal or skewed.
- Non-parametric tests work on the principle of ranking data
- Common non-parametric tests include
Mannn-Whitney U-test or Wilcoxon rank sum test,

Prabesh Ghimire Page | 46


Biostatistics MPH 19th
Batch

Wilcoxon signed rank test


Kruskal-Wallis test
Friedmans test
Tests of Associations
o Proportion test
o Chi-square test
o Fischers exact test
o McNemar test

Advantages of Non-Parametric test


- Nonparametric methods require no or very limited assumptions to be made about the format
of the data, and they may therefore be preferable when the assumptions required for
parametric methods are not valid.
- Nonparametric methods can be useful for dealing with unexpected, outlying observations
that might be problematic with a parametric approach.
- Nonparametric methods are intuitive and are simple to carry out by hand, for small samples
at least.
- Nonparametric methods are often useful in the analysis of ordered categorical data in which
assignation of scores to individual categories may be inappropriate.

Disadvantages
- Nonparametric methods may lack power as compared with more traditional approaches.
This is a particular concern if the sample size is small or if the assumptions for the
corresponding parametric method (e.g. Normality of the data) hold.
- Nonparametric methods are geared toward hypothesis testing rather than estimation of
effects. It is often possible to obtain nonparametric estimates and associated confidence
intervals, but this is not generally straightforward.
- Tied values can be problematic when these are common, and adjustments to the test
statistic may be necessary.
- Appropriate computer software for nonparametric methods can be limited, although the
situation is improving.

Difference between parametric and non-parametric tests


Parametric test Non-parametric test
Scale of measurement Interval/ Ratio Nominal/Ordinal
Distribution Normal Normal or not
Variance Equal variance Different variance
Sample size Large Small
Selection Random sample Random
Power More Power Less Power

Prabesh Ghimire Page | 47


Biostatistics MPH 19th
Batch

Parametric and Non-parametric tests


Sample Parametric Non Parametric
One t-test Kolmogorov-smirnov
Two
Independent t-test Mann-Whitney
Dependent Paired t-test Wicoxon signed rank test
Three
Independent ANOVA Kruskal-Wallis
Repeated ANOVA Friedman

Parametric Test

Z-test

The test statistic which is applied in the case of large sample is called z-test.

Fundamental assumptions for Z-test


- the underlying distribution is normal or the Central Limit Theorem can be assumed to hold
- the sample has been selected randomly
- the population standard deviation is known

Types of Z-test
1. Test of Significance of single population parameter
2. Test of significance of two population parameter

1. Test of significance of single population parameter


i. Test of significance single population mean
- In developing the test of significance for a single mean, we are interested to test if
= 0 , i.e. population has specified mean value 0
The sample has been drawn from the given population with specified mean 0 and
variance 2.
- The test statistics (Z) for the test of significance of single population mean is given by


=

Where,
= sample mean
= population mean
= S.D of a population
n = sample size


Standard error of mean is given by

Prabesh Ghimire Page | 48


Biostatistics MPH 19th
Batch

ii. Test of significance of single population proportion


- The test statistic (Z) for the test of significance of single population proportion is given by


=
( )

Where,
p = sample proportion
P = population proportion
Q = 1-P
n = sample size


Standard error of proportion is given by ( )

2. Test of significance of two population parameters

i. Test of significance of difference between two population means

Assumptions
the underlying distribution is normal or the Central Limit Theorem can be assumed to hold
the sample has been selected randomly
the population variances are known
two population means are equal

- The test statistics (Z) for the testing the significance of difference between two means is
given by

1
2
= 2 2
( 1 + 2 )
1 2

iii. Test of significance of difference between two population proportion


- The test statistics (Z) for the testing the significance of difference between two population
proportion is given by

1 2
= 1 1 2 2
( + )
1 2

Prabesh Ghimire Page | 49


Biostatistics MPH 19th
Batch

T-test

If the sample size is small and population SD is not known, then it is not appropriate to
approximate (estimate) the sample S.D as an unbiased estimator of population S.D. This adds
another uncertainty to our inference which cannot be captured by Z-test. Hence, to address this
uncertainty in the inference we use modification of Z-test based on Students t-distribution.

Assumptions for t-test


the underlying distribution is normal or the Central Limit Theorem can be assumed to hold
the samples are independent and randomly selected
the population variances are equal in both the groups
the independent variable is categorical and contains only two levels.

Types of t-test
i. One-sample test
ii. T-test for two independent (uncorrelated) samples (equal and unequal variances)
iii. T-test for paired samples
iv. T-test for significance of an observed sample correlation coefficient

i. One sample t-test


- When a sampling is from a normal population with unknown variance, the test statistic (t) for
testing the significance of single population mean is given by:


=

Where,
= sample mean
= population mean
s = S.D of a sample
n = sample size
- The degree of freedom is n-1

ii. T-test for two uncorrelated samples


- When a sampling is from a normal population with unknown and equal population variances,
the test statistic for testing the significance of difference between tow uncorrelated sample is
given by

1
2
= 2 2
( + )
1 2
Where,
( 1 1)12 +( 2 1)22
S p = Pooled variance, 2 =
1 + 2 2
- The degree of freedom is n 1 +n 2 -2

Prabesh Ghimire Page | 50


Biostatistics MPH 19th
Batch

iii. Paired t-test


- In t-test for the difference of mean, two samples were independent of each other. However,
in a situation when the samples are pairly dependent to each othe, paired t-test is used.
- For example: single sample taken before and after intervention (pre and post test)

Assumptions for paired t-test


The set of sample is treated twice in different circumstances
The outcome variable should be continuous
Sample sizes are equal
The difference between pre-post measurements should be normally distributed
- The test statistic for paired t-test can be given by


=

Where,

=

1 ()2
2 = 2
( 1)
d = x-y

iv. T-test for significance of an observed sample correlation coefficient


- Let r be the sample correlation coefficient of a sample of n pairs of observation from
binomial normal population with population correlation coefficient .
- In order to test whether the sample correlation coefficient is significant or it is just due to the
fluctuation in sampling, the test statistic is calculated by
2
=
1 2
- The degree of freedom is chosen at n-2
Note: The Standard Error of r is given by formula:
1 2
. . =
2

Analysis of Variance (ANOVA)

- When a test is used to test a hypothesis concerning the means of three or more populations,
the technique is called analysis of variance (ANOVA).
- The test involved in ANOVA is called F-test.
- With analysis of variance, all the means are compared simultaneously.
- In ANOVA, two different estimates of the population variances are made
The first estimate is between group variance and it involves finding the variance of the
means.
The second estimate is within-group variance, which is made by computing the variance
using all the data and is not affected by difference in the means.

Prabesh Ghimire Page | 51


Biostatistics MPH 19th
Batch

- If there is no difference in the means, the between-group variance estimate will be


approximately equal to the within group variance estimate, and F test value will be
approximately equal to 1. The null hypothesis will not be rejected.
- However, when the means differ significantly, the between group variance will be much
larger than the within-group variance the F-test will be significantly greater than 1 and null
hypothesis will be rejected.
- Since variances are compared, this process is called analysis of variance.

Assumptions for ANOVA


- The samples are random and independent of each other
- The independent variable is categorical and contains more than two levels
- The distribution of dependent variable is normal. If the distribution is skewed, the ANOVA
may be invalid
- The groups should have equal variances

Hypothesis in ANOVA
For the test of difference among three groups or three means, the following hypothesis should
be used:
i. H 0 : The means of all groups are equal
i.e H 0 : 1 = 2 = 3 .= k
ii. H 1 : At least one mean is different from other
(ANOVA cant say which group differs)

Scheffe Test
When the null hypothesis is rejected using the F test, the researcher may want to know where
the difference among the means is. Several procedures have been developed to determine
where the significant differences in the means lie after the ANOVA procedure has been
performed. Among the most commonly used tests are the Scheff test and the Tukey test.

Scheffe test
- To conduct the Scheffe test, it is necessary to compare the means two at a time, using all
possible combinations of means.
- For example, if there are three means, the following comparisons must be done

1 versus
2
2 versus
3
1 versus
3
- This test uses F-sampling distribution
- This method is recommended when
The size of the samples selected from the different populations are unequal
Comparisons between two means are of interest

Turkey Test
- Turkey test can be used after the analysis of variance has been completed to make pairwise
comparison between means when groups have the same sample size.
- The symbol for test statistic in the Turkey test is q (studentized range statistic).

Prabesh Ghimire Page | 52


Biostatistics MPH 19th
Batch

This method is applicable when


- The sample size from each group are equal
- Pair-wise comparisons of means are of primary interest.

Bonferroni Test

- The Bonferroni method is a simple method that allows many comparison statements to be
made (or confidence intervals to be constructed) while still assuring an overall confidence
coefficient is maintained.
- This method applies to an ANOVA situation when the analyst has picked out a particular set
of pairwise comparisons or contrasts or linear combinations in advance.
- The Bonferroni method is valid for equal and unequal sample sizes.
- Disadvantage with this procedure is that true overall level may be so much less than the
maximum value that none of individual tests are more likely to be rejected.

Non-Parametric Tests

Mann Whitney U Test

- Mann Whitney Test is also called Wicoxon Rank Sum Test


- It is the alternative test to the independent sample t-test.
- It is a non-parametric test that is used to compare two population means that come from the
same population; it is also used to test whether two population means are equal or not.
- It is used for equal sample sizes, and is used to test the median of two populations.
- Usually the Mann-Whitney U test is used when the data is ordinal.

Assumptions for Mann Whitney Test


- The sample drawn from the population is random and independent.
- Two probability distributions from which the samples are drawn are continuous.
- Measurement scale is ordinal

Applications
- In public health, it is used to know the effect to two medicines and whether they are equal or
not.
- It is also used to know whether or not a particular medicine cures the ailment or not.

Wilcoxon Signed Rank Test

- The Wilcoxon sign test is a statistical comparison of average of two dependent samples.
- The Wilcoxon sign test works with metric (interval or ratio) data that is not multivariate
normal, or with ranked/ordinal data.
- Generally it the non-parametric alternative to the dependent samples t-test (paired-t test).
- The Wilcoxon sign test tests the null hypothesis that the average signed rank of two
dependent samples is zero.

Prabesh Ghimire Page | 53


Biostatistics MPH 19th
Batch

Assumptions
- Sample must be paired
- The sample drawn from the population is random and independent
- Continuous dependent variable
- Measurement scale is at least of ordinal scale

Kruskal Wallis Test

- The Kruskal-Wallis test is a nonparametric (distribution free) test, and is used when the
assumptions of ANOVA are not met.
- They both assess for significant differences on a continuous dependent variable by a
grouping independent variable (with three or more groups).
- In the ANOVA, we assume that distribution of each group is normally distributed and there is
approximately equal variance on the scores for each group. However, in the Kruskal-Wallis
Test, we do not have any of these assumptions.
- Like all non-parametric tests, the Kruskal-Wallis Test is not as powerful as the ANOVA.

Assumptions
- The sample drawn from the population is random and independent
- Measurement scale is at least of ordinal scale

Tests of Association

Chi-Square Test

- Chi-square test is a non-parametric statistic, also called distribution free test


- Chi-square test is used to test the counts of categorical data.
- The tests are of three types
Test of independence/ association (bivariate)
Test of homogeneity (univariate with two samples)
Test of goodness of fit (univariate)

Assumptions for Chi-Square Test


- Must be a random sample from population
- The data in the cells should be in raw frequencies, or counts of cases rather than
percentages.
- The levels (or categories) of the variables are mutually exclusive.
- The study groups must be independent. This means that a different test must be used if the
two groups are related (e.g paired samples)
- There are two variables and both are measured as categories usually at the nominal level.
However, data may be ordinal. Interval or ratio data that have been collapsed into ordinal
categories may also be used.
- The sample size should be more than 50. The value of the cell should be 5 or more in at
least 80% of the cells, and no cell should have an expected value of less than one.

Prabesh Ghimire Page | 54


Biostatistics MPH 19th
Batch

Chi-Square Test for Independence


- The chi-square test for independence can be used to test the independence of two
qualitative variables when the data is in the form of counts.
- Hypothesis for test of independence
Null hypothesis: Variable A and Variable B are independent (there is no association)
Alternative Hypothesis: Variable A and Variable are not independent (there is
association)

- In-order to analyze the sample data for test of independence, we find the degrees of
freedom, expected frequencies and test statistic using a four-fold (22) table.
Degree of Freedom: The degree of freedom is equal to:
Df = (r-1) (c-1)
Where r is the number of levels for one categorical variable and c is the number of
levels for other categorical variable.

Expected Frequencies: The expected frequency counts are computed separately for
each cells using the formula

(,) =

Test Statistic: The test statistic (2) is defined by the equation


( )2
2 =

((,) (,) )2
2 = ( 1)( 1)
(,)

Yates Continuity correction


- Continuity correction is always advisable although it has most effect when the expected
numbers are small (usually <5).
- In order to apply the Yates correction, 0.5 is subtracted from the numerical difference
between observed frequencies and expected frequencies.
(|(,) (,) | 0.5)2
2 =
(,)

Chi-Square Test for Homogeneity


- The test is applied to a single categorical variable from two (or more) independent
- It is used to determine whether frequency counts are distributed identically (homogenous)
across different populations.
- Hypothesis for test of homogeneity
Null Hypothesis: The two (or more) distributions are same
Alternative Hypothesis: The distributions are different
(Note: The assumptions and computation of test statistic for chi-square is similar to the test of
independence)

Prabesh Ghimire Page | 55


Biostatistics MPH 19th
Batch

Chi-Square Test of Goodness of Fit


- This test is applied when we have one categorical variable from a single population.
- It is used to determine whether the observed sample frequencies are consistent with a
hypothesized distribution.
- Hypothesis for test of Goodness of fit
Null Hypothesis: The data are consistent with a specified distribution
The data are not consistent with a specified distribution
- The degree of freedom is k-1, where k is the number of parameter estimated.

Fischers Exact Test

- Fischer exact test is the test of significance that is used in the place of chi-square test in 22
tables, especially in cases of small samples.
- Fischer test is recommended when
Nominal scale
22 table with expected frequencies less than 5
The overall total of the table is less than 20 or the overall table is between 20 to 40.
- Fisher exact test uses the following formula:
( + !)( + )! ( + )! ( + )!
=
! ! ! ! !
Where, a,b,c and d are the individual frequencies of the 22 contingency table and N is the
total frequency

McNemars Chi-Square Test

- NcNemars Chi Square test is used for comparing two proportions which are paired.
- This non-parametric test assesses if a statistically significant change in proportions have
occurred on a dichotomous trait at two points on the same population.
- For example, if a researcher wants to determine whether or not a particular intervention has
an effect on a disease (yes or no), then a count of the individuals is recorded in a 2*2 table
before and after being given the drug, then McNemars test is applied to make statistical
decision as to whether or not an intervention has an effect on the disease.
- Hypothesis:
Null Hypothesis: Intervention has no impact on disease
Alternative Hypothesis: Intervention has an impact on the disease
Intervention 2
+ -
+ a b
Intervention 1
- c d

- The test-statistic can be calculated as:


(| | 1)2
2 = 1
( + )

Prabesh Ghimire Page | 56


Biostatistics MPH 19th
Batch

STATISTICAL SOFTWARE IN BIOSTATISTICS

Introduction to Various Statistical Softwares

Database: A database is a collection of information that is organized so that it can easily be


retrieved, managed and updated. The databases can be physical (paper/prints) as well as
electronic.
Classification of Statistical Softwares:
Classification on the basis of methodological capabilities
Basic Intermediate Advanced
Excel Epi info SAS
Access Epi data Stata
SPSS R

Classification on the basis of cost


Freeware Commercial
Epidata entry SPSS
Epi Info Stata
R SAS
CS Pro Mnitab

Some of the common statistical software that can be used in health sciences are
i. Excel
- Excel is a spreadsheet developed by Microsoft.
- It features calculation, graphing tools, pivot tables and a macro programming language.

Advantages
- Easy to use for data entry and data storage
- Relatively easy to use for basic descriptive statistics
- Excel has some built in basic analysis tools (e.g. t-test, correlation, chi-squared tests)

Disadvantages
- Excel has no restriction on data type storage.
- Excel allows multiple user-errors to slip through the gaps

ii. Epi info


- Epi info is a public domain, free software package designed for the global community of
public health practitioners and researchers.
- It allows for easy form and database construction, data entry, and analysis with
epidemiologic statistics, maps and graphs.
- It allows for electronic survey creation, data entry and analysis.
- With Epi info one can rapidly develop a questionnaire or form, customize the data entry
process, and enter and analyze data.

Prabesh Ghimire Page | 57


Biostatistics MPH 19th
Batch

Advantages
- Freeware
- Easy to select subsets of data for analysis, without having to delete records or make
multiple copies of datasets.
- Performs both descriptive statistics and a lot of basic to intermediate analysis (e.g.
comparison of means, proportions; many regression methods, etc.)
- Keeps a savable record of the analysis steps that has been performed.
- Ability to rapidly develop a questionnaire

Disadvantages
- Runs under windows only
- Limited analysis options beyond the basic methods
- Graphic can look quite sloppy good for interpretation but not so good for scientific
publications.
- Not a dedicated statistical package

iii. Epidata
- Epidata is a computer program for simple or programmed data entry and data
documentation.
- It is highly reliable.

iv. SPSS
- It is a widely used statistical package in social sciences, marketing, education and public
health.
Features
- Menu driven statistical software, but does have scripting language available for typing
commands or creating
- Plug-ins are available for other programming languages, such as JAVA, python, E, and VB.
- Can take data from almost any type of file
- Separate Data view and variable view tabs in the worksheet
- Separate output files that can be customized.

Advantages
- Good range of statistics from descriptive methods (mean, median, frequencies, etc.) through
to common tests (t-tests, regression, ANOVA, etc.) and some more advanced statistical
measures (e.g. factor analysis)
- Can produce some nice looking graphs.
- New variables can be added to worksheet or created using formula
- Easy to use with powerful statistical functions making useful for academic fields.

Disadvantages
- Expensive
- Can only analyze one data set at a time
- Can be bit rigid with regards to advanced options for tests sometimes.

Prabesh Ghimire Page | 58


Biostatistics MPH 19th
Batch

v. SAS
SAS is commercial software with advanced methodological capabilities.
Advantages
- Pretty much industry standard.
- Used widely in medical and pharmaceutical industry.
- Is very adept at data manipulation (e.g. counting elapsed days between two days) as well as
analysis.
- Range of procedures from descriptive statistics to simple analysis and on to complex
analysis
- Usually good help files, with many worked examples are available.
- Can write programmes to automate some time consuming processes.
- Graphical user interfaces are available that can help with analysis although complex tasks
may only be possible through scripting.

Disadvantages
- SAS is a script-based programming.
- There is steep learning curve at the start, even for very simple analyses
- Essentially a programming language, can be tricky and intimidating.
- Some versions are not 100% compatible

vi. Stata
- STATA is a powerful statistical package that provides comprehensive solutions for data
analysis, data management and graphics.

Advantages
- Performs a large number of statistical analyses
- Has some quick to use commands that give results for simple questions quickly.
- Has a large amount of example data available within the package as well as online.
- Large number of downloadable extensions are available that can be used to do more
complex analysis/data presentation.
- Main stata files are frequently updates to fix bug and errors in the programme.

Disadvantages
- Help files are frequently difficult to understand.
- Less flexible than SAS in terms of data manipulation

vii. R software
- R is free open-source software with a programming language for statistical computing and
graphics.
- R provides a wide variety of statistical analyses (linear and non linear modeling, classical
statistical tests, time-series analysis, classification, clustering, etc.)

Advantages
- Freeware

Prabesh Ghimire Page | 59


Biostatistics MPH 19th
Batch

- Highly flexible scripting based language


- Downloadable packages freely available online that have been developed to perform
specialized statistical analyses.
- Excellent graphing capabilities- highly flexible output, easy to overlay multiple graphs in the
same figure and figure can be customized.
- Works in any operating systems
- Can be installed in removable media (e.g. USB flashdrive) and can be run on nay computer

Disadvantages
- R is a scripting based language and therefore has a steep learning curve. More complex to
learn than SAS or Stata
- Help files are variable in usefulness, can often be complicated to understand the command
structure
- Getting results quickly out of R (copying tables to paste in word) can be complicated.

Data Management in Epidata

Prabesh Ghimire Page | 60


Biostatistics MPH 19th
Batch

Data Management in SPSS

The overal process of data management in SPSS is defined by the figure below:

1. Data Collection and Preparation


i. Collecting Data
- Data for analysis may be used from survey or available existing data.
- For data entered in other softwares such as epidata or excel, it is necessary to import those
data to SPSS format (.sav).

ii. Preparing SPSS Codebook and setting up structure of data


- The codebook may include, label versus name, type of variables, values of a variable,
missing values, type of measurement, etc.
- In the variable view tab, the required variables are defined based on codebook.

iii. Entering data


- Before entering data, it is essential to customize various options in SPSS (going to
Edit>>Option). The option settings may be changed for output notification, output labels,
charts formats, table formats, etc.
- In the data view tab, the data are entered in the respective variables set before. The data
may also be entered using excel.
- In some cases, merging data file (adding cases/variables) or sorting or splitting data file,
may also be required.

iv. Screening and cleaning data


a. Checking for errors
- For categorical variables, following checks are performed to screen and clean data.
Analyze>>descriptive statistics>>frequencies
Statistics: min/max

Prabesh Ghimire Page | 61


Biostatistics MPH 19th
Batch

- For numerical variables


Analyze>descriptive statistics>descriptive
Options: mean, standard deviation, minimum and maximum

b. Finding and correcting the errors


- Sorting cases
- After identification of errors, missing values, and true (extreme or normal) values, they
should be corrected, deleted or left unchanged.

2. Extrapolation of data
- For categorical data: frequencies
- For numerical data: mean, standard deviation, minimum, maximum, skewness, kurtosis, etc.
- Graphs: histograms, boxplot, bar charts, scatterplots, line graphs, etc.

3. Data Analysis
a. Exploring relationships among variables
- Following analysis are performed by using respective tabs in SPSS window.
Crosstabulation/ chi-square
Correlation
Regression/Multiple regression
Logistic regression
Factor analysis

b. Comparing Groups
- Following analysis can be performed
Non parametric statistics
T-tests
One-way analysis of variance (ANOVA)
Two-way between groups ANOVA
Multivariate analysis of variance (MANOVA)

Finally, the output files from analyses are saved or archived for future reference. The results of
analysis are used and presented according to the objective of the study.

Miscellaneous

Data Cleaning and Editing


Date cleaning: It is the process of detecting, diagnosing, and editing faulty data.
Data editing: It is the process of changing value of data shown to be incorrect.

Data cleaning intends to identify and correct the errors in data or at least to minimize their
impact on study results.
There are three processes involved in data cleaning and editing:

Prabesh Ghimire Page | 62


Biostatistics MPH 19th
Batch

i. Screening
- During screening four basic types of oddities should be distinguished: lack or excess of
data; outliers, including inconsistencies; strange patterns in (joint) distributions; and
unexpected analysis results and other types of inferences and abstractions.
- For this, data can be examined with simple descriptive tools. For example, in a statistical
package, analyzing range, minimum and maximum values can help detect outliers.
Frequency measure may provide information with excess or lack of data.
- Screening methods:
Checking questionnaire using fixed algorithm
Range checks
Graphical exploration of distribution (histogram, box plot)
Frequency distribution

ii. Diagnosis
- In this phase, purpose s to clarify the true nature of the worrisome data points, patterns, and
statistics.
- Possible diagnoses for each data point are as follows: erroneous, true extreme, true normal,
or idiopathic (no explanation found but still suspectful).

iii. Treatment/ Editing


- After identification of errors, missing values, and true (extreme or normal) values, they
should be corrected, deleted or left unchanged.
- Impossible values are never left unchanged, but should be corrected if a correct value can
be found, otherwise they should be deleted.
- Missing values may be identified from the data collection tool and entered.

Prabesh Ghimire Page | 63


Biostatistics MPH 19th
Batch

Important Formulae for Biostatistics


Probability
Conditional Probability (&)
(/) =
()
Bayes Theorem
(1). (/1)
(1/) =
(1) (/1) + (2)(/2) + (3) (/3)

Discrete Probability Expectation = Mean = [X.P(x)]


Variance = 2=[(x i 2p i )]2
Binomial Distribution f(x,p) = n C x px qn-x

Mean and Variance of Mean, = np


Binomial Distribution Variance,2= npq
Poisson Distribution e .x
f(X,)=
x!
Mean and Variance of Mean = Variance =
Poisson Distribution
Normal Distribution
=

Correlation and Regression
Karl Pearsons Coefficient ()
of Correlation =
2 ()2 2 ()2

Spearmans Rank
Correlation Coefficient 6 2
(for non-repeated ranks) = 1
3
D= R 1 -R 2
Spearmans Rank 13 1 23 2
6[ 2 + + + .
Correlation Coefficient = 1 12 12
(for repeated ranks) 3

Regression Coefficient Solve two equations to find a and b (regression coefficient)


(Least Square Method) = + 2
And
= +
Estimation
Interval Estimation for When Population Variance Known
Population Mean = 1/2 /
When Population Variance Unknown
CI = 1,1/2 /
Interval Estimation for

Population Proportion = 1/2

Standard Error of Mean SEM = /n

Standard Error of SEM = ( )
Proportion
Sample Size Estimation
Sample Size for population 2 2
mean =
2
Sample Size for population 2
proportion =
2

Prabesh Ghimire Page | 64


Biostatistics MPH 19th
Batch

Hypothesis Testing
Z-Test Single population mean

=

Single population proportion



=
( )

Two population means
1
2
= 12 22
( + )
1 2

Two population proportion


1 2
= 1 1 2 2
( + )
1 2

T-test One-sample test



=

Two uncorrelated samples


1
2
= 2 2
( + )
1 2

( 1 1)12 +( 2 1)22
S p = Pooled variance, 2 =
1 + 2 2
The degree of freedom is n 1 +n 2 -2

Paired t-test

=

Where,

=

1 ()2
= [ 2 ]
( 1)
d = x-y

Chi-Square Test ((,) (,) )2


2 = ( 1)( 1)
(,)

Prabesh Ghimire Page | 65

También podría gustarte