Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Biostatistics
Prabesh Ghimire
Biostatistics MPH 19th
Batch
Table of Contents
UNIT 1: INTRODUCTION TO BIOSTATISTICS ................................................................................................ 4
Biostatistics and its Role in Public Health ................................................................................................. 4
UNIT 2: DESCRIPTIVE STATISTICS ................................................................................................................. 4
Variables.................................................................................................................................................... 4
Scales of Measurement............................................................................................................................. 5
Diagrammatic and graphic presentation .................................................................................................. 7
Measure of Central Tendency ................................................................................................................. 12
Measures of Dispersion .......................................................................................................................... 14
UNIT 3: PROBABILITY DISTRIBUTION......................................................................................................... 16
Probability Distributions ......................................................................................................................... 18
Binomial Distribution .............................................................................................................................. 18
Poison Distribution.................................................................................................................................. 19
Normal Distribution ................................................................................................................................ 20
UNIT 4: CORRELATION AND REGRESSION ................................................................................................. 21
Correlation .............................................................................................................................................. 21
Regression ............................................................................................................................................... 26
UNIT 5: SAMPLING THEORY, SAMPLING DISTRIBUTION AND ESTIMATION ......................................... 33
Sampling Techniques .............................................................................................................................. 39
Determination of Sample Size................................................................................................................. 43
UNIT 6: Hypothesis Testing ........................................................................................................................ 43
Parametric Test ....................................................................................................................................... 48
Z-test ................................................................................................................................................... 48
T-test ................................................................................................................................................... 50
Analysis of Variance (ANOVA) ............................................................................................................. 51
Scheffe Test ........................................................................................................................................ 52
Turkey Test .......................................................................................................................................... 52
Bonferroni Test ................................................................................................................................... 53
Non-Parametric Tests ............................................................................................................................. 53
Mann Whitney U Test ......................................................................................................................... 53
Wilcoxon Signed Rank Test ................................................................................................................. 53
Biostatistics is the branch of statistics responsible for interpreting the scientific data that is
generated in the health sciences, including the public health sphere.
- In essence, the goal of biostatistics is to disentangle the data received and make valid
inferences that can be used to solve problems in public health.
Variables
Concept of Variables
If a characteristic takes on different values in different persons, places, or things, we label the
characteristic as a Variable.
Some examples of variables include diastolic blood pressure, heart rate, the heights of male
adults, the weights of under-5 years children, the ages of patients in OPD.
Types of Variables
Variables can be usually distinguished into qualitative and quantitative:
i. Qualitative Variables
- Qualitative variables have values that are intrinsically non-numeric (categorical)
- E.g., cause of death, nationality, race, gender, severity of pain (mild, moderate, severe), etc.
- Qualitative variables generally have either nominal or ordinal scales.
- Qualitative variables can be reassigned numeric values (e.g., male=o, female =1), but they
are still intrinsically qualitative.
b. Continuous Variable
- A continuous variable has a set of possible values including all values in an interval of
the real line.
- E.g., duration of seizure, body mass index, height
- No gaps exist between possible values.
Scales of Measurement
There are four different measurement scales. Each is designed for specific purpose.
i. Nominal Scale
ii. Ordinal Scale
iii. Interval Scale
iv. Ratio Scale
i. Nominal Scale
- It is simply as system of assigning numbers to events in order to label/identify them i.e.
assignment of numbers to cricket player in order to identify them.
- Nominal data can be grouped but not ranked. For example, male/female, urban/rural,
diseased/healthy are examples of nominal data and such data consists of numbers used
only to classify an object, person or characteristics.
- Nominal scale is the least powerful among the scales of measurement.
- It indicates no order or distant relationship and has no arithmetic origin.
- Chi-square test is the most common test of statistical significance that can be utilized in this
scale.
- Ordinal scale only permits ranking of items from highest to lowest. Thus use of this scale
implies statement of greater than or less than without being able to state how much greater
or less.
- Ordinal data can be both grouped and ranked. Examples include mild, moderate and severe
malnutrition, first degree, second degree and third degree uterine prolapse, etc.
There are different types of diagrams and graphic representations for a given dataset:
i. Bar Graph
- Bar graph is the simplest qualitative graphic representation of data.
- A bar graph contains two or more categories along one-axis and a series of bars, one for
each category, along the other axis.
- Typically, the length of the bare represents the magnitude of the measure (amount,
frequency, percentage, etc.) for each category.
- The bar graph is qualitative because the categories are non-numerical, and it may be either
horizontal or vertical.
Limitations
- Bar graph is not applicable for plotting data overlapping with each other because it gives a
confusing picture.
Advantages
- It allows to visually comparing the distribution of different sets of data.
Limitations
- Histogram is not applicable in plotting two or more sets of data over-lapping with each other
because it gives a confusing picture.
- Since only one set of distribution can be plotted in one graph, it is expensive and more time
consuming.
Advantages
- The change of points from one place to another is direct and gives correct impression.
- Unlike histogram, it is possible to plot two or more sets of distribution overlapping on the
same baseline because it still gives a clear picture of the comparison of each distribution.
Limitations
- Can be used only with continuous data
Limitations
- As a rule, it is not suitable for use in annual reports or other dissemination purposes.
Advantages
- Examination of a box-and-whisker plot for a set of data reveals information regarding the
amount of spread, location of concentration, and symmetry of data.
- It is easy to compare the stratified data using the Box and Whisker Plot.
Limitations
- Mean and mode cannot be identified using the box plot
- If large outliners are present, the Box plot is more likely to give an incorrect representation.
- The issue with handling large amount of data in a box plot is that the exact values and
details of the distribution of results are not retained.
Example
Given the weight measurements (Kg) in a group of selected students:
Here, smallest and largest values are 14.6 and 44 respectively. The Q 1 is 27.25, median is 31.1
and third quartile is 33.525. The Box and Whisker Plot for the given dataset will be:
1. Arithmetic Mean
- The arithmetic mean or mean of a set of measurement is defined to be the sum of the
measurements divided by the total number of measurements.
- The population mean is denoted by the Greek letter and the sample mean is denoted by
the symbol .
() =
Properties of Mean
i. The sum of the deviations of a given set of observations from the arithmetic mean is equal
to zero.
(X X) = 0
ii. The sum of squares of deviations of set of observations from arithmetic mean is minimum.
( )2 ( )2
iii. If every value of the variable X is increased (or decreased) by constant value, the arithmetic
mean of observation so obtained also increases (or decreases) by the same constant.
, = ,
, =
Advantages of mean
- The mean uses every value in the data and hence is a good representative of the data. The
irony in this is that most of the times this value never appears in the raw data.
- Repeated samples drawn from the same population tend to have similar means. The mean
is therefore the measure of central tendency that best resists the fluctuation between
different samples.
- It is closely related to standard deviation, the most common measure of dispersion.
Disadvantages
- The important disadvantage of mean is that it is sensitive to extreme values/outliers,
especially when the sample size is small. Therefore, it is not an appropriate measure of
central tendency for skewed distribution.
- Mean cannot be calculated for nominal or non numeric ordinal data. Even though mean can
be calculated for numerical ordinal data, many times it does not give a meaningful value,
e.g. stage of cancer.
2. Median
- Median is the value which occupies the middle position when all the observations are
arranged in an ascending/descending order.
- It divides the frequency distribution exactly into two halves. Fifty percent of observations in a
distribution have scores at or below the median. Hence median is the 50th percentile.
- Median is also known as positional average
+ 1
=
2
Advantages
- It is easy to compute and comprehend
- It is not distorted by outliers/skewed data
- It can be determined for ratio, interval, and ordinal scale
Disadvantages
- It does not take into account the precise value of each observation and hence does not use
all information available in the data.
- Unlike mean, median is not amenable to further mathematical calculation and hence is not
used in many statistical tests.
- If we pool the observations of two groups, median of the pooled group cannot be expressed
in terms of the individual medians of the pooled groups.
3. Mode
- Mode is defined as the value that occurs most frequently in the data.
- Some data sets do not have a mode because each value occurs only once. On the other
hand, some data sets can have more than one mode.
- Mode is rarely used as a summary statistic except to describe a bimodal distribution.
Advantages
- It is the only measure of central tendency that can be used for data measured in a nominal
scale.
- It can be calculated easily.
Disadvantages
- It is not used in statistical analysis as it is not algebraically defined and the fluctuation in the
frequency of observation is more when the sample size is small
Measures of Dispersion
Measures of dispersion refer to the variability of data from the measure of central tendency.
Some commonly used measures of dispersion are
i. Range
- The range is the difference between the largest and the smallest observation in the data
Advantages
- It is independent of measure of central tendency and easy to calculate
Disadvantages
- It is very sensitive to outliers and does not use all the observation in a data set.
- It is more informative to provide maximum and minimum value rather than providing range.
Advantage
- It can be used as a measure of dispersion if the extreme values are not being recorded
exactly (as in case of open-ended class intervals in frequency distribution).
- It is not affected by extreme values.
- It is useful for erratic or highly skewed distributions
Disadvantages
- It is not amenable to further mathematical manipulation
- It is very much affected by sampling fluctuations
()2
2
=
1
Advantages
- If the observations are from a normal distribution, SD serves as a basis for many further
statistical analyses.
- Along with mean it can be used to detect skewness.
Disadvantages
- It is inappropriate measure of dispersion for skewed data
i. Classical probability
- Classical probability assumes that all outcomes in the sample space are equally likely to
occur.
- The probability of any event E is
- This probability is denoted by
()
() =
()
Axioms of Probability
i. The probability of any event A lies between 0 and 1.
i.e. 0 P(A)1
ii. The sum of probabilities of all the outcomes in a sample space is always 1.
i.e. P(E i ) =1
P(S) = 1
iii. For any mutually exclusive events A & B, the probability that at least one of these events
occurs is equal to sum of their respective probabilities.
P(AUB) = P(A) + P(B)
a. Proposition 1
- The probability that an event does not occur is 1 minus the probability that the event occurs.
P(Ac) = 1 P(A)
b. Proposition 2
- For any non-mutually exclusive events A&B, the probability that at least one of these events
occurs is equal to sum of their experiences minus probability of both events occurring
together.
P(AUB) = P(A) + P(B) P(A&B)
Conditional Probability
- The conditional probability of an event A in a relationship to an event B is defined as the
probability that event A will occur after event B has already occurred.
- The condition probability of A given B has occurred is equal to the probability of A&B divided
by the probability of B, provided the probability is not zero.
(&)
(/) = , () 0
()
Bayes Theorem
If E 1 , E 2 , E 3 ,..E n are mutually disjoint events with P(E i ) 0 (i=0,1,2,n), then for any
arbitrary event A, which is subset of
(). (/)
(/) =
(). (/)
If i=1,2,3 then
(1). (/1)
(1/) =
(1) (/1) + (2)(/2) + (3) (/3)
Probability Distributions
Binomial Distribution
- Binomial distribution is one of the most widely used discrete probability distribution in
applied statistics.
- This distribution is derived from a process known as Bernoulli trial (by James Bernoulli)
- The random variable X is said to have a binomial distribution if its probability function is
given by
b(X,n,p) = f(x,p) = n C x px qn-x, x=0,1,2,3,n
where,
n= number of trials
p = probability of success
q = probability of failure = 1-p
x= number of success
Poison Distribution
Normal Distribution
Correlation
If two (or more) variables are so related that the change in the value of one variable is
accompanied by the change in the value of other variable, then they are said they have
correlation. Hence, correlation analysis is defined as the statistical technique which measures
the degree (or strength) & direction of relationship between two or more variables.
Types of correlation
1. Positive and Negative Correlation
i. Positive Correlation
- If two variable X and Y move in the same direction (i.e if one variable rises, other rises and
vice versa), then it is called as positive correlation.
- Example: Gestational age against birth weight of baby
Bivariate Correlation
Many biomedical studies are designed to explore relationship between two variables and
specifically to determine whether these variables are independent or dependent.
E.g. Are obesity and blood pressure related?
Scattered Diagram
- Scattered diagram is the graphic method of finding out correlation between two variables.
- By this method, direction of correlation can be ascertained.
- For constructing a scatter diagram, X-variable is represented on X-axis and Y-variable on Y-
axis.
- Each pair of values of X and Y series is plotted shown by dots in two-dimensional space of
X-Y
- This diagram formed by bivariate data is known as scattered diagram.
The scattered diagram gives idea about the direction and magnitude of correlation in the
following ways:
a. Perfect Positive Correlation (r = +1)
- If all points are plotted in the shape of a straight line passing from the lower left corner to the
upper right corner, then both series X and Y have perfect positive correlation.
v. No correlation
- If r =0, then there is no existence of correlation
iii. Directionality problem: It does not explain whether variable X causes a change in variable Y
or reverse is true.
iv. It is unstable with small sample sizes
v. It measures only a linear relationship.
Where,
m 1 is the no.of repetition of 1st item
m 2 is the no.of repetition of 2nd item
Properties or
- It is less sensitive to outlying values than Pearsons correlation coefficient
- It can be used when one or both of the relevant variable are ordinal
- It relies on rank rather than on actual observations
- The sum total of rank difference (i.e. D) is always equal to zero.
- The value of rank correlation coefficient will be equal to the value of Pearsons coefficient of
correlation for the two characteristics taking the ranks as values of the variables, provided
no rank value is repeated.
Demerits
- This method cannot be used for finding correlation in grouped frequency distribution.
Regression
Purpose of regression
- To predict the value of dependent variable based on the value of an independent variable.
- To explain the impact of changes in a dependent variable for every unit change in
independent variable.
- To explain the nature of relationship between variables.
2. Independent variable
- The variable used to explain or the known variable which is used for prediction is called
independent variable.
- It is also called explanatory variable.
Regression lines
- The regression lines shows the average relationship between two variables. This is also
known as the Line of best fit.
- If two variables X and Y are given, then there are two regression lines related to them which
are as follows:
For simple regression, error is assumed to be zero. So, the estimate of population regression
line is given by
= a yx + b yx X
To fit the regression lines we must find the unique values of a yx and b yx
For this we use principle of least square. Using this principle, derive a and b solving following
two equations.
= + 2
And
= +
Alternatively,
The values can be obtained by,
()
=
2 ()2
Probable Error
- The probable error of correlation coefficient helps in determining the accuracy and reliability
of the value of the coefficient in so far as it depends on the conditions of random sampling.
- Probable error of r is an amount which if added to and subtracted from the value of r
produces limits within which the coefficients of correlation in the population can be expected
to lie..
- The probability error of the coefficient of correlation is obtained as follows:
1 2
. = 0.6745
Where, r is the coefficient of correlation and N is the number of pairs of observations.
Coefficient of determination
- The concept coefficient of determination is used for the interpretation of regression
coefficient.
- The coefficient of determination is also called r-squared and is denoted by r2.
- The coefficient of determination explains the percentage variation in the dependent variable
Y that can be explained in terms of the independent variable X.
- It measures the closeness of fit of the regression equation to the observed values of Y.
- For example, if r is 0.9 then the coefficient of determination (r2) will be 0.81 which implies
that 81% of the total variation in the dependent variable (Y) occurs due to the independent
variable X. The remaining 19% variation occurs due to other external factors.
- Thus the coefficient of determination is defined as the ratio of the explained variance to the
total variance.
- Coefficient of determination lies between 0 to 1 i.e. 0 r2 +1
- When r2 =1, all the observations fall on the regression line.
- When r2 =0, none of the variation in Y is explained by the regression.
2 = =
Coefficient of non-determination
- By dividing the unexplained variance by the total variation, the coefficient of non-
determination can be determined.
- Assuming the total of variation as 1, then the coefficient of non-determination can be
determined by subtracting the coefficient of determination from1.
- It is denoted by K2.
( 2 ) = 1 2
- Suppose if coefficient of determination is 0.81, then coefficient of non-determination will be
0.19, which means that 19% of the variations are due to other factors.
- Coefficient of alienation is the square root of coefficient of non-determination (= 1 2 )
- It is type in which a relationship between one dependent and more than one independent
variables described by a linear function.
- Changes in Y (dependent) are assumed to be caused by changes in X 1 , X 2 , X 3 , X 4 ,
(independent).
- Multiple regression analysis is used when a statistician thinks there are several independent
variables contributing to the variation of the dependent variable.
- For example, if a statistician wants to see whether birth weight of a child is dependent on
gestational age, age of mother and antenatal visits, then multiple regression analysis may
be applicable.
Birth Weight = 0 + 1 GA + 2 Age + 3 ANC + e
Where, GA = Gestational age
Age = Age of mother in yrs
ANC = Antenatal care
Logistic Regression
- Logistic regression is a kind of predictive model that can be used when the dependent
variable is a categorical variable having two categories and independent variable are either
numerical or categorical.
- Examples of categorical variables are disease/ no disease, smokers/non-smokers, etc.
- The dependent variable in the logistic model is often termed as outcome or target variable,
whereas independent variables are known as predictive variables.
- It provides answers to questions such as:
How does the probability of getting lung cancer change for every additional pound of
overweight and for every X cigarettes smoked per day?
Do body weight calorie intake, fat intake, and age have an influence on heart attacks
(yes vs. no)?
Limitations
i. Outcome variable must always be discrete
ii. When continuous outcome is categorized or categorical variables are dichotomized, some
important information may be lost.
iii. Ratio of cases to variables: Using discrete variables requires that there are enough
responses in every given category.
- If there are too many cells with no responses, parameter estimates and standard errors will
likely blow up.
- Also can make groups perfectly separable (e.g. multi-collinear) which will make maximum
likelihood estimation impossible
Sampling theory is the field of statistics that is involved with the collection, analysis and
interpretation of data gathered from random samples of a population under study.
Parameter Statistics
Source Population Sample
Notation for Mean
Notation for SD s
Vary No Yes
Calculated No Yes
1. Estimation
Estimation is the statistical process by which population characteristics (i.e parameter) are
estimated from the sample characteristics (i.e. statistic) with desired degree of precision.
Types of estimation
i. Point Estimations
- A point estimate is a specific numerical value from a sample that estimates a parameter.
- The best point estimate of the population mean is the sample mean .
Confidence Interval
A confidence interval is a range value around the given sample statistic where true population
value is assumed to lie at a give level of confidence.
The confidence level is the probability that the interval estimate will contain the parameter,
assuming that a large number of samples are selected and that the estimation process on the
same parameter is repeated. Confidence level generally used is 90%, 95% and 99%.
As the length of CI increases, it is more likely to capture . Therefore, the length of CI is longer
in 99% confidence as compared to 90%.
CI = 1,1/2 /
Points to remember
- Confidence interval applies only when a sample is selected by a probability sampling
technique and the population is normal or the sample is large.
- In addition, the CI does not account for practical problems such as:
Measurement error and processing error
Other selection biases
Sampling distribution
- It is a distribution obtained using the statistics computed from all possible random samples
of a specific size taken from a normal population.
- Sampling distribution is a theoretical concept.
- In practice it is too expensive to take many samples from a population. Simulation may be
used instead of many samples to approximate sampling distribution.
When sampling is from a normally distributed population, the distribution of the sample mean
will possess the following properties:
i. The sampling distribution of mean tends to be normal as sample size increases (Central
Limit Theorem).
ii. The mean obtained from the sampling distribution of mean will be same as the population
mean.
iii. The standard deviation of the sample means will be smaller than the standard deviation of
the population, and it will be equal to the population standard deviation divided by the
square root of the sample size. This is called standard error.
- As the formula shows, the standard error is dependent on the size of the sample; standard
error is inversely related to the square root of the sample size. Therefore, larger the n
becomes, the more closely will the sample means represent the true population mean.
- Also the standard error is influenced by the standard deviation and the sample size. The
greater the dispersion around the mean, greater is the standard error and less certain we
are about the actual population mean.
Sampling Techniques
Sampling is a statistical procedure of drawing a sample from a population with a belief that the
drawn sample will exhibit the relevant characteristics of the whole population.
i. Probability Sampling
- Probability sampling is a method of drawing a sample so that all the units in the population
have equal probability of being selected as a unit of the sample.
- The advantage of probability method is that the sampling error of a given sample size can
be estimated statistically and therefore the samples can be subjected to further statistical
procedures.
Advantages
- Reduces selection bias as selection depends on chance.
- Relatively cheap compared to stratified random sampling
- Sampling error can be easily measured
Limitations
- Complete list of sampling frame is needed.
- This method may not always achieve best representatives.
- Units may be scattered
- Less suitable for large population
- There are two types of stratified random sampling; proportionate and disproportionate
- In proportionate stratified sampling, the sample size from each stratum is dependent on that
size of the stratum. Therefore largest strata are sampled more heavily as they make larger
percentage of the target population.
- In disproportionate sampling, the sample selection from each stratum is independent of its
size.
Advantages
- This method produces more representative samples.
- Facilitates comparison between strata and understanding of each stratum and its unique
characteristics.
- It is suitable for large and heterogeneous population.
Limitations
- It requires more cost, time and resources
- Stratification is a difficult process.
Advantages
- This methods is simple and easy.
- The selected samples are evenly spread in the population and therefore minimize chances
of clustered selection of subjects.
Limitations
- The method may introduce bias when elements are not arranged in random order.
- In some cases, complete sampling frame may be unavailable.
Advantages
- This method is faster, easier and cheaper
- It is useful when sampling frame is not available
- It is economical when study area is large
Limitations
- There are high chances of sampling error
- Over or underrepresentation of cluster can skew the result of the study.
v. Multi-stage sampling
- In multi-stage sampling, the selection of the sample is drawn in two or more stages.
- The population is divided into a number of first-stage units from which the sample is drawn
at random among them. These units are called first stage units.
- In the second stage, the elements are randomly drawn from the first stage unit and these
units are called second units. The procedure can further be repeated for third and fourth
stages as required.
- The ultimate unit is called the unit of analysis
- Example:
First stage: Development regions
Second stage: Districts
Third stage: VDCs
Advantages
- It is quiet convenient in very large area
- Saves cost, time and resources
- Sample frame is required for only those which are selected.
Limitations
- This method may not always achieve representative samples.
- High level of subjectivity
Advantages
- Useful when the sample size is small
- Applied when the number of elements in the population is unknown.
Limitations
- There are high chances of selection bias
- It is not a scientific method
Advantages
- It is useful for pre-testing questionnaires
- It is useful for pilot studies
Limitations
- Selected samples might be atypical to the population
- There are high chances of selection bias
2
=
2
Where,
n= sample size
d = maximum allowable error or margin of error
P = population proportion
Z = value of Z at desired level of significance
Type I error
- A type I error is characterized by the rejection of the null hypothesis when it is true and is referred by
alpha () level.
- Alpha level or the level of the significance of a test is the probability researchers are willing to take in
the making of a type I error.
- In public health research, alpha level is usually set at a level of 0.05 or 0.01.
- Type I error can be minimized by increasing the sample size.
Type II error
- Type II error is characterized by failure to reject the false null hypothesis.
- The probability of making a type II error is called beta (), and the probability of avoiding type II error
is called power (1- ).
Actual Situation
Power of a Test
- The power of a statistical test measures the sensitivity of the test to detect a real difference
in parameter if one actually exists.
- The power of a test is a probability and like all probabilities, can have values ranging 0 to 1.
- The higher power, the more sensitive the test is to detecting a real difference between
parameters if there is a difference.
- In other words, the closer the power of a test is to 1, the better the test is for rejecting the
null hypothesis, if the null hypothesis is, in fact false.
- The power of a test is equal to 1- , that is 1 minus the probability of committing a type I error. So
power of test depends upon the probability of committing a type II error.
- The power of a test can be increase by increasing the value of . For example instead of using =
0.01, use =0.05.
- Another way to increase the power of a test is to select a larger sample size. The larger sample size
would make the standard error of the mean smaller and consequently reduce .
Level of Significance
- A level of significance is a threshold that demarcates statistical significance.
- Levels of significance are expressed in probability terms, and are denoted with Greek letter
alpha .
- In tests of statistical significance, we use a cut-off point called a level of significance or . It defines
the rejection region of the sampling distribution.
- It provides the critical values of the test. The results of a test are compared to these critical values
and are categorized as statistically significant or not statistically significant.
- The level of significance is anticipated by the researcher at the beginning.
- The most commonly used level of significance in public health studies are .01, .05 or .10.
- In other sense, the level of significance can also viewed as the probability of making type I
error.
- It is the margin that we use to tolerate type I error.
- When the level of significance is set to any value, we mean to say that is the risk of making
type I error that we are prepared to accept.
P-Value
- The P-value (or probability value) is the probability of getting a sample statistic (such as the
mean) or a more extreme sample statistic in the direction of the alternative hypothesis when
the null hypothesis is true.
- In other words, the P-value is the actual area under the standard normal distribution curve
representing the probability of particular sample statistic or a more extreme sample statistics
occurring if the null hypothesis is true.
- The P-value of 0.05 means that the probability of getting the difference would be 5 in 100
times.
- P-value is particularly important in determining the significance of the results in hypothesis
testing.
Disadvantages
- Nonparametric methods may lack power as compared with more traditional approaches.
This is a particular concern if the sample size is small or if the assumptions for the
corresponding parametric method (e.g. Normality of the data) hold.
- Nonparametric methods are geared toward hypothesis testing rather than estimation of
effects. It is often possible to obtain nonparametric estimates and associated confidence
intervals, but this is not generally straightforward.
- Tied values can be problematic when these are common, and adjustments to the test
statistic may be necessary.
- Appropriate computer software for nonparametric methods can be limited, although the
situation is improving.
Parametric Test
Z-test
The test statistic which is applied in the case of large sample is called z-test.
Types of Z-test
1. Test of Significance of single population parameter
2. Test of significance of two population parameter
=
Where,
= sample mean
= population mean
= S.D of a population
n = sample size
Standard error of mean is given by
=
( )
Where,
p = sample proportion
P = population proportion
Q = 1-P
n = sample size
Standard error of proportion is given by ( )
Assumptions
the underlying distribution is normal or the Central Limit Theorem can be assumed to hold
the sample has been selected randomly
the population variances are known
two population means are equal
- The test statistics (Z) for the testing the significance of difference between two means is
given by
1
2
= 2 2
( 1 + 2 )
1 2
1 2
= 1 1 2 2
( + )
1 2
T-test
If the sample size is small and population SD is not known, then it is not appropriate to
approximate (estimate) the sample S.D as an unbiased estimator of population S.D. This adds
another uncertainty to our inference which cannot be captured by Z-test. Hence, to address this
uncertainty in the inference we use modification of Z-test based on Students t-distribution.
Types of t-test
i. One-sample test
ii. T-test for two independent (uncorrelated) samples (equal and unequal variances)
iii. T-test for paired samples
iv. T-test for significance of an observed sample correlation coefficient
=
Where,
= sample mean
= population mean
s = S.D of a sample
n = sample size
- The degree of freedom is n-1
1
2
= 2 2
( + )
1 2
Where,
( 1 1)12 +( 2 1)22
S p = Pooled variance, 2 =
1 + 2 2
- The degree of freedom is n 1 +n 2 -2
- When a test is used to test a hypothesis concerning the means of three or more populations,
the technique is called analysis of variance (ANOVA).
- The test involved in ANOVA is called F-test.
- With analysis of variance, all the means are compared simultaneously.
- In ANOVA, two different estimates of the population variances are made
The first estimate is between group variance and it involves finding the variance of the
means.
The second estimate is within-group variance, which is made by computing the variance
using all the data and is not affected by difference in the means.
Hypothesis in ANOVA
For the test of difference among three groups or three means, the following hypothesis should
be used:
i. H 0 : The means of all groups are equal
i.e H 0 : 1 = 2 = 3 .= k
ii. H 1 : At least one mean is different from other
(ANOVA cant say which group differs)
Scheffe Test
When the null hypothesis is rejected using the F test, the researcher may want to know where
the difference among the means is. Several procedures have been developed to determine
where the significant differences in the means lie after the ANOVA procedure has been
performed. Among the most commonly used tests are the Scheff test and the Tukey test.
Scheffe test
- To conduct the Scheffe test, it is necessary to compare the means two at a time, using all
possible combinations of means.
- For example, if there are three means, the following comparisons must be done
1 versus
2
2 versus
3
1 versus
3
- This test uses F-sampling distribution
- This method is recommended when
The size of the samples selected from the different populations are unequal
Comparisons between two means are of interest
Turkey Test
- Turkey test can be used after the analysis of variance has been completed to make pairwise
comparison between means when groups have the same sample size.
- The symbol for test statistic in the Turkey test is q (studentized range statistic).
Bonferroni Test
- The Bonferroni method is a simple method that allows many comparison statements to be
made (or confidence intervals to be constructed) while still assuring an overall confidence
coefficient is maintained.
- This method applies to an ANOVA situation when the analyst has picked out a particular set
of pairwise comparisons or contrasts or linear combinations in advance.
- The Bonferroni method is valid for equal and unequal sample sizes.
- Disadvantage with this procedure is that true overall level may be so much less than the
maximum value that none of individual tests are more likely to be rejected.
Non-Parametric Tests
Applications
- In public health, it is used to know the effect to two medicines and whether they are equal or
not.
- It is also used to know whether or not a particular medicine cures the ailment or not.
- The Wilcoxon sign test is a statistical comparison of average of two dependent samples.
- The Wilcoxon sign test works with metric (interval or ratio) data that is not multivariate
normal, or with ranked/ordinal data.
- Generally it the non-parametric alternative to the dependent samples t-test (paired-t test).
- The Wilcoxon sign test tests the null hypothesis that the average signed rank of two
dependent samples is zero.
Assumptions
- Sample must be paired
- The sample drawn from the population is random and independent
- Continuous dependent variable
- Measurement scale is at least of ordinal scale
- The Kruskal-Wallis test is a nonparametric (distribution free) test, and is used when the
assumptions of ANOVA are not met.
- They both assess for significant differences on a continuous dependent variable by a
grouping independent variable (with three or more groups).
- In the ANOVA, we assume that distribution of each group is normally distributed and there is
approximately equal variance on the scores for each group. However, in the Kruskal-Wallis
Test, we do not have any of these assumptions.
- Like all non-parametric tests, the Kruskal-Wallis Test is not as powerful as the ANOVA.
Assumptions
- The sample drawn from the population is random and independent
- Measurement scale is at least of ordinal scale
Tests of Association
Chi-Square Test
- In-order to analyze the sample data for test of independence, we find the degrees of
freedom, expected frequencies and test statistic using a four-fold (22) table.
Degree of Freedom: The degree of freedom is equal to:
Df = (r-1) (c-1)
Where r is the number of levels for one categorical variable and c is the number of
levels for other categorical variable.
Expected Frequencies: The expected frequency counts are computed separately for
each cells using the formula
(,) =
((,) (,) )2
2 = ( 1)( 1)
(,)
- Fischer exact test is the test of significance that is used in the place of chi-square test in 22
tables, especially in cases of small samples.
- Fischer test is recommended when
Nominal scale
22 table with expected frequencies less than 5
The overall total of the table is less than 20 or the overall table is between 20 to 40.
- Fisher exact test uses the following formula:
( + !)( + )! ( + )! ( + )!
=
! ! ! ! !
Where, a,b,c and d are the individual frequencies of the 22 contingency table and N is the
total frequency
- NcNemars Chi Square test is used for comparing two proportions which are paired.
- This non-parametric test assesses if a statistically significant change in proportions have
occurred on a dichotomous trait at two points on the same population.
- For example, if a researcher wants to determine whether or not a particular intervention has
an effect on a disease (yes or no), then a count of the individuals is recorded in a 2*2 table
before and after being given the drug, then McNemars test is applied to make statistical
decision as to whether or not an intervention has an effect on the disease.
- Hypothesis:
Null Hypothesis: Intervention has no impact on disease
Alternative Hypothesis: Intervention has an impact on the disease
Intervention 2
+ -
+ a b
Intervention 1
- c d
Some of the common statistical software that can be used in health sciences are
i. Excel
- Excel is a spreadsheet developed by Microsoft.
- It features calculation, graphing tools, pivot tables and a macro programming language.
Advantages
- Easy to use for data entry and data storage
- Relatively easy to use for basic descriptive statistics
- Excel has some built in basic analysis tools (e.g. t-test, correlation, chi-squared tests)
Disadvantages
- Excel has no restriction on data type storage.
- Excel allows multiple user-errors to slip through the gaps
Advantages
- Freeware
- Easy to select subsets of data for analysis, without having to delete records or make
multiple copies of datasets.
- Performs both descriptive statistics and a lot of basic to intermediate analysis (e.g.
comparison of means, proportions; many regression methods, etc.)
- Keeps a savable record of the analysis steps that has been performed.
- Ability to rapidly develop a questionnaire
Disadvantages
- Runs under windows only
- Limited analysis options beyond the basic methods
- Graphic can look quite sloppy good for interpretation but not so good for scientific
publications.
- Not a dedicated statistical package
iii. Epidata
- Epidata is a computer program for simple or programmed data entry and data
documentation.
- It is highly reliable.
iv. SPSS
- It is a widely used statistical package in social sciences, marketing, education and public
health.
Features
- Menu driven statistical software, but does have scripting language available for typing
commands or creating
- Plug-ins are available for other programming languages, such as JAVA, python, E, and VB.
- Can take data from almost any type of file
- Separate Data view and variable view tabs in the worksheet
- Separate output files that can be customized.
Advantages
- Good range of statistics from descriptive methods (mean, median, frequencies, etc.) through
to common tests (t-tests, regression, ANOVA, etc.) and some more advanced statistical
measures (e.g. factor analysis)
- Can produce some nice looking graphs.
- New variables can be added to worksheet or created using formula
- Easy to use with powerful statistical functions making useful for academic fields.
Disadvantages
- Expensive
- Can only analyze one data set at a time
- Can be bit rigid with regards to advanced options for tests sometimes.
v. SAS
SAS is commercial software with advanced methodological capabilities.
Advantages
- Pretty much industry standard.
- Used widely in medical and pharmaceutical industry.
- Is very adept at data manipulation (e.g. counting elapsed days between two days) as well as
analysis.
- Range of procedures from descriptive statistics to simple analysis and on to complex
analysis
- Usually good help files, with many worked examples are available.
- Can write programmes to automate some time consuming processes.
- Graphical user interfaces are available that can help with analysis although complex tasks
may only be possible through scripting.
Disadvantages
- SAS is a script-based programming.
- There is steep learning curve at the start, even for very simple analyses
- Essentially a programming language, can be tricky and intimidating.
- Some versions are not 100% compatible
vi. Stata
- STATA is a powerful statistical package that provides comprehensive solutions for data
analysis, data management and graphics.
Advantages
- Performs a large number of statistical analyses
- Has some quick to use commands that give results for simple questions quickly.
- Has a large amount of example data available within the package as well as online.
- Large number of downloadable extensions are available that can be used to do more
complex analysis/data presentation.
- Main stata files are frequently updates to fix bug and errors in the programme.
Disadvantages
- Help files are frequently difficult to understand.
- Less flexible than SAS in terms of data manipulation
vii. R software
- R is free open-source software with a programming language for statistical computing and
graphics.
- R provides a wide variety of statistical analyses (linear and non linear modeling, classical
statistical tests, time-series analysis, classification, clustering, etc.)
Advantages
- Freeware
Disadvantages
- R is a scripting based language and therefore has a steep learning curve. More complex to
learn than SAS or Stata
- Help files are variable in usefulness, can often be complicated to understand the command
structure
- Getting results quickly out of R (copying tables to paste in word) can be complicated.
The overal process of data management in SPSS is defined by the figure below:
2. Extrapolation of data
- For categorical data: frequencies
- For numerical data: mean, standard deviation, minimum, maximum, skewness, kurtosis, etc.
- Graphs: histograms, boxplot, bar charts, scatterplots, line graphs, etc.
3. Data Analysis
a. Exploring relationships among variables
- Following analysis are performed by using respective tabs in SPSS window.
Crosstabulation/ chi-square
Correlation
Regression/Multiple regression
Logistic regression
Factor analysis
b. Comparing Groups
- Following analysis can be performed
Non parametric statistics
T-tests
One-way analysis of variance (ANOVA)
Two-way between groups ANOVA
Multivariate analysis of variance (MANOVA)
Finally, the output files from analyses are saved or archived for future reference. The results of
analysis are used and presented according to the objective of the study.
Miscellaneous
Data cleaning intends to identify and correct the errors in data or at least to minimize their
impact on study results.
There are three processes involved in data cleaning and editing:
i. Screening
- During screening four basic types of oddities should be distinguished: lack or excess of
data; outliers, including inconsistencies; strange patterns in (joint) distributions; and
unexpected analysis results and other types of inferences and abstractions.
- For this, data can be examined with simple descriptive tools. For example, in a statistical
package, analyzing range, minimum and maximum values can help detect outliers.
Frequency measure may provide information with excess or lack of data.
- Screening methods:
Checking questionnaire using fixed algorithm
Range checks
Graphical exploration of distribution (histogram, box plot)
Frequency distribution
ii. Diagnosis
- In this phase, purpose s to clarify the true nature of the worrisome data points, patterns, and
statistics.
- Possible diagnoses for each data point are as follows: erroneous, true extreme, true normal,
or idiopathic (no explanation found but still suspectful).
Spearmans Rank
Correlation Coefficient 6 2
(for non-repeated ranks) = 1
3
D= R 1 -R 2
Spearmans Rank 13 1 23 2
6[ 2 + + + .
Correlation Coefficient = 1 12 12
(for repeated ranks) 3
Hypothesis Testing
Z-Test Single population mean
=
( 1 1)12 +( 2 1)22
S p = Pooled variance, 2 =
1 + 2 2
The degree of freedom is n 1 +n 2 -2
Paired t-test
=
Where,
=
1 ()2
= [ 2 ]
( 1)
d = x-y