Está en la página 1de 188

BASICS IN

EPIDEMIOLOGY AND BIOSTATISTICS

R G
d V
ti e
Un
-
9
ri 9
h
ta
BASICS IN
EPIDEMIOLOGY AND BIOSTATISTICS

R G
d V
ti e
Waqar H Kazmi MD MS (Tufts, Boston)
Principal, Professor of Nephrology and Director Research

n
Karachi Medical and Dental College/Abbasi Shaheed Hospital
Karachi, Pakistan

- U
Farida Habib Khan DCH MPH MCPS FCPS
Professor of Community Medicine

9
Princess Nora Bint Abdulrahman University
Riyadh, Kingdom of Saudi Arabia

ri 9 Foreword

h
Waris Qidwai

ta
The Health Sciences Publisher
New Delhi | London | Philadelphia | Panama
Jaypee Brothers Medical Publishers (P) Ltd.
Headquarters
Jaypee Brothers Medical Publishers (P) Ltd.
4838/24, Ansari Road, Daryaganj
New Delhi 110 002, India
Phone: +91-11-43574357
Fax: +91-11-43574314
E-mail: jaypee@jaypeebrothers.com
Overseas Offices
J.P. Medical Ltd. Jaypee-Highlights Medical Publishers Inc.
83, Victoria Street, London City of Knowledge, Bld. 237, Clayton
SW1H 0HW (UK) Panama City, Panama
Phone: +44-20 3170 8910 Phone: +1 507-301-0496
Fax: +44(0)20 3008 6180 Fax: +1 507-301-0499
E-mail: info@jpmedpub.com E-mail: cservice@jphmedical.com
Jaypee Medical Inc. Jaypee Brothers Medical Publishers (P) Ltd.
The Bourse 17/1-B, Babar Road, Block-B, Shaymali
111, South Independence Mall East Mohammadpur, Dhaka-1207
Suite 835, Philadelphia, PA 19106, USA Bangladesh
Phone: +1 267-519-9789 Mobile: +08801912003485
E-mail: jpmed.us@gmail.com E-mail: jaypeedhaka@gmail.com

Jaypee Brothers Medical Publishers (P) Ltd.


Bhotahity, Kathmandu, Nepal
Phone: +977-9741283608
E-mail: kathmandu@jaypeebrothers.com

Website: www.jaypeebrothers.com
Website: www.jaypeedigital.com
© 2015, Jaypee Brothers Medical Publishers
The views and opinions expressed in this book are solely those of the original contributor(s)/author(s)
and do not necessarily represent those of editor(s) of the book.
All rights reserved. No part of this publication may be reproduced, stored or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior
­
permission in writing of the publishers.
All brand names and product names used in this book are trade names, service marks, trademarks
or registered trademarks of their respective owners. The publisher is not associated with any product
or vendor mentioned in this book.
Medical knowledge and practice change constantly. This book is designed to provide accurate,
authoritative information about the subject matter in question. However, readers are advised to
check the most current information available on procedures included and check information from the
manufacturer of each product to be administered, to verify the recommended dose, formula, method
and duration of administration, adverse effects and contraindications. It is the responsibility of the
­
practitioner to take all appropriate safety precautions. Neither the publisher nor the author(s)/editor(s)
assume any liability for any injury and/or damage to persons or property arising from or related to use
of material in this book.
This book is sold on the understanding that the publisher is not engaged in providing professional
medical services. If such advice or services are required, the services of a competent medical
professional should be sought.
Every effort has been made where necessary to contact holders of copyright to obtain permission to
reproduce copyright material. If any have been inadvertently overlooked, the publisher will be pleased
to make the necessary arrangements at the first opportunity.
Inquiries for bulk sales may be solicited at: jaypee@jaypeebrothers.com
Basics in Epidemiology and Biostatistics
First Edition: 2015
ISBN: 978-93-5152-631-5
Printed at

tahir99 - UnitedVRG
R G
V
Dedicated to

d
ti e
Medical and Dental Students
and
Young Researchers

Un
-
9
ri 9
h
ta
Foreword

It gives me immense pleasure in writing a foreword for Basics in Epidemiology

G
and Biostatistics, written by highly eminent and respected scholars Professor
Waqar H Kazmi and Professor Farida Habib Khan. Prof Kazmi is considered an

R
authority on this subject and has skills to present challenging concepts in the
area of epidemiology and biostatistics, in an easy-to-understand language.

V
He obtained his Masters in Epidemiology from Tufts University, Boston, USA
and has a strong clinical background being a Professor of Nephrology, as

d
well. Farida Habib Khan is the Professor of Community Medicine and served
College of Physicians and Surgeons as a regular facilitator of the Workshops

ti e
on Research Methodology and dissertation writing and served two medical
journals as an Associate Editor.
The book fills a great need that exists for availability of such books on

n
this important yet neglected subject. Epidemiology and biostatistics has
been neglected in medical education curriculum and, therefore, healthcare
providers are lacking expertise in this important area. The book will go a long

U
way, in addressing important need to provide an easy-to-understand guide
for healthcare providers and others, to understand and apply concepts of

-
epidemiology and biostatistics in their work. Its simple language and prac-
tical approach, makes it indispensable for those involved in research work as

9
well as those associated with teaching epidemiology and biostatistics. It will
be useful for undergraduate and postgraduate students in various disciplines

ri 9
of healthcare as well as those practicing medicine.
Besides, the book would be highly useful to healthcare providers, teachers

and researchers.

h
Waris Qidwai
Chair, Working Party on Research

ta
World Organization of Family Doctors (WONCA)
Former Chair
International Federation of Primary Care
Research Networks (IFPCRN)
Professor and Chairman
Department of Family Medicine
Aga Khan University
Karachi, Pakistan
Preface

Basics in Epidemiology and Biostatistics introduces the medical/dental students,

G
postgraduates, researchers, or clinicians, to the study of statistics applied to
medicine. We have incorporated our experiences in medicine and statistics to

R
develop a comprehensive text covering the traditional topics of biostatistics
and epidemiology. Particular emphasis is given to study design and the inter-

V
pretation of results of medical research.
It has been more than a decade that we have been giving lectures at

d

various undergraduate and postgraduate institutes. The students find these
lectures worthwhile for the understanding of basic concepts in biostatistics

ti e
and epidemiology. We realized that by writing a book, we could reach a large
number of students and faculty members in remote areas, which were not
accessible to us otherwise. Thus, we hope that anyone interested in research

n
will find the book extremely helpful.
We have tried to explain all statistical concepts in simple terms. No special

background knowledge will require to understand the text. An effort has been

U
made to cover all the fundamental concepts and important terms in the book.

-
The book contains the following features:
Simple Text

9
The book is written in a very simple and easy-to-understand manner. The
information given in the book is relevant to the need of any junior and early

ri 9
stage researcher. The information is presented in a schematic pattern. This is
necessary because a learner must understand the pre-requisite information
before understanding the more advanced concepts in basic epidemiology
and biostatistics. Thus, all the information have been presented in a schematic

h
and synchronized way so that the reader could grasp them very easily.

ta
Pictorial and Tabular Display of Information
Different learners have different learning styles. Some find textual informa-
tion easy to understand, while others are more at ease of understanding the
pictorial and tabular display of information. Thus, all relevant texts have also
been presented in a pictorial and tabular form. We hope that a large number
of readers could grasp the important and useful information by having a good
look at the pictures and tables.
Relevant Examples
We have used multiple clinical and nonclinical examples so that the reader
will understand the basic concepts of epidemiology and biostatistics. Simple
interesting examples have also been used for the purpose.
x Basics in Epidemiology and Biostatistics

Software Relevant to Use in Research


There are a number of softwares relevant to be used for research purpose. In
this book, multiple softwares have been used to compute sample size. The
reader will surely find the book useful to have the understanding of how to
use the relevant software for sample size calculation.

Waqar H Kazmi
Farida Habib Khan

tahir99 - UnitedVRG
Acknowledgments

We are extremely grateful to Muhammad Abdul Samad, Lecturer, Research

G
Department, Karachi Medical and Dental College, Karachi, Pakistan, for his
invaluable support and efforts in every stage of writing the book.

R
We express our gratitude to Mrs Huma Khan, Research Co-ordinator,

Universal Research Group, Pakistan, for her support regarding proofreading of

V
the book.
We are thankful to Asma Kazmi, Assistant Professor, California Institute of

d

Fine Arts, Los Angeles, USA, for designing the Cover Page.
Our special thanks to M/s Jaypee Brothers Medical Publishers (P) Ltd, New

ti e

Delhi, India, for their active co-operation in publishing this book.

Un
-
9
ri 9
h
ta
Contents

1. Introduction to Research 1

G



y What is Research ? 1
y

y Types of Research 1

R
y

y Steps to Conduct Research 3
y

y Selection of Research Topic 3

V
y

y Scale for Rating Research Topics 5
y

d
y Resources of Literature Search 5
y

2. Study Designs 8

ti e



y Definition 8
y

y Types of Epidemiological Study Designs 8
y

n
y Descriptive Observational Studies 10
y

y Analytical or Comparative Studies 14
y

y Analytical Observational Studies 14

U
y

y Registries 20
y

y Interventional/Experimental Studies 21

-
y

y Blinding 24
y

y Consent Form 25

9
y

y Intent to Treat Analysis 25
y

y Quasi-experimental Studies 25

ri 9
y

y Clinical Trials and their Phases 25
y

y Research Questions and Study Types 27
y

y Meta-analysis 27
y

h
3. Sampling Procedure 30



ta
y Population 30
y

y Reasons for Sampling 31
y

y Sampling Techniques 31
y

4. Variables, Data and its Presentation 41



y Variables and their Types 41
y

y Data and its Types 42
y

y Tabulation and Graphical Presentation of Data 44
y

5. Biostatistics: Basic 51



y Measures of Central Tendency 51
y

y Measures of Variation 52
y

xiv Basics in Epidemiology and Biostatistics

y Standard Error of Mean 54


y

y Normal Distribution 54
y

6. Estimation and Hypothesis Testing 57



y Point Estimate 57
y

y Interval Estimate 57
y

y Hypothesis Testing 57
y

y Introduction to the Scale of Probability 58
y

y Test of Hypothesis 59
y

y Decision Errors 62
y

7. Measures of Disease Frequency 69



y Ratio, Proportion and Rate 69
y

y Prevalence and Incidence 70
y

y Special Types of Incidence Rates 73
y

8. Measures of Association 77



y Association between Two Continuous Variables 77
y

y Relative Risk and Odds Ratio 84
y

9. Factors Affecting Study Outcomes 89



y Introduction 89
y

y Bias 89
y

y Control of Bias 92
y

y Confounding 92
y

y Effect Modifiers 93
y

10. Sample Size Estimation 95



y Sample Size 95
y

y Sample Size for Single Proportion 95
y

y Sample Size for Single Group Mean 96
y

y Sample Size for Two Proportions 98
y

y Sample Size for Two Group Means 98
y

y Sample Size for Sensitivity and Specificity 101
y

y Suggested Websites for Sample Size Calculator 102
y

11. Screening 103



y Reliability and Validity of a Screening Test 103
y

y Sensitivity and Specificity 104
y

y Predictive Values 105
y

tahir99 - UnitedVRG
Contents xv

12. Basic Statistical Tests 110





y Unpaired Samples 110
y

y Paired Samples 110
y

y What are Validity and Reliability in Research Findings? 113
y

G
13. Overview of Data Collection Techniques 115



y Different Data Collection Techniques 115
y

R
14. Data Analysis Plan 120



V
y Importance of Data Analysis Plan 121
y

y What Should the Plan Include? 121

d
y

15. Synopsis Writing 129

ti e



y Methodology 129
y

y Plan for Analysis of Results 130
y

y Title/Topic 130

n
y

y Introduction 130
y

16. Dissertation Writing 151

U



y Steps in Writing a Dissertation 151
y

-
y Title 152
y

y Table of Content 152
y

y Title Page 152

9
y

y Abstract 152
y

y Introduction 152

ri 9
y

y Hypothesis 153
y

y Study Objective 153
y

y Subjects/Material and Methods 153

h
y

y Results 153
y

y Discussion 154

ta
y

y Optional Components 154
y

y References 155
y

y Annexes 155
y

y The Whole Manuscript/Dissertation Should be
y
in Past Tense 155


y Sample of Title Page 155
y

17. Reference Writing 157



y Citing a Journal Article 157
y

y Title of Journal Article 158
y

y Journal’s Title 158
y

y Citing a Book Reference 159
y

xvi Basics in Epidemiology and Biostatistics

y Other Authors 161


y

y Dissertation Reference 161
y

y Citing Internet and other Electronic Sources 161
y

18. Guidelines for Consent Writing 164



y General Ethical Principles 164
y

y Guidelines for Drafting an Informed Consent Form 166
y

y Important Notes 168
y

19. Consent to Participate in Research (Sample) 169



y Title or Paraphrased Title of the Study 169
y

y Purpose of the Study 169
y

y Procedures 169
y

y Potential Risks and Discomforts 170
y

y Potential Benefits to Subjects and/or to Society 170
y

y For Biomedical Studies only,
y
Add the Following Section Here 172


y Identification of Investigators 172
y

y Rights of Research Subjects 173
y

Index 175


tahir99 - UnitedVRG
CHAPTER

1
Introduction to
Research

R G
V
WHAT IS RESEARCH ?

d
Research is a systematic process of collection and analysis of data

ti e
and later on its interpretation so as to find solutions to a problem or
any event around us (Fig. 1.1).

n
TYPES OF RESEARCH

U
Basically research is of two types, i.e. empirical and theoretical
(Flow chart 1.1 for the classification of research). Empirical approach

-
is based upon observation and experience, while theoretical is
based upon theory and abstraction. Both empirical and theoretical
research complement with each other to develop an understanding

9
of the phenomenon, predict future events and prevent harmful

ri 9
events for the general welfare of the population of interest.
Empirical research is further divided into qualitative and

quantitative.

h
Qualitative Research

ta
This type of research is context based. Here there is an inquiry with
the goal to understand a social or human problem so build up a
complex and holistic picture of the phenomena of interest. The
researcher interprets the results of perspectives or information
taken from subjects.

Figure 1.1 Research as a systemic process



2 Basics in Epidemiology and Biostatistics

Flow chart 1.1  Classification of research

In logic, we often refer to the two broad methods of reasoning as



the deductive and inductive approaches.
Deductive reasoning works from the more general to the more

specific approach. Sometimes this is informally called a “top down
approach”. Inductive reasoning works the other way, moving from
specific observations to the broader generalizations and theories,
called “bottom up” approach. Qualitative research is the inductive
form.
There are three types of qualitative research, i.e. case studies,

ethnographic studies and phenomenological studies.
1. Case study is a descriptive study of a single entity with respect to

time and entity.
2. Ethnographic study is a study of a cultural group in a natural

setting. A cultural group could be group of people who share a
common location or any common social experience, e.g. prisons
in jail or cultural group of Muslims.
3. Phenomenological study is a human experience of a small group

of people over a long period of time.

tahir99 - UnitedVRG
Introduction to Research 3

Quantitative Research
In quantitative research reality is studied objectively by the
researcher. Theory or hypothesis is tested by using numbers and
analyzed by statistical methods. This type of research is based

G
on deductive form of logic. Ultimately, the researcher develops
generalization and contributes to theory.
Three different types of quantitative research are experimental,

R

quasi-experimental and surveys.
1. In experimental type of research, there is random assignment of

V

subjects to experimental conditions. The results are compared

d
with controls.
2. Quasi-experimental studies are similar to experimental studies

ti e

with the exception that there is nonrandomized assignment of
subjects to experiments.
3. Surveys are cross-sectional studies using questionnaires or

n

interviews with an intent of estimating the characteristics of a
larger population based on a smaller group from that population.

U
Health science research mostly deals with quantitative type of

research approach.

-
STEPS TO CONDUCT RESEARCH

9
Research is a systemic process starting from selection of research topic

ri 9
and ends at reporting the research findings at local/international
journals or scientific meeting. The Table 1.1 gives details about
various steps and relevant purposes in conducting research.

h
SELECTION OF RESEARCH TOPIC

ta
Main Criteria for Selecting a Research Topic
There are seven criteria for selecting a research topic.
1. Relevance: Here consider the prevalence of the problem in which

you are interested. In other words, how big is the problem.
2. Innovation: It is good to look into a new problem but it is not

always possible to work or search for new problems as you may
have limited resources. Thus, you can work on the old problem
but with a different perspective.
3. Feasibility: It means the availability of different resources that you

may need to carry out the research project. It includes manpower,
money, material, machinery, skills and time, etc.
4 Basics in Epidemiology and Biostatistics

Table 1.1: Steps to conduct research



Steps Purpose
• Selecting a research topic and • To assess what questions will the
formulating objective(s) study address
• What will it measure?
• Undertaking literature review • To establish why the question is
important?
• What is already known about it?
• What new will this study assess?
• Selecting a study design • To ensure that the research design
matches the objectives set
• Selecting the subjects • To ensure generalizability and
validity
• Identifying study • To be clear in context to:
– Predictor variables
– Outcome variables
– Confounding variables
• Collection of data • To ensure collection of data
aligned to the objective(s) in a
reliable and nonbiased manner
• Analyzing data • To present quantifiable result and
assess validity

4. Acceptability: It is important to consider whether your proposal



will be supported by the local authorities or not. It also includes
the acceptability of the procedure or the method that you are
going to apply on the community as certain communities have
certain social boundaries that may hamper in your research
procedure.
5. Cost-effectiveness: Consider whether the resources which you

are spending are worthwhile, for example, in terms of decline in
morbidity/mortality rates or length of stay in hospital.
6. Ethical consideration: It includes informed consent, beneficence,

nonmaleficence (do no harm), and confidentiality of information
taken, etc.
7. Applicability of possible results and recommendations: Is it likely

that the recommendations from the study will be applied? This
depends not only on the blessing of the authorities but also on the

tahir99 - UnitedVRG
Introduction to Research 5

availability of resources for implementing the recommendations.


The opinion of the relevant stakeholders (i.e. potential clients
and of the responsible staff) will influence the implementation of
recommendations as well.

SCALE FOR RATING RESEARCH TOPICS


Every criterion that is mentioned above is graded from 1 to 3,
1 being low, 2 means medium, while 3 stands for high (Table 1.2).
Hence, the maximum score that is possible for any topic is 21. The
topic for which there is highest score should be chosen.

RESOURCES OF LITERATURE SEARCH


Relevant scientific literature could be searched through internet,
medical journals, conference literature, newspaper or documents
of government or nongovernment organizations. Usually internet is
used as the process is quick, reliable and freely accessible.
Through internet one can link with library catalogues, online
databases, like MEDLINE and a number of biomedical journals.
Researchers should give adequate time in conducting literature
search as this will help in writing a good quality of synopsis and
dissertation.
Before using internet for literature search, the researcher should
set the keywords for the topic of interest.
Suppose a researcher wants to work on the complication
nephropathy, among diabetic patients who are hypertensive. The
keywords are diabetes, hypertension and nephropathy.

Table 1.2:  Scale for rating research topics


Low (1) Low (2) Low (3)
Relevance
Innovation
Feasibility
Acceptability
Cost-effectiveness
Ethical consideration
Applicability
6 Basics in Epidemiology and Biostatistics

After opening the PubMed window by directly entering www.



pubmed.com or http://www.ncbi.nlm.nih.gov/PubMed/, the first
keyword (diabetes) in the search bar (for) is entered. Approximately
160000 research papers will be displayed which is not manageable
(Fig. 1.2).

Figure 1.2  PubMed window after entering the first keyword—“diabetes”

Figure 1.3  PubMed window after entering the second


keyword —“hypertension”

tahir99 - UnitedVRG
Introduction to Research 7

Figure 1.4  PubMed window after entering the third


keyword—“nephropathy”

After entering the second keyword (hypertension), the number of


articles have also narrowed down to 16057 but still it is a very large
figure (Fig. 1.3).
After entering the third keyword (nephropathy), the number
of articles will narrow down to just 3010 which is manageable
(Fig. 1.4).

BIBLIOGRAPHY
1. Dawson B, Trapp RG (Eds). Reading the Medical Literature. Basic
and Clinical Biostatistics, 3rd edn. Singapore: Lange Medical Books;
McGraw Hill; 2001.pp.317-9.
2. Fathalla MF, Fathalla MMF (Eds). What research to do? WHO Regional
Publication, Eastern Mediterranean Series: A Practical Guide for Health
Researchers. Cairo: World Health Organization; 2004.pp.25-42.
3. Harvard L. How to conduct an effective and valid literature search?
[Online]. 2007 [cited 2008 Jul]; Available from: URL: http://www.
nursingtimes.net/ntclinical/how_to_conduct_a_literature_search.
html
4. Hulley SB, Newman TB. Getting started: the anatomy and physiology
of clinical research. In: Hulley SB, Cummings SR, Browner WS (Eds).
Designing clinical research. Philadelphia, PA: Lippincott Williams and
Wilkins; 2007.pp.3-15.
5. Research and Scientific Methods. In: World Health Organization.
Health research methodology: a guide for training in research methods.
Manila: World Health Organization; 2001.pp.1-10.
CHAPTER

2
Study Designs

DEFINITION
A study design is a plan to conduct a study which allows the
researcher to translate a conceptual hypothesis into an operational
one. It is the method of data collection with respect to time, exposure
and outcome (Fig. 2.1).
The selection of a study design depends upon the research

objective and hypothesis. The researcher should know and use the
most appropriate study design that matches best with the objective.

TYPES OF EPIDEMIOLOGICAL STUDY DESIGNS


Epidemiological study designs are classified as follows (Flow chart 2.1):
• Descriptive or observational designs for generating hypothesis:
­
– Case report

– Case series

– Cross-sectional studies.

Figure 2.1 Study designs with respect to time

tahir99 - UnitedVRG
Study Designs 9

Flow chart 2.1  Types of epidemiological study designs

• Analytical or observational designs for generating/testing hypothesis:


– Case control studies
– Cohort studies.
• Analytical or experimental designs for testing hypothesis:
– Randomized control clinical trials (Gold standard)
– Quasi-experimental design.
The difference between hypothesis-testing and hypothesis-
generation is that in a hypothesis generating study only “an asso­
ciation” between an exposure and an outcome can be established,
while on the basis of an hypothesis testing study one can say with
confidence that a certain exposure causes a certain outcome. The
experimental studies (randomized controlled clinical trial) are
the most robust of studies and the only hypothesis-testing studies,
hence are considered the gold standard. The observational studies
are weaker studies and can only generate a hypothesis.
10 Basics in Epidemiology and Biostatistics

Epidemiological study designs are broadly divided into two



main types, i.e. descriptive and analytical. In descriptive studies a
researcher quantifies (in % or mean ± SD) the distribution of certain
variables in a study population at a point of time (Table 2.1), while
in analytical studies (observational or experimental), the researcher
tests the prior stated hypothesis.
In observational studies, the researcher merely observes what

is happening or what has happened in the past and tries to draw
conclusions based on these observations. In experimental studies,
the investigator assigns an intervention to one of the groups. Another
distinguishing feature of the experimental study is the process of
randomization.
The basic variable that defines a study design is time (Flow

chart 2.1). If both the exposure and outcome are determined at
one point of time, it is a cross-sectional (descriptive) study. If the
outcome has occurred and researcher goes back from the outcome
towards exposure, it is a case control study, while if patients are
followed from the exposure towards the outcome, then it is a cohort
study or experimental study.

DESCRIPTIVE OBSERVATIONAL STUDIES


These studies are usually carried out in one patient/group. These
studies describe an event or a problem with respect to time, place
and person. The researcher usually does not have a hypothesis at
the beginning of the study though one can formulate/generate a
hypothesis based on the conclusion of the study.
The three different types of descriptive observational study

designs are case report, case series and cross-sectional studies.

Case Report
It is report of a single case of disease, usually with an unexpected
presentation, which typically describes the findings, clinical course
and prognosis of the case. Writing of a case report is like writing a
good clinical history of a patient that includes presenting features,
clinical signs, lab investigations, and diagnosis after excluding a list
of differential diagnosis. A classical example of a case report from
history is that of a congenital anomaly affecting limbs and digits

tahir99 - UnitedVRG
Study Designs 11

Table 2.1:  Baseline characteristics of patients with chronic kidney disease


(hypothetical table of a des­­crip­­tive study design)
Patients characteristics Mean ± SD or %
Age (years)
Male Gender
Race
 Caucasians
 African-American
 Asians
 Others
Insurance
 Private
 HMO
 Medicare
  Medical aid
 None
Comorbidity Index
 Zero
 One
 Two
 Three
Cause of CRI
  Diabetes mellitus
 Hypertension
 GN/PKD/IN
 Other
Laboratory values
  Serum creatinine (mg/dL)
  GFR (mL/min/1.73 m2)
  BUN (mg/dL)
  Serum albumin (g/dL)
  Hct (%)

from Germany in late 1959 (The Thalidomide tragedy). The world has
never heard or seen such a unique congenital anomaly before. These
are the type of cases which should be presented as a case report.
12 Basics in Epidemiology and Biostatistics

Case Series
When several unusual cases all with similar conditions are described
in a published report, this is called a “Case Series”. A case series does
not include a control group. Subsequently after the first case report
of thalidomide tragedy a case series was published in 1961. The
thalidomide was used for nausea and vomiting in pregnancy in that
era, hence soon more such mal-developed children’ were identified
becoming a basis for a case series.
It was quite easy to identify the exposure now as thalidomide

because all mothers with the outcome (mal-developed children)
used this drug.

Cross-sectional Studies
In a cross-sectional study, the data is collected at one point of time.
The hallmark of such studies is that there is no follow-up. These
studies are also called “Prevalence Studies” as they determine the
burden of disease in a population, e.g. National Health Survey of
Pakistan on the prevalence of hypertension in Pakistan or Pakistan
National Diabetic Survey—shows Prevalence of Diabetes Mellitus in
Pakistan.
A survey is a classical example of a cross-sectional study. These

days surveys are also being carried out by people other than the
health professionals, for example, the media.
In a cross-sectional study, data on both the exposure and

outcome are determined at the same time. Hence, in this type of
study 4 groups are made, i.e. those exposed and have the outcome,
those exposed but do not have the outcome, those unexposed but
have the outcome, and those unexposed but without the outcome
(Flow chart 2.2). Exposure rates are calculated in each group, thus a
2 × 2 table can be constructed. These exposure rates are compared.
If a cross-sectional study covers the whole population, it is called

a census.
A cross-sectional design is not suitable to study the association

between an exposure and an outcome. While using this design
it is difficult for the researcher to establish whether the exposure
preceded the outcome or not. Ideally, the exposure should always
precede the outcome. For example, if the researcher is studying the
association of uric acid level and hypertension, and on analysis finds

tahir99 - UnitedVRG
Study Designs 13

Flow chart 2.2  Design of a cross-sectional study

that most of the hypertensive patients have hyperuricemia as well;


here the researcher cannot say with confidence that hyperuricemia
is really an exposure/risk factor for hypertension (outcome); as
hyperuricemia can cause hypertension and hypertension is also a
risk factor for hyperuricemia. Hence, temporal association cannot
be established in such studies. Temporal association is one of the
first criteria according to Hill’s Criteria to confirm an association
between an exposure and an outcome. Temporal association simply
means that there has to be a time period between the exposure
and an outcome, and that the exposure should always precede the
outcome. For instance, in the above example it has to be shown that
a person had hyperuricemia initially and then after a period of time
developed hypertension. Unfortunately, in a cross-sectional study
the data is collected on hyperuricemia and hypertension at the same
time and cannot establish which came first, the “chick” and “egg”
phenomenon.
Hence, cross-sectional studies are useful for determining the
prevalence of a disease, but not recommended if the researcher
wants to study an association between an exposure and an outcome.

Advantages
• Easy to perform
• Prevalence/frequency of the disease can be calculated
• Inexpensive as compared to analytical studies
• Useful for evaluating diagnostic procedures, e.g. comparing two
diagnostic or treatment modalities, or the usefulness of a new
diagnostic procedure
14 Basics in Epidemiology and Biostatistics

• Useful for measuring current health status and planning for some
health services
• Takes lesser time as compared to analytical studies
• Researcher can generate hypothesis.

Disadvantages
• The data about both the exposure to risk factors and the presence
or absence of disease are collected simultaneously, hence it is
difficult to determine temporal relationship of a presumed cause
and effect.
• Nonresponders bias (in surveys), it is difficult to obtain sufficiently
large response rates, as some people are too busy or reluctant to
participate.
• Hypothesis though can be generated but it is a weak hypothesis
which needs to be tested by conducting further analytical study.

ANALYTICAL OR COMPARATIVE STUDIES


The hallmark of these types of study designs is that the researcher
has at least 2 groups (made either on basis of exposure or outcome)
at the beginning of the study and a follow-up. Such studies are also
called longitudinal studies.
Hence, the association between an exposure and outcome can be

established.
It includes:
• Observational studies, e.g. case control and cohort study designs
• Interventional or experimental studies.

ANALYTICAL OBSERVATIONAL STUDIES


Analytical observational studies include:
• Case control study design
• Cohort study design (prospective, retrospective and combination
of retrospective and prospective cohort study).
Such study designs are useful to test etiological hypothesis. From

each of these studies, the data is analyzed to find out:
• Whether any association exists between the exposure/risk factor
and the outcome/disease (by calculating odds ratio in case
control study and relative risk in cohort study).

tahir99 - UnitedVRG
Study Designs 15

• If so, what is the strength of association between the exposure/


risk factor and the outcome/disease under study?
• To ascertain whether the association between the exposure and
the outcome is not by chance. This is determined by a test of
significance commonly called the p-value.

Case Control Study


Here the two groups are recruited on the basis of their outcome.
The group of patients who have the outcome in which researcher is
interested are called “cases” while the group of people who do not
have that outcome of interest are called “controls” (Flow chart 2.3).
For example, a pediatrician researcher wants to study the
association between the use of tap water for drinking and diarrhea.
His hypothesis is that “children using tap water for drinking are more
likely to suffer from diarrhea” as compared to those who use mineral
water. In this example, children who are suffering from diarrhea
will be “cases”, while those not having diarrhea will be controls. The
exposure in this study is the use of tap water for drinking, while the
outcome is diarrhea.
Cases and controls are questioned, or their medical records are
consulted regarding past exposure to risk factors. Later the measure
of association is determined which in case of a case-control study is
“odds ratio”.

Flow chart 2.3  Case control study design


16 Basics in Epidemiology and Biostatistics

Advantages
• Multiple exposure for a single outcome can be detected
• Inexpensive as compared to other analytical study designs
• No need of follow-up
• Takes lesser time as compare to other analytical study designs
• Recommended for those problems which have a long incubation
period as cancers.
• Recommended for studies on rare diseases
• Recommended for investigating a preliminary hypothesis.

Disadvantages
• Recall bias is the main problem as the “cases” will be more likely to
recall the past exposure. Similarly, if the researcher is working on
geriatric patients then recall bias can be problematic both in cases
and controls as the respondents might not have good memory
due to old age. For example, in a study looking at the association
of being a cigarette smoker for ten years and development of lung
cancer, some participants may have difficulty in recalling whether
they have been a cigarette smoker for ten years or not.
• Selection bias is another problem if the cases and controls are not
properly selected. Here are two examples of selection bias in two
studies carried out at two leading tertiary care centers of the world
by two very eminent researchers of the time.

Study 1
In 1929, Raymond Pearl at John Hopkins, Baltimore conducted a
study to test the hypothesis that tuberculosis (TB) protected against
cancer. He selected 816 cases of cancer from 7500 consecutive
autopsies. He also selected 816 controls from others on whom
autopsies had been carried out at John Hopkins. Of the 816 cases
(with cancer), 6.6% had TB. Of the 816 control (without cancer), 16.3%
had TB. From the finding that the prevalence of TB was considerably
higher in the control group, Pearl concluded that TB was protective
against cancer. Actually at the time of this study, TB was one of the
major reasons for hospitalization at Johns Hopkins Hospital. Pearl
thought that the control group’s rate of TB would represent the level
of TB in the general population; but because of the way he selected
the controls, they came from a pool that was heavily weighted with

tahir99 - UnitedVRG
Study Designs 17

TB. He should have compared the patients with cancer to a group of


patients admitted for some specific diagnosis other than cancer. The
way the controls are selected is a major determinant of whether a
conclusion is valid or not.

Study 2
Coffee-drinking and Cancer of the Pancreas in Women. The cases
(patients with cancer of the pancreas) were white cancer patients
from 11 Boston and Rhode-Island hospitals. The controls were
recruited from the Gastrointestinal Clinics of the same hospital.
McMohan found that coffee consumption was greater in cases
than controls. The controls were patients who had reduced their
coffee consumption because of Physician’s advice. The controls
level of coffee consumption was not representative of the general
population. When a difference in exposure is observed between
cases and controls we must ask “Is the level of exposure observed
in the controls really the expected level in the general population.”
In the two studies (1 and 2) the researchers erroneously concluded
about the association between an exposure and outcome because of
improper selection of controls.

Cohort Studies
Cohort means a group of people sharing the same attribute, e.g. all
those who are exposed to the use of tobacco as compared to those
not exposed to the use of tobacco.
In a cohort study design, the two groups are made on the basis of
exposure (i.e. smokers and nonsmokers). These groups are followed
for a specific period of time for the outcome of interest. This study
design is preferred if the researcher aims to determine the incidence
and the risk factors associated with the disease.
There are two types of cohort studies:
1. Prospective Cohort Study or Concurrent Cohort Study
2. Retrospective Cohort Study or Historical Cohort Study

Prospective Cohort Study


In prospective cohort studies the investigators conceive and design
the study, recruit subjects, and collect baseline exposure data from
all subjects, before any of the subjects have developed an outcome
18 Basics in Epidemiology and Biostatistics

of interest. The subjects are then followed into the future in order
to record the development of an outcome of interest. The follow-up
can be conducted by mail questionnaires, by phone interviews, via
the Internet, or in person with interviews, physical examinations,
and laboratory or imaging tests. For example a study investigating
the association between cigarette smoking for ten years or more and
lung cancer, if the researcher wants to choose a prospective cohort
design then his study would start in the year 2013 and end into 2023
(Flow chart 2.4).
The Framingham Heart Study is a good example of large, pros

­
pective cohort study. It is an ongoing cohort study still in progress to
identify the risk factors associated with heart disease.

Advantages
• Multiple outcomes to a single exposure can be detected
• Incidence rates are calculated
• It helps in calculating the relative risk and the attributable risk
• Temporal association is best studied in prospective cohort study
• It allows the assessment of dose response relationship

Flow chart 2.4 Prospective cohort study



tahir99 - UnitedVRG
Study Designs 19

• It helps to accept or to refute the hypothesis with a high degree of


validity
• Complete control over the data.

Disadvantages
• Expensive
• Time consuming
• Strict follow-up is required
• Not suitable for diseases that have a long incubation period
• Not suitable for rare diseases
• Attrition (loss to follow-up) due to migration or death of the
respondents.

Retrospective Cohort Study


Retrospective studies are also called historical cohort studies.
Sometimes in a prospective cohort study with a long outcome for
example the cigarette smoking for ten years and lung cancer study
loss to follow-up, long wait for the completion of the study and
finding a funding source are issues. In order to save time and money
and to complete the study in a shorter time the retrospective study is
an ideal situation (Flow chart 2.5).

Flow chart 2.5  Retrospective cohort study


20 Basics in Epidemiology and Biostatistics

Advantages
• Less expensive
• Less time consuming
• Follow-up data is obtained through records so ‘follow-up time’ is
saved
• Other advantages of cohort studies are also there.

Disadvantages
There is no control over the data, whatever variable information
is available is there. Nothing can be done about missing data.
Sometimes information on a variable of interest is not available.
In a prospective cohort study, the investigators are typically

present from the beginning to the end of the observation period.
However, it is possible to maintain the advantages of the cohort
study without the continuous presence of the investigator, or having
to wait for a long time to collect the necessary data, through the
use of a retrospective cohort study. In other words, although the
investigator was not present when the exposure was first identified,
he reconstructs the exposed and unexposed population from records,
and then proceeds as though he has been present throughout the
study. For example, if the 10 years cigarette smoking and lung cancer
study using a retrospective cohort design was being done today
(year 2013), the investigator would look into records and identify
the people who were smokers in the year 2003. In this manner, he
has selected a cohort who have been exposed to cigarette smoking
for ten years. He would now determine the outcome of lung cancer
today (year 2013). This way by using the retrospective cohort design
he has been able to complete a study which would have taken ten
years from now in a few months time.

REGISTRIES
In the developed world, researchers have collected data pertaining to
specific diseases like the United States Renal Data Systems (USRDS)
for end-stage renal disease patients (ESRD). The USRDS has data on
all dialysis patients being dialyzed in any of the 52 states in the US.
Any patient who initiates dialysis is immediately registered in this
data base and subsequently the entire follow-up including clinical
characteristics, labs and medicines are recorded continuously until the

tahir99 - UnitedVRG
Study Designs 21

patients is alive/dies/receives a kidney transplant. A researcher may be


interested to look at the risk factors associated with ESRD and may like
to study patients who initiated dialysis from 2001 to 2006. The data may
be used from this registry to conduct a retrospective cohort study.
Data from registries are ideal for retrospective cohort studies.
Clinicians of every specialty should be encouraged to conduct chart
audits to collect data retrospectively on disease of their interest.
Unfortunately, the hospital records are not well maintained in low
resource settings and, hence, it is difficult to create registries. In the
developed world, the majority of the studies done are retrospective
cohort studies using registries. We can also follow the foot-steps by
improving our in-door patients’ record system.

INTERVENTIONAL/EXPERIMENTAL STUDIES
Here intervention or some action is involved such as deliberate
application of a drug in the experimental (study) group and
no intervention in the control group. Later, the outcome of the
experiment is compared in both the groups (Flow chart 2.6).
Thus it differs from the observational analytical study designs
in that here the experiment is directly under the control of the
investigator whereas in the observational analytical studies, the
investigator takes no action, just observes.
There are three key components of an experimental study design:
(1) prepost test design, (2) a treatment group and a control group,
and (3) random assignment of study participants.
A prepost test design requires the collection of data on study
participants’ level of performance before the intervention is given
(pre-), and that you collect the same data on similar participants
after the intervention was given (post). This design is the best way to
be ensure that the intervention had a causal effect.

Flow chart 2.6  Sketch of experimental study design

* Pretest are characteristics measured at Baseline.


** Post-test are characteristics measured at end point of the trial.
22 Basics in Epidemiology and Biostatistics

To get the true effects of the program or intervention, it is necessary



to have both a treatment group and a control group. As the name
suggests, the treatment group receives the intervention while the
control group does not receive intervention. It is also important that
both the treatment group and the control group are of adequate size
to be able to determine whether an effect took place or not. While
the size of the sample ought to be determined by specific scientific
methods, a general rule of thumb is that each group ought to have at
least 30 participants.
Finally, it is important to make sure that both the treatment group

and the control group are statistically similar. While no two groups
will ever be exactly alike, the best way to ensure that two groups
are comparable is by randomly assigning the participants into the
treatment group and control group. Such random allocation ensures
that any difference between the treatment group and control group
is due to chance alone, and not by a selection bias (Table 2.2).
Randomization is the heart of the clinical trial as every individual

has an equal chance of being selected into either study group or
control group, from the reference population.

Table 2.2: Baseline characteristics of coronary artery disease patients



treated by medical/surgical therapy
Surgical Medical p-value
therapy group therapy group
(N = 1140) (N = 1130)
Characteristics
Age—year 61.4 ± 10.0 61.7 ± 9.6 0.54
Sex—no (%) 0.95
Male 974 (85.4) 964 (85.3)

Female 165 (14.5) 165 (14.6)

Race or ethnic group—no (%) 0.64
White 984 (86.3) 972 (86.0)

Black 55 (4.8) 55 (4.9)

Hispanic 66 (5.8) 56 (5.0)

Others 34 (3.0) 46 (4.1)

Contd...

tahir99 - UnitedVRG
Study Designs 23

Contd...
Surgical Medical p-value
therapy group therapy group
Clinical
Angina (CCS class)—no (%) 0.24
 0 132 (11.6) 146 (12.9)
 1 338 (29.6) 339 (30.0)
 11 407 (35.7) 423 (37.4)
 111 259 (22.7) 219 (19.4)
  Missing data 3 (<1) 2 (<1)
Duration of angina—months 0.53
 Median 5 5
Episodes/week with exertion or at rest within last month 0.83
 Median 3 3
History—no (%)
Diabetes 365 (32.0) 395 (35.0) 0.12
Hypertension 755 (66.2) 763 (67.5) 0.53
Congestive heart failure 56 (4.9) 51 (4.5) 0.59
Cerebrovascular Disease 99 (8.7) 100 (8.8) 0.83
Myocardial Infarction 435 (38.2) 437 (38.7) 0.80
Previous (PCI)* 173 (15.2) 183 (16.2) 0.49
Coronary artery bypass graft 124 (10.9) 124 (11.0) 0.94
(CABG)
Stress test
Total patients—no (%) 968 (84.9) 974 (86.2) 0.84
Treadmill tests—no (%) 552 (57.0) 550 (56.5)
Duration of treadmill 6.9 ± 2.6 6.8 ± 2.2 0.43
test-minute
Pharmacologic—stress no (%) 415 (42.9) 425 (43.6)
Echocardiography—no (%) 61 (5.4) 52 (4.6)
Nuclear imaging—no (%) 683 (70.6) 705 (72.2) 0.59
Single reversible defects 152 (22.2) 159 (22.6) 0.09
Multiple reversible defects 441 (66.0) 481 (68.2) 0.09
* PCI is per cutaneous intervention
24 Basics in Epidemiology and Biostatistics

The aim of the experimental study designs is to provide scientific



proof of the etiological factors/risk factors.
There are three main types of experimental study designs:
1. Clinical Trial or Randomized Controlled Trial with patients as

unit of study.
2. Field Trial or Community Intervention Studies with healthy

people as unit of study.
3. Community Trial with entire community as unit of study.

Table 2.2 shows a good example of how evenly balanced are the

characteristics of the two groups in a Randomized Controlled Clinical
Trial (RCCT). All the p-values are statistically nonsignificant which
means that the characteristics of the two groups are comparable.
This is important in a clinical trial because unless the two groups
are comparable you cannot compare the outcomes in the two
groups. If the two groups are not comparable, as often happens in
an observational study, then your study will be called “comparing
apples with oranges”.

BLINDING
Blinding represents an important, distinct aspect of randomized
controlled trials. The term blinding refers to keeping trial participants,
investigators or assessors (those collecting outcome data) unaware
of an assigned intervention. Blinding is of three types:

Single-blind
Here the participants do not know whether they are assigned to the
study or the control group. It means that they do not know whether
they are getting the new drug which is under investigation or the
old conventional drug. However, only the investigator knows who is
getting which drug. This trial helps to overcome subject variation.

Double-blind
Here neither the investigator (doctor) nor the participant (patient)
knows the group allocation and treatment received. However, the
statistician knows it. The drug is coded before handing over to the
doctor. Usually this trial is in practice.

tahir99 - UnitedVRG
Study Designs 25

Triple-blind
It goes one step further. All the participants, the doctor and the
statistician are unaware (blind) of the group allocation. Only the
principal investigator is aware of the group allocation and the
treatment allocation.

CONSENT FORM
Since these studies involve human subjects, hence there are always
ethical issues which cannot be over looked. Approval from Ethical
Review Board (ERB) is mandatory. Consent forms are always
required and are scrutinized in detail by the ERB.

INTENT TO TREAT ANALYSIS


In clinical trials, once a patient is randomized to a particular group
he/she will always be analyzed in that particular group. For example, a
study on coronary artery disease comparing the outcome (mortality)
between patients who receive medical therapy vs surgical therapy,
a patient who is randomized to the medical therapy group will be
analyzed in this group. If during the trial he has an acute myocardial
infarction and subsequently undergoes CABG surgery he will not be
considered in surgery group despite the fact that he has undergone
surgical treatment.

QUASI-EXPERIMENTAL STUDIES
In a quasi-experimental study, one characteristics of a true
experiment is missing, either randomization or the use of a separate
control group. A quasi-experimental study, however, always
includes the manipulation of an independent variable which is the
intervention.
One of the most common quasi-experimental designs uses two
(more) groups, one of which serves as a control group. Both groups
are observed before as well as after the intervention, to test if the
intervention has made any difference.

CLINICAL TRIALS AND THEIR PHASES


There are five phases of clinical trials:
26 Basics in Epidemiology and Biostatistics

Preclinical Phase
Drug is developed and evaluated in cells and animals to see its
potential effect on human body.

Phase I Trial
These trials are conducted to determine recommended dose, side
effects and manner in which drug is processed by body. Here just
10–20 healthy volunteers are recruited.

Phase II Trial
These are controlled clinical studies conducted to evaluate the
effectiveness of the drug or treatment to a larger group of people
(100–300) to see if it is effective. These trials further evaluate its safety
and determine the common short-term side effects and risks.

Phase III Trial


These trials are used as a basis for regulatory approval of a new
drug/device, or for a new indication for a marketed product. These
are expanded controlled and uncontrolled trials after preliminary
evidence suggesting effectiveness of the drug has been obtained.
The study drug or treatment is given to large groups of people
(1,000–3,000) to confirm its effectiveness, monitor side effects,
compare it to commonly used treatments, and collect information
that will allow the drug or treatment to be used safely. These trials
are intended to gather additional information to evaluate the overall
benefit-risk relationship of the drug and provide adequate basis for
physician prescription.

Phase IV Trial
This includes post-marketing studies to delineate additional
information including the drug’s risks, benefits, optimal use and
long-term side effects.

Post-marketing Surveillance
These involve observational studies such as case reports, cohort
studies or case control studies. Its purpose is to assess drug safety

tahir99 - UnitedVRG
Study Designs 27

under the conditions of use in general practice, as opposed to the


conditions under which they were tested in phase III trials.

Post-marketing Clinical Trials


Here uncontrolled clinical trials are designed to gain more experience
with efficacy and safety—and promote use of the drug or device.
It also includes controlled clinical trials designed to obtain
regulatory approval for a new indication (Phase IIIB).

RESEARCH QUESTIONS AND STUDY TYPES


See Table 2.3.

META-ANALYSIS
A meta-analysis is a particular type of systematic review that
focuses on the numerical results. The main aim of meta-analysis
is to combine the results from individual studies to produce, if
appropriate, an estimate of the overall or average effect of interest
(e.g., the relative risk). The direction and magnitude of this average
effect, together with a consideration of the associated confidence
interval and hypothesis test result, can be used to make decisions
about the therapy under investigation and the management of
patients.
In the below study, Figure 2.2 is a meta-analysis comparing two
intervention for a certain outcome. The studies A [RR= 0.65 (CI = 0.1
– 0.7); p-value = 0.01] and E [RR = 0.7 (CI = 0.1 – 0.4); p-value = 0.0001]
show group A is better. While the study H [RR = 1.5 (CI = 1.2 – 2.0);
p-value = 0.001] shows that group B is better. The overall effect size is
not significant; [RR = 0.75 (95% CI = 0.3 – 1.1; p-value=0.32)].

Statistical Approach
We decide on the effect of interest and, if the raw data is available,
evaluate it for each study. However, in practice, we may have to
extract these effects from published results. For example, if the
outcome in a clinical trial comparing two treatments is numerical—
the effect may be the difference in treatment means. A zero difference
implies no treatment effect. Similarly, if the outcome is binary (e.g.
died/survived) we consider the risks of the outcome (e.g. death) in
28 Basics in Epidemiology and Biostatistics

Table 2.3: Research questions and study types


State of knowledge of Type of research question Type of study
the problem
Knowing that a •  What is the nature/magni­ Exploratory studies,


­­
problem exists but tude of the problem? or descriptive
knowing little about •  Who is affected? studies:
its characteristics or •  How do the affected •  Descriptive case


possible causes people behave? studies
•  What do they know, •  Cross sectional


-
believe, and think about studies
the problem and its
causes?
Suspecting that Are certain factors indeed Analytical
certain factors associated with the (comparative)
contribute to the problem? (e.g. Is lack of studies:
problem preschool education related •  Cross sectional

-
to low school performance? comparative
Is low fiber diet related studies
to carcinoma of the large •  Case control

intestine?) studies
•  Cohort studies
Having established •  What is the cause of the Cohort studies

that certain factors problem? experimental or
are associated •  Will the removal of a quasi-experimental

with the problem: particular factor prevent studies
desiring to establish or reduce the problem?
the extent to which (e.g. stopping smoking,
a particular factor providing safe water)
causes or contributes
to the problem
Having sufficient •  What is the effect of a Experimental or

knowledge about particular intervention/ quasi-experimental
cause(s) to develop strategy? (e.g. treating studies
and assess an with a particular drug:
intervention that being exposed to a certain
would prevent, type of health education).
control or solve the •  Which of two alternate

problem strategies gives better
results?
•  Which strategy is most

cost effective?
-
tahir99 - UnitedVRG
Study Designs 29

Figure 2.2  Hypothetical meta-analysis figure to compare two interventions


(group A and group B) for a certain outcome

the treatment groups. The effect may be the difference in the risks
or their ratios, the RR. If the difference in risks equals zero or RR=1,
then there is no treatment effect.

BIBLIOGRAPHY
1. Hulley SB, Newman TB. Getting started: the anatomy and physiology of
clinical research. In: Hulley SB, Cummings SR, Browner WS. Designing
clinical research. Philadelphia, PA: Lippincott Williams and Wilkins;
2007.
2. Last John M. A Dictionary of Epidemiology. Oxford University Press
1983.
3. Park K. Park’s Textbook on Preventive and Social Medicine 18th edn,
2005.
4. Schlesselman JJ. Case-Control Studies. Oxford University Press. New
York 1982.
5. Types of epidemiologic studies. In: Hennekens CH, Buring JE.
Epidemiology in Medicine. Boston: Little, Brown and Company; 1987.
pp. 101-204.
CHAPTER

3
Sampling Procedure

POPULATION
A major purpose of the research is to infer or generalize findings from
a sample to a target population. Population is the term statisticians
use to describe a large set or collection of items that have something
in common (i.e. all pregnant women, all pregnant women in third
trimester, all anemic pregnant women in third trimester, etc.).
Target population is a group about which researcher aims to draw
conclusion. In medicine, population generally refers to patients
or other living organisms, but the term can also be used to denote
collections of inanimate objects, such as autopsy reports, X-ray
reports, or birth certificates.
Figure 3.1 shows relationship among target population, study

population and sample. Target population is a population of ultimate
clinical interest about which researcher aims to draw a conclusion.
On account of the cost and other practical issues, the entire target
population cannot be studied. Study population is a subset of
target population that can be studied. Samples are subsets of study
populations investigated in clinical research because often not every
individual in a study population can be measured.
A “sample” is a subset of population with all its inherent qualities.

Studies are conducted on samples but inference is made about
target population. That is why it is important that the sample should
be a true representative of the target population. Hence, the selected
elements should be properly approached, recruited in the study and
interviewed. Thus, selection of sample is critical as, otherwise, the
research findings might not be valid.
It is vital to have a clear understanding of the terms population

and sample; these two terms must not be used interchangeably.

tahir99 - UnitedVRG
Sampling Procedure 31

Figure 3.1  Relationship among target population,


study population and sample

REASONS FOR SAMPLING


It is reasonable and practical to collect information from sample
rather than the whole population. Below are the reasons listed for
sampling:
• Samples can be studied more quickly than population
• A study of a sample is less expensive than that of an entire
population
• A study of a population is impossible in most situations
• Samples are more often accurate than results based on a
population
• If samples are properly selected, probability methods can be used
to estimate the error in the resulting statistics
• Samples can be selected to reduce heterogeneity.

SAMPLING TECHNIQUES
Broadly, there are two types of sampling techniques (Table 3.1):
1. Probability sampling techniques.
2. Nonprobability sampling techniques.
In a probability sampling technique, each participant in a study
population has an equal (or at least a known) chance of being
selected. The method protects the research from bias and ensures
32 Basics in Epidemiology and Biostatistics

Table 3.1: Different sampling techniques



Probability sampling Nonprobability sampling
1. Simple random sampling 1. Consecutive sampling


2. Systematic random sampling 2. Convenience sampling


3. Cluster sampling 3. Purposive sampling


4. Stratified random sampling 4. Quota sampling


5. Snowball sampling


that the sample is a true representative of a population. Importantly,
it helps a researcher to make meaningful statistical estimation while
analyzing the results of the research. In a nonprobability technique,
each participant does not have an equal chance of being selected.

Probability Sampling Techniques


Simple Random Sampling
Simple random sampling is the simplest method of probability
sampling. In this type of sampling technique each individual within
the study frame has an equal chance of inclusion in the sample. A
common example is sometimes called the ‘lottery method’ and
illustrated in Figure 3.2.

Figure 3.2 Lottery sampling technique



tahir99 - UnitedVRG
Sampling Procedure 33

For example in a recruitment for a study there are 100 participants


available, of these 25 have to be selected (sample size). The
participants to be recruited in the study will be selected randomly by
drawing a chit bearing the names/ID number of the 100 individuals.
Each individual in the study frame has an equal probability of being
selected for the study (i.e. when the first participant is to be selected
the probability is 1/100 for all participants, for second participant
the probability is 1/99 for all participants, for third participant
the probability is 1/98 for all participants and so on). Thus each
participant has an equal probability of being selected for the study.
The recommended way to select a simple random sample is to
use a table of random numbers, or a computer-generated list of
random numbers. For this approach each participants should have
an identification number (ID), and a list of ID numbers called a
“sampling frame”.
The steps of simple random sampling are as follows:
• Prepare the sampling frame (assign a number to each element) of
the whole population [Participants are numbered from 1 to 100].
• Determine the sample size [Estimated sample size is 25]
• Randomly select the element [Any 25 numbers are picked from
1 to 100]
OR
• If using computer generated lists to randomly select the
participant
–– Enter lowest ID number (i.e. in this case 001)
–– Enter highest ID number (i.e. in this case 100)
–– Enter the estimated sample size as 25
–– Computer generated randomization software will generate a
table of randomly selected participants/ID number (Fig. 3.3).

Systematic Random Sampling


In systematic sampling technique study participants are selected at
regular intervals using a sampling frame (Fig. 3.4).
Just estimate the population size (N) and calculate the required
sample size (n).
Now divide population size by sample size, i.e. N/n. This will give
you the kth number (sampling interval). In the above study example,
the number of individuals were 100 and the required sample size
34 Basics in Epidemiology and Biostatistics

001 002 003 004 005 006 007 008 009 010
011 012 013 014 015 016 017 018 019 020
021 022 023 024 025 026 027 028 029 030
031 032 033 034 035 036 037 038 039 040
041 042 043 044 045 046 047 048 049 050
051 052 053 054 055 056 057 058 059 060
061 062 063 064 065 066 067 068 069 070
071 072 073 074 075 076 077 078 079 080
081 082 083 084 085 086 087 088 089 090
091 092 093 094 095 096 097 098 099 100

Figure 3.3 Random selection of 25 participants represented by bold



Figure 3.4 Systematic random sampling (Every 3rd selected )

was 25, hence 100/25 would be 4 and so every 4th X-ray should be
selected.
First element is selected randomly from 1st to kth element (i.e.

in above example from 1 to 4). Then every kth element is selected
till the researcher achieves the required sample size. For example in
Figure 3.5 second individual in the study population is selected at
random and then every fourth individual is selected (i.e. 6th, 10th,
14th, etc.).

tahir99 - UnitedVRG
Sampling Procedure 35

001 002 003 004 005 006 007 008 009 010
011 012 013 014 015 016 017 018 019 020
021 022 023 024 025 026 027 028 029 030
031 032 033 034 035 036 037 038 039 040
041 042 043 044 045 046 047 048 049 050
051 052 053 054 055 056 057 058 059 060
061 062 063 064 065 066 067 068 069 070
071 072 073 074 075 076 077 078 079 080
081 082 083 084 085 086 087 088 089 090
091 092 093 094 095 096 097 098 099 100

Figure 3.5  Systematic random sampling (Every 4th selected )

Stratified Random Sampling


Stratified random sampling is a sampling technique that divides
the population into various sub-groups, i.e. based on gender, age
groups, ethnicity, etc. (Fig. 3.6) and then any of the random sampling
technique is employed to randomly select participants from each
group (Fig. 3.7). Suppose a population consisted of more females
than males. In spite of the random technique employed, females will
constitute a greater proportion of sample than males. Such problem
could be overcome by utilizing stratified random sampling.
For example, a population consisted of 60 individuals and the
researcher wants to select equal representation of all the strata based
on ethnicity. Firstly, the population is stratified according to ethnicity
(i.e. Caucasians, African-American and Hispanic-American). There
are 30 Caucasians, 20 African-American and 10 Hispanic-American.
As the researcher wants to select 15 participants thus each strata
must constitute 5 participants. Finally, 5 participants are randomly
selected from each strata.
One of the main purposes of stratified sampling is to compare
different strata (e.g. males with females, different age groups, etc.)
which may not be possible with simple random sampling alone.
36 Basics in Epidemiology and Biostatistics

Figure 3.6  Stratified random sampling technique


(Individuals in each strata)

Figure 3.7  Stratified random sampling technique (Participants selected from


each strata represented by bold headed stickman)

Cluster Sampling
In clustered sampling technique sub-group of population is used as
a sampling unit instead of individuals. It is a probability sampling
technique, employed when the researcher aims to select participants
from a large geographical area i.e. country, province, state or city
(Flow chart 3.1). Suppose the city of Karachi consisted of 18 towns
and each town consisted of 10 union councils. Initially, 5 towns are

tahir99 - UnitedVRG
Sampling Procedure 37

Flow chart 3.1  Cluster random sampling technique

selected by either of the random technique methods. Later, from


each town 4 union councils are randomly selected. Finally, from
20 union councils houses are randomly selected. Thus in this type
of sampling method households are the sampling unit instead of
individual residents.

Nonprobability Sampling Techniques


Consecutive Sampling
It involves sequential selection of all accessible eligible participants
that meets the selection criteria. If the study participants are selected
in a consecutive manner, they might be inherently similar to eligible
participants that meets inclusion and exclusion criteria for the study.
Suppose, a strategy is devised to recruit 100 patients (the estimated
sample size) for a study that satisfies the selection criteria and seen
in a Nephrology clinic from Monday to Friday between 9.00 am to
12.00 pm. The first 100 patients who meets the eligibility criteria and
attend the outpatient clinic during these days and timings will be
recruited in the study. This method is best among nonprobability
sampling techniques as it minimizes selection bias by recruiting
complete accessible population within the parameters of estimate
sample size and selection criteria.

Convenience Sampling
Convenience sampling is presumed to be the most commonly used
technique in clinical research. It involves the selection of subjects
that are conveniently accessible to the researcher. Suppose, a
38 Basics in Epidemiology and Biostatistics

researcher working as a professor of nephrology aims to identify the


communication skills of postgraduate trainees. The description of
“20 postgraduate trainees” is assuredly 20 postgraduate trainees in
nephrology ward who volunteered for this study. The participants
were selected on account of investigator feasibility to recruit these
participants, as working in the nephrology ward. The method is easy,
fast and less expensive but not the representative of a larger overall
population thus introducing selection bias in the research.

Purposive Sampling
Purposive sampling is also called judgmental sampling. The
technique is criticized for introducing selection bias in the research
as the researcher recruit participants based over pre-existing belief
that certain subjects will be more likely benefit, compliant or respond
in certain way. Thus, the researcher selects study participants with a
‘particular purpose’ in mind.
For example, if the researcher wants to check the hypothesis that

Pakistani females have better knowledge regarding medical research
than American females. Selection of Pakistani females medical
students (a group that has better understanding of medical research
than other women) and American females who came to the market
for shopping were selected. As the two groups are noncomparable,
evidently Pakistani females will display a better knowledge regarding
medical research which might not be the case. Such deviation from
truth is on account of purposeful sampling.
Similarly, while conducting a knowledge survey on the mode

of transmission of HIV; selecting participants that are relatives of
AIDS patients will demonstrate an excellent knowledge regarding
transmission modes of HIV. Evidently the selection of study
participants was biased as the sample was not the true representative
of the target population.

Snowball Sampling
Snowball sampling method is employed when study participants
are difficult to identify, access or locate. The method is commonly
employed to recruit participants from hard to reach group (i.e. sex
workers, IV drug users, etc.). The sample is built through chain
referrals. Suppose, you are investigating the knowledge about

tahir99 - UnitedVRG
Sampling Procedure 39

Flow chart 3.2  Snowball sampling technique

contraception among female sex workers. Female sex workers are


hard to identify as they are not registered in Pakistan. Thus one
female sex worker will be identified and recruited in the study.
Later, the participant will be requested to recommend more sex
workers. Each of these will recommend more sex workers. In this
way, a sizeable sample may be obtained even for hard to reach group
(Flow chart 3.2).

Quota Sampling
Quota sampling is a nonprobability sampling method that
ensured a certain number of study participants from different
subgroups constitute the sample so that all these characteristics are
represented. Suppose you aim to identify the quality of life among
dialysis patients but you think that socioeconomic status has a
strong affect on quality of life in these patients. Thus you decide to
include 25% of respondents from each socioeconomic groups (i.e.
upper, middle, lower middle and lower). If the estimated sample
size is 200, each socioeconomic group will include 50 participants.
Thus initially a population is divided into different strata and then
any nonprobability sampling technique will be applied to select
participants.

BIBLIOGRAPHY
1. Beth Dawson-Saunders, Robert G Trapp. Basic and Clinical Biostatistics,
1989.
40 Basics in Epidemiology and Biostatistics

2. Hulley SB, Newman TB. Choosing the study subjects: specification,




sampling, and recruitment. In: Hulley SB, Cummings SR (Eds).
Designing clinical research. Philadelphia, PA: Lippincott Williams and
Wilkins; 2007.pp. 27-36.
3. Kuzma JW, Bohnenblust SE (Eds). Populations and samples. Basic


statistics for the health sciences. Boston: McGraw Hill; 2005.pp. 16-28.
4. Last John M. A dictionary on Epidemiology. Oxford University Press


1983.
5. Morris JN. Uses of Epidemiology. ELBS 3rd edn, 1983.


tahir99 - UnitedVRG
CHAPTER

4
Variables, Data and
its Presentation

VARIABLES AND THEIR TYPES


Variable
A variable is a measureable characteristic of a person, object or
phenomenon that can take on different values. A simple example of
a variable is a person’s age. The variable age can take on different
values because a person can be 20 years old, 35 years old, and so on.

Type of Variables
Dependent and Independent Variables
As in health system research you often look for causal explanations,
hence it is important to make distinction between dependent and
independent variables.
The variable that is used to describe or measure the problem
under study is called the dependent variable. It represents the
output or effect, or is tested to see if there is an effect. A dependent
variable is also known as a “response variable”, “outcome variable”,
and “output variable”.
The variables that are used to describe or explain the difference
in the dependent variable or to cause changes in the dependent
variables are called the independent (exposure) variables. It
represents the inputs or causes, or is tested to see if they are the cause.
An independent variable is also known as a “predictor variable”,
“explanatory variable”, and “exposure variable”.
For example, in a study of the relationship between smoking and
lung cancer, suffering from lung cancer (with the values yes or no)
would be the dependent variable and ‘smoking’ (varying from not
42 Basics in Epidemiology and Biostatistics

smoking to smoking more than three packets a day) would be the


independent variable.
Whether a variable is dependent or independent, is determined

by the statement of the problem and the objectives of the study. It is,
therefore, important when designing an analytical study to clearly
indicate which variable is the dependent and which the independent.
If a researcher investigates why people smoke; ‘smoking’ is the

dependent variable and ‘pressure from peers to smoke’ could be an
independent variable. In the lung cancer study ‘smoking’ was the
independent variable, and lung cancer was the outcome.

DATA AND ITS TYPES


Data are values of the observation recorded for variables (e.g. age,
weight, sex).

Types of Data
Data is classified as either qualitative and quantitative (Flow chart 4.1):
1. Qualitative or categorical data.

2. Quantitative or numerical data.

Flow chart 4.1  Classification of data types

* Mutually exclusive means both events cannot occur at the same time (i.e. tossing a

coin will result in either head or tail).

tahir99 - UnitedVRG
Variables, Data and its Presentation 43

Qualitative or Categorical Data


Qualitative or categorical data comprises of a characteristic which
cannot be expressed numerically like gender, ethnicity, healing, etc.
It is divided in three types:
1. Binary
2. Nominal
3. Ordinal.
Binary data: In binary data, the variables are divided into two
mutually exclusive categories.
Example: Binary data Categories
Gender Male, female
Nominal data: In nominal data, the variables are divided into more
than two mutually exclusive categories. These categories however,
cannot be ordered one above another (as they are not greater or less
than each other).
Example: Nominal data Categories
Marital status  Single, married, widowed, separ­
ated and divorced
Employment status Unemployed, self-employed, public
employee and Govt. employee
Ordinal data: In ordinal data, the variables are also divided into more
than two mutually exclusive categories, but they can be ordered one
above another, from lowest to highest or vice versa.
Example: Ordinal data Categories
Level of knowledge: Good, average, poor
Level of blood pressure: High, moderate, low

Quantitative or Numerical Data


These are the characteristics which can be expressed numerically
like age, height, weight, blood pressure, hemoglobin, temperature
and number of children in a family.
Discrete and continuous: Quantitative variables can be classified as
discrete and continuous.
Discrete variable is one in which values can only be whole
numbers. If we are studying the number of children in a family, each
44 Basics in Epidemiology and Biostatistics

child is equal with respect to providing one counting unit. There are
no intermediate values between each number.
Continuous variable is one in which there are no gaps in the values

of the variables: there are an unlimited number of possible values
between any two adjacent values on the scale. Thus, if the variable
is height measured in inches, then 4 and 5 inches are two adjacent
values of the variable. However, there can be an infinite number of
the intermediate values, such as 4.5 and 4.7 inches, variables such
as these are known as continuous variables (the values which can
occur in fractions or decimals).

TABULATION AND GRAPHICAL


PRESENTATION OF DATA
Data once collected should be presented in such a way as to be easily
understood. The style of presentation depends, of course, on the
type of data. Data can be presented in as frequency tables, charts,
graphs, etc. Here, we would discuss some of the important means of
presentation.

Frequency Tables
The most common way of presentation of data is to arrange them in
the form of tables. It gives the frequency with which (or the number
of times) a particular value appears in the data.
The basic principles of tabulation of data are:
1. The information should be in a simple and orderly manner.

2. The table should have a title which must be brief and compre­

­
hensive.
3. Rows and columns must have their own captions.

4. The titles of the rows must be entered on the left side of the table

while the titles of the columns are on the top row. The rest of
the table constituting the body, contain the numerical values in
actual numbers, in percentages or in both forms.
5. The class intervals are usually taken at equal intervals.

6. Standard codes or symbols, if used, should be explained in the

foot note.
-
In a frequency Tables 4.1 and 4.2, data is presented in a tabular

form. It gives the frequency with which (or the number of times) a
particular value appears in the data.

tahir99 - UnitedVRG
Variables, Data and its Presentation 45

Table 4.1:  Systolic blood pressure of 100 patients coming to a tertiary care
hospital
Systolic blood Frequency Relative Cumulative
pressure (mm Hg) (n =100) frequency relative
Below 100 15 0.15 0.15
100–120 25 0.25 0.40
121–140 20 0.20 0.60
141–160 30 0.30 0.90
Above 160 10 0.10 1.00
Total 100 1.00

Table 4.2:  Clinical presentation of patients coming to medical OPD


Clinical presentation Frequency Percentage
Vomiting 30 30.0%
Fever 25 25.0%
Dyspepsia 20 20.0%
Nausea 15 15.0%
Headache 10 10.0%
Total 100 100.0%

Graphs
Another way to summarize and display data is through the use of
graph or pictorial representations of data, so that the data is easier to
interpret. Graphs should be designed so that they convey at a single
glance the general patterns in a set of data.

Types of Graphs
• Bar charts
• Pie charts
• Histograms
• Line graphs
• Scatter plots
46 Basics in Epidemiology and Biostatistics

Figure 4.1  Marital status of respondents

Bar Charts
Bar charts are used for binary, nominal and ordinal data (categorical)
and comprises of nonadjacent bar. The bars can be vertical or
horizontal.
Example: The marital status of different respondents (200 in total)
participated in a knowledge, attitude and practice survey regarding
dengue fever are as follows; Single 60 (30%), Married 120 (60%) and
Divorced 20 (10%). The bar graph is shown in Figure 4.1.
Y axis = Percentage of respondent

-
X axis = Marital status of respondent.

-
Pie Charts
Pie charts can also be used to display binary, nominal and ordinal
data (categorical). A pie chart consists of circular region partitioned
into sections, with each percentage represents a part or a percentage.
Example: The data regarding knowledge of research ethics were
collected from 150 postgraduate trainees were collected. The survey
showed that 60 (40%) of the respondents were male and 90 (60%)
were female. The data is represented in Figure 4.2.

tahir99 - UnitedVRG
Variables, Data and its Presentation 47

Figure 4.2  Gender distribution of respondents

Figure 4.3  Histogram with normal curve showing


distribution of age (in years)

Histograms
A histogram depicts a frequency distribution for quantitative data, it
comprises of series of adjacent bars (Fig. 4.3).
Histograms are constructed to represent the continuous or
quantitative data. Ideally, every quantitative variable should be
normally distributed (bell shaped curve).
48 Basics in Epidemiology and Biostatistics

Line Graphs
A line graph (also called time series plot) is appropriate for
representing data that vary continuously. It shows a trend of variable
over time. To construct a time series plot, time is placed on a
horizontal axis and the variable being measured on a vertical axis,
with points being connected using line segments (Fig. 4.4).
Example: The population statistics of the US for the years 1860–1950
are as in Table 4.3:

Table 4.3: Population statistics of US population



Year Population
(in millions)
1860 31.4
1870 39.8
1880 50.2
1890 62.9
1900 76.0
1910 92.0
1920 105.7
1930 122.8
1940 131.7
1950 151.1

Figure 4.4  Line graph of US population data

tahir99 - UnitedVRG
Variables, Data and its Presentation 49

Scatter Plots
Scatter plot represents a relationship between two continuous
variable.
Example: Suppose, a researcher wishes to identify whether studying
for longer hours will lead to better scores. A collection of data is given
in Table 4.4.
Based, on the data below a scatter plot has been constructed as
shown in Figure 4.5. (Note: When connecting a scatter plot, do not
connect the dots).

Table 4.4:  Data on studying hours and


corresponding scores
Participant No. Study hours Score
1. 3 80
2. 5 90
3. 2 75
4. 6 80
5. 7 90
6. 1 50
7. 2 65
8. 7 85
9. 1 40
10. 7 100

Figure 4.5  Scatter plot of students test scores and hours of study
50 Basics in Epidemiology and Biostatistics

BIBLIOGRAPHY
1. Kuzma JW, Bohnenblust SE (Eds). Organizing and displaying data.


Basic statistics for the health sciences, 3rd edn. London: Mayfield
Publishing Company; 2001.pp.23 43.

-
2. Kuzma JW, Bohnenblust SE (Eds). Organizing and displaying data.


Basic statistics for the health sciences, 5th edn. Boston: McGraw Hill;
2005.pp.29 53.
-
3. Pagano M, Gauvreau K (Eds). Data presentation. Principles of


biostatistics. Australia: Duxbury Press; 2000.pp.7 37.

-
4. Perrie A, Sabin C (Eds). Displaying data graphically. Medical Statistics


at glance. UK: Blackwell Science Ltd; 2000.pp.14 5.

-
5. Perrie A, Sabin C (Eds). Type of data: Medical Statistics at glance.


UK: Blackwell Science Ltd; 2000.pp.8 9.
-
tahir99 - UnitedVRG
CHAPTER

5
Biostatistics: Basic

MEASURES OF CENTRAL TENDENCY


Measures of central tendency refer to the summary measures used
to describe the most “typical” value in a set of values. The three most
common measure of central tendency are mean, median and mode.
Mean: The most popular measure of central tendency for a quanti­
tative data set is the arithmetic mean or simply the mean of the data
set. It is also known as the Average. It is calculated by adding all the
observations and dividing by the total number of observations. The
sample mean is denoted by x– (pronounced x bar) and the population
mean is denoted by µ (the Greek letter mu). Note that the mean can
only be calculated for quantitative data.
Median: The median is an important measure of central tendency.
It is the value that divides a distribution into two equal halves. We
arrange the observations in order from smallest to largest value
or vice versa. If there are an odd number of total observations,
the median is the middle value. If there is an even number of total
observations, the median is the average of the two middle values.
The median is useful when some measurements are much bigger
or much smaller than the rest. The mean of such data will be biased
toward these extreme values while the median is not influenced by
extreme values.
Mode: The mode is the most frequently occurring value in a set of
observations.

Example of Mean, Median and Mode


Suppose we draw a sample of five women and measure their weights.
They weigh 110 pounds, 110 pounds, 140 pounds, 150 pounds, and
160 pounds.
52 Basics in Epidemiology and Biostatistics

The mean weight would equal (110 + 110 + 140 + 150 + 160)/5 =

670/5 = 134 pounds.
The median value would be 140 pounds; since 140 pounds is the

middle weight.
Most frequent value is 110 (as occurring twice), so the mode of the

data set is 110 pounds.
The mode of the data is 110 pounds, since it is occurring twice

(more frequently).

Mean Versus Median


The median may be a better indicator of the most, typical value, if a
set of scores has an outlier. An outlier is an extreme value that differs
greatly from other values.
Score that are much above or below the mean are called outliers.

For example if in the above mentioned data one individual has a
weight of 250 lbs (weight of 160 lbs replaced by 250 lbs). This will be
an extreme value, i.e. outlier and will impact the mean value.
Mean = (110 + 110 + 140 + 150 + 250)/5 = 760/5
Mean = 152 pound

The mean value on account of 250 pound is much higher than

most reading in the data set. Hence, in such cases median should be
reported which will continue to be 140.
However, when the sample size is large and does not include

outliers, the mean score usually provides a better measure of central
tendency.

MEASURES OF VARIATION
These include the measures to describe the amount of variability or
spread in a set of data. The most common measures of variability are
the range, variance, and standard deviation.
Range is the simplest measure of variability. It is defined as the

difference in value between the highest (maximum) and the lowest
(minimum) observation in the data set. For example, consider the
following women weights in the data set 110 pounds, 110 pounds,
140 pounds, 150 pounds, and 160 pounds. The range would be
160–100 = 60 lbs.
Variance quantifies the amount of variability or spread about the

mean of the sample.

tahir99 - UnitedVRG
Biostatistics: Basic 53

For instant, the women weights in the above example were 110,
110, 140, 150 and 160 pound, the mean weight would be 134 pounds.
Variance (S) = S (xi – –x)2 / (n – 1)
Where xi = Individual sample observation
–x = Sample mean
n = Total sample size
S = sum of the differences between individual sample observation
and sample mean
Example:
S = [(110–134)2 + (110–134)2 + (140–134)2 + (150–134)2 +
(160–134)2]/5–1
S = [ (–24)2 + (–24)2 + (6)2 + (16)2 + (26)2]/5–1
S = [576 + 576 + 36 + 256 + 676]/4 = 2120/4
S = 530
Standard deviation is the square root of the variance. The standard
deviation is a measure, which describes how much individual
measurement differs, on the average, from the mean.
Standard deviation is the square root of variance (S):
SD = S
SD = (530) = 23.02
The same results can easily be obtained by SPSS (statistical
package).
Below is the SPSS output showing central tendency and variation
of above data set.

N (Number of observations) 5
Mean 134
Median 140.00
Mode 110.00
Standard deviation 23.02
Variance 530.00
Range 50.00

A large standard deviation reflects that there is a wide scatter of


measured values around the mean, while a small standard deviation
reflects that the individual values are concentrated around the mean
with little variation among them (construct figure).
54 Basics in Epidemiology and Biostatistics

STANDARD ERROR OF MEAN


When we draw a sample from study population and compute its
sample mean, it is not likely to be identical to the population mean. If
we draw another sample from the same population and compute its
sample mean, this may also not be identical to the first sample mean.
It might also differs from the true mean of the total population from
which the sample was drawn; this phenomenon is called sampling
variation.
The standard error of the mean gives an estimate of the degree
to which the sample mean (x–) varies from the population mean (µ).

This measure is used to calculate confidence interval (CI), which is
discussed in the next chapter.

NORMAL DISTRIBUTION
A normal distribution such as the distribution shown in the following
figure (Figs 5.1A and B) is classically a bell shaped curve. Most of
the values are clustered near the mean and a few values are near the
tails. The normal distribution is symmetrical around the mean. If
the variable is normally distributed, then mean, median and mode
values will be approximately equal.
An important characteristic of a normally distributed variable is

that 95% of the measurements have values which are approximately
within 2 standard deviations (SD) around the mean (Fig. 5.1B).
When the area of the normal curve is divided into sections by

standard deviations above and below the mean, the area in each
section is a known as a quantity. For example, 34 percent of all the
values of a normally distributed variable are between the mean
and one standard deviation above it. It also means that there is
a 0.34 chance that a value drawn at random from the distribution
will lie between these two points. Similarly, 34 percent of all the
values of a normally distributed variable are between the mean and
one standard deviation below it. It also means that there is a 0.34
chance that a value drawn at random from the distribution will lie
between these two points. Consequently, 68 percent of all the values
of a normally distributed variable are between the mean and one
standard deviation either side.
Sections of the curve above and below the mean may be added

together to find the probability of obtaining a value within (plus

tahir99 - UnitedVRG
Biostatistics: Basic 55

R G
d V
ti e
A

Un
-
9
ri 9
B
Figures 5.1A and B Proportion of cases under portion of the normal curve

h
ta
or minus) a given number of standard deviations of the mean.
For example, the amount of curve area between one standard
deviation above the mean and one standard deviation below is
0.34 + 0.34 = 0.68, which means that approximately 68 percent
of the values lie in that range. Similarly, about 95 percent of the
values lie within two standard deviations while 99.7 percent of
the values lie within three standard deviations around the mean
(Fig. 5.1B).
Example: Suppose, for a study on 300 chronic kidney disease
(CKD) patients, the hemoglobin levels were obtained. The data on
56 Basics in Epidemiology and Biostatistics

Figure 5.2 Hemoglobin level of 300 CKD patients



hemoglobin level was plotted. The data is normally distributed,
with mean Hb and standard deviation are calculated as 7 mg/dL
and 1.0 mg/dL, respectively.
Thus, out of 300 CKD patients (Fig. 5.2):

• 68% (204) will have hemoglobin within the range of 6.0 mg/dL to
8.0 mg/dL (within one standard deviation from mean).
• 95% (285) will have hemoglobin within the range of 5.0 mg/dL to
9.0 mg/dL (within two standard deviation from mean).
• 99.7% (299) will have hemoglobin within the range of 4.0 mg/dL
to 10.0 mg/dL (within three standard deviation from mean).

BIBLIOGRAPHY
1. Kuzma JW, Bohnenblust SE (Eds). Summarizing data: Basic statistics


for the science. London: Mayfield Publishing Company; 2001.pp.44 54.
-
2. Kuzma JW, Bohnenblust SE (Eds). The Normal Distribution: Basic


statistics for the science. London: Mayfield Publishing Company; 2001.
pp.79 91.
-
3. Pagano M, Gauvreau K (Eds). Numerical summary measures. Principles


of biostatistics. Australia: Duxbury Press; 2000.pp.38 65.
-
4. Perrie A, Sabin C (Eds). Describing data. Medical Statistics at glance.


UK: Blackwell Science Ltd; 2000.pp.16 9.
-
5. Perrie A, Sabin C (Eds). Theoretical distribution (1): the normal distri­


bution. Medical Statistics at glance. UK: Blackwell Science Ltd; 2000.
pp.20 1.
-
tahir99 - UnitedVRG
CHAPTER

6
Estimation and
Hypothesis Testing

R G
V
Estimation refers to the process by which one makes inferences

d
about a population, based on information obtained from a sample.

ti e
POINT ESTIMATE
• A point estimate is a specific numerical value estimate of a

n
parameter.
• The best point estimate of the population mean (µ) is the sample

U
mean.
• But how good is a point estimate?

-
There is no way of knowing how close the point estimate is to

the population mean. Statisticians therefore prefer another type of

9
estimate called an interval estimate.

ri 9
INTERVAL ESTIMATE
• An interval estimate of a ‘parameter’ is an interval or a range of

h
values used to estimate the ‘parameter’ (confidence level).
• The confidence level of an interval estimate of a ‘parameter’ is the

ta
probability that the interval estimate will contain the parameter.
• Two commonly used confidence levels are 95 percent and
99 percent.
• If one desires to be more confident then the sample size must be
large enough.

HYPOTHESIS TESTING
What is a Hypothesis?
Hypothesis is a testable theory. Hypothesis testing is the method
of testing whether claims or hypothesis regarding a population are
58 Basics in Epidemiology and Biostatistics

likely to be true. For example, such claims could be regarding the


prevalence of an outcome of interest, mean of an outcome of interest
and association between a dependent and independent variable. For
example, investigator can have a hypothesis that the mean Hb of CKD
patients on dialysis is 7.5 g/dL. In epidemiological or clinical studies
it is an expected or anticipated association between an independent
variable and a dependent variable. For example, an investigator aims
to look at the association between cigarette smoking (independent
variable) and lung cancer (dependent variable). His hypothesis
would be that cigarette smokers are more likely to develop lung
cancer than nonsmokers.
Association referred to as an ‘assumption’ that is formulated

regarding a population parameter of interest, i.e. mean or proportion
prior to a research being conducted. During the course of research
the researcher endeavors to test the formulated hypothesis.

What is the Need to Test a Hypothesis in Research?


Most often we are interested in finding out difference between
two groups or an association between two variables. An observed
difference or association between two groups or variables can be a
real difference, or can be attributed to chance.
Hypothesis testing is done to determine if the observed difference

or association is because of chance. If a researcher has been
meticulous in control of bias and confounding, then after hypothesis
testing he or she can ascertain the ‘reality’ of observed difference or
association with a certain degree of confidence. This confidence is
expressed in terms of percentage which is derived from the criteria
at which the hypothesis has been tested.

INTRODUCTION TO THE SCALE OF PROBABILITY


It is worthwhile to understand certain basic concepts related
to probability before discussing the steps of hypothesis testing.
Probability is defined as the chance of occurrence of an event. It is a
common everyday concept which relates to chances of happening of
a particular incident, e.g. it is usual to talk about ‘probability’ of rain
on a given day. The chances of a formulated hypothesis being true or
otherwise are also determined in terms of probability.
Probability is measured on a scale of 0 to 1 (zero to one). A zero

probability means that there is an absolute certainty of an event not

tahir99 - UnitedVRG
Estimation and Hypothesis Testing 59

Figure 6.1 Probability scale: Range (Zero to One)



happening or an outcome not appearing; whereas a probability of
one means an absolute certainty of an occurrence of an event or

R G
V
appearance of an outcome. In between the absolute values of zero
and one there is a whole range of probabilities. The application of

d
the scale of probability in the concept of hypothesis testing will be
elaborated subsequently (Fig. 6.1).

ti e
TEST OF HYPOTHESIS

n
• Suppose a study is being conducted to answer questions about
differences between two regimens for the management of

U
diarrhea in children: the sugar based modern ORS, and the time-
tested indigenous herbal solution made from locally available

-
herbs.
• One question that could be asked is:
“In the population is there a difference in overall improvement

9

(after three days of treatment) between the ORS and the herbal

ri 9
solution?’’
• There could be only two answers to this question:
1. Yes


h
2. No


ta
Null Hypothesis (H0)
“There is no difference between the 2 regimens in term of improve
­
ment” (null hypothesis). A null hypothesis is usually a statement
that there is no difference between groups or that one factor is not
dependent on another and corresponds to the No answer.

Alternative Hypothesis (HA)


“There is a difference in terms of improvement achieved by a
three days treatment with the ORS and that of the herbal solution’’
(alternative hypothesis).
60 Basics in Epidemiology and Biostatistics

Associated with the null hypothesis there is always another



hypothesis or implied statement concerning the true relationship
among the variables or conditions under study if no is an implausible
answer. This statement is called the alternative hypothesis and
corresponds to the “Yes” answer.

Types of Alternate Hypothesis


• Directional
• Nondirectional
A directional hypothesis is one which the researcher is able to

explicitly state the direction of the relation between the populations.

Steps in Hypothesis Testing


1. State the hypotheses: Every hypothesis test requires the researcher

to state a null hypothesis (H0) and an alternative (HA). The hypo

­
theses are stated in such a way that they are mutually exclusive.
That is, if one is true, the other must be false; and vice versa.
2. Formulate an analysis plan: The analysis plan describes how to

use sample data to accept or reject the null hypothesis. It should
specify the following elements:
• Selection of significance level (a): Often, researchers choose
significance level equal to 0.01 (1 in 100), 0.05 (1 in 20), the
significance level is the risk we are willing to take that a sample
which showed a difference was misleading. Five percent
significance level means that we are ready to take a 5 percent
chance of wrong results. The significance level is set prior to
the actual testing of the null hypothesis, if alpha is set at 0.01,
then the researcher desires to be 99 percent confident before
rejecting the null hypothesis.
• Choosing a test statistic: t-test, z-test for continuous data,
chi-square for proportions, etc. Test statistics is computed
from the sample data and is used to determine whether the
null hypothesis should be rejected or retained. Test statistics
generates a p-value.
3. Analyze sample data: Using sample data perform computations

called for in the analysis plan.
• p-value: Indicates the probability or likelihood of obtaining a
result at least as extreme as that observed in a study by chance

tahir99 - UnitedVRG
Estimation and Hypothesis Testing 61

alone, assuming that there is truly no association between


exposure and the outcome under consideration.
The final decision to either reject the null hypothesis or not,

depends on the p-value.
By convention the p-value is set at 0.05 levels. Thus, any

G

value of p less than or equal to 0.05 indicates that there is at
the most a 5 percent probability of observing an association

R
as large or larger than that found in the study due to chance
alone, given that there is no association between exposure and

V
outcome. If the P values is higher than the set value of alpha is,
e.g. p value>0.05, then we do not reject the null hypothesis

d
(Fig. 6.2).

ti e
4. Interpret the results: If the sample findings are unlikely, given

the null hypothesis, the researcher rejects the null hypothesis.
Typically, this involves comparing the p-value to the significance
level, and rejecting the null hypothesis when the p-value is less

n
than the significance level.

- U
9
ri 9
h
ta
Figure 6.2  Level of significance (a = 0.05) for hypothesis (testing)
62 Basics in Epidemiology and Biostatistics

DECISION ERRORS
Two types of errors can result from a hypothesis test.
1. Type I error: A type I error occurs when the researcher rejects a

null hypothesis when it is true. The probability of committing a
type I error is called the significance level. This probability is also
called alpha, and is often denoted by . Thus, if of a study is

α
α
lowered from 0.05 to 0.01 the maximum chance of committing a
type I error also reduces from 5 to 1 percent.
Suppose, a researcher wants to compare the mean ages of males

and females in a class of final year students. The null hypothesis for
this research is that there is no difference in the mean age of males
and females of this class. For some reason (i.e. small number of
sample size, inappropriate statistical analysis technique, etc.) the
p-value is calculated as 0.01 (as less than 0.05 thus significant at
95 percent confidence interval). As a result, the researcher has to
reject the true null hypothesis, thus forced to make a type I error.
In this example, there was no true difference in the mean ages of
males and female as they are students in the same class, and the
null hypothesis was true.
Similarly, this type of error can happen in the court of law,

a judge while prosecuting a trial if sends an innocent behind
the bars, he has committed a type I error. Type I error is more
important, which both researcher and judge must avoid in all
cases, and for this reason they make every effort not to commit a
type I error ( = 0.05).
α
2. Type II error: A type II error occurs when the researcher accepts

a null hypothesis that is false. The probability of committing

a type II error is called beta, and is often denoted by b. The

probability of not committing a type II error is called the Power of
the test. The power is generally kept at 80 percent and determined
by 1-b. The level of significance and power of a study play a very
crucial role in sample size determination.
Suppose, a researcher aims to compare the mean Hb of chronic

kidney disease (CKD) patients with that of normal population.
The null hypothesis is; there is no difference in the mean Hb of
CKD patients and normal population. Considering that the mean
hemoglobin of the sample of CKD patients was 7G/dL, and that
of the normal population was 12G/dL. If in this study the sample
size is inadequate or because of some inappropriate statistical

tahir99 - UnitedVRG
Estimation and Hypothesis Testing 63

technique the p-value is calculated as 0.15 (greater than 0.05 thus


nonsignificant at 95 percent confidence interval). As a result, the
researcher fails to reject a null hypothesis which was false, thus
making a type II error. In this example, there is a big difference
between the hemoglobin levels between the CKD sample and

G
the normal population. Obviously, there was logically a true
difference between the mean hemoglobin of CKD patients and

R
normal population and the null hypothesis was false, but because
the sample size was small, the analysis failed to work a significant

V
difference. In order to avoid this error, sample size calculation is
carried out at the synopsis level.

d
Similarly, this type of error can happen in the court of law, a

ti e
judge while prosecuting a trial, if declares a guilty person as
innocent and frees him/her, he has committed a type II error. This
can happen in the court of law where a person who is thought
to be guilty gets away from punishment because the court does

n
not have enough evidence against him. So we can see that in the
court of law having enough evidence is a must to make a decision,

U
whereas in research the evidence is a large and adequate sample
size. Type II error is less important than type II error, but it should

-
also be tried to avoid by having an adequate sample size with a
minimum power size of 80 percent (Table 6.1).

9
Simple Explanation of p-value and

ri 9
95 Percent Confidence Interval
Hypothesis is all about the confidence of researcher in his results.

h
Having completed his study he is faced with two questions:

ta
Table 6.1: a- and b-errors
Decision
Retain the null Reject the null
hypothesis hypothesis
Truth in the Correct Type I error
Population True 1–a a
Type II error Correct
False b 1–b
Power
64 Basics in Epidemiology and Biostatistics

1. Are you 100 percent sure about your results?



2. Are you sure the results are not by chance?

The researcher tries to take the help of p-value and 95 percent

confidence interval to show the confidence he has on his result.
The researcher tries to explain the first question by saying that he

cannot be 100 percent sure about his results, but he is 95 percent
confident that the results are true (95% confidence interval).
Regarding question number two, he seeks help from the p-value.

A p-value of less than 0.05 means that he has less than 5 percent
probability of having his results by chance. The smaller the p-value,
e.g. 0.001 the probability of the researcher of having his results
by chance is extremely negligible. So p-values and 95 percent
confidence interval are all about the confidence the researcher has
on his results.
Table 6.2 is a multivariate regression analysis of factors associated

with mortality in dialysis patients. The interpretation of independent
variable, age is that for every one year of increase in age the risk
of mortality increases by 3 percent (RR = 1.03). The 95 percent
confidence intervals for the same variable is 1.02–1.04. This mean

Table 6.2: Multivariate regression analysis of factors associated with


mortality among end stage renal disease patients
Variable Relative risk (RR) 95% CI p-value
Age (per year increase) 1.03 1.02, 1.04 <0.0001
Female (ref = male) 1.12 0.97, 1.31 0.14
White (ref = non-white) 1.51 1.27, 1.78 <0.0001
GFR at initiation of dialysis 1.04 1.03, 1.05 <0.0001
(per mL/min/1.73 m2
Angina (ref = no) 1.11 1.03, 1.18 0.02
Congestive heart failure 1.12 1.04, 1.19 0.01
(ref = no)
Ambulate independently 0.48 0.39, 0.58 < 0.0001
(ref = no)
LR (ref = ER) 1.66 1.30, 2.07 < 0.0001
Outcome (dependent variable), for this regression analysis was mortality at
1 year after initiation of (dialysis).

Abbreviations: LR: Late referral; ER: Early referral

tahir99 - UnitedVRG
Estimation and Hypothesis Testing 65

that if the study is carried out a hundred times, 95 percent or more


time the relative risk is going to lie between 1.02 and 1.04. Also for this
variable the 95 percent confidence interval is extremely close to the
relative risk, these are called tight confidence intervals, and means
that the results are extremely valid. The p-value for the variable age

G
is less than 0.0001. This means that the probability of the researcher
having his results by chance are almost negligible. So we can see

R
that the researcher takes the age of 95 percent confidence interval
and p-value to be sure that his results are valid. The next variable

V
in this multivariate regression analysis is gender. The relative risk
of mortality among females compared to males is 12 percent (RR =

d
1.12). However, when we look at p-values and 95 percent confidence

ti e
interval for this variable, they are both statistically insignificant. The
interpretation for the variable gender in this study considering the
p-value and 95 percent confidence interval, would be that there is no
difference for mortality among males and females.

Un
Solving Hypothesis Testing Problems
The six steps for solving hypothesis testing from problems are as

-
follows:
1. State the hypothesis and identify the claim

9
2. Choose a significance level a

3. Find the critical value (s)

ri 9

4. Compute the test value

5. Make the decision to reject or not to reject the null hypothesis

6. State the appropriate conclusion.

h

Example: The population has a mean Hb level (µ) of 12 g/dL, and a

ta
SD of 2. A sample of the population (x) has a mean Hb of 7 g/dL. Is
the Hb level of the sample representative of the population mean?

Solution
Step 1: State the hypothesis and identify the claim:
Null: The mean Hb level of x = µ

Alternate: The mean Hb level of x ≠ µ

Step 2: Choose a significance level a:
Alpha = 0.05

The researcher is willing to accept < 5 percent chance of committing

a type I error (of rejecting a true null hypothesis, by chance).
66 Basics in Epidemiology and Biostatistics

Step 3: Find the critical value (s)


One-tailed hypothesis

Alpha = 0.05

Step 4: Compute the test value
x −u
Z =
SD



7 − 12
=
2


= –5/2 = – 2.5


Step 5: Make the decision to reject or not to reject the null
hypothesis
Since z = –2.5

And –2.5 is less than –1.96, we reject the null hypothesis

Step 6: State the appropriate conclusion
We reject the null hypothesis of no difference, and conclude that the

mean Hb levels of the sample is different from the population mean
(Fig. 6.3).

Interpretation
Since z-score calculated by statisticians for 2 standard deviation cut
of point is –1.96 and +1.96. Any z-score less than 1.96 and/or greater

Figure 6.3 Critical regions (the two tails) for rejecting the null hypothesis

(a = 0.025)

tahir99 - UnitedVRG
Estimation and Hypothesis Testing 67

R G
V
Figure 6.4 Region of rejection and region of acceptance

d

ti e
than +1.96 will fall in the region of rejection. In the above study, –2.5
is smaller than –1.96 we can see in Figure 6.4 the CKD sample mean
falls within the region of rejection (Fig. 6.4) of the population mean.

n
Hence, we reject the null hypothesis.

U
Level of Significance ( Level: )
The level of significance is the maximum probability of committing

-
a type I error. Statisticians generally agree on using 3 arbitrary
significance levels, i.e. 0.10, 0.05 and 0.01. If the significance level

9
is 0.01, there is 1 percent probability of committing a mistake (and
accepting results that are not true). If the significance level is 0.10,

ri 9
there is 10 percent probability of committing a mistake, while if the
significance level is 0.05 it means, there is 5 percent probability of
committing a mistake. If the significance level is set as 1 percent an

h
extremely large sample size is required which may be difficult to
achieve, while if the significance level is set as 10 percent the sample

ta
size required will be small, but the validity of the results will become
questionable. Increase in sample size makes the findings more
valid while decrease in sample size invariably affects the validity
of the results. Thus it is investigator choice and decision to set the
significance level at an appropriate level so that the findings are
valid, at an affordable sample size.

BIBLIOGRAPHY
1. Duffy MS, Jacobsen BS. Key principles of statistical inference. In: Munro


BH (Ed). Statistical methods for health care research. Philadelphia:
Lippincott William and Wilkins; 2005. pp. 73-106.
68 Basics in Epidemiology and Biostatistics

2. Gravetter FJ, Wallnau LB (Eds). Introduction to hypothesis testing.




Essentials of statistics for the behavioral sciences. New York: Thomson
Wadsworth; 2005. pp. 184-218.
3. Hulley SB, Cummings SR (Eds). Getting ready to estimate sample size:


hypotheses and underlying principles. Designing clinical research.
USA: Lippincott Williams and Wilkins; 2007. pp. 51-63.
4. Kuzma JW, Bohnenblust SE (Eds). One sample significance test, point


estimation, and confidence interval. Basic statistics for the science:
London: Mayfield Publishing Company; 2001. pp. 105-35.
5. Osborne C (Ed). The normal distribution and statistical inference. Statis


­
tical applications for health information management. Massachusetts:
Jones and Bartlett; 2006.pp.121-51.
6. Perrie A, Sabin C (Eds). Confidence interval. Medical statistics at glance.


UK: Blackwell Science Ltd; 2000.pp.28-9.
7. Perrie A, Sabin C (Eds). Hypothesis testing. Medical statistics at glance.


UK: Blackwell Science Ltd; 2000.pp.42-3.
8. Perrie A, Sabin C (Eds). Sampling and sampling distribution. Medical


statistics at glance. UK: Blackwell Science Ltd; 2000.pp.26-7.

tahir99 - UnitedVRG
CHAPTER

7
Measures of Disease Frequency

R G
V
For epidemiological purposes, the occurrence of cases of disease

d
must be related to “population at risk” giving rise to the cases.
Several measures of disease frequency are in common use. There are

ti e
three general classes of mathematical parameters used to relate the
number of cases of a disease or outcome to the size of the source
population.

Ratio
RATIO, PROPORTION AND RATE

Un
-
It is obtained by simply dividing one quantity by another without

9
implying any specific relationship between the numerator and
denominator, such as gender ratio, i.e. females : males. In ratio, the

ri 9
numerator and denominator are mutually exclusive.
For example, the female to male ratio of postgraduate trainees in

Abbasi Shaheed Hospital is:

h
Number of female trainees in
          
ta

Abbasi Shaheed hospital
Gender ratio = __________________________________________
       
­
Number of male trainees in



Abbasi Shaheed hospital
     
      
= 150/50


Female : Male = 3:1



Proportion
It is a type of ratio in which those who are included in the numerator
must also be included in the denominator.
70 Basics in Epidemiology and Biostatistics

For example: the proportion of postgraduate trainees who have



passed the FCPS part 2 examinations.
Total number of postgraduate trainees who appeared in the FCPS

part 2 examinations = 1500
Total number of postgraduate trainees who cleared FCPS part 2

examinations = 150
Total number of postgraduate
          
trainees who cleared FCPS part 2
____________________________________________

Proportion =

­



Total number of postgraduate
     
trainees who appeared FCPS part 2

= 150/1500


= 1/10


Rate
A rate is a proportion with specification of time. There is a distinct
relationship between the numerator and denominator with a
measure of time being an intrinsic part of the denominator.
For example, the number of newly diagnosed cases of cervical

cancer per 100,000 women during a given year.

Important Point
It is necessary to be very specific about what constitutes both the
numerator and the denominator. In some circumstances, it is
important to make clear distinction whether the measure represents
the number of events or the number of individuals.
For example, the frequency of myopia among a population of

school children could represent the number of affected eyes in
relation to total eyes (measure represents the event), or the number
of children affected in one or both eyes relative to all students
(measure represents the number of individuals).

PREVALENCE AND INCIDENCE


The frequency of disease is commonly measured in epidemiological
studies and broadly categorized as incidence and prevalence.

tahir99 - UnitedVRG
Measures of Disease Frequency 71

Prevalence
Prevalence quantifies the proportion of individuals in a population
who have the disease at a specific instant and provides an estimate
of the probability that an individual will be ill at a point in time.

G
Prevalence is proportion, so has no unit.
Prevalence “P” can be calculated as

R
Number of existing cases (both old
          
and new) of a disease

V
___________________________________________________
Prevalence =
  
­
   


Total population at a given point in time

d
     
Point Prevalence

ti e
Point prevalence measures the frequency of disease of interest in a
defined population at a single point in time.

n
Number of cases (diseased) in a defined
         
population at one point in time

U
_______________________________________________________

Point prevalence =
­
Number of persons in a defined population


-
     
at the same point in time

For example: Of 25,000 male residents in Steel Town on 1st March,

9
2013, 25,00 have diabetes. The prevalence of diabetes among men in

ri 9
Steel Town on 1st March, 2013 is calculated as:
Prevalence (P) = 2500/25000



= 1/10 or 0.1


Prevalence can also be expressed as percentage (cases per 100),

h

by multiplying P by 100. Thus the prevalence percentage in the above

ta
example was calculated(as)
Prevalence in (%) of diabetes among men in Steel Town on 1st

March, 2013 is calculated as = 0.1 × 100


= 10%


Period Prevalence
Period prevalence is the total number of cases (diseased) at any
point during a specified period of time divided by the population at
risk midway through the period.
72 Basics in Epidemiology and Biostatistics

Cases present at the start of time period +


         
New cases developed during this time period
Period prevalence = ____________________________________________________________

­
Population at risk midway during that


time period
     

For example: A study was conducted in Gulberg Town, Karachi from
January 1st 2011 to December 31st 2012, to determine the period
prevalence of hypertension in women greater than 45 years during
the time period. The Period Prevalence based over the below data is
as follows:
• Number of hypertensive women greater than 45 years residing in
Gulberg Town as on 1st January 2011 = 2500
• Number of women greater than 45 years residing in Gulberg Town
developed hypertension from 1st Jan 2011 to 31st Dec 2012 = 500
• Population at risk midway (as on 31st Dec, 2011) = 60,000
2500 (old cases) + 500 (new cases)
Period prevalence = _____________________________________________

60,000 (midway population)

        
= 3,000/ 60,000


= 1/20 or 0.05


Period prevalence when expressed in percentage would be:
= 0.05 × 100


= 5%


Factors Influencing Prevalence
Prevalence is a useful measure in quantifying the burden of disease
in a population at a given point in time, thus beneficial in planning
health services. However, as it is influenced by a number of factors
(Table 7.1) thus not a useful measure to establish the determinant of
disease (causality) in a population.

Incidence
Incidence quantifies the number of new events or cases of disease
that develop in a population of individuals at risk during a specified
time interval.
Number of new cases of a disease
     
Incidence = _____________________________________________
Total population at risk
    
tahir99 - UnitedVRG
Measures of Disease Frequency 73

Table 7.1: Factors influencing prevalence


Increased by Decreased by
Longer duration of disease Shorter duration of disease
Prolongation of life of patients High case fatality rate from disease

G
without cure
Increase in new cases (Increase in Decrease in new cases (decrease in

R
incidence) incidence)
In-migration of cases In-migration of healthy people

V
Out-migration of healthy people Out-migration of cases

d
In-migration of susceptible people Increase cure rate of cases
Improved diagnostic facilities (Better Worsening diagnostic facilities

ti e
reporting) (Poor reporting)

n
Issues in the Calculation of Measures of Incidence

U
For any measure of disease frequency, precise definition of the
denominator is essential for both accuracy and clarity. This is of

-
particular concern in the calculation of incidence. The denominator
of a measure of incidence should include only those who are

9
considered “at risk” of developing the disease. That is, the total
population from which the new cases could arise.

ri 9
Consequently, those who currently have or have already had the

disease under study or persons who cannot develop the disease for
reasons such as age, immunization, or prior removal of the involved

h
organ should be excluded from the denominator.

ta
SPECIAL TYPES OF INCIDENCE RATES
• Cumulative incidence rate or incidence risk
• Incidence density rate.

Cumulative Incidence Rate


Cumulative incidence (CI) is a simpler measure of a disease. It is the
proportion of people who become diseased during a specified period
of time. It provides an estimate of the probability, or incidence risk,
74 Basics in Epidemiology and Biostatistics

that an individual will develop a disease during a specified period of


time. Hence, the characteristics of CI are:
• A population is identified and screened for the disease at baseline.
• Those who do not have the disease are followed for a year and
then rescreened.
• Any cases that develop in this period are ‘new’ cases.
• It measures the denominator at only one point in time (usually at
the mid-point of the specified period).
Number of new cases of a disease

in a specified period of time
______________________________________________
Formula for CI =
Number of disease free person at
the beginning of that time period
It is important to note that the denominator is the total number

of people who were free of the disease at the beginning of the study
period; defined as the population at risk. The cumulative incidence
assumes that the entire population at risk at the beginning of the
study period has been followed for the specified time period for the
development of the outcome under investigation. This is called a
closed population.

Incidence Density (ID) Rate


Incidence density measures the rate (speed) at which new cases
of disease occur in a population. It is a more precise measure of
an incidence as it takes into account for varying time periods of
follow-up due to reasons (i.e. refuse to continue to participate in the
study, migrate, death, and new participants entering into the study
some time after it starts, etc.).
As every individual in the denominator is not followed up for

the full time—commonly due to loss to follow-up and different
individuals may be observed for different lengths of time; thus
“Incidence Density Rate” is calculated for a more precise estimate of
incidence.
Number of new cases of disease
        
during a given period of time
          
Incidence density = __________________________________________
­­­­
Total person-time at risk during
         
a follow-up period
         
tahir99 - UnitedVRG
Measures of Disease Frequency 75

Table 7.2: Person-time (years) at risk for 5 individuals in a hypothetical


cohort study between 2008-2012
Year Jan Jan Jan Jan Jan Years at risk
2008 2009 2010 2011 2012
Persons
1 --------- --------- --------- ---------- --------- 5 years
2 --------- --------- --------- ---------- ----x 4.5 years
3 --------- --------- -----x 2.5 years
4 --------- --------- -----L 2.5 years
5 --------- --------- --------- -----x 3.5 years
Total 18 years
----------- = Time at risk
x = Developed disease
L = Person lost to follow-up

Calculating Person-time at Risk


The denominator in incidence density is person-time at risk, which
is the sum of each individual’s time at risk (i.e. length of time study
participants were followed in the study). It is commonly expressed
as persons year at risk. When a study subject develops the disease,
dies or leaves the study, they are no longer at risk and will no longer
contribute person-time units at risk (Table 7.2).
Thus, in the above example, the incidence density (per person
years at risk) for the disease (x) is calculated as:
               3 cases
Incidence density (per person years at risk) = _______________ × 100
               18 person
= 5.5 cases per 100 person years at risk

Morbidity Rate
It is the incidence rate of nonfatal cases in the total population at risk
during a specified period of time. For example, the morbidity rate
of tuberculosis (TB) in the US in 1982 can be calculated by dividing
the number of nonfatal cases newly reported during that year by the
total US mid-year population.
76 Basics in Epidemiology and Biostatistics

Total no of nonfatal cases of TB in population at risk/mid-year



population 25,520/231,534,000 = 11.0 per 100,000 population.

Mortality Rate
It expresses the incidence of deaths in a particular population during
a period of time. It is calculated by dividing the number of fatalities
during that period by the total population. This can be further
divided into cause specific mortality rate, age specific mortality rate
or sex specific mortality rate, etc.

BIBLIOGRAPHY
1. Gordis L (Ed). Measuring the occurrence of disease. Epidemiology.


Philadelphia, PA: Saunders Elsevier; 2008. pp. 37-57.
2. Hennekens CH, Buring JE (Eds). Measures of disease frequency and


association. Epidemiology in medicine. Boston: Little Brown and
Company; 1987. pp. 54-100.
3. Kuzma JW (Ed). Vital statistics and demographic methods. Basic


statistics for the science. London: Mayfield Publishing Company; 2001.
pp. 255-73.

tahir99 - UnitedVRG
CHAPTER

8
Measures of Association

In order to describe the strength of the relationship between an


exposure (independent variable) and an outcome (dependent
variable) measures of association are used. The types of measures
used to define the association between an exposure and an outcome
depends upon the type of data.

ASSOCIATION BETWEEN TWO


CONTINUOUS VARIABLES
Correlation
Correlation measures the strength of the linear association between
two continuous variables. Correlation coefficient ‘r’ is a measure
of degree of how much (magnitude) two continuous variables are
associated with each other. It is always a number between –1 and +1
(Table 8.1). The sign of r indicates whether the correlation is positive
or negative. The magnitude (absolute value) of r indicates the
strength of the correlation, or how close the array of data points is to
a straight line.

Table 8.1: Approximate degrees of association


corresponding to level of ‘r’
Correlation coefficient (r) Degree of association
± 1.0 Perfect
± 0.7 to ± 1.0 Strong
± 0.4 to ± 0.7 Moderate
± 0.2 to ± 0.4 Weak
± 0.01 to ± 0.2 Negligible
On 0.0 No association
78 Basics in Epidemiology and Biostatistics

The different types of correlation coefficient are:


• Pearson correlation coefficient
• Spearman rank correlation coefficient.
Example 1: Consider the data table below which contains measure­
ments on two variables for ten people: the number of months the
person has owned an exercise machine and the number of hours the
person spent on exercise in the past week.

Person 1 2 3 4 5 6 7 8 9 10
Machine owned 5 10 4 8 2 7 9 6 1 12
(in months)
Hours exercised 5 2 8 3 8 5 5 7 10 3

If you display these data pairs as points in a scatter plot (Fig. 8.1),

then you can see a definite trend. The points appear to form a line
that slants from the upper left to the lower right. As you move along
that line from left to the right, the values on the vertical axis (hours of
exercise) decreases, while the values on the horizontal axis (months
owned) increases. Another way to express this is to say that the two
variables are inversely related: the more months the machine was
owned, the less the person tends to exercise. Thus, there seems to be

Figure 8.1 Scatter plot of two continuous variables (Months exercise



machine owned and hours of exercise) showing negative correlation

tahir99 - UnitedVRG
Measures of Association 79

a correlation between these two continuous variables, but the two


variables are correlated negatively.
Example 2: Now consider the data table below which contains
measurements on two continuous variables for ten people; the
number of months the person has owned an exercise machine and
their cardiovascular fitness (measured on a scale from 1 to 12, higher
scores showing better cardiovascular fitness).

Person 1 2 3 4 5 6 7 8 9 10
Machine owned 5 10 4 8 2 7 9 6 1 12
(in months)
Cardiovascular 4 9 5 7 3 7 8 5 2 11
fitness (score from
1 to 12)

Thus in Figure 8.2 data of months owned is plotted against


cardiovascular fitness on scatter plot. The pattern of these data
points suggests a line that slants from lower left to upper right, which
is the opposite of the direction of slant in the first example. Thus,
the figure shows that the longer the person has owned the exercise
machine, the better his or her cardiovascular fitness tends to be; this
is an example of a positive correlation (Fig. 8.2).

Figure 8.2  Scatter plot of two continuous variables (Months exercise machine
owned and cardiovascular fitness score) showing positive correlation
80 Basics in Epidemiology and Biostatistics

If two variables are positively correlated, as the value of one



increases, so does the value of the other. If they are negatively (or
inversely) correlated, as the value of one increases, the value of the
other decreases.
A third possibility remains; that as the value of one variable

increases, the value of the other neither increases nor decreases.
Example 3: Now consider the data table below which contains
measurements on two variables for ten people; the number of
months the person has owned an exercise machine and their height.

Person 1 2 3 4 5 6 7 8 9 10
Machine owned 5 10 4 8 2 7 9 6 1 12
(in months)
Height (meters) 2 1.3 1.8 1.5 1.9 1.3 1.9 1.4 1.8 1.5

Figure 8.3 is a scatter plot of months exercise machine owned



(horizontal axis) by person’s height (vertical axis). No line trend can
be seen in the plot. These two variables appear to be uncorrelated.
You can go even farther in expressing the relationship between

variables. Compare the two scatter plots in Figures 8.4 and 8.5. Both
plots show a positive correlation because as the values on one axis
increases, so does the values on the other. But the data points in
Figure 8.5, which are more closely packed than the points in Figure
8.4, which are more spread out. If a line were drawn through the
middle of the trend, the points in Figure 8.5 would be closer to the
line than the points in Figure 8.4. In addition to direction (positive or
negative), correlations can also have strength, which is a reflection of

Figure 8.3 Scatter plot of two continuous variables (Months exercise



machine owned and height) showing no correlation

tahir99 - UnitedVRG
Measures of Association 81

Figure 8.4  Weak/low correlation

Figure 8.5  Strong/perfect correlation

the closeness of the data points to the perfect line. Figure 8.5 shows
a stronger correlation than Figure 8.4.

Simple Linear Regression


Correlation is not concerned with causation in relationships among
variables. However, a statistical procedure called regression is used
to establish causality. Regression is used to assess the contribution
of one or more predictor/explanatory variables (called independent
variables) to one dependent variable. It can also be used to predict
the value of one variable from the values of other variable. When
there is only one independent variable and when the relationship
82 Basics in Epidemiology and Biostatistics

Figure 8.6 Simple linear regression



can be expressed as a straight line, the procedure is called simple
linear regression. Any straight line (Fig. 8.6) in two dimensional
-
space can be represented by the equation;
Y’ = a + bX’
where
Y’ is the variable on the vertical axis.

X’ is the variable on the horizontal axis.

a is the y value where the line crosses the vertical axis (often called

-
an intercept).
b is the amount of change in y corresponding to a one unit increase

in X (often called the slope).
Example: In a cross sectional survey the data is collected from 77
-
patients on maintenance hemodialysis. Variables on which the
data was collected were number of months on dialysis and Beck
depression score (validated tool to identify the presence and severity
of depression).
To predict the Beck depression score (dependent variable) from

months on dialysis (independent variable) a linear regression
analysis on SPSS was performed. SPSS generate a number of
output, but the most important inferential output is displayed below
(Table 8.2).

tahir99 - UnitedVRG
Measures of Association 83

Table 8.2: SPSS output (labeled as coefficient) for linear regression


Coefficients
Unstandardized Standar- 95% confidence
coefficients dized co- interval for B
Mode efficients
B Standar­ Beta t Signifi­­­ Lower Upper
dized error cance bound bound
1 (Constant) 19.932 2.07 –0.300 9.61 0 15.801 24.064
dialysis –0.061 0.023 –2.728 –0.008 –0.106 –0.017
duration
in months
Source: Dependent variable: Beck depression inventory (scoring)

• The output labeled as coefficient is the most important table as


the values in this table will be helpful in generating ‘equation’ of
the regression line.
• The standardized coefficient (B) that is –0.3 is basically the
correlation between Beck depression score and dialysis duration
in months.
• As the (p-value = 0.008 <0.05) thus the correlation is statistically
significant.
• The ‘dialysis duration in months’ row under the beta (B) column
gives the slope of the regression line that is –0.061.
• The slope value (–0.061) gives information that with a increase in
one month of dialysis the Beck depression score is predicted to
decrease by 0.061.
• The ‘Constant’ row under the beta (B) column gives the intercept
that is 19.932.
• The constant gives the value of dependent variable when the
explanatory variable is 0. Thus if the dialysis duration in months is
0, then the Beck depression score will be 19.932.
• Special package for the social sciences (SPSS) does not generate
the equation of the regression line.
• Thus these two coefficients will be used to construct the regression
equation that is
Y’ = a + b (X)
84 Basics in Epidemiology and Biostatistics

where
Y’ is the predicted value of the dependent variable Y

a is the intercept (in this case it is 19.932)

b is the slope or the gradient of the regression line (in this case, it

is –0.061)
X is the independent or explanatory variable

Thus, the equation of the regression line will be
Y’ = 19.932 + (–0.061)X


Y’ = 19.932–0.061X


Task: Predict the Beck depression score for maintenance dialysis
patients on dialysis for 16 months and on 17 months?

Dialysis duration 17 months Dialysis duration 16 months


Y’ = 19.932-0.061 X Y’ = 19.932-0.061 X






Y’ = 19.932-0.061 (17) Y’ = 19.932-0.061 (16)






Y’ = 19.932-1.037 Y’ = 19.932-0.976






Y’ = 18.895 Y’ = 18.956






     
Thus one unit increase in dialysis duration in months (from 16

to 17 months) the Beck depression score decreases from (18.956 to
18.895 a decrease of 0.061). It is also manifested in the above output
generated by SPSS (highlighted in green).
The intercept is 19.932 is the Beck depression score at time 0 of

dialysis. This is the value on the y axis where the best fit line touches
-
the y axis (Fig. 8.7).
-
RELATIVE RISK AND ODDS RATIO
The relative risk (RR) and odds ratio (OR) are commonly used to
describe the relationship between an exposure and an outcome. The
RR is used in cohort studies, whereas the OR is used in case control
studies.

Relative risk
The relative risk (or risk ratio) is defined as the ratio of the incidence of
disease in the exposed group divided by the corresponding incidence
of disease in the nonexposed group. Relative risk can be calculated in
cohort studies such as the Framingham Heart Study where subjects

tahir99 - UnitedVRG
Measures of Association 85

Figure 8.7  Linear regression showing on association between duration of


dialysis (months) and Beck depression scores

with certain exposures (e.g. hypertension, hyperlipidemia, smokers)


were followed prospectively for cardiovascular outcomes. The
incidence of cardiac events in subjects with and without exposures
were then used to calculate relative risk and determine whether
exposures were cardiac risk factors.
  Incidence in exposed
Relative risk = ____________________________________
     Incidence in nonexposed

Risk Diseased Nondiseased Total


Exposed a b a+b
Nonexposed c d c +d

        
        ________ a
           a + b
Relative risk (RR) = _____________
       _________ c
         c + d
        
86 Basics in Epidemiology and Biostatistics

Incidence in exposed individuals = a/a+b or proportion of exposed



people who developed the disease. Incidence in nonexposed
individuals = c/c+d or proportion of nonexposed people who
developed disease.

Disease status
Risk factor Total
CHD present CHD absent

112 176 288


Smoker
a b a+b

88 224 312
Nonsmoker
c d c+d

Incidence in exposed = a /a+b = 112/288 = 0.38



Incidence in nonexposed = c /c + d= 88/312 = 0.28

RR= 0.38/0.28 = 1.38

Interpretation of RR
As compared to nonsmokers, the smokers have a 1.38 times greater
risk of developing CHD.
Alternative explanation: Compared to nonsmokers, the smokers
have a 38 percent greater risk of developing CHD.
Interpretation of RR if the RR is < 1.0: Supposing the RR in the above
study was 0.68.
Then, the interpretation would be that compared to nonsmokers

the smokers have a 32 percent lesser risk of developing CHD. This in
research is called a protective effect of the exposure. In other words,
the exposure is beneficial.

Odds Ratio
The odds ratio is defined as the odds of exposure in the group
with disease divided by the odds of exposure in the control group.
As subjects are selected on the basis of disease status in case
-
control studies; therefore, it is not possible to calculate the rate of
development of disease (or the incidence).

tahir99 - UnitedVRG
Measures of Association 87

In research the word “risk” is used for the development of a


disease or outcome, e.g. the risk of developing CHD. In a case control
study because the cases and controls are defined on the basis of the
outcome/disease, i.e. those who have CHD are the cases, and those
who do not have CHD are controls. Since the study starts with the
disease/outcome, hence researchers want to use a different word
for looking at the prevalence of the exposure in those who had the
disease versus those who did not have the disease. The researchers
prefer to use the word “odds” for an exposure rather than “risk”.

Odds of exposure in the cases


Odds ratio = _____________________________________________
  Odds of exposure in the controls

Calculating Odds Ratio (Case Control Studies)

Cases Control Total

Exposed A b a+b

Nonexposed C d c+d
     
      ______ a
        c
Odds ratio = ________
      ______ b
       d
      
Odds ratio = ad/bc

Oral Contraceptives and Breast Cancer

Breast cancer
Exposure Total
Yes No
Exposed
140 (a) 370 (b) 510
(oral contraceptive users)
Nonexposed 40 (c) 234 (d) 274
88 Basics in Epidemiology and Biostatistics

    
______

      
c

       
Odds ratio = ________
b
______

     
d
      
      
Odds of exposure in cases = a/c = 140/40 = 3.5

Odds of exposure in controls = b/d = 370/234 = 1.6

OR = 3.5/1.6 = 2.2

Interpretation of OR
Compared to the controls (those who did not have Ca breast), the
odds of being an oral contraceptive user were 2.2 greater in those
who had Ca breast (cases).

BIBLIOGRAPHY
1. Coggon D, Rose G. Quantifying disease in populations. [Online].


1997 [cited 2008 Oct 01]; Available from: URL: http://www.bmj.com/
epidem/epid.2.html
2. Grimes DA, Schultz KF. Cohort studies: marching towards outcomes.


Lancet. 2002;359:341 5.
-
3. Israni RK. Guide to biostatistics. [Online]. 2007 [cited 2008 Aug 05];


Available from: URL:http://www.medpagetoday.com/Medpage
-
Guide to Biostatistics.pdf
-
-
4. Schultz KF, Grimes DA. Case control studies: research in reverse.


-
Lancet. 2002;359:431 4.
-
tahir99 - UnitedVRG
CHAPTER

9
Factors Affecting
Study Outcomes

INTRODUCTION
Results of an epidemiological studies may reflect the true effect of an
exposure(s) on the development of the outcome under investigation,
but it must always be considered that the results may in fact due to an
alternative explanations. Such alternative explanations, may be on
account of the effects of chance (random error), bias or confounding
which may produce spurious results, leading the researcher to
believe the existence of a valid statistical association when one does
not exists or alternatively the absence of an association when one is
truly present.
Observational studies are more susceptible to the effect of chance,
bias and confounding, so appropriate steps must be taken at both
the design and analysis so their effects could be minimized.

BIAS
Any systematic error that results in an incorrect estimate of the
association between an exposure and the disease/outcome is
called a bias. It is usually introduced by the researcher due to
nonstandardized measuring techniques.

Types of Bias
More than 50 types of bias are identified in epidemiological studies,
but for simplicity, they are broadly grouped into two categories:
1. Selection bias
2. Information bias
90 Basics in Epidemiology and Biostatistics

Selection Bias
It occurs when the inclusion of subjects in a study depends in some
way on the outcome of interest. It occurs mainly in case control and
retrospective cohort studies and not in prospective cohort study as
outcome of interest has not yet occurred. Selection bias can occur
due to improper means or source of selection of study subjects.
A classical example of selection bias is a study conducted to see

the association between oral contraceptives (OC) and thrombo-
embolism. There was a concern in this study that as physicians
were already aware of the possible relationship between OC and
thromboembolism, hence proportion of women that had been
hospitalized for evaluation of thromboembolism was all current
users of OCs. So any increased frequency of thromboembolism
in oral contraceptive users could be in part due to the fact that
hospitalization and the determination of the diagnosis were both
influenced by a history of OC use.
Another means of selection bias could be due to inappropriate

source of selection, e.g. cases selected from hospitals and controls
from household surveys. In this case it is possible that a number of
demographic and lifestyle variables could be different amongst the
cases and controls leading to noncomparability between groups and
incorrect results with respect to association between exposures and
outcome.
In a clinical trial a selection bias can occur if there is no

randomization. Suppose that the principal investigator is taking
a decision as to which patients are going to be included in the
standard drug group and which patient is going to be included in
the new drug group. If the principal investigator is allowed to do so,
he might include all the healthy patients in the new drug group and
all patients who are sick (and have multiple comorbid conditions) in
the standard drug group. Thus, he can show better outcomes among
the new drug group (who are healthy patients) compared to the
standard treatment group (who are sicker) and present results which
are not true. The process of randomization ensures that selection bias
cannot take place, by ensuring that the principal investigator and
his team members are not even close to where the randomization
process is taking place.

tahir99 - UnitedVRG
Factors Affecting Study Outcomes 91

Observation or Information Bias


It includes any systematic error in the measurement of information
on exposure or outcome. It is further classified into different
categories on the basis of source of noncomparability into:
Recall bias: It occurs when individuals with previous adverse
health outcomes remember and report their previous exposure
differently or with different degree of completeness and accuracy
than those who are unexposed/unaffected. It can lead to an over
or underestimate of the association between exposure and disease,
depending on whether the cases recall their exposure to a greater or
lesser extent than the controls.
For example, in a case-control study mothers whose recent
pregnancies had ended in fetal death (cases) may report their
exposure experience (drug history) differently than a matched group
of mothers whose pregnancy had ended normally (controls). That is,
cases may have a better recall on past exposure than controls. Recall
bias can be reduced by:
• Collecting exposure data from work or medical records
• Blinding participants to the study hypothesis.
–– Interviewer bias: It refers to any systematic difference in
the soliciting, recording or interpretation of information by
interviewer from study participants and can affect every type
of epidemiologic study.
–– Lost to follow-up bias: It is a major concern in a cohort or
any prospective study. When persons lost to follow-up differ
from those who remain in the study with respect to both the
exposure and the outcome, any observed association will be
biased. Even very small loss to follow-up can be a potential for
bias as long as such loss is related to both exposure and disease.
–– Misclassification bias: It occurs when the sensitivity and/
or specificity of the procedure/tool to detect exposure and/
or outcome is not perfect, that is exposed/diseased subjects
can be classified as nonexposed/nondiseased and vice versa,
based on the means of determination which may be unclear
or not standardized. It is inevitable in every study and always
a potential for concern and therefore should be carefully
evaluated.
92 Basics in Epidemiology and Biostatistics

CONTROL OF BIAS
Control of bias is mostly done at the design phase of the study.
Following are some means to ensure the same.

For Control of Selection Bias


• Correct choice of study population (sampling procedure)
• Randomization.

For Control of Information Bias


• Correct training of interviewers and use of clearly written protocols
ensuring uniform methodology of obtaining information.
• Use of standardized, tested instruments for data collection, and
utilizing uniform source of data on all study subjects.
• Maintaining of complete records and having definite means of
contact with respondents to prevent loss to follow-up.
• Use of clearly defined means of determination of both exposure
and outcome variables.
• Blinding of interviewees and interviewers to study objectives.

CONFOUNDING
The concept of confounding is a central one in the interpretation
of any epidemiological study. It can be thought of as mixing of the
effect of the exposure under study on the outcome, with that of an
extraneous factor—the “confounder”. This external factor or variable
must be associated with the exposure, and independent of the
exposure must be a risk factor for the disease to be deemed as a
confounder. Confounding can lead to an over or an underestimation
of the true association between exposure and outcome.
Example 1: In a study conducted to determine the association
between smoking and myocardial infarction (MI), age can be a
confounder as it is associated with both exposure and outcome
independently.

tahir99 - UnitedVRG
Factors Affecting Study Outcomes 93

Table 9.1:  Relation of myocardial infarction (MI) to recent oral contraceptive


(OC) use
Estimated relative risk
Oral contraceptive (OC) MI + MI -

Yes 29 135
No 205 1607 = 1.68
Total 234 1742

Table 9.2:  Age-specific relation of myocardial infarction (MI) to recent oral


contraceptive (OC) use
Age (years) Recent OC use MI + MI - Estimated age-specific
relative risk
Yes 4 62
25–29 7.2
No 2 224
Yes 9 33
30–34 8.9
No 12 390
Yes 4 26
35–39 1.5
No 33 330
Yes 6 9
40–44 3.7
No 65 362
Yes 6 5
45–49 3.9
No 93 301
Total 234 1742

Example 2: A study was conducted to assess the association between


recent oral contraceptive use and MI, the following were the results:
However, the data was confounded by age, which was leading to an
underestimation of the true effect, as can be seen by the (Table 9.2).
Confounding can be controlled in study design by restriction,
matching and randomization. In analyses, it can be controlled
through stratification and multivariate analysis (Tables 9.1 and 9.2).

EFFECT MODIFIERS
Effect modifiers are variables that bring about a change in the
magnitude of an effect. Unlike confounder, effect modifier does not
94 Basics in Epidemiology and Biostatistics

require to be related to both exposure and outcome variable. For


example if we want to determine the incidence of coronary heart
disease (CHD) amongst smokers, the outcome will be affected by
age. Hence, in all such cases age is an effect modifier. Its impact has
to be reported through stratification.
It is important to bear in mind the role of bias, confounding,

chance and evaluate/control for the same so as to ensure that the
results are valid and generalizable.

BIBLIOGRAPHY
1. Delgado-Rodríguez M, Lorca J (Eds). Bias. J Epidemiol Community


Health. 2004;58(8):635-41.
2. Hennekens CH, Buring JE (Eds). Analysis of epidemiologic studies:


evaluation the role of bias. Epidemiology in medicine. Boston: Little
Brown and Company; 1987.pp. 243-71.
3. Rothman KJ, Greenland S, Lash TL. Validity in epidemiologic study.


In: Rothman KJ, Greenland S, Lash TL (Eds). Modern epidemiology.
Philadelphia, PA: Lippincott, Williams and Wilkins; 2008. pp. 128-47.

tahir99 - UnitedVRG
CHAPTER

10
Sample Size Estimation

SAMPLE SIZE
The sample size calculation depends on:
• Type of study
• Magnitude of the outcome of interest derived from previous
studies
• Type of statistical analysis required (comparing means or
proportions)
• Level of significance/power.

SAMPLE SIZE FOR SINGLE PROPORTION


Sample size for single proportion depends on:
• The prevalence of the condition/attribute of interest
• Level of confidence
• Margin of error.

Example of Sample Size Calculation


for a Single Proportion
A researcher aims to estimate the prevalence of chronic kidney
disease (CKD) among adults greater than 18 years of age in a locality.
How many adults should be included in the sample so that the
prevalence may be estimated within 5 percent point of the true value
with 95 percent confidence, if it is known that the true rate is unlikely
to exceed 40 percent?
Values needed to be entered into the WHO sample size calculator
Confidence interval: 95 percent
Anticipated prevalence or population proportion for CKD: 40 percent
Absolute precision required (based on researcher judgment): 5 percent
96 Basics in Epidemiology and Biostatistics

Figure 10.1 Sample size calculation and formula for single proportion

When the above values are entered into WHO sample size
calculator, the estimated sample size will be calculated (Fig. 10.1).
The estimated sample size calculated is 369. Thus, at least
369 participants must be recruited in the study to determine the
prevalence of CKD at confidence interval of 95 percent, with a
precision of 5 percent.

SAMPLE SIZE FOR SINGLE GROUP MEAN


Sample size for single group mean depends on:
• The mean of the condition of interest
• Level of confidence
• Margin of error.

Example of Sample Size Calculation


for Single Group Mean
A researcher aims to estimate the mean hemoglobin level among
pregnant women admitted to a tertiary care hospital. A previous
study of pregnant women showed average hemoglobin level

tahir99 - UnitedVRG
Sample Size Estimation 97

8.2 g/dL and standard deviation of 4.2 g/dL. How many pregnant
women must be studied if he wants the estimate should fall within
1 g/dL with 95 percent confidence?
Values needed to be entered into the WHO Sample Size Calculator:
Confidence interval: 95 percent
Population mean (Average hemoglobin of pregnant women identified
from previous study): 8.2 g/dL
Population standard deviation: 4.2 g/dL
Absolute precision required: 1 g/dL
When the above values are entered into WHO sample size
calculator, the estimated sample size will be calculated (Fig. 10.2).
Where ∈ = d/µ
∈ = Relative precision
d = Absolute precision
µ = Population mean
The estimated sample size calculated is 68. Thus, at least 68
participants must be recruited in the study to estimate the mean
hemoglobin level among pregnant women at confidence interval of
95 percent, with a precision of 1 g/dL.

Figure 10.2  Sample size calculation and formula for single group mean
98 Basics in Epidemiology and Biostatistics

SAMPLE SIZE FOR TWO PROPORTIONS


The sample size for two proportions depends on:
• The prevalence of the condition/attribute of interest for both
groups
• Level of confidence
• Power of the test.

Example of Sample Size Calculation


for Two Proportions
It is believed that the proportion of patients who develop depression
on one type of dialysis modality (peritoneal dialysis) is 5 percent
while the proportion of the patients who develop depression on
other type of dialysis modality (hemodialysis) is 15 percent. How
large should be the sample size in each of the two groups of patients
if an investigator wishes to detect with a power of 90 percent, whether
the second dialysis modality (hemodialysis) has depression rate
significantly higher than the first at 5 percent level of significance?
Values needed to be entered into the WHO sample size calculator:
Confidence interval: 95 percent
Anticipated prevalence of depression in peritoneal dialysis patients:
5 percent
Anticipated prevalence of depression in hemodialysis patients:
15 percent
Power of test: 90 percent
When the above values are entered into WHO sample size
calculator, the estimated sample size will be calculated (Fig. 10.3).
The estimated sample size calculated is 153. Thus, at least 153
participants must be recruited in the study to estimate any significant
difference in psychiatric illness in two different types of dialysis
modality at a power of 90 percent.

SAMPLE SIZE FOR TWO GROUP MEANS


The sample size for group means depends on:
• The means and variance of both groups
• Level of confidence
• Power of the test

tahir99 - UnitedVRG
Sample Size Estimation 99

Figure 10.3  Sample size calculation and formula for two proportions

Example of Sample Size Calculation


for Two Group Means
Suppose the true mean systolic blood pressure (SBP) of
35–39-year-old oral contraceptive (OC) users is (135 mm Hg) and
standard deviation (16 mm Hg). Similarly, for non-OC users, the
mean SBP is (130 mm Hg) with standard deviation (17 mm Hg).
If we desire to estimate the difference between 2 groups of equal
size,  what would be the minimal sample size required with a power
of 80 percent at 95 percent confidence level?
Values to be Entered into the Open Epi-Software
Following values should be entered in the Open epi sample size
calculator (Table 10.1).
Confidence interval: 95 percent
Power: 80 percent
Mean systolic BP of oral contraceptive users: 135 mm Hg
Standard deviation of oral contraceptives users: 16 mm Hg
Mean systolic BP of nonoral contraceptive users: 130 mm Hg
Standard deviation of nonoral contraceptive users: 17 mm Hg
100 Basics in Epidemiology and Biostatistics

Table 10.1: Sample size calculation for comparing two means



95 Enter a value between
Confidence interval % (two-sided)
0 and 100, usually 95%
80 Enter a value between
Power
0 and 100, usually 80%
Ratio of sample size (Group 2/Group 1) 1
Group 1 Group 2
135 and 130 Enter means values of
Mean
each group
Enter standard
Standard
16 17 deviation or variance of
Deviation
each individual group
Variance

Table 10.2: Sample size calculation result



Input data
Confidence interval (2-sided) 95%
Power 80%
Ratio of sample size (Group 2/Group 1) 1
Group 1 Group 2 Mean
difference*
(135–130)=5
Mean 135 130
Standard deviation 16 17
Variance 256 289
Sample size of group 1 172
Sample size of group 2 172
Total sample size 344
* Mean difference of Systolic BP of Group 1 and Group 2

The minimum sample size required to compare the mean of OC


and non-OC user is 344. Thus 344 (172 in each group), participants
must be recruited in the study to estimate the difference in two
groups at 95 percent confidence interval with a power of 80 percent
(Table 10.2).

tahir99 - UnitedVRG
Sample Size Estimation 101

SAMPLE SIZE FOR SENSITIVITY AND SPECIFICITY


The sample size for sensitivity and specificity depends on:
• The prevalence of the condition/attribute of interest
• Estimated sensitivity
• Estimated specificity
• Level of significance
• Margin of error.

Example of Sample Size Calculation


for Sensitivity and Specificity
Suppose we want to determine the sensitivity and specificity of ELISA
in the diagnosis of HIV by the gold standard Western Blot. How many
patients should be included in the sample? The prevalence of HIV
is 15 percent and estimated sensitivity of gold standard Western Blot
is 97 percent and estimated specificity is 94 percent with 95 percent
confidence, if we want to keep margin of error as 5 percent how much
patients should be invited to participate in this study. 347 patients will
be required for the sensitivity and specificity analysis in this study.
Sample size calculation and formula for sensitivity and specificity
studies values to be entered into the WHO software.
Prevalence of HIV: 15 percent
Sensitivity: 97 percent
Specificity: 94 percent
Confidence interval: 95 percent
Margin of error: 5 percent
Expected Sensitivity 0.97 From literature or pilot study
  

Expected Specificity 0.94 From literature or pilot study


Expected Prevalence 0.15 From literature or pilot study
Desired Precision 0.05 Researcher’s judgment
Confidence level 95% 95% is recommended

To achieve the precision of 0.05 for ‘Sensitivity’, we need ‘the total sample size’ of = 347 This

is preferable as it will give precision of 0.05 or less for both sensitivity and specificity
With this sample size, the precision for ‘Specificity’ will be = 0.027
102 Basics in Epidemiology and Biostatistics

SUGGESTED WEBSITES FOR


SAMPLE SIZE CALCULATOR
1. http://www.raosoft.com/samplesize.html
2. http://www.quantitativeskills.com/sisa/calculations/samsize.htm
3. http://www.openepi.com/Menu/OpenEpiMenu.htm

BIBLIOGRAPHY
1. Calkins KG. Power and sample size: an appropriate sample size is
crucial to any well-planned research investigation. [Online]. 2005 [cited
19 Sep. 2008]; Available from: URL: http://www.andrews.edu/~calkins/
math/edrm611/edrm11.htm.
2. Naing L, Winn T, Rusli BN. Practical issues in calculating the sample
size for prevalence studies. Arch Orofac Sci. 2006;1:9-14.

3. Naing L. Sample size calculation for sensitivity and specificity studies.
[Online]. 2004 [cited 10 Aug. 2008]; Available from: URL: http://www.
kck.usm.my/ppsg/statistical_resources/samplesize_forsensitivity_
specificitystudiesLinNaing.xls.
4. OpenEpi Version 2.2.1: open source epidemiologic statistics for public
health. [Online]. 2008 [cited 10 Oct. 2008]; Available from: URL: http://
www.openepi.com/Menu/OpenEpiMenu.htm.
5. Sample size calculations: statistics guide for research grant applicants.
[Online]. [2001?] [cited 14 Oct. 2008]; Available form: URL: http://
www.sgul.ac.uk/index.cfm?D7DEB028-B5BE-7536-BD9D-
2EC800CE3789CAB35E63-88E4-4358-889C-043A012DF815.

tahir99 - UnitedVRG
CHAPTER

11
Screening

The active search for disease among apparently healthy people is a


fundamental aspect of prevention. This is embodied in screening,
which has been defined as “the search for unrecognized disease
or defect by means of rapidly applied tests, examinations or other
procedures in apparently healthy individuals.”
Screening is a way of improving patient’s outcome by detecting
the disease in apparently healthy individual at an earlier stage,
which is usually a treatable stage. For this purpose, there are tests
such as physical examination, biochemical assay of blood, urine
and other body fluids, radiography, ultrasonography, cytology and
histopathology. One question needed to be answered in context to
screening is how good are these tests in distinguishing individuals
with and without the disease in question.
A screening program is most effective and beneficial if it is directed
to a high-risk target population. Screening a total population for a
relatively infrequent disease can be very wasteful of resources and
may yield very few previously undetected cases.

RELIABILITY AND VALIDITY OF A SCREENING TEST


An effective screening program will use tests which are ideally
inexpensive, easy to administer, impose minimal discomfort on
those in whom they are administered, reliable (measure a variable
consistently and free of random error) and are valid (able to
differentiate between individuals with a disease or its precursor, and
those without).

Validity (Accuracy)
The term validity refers to what extent the test accurately measures
which it intends to measure. In other words, validity expresses the
104 Basics in Epidemiology and Biostatistics

ability of a test to separate or distinguish those who have the disease


from those who do not. For example, glycosuria is a useful screening
test for diabetes, but a more valid or accurate test is the glucose
tolerance test. Accuracy refers to the closeness with which measured
values agree with the true values.
Assessment of test performance is presented in a two by two

table (Table 11.1). The disease status (as assessed through the Gold
Standard) is conventionally put in the top row while the screening
test result in the first column.
In the above table, a is the number of subjects who have the disease

and are found positive by the test (true positives), b is the number of
subjects who do not have the disease and are found positive by the
test (false positives), c the number of subjects who have the disease
but are found negative by the test (false negative), and d the number
of subjects who do not have the disease and are found negative by
the test (true negative).
Validity has two components—sensitivity and specificity.

SENSITIVITY AND SPECIFICITY
Sensitivity is defined as the ability of a test or procedure to identify
correctly all those who have the disease, that is “true-positive” in the

Table 11.1: A two-by-two table for screening test



Disease status as determined by ‘Gold Standard’
Disease present No disease
Test Positive *True Positives #False Positives Total Test Positive
(a) (b) (a + b)
Test Negative ‡False Negatives ~True Negatives Total Test Negative
(c) (d) (c + d)
Total with Disease Total without Total Screened
(a + c) Disease (b + d) (a + b + c + d)
*True positives = number of individuals with disease and a positive screening test
(a); #False positives = number of individuals without disease but have a positive
screening test (b); ‡False negatives = number of individuals with disease but have a
negative screening test (c); ~True negatives = number of individuals without disease
and a negative screening test (d)

tahir99 - UnitedVRG
Screening 105

screened population. Sixty percent sensitivity means that 60 percent


of the diseased people screened by the test will give a true positive
result and the remaining 40 percent a false-negative result. Thus,
expressed as the proportion of those with disease correctly identified
by a positive screening test result.

G
Number of true positives
Sensitivity =
Total with disease

R
= a/(a + c)

when expressed in percent

V
a
= ×100

d
a+c

Specificity is the ability of the test or procedure to identify correctly

ti e

all those who do not have the disease, that is “true negatives” in
the screened population. Thirty percent specificity means that
30 percent of the nondiseased persons will give true-negative result,

n
while 70 percent of the nondiseased persons screened by the test
will be incorrectly classified as “diseased” when they are not. Thus,

U
expressed as the proportion of those without disease correctly
identified by a negative screening test result.

-
Number of true negative
Specificity =
Total without disease

9
= d/(b + d)

ri 9
when expressed in percent
d
= ×100
b+d

h
PREDICTIVE VALUES

ta
Predictive value reflects the diagnostic power of the test. The
predictive accuracy depends upon sensitivity, specificity and disease
prevalence.
Positive predictive value describes the probability of having

the disease given a positive screening test result in the screened
population. Thus, expressed as the proportion of those with disease
among all screening test positives. The positive predictive value of
mammography, for example, will tell a woman how likely it is that
she has breast cancer after a positive mammogram.
106 Basics in Epidemiology and Biostatistics

Number of true positives


Positive predictive value (PPV) =
Total test positives
= a/(a + b)

when expressed in percent
a
= ×100
a+b

Negative predictive value describes the probability of not having
the disease given a negative screening test result in the screened
population. Thus, expressed as the proportion of those without
disease among all screening test negatives. The negative predictive
value of mammography, for example, will tell a woman the probability
that she truly does not have breast cancer, if the mammogram is
negative.
Number of true negatives
Positive predictive value (PPV) =
Total test negatives
= d/(c + d)

when expressed in percent
d
= ×100
c+d

Example
A new ELISA (antibody test) is developed to diagnose HIV infections.
Serum from 80 patients that were positive by Western Blot (the Gold
Standard assay) was tested, and 60 were found to be positive by the
new ELISA screening test. The manufacturers then used the new
ELISA to test serum from 120 study participants that were negative
by Western Blot (the Gold Standard assay); 70 were found to be
negative by the new test.

HIV
Infected Non-infected Total
a + b =110
Positive 60 (a = TP) 50 (b = FN)
ELISA Total test positive
Test c +d = 90
Negative 20 (c = FP) 70 (d = TN)
Total test negative
80 (a + c) 120 (b + d) a + b + c + d = 200
Total
Total infected Total not infected Total screened

tahir99 - UnitedVRG
Screening 107

a 60 × 100
Sensitivity = × 100 = = 75%, i.e. the new test ELISA is
a+c 80
75 percent sensitive in correctly identifying HIV infection.

d 70 × 100
Specificity = × 100 = = 58%, i.e. the new test ELISA is

G
d+b 120
58 percent specific to detect non-HIV infected persons.

Positive Predictive Value (PPV)

V R
d
a 60 × 100
PPV = × 100 = = 55% , i.e. based over, ELISA the new
a+b 100

ti e
screening technique for HIV 55 percent persons who test positive,
are actually suffering from HIV.

n
Negative Predictive Value (NPV)
d 70 × 100

U
NPV = × 100 = = 78%, i.e. based over, ELISA the new
c +d 90

-
screening technique for HIV 78 percent persons who test negative,
are actually free from HIV.

9
Relationship between Sensitivity,

ri 9
Specificity, PPV and NPV
Sensitivity and NPV

h
Sensitivity and Negative predictive value are positively correlated
(increase in one will increase other). If the test is more sensitive, it

ta
is less likely that an individual with a negative result will have the
disease, so the greater Negative predictive value.

Specificity and PPV


Specificity and Positive predictive value are directly correlated (i.e.
increase in one will increase other). If the test is more specific, it
is less likely that an individual with a positive test will be free from
disease, so the greater the Positive predictive value.
108 Basics in Epidemiology and Biostatistics

The Effect of Disease Prevalence


Sensitivity and Specificity are independent of prevalence of disease
as they are test specific (describes how well the screening test
performs against the gold standard).
Positive predictive value (PPV) and Negative predictive

value (NPV) are dependent over disease prevalence as they are
population specific. Both PPV and NPV gives information on how
well a screening test will perform in a given population with known
prevalence. Prevalence is directly related to PPV and inversely to
NPV, thus a higher prevalence will increase the PPV and decrease
the NPV.
Example 1a: In a population of 10,000 with a disease prevalence of
1%, Sensitivity = 99%; Specificity = 95% of test A;
Disease Disease Disease
Total
prevalence positive negative
1% Test (Positive) 99 495 594
Test (Negative) 1 9405 9406
Total 100 9900 10,000

99 9405
PPV = × 100; NPV = × 100
594 9406
= 17% = 99.99%


However, with the same sensitivity, specificity and population

size, if the prevalence changes then what will be the effect on the
tests positive predictive value (PPV); see example 2b.
Example 1b: In a population of 10,000 with a disease prevalence of
5% Sensitivity = 99%; Specificity = 95% with test A;
Disease Disease Disease
Total
prevalence positive negative
5% Test (Positive) 495 475 970
Test (Negative) 5 9025 9030
Total 500 9500 10,000

495 9025
PPV = × 100; NPV = × 100
970 9030
= 51.03 % = 99.94 %


tahir99 - UnitedVRG
Screening 109

Thus an increase in prevalence from 1 to 5 percent with same the



same level of sensitivity and specificity; has increased the test’s positive
predictive value (PPV) from 17 to 51.03 percent, and decreases the
negative predictive value (NPV) from 99.99 to 99.94 percent.

G
BIBLIOGRAPHY
1. Hennekens CH, Buring JE. Screening. Epidemiology in medicine.

R


Boston, Mass: Little, Brown and Co; 1987.pp.327-47.
2. Park K. Screening for Disease. In Park’s Textbook of Preventive and

V


Social Medicine. India: Bhanot; 2009.pp.123-130.

d
3. Petrie A, Sabin C. Diagnostic tools. Medical statistics at a glance. UK:


Blackwell Science; 2000.pp.90-1.

ti e
4. Wassertheil-Smoller S. Mostly about screening. Biostatistics and


epidemiology: a primer for health and biomedical professionals. New
York: Springer-Verlag; 1995.pp.118-28.

Un
-
9
ri 9
h
ta
CHAPTER

12
Basic Statistical Tests

Selection of a correct test is vital to run an analysis on SPSS. The


selection of test depends on whether
• Data is qualitative (categorical) or quantitative (Continuous)
• Data is unpaired (independent groups) or paired (repeated
measures)
• Distribution is normal or skewed.

UNPAIRED SAMPLES
In unpaired samples, there is no relation between subjects in group
1 and subjects in group 2 (two independent groups). Suppose a data
is collected on ICT skills comparing medical versus engineering
students. These are two independent groups. Whenever you are
comparing mean of continuous variable in two independent groups
(e.g. medical students and engineering students), an independent
sample t-test will be applied.

PAIRED SAMPLES
In paired samples, repeated measures (pre-post test) are taken on the
same subject. For example, if you wanted to determine how much a
student learned in a statistics class, you would do a pre (before) and
post (after) test to determine the impact of intervention (statistical
class) on the score.
Whenever comparing a categorical variable (qualitative data)

between two groups, a Chi-square test is used.
Comparing a continuous variable (quantitative data) between

two independent groups is called comparing two means, a t-test (e.g.
independent t-test, paired test, etc.) is applied for this purpose.
When comparing a continuous variable (quantitative data)

between two paired groups (pre-post) a paired t-test is applied.

tahir99 - UnitedVRG
Basic Statistical Tests 111

Flow charts 12.1 and 12.2 give different choices of tests for

qualitative and quantitative data.

Flow chart 12.1 Selection of statistical test for qualitative data



R G
d V
ti e
Un
-
9
Flow chart 12.2 Selection of parametric statistical test for

ri 9

quantitative data to compare means

h
ta
112 Basics in Epidemiology and Biostatistics

Nonparametric Tests
When assumptions of the parametric tests are not satisfied, i.e.
data is not normally distributed or the data is collected on less
than 30 participants a nonparametric test is applied (Flow chart
12.3). Nonparametric tests are an alternative to parametric tests.
Chi-square is the most frequently used nonparametric test. Other,
nonparametric tests are:
Wilcoxon Rank Sum test or Mann-Whitney U test is the

nonparametric version of the independent sample t-test and can
be used when assumptions of the parametric tests are not satisfied.
Thus, Mann-Whitney U test is used to compare median of two
independent samples when the data is either:
• On interval scale; or
• Ranked (ordered) scale.
The test is used to test the hypothesis that two population

distributions do not differ in median (e.g. a null hypothesis comparing
median bicep skinfold thickness of patients with celiac disease and
Crohn’s disease would say that the two median are equal).

Flow chart 12.3 Selection of nonparametric statistical



test for quantitative data to compare means

tahir99 - UnitedVRG
Basic Statistical Tests 113

Wilcoxon signed rank test is the nonparametric version of a paired



sample t-test, which is also called the Wilcoxon matched pairs test
and is used when the data is either:
• On interval scale; or
• Ranked scale.

G
The test is based on the rank of absolute difference, rather than

the numerical value of the difference (Table 12.1).

R
Kruskal-Wallis test is the nonparametric version of ANOVA and

used when the assumptions of the parametric tests are not satisfied.

V
WHAT ARE VALIDITY AND RELIABILITY

d
IN RESEARCH FINDINGS?

ti e
Validity and reliability has been discussed in Figures 12.1A to D.
Validity means that your scientific observations actually measure

what they intend to measure (your conclusions are true).

n
Reliability means that someone else using the same method in

the same circumstances should be able to obtain the same findings

U
(your findings are repeatable).
Reliability (repeatability) refers to the possibility to replicate

-

(repeat) the observations and is related to the precision of the
instrument used for scientific observations. Validity refers to the

9
soundness of the observations and to the accurateness of the data
collected by the research method/instrument.

ri 9
Table 12.1: Wilcoxon signed rank test

h
Participants ID Placebo Drug Difference
(Placebo-Drug)

ta
1 2 1 1
2 5 2 3
3 8 3 5
4 6 4 2
5 9 3 6
6 13 16 -3
7 19 8 11
114 Basics in Epidemiology and Biostatistics

A B

C D
Figures 12.1A to D (A) Neither valid nor reliable. The research method does

not measure the research outcome (not valid) and repeated attempts are un-
focused; (B) Reliable but not valid. The research method does not measure
the research outcome (not valid), but repeated attempts get almost the same
(wrong) results; (C) Fairly valid but not very reliable. The research method
measures the research outcomes fairly closely, but repeated attempts have
very scattered results (not reliable); (D) Valid and reliable. The research
method precisely measures the research outcomes, and repeated attempts
produce similar results

BIBLIOGRAPHY
1. Data management: preparing to analyse the data. In: Peat J, Barton B.


Medical Statistics: a guide to data analysis and critical appraisal. USA:
Blackwell Publishing Ltd; 2005.pp.1-23.
2. Field A. Discovering statistics using IBM SPSS statistics. Sage


Publications, 2013.
3. Petrie A, Sabin C. Medical Statistics at a glance (vol 29). John Wiley &


Sons;2009.
4. Pallant J. SPSS Survival manual: A step-by-step guide to data analysis


using SPSS for windows (version 10): Allen and Unwin, 2001.
5. Rosner B. Fundamentals of biostatistics. Cengage Learning, 2010.


tahir99 - UnitedVRG
CHAPTER

13
Overview of Data
Collection Techniques

R G
V
Data collection techniques allow us to systematically collect

d
information about our objects of study (people, objects, phenomena)
and about the settings in which they occur.

ti e
DIFFERENT DATA COLLECTION TECHNIQUES

n
• Using available information
• Observing

U
• Interviewing (face-to-face)
• Administering written questionnaire

-
• Focus group discussion
• Projective techniques

9
• Mapping and scaling.

ri 9
Using Available Information
Usually, there is a large amount of information/data that has
been collected by some other source but not being analyzed and

h
published. For example, analysis of information collected from a

ta
Primary Health Care Center regarding the proportion of different
diseases and the age group affected in those diseases in an area. The
advantage of using existing knowledge is that it is a very inexpensive
method, however, the data may not always be completed or too
disorganized.

Observing
It is a technique that involves systematically selecting, watching
and recording behavior and characteristics of living beings, objects
or phenomena. Observations can be open (e.g. observing a health
worker during his/her routine activities) or concealed (e.g. mystery
116 Basics in Epidemiology and Biostatistics

clients trying to obtain antibiotics without medical prescription).


This method gives more accurate, additional information on
behavior of people than interviews or questionnaire. It also checks
the information collected through interviews especially on sensitive
topics as alcohol use or behavior of people towards the patient
having any stigmatizing disease.

Interviewing
Here there is oral questioning of respondents. Answers to the
questions posed during an interview are either written down or
recorded by a tape recorder, or both techniques could be used.
The unstructured method of asking questions is used. This method
is frequently used in exploratory studies where the investigator has,
as yet, little understanding of the problem, or if the topic is sensitive.

Questionnaire
A written questionnaire also known as self-administered question

­
naire, is a data collection tool in which questions are presented that
are to be answered by the respondent himself in written form.
Questionnaire comprises of a formal, written, set of closed-ended/
open-ended questions that are asked from every respondent in the
study. It provides an objective means of collecting information (data)
related to exposure/outcome of interest as well as on confounders or
effect modifiers.

Types of Questions
• Open-ended questions are those questions that solicit additional
information from the inquirer. They are also called infinite

response or unsaturated type questions. By definition, they are
broad and require more than one or two word responses. These
types of questions are of use in conduct of qualitative research.
• Closed ended questions: Closed ended questions are those
questions, which can be answered finitely by either “yes” or “no”.

They are also called dichotomous or saturated type questions.
In quantitative research closed ended questions are maximally
used.

tahir99 - UnitedVRG
Overview of Data Collection Techniques 117

Ways of Administration of Questionnaire


• Mail
• Telephone
• Via computer

G
• Interviewer.

Important Points in Designing a Questionnaire


• The information obtained by each question will be specific to

V
the information you would need in your analysis. Therefore,
R
d
before you compose any question, think through your research
questions/objectives and also think how you will conduct your

ti e
analysis.
• It should be ensured that the format of the questionnaire be
attractive and easy for the respondents to fill, overcrowding or

n
cluttering of inquiries should be avoided. All pages and questions
should be clearly numbered.

U
• The questionnaire should never be too long. In general, questions
should be short and to the point (around 12 words or less).

-
• Only information relevant to the objective should be solicited, the
proforma/questionnaire should not resemble a history sheet.
• Be careful about responses of ‘neutral’ or ‘no opinion’ versus ‘do

9
not know’.

ri 9
• Questions concerning major areas should be grouped together.
• Simple questions about age, birth date, etc. should be put at the
beginning to warm up the respondent.
• Questions should ask only 1 piece of information.

h
• Question wordings should ensure that every respondent will

ta
be answering the same thing, so avoid ambiguous wording or
wording that means different things to different respondents. Also

avoid terms for which the definition can vary (if it is unavoidable,
provide the respondent with a definition).
• Question should be preferably close ended, possible answers
to close ended question should be lined vertically, preceded by
boxes, brackets or numbers.
Example: How many different medicines do you take daily (check
one)?
– [ ] None

– [ ] 1–2

118 Basics in Epidemiology and Biostatistics

– [ ] 3–4

– [ ] 5–6

– [ ] 7 or more

• If more details are required pertaining to a question, then the
filter/skip technique should be used to save time and allow
respondents to avoid irrelevant questions.
Example: Have you ever been told that you have hypertension?

1. Yes

2. No

If yes, proceed to the next question
How long back were you told that you have hypertension?
• Always choose an appropriate means of measurement e.g. score/
scales.
Example: Two words that are often used inappropriately are

frequently and regularly. A poorly designed question might read,
“I frequently engage in exercise,” and offer a Likert scale giving
responses from “strongly agree” through to “strongly disagree.”
But “frequently” implies frequency, so a frequency based rating
scale (with options such as at least once a day, twice a week, and
so on) would be more appropriate.
• Sensitive questions should be left for the end.
• Using a previously validated and published questionnaire will
save your time and resources, so if similar research instruments
are available it may be a good idea to review and borrow questions.
• Always try to ensure that if questions are to be asked in any
language besides English they shall be so written too.

Focus Group Discussion


Focus group discussion allows a group of 8–10 informants to freely
discuss a certain subject with the guidance of a facilitator or reporter.

Projective Techniques
When a researcher uses projective techniques, he asks an informant
to react to some kind of visual or verbal stimulus.
For example, the presentation of a hypothetical question or
an incomplete sentence or case/study to an informant (story with
a gap). The researcher then asks the informant to complete the
sentence in writing such as;

tahir99 - UnitedVRG
Overview of Data Collection Techniques 119

If I were to discover that my neighbor had tuberculosis, I would


suggest him---------------------------------------------------------------
If my wife were having labor pains, I would do---------------------
If my child had diarrhea, I would give him---------------------------

G
Mapping and Scaling
It is a valuable technique to display relationships and resources. In

R
a water supply project, for example, mapping is invaluable. It can be

V
used to present the placement of wells, distance of the homes from
the wells, other water systems, etc. It gives researcher a good overview

d
of the physical situation and may help to highlight relationships
hitherto unrecognized.

ti e
Scaling is a technique that allows researcher through their

respondents to categorize certain variables that they would not be
able to rank themselves. For example, they may ask their informant

n
to bring certain types of herbal medicine and ask them to arrange
these into piles according to their usefulness. The informant would

U
then be asked to explain the logic of their ranking.
Mapping and scaling are used as techniques in rapid appraisals

-
or situation analysis. Rapid appraisal technique is an approach often
used in health systems-research.

9
BIBLIOGRAPHY

ri 9
1. Bourque, Linda and Eve Fielder. How to Conduct Self-Administered
and Mail Surveys? Learning Objectives. Thousand Oaks, CA: Sage
Publications, 1995.

h
2. Converse Jean M, Stanley Presser. Survey Questions: Handcrafting the

ta
Standardized Questionnaire. Quantitative Applications in the Social
Sciences (series). Thousand Oaks, CA: Sage Publications, 1986.
3. Dillman Don A. Mail and Internet Surveys: The Tailored Design Method.
New York: J Wiley, 2000.
4. Fink Arlene. How To Ask Survey Questions? Thousand Oaks, CA:Sage
Publication, 1995.
5. Fowler, Floyd J Jr. Improving Survey Questions: Design and Evaluation.
Thousand Oaks, CA: Sage Publications, 1995
6. Sudman Seymore, Norman M Bradburn. Asking Questions: A Practical
Guide to Questionnaire Design. San Francisco: Jossey-Bass Inc., 1982.
CHAPTER

14
Data Analysis Plan

Development of a research process is a cyclical process. The double-headed arrows indicate


that the process is never linear.

tahir99 - UnitedVRG
Data Analysis Plan 121

IMPORTANCE OF DATA ANALYSIS PLAN


Preparation of a plan for data processing and analysis will provide
you with better insight into the feasibility of the analysis to be
performed as well as the resources that are required. It also provides

G
an important review of the appropriateness of the data collection
tools for collecting the data you need. That is why you have to plan for
data analysis before the pretest. When you process and analyze the

R
data you collect during the pretest you will spot gaps and overlaps
which require changes in the data collection tools before it is too late!

WHAT SHOULD THE PLAN INCLUDE?

d V
ti e
When making a plan for data processing and analysis the following
issues should be considered:
• Sorting data

n
• Performing quality-control checks
• Data processing

U
• Data analysis.

-
Sorting Data
An appropriate system for sorting the data is important for facilitating

9
subsequent processing and analysis.
If you have different study populations (for example, doctors,

ri 9

paramedical staff and medical students), you obviously would
number the questionnaires separately.
In a comparative study, it is best to sort the data right after

h

collection into the two or three groups that you will be comparing

ta
during data analysis. For example, in a study where you are
interested to know the use of sedatives by the doctors, users and
nonusers would be two basic categories. In a study of the reasons
why doctors object to being posted in rural areas, rural and urban
doctors would be basic categories. In a case-control study obviously
the cases are to be compared with the controls. It is useful to number
the questionnaires belonging to each of these categories separately,
right after they are sorted.

Performing Quality-Control Checks


Usually the data have already been checked in the field to ensure
that all the information has been properly collected and recorded.
122 Basics in Epidemiology and Biostatistics

Before and during data processing, however, the information should


be checked again for completeness and internal consistency.
If a questionnaire has not been filled in completely, you will have

missing data for some of your variables. If there are many missing
data in a particular questionnaire, you may decide to exclude the
whole questionnaire from further analysis.

Data Processing: Quantitative Data


Process and analyze the data from questionnaires by:
• Manually, using data master sheets or manual compilation of the
questionnaires, or
• By computer, for example, using a microcomputer and existing
software or self-written programs for data analysis.
Data processing in both cases involves:
• Categorizing the data
• Coding
• Summarizing the data in data master sheets, manual compilation
without master sheets, or data entry and verification by computer.
Answers that are difficult or impossible to categorize may be

put in a separate residual category called ‘others’, but this category
should not contain more than 5 percent of the answers obtained.
If the data will be entered in a computer for subsequent processing

and analysis, it is essential to develop a coding system.
For computer analysis, each category of a variable can be coded

with a letter, group of letters or word, or be given a number. For
example, the answer ‘yes’ may be coded as ‘Y’ or 1; ‘no’ as ‘N’ or 2
and ‘no response’ or ‘unknown’ as ‘U’ or 9.
The codes should be entered on the questionnaires (or checklists)

themselves. When finalizing your questionnaire, for each question
you should insert a box for the code in the right margin of the page.
These boxes should not be used by the interviewer. They are only
filled in afterwards during data processing. Take care that you have
as many boxes as the number of digits in each code.

For example:
Yes (or positive response) code-Y or 1

No (or negative response) code-N or 2

Do not know code-D or 8

No response/unknown code-U or 9

tahir99 - UnitedVRG
Data Analysis Plan 123

A number of computer programs are available on the market that



can be used to process and analyze research data. The most widely
used programs are:
• Epi Info (version 6), a very consumer friendly program for data
entry and analysis, which also has a word processing function

G
for creating questionnaires (developed by the Center for Disease
Control, Atlanta, USA and World Health Organization, Geneva),

R
• LOTUS 1-2-3, a spreadsheet program (from the Lotus
Development Corporation),

V
• dBase (version III plus or IV), a data-management program (from
Ashton-Tate), and

d
• SPSS, which is a quite advanced Statistical Package for Social

ti e
Sciences (SPSS Inc.).
If you intend to use a computer, you may ask advice from

an experienced person concerning which program is the most
appropriate for your type of data. Note that Epi Info may be freely

n
used and copied. All the other programs have copyrights.

U
Data Analysis: Quantitative Data

-
Analysis of quantitative data involves the production and
interpretation of frequencies, tables, graphs, etc., that describe the

9
data.
After deciding on a data entry format, the information on the

ri 9

data collection instrument will have to be coded (e.g., Male: M or 1,
Female: F or 2). During data entry, the information relating to each
subject in the study is keyed into the computer in the form of the

h
relevant code (e.g., if the first subject (identified as 001) is a male
(code 1) aged 25, the data could be keyed in as 001125).

ta
The computer can do all kinds of analysis and the results can be

printed. It is important to decide whether each of the tables, graphs,
and statistical tests that can be produced makes sense and should
be used in your report. That is why we plan the data analysis before
hand!
• Frequency counts: From the data master sheets, simple tables can
be made with frequency counts for each variable. A frequency
count is an enumeration of how often a certain measurement or a
certain answer to a specific question occurs.
124 Basics in Epidemiology and Biostatistics

For example:
Smokers 51


Nonsmokers 93


Total 144

If numbers are large enough it is better to calculate the frequency

distribution in percentages (relative frequencies): 51/144 × 100 =
35 percent are smokers and 93/144 × 100 = 65 percent nonsmokers.
This makes it easier to compare groups than when only absolute
numbers are given. In other words, percentages standardize the
data.
• Divide the range into three to five categories. You can either aim
at having a reasonable number in each category (e.g. 0–2 km,
3–4 km, 5–9 km, 10+ km for home-clinic distance) or you can
define the categories in such a way that they are each equal in size
(e.g. 20–29 years, 30–39 years, 40–49 years, etc.).
• Construct a table indicating how data are grouped and count the
number of observations in each group.
• Cross-tabulations: Further analysis of the data usually requires
the combination of information on two or more variables in order
to describe the problem or to arrive at possible explanations for it.
For this purpose it is necessary to design cross-tabulations.

Depending on the objectives and the type of study, two major

kinds of cross-tabulations may be required:
1. Descriptive cross-tabulations that aim at describing the

problem under study.
2. Analytic cross-tabulations in which groups are compared in

order to determine differences, or which focus on exploring
relationships between variables.
A descriptive cross-tabulation (Table 14.1) would, for example,

relate smoking behavior to sex or occupational background:
The males appear to be smoking more (43%) than females (28%).

Table 14.1: Smoking by sex
Sex Smoking Not smoking Total
Males 31 (43%) 41 (57%) 72 (100%)
Females 20 (28%) 52 (72%) 72 (100%)
Total 51 (35%) 93 (65%) 144 (100%)

tahir99 - UnitedVRG
Data Analysis Plan 125

An analytic cross-tabulation serves to investigate, if there is a


relationship between smoking (independent variable) and persistent
cough, or chest complaints (dependent variables/problems).
Of the informants with a cough, the majority (77%) is smoking,
whereas among those without a cough, only one-third (33%) are
smokers. The expected relationship between smoking and chest
problems seems confirmed.
When the plan for data analysis is being developed, the data, of
course, is not yet available. However, in order to visualize how the
data can be organized and summarized it is useful at this stage to
construct the so-called dummy cross-tabulations.
A dummy table contains all elements of a real table, except that
the cells are still empty. In a research proposal dummy tables should
be prepared to describe the study population in order to show the
crucial relationships between variables.
For the study exploring the relationship between smoking
behavior and persistent cough, a table should be constructed as
below (Table 14.2).
Some practical hints when constructing tables:
• If a dependent and an independent variable are cross-tabulated,
the headings of the dependent variable are usually placed
horizontally (Table 14.2: ‘cough’ and ‘no cough’), and the
headings of the independent variable vertically: (‘smoking’ and
‘not smoking’ in the same table).
• All tables should have a clear title and clear headings for all rows
and columns.
• All tables should have a separate row and a separate column for
totals to enable you to check if your totals are the same for all
variables and to make further analysis easier.
• All tables related to a certain objective should be numbered and
kept together so the work can be easily organized and the writing

Table 14.2: Smoking in relation to persistent cough over the past 2 weeks
Smoking behavior Cough No cough Total
Smoking 10 (77%) 41 (32%) 51 (35%)
Not smoking 3 (23%) 90 (68%) 93 (65%)
Total 13 (100%) 131 (100%) 144 (100%)
126 Basics in Epidemiology and Biostatistics

of the final report will be simplified. To further analyze and


interpret the data, certain calculations or statistical procedures
must usually be completed. Especially, in large cross-sectional
surveys and in comparative studies, statistical procedures are
necessary if the data is to be adequately interpreted. Statistical
tests should, for example, indicate whether the gender differences
in smoking behavior are true differences or due to chance. When
conducting such studies it is advisable to consult a person with
statistical knowledge right from the start.

Processing and Analysis of Qualitative Data


Qualitative data may be collected through open-ended questions in
self-administered questionnaires, in individual interviews or focus
group discussions or through observations during fieldwork. For a
detailed description of the analysis of qualitative data see Module
10C and in particular Module 23, which specify the methods most
often used. Here we will concentrate on the analysis of responses
obtained from open-ended questions in interviews or self-
administered questionnaires.
Commonly solicited data in open-ended questions include:

• Opinions of respondents on a certain issue;
• Reasons for a certain behavior; and
• Descriptions of certain procedures, practices or perceptions with
which the researcher is not familiar.
The data can be analyzed in seven steps:

Step 1: Take a sample of (say 20) questionnaires and list all answers for
a particular question. Take care to include the source of each answer
you list (in the case of questionnaires you can use the questionnaire
number), so that you can place each answer in its original context, if
required.
Step 2: To establish your categories, you first read carefully through
the whole list of answers. Then you start giving codes (A, B, C, for
example, or keywords) for the answers that you think belong together
in one category, and write these codes in the left margin. Use a pencil
so that it is easy to change the categories if you change your mind.

tahir99 - UnitedVRG
Data Analysis Plan 127

Step 3: List the answers again, grouping those with the same code
together.
Step 4: Then interpret each category of answers and try to give it a label
that covers the content of all answers. In the case of data on opinions,
for example, there may be only a limited number of possibilities,
which may range from (very) positive, neutral, to (very) negative.
Data on reasons may require different categories depending on
the topic and the purpose of your question. In the exercise below
you will be asked to categorize the reasons why people smoke by
grouping them in such a way that it is easy to find entry points for
health education aimed at reducing smoking.
After some shuffling you usually end up with 5 to 7 categories.
Step 5: Now try a next batch of 20 questionnaires and check if the
labels work. Adjust the categories and labels, if necessary.
Step 6: Make a final list of labels for each category and give each label
a code (keyword, letter or number).
Step 7: Code all your data, including what you have already coded,
and enter these codes in your master sheet or in the computer.
Note again that you may include a category ‘others’, but that it
should be as small as possible, preferably used for less than 5 percent
of the total answers. If you categorize your responses to open-ended
questions in this way you can:
• Analyze the content of each answer given in particular categories,
for example, in order to plan what actions should be taken (e.g.,
for health education). Gaining insight in a problem, or in possible
interventions for a problem, is the most important function of
qualitative data.
• Report the number and percentage of respondents that fall into
each category; so that you gain insight in the relative weight of
different opinions or reasons.
Questions that ask for descriptions of procedures, practices, or
beliefs usually do not provide quantifiable answers (though you may
quantify certain aspects of them). The answers rather form part of a
jigsaw puzzle that you have to put together in order to obtain insight
in your problem/topic under study.
128 Basics in Epidemiology and Biostatistics

BIBLIOGRAPHY
1. AO Foundation (n.d.0. Step-by-step guide to doing clinical research.


Retrieved on 09 October 2006 from http//:www.aofoundation.org/
portal/wps/portal/!ut/p/.cmd/cs/.ce/7_0_7T5/_s.7_0_A/7_0_7T5.
2. Designing and conducting Health Systems Research Project Volume


1. The International Development Research Centre (Science for
Humanity). Module 13: Plan for data processing and analysis. Retrieved
on 14 April 2010 from http//:www.idrc.ca/en/ev-56622-201-1-D0-
TOPIC.html
3. Professional Data Analysts (n.d.). Stage 3: Data Analysis. Retrieved on


09 October 2006 from http//:www.pdastats.com/default.asp

tahir99 - UnitedVRG
CHAPTER

15
Synopsis Writing

A research proposal/synopsis and research protocol are synonymous


terms and can be used interchangeably. Development of a research
proposal is the first step taken prior to initiation of any research
project. It very precisely and elegantly describes the importance of
the area of research, the research questions/hypothesis behind the
research and how it will be carried out.
Research proposal is basically a research plan with well-defined
measurable outcome that an investigator aims to follow to achieve
the research objectives. A good research proposal is vital for
successful research. All researches must begin with a clearly focused
research proposal. In recent years there has been an enormous
dissemination of research culture thus formulation of an excellent
research proposal became necessary, not only for ensuring a high
quality of research but also for reasons like attracting a research
grant. A research proposal must be precize and convincing, with
execution (researcher can do this research) as an ultimate test.
Importantly, a research proposal must incorporate a properly
formulated hypothesis and a good analytical plan.
The components of research proposal are outlined as follows:
• Title of the research project
• Project summary
• Statement of the problem
• Justification and use of the results
• Theoretical framework
• Research objectives (general and specific).

METHODOLOGY
• Operational definitions
• Type of study and general design
130 Basics in Epidemiology and Biostatistics

• Universe of study, sample selection and size, unit of analysis and


observation, selection criteria
• Proposed intervention
• Data collection procedures, instrument used and methods for
data quality control
• Procedure to ensure ethical considerations in research with
human subjects.

PLAN FOR ANALYSIS OF RESULTS


• Methods and model of data analysis
• Programs to be used for data analysis
– SPSS

– SAS

TITLE/TOPIC
The title of synopsis/research protocol precisely reflects the
objective (s) of the proposed study concisely and clearly. The title
must provide the “keywords” for classification and indexing of the
research project. If your study is a clinical trial being carried out
on children with ear infections using an antibiotic X, then your key
features should be reflected in the title. Your title should say Role of
antibiotic X in children with ear infections—a clinical trial.
It is important to include the keywords in the title, as it helps

the reader to identify whether the article is of relevance to him/her
or not. As the articles are searched through keywords, thus these
keywords help the reader to locate articles of interest. For example, if
the reader enters the keywords (antibiotic X, children, ear infection),
the search engine of an electronic repository (i.e. pubMed) will yield
all articles containing these keywords.

INTRODUCTION
An introduction is the most important part of the research protocol
and it should come very strongly just like a thunder to grab the
reader’s attention. It is here that one tries to let the reviewer know
that his research is going to be different from what other people have
done. One should also know that in case of research protocol the
onus is on the researcher to tell the reviewer how important is the
study going to be. Let me explain here, in case the reviewer comes
from a specialty different than the researcher, the former might not

tahir99 - UnitedVRG
Synopsis Writing 131

know the relevant details of the topic of interest. It is the responsibility


of the researcher to tell him the precise facts about the subject. This
is best done by giving him statistical facts. For example in case of a
research proposal from a nephrologist on end-stage renal disease it
would be a good idea to give statistics about how many thousands
of patients are suffering from end-stage renal disease, how many
billions of dollars being spend each year. To a non-nephrologist
reviewer these statistics highlighting the burden of disease and cost
of illness would be enough to tell him about the importance and
significance of the subject. The first paragraph which is about how
big is the problem, should have these statistics so as to make a real
loud thunderous introduction.
An introduction should ideally comprise of four paragraphs;
addressing what is known about the problem, what is not known
about the problem, point out the existing gaps in scientific knowledge
and how the study will contribute to fill the gap, and strength of
the study planned. It is important to present some statistics from
local and international population to impose a seriousness of the
magnitude of problem of interest. The example is described as here:

A Template of a First Paragraph


End-stage renal disease (ESRD) is a significant clinical and public
health problem. In 1999, prevalent ESRD population approached
3,50,000. Total annual cost of care of ESRD in the US was estimated
to be $ 17.9 billion, in 1999. Despite this high cost, there were 68000
ESRD deaths reported in 1999.
The second paragraph should include “what is not known” about
the problem. In this paragraph one should focus on either novel
ideas, gray areas, or controversial subjects in that area, because these
are the areas where people would like to know more about. Hence,
in this is paragraph you would choose from either of the three things
mentioned in the above subject, that is novel idea, gray area, and
controversial area. This will convince the reviewer that your study is
going to be a useful addition to whatever is available in the literature.

A Template of a Second Paragraph


In a study about patients with end-stage renal disease (ESRD) the
researcher wanted to investigate further the issue of late nephrology
132 Basics in Epidemiology and Biostatistics

referral (LR) to a nephrologist and its associated subsequent


outcome—mortality, which was a controversial one. Some previous
studies had shown LR to have worst outcomes compared to early
referral (ER), while others had not.
As shown in the above example the researcher has identified a

controversial subject among ESRD patients, delayed vs early referral
issue. In the second paragraph he has highlighted the controversy
and tried to give a justification that why his study is still important.
This is important because the reviewer is always obsessed with the
idea “why is your study important?”

Third Paragraph
The third paragraph should point out the existing gaps in scientific
knowledge and how the present study will contribute to fill in the
gaps.
A template of a third paragraph: Using the same study as an example,
this is how the researchers made their point. The outcome mortality
has been a controversial issue among LR vs ER in end stage renal
disease (ESRD) patients. The previous studies on this subject have
been single center studies with a sample size of a few hundred
patients only.
“Our study will be carried out on a generalizable United States

population of about 3,50,000 dialysis patients recruited from all
states of the US. Our study will also be using a novel statistical
technique called propensity score analysis (PS analysis). PS analysis
is a proxy for randomization. Thus using a PS analysis will make the
study as good as randomized controlled clinical trial. Hence, our
study is going to make an effort to settle this controversy regarding
LR vs ER in a robust fashion.”

Fourth Paragraph
The fourth paragraph should give details about the rationale of the
study planned. Thus a clear emphasis must be made why this study
is important.
A template of a fourth paragraph: The outcome (i.e. mortality)

associated with late vs early referral has been a controversial subject
and has generated immense debate among the researchers. There
is lack of consensus among researchers whether late referral is

tahir99 - UnitedVRG
Synopsis Writing 133

associated with worst outcomes or not, compared to the early referral


patients. Thus a more valid study is required with an well-developed
research plan, robust statistical analysis technique on a large dataset
to correctly answer this controversial issue.
We feel that our study will be a unique study, different from the
other studies on this subject because we will use the novel technique
of PS analysis in a nationally representative sample of ESRD patients
to examine this issue in a more robust fashion. The PS analysis is a
proxy for randomization in observational studies and will be used to
balance the covariates in ER and LR groups.

Research Objectives
A research objective is a statement that clearly depicts the goal to be
achieved by a research project. In other words, the objectives of a
research project summarize what a study plans to achieve.
The formulation of objectives will help you to:
• Focus the study (narrowing it down to essentials)
• Avoid the collection of data which are not strictly necessary for
understanding and solving the problem you have identified (to
establish the limits of the study)
• Organize the study in clearly defined parts or phases.
Properly formulated, specific objectives will facilitate the
development of your research methodology and will help to orient
the collection, analysis, interpretation and utilization of data.
Objectives should be stated using “action verbs” that are specific
enough to be measured:
Examples: To determine …, To compare…, To verify…, To calculate…,
To describe…, etc.
Do not use vague nonaction verbs such as:
To appreciate … To understand… To believe
An objective is intent of what the researcher wants to determine
and should be stated in clear, measurable terms. While developing
a research protocol a researcher must ensure that the research
objective must match the hypothesis and data analysis plan.
Moreover, a researcher can have as many objectives as he feels that
the study is feasible to achieve.
Given below is an example of specific aims/objectives mentioned
for a study looking at the impact of socioeconomic factors on
134 Basics in Epidemiology and Biostatistics

outcomes among kidney transplant recipients, submitted to National


Institutes of Health. The point worth noticing in this synopsis is that
the objectives match well with the hypothesis and the analytical plan
for each objective.
Material and methods: The methodology explains the procedures
that will be used to achieve the objectives. The methodology of a
research project is the core of the study. Components of a research
design that should be addressed in the methodology section of a
research proposal are:
• Operational definition
• Hypothesis
• Variables
• Research methods or techniques
• Sampling method
• Plan for data collection
• Plan for analysis of data and interpretation of the results
• Staffing, supplies and equipment (covered in detail in ‘Budget
and plan for data collection and analysis’ section)
• Ethical considerations.

Operational Definition
It is the definition of the exposure and outcome variables of interest
in context to objective in a particular study and their means of
measurement/determination.
Consider that one wishes to do a study on anemia in patients

with chronic kidney disease (CKD). He has to give an operational
definition of anemia in his study. This definition of anemia should
not be a textbook definition of anemia, rather it should mention
what anemia means in this particular study. For example, he
should mention an operational definition that anemia in this study
is defined as hemoglobin less than 11 g/dL. This cut-off of 11 g/dL
should ideally come from a world recognized body like the WHO or
National Kidney Foundation.
Take another example, a study to compare the effectiveness of

dressing A and dressing B in patients presenting with infected wounds
of the foot. An outcome variable should be easily measureable. By
looking at the objective it is not clear that what will be deemed as
effective and how will effectiveness be measure. So effectiveness
should be defined in clear measurable terms. “The effectiveness

tahir99 - UnitedVRG
Synopsis Writing 135

could be defined as positive if there is presence of granulation tissue


on clinical examination on the 7th postoperative day.
Hypothesis: Statistical hypothesis (null and alternate hypothesis)
where required, should be appropriately framed in terms of
objectives (please see “hypothesis” for specific aim 1 and specific
aim 2 in Table 15.1).

Table 15.1:  Formulation of specific aims, hypotheses, and statistical


analysis plan for synopsis writing: A template
Title of the study
Impact of socioeconomic factors on outcomes among kidney transplant
recipients (KTR)
Specific Aim 1 Specific Aim 2
Evaluate the prevalence of Determine the influence of
complications of chronic kidney complications of CKD (anemia,
disease [CKD]; (anemia, malnutrition, malnutrition, hyperlipidemia,
hyperlipidemia, abnormal calcium- abnormal calcium-phosphorus
phosphorus metabolism), and metabolism), comorbid conditions
comorbid conditions (hypertension, (hypertension, diabetes,
diabetes, cardiovascular) among cardiovascular), and socioeconomic
kidney transplant recipients factors (decreased access to care)
on mortality among KTR
Hypothesis/Rationale Hypothesis/Rationale
(For Specific Aim 1) (For Specific Aim 2)
The prevalence of complications Complications of CKD (anemia,
of CKD (anemia, malnutrition, malnutrition, hyperlipidemia,
hyperlipidemia, abnormal calcium- abnormal calcium-phosphorus
phosphorus metabolism) and metabolism), comorbid conditions
comorbid conditions (diabetes, (hypertension, diabetes,
hypertension, cardiovascular disease) cardiovascular), and decreased
is high among kidney transplant access to care are associated with
recipients increased mortality among KTR
Statistical analysis Statistical analysis
(For Specific Aim 1) (For Specific Aim 2)
Descriptive statistics and/ Descriptive statistics and/or the
or frequency distributions of proportion of deaths among
continuous variables and of kidney transplant recipients will be
categorical variables will be determined for overall deaths and
obtained. The prevalence of by specific causes

Contd...
136 Basics in Epidemiology and Biostatistics

Contd...

malnutrition, various levels of The proportion of deaths in year


anemia, dyslipidemia, abnormal 1 post-transplant, and then at
calcium-phosphorus metabolism, each post-transplant year will be
presence of diabetes mellitus, determined
presence of cardiac comorbidity, and The proportion of deaths will
hypertension will be determined at be determined separately by
baseline and yearly intervals post- marital status, education level,
kidney transplant employment status, and race
The prevalence will be determined Analytic technique: Time at risk will
separately by marital status, be calculated as the time in days
education level, employment status, from the date of transplant to the
status and race earliest of return to dialysis, death,
transfer, loss to follow-up or end of
study (12/31/05)
Cox proportional hazards
regression (CPHR) model: CPHR
models will be used to examine
the independent contributions
of anemia, hypoalbuminemia
and hyperlipidemia, comorbid
conditions, and socioeconomic
characteristics to mortality among
KTR
Survival analysis: Kaplan-Meier
curves will be developed to
examine the differences by
marital status, education level,
employment status, and race as
well as by different hematocrit
(Hct) levels (<30, 30–32.9, 33–35.9,
>36), albumin levels (<3.5, >3.5),
stages of CKD, cholesterol levels
(<240, >240), and SF-36 scores
(< and > median or mean score
of study population), and will be
compared using log-rank test
Dependent variable/outcome: All-
cause mortality

Contd...

tahir99 - UnitedVRG
Synopsis Writing 137

Contd...

Independent variables: Age,


gender, race, presence of DM,
socioeconomic factors (type
of insurance, employment
status, marital status, education
level, language), comorbidity
(hypertension, cardiovascular
disease), serum albumin,
hematocrit (as a continuous
variable; categorical <30, 30-<33,
33-<36, >36 ), hyperlipidemia
(yes vs no), abnormal calcium-
phosphorus metabolism (yes vs
no), ACE-Inhibitors (yes vs no),
antihypertensives (yes vs no),
lipid-lowering drugs (yes vs no),
rHuEPO use (yes vs no), calcineurin
inhibitor (yes vs no), cyclosporine
vs tacrolimus, antimetabolite (MMF
vs AZA), HLA matching, type of
transplant (living vs cadaveric),
delayed graft function (yes vs no),
number of rejection episodes in
year 1 post-transplant

Variables: A variable is a measureable characteristic of a person,


object or phenomenon that can take on different values. A simple
example of a variable is a person’s age. The variable age can take on
different values because a person can be 20 years old, 35 years old,
and so on.
The variable that is used to describe or measure the problem
under study is called the dependent variable. It represents the
output or effect, or is tested to see if there is an effect. A dependent
variable is also known as a “response variable”, “outcome variable”,
and “output variable”.
The variables that are used to describe or explain the difference
in the dependent variable or to cause changes in the dependent
variables are called the independent (exposure) variables. It
represents the inputs or causes, or is tested to see if they are the cause.
138 Basics in Epidemiology and Biostatistics

An independent variable is also known as a “predictor variable”,


“explanatory variable”, and “exposure variable”.

Numerical and Categorical Variables


The values of some variables (i.e. age, number of children, monthly
income) are expressed in numbers, we call them numerical variables.
Some variables may be expressed in categories. For example, the

variable gender has two distinct categories, male and female. Since
these variables are expressed in categories, we call them categorical
variables.
Study design: The selection of an appropriate study design is essential
for any study. The study design should match the specific aims of the
study. It is the foundation or pillar stone for any research project. If
the study design is not appropriate the study will not be able to yield
valid and reliable results. The type of study design chosen depends
on the:
• Type of problem under investigation
• Knowledge already available about the problem
• Resources available for the study.

Sampling Method
Sampling is the process involving the selection of a finite number
of elements from a given population of interest, for purposes of
inquiry. A researcher can use either a probability or nonprobability
sampling technique after considering the cost, resources available
and practicability.
Large-scale descriptive studies almost always use probability-

sampling techniques. Intervention studies sometimes use probability
sampling but also frequently use nonprobability sampling. Qualita
­
tive studies almost always use nonprobability samples.
Probability sampling techniques are preferred by researchers

as maximizes external validity or generalizability of the results of
the study while nonprobability sampling techniques introduces
selection bias in the research.

Inclusion and Exclusion Criteria


It is important for a researcher to have a predefined inclusion and
exclusion criteria regarding what participants will be included

tahir99 - UnitedVRG
Synopsis Writing 139

in the study. Inclusion of subjects that should not be included, or


excluding participants that should have been included would make
the findings less valid. It is also important to mention from where
the participants will be enrolled (e.g. private clinics, tertiary care
hospitals, rural settings, etc.)—the study setting.
For example, a study was planned to evaluate the efficacy of
parenteral iron. As the parenteral iron has a teratogenic effect on
pregnant women in first and second trimester, thus all pregnant
women in first and second trimesters were excluded. Moreover,
parenteral iron is indicated when the Hb level falls below 8 g/dL,
thus only those pregnant women in third trimester whose Hb level
were below 7 g/dL were included.
Note: It is a word of caution for junior researchers who often tend
to write a few inclusion criteria and then write the opposite in the
exclusion criteria. For example, in a study about adult population
they would write as all participants who are aged 18 and above will
be included. In the exclusion criteria, they would write “children will
be excluded.” This is bad practice and must be avoided.

Duration of Study
It is also important to make clear that during what time period the
data will be collected. For example, “all participants who attend the
outpatient diabetic clinics of XYZ hospital from 1st January 2012 to
31st December 2013 will be included in the study.”

Sample Size Calculations


Estimating appropriate sample size is an important determinant for
the accuracy of the result thus vital to avoid type 2 error. Calculation
of sample size depends on level of significance (normally 0.05),
power (should be greater than 80%), and estimates taken from the
reference studies. For a detailed reading on “sample size calculation”
read chapter 10.
Example: This is an example of the sample size calculation done
for a study looking at the association of betel nut and oral cancer in
Pakistan (Fig. 15.1).
140 Basics in Epidemiology and Biostatistics

Software
The sample size calculation was done using the WHO software for
“Sample Size Calculation” edited by Lemeshow L and Lwanga SK.

Reference Study
The reference study used for this sample size calculation is;
Charité, Virchow Klinikum et al. “Betel quid chewing, oral cancer
and other oral mucosal diseases in Vietnam”. J Oral Pathol Med.
2008 Oct;37(9):511-4. Epub 2008 Jul 8. The values obtained from the
reference study are P1 = 0.30 ; 30% of the controls in the reference
study were consuming betel quid (chemical similar to ghutka).
P2 = 0.70 ; (70% of the cases in the reference study were consuming
betel quid). These two numbers 30 percent and 70 percent were
plugged into the WHO sample size software.
According to the proportion of exposures in cases and controls

in the above study, the sample size calculated is 38 (Fig. 15.1). The
results of the study are valid as confirmed by sample size calculation
using WHO software for sample size calculation.
Although the calculated sample size according to the WHO

software is 38 cases and 38 controls.

Figure 15.1 Sample size calculation for two population proportions



tahir99 - UnitedVRG
Synopsis Writing 141

Moreover, in many studies the sample size is inflated by 5 percent


to 10 percent to account for nonresponse bias or lost of follow-up of
subjects. Thus in this study, the sample size was calculated as 76 (38
cases and 38 control), but to account for nonresponse the sample
size was inflated by 10 percent to 84 (41 cases and 41 control).
Also, while doing a multivariate regression analysis (to look at
factors associated with a certain outcome) the sample size would
keep into consideration the number of independent variables
(risk factors) being studied. For example in the previous study,
a multivariate regression analysis for factors associated with the
risk of developing oral cancer was being studied. The sample size
calculated was 41 cases and 41 controls. If in this study there were 8
independent variables, the sample size should have been at least 80
cases and 80 controls (10 cases and 10 controls for each independent
variable).
Data collection: Data collection is the most important part of any
research. Initially, the researcher must make it clear whether a
primary data will be collected or a secondary data will be used for
the research.
Primary data is a first hand information which is collected from
the study participants usually through a data collection form or a
performa. In the data collection portion of the research protocol,
an investigator must make clear that on what variables the data
is collected; including demographic, socioeconomic status, lab
variables and outcome variable, etc. The demographic variables
include; age, gender, race, ethnicity, marital status, etc. Moreover,
the researcher should also make clear as to what is his outcome
variable/variables (e.g. mortality, hospitalization, length of stay,
quality of life, etc.) will be collected. It is also important to mention
whether a validated tool (e.g. SF-36 for quality of life) will be used or
not. In many cases a translated version of a validated tool is used,
which must also be mentioned. In a cross-sectional study the data is
collected at one point of time, whereas in a follow-up study like the
cohort or case control, the data is collected on multiple occasions.
Thus the investigator must specify on what time periods (e.g. week
1, 4, 8, etc.) the data will be collected. A detailed performa must
be attached preferably as an appendix because of the limited word
count or number of pages in synopsis.
142 Basics in Epidemiology and Biostatistics

In case of secondary data, the investigator must make clear from



what databases or registry the data has been extracted. As the original
database contains a large number of variables and all of them might
not be of use to the researcher. Thus, the researcher extracts the
variables of his interest from the database. Moreover, large databases
are also distributed into various sections. The researcher must also
specify as to what variable was obtained from which section of the
database.
For example, a study looking at the factors associated with

mortality among dialysis patients used a database, the United States
­
Renal Data System (USRDS). This is a large registry of dialysis patients
being dialyzed in all states of the US. The information of patients
demographics, labs and comorbid conditions were obtained from
a patients questionnaire (the DMMS Wave II), while the data on
mortality was obtained from the “patients file.”

Ethical Concerns
Ethical concerns are of paramount importance for any research.
The researcher must obtain an informed consent in the local
language from all the participants. The purpose of the research,
intervention to be given, potential benefits and harms, voluntary
participation, healthcare cost, etc. must be explained in detail to
all study participants. It is also important to protect the rights of
vulnerable groups (i.e. children, mentally ill people, etc.) If children
are to included in the study, a consent from guardian is essential.
A translated version of the inform consent form must be attached
as an appendix. It is the duty of the researcher to ensure that
anonymity of the participants will be maintained throughout the
research. Moreover, confidentiality of participants response must
also be maintained during research. The researcher must make
sure that appropriate data protection policies are adopted, so no
unofficial person has an access to confidential data collected from
study participants. Finally, the researcher must ensure that the
study is conducted in accordance with the guidelines of Helenski
Deceleration, and if deemed necessary an approval from the local
ethical review board should be obtained. All these details must be
included in the ethical consideration portion of the methodology.

tahir99 - UnitedVRG
Synopsis Writing 143

Data Analysis
Descriptive Analysis
The data analysis usually begins with the descriptive analysis. The
descriptive analysis is the description about the characteristics of the
population/sample being studied. The descriptive analysis is usually
presented in research studies as shown in Table 15.2.
A universally accepted and prescribed descriptive analysis, if the
study is describing one sample/population is like given here:
A descriptive statistical analysis of continuous and categorical
variables will be performed. Data on continuous variables will be
presented as mean ± SD and data on categorical variables will be
presented as proportions.
Please note that there is no p-values column in Table 15.3 as no
comparison is being made.
If the comparison is to be made between two groups, then values
on each variable in both groups must be calculated, with a p-value
indicating any difference (Table 15.3).
Ideally, a statistical analysis should include various types of
analyses like cross-tabulations, linear regression, multivariate
regression analysis, and survival analysis. New researchers are
strongly encouraged to include these types of analysis to add
glamor and colour to the research. Examples of some of the analysis
mentioned above are given here.

Association Between Two Variables


An association between two variables which seem to have a relation
can be studied in two ways; cross-tabulation and linear regression.
Cross-tabulation is used when the researcher wants to determine
the association between two categorical variable, while linear
regression is used when the researcher wants to determine the
association between two continuous variables (Tables 15.4 and 15.5).
Table 15.4 is a hypothetical cross-tabulation table. The researcher
would like to study the proportion of patients with various Hb levels
in each creatinine category. The Hb has been categorized as patients
having Hb <10, Hb 10–11, Hb 11–12, and >12 g/dL, the less than 10
and 10–11 categories are patients who are anemic. The 11–12 and
>12 categories are patients who are not anemic.
144 Basics in Epidemiology and Biostatistics

Table 15.2: Baseline characteristics of patients with chronic kidney



disease (CKD)—hypothetical table
Patient characteristics Mean ± SD or %
Age (in years)
Gender
Male

Race
Caucasians

African-American

Asian

Other

Insurance
Private

Health Maintenance Organization (HMO)

Medicare

Medicaid

None

Comorbidity index
Zero

One

Two

Three

Cause of CRI
Diabetes mellitus

Hypertension

GN/PKD/IN

Other

Laboratory values
Serum creatinine (mg/dL)

GRF (mL/ min/ 1.73m2)

BUN (mg/dL)

Serum albumin (g/dL)

Hematocrit (Hct) (%)

tahir99 - UnitedVRG
Synopsis Writing 145

Table 15.3:  Comparison of characteristics of patients of CKD with and


without anemia: A hypothetical table of descriptive comparative analysis
Variable CKD patients with CKD patients p-value
anemia without anemia
(Mean ± SD or %) (Mean ± SD or %)
Age (in years)
Gender
 Male
Race
 Caucasians
 African-American
 Asian
 Other
Insurance
 Private
 Health Maintenance
Organization (HMO)
 Medicare
 Medicaid
 None
Comorbidity index
 Zero
 One
 Two
 Three
Cause of CRI
  Diabetes mellitus
 Hypertension
 GN/PKD/IN
 Other
Laboratory values
  Serum creatinine (mg/dL)
  GRF (mL/ min/ 1.73 m2)
  BUN (mg/dL)
  Serum albumin (g/dL)
  Hematocrit (Hct) (%)
146 Basics in Epidemiology and Biostatistics

Table 15.4: The prevalence of various levels of Hemoglobin (Hb) at different



serum creatinine levels: A hypothetical table of a cross-tabulation
Cr (<2 ) Cr (2 – 3) Cr (3 – 4) Cr (4 – 5) Cr (>5)
mg/L mg/dL mg/dL mg/dL mg/dL
Hb>12 g/dL
Hb 11 – 12 g/dL
Hb 10 – 11 g/dL
Hb <10 g/dL

Table 15.5: Factors associated with anemia in CKD patients: Multivariate



analysis: (A hypothetical table)
Characteristics OR (95% CI) p-value
Age (per 1 year increase)
Male (ref= females)
Whites (ref= Non-Whites)
Diabetes (ref = No)
Hypertension (ref = No)
GFR ( per 1mL/min increase)
Serum creatinine (per 1 mg/dL
increase)
Serum albumin (1 g/dL increase)

The creatinine categories are Cr <2, Cr 2 – 3, Cr 3 – 4, Cr 4 – 5 and



Cr >5. The <2 and 2 – 3 are patients who have mild CKD, while
patients in categories Cr 4 – 5 and Cr >5 are patients with moderate
to severe CKD.
This is a cross-tabulation between two variables hematocrit and

creatinine, published in the American Journal of Kidney Disease.
On the X-axis is the creatinine categories and the Y-axis shows the
hematocrit categories. It can be seen in Figure 15.2, that patients in
creatinine categories less than 2 (mild CKD) have high proportion
of patients with good hematocrit values, while in the creatinine
categories greater than five there are more people with less hematocrit
values and an only a few with good hematocrit values.

tahir99 - UnitedVRG
Synopsis Writing 147

Figure 15.2  Association between hematocrit and creatinine


(Source : Kazmi WH et al. Am J Kidney Dis. 2001;38:803-12)

Note: The hypothetical table for this cross-tabulation was conceived


much before the analysis was carried out and can be seen in
Table 15.4.
In the above example of cross-tabulation, two continuous variable
hematocrit and creatinine were first stratified into categories of
creatinine and categories of hematocrit. Two continuous variables
who seems to be associated can also be studied doing a linear
regression (Figs 15.3A and B). In the Figure 15.3A, hematocrit
and GFR have been studied, while in Figure 15.3B, hematocrit
and creatinine has been studied. Figure 15.3A shows a positive
correlation between hematocrit and creatinine (with higher
glomerular filtration rate (GFR) we can see higher hematocrits)
(Fig. 15.4). Figure 15.3B shows a negative correlation between
hematocrit and creatinine (with higher creatinine values we can see
patients with less hematocrits).
Note: These are hypothetical figures of linear regression analysis
between hematocrit and GFR, hematocrit and creatinine
(Fig. 15.5). The direction of the plots are based on the anticipation
of the researcher. This hypothetical figure was conceived at the
148 Basics in Epidemiology and Biostatistics

B
Figures 15.3A and B Relationship of hematocrit to renal function: Linear

regression between hematocrit and creatinine

Figure 15.4  Hypothetical figure showing expected association between


hematocrit and GFR

tahir99 - UnitedVRG
Synopsis Writing 149

Figure 15.5  Hypothetical figure looking at the expected association


between hematocrit and creatinine

synopsis stage. The researcher so far does not have the data but
he has in his mind how the associations should be between these
two continuous variables (Figs 15.4 and 15.5). A true association
between continuous variables hematocrit and GFR, and hematocrit
and creatinines can be seen in Figures 15.3A and B, which is a
published study by kazmi et al.

Multivariate Regression Analysis


This sort of analysis is performed in a study to determine the
risk factors associated with a certain disease/outcome. These
analysis are done in studies with a follow-up (like case control and
cohort studies). The outcome variable (dependent variable) and
independent variables (risk factors) should be precisely identified.
Table 15.5 is a hypothetical table of a study looking at the factors
associated with the outcome (anemia) among patients with chronic
kidney disease (CKD).
References: All resources used should be referenced appropriately.
The recommended referencing styles are Vancouver and Harvard.
All references should be verified.
Work schedule or timeline: A work schedule is a table that summarizes
the tasks to be performed in the research project, the duration of
each activity (Fig. 15.6).
150 Basics in Epidemiology and Biostatistics

Figure 15.6  Work schedule and timeline for researcher

Appendix: The appendix must include:


• CV of researchers
• CV of supervisor
• Previous published articles
• Information on institutional affiliations of researchers
• Sample of data collection instrument
• Informed consent form
• Letters for endorsement for the study.
Logistics:
• A description of the resources and facilities available for the study
• Any anticipated difficulty
• A brief management plan
• A realistic budget.

BIBLIOGRAPHY
1. Guidelines for Synopsis and Dissertation Writing for CPSP, Retrieved


on 14 April 2010 from http://www pakmedinet.com/page/cpsp
2. Marg Gilks. How to write a synopsis? Retrieved on 14 April 2010.Writing-


World.com.from http://www.wrting-world.com/publish/synopsis.shtml

tahir99 - UnitedVRG
CHAPTER

16
Dissertation Writing

It is a detailed discourse on a subject especially submitted for a


higher degree in a University [Oxford].

STEPS IN WRITING A DISSERTATION


Format of Dissertation
Part-1
• Title page
• Supervisor’s certificate
• Dedication
• Acknowledgement
• Table of contents
• List of tables
• List of figures, graphs, illustrations
• List of abbreviations
Part-2 (about 70–100 pages)
• Abstract
• Introduction
• Review of literature
• Objective(s) of study
• Operational definitions
• Hypothesis
• Material and Methods
• Result
• Discussion
• Conclusion(s)
• References (Bibliography)
• Annexures (Proforma, etc.)
152 Basics in Epidemiology and Biostatistics

TITLE
It should highlight the key features of the study.

TABLE OF CONTENT
Include headings and subheadings with respect to the page number.

TITLE PAGE
It includes complete title of the manuscript, the name of the authors
with their highest qualifications, the department or institution to
which they are attached, address for correspondence with telephone
numbers and fax number, if possible.

ABSTRACT
Structured: All original articles should have a structured abstract.
Usually the limit ranges from one hundred fifty to two hundred fifty
words. The abstract should be in structured form and should have
headings of objective, study design, settings, subjects, interventions
(if applicable), main outcome measures, results and conclusions.
Keywords: Below the abstract give few keywords, which should not be
more than ten. These keywords are used in cross-indexing the article
and are usually published with abstract. Use terms from the Medical
Subject Headings (MeSH) which are listed with standard medical
headings given in the list of index medicus, e.g. glomerulonephritis,
paraplegia, infertility. If some cases, MeSH terms are not yet available
for recently introduced terms, present term may be used. Keywords
are included with structured abstract.

INTRODUCTION
It includes:
• Importance of the subject (what is known).
• Limitation of previous studies/gray areas/controversies (what is
unknown).
• Justification of your study/rationale (based on the above aspects
e.g., gaps in knowledge).
• Any special strength of your study.

tahir99 - UnitedVRG
Dissertation Writing 153

Collective review and critique of the literature should be written in


the candidates own words (not copied). References of the last 5 years
(older, relevant and historical references can be used). Review of the
local as well as international literature must be included. Literature
cited must belong to MedLine, ExtraMed or journals approved by
Pakistan Medical and Dental Council (PMDC).

HYPOTHESIS
It is an expected relationship between the exposure and the outcome.

STUDY OBJECTIVE
Formulate your objective(s) clearly. Remember Quality Thoughts
Precede Quality Results.

SUBJECTS/MATERIAL AND METHODS


Subjects: These are patients or persons on whom study was done.
Their age, sex, mean age, and standard deviation, and other relevant
characteristics should be given. The term subject is replaced by the
term material if data is noted down directly from laboratory reports,
device/machinery, or any inanimate object.
Apparatus: It refers to the main device used to measure the
observation, this may be a laboratory equipment, surgical procedure,
questionnaire, or a clinical method, for example, a laboratory
instrument for hemoglobin estimation, a procedure to remove the
stone from bile duct, a questionnaire developed to know the effect of
poverty on nutritional status or clinical criteria to assess the severity
of pain.
Method is the procedure of data collection. Mention the study
design, setting (place) where study was conducted, procedure of
data collection.
Mention the study variables, such as predictor variables, outcome
variables, confounding variables, etc.
Mention the name of statistical test and software program applied.

RESULTS
Firstly, the demographic profile is shown (e.g. if the study is done
on human subjects, show the different age groups, common areas of
154 Basics in Epidemiology and Biostatistics

belonging, gender, educational level, different professional cadres,


etc.). Quantitative variables are presented as mean + standard
deviation. Qualitative variables are shown in proportions (or %).
For graphic representation, show qualitative variables by using

either bar graph or pie charts, while for quantitative variables
histogram is appropriate.
Cross-tabulation could also be done. For any hypothetical study

design, variables (exposure with outcome) are cross matched by
applying either Chi-square test or Students’ t-test. Chi-square test is
applied for qualitative variables, while t-test is applied for quantitative
variables. Level of significance is usually set at 0.05. Odds ratio and
relative risk (with confidence interval) are calculated if the study
design is case control or cohort, respectively. Further analytical tests
such as correlation, regression and multivariate analysis are applied
where required. (For further detail regarding statistical analysis, read
the chapter no. 14, data analysis plan: page no.102–111).

DISCUSSION
It should emphasize the salient features of present findings.
Comparisons should be made of variations or similarities with
results of previous similar studies both national and international
with references. The detailed data should not be repeated in the
discussion. It must be mentioned whether the hypothesis in the
article was rejected, or could not be rejected. It is important to
remember that in the “discussion section” only discuss points you
have highlighted in the results. The second last paragraph highlights
the limitations of your study. It is a good idea to mention your
limitations before they are pointed out to you by the reviewer. The
conclusions of your study must be based on what you have observed
in your results.

OPTIONAL COMPONENTS
They are added only whenever applied. These are as follows:
Acknowledgement—if desired, it should be included after the

discussion and before references.
Letter of undertaking signed by the main author must accompany

all manuscripts.

tahir99 - UnitedVRG
Dissertation Writing 155

Sample letter of undertaking is as follows:


This is to confirm that the original/review article/case report
titled------ written by ------ submitted for publication, has not been
published in any other journal and if accepted for publication in
the requested journal, it will not be published in any other medical
journal in Pakistan or overseas.

REFERENCES
It includes citation in the text that should be serially numbered. List
the references in Vancouver style.

ANNEXES
It should be added, if they increase the understanding or evaluation
of the study. All annexure should be serially numbered and referred
to at appropriate places in the body of dissertation.

THE WHOLE MANUSCRIPT/DISSERTATION


SHOULD BE IN PAST TENSE

SAMPLE OF TITLE PAGE


Cost incurred on Directly Observed Therapy-Short Course (in terms
of time and money) by Tuberculous patients.

Dr XYZ
FCPS Student
(2008-2009)

Supervisor:

Dr ABC

Institute

Department
Name of Institution
156 Basics in Epidemiology and Biostatistics

Supervisor’s Certificate (Sample)

I, hereby, certify that Dr. ______________________ having Enrolment.


Number: ________________________ and RTMC Registration.
Number:______________ has been working under my direct
supervision with effect from : (date) ____________________________
to (date) __________ in the Department: _________________________
Unit: ________________________________________________________
of Training institution:________________________________________
in the city of: _________________________________ The enclosed
Dissertation titled:_________________________________________
____________ was prepared according to the “FCPS Dissertation—
Guidelines” under my direct supervision. I have read the Dissertation
and have found it satisfactory for FCPS part II examination in the
subject.

Signature of the Supervisor: ___________________________________


Name of the Supervisor: ______________________________________
Designation: ______________________ Date: _____________________

Official stamp:

BIBLIOGRAPHY
1. Dissertation Writing. Retrieved on 15 April 2010 from www.cpsp.edu.


pk/guideline/dissertation.
2. Newcastle University, (2009). School of Chemical Engineering


and Advanced Materials. Writing Research Thesis or Dissertations
(guidelines and tips). Retrieved on 14 April 2010 from http://lorien.ncl.
ac.uk/ming/dept/tips/writing/thesis/thesis-layout.htm
3. PhD-Dissertations.com. Retrieved on 15 April 2010 from http://www.
phd-dissertations.com/topic/medical_dissertation_thesis.html

tahir99 - UnitedVRG
CHAPTER

17
Reference Writing

Reference writing is a standardized method of acknowledging


sources of information and ideas used in research article, synopsis,
dissertation, assignment, etc. in a way that uniquely identifies their
source. Direct quotations, facts and figures, as well as ideas and
theories, from both published and unpublished works, must be
mentioned.
There are many acceptable forms of quoting references. This
chapter will exclusively provide a brief guide to the Vancouver style
of reference writing. The Vancouver style of writing references is
predominantly used in the medical field. The Vancouver style was
first published by the Vancouver group, which expanded from time
-to-time and evolved into the International Committee of Medical
Journal Editors (ICMJE).
It is very important to use the right punctuation and the order
of details in the reference. In this style, the journal titles used in the
references are abbreviated from an authoritative list.
A reference list at the end of the chapter contains the full details of
all the in-text citations. References are necessary to avoid plagiarism,
to verify quotations, and to enable readers to follow-up and read
more cited author’s arguments in detail.

CITING A JOURNAL ARTICLE


Name(s) of Author(s)
• Name(s) of author(s) of the article
–– Where there are six or less authors one must list all authors.
–– Where there are more than six authors, only the first six are
listed and added as “et al.” (et al. means “and others”).
158 Basics in Epidemiology and Biostatistics

– Put a comma and 1 space between each name. The last author

must have a full-stop after his initial(s).
• Format name (s) of author(s): Surname (1 space) initial(s) (no

spaces or punctuation between surname and initials) (full-stop
OR if further names comma, 1 space).
– Example: Halpern SD, Ubel PA, Caplan AL. Solid-organ trans

­
plantation in HIV-infected patients. N Engl J Med. 2002;
347(4):284-7.
As an option, if a journal carries continuous pagination

throughout a volume (as many medical journals do) the month
and issue number may be omitted.
– Example: Halpern SD, Ubel PA, Caplan AL. Solid-organ

transplantation in HIV-infected patients. N Engl J Med.
2002;347: 284-7.
• More than six authors
– Example: Rose ME, Huerbin MB, Melick J, Marion DW, Palmer

AM, Schiding JK, et al. Regulation of interstitial excitatory
amino acid concentrations after cortical contusion injury.
Brain Res. 2002; 935(1-2):40-6.
• Organization as author
– Example: Diabetes Prevention Program Research Group.

Hypertension, insulin, and proinsulin in participants with
impaired glucose tolerance. Hypertension. 2002; 40(5):679-86.

TITLE OF JOURNAL ARTICLE


• Do not use italics or underlining.
• Only the first word of journal articles (and words that normally
begin with a capital letter) are capitalized.
• Format of journal article: Title (full-stop, 1 space).
– Example: Clinical results in pediatric cochlear implantation.

• Format subtitle of publication: Title (colon, 1 space).
– Example: Cochlear implantation after meningitis: Does the

post-meningitic deafness etiology influence worse speech
rehabilitation progress?

JOURNAL’S TITLE
• Title of journal (abbreviated)
– Abbreviate title according to the style used in Medline. A list of

abbreviations can be found at: http://www.nlm.nih.gov

tahir99 - UnitedVRG
Reference Writing 159

–– Note: No punctuation is used in the abbreviated journal name.


• Format title of journal: Journal title abbreviation (1 space).
–– Example: J Coll Physicians Surg Pak Ann King Edward Med
Coll.

Year (and Month/Day, if Necessary) of Publication


• Abbreviate the month to the first 3 letters.
• If the journal has continuous page numbering through volume,
the month/day and issue information can be omitted.
• Format year of publication: Year (1 space) month (1 space) day
(semi-colon, no space) OR year (semi-colon, no space).
–– Example: 2003 September.

Volume Number
• If the journal has continuous page numbering through volume,
the month/day and issue information can be omitted.
• Format volume of publication: Volume number (no space) issue
number in brackets (colon, no space) OR volume number (colon,
no space).
–– Example: 4(3):

Page Numbers
• Format of page number: Page numbers (full-stop).
–– Example: pp. 122-9.
–– Example: pp. 1129-57.

CITING A BOOK REFERENCE


• Name(s) of author(s), editor(s), compiler(s) or the institution
responsible
–– Where there are six or less authors you must list all authors.
–– Where there are seven or more authors, only the first six are
listed and add “et al.” (et al. means “and others”).
–– Put a comma and 1 space between each name. The last author
must have a full-stop after their initial(s).
• Format of author(s): Surname (1 space) initial(s) (no spaces or
punctuation between initials) (full-stop OR if further names
comma, 1 space).
160 Basics in Epidemiology and Biostatistics

• Title of publication and subtitle if any


– Do not use italics or underlining.

– Only the first word of journal articles or book titles (and words

that normally begin with a capital letter) are capitalized.
• Format title of publication: Title (full-stop, 1 space)
– Example: Harrison’s Principles of Internal Medicine.

• Format subtitle of publication: Title (colon, 1 space).
– Example: Physical pharmacy: Physical chemical principles in

the pharmaceutical.
• Edition, if other than first edition
– Abbreviate the word edition to “edn” (Do not confuse with editor).

• Format of edition: Edition statement (full-stop, 1 space).
– Example: 3rd edn.

• Place of publication
– If the publishers are located in more than one city, cite the

name of the city that is printed first.
– Write the place name in full.

– If the place name is not well known, add a comma, 1 space

and the state or the country for clarification. For places in the
USA, add after the place name the 2 letter postal code for the
state. This must be in upper case, e.g. Hartford, CT (where
CT=Connecticut).
• Format place of publication: Place of publication (colon, 1 space)
– Example: New York:

• Publisher
– The publisher’s name should be spelt out in full.

• Format name of publisher: Publisher (semicolon, 1 space)
– Example: Williams and Wilkins;

• Year of publication.
• Format year of publication: Year (full-stop, add 1 space if page
numbers follow).
– Example: 1999.

– Example: 2000.pp. 12-5.

• Page numbers (if applicable)
– Abbreviate the word page to “p.”

– Note: Do not repeat digits unnecessarily-abbreviate.

• Format of page number: P (full-stop, 1 space) page numbers (full-
stop).
– Example: pp. 122-9.

– Example: pp. 1129-57.

tahir99 - UnitedVRG
Reference Writing 161

OTHER AUTHORS
• More than six authors: Give the first six names in full and add “et
al.” The authors are listed in the order in which they appear on the
title page.
• Editor(s): Follow the same methods used with authors but use the
word “editor” or “editors” in full after the name(s). The word editor
or editors must be in lower case. (Do not confuse with “edn” used
for edition).
–– Example: Millares M, editor. Applied drug information:
strategies for information management. Vancouver, WA:
Applied Therapeutics, Inc.; 1998.
• Sponsored by institution, corporation or other organization
(including Pamphlet)
–– Example: Australian Pharmaceutical Advisory Council.
Integrated best practice model for medication management
in residential aged care facilities. Canberra: Australian
Government Publishing Service; 1997.
Chapter or part of a book to which a number of authors have
contributed.
• Format of book chapter: Author(s)/editor(s) of chapter. Title
of chapter. In: author(s)/editor(s) of book. Title of book. City of
publication (State or country of publication): Publisher; year.
pages of book chapter.
–– Example: Porter RJ, Meldrum BS. Antiepileptic drugs. In:
Katzung BG, editor. Basic and clinical pharmacology. Norwalk,
CN: Appleton and Lange; 1995.pp. 361-80.

DISSERTATION REFERENCE
Example: Borkowski MM. Infant sleep and feeding: a telephone
survey of Hispanic Americans [dissertation]. Mount Pleasant (MI):
Central Michigan University; 2002.

CITING INTERNET AND OTHER


ELECTRONIC SOURCES
This includes software and Internet sources such as websites,
electronic journals and databases. These sources are proliferating
and the guidelines for citation are developing and subject to change.
162 Basics in Epidemiology and Biostatistics

The following information is based on the recommendations of the


National Library of Medicine.

Journal on the Internet


• Format: Author(s) (full-stop after last author, 1 space) Title of
article (full-stop, 1 space) Abbreviated title of electronic journal
(1 space) [serial on the Internet] Publication year (month if
applicable) [cited year month date] (full-stop, 1 space) Volume
number (no space) (Issue number in round brackets if applicable)
(colon, no space) [Page number in square brackets] (full-stop, 1
space) Available from (colon, 1 space) URL address.
– Examples: Abood S. Quality improvement initiative in nursing

homes: the ANA acts in an advisory role. Am J Nurs [serial on
the Internet]. 2002 Jun [cited 2002 Aug 12]; 102(6):[about 3
p.]. Available from: http://www.nursingworld.org/AJN/2002/
june/Wawatch.htm
(If the author is not documented, the title becomes the first

element of the reference).
• Format: Organization name (1 space) [homepage on the Internet]
(full-stop, 1 space) place of publication (colon, 1 space) publisher
of the website (semicolon) published year (1 space) [updated
year month date; cited year month date]. Available from (colon, 1
space) URL address.
– Examples: Cancer-Pain.org [homepage on the Internet]. New

York: Association of Cancer Online Resources, Inc.; 2000-01
[updated 2002 May 16; cited 2002 Jul 9]. Available from: http://
www.cancer-pain.org/.
In the Vancouver style, a consecutive number is allocated

to each reference as it is cited for the first time in the text of
the assignment. This number becomes the unique identifier of
that source and if the source is cited again the same number is
repeated. Numbers are inserted to the right of commas and full-
stops, and to the left of colons and semicolons. Multiple sources
can be listed at a single reference point. The numbers are then
separated by commas and consecutive numbers are joined
with a hyphen like 2–7. Vancouver uses superscript numbers,
or standard numbers in brackets, in the text, e.g. 1–4,10,12 or
(1–4,10,12). The superscript numbers are preferably used in
the text.

tahir99 - UnitedVRG
Reference Writing 163

  The references are listed at the end of your dissertation and


synopsis in the same numerical order as cited in the text.

BIBLIOGRAPHY
1. International Committee of Medical Journal Editors. Uniform
requirements of manuscripts submitted to biomedical journal: sample
references. [monograph on the Internet]. Bethesda (MD): National
library of Medicine (US); 2003. [cited 10 Aug. 2008]; Available from:
URL: http://www.nlm.nih.gov/bsd/uniform_requirements.html.
2. Uniform requirements for manuscripts submitted to biomedical
journals. International Committee of Medical Journal Editors. CMAJ.
1995;152(9):1459-73.
CHAPTER

18
Guidelines for Consent Writing

Informed consent has been recognized as an important component


of research protocols. Procedures of disclosure and consent in
collaborative research have been criticized, as they may not be in
keeping with cultural norms of developing countries.
The Nuremberg Doctors’ Trial (the so-called “Medical Case”)

following World War ll heightened international concerns with
ethical issues surrounding human experimentation. These
proceedings judged medical experiments conducted by Nazis on
prisoners of concentrated camps. In 1947, the Nuremberg Code,
the first international code of ethics for research involving human
subjects, was issued. The Nuremberg Code emphasized a strong
commitment to the informed and voluntary consent of research
participants. The World Medical Association‘s Declaration of
Helsinki, adopted in 1964 and most recently revised in 1996,
reiterated concerns for voluntary and informed consent for research.
In 1982, the Council for International Organizations of Medical
Sciences (CIOMS) and World Health Organization (WHO) published
Proposed International Guidelines for Biomedical Research. These
guidelines were developed in response to concerns raised about
the particular circumstances surrounding the implementation of
scientific research in developing countries.

GENERAL ETHICAL PRINCIPLES


All research involving human subjects should be conducted in
accordance with three basic ethical principles, namely respect for
persons, beneficence and justice. It is generally agreed that these
principles, which in the abstract have equal moral force, guide the
conscientious preparation of proposals for scientific studies. In
varying circumstances they may be expressed differently and given

tahir99 - UnitedVRG
Guidelines for Consent Writing 165

different moral weight, and their application may lead to different


decisions or courses of action. The present guidelines are directed
at the application of these principles to research involving human
subjects. 
• Respect for persons incorporates at least two fundamental ethical

G
considerations, namely:
1. Respect for autonomy, which requires that those who are

R
capable of deliberation about their personal choices should be
treated with respect for their capacity for self-determination.

V
2. Protection of persons with impaired or diminished autonomy

(vulnerable groups e.g. children/minors, subjects with

d
psychiatric illness, etc.), which requires that those who are

ti e
dependent or vulnerable be afforded security against harm or
abuse.
• Beneficence refers to the ethical obligation to maximize benefits
and to minimize harms. This principle gives rise to norms

n
requiring that the risks of research be reasonable in the light of the
expected benefits, that the research design should be sound, and

U
that the investigators must be competent to conduct the research
and to safeguard the welfare of the research subjects. Beneficence

-
further proscribes the deliberate infliction of harm on persons;
this aspect of beneficence is sometimes expressed as a separate

9
principle, nonmaleficence (do no harm).
• Justice refers to the ethical obligation to treat each person in

ri 9
accordance with what is morally right and proper, to give each
person what is due to him or her. In the ethics of research involving
human subjects the principle refers primarily to distributive

h
justice, which requires the equitable distribution of both the
burdens and the benefits of participation in research. Differences

ta
in distribution of burdens and benefits are justifiable only if they
are based on morally relevant distinctions between persons;
one such distinction is vulnerability. “Vulnerability” refers to a
substantial incapacity to protect one’s own interests owing to
such impediments as lack of capability to give informed consent,
lack of alternative means of obtaining medical care or other
expensive necessities, or being a junior or subordinate member
of a hierarchical group. Accordingly, special provision must be
made for the protection of the rights and welfare of vulnerable
persons.
166 Basics in Epidemiology and Biostatistics

Sponsors of research or investigators cannot, in general, be held



accountable for unjust conditions where the research is conducted,
but they must refrain from practices that are likely to worsen unjust
conditions or contribute to new inequities. Neither should they
take advantage of the relative inability of low-resource countries or
vulnerable populations to protect their own interests, by conducting
research inexpensively and avoiding complex regulatory systems of
industrialized countries in order to develop products for the lucrative
markets of those countries.
In general, the research project should leave low-resource

countries or communities better off than previously or, at least, no
worse off. It should be responsive to their health needs and priorities
in that any product developed is made reasonably available to them,
and as far as possible leave the population in a better position to
obtain effective healthcare and protect its own health.
Justice requires also that the research be responsive to the health
conditions or needs of vulnerable subjects. The subjects selected
should be the least vulnerable necessary to accomplish the purposes
of the research. Risk to vulnerable subjects is most easily justified
when it arises from interventions or procedures that hold out for
them the prospect of direct health-related benefit. Risk that does
not hold out such prospect must be justified by the anticipated
benefit to the population of which the individual research subject is
representative.

GUIDELINES FOR DRAFTING AN


INFORMED CONSENT FORM
Guidelines are given here in order to help and facilitate the
researchers in drafting a proper, acceptable consent form.
• All studies involving human subjects should have a properly
drafted consent form. No study should be done on human subjects
without obtaining informed consent and sufficiently before the
start of the study, at an appropriate time, and not a time when he/
she is under stress such as surgical procedure, and is unable to
understand the study.
• Consent may be written or verbal or telephonic. In case of
unwritten consent, it should be signed by the person taking
consent and witnessed by a second person.

tahir99 - UnitedVRG
Guidelines for Consent Writing 167

• In case of children, an assent form from children and consent


from guardian/parents is needed.
• In case of mentally or physically incapacitated subject, consent
should be obtained from immediate guardian or relative such as,
wife or husband, father or mother, brother or sister, etc.

G
• In case of community studies, community leaders, elders,
local political leaders, religious leaders (in certain cases), and

R
governmental officials should be taken into confidence, and a
written consent should be obtained.

V
• In case of doing a study in other locations such as other hospitals
and clinics, permission from appropriate authority or physicians

d
should also be obtained.

ti e
• The consent form should be in English, Urdu or other local
language if needed. These should be identical in such a way that
the translation of one into other is similar. The language should be
easy which can be understood by study subjects (uneducated or

n
primary passed). Use of technical terms should be avoided.
• A properly drafted consent form should contain the following

U
important points:
– Information sheet. There should be one paragraph or page

-

giving information about the nature of the study, its purpose
and need, possible benefits of the study, and procedures to be

9
carried out on the study subjects.
– Possible risks and benefits to the study subjects.

ri 9

– Availability of alternate treatment in case of therapeutic trials.

– Voluntary participation without any compulsion, moral or

otherwise and without any financial incentive or coercion.

h
However, financial assistance reimbursement for time and
traveling may/should be provided to study subjects; which

ta
should commensurate with the time spent, and should not be
too high.
– Right to withdraw from the study any time without affecting

their rights and treatment.
– Confidentiality.

– If any specimen is to be stored, its time of storage and

permission to use it in further research.
– Name and contact number of the investigator in case the study

subject wants further clarification or information about study.
– Authorization from study subjects with their signature, thumb

impression, signature of witness, etc.
168 Basics in Epidemiology and Biostatistics

IMPORTANT NOTES
• Studies should not be done on patient’s expenses.
• If any new or additional tests are to be done as a requirement of
study, their cost should be supported by the study.
• If a new treatment is compared with an existing and establish one
or two treatment modalities are being evaluated and compared,
cost of treatment or difference in cost of treatment should be
borne by the study. In addition any expected or unexpected
complication arising as a result of new treatment should also be
supported by the study.
• Studies which are unlikely to produce any significant results
because of faulty design are often considered not to be ethical as
such studies cause wastage of time and resources. Theses should
be avoided unless there is a strong justification.

BIBLIOGRAPHY
1. Agard E, Finkelstein D, Wallach E. Cultural Diversity and Informed
Consent. The Journal of Clinical Ethics. 1998;9(2):173-6.
2. Sugarman J, Popkin B, Fortney J Rivera R. International Perspectives
on Protecting Human Research Subjects. Crystal City, VA: National
Bioethics Advisory Commission Draft, 2000.
3. World Health Organization and Council for International Organizations
of Medical Sciences (WHO-CIOMS). International Ethical Guidelines
for Biomedical Research Involving Human Subjects. Author, Geneva,
1993.

tahir99 - UnitedVRG
CHAPTER

19
Consent to Participate
in Research (Sample)

R G
V
TITLE OR PARAPHRASED TITLE OF THE STUDY

d
You are asked to participate in a research study conducted by names

ti e
of PI (and faculty sponsor if the PI is a student), from the departmental
affiliation at Michigan Technological University. If student, indicate
whether study is being conducted as part of undergraduate project,

n
graduate student project, thesis, or dissertation. Your participation
in this study is entirely voluntary. Please read the information below

U
and ask questions about anything you do not understand, before
deciding whether or not to participate.

-
Optional: You have been asked to participate in this study because
explain succinctly and simply why the prospective subject is eligible

9
to participate. If appropriate, state the approximate number of
subjects involved in the study. State whether there are inclusion

ri 9
or exclusion criteria for participation (e.g. medical conditions that
would include or exclude a person).

h
PURPOSE OF THE STUDY

ta
Briefly state what the study is designed to examine, assess, or
establish.

PROCEDURES
If you volunteer to participate in this study, you will be asked to do
the following things:
Describe the procedures chronologically using simple language,

short sentences, and short paragraphs. If there are several procedures
or if they are complex, then use of subheadings may help organize
this section and increase readability.
170 Basics in Epidemiology and Biostatistics

Define and explain scientific or discipline-specific terms. Use



language appropriate to the population.
If applicable, specify the subject’s assignment to study groups,

length of time for participation in each procedure or study activity,
the total length of time for participation, frequency of procedures
and location of the procedures to be done.
If subjects will be recorded (audiotaped, videotaped, digitally),

describe the procedures to be used.
If any study procedures are experimental, clearly identify which

ones.

POTENTIAL RISKS AND DISCOMFORTS


Describe any reasonable foreseeable risks or discomforts, including
physical inconveniences and their likelihood, and explain how these
will be managed. In addition to physiological risks/discomforts,
describe any reasonably foreseeable psychological, social, legal, or
financial risks or harms that might result from participating in the
research.
If there are circumstances in which the researcher may terminate

the study, describe them. (This refers to situations in which the study
itself may be terminated. It is not the same thing as circumstances
in which a specific subject may be withdrawn; this issue is to be
discussed below, if relevant).
In the event of physical and/or mental injury resulting from

participation in this research project, Michigan Technological
University does not provide any medical, hospitalization or other
insurance for participants in this research study, nor will Michigan
Technological University provide any medical treatment or
compensation for any injury sustained as a result of participation in
this research study, except as required by law.

POTENTIAL BENEFITS TO SUBJECTS


AND/OR TO SOCIETY
Describe benefits to subjects expected from the research. If the
subject will not benefit directly from participation, clearly state this
fact.
State the potential benefits, if any, to science or society expected

from the research.

tahir99 - UnitedVRG
Consent to Participate in Research (Sample) 171

Note: Payment or other compensation for participation (e.g. a gift


certificate, extra credit) is not a benefit and is not to be discussed in
this section.

For Biomedical Studies Only: Include the

G
Following Paragraph, if Relevant

R
Based on experience with this drug, procedure, device, etc. in
animals, patients with similar disorders, researchers believe it may

V
be of benefit to subjects with your condition or, it may be as good
as standard therapy but with fewer side effects. Of course, because

d
individuals respond differently to therapy, no one can know in
advance if it will be helpful in your particular case. The potential

ti e
benefits may include: describe the anticipated benefits to subjects
resulting from their participation in the research.
If there is no likelihood that participants will benefit directly from

n

their participation in the research, state in clear terms. For example:
“You should not expect your condition to improve as a result of

U
participating in this research” or “This study is not being conducted
to improve your condition or health. You have the right to refuse to

-
participate in this study.”

9
Payment for Participation (Optional)

ri 9
State whether the subject will receive payment. If not, delete
this section. If subject will receive compensation, describe type
and amount, when compensation (e.g. money, extra credit, gift

h
certificate) is scheduled, and the proration schedule, if any, should
the subject decide to withdraw or is withdrawn by the investigator.

ta
Confidentiality
Any information that is obtained in connection with this study
and that can be identified with you will remain confidential and
will be disclosed only with your permission or as required by law.
Confidentiality will be maintained by means of describe coding
procedures and plans to safeguard data, including where data will
be kept, who will have access to it, etc.
If information will be released to any other party for any reason,

then state the person or agency to whom the information will
172 Basics in Epidemiology and Biostatistics

be furnished, the nature of the information, the purpose of the


disclosure, and the conditions under which it will be released.
If activities are to be audio- or videotaped or digitally recorded,

describe who will have access, if the tapes/files will be used for
educational purposes, and when they will be erased or destroyed.

Participation and Withdrawal


You can choose whether or not to be in this study. If you volunteer to
be in this study, you may withdraw at any time without consequences
of any kind or loss of benefits to which you are otherwise entitled.
You may also refuse to answer any questions you do not want to
answer. There is no penalty, if you withdraw from the study and you
will not lose any benefits to which you are otherwise entitled.

Include the Following Paragraph in this


Section Only if Relevant
The investigator may withdraw you from this research if
circumstances arise which warrant doing so. Describe the
anticipated circumstances under which the subject’s participation
may be terminated by the investigator without regard to the subject’s
consent.

FOR BIOMEDICAL STUDIES ONLY, ADD THE


FOLLOWING SECTION HERE
Alternatives to Participation (If Applicable)
Describe any appropriate alternative therapeutic, diagnostic, or
preventive procedures that should be considered before the subjects
decide whether to participate in the study. If applicable, explain
why these procedures are being withheld. If there are no efficacious
alternatives, state that an alternative is not to participate in the study.

IDENTIFICATION OF INVESTIGATORS
If you have any questions or concerns about this research, please
contact; identify research personnel: principal Investigator, faculty
Sponsor (if student is the PI), Co-Investigator(s), if any. Include
day phone numbers, addresses, and email addresses for all listed

tahir99 - UnitedVRG
Consent to Participate in Research (Sample) 173

individuals. For some studies of greater than minimal risk, it may be


necessary to include night/emergency phone numbers.

RIGHTS OF RESEARCH SUBJECTS

G
The Michigan Tech Institutional Review Board has reviewed my
request to conduct this project.  If you have any concerns about your
rights in this study, please contact Joanne Polzien of the Michigan

R
Tech-IRB at 906-487-2902 or email jpolzien@mtu.edu.

V
I understand the procedures described above. My questions have

been answered to my satisfaction, and I agree to participate in this

d
study. I have been given a copy of this form.

ti e
________________________________________
Printed Name of Subject

Un
________________________________________
________________________________________

-
Signature of Subject
Date

9

________________________________________

ri 9
________________________________________
Signature of Witness
Date

h

BIBLIOGRAPHY

ta
1. www.uoguelph.ca/research/forms/.../sample%20consent%20form.


doc
Index

Page numbers followed by f refer to figure and t refer to table

A Conduct research 4t
Consecutive manner 37
Alternate hypothesis, types of 60
Consecutive sampling 37
Analytical observational studies 14
Consent form 25
Antibody test 106
Convenience sampling 37
B Coronary artery disease 22f
Coronary heart disease 94
Bar charts 46
Cross-sectional studies 12
Basic statistical tests 110
design of 13
Bias 89

Cumulative incidence rate 73
control of selection 92

interviewer 91 D

misclassification 91
Data analysis 123, 143

types of 89
plan 120

Biostatistics 51

Data collection techniques, over-
Blinding 24
view of 115
C Data processing 122
Data types, classification of 42
Calculating odds ratio 87 Descriptive analysis 143
Case control study 15 Descriptive observational
design 15 studies 10

Categorical data 43 Diabetes 6
Causes of CRI 11 Different data collection
Central tendency, measures of 51 techniques 115
Chronic kidney disease 11t, 62, 95, Disease frequency, measures of 69
134, 144f Disease prevalence, effect of 108
Citing book reference 159 Dissertation reference 161
Citing internet and electronic Dissertation writing 151
sources 161 Dissertation, format of 151
Citing journal article 157 Dyspepsia 45
Closed ended questions 116
Cluster random sampling E
technique 37 End-stage renal disease 131
Cluster sampling 32, 36 Epidemiological study designs,
Cohort studies 17 types of 8, 9
Comorbidity index 11 Estimation and hypothesis
Comparative studies 14 testing 57

tahir99 - UnitedVRG
176 Basics in Epidemiology and Biostatistics

Ethical review board 25 M


Experimental study design, sketch
Mapping and scaling 119
of 21
Mean, median and mode, example
of 51
F
Methodology 129
Fever 45 Morbidity rate 75
Focus group discussion 118 Mortality rate 76
Formulate analysis plan 60 Multivariate analysis 146t
Multivariate regression analysis 149
G Myocardial infarction, relation
Gender distribution of of 93t
respondents 47f
General ethical principles 164 N
Generating hypothesis, observa- Nausea 45
tional designs for 8 Negative predictive value 107
Graphs 45 Nephropathy 7
types of 45 Nonparametric tests 112
Nonprobability sampling
H techniques 31, 37
Headache 45 Null hypothesis 59
Histograms 47 Numerical data 43
Hypertension 6
Hypothesis 57, 134, 135, 153 O
alternative 59 Observation bias 91
test of 59 Odds ratio 86
Open-ended questions 116
I Operational definition 134
Incidence 72 Optional components 154
density rate 74 Oral contraceptive and breast
rates, special types of 73 cancer 87
Information bias 91 Oral contraceptive use 93t
Interpretation 66, 86, 88
P
J Page numbers 159
Journal article, title of 158 Participation and withdrawal 172
Journal title 158 Pie charts 46
Judgmental sampling 38 Population 30
Positive predictive value 107-109
L Post-marketing
Laboratory values 11 clinical trials 27
Line graphs 48 surveillance 26
Literature search, resources of 5 Probability sampling
Lottery sampling technique 32f techniques 31, 32
Index 177

Processing and analysis of Scatter plots 49


qualitative data 126 Selection bias 90
Projective techniques 118 Significance level, selection of 60
Prospective cohort study 17, 18 Simple linear regression 81, 82f
Simple random sampling 32
Q Snowball sampling 38
Qualitative data 43 technique 39


Qualitative research 1 Solving hypothesis testing
Quantitative data 43, 122, 123 problems 65
Quantitative research 3 Sorting data 121
Quasi-experimental studies 25 Special package for social
Questions, types of 116 sciences 83
Quota sampling 39 Standard error of mean 54
State appropriate conclusion 66
R Steps in
Recall bias 91 hypothesis testing 60

References 155 writing dissertation 151

study 140 Stratified random sampling 32, 35

writing 157 technique 36f


Research questions and study Study designs 8
types 27 Study duration 139
Research subjects, rights of 173 Study objective 153
Research topic, selection of 3 Study purpose 169
Research Synopsis writing 129
classification of 2 Systematic random sampling 32,

types of 1 33, 34f, 35f

Retrospective cohort study 19
Systolic blood pressure 45
S
Sample data 60
T
Sample of title page 155 Table of content 152
Sample size 95 Title 152
calculation 139 page 152


calculation result 100t Tuberculosis 16

estimation 95 morbidity rate of 75


for single group mean 96
V

for single proportion 95

Sampling Variables, types of 41
method 138 Variation, measures of 52

procedure 30 Volume number 159

techniques 31, 32f Vomiting 45

tahir99 - UnitedVRG

También podría gustarte