Documentos de Académico
Documentos de Profesional
Documentos de Cultura
A Thesis
Presented to the
Faculty of the Department of Mathematics
College of Arts and Sciences
Caraga State University
Butuan City
In Partial Fulfillment
of the Requirements for the Degree
Bachelor of Science in Mathematics
(Applied Statistics)
Marchan A. Saga
February 2016
ABSTRACT
As usually aimed in any estimation problems, researchers are so interested on the reliability of estimates. As far as the theory is concerned,
confidence interval estimates are expressed as functions of tabular values of
either T distribution or Z-distribution. These distributions further lie on the
assumption of the normality of data. Thus, reliability of estimates absolutely
depends on the necessary requirement. Common fault in vast analysis are
committed when this important assumption is ignored. When this happens,
uncertainties on inferences about the population parameter become higher
than what is often defined (the level of significance). In this way, decision
making is mislead.
This study investigates the reliability of confidence interval estimates
via computer program (C-programming language) for both normal and nonnormal data taken from normal and non-normal population, respectively, via
Simple Random Sampling without Replacement (SRSWR). Confidence interval reliability is also observed when sample sizes and level of significance are
varying.
Results showed that when normality assumption is ignored, confidence
interval estimates become less reliable. This worsens when the specified level
of significance gets smaller. However, it is further revealed that increasing the
sample size contributed gain in reliability.
TABLE OF CONTENTS
Title Page
Approval Sheet
ii
Abstract
iii
Acknowledgement
iv
Table of Contents
I. Introduction
1.1
1.2
1.3
2.1
Basic Concepts
2.2
Preliminaries
III. Methodology
4
10
12
3.1
Algorithm
12
3.2
Flowchart
13
Results
14
14
19
5.1
19
5.2
Recommendations
20
Bibliography
21
Annexes
23
CHAPTER 1
Introduction
cost of data collection, time to accomplish the project and even human resources. This problem has been addressed through the development of both
sampling and survey theory where estimation is an inevitable element. Estimation has a big role in the field of statistics. One of the major applications of
statistics is estimating population parameters from sample statistics. In conducting a research, there is nothing wrong if the researcher will cover all the
respondents, however as mentioned, it will be costly and time consuming. All
those consequences will be resolved due to estimation. As an important parcel of statistics, estimation theory aims to search for an accurate and efficient
estimate of a given parameter of interest. These estimates can be in a point or
in an interval form. Point estimate is a single number (value) computed from
a random sample which represents a plausible value of the parameter. It pin-
2
points a location or a point in the distribution of possible values of the random
variable [8]. On the other hand, interval estimate is a range of values computed
from a random sample, which represents an interval of plausible values for the
unknown value of the parameter of the population. When some measure of
certainty or confidence is attached to the interval estimate, the interval called
the confidence interval estimate. Confidence interval estimate is a fundamental technique in statistical inference and widely used method of inference [13].
The interpretation of a confidence interval derives from the sampling process
that generates the sample from which the confidence interval is calculated. It
provides a measure of how confident the researcher is in stating that the
interval estimate obtained from the random sample contains the true value of
the parameter. Equivalently, if 95% confidence interval is constructed, then, in
the long run, 95% of the intervals constructed in similar manner, will contain
the true value of the parameter [8]. As usually aimed in any estimation problems, researchers are so interested on the reliability of estimates. As far as the
theory is concerned, confidence interval estimates are expressed as functions of
tabular values of either T distribution or Z-distribution. These distributions
further lie on the assumption of the normality of data. Thus, reliability of
estimates absolutely depends on the necessary requirements or assumptions.
Common fault in vast analysis are committed when these important assump-
3
tions are ignored. When this happens, uncertainties on inferences about the
population parameter become higher than what is often defined (the level of
significance). In this way, decision making is mislead.
1.2
CHAPTER 2
Basic Concepts And Preliminaries
This section presents the basic concepts and terminologies that will be used
in the next chapter. Definitions and theorems are taken from [12], [10] and
[8].
2.1
Basic Concepts
5
illustration:
i=1
Xi
6
Population mean descibes the characteristic of the population, thus, is a
parameter.
Example 2.1.6 The number of employees at 5 different drugstores are 3, 5, 6,
4 and 6. Treating the data as a population, find the mean number of employees
for 5 stores.
Solution. Since the data are considered to be a finite population,
=
3 + 5 + 6 + 4 + 6+
= 4.8.
5
Pn
i=1 (Xi
)2
Pn
i=1
Xi
7
Example 2.1.11 A food inspector examined a random sample of 7 cans of
a certain brand of tuna to determine the percent of foreign impurities. The
following data were recorded: 1.8, 2.1, 1.7, 1.6, 0.9, 2.7 and 1.8. Compute the
sample mean.
s =
Pn
2
X)
n1
i=1 (Xi
s =
P9
0)
(5)2 + (4)2 + + (4)2 + (7)2
124
=
=
.
91
8
8
i=1 (Xi
Definition 2.1.14 Sampling is the process of selecting a sample from a universe or a population.
8
Definition 2.1.15 Probability Sampling is a sampling where samples are
obtained using some objective chance mechanism, thus involving randomization. It requires the use of a complete listing of the elements of the universe
called the sampling frame. Simple Random Sapling is one of the example of probability sampling. There are two types of simple random sampling.
Simple random sapling without replacement (SRSWOR) does not allow repetitions of selected units in the sample. On the other hand, simple random
sampling with replacement (SRSWR) allows repetitions of selected units in
the sample.
Definition 2.1.16 Non-Probability Sampling is a sampling where samples are obtained haphazardly, selected purposively or are taken as volunteers.
The probablities of selection are unknown.
9
Definition 2.1.20 The numerical values of the test statistic for which the
null hypothesis will be rejected. The value of is usually chosen to be small
(e.g., 0.01, 0.05, 0.10) and is reffered to as level of significance of the test.
which reduces to
9.4 < < 10.26.
10
2.2
Preliminaries
The next theorem is stated in [12].
t=
s/ n
Theorem 2.2.2 If all possible random samples of size n are drawn, without
replacement from a finite population of size N with mean and standard
will be
deviation , then the sampling distribution of the sample mean X
approximately normally distributed with a mean and standard deviation given
by,
X =
r
N n
s=
n N 1
Theorem 2.2.3 Central Limit Theorem. If random sample of size n are
drawn from a large or infinite population with mean and variance 2 , ten
11
is approximately normally
the sampling distriution of the sample mean X
. Hence,
distributed with X = and standard deviation X=/
z=
/ n
n=
z/2
e
2
.
Theorem 2.2.5 is the core of this study. The researcher will validate the
and s2
theorem if this really holds given the assumptions above. Moreover, X
from non-normal data will also part of the validation.
CHAPTER 3
Methodolody
In this chapter, the methodology on how the researcher obtained the results
is being presented. The researcher generate a normal data from R- programming language (Free Software) and a non-normal data of size 20 from the
examination test results of the students. The researcher used C-programming
language (Free Software) to exhaust all combinations of sample size 5, 10 and
15 from the population size 20.
3.1
Algorithm
1st Step: Set the level of significance , sample size n, integer x, the population
mean and
N Cn .
n
s
N n
X t 2 ,(n1)
, X + t 2 ,(n1)
N
N
n
n
13
Then the following proportion gives the percent reliability;
P
x
N Cn
3.2
Flow Chart
CHAPTER 4
Results And Discussions
4.1
Results
Figure 4.1: Histogram of a Non-normal Data.
The figure above shows the histogram of a non-normal data taken from the
examination test results of the students. Observe that the histogram is skewed
the left, from that observation, it is visually evident that the set of data is not
15
normally distributed.
Figure 4.2: Histogram of a Normal Data.
The figure above shows the histogram of a normal data generated in Rprogramming language (free software) , observe that the histogram of the data
is somewhat like a bell curve, that is, visually evident that the data is normally
distributed.
16
Table 1: Simulation result on the proportion of (1 ) 100 Confidence
Interval containing the population parameter across normal and non-normal
data, varying level of signifance and increasing sample sizes.
Table 1 above shows the main result of the study. For = 0.10 , theoretically it must be expected that at least 90% of the constructed intervals
will contain the population mean . This is indeed true for data taken from a
normal population. Proportions are 90.33%, 90.47% and 90.50% for sample
sizes 5, 10, and 15, respectively. It must be noted that there is a relatively
small increase in the said reliability at an increasing sample sizes. However,
it is not observed when data are taken from a non-normal population. As
shown, observed proportions are 83.29% for sample size 5, 87.19% for sample
size 10 and 89.99% for sample size 15. This result reveals that when normality
assumption is ignored, the error in estimating a desired parameter becomes
relatively higher than what is pre-defined by the researcher. Surprisingly, increasing the sample size still exposes the remedy when normality assumption
17
is not met. This in fact explained previously by virtue of the commonly known
central limit theorem.
At = 0.05, almost similar trend of result is observed. A usual, it is
expected that in the long run, 95% of the intervals under a given sampling
design will contain the parameter of interest, the population mean. Under the
normality assumption, it is divulged that proportions are 95.37%, 95.35% and
95.68% for sample sizes, 5, 10, and 15, respectively. However, samples from
non-normal polulation still generated lesser number of intervals that contained
the delared parameter. Observed proportions are 89.86%, 91.36% and 93.64%
for respective samples, 5, 10 and 15 which are obviously smaller than the
theoretically expected value (95%). It is noted that increassing sample sizes
contributed a relative increase in reliablity while normality assumption is ignored.
Lastly, at a more precise level of significance = 0.01 , the same pattern
of result was observed. When normality assumption is violated, the more that
the estimation process becomes worse (less accrurate and less precise). Under
this unsatisfied assumption, proportion of intervals that contain the population parameter are 95.98%, 95.94%, and 97.24% for sample sizes 5, 20, 15,
respectively. This values are unfortunately lesser than what is again expected
(99%). The idea of increasing the sample size as a remedy in increasing the
18
reliability of estimates for non-normal data still holds.
CHAPTER 5
Summary, Conclusion and Recommendation
5.1
20
5.2
Recommendations
The study clearly focuses only on the interval estimates of population mean.
Thus for the interest of future researchers, the following recommendations are
generated;
Estimating the interval estimates for population variance in both normal
and non-normal data.
Estimating the interval estimates for population proportion both normal and
non-normal data.
Comparing the power of the test in hypothesizing a parameter value using
one sample t-test in both normal and non-normal data.
Comparing the power of the test in testing mean differences using independent sample t-test in both normal and non-normal data.
21
Bibliography
22
[8] Institute of Statistics, 2014. WorkBook in Statistics 1, Tenth Ed. University
of the Philippines, Los Ba
nos.
[9] J. Sauro and J.R. Lewis, 2005 Estimating Completion Rates From Small
Samples Using Binomial Confidence Intervals: Comparisons And Recommendations, Proceeding of the Human Factors And Ergonomics Society
49th Annual Meeting.
[10] J.t. Mc Claive, F.H. Dietrich and T. Sincich, 1997, Statistics, Seventh Ed.
Prentice Hall.
[11] A.D. Aczel, 1995, Statistics Concepts and Applications, Richard D. IRWIN, INC.
[12] R.E. Walpole, 1982, INTRODUCTION TO STATISTICS, Macmillan
Publishing Company, New York.
[13] D. Gilliland and V. Melfi, 2010, A Note on Confidence Interval Estimation
and Margin of Error, Journal of Statistics Education, Volume 18, No.1.
[14] J. Orloff and J. Bloom, 2014, Confidence Intervals for the Mean of Nonnormal Data, Class 23, 18.05, Spring.
23
Annexes
CODE:
24
25
Actual Results for Normal Data at
= 0.1,
20 taken 5
20 taken 10
26
20 taken 15
= 0.05,
20 taken 5
27
20 taken 10
20 taken 15
28
= 0.01,
20 taken 5
20 taken 10
29
20 taken 15
30
20 taken 10
20 taken 15
31
= 0.05,
20 taken 5
20 taken 10
32
20 taken 15
= 0.01,
20 taken 5
33
20 taken 10
20 taken 15