Está en la página 1de 41

TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TOPIC 6: HYPOTHESES TESTING & REGRESSION

Contents:

A. HYPOTHESES TESTING

6.1 Introduction

6.1.1 Statistical Hypothesis


6.1.2 Important Concepts in Hypothesis Testing
6.1.3 Tests of Hypotheses using Critical Region

6.2 Hypothesis Test about the Mean

6.2.1 Single Sample – Test Concerning the Mean


6.2.2 Two Samples - Test Concerning the Difference between Two Means

6.3 Hypothesis Test about the Proportion

6.3.1 Single Sample – Test on a Single Proportion

6.4 Hypothesis Test about the Variance

6.4.1 Single Sample - Test Concerning The Variance

B. REGRESSION & CORRELATION

6.5 The Simple Linear Regression Model


6.6 Estimating Model Parameters
6.7 The Coefficient of Determination & Correlation

TCK (2016) Page 1 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

A. HYPOTHESES TESTING

6.1 Introduction

While doing a particular research, one may propose a hypothesis (assumption), and
then design an experiment and collect data in order to carry out hypothesis testing.

In order to reach a conclusion about the hypothesis:


 Data may support the research hypothesis or
 Data may not support the research hypothesis.

Since a conclusion is reached based on data from a sample of the population, there
always exists a chance that our conclusion about the hypothesis may turn out to be
wrong!

6.1.1 Statistical Hypothesis

A statistical hypothesis, or just hypothesis, is a conjecture or claim (assertion)


concerning one or more populations.

- Examples of Claims:
• At least 50% of the students will skip the morning class on Friday
• The mean lifetime of a certain light bulb is 8000 hours
• The mean of batch 1 is different from the mean of batch 2.
• The variance of batch 1 is not different from the variance of batch 2.
• The defective percentage is less than 2%.

Evidence from the sample that is inconsistent with the stated hypothesis leads to a
rejection of the hypothesis, whereas evidence supporting the hypothesis leads to its
acceptance.

The Null hypothesis, denoted by Ho is a claim concerning population parameter that


is initially favoured as true.

The Alternative hypothesis, denoted by H1 (or Ha) is the assertion that is contrary to
Ho.

A test of hypothesis is a method of using sample data to decide whether the null
hypothesis H0 should be rejected (in favour of the alternative hypothesis).
TCK (2016) Page 2 of 41
TMA1111 Mathematical Techniques Faculty of Information Science & Technology

There are two possible conclusions from hypothesis-testing (H.T.) analysis:


 reject Ho or
 fail to reject Ho.

If the sample data provides sufficient evidence to suggest that H0 is false, H0 is


rejected in favour of H1. Otherwise, we continue to believe in the truth of H0.

We will confine ourselves to H.T. of the following format:


H0:  = 0
H1:  ≠ 0 OR  > 0 OR  < 0

The parameters which used to form the hypotheses are population parameters.
For example: mean (), proportion (p), and variance (2) or standard deviation ().

How to define hypothesis?

Given a scenario, you must read the scenario carefully and determine the claim that
you want to test (refer Table 1).

 The Ho always carries the equal (=) sign (refer to the column of Ho).

 If the claim suggests a simple direction such as more than, less than, superior to,
inferior to, and so on, then H1 will be stated using the inequality symbol (< or >)
corresponding to the suggested direction (refer to row 2 and row 4).

 If the claim suggests a compound direction (equality as well as direction) such


as at least, equal to or greater, at most, no more than, and so on, then this entire
compound direction (≤ or ≥) is expressed as Ho, but using only the equality (=)
sign, and H1 is given by the opposite direction (refer to row 1 and row 3).

 If no direction whatsoever is suggested by the claim, then H1 is stated using the


not equal symbol,  (refer to row 5).

TCK (2016) Page 3 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Claim Keywords (e.g.) Ho H1


1  At least = <
2 > More Than = >
3  At Most = >
4 < Less Than = <
5  Not equal = 
Table 1. The Claims and the Hypotheses

Example 6.1:

State the null and alternative hypothesis to be used in testing the claim:

(a) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content does not exceed 1.5 grams.
Claim:  ≤ 1.5, Ho : = 1.5 vs. H1 : > 1.5

(b) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content is more than 1.5 grams.
Claim:  >1.5,  Ho : = 1.5 vs. H1 : > 1.5

(c) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content is at least 1.5 grams.
Claim:  ≥ 1.5,  Ho : = 1.5 vs. H1 : < 1.5

(d) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content is less than 1.5 grams.
Claim: <1.5,  Ho : = 1.5 vs. H1 : < 1.5

(e) A real estate agent claims that 60% of all private residences being built today
are 3-bedroom homes. To test this claim, a large sample of new residences is
inspected; the proportion of these homes with 3 bedrooms is recorded and used
as our statistic.
Claim: p=0.6,  Ho : p = 0.6 vs. H1 : p 0.6

TCK (2016) Page 4 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.2:

For the following pairs of assertions, indicate which do not comply with our rules for
setting up hypotheses and why.

(a) H o :   100, H1 :   100


These hypotheses comply with our rules

(b) H o :   20, H1 :   20
H1 includes the equality claim ( ≤ 20), it is contradict to Ho.  Not comply

(c) H o : p  0.25, H1 : p  0.25


Ho should contain the equality claim, so these are not legitimate.  Not
comply
(d) H o :   120, H1 :   150
We are not allowing both Ho and H1 to be equality claims.  Not comply

6.1.2 Important Concepts in Hypothesis Testing

A test procedure is specified by:

1. A test statistic, a function of the sample data on which the decision is to be


based.
2. A critical region (or rejection region), the set of all test statistic values for
which H0 will be rejected (null hypothesis is rejected if and only if the test
statistic value falls in this region.)

A test statistic is the sample statistic that is used in the hypothesis testing process.
The calculated value of the test statistic is used for either rejecting or accepting the
null hypothesis. Examples of test statistic:

TCK (2016) Page 5 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

 Mean, :
X  X 
Z , and T
 s
n n

 Proportion, p:
pˆ  p
Z
pq
n
 Variance, 2 :
(n  1) s 2
  2

2

The decision procedure could lead to either of two wrong conclusions. So, there are
two types of errors.

Type I error consists of rejecting the null hypothesis Ho when it is true.


P(Type I error) = P(Reject Ho when it is true) = 

Type II error involves accepting Ho when Ho is false.


P(Type II error) = P(Accept Ho when Ho is false) = 

Accept Ho Reject Ho
Ho is true Correct Type I error
Ho is false Type II error Correct

• Level of significance is the probability of committing Type I error and is


denoted by α (alpha).

A test of a statistical hypothesis, where the region of rejection is on only one side of
the sampling distribution of the test statistic, is called a one-tailed test. {Note:
For one-tailed test, it can be upper-tailed test/ right-tailed test, or lower-tailed
test/ left-tailed test}

A test of a statistical hypothesis, where the region of rejection is on both sides of


the sampling distribution of the test statistic, is called a two-tailed test.
TCK (2016) Page 6 of 41
TMA1111 Mathematical Techniques Faculty of Information Science & Technology

We refer to H1 to determine the test whether it is right-tailed, left-tailed or two-


tailed test.

The critical region is chosen according to three possible cases (upper-tailed test/
right-tailed test, or lower-tailed test/ left-tailed test, or two tailed test), illustrated
with a test statistic that is a standard normal random variable under H0.

Three possible cases:


 A test of any statistical hypothesis, where the H1 is one-sided, such as
H 0 :    0 vs H1 :    0 (one sided test; right-tailed test as “>” is used in
H1).
The critical region :
Reject Ho if Z  Z (upper-tailed test/ right-tailed test)

 A test of any statistical hypothesis, where the H1 is one-sided, such as


H 0 :    0 vs H1 :    0 (one-sided test; left-tailed test as “<” is used
in H1)
The critical region :
Reject Ho if Z   Z (lower-tailed test/ left-tailed test)

 A test of any statistical hypothesis, where the H1 is two-sided, such as


H 0 :    0 vs H1 :   0 (two tailed test as "" is used in H1 )

The critical region :


Reject Ho if Z  Z or Z   Z
2 2

TCK (2016) Page 7 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.3:

State the null and alternative hypothesis to be used in testing the claim and determine
where the critical region is located:

(a) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content does not exceed 1.5 grams. State the null and
alternative hypothesis to be used in testing this claim.
H0 : = 1.5 vs. H1 :  > 1.5 (One-tailed test/ right-tailed test)
Critical region: Reject H0 if z  z

(b) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content is more than 1.5 grams. State the null and alternative
hypothesis to be used in testing this claim.
H0 : = 1.5 vs. H1 :  > 1.5 (One-tailed test/ right-tailed test)
Critical region: Reject H0 if z  z

TCK (2016) Page 8 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

(c) A manufacturer of a certain brand of rice cereal claims that the average
saturated fat content is at least 1.5 grams. State the null and alternative
hypothesis to be used in testing this claim.
H0 : = 1.5 vs. H1 :  < 1.5 (One-tailed test/left-tailed test)
Critical region : Reject H0 if z  - z

(d) A real estate agent claims that 60% of all private residences being built
today are 3-bedroom homes. To test this claim, a large sample of new
residences is inspected; the proportion of these homes with 3 bedrooms is
recorded and used as our statistic.
H0 : p = 0.6 vs. H1 : p  0.6 (Two-tailed test)
Critical region : Reject H0 if z  - z/2 or z  z/2

6.1.3 Tests of hypotheses using critical region

Steps for conducting a test of hypotheses:


1. State the null hypothesis and the alternative hypothesis:
H 0 :   0 vs H1 :   0 (for 2-tailed test)
Or H1 :   0 (for right-tailed test)
Or H1 :   0 (for left-tailed test)
2. Determine whether it is a one or two-tailed test by referring to H1.
3. Decide on the sampling distribution of the test statistic under H0.
4. State the critical region for the selected significance level, . (or draw the curve).
5. Give the formula for the test statistic. Compute the value of the test statistic from
the sample data.
6. Decide whether H0 should be rejected and state this conclusion in the problem
context.

TCK (2016) Page 9 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.2 Hypothesis Test about the Mean

6.2.1 Single Sample - Test concerning the mean

Case 1a: 2 is known ( for n  30 or n < 30)

Suppose we have a sample of size n taken from a population whose mean is  and
variance 2. We want to test whether this sample is taken from a population whose
 2
mean is 0. We know that the sample mean X ~ N   ,  if n is large.
 n 
Steps:
(1) H 0 :   0 vs H1 :   0 (for 2-tailed test)
or H1 :   0 (for right-tailed test)
or H1 :   0
(for left-tailed test)
(2) Determine whether it is a one or two-tailed test by referring to H1.
(3) Use z distribution.
(4) Critical region :
Critical/Rejection Region for Level α
Alternative hypothesis Test
H1 :   0 z ≥ zα (upper-tailed test/right-tailed test)

H1 :   0 z ≤− zα (lower-tailed test/left-tailed test)

H1 :   0 z ≤−𝑧𝛼/2 or z ≥𝑧𝛼/2 (two-tailed test)

𝑥̅ −𝜇
(5) Test-statistics, Z = 𝜎
√𝑛
(6) Decision & Conclusion.

TCK (2016) Page 10 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Case 1b: 2 is unknown but n  30 (big sample)

The steps are same as Case 1a, but just replace  with s if  is unknown for test
𝑥̅ −𝜇
statistics, i.e. Z = 𝑠 .
√𝑛

Case 2: 2 is unknown and n < 30 (small sample)

Suppose we have a sample of size n, where n < 30, taken from a normal population
whose mean is  and variance (2 ) unknown. We want to test whether this sample
is taken from a population whose mean is 0.

Steps:

(1) H 0 :   0 vs H1 :   0 (for 2-tailed test)

or H1 :   0 (for right-tailed test)


or H1 :   0 (for left-tailed test)
(2) Determine whether it is a one or two-tailed test by referring to H1.
(3) Use t-distribution because n < 30 and 2 is unknown.
(4) Critical region :
Critical/Rejection Region for Level α
Alternative hypothesis Test
T  t (upper-tailed test)
H1 :   0 ,n1
2

T  t (lower-tailed test)


H1 :   0 ,n1
2

T  t or T  t (two-tailed test)


H1 :   0 ,n1 ,n1
2 2

X 
(5) Use t-distribution. Test statistic, T .
s
n
(6) Decision & Conclusion.

TCK (2016) Page 11 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.4:

Suppose that it is known from experience that the standard deviation of the 8-cm
diameter CDs made by a certain company is 0.16 cm. To check whether its
production is under control on a given day, namely, to check whether the true average
diameter of the CD is 8 cm, the employee selected a random sample of 25 pieces of
CDs and finds that their mean diameter is x  8.091cm. Since the company stands to
lose money when   8 and the customer loses out when   8 , test the null
hypothesis   8 against the alternative hypothesis   8 at   0.01.

Solution:

Let X be the diameter of the CD. Given: =0.16, n = 25, x  8.091


(i) Hypothesis : H0 :  = 8
H1 :   8 (2-tailed test)
(ii) 2 is known  z-distribution
(iii) Critical region :
 = 0.01  /2=0.01/2= 0.005 ( is divided by 2 because 2-tailed test)
 z0.005 = 2.57
Reject H0 if z  -2.57 or z  2.57
X   8.091  8
(iv) Test statistics : Z  =  2.8438
 0.16
n 25
(v) Since z > 2.57,  Reject H0.
{Note: We also can draw normal curve. By drawing a normal curve and shade the
critical region as below,

0
-2.57 2.57 z = 2.8438
Critical region Critical region

TCK (2016) Page 12 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

we also can write :


Z = 2.8438 falls into the critical region. Reject H0. }
{Note: When we reject H0 means we accept H1. So use H1 to make conclusion.}
(vi) Conclusion:   8 , or  >8 and the company stands to lose money.

Example 6.5:
The daily yield for a local chemical plant has averaged 880 tons for the last several
years. The quality control manager would like to know whether this average has
changed in recent months. She randomly selects 50 days from the computer database
and computes the average and standard deviation as 871 and 21 tons respectively.
Test the appropriate hypothesis using  = 0.05

Solution:

Let X be the daily yield for the local chemical plant. Given : n = 50, x  871 , s = 21
(i) Ho : = 880 vs. H1 :   880 (2-tailed test) 
Note: use  in H1 because the
claim is “whether this average
has changed...”, indicates that
maybe is it is greater or maybe
lower.
(ii) n > 30  z-distribution
(iii)  = 0.05  /2= 0.05/2 = 0.025 z0.025 = 1.96
Critical region : Reject H0 if z -1.96 or z 1.96
871  880
(iv) Test statistics : z   3.0305
21
50

(v)

z = -3.0305 -1.96 0 1.96


Critical region Critical region

Since z = -3.0.05 falls into critical region {or z < -1.96}  Reject H0.
(vi) Conclusion:   880, or  < 880 and the average have changed to a value
lower than 80 tons.

TCK (2016) Page 13 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.6:

The specification for a certain kind of ribbon should have a mean breaking strength
of 185 pounds. If five pieces randomly selected from different rolls having mean
breaking strengths of 183.14 and standard deviation of 8.219, test the null hypothesis
  185 pounds against   185 pounds at   0.05

Solution:

Let X be the breaking strength for that certain kind of ribbon.


Given: n = 5, x  183.14 , s = 8.219.

(i) H0 : = 185 vs. H1 : <185 (left-tailed test)


(ii)  2 is unknown, n < 30  t-distribution
(iii)  = 0.05  t0.05, 4 = 2.132
Critical region : Reject H0 if t  -2.132

183.14  185
(iv) Test statistics : t   0.506
8.219
5
(v)

-2.132 0
z = -0.506
Critical region

Since t > -2.132 (or we can write t = -0.506 does not fall into critical region),
 Do not reject H0 (or we can write accept H0).

(vi) Conclusion: the mean breaking strength is not significantly less than 185.

TCK (2016) Page 14 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.2.2 Two Samples: Test Concerning the Difference between Two Means

Case 1:  12 and  22 are known or  12 and  22 are unknown but n1 , n2  30

Two independent samples of size n1 and n2 taken from population with means 1, 2
and variances  12 and  22 . To test whether these samples are taken from population
whose means are equal:

Steps:
(1) For 2-tailed test:
H 0 : 1   2 vs H1 : 1  2 or
H 0 : 1  2  0 vs H1 : 1  2  0

For Right-tailed test:


H 0 : 1   2 vs H1 : 1  2 or
H 0 : 1  2  0 vs H1 : 1  2  0

For Left-tailed test:


H 0 : 1   2 vs H1 : 1  2 or
H 0 : 1  2  0 vs H1 : 1  2  0

(2) Determine 1-tailed or 2-tailed test.


(3) Use Z-distribution.
(4) For a particular value of , determine the critical region.
Critical region :
Critical/Rejection Region for Level α
Alternative hypothesis Test
z ≥ zα (upper-tailed test/right-tailed test)
H1 : 1  2  0

z ≤− zα (lower-tailed test/left-tailed test)


H1 : 1  2  0

z ≤−𝑧𝛼/2 or z ≥𝑧𝛼/2 (two-tailed test)


H1 : 1  2  0

TCK (2016) Page 15 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

( x1  x2 )  ( 1   2 )
(5) Test statistic Z
 12  22

n1 n2
(6) Conclusion.

 Note : Replace  12 and  22 with s1 and s2 if  12 and  22 are unknown for


test statistics.

Example 6.7:

A company claims that its light bulbs are superior to those of its main competitor. If
a study showed that a sample of 40 of its bulbs has a mean lifetime of 647 hours of
continuous use with a standard deviation of 27 hours, while a sample of 40 bulbs
made by main competitor had a mean lifetime of 638 hours of continuous use with a
standard deviation of 31 hours. Does this substantiate the claim at the 0.05 level of
significance?

Solution:

Let X1 and X2 be the lifetimes of the light bulbs made by that company and its main
competitor respectively.

Given : n1 = 40, x1  647 , s1 = 27


n2 = 40, x 2  638 , s2 = 31

(i) H0 :1 = 2 vs. H1 : 1>2 (Right-tailed test)


(or we can write H0 :1 - 2 = 0 vs. H1 : 1-2 >0)
(ii) 2 unknown, n1, n2 > 30  z-distribution
(iii)  = 0.05  z0.05 = 1.645
Critical region : Reject H0 if z  1.64

TCK (2016) Page 16 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

90
(iv) Test statistics : z   1.3846
2 2
27 31

40 40
(v)
Critical region

0 1.96
z = 1.3846

Since z <1.645 (or z = 1.3846 does not fall into critical region) Do not reject
H0.
(vi) Conclusion: The two light bulbs have equal quality.

6.3 Hypothesis Test about the Proportion

6.3.1 One Sample - Test on a Single Proportion

When the random samples if size n (n is large) can result in two possible outcomes,
with the sample proportion, p̂ represents the “successes”, could be drawn from a
population with the proportion of “successes”, po, we use the hypothesis test about
proportion.

Steps:
(1) H 0 : p  p0 vs H1 : p  p0 (for 2-tailed test)
or H1 : p  p0 (for right-tailed test)
or H1 : p  p0 (for left-tailed test)
(2) Determine whether it is a one or two-tailed test by referring to H1.
(3) Use z-distribution.
(4) For the required  , the critical region is determined.

TCK (2016) Page 17 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Critical region :
Critical/Rejection Region for Level α
Alternative hypothesis Test
H1 : p  p0 z ≥ zα (upper-tailed test/right-tailed test)

H1 : p  p0 z ≤− zα (lower-tailed test/left-tailed test)

H1 : p  p0 z ≤−𝑧𝛼/2 or z ≥𝑧𝛼/2 (two-tailed test)

(5)
pˆ  po
Test statistic Z , qo  1  po
po qo
n
(6) Conclusion regarding the acceptance/rejection of null hypothesis based on
rejection criteria.

Example 6.8:

If 4 out of 20 patients suffered serious side effects from a new medication, test the
null hypothesis p  0.5 against the alternative hypothesis p  0.5 at   0.05 . Here,
p is the true proportion of patient suffering side effects from the new medication.

Solution:
(i) H0 : p = 0.5 vs. Ha: p  0.5 (2-tailed test)
(ii)  = 0.05  /2 = 0.05/2 = 0.025  z0.025 = 1.96
Critical region : Reject H0 if z  -1.96 or z  1.96
(iii) Test statistics : n = 20, po = 0.5, qo = 1-po = 0.5, pˆ  4 / 20  0.2 , and thus
0.2  0.5
z  2.68
0.5  0.5
20
(iv) Since z < -1.96,  Reject H0 .
(Note: We also can draw normal curve & shade the critical region as previous
examples above to make decision whether to reject or accept H0.)
(v) Conclusion: p  0.5

TCK (2016) Page 18 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.4 Hypothesis Test about the Variance

6.4.1 One Sample - Test Concerning the Variance

To test whether the population variance 2 equals to a specific value  2 . The sample
variance is s2.

Steps:
(1) H 0 :  2   02 , vs H1 :  2   02 (Two-tailed test)
Or H1: σ2 > σ02 (Right-tailed test)
Or H1: σ < σ0
2 2
(Left-tailed test)
(2) Determine whether it is a one or two-tailed test by referring to H1.
(3) Use 2 distribution
(4) Critical region for the required value of α and given n, based on 2 distribution
table.
Critical Region:
Hypotheses: Critical/Rejection Region for Level α Test
H0: σ2 = σ02 χ2 > χ2α

H1: σ2 > σ02 (right-tailed rejection region)


H0: σ2 = σ02 χ2 < χ2 1 - α

H1: σ2 < σ02 (left-tailed rejection region)


H0: σ2 = σ02 χ2 > χ2α /2 or χ2 < χ2 1 - α/2

H1: σ2 ≠ σ02 (two-tailed rejection region)

The rules in the table above are based on the following diagrams:

TCK (2016) Page 19 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

(i) Right-tailed Test (right-tailed rejection region)

(ii) Left-tailed Test (Left-tailed rejection region)

TCK (2016) Page 20 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

(iii) Two-tailed Test (Two-tailed rejection region)

(n  1) s 2
(5) Test statistic,  
2

2
.
(6) Conclusion regarding the acceptance/rejection of null hypothesis based on
rejection criteria.

Example 6.9:

Suppose that the thickness of a part used in a semiconductor is its critical dimension
and that measurements of the thickness of a random sample of 18 such parts have the
variance s2 = 0.68. The process is said to be under control if the variation of the
thickness is given by a variance not greater than 0.36. Assuming that the
measurements constitute a random sample from a normal population, test the
hypothesis  = 0.05.

Solution:

Let X be the thickness of a part used in the semiconductor.


n = 18 s2 = 0.68
(i) H0 : 2 =0.36 vs. H1 : 2 >0.36 (right-tailed test)
(ii)  = 0.05 20.05, 17 = 27.587
Critical region : Reject H0 if 2  27.587

TCK (2016) Page 21 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Or draw diagram:

= 0.05

= 27.587

(iii) Test statistics : n = 18, s 2 = 0.68 and thus  2 


18  1  0.68  32.111
0.36
(iv) Since  > 27.587,  Reject H0.
2

(or 2 = 32.1111 falls in critical region. Reject H0. )


(v) Conclusion: 2 >0.36 .

Example 6.10:

You have a random sample of size 20, with a standard deviation of 125. You have
good reason to believe that the underlying population is normal. Is the population
variance different from 10,000, at the 0.05 significance level?

Solution:

n = 20, s = 125, σ2 = 10,000, α = 0.05.

(i) H0: σ2 = 10,000


H1: σ2 ≠ 10,000 (two-tailed test.)

(ii)  = 0.05 , Since we have 2-tailed test, divide  by 2, i.e.


/2 = 0.05/2=0.025

Rejection Criteria:
Reject Ho if
 2   2 or 2  2
, n 1 1 ,n1 .
2 2

TCK (2016) Page 22 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

i.e.  2   02.025,19 or  2   02.975,19

 2  32.852 or  2  8.906
Or draw diagram:
=0.025

=0.025

=8.906 =32.852

(n  1) s 2 (20  1)(1252 )
(iii) Test statistic ,  
2
 =29.688
2 10000 .

Since   29.688 does not fall in rejection region (or  2  32.852 ) .


2

 Accept H0 .
(iv) σ2 = 10,000 (or The population standard deviation may not be different from
10,000)

B. REGRESSION & CORRELATION

6.5 The Simple Linear Regression Model

The simplest deterministic mathematical relationship between two variables x and y


is a linear relationship 𝑦 = 𝛽0 + 𝛽1 𝑥

However, the relationship between two variables x and y may not be deterministic.

TCK (2016) Page 23 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Regression analysis is the part of statistics that deals with investigation of the
relationship between two or more variables using probabilistic models.

For our discussion, we shall assume that values of the variable x are fixed by the
experimenter. The variable x is the independent (predictor, explanatory) variable.
For a fixed x, the second variable will be a random variable Y with observed value y,
referred to as the dependent (response) variable.

Usually observations will be made for a number of settings of the independent


variable. Let x1, x2, . . . , xn denote values of the independent variable for which
observations are made, and let Yi and yi respectively denote the random variable and
observed value associated with xi. The available bivariate data then consists of the n
pairs (x1, y1), (x2, y2), . . . , (xn, yn). A first step in regression analysis involving two
variables is to construct a scatter plot of the observed data. In such a plot, each (xi, yi)
is represented as a point plotted on a two-dimensional coordinate system.

Scatter Plot
A scatter plot is a useful summary of a set of bivariate data (two variables), usually
drawn before working out a linear correlation coefficient or fitting a regression line. It
gives a good visual picture of the relationship between the two variables, and aids the
interpretation of the correlation coefficient or regression model.

Each unit contributes one point to the scatter plot, on which points are plotted but not
joined. The resulting pattern indicates the type and strength of the relationship between
the two variables.

(from Statistics Glossary by Easton & McColl)

TCK (2016) Page 24 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

A simple linear regression model describes the linear relationship between dependent
variable Y and a single independent variable x as
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀

where
Y is the response variable/dependent variable
x is the explanatory variable/predictor/ independent variable
𝛽0 and 𝛽1 are the regression coefficients
𝜀 is the random error, with E[𝜀] = 0 and Var[𝜀] = 𝜎 2
𝛽0 , 𝛽1 and 𝜎 2 are parameters.

Note:
(i)  0 indicates the y intersect only if the scope of the model includes the value
x = 0.
(ii) 1 indicates the changes in the mean respond associated with one unit increase in
x. ( 1 is also the slope of the regression line.)
(iii) The true (or population) regression line 𝑌 = 𝛽0 + 𝛽1 𝑥 is the line of mean
values; for a particular x value, y is the expected value of Y for that value of x.

Figure 1. Points corresponding to observations from the simple linear regression


model

Linear models: The simplest relationship between two variables is a straight line,
thus termed as Simple Linear Regressions. By having such relationship
Y = 𝛽0 + 𝛽1 𝑥, one may be able to predict Y at unknown values of X from the
knowledge of the trend between X and Y.

TCK (2016) Page 25 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.11
Suppose that in a certain chemical process the reaction time Y (hr) is related to the
temperature (oF) in the chamber in which the reaction takes place according to the
simple linear regression model with equation Y = 5.00 - 0.01 X and  = 0.075.
a. What is the expected change in reaction time for a 1 oF increase in
temperature? For a 10 oF increase in temperature?

b. What is the expected reaction time when temperature is 200 oF? When
temperature is 250 oF?

Solution:
a. When X = 1 oF, expected change for a one degree increase,
𝛽1 = -0.01*1 = - 0.01#

When X = 10 oF, expected change for a one degree increase,


𝛽1 = -0.01*10 = -0.1#
b. When X = 200 oF, Y = 5.00 – 0.01(200) = 3.00#
When X = 250 oF, Y = 5.00 – 0.01(250) = 2.50#

6.6 Estimating Model Parameters

Consider a given sample data {(x1, y1), (x2, y2), …, (xi, yi) ,…, (xn, yn) }. Let yi is the
observed value of a rv Yi, where 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 . The errors i are
independent rv’s.

If the line 𝑦 = 𝛽0 + 𝛽1 𝑥 is used to fit the model, the fitted values 𝑦̂𝑖 are obtained
via 𝑦̂ = 𝛽0 + 𝛽1 𝑥 . The residual 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 = 𝑦𝑖 − 𝛽0 + 𝛽1 𝑥𝑖 is the vertical
deviation of the point (xi, yi) from the fitted line y = 𝛽0 + 𝛽1 𝑥.

The error sum of squares, denoted by SSE, is:


SSE    i2   (Yi  Yˆi ) 2   (Yi   o  1 X i ) 2
i i i
It is used as the measure of goodness of fit.
Using the principle of least squares, we minimize this sum of squares to obtain the
estimated regression line or least squares line.

TCK (2016) Page 26 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

A line provides a good fit to the data if the vertical distances (deviations) from the
observed points to the line are “small” (see Figure 2).

Figure 2. Deviations of observed data from line y = 𝛽0 + 𝛽1 𝑥.

The least-squares (regression) line for the data is given by


y = 𝛽̂0 + 𝛽̂1 𝑥.

Here the least squares estimate of the slope:


𝑆𝑥𝑦
̂
𝛽1 =
𝑆𝑥𝑥

and the least squares estimate of the intercept:


1
𝛽̂0 = (∑ 𝑦𝑖 − 𝛽̂1 ∑ 𝑥𝑖 ) = 𝑦̅ − 𝛽̂1 𝑥̅
𝑛
𝑖
Where

1
𝑆𝑥𝑦 = ∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) = ∑ 𝑥𝑖 𝑦𝑖 − (∑ 𝑥𝑖 )(∑ 𝑦𝑖 )
𝑛
𝑖 𝑖 𝑖 𝑖

2
22
(∑𝑖 𝑥𝑖 )
𝑆𝑥𝑥 = ∑(𝑥𝑖 − 𝑥̅ ) = ∑ 𝑥𝑖 −
𝑛
To minimize SSE with respect to the linear regression parameters (0, 1) :
𝑖 𝑖

TCK (2016) Page 27 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Least squares estimators of 𝛽0 and 𝛽1 given above are unbiased and have minimum
variance among all other unbiased estimators.
In computing 𝛽̂0 , use extra digits (at least up to 4 decimal) in 𝛽̂1 because if 𝑥̅ is
large in magnitude, rounding will affect the final answer.

The Line

After estimating the model parameters, the fitted regression line can then be written
as:
y = 𝛽̂0 + 𝛽̂1 𝑥.

Note: It must be emphasized that before 𝛽̂0 and 𝛽̂1 are computed, a scatter plot
should be examined to see whether a linear probabilistic model is plausible. If the
points do not tend to cluster about a straight line with roughly the same degree of
spread for all x, other models should be investigated. In practice, plots and regression
calculations are usually done by using a statistical computer package.

Estimating 2 and 

The parameter variance, 2, determines the amount of variability inherent in the
regression model. After a regression model has been fitted, the fitted values 𝑦̂𝑖 are
obtained via 𝑦̂𝑖 = 𝛽̂0 + 𝛽̂1 𝑥 with residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 .

The residuals can be used to give an estimate of 2. An unbiased estimate of 2 is


given by
𝑆𝑆𝐸
̂𝜎 2 = 𝑠 2 =
𝑛−2

with SSE is the error sum of square of errors:


S 
SSE  S YY   XY  S XY
 S XX 
 SYY  1 S XY
where
1 1
SYY   yi2  ( yi ) 2 and S XY   xi yi  ( xi )(  yi )
i n i i n i i

TCK (2016) Page 28 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Steps to Solve Simple Linear Regression Problem

𝑦 = 𝛽0 + 𝛽1 𝑥

Step 1: Draw the scatter plot of the (X,Y) data for visual inspection of the
relationship that may exist between X and Y. {Note: This step can be
skipped if the scatter diagram is not required in the question.}

Step 2: Construct the following table to facilitate computation.

k X Y X2 Y2 XY
1 x1 y1 x1 2 y1 2 x1 y1
2 x2 y2 x2 2 y2 2 x2 y2

     

n xn yn xn 2 yn 2 xn yn
Sum x i
i y
i
i x
i
2
i y
i
2
i x y
i
i i

Step 3: Calculate the linear regression parameters (o, 1) using the formula below:

1 1
S XY   xi yi  ( xi )(  yi ) and S XX   xi2  ( xi ) 2
i n i i i n i

where
S xy
̂1  and ˆ0  y  ˆ1 x
S xx

Step 4: The linear regression model of the data is given by

Y  ˆ0  ˆ1 X

TCK (2016) Page 29 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Additionally, we can compute 𝑆𝑆𝐸 = 𝑆𝑦𝑦 − 𝛽̂1 𝑆𝑥𝑦 and hence, an unbiased
𝑆𝑆𝐸
estimate of 2, ̂𝜎2 = 𝑠2 =
𝑛−2
Example 6.12

A cloth manufacturer wants to determine the relationship between the thickness of a


synthetic fiber and its tensile strength. Researchers took measurements at various pre-
selected, known levels of fiber thickness, and the following data was collected.

Fiber thickness, 40 31 34 44 49 36 41 50 39 45
X
Tensile strength, 83 74 72 70 75 73 70 76 79 72
Y

If the fiber strength thickness was 45, what would be the predicted strength?
In addition, give an estimate of the standard deviation of the model error.

Solution:

Consider the simple linear regression model to fit the data


Y   0  1 X

Step 1: Draw the scatter plot of the (X,Y) data for visual inspection of the
relationship that may exist between X and Y. {Note: can be skipped}

Y
85
*
80
*
75
* **
** *
70 * *
0 30 35 40 45 50 X

TCK (2016) Page 30 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Step 2: Construct the following table to facilitate computation.

k X Y X2 Y2 XY
1 40 83 1600 6889 3320
2 31 74 961 5476 2294
3 34 72   
4 44 70
5 49 75
6 36 73   
7 41 70
8 50 76
9 39 79
10 45 72
Sum xi
i  yi =744 x
i
2
i y
i
2
i x y
i
i i
i

=409 =17077 =55504 =30436

Step 3: Calculate the linear regression parameters (o, 1):


Using the table above, n =10, we determine
1 1
x   xi  40.9 , y   yi  74.4
n i n i

and
S 6.4
ˆ1  XY   0.01834
S XX 348.9
ˆ0  y  ˆ1 x  74.4  (0.01834)(40.9)  73.6499

Step 4: The linear regression model of the data is given by

Y  73.6499  0.0183 X

When thickness, x = 45, the model predicts tensile strength Y to be


Y = 73.6499+0.0183(45) = 74.4734#

TCK (2016) Page 31 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

For SSE and estimate of 2:

2
(∑𝑖 𝑦𝑖 ) 7442
𝑆𝑦𝑦 = ∑ 𝑦𝑖2 − = 55504 − = 150.4
𝑛 10
𝑖
𝑆𝑆𝐸 = 𝑆𝑦𝑦 − 𝛽̂1 𝑆𝑥𝑦 = 150.4 − (0.01834)(6.4) = 150.28

150.28
̂𝜎 2 = 𝑠 2 == 18.785
8
An estimate of the standard deviation ( )of the model error is √18.785 = 4.33

Note: For the above example

1) 0 does not give any meaning since the scope of sample data not include x = 0.

2) Within the scope of the model, we have linear relationship between x and y.

3) We should not make inference about the relationship between x and y for value
out of the range of sample data.

Example 6.13

A chemical engineer is investigating the effect of process operating temperature on


product yield. The study results in the following data:

0
Temperature, C 100 110 120 130 140 150 160 170 180 190
(x)
Yield, % (y) 45 51 54 61 66 70 74 78 85 89

These pairs of points are plotted in Fig. 14-1. Such a display is called a scatter
diagram. Examination of this scatter diagram indidates that there is a strong
relationship between yield and temperature, and the tentative assumption of the
straight-line model y = 𝛽0 + 𝛽1 𝑥 + 𝜀 appears to be reasonable. Find the regression
line equation that represents this set of data.

TCK (2016) Page 32 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Solution:

TCK (2016) Page 33 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.7 The Coefficient of Determination & Correlation

6.7.1 The Coefficient of Determination

The sample coefficient of determination, r2, represents the proportion of the total
variation of the variable Y that can be explained by a linear relationship with the
values of X. It is widely used to determine how well a regression fits. In other words,
how close the points are to the regression line.

A quantitative measure of the total amount of variation in observed y values is given


by the total sum of squares.

SSE = the sum of squared deviations about the least squares line Y   0  1 X ,
SST = the sum of squared deviations about the horizontal line at height y.
SSE/SST = the proportion of total variation that cannot be explained by the simple
linear regression model,
1 – SSE/SST = the proportion of observed y variation explained by the model.

Thus, r2 = 1 – SSE/SST

The higher the value of r2, the more successful is the simple linear regression model
in explaining y variation.

TCK (2016) Page 34 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

6.7.2 Correlation

Correlation analysis is used to measure the strength of linear relation between X and
Y by means of a single number called a correlation coefficient.

Population correlation coefficient  defined as


 XY
 , with 1    +1.
 XX  YY

Some useful indications of correlation coefficient:


  = ±1 only occur when we have a perfect linear relationship between the two
variables
  = +1 implies a perfect linear relationship with a positive slope (1 > 0),
  = 1 implies a perfect linear relationship with a negative slope (1 < 0),

Thus, if a sample’s correlation coefficient is close to unity in magnitude, this


implies a good correlation or linear association between X and Y, whereas values
that near to zero indicate little or no correlation.

Sample Estimate of correlation coefficient

Sample estimate of the correlation coefficient, r is defined as


S XY
r or r  r2
S XX SYY

The value of r (1  r  1) measures how good is the linear relationship between X
and Y.

TCK (2016) Page 35 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Y Y

X X

Y Y

X X

A value of r near 0 is not evidence of the lack of a strong relationship, but only the
absence of a linear relation or correlation.
A value of r that fall within the range from 0 to 0.5 is considered weak, strong if it is
between 0.8 to 1, and moderate otherwise. Refer to the diagram below for the
summary of r value:

-1 -0.8 -0.5 0 0.5 0.8 1

Strong Moderate Weak Weak Moderate Strong


negative negative negative positive positive positive
relationship relationship relationship relationship relationship relationship

TCK (2016) Page 36 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Example 6.14

Construct the correlation coefficient between X (test grade) & Y (number of years) if
SXX = 10.5, SYY = 1504.1, SXY = 114.5

Solution:
S XY
r
S XX SYY
114.5
r
10.5 1504.1
= 0.9111#

As a conclusion, the r value of 0.9111 shows a strong correlation between test grade
and the number of years.

TCK (2016) Page 37 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

Statistical Tables

TCK (2016) Page 38 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TCK (2016) Page 39 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TCK (2016) Page 40 of 41


TMA1111 Mathematical Techniques Faculty of Information Science & Technology

TCK (2016) Page 41 of 41