Está en la página 1de 19

Chem 75

Winter, 2015

An Introduction to Error Analysis


Introduction
The following notes (courtesy of Prof. Ditchfield) provide an introduction to quantitative error
analysis: the study and evaluation of uncertainty in measurement. Before doing your first lab writeup, you will put some of these concepts to work by analyzing some data in a homework
assignment. Your answers to this assignment should be handed in along with other assigned
problems for the first class assignment. To the extent possible, a quantitative error analysis should
be included in each lab write-up.
Your experience in the laboratory to date has likely shown you that no measurement, no
matter how carefully it is made, can be completely free of uncertainties. Thus, the usefulness of
numerical scientific results is severely limited unless their uncertainty is known and communicated.
There is no single "correct" way to do this, but there are lots of wrong ways. These notes describe
some conventional ways of measuring and reporting experimental uncertainty.

Estimating Uncertainties
Single Measurements
Frequently you will have only one or two measurements of some experimental observable. In
such cases, you must simply make estimates of the reasonable uncertainty in each of the operations
that went into the measurements. For example, typical uncertainties associated with some
commonly used equipment are:
Mettler balance
pipette
burette
etc.

0.0001 g
0.01 mL
0.02 mL

In some cases, your estimate of an uncertainty will simply be an educated guess, but in others it
may be worthwhile to do some experiments to determine the uncertainties of some of the devices
used. The trick is to estimate the uncertainty large enough to avoid unrealistic representation of
results, but at the same time to estimate it small enough not to wipe out important scientific
conclusions. This process is actually a very important part of the skill of a good scientific
investigator.
Repeatable Measurements
In general, if the uncertainty in an experimental measurement is random it can be estimated
more reliably if the measurement can be repeated several times. Suppose, for example, we measure
the time (tblue) for the blue color to disappear in the iodination of cyclohexanone (an experiment
that many of you did in the Chemistry 6 laboratory) and find 23.5 min. From this single
Error Analysis

Page 1

Chem 75

Winter, 2015

measurement we can't say much about the experimental uncertainty. However, if we repeat the
experiment and get 23.6 min, then we can say that the uncertainty is probably of the order of 0.1
min. If a sequence of four timings gives the following results (in minutes)
23.5, 23.6, 23.7, 23.6,
then we can begin to make more realistic estimates.
The first step is to calculate the best estimate of tblue as the average or mean value, 23.6 min.
As a second step, we make a reasonable assumption that the correct value of tblue lies between the
lowest value, 23.5 min and the highest value 23.7 min. Thus, we might reasonably conclude that the
best estimate is 23.6 min with a probable range of 23.5 to 23.7 minutes. These results can be
expressed in the compact form
tblue = 23.6 0.1 min.
In general, the result of any measurement of a quantity x is stated as
(measured value of x) = x x,
where x denotes the mean value of x, and x is an estimate of the uncertainty in x.
When a measurement can be repeated several times, the spread in the measured values gives a
valuable indication of the uncertainty in the measured quantity. In a later section of these notes,
statistical methods for treating such repeated measurements are outlined. Such methods can give a
more accurate estimate of uncertainty than that presented above, and, moreover, give a much more
objective value of the uncertainty.
Precision and Accuracy
Precision refers to the amount of scatter in a set of numbers presumed to measure the same
quantity. For example, suppose you repeatedly placed the same coin on an analytical balance. The
scatter in your results would define the precision of the measurement. Accuracy refers to the
degree to which the set of numbers represents the "true" value of the quantity. If, in the above
example, the balance were not properly leveled, it might give good precision, but the value could be
wrong, thereby leading to poor accuracy.
Random and Systematic Errors
The scatter of results which leads to the concept of precision is attributed to random errors
which are presumed to originate from external influences that:
1. are large in number,
2. are unknown and therefore uncontrollable and unpredictable,
Error Analysis

Page 2

Chem 75

Winter, 2015

3. cause relatively small individual effects, and


4. act independently.
Defects in accuracy are often caused by systematic errors in a set of data. These refer to a
consistent shift of all values away from the "true" value, such as might be caused by a faulty
measuring device or other undetected external influence which acts in the same direction on all
measurements. For example, suppose the clock used to measure tblue was running consistently
10% slow. Then, all timings made with it will be 10% too small and repetitive measurements (with
the same clock) will not reveal this deficiency.
The distinction between random and systematic error is rarely sharp, and much "random
error" can be attributed to inadequate experimental design. In some experiments it is worth
considerable effort to determine whether the "random error" is indeed random.
Significant Figures
Several basic rules for reporting uncertainties are worth emphasizing. Because the quantity
x is an estimate of the uncertainty it should not be stated with too much precision. It would be
inappropriate to state the rate constant for iodination of cyclohexanone, for example, in the form
kmeasured = (1.9 0.1046) 102 M1 min .
The uncertainty in the measurement cannot possibly be known to four significant figures. In the
above case, the result should be reported as
kmeasured = (1.9 0.1) 102 M1 min
A rough indication of precision is given by the use of significant figures, following these rules.
1. Write a number with all digits known to be correct, plus one doubtful figure.
2. In addition and subtraction, the number of significant figures in the result is limited by the
component term with the largest absolute uncertainty: (5.25 0.04) + (6.386 0.001) = 11.64
0.04.
3. In multiplication and division, the result is limited by the term with the largest relative or
fractional uncertainty: (5.0 0.5) (6.0 0.01) = 30 3.
In this case, the relative or fractional uncertainty in the first term is (0.5/5.0), i.e., 5 parts in 50,
which is a relative uncertainty of 10%. The relative uncertainty in the second term is a little less
than 2%. Thus, here, the relative uncertainty in the product cannot be less than 10% and is
dominated by that in the first term. A systematic approach for treating the combination of such
uncertainties is presented in the later section on the propagation of errors.
Error Analysis

Page 3

Chem 75

Winter, 2015

Statistical Analysis of Data


As mentioned above, the reliability of an estimate of uncertainty in a measurement can be
improved if the measurement is repeated many times. The first problem in reporting the results of
many repeated measurements is to find a concise way to record and display the values obtained.
This is a problem that you may have encountered previously. For example, suppose you measured
the masses of all the pennies minted since 1982. Clearly, you would not always find the same
value, since there would be some variation in alloy composition, stamping pressure, care of
handling, etc. One way to display the results is to construct a bin histogram as shown in Figure 1
for a sample of 25 pennies. Here, one divides the range of values into a convenient number of
intervals or "bins" of width (equal to 0.01 g in this example), and counts the number of values in
each "bin." One then plots the data in such a way that the fraction of measurements that fall in each
bin is indicated by the area of the rectangle drawn above the bin. That is the height P(k) of the
rectangle drawn above the kth bin is chosen so that
rectangular area = P(k) = fraction of measurements in the kth bin.
For example, the shaded area above the interval from mass = 2.50 to 2.51 g has area 20 0.01 =
0.2, indicating that one fifth of the 25 masses fell in this interval.
N = 25
25
20
15

P(k)
10
5
0

Mass/g

Figure 1
Such a plot gives us a visual representation of how the masses of pennies are distributed. In
most experiments, as the number of measurements increases, the histogram begins to take on a
definite simple shape, and as the number of measurements approaches infinity, their distribution
approaches some definite, continuous curve, the so-called limiting distribution. If a measurement is
subject to many small sources of random error and negligible systematic error, the limiting

Error Analysis

Page 4

Chem 75

Winter, 2015

distribution will have the form of the smooth bell-shaped curve shown in Figure 2. This curve will
be centered on the true value of the measured quantity.

P (x)

x
Figure 2
In the general case, this limiting distribution defines a function which we will call P(x). From
the symmetry of the bell-shaped curve, we see P(x) is centered on the average value of x. Thus, if
we knew the limiting distribution we could calculate the mean value, given the symbol , that would
be found after an infinite number of measurements. This is defined as
= lim

x1 + x2 + x3 + + xN
= lim 1
N
N N

xi

(1)

i=1

where xi is the ith of N measurements (i = 1, 2, 3, , N) and the symbol is the standard


summation notation. If the data represent repeated measurements of the same quantity (such as the
mass of one penny), then represents the true value, but only in the absence of systematic
errors.
If the measurement of interest can be made with high precision, the majority of the values
obtained will be very close to the true value of x, and the limiting distribution will be narrowly
peaked about the value . In contrast, if the measurement of interest is of low precision, the values
found will be widely spread and the distribution will be broad, but still centered on the value .
Thus, we see that the breadth of the distribution not only provides us with a very visual
representation of the uncertainty in our measurement, but also, it defines another important measure
of the distribution.
How do we quantify this measure of the distribution? The spread of values about is
characterized by the standard deviation , defined for an infinite number of measurements as

Error Analysis

Page 5

Chem 75

Winter, 2015

lim 1
N N

xi 2

(2)

i=1

The standard deviation, , characterizes the average uncertainty in each of the measurements x1, x2,
x3, ...., xN from which and were calculated. Clearly, P(x), and are related. Gauss showed
that, for randomly distributed errors, the limiting distribution function (the bell-shaped curve) is
related to and by the equation:
P,(x) =

x
1
exp 1

2
2

(3)

Here, the subscripts and have been added to P(x) to indicate the center and width of the
distribution. Measurements whose limiting distribution is given by the Gauss function are said to
be normally distributed.
The significance of this function is shown by Figure 3. The fraction of measurements that
fall in any small interval x to x + dx is equal to the area P,(x) dx of the white strip in Figure 3(a).
P, (x)
0.3
0.2
0.1

x x+dx

Figure 3(a)

Figure 3(b)

P, (x)
0.3

0.2

P,(x) dx

0.1

a
a

More generally, the fraction of measurements that fall between any two values a and b is the total
area under the graph between x = a and x = b as in Figure 3(b). This area is just the definite
integral of P,(x). Thus, we have the important result that, after we have made a very large number
of measurements,
b

P,(x) dx = fraction of measurements that fall between x = a and x = b.

(4)

a
Error Analysis

Page 6

Chem 75

Winter, 2015
b

Similarly, the integral

P,(x) dx defines the probability that any one measurement will lie
a

between x = a and x = b. Because the total probability of our measurement falling anywhere
between and + must be unity, the limiting distribution function P,(x) must satisfy

P,(x) dx = 1

(5)

A function satisfying equation (5) is said to be normalized.


Thus, the probability that any one x value lies between the limits x = and x = + is
the area under the Gaussian curve between these limits. If one computes (by integration) such areas
for various choices of , one can show that the probability of finding any one measurement of x
between various limits, measured as multiples of the standard deviation, , is given by the data
presented in Table 1:

Table 1. Gaussian Probability Intervals


Probability
0.50
0.68
0.80
0.90
0.95
0.99
0.999

Interval
0.674 < x < + 0.674
1.000 < x < + 1.000
1.282 < x < + 1.282
1.645 < x < + 1.645
1.960
< x < + 1.960

2.576 < x < + 2.576


3.291 < x < + 3.291

This table says that, for example, we can be 95% confident that any one measurement will lie within
approximately 2 of the mean (where we have approximated 1.960 by 2). We can thus think of the
Probability column in the table as a confidence level and the Interval column as a corresponding
confidence interval.
Although this analysis is elegant and all looks very straightforward, it would not be
unreasonable for you to feel somewhat perplexed at this stage, since we can only know and if
we can make an infinite number of measurements! This would clearly make for very long lab
periods! Practically, we always sample only a finite number of all possible measurements. Thus,
we need to know how the mean and standard deviation for a finite number of measurements are
related to and .

Error Analysis

Page 7

Chem 75

Winter, 2015

For any finite number of measurements, the mean of those measurements will, in general,
depend on how many measurements are made. To distinguish the mean of a particular finite set of
measurements from the mean of an infinite number, we will use a different symbol:
N

x =

x1 + x2 + x3 + + xN
= 1 xi
N
N i=1

(6)

For a finite number of measurements, the experimental standard deviation s is defined as


N

s =

1
xi x 2
N 1 i
=1

(7)

(Note that s is defined with a factor N 1 in the denominator rather than N. As N approaches , N
1 approaches N, but for N finite, one uses N 1 simply because the calculation of x has used up
one independent piece of information. Initially, all the N xi values are independent of each other;
they were made as N independent measurements. But once we compute x, we lose one independent
piece of information in the sense that we can calculate any one xi given x and the N 1 other data.)
Thus, our goal is to determine how x is related to and how s is related to . Consider a
finite number (N) of measurements of x with the results: x1, x2, ..., xN. The problem we confront
is to determine the best estimates of and based on these N measured values. If the
measurements follow a Normal distribution P,(x), and if we knew the parameters and , we
could easily calculate the probability of obtaining the values x1, x2, ..., xN that were actually
measured. The probability of finding a value of x within a small interval dx1 of x1 is given by:
Prob(x between x1 and x1 + dx1) Prob(x1) =

x
1
exp 1 1

2
2

dx1

(8)

Similarly, the probability of finding a value within a small interval dx2 of x2 is given by:
Prob(x between x2 and x2 + dx2) Prob(x2) =

x
1
exp 1 2

2
2

dx2.

(9)

Since these probabilities are uncorrelated, the simultaneous probability of finding x1 in the range
x1 x1 + dx1, x2 in the range x2 x2 + dx2, x3 in the range x3 x3 + dx3, etc. is given by:

or

Error Analysis

Prob(x1, x2, x3, ..., xN) = Prob(x1) Prob(x2) ... Prob(xN)

(10)

x 2/2 2
Prob,(x1, x2, ..., xN) 1 e i
N

(11)

Page 8

Chem 75

Winter, 2015

In equation (11), the numbers and are not known; we want to find the best estimates for
and based on the given observations x1, x2, ..., xN. We might start by guessing values of and
(call them ' and ') and computing the probability Prob','(x1, x2, ..., xN). The next step would
be to guess new values '' and '' and compute the probability Prob'',''(x1, x2, ..., xN). If
Prob'',''(x1, x2, ..., xN) was larger than Prob','(x1, x2, ..., xN), '' and '' would be better
estimates of and the best estimates of and are those values for which the observed x1, x2,
..., xN are most likely. Continuing in this way, we would select different values for and to make
the probability Prob,(x1, x2, ..., xN) as large as possible. That is, we wish to maximize the value
of Prob,(x1, x2, ..., xN) with respect to variations in and .
Using this approach, we can easily find the best estimate for the true value . From equation
(11), the probability Prob,(x1, x2, ..., xN) is a maximum when the sum in the exponent is a
minimum. That is, the best estimate for is that value of for which
N

xi 2/22

(12)

i=1

is a minimum. To locate this minimum, we differentiate equation (12) with respect to and set the
derivative equal to zero, giving
N

xi =

i=1

xi N = 0

(13)

i=1

(best estimate for ) = 1 xi


N i=1

or

(14)

Thus, we have shown that the best estimate for the true mean is simply the mean of our N
N

measurements, x = 1 xi.
N i=1
Proceeding in a similar manner, we obtain
(best estimate for ) =

1
xi 2 .
N i
=1

(15)

Since the true value of is unknown, in practice, we have to replace in equation (15) by our best
estimate for , namely the mean value x. Because the calculation of x has used up one independent piece of information, the factor of N in the denominator of equation (15) must also be replaced
by N 1. Thus, in practice,

Error Analysis

Page 9

Chem 75

Winter, 2015

(best estimate for ) =

1
xi x 2 = s
N 1 i
=1

(16)

At this point, it is probably worthwhile to summarize our progress to date. If the measurements of x are subject only to random errors, their limiting distribution is the Gaussian or Normal
distribution P,(x) centered on the true value , and with width parameter . The width is the
68% confidence level, in that there is a 68% probability that any measurement of x will fall within
of the true value . In practice, neither nor is known. The data available are the N measured
values x1, x2, x3, ..., xN, where N is as large as our time and patience (and research budget!) allow.
Based on these N measured values, we have shown that the best estimate of is the mean value x,
and the best estimate of is the standard deviation s.
Several other questions remain. As mentioned above, the standard deviation, s, characterizes
the average uncertainty in each of the measurements x1, x2, x3, ..., xN. We may also ask, what is the
uncertainty in the mean value x ? How is this uncertainty related to s? This question can be
answered by considering the following experiment. We start by weighing a penny N times and
determine x and s for this set of measurements. Now suppose that we repeat this experiment M
times. For each of the M data sets we would compute a value of x and s, and, in general, each of
these M values of x would be different. We could now average the values of x to give a mean of
means, but this value would be the same number that would result from analyzing the combined
data sets (N M values) from the M experiments. We could also compute the standard deviation
of the means. This number, which we will call sm, can be shown to have the following simple
relationship to s, which characterizes the uncertainty in each experimental measurement:
sm = s
N

(17)

That is, s characterizes the uncertainty associated with each experimental measurement, while sm
characterizes the uncertainty associated with the mean of any one set of N measurements. Clearly,
the more times we measure a quantity, the better we know its mean, but as the equation above
shows, the uncertainty decreases only as the square root of the number of measurements.
We now see how we might report the results of our experimental measurements of x. If we
wish to report a value of x at the 68% confidence level, we might report it as:
(value of x) = x sm .

(18)

On the other hand, if we wish to report a value for x at the 95% confidence level frequently used in
scientific reports, we might report it as:
(value of x) = x 2sm .

Error Analysis

(19)

Page 10

Chem 75

Winter, 2015

Note that the multipliers of s in equations (18) and (19) are taken from Table 1. This raises one last
issue to address in our discussion of statistical methods. The data in Table 1 allow us to relate a
confidence level to some multiple of , which we do not know, rather than to some multiple of s,
which we can determine. That is, the values given in Table 1 are only valid for the limiting distribution resulting from an infinite set of measurements. For a finite number of measurements, what is
the correct relationship between the confidence level and s?
The connection between the confidence level and s was made by W. S. Gosset, a mathematician who worked in the quality control department of a British brewery. Apparently, the company
realized the importance of Gossets work to both the general scientific community and to their own
business, because they allowed him to publish his results, but only under a pseudonym! The
pseudonym Gosset used was A. Student, and the critical quantity of his analysis, which he
denoted by the symbol t in his papers, is to this day known as Students t.
Gosset found a function (which cannot be evaluated analytically, but is tabulated from the
result of numerical calculations) that lets us compute the one number we need to relate a measured,
experimental standard deviation, s, to a confidence level. In short, Gosset found that the true value
of x fell in the interval
x

t s < (true value of x) < x + t s


N
N

(20)

where t is found, at various confidence levels, from Table 2 below. Note that t depends not only on
the confidence level, but also on N, and note as well that as N approaches , t approaches the
confidence interval values tabulated earlier from the Gaussian function.
In this course, we will use the common 95% confidence level (i.e., the t values in bold in Table
2), and we will approximate t for any value of N >15 by t = 2.0.

Error Analysis

Page 11

Chem 75

Winter, 2015

Table 2. Table of Students t factors

N
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

0.50
1.00
0.82
0.77
0.74
0.73
0.72
0.71
0.71
0.70
0.70
0.70
0.70
0.69
0.69
0.69
0.67

Confidence Level
0.90
0.95
6.31
12.71
2.92
4.30
2.35
3.18
2.13
2.78
2.02
2.57
1.94
2.45
1.90
2.37
1.86
2.31
1.83
2.26
1.81
2.23
1.80
2.20
1.78
2.18
1.77
2.16
1.76
2.15
1.75
2.13
1.65
1.96

0.99
63.66
9.93
5.84
4.60
4.03
3.71
3.50
3.36
3.25
3.17
3.11
3.06
3.01
2.98
2.95
2.58

Adapted from Handbook of Mathematical Functions, Edited by M Abramowitz and I. A. Stegun, Dover Publications, Inc.,
New York, 1972

Propagation of Uncertainties
Suppose the observable an experiment is designed to determine is not measured directly, but
is derived from other measured experimental variables through some explicit functional
relationship. For example, the Bomb Calorimetry experiment does not measure the molar internal
energy of combustion of a hydrocarbon directly, but derives a value for it from measurements of the
temperature rise produced on combustion, the heat capacity of the apparatus, and the weight of
hydrocarbon used in the experimentthe so-called raw data. How is the uncertainty in the derived
quantity related to the uncertainties in the raw data?
In general, suppose an experimental quantity, C, depends on other variables (the raw data) A,
B, ... via the relationship C = (A, B, ...). The values of {A, B, ...} are obtained through
measurement, and each has an associated uncertainty, A, B, ... that has been estimated in some
way. For example, we may have used a statistical analysis to estimate A as A = tsA/ N , or we
may simply have made an educated guess to estimate A. How is the uncertainty in C, C,
dependent on the uncertainties A, B, ...? The simple "significant figure" rules referred to earlier
are very approximate attempts to account for uncertainties in derived quantities. However, for
Error Analysis

Page 12

Chem 75

Winter, 2015

complicated functional dependencies, such as logarithmic, exponential, and trigonometric dependencies, that appear in many equations in chemistry and physics, these approximate ideas are of
little value. Sometimes a tiny uncertainty in a measurement will produce a huge uncertainty in a
derived quantitythe exponential function is notorious for this. Clearly, we need a more general
approach for propagating uncertainties.
If one assumes that the results of many duplicate measurements would produce a Normal or
Gaussian distribution about the mean, then statistical theory provides a mechanism for estimating
the uncertainty sC in the derived quantity C = (A,B). When the uncertainties A and B are both
small and uncorrelated, statistical arguments show that the propagated uncertainty C is given by
C =

2
A

2
2
+ B2
A B
B A

where the derivatives are evaluated using the best (i.e., the mean) values of A and B. When C
depends on more than two quantities, for example C = (A, B, D,...), the formula is extended:
C =

2
A

2
2
2 2
+ B2
+ D
+ ...
A B, D, ...
B A, D, ...
D A, B, ...

Note that all of the quantities inside the square root are positive; random errors in uncorrelated
variables tend to add in the calculated uncertainty.
To help reduce the effort in the analysis of error propagation, a short list of error propagation
formulae for some common functional relationships is given below. When a calculation requires
several of these operations, these formulae may be combined according to the rules of differentiation. In complicated cases, you may find it easier to do the overall calculation in stages, obtaining
the uncertainties in intermediate results as you go along.
Error Propagation Formulae
A and B are measurements with associated uncertainties A and B, respectively. C is a
derived quantity with associated uncertainty C.
1. Addition of an exact (constant) number : C = A + .
C = A
2. Multiplication by an exact number : C = A
C = A
3. Addition (or subtraction): C = A B D ...
Error Analysis

Page 13

Chem 75

Winter, 2015

C =

A + B + D + ....

4. Multiplication and division: C =

C = C

A
A

+ B
B

A B ...
D E ...
2

+ D
D

+ ...

5. Power law: C = An (n 0; n can be fractional or negative; hence the absolute value below)
C =

n C A
A
2
C = 2A A = 2AA
A

For example: C = A2:

C = A1/2:

C =

(1/2)A1/2 A
= A
A
2A1/2

6. Combined multiplication and power law: C = AmBn


C = C

m2 A
A

+ n2 B
B

7. Exponential relationship: C = exp ( A) where and are exact numbers


C = C A = A exp ( A)
8. Logarithmic relationship: C = ln ( A)
C =

Error Analysis

A
A

Page 14

Chem 75

Winter, 2015

Discarding Suspect Data


Sometimes one result in a set seems way out of line, and it is suspected that some influence
outside the usual play of random fluctuations was at work. (power surge, stuck needle, human error
in reading, etc.) When should a suspect result be tossed out? There are two schools of thought:
1. Never
2. Sometimes
Even the sometimes school does not agree on when, but a commonly used criterion is that one
and only one result in a set can be tossed out, if it has less than 10% chance of being a legitimate
part of the random set. The test is the "Q-test" and goes as follows. Calculate a quantity Q:
Q =

Qsuspect Qclosest
Qhighest Qlowest

Here Qsuspect is the value of the suspect result. If Q exceeds the values in the accompanying table,
the value in question may be discarded with 90% confidence (N is the total number of measurements in a data set). A similar table exists for 95% confidence, etc.
N:
Q:

3
0.94

4
0.76

5
0.64

6
0.56

7
0.51

8
0.47

9
0.44

10
0.41

Least Squares Fitting


One of the most common types of experiment involves the measurement of several values of
two different variables to investigate the mathematical relationship between the two variables. For
example, suppose that one wished to determine the rate constant, k, for the second order dimerization of butadiene: 2 C4H6 C8H12. The integrated rate law for this process is
1
1

= 2kt
C 4H 6
C 4H 6 0

(1)

Here [C4H6] is the concentration of butadiene at time t and [C4H6]0 is the initial concentration.
One would measure the concentration of C4H6 at known times t, and if this reaction is second
order, a plot of 1/[C4H6] vs. t should be linear with slope equal to twice the value of the rate
constant k. How do we obtain the best value of k from such data, and what is an appropriate
estimate of the uncertainty in this best value of k?
We will consider the general case where two variables x and y are connected by a linear
relation of the form
y = A + Bx

Error Analysis

(2)

Page 15

Chem 75

Winter, 2015

where A and B are constants. If y and x are linearly related, then a graph of y vs. x should be a
straight line with slope B and y-intercept = A. If we were to measure N different values y1, y2, ...,
yN corresponding to N values x1, x2, ..., xN, and if our measurements were subject to no
uncertainties, then each of the points (xi, yi) would lie exactly on the line y = A + Bx as in Figure 4
(a). In practice, there are always uncertainties, and the most we can expect is that the distance of
each point (xi, yi) from the line will be reasonable compared with the uncertainties in the datasee
Figure 4(b).

Slope = B

Intercept = A

x
Figure 4(a)

x
Figure 4(b)

How do we find the values of A and B that give the best straight line fit to the measured data?
This problem can be approached graphically, but it can also be treated analytically using leastsquares fitting. To simplify our discussion, we assume that although there is appreciable
uncertainty in the measured y values, the uncertainty in our measurement of x is negligible. This
assumption is often reasonable, because the uncertainties in one variable often are much larger than
those in the other, which we can then safely ignore. For example, in the kinetics experiment
mentioned above, uncertainties in the measured concentrations are usually much larger than
uncertainties in the measured times. We also assume that the uncertainties in y all have the same
magnitude. (If the uncertainties are different, then the following analysis can be generalized to
weight the measurements appropriatelyso-called weighted least squares fitting.) Finally, we
assume that the measurement of each yi is governed by the Gaussian or Normal distribution, with
the same width parameter y for all measurements.
If we knew the constants A and B, then, for any given value of xi we could calculate the true
value of the corresponding yi:
(true value for yi) = A + Bxi

(3)

From our assumptions, the measurement of yi is governed by a Normal distribution centered on this
true value, with width parameter y. Therefore, the probability of obtaining the observed value yi is
2 2
ProbA, B(yi) 1 e yi A Bxi /2y
y

Error Analysis

(4)

Page 16

Chem 75

Winter, 2015

where the subscripts A and B indicate that this probability depends on the (unknown) values of A
and B. Since the measurements of the yi are independent, the probability of obtaining the set of N
values, y1, y2, ..., yN is the product
ProbA, B(y 1, y 2, ..., y N) = ProbA, B(y 1) ProbA, B(y 2) ... ProbA, B(y N)

(5)

1 e 2 / 2
N
y

(6)

where the exponent 2 is given by


2 =

yi A Bxi 2

(7)

2y

i=1

Using the approach followed in the section on the statistical treatment of data, we will assume that
the best estimates for A and B based on the measured data are those values for which the probability
ProbA, B(y 1, y 2, ..., y N) is a maximum, or for which the sum of squares 2 in equation (7) is a
minimumhence the name least-squares fitting. Thus, to find the best values of A and B we
differentiate 2 with respect to A and B and set the derivatives equal to zero:
N
2
2
=
yi A Bxi = 0
A B
2y i = 1

(8)

N
2
2
=
xi yi A Bxi = 0

B A
2y i = 1

(9)

These two equations can be rewritten as simultaneous equations for A and B:


N

AN + B

xi =

i=1
N

i=1

xi + B

i=1

yi

(10)

i=1

xi2 =

xi yi

(11)

i=1

Equations (10) and (11), sometimes called normal equations, are easily solved for the best leastsquares estimates of A and B:

Error Analysis

Page 17

Chem 75

Winter, 2015
N

A = i=1

xi2

i=1

B =

xi

i=1

i=1

xi yi
(12)

D
N

yi

xi yi

i=1

xi

i=1

yi

i=1

(13)

where the denominator D is given by


N

D = N

x i2

i=1

xi

(14)

i=1

The resulting line is called the least-squares fit to the data.


The next step is to calculate the uncertainties in the constants A and B. To achieve this goal
we first need to estimate the uncertainty in the y values, i.e., in y1, y2, ..., yN. In the course of
measuring the y values, we will have formed some idea of their uncertainty. Nonetheless,
calculating the uncertainty in y by analyzing the data is important. Since the numbers y1, y2, ..., yN
are not measurements of the same quantity, we cannot get any idea of their reliability by examining
the spread in their values. On the other hand, we can easily estimate the uncertainty y in the values
y1, y2, ..., yN. Since we assume that the measurement of each yi is normally distributed about its
true value A + Bxi, with width parameter y, the deviations (yi A Bxi) are normally distributed, all
with the same central value (zero) and the same width y. As usual the best estimate for y is that
value for which the probability of equation (6) is a maximum. By differentiating equation (6) with
respect to y and setting the derivative equal to zero we obtain the following familiar-looking
expression for y:
y =

1
yi A Bxi 2
N i
=1

(15)

As you may have suspected, this estimate of y is not quite the end of the story. The numbers in
equation (15) are the unknown, true values of the constants A and B. In practice, these numbers
must be replaced by the best, least-squares estimates for A and B given by equations (12) and (13).
This replacement must be accompanied by replacing N in the denominator of equation (15) by
N 2. Initially, all the N (xi, yi) data pairs are independent of each other; they were made as N
independent measurements. But once we compute A and B we lose two independent pieces of
information. Thus, our final expression for y is:

Error Analysis

Page 18

Chem 75

Winter, 2015

y =

1
yi A Bxi 2
N 2 i
=1

(16)

with A and B given by equations (12) and (13).


Having found the uncertainty y we can easily calculate the uncertainties in A and B. From
equations (12) and (13), we see that there are well-defined functional relationships between A, B and
the measured values y1, y2, ..., yN. Therefore, we can estimate the uncertainties in A and B using the
propagation of errors ideas discussed earlier. The results are:
A = y

1 x2
D i=1 i

(17)

and
B = y

N
D

(18)

where D is given by equation (14).


The problem of least-squares fitting to a general polynomial can be approached in a
completely analogous manner. Many computer programs are capable of finding the values of the
parameters (A and B in the above linear model) that provide a best fit, in the least-squares sense, of
the sets of data to a theoretical model, as well as the associated uncertainties in the parameters. The
Excel spreadsheet application has a least-squares function built in, but it does not produce statistical
uncertainties in the A and B parameters.

Error Analysis

Page 19

También podría gustarte