Está en la página 1de 5

Probability and Statistics for Computer

Science Who Should Teach It?




Daniel Joyce
Department of Computing Sciences
Villanova University
Villanova, PA 19085

ABSTRACT
Despite initial reservations, after teaching a semester of
Probability and Statistics for Computer Science, the
author now firmly believes, for reasons described in this
paper, that the course can be taught most effectively
from within the computer science department.


Keywords: Probability, Statistics, Modeling, Active
Learning, Assignments


1. Background

We are an accredited, medium-sized Computer Science
(CS) program within a College of Liberal Arts and
Sciences. For many years our majors were required to
take a probability and statistics course offered by our
Mathematics Department. This course, entitled
"Statistics for Experimenters", was an excellent course -
it had been created by the Mathematics Department for
us in the late 1990s, and was taken by students from
throughout the college, not just by computer science
majors.

Last year a committee of members of our
department recommended that we begin to teach the
probability and statistics course ourselves, i.e., to teach it
within our own department. The primary argument for
the recommendation was that we would be able to use
computer science related examples throughout the
course, therefore allowing the students to appreciate the
role of probability and statistics within computing. Such
an approach would have been difficult for the
mathematics instructors to accomplish, so the department
accepted the recommendation.

During the discussion regarding the course, I was
one of the few dissenting voices. I realized that the
current course was sound and although I could see some
benefit to being able to concentrate on examples from
computing, I did not think that the status quo approach
was "broken" - I did not think that the work needed to
bring the course into our department would be worth the
predicted benefits. As one of the few dissenting voices it
was ironic that the second time we offered the course, I
was slated to be the instructor.

This paper describes my experience teaching this
course. It explains how my experience has changed my
opinion. I now do believe that it is a good idea for us to
offer the Probability and Statistics course from within
our department, although my reasons are not exactly the
same as the ones originally put forth by the committee.


2. Related work

It is easy to find areas of computing related to probability
and statistics. Probabilistic algorithms use randomness,
artificial intelligence learning patterns are often based on
conditional probabilities, computer based modeling uses
random number generation, software reliability analysis
can use statistical regression techniques, and in general,
many claims related to software usability and
development can be studied using experimental
procedures.

Some computer science educators have long called
for an increase in the emphasis on "empirical methods"
across the CS curriculum [1]. One might ask what should
be taught to CS students in this area? A solid list of
"Core empirical competencies for computer science" are
listed in [2]. These include a basic understanding of
probability, distributions, sampling, statistics, and
experimental design. A recent paper on CS curriculum
design [3] suggests that the core of CS be broken into six
courses, with one of the six courses being "Probability
Theory for Computer Science". This paper also argues
that the correct place for a probability course for CS
majors is within the CS department.


Papers describing courses similar to the one discussed in
our paper have appeared. Notably, Anderson [4]
describes a course on simulation, probability, and
statistics taught within the CS department and fulfilling a
college-wide "quantitative reasoning" requirement. More
recently, Sahami [5] presents a course on probability
theory for computer scientists, which is much like the
course we describe in this paper.
Before leaving this section we should mention that
textbooks targeted towards this course do exist and in our
opinion, more will be available in upcoming years.
Examples include [6], [7], and [8]. The Horgan book
nicely incorporates the open source statistical processing
environment R throughout the text.


3. Our Course

We have now offered this course twice, once in the
spring of 2010, and once in the spring of 2011. This
paper discusses the latter offering.

The course prerequisite is our first semester
introductory programming course. It might be a good
idea to raise this prerequisite to two semesters of
programming, essentially requiring data structures,
although in our environment scheduling concerns make
this problematic. In any case, in our view, to reap the
benefit of teaching this course within the CS department,
it is essential that all students have some programming
experience.

There were twenty-three students in the course,
mostly second year students but with a few first year
students and a few third year students. The class met
three times a week in a lecture classroom environment.
The course web site is
www.csc.villanova.edu/~joyce/csc5930stat/index.html
We had weekly quizzes, three tests plus a final, and
multiple projects. The approximate schedule of topics
was:

Basic Probability
Combinatorics
Conditional Probability
Families of Discrete Distributions
Families of Continuous Distributions
Descriptive Statistics
The Central Limit Theorem
Confidence Intervals
Hypothesis testing
Experimental Design


4. The use of Computer Science
related examples

As mentioned previously, one of the primary reasons for
bringing this course into the direct jurisdiction of the CS
department was to allow the use of CS related examples.
Throughout the textbook CS related examples are used,
and almost all of the exercises involve CS. In many cases
however the examples are strained. Let's consider a
simple example:
In a shipment of 200 computer chips, 3 are defective. If
you choose two chips to use for a project, what is the
probability that your project will fail due to a defective
chip?
Such an example is obviously contrived. It is not really
CS related. It is just a "computerized" version of a
similar generic problem that would use the term
"widgets" instead of "computer chips", or that would
simply have a bag full of 197 blue balls and 3 red balls.

There is little pedagogic benefit to stating problems
such as the example above using computer terminology.
The students realize that it is not really related to
computing. The additional verbiage required in such
examples makes the example more difficult to follow
and detracts from the real point of the example. It is our
opinion that textbook authors and instructors would
benefit from not trying to shove everything into a
computer science context. It is pedagogically sounder to
keep examples simple and uncluttered, such as by using
the classic "balls in a bag" or by using the familiar pair of
dice or deck of cards.

That said however, there are some places where good,
solid, CS related examples can be used:
Some of the subtleties of combining probabilities
can be studied be considering the reliability of
networks, based on the reliability of their
components. If component A has reliability 0.9 and
component B has reliability 0.8, then the network




has reliability 0.9 0.8 whereas the network






has reliability 1 (1 0.9) (1 0.8).
A
B
A B
Conditional/Bayesian probability is often used as the
basis for spam filtering approaches. Discussing and
researching spam filtering offers a concrete basis for
the study of these topics.

Software reliability models are usually based on
collecting historical data and applying regression
analysis. This direct use of an "advanced" statistical
approach within a software engineering context also
allows CS students to appreciate the practical power
of the theory.

In addition to covering basic probability and
introducing statistics, our course teaches the students
that there is a place within computer science for the
application of the scientific method [9]. A late
semester project involves the selection by groups of
a research paper related to CS that involves
experimentation, and a short presentation by the
group to the class describing the experiment, the
dependent and independent variables, the statistics
used, the internal and external validity, and the
significance of the work.

So, it is not in the simple examples that we procure
benefits by using CS related examples in our class, but
rather in more complex examples and the ability to point
out direct benefits of the material under study, helping to
answer that age-old student question "why do we have to
learn this?'

5. Programming

Opportunities abound within this course for creating
programs that allow us to obtain insight into the material.
During this past semester we created the following
programs, among others:

A program that flips a coin N times and prints out
the ratio (showing 5 decimal places) of times Heads
appears. It does this for N = 1, 10, 100, 1000, 10000,
1000000, and 10000000. This program clearly
demonstrates the "Law of Large Numbers" with the
ration approaching the expected value as more and
more coins are "flipped" as shown by the following
sample output:

ratio with 1 flips: 1.00000
ratio with 10 flips: 0.40000
ratio with 100 flips: 0.48000
ratio with 1000 flips: 0.47800
ratio with 10000 flips: 0.50640
ratio with 100000 flips: 0.49884
ratio with 1000000 flips: 0.49994

A program that flips a coin N times and prints out
the length of the longest "run" of heads or tails,
again for N = 1, 10, 100, 1000, 10000, 1000000, and
10000000. Our intuition tells us that we could
"never" flip, say, 20 heads in a row, but probability
theory tells us that if we flip a coin enough times it
will happen. This program verifies the theory (and
helps give us confidence in our random number
generator). Sample output:

# of Flips Max Run
1 1
10 3
100 7
1000 8
10000 13
100000 19
1000000 21
10000000 22


A program that simulates betting $10 on the Field of
a craps table, over and over, 1,000,000 times. This is
another example about the idea of expected value
and teaches us not to bet against the house!

A program that investigates the famous "Monty
Hall" problem [10]. The interesting thing about this
counter intuitive "puzzle" is that just the mere act of
writing a program to simulate it clarifies the
apparent paradox in the programmer's mind.

A program that estimates the value of Pi by
randomly generating points in a square and counting
how many of them fall inside an inscribed circle (Pi
is approximately equal to four times the number of
points that land inside the circle divided by the total
number of points.) This fun and surprising result
demonstrates that the use of random number
generation and probability can be used in interesting
innovative ways.

A program that generates a sequence of random
numbers based upon an exponential distribution for
a given , by generating random real numbers in the
range 0 to 1 and then using the inverse of the
cumulative distribution function. This program,
which includes visual output in the form of a
histogram, demonstrates the relationship between
the probability density function and cumulative
distribution function and provides a useful approach
that can be used in Monte Carlo modeling. Sample
output:


* equals 2265 occurrences

0 ***************************************
1 **********************************
2 ******************************
3 ************************
4 ******************
5 *************
6 **********
7 ********
8 ******
9 ****
10 ***
11 **
12 **
13 **
14 *
15 *
16 *

A program that generates a multitude of samples
from a given distribution in an attempt to estimate
the (known) mean of the underlying population. For
each sample the program calculates a 90%
confidence interval, and then reports the results.
From this project we learn, in an active hands-on
way, what it means to call something a 90%
confidence interval. This exercise strikes at the very
heart of statistics.

In addition to the programs we created ourselves, either
as exercises or those provided by the instructor, we used
many example applets that are available on the web. We
believe the fact that we programmed many examples
ourselves made the use of such applets more "real" than
if we had not experienced similar programming.

6. Bonus benefits of programming

In addition to the direct benefits of the programming
problems and solutions highlighted in the previous
section, we noticed several additional benefits to this
approach. First, the programming assignments allowed
the students to use their programming skills outside a
programming class. It is nice for them to see that there
can be applications for programs that are useful in
classes besides their programming class. Furthermore, as
we've written elsewhere [11], we believe there are many
inherent benefits in simply having students program,
especially for those students with a weaker background.
More programming gives such students a chance to
"catch up". In all cases where students were assigned a
programming project they were later able to see and
learn about an "expert" solution, thus enhancing their
own approaches.


One assignment in particular led to a rather nice "bonus"
benefit. We called this the HTX problem:

Write a program that simulates flipping a coin over
and over again until the sequence HTH has been
seen, counting the number of flips required. This
"experiment" should be repeated one million times;
then output the average number of flips needed.
Next repeat the entire process, except this time
flipping until the sequence HTT appears. Again
output the average number of flips. Include code,
output, observations and analysis with your report.

Most people believe the average number of flips
expected to obtain the sequence HTH is the same as the
expected number of flips needed to get HTT. However,
that is not the case, and a correctly coded program will
report that on average we require about 10 flips to see
HTH and 8 flips to see HTT. Note that we are not asking
about the probability of flipping a coin three times and
getting one of these sequences we are asking for the
expected number of flips until we get the sequence.

The benefit from this assignment is that the
difference between the two results can be easily
explained using Finite State Automata (FSA). FSA are,
of course, of core importance within computing, playing
a role not only in theory, but also in language, system,
and software design. If we construct an FSA for each of
the two desired sequences and compare we can easily see
why the results are different:
for HTH for HTT


















State S is the start state, F is the finish state (success), H
is the state of having "seen" an H, and HT is the state of
having "seen" an HT in sequence. As can be easily seen
in the figures, if you have flipped an HT and are looking
for HTH and you "fail", i.e. you flip a T, you go all the
S
H
T
H
T
HT
H
F
H
T
S
H
T
H
T
HT
T
F
H
H
way back to the start state. However, if you have flipped
an HT and are looking for HTT and you "fail", i.e. you
flip a H, you do not go all the way back to the start state,
instead you go back to the state of having seen an H and
therefore you have already made some progress. It is this
difference, easily seen using the FSA's that explains the
difference in expected values. This example allowed us
to talk about the use of FSAs outside the theory class and
to show how one can easily implement an FSA in a
program.

The final bonus benefit of programming was earned
from the semester long class project. We had the
students, in groups, design and build a web based survey,
with results presented visually and dynamically. The
"bonus" part of this is it allowed us to expose the
students to some web programming technologies. About
one third of the students already had plenty of experience
with these technologies but the rest of the students were
very happy with the chance to learn. This project was
broken into the following phases:

1. Basic html create a web page related to probability
and statistics, including a link to an interesting
video, and a list of suggested survey questions.

2. Server side create a server side script that that
demonstrates the use of echo, a loop, a decision, and
file input and output.

3. Database create a client side form that collects
information and sends it to a server side script which
saves the information in a database. Create a server
side script that displays the information in the
database.

4. Visualization update the display report so that it
uses some sort of graphical visualization.

We believe that many of these students will go on, of
their own volition, and study these topics in more detail,
now that they have been provided an introduction. And
yes, we know this is not directly related to probability
and statistics but the project did include content
components related to those topics and, as we said, the
rest was bonus!






7. Conclusion

The Probability and Statistics for Computer Science
course does belong within the computer science
department.

8. References

[1] Empirical Investigation throughout the CS
Curriculum. David Reed, Craig Miller, and Grant
Braught, SIGCSE 2000 3/00.

[2] Core Empirical Concepts and Skills for Computer
Science. Grant Braught, Craig S. Miller, David Reed,
SIGCSE 2004.

[3] Expanding the Frontiers of Computer Science:
Designing a Curriculum to Reflect a Diverse Field.
Mahran Sahami, Alex Aiken, Julie Zelenski, SIGCSE
2010.

[4] A Course on Simulation, Probability, and Statistics.
Scott D. Anderson, SIGCSE 2007.

[5] A Course on Probability Theory for Computer
Scientists. Mahran Sahami, SIGCSE 2011.

[6] Probability and Statistics for Computer Scientists.
Michael Baron, Chapman and Hall/CRC, 2007.

[7] Probability with R An Introduction with Computer
Science Applications. Jane M. Horgan, Wiley, 2009.

[8] Probability and Statistics for Computer Science,
James L. Johnson, Wiley, 2007.

[9] Experimental Computer Science: The Need for a
Cultural Change, Dror G. Feitelson, White Paper, was
available at www.cs.huji.ac.il/~feit/papers/exp05.pdf on
March 31, 2011.

[10] A problem in probability, Steve Selvin, letter to the
editor, American Statistician, August 1975.

[11] Dealing with Experience Imbalance in Introductory
Computer Programming Courses, Daniel Joyce, FECS
2010, July 2010.

También podría gustarte