Appendix B: Introduction To Statistics: Eneral Terminology

Appendix B: Introduction to Statistics
Biologists often want to make sense of natural phenomena. Statistics are used to aid in the interpretation of data. Statistics can be broadly classified in two ways according to their function: descriptive, used to describe data and inferential, used to draw conclusions about data, e.g., making predictions and decisions.
GENERAL TERMINOLOGY
Types of data: The first thing to know is that there are different kinds of data. Data can be qualitative or quantitative. Quantitative data is numerical data, e.g., height measured in centimeters for a sample of sunflowers, distance in miles traveled to school for a sample of offcampus commuter students, the number of chocolate chips in a sample of the ten best selling cookie brands in the U.S. Qualitative data cannot be measured on a numerical scale; it is data that can be classified into categories, e.g., the genre of movie rented by each of a sample of 50 movie renters, the gender of each individual in a sample of 50 voters, the species of bird observed. Variables: A variable is a characteristic or property of an individual that may vary in a population. Variables are what we are measuring, manipulating, or controlling in our research. We are interested in measuring the response variable, or dependent variablewhat we expect may be affected by our treatment, or what differs between groups. For example, if a gardener wants to determine whether a new fertilizing treatment helped tomatoes to grow larger, an appropriate response variable would be the weight of the tomatoes. Sample & population: Its rare when we collect data that we can actually sample or survey every individual in a population. More frequently, we only take measurements on a smaller portion of individuals within a population, a sample. In doing so, we accept a major assumption about the data we are collecting: that this sample is representative of the entire population. To increase the likelihood of obtaining a representative sample, we choose our sample randomly.
DESCRIPTIVE STATISTICS
After we collect data, we calculate several parameters that describe two aspects of the data: measures of central tendency and measures of dispersion. First, we create a graph to look at the frequency of values in the data set. The graph that we create is a histogram (Figure 1); the x-axis displays the data values and the y-axis displays the frequency that those values occur within the data set. Therefore, scientists use histograms as a way to visualize their data set.
67
Shorter
Taller
Hayden-McNeil, LLC
26 24 22 20
Number of individuals
18 16 14 12 10 8
Hayden-McNeil, LLC
6 4 2 0 52 54 56 58 510 60 62 64 66 Height (feet and inches)

Figure 1. Top: a human height dataset. Above: a histogram displaying the frequency of the values of this dataset.
68
Appendix B: Introduction to Statistics Its very important to look at your data before you begin a statistical analysis. Explore what your dataset looks like. Do you see patterns? Are there data points that dont seem to fit with the rest of the data? Whats the spread of the data? Figure 2 illustrates why it is important to have an understanding of what your dataset looks like. It is not always possible to get a sense for your data just by looking at means alone.
Graph of all data Graph of means
Plant biomass
Plant biomass
Nutrients
No nutrients
Nutrients
No nutrients
Figure 2. Left: the graph displays biomass data collected from plants grown with and without nutrients. Right: the graph simply displays the means. Measures of central tendency: Measures of central tendency provide information about how the values of the data cluster around a single middle value (Figure 3). Figure 3. Measures of central tendency and dispersion. There are three measures of central tendency: the mean, or average value of the data set, usually represented by or x bar; the median, or middle value when the data set is ordered sequentially; and the mode, or the most
Dispersion or spread of the data
69
Hayden-McNeil, LLC
Center
Hayden-McNeil, LLC
Appendix B: Introduction to Statistics frequently observed value in the data set. The mean is estimated by calculating the sum of the individual values (xi) and dividing that by the total number of individuals in the sample (n) (Box 1).
Mean =
= xi/n = (x1 + x2+ x3 + +xn) / n
Box 1. Equation for the mean.
Number of individuals 165
175 Height (cm)
185
165
175 Height (cm)
185
Figure 4. Datasets that differ in measures of dispersion. Measures of dispersion: Measures of dispersion (Figure 4) tell us about the spread of the data. We are concerned with three parameters: variance, standard deviation, and standard error (Box 2). The variance is the sum of each of the differences or deviations between individual values and the mean. The sum of the total difference is divided by the number of the individuals in the sample, minus one. The standard deviation, or s, is the square root of the variance. The standard error is the standard deviation of the mean divided by the square root of the sample size.
Variance = s2 =
n i=1
(xi )2 n-1
Standard deviation = s= square root of the variance Standard error = standard deviation divided by the square root of the sample size (n)
Box 2. Equations for measures of dispersion.
70
Appendix B: Introduction to Statistics Error bars: Often, we graph only a single data pointthe meanbut we also need some way to represent the overall spread, or distribution, of the data in the graph. Error bars are used to represent the overall distribution of the data and to describe researchers confidence that their data represents a true value. Confidence in the value of the mean decreases as the width of the error bars around the mean increases. As you make more measurements, your confidence in your mean value should increase, which is also reflected by the standard error, which shrinks with more measurements. Since the standard deviation is divided by the square root of n, the standard error grows smaller as the number of measurements (n) grows larger. However, it is also possible that either it is difficult or impossible to record true differences due to technical limitations, or that there is not a significant difference. By comparing the error bars for different values, we can estimate whether the values differ significantly. If the error bars for one value overlap with the range of the error bars for another value, it is unlikely that these two values are significantly different. It is important to specify in your graph which values were used to calculate your error bars, i.e., one standard deviation, or standard error. Normal distribution: Figure 5 shows normally distributed data. Recall from above that we can look at the shape of the graph to learn about the frequency of values from different ranges of the variable of interest. Normally distributed data has a mean of zero and a standard deviation of 1. This is what determines the bell-shape of the normal distribution; the distribution of the data is perfectly symmetrical. As you can see from Figure 4, most values will be clustered near the middle of the continuum of the data (68% of the observations will fall within 1 standard deviation from the mean, and 95% of the observations fall within 2 standard deviations of the mean). So why is the normal distribution important? The idea is that variation is generally normally distributed. However, sometimes in biology, there are variables that we measure that are not normally distributed. One reason a measurement for a trait or character may not be normally distributed is that there was strong selection acting to minimize the amount of variation in the character. For example, a particular form of a trait may be critical for an organisms survival. Then, statistics that are based on different kinds of distributions may be more appropriate to use. Here, we discuss statistics that are based on the normal distribution (sometimes called the z distribution) and the t-test, which is based on the t distribution, which is similar to the normal distribution.
68%
Figure 5. Normal distribution with mean, and one, two, and three standard deviations indicated.
Frequency
95% 99% -3s -2s -1s +1s +2s Measured characteristic +3s
71
Appendix B: Introduction to Statistics Outliers: Outliers are observations that are outside the range of 2 standard deviations around the mean. There are some quantitative methods to exclude the outliers. However, defining an outlier is subjective and each outlier is identified on an individual basis. Outliers may be indicative of a phenomenon that is different from the typical pattern in the sample. Researchers carefully evaluate outliers to determine whether to include or exclude them from analysis. Outliers should not be excluded simply because they fall outside of the range of the normal distribution. These data points may represent real biological variation. When examining your data set for outliers, it is a good idea to go back to your original data sheets, to check for evidence of measurement error, mistakes in recording data, etc. Coefficient of correlation (r): Youve probably heard, correlation does not mean causation. What does that mean? A correlation is a measure that describes the strength or degree of a relationship between two variables. So, two variables may be correlated, or related, but that does not mean that one variable causes the other. Correlation coefficients are measured on a scale from -1 to 1. A correlation coefficient of -1 means that two variables are perfectly negatively correlated, while a coefficient of 1 means that two variables are perfectly positively correlated. A coefficient of zero indicates that there is no relationship between the two variables. The more negative (closer to -1) and the more positive (closer to 1) the coefficient is, the stronger the relationship between the two variables. How to estimate a correlation in Excel: Notice that there are two columns of information in Figure 6, one column for each variable. Excel refers to each of these columns as an array. The function =CORREL gives the Pearson product moment coefficient of correlation (r). The Excel Help menu is excellent. You can search for formulas and find out more information about the particular arguments involved in the equation of interest. Lets look at the following example based on the data shown in Figure 6. Every semester, a teacher finds that his students do not study much for the first few examsusually until they receive a few low grades. Hes decided that hed like to try to persuade his students that while more hours spent studying does not necessarily result in higher test grades, the two variables may be related! On the final exam, he asks his students to record the hours they studied in preparation for the exam. He is interested to see if there is a relationship between test grade and hours studied. He plans to share these results with his students during the first class next semester.
72
Figure 6. An example of how to conduct a correlation in Excel. Notice that the equation for a correlation is =CORREL(array1, array2). Arrays refer to the two variables of interest. As you enter the equation, the reference equation will automatically pop up as a reminder. Figure 7. The result from the correlationthe correlation coefficient. How would you interpret the value of this correlation coefficient shown in Figure 7? What advice would you give students in this class? Keep in mind, correlation does not mean causation! Why is a correlation a descriptive statistic rather than an inferential statistic?
73
INFERENTIAL STATISTICS
What exactly is statistical significance (p-value)? If were interested in whether a treatment is in fact different from a control, usually to be more certain, we need to know something about whether there is a statistically significant difference between the groups (Figure 8). The statistical significance of a result is the probability that the observed relationship, either between variables or between means, in a sample, occurred by pure chance alone. The measure of the significance is reported as a p-value. The p-value tells us the probability of observing a value of the statistic that is at least as extreme as the one that was actually observed in the data. The null hypothesis (H0) states that there is no significant difference and that any observed difference is due to chance. The alternative hypothesis (Ha), on the other hand, states that there is a significant difference. Some other stats speak: we dont say that we accept the null hypothesis; instead, we say that we failed to reject the null hypothesis. Why do you think this is? P-values generally have three (albeit arbitrary, but by tradition) levels of significance: at the 0.0001 level, the 0.01 level, and the 0.05 level. Essentially, a pvalue of 0.05 tells you that there is a 5% probability that the relationship observed is by chance alone. Its really important to recognize, however, that biologists do not live and die by 0.05! A higher p-value might indicate that there is an important trend that, for example, your research design was not quite sensitive enough to pick up. So, if you find a result with a pvalue of 0.072, you may still have found something that is biologically significant. Something to think about: biologically significant and statistically significant are not necessarily the same thing! Always consider your statistical results in light of the biology! Conversely, the higher the pvalue, the less you can believe that the observed relationship between variables is a reliable indicator of the total population. Be aware that as you increase your sample size you will also increase the likelihood of finding statistically significant results. Another point to keep in mind is that the more analyses that you perform on a dataset, the more likely you are to find statistically significant results by chance. (Dont do this! This is not good science!)
Figure 8. Are two groups statistically different?
Hayden-McNeil, LLC
74
The t test: The t-test is the most common method used to evaluate whether the means of two groups are significantly different. To perform a t-test, one independent variable and one dependent variable are needed. The means of the dependent variable are compared between groups based on the values of the independent variable. If you have more than two groups means, you will need to use analysis of variance (ANOVA). The ANOVA is an extension of a ttest. Be aware that the formulas for the t-test change slightly if the sample sizes of each group are unequal. Box 3. Equation for t-test. It may help to think of the t-test as a ratio between the difference in two means (the numerator) and the total variation in both groups (the denominator). As the amount of variation increases, the t value decreases, since the difference between the means may be swamped out by a large amount of variation. As the difference between the two means increases, the t value increases. The other part of the equation in the denominator is the sample size, n. As the sample size increases, the variation decreases. Thinking about sample size also leads to thinking about degrees of freedom. Degrees of freedom (df): In general, the degrees of freedom are equal to the number of independent samples minus the number of parameters estimated in the calculation of the statistic itself. Yikes! What does that mean? For a t-test with independent groups (two groups that are unrelated, e.g., blood pressure was measured in 40 men and 40 women), this means the sum of the sample sizes minus 2 parameters (two means) that are measured (2n-2). For dependent groups (groups that are related, e.g., blood pressure was measured in 40 women in May and then in June) this means n-1. The degrees of freedom are also important because they determine the particular form of the distribution for the data. Hypothesis-testing & statistical significance: Sometimes we develop hypotheses that are onesided, meaning that there is only one alternative hypothesis as compared to the null hypothesis. For example, a researcher might want to test the following idea: Ho: Reading for 30 minutes each night has no effect on 2nd graders standardized test scores Ha: Reading for 30 minutes each night improves 2nd graders standardized test scores The alternative hypothesis is an example of a one-sided, or one-tailed, hypothesis because Ha states that a change will occur in only one directiontest scores will be improved. Sometimes, however, we are interested in testing two-sided, or two-tailed hypotheses because we do not have an a priori direction or expectation that we are testing. How does this affect how you determine whether the t statistic youve estimated is significant? To determine significance, you pick a risk level, or an alpha () level. The standard alpha level is 0.05, which means that 5 times out of 100, you will find statistical significance by chance. Alpha is distributed differently depending
t=
meana meanb ( s2 /na + s2/nb)
75
Appendix B: Introduction to Statistics on whether you are evaluating a one-tailed or two-tailed test. In a one-tailed test, is distributed at 0.05 of one direction of the tail of the distribution of the test statistic (Figure 9).
Hayden-McNeil, LLC
1.645
0 Normal probability
0 Normal probability
1.645
Figure 9. Graphs of a one-tailed t distribution with alpha=0.05. This shows the critical value for t as 1.645, which is a critical value for t when sample sizes are large. In a two-tailed test, is distributed at 0.025 in both directions of the tail of the distribution of the test statistic (Figure 10).
2.228
2.228
Figure 10. A graph of two-tailed t distribution with alpha=0.05. Here, the critical value for t=2.228, the value for tcrit when there are 10 degrees of freedom (df=10).
Hayden-McNeil, LLC
This area is 0.025 of the total area under curve
76
Hayden-McNeil, LLC
Appendix B: Introduction to Statistics How to perform a t-test in Excel: If you are still confused after reading the directions below, look at the Excel help menu. It is an excellent resource and you can search for formulas and find out more information about the particular arguments involved in the equation of interest. Lets start with an example of when you might use a t-test: A farmer wanted to determine whether it adding vitamins to his Holsteins diet resulted in a significant weight gain. Before starting the experiment, the farmer weighed each cow. He gave 14 cows added vitamins, and 14 cows no added vitamins. After several weeks of the treatments, he weighed each cow and recorded the weight gain. In this example, the treatment (vitamins) is being compared to the control (no vitamins), and the response variable is weight gain (Figure 11).
Figure 11. The data entered in Excel. Notice that there are two columns of information: one column with the group type (control, vitamins) and one column with the response variable (weight gain).
77
Appendix B: Introduction to Statistics The formula for t-test in Excel is =ttest(array1, array2, tails, type).
Figure 12. An example of how to conduct a t-test in Excel.
78
Appendix B: Introduction to Statistics The p-value for the t-test is shown in Figure 12. How do you interpret it? Does the vitamin treatment result in significantly more weight gained?
You can also calculate descriptive statistics, e.g., mean, mode, median, standard deviation. Note that these are calculated separately from the t-test, using =AVERAGE to calculate the means, and =STDEV to calculate the standard deviations.
Figure 13. The result of the t-test. Note that the actual calculated t value is not shown. When you calculate a t-test by hand, you compare your calculated t value to a table of critical t values to determine your p-value. Here, instead, the p-value is displayed. What if my question cant be answered with a t-test? What other statistics might be useful? Here, well discuss one more statistic that is commonly used: the chi square test (2). Chi square test (2): A chi square test is used to determine whether an observed pattern differs from an expected pattern. This test is based on the comparison of frequencies. For example, a new campus eatery is opening in building in which both graduate and undergraduate students reside. They are trying to determine what beverages to sell and so they survey both undergraduates and graduates for drink preferences, to determine their market. The survey data that the business generates is measured in terms of frequenciespercentages of graduate students and undergraduates who prefer coffee versus soda. The analysis that the business 79
Appendix B: Introduction to Statistics conducts is based on comparing their observed frequencies (their actual, collected data) to the expected frequencies, given the numbers of students surveyed.
Figure 14. The format for entering data to set up a chi square test in Excel. Chi square test in Excel: 1. Enter the observed values. 2. Calculate the expected frequencies by multiplying each row total by each column total and dividing by the total number of observations. E.g., for expected frequency for grad students who prefer coffee: (83 x 117)/237 = 40.97 3. Use =CHITEST to calculate the chi square test. This will give you a p-value. The formula will ask for the actual (observed) range and the expected range.
80
Appendix B: Introduction to Statistics Is there a difference in beverage preferences between graduate students and undergraduates? What does the pvalue tell you?
Figure 15. The results of the chi square test. Note that the p-value is what is returned and displayed. Although the p-value calculated is 0.170845217, typically values are reported to the second or third significant digit, i.e., 0.17 or 0.171.
81

Appendix B: Introduction To Statistics: Eneral Terminology

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Appendix B: Introduction To Statistics: Eneral Terminology

Cargado por

Copyright:

Formatos disponibles

Appendix B: Introduction to Statistics