Está en la página 1de 12

Running Head: THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Assumptions and Multiple Regression Shauna Thompson EDPS 607

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Page |1

Assumptions and Multiple Regression In this paper I intend to outline the importance of assumptions in multiple regression and look at how we test them using SPSS. Once the premise of each assumption is explained I will provide picture examples from SPSS output. Since you are accessing the electronic version of this paper you will find that the pictures have been shrunk for ease of paper review. If you find the size to be too small, simply right-click on the bottom-right corner of the picture and hold the mouse button while you drag it down and to the right to resize. Rather than providing a step-bystep photo example of how to access each option within SPSS I have chosen to italicize the menu commands and options and place the icon between then in each sequence. It is assumed that the reader is familiar with how to run a typical multiple regression in SPSS, and discussion in the paper is limited to consideration of different tables and graphs you find in the standard analysis. Multiple regression is a prediction method that is most effective at identifying relationships between a dependent variable and a combination of independent variables when its underlying assumptions are satisfied. We use this technique attempt to predict a dependent variable (in this example we have used math enjoyment), from a set of predictors (in this example we have used confidence in math abilities and hours spent monthly on math homework). So what are assumptions, and why should we care? Regression assumptions clarify the conditions under which multiple regression works well. Multiple regression assumes that each of the variables are normally distributed, the relationship between the variables are linear, and the relationship has homoscedasticity. When these assumptions are met, we are more likely to have estimates that are reliable (rather than over or under-estimating the strength of the relationship, and therefore the likelihood the independent variables can be used to predict the dependent variables), with as little error as possible. Big violations of one or more assumptions lead to the wrong conclusions being drawn at the end of the analysis, and potentially negative consequences for those the research is being done for. According to Osborne and Waters (2002), the consequences of failure to meet these statistical assumptions is a Type I or Type II error, or over- or under- estimation of significance or effect size(s) (p.1). In my review of EDPS 607 course notes, different university teaching sites, textbooks, and articles online it seems that the number of assumptions to be considered depends entirely on the person conducting the analysis. However, based on what I learned in my EDPS 607 course I am going to consider three assumptions in multiple regression: those of normality, homoscedasticity, and linearity. Each assumption is explained below along with examples of how to test for them in multiple regression in pre-analysis and post-analysis. For the postanalysis techniques run a multiple regression and then check your SPSS output for anaylsis. Each assumption is divided pre-analysis and post-analysis where needed. Assumption 1: Normality Many popular or common statistical methods require the assumption of normality (e.g. normal distribution) be met, i.e. errors are normally distributed. Since multiple regression is a multivariate statistic, the assumption of normality is that the combination of variables follows a multivariate normal distribution (Schwab, 2012). There is no real test of multivariate normality so it is my understanding that we test each variable individually and assume that individual normality suggests multivariate normality (though this cannot always be the case). As is true with evaluating most statistical assumptions within SPSS, there are both graphical and statistical methods for evaluating normality. The graphical methods I will look to are the histogram and the normality plot (with both we will be looking for evidence of the

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Page |2

variables skewness and kurtosis). Remember: none of the evaluation methods are 100% definitive. Pre-Analysis: Normality The histogram gives a nice visual representation of normality. To request a histogram in SPSS click Analyze Descriptive Statistics Explore. In the new window click your dependent variable and then the right arrow button to move the dependent variable to the Dependent List; then click Plots to choose the diagnostic charts for the output. Select None under Boxplots, Histogram under Descriptive, and Normality plots with tests before clicking Continue OK to produce the output.

To get a general feel for the normality of the distribution within the multiple regression we look at the first graphic that is produced in the SPSS output, the histogram. If the distribution were skewed in one direction or another we would see the data pooled at one end or another of the x-axis; if it were totally unrelated in its distribution we would see no discernible pattern at all. If we had a problem with kurtosis we would see a peak with very steep sides and little on either end, or very low plateau spread widely across the x-axis. In this example the histogram has a rise and fall that seems to approximate the normal curve, giving us the first impression that our dependent variable likely complies with the assumption for normality. Next it is important is to examine the normality of the dependent variable itself by evaluating its skewness and kurtosis (assuming you did not do so before choosing to run a multiple regression). This is a key step, because if the scores on our dependent variable are not normally distributed it can affect the relationships with all other variables. To double-check our skewness and kurtosis we can quickly look to our Descriptives table in SPSS and review the statistical output. The rule of thumb suggested in EDPS 607 is that a variable is relatively close to normal if its skewness and kurtosis have values between -1.0 and +1.0.
Descriptive Statistics
N Statistic Minimum Statistic 12.50 Maximum Statistic 28.93 Mean Statistic 20.0665 Std. Deviation Statistic 3.26679

Skewness
Statistic Std. Error .159

Kurtosis
Statistic Std. Error .316

math enjoyment
Valid N (listwise)

235 235

.129

-.285

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Page |3

In looking at the Descriptive Statistics table, we can see that our skewness (0.129) and kurtosis (0.285) values fit between the preferred values of -1.0 and 1.0, supporting our impression from the histogram that the dependent variable is normally distributed. The second graphic produced is the normality plot (which can be viewed below). The diagonal line shows us the normal trend we are looking for in our data, and the circles represent our data points. If the variable is normally distributed the dots would align quite nicely with the line.

In this case we have fairly normal trend along the line, reinforcing our impression from the histogram. Finally we go to the statistical output, the Tests of Normality table SPSS produced.
Tests of Normality Kolmogorov-Smirnov Statistic math enjoyment
a. Lilliefors Significance Correction *. This is a lower bound of the true significance.
a

Shapiro-Wilk Statistic
*

df 235

Sig. .200

Df 235

Sig. .444

.036

.994

Since our sample size (N=235) is larger than 50 we look at the Kolmogorov-Smirnov test. The null hypothesis for this test of normality states that the actual distribution of the variable is equal to the distribution we expect, i.e. the variable is normally distributed. Since the probability associated with our example is 0.200 we fail to reject the null hypothesis and conclude that the scores the general stress scale are normally distributed across our sample of students. Post-Analysis: Normality Now lets look evaluate the assumption of normality once an analysis has been run. First well return to the histogram.

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Page |4

In this example the histogram has a rise and fall that seems to approximate the normal curve for the most part, giving us the first impression that our dependent variable likely complies with the assumption for normality. Since there is a sharper peak in the middle and a value to the negative end of the scale it will be useful to examine the Mahalanobis Distance and Cooks Distance in our Residual Statistics table to ensure that we dont have to be on-the-lookout for interference from outliers in our data set that will interfere with our assumption of normality. First, lets consider the Mahalanobis Distance, and from there we can consider whether or not we need to use Cooks Distance (Stevens, 2009).
Residuals Statistics Minimum Predicted Value Std. Predicted Value Standard Error of Predicted Value Adjusted Predicted Value Residual Std. Residual Stud. Residual Deleted Residual Stud. Deleted Residual Mahal. Distance Cook's Distance Centered Leverage Value
a. Dependent Variable: math enjoyment
a

Maximum 26.2194 2.562 .440 26.2275 5.73953 2.581 2.601 5.82692 2.634 8.174 .034 .035

Mean 20.0665 .000 .243 20.0659 .00000 .000 .000 .00060 .000 1.991 .004 .009

Std. Deviation 2.40191 1.000 .065 2.40208 2.21422 .996 1.002 2.24071 1.007 1.646 .006 .007

N 235 235 235 235 235 235 235 235 235 235 235 235

14.9765 -2.119 .145 14.8870 -8.54988 -3.845 -3.856 -8.60030 -3.977 .002 .000 .000

In the example above we want to use the table on page 108 of the Stevens (2009) text to determine the critical value. In this case we have 2 predictors (our 2 independent variable), and a sample size of 235, so our critical value is about 15.99 (p= < 0.05). Looking at the minimum (0.002) and maximum (8.174) values on the SPSS output above we see that our values fall within the acceptable range, and we know that we do not need to be concerned about outliers in this instance. We dont need to consider Cooks Distance in this case because it is a measure of the

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Page |5

change in the regression coefficients that would occur if outliers were omitted (Stevens, 2009) unnecessary here because we have no outliers to remove. The other relevant graphic produced in our SPSS output that helps us to be confident that the assumption of normality has been met (which can be viewed below) is the PPP normality plot of the regression standardized residuals. The diagonal line shows us the normal trend we are looking for in our data, and the circles represent our data points. If the variable is normally distributed the dots will align quite nicely with the line.

In this case we have fairly normal trend along the line with some minor deviation toward the centre of the line. Since the deviation is slight it reinforcing our impression from the histogram and we can assume that the assumption of normality has been satisfied. Assumption 2: Homoscedasticity Homoscedasticity, also known as homogeneity of variance, refers to the assumption that the dependent variable exhibits similar amounts of variance across the range of values for an independent variable. Variance in statistics is a measure of how spread out a distribution is. Homoscedasticity is evaluated for pairs of variables. As it was when testing for normality, there are both graphical and statistical methods for evaluating homoscedasticity. In pre-analysis the graphical method we will look at is called a box plot; the statistical method is the Levene statistic, which is computed by SPSS. Unfortunately, box plots require the independent variable to be nominal or ordinal and the dependent variable to be ordinal or interval, and our sample variables dont meet these requirements. To provide an example from another analysis for how you might use a boxplot in pre-analysis one is included here, followed with consideration of the Levene statistic. In post-analysis we will examine a residual analysis (scatterplot z-scores of the residuals explained in that section). As always, it is important to remember that neither statistic is absolutely definitive.

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Page |6

Pre-Analysis (special example): Homoscedasticity To produce a boxplot in SPSS click Graphs Legacy Dialogs Boxplot Define. In the new window, click on your dependent variable and then the right arrow button to move the dependent variable to the Variable text box; do the same for your independent variable and move it into the Category Axis text box and click OK to produce the boxplot.

Each brown box shows the middle 50% of the cases for the group (the mean of each group indicated with the horizontal bar through the box), indicating how spread out the group score is. If the variance across the groups is equal, the height of the boxes will be similar across groups. If the heights of the boxes are different the plot suggests that the variance across groups is not homogenous. In this example the groups appear to be very similar, which suggests equal variance. The visual the boxplot gives us is a good place to start, but its always important to look at the numbers as well, in this case the numbers we find for something called the Levenes statistic. The get the Levenes test for homogeneity of variance, click Analyze Compare Means One-Way ANOVA. Click your independent variable followed by the right arrow button to put the independent variable into the Factor text box; do the same for your dependent variable and move it to the Dependent List text box. When this is complete click Options and select Homogeneity of variance test Continue OK to produce the test for homogeneity of variance.
Test of Homogeneity of Variances score on general stress scale Levene Statistic .314 df1 2 df2 232 Sig. .731

The null hypothesis for the test of homogeneity of variance states that the variance of the dependent variable is equal across groups defined by the independent variable (the variance is homogenous). In our example the probability associated with the Levene Statistic (0.731) is greater than the level of significance we successfully reject the null hypothesis and conclude that the variance is homogenous.

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Page |7

Post-Analysis (related to example data): Homoscedasticity Now, back to our original example. If we run a multiple regression analysis and the model is well-fitted there should be no pattern to the residuals plotted against the fitted values. In this case we will look at the standardized residuals (zresid) versus the predicted values (zpred) in a scatterplot.

If the assumptions of the model are reasonable then the residuals should scatter randomly about the 0-point like a cloud (Stevens, 2009).We want to see our data here form a kind of cloudshaped distribution, without major concentrations on one side or another of the plot. This tells us that the pattern Looking at our example we can see that the data points do form a cloud-like pattern rather than concentrating at one end or the other of the scatterplot, rejecting the null hypothesis (which would produce randomized uncollected dots on the scatter plot above if we failed to reject it). We can assume from this evaluation that we reject the null hypothesis and find that the variance is homogenous. Assumption 3: Linearity Linearity suggests that the rate of change between the scores on two variables is constant for the full range of scores for the variables. The relationship between the predictors and the outcome variable should be linear. As you have probably guessed, SPSS has statistical and graphical methods for evaluating this assumption. The most commonly recommended tool for considering linearity pre-analysis is visual examination of a scatter plot, and we will look at correlation coefficients post-analysis. Pre-Analysis: Linearity To produce a scatterplot in SPSS select Graphs Legacy Dialogs Scatter/Dot Simple Define. Click your dependent variable followed by the right arrow to move it to the Y text box; click one independent variable followed by the right arrow to move it to the X text box and click OK. Do the same with the dependent variable and your other independent variable (keeping in mind that we will examine linearity post-analysis as well rather than relying only on pre-analysis which could represent a great error rate). We are looking for a kind of elliptical shaped grouping to say that the results are linearly related.

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Page |8

The results in our example seem to be reasonable elliptical in shape; lets move on to check the correlation matrix to see what those numbers look like. To get the correlation matrix you select Analyze Correlate Bivariate. Move your dependent variable and the intdependent variables into the Variables text box and click OK.
Correlations
confidence in ability to do math confidence in ability to do math Pearson Correlation Sig. (2-tailed) N hours of math homework per month Pearson Correlation Sig. (2-tailed) N math enjoyment Pearson Correlation Sig. (2-tailed) N **. Correlation is significant at the 0.01 level (2-tailed). 235 .488** .000 235 .637** .000 235 235 .631** .000 235 235 1 hours of math homework per month .488
**

math enjoyment .637** .000 235 .631** .000 235 1

.000 235 1

The probability associated with the correlation coefficient between the variables is less than the level of significance in each case (p= < 0.001), validating our position that the assumption for linearity is supported, rejecting the null hypothesis (there is no relationship between our 3 variables) and suggesting that there is a relationship between math enjoyment, confidence, and hours of homework. Post-Analysis: Linearity Post-analysis we find evidence for linearity in the relationships between the independent variables and the dependent variable is found in the Correlations table in our SPSS output. First we want to ensure that our dependent variable has a significant relationship to our independent variables, and we also want to check to see that the relationship between our independent variable and each of our dependent variables is stronger than the relationship between our two independent variables. Here we look to the significance of the correlation coefficient of each pairing (the Pearson Correlation) and also examine the probability for each pairing to ensure it is statistically significant (p=< 0.05).

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

Page |9

Correlations
hours of math homework per month .631 1.000 .488 .000 . .000 235 235 235 confidence in ability to do math .637 .488 1.000 .000 .000 . 235 235 235

math enjoyment Pearson Correlation math enjoyment hours of math homework per month confidence in ability to do math Sig. (1-tailed) math enjoyment hours of math homework per month confidence in ability to do math N math enjoyment hours of math homework per month confidence in ability to do math 1.000 .631 .637 . .000 .000 235 235 235

In our example we see that all correlations are statistically significant (p= < 0.001). We see that the relationship of our dependent variable (math enjoyment) is stronger with our independent variables (r= 0.631 with homework and 0.637 with confidence) than the independent variables are with one another (r= 0.488). The null hypothesis, which is that there is no linear relationship between the variables, is rejected here, and we conclude that there is a linear relationship between the variables. Conclusion In the examples provided in this paper we had a rather straightforward data sample that didnt make it too difficult to determine whether the assumptions of normality, linearity, and homoscedasticity were met. By reviewing the relevant histograms and the SPSS output for skewness, kurtosis, and Mahalanobis Distance we can say with confidence that the assumption of normality was met with our data set. By reviewing the box plots, Levenes test, and the z-score residuals we showed that the assumption for homoscedasticity was met. Finally, by reviewing our scatter plots and correlation coefficients we were able to determine that there is a linear relationship between our dependent and independent variables. In this case it is appropriate to go forward with a multiple regression for these variables, and hopefully further analyses to see whether or not the independent variables will be useful and effective in predicting the dependent variable. Osborne, Christensen, and Gunter (2001) observed that very few published articles report the results of their assumption-testing that should be considered the basis for drawing their conclusions. Unfortunately that leaves us, as consumers of these research papers, left wondering about the conclusions drawn since we have no idea whether or not the statistical assumptions were met.

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

P a g e | 10

Appendix 1: Running a Multiple Regression Analysis Running a Multiple Regression To run a multiple regression on your data, click Analyze Regression Linear. Select your independent variables and then the right arrow to move them to the Independent(s) text box; then select your dependent variable followed by the right arrow to move it to the Dependent text box. Next, select Statistics and Model fit, R-Squared change, Descriptives, Part and partial Correlations, Collinearity diagnostics, and the Durbin-Watson residual followed by Continue. In the next window select Plots and move ZPRED (z-predicted) to the Y text box and move ZRESID (z-residual) to the text box (we do this to see if our assumptions are met). Make sure you have selected Histogram and Normal probability plot and select Continue. Next, select Options and Exclude cases pairwise followed by Continue and OK to have SPSS produce our multiple regression output.

THOMPSON - ASSUMPTIONS AND MULTIPLE REGRESSION

P a g e | 11

References Harlow, L.L (2005). The Essence of Multivariate Thinking: Basic Themes and Methods. Lawrence Erlbaum Associates: New Jersey. Keith, T.Z. (2005). Multiple Regression and Beyond. New York: Pearson. Osborne, J.W., Christensen, W.R. and Gunter, J. (2001). A New Look at Outliers and Fringeliers: Their Effects on Statistical Accuracy and Type I and Type II Error Rates. Unpublished manuscript, Department of Educational Research and Leadership and Counsellor Education, North Carolina State University. Osborne, J.W. Waters, E. (2002). Four assumptions of multiple regression that researchers should always test. Practical Assessment, Research & Evaluation, 8(2). Retrieved from http://PAREonline.net/getvn.asp?v=8&n=2 Schwab, J (2012). Data Analysis and Computers II. Stevens, J.P. (2009). Applied Multivariate Statistics for the Social Sciences, Fifth Edition. New York: Routledge. University of Calgary (2012). EDPS 607 Course lectures and materials. University of North Texas (2012). Multiple Regression: Assumptions. Retrieved from http://courses.unt.edu/yeatts/6200-Multivariate%20Stats/Lectures-Tests/Test%202/Week-12assumptions.pdf University of Texas (2012). Assumptions of Multiple Regression. Retrieved from http://www.slideserve.com/tom/assumptions-of-multiple-regression

También podría gustarte