Está en la página 1de 22

Multivariate Analysis Business Research Methods

Multiple Regression
Q-1 What is Multiple Regression? Ans : Multiple regression is used to account for (predict) the variance in an interval dependent, based on linear combinations of interval, dichotomous, or dummy independent variables. Multiple regression can establish that a set of independent variables explains a proportion of the variance in a dependent variable at a significant level (through a significance test of R2), and can establish the relative predictive importance of the independent variables (by comparing beta weights). Power terms can be added as independent variables to explore curvilinear effects. Cross-product terms can be added as independent variables to explore interaction effects. One can test the significance of difference of two R2's to determine if adding an independent variable to the model helps significantly. Using hierarchical regression, one can see how most variance in the dependent can be explained by one or a set of new independent variables, over and above that explained by an earlier set. Of course, the estimates (b coefficients and constant) can be used to construct a prediction equation and generate predicted scores on a variable for further analysis. The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's are the regression coefficients, representing the amount the dependent variable y changes when the corresponding independent changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the independent variables. Associated with multiple regression is R2, multiple correlation, which is the percent of variance in the dependent variable explained collectively by all of the independent variables. Multiple regression shares all the assumptions of correlation: linearity of relationships, the same level of relationship throughout the range of the independent variable ("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose range is not truncated. In addition, it is important that the model being tested is correctly specified. The exclusion of important causal variables or the inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the importance of the independent variables.

Q-2 What is R-square ? Ans: R2, also called multiple correlation or the coefficient of multiple determination, is the percent of the variance in the dependent explained uniquely or jointly by the independents. R-squared can also be interpreted as the proportionate reduction in error in estimating the dependent when knowing the independents. That is, R2 reflects the number of errors made when using the regression model to guess the value of the dependent, in ratio to the total errors made when using only the dependent's mean as the basis for estimating all cases. Mathematically, R2 = (1 - (SSE/SST)), where SSE = error sum of squares = SUM((Yi - EstYi)squared), where Yi is the actual value of Y for the ith case and EstYi is the regression prediction for the ith case; and where SST = total sum of squares = SUM((Yi - MeanY)squared). The "residual sum of squares" in SPSS /SAS output is SSE and reflects regression error. Thus R-square is 1 minus regression error as a percent of total error and will be 0 when regression error is as large as it would be if you simply guessed the mean for all cases of Y. Put another way, the regression sum of squares/total sum of squares = R-square, where the regression sum of squares = total sum of squares - residual sum of squares Q-3 What is Adjusted R-square and How it is calculated? Ans: Adjusted R-Square is an adjustment for the fact that when one has a large number of independents, it is possible that R2 will become artificially high simply because some independents' chance variations "explain" small parts of the variance of the dependent. At the extreme, when there are as many independents as cases in the sample, R2 will always be 1.0. The adjustment to the formula arbitrarily lowers R2 as p, the number of independents, increases. Some authors conceive of adjusted R2 as the percent of variance "explained in a replication, after subtracting out the contribution of chance." When used for the case of a few independents, R2 and adjusted R2 will be close. When there are a great many independents, adjusted R2 may be noticeably lower. The greater the number of independents, the more the researcher is expected to report the adjusted coefficient. Always use adjusted R2 when comparing models with different numbers of independents.

Adjusted R2 = 1 - ( (1-R2)(N-1 / N - k - 1) ). where n is sample size and k is the number of terms in the model not counting the constant (i.e., the number of independents). Q-4 What is Multicollinearity and How it is measured? Multicollinearity is the intercorrelation of independent variables. R2's near 1 violate the assumption of no perfect collinearity, while high R2's increase the standard error of the

beta coefficients and make assessment of the unique role of each independent difficult or impossible. While simple correlations tell something about multicollinearity, the preferred method of assessing multicollinearity is to regress each independent on all the other independent variables in the equation. Inspection of the correlation matrix reveals only bivariate multicollinearity, with the typical criterion being bivariate correlations > . 90. To assess multivariate multicollinearity, one uses tolerance or VIF, which build in the regressing of each independent on all the others. Even when multicollinearity is present, note that estimates of the importance of other variables in the equation (variables which are not collinear with others) are not affected. Types of multicollinearity. The type of multicollinearity matters a great deal. Some types are necessary to the research purpose Tolerance is 1 - R2 for the regression of that independent variable on all the other independents, ignoring the dependent. There will be as many tolerance coefficients as there are independents. The higher the intercorrelation of the independents, the more the tolerance will approach zero. As a rule of thumb, if tolerance is less than .20, a problem with multicollinearity is indicated. When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable.The more the multicollinearity, the lower the tolerance, the more the standard error of the regression coefficients. Tolerance is part of the denominator in the formula for calculating the confidence limits on the b (partial regression) coefficient. Variance-inflation factor, VIF VIF is the variance inflation factor, which is simply the reciprocal of tolerance. Therefore, when VIF is high there is high multicollinearity and instability of the b and beta coefficients. VIF and tolerance are found in the SPSS and SAS output section on collinearity statistics. Condition indices and variance proportions. Condition indices are used to flag excessive collinearity in the data. A condition index over 30 suggests serious collinearity problems and an index over 15 indicates possible collinearity problems. If a factor (component) has a high condition index, one looks in the variance proportions column. Criteria for "sizable proportion" vary among researchers but the most common criterion is if two or more variables have a variance partition of .50 or higher on a factor with a high condition index. If this is the case, these variables have high linear dependence and multicollinearity is a problem, with the effect that small data changes or arithmetic errors may translate into very large changes or errors in the regression analysis. Note that it is possible for the rule of thumb for condition indices (no index over 30) to indicate multicollinearity, even when the rules of thumb for tolerance > .20 or VIF < 4 suggest no multicollinearity. Computationally, a "singular value" is the square root of an eigenvalue, and "condition indices" are the ratio of the largest singular values to each other singular value. In SPSS or SAS, select Analyze, Regression, Linear;click Statistics; check Collinearity diagnostics to get condition indices.

Q-5 What is homoscedasticity ? Homoscedasticity: The researcher should test to assure that the residuals are dispersed randomly throughout the range of the estimated dependent. Put another way, the variance of residual error should be constant for all values of the independent(s). If not, separate models may be required for the different ranges. Also, when the homoscedasticity assumption is violated "conventionally computed confidence intervals and conventional t-tests for OLS estimators can no longer be justified" However, moderate violations of homoscedasticity have only minor impact on regression estimates . Nonconstant error variance can be observed by requesting a simple residual plot (a plot of residuals on the Y axis against predicted values on the X axis). A homoscedastic model will display a cloud of dots, whereas lack of homoscedasticity will be characterized by a pattern such as a funnel shape, indicating greater error as the dependent increases. Nonconstant error variance can indicate the need to respecify the model to include omitted independent variables. Lack of homoscedasticity may mean (1) there is an interaction effect between a measured independent variable and an unmeasured independent variable not in the model; or (2) that some independent variables are skewed while others are not. One method of dealing with hetereoscedasticity is to select the weighted least squares regression option. This causes cases with smaller residuals to be weighted more in calculating the b coefficients. Square root, log, and reciprocal transformations of the dependent may also reduce or eliminate lack of homoscedasticity. Suggested Readings and Links: http://www2.chass.ncsu.edu/garson/pa765/regress.htm www.cs.uu.nl/docs/vakken/arm/SPSS/spss4.pdf Kahane, Leo H. (2001). Regression basics. Thousand Oaks, CA: Sage Publications. Menard, Scott (1995). Applied logistic regression analysis. Thousand Oaks, CA: Sage Publications. Series: Quantitative Applications in the Social Sciences, No. 106. Miles, Jeremy and Mark Shevlin (2001). Applying regression and correlation. Thousand Oaks, CA: Sage Publications. Introductory text built around model-building. Schroeder, Larry D., David L. Sjoquist, and Paula E. Stephan (1986). Understanding regression analysis: An introductory guide. Thousand Oaks, CA: Sage Publications. Series: Quantitative Applications in the Social Sciences, No. 57.

Discriminant Analysis
Q-1 What is Disriminant Analysis ? Ans: Discriminant function analysis, a.k.a. discriminant analysis or DA, is used to classify cases into the values of a categorical dependent, usually a dichotomy. If discriminant function analysis is effective for a set of data, the classification table of correct and incorrect estimates will yield a high percentage correct. Discriminant function analysis is found in SPSS/SAS under Analyze, Classify, Discriminant. One gets DA or MDA from this same menu selection, depending on whether the specified grouping variable has two or more categories. Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a cousin of multiple analysis of variance (MANOVA), sharing many of the same assumptions and tests. MDA is used to classify a categorical dependent which has more than two categories, using as predictors a number of interval or dummy independent variables. MDA is sometimes also called discriminant factor analysis or canonical discriminant analysis. There are several purposes for DA and/or MDA:

To classify cases into groups using a discriminant prediction equation. To test theory by observing whether cases are classified as predicted. To investigate differences between or among groups. To determine the most parsimonious way to distinguish among groups. To assess the relative importance of the independent variables in classifying the dependent variable. To infer the meaning of MDA dimensions which distinguish groups, based on discriminant loadings.

Discriminant analysis has two steps: (1) an F test (Wilks' lambda) is used to test if the discriminant model as a whole is significant, and (2) if the F test shows significance, then the individual independent variables are assessed to see which differ significantly in mean by group and these are used to classify the dependent variable. Discriminant analysis shares all the usual assumptions of correlation, requiring linear and homoscedastic relationships, and untruncated interval or near interval data. Like multiple regression, it also assumes proper model specification (inclusion of all important independents and exclusion of extraneous variables). DA also assumes the dependent variable is a true dichotomy since data which are forced into dichotomous coding are truncated, attenuating correlation.

DA is an earlier alternative to logistic regression, which is now frequently used in place of DA as it usually involves fewer violations of assumptions (independent variables needn't be normally distributed, linearly related, or have equal within-group variances), is robust, handles categorical as well as continuous variables, and has coefficients which many find easier to interpret. Logistic regression is preferred when data are not normal in distribution or group sizes are very unequal. See also the separate topic on multiple discriminant function analysis (MDA) for dependents with more than two categories.

Few Definitions and Concepts Discriminating variables: These are the independent variables, also called predictors. The criterion variable. This is the dependent variable, also called the grouping variable in SPSS. It is the object of classification efforts. Discriminant function: A discriminant function, also called a canonical root, is a latent variable which is created as a linear combination of discriminating (independent) variables, such that L = b1x1 + b2x2 + ... + bnxn + c, where the b's are discriminant coefficients, the x's are discriminating variables, and c is a constant. This is analogous to multiple regression, but the b's are discriminant coefficients which maximize the distance between the means of the criterion (dependent) variable. Note that the foregoing assumes the discriminant function is estimated using ordinary least-squares, the traditional method, but there is also a version involving maximum likelihood estimation. Number of discriminant functions. There is one discriminant function for 2-group discriminant analysis, but for higher order DA, the number of functions (each with its own cut-off value) is the lesser of (g - 1), where g is the number of categories in the grouping variable, or p,the number of discriminating (independent) variables. Each discriminant function is orthogonal to the others. A dimension is simply one of the discriminant functions when there are more than one, in multiple discriminant analysis. The eigenvalue, also called the characteristic root of each discriminant function, reflects the ratio of importance of the dimensions which classify cases of the dependent variable. There is one eigenvalue for each discriminant function. For twogroup DA, there is one discriminant function and one eigenvalue, which accounts for 100% of the explained variance. If there is more than one discriminant function, the first will be the largest and most important, the second next most important in explanatory power, and so on. The eigenvalues assess relative importance because they reflect the percents of variance explained in the dependent variable, cumulating

to 100% for all functions. That is, the ratio of the eigenvalues indicates the relative discriminating power of the discriminant functions. If the ratio of two eigenvalues is 1.4, for instance, then the first discriminant function accounts for 40% more betweengroup variance in the dependent categories than does the second discriiminant function. Eigenvalues are part of the default output in SPSS (Analyze, Classify, Discriminant). The relative percentage of a discriminant function equals a function's eigenvalue divided by the sum of all eigenvalues of all discriminant functions in the model. Thus it is the percent of discriminating power for the model associated with a given discriminant function. Relative % is used to tell how many functions are important. One may find that only the first two or so eigenvalues are of importance. The canonical correlation, R, is a measure of the association between the groups formed by the dependent and the given discriminant function. When R is zero, there is no relation between the groups and the function. When the canonical correlation is large, there is a high correlation between the discriminant functions and the groups. Note that relative % and R* do not have to be correlated. R is used to tell how much each function is useful in determining group differences. An R of 1.0 indicates that all of the variability in the discriminant scores can be accounted for by that dimension. Note that for two-group DA, the canonical correlation is equivalent to the Pearsonian correlation of the discriminant scores with the grouping variable. The discriminant score, also called the DA score, is the value resulting from applying a discriminant function formula to the data for a given case. The Z score is the discriminant score for standardized data. To get discriminant scores in SPSS, select Analyze, Classify, Discriminant; click the Save button; check "Discriminant scores". One can also view the discriminant scores by clicking the Classify button and checking "Casewise results." Cutoff: If the discriminant score of the function is less than or equal to the cutoff, the case is classed as 0, or if above it is classed as 1. When group sizes are equal, the cutoff is the mean of the two centroids (for two-group DA). If the groups are unequal, the cutoff is the weighted mean. Unstandardized discriminant coefficients are used in the formula for making the classifications in DA, much as b coefficients are used in regression in making predictions. The constant plus the sum of products of the unstandardized coefficients with the observations yields the discriminant scores. That is, discriminant coefficients are the regression-like b coefficients in the discriminant function, in the form L = b1x1 + b2x2 + ... + bnxn + c, where L is the latent variable formed by the discriminant function, the b's are discriminant coefficients, the x's are discriminating variables, and c is a constant. The discriminant function coefficients are partial coefficients, reflecting the unique contribution of each variable to the classification of the criterion variable. The standardized discriminant coefficients, like beta weights in

regression, are used to assess the relative classifying importance of the independent variables. Standardized discriminant coefficients, also termed the standardized canonical discriminant function coefficients, are used to compare the relative importance of the independent variables, much as beta weights are used in regression. Note that importance is assessed relative to the model being analyzed. Addition or deletion of variables in the model can change discriminant coefficients markedly. As with regression, since these are partial coefficients, only the unique explanation of each independent is being compared, not considering any shared explanation. Also, if there are more than two groups of the dependent, the standardized discriminant coefficients do not tell the researcher between which groups the variable is most or least discriminating. For this purpose, group centroids and factor structure are examined. Q-2 What is Wilks Lambda? Wilks' lambda is used to test the significance of the discriminant function as a whole. In SPSS, the "Wilks' Lambda" table will have a column labeled "Test of Function(s)" and a row labeled "1 through n" (where n is the number of discriminant functions). The "Sig." level for this row is the significance level of the discriminant function as a whole. A significant lambda means one can reject the null hypothesis that the two groups have the same mean discriminant function scores. Wilks's lambda is part of the default output in SPSS (Analyze, Classify, Discriminant). In SPSS, this use of Wilks' lambda is in the "Wilks' lambda" table of the output section on "Summary of Canonical Discriminant Functions." ANOVA table for discriminant scores is another overall test of the DA model. It is an F test, where a "Sig." p value < .05 means the model differentiates discriminant scores between the groups significantly better than chance (than a model with just the constant). It is obtained in SPSS by asking for Analyze, Compare Means, One-Way ANOVA, using discriminant scores from DA (which SPSS will label Dis1_1 or similar) as dependent. Wilks' lambda also can be used to test which independents contribute significantly to the discriminant function. The smaller the lambda for an independent variable, the more that variable contributes to the discriminant function. Lambda varies from 0 to 1, with 0 meaning group means differ (thus the more the variable differentiates the groups), and 1 meaning all group means are the same. The F test of Wilks's lambda shows which variables' contributions are significant. Wilks's lambda is sometimes called the U statistic. In SPSS, this use of Wilks' lambda is in the "Tests of equality of group means" table in DA output. Q-3 What is Confusion or classification Matrix ?

Ans: The classification table, also called a classification matrix, or a confusion, assignment, or prediction matrix or table, is used to assess the performance of DA. This is simply a table in which the rows are the observed categories of the dependent and the columns are the predicted categories of the dependents. When prediction is perfect, all cases will lie on the diagonal. The percentage of cases on the diagonal is the percentage of correct classifications. This percentage is called the hit ratio. Expected hit ratio. Note that the hit ratio must be compared not to zero but to the percent that would have been correctly classified by chance alone. For two-group discriminant analysis with a 50-50 split in the dependent variable, the expected percent is 50%. For unequally split 2-way groups of different sizes, the expected percent is computed in the "Prior Probabilities for Groups" table in SPSS, by multiplying the prior probabilities times the group size, summing for all groups, and dividing the sum by N. Adapted from the link: http://faculty.chass.ncsu.edu/garson/PA765/discrim2.htm Suggested Readings: Huberty, Carl J. (1994). Applied discriminant analysis . NY: Wiley-Interscience. (Wiley Series in Probability and Statistics). Klecka, William R. (1980). Discriminant analysis. Quantitative Applications in the Social Sciences Series, No. 19. Thousand Oaks, CA: Sage Publications. Lachenbruch, P. A. (1975). Discriminant analysis. NY: Hafner.

10

Cluster Analysis
Q-1 What is Cluster Analysis ? Ans: Cluster analysis, also called segmentation analysis or taxonomy analysis, seeks to identify homogeneous subgroups of cases in a population. That is, cluster analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. Other techniques, such as latent class analysis and Q-mode factor analysis, also perform clustering and are discussed separately. SPSS offers three general approaches to cluster analysis. Hierarchical clustering allows users to select a definition of distance, then select a linking method of forming clusters, then determine how many clusters best suit the data. In k-means clustering the researcher specifies the number of clusters in advance, then calculates how to assign cases to the K clusters. K-means clustering is much less computer-intensive and is therefore sometimes preferred when datasets are very large (ex., > 1,000). Finally, two-step clustering creates pre-clusters, then it clusters the pre-clusters.

Key Concepts and Terms


Cluster formation is the selection of the procedure for determining how clusters are created, and how the calculations are done. In agglomerative hierarchical clustering every case is initially considered a cluster, then the two cases with the lowest distance (or highest similarity) are combined into a cluster. The case with the lowest distance to either of the first two is considered next. If that third case is closer to a fourth case than it is to either of the first two, the third and fourth cases become the second two-case cluster; if not, the third case is added to the first cluster. The process is repeated, adding cases to existing clusters, creating new clusters, or combining clusters to get to the desired final number of clusters. There is also divisive clustering, which works in the opposite direction, starting with all cases in one large cluster. Hierarchical cluster analysis, discussed below, can use either agglomerative or divisive clustering strategies. Similarity and Distance

11

Distance. The first step in cluster analysis is establishment of the similarity or distance matrix. This matrix is a table in which both the rows and columns are the units of analysis and the cell entries are a measure of similarity or distance for any pair of cases. Euclidean distance is the most common distance measure. A given pair of cases is plotted on two variables, which form the x and y axes. The Euclidean distance is the square root of the sum of the square of the x difference plus the square of the y distance. (Recall high school geometry: this is the formula for the length of the third side of a right triangle.) It is common to use the square of Euclidean distance as squaring removes the sign. When two or more variables are used to define distance, the one with the larger magnitude will dominate, so to avoid this it is common to first standardize all variables. There are a variety of different measures of inter-observation distances and inter-cluster distances to use as criteria when merging nearest clusters into broader groups or when considering the relation of a point to a cluster. SPSS supports these interval distance measures: Euclidean distance, squared Euclidean distance, Chebychev, block, Minkowski, or customized; for count data, chi-square or phi-square. For binary data, it supports Euclidean distance, squared Euclidean distance, size difference, pattern difference, variance, shape, or Lance and Williams. Similarity. Distance measures how far apart two observations are. Cases which are alike share a low distance. Similarity measures how alike two cases are. Cases which are alike share a high similarity. SPSS supports a large number of similarity measures for interval data (Pearson correlation or cosine) and for binary data (Russell and Rao, simple matching, Jaccard, Dice, Rogers and Tanimoto, Sokal and Sneath 1, Sokal and Sneath 2, Sokal and Sneath 3, Kulczynski 1, Kulczynski 2, Sokal and Sneath 4, Hamann, Lambda, Anderberg's D, Yule's Y, Yule's Q, Ochiai, Sokal and Sneath 5, phi 4-point correlation, or dispersion). Absolute values. Since for Pearson correlation, high negative as well as high positive values indicate similarity, the researcher normally selects absolute values. This can be done by checking the absolute value checkbox in the Transform Measures area of the Methods subdialog (invoked by pressing the Methods button) of the main Cluster dialog. Summary. In SPSS, similarity/distance measures are selected in the Measure area of the Method subdialog obtained by pressing the Method button in the Classify dialog. There are three measure pulldown menus, for interval, binary, and count data respectively.The proximity matrix table in the output shows the actual distances or similarities computed for any pair of cases. In SPSS, proximity matrices are selected under Analyze, Cluster, Hierarchical clustering; Statistics button; check proximity matrix. Method. Under the Method button in the SPSS Classify dialog, the pull-down Method selection determines how cases or clusters are combined at each step. Different methods will result in different cluster patterns. SPSS offers these method choices:

12

Nearest neighbor. In this single linkage method, the distance between two clusters is the distance between their closest neighboring points Furthest neighbor. In this complete linkage method, the distance between two clusters is the distance between their two furthest member points. UPGMA (unweighted pair-group method using averages). The distance between two clusters is the average distance between all inter-cluster pairs. UPGMA is generally preferred over nearest or furthest neighbor methods since it is based on information about all inter-cluster pairs, not just the nearest or furthest ones. and is the default method in SPSS. SPSS labels this "between-groups linkage." Average linkage within groups is the mean distance between all possible inter- or intracluster pairs. The average distance between all pairs in the resulting cluster is made to be as small as possibile. This method is therefore appropriate when the research purpose is homogeneity within clusters. SPSS labels this "within-groups linkage." Ward's method calculates the sum of squared Euclidean distances from each case in a cluster to the mean of all variables. The cluster to be merged is the one which will increase the sum the least. This is an ANOVA-type approach and preferred by some researchers for this reason. Centroid method. The cluster to be merged is the one with the smallest sum of Euclidean distances between cluster means for all variables. Median method. Clusters are weighted equally regardless of group size when computing centroids of two clusters being combined. This method also uses Euclidean distance as the proximity measure. Correlation of items can be used as a similarity measure. One transposes the normal data table in which columns are variables and rows are cases. By using columns as cases and rows as variables instead, the correlation is between cases and these correlations may constitute the cells of the similarity matrix. Binary matching is another type of similarity measure, where 1 indicates a match and 0 indicates no match between any pair of cases. There are multiple matched attributes and the similarity score is the number of matches divided by the number of attributes being matched. Note that it is usual in binary matching to have several attributes because there is a risk that when the number of attributes is small, they may be orthogonal to (uncorrelated) with one another, and clustering will be indeterminate. Summary measures assess how the clusters differ from one another. Means and variances. A table of means and variances of the clusters with respect to the original variables shows how the clusters differ on the original variables. SPSS does not make this available in the Cluster dialog, but one can click the Save button, which will

13

save the cluster number for each case (or numbers if multiple solutions are requested). Then in Analyze, Compare Means, Means the researcher can use the cluster number as the grouping variable to compare differences of means on any other continuous variable in the dataset. Linkage tables show the relation of the cases to the clusters. Cluster membership table. This shows cases as rows, where columns are alternative numbers of clusters in the solution (as specified in the "Range of Solution" option in the Cluster membership group in SPSS, under the Statistics button). Cell entries show the number of the cluster to which the case belongs. From this table, one can see which cases are in which groups, depending on the number of clusters in the solution. Agglomeration Schedule. Agglomeration schedule is a choice under the Statistics button for Hierarchical Cluster in the SPSS Cluster dialog. In this table, the rows are stages of clustering, numbered from 1 to (n - 1). The (n - 1)th stage includes all the cases in one cluster. There are two "Cluster Combined" columns, giving the case or cluster numbers for combination at each stage. In agglomerative clustering using a distance measure like Euclidean distance, stage 1 combines the two cases which have lowest proximity (distance) score. The cluster number goes by the lower of the cases or clusters combined, where cases are initially numbered 1 to n. For instance, at Stage 1, cases 3 and 18 might be combined, resulting in a cluster labeled 3. Later cluster 3 and case 2 might be combined, resulting in a cluster labeled 2. The researcher looks at the "Coefficients" column of the agglomerative schedule and notes when the proximity coefficient jumps up and is not a small increment from the one before (or when the coefficient reaches some theoretically important level). Note that for distance measures, low is good, meaning the cases are alike; for similarity measures, high coefficients mean cases are alike. After the stopping stage is determined in this manner, the researcher can work backward to determine how many clusters there are and which cases belong to which clusters (but it is easier just to get this information from the cluster membership table). Note, though, that SPSS will not stop on this basis but instead will compute the range of solutions (ex., 2 to 4 clusters) requested by the researcher in the Cluster Membership group of the Statistics button in th Hierarchical Clustering dialog. When there are relatively few cases, icicle plots or dendograms provide the same linkage information in an easier format. Linkage plots show similar information in graphic form. Icicle plots are usually horizontal, showing cases as rows and number of clusters in the solution as columns. If there are few cases, vertical icicle plots may plotted, with cases as columns. Reading from the last column right to left (horizontal icicle plots) or last row bottom to top (vertical icicle plots), the researcher can see how agglomeration proceeded. The last/bottom row will show all the cases in separate one-case clusters. This is the (n 1) solution. The next-to-last/bottom column/row will show the (n-2) solution, with two cases combined into one cluster. Subsequent columns/rows show further clustering steps. Row 1 (vertical icicle plots) or column 1 (horizontal icicle plots) will show all cases in a

14

single cluster. This is a visual way of representing information on the agglomeration schedule, but without the proximity coefficient information. Dendrograms, also called tree diagrams, show the relative size of the proximity coefficients at which cases were combined. The bigger the distance coefficient or the smaller the similarity coefficient, the more clustering involved combining unlike entities, which may be undesirable. Trees are usually depicted horizontally, not vertically, with each row representing a case on the Y axis, while the X axis is a rescaled version of the proximity coefficients. Cases with low distance/high similarity are close together. Cases showing low distance are close, with a line linking them a short distance from the left of the dendogram, indicating that they are agglomerated into a cluster at a low distance coefficient, indicating alikeness. When, on the other hand, the linking line is to the right of the dendogram the linkage occurs a high distance coefficient, indicating the cases/clusters were agglomerated even though much less alike. If a similarity measure is used rather than a distance measure, the rescaling of the X axis still produces a diagram with linkages involving high alikeness to the left and low alikeness to the right. In SPSS, select Analyze, Classify, Hierarchical Cluster; click the Plots button, check the Dendogram checkbox.

What is Hierarchical Cluster Analysis ?


Hierarchical clustering is appropriate for smaller samples (typically < 250). To accomplish hierarchical clustering, the researcher must specify how similarity or distance is defined, how clusters are aggregated (or divided), and how many clusters are needed. Hierarchical clustering generates all possible clusters of sizes 1...K, but is used only for relatively small samples. In hierarchical clustering, the clusters are nested rather than being mutually exclusive, as is the usual case..That is, in hierarchical clustering, larger clusters created at later stages may contain smaller clusters created at earlier stages of agglomeration. One may wish to use the hierarchical cluster procedure on a sample of cases (ex., 200) to inspect results for different numbers of clusters. The optimum number of clusters depends on the research purpose. Identifying "typical" types may call for few clusters and identifying "exceptional" types may call for many clusters. After using hierarchical clustering to determine the desired number of clusters, the researcher may wish then to analyze the entire dataset with k-means clustering (aka, the Quick Cluster procedure: Analyze, Cluster, K-Means Cluster Analysis), specifying that number of clusters. Forward clustering, also called agglomerative clustering: Small clusters are formed by using a high similarity index cut-off (ex., .9). Then this cut-off is relaxed to establish broader and broader clusters in stages until all cases are in a single cluster at some low similarity index cut-off. The merging of clusters is visualized using a tree format. Backward clustering, also called divisive clustering, is the same idea, but starting with a low cut-off and working toward a high cut-off. Forward and backward methods need not generate the same results.

15

Clustering variables. In the Hierarchical Cluster dialog, in the Cluster group, the researcher may selected Variable rather than the usual Cases, in order to cluster variables. SPSS calls hierarchical clustering the "Cluster procedure." In SPSS, select Analyze, Classify, Hierarchical Cluster; select variables; select Cases in the Cluster group click Statistics, select Proximity Matrix; select Range of Solutions in the Cluster Membership group, specify the number of clusters (typically 3 to 6); Continue; OK.

What is K-means Cluster Analysis ?


K-means cluster analysis. K-means cluster analysis uses Euclidean distance. The researcher must specify in advance the desired number of clusters, K. Initial cluster centers are chosen in a first pass of the data, then each additional iteration groups observations based on nearest Euclidean distance to the mean of the cluster. Cluster centers change at each pass. The process continues until cluster means do not shift more than a given cut-off value or the iteration limit is reached. Cluster centers are the average value on all clustering variables of each cluster's members. The "Initial cluster centers," in spite of its title, gives the average value of each variable for each cluster for the k well-spaced cases which SPSS selects for initialization purposes when no initial file is supplied. The "Final cluster centers" table in SPSS output gives the same thing for the last iteration step. The "Iteration history" table shows the change in cluster centers when the usual iterative approach is taken. When the change drops below a specified cutoff, the iterative process stops and cases are assigned to clusters according to which cluster center they are nearest. Large datasets are possible with K-means clustering, unlike hierarchical clustering, because K-means clustering does not require prior computation of a proximity matrix of the distance/similarity of every case with every other case. Method. The default method is "Iterate and classify," under which an interative process is used to update cluster centers, then cases are classified based on the updated centers. However, SPSS supports a "Classify only" method, under which cases are immediately classified based on the initial cluster centers, which are not updated. Agglomerative K-means clustering. Normally in K-means clustering, a given case may be assigned to a cluster, then reassigned to a different cluster as the algorithm unfolds. However, in agglomerative K-means clustering, the solution is constrained to force a given case to remain in its initial cluster. SPSS: Analyze, Cluster, K-Means Cluster Analysis; enter variables in the Variables: area; optionally, enter a variable in the "Label cases by:" area; enter "Number of clusters:"; choose Method: Iiterate and classify, or just Classify); Unlike hierarchical clustering, there is no option for "Range of solutions"; instead you must re-run K-means clustering, asking for a different number of clusters.

16

Iterate button. Optionally, you may press the Iterate button and set the number of iterations and the convergence criterion. The default maximum number of iterations in SPSS is 10. For the convergence criterion, by default, iterations terminate if the largest change in any cluster center is less than 2% of the minimum distance between initial centers (or if the maximum number of iterations has been reached). To override this default, enter a positive number less than or equal to 1 in the convergence box. There is also a "Use running means" checkbox which, if checked, will cause the clulster centers to be updated after each case is classified, not the default, which is after the entire set of cases is classified. Save button: Optionally, you may press the Save button to save the final cluster number of each case as an added column in your dataset (labeled QCL_1), and/or you may save the Euclidean distance between each case and its cluster center (labeled QCL_2) by checking "Distance from cluster center." Options button: Optionally, you may press the Options button to select statistics or missing values options. There are three statistics options: "Initial cluster centers" (gives the initial variable means for each clusters); ANOVA table (ANOVA F-tests for each variable., but as the F tests are only descriptive, the resulting probabilities are for exploratory purposes only; nonetheless, non-significant variables might be dropped as not contributing to the differentiation of clusters); and "Cluster information for each case" (gives each case's final cluster assignment and the Euclidean distance between the case and the cluster center; also gives the Euclidean distance between final cluster centers). Getting different clusters. Sometimes the researcher wishes to experiment to get different clusters, as when the "Number of cases in each cluster" table shows highly imbalanced clusters and/or clusters with very few members. Different results may occur by setting different initial cluster centers from file (see above), by changing the number of clusters requested, or even by presenting the data file in different case order.

What is Two-Step Cluster Analysis ?


Two-step cluster analysis groups cases into pre-clusters which are treated as single cases. Standard hierarchical clustering is then applied to the pre-clusters in the second step. This is the method used when one or more of the variables are categorical (not interval or dichotomous). Also, since it is a method requiring neither a proximity table like hierarchical classification nor an iterative process like K-means clustering, but rather is a one-pass-through-the-dataset method, it is recommended for very large datasets. Cluster feature tree.. The preclustering stage employs a CFtree with nodes leading to leaf nodes. Cases start at the root node and are channeled toward nodes and eventually leaf nodes which match it most closely. If there is no adequate match, the case is used to start its own leaf node. It can happen that the CFtree fills up and cannot accept new leaf entries in a node, in which case it is split using the most-distant pair in the node as seeds. If this recursive process grows the CFtree beyond maximum size, the threshold distance is increased and the tree is rebuilt, allowing new cases to be input. The process continues

17

until all the data are read. Click the Advanced button in the Options button dialog to set threshold distances, maximum levels, and maximum branches per leaf node manually. Proximity. When one or more of the variables are categorical, log-likelihood is the distance measure used, with cases categorized under the cluster which is associated with the largest log-likelihood. If variables are all continuous, Euclidean distance is used, with cases categorized under the cluster which is associated with the smallest Euclidean distance. Number of clusters. By default SPSS determines the number of clusters using the change in BIC (the Schwarz Bayesian Criterion: when BIC change is small, it stops and selects as many clusters as thus far created. It is also possible to have this done based on changes in AIC (the Akaike Information Criterion), or to simply to tell SPSS how many clusters are wanted. The researcher can also ask for a range of solutions, such as 3-5 clusters. The "Autoclustering statistics" table in SPSS output gives, for example, BIC and BIC change for all solutions. SPSS. Choose Analyze, Classify, Two-Step Cluster; select your categorical and continuous variables; if desired, click Plots and select the plots wanted; Click Output and select the statistics wanted (descriptive statistics, cluster frequencies, AIC or BIC); Continue Adapted from http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm www.cs.uu.nl/docs/vakken/arm/SPSS/spss8.pdf Suggested Readings:
Anil K. Jain, Richard C. Dubes,

Algorithms for Clustering Data ,2004 Finding Groups In Data: An Introduction

Leonard Kaufman, Peter J. Rousseeuw,

To Cluster Analysis, 2005

18

Factor Analysis
Q-1 What is Factor Analysis?

Factor analysis is a correlational technique to determine meaningful clusters of shared variance. Factor Analysis should be driven by a researcher who has a deep and genuine interest in relevant theory in order to get optimal value from choosing the right type of factor analysis and interpreting the factor loadings. Factor analysis beings begins with a large number of variables and then tries to reduce the interrelationships amongst the variables to a few number of clusters or factors. Factor analysis finds relationships or natural connections where variables are maximally correlated with one another and minimally correlated with other variables, and then groups the variables accordingly. After this process has been done many times a pattern appears of relationships or factors that capture the essence of all of the data emerges. Summary: Factor analysis refers to a collection of statistical methods for reducing correlational data into a smaller number of dimensions or factors

Key Concepts and Terms


Exploratory factor analysis (EFA) seeks to uncover the underlying structure of a relatively large set of variables. The researcher's priori assumption is that any indicator may be associated with any factor. This is the most common form of factor analysis. There is no prior theory and one uses factor loadings to intuit the factor structure of the data. Confirmatory factor analysis (CFA) seeks to determine if the number of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis of pre-established theory. Indicator variables are selected on the basis of prior theory and factor analysis is used to see if they load as predicted on the expected number of factors. The researcher's priori assumption is that each factor (the number and labels of which may be specified priori) is associated with a specified subset of indicator variables. A minimum requirement of confirmatory factor analysis is that one hypothesize beforehand the number of factors in the model, but usually also the researcher will posit expectations about which variables will load on which factors (Kim and Mueller, 1978b: 55). The researcher seeks to determine, for instance, if measures created to represent a latent variable really belong together. Factor loadings: The factor loadings, also called component loadings in PCA, are the correlation coefficients between the variables (rows) and factors (columns). Analogous to Pearson's r, the squared factor loading is the percent of variance in that variable explained by the factor. To get the percent of variance in all the variables accounted for by each factor, add the sum of the squared factor loadings for that factor (column) and divide by

19

the number of variables. (Note the number of variables equals the sum of their variances as the variance of a standardized variable is 1.) This is the same as dividing the factor's eigenvalue by the number of variables. Communality, h2, is the squared multiple correlation for the variable as dependent using the factors as predictors. The communality measures the percent of variance in a given variable explained by all the factors jointly and may be interpreted as the reliability of the indicator. When an indicator variable has a low communality, the factor model is not working well for that indicator and possibly it should be removed from the model. However, communalities must be interpreted in relation to the interpretability of the factors. A communality of .75 seems high but is meaningless unless the factor on which the variable is loaded is interpretable, though it usually will be. A communality of .25 seems low but may be meaningful if the item is contributing to a well-defined factor. That is, what is critical is not the communality coefficient per se, but rather the extent to which the item plays a role in the interpretation of the factor, though often this role is greater when communality is high Eigenvalues: Also called characteristic roots. The eigenvalue for a given factor measures the variance in all the variables which is accounted for by that factor. The ratio of eigenvalues is the ratio of explanatory importance of the factors with respect to the variables. If a factor has a low eigenvalue, then it is contributing little to the explanation of variances in the variables and may be ignored as redundant with more important factors. Thus, eigenvalues measure the amount of variation in the total sample accounted for by each factor. Note that the eigenvalue is not the percent of variance explained but rather a measure of amount of variance in relation to total variance (since variables are standardized to have means of 0 and variances of 1, total variance is equal to the number of variables). SPSS will output a corresponding column titled '% of variance'. A factor's eigenvalue may be computed as the sum of its squared factor loadings for all the variables. Q-2 What are the criteria for determining the number of factors, roughly in the order of frequency of use in social science (see Dunteman, 1989: 22-3). Kaiser criterion: A common rule of thumb for dropping the least important factors from the analysis. The Kaiser rule is to drop all components with eigenvalues under 1.0. Kaiser criterion is the default in SPSS and most computer programs. Scree plot: The Cattell scree test plots the components as the X axis and the corresponding eigenvalues as the Y axis. As one moves to the right, toward later components, the eigenvalues drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's scree test says to drop all further components after the one starting the elbow. This rule is sometimes criticised for being amenable to researchercontrolled "fudging." That is, as picking the "elbow" can be subjective because the curve has multiple elbows or is a smooth curve, the researcher may be tempted to set the cut-off

20

at the number of factors desired by his or her research agenda. Even when "fudging" is not a consideration, the scree criterion tends to result in more factors than the Kaiser criterion. Variance explained criteria: Some researchers simply use the rule of keeping enough factors to account for 90% (sometimes 80%) of the variation. Where the researcher's goal emphasizes parsimony (explaining variance with as few factors as possible), the criterion could be as low as 50%. Q-3 What are the different rotation methods used in factor analysis? Ans: No rotation is the default, but it is a good idea to select a rotation method, usually varimax. The original, unrotated principal components solution maximizes the sum of squared factor loadings, efficiently creating a set of factors which explain as much of the variance in the original variables as possible. The amount explained is reflected in the sum of the eigenvalues of all factors. However, unrotated solutions are hard to interpret because variables tend to load on multiple factors. Varimax rotation is an orthogonal rotation of the factor axes to maximize the variance of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix, which has the effect of differentiating the original variables by extracted factor. Each factor will tend to have either large or small loadings of any particular variable. A varimax solution yields results which make it as easy as possible to identify each variable with a single factor. This is the most common rotation option. Quartimax rotation is an orthogonal alternative which minimizes the number of factors needed to explain each variable. This type of rotation often generates a general factor on which most variables are loaded to a high or medium degree. Such a factor structure is usually not helpful to the research purpose. Q-4 How many cases are required to do factor analysis? There is no scientific answer to this question, and methodologists differ. Alternative arbitrary "rules of thumb," in descending order of popularity, include those below. These are not mutually exclusive: Bryant and Yarnold, for instance, endorse both STV and the Rule of 200. Rule of 10. There should be at least 10 cases for each item in the instrument being used. STV ratio. The subjects-to-variables ratio should be no lower than 5 (Bryant and Yarnold, 1995)

21

Rule of 100: The number of subjects should be the larger of 5 times the number of variables, or 100. Even more subjects are needed when communalities are low and/or few variables load on each factor. (Hatcher, 1994) Rule of 150: Hutcheson and Sofroniou (1999) recommends at least 150 - 300 cases, more toward the 150 end when there are a few highly correlated variables, as would be the case when collapsing highly multicollinear variables. Q-5 What is "sampling adequacy" and what is it used for? Measured by the Kaiser-Meyer-Olkin (KMO) statistics, sampling adequacy predicts if data are likely to factor well, based on correlation and partial correlation. In the old days of manual factor analysis, this was extremely useful. KMO can still be used, however, to assess which variables to drop from the model because they are too multicollinear. There is a KMO statistic for each individual variable, and their sum is the KMO overall statistic. KMO varies from 0 to 1.0 and KMO overall should be .60 or higher to proceed with factor analysis. If it is not, drop the indicator variables with the lowest individual KMO statistic values, until KMO overall rises above .60. To compute KMO overall, the numerator is the sum of squared correlations of all variables in the analysis (except the 1.0 self-correlations of variables with themselves, of course). The denominator is this same sum plus the sum of squared partial correlations of each variable i with each variable j, controlling for others in the analysis. The concept is that the partial correlations should not be very large if one is to expect distinct factors to emerge from factor analysis. In SPSS, KMO is found under Analyze - Statistics - Data Reduction - Factor - Variables (input variables) - Descriptives - Correlation Matrix - check KMO and Bartlett's test of sphericity and also check Anti-image - Continue - OK. The KMO output is KMO overall. The diagonal elements on the Anti-image correlation matrix are the KMO individual statistics for each variable. Adapted from: http://faculty.chass.ncsu.edu/garson/PA765/factspss.htm www.sussex.ac.uk/Users/andyf/factor.pdf www.cs.uu.nl/docs/vakken/arm/SPSS/spss7.pdf Suggested Readings Bruce Thompson, Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications, 2004

22

También podría gustarte