Está en la página 1de 1

248

T. M. Khoshgoftaar

and E. B. Allen

A project's developmental history can be captured by information systems. Many software development organizations have very large data bases for configuration management and for problem reporting which capture data on events during development. Such data bases are potential sources of new information relating software quality factors to the attributes of software products and the attributes of their development processes. For large legacy systems or product lines, the amount of available data can be overwhelming. The combination of numerous attributes of software products and processes, very large data bases designed for other purposes, and weak theoretical support [Kitchenham and Pfleeger (1996)] mandates an empirical approach to software quality prediction, rather than a strictly deductive approach [Khoshgoftaar et al. (2000)]. Fayyad (1996) defines knowledge discovery in data bases as "the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data". Given a set of large data bases or a data warehouse, major steps of the knowledge discovery process are [Fayyad et al. (1996)]: (1) selection and sampling of data; (2) preprocessing and cleaning of data; (3) data reduction and transformation; (4) data mining; and (5) evaluation of knowledge. Fayyad restricts the term data mining to denote the step of extracting patterns or models from clean, transformed data, for example, fitting a model or finding a pattern. Classificationtree modeling is an acknowledged tool for data mining [Glymour et al. (1996), Hand (1998)]. Knowledge discovery in general, and the data mining step in particular, is focused on finding patterns and models that can be interpreted as useful knowledge [Fayyad et al. (1996)]. Industrial software systems often have thousands of modules, and a large number of variables can be extracted from source code measurements, configuration management data, and problem reporting data. The result is a large amount of multidimensional data to be analyzed by the data mining step. Classification trees can be used as a data mining technique to identify significant and important relationships between faults and software product and process attributes [Khoshgoftaar et al. (1996a), Porter and Selby (1990), Troster and Tian (1995)]. This paper introduces the Classification And Regression Trees (CART) algorithm [Breiman et al. (1984)] to software engineering practitioners. A "classification tree" is an algorithm, depicted as a tree graph, that classifies an input object. Alternative classification techniques used in software quality modeling include discriminant analysis [Khoshgoftaar et al. (1996b)], the discriminative power technique [Schneidewind (1995)], logistic regression [Basili et al. (1996)], pattern recognition [Briand et al. (1992)], artificial neural networks [Khoshgoftaar and Lanning (1995)], and fuzzy classification [Ebert (1996)]. A classification tree differs from these in the way it models complex relationships between class membership and combinations of variables. CART automatically builds a parsimonious tree by first building a maximal tree and then pruning it to an appropriate level of detail. CART is attractive because it emphasizes pruning to achieve robust models. Although Kitchenham briefly

También podría gustarte