Está en la página 1de 36

TEXAS TECH UNIVERSITY

Homework 2
ISQS 7339 Data Mgmt. & Business Intelligence
Dr. Donald Jones
Darshan D. Komarlu 9/13/2013

Problem 1:

TABLE OF CONTENTS
1 Problem 1: .......................................................................................................................................................................................................... 3 1.1.1 1.1.2 2 2.1.1 2.1.2 3 4 3.1.1 4.1.1 4.1.2 Datasets Configurations: ........................................................................................................................................................... 3 Results: ...........................................................................................................................................................................................10 Datasets Configurations: .........................................................................................................................................................15 Results: ...........................................................................................................................................................................................21 Results.............................................................................................................................................................................................25 Datasets Configurations: .........................................................................................................................................................30 Results: ...........................................................................................................................................................................................34

Problem 2: ........................................................................................................................................................................................................15

Problem 3: ........................................................................................................................................................................................................25 Problem 4: ........................................................................................................................................................................................................30

Problem 1:

Homework #2
1 PROBLEM 1:
Make use of XML file in the network drive, DMDT4.xml. Import it, modify its configuration, and make it your model (ignore the Assessment node). Using what you have done in Homework 1 as a starting point to work out the results of the following model. Report the average profits of the model from the test dataset in a table with columns: Model name, average profit. Solution 1:

1.1.1 DATASETS CONFIGURATIONS:


1.1.1.1 INSURANCE DATASET: The Insurance data set has 19,357 cases and 50 columns. The target, INS is binary and indicates whether the customers has an insurance products. INS variable is set as Target. Numerical Variables with fewer than 20 levels are assigned a level of nominal by default. Only 3 variables, IDNUM, BRANCH and RES, are set to nominal level. 1.1.1.2 VARIABLE SETTINGS:

Problem 1:

1.1.1.2.1 PRIOR PROBABILITY ADJUSTMENTS FOR INSURANCE DATASET: The insurance dataset is originally over sampled for modeling. So, These settings will adjust the predicted probabilities to reflect prior probabilities in the bank customer population.

1.1.1.2.2 DECISION WEIGHTS FOR INSURANCE DATASET: The cost 3 $ corresponds to the marketing offer and the 150.00 $ corresponds to the generated profit. The decision matrix is set to maximize profit.

1.1.1.3 CUSTOMERS DATASET: Customers Data source represents an additional segment of the customer database with the same prior proportion of insured customers and rate of sampling. The role of this data source is set to TEST.

Problem 1:

1.1.1.4 INS_SMALL DATASET: INS_SMALL is a small sample of the insurance with 1937 records in it. INS variable is set as target and only 3 variables are set to nominal. (IDNUM, RES, BRANCH)

Problem 1:

1.1.1.5 PRIOR PROBABILITY ADJUSTMENTS FOR INS_SMALL DATASET:

1.1.1.6 DECISION WEIGHTS FOR INS_SMALL DATASET: The cost 3 $ corresponds to the marketing offer and the 150.00 $ corresponds to the generated profit. The decision matrix is set to maximize profit.

Problem 1:

1.1.1.7 DATA PARTITION NODE: The training data set allocation and validation data set allocation are set to 70 % and 30 % respectively.

1.1.1.8 DEFAULT TREE CONFIGURATION: The default tree node configurations are used.

1.1.1.9 CART LIKE CLASS PROB CONFIGURATION: This type of tree is not designed for classification based upon accuracy or profit. The only difference compared to CART - like tree is pruning measure, average squared error.

Problem 1:

1.1.1.10 CHAID LIKE + VALID CONFIGURATION: This tree is a blend of CHAID -like and CART-like settings. The use of the increased minimum category size and the default depth adjustment constraints the tree from growing too large.

1.1.1.11 CART LIKE CONFIGURATION: This tree is configured to approximate CART. Gini index is used as the splitting criterion. CART tree tend not to grow as wide as CHAID trees that allow for multi-way splits.

Problem 1:

1.1.1.12 CHAID LIKE CONFIGURATION: This tree is configured to approximate CHAID. Because no validation is used to tune the model, a small significance level is chosen for larger datasets.

1.1.1.13 CROSS VALIDATION TREE CONFIGURATION: Cross validation is most often associated with CART.

Problem 1:

10

1.1.1.14 COMPLETE MODEL:

1.1.2 RESULTS:

Problem 1:

11

1.1.2.1 CUMULATIVE LIFT CHART:

The CHAID - like tree with validation data have the best Test set lift in the top 5% whereas the CART-like and cross validated CART-like trees fare worse in top 5% 1.1.2.2 ROC CHART INS:

Problem 1:

12

1.1.2.3 FIT STATISTICS:

1.1.2.4 COMPARISON OF PROFITS: Model Name Default Tree CART like Class Prob CHAID like + Valid CART like CHAID like Cross Validation Test Average Profit 1.243513 1.30368 1.331732 1.323522 1.283861 1.290181 1.1.2.5 SCORE RANKINGS OVERLAY: Test Total Profit 16050.02 16826.6 17188.67 17082.69 16570.8 16652.37

Problem 1:

13

1.1.2.6 SCORE RANKINGS OVERLAY:

Problem 1:

14

1.1.2.7 ASSES NODE

Problem 2:

15

2 PROBLEM 2:
Referring to DMDT Section 4.4, go through all data mining tasks in the section. In addition, 1. What is the main reason for using the dataset CUSTOMERS? Why doesnt it use a partition from INSURANCE for the same purpose? 2. Explain the function of the following nodes in the model: a. Control Point b. Variable Selection Tree c. Consolidation Tree d. Metadata 3. Provide the same results as the ones in the textbook by screenshots, each with a one-sentence explanation. 4. Add in a comparison node to compare the performance of the three Solution 2:

2.1.1 DATASETS CONFIGURATIONS:


2.1.1.1 INSURANCE DATASET: The Insurance data set has 19,357 cases and 50 columns. The target, INS is binary and indicates whether the customers has an insurance products. The inputs represent other product usage and demographics prior to insurance acquisition. INS variable is set as Target. Numerical Variables with fewer than 20 levels are assigned a level of nominal by default. Only 3 variables, IDNUM, BRANCH and RES, are set to nominal level. 2.1.1.2 VARIABLE SETTINGS:

Problem 2:

16

2.1.1.3 DECISION WEIGHTS FOR INSURANCE DATASET: The cost 3 $ corresponds to the marketing offer and the 150.00 $ corresponds to the generated profit. The decision matrix is set to maximize profit.

2.1.1.4 CUSTOMERS DATASET: Customers Data source represents an additional segment of the customer database with the same prior proportion of insured customers and rate of sampling. The role of this data source is set to TEST. The model can be better assessed by using different datasets for train and test.

Problem 2:

17

2.1.1.5 DATA PARTITION NODE: The training data set allocation and validation data set allocation are set to 70 % and 30 % respectively.

2.1.1.6 IMPUTE NODE: This will create missing value indicators for each variable, indicating whether or not the value of the variable is imputed.

Problem 2:

18

2.1.1.7 CONTROL POINT: we use the Control Point node to establish a control point within process flow diagrams. A control point simplifies distributing the connections between process flow steps that have multiple interconnected nodes. The Control Point node can reduce the number of connections that are made. 2.1.1.8 VARIABLE SELECTION TREE:

Variable selection tree is configured to fit a Class Probability - like tree. This node is used to select the variables based on their importance level. In this model, it selects 22 variables are selected by this node.

2.1.1.9 CONSOLIDATION TREE: This is used to select only the desired inputs. If we select only one input (for example BRANCH in our model), the tree is built using the splitting rules corresponding to BRANCH Variable only. This can be seen in the below tree plot.

Problem 2:

19

2.1.1.10 TREE PLOT:

2.1.1.11 METADATA: we use the Metadata node to modify the columns metadata information (such as roles, measurement levels, and order) in a process flow diagram

2.1.1.12 FLEXIBLE REGRESSION:

Problem 2:

20

These settings will enable you to fit all possible quadratic terms and two - way interactions.

These settings result in the fitting of a hierarchically well-formulated stepwise regression. 2.1.1.13 REGRESSION:

Regression Node configuration.

Problem 2:

21

2.1.1.14 NEURAL NETWORK:

2.1.2 RESULTS:
2.1.2.1 VARIABLE SELECTION TREE RESULTS: 22 inputs are selected by this node as show in the below table.

Problem 2:

22

2.1.2.2 REGRESSION NODE RESULTS: A 11 variable model is selected using Regression node.

2.1.2.3 FIT STATISTICS:

Problem 2:

23

The performance of the general regression model is relatively poor compared to other tree models. This is because of the presence of more complex and/or non-additive effects. One approach to handle such patterns is to use Flexible regression model. 2.1.2.3.1 FLEXIBLE REGRESSION MODEL RESULTS:

2.1.2.3.2 FIT STATISTICS:

Problem 2:

24

A 26 variable model is selected in this. Model performance is enhanced by the presence of higher order terms, but it is still relatively poor as compared to other tree models. Neural Networks provides a more flexible alternative.

2.1.2.4 NEURAL NETWORK RESULTS:

Neural network model outperforms the two regression models that we have seen earlier.

Problem 3:

25

3 PROBLEM 3:
Referring to DMDT Section 4.5, complete the interactive tree tasks in the text book. Report the main results with screenshots, plus a one-sentence explanation for each. Solution 3:

3.1.1 RESULTS

Problem 3:

26

3.1.1.1 INTERACTIVE TREE CONFIGURATION:

3.1.1.2 SELECTING THE SPLITTING RULE: Using Branch of Bank variable: In this sceario, the bank want to stratify tree modeling based upon branch characteristics.

Problem 3:

27

3.1.1.3 USING SAVBAL VARIABLE: In this scenario, the bank want to impose split values further for saving account balance.

Problem 3:

28

3.1.1.4 FULLY GROWN TREE:

Problem 3:

29

3.1.1.5 ASSESSMENT PLOT: The below plot shows the average profit for a fully grown tree. i.e. with 34 leaves. However, this is not the optimal solution because the same level of average profit can be obtained with 20 leaves as well.

3.1.1.6 ASSESSMENT TABLE: This table gives the assessment value i.e. average profit with respect to the number of leaves in a tree.

3.1.1.7 FIT STATISTICS:

Problem 4:

30

The fit statistics table gives the various tree statistics with respect to train data, validation data and test data. The average profit for INS for Train , validation and test data are 0.737914, 0.722968 and 0.728364 respectively.

4 PROBLEM 4:
4. Referring to DMDT Section 5.2, complete the data mining tasks in the following diagram. Report the main results with screenshots, plus a one-sentence explanation for each. 1) If you want to do bagging ensemble modeling, which nodes are needed? 2) What are the main differences between the two models in the diagram? Solution:

4.1.1 DATASETS CONFIGURATIONS:


4.1.1.1 INSURANCE DATASET: The Insurance data set has 19,357 cases and 50 columns. The target, INS is binary and indicates whether the customers has an insurance products. The inputs represent other product usage and demographics prior to insurance acquisition. INS variable is set as Target. Numerical Variables with fewer than 20 levels are assigned a level of nominal by default. Only 3 variables, IDNUM, BRANCH and RES, are set to nominal level.

Problem 4:

31

4.1.1.2 VARIABLE SETTINGS:

4.1.1.3 CUSTOMERS DATASET: Customers Data source represents an additional segment of the customer database with the same prior proportion of insured customers and rate of sampling. The role of this data source is set to TEST.

Problem 4:

32

4.1.1.4 DATA PARTITION NODE: The training data set allocation and validation data set allocation are set to 70 % and 30 % respectively.

Configuring Bagging: 4.1.1.5 STATS GROUP CONFIGURATION:

4.1.1.6 BAGGED CART LIKE TREE CONFIGURATION:

Problem 4:

33

4.1.1.7 GRADIENT BOOSTING NODE CONFIGURATION:

4.1.1.8 MODEL:

Problem 4:

34

4.1.2 RESULTS:
4.1.2.1 GRADIENT BOOSTING SUBSERIES PLOT: The subseries plot gives the assessment measures by iteration. By default the node selects the iteration with best decision measure (Profit) on validation data.

4.1.2.2 GRADIENT BOOSTING SUBSERIES TABLE: The best profit on validation data is 0.749576 and it occurred at 96th iteration.

Problem 4:

35

4.1.2.3 STATISTICS PLOT - BAGGING PLOT: This shows the performance of individual trees within the bagging sequence. The results are displayed for Train, Validation and Test datasets.

4.1.2.4 MODEL COMPARISON RESULTS: ROC CHART AND SCORE RANKINGS OVERLAY:

Problem 4:

36

4.1.2.5 FIT STATISTICS:

The Gradient Boost model has better test set profit compared to Bagged Model. The ROC Index of Gradient Boost is high as compared to that of Bagged Model. 4.1.2.6 IF YOU WANT TO DO BAGGING ENSEMBLE MODELING, WHICH NODES ARE NEEDED?

An ensemble model is the combination of multiple models. Bagging (Bootstrap Aggregation) is the original Perturb and Combine (P&C) method. We use "Start Groups" and "Decision Tree" Node to do bagging ensemble modeling. For Start Group node we change the mode to Bagging.

4.1.2.7 WHAT ARE THE MAIN DIFFERENCES BETWEEN THE TWO MODELS IN THE DIAGRAM? The Gradient boosting algorithm is a weighted linear combination of simple models where as bagging is a combination of simple models with equal weights assigned to each model. Gradient boosting yield better accuracy as compared to bagging. However, it results in over-fitting of training data.

También podría gustarte