Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Homework 2
Homework 2
Homework 2
ISQS 7339 Data Mgmt. & Business Intelligence
Dr. Donald Jones
Darshan D. Komarlu 9/13/2013
Problem 1:
TABLE OF CONTENTS
1 Problem 1: .......................................................................................................................................................................................................... 3 1.1.1 1.1.2 2 2.1.1 2.1.2 3 4 3.1.1 4.1.1 4.1.2 Datasets Configurations: ........................................................................................................................................................... 3 Results: ...........................................................................................................................................................................................10 Datasets Configurations: .........................................................................................................................................................15 Results: ...........................................................................................................................................................................................21 Results.............................................................................................................................................................................................25 Datasets Configurations: .........................................................................................................................................................30 Results: ...........................................................................................................................................................................................34
Problem 2: ........................................................................................................................................................................................................15
Problem 1:
Homework #2
1 PROBLEM 1:
Make use of XML file in the network drive, DMDT4.xml. Import it, modify its configuration, and make it your model (ignore the Assessment node). Using what you have done in Homework 1 as a starting point to work out the results of the following model. Report the average profits of the model from the test dataset in a table with columns: Model name, average profit. Solution 1:
Problem 1:
1.1.1.2.1 PRIOR PROBABILITY ADJUSTMENTS FOR INSURANCE DATASET: The insurance dataset is originally over sampled for modeling. So, These settings will adjust the predicted probabilities to reflect prior probabilities in the bank customer population.
1.1.1.2.2 DECISION WEIGHTS FOR INSURANCE DATASET: The cost 3 $ corresponds to the marketing offer and the 150.00 $ corresponds to the generated profit. The decision matrix is set to maximize profit.
1.1.1.3 CUSTOMERS DATASET: Customers Data source represents an additional segment of the customer database with the same prior proportion of insured customers and rate of sampling. The role of this data source is set to TEST.
Problem 1:
1.1.1.4 INS_SMALL DATASET: INS_SMALL is a small sample of the insurance with 1937 records in it. INS variable is set as target and only 3 variables are set to nominal. (IDNUM, RES, BRANCH)
Problem 1:
1.1.1.6 DECISION WEIGHTS FOR INS_SMALL DATASET: The cost 3 $ corresponds to the marketing offer and the 150.00 $ corresponds to the generated profit. The decision matrix is set to maximize profit.
Problem 1:
1.1.1.7 DATA PARTITION NODE: The training data set allocation and validation data set allocation are set to 70 % and 30 % respectively.
1.1.1.8 DEFAULT TREE CONFIGURATION: The default tree node configurations are used.
1.1.1.9 CART LIKE CLASS PROB CONFIGURATION: This type of tree is not designed for classification based upon accuracy or profit. The only difference compared to CART - like tree is pruning measure, average squared error.
Problem 1:
1.1.1.10 CHAID LIKE + VALID CONFIGURATION: This tree is a blend of CHAID -like and CART-like settings. The use of the increased minimum category size and the default depth adjustment constraints the tree from growing too large.
1.1.1.11 CART LIKE CONFIGURATION: This tree is configured to approximate CART. Gini index is used as the splitting criterion. CART tree tend not to grow as wide as CHAID trees that allow for multi-way splits.
Problem 1:
1.1.1.12 CHAID LIKE CONFIGURATION: This tree is configured to approximate CHAID. Because no validation is used to tune the model, a small significance level is chosen for larger datasets.
1.1.1.13 CROSS VALIDATION TREE CONFIGURATION: Cross validation is most often associated with CART.
Problem 1:
10
1.1.2 RESULTS:
Problem 1:
11
The CHAID - like tree with validation data have the best Test set lift in the top 5% whereas the CART-like and cross validated CART-like trees fare worse in top 5% 1.1.2.2 ROC CHART INS:
Problem 1:
12
1.1.2.4 COMPARISON OF PROFITS: Model Name Default Tree CART like Class Prob CHAID like + Valid CART like CHAID like Cross Validation Test Average Profit 1.243513 1.30368 1.331732 1.323522 1.283861 1.290181 1.1.2.5 SCORE RANKINGS OVERLAY: Test Total Profit 16050.02 16826.6 17188.67 17082.69 16570.8 16652.37
Problem 1:
13
Problem 1:
14
Problem 2:
15
2 PROBLEM 2:
Referring to DMDT Section 4.4, go through all data mining tasks in the section. In addition, 1. What is the main reason for using the dataset CUSTOMERS? Why doesnt it use a partition from INSURANCE for the same purpose? 2. Explain the function of the following nodes in the model: a. Control Point b. Variable Selection Tree c. Consolidation Tree d. Metadata 3. Provide the same results as the ones in the textbook by screenshots, each with a one-sentence explanation. 4. Add in a comparison node to compare the performance of the three Solution 2:
Problem 2:
16
2.1.1.3 DECISION WEIGHTS FOR INSURANCE DATASET: The cost 3 $ corresponds to the marketing offer and the 150.00 $ corresponds to the generated profit. The decision matrix is set to maximize profit.
2.1.1.4 CUSTOMERS DATASET: Customers Data source represents an additional segment of the customer database with the same prior proportion of insured customers and rate of sampling. The role of this data source is set to TEST. The model can be better assessed by using different datasets for train and test.
Problem 2:
17
2.1.1.5 DATA PARTITION NODE: The training data set allocation and validation data set allocation are set to 70 % and 30 % respectively.
2.1.1.6 IMPUTE NODE: This will create missing value indicators for each variable, indicating whether or not the value of the variable is imputed.
Problem 2:
18
2.1.1.7 CONTROL POINT: we use the Control Point node to establish a control point within process flow diagrams. A control point simplifies distributing the connections between process flow steps that have multiple interconnected nodes. The Control Point node can reduce the number of connections that are made. 2.1.1.8 VARIABLE SELECTION TREE:
Variable selection tree is configured to fit a Class Probability - like tree. This node is used to select the variables based on their importance level. In this model, it selects 22 variables are selected by this node.
2.1.1.9 CONSOLIDATION TREE: This is used to select only the desired inputs. If we select only one input (for example BRANCH in our model), the tree is built using the splitting rules corresponding to BRANCH Variable only. This can be seen in the below tree plot.
Problem 2:
19
2.1.1.11 METADATA: we use the Metadata node to modify the columns metadata information (such as roles, measurement levels, and order) in a process flow diagram
Problem 2:
20
These settings will enable you to fit all possible quadratic terms and two - way interactions.
These settings result in the fitting of a hierarchically well-formulated stepwise regression. 2.1.1.13 REGRESSION:
Problem 2:
21
2.1.2 RESULTS:
2.1.2.1 VARIABLE SELECTION TREE RESULTS: 22 inputs are selected by this node as show in the below table.
Problem 2:
22
2.1.2.2 REGRESSION NODE RESULTS: A 11 variable model is selected using Regression node.
Problem 2:
23
The performance of the general regression model is relatively poor compared to other tree models. This is because of the presence of more complex and/or non-additive effects. One approach to handle such patterns is to use Flexible regression model. 2.1.2.3.1 FLEXIBLE REGRESSION MODEL RESULTS:
Problem 2:
24
A 26 variable model is selected in this. Model performance is enhanced by the presence of higher order terms, but it is still relatively poor as compared to other tree models. Neural Networks provides a more flexible alternative.
Neural network model outperforms the two regression models that we have seen earlier.
Problem 3:
25
3 PROBLEM 3:
Referring to DMDT Section 4.5, complete the interactive tree tasks in the text book. Report the main results with screenshots, plus a one-sentence explanation for each. Solution 3:
3.1.1 RESULTS
Problem 3:
26
3.1.1.2 SELECTING THE SPLITTING RULE: Using Branch of Bank variable: In this sceario, the bank want to stratify tree modeling based upon branch characteristics.
Problem 3:
27
3.1.1.3 USING SAVBAL VARIABLE: In this scenario, the bank want to impose split values further for saving account balance.
Problem 3:
28
Problem 3:
29
3.1.1.5 ASSESSMENT PLOT: The below plot shows the average profit for a fully grown tree. i.e. with 34 leaves. However, this is not the optimal solution because the same level of average profit can be obtained with 20 leaves as well.
3.1.1.6 ASSESSMENT TABLE: This table gives the assessment value i.e. average profit with respect to the number of leaves in a tree.
Problem 4:
30
The fit statistics table gives the various tree statistics with respect to train data, validation data and test data. The average profit for INS for Train , validation and test data are 0.737914, 0.722968 and 0.728364 respectively.
4 PROBLEM 4:
4. Referring to DMDT Section 5.2, complete the data mining tasks in the following diagram. Report the main results with screenshots, plus a one-sentence explanation for each. 1) If you want to do bagging ensemble modeling, which nodes are needed? 2) What are the main differences between the two models in the diagram? Solution:
Problem 4:
31
4.1.1.3 CUSTOMERS DATASET: Customers Data source represents an additional segment of the customer database with the same prior proportion of insured customers and rate of sampling. The role of this data source is set to TEST.
Problem 4:
32
4.1.1.4 DATA PARTITION NODE: The training data set allocation and validation data set allocation are set to 70 % and 30 % respectively.
Problem 4:
33
4.1.1.8 MODEL:
Problem 4:
34
4.1.2 RESULTS:
4.1.2.1 GRADIENT BOOSTING SUBSERIES PLOT: The subseries plot gives the assessment measures by iteration. By default the node selects the iteration with best decision measure (Profit) on validation data.
4.1.2.2 GRADIENT BOOSTING SUBSERIES TABLE: The best profit on validation data is 0.749576 and it occurred at 96th iteration.
Problem 4:
35
4.1.2.3 STATISTICS PLOT - BAGGING PLOT: This shows the performance of individual trees within the bagging sequence. The results are displayed for Train, Validation and Test datasets.
4.1.2.4 MODEL COMPARISON RESULTS: ROC CHART AND SCORE RANKINGS OVERLAY:
Problem 4:
36
The Gradient Boost model has better test set profit compared to Bagged Model. The ROC Index of Gradient Boost is high as compared to that of Bagged Model. 4.1.2.6 IF YOU WANT TO DO BAGGING ENSEMBLE MODELING, WHICH NODES ARE NEEDED?
An ensemble model is the combination of multiple models. Bagging (Bootstrap Aggregation) is the original Perturb and Combine (P&C) method. We use "Start Groups" and "Decision Tree" Node to do bagging ensemble modeling. For Start Group node we change the mode to Bagging.
4.1.2.7 WHAT ARE THE MAIN DIFFERENCES BETWEEN THE TWO MODELS IN THE DIAGRAM? The Gradient boosting algorithm is a weighted linear combination of simple models where as bagging is a combination of simple models with equal weights assigned to each model. Gradient boosting yield better accuracy as compared to bagging. However, it results in over-fitting of training data.