Documentos de Académico
Documentos de Profesional
Documentos de Cultura
LABORATORY MANUAL
on
DATA MINING
1) INTRODUCTION ON WEKA
WEKA is an open source application that is freely available under the GNU general
public license agreement. Originally written in C, the WEKA application has been
completely rewritten in Java and is compatible with almost every computing platform.
It is user friendly with a graphical interface that allows for quick set up and operation.
WEKA operates on the predication that the user data is available as a flat file or
relation. This means that each data object is described by a fixed number of
attributes that usually are of a specific type, normal alpha-numeric or numeric values.
The WEKA application allows novice users a tool to identify hidden information from
database and file systems with simple to use options and visual interfaces.
The WEKA workbench contains a collection of visualization tools and algorithms for
data analysis and predictive modeling, together with graphical user interfaces for
easy access to this functionality.
This original version was primarily designed as a tool for analyzing data from
agricultural domains, but the more recent fully Java-based version (WEKA 3), for
which development started in 1997, is now used in many different application areas,
in particular for educational purposes and research.
2) ADVANTAGES OF WEKA
The obvious advantage of a package like WEKA is that a whole range of data
preparation, feature selection and data mining algorithms are integrated. This means
that only one data format is needed, and trying out and comparing different
approaches becomes really easy. The package also comes with a GUI, which should
make it easier to use.
Portability, since it is fully implemented in the Java programming language and thus
runs on almost any modern computing platform.
WEKA supports several standard data mining tasks, more specifically, data
preprocessing, clustering, classification, regression, visualization, and feature
selection.
All of WEKA's techniques are predicated on the assumption that the data is available
as a single flat file or relation, where each data point is described by a fixed number
WEKA provides access to SQL databases using Java Database Connectivity and
can process the result returned by a database query.
It is not capable of multi-relational data mining, but there is separate software for
converting a collection of linked database tables into a single table that is suitable for
processing using WEKA. Another important area is sequence modeling.
Attribute Relationship File Format (ARFF) is the text format file used by WEKA to
store data in a database.
The ARFF file contains two sections: the header and the data section. The first line of
the header tells us the relation name.
Then there is the list of the attributes (@attribute...). Each attribute is associated with
a unique name and a type.
The latter describes the kind of data contained in the variable and what values it can
have. The variables types are: numeric, nominal, string and date.
The class attribute is by default the last one of the list. In the header section there
can also be some comment lines, identified with a '%' at the beginning, which can
describe the database content or give the reader information about the author. After
that there is the data itself (@data), each line stores the attribute of a single entry
separated by a comma.
WEKA's main user interface is the Explorer, but essentially the same functionality
can be accessed through the component-based Knowledge Flow interface and from
the command line. There is also the Experimenter, which allows the systematic
comparison of the predictive performance of WEKA's machine learning algorithms on
a collection of datasets.
Launching WEKA
The WEKA GUI Chooser window is used to launch WEKAs graphical environments.
At the bottom of the window are four buttons:
1. Simple CLI. Provides a simple command-line interface that allows direct
execution of WEKA commands for operating systems that do not provide their
own command line Interface.
2. Explorer. An environment for exploring data with WEKA.
3. Experimenter. An environment for performing experiments and conducting.
4. Knowledge Flow. This environment supports essentially the same functions as the
Explorer but with a drag-and-drop interface. One advantage is that it supports
incremental learning.
S.K.T.R.M College off Engineering
Classification
Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text field that
gives the name of the currently selected classifier, and its options. Clicking on the
text box brings up a GenericObjectEditor dialog box, just the same as for filters that
you can use to configure the options of the current classifier. The Choose button
allows you to choose one of the classifiers that are available in WEKA.
Test Options
The result of applying the chosen classifier will be tested according to the options
that are set by clicking in the Test options box. There are four test modes:
1. Use training set. The classifier is evaluated on how well it predicts the class of the
instances it was trained on.
2. Supplied test set. The classifier is evaluated on how well it predicts the class of a
set of instances loaded from a file. Clicking the Set... button brings up a dialog
allowing you to choose the file to test on.
3. Cross-validation. The classifier is evaluated by cross-validation, using the
number of folds that are entered in the Folds text field.
4. Percentage split. The classifier is evaluated on how well it predicts a certain
percentage of the data which is held out for testing. The amount of data held out
depends on the value entered in the % field.
Note: No matter which evaluation method is used, the model that is output is always
the one build from all the training data. Further testing options can be set by clicking
on the More options... button:
1. Output model. The classification model on the full training set is output so that it can
be viewed, visualized, etc. This option is selected by default.
2. Output per-class stats. The precision/recall and true/false statistics for each class
are output. This option is also selected by default.
3. Output entropy evaluation measures. Entropy evaluation measures are included in
the output. This option is not selected by default.
4. Output confusion matrix. The confusion matrix of the classifiers predictions is
included in the output. This option is selected by default.
5. Store predictions for visualization. The classifiers predictions are remembered so
that they can be visualized. This option is selected by default.
6. Output predictions. The predictions on the evaluation data are output. Note that in
the case of a cross-validation the instance numbers do not correspond to the location
in the data!
7. Cost-sensitive evaluation. The errors is evaluated with respect to a cost matrix. The
Set... button allows you to specify the cost matrix used.
S.K.T.R.M College off Engineering
8. Random seed for xval / % Split. This specifies the random seed used when
randomizing the data before it is divided up for evaluation purposes.
Training a Classifier
Once the classifier, test options and class have all been set, the learning process is
started by clicking on the Start button. While the classifier is busy being trained, the
little bird moves around. You can stop the training process at any time by clicking on
the Stop button.
When training is complete, several things happen. The Classifier output area
to the right of the display is filled with text describing the results of training and
testing. A new entry appears in the Result list box. We look at the result list below;
but first we investigate the text that has been output.
3.
The results of the chosen test mode are broken down thus:
4.
Summary. A list of statistics summarizing how accurately the classifier was able
to predict the true class of the instances under the chosen test mode.
5.
6.
Confusion Matrix. Shows how many instances have been assigned to each
class. Elements show the number of test examples whose actual class
is the row and whose predicted class is the column.
Visualize margin curve. Generates a plot illustrating the prediction margin. The
margin is defined as the difference between the probability predicted for the
actual class and the highest probability predicted for the other classes. For
example, boosting algorithms may achieve better performance on test data by
increasing the margins on the training data.
9.
10. Visualize cost curve. Generates a plot that gives an explicit representation of
the expected cost, as described by Drummond and Holte (2000). Options are
greyed out if they do not apply to the specific set of results.
Credit risk is an investor's risk of loss arising from a borrower who does not make
payments as promised. Such an event is called a default. Other terms for credit risk
are default risk and counterparty risk.
Credit risk is most simply defined as the potential that a bank borrower or
counterparty will fail to meet its obligations in accordance with agreed terms.
Banks need to manage the credit risk inherent in the entire portfolio as well as the
risk in individual credits or transactions.
Banks should also consider the relationships between credit risk and other risks.
A good credit assessment means you should be able to qualify, within the limits of
your income, for most loans.
Lab Experiments
1. List all the categorical (or nominal) attributes and the real-valued
attributes separately.
From the German Credit Assessment Case Study given to us, the following attributes
are found to be applicable for Credit-Risk Assessment:
Total Valid Attributes
Categorical or Nominal
attributes
(which takes True/false,
etc values)
1. checking_status
2. duration
3. credit history
4. purpose
5. credit amount
6. savings_status
7. employment duration
8. installment rate
9. personal status
10. debitors
11. residence_since
12. property
14. installment plans
15. housing
16. existing credits
17. job
18. num_dependents
19. telephone
20. foreign worker
1. checking_status
2. credit history
3. purpose
4. savings_status
5. employment
6. personal status
7. debtors
8. property
9. installment plans
10. housing
11. job
12. telephone
13. foreign worker
1. duration
2. credit amount
3. credit amount
4. residence
5. age
6. existing credits
7. num_dependents
10
11
5. To generate the decision tree, right click on the result list and select visualize
tree option by which the decision tree will be generated.
12
6. The obtained decision tree for credit risk assessment is very large to fit on the
screen.
13
14
Bad idea, if take all the data into training set. Then how to test the above classification is correctly or
not ?
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as
training set and the remaining 1/3 as test set. But here in the above model we have taken
complete dataset as training set which results only 85.5% accuracy.
This is done for the analyzing and training of the unnecessary attributes which does not
make a crucial role in credit risk assessment. And by this complexity is increasing and
finally it leads to the minimum accuracy. If some part of the dataset is used as a training set
and the remaining as test set then it leads to the accurate results and the time for
computation will be less.
This is why, we prefer not to take complete dataset as training set.
UseTraining Set Result for the table GermanCreditData:
Correctly Classified Instances
855
85.5
145
14.5
Kappa statistic
0.6251
0.2312
0.34
55.0377 %
74.2015 %
1000
15
1. Select classify tab and J48 decision tree and in the test option select cross
validation radio button and the number of folds as 10.
2. Number of folds indicates number of partition with the set of attributes.
16
698
69.8
302
30.2
0.2264
0.3571
0.4883
85.0006 %
106.5538 %
1000
%
%
709
70.9
291
29.1
0.2538
0.3484
0.4825
82.9304 %
105.2826 %
1000
%
%
710
71
290
29
0.2587
0.3444
0.4771
81.959 %
104.1164 %
1000
%
%
17
Percentage split does not allow 100%, it allows only till 99.9%
18
362
138
0.2725
0.3225
0.4764
72.4
27.6
%
%
76.3523 %
106.4373 %
500
%
%
19
20
If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that these
two attributes are not significant to perform training.
21
22
23
24
After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left
over attributes and visualize them.
S.K.T.R.M College off Engineering
25
After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can
further try random combination of attributes to increase the accuracy.
Cross
validation
26
Percentage split
27
28
29
4.Set classes as 2.
5.Click on Resize and then well get cost matrix.
6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0
7.Then confusion matrix will be generated and you can find out the difference
between good and bad attribute.
8.Check accuracy whether its changing or not.
30
31
4.
To generate the decision tree, right click on the result list and select visualize tree
option, by which the decision tree will be generated.
32
Visualize tree
33
34
35
4. We find that the accuracy has been increased by selecting the reduced error pruning
option.
S.K.T.R.M College off Engineering
36
37
38
If outlook=overcast then
play=yes
If outlook=sunny and humidity=high then
play = no
else
play = yes
If outlook=rainy and windy=true then
play = no
else
play = yes
39
40
41