Está en la página 1de 7

CMM510 Data Mining Devised by W.Ji ; updated by C.H.Bryant and IArana.

Lab Data Mining by Clementine

Aims
• To be familiar with Clementine programming interface
• To learn about the visualization functions in Clementine
• To predict iris types using Clementine modelling functions

Background
Clementine data mining tool kit was originally developed by the Integral
Solutions Limited. The Company was later merged by SPSS Inc in 1999.
SPSS (Statistical Package for the Social Sciences) is a software package for
comprehensive data mining (not its initial objective) and analytic applications
for enhanced decision making. The strong power of SPSS lays on the
statistical analysis – it contains a series systematic statistic functions, from
descriptive analysis, parametric and nonparametric tests, to nonlinear
regressions.

Clementine is regarded as a supply to SPSS by providing many intelligent


modelling functions (compared to the traditional statistical techniques). C5.0 is
one of such example. Clementine and SPSS run independently. However, for
enhancing Clementine’s speciality and avoiding loosing its generality in
statistic analysis, Clementine not only embeds most of SPSS functions into its
interface but also provides facility to export its process to SPSS.

As a data mining tool, Clementine follows the basic preprocessing-


modelling-postprocessing routine to reveal the information and knowledge
behind the data.

© The Robert Gordon University


School of Computing P1/7
CMM510 Data Mining Devised by W.Ji ; updated by C.H.Bryant and IArana.

Clementine Programming Interface


Start Clementine

Desktop => start => programs => Clementine Desktop 9.0 => Clementine
Desptop 9.0

The Clementine programming interface is shown in Fig. 1.

Fig. 1 Clementine programming interface

Seek Help

On the Clementine tool bar, clicking Help then HelpTopics will bring you
Clementine user manual. You may need to consult it occasionally during the
lab.

Or you can open individual topic like this: (CTRL+click to follow the link)

C:\Program Files\Clementine Desktop 9.0\Help\i18n\English_US\Clemhelp\clem_intro.htm

Build and Work With Stream

The Clementine programming interface is divided into blocks. The major one
is the biggest, blank space which is the place to place the virtual programs.
Generated models are automatically placed in the upper righthand block. The
nodes, representing operations to be performed on the data, are located in

© The Robert Gordon University


School of Computing P2/7
CMM510 Data Mining Devised by W.Ji ; updated by C.H.Bryant and IArana.

seven palettes at the bottom of the interface. Nodes can be dragged into the
programming area and connected by links which indicate the direction of data
flow. Then this virtual program, consisting of nodes and links, is called a
stream.

In summary, to build a data stream:

• Add nodes to the stream pane


• Connect the nodes to form a stream
• Specify any node or stream options
• Execute the stream

When a stream is built up, it can be run from any node – this is useful to
debug the program step by step.

Master Clementine Nodes

Nodes are programming functions that represent different objects and actions.
They are grouped into 6 palettes according to their operations.

• Source nodes

Source nodes can be used to import the contents of various flat files and to
connect to data contained within ODBC-compliant relational databases.
Various file formats can be input through source nodes, such as txt, csv, tsv,
etc. The csv format can be produced by many data processing software, e.g.
spreadsheets.

Later in this lab, you will use the variable file node to read in a csv data file.

• Record operation nodes

Data sets are composed of records or cases or instances. These records can
be manipulated by record operation nodes.

We often need to split up a data-set up into subsets to allow the results of


learning to be evaluated, e.g. in hold out and multi-fold cross validation. In
Clementine this can be done using the sample node. Later in this lab, you will
use the sample node to split the data set into training and testing subsets.

Records can be purposely selected (select node) according to their similarity.


When new attributes are obtained for the records in the data set, merge node
allows you to extend the attribute space. If new records are collected to the
data set, append node can easily expand the data set into a large volume.

• Field operation nodes

Fields store values of each attribute. Most data pre-processing tasks are
conducted by field operation nodes. The type node is the most frequently

© The Robert Gordon University


School of Computing P3/7
CMM510 Data Mining Devised by W.Ji ; updated by C.H.Bryant and IArana.

used node because it allows you to assign a data type, direction, and blank
handling for each field in the data. The last function of the type node is one of
two ways in Clementine to deal with missing values. When a small attribute
space is expected, e.g. after attribute selection processes, the filter node will
shield unexpected attributes or eliminate fields with a high proportion of
missing values. If new attributes are required based on the current attributes
(sales per transaction, for instance), the derive node can create new fields for
this purpose.

In this lab, you will use:


• type node to assign types and define input and output properties to
each field;
• filter node to choose proper attributes to make decision;
• derive node to generate a new field to support decision making.

• Graph nodes

The nodes in this palette play the major role of information visualization in
Clementine. Although the functions of graph nodes themselves are not very
‘colourful’ (compared to some pure visualization software package), they do
have the following desirable features:
(a) a graph node can be used in any stage of data mining (i.e. it may be
connected to any non-output node in the stream) to acquire information on
the intermediate results achieved so far;
(b) a displayed graph can be stored in several formats (.ghp, .bmp, .ps) that
can be used by other software or processes later on;
(c) the data in any stage of processing (stream) can be exported, and if
needed, can be visualized by powerful functions of other software i.e.
Excel, SPSS, or partners of Clementine, e.g. AISoft@re/Visualmine.

• Modelling nodes

Modelling nodes are the heart of the data mining process. They are taken
from machine learning, artificial intelligence, and statistics. The number of
nodes in this palette will increase in line with the development of new data
mining technology. The modelling nodes include Neural Net (a neural
network), C5.0 (the higher version of C4.5) and C&R Tree (a tree-based
classification and prediction model).

Later in this lab, you will use C5.0.

• Output nodes

The nodes in this palette can provide powerful analyses of the results from
further upstream. Output nodes are the terminals of the stream, which
means there are no further nodes downstream of them.

The most frequently used node to display the data at any stage of data mining
is Table node. Analysis node is another popular node to let you actually
access the result and assess the performance of data mining. Output nodes

© The Robert Gordon University


School of Computing P4/7
CMM510 Data Mining Devised by W.Ji ; updated by C.H.Bryant and IArana.

also provide a mechanism for exporting data in various formats to interface


with other software tools.

Later in this lab, you will use both Table node and Analysis node.

© The Robert Gordon University


School of Computing P5/7
CMM510 Data Mining Devised by W.Ji ; updated by C.H.Bryant and IArana.

Step by Step Exercises


Tasks

To predict the iris types by C5.0 and neural networks.

Dataset

In this lab, you will use the Iris dataset. This dates back to seminal work by
the eminent statistician R.A.Fisher in the mid-1930s. It contains 150
examples in total. There are 50 examples of each of three types of plants.
The classes are the types of plants. There are four attributes:
1. sepal length (sepallen)
2. sepal width (sepalwid)
3. petal length (petallen)
4. petal width (petalwid)
All four attributes are numeric and are measured in cm.

Downloading the dataset

Create a new folder called ‘clementine_lab’ inside your ‘H:\CMM510’ folder.


Download the data set from http://www.comp.rgu.ac.uk/staff/chb/teach.html.
Save it to your ‘H:\CMM510\clementine_lab’ folder. The data set is stored in
csv format with the first line the field name.

Data Set Input and Display

In Sources palette, click and drag variable file node to the programming
area. Double click the node, a new window appears. Set the correct path of
the data file (iris data) and according to the file format, take following options:
get field names from file, skip header characters 0, delimiter characters ‘,’,
delimiter on new line.

Now in output palette, click and drag the table node to the programming
area. Next create a link between your two nodes as follows. Click and hold
the middle button/scroll button of the mouse on your file node, then drag to
your table node. You should be able to see a link from the file node to the
table node. Double click the table node and choose execute (or right click the
node and hold then choose execute from the dropdown manual), a table
appear displaying the iris data.

Save your stream in the folder you created.

Define Type and Input/Output

In Field palette, choose type node and add it to the programming area. Draw
a link from your file node to the type node. Double clicking type node brings
you the option setting window. Make sure the types are correct for attributes

© The Robert Gordon University


School of Computing P6/7
CMM510 Data Mining Devised by W.Ji ; updated by C.H.Bryant and IArana.

and iris type (class). Clicking the direction column allows you to change the
input/output mode. Set four attributes as IN and only class as OUT.

Feature Selection

Add a filter node from Field Ops palette after the type node. Determine how
to discard the sepal length attribute. Display the new data set by connecting
your filter node to a new table node.

Training and Testing Data

Add two sample nodes from Record Ops palette, one is set as
include sample and 1-in-2, another is set as discard sample and 1-in-2. The
first one is linked to the filter node, acting as the training branch. The latter is
also linked to the filter node, acting as testing branch.

C5.0 Prediction

Link the C5.0 node to the sample node in the training branch, then double
click the node to execute the stream. A model will be generated and placed in
the model palette. Click and drag the model (now is a node) to the
programming area and link it to the sample node in the testing branch. The
rule set can be viewed by double click the model.

Prediction Analysing

Add an analysis node to the testing branch after the model, double click it to
set the option, then execute. You should be able to see the prediction result.

Exercises
1. Use the nodes in Graphs palette to display the histogram of sepal
length, sepal width, petal length, and petal width.
2. Use the node in Graphs palette to display the scatterplot of some pairs
of attributes. Investigate why we discarded the sepal length attribute by
looking at pairs that include it.

© The Robert Gordon University


School of Computing P7/7