DWM Daga

LABMANUAL
SUBJECT: DATAWAREHOUSINGANDMINING
CLASS:
SEMESTER:VI
T.E(ComputerEngineering)
Page|1
INDEX
No.
Title
PageNo.
Designdatawarehouseforautosalesanalysis
DesigndatawarehouseforStudentAttendanceanalysis
ImplementclassificationusingKnearestneighborclassification
Implementdecisiontreebasedalgorithmforclassification
10
ImplementKmeansalgorithmforclustering
13
ImplementApriorialgorithmforassociationrule
20
BayesianClassification
24
LinearRegression
26
MinimumSpanningbasedclustering
28
10
IntroductiontotheWekamachinelearningtoolkit
29
11
ClassificationusingtheWekatoolkitPart1
31
12
33
13
PerformingdatapreprocessingtasksfordatamininginWeka
35
Page|2
Problem1:
Title:
Designdatawarehouseforautosalesanalysis
1. star,
2. snowflake&
3. Galaxyschemaforthesame.
Performfollowingfortheabovewarehouse:
4.
5.
6.
7.
8.
9.
10.
Maximum/minimumsaleinfirstquarterw.r.t.location
Maximum/minimumsaleofitemvehiclesthroughouttheyear
Maximum/minimumsaleofitemthroughouttheyear
Maximum/minimumsaleduringthesecond&thirdquarterw.r.t.locationanditem
Listouttheitemsinincreasingorderw.r.t.salesamount&quantitysold
Listoutthesupplierswhosupplymaximumnumberofbikes/carsduringyear
Findoutthecustomerwhopurchasemaximumnumberofitemsandalsofindoutall
thedetailsofcustomeralongwithregion.
Notes:
Designadatacubewhichcontainonefacttableanddesignitem,time,supplier,
location,customerdimensiontable,alsoidentifymeasuresforsales.Insertminimum
4itemslikebikes,smallcars,midsegmentcars,carconsumablesitemsetc.Alsoenter
minimum1012records
Region/location,enterminimum2citiesfromeachstatealsoenterminimum2states.
Keeptrackofsalesquarterwise.
Performandimplementabovefact&dimensiontablesinoracle10gwhicharesame
asrelationaltableofdatabase,performanalyzeabovewiththehelpofSQLtool.
YouhavetouseconceptsofOLAPoperationlikeslice,dice,rollup,drilldownetc
Objective:
Tolearnfundamentalofdatawarehousing
Tolearnconceptsofdimensionalmodeling
Tolearnstar,snowflake&Galaxyschema
Reference:
SQLPL/SQLbyIvanBayrose
DataMiningConceptandTechniqueByHan&Kamber
DataWarehousingFundamentalsByPaulraj
Page|3
Datawarehousing&MiningByReemaThereja
Prerequisite:
FundamentalKnowledgeofDatabaseManagement
FundamentalKnowledgeofSQL
Theory:
Dimensionalmodeling(DM)isthenameofalogicaldesigntechniqueoftenusedfordata
warehouses.
Dimensionalmodelingalwaysusestheconceptsoffacts,measures,anddimensions.
Factsaretypically(butnotalways)numericvaluesthatcanbeaggregated,
Dimensionsaregroupsofhierarchiesanddescriptorsthatdefinethefacts.Forexample,sales
amountisafact;timestamp,product,register#,store#,etc.areelementsofdimensions.
Dimensionalmodelsarebuiltbybusinessprocessarea,e.g.storesales,inventory,claims,etc.
Facttable
Thefacttableisnotatypicalrelationaldatabasetableasitisdenormalizedonpurposeto
enhancequeryresponsetimes.Thefacttabletypicallycontainsrecordsthatarereadyto
explore,usuallywithadhocqueries.Recordsinthefacttableareoftenreferredtoasevents,
duetothetimevariantnatureofadatawarehouseenvironment.
Theprimarykeyforthefacttableisacompositeofallthecolumnsexceptnumericvalues/
scores(likeQUANTITY,TURNOVER,exactinvoicedateandtime).
Typicalfacttablesinaglobalenterprisedatawarehouseare(usuallytheremaybeadditional
companyorbusinessspecificfacttables):
Salesfacttablecontainsalldetailsregardingsales
Ordersfacttableinsomecasesthetablecanbesplitintoopenordersandhistoricalorders.
Sometimesthevaluesforhistoricalordersarestoredinasalesfacttable.
Budgetfacttableusuallygroupedbymonthandloadedonceattheendofayear.
Forecastfacttableusuallygroupedbymonthandloadeddaily,weeklyormonthly.
Inventoryfacttablereportstocks,usuallyrefresheddaily
Dimensiontable
Nearlyalloftheinformationinatypicalfacttableisalsopresentinoneormoredimension
tables.ThemainpurposeofmaintainingDimensionTablesistoallowbrowsingthecategories
quicklyandeasily.
Page|4
Theprimarykeysofeachofthedimensiontablesarelinkedtogethertoformthecomposite
primarykeyofthefacttable.Inastarschemadesign,thereisonlyonedenormalizedtablefora
givendimension.
Typicaldimensiontablesinadatawarehouseare:
Timedimensiontable
Customersdimensiontable
Productsdimensiontable
Keyaccountmanagers(KAM)dimensiontable
Salesofficedimensiontable
Starschemaarchitecture
Starschemaarchitectureisthesimplestdatawarehousedesign.Themainfeatureofastar
schemaisatableatthecenter,calledthefacttableandthedimensiontableswhichallow
browsingofspecificcategories,summarizing,drilldownsandspecifyingcriteria.
Typically,mostofthefacttablesinastarschemaareindatabasethirdnormalform,while
dimensionaltablesaredenormalized(secondnormalform).
Despitethefactthatthestarschemaisthesimplestdatawarehousearchitecture,itismost
commonlyusedinthedatawarehouseimplementationsacrosstheworldtoday(about9095%
cases).
SnowflakeSchemaarchitecture
Snowflakeschemaarchitectureisamorecomplexvariationofastarschemadesign.Themain
differenceisthatdimensionaltablesinasnowflakeschemaarenormalized,sotheyhavea
typicalrelationaldatabasedesign.
Snowflakeschemasaregenerallyusedwhenadimensionaltablebecomesverybigandwhena
starschemacantrepresentthecomplexityofadatastructure.ForexampleifaPRODUCT
dimensiontablecontainsmillionsofrows,theuseofsnowflakeschemasshouldsignificantly
improveperformancebymovingoutsomedatatoothertable(withBRANDSforinstance).
Theproblemisthatthemorenormalizedthedimensiontableis,themorecomplicatedSQLjoins
mustbeissuedtoquerythem.Thisisbecauseinorderforaquerytobeanswered,manytables
needtobejoinedandaggregatesgenerated.
Factconstellation/GalaxyschemaArchitecture
Foreachstarschemaorsnowflakeschemaitispossibletoconstructafactconstellation
schema.
Page|5
Thisschemaismorecomplexthanstarorsnowflakearchitecture,whichisbecauseitcontains
multiplefacttables.Thisallowsdimensiontablestobesharedamongstmanyfacttables.
Inafactconstellationschema,differentfacttablesareexplicitlyassignedtothedimensions,
whichareforgivenfactsrelevant.Thismaybeusefulincaseswhensomefactsareassociated
withagivendimensionlevelandotherfactswithadeeperdimensionlevel.
Useofthatmodelshouldbereasonablewhenforexample,thereisasalesfacttable(with
detailsdowntotheexactdateandinvoiceheaderid)andafacttablewithsalesforecastwhich
iscalculatedbasedonmonth,clientidandproductid.
Thesedimensionsallowustoanswerquestionssuchas
Inwhatregionsofthecountryarepleatedpantsmostpopular?(facttable
joinedwiththeproductandshiptodimensions)
Whatpercentageofpantswereboughtwithcouponsandhowhasthatvaried
fromquartertoquarter?(facttablejoinedwiththepromotionandtime
dimensions)
Howmanypantsweresoldonholidaysversusnonholidays?(facttablejoined
withthetimedimension)
Post lab assignment:

1.
2.
3.
4.
Describe OLAP operation like slice, dice, roll-up, drill-down with example
Star schema vs snowflake schema
Dimensional table Vs. Relational Table
Advantages of snowflake schema
Page|6
Problem2:
Title:
Chooseanysystemforwhich,datawarehouseissuitable,designallthedimension&facttable
forthesamealsoperformanythreeanalyticalqueries.
Performandimplementabovefact&dimensiontablesinoracle10gwhicharesameas
relationaltableofdatabase,performanalyzeabovewiththehelpofSQLtool.
Note:Studenthastoperformabovecasestudyingroupof34
Objective:
Tolearnfundamentalofdatawarehousing
Tolearnconceptsofdimensionalmodeling
Tolearnstar,snowflake&Galaxyschema
Teamwork
Reference:
SQLPL/SQLbyIvanBayrose
DataWarehousingFundamentalsByPaulraj
Datawarehousing&MiningByReemaThereja
Prerequisite:
FundamentalKnowledgeofSQL
Theory:
Please refer supportive material /case study of previous problem for analytical query &
schema designing.
1. Describe OLAP vs OLTP
2. Multi dimensional data cube
3. DMQL
Page|7
Problem3
Title:
Implement classification using K nearest neighbor classification
Objective:
TolearnhowtoclassifydatabyKnearestneighboralgorithmforclassification
Reference:
DataMiningIntroductory&AdvancedTopicbyMargaretH.Dunham
Prerequisite:
Theory:
In k-nearest-neighbor classification, the training dataset is used to classify each member of a
"target" dataset.
The structure of the data is that there is a classification (categorical) variable of interest ("buyer,"
or "non-buyer," for example), and a number of additional predictor variables (age, income,
location...).
Algorithm:
1. For each row (case) in the target dataset (the set to be classified), locate the k closest
members (the k nearest neighbors) of the training dataset. A Euclidean Distance
measure is used to calculate how close each member of the training set is to the target
row that is being examined.
2. Examine the k nearest neighbors - which classification (category) do most of them
belong to? Assign this category to the row being examined.
3. Repeat this procedure for the remaining rows (cases) in the target set.
4. also lets the user select a maximum value for k, builds models parallelly on all values of k
upto the maximum specified value and scoring is done on the best of these models.
The computing time goes up as k goes up, but the advantage is that higher values of k
provide smoothing that reduces vulnerability to noise in the training data.
In practical applications, typically, k is in units or tens rather than in hundreds or thousands.
Page|8
Input:
Name
Kristina
Jim
Maggie
Bob
Dave
Kimm
Todd
Amy
Kathy
2m<=Tall,
Gender
F
M
F
M
F
M
M
F
F
Height(m)
1.6
2
1.9
1.85
1.7
1.9
1.9
1.85
1.6
1.7m< H<2m Medium,
H<=1.7m Short
New Tuple <Pat,F,1.6> , suppose K=5 is given than K nearest neighbors to input tuple
{(Kristina,F,1.6) , (Kathy,F,1.6),(Dave,F,1.7)}
Output:
Pat - Short

1.
Difference between simple approach of distance-based classification vs. K nearest

neighborhood classification.
Page|9
Problem4
Title:
Implementdecisiontreebasedalgorithmforclassification
Objective:
Tolearndecisiontreebasedalgorithmforclassification
Reference:
Prerequisite:
Theory:
Decision tree learning, used in data mining and machine learning, uses a decision tree as
a predictive model which maps observations about an item to conclusions about the
item's target value. More descriptive names for such tree models are classification trees
or regression trees. In these tree structures, leaves represent classifications and branches
represent conjunctions of features that lead to those classifications.
In decision analysis, a decision tree can be used to visually and explicitly represent
decisions and decision making. In data mining, a decision tree describes data but not
decisions; rather the resulting classification tree can be an input for decision making.
Basic steps
Building tree
Applying the tree to database
Internal node-test on attribute

Branch-outcome of test
Leaf node-class
Topmost-root node
Algorithm
Page|10
Creatte node N
If all the sampless are from saame class, C then
return N as
a leaf node--labeled C
If attrribute list is empty
retturn N as leaaf node assiggn common class
c
name
Selecct test-attribu
ute from attriibute list whhich having highest
h
inforrmation
Label N with testt-attribute
(Deteermine the best splitting criteria)
For each
e
value off attribute a of
o test-attribbute
groow ai branch
h from node N to conditiion test attribbute
Assiggn test-valuee to arch
Si is set of samplles of test-atttribute ai
Repeat above
ment:
Post lab assignm
1. What are the issue of Classificatioon? Explain with exampple
2. Explain NN-based
N
Allgorithm.
Page|1
11
Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree

structure called a dendrogram. The root of the tree consists of a single cluster containing
all observations, and the leaves correspond to individual observations.
Algorithms for hierarchical clustering are generally either agglomerative, in which one
starts at the leaves and successively merges clusters together; or divisive, in which one
starts at the root and recursively splits the clusters.
Any valid metric may be used as a measure of similarity between pairs of observations.
The choice of which clusters to merge or split is determined by a linkage criterion, which
is a function of the pair wise distances between observations.
Cutting the tree at a given height will give a clustering at a selected precision. In the
following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}.
Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser
clustering, with a smaller number of larger clusters.
The algorithm forms clusters in a bottom-up manner, as follows:
1.
2.
3.
4.
Initially,puteacharticleinitsowncluster.
Amongallcurrentclusters,pickthetwoclusterswiththesmallestdistance.
Replacethesetwoclusterswithanewcluster,formedbymergingthetwooriginalones.
Repeattheabovetwostepsuntilthereisonlyoneremainingclusterinthepool.
Page|12
Problem5:
Title:
ImplementKmeansalgorithmforclustering
Objective:
TolearnKmeansalgorithmforclustering
Reference:
Prerequisite:
Theory:
Instatisticsandmachinelearning,kmeansclusteringisamethodofclusteranalysiswhichaims
topartitionnobservationsintokclustersinwhicheachobservationbelongstotheclusterwith
thenearestmean.MeanClusteringalgorithmworks?
Here is step by step k means clustering algorithm:

Step 1. Begin with a decision on the value of k = number of clusters
Step 2. Put any initial partition that classifies the data into k clusters. You may assign the
training samples randomly, or systematically as the following:
1. Takethefirstktrainingsampleassingleelementclusters
Assign each of the remaining (N-k) training sample to the cluster with the nearest
centroid. After each assignment, recomputed the centroid of the gaining cluster.
Step 3 . Take each sample in sequence and compute its distance from the centroid of each
of the clusters. If a sample is not currently in the cluster with the closest centroid, switch
this sample to that cluster and update the centroid of the cluster gaining the new sample
and the cluster losing the sample.
Page|13
Step 4 . Repeat step 3 until convergence is achieved, that is untill a pass throuugh the
trainiing sample causes
c
no new
w assignmennts.
If thee number of data is less than
t
the num
mber of cluster then we assign
a
each data
d as the
centroid of the cluster. Each centroid
c
will have a clusster number. If the numbber of data iss
biggeer than the nu
umber of cluuster, for eacch data, we calculate
c
thee distance to all centroid
and get
g the minim
mum distancce. This data is said belonng to the cluuster that hass minimum
distannce from this data.
Input
Suppose we havee several objects (4 typess of medicinnes) and eachh object havee two
attribbutes or featu
ures as show
wn in table beelow. Our gooal is to grouup these objeects into K=2
groupp of medicin
ne based on the
t two featuures (pH andd weight indeex).
Objecct
Mediicine A
Mediicine B
Mediicine C
Mediicine D
attribute 1 (X): weight index

1
2
4
5
attribbute 2 (Y): pH
p
1
1
3
4
Each medicine reepresents onee point with two attributtes (X, Y) that we can reepresent it ass
a
spaace as shownn in the figurre below.
coorddinate in an attribute
Page|1
14
1. Iniitial value off centroids : Suppose wee use medicinne A and meedicine B as the first
centroids. Let
and
denoote the coorddinate of the centroids, thhen
and
oids distancee : we calcullate the distaance betweenn cluster cenntroid to eachh

2. Obbjects-Centro
objecct. Let us usee Euclidean distance,
d
theen we have distance
d
matrrix at iteratioon 0 is
Page|1
15
Each column in the

t distance matrix
m
symbbolizes the object. The fiirst row of thhe distance
matriix correspon
nds to the disstance of eacch object to the
t first centtroid and the second row
w
is thee distance off each object to the seconnd centroid. For
F examplee, distance frrom medicinne
C = (4,
( 3) to the first
f
centroidd
the seecond centro
oid
iss
is
, and its distance

d
to
, etc.
3. Obbjects clusterring : We assign each obbject based on

o the minim
mum distancee. Thus,
medicine A is asssigned to grooup 1, mediccine B to grooup 2, mediccine C to grooup 2 and
medicine D to gro
oup 2. The element
e
of Group
G
matrixx below is 1 if
i and only if
i the object
is asssigned to thaat group.
4. Iteeration-1, determine centtroids : Know

wing the meembers of eaach group, noow we
comppute the new
w centroid of each group based on theese new mem
mberships. Group
G
1 onlyy
has one
o member thus the cenntroid remainns in
. Group 2 now has thrree memberss,
thus the
t centroid is the averagge coordinatte among thee three membbers:
.
5. Iteeration-1, Ob
bjects-Centrooids distances : The nexxt step is to compute
c
the distance of
all obbjects to the new centroids. Similar to
t step 2, wee have distannce matrix att iteration 1 is
i
Page|1
16
6. Iteeration-1, Ob
bjects clusterring: Similar to step 3, we
w assign each object baased on the
minim
mum distancce. Based onn the new disstance matrixx, we move the
t medicinee B to Groupp
1 whiile all the oth
her objects remain.
r
The Group matriix is shown below
b
7. Iteeration 2, dettermine centtroids: Now we repeat sttep 4 to calculate the new

w centroids
coorddinate based on the clusttering of prevvious iteratioon. Group1 and group 2 both has tw
wo
membbers, thus th
he new centrooids are
annd
Page|1
17
8. Iteeration-2, Ob
bjects-Centrooids distances : Repeat step
s 2 again,, we have neew distance
matriix at iteration
n 2 as
9. Iteeration-2, Ob
bjects clusterring: Again,, we assign each
e
object based
b
on the minimum
distannce.
We obtain
o
result that
. Compariing the groupping of last iteration
i
andd this iteratioon
reveaals that the objects does not
n move group anymorre. Thus, the computationn of the kmeann clustering has
h reached its stability and
a no moree iteration is needed.
Page|1
18
Output:
We get the final grouping as the results
Object
Medicine A
Medicine B
Medicine C
Medicine D
Feature 1 (X): weight

index
1
2
4
5
Feature 2 (Y): pH
Group (result)
1
1
3
4
1
1
2
2
Note:Youcanimplementaboveproblemno3to6inC/C++/JAVA
Page|19
Problem6:
Title:
Implement Apriori algorithm for association rule
Objective:
TolearnassociationruleforApriorialgorithm
Reference:
Prerequisite:
Theory:
Association rule mining is to find out association rules that satisfy the predefined
minimum support and confidence from a given database. The problem is usually
decomposed into two sub problems.
Findthoseitemsetswhoseoccurrencesexceedapredefinedthresholdinthedatabase;
thoseitemsetsarecalledfrequentorlargeitemsets.
Generateassociationrulesfromthoselargeitemsetswiththeconstraintsofminimal
confidence.
Suppose one of the large item sets is Lk = {I1,I2,...,Ik}; association rules with this item sets
are generated in the following way: the first rule is {I1,I2,...,Ik 1} = > {Ik}. By checking
the confidence this rule can be determined as interesting or not. Then, other rules are
generated by deleting the last items in the antecedent and inserting it to the consequent,
further the confidences of the new rules are checked to determine the interestingness of
them. This process iterates until the antecedent becomes empty.
Since the second sub problem is quite straight forward, most of the research focuses on
the first sub problem. The Apriori algorithm finds the frequent sets L in Database D.
FindfrequentsetLk1.
JoinStep.
Page|20
o CkisgeneratedbyjoiningLk1withitself
PruneStep.
o Any(k1)itemsetthatisnotfrequentcannotbeasubsetofafrequentk
itemset,henceshouldberemoved.
where
(Ck:Candidateitemsetofsizek)
(Lk:frequentitemsetofsizek)
Apriori Pseudocode
Apriori
large1itemsetsthatappearinmorethan transactions}
while
Generate(Lk1)
fortransactions
Subset(Ck,t)
forcandidates
return
Input :
A large supermarket tracks sales data by SKU( Stoke Keeping Unit) (item), and thus is
able to know what items are typically purchased together. Apriori is a moderately
efficient way to build a list of frequent purchased item pairs from this data. Let the
database of transactions consist of the sets {1,2,3,4}, {2,3,4}, {2,3}, {1,2,4}, {1,2,3,4},
and {2,4}.
Page|21
Output
Each number corresponds to a product such as "butter" or "water". The first step of
Apriori to count up the frequencies, called the supports, of each member item separately:
Item Support
1
We can define a minimum support level to qualify as "frequent," which depends on the
context. For this case, let min support = 3. Therefore, all are frequent. The next step is to
generate a list of all 2-pairs of the frequent items. Had any of the above items not been
frequent, they wouldn't have been included as a possible member of possible 2-item pairs.
In this way, Apriori prunes the tree of all possible sets..
Item Support
{1,2} 3
{1,3} 2
{1,4} 3
{2,3} 4
{2,4} 5
{3,4} 3
This is counting up the occurrences of each of those pairs in the database. Since
minsup=3, we don't need to generate 3-sets involving {1,3}. This is because since they're
not frequent, no supersets of them can possibly be frequent. Keep going:
Item Support
Page|22
{1,2,4} 3
{2,3,4} 3

1. Give an example for Apriori with transaction and explain Apriori-gen-algorithm
Page|23
Problem7:
Title:
Bayesian Classification
Objective:
ToimplementclassificationusingBayestheorm.
Reference:
Prerequisite:
FundamentalKnowledgeofprobabilityandBayestheorm
Theory:
The simple baysian classification assumes that the effect of an attribute value of a given
class membership is independent of other attribute.
The Bayes theorm is as follows
Let X be an unknown sample. Let it be hypothesis such that X belongs to particular class
C. We need to determine P(H/X).
The probability that hypothesis it holds is given that all values of X are observed.
P(H/X) = P(X/H).P(H)
P(X)
In this program, we initially take the number of tuples in training data set in variable L.
The string arrays name, gender, hight, output to store the details and output respectfully.
Therefore, the tuple details are taken from user using for loops.
Bayesian classification has an expected classification. Now using the counter variables
for various attributes i.e. (male/female) for gender and (short/medium/tall) for hight.
Page|24
The tuples are scanned and the respective counter is incremented accordingly using ifelse-if structure.
Therefore variables pshort, pmed, plong are used to convert the counter variables to
corresponding values.
Algorithm
1.
2.
3.
4.
5.
START
Store the training data set
Specify ranges for classifying the data
Calculate the probability of being tall, medium, short
Also, calculate the probabilities of tall, short, medium according to gender and
classification ranges
6. Calculate the likelihood of short, medium and tall
7. Calculate P(t) by summing up of probable likelihood
8. Calculate actual probabilities
Input :
Training data set
Name
Christina
Jim
Maggie
Martha
Stephony
Bob
Dave
Steven
Amey
Gender
F
M
F
F
F
M
M
M
F
Height
1.6m
1.9m
1.9m
1.88m
1.7m
1.85m
1.7m
2.1m
1.8m
Output
Short
Tall
Medium
Medium
Medium
Short
Short
Tall
Medium
Output
The tuple belongs to the class having highest probability. Thus new tuple is classified.
Page|25
Problem8:
Title:
Linear Regression
Objective:
Towriteaprogramtoclassifythetuplesusinglinearregression.
Reference:
Prerequisite:
Knowledgeofregressiontechniques
Theory:
Regression problem deals with estimation of output values based on input values. In the
method we estimate the formula of straight line, which partitions data into 2 classes
-
by defining the regression coefficient C, the relation between output parameter Y

and input parameter X1, X2, X3 .. Xn can be estimated
Input :
Training data set
Name
Christina
Jim
Maggie
Martha
Stephony
Bob
Dave
Steven
Amey
Gender
F
M
F
F
F
M
M
M
F
Height
1.6m
1.9m
1.9m
1.88m
1.7m
1.85m
1.7m
2.1m
1.8m
Page|26
Output
The tuple is being classified using linear regression technique. Having value > 0.5 is
classified as medium else < 0.5 then tuple is classified as short.
Page|27
Problem9:
Title:
Minimum Spanning based clustering
Objective:
TowriteaprogramtoimplementMinimumSpanningbasedclustering.
Reference:
Prerequisite:
Knowledgeofsinglelinktechniqueandtreedatastructurecalleddendogram.
Theory:
Algorithm
1.
2.
3.
4.
5.
6.
7.
Store the dendogram structure

Store the clusters at each level
Search the adjacency matrix for smallest distance among clusters
Form the cluster
Add newly formed cluster to the vector
Form dendogram
Merge clusters based on their single link distance
Input :
Adjacency matrix
Output
Minimum Spanning tree
Page|28
Problem10:
Title:
IntroductiontotheWekamachinelearningtoolkit
Objective
TolearntousetheWeakmachinelearningtoolkit
References
Witten,IanandEibe,Frank.DataMining:PracticalMachineLearningToolsandTechniques.
Springer.
Requirements
HowdoyouloadWeka?
1. Whatoptionsareavailableonmainpanel?
2. WhatisthepurposeofthethefollowinginWeka:
1. TheExplorer
2. TheKnowledgeFlowinterface
3. TheExperimenter
4. Thecommandlineinterface
3. Describethearfffileformat.
4. PresstheExplorerbuttononthemainpanelandloadtheweatherdatasetandanswerthe
followingquestions
1. Howmanyinstancesarethereinthedataset?
2. Statethenamesoftheattributesalongwiththeirtypesandvalues.
3. Whatistheclassattribute?
4. Inthehistogramonthebottomright,whichattributesareplottedontheX,Yaxes?How
doyouchangetheattributesplottedontheX,Yaxes?
5. Howwillyoudeterminehowmanyinstancesofeachclassarepresentinthedata
6. WhathappenswiththeVisualizeAllbuttonispressed?
7. Howwillyouviewtheinstancesinthedataset?Howwillyousavethechanges?
5. WhatisthepurposeofthefollowingintheExplorerPanel?
1. ThePreprocesspanel
1. WhatarethemainsectionsofthePreprocesspanel?
2. WhataretheprimarysourcesofdatainWeka?
2. TheClassifypanel
3. TheClusterpanel
Page|29
4. TheAssociatepanel
5. TheSelectAttributespanel
6. TheVisualizepanel.
6. Loadtheirisdatasetandanswerthefollowingquestions:
1. Howmanyinstancesarethereinthedataset?
2. Statethenamesoftheattributesalongwiththeirtypesandvalues.
3. Whatistheclassattribute?
4. Inthehistogramonthebottomright,whichattributesareplottedontheX,Yaxes?How
doyouchangetheattributesplottedontheX,Yaxes?
5. Howwillyoudeterminehowmanyinstancesofeachclassarepresentinthedata
6. WhathappenswiththeVisualizeAllbuttonispressed?
7. Loadtheweatherdatasetandperformthefollowingtasks:
1. UsetheunsupervisedfilterRemoveWithValuestoremoveallinstanceswherethe
attributehumidityhasthevaluehigh?
2. Undotheeffectofthefilter.
3. Answerthefollowingquestions:
1. WhatismeantbyfilteringinWeka?
2. Whichpanelisusedforfilteringadataset?
3. WhatarethetwomaintypesoffiltersinWeka?
4. Whatisthedifferencebetweenthetwotypesoffilters?Whatisthedifference
betweenandattributefilterandaninstancefilter?
8. Loadtheirisdatasetandperformthefollowingtasks:
1. PresstheVisualizetabtoviewtheVisualizerpanel.
2. WhatisthepurposeoftheVisualizer?
3. SelectonepanelintheVisualizerandexperimentwiththebuttonsonthepanel.
Postlab
Provideanswerstoallthequestionsgivenabove.
Page|30
Problem11:
Title:
Objective
ToperformclassificationondatasetsusingtheWekamachinelearningtoolkit
References
Springer.
Requirements
1. Loadtheweather.nominal.arffdatasetintoWekaandrunId3classificationalgorithm.
Answerthefollowingquestions
1. Listtheattributesofthegivenrelationalongwiththetypedetails
2. Createatableoftheweather.nominal.arffdata
3. Studytheclassifieroutputandanswerthefollowingquestions
1. Drawthedecisiontreegeneratedbytheclassifier
2. Computetheentropyvaluesforeachoftheattributes
3. Whatistherelationshipbetweentheattributeentropyvaluesandthenodesofthe
decisiontree?
4. Drawtheconfusionmatrix?Whatinformationdoestheconfusionmatrixprovide?
5. DescribetheKappastatistic?
6. Describethefollowingquantities:
1. TPRate
2. FPRate
3. Precision
4. Recall
2. Loadtheweather.arffdatasetinWekaandruntheId3classificationalgorithm.What
problemdoyouhaveandwhatisthesolution?
3. Loadtheweather.arffdatasetinWekaandruntheOneRrulegenerationalgorithm.Write
therulesthatweregenerated.
4. Loadtheweather.arffdatasetinWekaandrunthePRISMrulegenerationalgorithm.Write
downtherulesthataregenerated.
Page|31
Postlab
Page|32
Problem12:
Title
Objective
ToperformclassificationondatasetsusingtheWekatoolkit
References
Springer.
Requirements
1. Loadtheglass.arffdatasetandperformthefollowingtasks?
1. Howmanyitemsarethereinthedataset?
2. Listtheattributesarethereinthedataset.
3. Listtheclassesinthedatasetalongwiththecountofinstancesintheclass.
4. Howwillyoudeterminethecolorassignedtoeachclass?
5. Byexaminingthehistogram,howwillyoudeterminewhichattributesshouldbethemost
importantinclassifyingthetypesofglass?
2. Performthefollowingclassificationtasks:
1. Runthe1BkclassifierforvariousvaluesofK?
2. WhatistheaccuracyofthisclassifierforeachvalueofK?
3. Whattypeofclassifieristhe1Bkclassifier?
3. Performthefollowingclassificationtasks:
1. RuntheJ48classifier
2. Whatistheaccuracyofthisclassifier?
3. WhattypeofclassifieristheJ48classifier?
4. Comparetheresultsofthe1BkandtheJ48classifiers.Whichisbetter?
5. RuntheJ48and1Bkclassifiersusing
1. thecrossvalidationstrategywithvariousfoldlevels.Comparetheaccuracyresults.
2. holdoutstrategywiththreepercentagelevels.Comparetheaccuracyresults.
6. Performfollowingtasks:
1. Removeinstancesbelongingtothefollowingclasses:
1. buildwindfloat
2. buildwindnonfloat
2. Performclassificationusingthe1BkandJ48classifiers.Whatistheeffectofthisfilteron
Page|33
theaccuracyoftheclassifiers?
7. Performthefollowingtasks:
1. RuntheJ48andtheNaiveBayesclassifiersonthefollowingdatasetsanddeterminethe
accuracy:
1. vehicle.arff
2. krvskp.arff
3. glass.arff
4. waveform5000.arff
OnwhichdatasetsdoestheNaiveBayesperformbetter?Why?
8. Performthefollowingtasks
1. UsetheresultsoftheJ48classifiertodeterminethemostimportantattributes
2. Removetheleastimportantattributes
3. RuntheJ48and1Bkclassifiersanddeterminetheeffectofthischangeontheaccuracyof
theseclassifiers.Whatwillyouconcludefromtheresults?
Postlab
Page|34
Problem13:
Title
PerformingdatapreprocessingtasksfordatamininginWeka
Objective
Tolearnhowtousevariousdatapreprocessingmethodsasapartofthedatamining
References
Springer.
Requirements
PartA:ApplicationofDiscretizationFilters
Performthefollowingtasks
1.
2.
3.
4.
5.
Loadthe'sick.arff'dataset
Howmanyinstancesdoesthisdatasethave?
Howmanyattributesdoesithave?
Whichistheclassattributeandwhatarethecharacteristicsofthisattribute?
Howmanyattributesarenumeric?Whataretheattributeindexesofthenumerica
attributes?
6. ApplytheNaiveBayesclassifier.Whatistheaccuracyoftheclassifier?
1. Loadthe'sick.arff'dataset.
2. Applythesuperviseddiscretizationfilter.
3. Whatistheeffectofthisfilterontheattributes?
4. Howmanydistinctrangeshavebeencreatedforeachattribute?
5. Undothefilterappliedinthepreviousstep.
6. Applytheunsuperviseddiscretizationfilter.Dothistwice:
1. Inthisstep,set'bins'=5
2. Inthisstep,set'bins'=10
3. Whatistheeffectoftheunsupervisedfilterfilteronthedatset?
7. RunthetheNaiveBayesclassifierafterapplythefollowingfilters
1. Unsuperviseddiscretizedwith'bins'=5
2. Unsuperviseddiscretizedwith'bins'=10
Page|35
3. Unsuperviseddiscretizedwith'bins''=20.
8. Comparetheaccuracyofthefollowingcases
1. NaiveBayeswithoutdiscretizationfilters
2. NaiveBayeswithasuperviseddiscretizationfilter
3. NaiveBayeswithanunsuperviseddiscretizationfilterwithdifferentvaluesforthe
'bins'attributes.
PartB:AttributeSelection
1. Loadthe'mushroom.arff'dataset
2. RuntheJ48,1Bk,andtheNaiveBayesclassifiers.
3. Whatistheaccuracyofeachoftheseclassifiers?
1. Gotothe'SelectAttributes'panel
2. SetattributeevaluatortoCFSSubsetEval
3. Setthesearchmethodto'GreedyStepwise'
4. Analyzetheresultswindow
5. Recordtheattributenumbersofthemostimportantattributes
6. RunthemetaclassifierAttributeSelectedClassifierusingthefollowing:
1. CFSSubsetEval
2. GreedStepwise
3. J48,1Bk,andNaiveBayes
7. Recordtheaccuracyoftheclassifiers
8. Whatarethebenefitsofattributeselection?
PartC
1. Loadthe'vote.arff'dataset.
2. RuntheJ48,1Bk,andNaiveBayesclassifiers.
3. Recordtheaccuracies.
1. Gotothe'SelectAttributes'panel
2. Setattributeevaluatorto'WrapperSubsetEval'
3. Setsearchmethodto''RankSearch'
4. Setattributeevaluatorto'InfoGainAttributeEval'
5. Analyzetheresults
6. RunthemetaclassifierAttributeSelectedClassifierusingthefollowing:
1. WrapperSubsetEval
2. RankSearch
3. InfoGainAttributeEval
7. Sampling
1. Loadthe'letter.arff'dataset
2. Takeanyattributeandrecordthemin,max,mean,andstandarddeviationofthe
attribute
Page|36
3. ApplytheResamplefilterwith'sampleSizePercent'setto50percent
4. Whatisthesizeofthefiltereddataset.Observethemin,max,mean,andstandard
deviationoftheattributethatwasselectedinstep2.Whatisthepercentagechangein
thevalues?
5. Givethebenefitofsamplingalargedataset.
Postlab
Page|37

DWM Daga

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

DWM Daga

Cargado por

Copyright:

Formatos disponibles

LABMANUAL

Post lab assignment:

1.7m< H<2m Medium,

Post lab assignment:

Difference between simple approach of distance-based classification vs. K nearest

Internal node-test on attribute

Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree

Here is step by step k means clustering algorithm:

attribute 1 (X): weight index

denoote the coorddinate of the centroids, thhen

oids distancee : we calcullate the distaance betweenn cluster cenntroid to eachh

Each column in the

, and its distance

3. Obbjects clusterring : We assign each obbject based on

4. Iteeration-1, determine centtroids : Know

7. Iteeration 2, dettermine centtroids: Now we repeat sttep 4 to calculate the new

Feature 1 (X): weight

Post lab assignment:

Post lab assignment:

by defining the regression coefficient C, the relation between output parameter Y

Post lab assignment:

Store the dendogram structure

Post lab assignment:

También podría gustarte