Está en la página 1de 14

PrinciplesofDataMining

PhamTho Hoan hoanpt@hnue.edu.vn

References
[1]DavidHand Hand,HeikkiMannilaandPadhraicSmyth Smyth, PrinciplesofDataMining,MITpress,2002 [2]JiaweiHanandMichelineKamber,DataMining: C Concepts and dT Techniques, h i 2nd Edition, Edi i 2006 2006. [3] [ ]Christopher p M.Bishop, p,PatternRecognition g and MachineLearning,2006

PrinciplesofDataMining

Whatisdatamining
Dataminingistheanalysisof(oftenlarge)observational datasetstofindunsuspected relationships andtosummarizethedatainnovel waysthatarebothunderstandable and useful tothedataowner. Therelationshipsandsummariesderivedthroughadataminingexerciseareoftenreferred toasmodelsorpatterns.Examplesincludelinearequations,rules,clusters,graphs,tree structures,andrecurrentpatternsintimeseries. Observationaldataexperimentaldata,convenience(opportunity)samplesrandom p ,huge g datasmalldata, ,datamining gstatistics samples, Novelty><triviality,noveltymustbemeasuredrelativetotheuser'spriorknowledge Simplerelationshipsaremorereadilyunderstoodthancomplicatedones ones,andmaywellbe preferred,butsimpleonesmaynotbeuseful.

Dataminingand Knowledge l d Discovery i in i Data

Typesofdatasets
n pdatamatrix{realnumber, number category, category missing,noise} Text, Text sequence sequence,structure structure,pictures Transactions Etc. Lostinformation

Modelandpatternstructures
Amodelstructure,asdefinedhere,isaglobalsummaryofadataset;it makesstatementsaboutanypointinthefullmeasurementspace.(Y=aX +c ) Patternstructures makestatementsonlyaboutrestrictedregionsofthe spacespannedbythevariables.Anexampleisasimpleprobabilistic statementoftheform:ifX>x1thenprob(Y>y1)=p1;orp(Y>y1|X>x1) =p1.ThisstructureconsistsofconstraintsonthevaluesofthevariablesX and dY,related l din i the h form f of faprobabilistic b bili i rule l Oncewehaveestablishedthestructuralformweareinterestedinfinding, thenextstepistoestimateitsparametersfromtheavailabledata.We refer f toaparticular i l model, d l such hasy=3:2 3 2x+2:8, 2 8 asa"fitted fi dmodel d l," "or just"model"forshort(andsimilarlyforpatterns).

Dataminingtasks
ExploratoryDataAnalysis(EDA)thegoalissimplytoexplorethedatawithoutanyclearideasof whatwearelookingfor.Typically,EDAtechniquesareinteractiveandvisual,andtherearemany effectivegraphicaldisplaymethodsforrelativelysmall,lowdimensionaldatasets. DescriptiveModelingThegoalofadescriptivemodelisdescribeallofthedata(ortheprocess generatingthedata).Examplesofsuchdescriptionsincludemodelsfortheoverallprobability distributionofthedata(densityestimation),partitioningofthepdimensionalspaceintogroups (clusteranalysisandsegmentation),andmodelsdescribingtherelationshipbetweenvariables (dependencymodeling). PredictiveModeling:ClassificationandRegressionTheaimhereistobuildamodelthatwill permitthevalueofonevariabletobepredictedfromtheknownvaluesofothervariables. DiscoveringPatternsandRules RetrievalbyContent

Componentsofdataminingalgorithms
1. ModelorPatternStructure:determining gthe underlyingstructureorfunctionalformsthatwe seekfromthedata 2 ScoreFunction:judgingthequalityofafitted 2. model 3. OptimizationandSearchMethod:optimizingthe scorefunction f ti and dsearching hi overdiff different tmodel d l andpatternstructures. 4.DataManagement g Strategy: gy handling gdataaccess efficientlyduringthesearch/optimization

Scorefunctions
Withoutsomeformofscorefunction, ,wecannottellwhether onemodelisbetterthananotheror,indeed,howtochoosea goodsetofvaluesfortheparametersofthemodel. Several S lscorefunctions f ti arewidely id l used dfor f this thi purpose;th these include likelihood,sumofsquarederrors,and misclassificationrate (thelatterisusedinsupervised classificationproblems). Penalize modelcomplexity: score(model) ( d l)=error(model) ( d l)+penaltyFunction(model), l F i ( d l)

OptimizationandSearchMethods
Thegoalofoptimizationandsearchistodeterminethestructure andtheparametervaluesthatachieveaminimum(ormaximum, dependingonthecontext)valueofthescorefunction. Methods:GreedySearchAlgorithm,SystematicSearchandSearch Heuristics,BranchandBound,GradientBasedMethodsfor OptimizingSmoothFunctions,Univariate ParameterOptimization, MultivariateParameterOptimization,ConstrainedOptimization, etc.

DataManagementStrategy
Thewaysinwhichthedataarestored, stored indexed,andaccessed.

Anexample
Problem: Input:adatasetofcreditcardspending{(xi,yi),i=1,..,n}; Output:amodelwhichwouldallowustopredictaperson'sannual p g g giventheirannualincome. creditcardspending Onesolution:themodelwouldnotbeperfect,butsincespending typicallyincreaseswithincome,themodelmightwellbeadequateas aroughcharacterization. Modelstructure:variablespending(f)islinearlyrelatedtothe variableincome(x):f(x)=ax+b Thescorefunction: Thesmallerthissumis,thebetterthemodelfitsthedata. Theoptimizationalgorithm(tofinda,b)isquitesimpleinthecase of fli linearregression: i aand dbcanbe b expressed dasexplicit li i functions f i oftheobservedvaluesofspendingandincome.
2 [ y f ( x )] i i

Somequestions
Cng bi ton trn (m hnh ha quan h gia x v y), xem xt 3 m hnh sau, anh ch thch m hnh no? M1: y=(y1+ + yn)/n vi mi x M2: y y=ax + b ( (vi a, , b tm c nh trong g slide trc) ) M3: if (x=x1) then y=y1 else if (x=x2) then y=y2 else if (x=xn) then y=yn else y=random-value (default) M hnh no phc tp nht? M hnh no ph hp vi d liu hun luyn nht? M hnh no c kh nng d on tt nht?

También podría gustarte