Está en la página 1de 20

MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY Department Of Computer Science and Engineering Data Mining and Data Warehousing

(ECS-075) QUESTION BANK (UNIT-I) 1. What is data mining? What is the need of data mining? Sol.) Refer page 5,Jiawei Han and Micheline kamber, Second Edition. 2. Explain the major issues involved in the process of data mining. Sol.) Refer page 36,Jiawei Han and Micheline kamber, Second Edition 3. Explain the architecture of data mining system. Write down the working of its components Sol.) Refer page 8,Jiawei Han and Micheline kamber, Second Edition 4. What is KDD process & data pre-processing? How is it different from data mining process? Sol.) Refer page 5,Jiawei Han and Micheline kamber, Second Edition 5. Explain the functionalities of data mining. Sol.) Refer page 21,Jiawei Han and Micheline kamber, Second Edition 6. How can we categorize data mining systems? Sol.) Refer page 29,Jiawei Han and Micheline kamber, Second Edition 7. Explain, what kind of Data can be mined from the data mining process & on what kind of data we can perform data mining techniques? Sol.) Refer page 21,Jiawei Han and Micheline kamber, Second Edition 8. Why do we need data preprocessing? Point out the main task of data preprocessing. Sol.) Refer page 7,Jiawei Han and Micheline kamber, Second Edition 9. In the process of data cleaning, how can we fill up the missing values? Write down its methods. Sol.) Refer page 61,Jiawei Han and Micheline kamber, Second Edition 10. What is noise & noisy data? How can we remove this data? Sol.) Refer page 62,Jiawei Han and Micheline kamber, Second Edition 11. What is data integration? Discuss its issues & method to integrate different types of data sources. Sol.) Refer page 67,Jiawei Han and Micheline kamber, Second Edition 12. Explain data transformation. Why we use it? Illustrate its methods. Sol.) Refer page 70,Jiawei Han and Micheline kamber, Second Edition

13. What are the techniques used in data reduction? Explain one of them in detail. Sol.) Refer page 72,Jiawei Han and Micheline kamber, Second Edition.

14. What is histogram? Define its types. Which type is most accurate? Sol.) Refer page 81,Jiawei Han and Micheline kamber, Second Edition. 15. Explain sampling as a reduction technique. In how many types sampling can be performed with their advantages? Sol.) Refer page 84,Jiawei Han and Micheline kamber, Second Edition. 16. Write down the step in the process of entropy based discretization. Sol.) Refer page 89,Jiawei Han and Micheline kamber, Second Edition. 17. Explain the methods of generating the concept hierarchies for categorical data with suitable example. Sol.) Refer page 94,Jiawei Han and Micheline kamber, Second Edition. 18. A group of sales price has been sorted as follows: 4,8,15,21,21,24,25,28,34 Partition them into three bins by each of the following methods: a.) Equal-frequency b.) By bin boundaries c.) By bin means Sol.) a) Equal-Frequency: Bin 1: 4,8,15 Bin 2: 21,21,24 Bin 3: 25,28,34 b) By Bin-Boundary Bin 1: 4,4,15 Bin 2: 21,21,24 Bin 3: 25,25,34 c) By Bin-Means Bin 1: 9,9,9 Bin 2: 33,33,33 Bin 3: 29,29,29

19. A 2*2 contingency table is given. Are gender and preferred reading correlated?
MALE FICTION NON-FICTION TOTAL 250 50 300 FEMALE 200 1000 1200 TOTAL 450 1050 1500

Sol.) Pearsons Statistics is computed as, = ((oij-eij)^2)/eij) eij = (count(A=ai)* count (B=bj))/N So, e11 = (count(male)*count(fiction))/N = 300*450/1500 = 90 e12 = (count(male)*count(non-fiction))/N = 300*1050/1500 = 210 e21 = (count(female)*count(fiction))/N = 1200*450/1500 = 360 e22 = (count(female)*count(non-fiction))/N = 1200*1050/1500 = 840

For this 2*2 table, the degrees of freedom are (2-1)(2-1) =1. For 1 degree of freedom the pearsons value needed to reject the hypothesis at the 0.001 significance level is 10.828. Since our computed value is above this we can reject the hypothesis that gender and preferred reading are independent and conclude that the two attributes are strongly correlated for the given group of people. 20. Normalize an attribute income $73,600.Using the methods given below: Range is [0.0,1.0],min=$12,000,max=$98,000,mean=$54000,standard deviation=$16000. a.) min-max normalization b.) z-score normalization Sol.) a.) min-max normalization, v= ((v-min /max min )(new_max-new_max)+new_min = ((73600-12000)/(98000-12000))(1.0-0)+0 =0.716 b.) z-score normalization, v= (v-A)/ = (73600-54000)/16000 = 1.225

21. Given a set of data : 12,8,45,23,6,7,8,25,34,22,19,22,34,45,41 a.) Employ the binning method to allocate the data in the set of three bins by equalfrequency, smoothing by bin means & smoothing by bin boundaries respectively. b.) Use the histogram method to allocate the data in the set of four equal intervals. Sol.) a.) First we sort the data in increasing order:
6,7,8,8,12,19,22,22,23,25,34,34,41,45,45 Equal-Frequency: Bin-1: 6,7,8,8,12 Bin-2: 19,22,22,23,25 Bin-3: 34,34,41,45,45 Smoothing by Bin-Means: Bin-1: 8,8,8,8,8 Bin-2: 22,22,22,22,22 Bin-3: 40,40,40,40,40 Smoothing by Bin-Boundaries: Bin-1: 6,6,6,6,12 Bin-2: 19,19,19,25,25 Bin-3: 34,34,45,45,45 b.) The interval width is: W= (max-min)/4 = (45-6)/4 =9 Thus we have 4 intervals which are: [6,15],[16,25],[26,35] and [36,45].

22. Use the methods below to normalize the following group of data : 200,300,400,600,1000 a.) min-max normalization by setting min=0 and max=1 b.) z-score normalization Sol.) a.)
Original Data 200 300 400 600 1000 b.) Original Data 200 300 400 600 1000 z-score -1.06 -0.7 -0.35 0.35 1.78 [0,1] normalized 0 1.25 0.25 0.5 1

23. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are: 13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70 a.) What is the mean of the data? What is the median? b.) What is the mode of the data? Comment on datas modality c.) What is the midrange of the data? d.) Can you find the first quartile(Q1) and the third quartile(Q3) of the data. e.) Give the five-number summary of the data. Sol.) a) Mean=30 and Median=25
b) The dataset has two values that occur with the same highest frequency and therefore it is Bimodal. The modes of the data are 25 and 35. c.) Midrange = (70+13)/2=41.5 d.) The first quartile is 20 and the third quartile is 35

24. Suppose that the values for the given set of data are grouped into intervals and corresponding frequencies are as follows:
AGE 1-5 5-15 15-20 20-50 50-80 80-110 FREQUENCY 200 450 300 1500 700 44

Compute an approximate median value for the data. Sol.) For interval data,
Median=L1 + ((N/2-( freq) )/Freq)width

Hence first the median values are computed, the medians will lie in the range of20-50, because the median value will be the average of 1597th & 1598th value. These values are lying in the range of 20-50. So now the lower value of median interval is: L1=20,N=3194,freq =1500,width=30,( freq) =950 Median =20+((1597-950)/1500)30 = 32.94 Hence, 32.94 is the approximate median value.

25. The training data set T are given as follows. It is described by three input attributes and belongs to one of two given classes: Class 1 or Class 2.Compute an Information Gain
ATTRIBUTE 1 A A A A A B B B B C C C C C ATTRIBUTE 2 70 90 85 95 70 90 78 65 75 80 70 80 80 96 ATTRIBUTE 3 TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE CLASS CLASS1 CLASS2 CLASS2 CLASS2 CLASS1 CLASS1 CLASS1 CLASS1 CLASS1 CLASS2 CLASS2 CLASS1 CLASS1 CLASS1

Sol.) Refer page 299,Jiawei Han and Micheline kamber, Second Edition.

MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY Department Of Computer Science and Engineering Data Mining and Data Warehousing (ECS-075) QUESTION BANK (UNIT-II) 1.) What is data generalization & data characterization? State the difference between these two. Sol.) Refer page 198,Jiawei Han and Micheline kamber, Second Edition. 2) What is attribute oriented induction? Why we use it? Explain its essential steps. Sol.) Refer page 199,Jiawei Han and Micheline kamber, Second Edition. 3) Write short note on(i) attribute generalization threshold control (ii) generalized relation threshold control Sol.) Refer page 202,Jiawei Han and Micheline kamber, Second Edition. 4) Give an example of attribute oriented induction method. Initial working relation should contain the following attributes- name, gender, major, birth-place, birthdate, residence, phone, gpa. Sol.) Refer page 203,Jiawei Han and Micheline kamber, Second Edition. 5) Write short note oni) cross tabs ii) bar charts iii) pie charts Sol.) Refer page 206,207 Jiawei Han and Micheline kamber, Second Edition. 6) What is the method of discrimination between different classes? Write down its complete procedure with example. Sol.) Refer page 210,Jiawei Han and Micheline kamber, Second Edition. 7) What is quantitative characteristic rule? What does it represent? State the equation of computing its measure. Sol) Refer page 209,Jiawei Han and Micheline kamber, Second Edition.

8) What is quantitative discriminant rule? What does it represent? State the equation of computing its measure. Sol.) Refer page 215,Jiawei Han and Micheline kamber, Second Edition. 9) What is class description? Is there any rule called quantitative description rule? State the equation of description rule & difference between its two parts. Sol.) Refer page 215,Jiawei Han and Micheline kamber, Second Edition.

10) Write down the quantitative description rule for the given dataItem Location Europe North America Both regions TV 80 120 200 Computer 240 560 800 Both items 320 680 1000

Sol.) Refer page 217,Jiawei Han and Micheline kamber, Second Edition. 11) Suppose the following table is derived from the induction methodClass Programmer DBA Birth place USA Others USA Others Count 180 120 20 80

i.) Transform this table into a cross tab showing the associated tweights & d-weights. ii) Map the class programmer into a description rule. Sol.) Refer page 216,218 Jiawei Han and Micheline kamber, Second Edition. 12) What is descriptive data summarization? How can we compute the central tendency of a given data? Sol.) Refer page 51,Jiawei Han and Micheline kamber, Second Edition. 13) write short note ona. variance b. standard deviation c. midrange d. range Sol.) Refer page 54,Jiawei Han and Micheline kamber, Second Edition. 14) Write down the complete procedure of computing the five number summary of a given data. What does it represents? Also define terms associated with this summary. Sol.) Refer page 54,Jiawei Han and Micheline kamber, Second Edition. 15) What are the outliers? How outliers are detected by the method of measuring the dispersion of data? Sol.) Refer page 54,Jiawei Han and Micheline kamber, Second Edition. 16) If we have to represent the summaries generated by the descriptive data summarization, what are the methods which will be used? Sol.) Refer page 56,Jiawei Han and Micheline kamber, Second Edition.

17) Suppose the data for analysis is13, 15, 16, 16, 19, 20, 20, 21, 25, 30, 33, 111 (i) Compute mean, median, midrange, five number summary & interquartile range (ii) What is mode & modality of data. (iii) Show a box-plot of data. Sol.) i) Mean = 28.2 Median = 20 Midrange = 62 Five no. summary = min,Q1,median,Q3,max = 13,16,20,25,111 ii) Mode = 16 and 20 Modality is bimodal 18) Suppose the interval data is given. Find the approximate median value for this dataAge 1-5 5-15 15-20 20-50 50-80 80-110 frequency 200 450 300 1500 700 44

Sol.) For interval data,


Median=L1 + ((N/2-( freq) )/Freq)width Hence first the median values are computed, the medians will lie in the range of20-50,because the median value will be the average of 1597th & 1598th value. These values are lying in the range of 20-50. So now the lower value of median interval is: L1=20,N=3194,freq =1500,width=30,( freq) =950 Median =20+((1597-950)/1500)30 = 32.94 Hence, 32.94 is the approximate median value.

19) What are association rules? What are the applications of association rule? Give an example. Sol.) Refer page 230,Jiawei Han and Micheline kamber, Second Edition. 20) What are confidence & support? State their difference & how are they computed? Sol.) Refer page 230,Jiawei Han and Micheline kamber, Second Edition. 21) Write down the procedure of generating the strong association rules from the frequent Itemsets. Sol.) Refer page 231,Jiawei Han and Micheline kamber, Second Edition.

22) How can we mine the association rules from the multilevel transactional databases? Sol.) Refer page 250,Jiawei Han and Micheline kamber, Second Edition. 23) What are the single dimensional association rule & multi dimensional association rules? Write down the procedure of mining the rules from the multidimensional databases. Sol.) Refer page 254,Jiawei Han and Micheline kamber, Second Edition. 24) A database has five transactions. Let min-support=60% & minconfidence=80%.
TID T100 T200 T300 T400 T500 Items-bought M,O,N,K,E,Y D,O,N,K,E,Y M,A,K,E M,U,C,K,Y C,O,O,K,I,E

(i) (ii)

Find all frequent item sets. Find all strong association rules.

Sol.) Step 1: Generate 1-itemset frequent patterns ITEMS SUPPORT M 3 O 4 N 2 K 5 E 4 Y 3 D 1 A 1 U 1 C 2 I 1 Step-2: Generating 2-itemset frequent patterns ITEMSET {M,O} {M,K} {M,E} {M,Y} {O,K} {O,E} {O,Y} {K,E} {K,Y} {E,Y} SUPPORT 1 3 2 2 3 3 2 4 3 2

ITEMS M O K E Y

SUPPORT 3 4 5 4 3

ITEMSET {M,K} {O,K} {O,E} {K,E} {K,Y}

SUPPORT 3 3 3 4 3

Step-3: Generating 3-itemset frequent patterns

ITEMSET {O,K,E} {K,E,Y}

ITEMSET {O,K,E} {K,E,Y}

SUPPORT 3 2

ITEMSET {O,K,E}

SUPPORT 3

Hence, a frequent itemset is {O,K,E}.

25) A database with four transactions is givenTID T100 T200 T300 T400 Items-bought I1, I2,I3 I2, I3, I4 I1, I2, I3, I4 I3, I4

Find the strong association rules. Minimum support has the value 2 & minimum confidence is 30%. Sol.) Step 1: Generate 1-itemset frequent patterns
ITEMS I1 ITEMS I1 I2 I3 I4 SUPPORT 2 3 4 3 SUPPORT 2 I2 I3 I4 3 4 3

Step-2: Generating 2-itemset frequent patterns ITEMSET {I1,I2} {I1,I3} {I1,I4} {I2,I3} {I2,I4} {I3,I4} SUPPORT 2 2 1 3 2 3 ITEMSET {I1,I2} {I1,I3} {I2,I3} {I2,I4} {I3,I4} SUPPORT 2 2 3 2 3

Step-3: Generating 3-itemset frequent patterns ITEMSET {I1,I2,I3} {I2,I3,I4} ITEMSET {I1,I2,I3} {I2,I3,I4} SUPPORT 2 2

Hence, a frequent itemsets are {I1,I2,I3} & {I2,I3,I4}

MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY

Department Of Computer Science and Engineering Data Mining and Data Warehousing (ECS-075) QUESTION BANK (UNIT-III) 1. What are classification & prediction techniques? Write down their applications. Sol.) Refer page 285,Jiawei Han and Micheline kamber, Second Edition. 2. Write down the issues in classification & prediction. Sol.) Refer page 289,Jiawei Han and Micheline kamber, Second Edition. 3. Explain the algorithm of decision tree induction. How attributes are selected in attribute selection step? Sol.) Refer page 291,Jiawei Han and Micheline kamber, Second Edition. 4. What is bayes theorem? Write the algorithm for the nave bays classification. Sol.) Refer page 310,Jiawei Han and Micheline kamber, Second Edition. 5. What is multi layer feed forward network? Explain its working principal. Sol.) Refer page 328,Jiawei Han and Micheline kamber, Second Edition. 6. Illustrate the complete procedure of classification using the backpropagation algorithm. Write its algorithm. Sol.) Refer page 330,Jiawei Han and Micheline kamber, Second Edition. 7. Write short note on k-nearest neighbour classifier. Sol.) Refer page 348,Jiawei Han and Micheline kamber, Second Edition. 8. What is the genetic algorithm? Explain classification method using genetic algorithm. Sol.) Refer page 351,Jiawei Han and Micheline kamber, Second Edition. 9. What is clustering? Write down the issues related with clustering. Sol.) Refer page 385,Jiawei Han and Micheline kamber, Second Edition. 10. Write short note on(i) interval scaled variables. (ii) Binary variables Sol.) Refer page 387,389 Jiawei Han and Micheline kamber, Second Edition. 11. Write short note on(i) categorical data (ii) ordinal data (ii) ratio scaled data Sol.) Refer page 392,Jiawei Han and Micheline kamber, Second Edition. 12. Write down the method for computing the dissimilarity matrix of the following types of data(i) variables of mixed type (ii) vector objects Sol.) Refer page 395,397 Jiawei Han and Micheline kamber, Second Edition.

13. Give a brief overview of the methods used in clustering. Also classify the clustering methods. Sol.) Refer page 398,Jiawei Han and Micheline kamber, Second Edition 14. Explain the partitioning methods & write their algorithms. Sol.) Refer page 401,Jiawei Han and Micheline kamber, Second Edition 15. Write short note on(i) Cure (ii) Chameleon Sol.) Refer page 416,Jiawei Han and Micheline kamber, Second Edition 16. Explain the density based clustering method. Write the procedure of dbscan method. Sol.) Refer page 418,Jiawei Han and Micheline kamber, Second Edition. 17. Explain the working of clique-grid based clustering method with suitable diagram. Sol.) Refer page 436,Jiawei Han and Micheline kamber, Second Edition. 18. Write short note on(i) Optics (ii) Sting Sol.) Refer page 420,425 Jiawei Han and Micheline kamber, Second Edition. 19. What is neural network approach in clustering? State the difference between clustering & classification. Sol.) Refer page 433,Jiawei Han and Micheline kamber, Second Edition 20. What are the outliers? Explain only one method in detail which is used in outlier analysis. Sol.) Refer page 451,Jiawei Han and Micheline kamber, Second Edition 21. Class Labeled tuples from AllElectronics Customer Database
RID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 AGE Youth youth Middle_aged senior senior senior Middle_aged Youth Youth senior Youth Middle_aged Middle_aged senior INCOME high high high Medium low low low Medium low Medium Medium low high Medium STUDENT No No No No yes yes yes No yes yes yes No yes No CREDIT_RATING fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent CLASS:buys_computer no no yes yes yes no yes no yes yes yes yes yes no

Predict the class label of the tuple using nave Bayesian classification X=(age=youth, income=medium, student=yes, credit_rating=fair

Sol.) Prior probabilities of each class can be computed as :


P(buys_computer=yes) = 9/4=0.643 P(buys_computer=no) = 5/14= 0.357 To compute P(X/Ci),for i=1,2, we compute the following conditional probabilities: P(age=youth/buys_computer=yes) = 2/9= 0.222 P(age=youth/buys_computer=no= 3/5= 0.600 P(income=medium/buys_computer=yes)= 4/9=0.444 P(income=medium/buys_computer=no)=2/5=0.400 P(student=yes/buys_computer=yes)=6/9=0.66 P(student=yes/buys_computer=no)=1/5=0.200 P(credit_rating=fair/buys_computer=yes)=6/9=0.667 P(credit_rating=fair/buys_computer=no)=2/5=0.400 Using the above probabilities we obtain, P)X/buys_computer=yes)=P(age=youth/buys_computer)* P(income=medium/buys_computer=yes)*P(student=yes/buys_computer=yes)*P(credit_rating= fair/buys_computer=yes) =0.222*0.444*0.667*0.667 =0.044 Similarly, P(X/buys_computer=no)=0.600*0.400*0.200*0.400=0.019 P(X/buys_computer=yes)P(X/buys_computer=yes)=0.044*0.643=0.028 P(X/buys_computer=no)P(buys_computer=no)=0.019*0.357=0.007 Therefore, the nave Bayesian classifier predicts buys_computer=yes for tuple X.

22. Consider the network with following units: 1 & 2 as input units,3 & 4 hidden units & 5 as the output unit. Let the learning rate be 0.9.The initial weight & bias values of the above units are given as: x1 x2 w13 w14 w23 w24 w35 w36 0.35 0.9 0.1 0.4 0.8 0.6 0.3 0.9 Compute the error at each unit. Bias at unit 3,4 & 5 is 0.1,0.3 & 0.4 respectively. Sol.) Refer page 334,Jiawei Han and Micheline kamber, Second Edition.

23. Consider the following units of the network. There are 3 input units 1,2 & 3 respectively and two hidden units 4 & 5 & one output unit 6. Let the learning rate be 0.9.The initial weight & bias values of the above units are given as: x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 Compute the error at each unit. Bias at 4,5& 6 is -0.4,0.2 & 0.1 respectively. Sol.) The first training tuple is X = (1,0,1) and its class label is 1. X is fed into the network & the
net input and output of each unit are computed. UNIT j 4 5 6 NET i/p Ii 0.2+0-0.5-0.4=-0.7 -0.3+0+0.2+0.2=0.1 (-0.3)(0.332)-(0.2)(0.525)+0.1=-0.105 OUTPUT Oj (1/1+e^0.7)=0.332 (1/1+e^-0.1)=0.525 (1/1+e^0.105)=0.474

Error at each node are calculated as: UNIT j 6 5 4 Err j (0.474)(1-0.474)(1-0.474)=0.1311 (0.5252)(1-0.525)(0.1311)(-0.2)=-0.0065 (0.332)(1-0.332)(0.1311)(-0.3)=-0.0087

Weights and biases are updated as: Weight or Bias W46 W56 W14 W15 W24 W25 W34 W35 O6 O5 O4

New Value -0.3+(0.9)(0.1311)(0.332)=-0.261 -0.2+(0.9)(0.1311)(0.525)=-0.138 0.2+(0.9)(-0.0087)(1)=0.192 -0.3+(0.9)(-0.0065)(1)=-0.306 0.4+(0.9)(-0.0087)(0)=0.4 0.1+(0.9)(-0.0065)(0)=0.1 -0.5+(0.)(-0.0087)(1)=-0.508 0.2+(0.9)(-0.0065)(1)=0.194 0.1+(0.9)(0.1311)=0.218 0.2+(0.9)(-0.0065)=0.194 -0.4+(0.9)(-0.0087)=-0.408

24.A table for data containing two attributes is given:


X(Experience)(Years) 3 8 9 13 3 6 11 21 1 16 Y(Salary)($1000) 30 57 64 72 36 43 59 90 20 83

X=10, a new value for experience is given, predict the salary accordingly. Sol.) In y=w0+w1x
W1 = ((xi- x)(yi-y))/(xi-x)^2 W0 = y-w1x W1=((3-9.1)(30-55.4)+(8-9.1)(57-55.4)+(9-9.1)(64-55.4)+(13-9.1)(72-55.4)+(3-9.1)(3655.4)+(6-9.1)(43-55.4)+(11-9.1)(59-55.4)+(21-9.1)(90-55.4)+(1-9.1)(20-55.4)+(16-9.1)(8355.4))/((3-9.1)^2+(8-9.1)^2+(9-9.1)^2+(13-9.1)^2+(3-9.1)^2+(6-9.1)^2+(11-9.1)^2+(219.1)^2+(4-9.1)^2+(16-9.1)^2) = 3.5 W0 = 55.4-(3.5)(9.1) = 23.6

Now equation is, y = 23.6+3.5x For x=10, y=23.6+35 = 58.6$ (in1000$)

25. Given the following measurements for variable age: 18,22,25,42,28,43,33,35,56,28 Standardize the variable by following: a.) Compute the mean absolute deviation of age. b.) Compute the z-score for first measurement Sol.) a.) Mean absolute deviation, Sf Sf = 1/n(|x1f-mf|+|x2f-mf|+..+|xnf-mf|) =1/10(|18-33|+|22-33|+|25-33|+|42-33|+|28-33|+|43-33|+|3333|+|35-33|+|56-33|+|28-33|) = 1/10(15+11+8+9+5+10+0+2+23+5) = 8.8 b.) z-score for first measurement is; Zif = (xif-mf)/Sf = 1.70

MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY Department Of Computer Science & Engineering Data Mining and Data Warehousing (ECS-075) UNIT-4 Q1) What is a data warehouse? Explain its characteristics. Sol.) Refer page 105,Jiawei Han and Micheline kamber, Second Edition. Q2) Differentiate between OLAP and OLTP systems. Sol.) Refer page 109,Jiawei Han and Micheline kamber, Second Edition. Q3) What is a dimensional model and what are the dimensions? Sol.) Refer page 110,Jiawei Han and Micheline kamber, Second Edition. Q4) Describe data warehouse architecture. Sol.) Refer page 131,Jiawei Han and Micheline kamber, Second Edition. Q5) What are the differences between MOLAP and ROLAP models? Sol.) Refer page 135,Jiawei Han and Micheline kamber, Second Edition. Q6) What is meant by slice and dice ? Give an example. Sol.) Refer page 284,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q7) How does a snowflake schema differ from a star schema? Sol.) Refer page114,Jiawei Han and Micheline kamber, Second Edition. Q8) What is the star schema? What are the fact tables? Sol.) Refer page 116,Jiawei Han and Micheline kamber, Second Edition. Q9) Describe data warehouse architecture? Sol.) Refer page 131,Jiawei Han and Micheline kamber, Second Edition. Q10) Briefly compare the following concept: a) snowflake schema, fact constellation, starnet model. b) Data cleaning, data transformation, refresh c) Enterprise warehouse, data mart, virtual warehouse. Sol.) Refer page117,132 Jiawei Han and Micheline kamber, Second Edition. Q11) Explain concept hierarchy? Sol.) Refer page 121,Jiawei Han and Micheline kamber, Second Edition. Q12) What is data marting and how it is different from a data warehouse? Sol.) Refer page 132,Jiawei Han and Micheline kamber, Second Edition Q13) Explain the process architecture of a data warehouse.

Sol.) Refer page 129,Jiawei Han and Micheline kamber, Second Edition. Q14) What are the various multidimensional data models of a data warehouse? Sol.) Refer page 110,Jiawei Han and Micheline kamber, Second Edition. Q15) Explain the three-tier architecture of a data warehouse. Sol.) Refer page 131,Jiawei Han and Micheline kamber, Second Edition. Q16)Explain the various steps involved for the design and construction of data warehouse. Sol.) Refer page 128,Jiawei Han and Micheline kamber, Second Edition. Q17) Give the syntax of cube and dimension definition for star,snowflake and fact constellation schema by taking any example. Sol.) Refer page 117,Jiawei Han and Micheline kamber, Second Edition. Q18) How a table can be represented in a form of a data cube. Explain with the help of an example. Sol.) Refer page 112,Jiawei Han and Micheline kamber, Second Edition. Q19) How are measures computed? Sol.) Refer page 119,Jiawei Han and Micheline kamber, Second Edition. Q20) What are the various ways of categorizing the measures? Sol.) Refer page 119,Jiawei Han and Micheline kamber, Second Edition.

MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY Department Of Computer Science & Engineering Data Mining and Data Warehousing (ECS-075) UNIT-5 Q1) Give the reasons why the data warehouse must be backed up. Sol.) Refer page 305,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q2) What are the various techniques of fine-tuning the data warehouse? Sol.) Refer page 309,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q3) Why data quality is considered critical in a data warehouse? Sol.) Refer page 312,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q4) Explain OLAP functions and tools in brief. What are the main features of OLAP servers? Sol.) Refer page 283,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q5) What are the differences between MOLAP and HOLAP models? Sol.) Refer page 292,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q6) What do you mean by aggregation? How OLAP handles aggregation? Sol.) Refer page 273,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q7) Write short note on: a) Slice and dice operation b) Roll-up and drill-down operation Sol.) Refer page 284,289 Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q8) Write short notes on testing of data warehouses. Sol.) Refer page 310,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q9) What is metadata? What is the role of metadata in data warehouse? Sol.) Refer page 282,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co.

Q10) What is the role of data mining interfaces? Sol.) Refer page 297,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q11) Explain the backup and recovery process of a data warehouse. Sol.) Refer page 305,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q12) Write a short note on OLAP models. Sol.) Refer page 291,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q13) Explain various OLAP functions with the help of an example. Sol.) Refer page 283,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q14) What is OLAP? What are its applications? Sol.) Refer page 276,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q15) Explain the query management process in a data warehouse. Sol.) Refer page 275,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q16) Give few benefits of OLAP. Sol.) Refer page 285,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q17) Explain with the help of an example how data mining is used in a data warehouse environment. Sol.) Refer page 305,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q18) What are the challenges involved in data warehousing testing? Sol.) Refer page 311,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q19) Diffrentiate between database testing and data warehouse testing. Sol.) Refer page 311,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co. Q20) Describe various phases of testing data warehouses. Sol.) Refer page 312,Data Mining & Warehousing, Sunita Tiwari and Neha Chaudhary. Dhanpat Rai and Co.