Está en la página 1de 6

Data Mining

Course Number: CS616BH1

GRADE: 20/20
Assignment 1
Student’s Name: Chintapalli Sri Ram

1.) Discuss whether or not each of the following activities is a data mining task.
(a) Dividing the customers of a company according to their gender.

A) No, as this is simple segregation problem.
Example :- select * from CustomersTable
where Gender = Male ..... in the same way select the
other and"" join them with required condition"" # like sales , purchase , age
(b) Dividing the customers of a company according to their profitability.
A) No. This is a simple mathematical problem. Dividing the customers of a

company by just Comparing the profits. Which also depends on product
which they consume and margin they make on them and also sales they do.
(c) Computing the total sales of a company.
A) No, as it is just a sum of total sales. That is all the manufacture products

and sold items at what margin is considered
(d) Sorting a student database based on student identification numbers.
A) No. A simple sorting procedure can solve the problem.

Example - Select * from student table
where sid = 1234
(e) Predicting the outcomes of tossing a (fair) pair of dice.

A) No. since it is given that the die is fair it is more of a probabilistic problem.
(f) Predicting the future stock price of a company using historical records.

A) This belongs to data mining. Since it involves predicting the future stock
prices from the historical data. Example Lets consider a stock Verizon where
customer start calculating the values while buying them with the help of
sales + company asserts and other historical data in a simple way or a
company will refer the same data and private data (where the consumer

Because in network systems are identified with the help of IP address and communication is done through peer to peer with this if other have IP address in same network there are lot of chances to hack or misuse your data. Because these are used in public transport (navigation) also helps in identifying the natural threats . A) This problem also comes into data mining domain. A) Yes because these are private data for the users. Similar to the above. A) Yes. images of earth-orbiting satellites are not private. A) No.) For each of the following data sets. Detecting an abnormality involves continues observation of the heart beat and reporting if any unusual happens. (i)Extracting the frequencies of a sound wave. (g) Monitoring the heart rate of a patient for abnormalities. this is not a data mining problem. (a) Census data collected from 1900–1950. all the seismic waves are monitored at a time. (b) IP addresses and visit times of Web users who visit your Website. It is a data mining problem.Data Mining Course Number: CS616BH1 does not have any idea ) which directly proportional to rise the stock price or sell them. (h) Monitoring seismic waves for earthquake activities. (c) Images from Earth-orbiting satellites. explain whether or not data privacy is an important issue. Census is published data. A) No. therefore data privacy is not the primary issue. 2. A) No. If any unusual wave appears then an alarm is raised.

(b) What can you say about the attribute type of the original product satisfaction attribute? A) The attribute type cannot be cleared because the original product satisfaction attribute contains many determining factors like satisfaction level. Its very similar to telephone book where we save data as softcopy and refer those 3) You are approached by the marketing director of a local company. .) An educational psychologist wants to use association analysis to analyze test results. the marketing director or his boss? If you answered. The test consists of 100 questions with four possible answers each. total sales of the product). I read in a data mining book that counts are ratio attributes. and that my measure was worthless. Therefore the appropriate measure of satisfaction would be a function as follows Measure = f(number of complaints for the product. He explains his scheme as follows: “It’s so simple that I can’t believe that no one has thought of it before. his boss.Data Mining Course Number: CS616BH1 (d) Names and addresses of people from the telephone book. I just keep track of the number of customer complaints for each product. names and email addresses are not private data. to contact someone they use these books which are in hardcopy (e) Names and email addresses collected from the Web. number of complaints etc. But when I rated the products based on my new customer satisfaction measure and showed them to my boss. A) No. and so. I think that he was just mad because our best-selling product had the worst satisfaction since it had the most complaints. Could you help me set him straight?” (a) Who is right. A) No they are meant to be shared in olden days because its very hard to remember all the names and phone numbers. 4. what would you do to fix the measure of satisfaction? A) The Boss is correct because the key factor the number of sales is not included in the measure of satisfaction. who believes that he has devised a foolproof way to measure customer satisfaction. my measure of product satisfaction must be a ratio attribute. he told me that I had overlooked the obvious.

what type of attributes would you have and how many of them are there? A) Since the solution is considered more important than the other options the attributes are asymmetric binary variables. And they still do not give any meaning therefore the matrix poses asymmetric continuous features.. 5. A) A document –term matrix is an i x j matrix in which the ijth is the number of times the ith term appears in the jth term. Since it represents the number of times a term appears in a particular document therefore zero entries are considered to be important. Q1(A ) 0 1 Q1(B) Q1(C) Q1(D) 1 0 0 0 0 0 ……………… ………. Q100( A) 1 0 Q100( B) 0 0 Q100( C) 0 1 Q100( D) 0 0 If the answer for the nth question is A then Qn(A) is 1 else it is 0. Since this transformation do not create non-zero entries for those value which were zero previously. The binary form representation of the above problem is as follows 1 2 … . (b) In particular. Thus it is a dataset that asymmetric discrete features. And there are a total of 400 variables (100 questions * 4 options).Data Mining Course Number: CS616BH1 (a) How would you convert this data into a form suitable for association analysis? A) Association analysis first step is to present the data in binary form. what is L2 norm : A Least squares which minimizes the sum of the squares of the differences (D) in between Target and Estimated .) Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features. If a normalization is performed on this matrix to have a L 2 norm of 1 then that matrix will have continuous features.

Data Mining Course Number: CS616BH1 values D = 1(sumof)n {t-E}^2 6. complete and thorough justification Every point precisely specified. thoroughly . This reflects that a term which appears in all the documents do not play a crucial role in segregating than that which appears in only certain documents. dfi is 1 then the transformation will have its maximum value logm. = Log m/m = Log(1) that is zero “ (b) What might be the purpose of this transformation? A) The above mentioned shows that a term which appears in one document has maximum value while which appears in all the documents has zero value. where tfij is the frequency of the ith word (term) in the jth document and m is the number of documents. dfi is m then the transformation will have zero value. mostly correct Explanation justifies correctness Explained.) Consider a document-term matrix. This transformation is known as the inverse document frequency transformation. = Log m/1 aprox value is Log m b) In every document: i. well A Explanation justifies correctness extremely well. (a) What is the effect of this transformation if a term occurs in one document? In every document? A) a) In one document: i. Consider the variable transformation that is defined by where dfi is the number of documents in which the ith term appears and is known as the document frequency of the term. somewhat clear Every point clearly specified. Criterion D Correctness No justification of correctness Clarity Unclear C B Explanation justifies.e.e.

Data Mining Course Number: CS616BH1 commented. entirely clear Evidence throughout of entirely thorough understanding . clear Minor Understanding understanding evidenced Note: Nicely done! Satisfactory understanding evidenced Evidence of good understanding throughout commented.