Data Mining

Course Number: CS616BH1

GRADE: 20/20
Assignment 1
Student’s Name: Chintapalli Sri Ram

1.) Discuss whether or not each of the following activities is a data mining task.
(a) Dividing the customers of a company according to their gender.

A) No, as this is simple segregation problem.
Example :- select * from CustomersTable
where Gender = Male ..... in the same way select the
other and"" join them with required condition"" # like sales , purchase , age
other
(b) Dividing the customers of a company according to their profitability.
A) No. This is a simple mathematical problem. Dividing the customers of a

company by just Comparing the profits. Which also depends on product
which they consume and margin they make on them and also sales they do.
(c) Computing the total sales of a company.
A) No, as it is just a sum of total sales. That is all the manufacture products

and sold items at what margin is considered
(d) Sorting a student database based on student identification numbers.
A) No. A simple sorting procedure can solve the problem.

Example - Select * from student table
where sid = 1234
(e) Predicting the outcomes of tossing a (fair) pair of dice.

A) No. since it is given that the die is fair it is more of a probabilistic problem.
(f) Predicting the future stock price of a company using historical records.

A) This belongs to data mining. Since it involves predicting the future stock
prices from the historical data. Example Lets consider a stock Verizon where
customer start calculating the values while buying them with the help of
sales + company asserts and other historical data in a simple way or a
company will refer the same data and private data (where the consumer

A) Yes. images of earth-orbiting satellites are not private. If any unusual wave appears then an alarm is raised. A) No. (c) Images from Earth-orbiting satellites. explain whether or not data privacy is an important issue. A) No. A) This problem also comes into data mining domain.) For each of the following data sets. Census is published data. (b) IP addresses and visit times of Web users who visit your Website. A) No. therefore data privacy is not the primary issue.Data Mining Course Number: CS616BH1 does not have any idea ) which directly proportional to rise the stock price or sell them. all the seismic waves are monitored at a time. Because in network systems are identified with the help of IP address and communication is done through peer to peer with this if other have IP address in same network there are lot of chances to hack or misuse your data. A) Yes because these are private data for the users. (h) Monitoring seismic waves for earthquake activities. Similar to the above. (a) Census data collected from 1900–1950. Detecting an abnormality involves continues observation of the heart beat and reporting if any unusual happens. (g) Monitoring the heart rate of a patient for abnormalities. this is not a data mining problem. It is a data mining problem. (i)Extracting the frequencies of a sound wave. Because these are used in public transport (navigation) also helps in identifying the natural threats . 2.

But when I rated the products based on my new customer satisfaction measure and showed them to my boss. my measure of product satisfaction must be a ratio attribute. total sales of the product). Therefore the appropriate measure of satisfaction would be a function as follows Measure = f(number of complaints for the product. what would you do to fix the measure of satisfaction? A) The Boss is correct because the key factor the number of sales is not included in the measure of satisfaction. the marketing director or his boss? If you answered.) An educational psychologist wants to use association analysis to analyze test results. who believes that he has devised a foolproof way to measure customer satisfaction. The test consists of 100 questions with four possible answers each. (b) What can you say about the attribute type of the original product satisfaction attribute? A) The attribute type cannot be cleared because the original product satisfaction attribute contains many determining factors like satisfaction level. I just keep track of the number of customer complaints for each product. he told me that I had overlooked the obvious. Its very similar to telephone book where we save data as softcopy and refer those 3) You are approached by the marketing director of a local company. I read in a data mining book that counts are ratio attributes. A) No they are meant to be shared in olden days because its very hard to remember all the names and phone numbers. number of complaints etc. 4.Data Mining Course Number: CS616BH1 (d) Names and addresses of people from the telephone book. He explains his scheme as follows: “It’s so simple that I can’t believe that no one has thought of it before. . and so. Could you help me set him straight?” (a) Who is right. A) No. his boss. and that my measure was worthless. to contact someone they use these books which are in hardcopy (e) Names and email addresses collected from the Web. names and email addresses are not private data. I think that he was just mad because our best-selling product had the worst satisfaction since it had the most complaints.

. The binary form representation of the above problem is as follows 1 2 … . 5. Since this transformation do not create non-zero entries for those value which were zero previously. If a normalization is performed on this matrix to have a L 2 norm of 1 then that matrix will have continuous features.) Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features. Thus it is a dataset that asymmetric discrete features.Data Mining Course Number: CS616BH1 (a) How would you convert this data into a form suitable for association analysis? A) Association analysis first step is to present the data in binary form. And there are a total of 400 variables (100 questions * 4 options). what type of attributes would you have and how many of them are there? A) Since the solution is considered more important than the other options the attributes are asymmetric binary variables. (b) In particular. And they still do not give any meaning therefore the matrix poses asymmetric continuous features. A) A document –term matrix is an i x j matrix in which the ijth is the number of times the ith term appears in the jth term. Since it represents the number of times a term appears in a particular document therefore zero entries are considered to be important. what is L2 norm : A Least squares which minimizes the sum of the squares of the differences (D) in between Target and Estimated . Q100( A) 1 0 Q100( B) 0 0 Q100( C) 0 1 Q100( D) 0 0 If the answer for the nth question is A then Qn(A) is 1 else it is 0. Q1(A ) 0 1 Q1(B) Q1(C) Q1(D) 1 0 0 0 0 0 ……………… ……….

= Log m/1 aprox value is Log m b) In every document: i. mostly correct Explanation justifies correctness Explained. (a) What is the effect of this transformation if a term occurs in one document? In every document? A) a) In one document: i. dfi is m then the transformation will have zero value. thoroughly .e. Criterion D Correctness No justification of correctness Clarity Unclear C B Explanation justifies. This transformation is known as the inverse document frequency transformation. = Log m/m = Log(1) that is zero “ (b) What might be the purpose of this transformation? A) The above mentioned shows that a term which appears in one document has maximum value while which appears in all the documents has zero value. This reflects that a term which appears in all the documents do not play a crucial role in segregating than that which appears in only certain documents. somewhat clear Every point clearly specified. Consider the variable transformation that is defined by where dfi is the number of documents in which the ith term appears and is known as the document frequency of the term.Data Mining Course Number: CS616BH1 values D = 1(sumof)n {t-E}^2 6. where tfij is the frequency of the ith word (term) in the jth document and m is the number of documents.) Consider a document-term matrix. complete and thorough justification Every point precisely specified. well A Explanation justifies correctness extremely well. dfi is 1 then the transformation will have its maximum value logm.e.

Data Mining Course Number: CS616BH1 commented. entirely clear Evidence throughout of entirely thorough understanding . clear Minor Understanding understanding evidenced Note: Nicely done! Satisfactory understanding evidenced Evidence of good understanding throughout commented.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.