Data Mining

Course Number: CS616BH1

GRADE: 20/20
Assignment 1
Student’s Name: Chintapalli Sri Ram

1.) Discuss whether or not each of the following activities is a data mining task.
(a) Dividing the customers of a company according to their gender.

A) No, as this is simple segregation problem.
Example :- select * from CustomersTable
where Gender = Male ..... in the same way select the
other and"" join them with required condition"" # like sales , purchase , age
(b) Dividing the customers of a company according to their profitability.
A) No. This is a simple mathematical problem. Dividing the customers of a

company by just Comparing the profits. Which also depends on product
which they consume and margin they make on them and also sales they do.
(c) Computing the total sales of a company.
A) No, as it is just a sum of total sales. That is all the manufacture products

and sold items at what margin is considered
(d) Sorting a student database based on student identification numbers.
A) No. A simple sorting procedure can solve the problem.

Example - Select * from student table
where sid = 1234
(e) Predicting the outcomes of tossing a (fair) pair of dice.

A) No. since it is given that the die is fair it is more of a probabilistic problem.
(f) Predicting the future stock price of a company using historical records.

A) This belongs to data mining. Since it involves predicting the future stock
prices from the historical data. Example Lets consider a stock Verizon where
customer start calculating the values while buying them with the help of
sales + company asserts and other historical data in a simple way or a
company will refer the same data and private data (where the consumer

Because in network systems are identified with the help of IP address and communication is done through peer to peer with this if other have IP address in same network there are lot of chances to hack or misuse your data. A) No. images of earth-orbiting satellites are not private. Detecting an abnormality involves continues observation of the heart beat and reporting if any unusual happens. Census is published data. (i)Extracting the frequencies of a sound wave. (b) IP addresses and visit times of Web users who visit your Website. It is a data mining problem. If any unusual wave appears then an alarm is raised. (c) Images from Earth-orbiting satellites. all the seismic waves are monitored at a time. A) No. (g) Monitoring the heart rate of a patient for abnormalities. explain whether or not data privacy is an important issue.) For each of the following data sets. A) This problem also comes into data mining domain. A) Yes.Data Mining Course Number: CS616BH1 does not have any idea ) which directly proportional to rise the stock price or sell them. (h) Monitoring seismic waves for earthquake activities. therefore data privacy is not the primary issue. this is not a data mining problem. (a) Census data collected from 1900–1950. 2. Similar to the above. A) Yes because these are private data for the users. A) No. Because these are used in public transport (navigation) also helps in identifying the natural threats .

(b) What can you say about the attribute type of the original product satisfaction attribute? A) The attribute type cannot be cleared because the original product satisfaction attribute contains many determining factors like satisfaction level. The test consists of 100 questions with four possible answers each. Could you help me set him straight?” (a) Who is right. I read in a data mining book that counts are ratio attributes.Data Mining Course Number: CS616BH1 (d) Names and addresses of people from the telephone book. what would you do to fix the measure of satisfaction? A) The Boss is correct because the key factor the number of sales is not included in the measure of satisfaction. names and email addresses are not private data. his boss. A) No. A) No they are meant to be shared in olden days because its very hard to remember all the names and phone numbers. and so. number of complaints etc. I just keep track of the number of customer complaints for each product.) An educational psychologist wants to use association analysis to analyze test results. my measure of product satisfaction must be a ratio attribute. total sales of the product). He explains his scheme as follows: “It’s so simple that I can’t believe that no one has thought of it before. who believes that he has devised a foolproof way to measure customer satisfaction. the marketing director or his boss? If you answered. he told me that I had overlooked the obvious. I think that he was just mad because our best-selling product had the worst satisfaction since it had the most complaints. Therefore the appropriate measure of satisfaction would be a function as follows Measure = f(number of complaints for the product. to contact someone they use these books which are in hardcopy (e) Names and email addresses collected from the Web. . 4. and that my measure was worthless. But when I rated the products based on my new customer satisfaction measure and showed them to my boss. Its very similar to telephone book where we save data as softcopy and refer those 3) You are approached by the marketing director of a local company.

what is L2 norm : A Least squares which minimizes the sum of the squares of the differences (D) in between Target and Estimated . Since this transformation do not create non-zero entries for those value which were zero previously. (b) In particular. Q100( A) 1 0 Q100( B) 0 0 Q100( C) 0 1 Q100( D) 0 0 If the answer for the nth question is A then Qn(A) is 1 else it is 0. A) A document –term matrix is an i x j matrix in which the ijth is the number of times the ith term appears in the jth term.) Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features. The binary form representation of the above problem is as follows 1 2 … . 5. what type of attributes would you have and how many of them are there? A) Since the solution is considered more important than the other options the attributes are asymmetric binary variables. And there are a total of 400 variables (100 questions * 4 options). Since it represents the number of times a term appears in a particular document therefore zero entries are considered to be important.. If a normalization is performed on this matrix to have a L 2 norm of 1 then that matrix will have continuous features.Data Mining Course Number: CS616BH1 (a) How would you convert this data into a form suitable for association analysis? A) Association analysis first step is to present the data in binary form. Q1(A ) 0 1 Q1(B) Q1(C) Q1(D) 1 0 0 0 0 0 ……………… ………. And they still do not give any meaning therefore the matrix poses asymmetric continuous features. Thus it is a dataset that asymmetric discrete features.

This reflects that a term which appears in all the documents do not play a crucial role in segregating than that which appears in only certain documents. well A Explanation justifies correctness extremely well. mostly correct Explanation justifies correctness Explained.e. somewhat clear Every point clearly specified. (a) What is the effect of this transformation if a term occurs in one document? In every document? A) a) In one document: i. This transformation is known as the inverse document frequency transformation.Data Mining Course Number: CS616BH1 values D = 1(sumof)n {t-E}^2 6. where tfij is the frequency of the ith word (term) in the jth document and m is the number of documents. = Log m/m = Log(1) that is zero “ (b) What might be the purpose of this transformation? A) The above mentioned shows that a term which appears in one document has maximum value while which appears in all the documents has zero value. thoroughly . dfi is 1 then the transformation will have its maximum value logm.e. Consider the variable transformation that is defined by where dfi is the number of documents in which the ith term appears and is known as the document frequency of the term. complete and thorough justification Every point precisely specified.) Consider a document-term matrix. dfi is m then the transformation will have zero value. Criterion D Correctness No justification of correctness Clarity Unclear C B Explanation justifies. = Log m/1 aprox value is Log m b) In every document: i.

clear Minor Understanding understanding evidenced Note: Nicely done! Satisfactory understanding evidenced Evidence of good understanding throughout commented.Data Mining Course Number: CS616BH1 commented. entirely clear Evidence throughout of entirely thorough understanding .

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.