Data Mining

Course Number: CS616BH1

GRADE: 20/20
Assignment 1
Student’s Name: Chintapalli Sri Ram

1.) Discuss whether or not each of the following activities is a data mining task.
(a) Dividing the customers of a company according to their gender.

A) No, as this is simple segregation problem.
Example :- select * from CustomersTable
where Gender = Male ..... in the same way select the
other and"" join them with required condition"" # like sales , purchase , age
(b) Dividing the customers of a company according to their profitability.
A) No. This is a simple mathematical problem. Dividing the customers of a

company by just Comparing the profits. Which also depends on product
which they consume and margin they make on them and also sales they do.
(c) Computing the total sales of a company.
A) No, as it is just a sum of total sales. That is all the manufacture products

and sold items at what margin is considered
(d) Sorting a student database based on student identification numbers.
A) No. A simple sorting procedure can solve the problem.

Example - Select * from student table
where sid = 1234
(e) Predicting the outcomes of tossing a (fair) pair of dice.

A) No. since it is given that the die is fair it is more of a probabilistic problem.
(f) Predicting the future stock price of a company using historical records.

A) This belongs to data mining. Since it involves predicting the future stock
prices from the historical data. Example Lets consider a stock Verizon where
customer start calculating the values while buying them with the help of
sales + company asserts and other historical data in a simple way or a
company will refer the same data and private data (where the consumer

(b) IP addresses and visit times of Web users who visit your Website. If any unusual wave appears then an alarm is raised. (i)Extracting the frequencies of a sound wave.Data Mining Course Number: CS616BH1 does not have any idea ) which directly proportional to rise the stock price or sell them. Detecting an abnormality involves continues observation of the heart beat and reporting if any unusual happens. A) No. (h) Monitoring seismic waves for earthquake activities. all the seismic waves are monitored at a time. explain whether or not data privacy is an important issue. It is a data mining problem. A) No. 2. Census is published data. (c) Images from Earth-orbiting satellites. Because these are used in public transport (navigation) also helps in identifying the natural threats . Similar to the above. A) Yes because these are private data for the users. A) No.) For each of the following data sets. (g) Monitoring the heart rate of a patient for abnormalities. A) This problem also comes into data mining domain. images of earth-orbiting satellites are not private. A) Yes. therefore data privacy is not the primary issue. (a) Census data collected from 1900–1950. Because in network systems are identified with the help of IP address and communication is done through peer to peer with this if other have IP address in same network there are lot of chances to hack or misuse your data. this is not a data mining problem.

who believes that he has devised a foolproof way to measure customer satisfaction. Could you help me set him straight?” (a) Who is right.Data Mining Course Number: CS616BH1 (d) Names and addresses of people from the telephone book. his boss. he told me that I had overlooked the obvious. I just keep track of the number of customer complaints for each product. total sales of the product). . Therefore the appropriate measure of satisfaction would be a function as follows Measure = f(number of complaints for the product. the marketing director or his boss? If you answered. I think that he was just mad because our best-selling product had the worst satisfaction since it had the most complaints. A) No. names and email addresses are not private data. number of complaints etc. my measure of product satisfaction must be a ratio attribute. The test consists of 100 questions with four possible answers each. I read in a data mining book that counts are ratio attributes. and that my measure was worthless. A) No they are meant to be shared in olden days because its very hard to remember all the names and phone numbers. He explains his scheme as follows: “It’s so simple that I can’t believe that no one has thought of it before. what would you do to fix the measure of satisfaction? A) The Boss is correct because the key factor the number of sales is not included in the measure of satisfaction. (b) What can you say about the attribute type of the original product satisfaction attribute? A) The attribute type cannot be cleared because the original product satisfaction attribute contains many determining factors like satisfaction level. to contact someone they use these books which are in hardcopy (e) Names and email addresses collected from the Web. Its very similar to telephone book where we save data as softcopy and refer those 3) You are approached by the marketing director of a local company. and so. But when I rated the products based on my new customer satisfaction measure and showed them to my boss.) An educational psychologist wants to use association analysis to analyze test results. 4.

And they still do not give any meaning therefore the matrix poses asymmetric continuous features. (b) In particular. Q100( A) 1 0 Q100( B) 0 0 Q100( C) 0 1 Q100( D) 0 0 If the answer for the nth question is A then Qn(A) is 1 else it is 0. A) A document –term matrix is an i x j matrix in which the ijth is the number of times the ith term appears in the jth term. Q1(A ) 0 1 Q1(B) Q1(C) Q1(D) 1 0 0 0 0 0 ……………… ………. what is L2 norm : A Least squares which minimizes the sum of the squares of the differences (D) in between Target and Estimated . Since it represents the number of times a term appears in a particular document therefore zero entries are considered to be important.Data Mining Course Number: CS616BH1 (a) How would you convert this data into a form suitable for association analysis? A) Association analysis first step is to present the data in binary form.. 5. And there are a total of 400 variables (100 questions * 4 options).) Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features. what type of attributes would you have and how many of them are there? A) Since the solution is considered more important than the other options the attributes are asymmetric binary variables. The binary form representation of the above problem is as follows 1 2 … . Since this transformation do not create non-zero entries for those value which were zero previously. If a normalization is performed on this matrix to have a L 2 norm of 1 then that matrix will have continuous features. Thus it is a dataset that asymmetric discrete features.

(a) What is the effect of this transformation if a term occurs in one document? In every document? A) a) In one document: i. This transformation is known as the inverse document frequency transformation. mostly correct Explanation justifies correctness Explained. dfi is m then the transformation will have zero value. = Log m/1 aprox value is Log m b) In every document: i. This reflects that a term which appears in all the documents do not play a crucial role in segregating than that which appears in only certain documents. well A Explanation justifies correctness extremely well. somewhat clear Every point clearly specified. = Log m/m = Log(1) that is zero “ (b) What might be the purpose of this transformation? A) The above mentioned shows that a term which appears in one document has maximum value while which appears in all the documents has zero value. dfi is 1 then the transformation will have its maximum value logm. Criterion D Correctness No justification of correctness Clarity Unclear C B Explanation justifies. Consider the variable transformation that is defined by where dfi is the number of documents in which the ith term appears and is known as the document frequency of the term.Data Mining Course Number: CS616BH1 values D = 1(sumof)n {t-E}^2 6.) Consider a document-term matrix.e. thoroughly .e. where tfij is the frequency of the ith word (term) in the jth document and m is the number of documents. complete and thorough justification Every point precisely specified.

clear Minor Understanding understanding evidenced Note: Nicely done! Satisfactory understanding evidenced Evidence of good understanding throughout commented. entirely clear Evidence throughout of entirely thorough understanding .Data Mining Course Number: CS616BH1 commented.