Documentos de Académico
Documentos de Profesional
Documentos de Cultura
• There are different types of attributes • The type of an attribute depends on which of
– Nominal the following properties it possesses:
• Examples: ID numbers, eye color, zip codes – Distinctness: = ≠
– Ordinal – Order: < >
• Examples: rankings (e.g., taste of potato chips on a scale – Addition: + -
from 1-10), grades, height in {tall, medium, short}
– Multiplication: */
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit. – Nominal attribute: distinctness
– Ratio – Ordinal attribute: distinctness & order
• Examples: temperature in Kelvin, length, time, counts – Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Data Mining Lecture 2 5 Data Mining Lecture 2 6
1
Attribute Description Examples Operations Attribute Transformation Comments
Type Level
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency Nominal Any permutation of values If all employee ID numbers
attributes provide only enough sex: {male, female} correlation, χ2 test
information to distinguish one object
were reassigned, would it
from another. (=, ≠) make any difference?
2
Data Matrix Document Data
• If data objects have the same fixed set of numeric • Each document becomes a `term' vector,
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each – each term is a component (attribute) of the vector,
dimension represents a distinct attribute – the value of each component is the number of times
the corresponding term occurs in the document.
• Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
timeout
season
coach
score
game
team
columns, one for each attribute
ball
lost
pla
wi
n
y
Projection Projection Distance Load Thickness
of x Load of y load Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1 Document 3 0 1 0 0 1 2 2 0 3 0
• A special type of record data, where • Examples: Generic graph and HTML Links
– each record (transaction) involves a set of items. <a href="papers/papers.html#bbbb">
Data Mining </a>
– For example, consider a grocery store. The set of <li>
products purchased by a customer during one 2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
shopping trip constitute a transaction, while the <li>
5 1 <a href="papers/papers.html#aaaa">
individual products that were purchased are the Parallel Solution of Sparse Linear System of Equations </a>
<li>
items. 2 <a href="papers/papers.html#ffff">
TID Items N-Body Computation and Dense Linear System Solvers
An element of
the sequence
Data Mining Lecture 2 17 Data Mining Lecture 2 18
3
Ordered Data Ordered Data
• What kinds of data quality problems? • Noise refers to modification of original values
• How can we detect problems with the data? – Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on television
• What can we do about these problems? screen
• Outliers are data objects with characteristics • Reasons for missing values
that are considerably different than most of – Information is not collected
the other data objects in the data set (e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
4
Duplicate Data Data Preprocessing
Aggregation Aggregation
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Standard Deviation of Average Standard Deviation of Average
Monthly Precipitation Yearly Precipitation
Sampling Sampling …
• Sampling is the main technique employed for data • The key principle for effective sampling is the
selection. following:
– It is often used for both the preliminary investigation of the
data and the final data analysis. – using a sample will work almost as well as using the
entire data sets, if the sample is representative
• Statisticians sample because obtaining the entire set
of data of interest is too expensive or time consuming. – A sample is representative if it has approximately
the same property (of interest) as the original set
of data
• Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.
5
Types of Sampling Sample Size
• Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
Data Mining Lecture 2 31 Data Mining Lecture 2 32
• Goal is to find a projection that captures the • Find the eigenvectors of the covariance
largest amount of variation in data matrix
• The eigenvectors define the new space
x2
x2
e
e
x1
x1
Data Mining Lecture 2 35 Data Mining Lecture 2 36
6
Fuzzy Sets and Logic Fuzzy Sets
DM: Prediction and classification are often fuzzy. Crisp Sets Fuzzy Sets
0-1 Decision Fuzzy Decision Information Retrieval (IR): retrieving desired information from
textual data
– Library Science
– Digital Libraries
– Web Search Engines
Loan Reject – Traditionally has been keyword based
Reject – Sample query:
Amount • Find all documents about “data mining”.
Accept Accept
DM: Similarity measures;
Mine text or Web data
Salary Salary
|Relevant|
IR Classification
Data Mining Lecture 2 41 Data Mining Lecture 2 42
7
Machine Learning Statistics
• Machine Learning (ML): area of AI that examines how to • Usually creates simple descriptive models.
devise algorithms that can learn.
• Statistical inference: generalizing a model created
• Techniques from ML are often used in classification and
from a sample of the data to the entire dataset.
prediction.
• Supervised Learning: learns by example. • Exploratory Data Analysis:
• Unsupervised Learning: learns without knowledge of correct – Data can actually drive the creation of the model.
answers. – Opposite of traditional statistical view.
• Machine learning often deals with small or static datasets.
• Data mining targeted to business users.
Point Estimate: estimate a population parameter. Bias: Difference between expected value and actual
value.
• May be made by calculating the parameter for a
sample.
• May be used to predict values for missing data.
Mean Squared Error (MSE): expected value of the
Ex: squared difference between the estimate and the
– R contains 100 employees actual value:
– 99 have salary information
– Mean salary of these is $50,000
– Use $50,000 as value of remaining employee’s salary. • Why square?
Is this a good idea? • Root Mean Square Error (RMSE).
• Maximize L.
Data Mining Lecture 2 47 Data Mining Lecture 2 48
8
MLE Example MLE Example (cont’d)
Algorithm
• Obtain initial estimates for parameters.
• Iteratively use estimates for missing data and
continue refinement (maximization) of the estimate
until convergence.
9
Scatter Diagram Bayes Theorem
• Calculate P(x i|hj) and P(x i) • Find model to explain behavior by creating and
• Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; then testing a hypothesis about the data.
P(x8|h1)=1/6; and P(x i|h1)=0 for all other x i. • Exact opposite of usual DM approach.
• Predict the class for x4: • H0 – Null hypothesis; Hypothesis to be tested.
– Calculate P(hj|x4) for all hj.
• H1 – Alternative hypothesis.
– Place x4 in class with largest value.
– Ex:
• P(h1|x4) = (P(x4|h1)(P(h1))/P(x4)
= (1/6)(0.6)/0.1 = 1.
• x4 in class h1.
10
Chi Squared Statistic Regression
• Examine the degree to which the values for • Determine similarity between two objects.
two variables behave similarly. • Characteristics of a good similarity measure:
• Correlation coefficient r:
• 1 = perfect correlation
• -1 = perfect but opposite correlation
• 0 = no correlation
11
Twenty Questions Game Decision Trees
• Advantages:
– Easy to understand.
– Easy to generate rules from.
• Disadvantages:
– May suffer from overfitting.
– Classify by rectangular partitioning.
– Do not easily handle nonnumeric data.
– Can be quite large – pruning is often necessary.
12