Introduction To Data Mining With R: Yanchang Zhao

Introduction to Data Mining with R1
Yanchang Zhao
http://www.RDataMining.com
Statistical Modelling and Computing Workshop at Geoscience Australia
8 May 2015
1
Presented at AusDM 2014 (QUT, Brisbane) in Nov 2014, at Twitter (US) in Oct 2014, at UJAT (Mexico) in
Sept 2014, and at University of Canberra in Sept 2013
1 / 44
Questions
Do you know data mining and its algorithms and techniques?
2 / 44
Questions
Have you heard of R?
2 / 44
Questions
Have you heard of R?
Have you ever used R in your work?
2 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
3 / 44
What is R?
I
R 2 is a free software environment for statistical computing

and graphics.
R can be easily extended with 6,600+ packages available on

CRAN3 (as of May 2015).
Many other packages provided on Bioconductor4 , R-Forge5 ,

GitHub6 , etc.
R manuals on CRAN7
I
I
I
I
An Introduction to R
The R Language Definition
R Data Import/Export
...
http://www.r-project.org/
http://cran.r-project.org/
4
http://www.bioconductor.org/
5
http://r-forge.r-project.org/
6
https://github.com/
7
http://cran.r-project.org/manuals.html
3
4 / 44
Why R?
I
R is widely used in both academia and industry.
R was ranked no. 1 in the KDnuggets 2014 poll on Top

Languages for analytics, data mining, data science 8 (actually,
no. 1 in 2011, 2012 & 2013!).
The CRAN Task Views 9 provide collections of packages for
different tasks.
I
I
I
I
I
I
8
9
Machine learning & statistical learning

Cluster analysis & finite mixture models
Time series analysis
Multivariate statistics
Analysis of spatial data
...
http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html
http://cran.r-project.org/web/views/
5 / 44
Outline
Introduction
Clustering with R
Text Mining with R
R and Big Data
Online Resources
6 / 44
Decision trees: rpart, party
Random forest: randomForest, party
SVM: e1071, kernlab
Neural networks: nnet, neuralnet, RSNNS
Performance evaluation: ROCR
7 / 44
The Iris Dataset

# iris data
str(iris)
## 'data.frame': 150 obs. of 5 variables:

## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1..
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1..
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0..
## $ Species
: Factor w/ 3 levels "setosa","versicolor",...
# split into training and test datasets
set.seed(1234)
ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3))
iris.train <- iris[ind==1, ]
iris.test <- iris[ind==2, ]
8 / 44
Build a Decision Tree
# build a decision tree

library(party)
iris.formula <- Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
iris.ctree <- ctree(iris.formula, data=iris.train)
9 / 44
plot(iris.ctree)
1
Petal.Length
p < 0.001
1.9
> 1.9
3
Petal.Width
p < 0.001
1.7
> 1.7
4
Petal.Length
p = 0.026
4.4
> 4.4
Node 2 (n = 40)
Node 5 (n = 21)
Node 6 (n = 19)
Node 7 (n = 32)
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
setosa
setosa
0
setosa
setosa
10 / 44
Prediction
# predict on test data

pred <- predict(iris.ctree, newdata = iris.test)
# check prediction result
table(pred, iris.test$Species)
##
## pred
setosa versicolor virginica
##
setosa
10
0
0
##
versicolor
0
12
2
##
virginica
0
0
14
11 / 44
Outline
Introduction
Clustering with R
Text Mining with R
R and Big Data
Online Resources
12 / 44
Clustering with R
k-means: kmeans(), kmeansruns()10
k-medoids: pam(), pamk()
Hierarchical clustering: hclust(), agnes(), diana()
DBSCAN: fpc
BIRCH: birch
Cluster validation: packages clv, clValid, NbClust
10
Functions are followed with (), and others are packages.

13 / 44
k-means Clustering
set.seed(8953)
iris2 <- iris
# remove class IDs
iris2$Species <- NULL
# k-means clustering
iris.kmeans <- kmeans(iris2, 3)
# check result
table(iris$Species, iris.kmeans$cluster)
##
##
##
##
##
1 2 3
setosa
0 50 0
versicolor 2 0 48
virginica 36 0 14
14 / 44
3.0
2.5
2.0
Sepal.Width
3.5
4.0
# plot clusters and their centers

plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster)
points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")],
col=1:3, pch="*", cex=5)
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
15 / 44
Density-based Clustering
library(fpc)
iris2 <- iris[-5] # remove class IDs
# DBSCAN clustering
ds <- dbscan(iris2, eps = 0.42, MinPts = 5)
# compare clusters with original class IDs
table(ds$cluster, iris$Species)
##
##
##
##
##
##
0
1
2
3
setosa versicolor virginica

2
10
17
48
0
0
0
37
0
0
3
33
16 / 44
# 1-3: clusters; 0: outliers or noise

plotcluster(iris2, ds$cluster)
0
3
3 33
0
3
3
03 3
1
dc 2
1
1
3
3
3 3
0
0 2 2
0 2 22
2
2
2
0
0
3 33 0 333
3
3
3 3
3
3 30
33
0
3
22
3
2 22022 2 20
3
2 20 2 2
2
3
2 2 22
02
0
22
30
0
3
2 20
2
0 0
0
0
2 2
0 1
1
1 1
1 1
11 1 1
1
1
11
111 1 1 11 11
1 111111 1
1 1
1
11
1 11
1
11
2
dc 1
0
0
17 / 44
Outline
Introduction
Clustering with R
Text Mining with R
R and Big Data
Online Resources
18 / 44
Association rules: apriori(), eclat() in package arules
Sequential patterns: arulesSequence
Visualisation of associations: arulesViz
19 / 44
The Titanic Dataset

load("./data/titanic.raw.rdata")
dim(titanic.raw)
## [1] 2201
idx <- sample(1:nrow(titanic.raw), 8)

titanic.raw[idx, ]
##
##
##
##
##
##
##
##
##
501
477
674
766
1485
1388
448
590
Class
Sex
Age Survived
3rd
Male Adult
No
3rd
Male Adult
No
3rd
Male Adult
No
Crew
Male Adult
No
3rd Female Adult
No
2nd Female Adult
No
3rd
Male Adult
No
3rd
Male Adult
No
20 / 44
Association Rule Mining
# find association rules with the APRIORI algorithm

library(arules)
rules <- apriori(titanic.raw, control=list(verbose=F),
parameter=list(minlen=2, supp=0.005, conf=0.8),
appearance=list(rhs=c("Survived=No", "Survived=Yes"),
default="lhs"))
# sort rules
quality(rules) <- round(quality(rules), digits=3)
rules.sorted <- sort(rules, by="lift")
# have a look at rules
# inspect(rules.sorted)
21 / 44
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
lhs
{Class=2nd,
Age=Child}
2 {Class=2nd,
Sex=Female,
Age=Child}
3 {Class=1st,
Sex=Female}
4 {Class=1st,
Sex=Female,
Age=Adult}
5 {Class=2nd,
Sex=Male,
Age=Adult}
6 {Class=2nd,
Sex=Female}
7 {Class=Crew,
Sex=Female}
8 {Class=Crew,
Sex=Female,
Age=Adult}
9 {Class=2nd,
Sex=Male}
10 {Class=2nd,
rhs
support confidence
lift
=> {Survived=Yes}
0.011
1.000 3.096
=> {Survived=Yes}
0.006
1.000 3.096
=> {Survived=Yes}
0.064
0.972 3.010
=> {Survived=Yes}
0.064
0.972 3.010
=> {Survived=No}
0.070
0.917 1.354
=> {Survived=Yes}
0.042
0.877 2.716
=> {Survived=Yes}
0.009
0.870 2.692
=> {Survived=Yes}
0.009
0.870 2.692
=> {Survived=No}
0.070
0.860 1.271
22 / 44
library(arulesViz)
plot(rules, method = "graph")
Graph for 12 rules
width: support (0.006 0.192)

color: lift (1.222 3.096)
{Class=3rd,Sex=Male,Age=Adult}
{Class=2nd,Sex=Male,Age=Adult}
{Survived=No}{Class=3rd,Sex=Male}
{Class=2nd,Sex=Male}
{Class=1st,Sex=Female}
{Class=2nd,Sex=Female}
{Class=1st,Sex=Female,Age=Adult}
{Class=2nd,Sex=Female,Age=Child}
{Survived=Yes}
{Class=Crew,Sex=Female}
{Class=2nd,Age=Child}
{Class=Crew,Sex=Female,Age=Adult}
{Class=2nd,Sex=Female,Age=Adult}
23 / 44
Outline
Introduction
Clustering with R
Text Mining with R
R and Big Data
Online Resources
24 / 44
Text Mining with R
Text mining: tm
Topic modelling: topicmodels, lda
Word cloud: wordcloud
Twitter data access: twitteR
25 / 44
Retrieve Tweets
Retrieve recent tweets by @RDataMining
## Option 1: retrieve tweets from Twitter
library(twitteR)
tweets <- userTimeline("RDataMining", n = 3200)
## Option 2: download @RDataMining tweets from RDataMining.com
url <- "http://www.rdatamining.com/data/rdmTweets.RData"
download.file(url, destfile = "./data/rdmTweets.RData")
## load tweets into R
load(file = "./data/rdmTweets.RData")
(n.tweet <- length(tweets))
## [1] 320
strwrap(tweets[[320]]$text, width = 55)
## [1] "An R Reference Card for Data Mining is now available"
## [2] "on CRAN. It lists many useful R functions and packages"
## [3] "for data mining applications."
26 / 44
Text Cleaning
library(tm)
# convert tweets to a data frame
df <- twListToDF(tweets)
# build a corpus
myCorpus <- Corpus(VectorSource(df$text))
# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)
# remove punctuations and numbers
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
# remove URLs, 'http' followed by non-space characters
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
# remove 'r' and 'big' from stopwords
myStopwords <- setdiff(stopwords("english"), c("r", "big"))
# remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
27 / 44
Stemming
# keep a copy of corpus
myCorpusCopy <- myCorpus
# stem words
myCorpus <- tm_map(myCorpus, stemDocument)
# stem completion
myCorpus <- tm_map(myCorpus, stemCompletion,
dictionary = myCorpusCopy)
# replace "miners" with "mining", because "mining" was
# first stemmed to "mine" and then completed to "miners"
myCorpus <- tm_map(myCorpus, gsub, pattern="miners",
replacement="mining")
strwrap(myCorpus[320], width=55)
## [1] "r reference card data mining now available cran list"
## [2] "used r functions package data mining applications"
28 / 44
Frequent Terms
myTdm <- TermDocumentMatrix(myCorpus,

control=list(wordLengths=c(1,Inf)))
# inspect frequent words
(freq.terms <- findFreqTerms(myTdm, lowfreq=20))
## [1] "analysis"
## [5] "examples"
## [9] "position"
## [13] "slides"
## [17] "used"
"big"
"mining"
"postdoctoral"
"social"
"computing"
"network"
"r"
"tutorial"
"data"
..
"package"..
"research..
"universi..
29 / 44
Associations
# which words are associated with 'r'?
findAssocs(myTdm, "r", 0.2)
##
r
## examples 0.32
## code
0.29
## package 0.20
# which words are associated with 'mining'?
findAssocs(myTdm, "mining", 0.25)
##
##
##
##
##
##
##
##
data
mahout
recommendation
sets
supports
frequent
itemset
mining
0.47
0.30
0.30
0.30
0.30
0.26
0.26
30 / 44
Network of Terms
library(graph)
library(Rgraphviz)
plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T)
university
tutorial
social
network
analysis
mining
research
postdoctoral
position
used
data
big
package
examples
computing
slides
31 / 44
Word Cloud
library(wordcloud)
m <- as.matrix(myTdm)
freq <- sort(rowSums(m), decreasing=T)
wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F)
provided melbourne
analysis outlier
map
mining network
open
graphics
thanks
conference users
processing
cfp text
analyst
exampleschapter
postdoctoral
slides used big
job
analytics join
high
sydney
topic
china
large
snowfall
casesee available poll draft
performance applications
group now
reference course code can via
visualizing
series tenuretrack
industrial center due introduction
association clustering access
information
page distributed
sentiment videos techniques tried
youtube
top presentation science
classification southern
wwwrdataminingcom
canberra added experience
management
predictive
talk
linkedin
vacancy
research
package
notes card
get
data
database
statistics
rdatamining
knowledge list
graph
free online
using
recent
published
workshop find
position
fast call
studies
tutorial
california
cloud
frequent
week tools
document
technology
nd
australia social university

datasets
google
short software
time learn
details
lecture
book
forecasting functions follower submission

business events
kdnuggetsinteractive
detection programmingcanada
spatial
search
machine
pdf
ausdm
modelling
twitter
starting fellow
web
scientist
computing parallel ibm
amp rules
dmapps
handling
32 / 44
Topic Modelling
library(topicmodels)
set.seed(123)
myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8)
terms(myLda, 5)
##
##
##
##
##
##
##
##
##
##
##
##
[1,]
[2,]
[3,]
[4,]
[5,]
[1,]
[2,]
[3,]
[4,]
[5,]
Topic 1
Topic 2 Topic 3
Topic 4
"mining"
"data"
"r"
"position"
"data"
"free"
"examples" "research"
"analysis" "course" "code"
"university"
"network" "online" "book"
"data"
"social"
"ausdm" "mining"
"postdoctoral"
Topic 5
Topic 6
Topic 7
Topic 8
"data"
"data"
"r"
"r"
"r"
"scientist" "package"
"data"
"mining"
"research" "computing" "clustering"
"applications" "r"
"slides"
"mining"
"series"
"package"
"parallel" "detection"
33 / 44
Outline
Introduction
Clustering with R
Text Mining with R
R and Big Data
Online Resources
34 / 44
Time series decomposition: decomp(), decompose(), arima(),

stl()
Time series forecasting: forecast
Time Series Clustering: TSclust
Dynamic Time Warping (DTW): dtw
35 / 44
Outline
Introduction
Clustering with R
Text Mining with R
R and Big Data
Online Resources
36 / 44
Packages: igraph, sna
Centrality measures: degree(), betweenness(), closeness(),

transitivity()
Clusters: clusters(), no.clusters()
Cliques: cliques(), largest.cliques(), maximal.cliques(),

clique.number()
Community detection: fastgreedy.community(),

spinglass.community()
Graph database Neo4j: package RNeo4j

http://nicolewhite.github.io/RNeo4j/
37 / 44
Outline
Introduction
Clustering with R
Text Mining with R
R and Big Data
Online Resources
38 / 44
R and Big Data Platforms

I
Hadoop
I
Spark
I
Spark - a fast and general engine for large-scale data

processing, which can be 100 times faster than Hadoop
SparkR - R frontend for Spark
H2O
I
Hadoop (or YARN) - a framework that allows for the

distributed processing of large data sets across clusters of
computers using simple programming models
R Packages: RHadoop, RHIPE
H2O - an open source in-memory prediction engine for big

data science
R Package: h2o
MongoDB
I
I
MongoDB - an open-source document database

R packages: rmongodb, RMongo
39 / 44
R and Hadoop
I
Packages: RHadoop, RHive
RHadoop11 is a collection of R packages:

I
I
I
I
rmr2 - perform data analysis with R via MapReduce on a

Hadoop cluster
rhdfs - connect to Hadoop Distributed File System (HDFS)
rhbase - connect to the NoSQL HBase database
...
You can play with it on a single PC (in standalone or

pseudo-distributed mode), and your code developed on that
will be able to work on a cluster of PCs (in full-distributed
mode)!
Step-by-Step Guide to Setting Up an R-Hadoop System

http://www.rdatamining.com/big-data/r-hadoop-setup-guide
11
https://github.com/RevolutionAnalytics/RHadoop/wiki
40 / 44
An Example of MapReducing with R12

library(rmr2)
map <- function(k, lines) {
words.list <- strsplit(lines, "\\s")
words <- unlist(words.list)
return(keyval(words, 1))
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function(input, output = NULL) {
mapreduce(input = input, output = output, input.format = "text",
map = map, reduce = reduce)
}
## Submit job
out <- wordcount(in.file.path, out.file.path)
12
From Jeffrey Breens presentation on Using R with Hadoop
http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/
41 / 44
Outline
Introduction
Clustering with R
Text Mining with R
R and Big Data
Online Resources
42 / 44
Online Resources
I
RDataMining website:
I
I
I
http://www.rdatamining.com
R Reference Card for Data Mining

RDataMining Slides Series
R and Data Mining: Examples and Case Studies
RDataMining Group on LinkedIn (12,000+ members)

http://group.rdatamining.com
RDataMining on Twitter (2,000+ followers)

@RDataMining
Free online courses

http://www.rdatamining.com/resources/courses
Online documents
http://www.rdatamining.com/resources/onlinedocs
43 / 44
The End
Thanks!
Email: yanchang(at)rdatamining.com
44 / 44

Introduction To Data Mining With R: Yanchang Zhao

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Introduction To Data Mining With R: Yanchang Zhao

Cargado por

Copyright:

Formatos disponibles

Introduction to Data Mining with R1

Statistical Modelling and Computing Workshop at Geoscience Australia

Do you know data mining and its algorithms and techniques?

Do you know data mining and its algorithms and techniques?

Have you heard of R?

Do you know data mining and its algorithms and techniques?

Have you heard of R?

Have you ever used R in your work?

R 2 is a free software environment for statistical computing

R can be easily extended with 6,600+ packages available on

Many other packages provided on Bioconductor4 , R-Forge5 ,

R is widely used in both academia and industry.

R was ranked no. 1 in the KDnuggets 2014 poll on Top

Machine learning & statistical learning

Decision trees: rpart, party

Random forest: randomForest, party

SVM: e1071, kernlab

Neural networks: nnet, neuralnet, RSNNS

Performance evaluation: ROCR

The Iris Dataset

## 'data.frame': 150 obs. of 5 variables:

Build a Decision Tree

# build a decision tree

# predict on test data

k-means: kmeans(), kmeansruns()10

k-medoids: pam(), pamk()

Hierarchical clustering: hclust(), agnes(), diana()

Cluster validation: packages clv, clValid, NbClust

Functions are followed with (), and others are packages.

# plot clusters and their centers

setosa versicolor virginica

# 1-3: clusters; 0: outliers or noise

Association Rule Mining with R

Association rules: apriori(), eclat() in package arules

Sequential patterns: arulesSequence

Visualisation of associations: arulesViz

The Titanic Dataset

idx <- sample(1:nrow(titanic.raw), 8)

Association Rule Mining

# find association rules with the APRIORI algorithm

width: support (0.006 0.192)

Text Mining with R

Topic modelling: topicmodels, lda

Word cloud: wordcloud

Twitter data access: twitteR

myTdm <- TermDocumentMatrix(myCorpus,

slides used big

australia social university

forecasting functions follower submission

Time Series Analysis with R

Time series decomposition: decomp(), decompose(), arima(),

Time series forecasting: forecast

Time Series Clustering: TSclust

Dynamic Time Warping (DTW): dtw

Social Network Analysis with R

Packages: igraph, sna

Centrality measures: degree(), betweenness(), closeness(),

Clusters: clusters(), no.clusters()

Cliques: cliques(), largest.cliques(), maximal.cliques(),

Community detection: fastgreedy.community(),

Graph database Neo4j: package RNeo4j

R and Big Data Platforms

Spark - a fast and general engine for large-scale data