Está en la página 1de 12

CS 850 4 Big Data with SAS Assignment # 2

1. Apache Spark: What is Scala Programming? Why is it so important


in Big Data? What is RDD: Re-silient Distributed Datasets in cotext
of Apache Spark?
Write a standalone application for linear regression. Repeat the steps in
question 1 on the given training data set (lpsa.data). Please explain in
detail how you solved this problem using following steps with screen
shots? Also explain challenges and pain points you experienced in
running this exercise? Give business case where you would use this
type of analysis method in context of Apache Spark?
Please refer to following links for installation instructions, installation
files and data file needed for running this example:
Instructions and Intro:
https://www.dropbox.com/s/guu6pb80l13er52/ApacheSparkITU.pdf?
dl=0
Installation and Data Files:
https://www.dropbox.com/s/4aq0cpg9zgibk2v/SparkLabs.zip?dl=0
Template to run SCALA program in Spark Shell:
$ source ./setup.sh
$ spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m -i
<Scalafile.scala>
Scala program to run Regression in Sparkshell:
import
import
import
import

org.apache.spark.mllib.regression.LabeledPoint
org.apache.spark.mllib.regression.LinearRegressionModel
org.apache.spark.mllib.regression.LinearRegressionWithSGD
org.apache.spark.mllib.linalg.Vectors

// Load and parse the data


val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

// Save and load model


model.save(sc, "myModelPath")
val sameModel = LinearRegressionModel.load(sc, "myModelPath")

2. Apache Kafka:
Run following exercise in Apache Kafka. Please explain what is Kafka?
Use? Pros Cons ? Usecase:
Step by Step explain how you ran the exercise and pain points,
challenges you faced in running following exercise. Also explain the
output and give business context on how you will use it in real life
business and what is Truck Event exercise doing and why we use Kafka
for this purpose?
Refer to following linked file for installation and other details
https://www.dropbox.com/s/7qd6pr9xixaoaug/Installation_Exercises_Ne
tappD4.pptx?dl=0

Make sure you have installed Kafka following steps in above link. Then
you can run code step by step to get the results

Make sure Kafka is running


$ jps
Start Truck Event topic
$ sh kafka-topics.sh --create --zookeeper localhost:2181 --replicationfactor 1 --partitions 1 --topic truckevent
List all Kafka topics active
$ sh kafka-topics.sh --list --zookeeper localhost:2181
open another terminal
$ sudo mkdir /opt/TruckEvents
$ cd /opt/TruckEvents
$ wget http://hortonassets.s3.amazonaws.com/mda/Tutorialsmaster.zip
$ unzip Tutorials-master.zip
$ cd Tutorials-master
$ wget http://apache.arvixe.com/maven/maven3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz
$ tar xvf apache-maven-3.2.5-bin.tar.gz
$ sudo mv apache-maven-3.2.2 /usr/local/
$ export PATH=/usr/local/apache-maven-3.2.2/bin:$PATH
$ mvn version
$ mvn clean package
$ cd target
$ java -cp Tutorial-1.0-SNAPSHOT.jar
com.hortonworks.tutorials.tutorial1.TruckEventsProducer
localhost:9092 localhost:2181 &
Listen from consumer
$ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic
truckevent --from-beginning

2. SAS: What is PROC and DATA STEP in SAS? Give example of some
of the most used SAS PROC and explain its use in Business
context?
Run following SAS code for Multiple Linear Regression to get
Multicollinearity and influence statistics (from SAS Manual) in
SAS Virtual machine and explain results in detail in Statistical
context.
options linesize=80;
data fitness;
input age weight oxy runtime rstpulse runpulse maxpulse;
cards;
44 89.47 44.609 11.37 62 178 182
40 75.07 45.313 10.07 62 185 185
44 85.84 54.297 8.65 45 156 168
42 68.15 59.571 8.17 40 166 172
38 89.02 49.874 9.22 55 178 180
47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180
43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176
38 81.87 60.055 8.63 48 170 186
44 73.03 50.541 10.13 45 168 168
45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176
47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170
49 81.42 49.156 8.95 44 180 185
51 69.63 40.836 10.95 57 168 172
51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164
49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176
54 79.38 46.080 11.17 62 156 165
52 76.32 45.441 9.63 48 164 166
50 70.87 54.625 8.92 48 146 155
51 67.25 45.118 11.08 48 172 172
54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188
57 59.08 50.545 9.93 49 148 155
49 76.32 48.673 9.40 56 186 188
48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
;
title 'SAS Fitness data';

title2 'Example of multicollinearity and influence diagnostics';


proc reg data=fitness simple corr;
FULL: model oxy=runtime age weight runpulse maxpulse rstpulse /
stb collin tol vif corrb influence;
plot student.*predicted.;
run;

3. What is Cluster Analysis? Explain in detail different types of


Clustering methods? Their business usecase and pros-cons. Run
Cluster Analysis in SAS using following code and explain the
results:

/*file:mammalsteeth.sas
ExampleofclusteranalysistakenfromExample
4oftheSASdocumentationtoPROCCLUSTER*/
optionsnocenternodatepageno=1linesize=132;
titleh=1j=l'File:cluster.mammalsteeth.sas';
title2h=1j=l'ClusterAnalysisofMammals''teeth
data';
datateeth;
inputmammal$116
@21(v1v8)(1.);
labelv1='Topincisors'
v2='Bottomincisors'
v3='Topcanines'
v4='Bottomcanines'
v5='Toppremolars'
v6='Bottompremolars'
v7='Topmolars'
v8='Bottommolars';
cards;
BROWNBAT23113333
MOLE32103333
SILVERHAIRBAT23112333
PIGMYBAT23112233
HOUSEBAT23111233
REDBAT13112233
PIKA21002233
RABBIT21003233
BEAVER11002133
GROUNDHOG11002133
GRAYSQUIRREL11001133
HOUSEMOUSE11000033

PORCUPINE11001133
WOLF33114423
BEAR33114423
RACCOON33114432
MARTEN33114412
WEASEL33113312
WOLVERINE33114412
BADGER33113312
RIVEROTTER33114312
SEAOTTER32113312
JAGUAR33113211
COUGAR33113211
FURSEAL32114411
SEALION32114411
GREYSEAL32113322
ELEPHANTSEAL21114411
REINDEER04103333
ELK04103333
DEER04003333
MOOSE04003333
;
/*principalcomponentsanalysisofteeth
herewescoretheprincipalcomponentsand
outputthentodatasetteeth2*/
procprincompdata=teethout=teeth2;
varv1v8;
run;
/*averagelinkageclusteranalysis
adendrogram(treediagram)isalsooutput*/
procclusterdata=teeth2method=averageouttree=ttree
cccpseudorsquare;
varv1v8;
idmammal;
run;
/*PROCTREEprintsthetreediagram
herewealsooutputadataset,calledttree2
thatcontainsfourclusters*/
proctreedata=ttreeout=ttree2nclusters=4;

idmammal;
run;

/*thenextsetofstatementssortthedatasets
byvariablemammalandthenmergethetreedataset
(withtheclusterscores)withtheteethdataset
(withtheprinicipalcomponents)*/
procsortdata=teeth2;
bymammal;
run;
procsortdata=ttree2;
bymammal;
run;
datateeth3;
mergeteeth2ttree2;
bymammal;
run;

/*stuffforplotting*/
symbol1c=blackf=,v='1';
symbol2c=blackf=,v='2';
symbol3c=blackf=,v='3';
symbol4c=blackf=,v='4';

procgplot;
plotprin2*prin1=cluster;
run;
procsort;
bycluster;
run;
procprint;
bycluster;
varmammalprin1prin2;
run;

4. R programming Text Processing:


Problem Statement: Budweiser wants to analyze the response posted
by people on Twitter for its Super bowl commercial. It is humongous
task for them to go through all the tweets. Complete the following
objectives.
Objective 1: Group tweets in 10 categories (based on content) using KMeans clustering.
Objective 2: Using K-Means clustering, find tweets which have
reference to words "Clydesdale" and "Budweiser"
Data for this task is uploaded at following location:
https://www.dropbox.com/s/6uevdeygb92vzvw/Tweets.csv?dl=0
5.Descriptive Questions on NoSQL and Big Data Science
a. Please explain difference between NoSQL and RDBMS
b. What is CAP Theorem? What is ACID and BASE in terms of
Database?
c. What is Cassandra? Explain Cassandra Data Model?
d. What is Polyglot persistence in terms of NOSQL and
Database?
e. What is MongoDB? Where will I use MongoDB? Give
example of MongoDB
f. What is Graph database? Explain usecase.
g. What is R programming? Give background and usecase

h. What is Machine learning? Give examples of Machine


Learning algorithm? What is supervised and unsupervised
algorithms? Give examples of each.
6.Group Assignment Q1: R-Hadoop
Use Big Data Science Virtual Machine at following link
http://goo.gl/gAqdf4
R packages needed to install and set up RHadoop on CentOS
https://goo.gl/PDN3Jx
Set up R-Hadoop using instructions in attached file
https://www.dropbox.com/s/kh2aa8fejbdrdhx/Team
%20Assignment%20R%20%E2%80%93Hadoop%20Installation
%20Instructions.docx?dl=0
Once installed and tested as instructed in above document
please run following to do linear regression using rhadoop and
rmapreduce.
Run the following code and please explain what you
understood from this exercise in detail. Explain the process,
package meaning that you installed, why we use Hadoop in R
and explain how you run the following code in R-hadoop,
Challenges and pain points you experienced in executing this.
Regression in R without Hadoop map reduce
# Defining data variables
X = matrix(rnorm(2000), ncol = 10)
y = as.matrix(rnorm(200))
# Bundling data variables into dataframe
train_data <- data.frame(X,y)
# Training model for generating prediction
lmodel<- lm(y~ train_data $X1 + train_data $X2 +
train_data $X3 + train_data $X4 + train_data $X5 +
train_data $X6 + train_data $X7 + train_data $X8 +
train_data $X9 + train_data $X10,data= train_data)

Regression using R-Hadoop

# Definig the datasets with big data matrix X


X = matrix(rnorm(20000), ncol = 10)
X.index = to.dfs(cbind(1:nrow(X), X))
y = as.matrix(rnorm(2000))
# Function defined to be used as reducers
Sum =
function(., YY)
keyval(1, list(Reduce('+', YY)))
XtX =
values(
from.dfs(
mapreduce(
input = X.index,
map =
function(., Xi) {
yi = y[Xi[,1],]
Xi = Xi[,-1]
keyval(1, list(t(Xi) %*% Xi))},
reduce = Sum,
combine = TRUE)))[[1]]
Xty = values(
from.dfs(
mapreduce(
input = X.index,
map = function(., Xi) {
yi = y[Xi[,1],]
Xi = Xi[,-1]
keyval(1, list(t(Xi) %*% yi))},
reduce = Sum,
combine = TRUE)))[[1]]
solve(XtX, Xty)
7. Group Assignment Q2:
Imagine 10000 receipts sitting on your table. Each receipt
represents a transaction with items that were purchased. The
receipt is a representation of stuff that went into a customers
basket and therefore Market Basket Analysis.
That is exactly what the Groceries Data Set contains: a collection of
receipts with each line representing 1 receipt and the items

purchased. Each line is called a transaction and each column in a


row represents an item. You can download the Groceries data set to
take a look at it, but this is not a necessary step.
Dataset:
https://www.dropbox.com/s/8k0frnc0ju8bsbs/groceries.csv?dl=0
Use the data to develop Association Rules based Recommendation
system. You can use the code below. Explain what steps you will
follow? What are the key outcomes from this machine learning
algorithm and where will you use it. Copy and pasting the code below
wont get you any credits.
#Loadthelibraries
library(arules)
library(arulesViz)
library(datasets)

#Loadthedataset
data(Groceries)
#Createanitemfrequencyplotforthetop20items
itemFrequencyPlot(Groceries,topN=20,type="absolute")
#Gettherules
rules<apriori(Groceries,parameter=list(supp=0.001,
conf=0.8))

#Showthetop5rules,butonly2digits
options(digits=2)
inspect(rules[1:5])
rules<sort(rules,by="confidence",decreasing=TRUE)
rules<apriori(Groceries,parameter=list(supp=0.001,
conf=0.8,maxlen=3))
subset.matrix<is.subset(rules,rules)
subset.matrix[lower.tri(subset.matrix,diag=T)]<NA
redundant<colSums(subset.matrix,na.rm=T)>=1
rules.pruned<rules[!redundant]
rules<rules.pruned
rules<apriori(data=Groceries,
parameter=list(supp=0.001,conf=0.08),
appearance=list(default="lhs",rhs="whole
milk"),

control=list(verbose=F))
rules<sort(rules,decreasing=TRUE,by="confidence")
inspect(rules[1:5])
rules<apriori(data=Groceries,
parameter=list(supp=0.001,conf=0.15,minlen=2),appearance=
list(default="rhs",lhs="wholemilk"),control=
list(verbose=F))
rules<sort(rules,decreasing=TRUE,by="confidence")
inspect(rules[1:5])
library(arulesViz)
plot(rules,method="graph",interactive=TRUE,shading=NA)

BonusQuestion:

También podría gustarte