Rota Baclawski Prob Theory 79

Introduction To Spark
1
Agenda
Apache Spark overview
Apache Spark Architecture
Apache Spark Installation / Deployment
Apache Spark Building Blocks
Apache Resilient Distributed Datasets
Apache Spark Paired RDD
Apache Spark RDD Persistence
Data Frames and Streaming
2
Apache Spark Overview
3
What is Spark
1. Apache Spark is an open source big data processing framework

designed for speed, ease of use, and analytics
2. Comprehensive, unified framework to manage big data processing

requirements. A general purpose engine
3. Enables applications in Hadoop clusters to run up to 100 times

faster in memory and 10 times faster even when running on disk
4. In addition to Map and Reduce operations, it supports SQL queries,

streaming data, machine learning and graph data processing
5. Can use these capabilities stand-alone or combine them to run in a

single data pipeline use case.
4
Why Spark
1. MapReduce as big data processing technology has proven to
be the solution of choice for processing large data sets.
2. It is a good solution for one-pass computations, but not very

efficient for use cases that require multi-pass computations
and algorithms
3. The Job output data between each step has to be stored in

the distributed file system before the next step can begin.
4. A end-to-end solution requires the integration of several tools

for different big data use cases (like Mahout for Machine
Learning and Storm for streaming data processing).
5
Hadoop and Spark Ecosystem
1. Spark is a processing engine
2. It occupies the same place in the Hadoop stack as MapReduce
PIG Hive
Map Reduce (MR) Framework SPARK
Yet Another Negotiator (YARN) Processing Layer
Hadoop Distributed File System (HDFS) Storage Layer
6
Hadoop and Spark Ecosystem
1. Spark can spawn it’s own servers to process data on HDFS i.e. not
use Resource Manager and Appmaster of YARN
2. This mode of installation is called “StandAlone”
3. Spark schedulers are not as robust as Yarn schedulers and hence

StandAlone mode is not used in large scale setups.
SPARK
7
Apache Spark Ecosystem
8
SPARK - Ecosystem
Spark Streaming:
For real-time streaming data processing,
based on micro batch style computing and
processing.
Spark SQL:
Spark SQL provides the capability to expose the Spark datasets over JDBC API and
allow running the SQL like queries on Spark data using traditional BI and visualization
tools.
Spark MLlib:
Mllib is Spark’s scalable machine learning library consisting of common learning
algorithms and utilities, including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization primitives
Spark GraphX:
GraphX is Spark API for graphs and graph-parallel computation.
Spark Core is the underlying general execution engine

In-memory computing to deliver speed, generalized execution model to support a wide
variety of applications, Java, Scala, and Python APIs for ease of development.
9
Spark Lab -1 Starting Clusters
1. Start spark cluster and connect using Spark Shell as given below
a. Start Spark daemons. Spark and Hadoop have same name script
$ > start-allspark.sh (to start the spark master and worker processes)
2. Execute the following command at the unix prompt

a. $> spark-shell --master spark://127.0.1.1:7077
3. The Spark shell will connect to the spark master and create a
Spark Context called sc.
4. The Spark shell will display a scala prompt as in the screen below
16/02/12 22:32:40 INFO repl.SparkILoop: Created spark context..

Spark context available as sc.
scala>
10
Spark Lab -1 Starting Clusters
$> start-dfs.sh
$> jps
18831 Worker
21669 Jps Spark Cluster
18651 Master
21545 SecondaryNameNode
21346 DataNode HDFS Layer
21222 NameNode
11
1. Spark can spawn it’s own servers to process data on HDFS i.e. not
use Resource Manager and Appmaster of YARN
2. This mode of installation is called “StandAlone”
3. Spark schedulers are not as robust as Yarn schedulers and hence

StandAlone mode is not used in large scale setups.
SPARK
12
Spark Context
13
SPARK - Context
• SparkContext is the object that manages the connection to the
clusters in Spark
• SparkContext connects to cluster managers, which manage the

actual executors that run the specific computations
• The SparkContext object is usually referenced as the variable sc
14
Resilient Distributed Datasets
15
Resilient Distributed Data sets
1. Resilient Distributed Datasets (RDD) are the
a. primary abstraction in Spark
b. a fault-tolerant collection of distributed partitions of data sets
c. can be operated on in parallel
2. RDD can be created in two ways:

a. Hadoop datasets – run functions on each record of a file in Hadoop
distributed file storage
val log_file = sc.textFile("/user/data/apache_log.txt")
c. Parallelized collections – take an existing Scala collection and run

functions on it in parallel
scala> val someRDD = sc.parallelize(1 to 100, 4)
16
Spark Lab -1 Creating RDD
1. To access HDFS file in Spark, we create an RDD. For e.g.
a. scala> val log_file = sc.textFile("/user/data/apache_log.txt")
b. scala> val errs = log_file.filter(line => line.contains("error"))
c. scala> errs.collect().foreach(println);
2. In Scala val defines a fixed value that cannot change, var defines a
variable.
3. Since RDDs are immutable, they are defined as val.
4. Line 1.a defines an RDD “log_file”. It acts as an alias to the hadoop

blocks of the file “apache_log.txt” (a.k.a “Base RDD)
5. Line 1.b defines another RDD “errs” derived from base RDD
“log_file” through a “filter” transformation process
17
6. I am using the word “defines” in line 3 and 5 for “log_file” and “errs”
respectively. I am not using the word “creates”. The reason for that is-
a) These steps in the program are converted into a DAG (Directed Acyclic
Graph) of tasks.
b) One of the activity is to create log_file RDD and other is to create errors
RDD out of the log_file RDD
c) That “errs” RDD is derived out of a filter process on “log_file” RDD. This
fact (a.k.a lineage) is noted down in the “errs” RDD meta information
d) The process of deriving an RDD out of another is done in a series of

tasks. Some or all of these tasks may be grouped and executed together
as a stage
e) The tasks are executed when an “action” task in the DAG is executed. on
the RDD.
18
SPARK DAG Scheduler
Task 1
Stage2
Spark Stage-1 Yarn
Task 2 DAG task1 Scheduler
scheduler task2
Task n
Original RDD
Transformed RDD
19
7. The RDD structure that helps abstract the scattered blocks of data
on the cluster has following information
a. A set of partitions(atomic pieces of datasets)
b. A set of dependencies on parent RDDs
c. A function for computing the dataset based on its parents
d. Metadata about its partitioning scheme (hash partition or otherwise)
e. Data placement / preferred location for each partition
8. Each RDD, “logs_file” , “errors” from line 1.a and 1.b in slide 15 will
have this structure associated.
9. The RDD structure contains the partition location (7.e) only when
the actual data churn happens on the cluster.
20
10. The RDD structure is maintained by the driver program
12. If Spark is working in standalone mode, then the RDD is

maintained on the spark master
13. If Spark is connecting to Yarn cluster then

a. In client mode, the driver program maintains RDD on client node
b. In cluster mode, the RDD is maintained by driver program in app
master
21
RDD Operations
22
RDD Operations
1. Spark allows two types of operations on an RDD
a. Transformations – create new RDD from existing RDD.
For example, “errs” RDD was created from “filter”
transformation of “log_file” RDD
b. Actions – perform some function on the RDD and return

the result to the driver program. For e.g.
errs.collect().foreach(println)
2. Transformations are lazy, they don’t compute right away.

They are computed only when an action is performed.
3. Till an action is preformed, all the transformations are

noted down in an RDD’s lineage
23
RDD Transformations
RDD 1 T RDD 2
Transformations Meaning
map(func) Return a new distributed dataset formed by passing
each element of the source through a function denoted
by func. map transforms an RDD of length N into
another RDD of length N.
flatMap(func) Return a new datasets formed by selecting those

elements of the source on which func returns true.
flatMap (loosely speaking) transforms an RDD of length
N into a collection of N collections, then flattens these
into a single RDD of results.
Ref: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
24
1. # base RDD!
a. lines = sc.textFile("/user/data/blackhole.txt").map(lambda x: x.split("
"))
2. # transformed RDDs!
a. val errors = lines.filter(_.contains("ERROR"))
b. errors.cache()
25
Actions (sample only)
Are operations that return a final value to the driver program or write data to an external
storage system that result in the evaluation of the transformations in the RDD.
Results 1
X = 10
Y = 14
RDD 1 A ...
Hard Disk
Actions Meaning
reduce(func) Aggregate the elements of the dataset using a function func
collect() Return all the elements of the dataset as an array at the driver
program
count() Return the number of elements in dataset
first() Return the first element of the dataset
saveAsTextFile(path) Write the elements of the dataset as text file (or set of text file) in a
given dir in the local file system, HDFS or any other Hadoop-
supported file system
Ref: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
26
1. Action
a. val errors = lines.filter(_.contains("ERROR"))
b. errors.collect().foreach(println)
27
Resilience
28
Why is it called resilient?
Apache_log
Errors
Top_Errors
1. Imagine you executed a Spark application that created the three RDDs shown above.
2. RDD “Apache_log” is mapped to partitions 1,2,3 on Node A, Node C and Node D
3. “Errors” RDD is derived from “Apache_Log” thru “Filter” transformation and its data
blocks are stored in partition 1 and 2 on Node B and Node D respectively
4. “Top_Errors” RDD is created thru transformation of “Errors” RDD and its partitions are
partition 1 to partition 4 scattered on Node A to Node D respectively
5. If Node D fails, the driver can reconstruct all the RDDs (RDD1 of Apache_log, RDD2 of
Errors and RDD2 of Top_Errors) elsewhere on the cluster using the lineage information
29
What is resilience?
Apache_log
Transformation Sequence
Errors
Top_Errors
Computer 1 Computer 2 Computer 3 Computer 4

SparkRDDcluster manager
partitions are
identifies another
recreated on
computer /s incomputer
another the
Cluster cluster with sufficient
transparently
resources
E
Computer 1 Computer 2 Computer 3 Computer 5
30
Paired RDD
31
Pair RDD
1. In Hadoop environment the concept of key value pair plays an

important role.
2. Based on keys, data from data blocks scattered in a cluster are

brought together as is in the case of map reduce
3. Spark allows creation of RDD with keys and values
4. Pair RDD consist of keys and values. For e.g. In the picture below,
“Fruit” column is the key column and “Color” column holds values
32
Pair RDD
6. Some of the most-frequently used functions are:

reduceByKey(func), groupByKey(), combineByKey() etc.
7. A simple RDD can be converted into PairRDD through “map”

command as shown below...
a. val wordpair = words.map(word => (word, 1))
will create “wordpair” RDD from “word” RDD
Words Wordpair (Pair RDD)

The The, 1
Little Little, 1
Brown Brown,1
Fox Fox ,1
Jumped words.map(word => (word, 1)) Jumped,1
Over Over,1
The The ,1
Little Little,1
Lazy Lazy,1
Dog Dog,1
33
Pair RDD
8. Aggregating (aggregate statistics across all elements with same

keys)
a. reduceByKey()
b. groupByKey()
c. combineByKey()
34
Pair RDD
9. groupByKey() (avoid)
a. Extracts the elements with same key and transmits them to another
location for further summation.
The 1
The 1
The 1
The 1 The 4
Little 1 Little 4
Little 1
Little 1
Little 1
Fox 1
Over
Jumped
Fox Fox 1 Jumped 1
Jumped 1 Over 2
Over 1
groupByKey() Over 1
Brown 1 Brown 1
Dog 1 Dog 1
Lazy 1 Lazy 1
b. Keys and values transmit over the wire leading to lot of data
movement
35
Pair RDD
10. reduceByKey()
a. Extracts the elements with same key, creates summary totals before
shuffle
The 2
The 4
The 2
Little 4
Little 2
Little 2
Fox 1 Fox 1
Over
Jumped
Fox Jumped 1 Jumped 1
Over 1 Over 1
reduceByKey()
Brown 1 Brown 1
Dog 1 Dog 1
Lazy 1 Lazy 1
b. Lesser number of keys and values travel over the wire than in
groupByKey()
https://databricks.gitbooks.io/databricks-spark-knowledge-
base/content/best_practices/prefer_reducebykey_over_groupbykey.html
36
PairRDD Lab -2
1. Create the following pairRDD and perform the steps. What do you see?
a. val pairRDD = sc.parallelize(List((1, 2), (3, 4), (3, 6)))
b. val keySum = pairRDD.reduceByKey( (x, y) => x + y)
c. keySum.collect().foreach(println)
Ans : -
(3,10)
(1,2)
2. Repeat the exercise and this time do a groupby as shown below

a. val keyGroup = pairRDD.groupByKey( )
b. keyGroup.collect().foreach(println)
Ans: -
(3,CompactBuffer(6, 4))
(1,CompactBuffer(2))
37
PairRDD Lab -2
3. Execute the following commands in Spark Shell and discuss the

ouput
a. val pairs = sc.parallelize(List((1, 2), (3, 4), (3, 6)))
b. val pairs1 = pairs.mapValues(x=>x+1)
c. pairs1.collect()
Ans: mapValues function works only on the values in the key, value pair. In
this case the values are increased by 1.

ouput
b. val pairs2 = pairs.map {case (x,y) => (x, y+1)}
c. pairs2.collect()
Ans: map function works only on the keys and values. In this case the
keys are unchanged and values are increased by 1.
38
PairRDD Lab -2

ouput
b. val pairs3 = pairs.mapValues(x=>(x,1))
c. pairs3.collect()
Ans: mapValues function works only on the values in the key, value pair. In
this case the values are changed to tuples where each value is
associated with a “1”)
Output should look like -
Array[(Int, (Int, Int))] = Array((1,(2,1)), (3,(4,1)), (3,(6,1)))
39
Spark Lab -3 WordCount (Scala)
1. Copy the following Scala snippet in the spark shell
a. val lines = sc.textFile("/user/data/blackhole.txt")
b. val words = lines.flatMap(line => line.split( ' '))
c. val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x+y}
d. counts.saveAsTextFile("/user/data/sparkwc53")
2. In 1.a we defined an RDD called “lines” created from a HDFS text file. This is
also known as base RDD.
3. Transformations include -
1. In 1.b “words” RDD is defined as transformation of “lines” RDD using “flatMap” function which splits
each line in the “lines” RDD into tokens based on space character
2. In 1.c “counts” RDD is defined as transformation of “words” RDD using “map” function which, for
every word, emits that word with a “1” . For e.g. if it comes across word “black”, the map function
makes it “black” 1.
3. The “reduceByKey” function works on the output of “map” function to collect all same words and
total how many “1” for that word
4. Actions include –
1. The saveAsTextFile is the action which triggers creation of all the RDDs and store the contents of
“counts” RDD in a directory
40
Spark Lab -3 WordCount (Scala) As stanalone program
1. Copy the following Scala snippet in the spark shell

a. val lines = sc.textFile("/user/data/blackhole.txt")
b. val words = lines.flatMap(line => line.split( ' '))
c. val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) =>
x+y}
d. counts.saveAsTextFile("/user/data/sparkwc")
41
RDD Persistence
42
RDD Persistence
1. One of the most important Spark capabilities is persisting

(or caching) a dataset in memory across operations
2. When an RDD is persisted in cache, each node stores any partitions

of it that it computes, in memory
3. The RDD partition stored in memory is used in other actions on that

dataset (or datasets derived from it).
4. This allows future actions to be much faster (sometimes 100 x).
5. Caching is useful feature for iterative algorithms and fast interactive

use.
43
RDD Persistence
44
RDD Persistence
6. An RDD can be persisted using the persist() or cache() methods on

it
7. The first time it is computed in an action, it will be kept in memory on

the nodes
8. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it

will automatically be recomputed using the transformations that
originally created it
9. Each persisted RDD can be stored using a different storage level, for
example, persist the dataset on disk, persist it in memory but as
serialize etc.
45
Spark RDD Persistence Lab -1
1. scala> val lines = sc.textFile("/user/data/blackhole.txt")
2. scala> import org.apache.spark.storage.StorageLevel
3. scala> lines.persist(StorageLevel.MEMORY_ONLY)
– We can also use cache() method if we need MEMORY_ONLY
storage level
4. scala> lines.count() // this is the point where the lines RDD will
be cached. Note down the time taken to give the count.
5. scala>lines.count() // run it second time and note the time taken

now. You should see a significant improvement because of cache
46
RDD Persistence
Storage Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, some partitions will not be cached and will be recomputed
on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This
is generally more space-efficient than deserialized objects, especially
when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in

memory to disk instead of recomputing them on the fly each time they're
needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, Same as the levels above, but replicate each partition on two cluster
MEMORY_AND_DISK_2, nodes.
etc.
47
Spark SQL - RDD Dataframes
48
Data Frames
1. Spark SQL provides a programming abstraction called DataFrame for

distributed query processing
2. A DataFrame is a Dataset organized into named columns. It is

conceptually equivalent to a table in a relational database
3. It is a distributed collection of data organized into named columns
4. DataFrames can be created from different data sources such as:

– Existing RDDs
– Structured data files
– JSON datasets
– Hive tables
– External databases
5. Spark SQL supports automatically converting an RDD containing case

classes to a DataFrame with the method toDF():
49
Data Frames
1. Spark SQL provides SQLContext to encapsulate all relational

functionality in Spark
2. Create the SQLContext from the existing SparkContext

a. import org.apache.spark.sql._
b. val sqlContext = new SQLContext(sc)
3. HiveContext which provides a superset of the functionality provided

by SQLContext.
a. import org.apache.spark.sql.hive._
b. val hc = new HiveContext(sc)
4. Can be used to write queries using the HiveQL parser and read
data from Hive tables
50
DataFrames Lab -3
Create DataFrame and temporary table from RDD
1. import org.apache.spark.sql._
2. val sqlContext = new SQLContext(sc)
3. val Sales = sc.textFile("/user/data/Sales.csv")
4. case class Salesclass(custid: String, pid: String, qty: Integer, date: String,
salesid: String)
5. val Salesdet = Sales.map(_.split(",")).map(p => Salesclass(p(0),

p(1),p(2).toInt,p(3),p(4)))
6. val SalesDF = Salesdet.toDF()
51
DataFrames Lab -3
Create DataFrame and temporary table from RDD
7. SalesDF.printSchema() //what do you see here?
8. SalesDF.registerTempTable("SalesTable") // convert DF to table in

metastore
9. val results = sqlContext.sql("select * from SalesTable").show()
52
DataFrames Lab -3
10. val results =sqlContext.sql("SELECT custid, sum(qty) FROM SalesTable

GROUP BY custid").show()
53
DataFrames Lab -4
Connecting SPARK to JDBC data source
1. Download mysql connector jar from
http://dev.mysql.com/downloads/file/?id=462849
2. The connector jar file should be available on client and all worker
nodes. It is kept in /home/hduser/mysql-connector-java-5.1.39/mysql-
connector-java-5.1.39-bin.jar
3. Create a table in Mysql, any database. A table called “person” is

already created in “testdb” for this hands on (Refer to attached doc)
4. Insert test data into the table (This is already done)
5. Ensure spark master and workers are running. Start Spark shell
6. spark-shell --driver-class-path /home/hduser/mysql-connector-java-

5.1.39/mysql-connector-java-5.1.39-bin.jar
54
DataFrames Lab -4
7. import org.apache.spark.sql._
8. val sqlContext = new SQLContext(sc)
9. val url="jdbc:mysql://127.0.0.1:3306/testdb"
10. val prop = new java.util.Properties
11. prop.setProperty("user","root")
12. prop.setProperty("password","root")
Ref: http://www.infoobjects.com/dataframes-with-apache-spark/
55
DataFrames Lab -4
13. val people = sqlContext.read.jdbc(url,"person",prop)
14. people.show
15. val males = sqlContext.read.jdbc(url,"person",Array

("gender='M'"),prop)
16. males.show
17. val below60 = people.filter(people("age") < 60)
18. below60.show
Ref: http://www.infoobjects.com/dataframes-with-apache-spark/
56
DataFrames Lab -4
57
DataFrames Lab -5
Connecting SPARK to Hive data source
1. val Sales = sc.textFile("/user/data/Sales.csv")
2. case class Salesclass(custid: String, pid: String, qty: Integer,date:

String, salesid: String)
3. val Salesdet = Sales.map(_.split(",")).map(p => Salesclass(p(0),

p(1),p(2).toInt,p(3),p(4)))
4. Salesdet.collect().foreach(println) // note that every line is a tuple i.e. a

record
5. val SalesDF = Salesdet.toDF() //converting RDD to DF
6. SalesDF.show //note the tabular format Vs the tuple
7. SalesDF.write.mode("overwrite").saveAsTable("salesdet") //Storing
RDD as a table in Hive
58
DataFrames Lab -5
Connecting SPARK to Hive data source
======== Using Hive Context to Read the table back from Hive =====
9. import org.apache.spark.sql.hive._
10. val hc = new HiveContext(sc)
11. val sales = hc.table("salesdet")
12. sales.count // Number of records in the table
13. sales.select("*").show //show all the records in the table
59
Data Frames
val SalesDF = Salesdet.toDF()
Salesdet (RDD) SalesDF
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
60
Local Vs Cluster mode
Impact
61
Local vs. cluster modes
1. The behavior of a code may depend on the way the application is executed /
deployed. Results when deployed locally vs on a cluster may be different for the
same code.
a. var counter = 0
b. val somerdd = sc.parallelize(data)
c. //unreliable code : Don't do this!!
d. somerdd.foreach(x => counter += x)
e. println("Counter value: " + counter)
2. Spark breaks up the processing of RDD operations into tasks, each of

which is executed by an executor. Prior to execution, Spark computes
the task’s closure.
3. Closure is those variables and methods which must be visible for the
executor to perform its computations on the RDD (in this case foreach()).
This closure is serialized and sent to each executor.
62
Local vs. cluster modes
4. The variables within the closure sent to each executor are now copies and
thus, when counter is referenced within the foreach function, it’s no longer
the counter on the driver node.
5. There is still a counter in the memory of the driver node but this is no longer
visible to the executors! The executors only see the copy from the serialized
closure.
6. Thus, the final value of counter will still be zero since all operations
on counter were referencing the value within the serialized closure.
7. To ensure well-defined behavior in these sorts of scenarios one should use

an Accumulator. Accumulators in Spark are used specifically to provide a
mechanism for safely updating a variable when execution is split up across
worker nodes in a cluster.
8. Accumulators are variables that are only “added” to through an associative

and commutative operation and can therefore be efficiently supported in
parallel. They can be used to implement counters (as in MapReduce) or sums
63
Accumulators
1. Accumulators are variables that are only “added” to through an associative
operation and can therefore, be efficiently supported in parallel.
2. They can be used to implement counters (as in MapReduce) or sums.
3. Can be accessed only by the driver. The workers can only update but not
read the accumulators
1. val accum = sc.accumulator(0)

2. sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
64
Broadcast Variables
1. Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks
2. Explicitly creating broadcast variables is only useful when tasks across

multiple stages need the same data or when caching the data in deserialized
form is important.
65
Spark Streaming
66
Spark Streaming
1. Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
2. Spark Streaming provides a high-level abstraction called discretized

stream or Dstream
Input Stream D Stream
HDFS
3. Spark Streaming receives live input data streams and divides the data into
batches. Chop up the live stream into batches of X seconds
4. Which are then processed by the Spark engine to generate the final stream of
results in batches (Dstreams again)
Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html
67
Spark Streaming
Discretized Streams
• Refers to batches of input data stream or processed output data

stream
• Internally, a DStream is represented by a continuous series of RDDs,

which is Spark’s abstraction of an immutable, distributed dataset
• Each RDD in a DStream contains data from a certain interval, as

shown in the following figure.
Ref: https://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#linking
68
Spark Streaming
val ssc = new StreamingContext(sparkContext, Seconds(1))

val tweets = TwitterUtils.createStream(ssc, auth)
Time
1 sec 1 sec 1 sec
Original stream is converted into discrete stream based on time lapse
Source: https://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf
69
Spark Streaming
val tweets = TwitterUtils.createStream(ssc, auth)

val hashTags = tweets.flatMap(status => getTags(status))
Time
1 sec 1 sec 1 sec
New “hashTags” Dstream is created out of tweets Dstream through

transformation process of flatMap
70
Spark Streaming
val tweets = TwitterUtils.createStream(ssc, None)

val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Time
1 sec 1 sec 1 sec
71
Spark Streaming Lab -6
1. Using Spark’s streaming context and socket streaming function,

build a word count program that will count frequency of words in
each input data point in a stream where a data point is one full line
of words terminated with a new line character (‘/n’)
2. The source of input stream will be one terminal where we will start
netcat server at port 9999 and key in our sentences
3. Spark Stream application will listen for data on same host and
same port i.e. 9999. The application will count frequency of words
in each line that we input.
72
Spark Streaming Lab -6 (Contd...)
• Sentence on the right is captured by spark stream program and

broken into individual words to reflect frequency of each word
73
Objective – to demonstrate use of spark streaming context with socket

streaming
nc -lk 9999 /* run this on one terminal and the following on another
$SPARK_HOME/bin/spark-submit --class "NetworkWordCount" --

master local[4] ./target/scala-2.11/sparknetworkstreaming_2.11-
1.0.jar 127.0.0.1 9999
Ref:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/Network
WordCount.scala
74
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF
// Create the context with a 1 second batch size

val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))
Ref:
75 WordCount.scala
// Create a socket stream on target ip:port and count the words in

input stream of \n delimited text
val lines = ssc.socketTextStream(args(0), args(1).toInt,

StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Ref:
76 WordCount.scala
Streaming live tweets from Twitter.com
1. Spark shell does not support stream data capture from streaming
sources such as twitter.com
2. Since this feature (of connecting to Twitter.com and pulling tweets)

is not core spark functionality, it is not an available feature by
default
3. To capture tweets, one needs to down load TwitterforJ libraries
4. The twitter Oauth keys are also required
77
Streaming live tweets from Twitter.com
1. import org.apache.spark.streaming._
2. import org.apache.spark.streaming.twitter._
3. import org.apache.spark.streaming.StreamingContext._
4. System.setProperty("twitter4j.oauth.consumerKey", “ABCD...")
5. System.setProperty("twitter4j.oauth.consumerSecret", “EFGH....")
6. System.setProperty("twitter4j.oauth.accessToken", “81498....")
7. System.setProperty("twitter4j.oauth.accessTokenSecret", “XYZ...")
8. var ssc = new StreamingContext(sc, Seconds(1))

9. var tweets = TwitterUtils.createStream(ssc, None)
10. var statuses = tweets.map(_.getText)
11. statuses.print()
12. ssc.start()
78
Some useful URLS to learn Spark Streaming Analysis
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebi
rdCMS.scala
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebi
rdHLL.scala
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterHash
TagJoinSentiments.scala
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopul
arTags.scala
For DataFrames
http://www.infoobjects.com/dataframes-with-apache-spark/
http://blog.jaceklaskowski.pl/2015/07/20/real-time-data-processing-using-
apache-kafka-and-spark-streaming.html
http://ingest.tips/2015/06/24/real-time-analytics-with-kafka-and-spark-
streaming/
79
Apache Spark Deployment Modes
80
Spark Application Deployment
1. The machine where the Spark application process (the one that creates the
SparkContext) is running, is the "Driver" node, with process being called the Driver
process
2. The ClusterManager could be:
StandaloneCluster, Mesos or YARN.
3. A WorkerNode is a node machine that

host one or multiple Workers.
4. A Worker is launched within a single JVM

(1 process) with access to a configured
number of resources in the node machine,
eg number of used cores per Worker,
RAM memory. A Worker can spawn and
own multiple Executors.
5. An Executor is also launched within a

single JVM (1 process), created by
the Worker, and is in charge of running
multiple Tasks in a ThreadPool.
81
Spark Standalone Master Slave Architecture
DRIVER NODE
& CLIENT
WORKER NODE
MASTER NODE
WORKER NODE
• Spark driver node – where the spark app is initiated/ its driver class is executed
• Spark master node – where cluster manager is initiated
• Spark slave nodes – where the slave processes are initiated
82
Spark Standalone Master Slave Architecture
DRIVER NODE
& CLIENT
SPARK WORKER NODE
SPARK MASTER NODE SPARK WORKER NODE
• Spark cluster (blue arrows) with HDFS cluster (red arrows).

• To access hdfs data set “raw_tweets.txt” the examples.jar application has a line
tweets = sc.textFile("hdfs://172.31.27.148/home/hduser/sampledata/raw_tweets.txt")
83
Spark Application Driver as YARN Client
DRIVER NODE & CLIENT
YARN SLAVE NODE
YARN MASTER NODE
YARN SLAVE NODE
• Spark applications run on YARN cluster. No Spark masters & workers

• Spark application driver (blue revolving circle) runs in the driver node
84
Spark Application Driver as YARN Client
1. Suitable for interactive mode such as using spark shell

2. Application driver run in the machine from where the application was submitted
3. App master is started and used only to negotiate resources in the cluster
4. AppMasters instruct NodeManagers to start containers on its behalf
5. The client interacts with the executors/ containers on the worker node
85
Spark Application Driver in cluster mode
DRIVER NODE
YARN SLAVE NODE
YARN MASTER NODE
YARN SLAVE NODE
• Spark application deployed on Yarn cluster (shown in blue arrow)

• Resource Manager is used to manage computing resources
• Spark application driver program (revolving blue circle) runs on cluster
86
Spark Application Driver in Cluster Mode
1. Not suited to using Spark
interactively
2. Application instance is
allocated an AppMaster
(YARN executor) process
3. The AppMaster is responsible

for requesting resources from
the ResourceManager
4. AppMaster instruct
NodeManagers to start
containers on its behalf
5. AppMaster make the need for

an active client unnecessary
87
Local mode:
a) Install spark on local unix box
b) Launch all the spark daemons together using “start-all.sh” Pl. Note, this
name is used in hadoop also. You may need to change it if you wish to
run on the same system where hadoop demons too are running
c) Once all the Spark servers are up and running (use the jps utility)
d) Good for debugging. To invoke Spark in local mode:
$SPARK_HOME/bin/spark-shell –master local
88
Thanks
89

Rota Baclawski Prob Theory 79

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Rota Baclawski Prob Theory 79

Cargado por

Copyright:

Formatos disponibles

Introduction To Spark

Apache Spark overview

Apache Spark Architecture

Apache Spark Installation / Deployment

Apache Spark Building Blocks

Apache Resilient Distributed Datasets

Apache Spark Paired RDD

Apache Spark RDD Persistence

Data Frames and Streaming

1. Apache Spark is an open source big data processing framework

2. Comprehensive, unified framework to manage big data processing

3. Enables applications in Hadoop clusters to run up to 100 times

4. In addition to Map and Reduce operations, it supports SQL queries,

5. Can use these capabilities stand-alone or combine them to run in a

2. It is a good solution for one-pass computations, but not very

3. The Job output data between each step has to be stored in

4. A end-to-end solution requires the integration of several tools

Map Reduce (MR) Framework SPARK

Yet Another Negotiator (YARN) Processing Layer

Hadoop Distributed File System (HDFS) Storage Layer

2. This mode of installation is called “StandAlone”

3. Spark schedulers are not as robust as Yarn schedulers and hence

Hadoop Distributed File System (HDFS) Storage Layer

Spark Core is the underlying general execution engine

2. Execute the following command at the unix prompt

16/02/12 22:32:40 INFO repl.SparkILoop: Created spark context..

2. This mode of installation is called “StandAlone”

3. Spark schedulers are not as robust as Yarn schedulers and hence

Hadoop Distributed File System (HDFS) Storage Layer

• SparkContext connects to cluster managers, which manage the

• The SparkContext object is usually referenced as the variable sc

2. RDD can be created in two ways:

c. Parallelized collections – take an existing Scala collection and run

3. Since RDDs are immutable, they are defined as val.

4. Line 1.a defines an RDD “log_file”. It acts as an alias to the hadoop

d) The process of deriving an RDD out of another is done in a series of

12. If Spark is working in standalone mode, then the RDD is

13. If Spark is connecting to Yarn cluster then

b. Actions – perform some function on the RDD and return

2. Transformations are lazy, they don’t compute right away.

3. Till an action is preformed, all the transformations are

flatMap(func) Return a new datasets formed by selecting those

2. RDD “Apache_log” is mapped to partitions 1,2,3 on Node A, Node C and Node D

Computer 1 Computer 2 Computer 3 Computer 4

Computer 1 Computer 2 Computer 3 Computer 5

1. In Hadoop environment the concept of key value pair plays an

2. Based on keys, data from data blocks scattered in a cluster are

3. Spark allows creation of RDD with keys and values

6. Some of the most-frequently used functions are:

7. A simple RDD can be converted into PairRDD through “map”

Words Wordpair (Pair RDD)

8. Aggregating (aggregate statistics across all elements with same

2. Repeat the exercise and this time do a groupby as shown below

3. Execute the following commands in Spark Shell and discuss the

4. Execute the following commands in Spark Shell and discuss the

3. Execute the following commands in Spark Shell and discuss the

Array[(Int, (Int, Int))] = Array((1,(2,1)), (3,(4,1)), (3,(6,1)))

1. Copy the following Scala snippet in the spark shell

1. One of the most important Spark capabilities is persisting

2. When an RDD is persisted in cache, each node stores any partitions

3. The RDD partition stored in memory is used in other actions on that

4. This allows future actions to be much faster (sometimes 100 x).

5. Caching is useful feature for iterative algorithms and fast interactive

6. An RDD can be persisted using the persist() or cache() methods on