Está en la página 1de 89

Introduction To Spark

1
Agenda

Apache Spark overview

Apache Spark Architecture

Apache Spark Installation / Deployment

Apache Spark Building Blocks

Apache Resilient Distributed Datasets

Apache Spark Paired RDD

Apache Spark RDD Persistence

Data Frames and Streaming

2
Apache Spark Overview

3
What is Spark

1. Apache Spark is an open source big data processing framework


designed for speed, ease of use, and analytics

2. Comprehensive, unified framework to manage big data processing


requirements. A general purpose engine

3. Enables applications in Hadoop clusters to run up to 100 times


faster in memory and 10 times faster even when running on disk

4. In addition to Map and Reduce operations, it supports SQL queries,


streaming data, machine learning and graph data processing

5. Can use these capabilities stand-alone or combine them to run in a


single data pipeline use case.

4
Why Spark
1. MapReduce as big data processing technology has proven to
be the solution of choice for processing large data sets.

2. It is a good solution for one-pass computations, but not very


efficient for use cases that require multi-pass computations
and algorithms

3. The Job output data between each step has to be stored in


the distributed file system before the next step can begin.

4. A end-to-end solution requires the integration of several tools


for different big data use cases (like Mahout for Machine
Learning and Storm for streaming data processing).

5
Hadoop and Spark Ecosystem
1. Spark is a processing engine
2. It occupies the same place in the Hadoop stack as MapReduce

PIG Hive

Map Reduce (MR) Framework SPARK

Yet Another Negotiator (YARN) Processing Layer

Hadoop Distributed File System (HDFS) Storage Layer

6
Hadoop and Spark Ecosystem
1. Spark can spawn it’s own servers to process data on HDFS i.e. not
use Resource Manager and Appmaster of YARN

2. This mode of installation is called “StandAlone”

3. Spark schedulers are not as robust as Yarn schedulers and hence


StandAlone mode is not used in large scale setups.

SPARK

Hadoop Distributed File System (HDFS) Storage Layer

7
Apache Spark Ecosystem

8
SPARK - Ecosystem

Spark Streaming:
For real-time streaming data processing,
based on micro batch style computing and
processing.

Spark SQL:
Spark SQL provides the capability to expose the Spark datasets over JDBC API and
allow running the SQL like queries on Spark data using traditional BI and visualization
tools.

Spark MLlib:
Mllib is Spark’s scalable machine learning library consisting of common learning
algorithms and utilities, including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization primitives

Spark GraphX:
GraphX is Spark API for graphs and graph-parallel computation.

Spark Core is the underlying general execution engine


In-memory computing to deliver speed, generalized execution model to support a wide
variety of applications, Java, Scala, and Python APIs for ease of development.
9
Spark Lab -1 Starting Clusters

1. Start spark cluster and connect using Spark Shell as given below
a. Start Spark daemons. Spark and Hadoop have same name script
$ > start-allspark.sh (to start the spark master and worker processes)

2. Execute the following command at the unix prompt


a. $> spark-shell --master spark://127.0.1.1:7077

3. The Spark shell will connect to the spark master and create a
Spark Context called sc.

4. The Spark shell will display a scala prompt as in the screen below

16/02/12 22:32:40 INFO repl.SparkILoop: Created spark context..


Spark context available as sc.
scala>

10
Spark Lab -1 Starting Clusters
$> start-dfs.sh

$> jps

18831 Worker
21669 Jps Spark Cluster
18651 Master
21545 SecondaryNameNode
21346 DataNode HDFS Layer
21222 NameNode

11
1. Spark can spawn it’s own servers to process data on HDFS i.e. not
use Resource Manager and Appmaster of YARN

2. This mode of installation is called “StandAlone”

3. Spark schedulers are not as robust as Yarn schedulers and hence


StandAlone mode is not used in large scale setups.

SPARK

Hadoop Distributed File System (HDFS) Storage Layer

12
Spark Context

13
SPARK - Context
• SparkContext is the object that manages the connection to the
clusters in Spark

• SparkContext connects to cluster managers, which manage the


actual executors that run the specific computations

• The SparkContext object is usually referenced as the variable sc

14
Resilient Distributed Datasets

15
Resilient Distributed Data sets
1. Resilient Distributed Datasets (RDD) are the
a. primary abstraction in Spark
b. a fault-tolerant collection of distributed partitions of data sets
c. can be operated on in parallel

2. RDD can be created in two ways:


a. Hadoop datasets – run functions on each record of a file in Hadoop
distributed file storage
val log_file = sc.textFile("/user/data/apache_log.txt")

c. Parallelized collections – take an existing Scala collection and run


functions on it in parallel
scala> val someRDD = sc.parallelize(1 to 100, 4)

16
Spark Lab -1 Creating RDD
1. To access HDFS file in Spark, we create an RDD. For e.g.
a. scala> val log_file = sc.textFile("/user/data/apache_log.txt")
b. scala> val errs = log_file.filter(line => line.contains("error"))
c. scala> errs.collect().foreach(println);

2. In Scala val defines a fixed value that cannot change, var defines a
variable.

3. Since RDDs are immutable, they are defined as val.

4. Line 1.a defines an RDD “log_file”. It acts as an alias to the hadoop


blocks of the file “apache_log.txt” (a.k.a “Base RDD)

5. Line 1.b defines another RDD “errs” derived from base RDD
“log_file” through a “filter” transformation process

17
6. I am using the word “defines” in line 3 and 5 for “log_file” and “errs”
respectively. I am not using the word “creates”. The reason for that is-

a) These steps in the program are converted into a DAG (Directed Acyclic
Graph) of tasks.

b) One of the activity is to create log_file RDD and other is to create errors
RDD out of the log_file RDD

c) That “errs” RDD is derived out of a filter process on “log_file” RDD. This
fact (a.k.a lineage) is noted down in the “errs” RDD meta information

d) The process of deriving an RDD out of another is done in a series of


tasks. Some or all of these tasks may be grouped and executed together
as a stage

e) The tasks are executed when an “action” task in the DAG is executed. on
the RDD.

18
SPARK DAG Scheduler

Task 1
Stage2
Spark Stage-1 Yarn
Task 2 DAG task1 Scheduler
scheduler task2
Task n

Original RDD

Transformed RDD

19
7. The RDD structure that helps abstract the scattered blocks of data
on the cluster has following information
a. A set of partitions(atomic pieces of datasets)
b. A set of dependencies on parent RDDs
c. A function for computing the dataset based on its parents
d. Metadata about its partitioning scheme (hash partition or otherwise)
e. Data placement / preferred location for each partition

8. Each RDD, “logs_file” , “errors” from line 1.a and 1.b in slide 15 will
have this structure associated.

9. The RDD structure contains the partition location (7.e) only when
the actual data churn happens on the cluster.

20
10. The RDD structure is maintained by the driver program

12. If Spark is working in standalone mode, then the RDD is


maintained on the spark master

13. If Spark is connecting to Yarn cluster then


a. In client mode, the driver program maintains RDD on client node
b. In cluster mode, the RDD is maintained by driver program in app
master

21
RDD Operations

22
RDD Operations
1. Spark allows two types of operations on an RDD
a. Transformations – create new RDD from existing RDD.
For example, “errs” RDD was created from “filter”
transformation of “log_file” RDD

b. Actions – perform some function on the RDD and return


the result to the driver program. For e.g.
errs.collect().foreach(println)

2. Transformations are lazy, they don’t compute right away.


They are computed only when an action is performed.

3. Till an action is preformed, all the transformations are


noted down in an RDD’s lineage

23
RDD Transformations

RDD 1 T RDD 2

Transformations Meaning
map(func) Return a new distributed dataset formed by passing
each element of the source through a function denoted
by func. map transforms an RDD of length N into
another RDD of length N.

flatMap(func) Return a new datasets formed by selecting those


elements of the source on which func returns true.
flatMap (loosely speaking) transforms an RDD of length
N into a collection of N collections, then flattens these
into a single RDD of results.

Ref: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
24
1. # base RDD!
a. lines = sc.textFile("/user/data/blackhole.txt").map(lambda x: x.split("
"))

2. # transformed RDDs!
a. val errors = lines.filter(_.contains("ERROR"))
b. errors.cache()

25
Actions (sample only)
Are operations that return a final value to the driver program or write data to an external
storage system that result in the evaluation of the transformations in the RDD.
Results 1
X = 10
Y = 14
RDD 1 A ...

Hard Disk

Actions Meaning
reduce(func) Aggregate the elements of the dataset using a function func

collect() Return all the elements of the dataset as an array at the driver
program
count() Return the number of elements in dataset
first() Return the first element of the dataset
saveAsTextFile(path) Write the elements of the dataset as text file (or set of text file) in a
given dir in the local file system, HDFS or any other Hadoop-
supported file system

Ref: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
26
1. Action
a. val errors = lines.filter(_.contains("ERROR"))
b. errors.collect().foreach(println)

27
Resilience

28
Why is it called resilient?

Apache_log

Errors

Top_Errors

1. Imagine you executed a Spark application that created the three RDDs shown above.

2. RDD “Apache_log” is mapped to partitions 1,2,3 on Node A, Node C and Node D

3. “Errors” RDD is derived from “Apache_Log” thru “Filter” transformation and its data
blocks are stored in partition 1 and 2 on Node B and Node D respectively

4. “Top_Errors” RDD is created thru transformation of “Errors” RDD and its partitions are
partition 1 to partition 4 scattered on Node A to Node D respectively

5. If Node D fails, the driver can reconstruct all the RDDs (RDD1 of Apache_log, RDD2 of
Errors and RDD2 of Top_Errors) elsewhere on the cluster using the lineage information
29
What is resilience?

Apache_log
Transformation Sequence

Errors

Top_Errors

Computer 1 Computer 2 Computer 3 Computer 4


SparkRDDcluster manager
partitions are
identifies another
recreated on
computer /s incomputer
another the
Cluster cluster with sufficient
transparently
resources
E

Computer 1 Computer 2 Computer 3 Computer 5

30
Paired RDD

31
Pair RDD

1. In Hadoop environment the concept of key value pair plays an


important role.

2. Based on keys, data from data blocks scattered in a cluster are


brought together as is in the case of map reduce

3. Spark allows creation of RDD with keys and values

4. Pair RDD consist of keys and values. For e.g. In the picture below,
“Fruit” column is the key column and “Color” column holds values

32
Pair RDD

6. Some of the most-frequently used functions are:


reduceByKey(func), groupByKey(), combineByKey() etc.

7. A simple RDD can be converted into PairRDD through “map”


command as shown below...
a. val wordpair = words.map(word => (word, 1))
will create “wordpair” RDD from “word” RDD

Words Wordpair (Pair RDD)


The The, 1
Little Little, 1
Brown Brown,1
Fox Fox ,1
Jumped words.map(word => (word, 1)) Jumped,1
Over Over,1
The The ,1
Little Little,1
Lazy Lazy,1
Dog Dog,1

33
Pair RDD

8. Aggregating (aggregate statistics across all elements with same


keys)
a. reduceByKey()
b. groupByKey()
c. combineByKey()

34
Pair RDD

9. groupByKey() (avoid)
a. Extracts the elements with same key and transmits them to another
location for further summation.
The 1
The 1
The 1
The 1 The 4
Little 1 Little 4
Little 1
Little 1
Little 1

Fox 1
Over
Jumped
Fox Fox 1 Jumped 1
Jumped 1 Over 2
Over 1
groupByKey() Over 1
Brown 1 Brown 1
Dog 1 Dog 1
Lazy 1 Lazy 1

b. Keys and values transmit over the wire leading to lot of data
movement

35
Pair RDD

10. reduceByKey()
a. Extracts the elements with same key, creates summary totals before
shuffle

The 2
The 4
The 2
Little 4
Little 2
Little 2

Fox 1 Fox 1
Over
Jumped
Fox Jumped 1 Jumped 1
Over 1 Over 1

reduceByKey()
Brown 1 Brown 1
Dog 1 Dog 1
Lazy 1 Lazy 1

b. Lesser number of keys and values travel over the wire than in
groupByKey()

https://databricks.gitbooks.io/databricks-spark-knowledge-
base/content/best_practices/prefer_reducebykey_over_groupbykey.html
36
PairRDD Lab -2

1. Create the following pairRDD and perform the steps. What do you see?
a. val pairRDD = sc.parallelize(List((1, 2), (3, 4), (3, 6)))
b. val keySum = pairRDD.reduceByKey( (x, y) => x + y)
c. keySum.collect().foreach(println)

Ans : -
(3,10)
(1,2)

2. Repeat the exercise and this time do a groupby as shown below


a. val keyGroup = pairRDD.groupByKey( )
b. keyGroup.collect().foreach(println)

Ans: -
(3,CompactBuffer(6, 4))
(1,CompactBuffer(2))

37
PairRDD Lab -2

3. Execute the following commands in Spark Shell and discuss the


ouput
a. val pairs = sc.parallelize(List((1, 2), (3, 4), (3, 6)))
b. val pairs1 = pairs.mapValues(x=>x+1)
c. pairs1.collect()
Ans: mapValues function works only on the values in the key, value pair. In
this case the values are increased by 1.

4. Execute the following commands in Spark Shell and discuss the


ouput
a. val pairs = sc.parallelize(List((1, 2), (3, 4), (3, 6)))
b. val pairs2 = pairs.map {case (x,y) => (x, y+1)}
c. pairs2.collect()
Ans: map function works only on the keys and values. In this case the
keys are unchanged and values are increased by 1.

38
PairRDD Lab -2

3. Execute the following commands in Spark Shell and discuss the


ouput
a. val pairs = sc.parallelize(List((1, 2), (3, 4), (3, 6)))
b. val pairs3 = pairs.mapValues(x=>(x,1))
c. pairs3.collect()
Ans: mapValues function works only on the values in the key, value pair. In
this case the values are changed to tuples where each value is
associated with a “1”)
Output should look like -

Array[(Int, (Int, Int))] = Array((1,(2,1)), (3,(4,1)), (3,(6,1)))

39
Spark Lab -3 WordCount (Scala)
1. Copy the following Scala snippet in the spark shell
a. val lines = sc.textFile("/user/data/blackhole.txt")
b. val words = lines.flatMap(line => line.split( ' '))
c. val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x+y}
d. counts.saveAsTextFile("/user/data/sparkwc53")

2. In 1.a we defined an RDD called “lines” created from a HDFS text file. This is
also known as base RDD.

3. Transformations include -
1. In 1.b “words” RDD is defined as transformation of “lines” RDD using “flatMap” function which splits
each line in the “lines” RDD into tokens based on space character
2. In 1.c “counts” RDD is defined as transformation of “words” RDD using “map” function which, for
every word, emits that word with a “1” . For e.g. if it comes across word “black”, the map function
makes it “black” 1.
3. The “reduceByKey” function works on the output of “map” function to collect all same words and
total how many “1” for that word

4. Actions include –
1. The saveAsTextFile is the action which triggers creation of all the RDDs and store the contents of
“counts” RDD in a directory

40
Spark Lab -3 WordCount (Scala) As stanalone program

1. Copy the following Scala snippet in the spark shell


a. val lines = sc.textFile("/user/data/blackhole.txt")
b. val words = lines.flatMap(line => line.split( ' '))
c. val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) =>
x+y}
d. counts.saveAsTextFile("/user/data/sparkwc")

41
RDD Persistence

42
RDD Persistence

1. One of the most important Spark capabilities is persisting


(or caching) a dataset in memory across operations

2. When an RDD is persisted in cache, each node stores any partitions


of it that it computes, in memory

3. The RDD partition stored in memory is used in other actions on that


dataset (or datasets derived from it).

4. This allows future actions to be much faster (sometimes 100 x).

5. Caching is useful feature for iterative algorithms and fast interactive


use.

43
RDD Persistence

44
RDD Persistence

6. An RDD can be persisted using the persist() or cache() methods on


it

7. The first time it is computed in an action, it will be kept in memory on


the nodes

8. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it


will automatically be recomputed using the transformations that
originally created it

9. Each persisted RDD can be stored using a different storage level, for
example, persist the dataset on disk, persist it in memory but as
serialize etc.

45
Spark RDD Persistence Lab -1

1. scala> val lines = sc.textFile("/user/data/blackhole.txt")

2. scala> import org.apache.spark.storage.StorageLevel

3. scala> lines.persist(StorageLevel.MEMORY_ONLY)
– We can also use cache() method if we need MEMORY_ONLY
storage level

4. scala> lines.count() // this is the point where the lines RDD will
be cached. Note down the time taken to give the count.

5. scala>lines.count() // run it second time and note the time taken


now. You should see a significant improvement because of cache

46
RDD Persistence
Storage Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, some partitions will not be cached and will be recomputed
on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This
is generally more space-efficient than deserialized objects, especially
when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in


memory to disk instead of recomputing them on the fly each time they're
needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, Same as the levels above, but replicate each partition on two cluster
MEMORY_AND_DISK_2, nodes.
etc.

47
Spark SQL - RDD Dataframes

48
Data Frames

1. Spark SQL provides a programming abstraction called DataFrame for


distributed query processing

2. A DataFrame is a Dataset organized into named columns. It is


conceptually equivalent to a table in a relational database

3. It is a distributed collection of data organized into named columns

4. DataFrames can be created from different data sources such as:


– Existing RDDs
– Structured data files
– JSON datasets
– Hive tables
– External databases

5. Spark SQL supports automatically converting an RDD containing case


classes to a DataFrame with the method toDF():

49
Data Frames

1. Spark SQL provides SQLContext to encapsulate all relational


functionality in Spark

2. Create the SQLContext from the existing SparkContext


a. import org.apache.spark.sql._
b. val sqlContext = new SQLContext(sc)

3. HiveContext which provides a superset of the functionality provided


by SQLContext.
a. import org.apache.spark.sql.hive._
b. val hc = new HiveContext(sc)

4. Can be used to write queries using the HiveQL parser and read
data from Hive tables

50
DataFrames Lab -3

Create DataFrame and temporary table from RDD

1. import org.apache.spark.sql._

2. val sqlContext = new SQLContext(sc)

3. val Sales = sc.textFile("/user/data/Sales.csv")

4. case class Salesclass(custid: String, pid: String, qty: Integer, date: String,
salesid: String)

5. val Salesdet = Sales.map(_.split(",")).map(p => Salesclass(p(0),


p(1),p(2).toInt,p(3),p(4)))

6. val SalesDF = Salesdet.toDF()

51
DataFrames Lab -3

Create DataFrame and temporary table from RDD

7. SalesDF.printSchema() //what do you see here?

8. SalesDF.registerTempTable("SalesTable") // convert DF to table in


metastore

9. val results = sqlContext.sql("select * from SalesTable").show()

52
DataFrames Lab -3

10. val results =sqlContext.sql("SELECT custid, sum(qty) FROM SalesTable


GROUP BY custid").show()

53
DataFrames Lab -4
Connecting SPARK to JDBC data source
1. Download mysql connector jar from
http://dev.mysql.com/downloads/file/?id=462849

2. The connector jar file should be available on client and all worker
nodes. It is kept in /home/hduser/mysql-connector-java-5.1.39/mysql-
connector-java-5.1.39-bin.jar

3. Create a table in Mysql, any database. A table called “person” is


already created in “testdb” for this hands on (Refer to attached doc)

4. Insert test data into the table (This is already done)

5. Ensure spark master and workers are running. Start Spark shell

6. spark-shell --driver-class-path /home/hduser/mysql-connector-java-


5.1.39/mysql-connector-java-5.1.39-bin.jar
54
DataFrames Lab -4
Connecting SPARK to JDBC data source
7. import org.apache.spark.sql._

8. val sqlContext = new SQLContext(sc)

9. val url="jdbc:mysql://127.0.0.1:3306/testdb"

10. val prop = new java.util.Properties

11. prop.setProperty("user","root")

12. prop.setProperty("password","root")

Ref: http://www.infoobjects.com/dataframes-with-apache-spark/

55
DataFrames Lab -4
Connecting SPARK to JDBC data source

13. val people = sqlContext.read.jdbc(url,"person",prop)

14. people.show

15. val males = sqlContext.read.jdbc(url,"person",Array


("gender='M'"),prop)

16. males.show

17. val below60 = people.filter(people("age") < 60)

18. below60.show

Ref: http://www.infoobjects.com/dataframes-with-apache-spark/

56
DataFrames Lab -4
Connecting SPARK to JDBC data source

57
DataFrames Lab -5
Connecting SPARK to Hive data source

1. val Sales = sc.textFile("/user/data/Sales.csv")

2. case class Salesclass(custid: String, pid: String, qty: Integer,date:


String, salesid: String)

3. val Salesdet = Sales.map(_.split(",")).map(p => Salesclass(p(0),


p(1),p(2).toInt,p(3),p(4)))

4. Salesdet.collect().foreach(println) // note that every line is a tuple i.e. a


record

5. val SalesDF = Salesdet.toDF() //converting RDD to DF

6. SalesDF.show //note the tabular format Vs the tuple

7. SalesDF.write.mode("overwrite").saveAsTable("salesdet") //Storing
RDD as a table in Hive

58
DataFrames Lab -5
Connecting SPARK to Hive data source

======== Using Hive Context to Read the table back from Hive =====

9. import org.apache.spark.sql.hive._

10. val hc = new HiveContext(sc)

11. val sales = hc.table("salesdet")

12. sales.count // Number of records in the table

13. sales.select("*").show //show all the records in the table

59
Data Frames

val SalesDF = Salesdet.toDF()

Salesdet (RDD) SalesDF

http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes

60
Local Vs Cluster mode
Impact

61
Local vs. cluster modes
1. The behavior of a code may depend on the way the application is executed /
deployed. Results when deployed locally vs on a cluster may be different for the
same code.

a. var counter = 0
b. val somerdd = sc.parallelize(data)
c. //unreliable code : Don't do this!!
d. somerdd.foreach(x => counter += x)
e. println("Counter value: " + counter)

2. Spark breaks up the processing of RDD operations into tasks, each of


which is executed by an executor. Prior to execution, Spark computes
the task’s closure.

3. Closure is those variables and methods which must be visible for the
executor to perform its computations on the RDD (in this case foreach()).
This closure is serialized and sent to each executor.
62
Local vs. cluster modes
4. The variables within the closure sent to each executor are now copies and
thus, when counter is referenced within the foreach function, it’s no longer
the counter on the driver node.

5. There is still a counter in the memory of the driver node but this is no longer
visible to the executors! The executors only see the copy from the serialized
closure.

6. Thus, the final value of counter will still be zero since all operations
on counter were referencing the value within the serialized closure.

7. To ensure well-defined behavior in these sorts of scenarios one should use


an Accumulator. Accumulators in Spark are used specifically to provide a
mechanism for safely updating a variable when execution is split up across
worker nodes in a cluster.

8. Accumulators are variables that are only “added” to through an associative


and commutative operation and can therefore be efficiently supported in
parallel. They can be used to implement counters (as in MapReduce) or sums

63
Accumulators
1. Accumulators are variables that are only “added” to through an associative
operation and can therefore, be efficiently supported in parallel.

2. They can be used to implement counters (as in MapReduce) or sums.

3. Can be accessed only by the driver. The workers can only update but not
read the accumulators

1. val accum = sc.accumulator(0)


2. sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

64
Broadcast Variables
1. Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks

2. Explicitly creating broadcast variables is only useful when tasks across


multiple stages need the same data or when caching the data in deserialized
form is important.

65
Spark Streaming

66
Spark Streaming

1. Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams

2. Spark Streaming provides a high-level abstraction called discretized


stream or Dstream

Input Stream D Stream

HDFS

3. Spark Streaming receives live input data streams and divides the data into
batches. Chop up the live stream into batches of X seconds

4. Which are then processed by the Spark engine to generate the final stream of
results in batches (Dstreams again)

Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html

67
Spark Streaming
Discretized Streams

• Refers to batches of input data stream or processed output data


stream

• Internally, a DStream is represented by a continuous series of RDDs,


which is Spark’s abstraction of an immutable, distributed dataset

• Each RDD in a DStream contains data from a certain interval, as


shown in the following figure.

Ref: https://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#linking
68
Spark Streaming

val ssc = new StreamingContext(sparkContext, Seconds(1))


val tweets = TwitterUtils.createStream(ssc, auth)

Time
1 sec 1 sec 1 sec

Original stream is converted into discrete stream based on time lapse

Source: https://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf

69
Spark Streaming

val tweets = TwitterUtils.createStream(ssc, auth)


val hashTags = tweets.flatMap(status => getTags(status))

Time
1 sec 1 sec 1 sec

New “hashTags” Dstream is created out of tweets Dstream through


transformation process of flatMap

Source: https://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf
70
Spark Streaming

val tweets = TwitterUtils.createStream(ssc, None)


val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

Time
1 sec 1 sec 1 sec

Source: https://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf
71
Spark Streaming Lab -6

1. Using Spark’s streaming context and socket streaming function,


build a word count program that will count frequency of words in
each input data point in a stream where a data point is one full line
of words terminated with a new line character (‘/n’)

2. The source of input stream will be one terminal where we will start
netcat server at port 9999 and key in our sentences

3. Spark Stream application will listen for data on same host and
same port i.e. 9999. The application will count frequency of words
in each line that we input.

72
Spark Streaming Lab -6 (Contd...)

• Sentence on the right is captured by spark stream program and


broken into individual words to reflect frequency of each word

73
Spark Streaming Lab -6 (Contd...)

Objective – to demonstrate use of spark streaming context with socket


streaming

nc -lk 9999 /* run this on one terminal and the following on another

$SPARK_HOME/bin/spark-submit --class "NetworkWordCount" --


master local[4] ./target/scala-2.11/sparknetworkstreaming_2.11-
1.0.jar 127.0.0.1 9999

Ref:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/Network
WordCount.scala
74
Spark Streaming Lab -6 (Contd...)

object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}

Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF

// Create the context with a 1 second batch size


val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))
Ref:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/Network
75 WordCount.scala
Spark Streaming Lab -6 (Contd...)

// Create a socket stream on target ip:port and count the words in


input stream of \n delimited text

val lines = ssc.socketTextStream(args(0), args(1).toInt,


StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
Ref:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/Network
76 WordCount.scala
Streaming live tweets from Twitter.com

1. Spark shell does not support stream data capture from streaming
sources such as twitter.com

2. Since this feature (of connecting to Twitter.com and pulling tweets)


is not core spark functionality, it is not an available feature by
default

3. To capture tweets, one needs to down load TwitterforJ libraries

4. The twitter Oauth keys are also required

77
Streaming live tweets from Twitter.com
1. import org.apache.spark.streaming._
2. import org.apache.spark.streaming.twitter._
3. import org.apache.spark.streaming.StreamingContext._

4. System.setProperty("twitter4j.oauth.consumerKey", “ABCD...")
5. System.setProperty("twitter4j.oauth.consumerSecret", “EFGH....")
6. System.setProperty("twitter4j.oauth.accessToken", “81498....")
7. System.setProperty("twitter4j.oauth.accessTokenSecret", “XYZ...")

8. var ssc = new StreamingContext(sc, Seconds(1))


9. var tweets = TwitterUtils.createStream(ssc, None)
10. var statuses = tweets.map(_.getText)
11. statuses.print()

12. ssc.start()

78
Some useful URLS to learn Spark Streaming Analysis

examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebi
rdCMS.scala
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebi
rdHLL.scala
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterHash
TagJoinSentiments.scala
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopul
arTags.scala

For DataFrames

http://www.infoobjects.com/dataframes-with-apache-spark/
http://blog.jaceklaskowski.pl/2015/07/20/real-time-data-processing-using-
apache-kafka-and-spark-streaming.html
http://ingest.tips/2015/06/24/real-time-analytics-with-kafka-and-spark-
streaming/

79
Apache Spark Deployment Modes

80
Spark Application Deployment

1. The machine where the Spark application process (the one that creates the
SparkContext) is running, is the "Driver" node, with process being called the Driver
process
2. The ClusterManager could be:
StandaloneCluster, Mesos or YARN.

3. A WorkerNode is a node machine that


host one or multiple Workers.

4. A Worker is launched within a single JVM


(1 process) with access to a configured
number of resources in the node machine,
eg number of used cores per Worker,
RAM memory. A Worker can spawn and
own multiple Executors.

5. An Executor is also launched within a


single JVM (1 process), created by
the Worker, and is in charge of running
multiple Tasks in a ThreadPool.

81
Spark Application Deployment
Spark Standalone Master Slave Architecture

DRIVER NODE
& CLIENT

WORKER NODE

MASTER NODE

WORKER NODE

• Spark driver node – where the spark app is initiated/ its driver class is executed
• Spark master node – where cluster manager is initiated
• Spark slave nodes – where the slave processes are initiated
82
Spark Application Deployment

Spark Standalone Master Slave Architecture

DRIVER NODE
& CLIENT

SPARK WORKER NODE

SPARK MASTER NODE SPARK WORKER NODE

• Spark cluster (blue arrows) with HDFS cluster (red arrows).


• To access hdfs data set “raw_tweets.txt” the examples.jar application has a line
tweets = sc.textFile("hdfs://172.31.27.148/home/hduser/sampledata/raw_tweets.txt")
83
Spark Application Deployment
Spark Application Driver as YARN Client

DRIVER NODE & CLIENT

YARN SLAVE NODE

YARN MASTER NODE

YARN SLAVE NODE

• Spark applications run on YARN cluster. No Spark masters & workers


• Spark application driver (blue revolving circle) runs in the driver node

84
Spark Application Deployment

Spark Application Driver as YARN Client

1. Suitable for interactive mode such as using spark shell


2. Application driver run in the machine from where the application was submitted
3. App master is started and used only to negotiate resources in the cluster
4. AppMasters instruct NodeManagers to start containers on its behalf
5. The client interacts with the executors/ containers on the worker node

85
Spark Application Deployment
Spark Application Driver in cluster mode

DRIVER NODE

YARN SLAVE NODE

YARN MASTER NODE

YARN SLAVE NODE

• Spark application deployed on Yarn cluster (shown in blue arrow)


• Resource Manager is used to manage computing resources
• Spark application driver program (revolving blue circle) runs on cluster

86
Spark Application Deployment
Spark Application Driver in Cluster Mode
1. Not suited to using Spark
interactively

2. Application instance is
allocated an AppMaster
(YARN executor) process

3. The AppMaster is responsible


for requesting resources from
the ResourceManager

4. AppMaster instruct
NodeManagers to start
containers on its behalf

5. AppMaster make the need for


an active client unnecessary

87
Spark Application Deployment

Local mode:
a) Install spark on local unix box
b) Launch all the spark daemons together using “start-all.sh” Pl. Note, this
name is used in hadoop also. You may need to change it if you wish to
run on the same system where hadoop demons too are running
c) Once all the Spark servers are up and running (use the jps utility)
d) Good for debugging. To invoke Spark in local mode:
$SPARK_HOME/bin/spark-shell –master local

88
Thanks

89

También podría gustarte