Documentos de Académico
Documentos de Profesional
Documentos de Cultura
1
Agenda
2
Apache Spark Overview
3
What is Spark
4
Why Spark
1. MapReduce as big data processing technology has proven to
be the solution of choice for processing large data sets.
5
Hadoop and Spark Ecosystem
1. Spark is a processing engine
2. It occupies the same place in the Hadoop stack as MapReduce
PIG Hive
6
Hadoop and Spark Ecosystem
1. Spark can spawn it’s own servers to process data on HDFS i.e. not
use Resource Manager and Appmaster of YARN
SPARK
7
Apache Spark Ecosystem
8
SPARK - Ecosystem
Spark Streaming:
For real-time streaming data processing,
based on micro batch style computing and
processing.
Spark SQL:
Spark SQL provides the capability to expose the Spark datasets over JDBC API and
allow running the SQL like queries on Spark data using traditional BI and visualization
tools.
Spark MLlib:
Mllib is Spark’s scalable machine learning library consisting of common learning
algorithms and utilities, including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization primitives
Spark GraphX:
GraphX is Spark API for graphs and graph-parallel computation.
1. Start spark cluster and connect using Spark Shell as given below
a. Start Spark daemons. Spark and Hadoop have same name script
$ > start-allspark.sh (to start the spark master and worker processes)
3. The Spark shell will connect to the spark master and create a
Spark Context called sc.
4. The Spark shell will display a scala prompt as in the screen below
10
Spark Lab -1 Starting Clusters
$> start-dfs.sh
$> jps
18831 Worker
21669 Jps Spark Cluster
18651 Master
21545 SecondaryNameNode
21346 DataNode HDFS Layer
21222 NameNode
11
1. Spark can spawn it’s own servers to process data on HDFS i.e. not
use Resource Manager and Appmaster of YARN
SPARK
12
Spark Context
13
SPARK - Context
• SparkContext is the object that manages the connection to the
clusters in Spark
14
Resilient Distributed Datasets
15
Resilient Distributed Data sets
1. Resilient Distributed Datasets (RDD) are the
a. primary abstraction in Spark
b. a fault-tolerant collection of distributed partitions of data sets
c. can be operated on in parallel
16
Spark Lab -1 Creating RDD
1. To access HDFS file in Spark, we create an RDD. For e.g.
a. scala> val log_file = sc.textFile("/user/data/apache_log.txt")
b. scala> val errs = log_file.filter(line => line.contains("error"))
c. scala> errs.collect().foreach(println);
2. In Scala val defines a fixed value that cannot change, var defines a
variable.
5. Line 1.b defines another RDD “errs” derived from base RDD
“log_file” through a “filter” transformation process
17
6. I am using the word “defines” in line 3 and 5 for “log_file” and “errs”
respectively. I am not using the word “creates”. The reason for that is-
a) These steps in the program are converted into a DAG (Directed Acyclic
Graph) of tasks.
b) One of the activity is to create log_file RDD and other is to create errors
RDD out of the log_file RDD
c) That “errs” RDD is derived out of a filter process on “log_file” RDD. This
fact (a.k.a lineage) is noted down in the “errs” RDD meta information
e) The tasks are executed when an “action” task in the DAG is executed. on
the RDD.
18
SPARK DAG Scheduler
Task 1
Stage2
Spark Stage-1 Yarn
Task 2 DAG task1 Scheduler
scheduler task2
Task n
Original RDD
Transformed RDD
19
7. The RDD structure that helps abstract the scattered blocks of data
on the cluster has following information
a. A set of partitions(atomic pieces of datasets)
b. A set of dependencies on parent RDDs
c. A function for computing the dataset based on its parents
d. Metadata about its partitioning scheme (hash partition or otherwise)
e. Data placement / preferred location for each partition
8. Each RDD, “logs_file” , “errors” from line 1.a and 1.b in slide 15 will
have this structure associated.
9. The RDD structure contains the partition location (7.e) only when
the actual data churn happens on the cluster.
20
10. The RDD structure is maintained by the driver program
21
RDD Operations
22
RDD Operations
1. Spark allows two types of operations on an RDD
a. Transformations – create new RDD from existing RDD.
For example, “errs” RDD was created from “filter”
transformation of “log_file” RDD
23
RDD Transformations
RDD 1 T RDD 2
Transformations Meaning
map(func) Return a new distributed dataset formed by passing
each element of the source through a function denoted
by func. map transforms an RDD of length N into
another RDD of length N.
Ref: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
24
1. # base RDD!
a. lines = sc.textFile("/user/data/blackhole.txt").map(lambda x: x.split("
"))
2. # transformed RDDs!
a. val errors = lines.filter(_.contains("ERROR"))
b. errors.cache()
25
Actions (sample only)
Are operations that return a final value to the driver program or write data to an external
storage system that result in the evaluation of the transformations in the RDD.
Results 1
X = 10
Y = 14
RDD 1 A ...
Hard Disk
Actions Meaning
reduce(func) Aggregate the elements of the dataset using a function func
collect() Return all the elements of the dataset as an array at the driver
program
count() Return the number of elements in dataset
first() Return the first element of the dataset
saveAsTextFile(path) Write the elements of the dataset as text file (or set of text file) in a
given dir in the local file system, HDFS or any other Hadoop-
supported file system
Ref: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
26
1. Action
a. val errors = lines.filter(_.contains("ERROR"))
b. errors.collect().foreach(println)
27
Resilience
28
Why is it called resilient?
Apache_log
Errors
Top_Errors
1. Imagine you executed a Spark application that created the three RDDs shown above.
3. “Errors” RDD is derived from “Apache_Log” thru “Filter” transformation and its data
blocks are stored in partition 1 and 2 on Node B and Node D respectively
4. “Top_Errors” RDD is created thru transformation of “Errors” RDD and its partitions are
partition 1 to partition 4 scattered on Node A to Node D respectively
5. If Node D fails, the driver can reconstruct all the RDDs (RDD1 of Apache_log, RDD2 of
Errors and RDD2 of Top_Errors) elsewhere on the cluster using the lineage information
29
What is resilience?
Apache_log
Transformation Sequence
Errors
Top_Errors
30
Paired RDD
31
Pair RDD
4. Pair RDD consist of keys and values. For e.g. In the picture below,
“Fruit” column is the key column and “Color” column holds values
32
Pair RDD
33
Pair RDD
34
Pair RDD
9. groupByKey() (avoid)
a. Extracts the elements with same key and transmits them to another
location for further summation.
The 1
The 1
The 1
The 1 The 4
Little 1 Little 4
Little 1
Little 1
Little 1
Fox 1
Over
Jumped
Fox Fox 1 Jumped 1
Jumped 1 Over 2
Over 1
groupByKey() Over 1
Brown 1 Brown 1
Dog 1 Dog 1
Lazy 1 Lazy 1
b. Keys and values transmit over the wire leading to lot of data
movement
35
Pair RDD
10. reduceByKey()
a. Extracts the elements with same key, creates summary totals before
shuffle
The 2
The 4
The 2
Little 4
Little 2
Little 2
Fox 1 Fox 1
Over
Jumped
Fox Jumped 1 Jumped 1
Over 1 Over 1
reduceByKey()
Brown 1 Brown 1
Dog 1 Dog 1
Lazy 1 Lazy 1
b. Lesser number of keys and values travel over the wire than in
groupByKey()
https://databricks.gitbooks.io/databricks-spark-knowledge-
base/content/best_practices/prefer_reducebykey_over_groupbykey.html
36
PairRDD Lab -2
1. Create the following pairRDD and perform the steps. What do you see?
a. val pairRDD = sc.parallelize(List((1, 2), (3, 4), (3, 6)))
b. val keySum = pairRDD.reduceByKey( (x, y) => x + y)
c. keySum.collect().foreach(println)
Ans : -
(3,10)
(1,2)
Ans: -
(3,CompactBuffer(6, 4))
(1,CompactBuffer(2))
37
PairRDD Lab -2
38
PairRDD Lab -2
39
Spark Lab -3 WordCount (Scala)
1. Copy the following Scala snippet in the spark shell
a. val lines = sc.textFile("/user/data/blackhole.txt")
b. val words = lines.flatMap(line => line.split( ' '))
c. val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x+y}
d. counts.saveAsTextFile("/user/data/sparkwc53")
2. In 1.a we defined an RDD called “lines” created from a HDFS text file. This is
also known as base RDD.
3. Transformations include -
1. In 1.b “words” RDD is defined as transformation of “lines” RDD using “flatMap” function which splits
each line in the “lines” RDD into tokens based on space character
2. In 1.c “counts” RDD is defined as transformation of “words” RDD using “map” function which, for
every word, emits that word with a “1” . For e.g. if it comes across word “black”, the map function
makes it “black” 1.
3. The “reduceByKey” function works on the output of “map” function to collect all same words and
total how many “1” for that word
4. Actions include –
1. The saveAsTextFile is the action which triggers creation of all the RDDs and store the contents of
“counts” RDD in a directory
40
Spark Lab -3 WordCount (Scala) As stanalone program
41
RDD Persistence
42
RDD Persistence
43
RDD Persistence
44
RDD Persistence
9. Each persisted RDD can be stored using a different storage level, for
example, persist the dataset on disk, persist it in memory but as
serialize etc.
45
Spark RDD Persistence Lab -1
3. scala> lines.persist(StorageLevel.MEMORY_ONLY)
– We can also use cache() method if we need MEMORY_ONLY
storage level
4. scala> lines.count() // this is the point where the lines RDD will
be cached. Note down the time taken to give the count.
46
RDD Persistence
Storage Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, some partitions will not be cached and will be recomputed
on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This
is generally more space-efficient than deserialized objects, especially
when using a fast serializer, but more CPU-intensive to read.
47
Spark SQL - RDD Dataframes
48
Data Frames
49
Data Frames
4. Can be used to write queries using the HiveQL parser and read
data from Hive tables
50
DataFrames Lab -3
1. import org.apache.spark.sql._
4. case class Salesclass(custid: String, pid: String, qty: Integer, date: String,
salesid: String)
51
DataFrames Lab -3
52
DataFrames Lab -3
53
DataFrames Lab -4
Connecting SPARK to JDBC data source
1. Download mysql connector jar from
http://dev.mysql.com/downloads/file/?id=462849
2. The connector jar file should be available on client and all worker
nodes. It is kept in /home/hduser/mysql-connector-java-5.1.39/mysql-
connector-java-5.1.39-bin.jar
5. Ensure spark master and workers are running. Start Spark shell
9. val url="jdbc:mysql://127.0.0.1:3306/testdb"
11. prop.setProperty("user","root")
12. prop.setProperty("password","root")
Ref: http://www.infoobjects.com/dataframes-with-apache-spark/
55
DataFrames Lab -4
Connecting SPARK to JDBC data source
14. people.show
16. males.show
18. below60.show
Ref: http://www.infoobjects.com/dataframes-with-apache-spark/
56
DataFrames Lab -4
Connecting SPARK to JDBC data source
57
DataFrames Lab -5
Connecting SPARK to Hive data source
7. SalesDF.write.mode("overwrite").saveAsTable("salesdet") //Storing
RDD as a table in Hive
58
DataFrames Lab -5
Connecting SPARK to Hive data source
======== Using Hive Context to Read the table back from Hive =====
9. import org.apache.spark.sql.hive._
59
Data Frames
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes
60
Local Vs Cluster mode
Impact
61
Local vs. cluster modes
1. The behavior of a code may depend on the way the application is executed /
deployed. Results when deployed locally vs on a cluster may be different for the
same code.
a. var counter = 0
b. val somerdd = sc.parallelize(data)
c. //unreliable code : Don't do this!!
d. somerdd.foreach(x => counter += x)
e. println("Counter value: " + counter)
3. Closure is those variables and methods which must be visible for the
executor to perform its computations on the RDD (in this case foreach()).
This closure is serialized and sent to each executor.
62
Local vs. cluster modes
4. The variables within the closure sent to each executor are now copies and
thus, when counter is referenced within the foreach function, it’s no longer
the counter on the driver node.
5. There is still a counter in the memory of the driver node but this is no longer
visible to the executors! The executors only see the copy from the serialized
closure.
6. Thus, the final value of counter will still be zero since all operations
on counter were referencing the value within the serialized closure.
63
Accumulators
1. Accumulators are variables that are only “added” to through an associative
operation and can therefore, be efficiently supported in parallel.
3. Can be accessed only by the driver. The workers can only update but not
read the accumulators
64
Broadcast Variables
1. Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks
65
Spark Streaming
66
Spark Streaming
1. Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
HDFS
3. Spark Streaming receives live input data streams and divides the data into
batches. Chop up the live stream into batches of X seconds
4. Which are then processed by the Spark engine to generate the final stream of
results in batches (Dstreams again)
Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html
67
Spark Streaming
Discretized Streams
Ref: https://spark.apache.org/docs/1.6.1/streaming-programming-guide.html#linking
68
Spark Streaming
Time
1 sec 1 sec 1 sec
Source: https://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf
69
Spark Streaming
Time
1 sec 1 sec 1 sec
Source: https://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf
70
Spark Streaming
Time
1 sec 1 sec 1 sec
Source: https://stanford.edu/~rezab/sparkclass/slides/td_streaming.pdf
71
Spark Streaming Lab -6
2. The source of input stream will be one terminal where we will start
netcat server at port 9999 and key in our sentences
3. Spark Stream application will listen for data on same host and
same port i.e. 9999. The application will count frequency of words
in each line that we input.
72
Spark Streaming Lab -6 (Contd...)
73
Spark Streaming Lab -6 (Contd...)
nc -lk 9999 /* run this on one terminal and the following on another
Ref:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/Network
WordCount.scala
74
Spark Streaming Lab -6 (Contd...)
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF
1. Spark shell does not support stream data capture from streaming
sources such as twitter.com
77
Streaming live tweets from Twitter.com
1. import org.apache.spark.streaming._
2. import org.apache.spark.streaming.twitter._
3. import org.apache.spark.streaming.StreamingContext._
4. System.setProperty("twitter4j.oauth.consumerKey", “ABCD...")
5. System.setProperty("twitter4j.oauth.consumerSecret", “EFGH....")
6. System.setProperty("twitter4j.oauth.accessToken", “81498....")
7. System.setProperty("twitter4j.oauth.accessTokenSecret", “XYZ...")
12. ssc.start()
78
Some useful URLS to learn Spark Streaming Analysis
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebi
rdCMS.scala
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebi
rdHLL.scala
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterHash
TagJoinSentiments.scala
examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopul
arTags.scala
For DataFrames
http://www.infoobjects.com/dataframes-with-apache-spark/
http://blog.jaceklaskowski.pl/2015/07/20/real-time-data-processing-using-
apache-kafka-and-spark-streaming.html
http://ingest.tips/2015/06/24/real-time-analytics-with-kafka-and-spark-
streaming/
79
Apache Spark Deployment Modes
80
Spark Application Deployment
1. The machine where the Spark application process (the one that creates the
SparkContext) is running, is the "Driver" node, with process being called the Driver
process
2. The ClusterManager could be:
StandaloneCluster, Mesos or YARN.
81
Spark Application Deployment
Spark Standalone Master Slave Architecture
DRIVER NODE
& CLIENT
WORKER NODE
MASTER NODE
WORKER NODE
• Spark driver node – where the spark app is initiated/ its driver class is executed
• Spark master node – where cluster manager is initiated
• Spark slave nodes – where the slave processes are initiated
82
Spark Application Deployment
DRIVER NODE
& CLIENT
84
Spark Application Deployment
85
Spark Application Deployment
Spark Application Driver in cluster mode
DRIVER NODE
86
Spark Application Deployment
Spark Application Driver in Cluster Mode
1. Not suited to using Spark
interactively
2. Application instance is
allocated an AppMaster
(YARN executor) process
4. AppMaster instruct
NodeManagers to start
containers on its behalf
87
Spark Application Deployment
Local mode:
a) Install spark on local unix box
b) Launch all the spark daemons together using “start-all.sh” Pl. Note, this
name is used in hadoop also. You may need to change it if you wish to
run on the same system where hadoop demons too are running
c) Once all the Spark servers are up and running (use the jps utility)
d) Good for debugging. To invoke Spark in local mode:
$SPARK_HOME/bin/spark-shell –master local
88
Thanks
89