Está en la página 1de 42

Spark ETL

How to create an optimal daily fantasy baseball roster.


Chicago Hadoop Users Group
May 12, 2015
Don Drake

don@drakeconsulting.com
@dondrake
Overview
Who am I?
ETL
Daily Fantasy Baseball
Spark
Data Flow
Extracting - web crawler
Parquet
DataFrames
Python or Scala?
Transforming - RDDs
Transforming - Moving Average
Who Am I?
Don Drake @dondrake
Currently: Principal Big Data Consultant @ Allstate
5 years consulting on Hadoop
Independent consultant for last 14 years
Previous clients:
Navteq / Nokia
Sprint
Mobile Meridian - Co-Founder
MailLaunder.com - my SaaS anti-spam service
cars.com
Tribune Media Services
Family Video
Museum of Science and Industry
ETL
Informally: Any repeatable programmed data movement
Extraction - Get data from another source
Oracle/PostgreSQL
CSV
Web crawler
Transform - Normalize formatting of phone #s, addresses
Create surrogate keys
Joining data sources
Aggregate
Load -
Load into data warehouse
CSV
Sent to predictive model
Spark
Apache Spark is a fast and general purpose engine for large-scale data
processing.
It provides high-level APIs in Java, Scala, and Python (REPLs for Scala
and Python)
Includes an advanced DAG execution engine that supports in-memory
computing
RDD - Resilient Distributed Dataset Core construct of the framework
Includes a set of high-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing and Spark Streaming
Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local
Open Source, core committers from DataBricks
Latest version is 1.3.1, which includes DataFrames
Started in 2009 (AMPLab) as research project, Apache project since 2013.
LOTS of momentum.
Daily Fantasy Baseball
Spark 101- Execution
Driver - your programs main() method
Only 1 per application
Executors - do the distributed work
As many as your cluster can handle
You determine ahead of time how many
You determine the amount of RAM required
Spark 101 - RDD
RDD - Resilient Distributed Dataset
Can be created from Hadoop Input formats (text
file, sequence file, parquet file, HBase, etc.) OR by
transforming other RDDs.
RDDs have actions and transformations which
return pointers to new RDDs
RDDs can contain anything

scala> val textFile = sc.textFile("README.md")

textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

scala> textFile.count() // Number of items in this RDD

res0: Long = 126

scala> textFile.first() // First item in this RDD

res1: String = # Apache Spark


DEMO
Spark 101 - DEMO
1 #!/bin/env python
2
3 import pyspark.SparkContext
4
5
6 sc = new SparkContext()
7
8 # create some numbers
9 nums = sc.parallelize(xrange(1000))
10
11 nums.getNumPartitions()
12
13 nums.count()
Spark 101 - Lazy Demo
1 #!/bin/env python
2
3 import pyspark.SparkContext
4
5 sc = new SparkContext()
6
7 # create some numbers
8 nums = sc.parallelize(xrange(1000))
9
10 check = sc.accumulator(0)
11
12 def isEven(x):
13 check.add(1)
14 return x % 2 == 0
15
16 evens = nums.filter(isEven)
17
18 evens.count()
19
20 evens.collect()
21
22 check
Spark 101 - Lazy RDDs
RDDs are executed *only* when an action is called
upon it.
This allows multiple transformations on a RDD,
allowing Spark to compute an optimal execution plan.
Uncached RDDs are evaluated *every* time an action
is called upon it
Cache RDDs if you know you will iterate over an RDD
more than once.
You can cache to RAM + Disk, by default, persists to
RAM only.
Data Flow
Overview
Extraction & Ingestion
Extraction
We have seen its easy to create a parallelized data
structure that we can execute code against
Pro-tip: If extracting from relational database, use
Sqoop + save as Parquet
We need to download a set of files for each
baseball game previously played (player statistics)
TODO List Pattern
We know the filenames to download for each
game (they are static)
Download all files for a game in parallel
1 def getFiles(game):
2 session = requests.Session()
3 files = scrape.get_files([game], session=session)
4 count, fails = scrape.download(files, DownloadMLB.cache)
5 return (count, fails)
6
7 def summarize(a, x):
8 total = a[0] + x[0]
9 alist = a[1]
10 alist.extend(x[1])
11 return (total, alist)
12
13 def run(self):
14
15 sc = SparkContext()
16 start_scrape = datetime.now()
17 begin, begin_parts = scrape.get_boundary(self.begin)
18 end, end_parts = scrape.get_boundary(self.end)
19
20 session = requests.Session()
21
22 print "here"
23 all_years_months_days = self.getYearsMonths(self.WEB_ROOT,
session)
24 games = scrape.get_games(all_years_months_days, session=session)
25
26 gamesRDD = sc.parallelize(games)
27 print "fileRDD=", gamesRDD
28
29 gamesRDD.foreach(dump)
30 print "# parttions:", gamesRDD.getNumPartitions()
31 print "count=", gamesRDD.count()
32 res = gamesRDD.map(getFiles).reduce(summarize)
33 print "res=", res
34
35 count = res[0]
36 fails = res[1]
37 end_scrape = datetime.now()
38 self.log.info("%d files downloaded in %s", count,
39 str(end_scrape - start_scrape))
40 if fails:
41 for url in fails:
42 self.log.error("failed to download %s", url)
43
44 sc.stop()
TODO DEMO
Butter..Butter
Parquet?
Features
Interoperability - Spark, Impala, Sqoop, much more
Space efficient
Query efficient
Schema - (field name, data type, nullable)
Columnar Format
Different Encoding (Compression) Algorithms
Delta Encoding - diffs per row per column
Dictionary Encoding (~60k items). e.g. Dates, IP address,
Run Length Encoding - (for repetitive data)
Example: http://blog.cloudera.com/blog/2014/05/using-
impala-at-scale-at-allstate/
http://parquet.apache.org/presentations/
http://parquet.apache.org/presentations/
Using DataFrame -
SparkSQL
Previously called SchemaRDD, now with DataFrame contain extra
functionality to query/filter
Allow you to write SQL (joins, filter, etc.) against DataFrames (or
RDDs with a little effort)
All DataFrames contain a schema
1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" +
2 "batter_moving_averages.parquet")
3 batter_mov_avg.registerTempTable("bma")
4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
5 print "batter_mov_avg=", batter_mov_avg.take(2)
6
7 batter_games = sqlContext.sql("select * from games g, bma,
8 game_players gp \
9 where bma.player_id = gp.player_id \
10 and bma.game_date = gp.game_date \
11 and gp.game_id = g.game_id \
12 and g.game_date = gp.game_date \
13 and g.game_date = bma.game_date")
DataFrames to the rescue
DataFrames offer a DSL that provides a distributed
data manipulation
In Python, you can convert a DF to Pandas data
frame and vice versa
More DataFrames
We can select columns from DataFrame
Run our own transform on it with map()
Transform function gets a Row() object, and returns one
The toDF() function will infer data types for you.
https://issues.apache.org/jira/browse/SPARK-7182
1 batting_features = batter_games.select(*unique_cols)
2 print "batting_features=", batting_features.schema
3 #print "batting_features=", batting_features.show()
4
5 def transformBatters(row_object):
6 row = row_object.asDict()
7 row = commonTransform(row)
8
9 return Row(**row)
10
11 batting_features = batting_features.map(transformBatters).toDF()
12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
13
14 self.rmtree(self.rddDir + "/" + "batting_features.parquet")
15 batting_features.saveAsParquetFile(self.rddDir + "/" +
16 "batting_features.parquet")
Programmatically Specifying
the Schema
1 class Game(AbstractDF):
2 schema = StructType( sorted(
3 [
4 StructField("game_id", StringType()),
5 StructField("game_date", DateType()),
6 StructField("id", IntegerType()),
7 StructField("type", StringType()),
8 StructField("local_game_time", StringType()),
9 StructField("game_pk", IntegerType()),
10 StructField("gameday_sw", StringType()),
11 StructField("game_time_et", TimestampType()),
12 StructField("home_code", StringType()),
13 StructField("home_abbrev", StringType()),
14 StructField("home_name", StringType()),
15 StructField("home_won", IntegerType()),
16 StructField("home_loss", IntegerType()),
17 StructField("home_division_id", IntegerType()),
18 StructField("home_league", StringType()),
19 StructField("away_code", StringType()),
20 StructField("away_abbrev", StringType()),
21 StructField("away_name", StringType()),
22 StructField("away_won", IntegerType()),
23 StructField("away_loss", IntegerType()),
24 StructField("away_division_id", IntegerType()),
25 StructField("away_league", StringType()),
26 StructField("stadium_id", IntegerType()),
27 StructField("stadium_name", StringType()),
28 StructField("stadium_venue_w_chan_loc", StringType()),
29 StructField("stadium_location", StringType()),
30 StructField("modified", TimestampType()),
31 ],
32 key = lambda x: x.name))
33 skipSelectFields = ['modified']
AbstractDF
Download: https://gist.github.com/dondrake/
c7fcf42cf051492fdd91
AbstractDF adds 3 major features to the class
Creates a python object with attributes based on the field
name defined in the schema
e.g. g = Game(); g.game_id = 123
Exposes a helper function to create a Row object
containing all of the fields in schema stored with correct
data types.
Needed so Scala can correctly infer data types of the
values sent.
Provides method that will return a list of columns to use in
a select statement.
Python or Scala??
Python or Scala???
Use Python for prototyping
Access to pandas, scikit-learn, etc.
Python is strongly typed (but not statically typed)
Spark Python API support lags, not as popular as Scala. Bugs might
not get fixed right away or at all.
Python is slower due to Gateway required. All data must be serialized
and sent through Gateway to/from Scala. (Not as bad for DataFrames)
Use Scala for application development
Scala learning curve is steep.
Functional Programming learning curve is steep.
Scala is a statically typed language
Java was intentionally left off
Dont bother with Java 7 (IMO)
Transformations
Building a moving average
Example of a moving average
How do we compute a 5-
day MA for batting average?
For an individual player, on an particular date, we need
the previous 5 days batting average.
Please note: not all players play every day.
To build a predictive model, we need historical data, so
we would need to calculate this for each day, every
player, for (possibly) each metric.
We would also want to use different Moving Average
durations (e.g. a 5, 7 ,14 day moving average) for each
player - day.
Our compute space just got big and is also
embarrassingly parallel
Some Stats
load table (2 full years + this year so far ~17%):
gamePlayers = sqlContext.parquetFile(rddDir +
game_players.parquet').cache()
# Hitters:
gamePlayers.filter (gamePlayers.fd_position !=
P').count()
248856L (# of hitter-gamedate combinations)
# Pitchers
gamePlayers.filter (gamePlayers.fd_position ==
'P').count()
244926L
Season-level moving average would require about 248,000 *
(162/2) rows of data (~20 million rows)
MapReduce is dead.
Long live MapReduce.
Spark has a set of transformations that can perform on
RDDs of key / value pairs.
Transformations
groupByKey([numTasks])
reduceByKey(func, [numTasks])
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
sortByKey([ascending], [numTasks])
join(otherDataset, [numTasks])
cogroup(otherDataset, [numTasks])
Actions
reduce(func)
countByKey()
Broadcast Map-side join
Broadcast a dictionary of lists of game_dates,
keys in dictionary is the year the game took place.
FlatMap operation looping over broadcast
game_dates for the stats took place
We use flatMap because we emit many rows
from a single input (7-day, 14-day, etc)
Key: player_id + asOfDate + moving_length
Value: Game Stats dictionary from input
Since this generates a lot of output, I
repartition(50) the output of flatMap
Reduce
now call groupByKey()
Creates a RDD that have a list of values with the same key.
These values are the stats to calculate the average
If performing aggregation, use reduceByKey, less shuffling
involved
Calculate the average (and stddev, etc.)
Emit new key of player_id + asOfDate
Value is the dictionary of all moving average fields
Run reduceByKey() to combine (concatenate) 7-day, 14-day
metrics to a single row
Save results as Parquet, will be joined with other features to
create a predictive model
Useful Links
1. https://databricks.com/blog/2015/04/28/project-
tungsten-bringing-spark-closer-to-bare-metal.html
2. https://www.usenix.org/system/files/conference/
nsdi15/nsdi15-paper-ousterhout.pdf
3. https://zeppelin.incubator.apache.org/
4. https://codewords.recurse.com/issues/one/an-
introduction-to-functional-programming
5. http://apache-spark-user-list.
1001560.n3.nabble.com/
Spark Books
Q&A

También podría gustarte