Lesson 1 - Hadoop and Big Data Overview

Hadoop Developer Day
Nicolas Morales
IBM Big Data
nicolasm@us.ibm.com
@NicolasJMorales
FREE
Monthly Events
San Jose & Foster City
Full Day Developer Days
Afternoon & Evening Hackathons
Past Meetups covered
Text Analytics
Real-time Analytics
Big Data Developers @
2 2013 IBM Corporation
Real-time Analytics
SQL for Hadoop
HBase
Social Media Analytics
Machine Data Analytics
Security and Privacy
Development Environment
provided
Live streaming
Topic suggestions welcome
http://www.meetup.com/BigDataDevelopers/
NEXT MEETUP: Streams Developer Day on Thursday, April 17.
Coming Soon: Big R, Watson, Big Data in the Cloud, Big SQL, MongoDB & more!
Agenda: Hadoop Developer Day
Time Subject
8:00 AM 9:00 AM Registration & Breakfast
9:00 AM 9:30 AM Introduction to Hadoop
9:30 AM 11:00 AM Hadoop Architecture and HDFS + Hands-on Lab
4
9:30 AM 11:00 AM Hadoop Architecture and HDFS + Hands-on Lab
11:00 AM 11:45 AM Introduction to MapReduce
11:45 AM 12:45 PM Lunch
12:45 PM 2:00 PM MapReduce Hands-on Lab
2:00 PM 4:00 PM Using Hive for Data Warehousing + Hands-on Lab
4:00 PM 6:00 PM SQL for Hadoop + Hands-on Lab
6:00 PM Closing Remarks
Big Data University
www.bigdatauniversity.com
Big Data University
www.bigdatauniversity.com
Quick Start Edition VM
Download: http://ibm.co/QuickStart
.tar.gz Unpack using WinRAR, 7-Zip, etc.
Your Feedback is Important, please
complete your Survey
8
Introduction to Hadoop
Rafael Coss
IBM Big Data
rcoss@us.ibm.com
@racoss
Executive Summary
Whats Big Data?
More Analytics on More Data for More People
More than just Hadoop
Whats Hadoop?
Distributed Computing framework that is
Scalable
Scalable
Cost Effective
Flexible
Fault Tolerance
What Hadoops Distribution?
Common set of Apache Projects
Install
Unique Value Add
Enrich Your
Information Base
with Big Data Exploration
Improve Customer
Interaction with
Enhanced 360 View
of the Customer
Key Business-driven Use Cases Improve Business
Outcomes
Help Reduce Risk
and Prevent Fraud
with Security and
Intelligence Extension
Real-time
42TB
Association
1,100
Reduction
99%
Optimize
Infrastructure
and Monetize Data
with Operations Analysis
Gain IT efficiency
and scale with Data
Warehouse
Modernization
Real-time
Acoustic
Data Analyzed
Gain in
Analysis
Performance
40X
Metered
Customers
in Five States
60K
Association
Publishing
Partnerships
Reduction
In Time Required
For Analysis
2013 IBM Corporation
12
Why is Big Data important?
Data AVAILABLE to an
organization
13
data an organization can
PROCESS
Enterprises are more blind
to new opportunities.
Organizations are able to
process less and less of the
available data.
100 Millionen Tweets are posted every day, 35 hours of video are beeing uploaded every
minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed
through the net. 80 % spam and viruses. => Prefiltering is more and more important.
What is Big Data?
Transactional &
Application Data
Machine Data Social Data Enterprise
Content
More Analytics on More Data for More People
Volume
Structured
Throughput
Velocity
Semi-structured
Ingestion
Variety
Highly unstructured
Veracity
Variety
Highly unstructured
Volume
Insurance
360 View of Domain
or Subject
Catastrophe Modeling
Fraud & Abuse
Producer Performance
Analytics
Analytics Sandbox
Banking
Optimizing Offers and
Cross-sell
Customer Service and
Call Center Efficiency
Fraud Detection &
Investigation
Credit & Counterparty
Risk
Every Industry can Leverage Big Data and Analytics
Telco
Pro-active Call Center
Network Analytics
Location Based
Services
Energy &
Utilities
Smart Meter Analytics
Distribution Load
Forecasting/Scheduling
Condition Based
Maintenance
Create & Target
Customer Offerings
Media &
Entertainment
Business process
transformation
Audience & Marketing
Optimization
Multi-Channel
Enablement
Digital commerce
optimization
Retail
Travel &
Transport
Consumer
Products
Government Healtcare
Actionable Customer
Insight
Merchandise
Optimization
Dynamic Pricing
Customer Analytics &
Loyalty Marketing
Predictive Maintenance
Analytics
Capacity & Pricing
Optimization
Shelf Availability
Promotional Spend
Optimization
Merchandising
Compliance
Promotion Exceptions
& Alerts
Civilian Services
Defense & Intelligence
Tax & Treasury Services
Measure & Act on
Population Health
Outcomes
Engage Consumers in
their Healthcare
!utomotive
Advanced Condition
Monitoring
Data Warehouse
Optimization
Actionable Customer
Intelligence
"i#e
$ciences
Increase visibility into
drug safety and
effectiveness
Cemical &
Petroleum
Operational Surveillance,
Analysis & Optimization
Data Warehouse
Consolidation, Integration
& Augmentation
Big Data Exploration for
Interdisciplinary
Collaboration
!erospace
& %e#ense
Uniform Information
Access Platform
Data Warehouse
Optimization
Airliner Certification
Platform
Advanced Condition
Monitoring (ACM)
Electronics
Customer/ Channel
Analytics
Advanced Condition
Monitoring
Big data adoption
Big Data use study
2012 Big Data @ Work Study surveying 1144 business and IT professionals in 95 countries
When segmented into four groups based on current levels of big data activity, respondents showed significant consistency in
organizational behaviors
Big Data Analytics
Iterative & Exploratory
Data is the structure
IT Team
Delivers Data
On Flexible
Platform
Traditional Analytics
Structured & Repeatable
Structure built to store data
Business
Users
Determine
Questions
Analyzed
Information
Warehouse Modernization Has Two Themes
Platform
Business
Users
Explore and
Ask Any Question
Analyze ALL Available Information
Whole population analytics
connects the dots
Questions
IT Team
Builds System
To Answer
Known Questions
17
Available Information
Capacity constrained down sampling
of available information
Carefully cleanse all information
before any analysis
Analyzed
Information
Analyze information as is & cleanse as
needed
Analyzed
Information
Big Data Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
Warehouse Modernization Has Two Themes
?
Question Hypothesis Data
All Information
Exploration
18 2013 IBM Corporation 18
Analyzed
Information
Data Answer
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Correlation Actionable Insight
Analyze after landing Analyze in motion
Getting the Value from Big Data Why a Platform?
Almost all big data use cases require
an integrated set of big data technologies
to address the business pain completely
Reduce time and cost and provide quick ROI
The Whole is Greater than
the Sum of the Parts
Accelerators
Data
Warehouse
Stream
Computing
Hadoop
System
Discovery Application
Development
Systems
Management
BIG DATA PLATFORM
Reduce time and cost and provide quick ROI
by leveraging pre-integrated components
Provide both out of the box and standards-
based services
Start small with a single project and progress
to others over your big data journey
Information Integration & Governance
Warehouse Computing System
Data Media Content Machine Social
Watson Foundations
Exploration,
landing and
archive
Trusted data
Reporting &
interactive
analysis
Deep
analytics &
modeling
Data types
Real-time processing & analytics
$TRE!M$& %!T! REP"IC!TI'(
Transaction &
application
data
Machine and
sensor data
Enterprise
content
Social data
Image and
video
Operational
systems
Actionable
insight
Decision
management
Predictive
analytics &
modeling
Reporting, analysis,
content analytics
1
2
3
3
3
3
5
3
3
Watson Foundations Differentiators
Information Integration & Governance
Social data
Third-party
data Discovery and
exploration
4
3
3
1
2
3
4
5
More than Hadoop
Greater resiliency and recoverability
Advanced workload management, multi-tenancy
Enhanced, flexible storage management (GPFS)
Enhanced data access (BigSQL, Search)
Analytics accelerators & visualization
Enterprise-ready security framework
Data in Motion
Enterprise class stream processing & analytics
Analytics Everywhere
Richest set of analytics capabilities
Ability to analyze data in place
Governance Everywhere
Complete integration & governance capabilities
Ability to govern all data where ever it is
Complete Portfolio
End-to-end capabilities to address all needs
Ability to grow and address future needs
Remains open to work with existing investments
IBM Watson Foundations
New/Enhanced
Applications
All Data
What action
should I
take !ognitive
IBM Big Data & Analytics
"anding,
Operational
Real-time Data #rocessing & $nalytics
What is
happening
Discovery and
exploration
Why did it
happen
Deep
$nalytics
data %one
In&ormation Integration & 'overnance
(ystems (ecurity
On premise, !loud, $s a service
(torage
IBM Big Data & Analytics Infrastructure
should I
take
Decision
management
!ognitive
)a*ric
"anding,
Exploration
and $rchive
data %one
EDW and
data mart
%one
Operational
data %one
happen
Reporting and
analysis
What could
happen
#redictive
analytics and
modeling
data %one
What is Hadoop?
Apache open source software framework for reliable, scalable, distributed
computing of massive amount of data
Hides underlying system details and complexities from user
Developed in Java
Core sub projects:
MapReduce
Hadoop Distributed File System a.k.a. HDFS
Hadoop Distributed File System a.k.a. HDFS
Hadoop Common
Supported by several Hadoop-related projects
HBase
Zookeeper
Avro
Etc.
Meant for heterogeneous commodity hardware
Design principles of Hadoop
New way of storing and processing the data:
Let system handle most of the issues automatically:
Failures
Scalability
Reduce communications
Distribute data and processing power to where the data is
Make parallelism part of operating system
Relatively inexpensive hardware ($2 4K)
Bring processing to Data!
Bring processing to Data!
Hadoop = HDFS + MapReduce infrastructure +
Optimized to handle
Massive amounts of data through parallelism
A variety of data (structured, unstructured, semi-structured)
Using inexpensive commodity hardware
Reliability provided through replication
Hadoop is not for all types of work
Not to process transactions (random access)
Not good when work cannot be parallelized
Not good for low latency data access
Not good for processing lots of small files
Not good for intensive calculations with little data
Big Data Solution
Who uses Hadoop?
Map-Reduce Hadoop BigInsights
What is Apache Hadoop?
Flexible, enterprise-class support for processing large volumes of
data
Inspired by Google technologies (MapReduce, GFS, BigTable, )
Initiated at Yahoo
Originally built to address scalability problems of Nutch, an open source Web search
technology
Well-suited to batch-oriented, read-intensive applications
Supports wide variety of data
Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel, cost effective manner
CPU + disks = node
Nodes can be combined into clusters
New nodes can be added as needed without changing
Data formats
How data is loaded
How jobs are written
Hadoop Open Source Projects
Hadoop is supplemented by an ecosystem of open source projects
How do I leverage Hadoop to create new value for my
enterprise?
Hadoop, Pig, Hive, Zookeeper, Jaql, Hbase, Ozzie, Flume
HDFS
MapReduce
AQL
Machine
learning
Terabytes
Petabytes
Exabytes
Log
analysis
29
learning
Sentiment
analysis
. . .
. . .
analysis
CDRs
. . .
. . .
Whats a Hadoop Distribution?
Whats a Linux Distribution?
Linux Kernel
Open Source Tools around Kernel
Installer
Administration UI
Open Source Distribution Formula
Kernel
Kernel
Core Projects around Kernel
Value Add
Test Components
Installer
Administration UI
Apps
WebSphere WAS
25 > Apache Projects + Additional Open Source + installer + IBM Value Add
BigInsights: Value Beyond Open Source
Enterprise Capabilities
Connectors
Advanced Engines
Visualization & Exploration
Development Tools
Key differentiators
Built-in analytics
Text engine, annotators, Eclipse tooling
Interface to project R (statistical platform)
Enterprise software integration
Spreadsheet-style analysis
Integrated installation of supported open source
and other components
Web Console for admin and application access
Platform enrichment: additional security,
Administration & Security
Workload Optimization
Connectors
Open source
components
IBM-certified
Apache Hadoop
Platform enrichment: additional security,
performance features, . . .
World-class support
Full open source compatibility
Business benefits
Quicker time-to-value due to IBM technology
and support
Reduced operational risk
Enhanced business knowledge with flexible
analytical platform
Leverages and complements existing software
From Getting Starting to Enterprise Deployment:
Different BigInsights Editions For Varying Needs
Standard Edition
E
n
t
e
r
p
r
i
s
e

c
l
a
s
s
Enterprise Edition
- Spreadsheet-style tool
- Accelerators
-- GPFS FPO
-- Adaptive MapReduce
- Text analytics
- Enterprise Integration
-- Monitoring and alerts
-- Big R
32 2013 IBM Corporation 2013 IBM Corporation 32
Breadth of capabilities
E
n
t
e
r
p
r
i
s
e

c
l
a
s
s
Quick Start
Free. Non-production
- Spreadsheet-style tool
-- Web console
-- Dashboards
- Pre-built applications
-- Eclipse tooling
-- RDBMS connectivity
-- Big SQL
-- Jaql
-- Platform enhancements
-- . . .
-- Big R
-- InfoSphere Streams*
-- Watson Explorer*
-- Cognos BI*
-- . . .
-* Limited use license
Apache
Hadoop
Scalable
New nodes can be added
on the fly
Performance & reliability
Adaptive MapReduce, Compression,
Indexing, Flexible Scheduler, +++
IBM Enriches Hadoop
Affordable
Massively parallel computing on
commodity servers
Flexible
Hadoop is schema-less, and can
absorb any type of data
Fault Tolerant
Through MapReduce
software framework
Enterprise Hardening of Hadoop
Productivity Accelerators
Web-based UIs and tools
End-user visualization
Analytic Accelerators
+++
Enterprise Integration
To extend & enrich your information
supply chain
33
Big Database Vendors Adopt Hadoop
IBM Internal Use Only
Competing Hadoop Distribution Vendors
Cloudera
Cloudera makes it easy to run open source Hadoop in production
Focus on deriving business value from all your data instead of worrying about managing Hadoop
Hortonworks
Make Hadoop easier to consume for enterprises and technology vendors
Provide expert support by the leading contributors to the Apache Hadoop open source projects
EMC Greenplum HD ** Pivotal HD **
Delivering enterprise ready Apache Hadoop
Delivering enterprise ready Apache Hadoop
Provides a complete platform including installation, training, global support, and value-add beyond
simple packaging of the Apache Hadoop distribution
MapR
High Performance Hadoop, up to 2-5 times faster performance than Apache-based distributions
The first distribution to provide true high availability at all levels making it more dependable
Amazon Elastic MapReduce
Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to
worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute
capacity upon which they sit
IBM Internal Use Only
Capabilities Required for Hadoop Style Workloads
Visualization &
Discovery
Analytics Engines
Application Support and Development
Tooling
Runtime
Cluster and Workload Management
Data
Ingest
File System
Data Store Security
36
Open Source Hadoop Components
Visualization & Discovery
Data Ingest
Analytics Engines
Application Support and Development Tooling
MapReduce MapReduce Pig Pig Hive Hive Lucene Lucene Oozie Oozie
Open Source
Cluster Optimization and Management
Runtime
File System
MapReduce
HDFS
Data Store
HBase
ZooKeeper ZooKeeper
Sqoop
Security
HCatalog
Flume
Avro Avro
Derby
37
Open Source Components Across Distributions
Component
Big
Insights
2.0
HortonWorks
HDP 1.2
MapR
2.0
Greenplum
HD 1.2
Cloudera
CDH3u5
Cloudera
CDH4*
Hadoop 1.0.3 1.1.2 0.20.2 1.0.3 0.20.2 2.0.0 *
HBase 0.94.0 0.94.2 0.92.1 0.92.1 0.90.6 0.92.1
Hive 0.9.0 0.10.0 0.9.0 0.8.1 0.7.1 0.8.1
Pig 0.10.1 0.10.1 0.10.0 0.9.2 0.8.1 0.9.2
Zookeeper 3.4.3 3.4.5 X 3.3.5 3.3.5 3.4.3
Oozie 3.2.0 3.2.0 3.1.0 X 2.3.2 3.1.3
Avro 1.6.3 X X X X X
Flume 0.9.4 1.3.0 1.2.0 X 0.9.4 1.1.0
Sqoop 1.4.1 1.4.2 1.4.1 X 1.3.0 1.4.1
HCatalog 0.4.0 0.5.0 0.4.0 X X X
BigInsights Enterprise Edition Components
Visualization & Discovery Integration
Streams
Netezza
DB2
IBM InfoSphere BigInsights
Advanced Analytic Engines
Text Processing Engine &
Extractor Library
BigSheets
JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql BigSQL & Hive
Systems Management
Admin Console
Dashboard &
Visualization
Apps
Workflow Monitoring
R
Big SQL
IBM Open Source Integration
Workload Optimization
Streams
Flume
DataStage
Runtime
File System
MapReduce
HDFS
Data Store
HBase Column Store
GPFS
Index
Splittable Text
Compression
Enhanced
Security
Flexible
Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive
MapReduce
Hive
Integrated
Installer
Sqoop
Management
HCatalog
Security
Audit & History
Lineage
Two Key Aspects of Hadoop
Hadoop Distributed File System = HDFS
Where Hadoop stores data
A file system that spans all the nodes in a Hadoop cluster
It links together the file systems on many local nodes to
make them into one big file system
make them into one big file system
MapReduce framework
How Hadoop understands and assigns work to the nodes
(machines)
What is the Hadoop Distributed File System?
HDFS stores data across multiple nodes
HDFS assumes nodes will fail, so it achieves
reliability by replicating data across multiple nodes
The file system is built from a cluster of data nodes,
The file system is built from a cluster of data nodes,
each of which serves up blocks of data over the
network using a block protocol specific to HDFS.
MapReduce
Take a large problem and divide it into sub-problems
Break data set down into small chunks
Perform the same function on all sub-problems
M
A
P
Combine the output from all sub-problems
DoWork() DoWork()
DoWork() DoWork()
DoWork() DoWork()

Output
M
A
P
R
E
D
U
C
E
MapReduce Example
Hadoop computation model
Data stored in a distributed file system spanning many inexpensive computers
Bring function to the data
Distribute application to the compute resources where the data is stored
Scalable to thousands of nodes and petabytes of data
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = ne IntWritable!"#$
Hadoop Data Nodes
MapReduce Application
1. Map Phase
(break job into small parts)
2. Shuffle
(transfer interim output
for final processing)
3. Reduce Phase
(boil all output down to
a single result set)
Return a single result set Result Set
Shuffle
one = ne IntWritable!"#$
private Text ord = ne Text!#$
public void %ap!Object ke&, Text val, 'ontext
(trin)Tokenizer itr =
ne (trin)Tokenizer!val*to(trin)!##$
+ile !itr*+asMoreTokens!## {
ord*set!itr*nextToken!##$
context*rite!ord, one#$
,
,
,
public static class Int(u%-educer
extends -educer<Text,IntWritable,Text,IntWrita
private IntWritable result = ne IntWritable!#$
public void reduce!Text ke&,
Iterable<IntWritable> val, 'ontext context#{
int su% = .$
for !IntWritable v / val# {
su% 0= v*)et!#$
* * *
Distribute map
tasks to cluster
InfoSphere BigInsights
Text processing
engine and
library
Infrastructure
Optional
IBM and
partner
offerings
Analytics and discovery
Apps
BigSheets
Web ra!ler
"istrib file copy
"B export
Boardreader
"B i#port
usto#
"ata processing
$ $ $
Ad#inistrative and
develop#ent tools
Web console
Monitor cluster health% &obs% etc$
Add ' re#ove nodes
Start ' stop services
Inspect &ob status
Inspect !or(flo! status
Open Source IBM
Accelerator for
#achine data
analysis
Accelerator for
social data
analysis
$ $ $
)sed in today*s hands+on lab
onnectivity and Integration Strea#s
,ete--a JDBC
.lu#e
Infrastructure
/a0l
1ive
2ig
1Base
Map3educe
1".S
4oo5eeper
Indexing
6ucene
Adaptive
Map3educe
Oo-ie
Text co#pression
7nhanced
security
.lexible
scheduler
"B8
Inspect !or(flo! status
"eploy applications
6aunch apps ' &obs
Wor( !ith distrib file syste#
Wor( !ith spreadsheet interface
Support 37ST+based A2I
$ $ $
3
7clipse tools
Text analytics
Map3educe progra##ing
/a0l% 1ive% 2ig % Big S96 develop#ent
BigSheets plug+in develop#ent
Oo-ie !or(flo! generation
Integrated
installer
ognos BI
Big S96
:uardiu#
"ataStage
"ata 7xplorer
S0oop
1atalog GPFS FPO GPFS FPO
O"B
So What Does This Result In?
Easy To Scale
Fault Tolerant and Self-Healing
Data Agnostic
Data Agnostic
Extremely Flexible
Resources
bigdatauniversity.com
youtube.com/ibmBigData
Quick Start Editions
Ibm.co/quickstart
Ibm.co/streamsqs
ibm.meetup.com
ibm.meetup.com
ibmdw.net/streamsdev
ibm.co/streamscon
ibmbigdatahub.com
ibm.co/bigdatadev
http://tinyurl.com/biginsights
Links to demos, papers, forum, downloads, etc
Thank You
Your feedback is important!
Please fill out survey
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in
which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for
informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.
While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without
warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this
presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or
representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use
of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have
achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended
to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other
results.
Copyright IBM Corporation 2014. All rights reserved.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM Corp.
IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International Business
Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their
first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law
trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law
trademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information at
www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
Backup
Implications of Big Data
Just reading 100 terabytes is slow
Standard computer (100 MBPS) ~11 days
Across 10Gbit link (high end storage) 1 day
1000 standard computers 15 minutes!
Seek times for random disk access is a problem
1 TB data set with 10
10
100-byte records
Updates to 1% would require 1 month
Reading and rewriting the whole data set would take 1 day
*
One node is not enough!
Global TLE Framework
One node is not enough!
Need to scale out not up!
50
+ )rom the ,adoop mailing list
Scaling out
Bad news: nodes fail, especially if you have many
Mean time between failures for 1 node = 3 years, 1000 nodes = 1 day
Super-fancy hardware still fails and commodity machines give better performance
per dollar
Bad news II: distributed programming is hard
Communication, synchronization, and deadlocks
Recovering from machine failure
Debugging
Optimization
Bad news III: repeat for every problem
Bad news III: repeat for every problem
51
A new model is needed
Its all about the right level of abstraction
Hide system-level details from the developers
No more race conditions, lock contention, etc.
Separating the what from how
Developer specifies the computation that needs to be performed
Execution framework (runtime) handles actual execution
MapReduce
Traditional computing
-apReduce computing
MapReduce, the reality
-any node, little communication *et.een the nodes,
some stragglers and &ailures
Big Difference: Schema on Run
Regular database
Schema on load
Big Data (Hadoop)
Schema on run
Raw data
Raw data
55 2013 IBM Corporation 2013 IBM Corporation 55
Schema
to filter
Storage
(pre-filtered data)
Storage
(unfiltered,
raw data)
Schema
to filter
Output
Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS)
Schema must be defined before
any data is loaded
An explicit load operation has
to take place which transforms
data to internal DB structure
Schema-on-Read (Hadoop)
Data is copied to the file store,
no transformation is needed
A SerDe (Serializer/Deserlizer)
is applied during read time to
extract the required columns
data to internal DB structure
New Columns must be added
explicitly before new data for
such columns can be loaded
into the database
Read First
Standard/Governance
extract the required columns
(late binding)
New data can start flowing
anytime and will appear
retroactively once SerDe is
updated to parse it.
Load Fast
Flexibility/Agility
Pros
Scalability: Scalable Software Deployment

Lesson 1 - Hadoop and Big Data Overview

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Lesson 1 - Hadoop and Big Data Overview

Cargado por

Copyright:

Formatos disponibles

Hadoop Developer Day

También podría gustarte