Introduction To Hadoop

Introduction to Hadoop
Outline
Introduction to Hadoop project

HDFS (Hadoop Distributed File System) overview
MapReduce overview
Reference
Classification
Copyright 2009 - Trend Micro Inc.
What is Hadoop?
Hadoop is a cloud computing platform for processing and keeping
vast amount of data.
Apache top-level project
Cloud Applications
Open Source
Hadoop Core includes
Hadoop Distributed File System (HDFS)
MapReduce framework
Hadoop subprojects
HBase, Zookeeper,
Written in Java.
Client interfaces come in C++/Java/Shell Scripting
Runs on
MapReduce
HBase
Hadoop Distributed File System

(HDFS)
A Cluster of Machines
Linux, Mac OS/X, Windows, and Solaris

Commodity hardware
A Brief History of Hadoop

2003.2
First MapReduce library written at Google
2003.10
Google File System paper published
2004.12
Google MapReduce paper published
2005.7
Doug Cutting reports that Nutch now use new MapReduce
implementation
2006.2
Hadoop code moves out of Nutch into new Lucene sub-project
2006.11
Google Bigtable paper published
A Brief History of Hadoop (continue)

2007.2
First HBase code drop from Mike Cafarella
2007.4
Yahoo! Running Hadoop on 1000-node cluster
2008.1
Hadoop made an Apache Top Level Project
Classification
Who use Hadoop?
Yahoo!
More than 100,000 CPUs in ~20,000 computers running Hadoop
Google
University Initiative to Address Internet-Scale Computing Challenges
Amazon
Amazon builds product search indices using the streaming API and preexisting C++, Perl, and Python tools.
Process millions of sessions daily for analytics
IBM
Blue Cloud Computing Clusters
Trend Micro
Threat solution research
More
http://wiki.apache.org/hadoop/PoweredBy
Classification
Hadoop Distributed File System

Overview
HDFS Design
Single Namespace for entire cluster
Very Large Distributed File System
10K nodes, 100 million files, 10 PB
Data Coherency
Write-once-read-many access model
A file once created, written and closed need not be changed.
Appending write to existing files (in the future)
Files are broken up into blocks

Typically 128 MB block size
Each block replicated on multiple DataNodes
HDFS Design
Moving computation is cheaper than moving data
Data locations exposed so that computations can move to where
data resides
File replication
Default is 3 copies.
Configurable by clients
Assumes Commodity Hardware

Files are replicated to handle hardware failure
Detect failures and recovers from them
Streaming Data Access

High throughput of data access rather than low latency of data
access
Optimized for Batch Processing
Classification
NameNode
Manages File System Namespace
Maps a file name to a set of blocks
Maps a block to the DataNodes where it resides
Cluster Configuration Management

Replication Engine for Blocks
NameNode Metadata
Meta-data in Memory
The entire metadata is in main memory
No demand paging of meta-data
Types of Metadata
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication factor
NameNode Metadata (cont.)

A Transaction Log (called EditLog)
Records file creations, file deletions. Etc
FsImage
The entire namespace, mapping of blocks to files and file system
properties are stored in a file called FsImage.
NameNode can be configured to maintain multiple copies of
FsImage and EditLog.
Checkpoint
Occur when NameNode startup
Read FsImange and EditLog from disk and apply all transactions
from the EditLog to the in-memory representation of the
FsImange
Then flushes out the new version into new FsImage
Classification
Secondary NameNode
Copies FsImage and EditLog from NameNode to a temporary
directory
Merges FSImage and EditLog into a new FSImage in
temporary directory.
Uploads new FSImage to the NameNode
Transaction Log on NameNode is purged
FsImage
EditLog
FsImage
(new)
NameNode Failure
A single point of failure
Transaction Log stored in multiple directories
A directory on the local file system
A directory on a remote file system (NFS/CIFS)
Need to develop a real HA solution

SPOF!!
DataNode
A Block Server
Stores data in the local file system (e.g. ext3)
Stores meta-data of a block (e.g. CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to the
NameNode
Facilitates Pipelining of Data

Forwards data to other specified DataNodes
HDFS - Replication
Default is 3x replication.
The block size and replication factor are configured per file.
Block placement algorithm is rack-aware
User Interface
API
Java API
C language wrapper (libhdfs) for the Java API is also avaiable
POSIX like command

hadoop dfs -mkdir /foodir
hadoop dfs -cat /foodir/myfile.txt
hadoop dfs -rm /foodir myfile.txt
HDFS Admin
bin/hadoop dfsadmin safemode
bin/hadoop dfsadmin report
bin/hadoop dfsadmin -refreshNodes
Web Interface
Ex: http://172.16.203.136:50070
Web Interface
(http://172.16.203.136:50070)
Web Interface
(http://172.16.203.136:50070)
Browse the file system
Classification
POSIX Like command
Java API
Latest API
http://hadoop.apache.org/core/docs/current/api/
MapReduce Overview
Classification
Brief history of MapReduce
A demand for large scale data processing in Google

The folks at Google discovered certain common themes for
processing those large input sizes
Many machines are needed

Two basic operations on the input data: Map and Reduce
Googles idea is inspired by map and reduce functions

commonly used in functional programming.
Functional programming: MapReduce
Map
[1,2,3,4] (*2) [2,3,6,8]
Reduce
[1,2,3,4] (sum) 10
Divide and Conquer paradigm

MapReduce
MapReduce is a software framework introduced by Google
to support distributed computing on large data sets on
clusters of computers.
Users specify
a map function that processes a key/value pair to generate a set
of intermediate key/value pairs
a reduce function that merges all intermediate values
associated with the same intermediate key
And MapReduce framework handles details of execution.
Classification

Application Area
MapReduce framework is for computing certain kinds of
distributable problems using a large number of
computers (nodes), collectively referred to as a cluster.
Examples
Large data processing such as search, indexing and sorting
Data mining and machine learning in large data set
Huge access logs analysis in large portals
More applications
http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/
Classification

MapReduce Programming Model
User define two functions

map
(K1, V1) list(K2, V2)
takes an input pair and produces a set of intermediate key/value
reduce
(K2, list(V2)) list(K3, V3)
accepts an intermediate key I and a set of values for that key. It

merges together these values to form a possibly smaller set of
values
MapReduce Logical Flow

shuffle / sorting

MapReduce Execution Flow

shuffle / sorting
Shuffle (by key)

Sorting (by
key)
Classification
7
Word Count Example
Word count problem :

Consider the problem of counting the number of occurrences of
each word in a large collection of documents.
Input:
A collection of (document name, document content)
Expected output
A collection of (word, count)
User define:
Map: (offset, line) [(word1, 1), (word2, 1), ... ]
Reduce: (word, [1, 1, ...]) [(word, count)]

Word Count Execution Flow

Monitoring the execution from web console

http://172.16.203.132:50030/
Classification

Reference
Hadoop project official site
http://hadoop.apache.org/core/
Google File System Paper

http://labs.google.com/papers/gfs.html
Google MapReduce paper

http://labs.google.com/papers/mapreduce.html
Google: MapReduce in a Week

http://code.google.com/edu/submissions/mapreduce/listing.html
Google: Cluster Computing and MapReduce

http://code.google.com/edu/submissions/mapreduceminilecture/listing.html
Classification


Introduction To Hadoop

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Introduction To Hadoop

Cargado por

Copyright:

Formatos disponibles

Introduction to Hadoop

Introduction to Hadoop project

Copyright 2009 - Trend Micro Inc.

Hadoop Distributed File System

Linux, Mac OS/X, Windows, and Solaris

Copyright 2009 - Trend Micro Inc.

A Brief History of Hadoop

A Brief History of Hadoop (continue)

Copyright 2009 - Trend Micro Inc.

Who use Hadoop?

Copyright 2009 - Trend Micro Inc.

Hadoop Distributed File System

Copyright 2009 - Trend Micro Inc.

Files are broken up into blocks

Copyright 2009 - Trend Micro Inc.

Assumes Commodity Hardware

Streaming Data Access

Copyright 2009 - Trend Micro Inc.

Copyright 2009 - Trend Micro Inc.

Copyright 2009 - Trend Micro Inc.

Cluster Configuration Management

Copyright 2009 - Trend Micro Inc.

Copyright 2009 - Trend Micro Inc.

NameNode Metadata (cont.)

Copyright 2009 - Trend Micro Inc.

Copyright 2009 - Trend Micro Inc.

Need to develop a real HA solution

Copyright 2009 - Trend Micro Inc.

Facilitates Pipelining of Data

Copyright 2009 - Trend Micro Inc.

Copyright 2009 - Trend Micro Inc.

POSIX like command

Copyright 2009 - Trend Micro Inc.

Copyright 2009 - Trend Micro Inc.

POSIX Like command

Copyright 2009 - Trend Micro Inc.

Copyright 2009 - Trend Micro Inc.

Copyright 2009 - Trend Micro Inc.

Brief history of MapReduce

A demand for large scale data processing in Google

Many machines are needed

Googles idea is inspired by map and reduce functions

[1,2,3,4] (*2) [2,3,6,8]

Divide and Conquer paradigm

Copyright 2009 - Trend Micro Inc.

And MapReduce framework handles details of execution.

Copyright 2009 - Trend Micro Inc.

Copyright 2009 - Trend Micro Inc.

MapReduce Programming Model

User define two functions

(K2, list(V2)) list(K3, V3)

accepts an intermediate key I and a set of values for that key. It

MapReduce Logical Flow

Copyright 2009 - Trend Micro Inc.

MapReduce Execution Flow

Shuffle (by key)

Word Count Example

Word count problem :

Copyright 2009 - Trend Micro Inc.

Word Count Execution Flow

Copyright 2009 - Trend Micro Inc.

Monitoring the execution from web console

Copyright 2009 - Trend Micro Inc.