Está en la página 1de 17

Course Topics

Week 1 Week 3

Understanding Big Data Introduction to HDFS Playing around with Cluster Data loading Techniques

Analytics using Hive Understanding HIVE QL NoSQL Databases Understanding HBASE

Week 2

Week 4

Map-Reduce Basics, types and formats Use-cases for Map-Reduce Analytics using Pig Understanding Pig Latin

Zookeeper, Sqoop, Flume Debug MapReduce programs in Eclipse. Real world Datasets and Analysis Planning a career in Big Data

What is Big Data?

Facebook Example

Facebook users spend 10.5 billion minutes (almost 20,000 years) online on the social network Facebook has an average of 3.2 billion likes and comments are posted every day.

Twitter Example
Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia. 79% of US Twitter users are more like to recommend brands they follow 67% of US Twitter users are more likely to buy from brands they follow 57% of all companies that use social media for business use Twitter

Other Industrial Usecases


Insurance Healthcare Genome Sequencing Utilities

Hadoop Users

http://wiki.apache.org/hadoop/Po weredBy

Data volume is growing exponentially

Estimated Global Data Volume: 2011: 1.8 ZB

2015: 7.9 ZB
The world's information doubles every two years Over the next 10 years: The number of servers worldwide will grow by 10x Amount of information managed by enterprise data centers will grow by 50x Number of files enterprise data center handle will grow by 75x

Source: http://www.emc.com/leadership/programs/digit al-universe.htm, which was based on the 2011 IDC Digital Universe Study

Un-Structured Data is exploding

Why DFS?
Read 1 TB Data

1 Machine
4 I/O Channels Each Channel 100 MB/s

10 Machines
4 I/O Channels Each Channel 100 MB/s

Why DFS?
Read 1 TB Data

1 Machine
4 I/O Channels Each Channel 100 MB/s

10 Machines
4 I/O Channels Each Channel 100 MB/s

45 Minutes

Why DFS?
Read 1 TB Data

1 Machine
4 I/O Channels Each Channel 100 MB/s

10 Machines
4 I/O Channels Each Channel 100 MB/s

45 Minutes

4.5 Minutes

What Is Distributed File System? (DFS)

What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

Companies using Hadoop: - Yahoo - Google - Facebook - Amazon - AOL - IBM - And many more at http://wiki.apache.org/hadoop/PoweredBy

Hadoop Eco-System

Hadoop Core Components:


HDFS Hadoop Distributed File System (storage) MapReduce (processing)

Any Questions ? See you in Next class


Thankyou. Sainagaraju vaduka

También podría gustarte