Está en la página 1de 17

BIG DATA

ANALYTICS
NOTES
Dr. Kirti Wankhede
IT Dept., SIMSR
Q.1) What is Big Data? What is the importance of Big Data? What are the
characteristics of Big Data?
Ans:
 Due to the rapid advancement in the field of IT, the amount of data
generated is alarming.
 Such huge amount of data is referred as Big Data. Industries use the
Big Data to analyse the market trends, study customer behavior &
accordingly takes financial decision.
 Expansion of Big Data has also resulted in creating vast job
opportunities in the IT industry.
 Big Data consists of large datasets that cannot be managed
efficiently by regular database management systems; general data
size is from terabytes to Exabyte.
 Phones, credit cards, social media sources, RFID devices generate
huge data which remains stored useless on unknown servers for
long durations, with the help of Big Data this information is
processed, analyzed & turned into useful information.
 Example:-
 Every second, there are 822 Tweets on twitter.
 Consumers make 11.5 million payments on PayPal, etc.
 On account of IBM, everyday 2.5 quintillion bytes of data are
generated.
Such amount of data is Big Data.

Importance of Big Data:


Processing studying & implementing the conclusion derived from the analysis
of Big Data helps collect accurate data, taken timely & more informed strategic
decision, target the right set of audience & customers, increase benefits &
reduce wastage & costs.
 Procurement- to find out which supplies are more efficient & cost-
effective in delivering products on time.
 Product Development-To draw insights on innovative product &
service formats & designs for enhancing the development process &
coming up with demanded products.
 Manufacturing-To identify machinery & process variations that may
be indicators of quality problems.
 Distribution –To enhance supply chain activities & standardize
optimal inventory levels visa-vis various external factors such as
weather, holidays, etc.
 Marketing-To identify which marketing campaigns will be the most
effective in driving & engaging customers & understanding customer
behaviors & channel behaviors.
 Price Management – To optimize prices based on the analysis of
external factors.
 Merchandising – To improve merchandise breakdown on the basis of
current buying patterns & increase inventory levels & level intersect
insight on the basis of the analysis of various customer behaviour.
 Sales- To optimize assignment of sales resources & accounts, product
& other operations.
 Store operations – To adjust inventory levels on the basis of predicted
buying patterns, study of demographics, weather, key events, etc.
 Human Resources-To find out the characteristics & behavior of
successful & effective employees also other employees for managing
talent better.

Characteristics of Big Data:

● Volume
Big data implies enormous volumes of data. Earlier data was generated only
by employees but, now it is generated by machines, networks and human
interaction on systems like social media the volume of data to be analysed is
massive.
● Variety
Big Data stores two types of data both structured and unstructured. The data
comes from spreadsheets, databases, emails, photos, videos, monitoring
devices, PDFs, audio, etc. Such variety of unstructured data creates problems
for storage, mining and analysing data.
● Velocity
The flow of data in Big Data has extremely fast pace from business processes,
machines, social media sites, etc. The flow of data is massive and continuous.
This real-time data can help researchers and businesses make valuable
decisions that provide strategic competitive advantages.

Q2. Explain different big data types with suitable example.


Ans: The importance of being able to manage the variety of data types, Big
data encompasses everything from dollar transactions to tweets to images to
audio. Therefore, taking advantage of big data requires all the information be
integrated for analysis and data management.
The three main types of big data:-
1) Structured data
2) Unstructured data
3) Semi-structured data

1) Structured data :-
i) The term structured data generally refers to data that has a defined
length and format. Structured data that has defined repeating
patterns and structured data is organized data in a predefined
format.
ii) This kind of data accounts for about 20% of the data that is out
there. It’s usually stored in a database.
iii) For ex:-
- Structured data include numbers, dates, and group of words
and numbers called strings.
- Relational databases (in the form of table)
- Flat files in the form of records(like tab separated files)
- Multidimensional databases
- Legacy databases.
iv) The sources of data are divided into two categories:-
- Computer or machine generated:- Machine-generated data
generally refers to data that is created by a machine with-out human
intervention. Ex:- Sensor data, web log data, financial data.
- Human-generated:- This is data that humans, in interaction
with computers, supply. Ex:- Input data, Gaming-related data.

2) Unstructured data:-
i) Unstructured data is data that does not follow a specified format.
Unstructured data refers to information that either does not have a
pre-defined data model or is not organized in a predefined manner.
ii) Unstructured information is typically text-heavy, but may also
contain data such as dates, numbers and facts. 80% of business
relevant information originates in unstructured form, primarily text.
iii) Sources:-
- Social media:- YouTube, twitter, Facebook
- Mobile data:- Text messages and location information.
- Call center notes, e-mails, written comments in a survey,
blog entries.
iv) Mainly two types:-
- Machine generated :- It refers to data that is created by a
machine without human intervention.
Ex:-Satellite images, scientific data, photographs and video.
- Human generated :- It is generated by humans, in interaction
with computers, machines etc.
Ex:-Website content, text messages.

3) Semi-structured data:-
i) Semi-structured data is data that has not been organized into a
specialized repository such as a database, but that nevertheless has
associated information, such as metadata, that makes it more
amenable to processing than raw data.
ii) Schema-less or self-describing structure refers to a form of
structured data that contains tags or markup elements in order to
separate elements and generate hierarchies of records and fields in
the given data. Semi-structured data is a kind of data that falls
between structured and unstructured data.
iii)Sources:-
- File systems such as web data in the form of cookies.
- Web server log and search patterns.
- Sensor data.
Q3. Write down the difference between traditional Analytics & Big Data
Analytics.
Ans:
Traditional Data warehouse Analytics Big Data Analytics

Traditional Analytics analyzes on the The biggest advantages of the Big Data is it
known data terrain that too the data that is targeted at unstructured data outside of
is well understood. Most of the data traditional means of capturing the data.
warehouses have a elaborate ETL Which means there is no guarantee that
processes and database constraints, which the incoming data is well formed and clean
means the data that is loaded inside a data and devoid of any errors. This makes it
warehouse is well under stood, cleansed more challenging but at the same time it
and in line with the business metadata. gives a scope for much more insight into
the data.
Traditional Analytics is built on top of the In typical world, it is very difficult to
relational data model, relationships establish relationship between all the
between the subjects of interests have information in a formal way, and hence
been created inside the system and unstructured data in the form images,
the analysis is done based on them. videos, Mobile generated information,
RFID etc... have to be considered in big data
analytics. Most of the big data analytics
databases are based out Columnar
databases.
Traditional analytics is batch Big Data Analytics is aimed at near real
oriented and we need to wait for nightly time analysis of the data using the support
ETL and transformation jobs to complete of the software meant for it
before the required insight is obtained.

Parallelism in a traditional analytics While there are appliances in the market


system is achieved through costly for the Big Data Analytics, this can also be
hardware like MPP (Massively Parallel achieved through commodity hardware
Processing) systems and / or SMP and new generation of analytical software
systems. like Hadoop or other Analytical databases.
Q4. What is Big Data Analytics? Why is big data analytics important?
Explain different types of Big Data Analytics.
Ans:
What is Big Data Analytics?
• Data Analytics: Examining raw data or analyzing a large data which is
generated from number of data sources for finding out relationship
among important information or gaining insight into the data which can
benefit data owner’s organization in taking sound decision.
• Big Data Analytics: The process of gaining insight into huge amount of
data (Big Data) in order to find out unseen, hidden, and useful
information for any organization which is left untouched by traditional
methods of analytics is known as Big Data Analytics.

Why is big data analytics important?


Big data analytics helps organizations harness their data and use it to identify
new opportunities.
1. Cost reduction - Big data technologies such as Hadoop and cloud-based
analytics bring significant cost advantages when it comes to storing large
amounts of data – plus they can identify more efficient ways of doing
business.
2. Faster, better decision making - With the speed of Hadoop and in-
memory analytics, combined with the ability to analyze new sources of
data, businesses are able to analyze information immediately – and make
decisions based on what they’ve learned.
3. New products and services - With the ability to gauge customer needs
and satisfaction through analytics comes the power to give customers what
they want. Davenport points out that with big data analytics, more
companies are creating new products to meet customers’ needs.

Explain different types of Big Data Analytics.

The different types of big data analytics are as follows:


1. Basic analytics (Descriptive analytics)
2. Advanced analytics (Predictive analytics)
3. Operational analytics (Prescriptive analytics)
Basic analytics (Descriptive analytics)
• The simplest class of analytics, one that allows you to condense big data
into smaller, more useful nuggets of information.
• The purpose of descriptive analytics is to summarize what happened,
when it happened & its impact.
• Involves visualizations of simple statistics.
• More than 80% of business analytics -- most notably social analytics or
lot of disparate data needs to be analyzed --are descriptive.
• For example, number of posts, mentions, fans, followers, page views,
kudos, +1s, check-ins, pins, etc. There are literally thousands of these
metrics -- it's pointless to list them -- but they are all just simple event
counters
Some techniques & examples of descriptive analytics
• Slicing and dicing
– Division of data into smaller groups that are easy to explore E.g.
sales data can be divided/categorized on the basis of region for
better analysis.
• Basic monitoring
– Monitoring of huge volumes of data in real time. E.g. When new ad
campaign launched, monitoring data related to every single
minutes.
• Anomaly identification
– Refers to the identification of anomalies. E.g. anomaly(event
showing difference between actual observations & what you
expected in your data) in the operations of an organization.

Advanced analytics (Predictive analytics)

• It is one step ahead of basic analytics by finding the cause for what has
happened & the measures that can be taken to avoid it from happening
again in the future.
• Predictive analytics is the next step up in data reduction. It utilizes a
variety of statistical, modeling, data mining, and machine learning
techniques to study recent and historical data, thereby allowing analysts
to make predictions about the future.
• The purpose of predictive analytics is NOT to tell you what will happen
in the future. It cannot do that. In fact, no analytics can do that.
Predictive analytics can only forecast what might happen in the future,
because all predictive analytics are probabilistic in nature.

Advanced analytics (Predictive analytics)


Some examples of predictive analytics
• Statistical & Data mining algorithms
– Such as advanced forecasting, optimization, cluster analysis for
segmentation, & affinity analysis
• Predictive modeling
– Refers to a data mining solution that provides algorithms &
techniques to use on structured & unstructured data to ascertain
future outcomes. E.g. a telecommunications company can identify
the customers who are about to drop the service, sentiment
analysis
• Text analytics
– Refers to the process of examining unstructured text, extracting
important information out of it, & transforming the information
into structured information that can be leveraged in different way.

Operational analytics (Prescriptive analytics)

• The emerging technology of prescriptive analytics goes beyond


descriptive and predictive models by recommending one or more
courses of action -- and showing the likely outcome of each decision.
• It's basically when we need to prescribe an action, so the business
decision-maker can take this information and act.
• Prescriptive analytics requires a predictive model with two additional
components: actionable data and a feedback system that tracks the
outcome produced by the action taken.
• Since a prescriptive model is able to predict the possible consequences
based on different choice of action, it can also recommend the best
course of action for any pre-specified outcome
• E.g. an insurance company can use a model to predict the probability of
a claim being fraudulent. This model can be used by the company to
form an important part of its claims processing system to flag claims
having a high probability of fraud. The fraudulent claims are then sent
to a separate unit for investigation, which reviews the claims further.

Q5. Explain Hadoop Ecosystem?

Ans: Hadoop Ecosystem is a framework of various types of complex and


evolving tools and components. Hadoop Ecosystem can be defined as a
comprehensive collection of tools and technologies that can be effectively
implemented and deployed to provide BIG Data solutions in a cost effective
manner.
Along with Hadoop MapReduce & HDFS, the Hadoop Ecosystem provides a
collection of various elements to support the development and deployment of
Big Data solutions.

Components of Hadoop Ecosystem:-


1) MapReduce:-
MapReduce is now the most widely used general purpose computing model
and runtime system for distributed data analytics. MapReduce is based on the
parallel programming framework to process the large amounts of data
dispersed across different systems. The process is initiated when a user
request is received to execute the MapReduce program and terminated once
the results are written back to HDFS. MapReduce enables the computational
processing of data stored in a file system without the requirement of loading
the data initially into a database

2) Pig:-
Pig is a platform for constructing data flows for Extract, Transform, & Load
processing and analysis of large data sets. Pig Latin, the programming
language for pig provides common data manipulation operations such as
grouping, joining, & filtering. Pig generates Hadoop mapreduce jobs to
perform data flows. The pig Latin scripting language is not only a higher level
data flow language but only has operators similar to SQL(EX:- FILTER,JOIN)
3) Hive:-
Hive is a SQL based data warehouse system for hadoop that facilitates data
summarization, adhoc queries, and the analysis of large data sets stored in
Hadoop Compatible file system(Ex:- HDFS) and some NOSQL databases.
Hive is not a relational database but a query engine that supports the parts of
SQL specific to querying data, with some additional support for writing new
tables or files, but not updating individual records.
4) HDFS:-
HDFS is an effective, scalable, fault tolerant, and distributed approach for
storing and managing huge volumes of data. HDFS works on write once read
many times approach and this makes it capable of handling such huge
volumes of data with the least possibilities of errors caused by the replication
of data.
5) HBASE:-
HBASE is one of the projects of APACHE Software Foundation that is
distributed under Apache Software License V2.0. It is a non-relational
database suitable for distributed Environment and uses HDFS as its
persistence storage. HBASE facilitates reading/writing of Big data randomly
and efficiently in real time. It is highly configurable, allows efficient
management of huge amount of data, and helps in dealing with Big Data
challenges in many ways.
6) Sqoop:-
Sqoop is a tool for data transfer between hadoop and relational databases.
Critical Processes are employed by MapReduce to move data into Hadoop and
back to other data sources. Sqoop is a command line interpreter which
sequentially executes sqoop commands.
Sqoop operates by selecting an appropriate import function for source data
from the specified database.
7) Zookeeper:-
Zookeeper helps in coordinating all the elements of distributed applications.
Zookeeper enables different nodes of a service to communicate and
coordinate with each other and also find other master IP addresses.
Zookeeper provides a central location for keeping information, thus acting as
a coordinator that makes the stored information available to all nodes of a
service
8) Flume:-
Flume aids in transferring large amounts of data from distributed resources to
a single centralized repository. It is a robust and fault tolerant, and efficiently
collects, assembles, and transfers data. Flume is used for real time data
capturing in hadoop. The simple and extensible data model of Flume
facilitates fast online data analytics.
9) Oozie :-
Oozie is an open source Apache Hadoop service used to manage and process
submitted jobs. It supports the workflow/coordination model and is highly
extensible and scalable. Oozie is a dataware service that coordinates
dependencies among different jobs executing on different platforms of hadoop
such as hdfs, pig, and mapreduce
10) Mahout:-
Mahout is scalable machine learning and data mining library. There are
currently 4 main groups of algorithm in Mahout:-
a) Recommendations or collective filtering
b) Classification, categorization
c) Clustering
d) Frequent item set mining, parallel frequent pattern mining

Q6. What is R? Explain its features and limitations.


Ans:
What is R?
• R is a language and environment for Statistical Computing and Graphics.
• It is based on S - a statistical programming language earlier developed
at Bell Labs during 1975-76 by John Chambers and colleagues.
• Today R is a widely used environment for statistical analysis
R features
• Cross-platform.
• Free/Open Source Software.
• Package-based, rich repository of all sorts of packages required for data
analysis.
• Strong graphics capabilities.
• Highly extensible.
• Powerful Latex-like documentation environment.
• Strong user, developer communities, active development.
• Maintained by scientists for scientists.
Limitations of R
• Objects stored in primary memory. May impose some performance
bottlenecks in case of large datasets.
• No provision of built-in dynamic or 3D graphics. But external packages
like plot3D, scatterplot3D etc. available.
• Similarly, no built-in support for web-based processing. Can be done
through third-party packages.
• Functionality scattered among packages.

Q7. Write utility commands/functions in R


Ans:
• setwd() - sets working directory.
setwd("C:/RDemo")
• getwd() - gets current working directory.
• dir() - lists the contents of current working directory.
• ls() - lists names of objects in R environment
• help.start() - provides general help.
• help(“foo”) or ?foo - help on function “foo”. For ex. help(“mean”) or
?mean.
• help.search(“foo”) or ??foo - search for string “foo” in help system. For
ex.help.search(“mean”) or ??mean
• example(“foo”) - shows examples of function “foo”.
example("mean")
• library() - lists all available packages
• data(foo) - loads dataset “foo” in R. For ex. data(mtcars)
• library(foo) - load package “foo” in R. For ex. library(plyr).
• install.packages(“foo”) - installs package “foo”. For ex.
install.packages(“reshape2”).
• help(package=“package-name”) - provides brief description of
package, an index of functions and datasets in package.
• q() - quits current R session.

Q8. Explain the common data objects in R.


Ans:
1. Vector
• Contains objects of same class.
• Creating Vector:
#using c() function
x<-c(1,2,3)
#using vector() function
y<-vector("logical", length=10)
#length of vector x
length(x)
Output: [1] 3

2. Matrix
• Two-dimensional array having elements of same class.
• Creating Matrix:
#using matrix() function.
m<-matrix(c(1,2,3,11,12,13), nrow=2,ncol=3)
m
Output:
[,1] [,2] [,3]
[1,] 1 3 12
[2,] 2 11 13

#dimensions of matrix m
dim(m)

3. List

• A list is a generic vector containing other objects.


• For example, the following variable x is a list containing copies of three
vectors n, s, b, and a numeric value 3.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3) # x contains copies of n, s, b
List Slicing
• We retrieve a list slice with the single square bracket "[]" operator. The
following is a slice containing the second member of x, which is a copy
of s.
• > x[2]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee“
Member Reference
• In order to reference a list member directly, we have to use the double
square bracket "[[]]"operator. The following object x[[2]] is the second
member of x. In other words, x[[2]] is a copy of s, but is not a slice
containing s or its copy.
> x[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
• We can modify its content directly.
> x[[2]][1] = "ta"
> x[[2]]
[1] "ta" "bb" "cc" "dd" "ee"

4. Factors
• Factors are the data objects which are used to categorize the data and
store it as levels. They can store both strings and integers. They are
useful in the columns which have a limited number of unique values.
Like "Male, "Female" and True, False etc. They are useful in data analysis
for statistical modeling.
• Factors are created using the factor () function by taking a vector as
input.
• # Create a vector as
> input. data <- c("East","West","East","North","North","East","West","West",
"West","East","North")
• # Apply the factor function.
> factor_data <- factor(data)
> print(factor_data)
• Output:
[1] East West East North North East West West West East North
Levels: East North West

5. Data frame
• Used to store tabular data. Can contain different classes.
student_id<-c(1,2,3)
student_names<-c("Ram","Shyam","Laxman")
position<-c("First","Second","Third")
#using data.frame() function
data<-data.frame(student_id,student_names,position)
Data
#accessing a particular column
data$student_id
#no. of rows in data
nrow(data)
#no. of columns in data
ncol(data)
#column names of data. for a dataframe, colnames() can also be used.
names(data)

También podría gustarte