Documentos de Académico
Documentos de Profesional
Documentos de Cultura
ANALYTICS
NOTES
Dr. Kirti Wankhede
IT Dept., SIMSR
Q.1) What is Big Data? What is the importance of Big Data? What are the
characteristics of Big Data?
Ans:
Due to the rapid advancement in the field of IT, the amount of data
generated is alarming.
Such huge amount of data is referred as Big Data. Industries use the
Big Data to analyse the market trends, study customer behavior &
accordingly takes financial decision.
Expansion of Big Data has also resulted in creating vast job
opportunities in the IT industry.
Big Data consists of large datasets that cannot be managed
efficiently by regular database management systems; general data
size is from terabytes to Exabyte.
Phones, credit cards, social media sources, RFID devices generate
huge data which remains stored useless on unknown servers for
long durations, with the help of Big Data this information is
processed, analyzed & turned into useful information.
Example:-
Every second, there are 822 Tweets on twitter.
Consumers make 11.5 million payments on PayPal, etc.
On account of IBM, everyday 2.5 quintillion bytes of data are
generated.
Such amount of data is Big Data.
● Volume
Big data implies enormous volumes of data. Earlier data was generated only
by employees but, now it is generated by machines, networks and human
interaction on systems like social media the volume of data to be analysed is
massive.
● Variety
Big Data stores two types of data both structured and unstructured. The data
comes from spreadsheets, databases, emails, photos, videos, monitoring
devices, PDFs, audio, etc. Such variety of unstructured data creates problems
for storage, mining and analysing data.
● Velocity
The flow of data in Big Data has extremely fast pace from business processes,
machines, social media sites, etc. The flow of data is massive and continuous.
This real-time data can help researchers and businesses make valuable
decisions that provide strategic competitive advantages.
1) Structured data :-
i) The term structured data generally refers to data that has a defined
length and format. Structured data that has defined repeating
patterns and structured data is organized data in a predefined
format.
ii) This kind of data accounts for about 20% of the data that is out
there. It’s usually stored in a database.
iii) For ex:-
- Structured data include numbers, dates, and group of words
and numbers called strings.
- Relational databases (in the form of table)
- Flat files in the form of records(like tab separated files)
- Multidimensional databases
- Legacy databases.
iv) The sources of data are divided into two categories:-
- Computer or machine generated:- Machine-generated data
generally refers to data that is created by a machine with-out human
intervention. Ex:- Sensor data, web log data, financial data.
- Human-generated:- This is data that humans, in interaction
with computers, supply. Ex:- Input data, Gaming-related data.
2) Unstructured data:-
i) Unstructured data is data that does not follow a specified format.
Unstructured data refers to information that either does not have a
pre-defined data model or is not organized in a predefined manner.
ii) Unstructured information is typically text-heavy, but may also
contain data such as dates, numbers and facts. 80% of business
relevant information originates in unstructured form, primarily text.
iii) Sources:-
- Social media:- YouTube, twitter, Facebook
- Mobile data:- Text messages and location information.
- Call center notes, e-mails, written comments in a survey,
blog entries.
iv) Mainly two types:-
- Machine generated :- It refers to data that is created by a
machine without human intervention.
Ex:-Satellite images, scientific data, photographs and video.
- Human generated :- It is generated by humans, in interaction
with computers, machines etc.
Ex:-Website content, text messages.
3) Semi-structured data:-
i) Semi-structured data is data that has not been organized into a
specialized repository such as a database, but that nevertheless has
associated information, such as metadata, that makes it more
amenable to processing than raw data.
ii) Schema-less or self-describing structure refers to a form of
structured data that contains tags or markup elements in order to
separate elements and generate hierarchies of records and fields in
the given data. Semi-structured data is a kind of data that falls
between structured and unstructured data.
iii)Sources:-
- File systems such as web data in the form of cookies.
- Web server log and search patterns.
- Sensor data.
Q3. Write down the difference between traditional Analytics & Big Data
Analytics.
Ans:
Traditional Data warehouse Analytics Big Data Analytics
Traditional Analytics analyzes on the The biggest advantages of the Big Data is it
known data terrain that too the data that is targeted at unstructured data outside of
is well understood. Most of the data traditional means of capturing the data.
warehouses have a elaborate ETL Which means there is no guarantee that
processes and database constraints, which the incoming data is well formed and clean
means the data that is loaded inside a data and devoid of any errors. This makes it
warehouse is well under stood, cleansed more challenging but at the same time it
and in line with the business metadata. gives a scope for much more insight into
the data.
Traditional Analytics is built on top of the In typical world, it is very difficult to
relational data model, relationships establish relationship between all the
between the subjects of interests have information in a formal way, and hence
been created inside the system and unstructured data in the form images,
the analysis is done based on them. videos, Mobile generated information,
RFID etc... have to be considered in big data
analytics. Most of the big data analytics
databases are based out Columnar
databases.
Traditional analytics is batch Big Data Analytics is aimed at near real
oriented and we need to wait for nightly time analysis of the data using the support
ETL and transformation jobs to complete of the software meant for it
before the required insight is obtained.
• It is one step ahead of basic analytics by finding the cause for what has
happened & the measures that can be taken to avoid it from happening
again in the future.
• Predictive analytics is the next step up in data reduction. It utilizes a
variety of statistical, modeling, data mining, and machine learning
techniques to study recent and historical data, thereby allowing analysts
to make predictions about the future.
• The purpose of predictive analytics is NOT to tell you what will happen
in the future. It cannot do that. In fact, no analytics can do that.
Predictive analytics can only forecast what might happen in the future,
because all predictive analytics are probabilistic in nature.
2) Pig:-
Pig is a platform for constructing data flows for Extract, Transform, & Load
processing and analysis of large data sets. Pig Latin, the programming
language for pig provides common data manipulation operations such as
grouping, joining, & filtering. Pig generates Hadoop mapreduce jobs to
perform data flows. The pig Latin scripting language is not only a higher level
data flow language but only has operators similar to SQL(EX:- FILTER,JOIN)
3) Hive:-
Hive is a SQL based data warehouse system for hadoop that facilitates data
summarization, adhoc queries, and the analysis of large data sets stored in
Hadoop Compatible file system(Ex:- HDFS) and some NOSQL databases.
Hive is not a relational database but a query engine that supports the parts of
SQL specific to querying data, with some additional support for writing new
tables or files, but not updating individual records.
4) HDFS:-
HDFS is an effective, scalable, fault tolerant, and distributed approach for
storing and managing huge volumes of data. HDFS works on write once read
many times approach and this makes it capable of handling such huge
volumes of data with the least possibilities of errors caused by the replication
of data.
5) HBASE:-
HBASE is one of the projects of APACHE Software Foundation that is
distributed under Apache Software License V2.0. It is a non-relational
database suitable for distributed Environment and uses HDFS as its
persistence storage. HBASE facilitates reading/writing of Big data randomly
and efficiently in real time. It is highly configurable, allows efficient
management of huge amount of data, and helps in dealing with Big Data
challenges in many ways.
6) Sqoop:-
Sqoop is a tool for data transfer between hadoop and relational databases.
Critical Processes are employed by MapReduce to move data into Hadoop and
back to other data sources. Sqoop is a command line interpreter which
sequentially executes sqoop commands.
Sqoop operates by selecting an appropriate import function for source data
from the specified database.
7) Zookeeper:-
Zookeeper helps in coordinating all the elements of distributed applications.
Zookeeper enables different nodes of a service to communicate and
coordinate with each other and also find other master IP addresses.
Zookeeper provides a central location for keeping information, thus acting as
a coordinator that makes the stored information available to all nodes of a
service
8) Flume:-
Flume aids in transferring large amounts of data from distributed resources to
a single centralized repository. It is a robust and fault tolerant, and efficiently
collects, assembles, and transfers data. Flume is used for real time data
capturing in hadoop. The simple and extensible data model of Flume
facilitates fast online data analytics.
9) Oozie :-
Oozie is an open source Apache Hadoop service used to manage and process
submitted jobs. It supports the workflow/coordination model and is highly
extensible and scalable. Oozie is a dataware service that coordinates
dependencies among different jobs executing on different platforms of hadoop
such as hdfs, pig, and mapreduce
10) Mahout:-
Mahout is scalable machine learning and data mining library. There are
currently 4 main groups of algorithm in Mahout:-
a) Recommendations or collective filtering
b) Classification, categorization
c) Clustering
d) Frequent item set mining, parallel frequent pattern mining
2. Matrix
• Two-dimensional array having elements of same class.
• Creating Matrix:
#using matrix() function.
m<-matrix(c(1,2,3,11,12,13), nrow=2,ncol=3)
m
Output:
[,1] [,2] [,3]
[1,] 1 3 12
[2,] 2 11 13
#dimensions of matrix m
dim(m)
3. List
4. Factors
• Factors are the data objects which are used to categorize the data and
store it as levels. They can store both strings and integers. They are
useful in the columns which have a limited number of unique values.
Like "Male, "Female" and True, False etc. They are useful in data analysis
for statistical modeling.
• Factors are created using the factor () function by taking a vector as
input.
• # Create a vector as
> input. data <- c("East","West","East","North","North","East","West","West",
"West","East","North")
• # Apply the factor function.
> factor_data <- factor(data)
> print(factor_data)
• Output:
[1] East West East North North East West West West East North
Levels: East North West
5. Data frame
• Used to store tabular data. Can contain different classes.
student_id<-c(1,2,3)
student_names<-c("Ram","Shyam","Laxman")
position<-c("First","Second","Third")
#using data.frame() function
data<-data.frame(student_id,student_names,position)
Data
#accessing a particular column
data$student_id
#no. of rows in data
nrow(data)
#no. of columns in data
ncol(data)
#column names of data. for a dataframe, colnames() can also be used.
names(data)