Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Tableau
Keen IO
Heap
Google Analytics
Crazyegg
hadoop
Reign of big data
The term big data was first used to refer to increasing data volumes in the
mid-1990s.
In 2001, Doug Laney, then an analyst at consultancy Meta Group Inc.,
expanded the notion of big data to also include increases in the variety of
data being generated by organizations and the velocity at which that data
was being created and updated.
Those three factors -- volume, velocity and variety -- became known as
the 3Vs of big data, a concept Gartner popularized after acquiring Meta Group
and hiring Laney in 2005.
Separately, the Hadoop distributed processing framework was launched as
an Apache open source project in 2006, planting the seeds for a clustered
platform built on top of commodity hardware and geared to run big data
applications.
Contd.
By 2011, big data analytics began to take a firm hold in organizations and the
public eye, along with Hadoop and various related big data technologies that
had sprung up around it.
Initially, as the Hadoop ecosystem took shape and started to mature, big data
applications were primarily the province of large internet and e-
commerce companies, such as Yahoo, Google and Facebook, as well as
analytics and marketing services providers.
In ensuing years, though, big data analytics has increasingly been embraced
by retailers, financial services firms, insurers, healthcare organizations,
manufacturers, energy companies and other mainstream enterprises.
Types of analytics
It is a kind of business intelligence that is now used for gaining profits and
making better use of resources.
This can also help in improving managerial operations and leverage
organisations to next level.
You must be wondering that why is this hype about big data?
The reasons why every company is inclined towards adopting big
data are –
YARN: YARN is responsible for allocating system resources to the various applications running
in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes
MapReduce: a software framework that allows developers to write programs that process
massive amounts of unstructured data in parallel across a distributed cluster of processors or
stand-alone computers.
Spark: an open-source parallel processing framework that enables users to run large-scale
data analytics applications across clustered systems.
HBase: a column-oriented key/value data store built to run on top of the Hadoop Distributed
File System (HDFS).
Hive: an open-source data warehouse system for querying and analyzing large datasets stored
in Hadoop files.
Kafka: a distributed publish-subscribe messaging system designed to replace
traditional message brokers.
Pig: an open-source technology that offers a high-level mechanism for the parallel
programming of MapReduce jobs to be executed on Hadoop clusters.
Applications
Large number of medical devices are there which are big data oriented.
Today data is used to such an extent that doctor prescribes the medicines
without even visiting the patient by knowing the heartbeat and temperature
through the heart and temperature monitoring watch fitted on the patient’s
hand that stays in a remote place.
Nanobots are miniature robots that are being developed which will increase
the immunity in the human’s body by fighting with bacteria and other harmful
germs.
They have their own sensors and will be great in delivering chemotherapy.
Nanobots are great biotech robots that will be used in carrying oxygen,
destroy germs, and renovate tissues.
Public sector
The big data as well enables for the better purchaser preservation from
insurance agencies.
Big data is the technology tool that is being used in the production to offer
purchaser insights for see-through and simpler commodities, by finding out
and foreseeing buyer behavior from side to side information obtained from
internet websites including the social media as well as CCTV video recording.
Industry and natural resources
Private sector uses the big data in traffic management, direction preparation,
intellectual transportation arrangements and overcrowding administration.
Private sector uses the big data in income administration, industrial
improvements, logistics and for reasonable benefit.
Personal use of the big data comprises direction forecasting to accumulate on
petroleum and period, for tour activities in seeing the sights etc.
Contributions in finance & crime
detection
In banking sectors as the big data is implemented, it finds out all the mischief
tasks done. It detects the misuse of credit cards, misuse of debit cards etc.
In businesses big data helps a lot in knowing the shopping patterns of
customers and CRM tactics of the competitors so that they can apply them in
their businesses in order to improve the sales.
Statistics
Its functions , algorithms can be used to analyse primary data, build statistics
model and predict the outcomes.
An analysis of any situation can be done in 2 ways:
Although both analysis are useful but statistical analysis gives more insight
and clear picture which makes it wider for business.
There are 2 major categories of statistics:
Descriptive statistics
Inferential statistics
Descriptive Statistics
It helps organize data and focuses on the main characteristics of the data.
Descriptive
Statistics
Characteristics of data
Contd…
Insurance
Stock market
Genetics
Medical studies
Shopping
Weather forecasting
Related terms…
Population
Sample
Variable
Quantitative variable
Qualitative variable
Discrete variable
Continuous variable
Contd..
Spread describes how similar or varied the set of observed values for a
particular variable.
The measures of spread are standard deviation(Square root of variance is
standard deviation, also the measurement of how far the data deviate from
the mean)….
variance(It gives us the understanding of how the far the measurements are
from the mean)…..
quartiles(gives us the understanding of how spread out the given data is).
The measures of spread are also called measures of dispersion.
Measures of position
Position identifies the exact location of a particular data value in the given
data set.
The measures of position are percentiles, quartiles and standard scores.
Random variables
The variable which can just change and can take different values.
Variable whose value is determined by a random experiment.
Eg: joining two cards from a deck of cards what are the chances that I will be
getting two aces??
Discrete probability: table where a formula that lists the probabilities for
each outcome of the random variable, X.
Example of statistical analysis