Documentos de Académico
Documentos de Profesional
Documentos de Cultura
I. I NTRODUCTION
Online social networks, like Facebook or Twitter, are
increasingly responsible for a signicant portion of the
digital content produced today. As a consequence, it becomes essential for publishers, stakeholders and observers
to understand and analyze the data streams originating from
those networks in real-time.
Advertisers, for instance, would originally publicize their
latest campaigns statically using pre-selected hashtags on
Twitter. Today, real-time data processing opens the door
to continuous tracking of their campaign on the social
networks, and to online adaptation of the content being
published to better interact with the public, e.g., by augmenting or linking the original content to new content, or
by reposting the material using new hashtags.
Typical Big Data analytics solutions such as batch data
processing systems can scale-out gracefully and provide
insight into large amounts of historical data at the expense
of a high latency. They are hence a bad t for online and
dynamic analyses on continuous streams of potentially high
velocity.
In the following case-study, we discuss how a modern,
popular and open-source stream processing system, Storm1 ,
can be brought to use to carry out complex analytics such
as trend prediction on high velocity streams. We introduce
our problem and briey describe related work below in
Section II. We introduce Storm and present our solution in
2 https://bitly.com/
3 http://twitter.com/
4 http://www.streambase.com/
5 http://www.truviso.com/
6 though both Streambase and Truviso support limited forms of UDFs
(User Dened Functions) for extensibility.
1 http://storm-project.net/
Benoit Perroud
VeriSign Inc
Fribourg, Switzerland
bperroud@verisign.com
784
platform that allows programmers to develop arbitrary applications for processing continuous and unbounded streams
of data. IBM InfoSphere7 , Apache S48 or Twitter Storm are
popular examples of such frameworks.
We focus on the Storm framework in the following. We
describe a production-ready pipeline for integrating and
processing both Twitter and Bitly streams and supporting
two essential features:
Real-Time Complex Analysis: we wish our system to
support complex analyses on data streams continuously
and in real-time.
Distribution & Fault Tolerance: in addition, the system
should be able to scale-out (i.e., to support parallel
processing on clusters of commodity machines) to
support high-velocity streams whenever necessary and
also for fault-tolerance purposes.
III. S OLUTION OVERVIEW
The general idea of our real-time trend analysis implementation is to combine real-time data from Bitly, a
popular URL shortening and bookmarking service, with realtime data from Twitter, an online microblogging platform.
Bitly.com is a very well-known link shortener. It is mainly
used for shortening long links to be able to more efciently
use the 140 character limit on Twitter. For example, if
someone wants to post a tweet containing a link to the
Wikipedia page of Albert Einstein, he/she can shorten the
link http://en.wikipedia.org/wiki/Albert Einstein with Bitly
to produce a shorter url: http://bit.ly/4za3r. In this url,
4za3r is the code used by Bitly to match the shortened
URL with the original one. In the following, we refer to
this value as the Bitly code.
From a data analysis perspective, the integration of Bitly
and Twitter streams in real-time allows for a ner-grained
analysis of the social interactions between the different
users sharing the data. For example, it becomes possible to
construct a timeline between the posting of the information
on Twitter and the corresponding Webpage visits by analyzing both Twitter and Bitly data (Figure 1 below shows an
example of such an analysis). This prole could then be used
to track social advertising campaigns and further increase the
reach and click rate by reacting to the information provided
by this analysis. The main idea of the present use-case
is hence to track the Bitly clicks and keep track of the
associated hashtags or words over time. This allows for
a detailed sentiment analysis and trend detection. In our
system, we use a distributed rolling count implementation
that calculates the frequency of terms over a moving window.
A. Stream Processing Architecture
A Storm cluster can be compared to a Hadoop cluster.
Whereas on Hadoop you run MapReduce jobs, on Storm
7 http://www-01.ibm.com/software/data/infosphere/
9 https://github.com/robey/kestrel
8 http://incubator.apache.org/s4/
10 http://kafka.apache.org/
785
Figure 1. Our current stream dashboard combining in real-time both Twitter and Bitly streams to provide additional information such as geographic
information about the users and click timelines for each Tweet containing a Bitly link.
A. Twitter Data
We were able to use a Twitter dataset from June 2012
in our experiments. The whole dataset (Bitly and Twitter)
is about 1600 GB of compressed text. The main difculty
with this static dataset is that Storm is designed to process
continuous streams rather than a historical dataset. To bypass
this issue, we simulate a stream by reading the les line by
line and pushing them to a Kafka messaging queue system.
the tweets containing the Bitly code. Since the Bitly code is
unique for each URL, this allows us to track specic topics
even though the hash tags might vary or contain spelling
errors.
Our rolling count is implemented using a single bolt as
emitter for words and hash-tags, and then uses multiple
instances of an aggregation bolt to distribute the load of
the count. Each bolt maintains minimal stateful information.
The results of the count are emitted in a given time interval
to a single collector bolt that merges the results from all
instances of the aggregation bolt.
The conguration can be tuned using two parameters:
The rst parameter is the size of the moving window. The
larger the window, the larger the memory consumption of
the aggregation bolt. As we might record not only hashtags
but all other words as well, the memory consumption can
be a critical property, even if the original tweet text was
cleaned from stop-words. The second parameter is the emit
frequency of the rolling count from the central collector
node. Depending on the use-case, the emit frequency should
be chosen between a minimal delay and the width of the
moving window to avoid loosing information.
B. Bitly Data
Through a cooperation between Bitly and Verisign, we
were able to access the complete Bitly stream for the
same period as for the Twitter stream and to combine the
information of both feeds. The Bitly stream is actually
separated into two streams: Bitly encode and Bitly decode.
The Bitly encode stream contains every new short link
created while the Bitly decode stream contains all the clicks
on the shortened links.
In our use case, we mainly use only the Bitly decode
stream. The Bitly decode is encoded in JSON and contains
different attributes like the user-agent string of the browser,
the Bitly code, the timestamp when the user clicked on the
link, and several geo-location attributes.
C. Performance Results
On average the topology we used was able to process
about 12,000 tweets per second and concurrently 37,000
Bitly codes per second, which was more than enough to
process the full version of both streams in real-time. At full
speed, our topology was able to process the data in 8 days.
This basically means that we are able to process a month
full of data using commodity machines approximatively 4x
faster than what would be necessary for real-time processing.
786
Figure 2.
Our topology combining and processing a Twitter and a Bitly streams in real-time
Figure 3.
Scalability measurements
787