Está en la página 1de 6

An Insight into US Elections 2016 using Twitter

Sentiment Analysis
Sachit Mahajan, TIGP SNHCC15
1. INTRODUCTION
Nowadays social media is widely used to analyze different areas and
politics is one of them. Researchers all over the world use micro
blogging websites like Twitter, Facebook to understand public views
and opinions towards different political parties and candidates. In
this project, I investigate how Twitter can be used to understand
public opinion and sentiments about the candidates and thus predict
the
US
Presidential
Elections.
The microblogging websites provide a very simple and interesting
platform to share individuals views and thoughts about any
particular topic. These social networking websites are easy to use
and accessible and that is one of the reasons why people tend to
shift from the conventional tools like mailing lists and blogs. [1]
As the trend of people posting about almost everything on
microblogging websites increases, it becomes easy to gather
information that can be processed to understand peoples
sentiments and opinions. Such data can be used easily to social
studies and marketing related studies [2]. In another research [3],
authors used twitter to forecast box office revenues for the movies
even before their release. Some researchers even used the public
social media data sets to predict stock market trends [4] and politics
[5].
In order to understand public opinion about sensitive political issues,
I collected thousands of tweets through Twitter API and analyzed
them. The project aims to solve the problem of understanding the
current opinions and sentiments towards the two presidential
candidates based on real time tweets rather than the polls
conducted by the media corporations. The tweets were classified as
having positive, negative and neutral sentiment.
2. Data
Around 3000 tweets were collected regarding the presidential
candidates. Recent tweets were used to understand the sentiment
towards Donald Trump and Hillary Clinton.
2.1 Input
The tweets were downloaded using the Twitter API. 1500 tweets
were downloaded for both Donald Trump and Hillary Clinton. Out of
the total 3000 tweets, 2100 were used for training and the rest were
used for testing purposes.
2.2 Output
2.2.1 Training Output
The output is the sentiment string that the text in tweets that show
sentiment is mapped to. For example, a tweet like We love Hillary

Clinton would be matched to positive as it contains the word


love.
2.2.2 Testing Output
The testing output is the sentiment string that is expressed by the
overall tweet that can be Positive, Negative or Neutral. It is
based on fitting the classifier to the training data.
3. Methodology
The system setup is shown in Figure1. It comprises of various steps
that are explained in the sections below.

Figure1 System Setup


3.1 Web-Scrapping
In the first step, tweets about the two frontrunners for the
Presidential position were collected from Twitter. The tweets were
used to analyze which candidate has been the center of most of the
tweets and what kinds of sentiments were related to the tweets. The
tweets for both the candidates were combined and the nonalphanumeric characters were removed.
3.2 Classify Sentiments
In this step, the procedure to extract sentiment from a tweet had to
be determined. Since manual approach of hand-classifying
thousands of tweets is inefficient and time consuming. So to
perform the sentiment analysis of the tweets, the lexicon based
sentiment analysis approach [6] was used.
3.3 Processing Text
This step involved the steps to process the text. The ideas about
pre-processing were obtained by studying the related work that has

been cited in the references. Firstly, all the text was converted into
lowercase. The extra spaces in the text and hash tags before the
words were also removed. The URLs and the usernames were
ignored, as they dont relate to any particular emotion. Because of
the casual nature of these microblogging websites, sometimes
people use words in the tweets with multiple same letters like
cooool, envyyy. To tackle these cases, I replaced more than two
of the same consecutive characters with only two of that letter [7].
The next step involved removing the stop words and punctuation
marks. There are some words that are less frequent i.e. sparse.
Higher sparsity means that there is low correlation among the
tweets. So the less frequent words were also removed.
3.4 Predictive Models and Evaluation
Choosing the best model for the task is one of the most challenging
tasks. The task was modeled to classify different sentiments that
people feel regarding the two presidential candidates. CART and
logistic regression were used to predict the negative sentiment.
Comparison was done with random forest classification model and
the accuracy for both the models was around 0.89.
4. Data Analysis
Histograms, box plots and word clouds were generated to analyze
the data and make an observation. Figure 2 shows the box plots that
were obtained for Donald Trump and Hillary Clinton.

Figure2 Box Plots for Presidential Candidates


The box plot below shows that the sentiments towards Hillary
Clinton are negative as compared to the sentiments towards Donald
Trump. This can be based on the fact that recently Donald Trump
was finalized as the Republican Party Presidential Candidate.
Figure 3 shows the histogram with sentiment score of tweets for
both the candidates.

Figure3 Sentiment score histogram


The histogram shows that the sentiment score of tweets is more
negative for Hillary Clinton as compared to Donald Trump. It can be
observed that based on percentage of negative and positive tweets,
Donald Trump is leading the Presidential Elections race. In the next
step, word clouds with sentiment percentage were generated to see
what kind of words and with what sentiments are used to address
the two presidential candidates.

Figure4 Word Cloud for Donald Trump


Figure 4 shows that according to the word cloud generated, the
sentiment percentage is mostly neutral. Whereas for Hillary Clinton
in Figure 5 , it is quite equally distributed.

Figure5 Word Cloud for Hillary Clinton


From the two word clouds generated it can be observed that
although the words with positive sentiments are more towards
Hillary Clinton but still she has more words with negative sentiments
as compared to Donald Trump. The neutral sentiment towards
Donald Trump conveys that although people are not talking positive
about Donald Trump but still they favor him over Hillary Clinton.

[1] PredictingUSPrimaryElectionswithTwitter
[2] AlexanderPakandPatrickParoubek.Twitterasacorpusforsentimentanalysisandopinion
mining.InProceedingsoftheSeventhInternationalConferenceonLanguageResourcesand
Evaluation(LREC10),may2010.
[3]SitaramAsurandBernardoA.Huberman.Predictingthefuturewithsocialmedia.InProceedings
ofthe2010IEEE/WIC/ACMInternationalConferenceonWebIntelligenceandIntelligentAgent
TechnologyVolume01,WIIAT10,pages492499,2010.
[4] J.Bollen,H.Mao,andX.J.Zeng.Twittermoodpredictsthestockmarket.InJournalof
ComputationalScience,volume2,pages18,2011.

[5]B.OConnor,R.Balasubramanyan,B.R.Routledge,andN.A.Smith.Fromtweetstopolls:Linking
textsentimenttopublicopiniontimeseries.InProceedingsof4thICWSM,AAAPress,pages122
129,2010.
[6]Hu and Liu, KDD-2004

[7] http://web.stanford.edu/~jesszhao/files/twitterSentiment.pdf