Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Sentiment Analysis of
Twitter
Using Knowledge based and Machine
Learning Techniques
Anne Hennessy
National College of Ireland
5/28/2014
National College of Ireland
Higher Diploma in Science in Data Analytics
2013/2014
Anne Hennessy
X13119966
anne.hennessy@student.ncirl.ie
-2-
Table of Contents
1 Executive Summary ....................................................................................... 5
2 Introduction .................................................................................................... 6
8 Conclusions ................................................................................................. 41
10 References ................................................................................................ 43
11 Appendix ................................................................................................... 46
-4-
1 Executive Summary
.
-5-
2 Introduction
Twitter is a “micro-blogging” social networking website that has a large and
rapidly growing user base. Those who use Twitter can write short 140 characters
long or less updates called ‘tweets’. ‘Tweets’ are seen by those who ‘follow’ the
person who ‘tweeted’. Due to the growing popularity of the website, Twitter can
provide a rich bank of data in the form of harvested “tweets”. Twitter by its very
nature, allows people to convey their opinions and thoughts openly about
whatever topic, discussion point or product that they are interested in sharing
their opinions about. Therefore Twitter is a good medium to search for potentially
interesting trends regarding prominent topics in the news or popular culture.
R studio is one of the many programmes that offer packages that can analysis
data, however R studio works well with statistical problems and has a user
friendly interface. Verzani, (2011)
Sentiment analysis (or opinion mining) refers to the use of natural language
processing, text analysis and computational linguistics to identify and extract
subjective information in source material. Verzani, (2011)
The classification model which this project will develop will determine whether
the tweet status updates (which cannot exceed 140 characters) reflects positive
opinion or negative opinion on the behalf of the person who tweeted. This paper
will use a hybrid of knowledge based sentiment analysis methodologies which
have been more traditionally used, and those of machine learning methodologies
which used a more intuitive approach to sentiment. The results of these two
methodologies will be used to perform a thorough analysis of the dataset.
2.2 Motivation
This paper chose to analysis tweets with the hash tag #ConchitaWirst and
#Eurovision2014 as these were topical at the time of data collection. Conchita
Wurst is the drag stage persona of Thomas "Tom" Neuwirth an Austrian singer.
7
Wurst represented Austria and won the Eurovision Song Contest
2014 in Copenhagen, Denmark. Brooks (2014). This paper decided to
investigate the tweets related to Wurst and the Eurovision song contest 2014 as
Worst’s selection sparked controversy in Austria. Four days after ORF
announced its decision, more than 31,000 people liked an "Anti-
Wurst" Facebook page. Michaels (2014)
2.3 Aims
In order to conduct any kind of analysis on twitter the construction of a suitable
dataset of tweets needs to be built. Twitter API is an app which extracts tweets
from twitter and loads them into a dataset. The aims of this paper are threefold;
8
2.4 Solution Overview
The construction of a suitable dataset of tweets needs to be built. Twitter API is
an app which extracts tweets from twitter and loads them into a dataset.
Additionally for this project R studio was used along with the following packages
and libraries;
Plyr; The package plyr is a set of tools that solves common problems by
breaking down bigger problems into more workable pieces. The package then
operates on each problem before reassembling the reworked pieces back
together.
Stringr; stringr makes R string functions more consistent, simpler and easier to
use by ensuring the function and argument names are consistent and all
functions deal with NA’s and zero length characters appropriately. Stringr also
ensures that the data output from each function matches the input data
structures of other functions.
e1071; this package provides function for latent class analysis, such as Support
vector Machines, bagged clustering and Naive Bayes classification.
Gmodels; This package provides various R programming tools for model fitting.
9
Using the combination of knowledge based techniques and well as machine
learning techniques, this paper will allow for a full and thorough analysis of the
data. The combination of both these techniques to analysis data will provide
researchers the opportunities to see that the combination of techniques can be
complementary to the analysis.
2.5 Structure
Related work describes the related literature that has been reviewed
around the topic of sentiment analysis and the various methodologies
used. The related work is loosely divided up into knowledge based
approaches and machine learning based approaches. This is to ensure a
better understanding of the two areas this paper hopes to combine.
Design and Architecture describes how the overall layout of the project will
be achieved. The design of the paper is based on Naive Bayes classifier
which is explained in this section. The Architecture diagram provides a
‘road map’ of the different distribution systems that will be used and how
they are connected to other systems.
Implementation provides a step by step guide through the processes of
the paper which includes the code that was used as well as outputs.
Requirements provide an insight into the aspects of the project that must
be considered such as Functional requirements, Data requirements, User
requirements etc.
Datasets. This section describes the datasets used in this paper. It
describes in detail how the data was acquired and the basic description of
the data. It describes where it is stored and how it will be used to answer
the aims of the project.
Results describe the results that have come to light after the various
methodologies have been applied to the dataset.
Conclusion this section describe the advantages/disadvantages,
opportunities and limits of the project.
10
Further development or research; this section asks the author of this paper
how does this paper view where the results of this project could lead to in
future research.
11
3 Related Work (Literature Review)
Twitter is different to other forms of raw data which are used for sentiment
analysis as sentiments are conveyed in one or two sentence blurbs rather
than paragraphs. Twitter is much more informal and less consistent in terms
of language. Users cover a wide array of topics which interest them and use
many symbols such as emoticons to express their views on many aspects of
their life (Agarwal et al. 2011). When using human generated status updates,
sentiment are not always obvious; many tweets are ambiguous and can use
humour to maximize the opinion to other human readers but deflect the
opinion to a machine learning algorithm. (Agarwal et al. 2011). Another
consideration when using a dataset generated from Twitter is that a
considerably large amount of tweets which convey no sentiment such as
linking to a news article, which can lead to difficulties in data gathering,
training and testing. Parikh, Movassate (2009). Sentiment analysis provides a
means of tracking opinions and attitudes on the web and determines if they
are positively or negatively received by the public.
According to Mejova (2009) Sentiment analysis is usually conducted between
two levels; a coarse level and a fine level. Coarse level sentiment analysis deals
with determining the sentiment of an entire document and Fine level deals with
attribute level sentiment analysis. Neethu, Rajasree (2013) Sentence level
sentiment analysis comes in between these two. Mejova (2009). Sentiment
analysis in Twitter provides a dramatically different data set where multiple
interesting challenges can arise.
The next two sections deal with these techniques in further detail.
12
A. Symbolic Techniques
Symbolic techniques in supervised classification models make use of
available lexical resources. In his sentiment analysis Turney (2002) used bag-
of-words approach. In this approach the document was treated as a collection
of words where relationships between words are not considered important. To
determine the overall sentiment, sentiments of every word are given a value and
using aggregation functions, those values are combined. Where tuples are
phrases having adjectives or adverbs which may be considered positive or
negative, Turney (2002) found the polarity of a review was based on the
average semantic orientation of tuples extracted from the review.
13
A test set is used to prove the model by predicting the class labels of unseen
feature vectors as outlined in the training set.
A number of machine learning techniques like Naive Bayes (NB) and Support
Vector Machines (SVM) are used to classify reviews into either positive or
negative orientation. Vinodhini, Chandrasekaran (2012). In their paper
Domingos et al.( 1 9 9 7 ) found that Naive Bayes works as a good classifier for
certain problems as it results in highly dependent features .
A new model which was based on Bayesian algorithm was introduced in 2012
by Zhen Niu et al. (2012). In this model weights of the classifier are adjusted
by making use of representative feature(information that represents a class)
and unique feature( information that helps in distinguishing classes). Using
those weights, the researchers calculated the probability of each classification.
This allowed for an improved Bayesian algorithm.
Pak and. Paroubek (2010) created a twitter corpus by using a Twitter API
which automatically collected tweets from Twitter as well as annotating those
using emoticons. Using that corpus, they built a sentiment classifier which
used N-gram and POS-tags as features based on the multinomial Naive Bayes
classifier.
14
In this paper, we construct a Twitter corpus using Twitter API, use R studio
coding to prepossess the Twitter corpus, then using know ledge based methods
we use an available lexical resources and apply it to the Twitter corpus. To
compare the results from the knowledge based method to a machine learning
technique we then use Naive Bayes classification models to the corpus which
will split the corpus into positive and negative tweets as well as highlighting
which tweets are classified. Naïve Bayes is used as it is often works well as a
good first classifier in data analysis.
15
4 System and Datasets
’X’ is the feature vector defined as X={x1 ,x2 ,....xm } and yj is the class label. In
the tweets collected for this paper there are different independent features such
as emoticons, emotional keyword, which are treated as either positive or
negative and so are utilized by Naive Bayes classifier for classification. The
16
machine learning algorithm is then applied to the model classifier and a label is
produced as seen in figure 1.
Figure 1
17
5 Implementation
Figure 2
18
application will show as below. The properties were changed to ‘Read
Write and Access Direct Messages’. The Consumer key and Consumer
Secret numbers were used in R Studio.
Figure 3
19
Figure 4
Figure 5
Once this file is downloaded, the next stage is to access the Twitter API. This
step includes the script code to perform handshake using the Consumer Key and
Consumer Secret number of the application. In figure 6 is the code you have to
run to perform handshake.
Figure 6
20
In order to access the Twitter API, the programme requires the request URL,
access URL and authorization URL of Twitter application to the variables
requestURL, accessURL and authURL respectively. consumerKey and
consumerSecret are unique to a twitter application. Running this gives following
message on the R console:
Figure 7
The last three lines of the console are a message to the user. To enable the
connection, please direct your web browser to:
http://api.twitter.com/oauth/authorize?oauth_token=dHwEGXdxbjJ093sG0tVjYVT
0NQrkjU3DuCxcC1YQyc
After opening the above link in the browser, the authorization of the application is
ensured by providing you username and password. The provision of these items
ensures the app will be authorized. The code provided must be written into the
console.
Figure 8
21
The console will give a message with TRUE, which means that the handshake is
complete. Now we can get the tweets from the twitter timeline.
Figure 9
Figure 10
This command will get 1500 tweets related to #ConchitaWurst. The function
“searchTwitter” is used to download tweets from the timeline. Twitter API can
only return a fixed maximum amount of tweets (1500). The return of a maximum
number of tweets may not be met sometimes as there are not enough tweets for
the particular keyword. This was the case for #Eurovision2014, the results did not
return many tweets, therefore it was decided that the paper would concentrate all
of its efforts on the tweets pulled by #ConchitaWurst. As can be seen in the code
above, the data of 1500 tweets was converted into a data frame, so that analysis
can be performed on it. Finally the data was converted into a .csv file
22
5.4 Sentiment Function
Once the tweets were obtained, the application of some functions to convert
these tweets into some useful information was needed. The main working
principle of sentiment analysis is to find the words in the tweets that
represent positive sentiments and find the words in the tweets that
represent negative sentiments. For this a list of words that contains positive
and negative sentiment words was needed. A list of positive and negative
words complied by B. Liu which was publicly available was downloaded
from the University of Illinois at Chicago website. Liu, Hu. (2004). After
downloading the list, it was saved it in a working directory. The sentiment
analysis uses two packages plyr and stringr to manipulate strings. The
function can be seen in the following screen prints; figure 11.
Figure 11
23
The sentiment function calculates score for each individual tweet. It first calculate
the positive score by comparing words with the negative words list and then
calculate negative score by comparing words with negative words list. The final
score is calculated as
Figure 12
Figure 13
24
5.6 Import the csv file
When we import this csv file, a dataset file is created in the working directory.
Next step is to score the tweets; this can be done by creating a separate csv file
which contains the score of each tweet. This can be done as follows:
Figure 14
The snapshot of the score file shows the score of each tweet as an integer in
front of every tweet.
Figure 15
25
5.7 Visualizing the tweets
The next step is to create visual histograms and other plots to visualize the
sentiments of the user. This can be done by using hist function.
Figure 16
Figure 17
26
The first step in processing text data involves creating a corpus, which refers to
a collection of text documents. In this project, a text document refers to a single
tweet. We'll build a corpus containing the tweets in the training data using the
following command:
Figure 18
The text mining of the tweets can be performed next. The functions to refine the
text and filter the text according to our need such as the removal of numbers,
punctuation, and how to handle uninteresting words such as ‘and’, ‘but’, and ‘or’,
is taken from the tm package.
Figure 19
The next step is to tokenize the corpus and return the sparse matrix with the
name twitter_dtm. From here, analyses involving word frequency will be
performed.
The data then needs to be split into a training dataset and test dataset for Naïve
Bayes, so that the classifier can evaluate the data.
27
The data is split into two portions: 75 percent (1; 1299) for training and 25
percent (1300; 1500) for testing.
Figure 20
To confirm that the subsets are representative of the complete set of Twitter data,
a comparison of the proportions of scores in the training and test data frames is
performed
Figure 21
Both the training data and test data contain about 13 percent negative sentiment
and 75 percent positive sentiment. This suggests that the tweets were divided
evenly between the two datasets.
Figure 22
28
.A word cloud can be produced afterwards. Words appearing more often in the
text are shown in a larger font, while less common terms are shown in smaller
fonts and therefore illustrate the frequency of words in the dataset. The code for
the word cloud is;
Figure 23
Figure 24
According to this word cloud , we can see that ‘Eurovision’ ,‘Conchita Wurst’ and
‘Lady Gaga’ are the most used terms in the tweets followed by ‘gaga’, ‘gay’ and
‘queen’ which shows that while tweeting about ‘Conchita Wurst’ the person also
connects the words like ‘Eurovision’ and ‘Lady Gaga’.
29
5.9 Create a Naive Bayes classifier
The final step in the data preparation process is to transform the sparse matrix
into a data structure that can be used to train a naive Bayes classifier.
It's unlikely that all of the features in the sparse matrix are useful for
classification. To reduce the number of features, we will eliminate any words that
appear in less than five tweets, or less.
Figure 26
The naive Bayes classifier is typically trained on data with categorical features.
This poses a problem since the cells in the sparse matrix indicate a count of the
times a word appears in a message. Changing this to a factor variable that
indicates yes or no will alleviate this problem. The following code defines a
convert_counts() function to convert counts to factors:
30
Figure 27
As can be seen in the above screen shot the classifier is created by the Naive
Bayes function and is applied to the training data . The classifier will then predict
the results from the training data in the test data. A crosstable is created to
visualize the results where the data is classified into actual data and predicted
data.
Figure 28
31
6 Requirements
The programme used for this paper was R studio. R studio was selected as a
suitable programme as its interface and programme libraries met the
requirements of the brief. Additionally the user interface of R studio was simple
to navigate and easy to understand. R studio is free and open source, and works
well on both Windows and Mac hardware . It contains advanced statistical
routines
The interface will include user inputs as well as two graphics, as outlined below.
1. Edit this function will let the user edit Keywords, by adding, editing, or
removing keywords for each topic, and
2. Time, this function will let the user specify the duration of each analysis
session.
32
analyzed. It will also display the most frequently used words used on the subject
of #Conchita Wurst through the use of a word cloud.
The software will receive input from four sources. First, the programme R studio
and second, the
Twitter API, thirdly Excel, which will hold the dataset once retrieved from Twitter
API, and fourthly Twitter app’ Sentiment140’. The programme R studio will supply
the code results and the majority of the graphs for the analysis, while the Twitter
API will supply dataset of the Tweet text. Sentiment140 app will supply an
additional pie chart which will add a more visual element to the interpretation of
the data.
Outputs
The output will portray the current mood of the Twitter community on
#ConchitaWurst in the form of a simple charts, word cloud and histograms.
33
6.1.6 Functional Requirements
Retrieving Input
The software will receive three inputs: R studio code and R studio libraries,
analysis session duration and Tweets.
● R studio code, R studio libraries was entered by the user for each topic.
● the analysis session duration will be set by the user before each session.
● Tweets will be retrieved from the Twitter API and saved on an Excel file.
Real-Time Processing
The software will take input, process data, and display output in real-time. This
will ensure the data provided by Twitter is a current view of the Twitter
community’s mood on #ConchitaWurst.
Sentiment Analysis
Output
The software must output real time data in the form of simple charts, word cloud
and histograms. In addition, the software may output additional statistics
pertaining to a topic of #ConchitaWurst). This output will be clear and easy to
understand.
34
6.1.7 Use Cases
This software will serve as a tool of interest, providing users with the current
mood of the Twitter
Community on #ConchitaWurst.
The Twitter API will provide up-to-date information; limited only by the rate of
Twitter input. R studio will provide promptly analysis of the data using the various
software packages available to it. The output should display the latest results at
all times, and if it lags behind, the user should be notified. The application should
be capable of operating in the background should the user wish to utilize other
applications.
Reliability
The software will meet all of the functional requirements without any unexpected
behavior. At no time should the output display incorrect or outdated information
without alerting the user to potential errors. In this instance error message will be
shown.
Availability
The software will be available at all times on the user’s device desktop or laptop,
as long as the device is in proper working order. The functionality of the software
will depend on any external services such as internet access that are required. If
those services are unavailable, the user should be alerted.
Security
35
The software should never disclose any personal information of Twitter users,
and should collect no personal information from its own users. The use of
passwords and API keys will ensure private use of the Twitter API. The
programmes will be performed on a password protected laptop and desktop to
ensure maximum security.
Maintainability
The software should be written clearly and concisely. The code will be well
documented. Particular care will be taken to design the software modularly to
ensure that maintenance is easy.
Portability
36
6.4 Datasets
A Twitter API app was used to pull tweets from Twitter's public timeline in real-
time. A dataset was created using twitter tweets from a topic that was dominating
twitter at the time of data collection; #ConchitaWurst and #Eurovision2014.
Eurovision2014 did not produce very many tweets therefore the concentration of
the project fell to the tweets returned from ConchitaWurst which returned 1500
tweets. A sentence level sentiment analysis was performed on tweets as many
were full of slang words and misspellings. This is done in three phases. In the
first phase of a sentence level sentiment analysis pre-processing is done.
Secondly a feature vector is created using relevant features. A publicly available
sentiment lexicon which consists of around 6800 words in a list of positive and
negative opinion words or sentiment words for English was used to separate the
tweets. This list was compiled over many years by Liu and Hu (2004) finally
tweets are classified into positive and negative classes using different classifiers.
The final sentiment is based on the number of tweets in each class using several
sentiment analysis methodologies; the bag-of-words approach, which uses
available lexical resources as seen in Turney (2002) sentiment analysis.
Machine learning approaches are also used where the tweets dataset was split in
two Training and testing. Of these, we chose to use 1199 of the data set for
training and the remaining 300 tweets to be used for testing We had a total of
1135 positive tweets, 197 negative tweets and 152 very positive tweets as well
as 15 very negative tweets. These tweets were then used for training and testing
so to conduct a Naive Bayes classifier.
Creation of a Dataset
Since standard twitter dataset are not available for analysis, we created a new
dataset by collecting tweets over a period of time ranging from May 6th 2014 to
37
May 8th 2014. Tweets were collected automatically using Twitter API and they
are manually annotated as positive or negative. In total 1500 tweets were
collected from #ConchitaWurst and 100 tweets were collected from
#Eurovision2014. Unexpectedly a number of the tweets were neutral, however
positive and negative tweets created the dataset.
Preprocessing of Tweets
Punctuation marks, correctors and digits were removed as well as changing the
tweet texts to lower case, splitting sentences to words with structural split and
comparing the corpus from the dictionary’s positive and negative words. The
matched term would be returned as a true or false value which will be treated as
1 or 0 by the sum function. Finally scores are put into a data frame named
scores.df. Before tweets can be scores however a sentiment lexicon of words
must be obtained. This sentiment lexicon was found at Bing Lui’s website. Lui,
Hu (2004). The final score for each tweet will be the number of positive words
minus the number of negative words. If the score is higher than 0, the tweet will
be regarded as positive. If the tweet score is lower than 0 the tweet will be
regarded as negative opinion.
38
7 Results
The aim of this paper was to analysis the results of a sentiment orientation on the
keyword #ConchitaWurst.
Figure 29
The above histogram shows the frequency of tweets with respect of scores
allotted to each tweets. The x-axis shows the score of each tweet as a negative
and positive integer or zero. A positive score represents positive or good
sentiments associated with that particular tweet whereas a negative score
represents negative or bad sentiments associated with that tweet. A score of zero
indicates a neutral sentiment. The more positive the score, the more positive the
sentiments of the person tweeting and vice-versa.
The above histogram is slightly skewed towards positive score which shows that
the sentiments of people regarding Conchita Wurst are overwhelming positive
with a slight skew towards very positive.
Out of 1500 tweets that were fetched from the twitter, a majority of them (1135)
are positive, whereas around 197 were having negative sentiments. 152 tweets
39
had very positive sentiments but the overall score is positive as can be seen from
the plot.
In order to see how accurately the Naïve Bayes classifier worked, an analysis of
the table below which has produced the results of the actual data from the
training set and the predicted data from the test set will have to be undertaken. It
can be seen that of the 225 positive tweets 6 were incorrectly classified as very
positive (4), negative (1) and very negative (1), while 2 of 35 negative tweets
were incorrectly classified as positive (1) and very negative (1).The presence of
some mis-catogorised tweets might suggest that the training model was perfectly
fit however when applied to the test data the model was slightly under fitted.
Figure 30
40
8 Conclusions
In this paper a hybrid of knowledge based methodologies and machine learning
methodologies were used in order to give a thorough examination of the tweets
of #ConchitaWurst and #Eurovision2014 which were extracted from Twitter.
#Eurovision2014 did not produce very many tweets therefore the concentration
of the project fell to the tweets pulled from #ConchitaWurst, which returned 1500
tweets. A publicly available sentiment lexicon which consists of around 6800
words, in a list of positive and negative opinion words or sentiment words for
English was used to separate the tweets. This list was compiled over many years
by (Hu and Liu, KDD-2004) .Tweets were then classified into positive and
negative classes using the machine learning classifier Naïve Bayes. The
extraction of tweets from twitter proved to be more difficult than expected and
several attempts were made to produce a dataset. The lack of tweets pulled from
#Eurovision2014 highlights the limitations of the app if faced with not enough
tweets it was found that there are certain issues which can arise when dealing
with a tweet based dataset. The presence of white spaces, punctuations and
numbers had to be confronted in the preprocessing stage. To further alleviate
these issues, twitter specific features were extracted and added to the feature
vector after proper preprocessing.
Classification accuracy of the feature vector is tested using classifier like Nave
Bayes. The assumption of Naïve Bayes that the data is independent, proved
classification methodology to be an excellent tool in this analysis. It was found by
the author that Machine learning algorithms were simpler to implement and more
efficient than other aspects of the paper as they produced a table which allowed
for transparency in the accuracy of the Naive Bayes classification. Overall the
hybrid approach to sentiment analysis allowed for a thorough analysis of the data
and performs well for a Twitter dataset. However, the accuracy of the Naïve
Bayes classifier still leaves room for improvement this may be achieved by better
preprocessing.
41
9 Further development or research
42
10 References
.
4. 4.Brooks,D.(2014) ‘Bearded Austrian drag queen to take on Eurovision’. Reuters Apr 28,
201Available at http://uk.reuters.com/article/2014/04/28/uk-austria-eurovision-drag-
idUKKBN0DE06O20140428 [Accessed on 23 April 2014]
5. Domingos, P and Pazzani, M. (1997) “On the optimality of the simple bayesian classifier
under zero-one loss,” Machine Learning, vol. 29, no. 2-3, pp. 103–130.IEEE [Accessed
On 3 May 2014]
43
7. Liu, B. Hu ,M. (2004) ‘Opinion Mining, Sentiment Analysis, and Opinion Spam Detection’.
available at http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon [Accessed
on 16th May 2014]
10. Neethu,M. Rajasree R.(2013) ‘Sentiment Analysis in Twitter using Machine Learning
Techniques’ 4th International Conference on Computing, Communications and
Networking Technologies (ICCCNT) Tiruchengode India. July 4-6 2013. IEEE. pp1-5
[Accessed April 25th 2014].
11. Niu, Z. Yin, Z. Kong, X. (2012 )“Sentiment classification for microblog by machine
learning,” in Computational and Information Sciences (ICCIS),2012 Fourth International
Conference on, pp. 286–289, IEEE[Accessed On 3 May 2014]
12. Pak A. Paroubek, P. (2010). “Twitter as a corpus for sentiment analysis and opinion
mining,”In Proceedings of the Seventh Conference on Language Resources and
Evaluation(LREC10). Valette, Malta. May 2010.European Language Resources
Association.
44
15. Vinodhini G. Chandrasekaran, RM.(2012) ‘Sentiment Analysis and Opinion Mining: A
Survey’. International Journal of Advanced Research in Computer Science and Software
Engineering Volume 2, Issue 6, June 2012 IEEE. Pp61-75 [Accessed April 25th 2014].
16. Verzani, J.(2011) Getting Started with R Studio. CA. O'Reilly Media, Inc.
17. Xia, R. Zong, C.. Li, S(2011). “Ensemble of feature sets and classification algorithms for
sentiment classification,” Information Sciences: an International Journal, vol. 181, no. 6,
pp. 1138–1152. IEEE [Accessed On 3 May 2014]
45
11 Appendix
In order to conduct any kind of analysis on twitter the construction of a suitable dataset
of tweets needs to be built. Twitter API is an app which extracts tweets from twitter and
loads them into a dataset.
The aim of this paper is to use the results from the knowledge based techniques and
those of the machine learning techniques to ensure a thorough analysis of the dataset.
Background
Twitter is different to other forms of raw data which are used for sentiment analysis, as
sentiments are conveyed in one or two sentence blurbs rather than paragraphs. Twitter
is much more informal and less consistent in terms of language. Users cover a wide
array of topics which interest them and use many symbols such as emoticons to express
their views on many aspects of their life (Agarwal et al. 2011). When using human
generated status updates, sentiment are not always obvious; many tweets are
ambiguous and can use humor to maximize the opinion to other human readers but
deflect the opinion to a machine learning algorithm. (Agarwal et al. 2011).This provides
a challenge for machine learning algorithms. Sentiment analysis provides a means of
tracking opinions and attitudes on the web and determines if they are positively or
negatively received by the public.
Using the combination of knowledge based techniques and well as machine learning
techniques, this paper will allow for a full and thorough analysis of the data. The
combination of both these techniques to analysis data will provide researchers the
opportunities to see that the combination of techniques can be complementary to the
analysis.
46
Project Plan
20/02/201412/03/201401/04/201421/04/201411/05/2014 31/05/201420/06/2014
project proposal
Requirements Specifications
Data retrieval
Data analysis
statistics/ algorithms
Literature review
System Architecture
Editing
Technical Details
The construction of a suitable dataset of tweets needs to be built. Twitter API is an app
which extracts tweets from twitter and loads them into a dataset. Additionally for this
project R studio was used along with the following packages and libraries;
ROAuth; This package provides an interface to the OAuth 1.0 specification, which allows
users to authenticate via OAuth to the server of their choice
Plyr; The package plyr is a set of tools that solves common problems by breaking down
bigger problems into more workable pieces. The package then operates on each
problem before reassembling the reworked pieces back together.
Stringr; stringr makes R string functions more consistent, simpler and easier to use by
ensuring the function and argument names are consistent and all functions deal with
NA’s and zero length characters appropriately. Stringr also ensures that the data output
from each function matches the input data structures of other functions.
Wordcloud; This package creates a word cloud to illustrate frequency of words in text
mining
e1071; This package provides function for latent class analysis, such as Support vector
Machines, bagged clustering and Naive Bayes classification.
47
Systems/Datasets
The process of sentiment analysis for this project is outlined in the diagram below;
Crawling
Ranked Topics
48
11.2 Initial Requirement Specification
Introduction
Twitter is a micro-blogging site where users have the ability to send mini blogs (tweets) in
the form 140 character long messages to a group of friends (followers). Despite some
restrictions, in general people are permitted to read and follow one another’s tweets,
therefore tweets are by default public. According to Forbes, Twitter has an increase of
40% of active twitter users from the second quarter of 2012 to the fourth quarter of 2012
(forbes.com 2014).
Due to Twitter’s nature, it allows people to tweet how they feel about certain topics and
produces and because Twitter is now integrated with several Web applications, and can
be sent via various messaging and other social networking platforms, people can share
their opinions more freely. This makes Twitter an obvious choice for a research purposes
in data and opinion mining for a variety of fields.
Recently an Irish drag queen named Panti (Rory O’Neill) whose use of the word
“homophobe” on RTÉ the national broadcaster has led to a debate over a provision in
legislation relating to offence being given during a broadcast, as well as to broader
issues of homophobia and gay rights in Ireland. Because of the strong outcry of the
public against some of the broadcaster’s reactions and the important issues that Panti
has raised with regards to Irelands homophobia, her campaign has gone international.
Her Twitter account @Pantibliss allows us the opportunity to gain a strong insight into the
underlying social structures and complexities of modern day Ireland.
Purpose.
The purpose of this document is to outline the requirements which the ‘Panti on Twitter’
sentiment analysis tool will utilize. The audience of this tool will be politicians, media
broadcasters, LGBTQ activists and followers, drag enthusiasts, anti-Panti activists and
general public.
Scope
The scope of the project is to develop a sentiment analysis of a twitter page called
@Pantibliss. The medium of Twitter is used for the following reasons;
Twitter is used my Millions of different people as a medium to express their opinions and
thus is a valuable source of opinion dataset.
Twitter enjoys a range of contributors and audiences, where celebrities and regular users
can interact, barriers are broken down in this way and real opinions can be expressed
through these interactions. This also allows for analysis tweets form different social groups
which may otherwise be hard to collect data from.
49
Twitter allows for international contributions, therefore in the example of Panti, we can
see contributions on an ‘Irish’ issue going global and these tweets add to the tapestry of
the opinions we seek.
Overview
The process of sentiment analysis for this project is outlined in the diagram below;
Crawling
Ranked Topics
The process will begin with an acquisition program applied to Twitter. Specified keywords
will be taken from the data that is retrieved from Twitter. In order to support a derivation
to a conceptual level, the data segments are then fragmented, assuming that every
message will only contain a single concept. Three distinct categories have been chosen
as follows
Positive; those opinions who favour what Panti has said or who react positively to her
comments (note these text may be negative towards those who oppose her)
Negative; tweets that are not in favour of what Panti has said and who react negatively
towards her (these tweets may be positive towards other groups)
50
Neutral; Objective tweets or those which do not state an opinion.
General Description
Product perspective
Python
There are numerous Python libraries that can be used to interact with the Twitter API. Two
of the most popular are Python Twitter Tools and python-twitter. The Twitter API requires
that requests are authenticated. For this project, I will use PTT (Python Twitter Tools). PTT
has a twitter command-line tool for getting tweets from followers and setting your own
tweets from the safety and security of your python shell. PTT also allow me to perform
actions with Twitter’s code without being on the website, and open up other options that
are not readily available to normal users.
Twitter API
Twitter exposes its data via an Application Programming Interface, (API). The Twitter API
has two different flavors: Restful and Streaming. The Streaming API works by making a
request for a specific type of data; filtered by keyword, user, geographic area, or a
random sample, and then keeping the connection open as long as there are no errors in
the connection.
The Restful API is useful for getting things like lists of followers and those who follow a
particular user, and is what most Twitter clients are built off of. However one of the main
drawbacks is that the Restful API is that only tweets from 5 days preceding can be
searched, and queries are limited to approximately 10 per minute at the time of writing
(Manjaly, 2013). For this project I am going to focus on the Streaming API.
Product Functions
A vital aspect of this project is Document preparation. This allows for the different aspects
we want to when representing our document. A full text string representation is not very
useful, because it is hard to find similarities between two text strings.
Therefore the ‘Bag of Words’ text string model is used which vitally ignores the ordering of
words, but instead counts of the number of occurrences of the words in the document
(Manjaly, 2013). Some information may be lost in this system however, the bag of words
model is still commonly used, and performs very strongly. It is computationally simple
and in many applications much of the information required for learning is captured by
this representation. According to. Bespalov, Bai, and. Shokoufandeh 2011, the Bag of
Words mode is a natural predecessor to the bag of N-gram. This system counts for
groups of consecutive words of size n. This is important as it can eliminate ambiguities
that can occur in bag of words models, such as “gay rights" being significantly different
to “The gay flower leaves a shadow which falls to the right”. This system allows us the
advantage of increased string length and therefore a greater context.
51
User Characteristics
The intended user will be politicians, media broadcasters, LGBTQ activists and followers,
drag enthusiasts, anti-Panti activists and general public who are interested in the
sentiment of the Twitter population with respect to the opinions formed by @Pantibliss.
Users are not expected to have a very high level of technical expertise.
General Constraints
Personal Data
If a User has not made information public, Twitter does not return that data. Any
Personal information that is collected from Twitter will not be stored or used in any way.
Twitter Data
The application must comply with the Twitter Developer terms of service. This includes the
Following:
1. Defining an application privacy policy (what we do with tweets, user data, etc.)
2. Not redistributing Tweets
3. Providing a link to Twitter sign-up if user does not have a registered Twitter
account
Specific Constraints
Specific Constraints that this product may encounter when dealing with the users Jose,
Bhatia and Krishna( 2010).
1. Negative sentences: many people would write their tweets with negation before the
adjective or verb, which complicates the data. For example : a sentence such as Not
satisfied with the
Situation of Gay Marriage. Has the adjective satisfied which can assign a polarity
positive without considering the negation in the sentence.
2. Confusing polarity: for certain tweets there will be a confusion or disagreement for the
polarity to be assigned. For instance, Norton defeats Pantibliss is positive when taken from
Pantibliss’s point of view, while its negative when Norton is the search query.
3. Dealing with emoticons: Our data should contain clean labels and emoticons are
deemed
A noisy label. However, emoticons are popular on Twitter therefore the data will have to
clean these out.
4. Casual language: Tweets contain very casual language. For example, a user may
want to right the word happy as: happpppyyy happpiieee happy hap-e besides
showing that people are happy,this emphasizes the casual nature of Twitter.
5. Usage of links: Users very often include links in their tweets. Thus there is a need to
classify
52
This type of tweet by using keywords such as URL. But even then it is difficult to extract the
Specific Requirement
User Interfaces
The programme used for this paper was R studio. R studio was selected as a suitable
programme as it’s interface and programme libraries met the requirements of the brief.
Addtionally the user interface of R studio was simple to navigate and easy to understand.
R studfio is free and open source, and works well on both Windows and Mac hardware .
It contains advanced statistical routines
As well as a large, coherent, integrated collection of tools for data analysis
R studio also processes powerful graphics capabilities which aid visualisation of data and
results greatly. Due to these properties Control of the programme allowed the user to
interact with the application at optimal ease.
The interface will include user inputs as well as two graphics, as outlined below.
The user will be able to control the sentiment analysis of topics in two ways:
3. Edit this function will let the user edit Keywords, by adding, editing, or removing
keywords for each topic, and
4. Time, this function will let the user specify the duration of each analysis session.
This graphic will consist of a simple histogram, which shows the current mood of the
Twitter community on the topic of #ConchitaWurst. The percentage of the Twitter users
will be displayed who are currently for or against the topic being analyzed. It will also
display the most frequently used words used on the subject of #Conchita Wurst through
the use of a word cloud.
Error notifications will be required within the programme R studio; this will be presented to
the user with appropriate messages in red, which will describe the error that has taken
place. If applicable, error messages suggest possible solutions to the problem.
53
Hardware Interface
The application will run on a password protected personal Microsoft laptop. No further
hardware devices or interfaces will be required for this analysis.
Software Interfaces
Inputs
The software will receive input from four sources. First, the programme R studio and
second, the
Twitter API, thirdly Excel, which will hold the dataset once retrieved from Twitter API, and
fourlhly Twitter app’ Sentiment140’. The programme R studio will supply the code results
and the majority of the graphs for the analysis, while the Twitter API will supply dataset of
the Tweet text. Sentiment140 app will supply an additional pie chart which will add a
more visual element to the interpretation of the data.
Outputs
The output will portray the current mood of the Twitter community on #ConchitaWurst in
the form of a simple charts, word cloud and histograms.
Functional Requirements
Retrieving Input
The software will receive three inputs: R studio code and R studio libraries, analysis session
duration and Tweets.
● R studio code, R studio libraries was entered by the user for each topic.
● the analysis session duration will be set by the user before each session.
● Tweets will be retrieved from the Twitter API and saved on an Excel file.
Real-Time Processing
The software will take input, process data, and display output in real-time. This will ensure
the data provided by Twitter is a current view of the Twitter community’s mood on
#ConchitaWurst.
Sentiment Analysis
Sentiment analysis will be performed on the keywords within the Tweet to determine the
overall mood of the Tweet relative to the topic. The sentiment analysis will provide a
negative or positive numeric sentiment value.
54
Output
The software must output real time data in the form of simple charts, word cloud and
histograms. In addition, the software may output additional statistics pertaining to a topic
of #ConchitaWurst). This output will be clear and easy to understand.
Use Cases
This software will serve as a tool of interest, providing users with the current mood of the
Twitter
Community on #ConchitaWurst.
The Twitter API will provide up-to-date information, limited only by the rate of Twitter input.
R studio will provide promptly analysis of the data using the various software packages
available to it. The output should display the latest results at all times, and if it lags behind,
the user should be notified. The application should be capable of operating in the
background should the user wish to utilize other applications.
Reliability
The software will meet all of the functional requirements without any unexpected
behavior. At no time should the output display incorrect or outdated information without
alerting the user to potential errors. In this instance error message will be shown.
Availability
The software will be available at all times on the user’s device desktop or laptop, as long
as the device is in proper working order. The functionality of the software will depend on
any external services such as internet access that are required. If those services are
unavailable, the user should be alerted.
Security
The software should never disclose any personal information of Twitter users, and should
collect no personal information from its own users. The use of passwords and API keys will
ensure private use of the Twitter API. The programmes will be performed on a password
protected laptop and desktop to ensure maximum security.
55
Maintainability
The software should be written clearly and concisely. The code will be well documented.
Particular care will be taken to design the software modularly to ensure that
maintenance is easy.
Portability
This software will be designed to run on any Android operating system. To ensure the
longevity of the software, the software will be forwarded compatible for all currently
released Android operating systems.
Design Constraints
Twitter API has some limitations such as Twitter API can only return a fixed maximum
amount of tweets (1500). The return of a maximum number of tweets may not be met
sometimes as there are not enough tweets for the particular keyword.
Analysis Models
List all analysis models used in developing specific requirements previously given in this
SRS. Each model should include an introduction and a narrative description.
Furthermore, each model should be traceable the SRS’s requirements.
56
11.3 Management Progress Reports
At this stage the project proposal has been submitted as well as the requirements
specifications. The project proposal was due on the 20 th of February .The requirements
specifications was submitted on the 2nd of March. Between the weeks of the two
submissions, it was decided that my project would be changed from a sentiment
analysis of several books to a sentiment analysis of a tweet feed. Therefore a second
project proposal had to be completed. The process of researching the project had
begun as effects were made to read and research papers that had investigated twitter
sentiment analysis. A lecture on Web API was attended during this period, provided by
the data mining module. A preliminary attempt at pulling tweets from Twitter failed.
Quality Reviews
The lecture on data mining provided some insight into the project and therefore allowed
for a better understanding of the research question and the tools needed to perform the
analysis. Code was provided by the lecturer as well as some links to other sites that had
step by step instruction of the process of sentiment analysis. A first attempt to pull data
from the Twitter API was made, however I was unable to successfully retrieve any tweets
and encountered a number of problems using the code.
Issues Arising
The’ handshake’ of the code outlined by Jeffrey Breen was not performing as it should
and provide a URL address . This URL provided a code that would be inputted into the R
programme and if it was accepted it allowed the Twitter data to be loaded into R. The
issue arose for a number of people in the class. Efforts were made to figure out why this
was happening. A week or two later the issue was resolved as it was discovered that the
57
servers in college did not allow for the handshake to occur. All sentiment analysis work
would have to be performed on a computer outside of college with its own server.
Between the weeks of the two submissions, it was decided that my project would be
changed from a sentiment analysis of several books to a sentiment analysis of a tweet
feed. Therefore a second project proposal had to be completed. The reason for this
variance from plan was because initially I was not sure of the abilities of data mining- I
took a guess. Once I was more informed on what data mining was and its capabilities I
felt far more comfortable doing a sentiment analysis on Twitter. A number of other
people in the class were also performing sentiment analysis on Twitter and it was felt that
there would be sufficient support for this subject area.
In the next period I plan to retrieve a dataset from Twitter using an API app. I also plan to
do a bit more research in order to produce a solid research question. I then plan to
research the methodologies I might be able to perform in order to answer my research
question
58
11.3.2 Management Progress Report 2
Progress Mangement Report from 16TH March 2014 to 16th April 2014.
A Progress Management report provides the Project Supervisors with a summary of the
status of a project at agreed stages and is used to monitor progress. The Project
manager uses the Progress Management report to advise the Project Supervisors of any
potential problems or areas where the Project supervisor can help.
A research question was formulated from extensive reading on the subject of twitter
analysis and the project would focus on sentiment analysis of user-generated Twitter
updates using knowledge based and machine learning Techniques.
Further research was performed on the different methodologies I could use to answer
research question
Research question
Quality Reviews
Further research was conducted on the subject of sentiment analysis as well as further
understanding of the mathematical equations behind the analysis. Lecture in advanced
business data analysis helped with the understanding of the processes involved. These
lectures also provided more experience with R studio as that was the primary
programme we used.
Issues Arising
During this period a number of attempts to retrieve a dataset were made however a
dataset was still not provided. Problems continued as I tried to retrieve tweets from the
dataset . Finally a setting in the twitter API was changed from ‘read only’ to ‘read write
and access direct messages’ finally tweets were pulled. However the original keyword of
#Pantibliss does not seem to collect enough tweets for analysis. Therefore a change of
keyword would be implemented to #Eurovision 2014 and #Conchita Wurst. This should
provide the project with enough tweets and a variation of opinions as #conchitawurst is
a topical contestant on Eurovision 2014.
59
Variance from Plan
The original Requirements Specifications outlined using a twitter sentiment analysis in the
programme Python, however after increased experience of R studio in the Advanced
Business Data Analysis lectures, and applicability of R studio of statistical problems, it
became apparent that this would be a more optimal package to use for the analysis.
Retrieval of dataset
Analysis of data
Progress Management Report from 15th April 2014to 4th May 2014.
In this period a dataset was created using keywords #Conchita Wurst and
Eurovsion2014. A number of data analysis techniques were applied to the data including
the machine learning classifier Naive Bayes.
The literature behind the project has also started to take form and writing of the project
has begun. A template of the paper has also been provided and this allows for a more
structured approach to the writing of the paper.
60
Products Completed during the Period
Dataset retrieved
Quality Reviews
Templates for the progress management reports as well as a template for the paper
have been provided in the project class. This will provide excellent guidelines in
approaching the paper. Lectures have also being given and guidance has been
provided on writing style. The project has also been reduced in word count from 10000
to 7500 words. The deadlines for final submission have been pushed to the end of May
after the exams.
Issues Arising
Time allocated to spend on the project has had to be deferred as exams are
approaching quickly. The writing of the rest of the paper will have to occur after the
exams and other projects and presentations.
Compete understanding of Naive Bayes and write up the remaining parts of the project.
61