Está en la página 1de 61

2014

Sentiment Analysis of
Twitter
Using Knowledge based and Machine
Learning Techniques

Anne Hennessy
National College of Ireland
5/28/2014
National College of Ireland
Higher Diploma in Science in Data Analytics
2013/2014

Anne Hennessy
X13119966
anne.hennessy@student.ncirl.ie

-2-
Table of Contents
1 Executive Summary ....................................................................................... 5

2 Introduction .................................................................................................... 6

2.1 Domain(s) Description ............................................................................. 7


2.2 Motivation ................................................................................................ 7
2.3 Aims ......................................................................................................... 8
2.4 Solution Overview .................................................................................... 9
2.5 Structure ................................................................................................ 10
3 Related Work (Literature Review) ................................................................ 12

4 System and Datasets ................................................................................... 16

4.1 Design and Architecture......................................................................... 16


5 Implementation ............................................................................................ 18

5.1 Creating a Twitter Application ................................................................ 18


5.2 Working on RStudio- Building the corpus .............................................. 19
5.3 Saving Tweets ....................................................................................... 22
5.4 Sentiment Function ................................................................................ 23
5.5 Scoring tweets and adding column ........................................................ 24
5.6 Import the csv file ................................................................................... 25
5.7 Visualizing the tweets ............................................................................ 26
5.8 Text Analysis ......................................................................................... 26
5.9 Create a Naive Bayes classifier ............................................................. 30
6 Requirements .............................................................................................. 32

6.1 User Interfaces ...................................................................................... 32


7 Results ......................................................................................................... 39

8 Conclusions ................................................................................................. 41

9 Further development or research................................................................. 42

10 References ................................................................................................ 43
11 Appendix ................................................................................................... 46

11.1 Project Proposal .................................................................................... 46


11.2 Initial Requirement Specification ............................................................ 49
11.3 Management Progress Reports ............................................................. 57
11.3.1 Management Progress Report 1 ..................................................... 57
11.3.2 Management Progress Report 2 ..................................................... 59
11.3.3 Management Progress Report 3 ..................................................... 60
11.4 Other Material Used ............................................................................... 61

-4-
1 Executive Summary
.

Twitter is a “micro-blogging” social networking website that has a large and


rapidly growing user base. The aim of this paper is to collect tweets using a
Twitter API on keywords #ConchitaWurst and #Eurovision2014.This paper will
determine the sentiment orientation of the tweets. The classification model which
this project will develop will determine whether the tweet status updates (which
cannot exceed 140 characters) reflects positive opinion or negative opinion on
the behalf of the person who tweeted. This paper will use a hybrid of knowledge
based sentiment analysis methodologies such as which have been more
traditionally used, and those of machine learning methodologies which used a
more intuitive approach to sentiment such as Naïve Bayes. The results of these
two methodologies indicate an overwhelmingly positive response towards
Conchita Wurst.

-5-
2 Introduction
Twitter is a “micro-blogging” social networking website that has a large and
rapidly growing user base. Those who use Twitter can write short 140 characters
long or less updates called ‘tweets’. ‘Tweets’ are seen by those who ‘follow’ the
person who ‘tweeted’. Due to the growing popularity of the website, Twitter can
provide a rich bank of data in the form of harvested “tweets”. Twitter by its very
nature, allows people to convey their opinions and thoughts openly about
whatever topic, discussion point or product that they are interested in sharing
their opinions about. Therefore Twitter is a good medium to search for potentially
interesting trends regarding prominent topics in the news or popular culture.

R studio is one of the many programmes that offer packages that can analysis
data, however R studio works well with statistical problems and has a user
friendly interface. Verzani, (2011)

Sentiment analysis (or opinion mining) refers to the use of natural language
processing, text analysis and computational linguistics to identify and extract
subjective information in source material. Verzani, (2011)

The value of Twitter in recent years has increased as businesses, political


groups and curious Internet users alike have started to assess the public’s
general sentiment for their products and services from twitter posts.
Sentiment analysis provides a means of tracking opinions and attitudes on the
web and determines if they are positively or negatively received by the public.
The purpose of Text mining is to process unstructured (textual) information
and to extract meaningful numeric indices from the text, allowing the
application of various data mining algorithms to explain the textual dataset.
Verzani, (2011)

The classification model which this project will develop will determine whether
the tweet status updates (which cannot exceed 140 characters) reflects positive
opinion or negative opinion on the behalf of the person who tweeted. This paper
will use a hybrid of knowledge based sentiment analysis methodologies which
have been more traditionally used, and those of machine learning methodologies
which used a more intuitive approach to sentiment. The results of these two
methodologies will be used to perform a thorough analysis of the dataset.

2.1 Domain(s) Description


Twitter is different to other forms of raw data which are used for sentiment
analysis, as sentiments are conveyed in one or two sentence blurbs rather
than paragraphs. Twitter is much more informal and less consistent in terms
of language. Users cover a wide array of topics which interest them and use
many symbols such as emoticons to express their views on many aspects of
their life (Agarwal et al. 2011). When using human generated status updates,
sentiment are not always obvious; many tweets are ambiguous and can use
humor to maximize the opinion to other human readers but deflect the opinion
to a machine learning algorithm. (Agarwal et al. 2011).This provides a
challenge for machine learning algorithms. Sentiment analysis provides a
means of tracking opinions and attitudes on the web and determines if they
are positively or negatively received by the public.
According to Mejova (2009) Sentiment analysis is usually conducted between two
levels; a coarse level and a fine level. Coarse level sentiment analysis deals with
determining the sentiment of an entire document and Fine level deals with
attribute level sentiment analysis. (Neethu, Rajasree 2013) Sentence level
sentiment analysis comes in between these two. Mejova (2009) Sentiment
analysis in Twitter provides a dramatically different data set where multiple
interesting challenges can arise.

2.2 Motivation

This paper chose to analysis tweets with the hash tag #ConchitaWirst and
#Eurovision2014 as these were topical at the time of data collection. Conchita
Wurst is the drag stage persona of Thomas "Tom" Neuwirth an Austrian singer.

7
Wurst represented Austria and won the Eurovision Song Contest
2014 in Copenhagen, Denmark. Brooks (2014). This paper decided to
investigate the tweets related to Wurst and the Eurovision song contest 2014 as
Worst’s selection sparked controversy in Austria. Four days after ORF
announced its decision, more than 31,000 people liked an "Anti-
Wurst" Facebook page. Michaels (2014)

In October, the Ministry of Information in Belarus received a petition calling on


Belarus's state broadcaster, to edit Wurst's performance out of its Eurovision
broadcast. The petition claimed that the performance would turn Eurovision "into
a hotbed of sodomy". Michaels (2014). At the finals held in Copenhagen on 10
May 2014, she won the competition with 290 points. This was Austria's first
Eurovision win since Eurovision 1966. Michaels (2014).

2.3 Aims
In order to conduct any kind of analysis on twitter the construction of a suitable
dataset of tweets needs to be built. Twitter API is an app which extracts tweets
from twitter and loads them into a dataset. The aims of this paper are threefold;

 To construct a database of tweets on the keywords #Eurovision2014


and #ConchitaWurst which will be built using a Twitter API app.
 R studio will perform a series of analysis on the data such as a knowledge
based techniques which uses a sentiment lexicon dictionary to determine
the number of positive and negative tweets. Machine learning techniques
which are based on a training set and will determine the number of tweets
which are positive and negative.
 Use the results from the knowledge based techniques and those of the
machine learning techniques to ensure a thorough analysis of the dataset.

8
2.4 Solution Overview
The construction of a suitable dataset of tweets needs to be built. Twitter API is
an app which extracts tweets from twitter and loads them into a dataset.
Additionally for this project R studio was used along with the following packages
and libraries;

TwitteR; Provides an interface to the Twitter API

ROAuth; This package provides an interface to the OAuth 1.0 specification,


which allows users to authenticate via OAuth to the server of their choice

Plyr; The package plyr is a set of tools that solves common problems by
breaking down bigger problems into more workable pieces. The package then
operates on each problem before reassembling the reworked pieces back
together.

Stringr; stringr makes R string functions more consistent, simpler and easier to
use by ensuring the function and argument names are consistent and all
functions deal with NA’s and zero length characters appropriately. Stringr also
ensures that the data output from each function matches the input data
structures of other functions.

Ggplot2; This package provides an implementation of the grammar of graphics


in R, combining the advantages of both base and lattice graphics. Plots can be
built up step by step from multiple data sources

RColourBrewer; This packages provides palettes for drawing maps shaded


accordingly to a variable

Tm; a framework for text mining applications within R

Wordcloud; This package creates a word cloud to illustrate frequency of words


in text mining

e1071; this package provides function for latent class analysis, such as Support
vector Machines, bagged clustering and Naive Bayes classification.

Gmodels; This package provides various R programming tools for model fitting.

9
Using the combination of knowledge based techniques and well as machine
learning techniques, this paper will allow for a full and thorough analysis of the
data. The combination of both these techniques to analysis data will provide
researchers the opportunities to see that the combination of techniques can be
complementary to the analysis.

2.5 Structure
 Related work describes the related literature that has been reviewed
around the topic of sentiment analysis and the various methodologies
used. The related work is loosely divided up into knowledge based
approaches and machine learning based approaches. This is to ensure a
better understanding of the two areas this paper hopes to combine.
 Design and Architecture describes how the overall layout of the project will
be achieved. The design of the paper is based on Naive Bayes classifier
which is explained in this section. The Architecture diagram provides a
‘road map’ of the different distribution systems that will be used and how
they are connected to other systems.
 Implementation provides a step by step guide through the processes of
the paper which includes the code that was used as well as outputs.
 Requirements provide an insight into the aspects of the project that must
be considered such as Functional requirements, Data requirements, User
requirements etc.
 Datasets. This section describes the datasets used in this paper. It
describes in detail how the data was acquired and the basic description of
the data. It describes where it is stored and how it will be used to answer
the aims of the project.
 Results describe the results that have come to light after the various
methodologies have been applied to the dataset.
 Conclusion this section describe the advantages/disadvantages,
opportunities and limits of the project.

10
 Further development or research; this section asks the author of this paper
how does this paper view where the results of this project could lead to in
future research.

11
3 Related Work (Literature Review)
Twitter is different to other forms of raw data which are used for sentiment
analysis as sentiments are conveyed in one or two sentence blurbs rather
than paragraphs. Twitter is much more informal and less consistent in terms
of language. Users cover a wide array of topics which interest them and use
many symbols such as emoticons to express their views on many aspects of
their life (Agarwal et al. 2011). When using human generated status updates,
sentiment are not always obvious; many tweets are ambiguous and can use
humour to maximize the opinion to other human readers but deflect the
opinion to a machine learning algorithm. (Agarwal et al. 2011). Another
consideration when using a dataset generated from Twitter is that a
considerably large amount of tweets which convey no sentiment such as
linking to a news article, which can lead to difficulties in data gathering,
training and testing. Parikh, Movassate (2009). Sentiment analysis provides a
means of tracking opinions and attitudes on the web and determines if they
are positively or negatively received by the public.
According to Mejova (2009) Sentiment analysis is usually conducted between
two levels; a coarse level and a fine level. Coarse level sentiment analysis deals
with determining the sentiment of an entire document and Fine level deals with
attribute level sentiment analysis. Neethu, Rajasree (2013) Sentence level
sentiment analysis comes in between these two. Mejova (2009). Sentiment
analysis in Twitter provides a dramatically different data set where multiple
interesting challenges can arise.

According to Boiy et al. (2007), Symbolic techniques and Machine Learning


techniques a re the two basic methodologies used in sentiment analysis
from text.

The next two sections deal with these techniques in further detail.

12
A. Symbolic Techniques
Symbolic techniques in supervised classification models make use of
available lexical resources. In his sentiment analysis Turney (2002) used bag-
of-words approach. In this approach the document was treated as a collection
of words where relationships between words are not considered important. To
determine the overall sentiment, sentiments of every word are given a value and
using aggregation functions, those values are combined. Where tuples are
phrases having adjectives or adverbs which may be considered positive or
negative, Turney (2002) found the polarity of a review was based on the
average semantic orientation of tuples extracted from the review.

WordNet which is a database consists of words and their relative synonyms


were used by Kamps et al. (2004). I n t h i s s t u d y a distance metric was
developed on WordNet and the semantic orientation of adjectives was
determined from this metric. In their study, Balahur et al. (2012) introduced a
conceptual representation of text, which stored the structure and the semantics
of real events, in a system called EmotiNet,. Emotinet was able to identify the
emotional responses triggered by actions with the information it stored.
The difficulty with using a Knowledge base approach however that is it
requires of a large lexical database. This has become harder and harder to
provide as the language of social networks is so trend dependent and
changeable that lexicon datasets cannot keep up. Therefore Knowledge based
approaches to sentiment analysis are not as popular as they used to be.

B. Machine Learning Techniques

In contrast to Knowledge based approaches Machine Learning techniques


are not dependent on a lexicon dataset, instead the use of a training set
and a test set in order to classify is employed. This allows the algorithm to
remain dynamic in the face of ever changing social network language lexicons.
In this methodology a classification model is developed using a training set,
which tries to classify the input feature vectors into corresponding class labels.

13
A test set is used to prove the model by predicting the class labels of unseen
feature vectors as outlined in the training set.

A number of machine learning techniques like Naive Bayes (NB) and Support
Vector Machines (SVM) are used to classify reviews into either positive or
negative orientation. Vinodhini, Chandrasekaran (2012). In their paper
Domingos et al.( 1 9 9 7 ) found that Naive Bayes works as a good classifier for
certain problems as it results in highly dependent features .
A new model which was based on Bayesian algorithm was introduced in 2012
by Zhen Niu et al. (2012). In this model weights of the classifier are adjusted
by making use of representative feature(information that represents a class)
and unique feature( information that helps in distinguishing classes). Using
those weights, the researchers calculated the probability of each classification.
This allowed for an improved Bayesian algorithm.

Pak and. Paroubek (2010) created a twitter corpus by using a Twitter API
which automatically collected tweets from Twitter as well as annotating those
using emoticons. Using that corpus, they built a sentiment classifier which
used N-gram and POS-tags as features based on the multinomial Naive Bayes
classifier.

In combining various feature sets and classification techniques an Ensemble


framework is created. Xia et al. (2011) used an ensemble framework for
sentiment classification in their paper were two types of feature sets and three
base classifiers were used to form the ensemble framework. Part-of-speech
information and Word-relations were used to create two types of feature sets..Three
base classifiers Naive Bayes, Maximum Entropy and Support Vector Machines
were also selected. Fixed combination, weighted combination and Meta-classifier
combination ensemble methods for sentiment classification were applied and
measured so as to obtain better accuracy. When the classifiers were measured
separately Naive Bayes was found to be the most accurate.

14
In this paper, we construct a Twitter corpus using Twitter API, use R studio
coding to prepossess the Twitter corpus, then using know ledge based methods
we use an available lexical resources and apply it to the Twitter corpus. To
compare the results from the knowledge based method to a machine learning
technique we then use Naive Bayes classification models to the corpus which
will split the corpus into positive and negative tweets as well as highlighting
which tweets are classified. Naïve Bayes is used as it is often works well as a
good first classifier in data analysis.

15
4 System and Datasets

4.1 Design and Architecture


The system architecture that is proposed for this paper is shown in Figure. This
paper proposes a hybrid approach involving both knowledge based
methodologies and machine learning based methodologies to analysis the
sentiment orientation of the tweets. Tweets are accessed through a Twitter API.
The collected reviews are proposed in R and saved to an excel file. The words
were extracted and stored in a feature vector. The words were scored into their
relative sentiment orientation using a sentiment lexicon which was sourced from
the internet and was publicly available. The excel file was then entered into R
and a corpus was produced. The corpus was then divided into a ‘training set’ and
a ‘test set’ and feature were extracted from each respectively. From the ‘training
set- feature extraction’ a machine learning algorithm was produced from the ‘test
set-feature extraction’ a model classifier in the form of Naive Bayes classifier was
produced. Naive Bayes does not consider the relationships between features
such as emotional keywords and emoticons. This is ideal for sentiment analysis
as often these features do not always relate to one another such as in the use of
a smiley emoticon at the end of a negative tweet. Naive Bayes Classifier
analyzes each of the features of the feature vector individually as it assumes that
they are equally independent of each other. The conditional probability for Naive
Bayes can be defined as

P (X |yj ) = Πm P (xi |yj

’X’ is the feature vector defined as X={x1 ,x2 ,....xm } and yj is the class label. In
the tweets collected for this paper there are different independent features such
as emoticons, emotional keyword, which are treated as either positive or
negative and so are utilized by Naive Bayes classifier for classification. The

16
machine learning algorithm is then applied to the model classifier and a label is
produced as seen in figure 1.

Figure 1

17
5 Implementation

5.1 Creating a Twitter Application


First step to perform Twitter Analysis is to create a twitter application. This
application allows the performance of sentiment analysis by connecting your R
console to the twitter using the Twitter API. The steps for creating the twitter
applications are as follows:

 Go to https://dev.twitter.com and login by using a twitter account.


 Then go to My Applications  Create a new application.

Figure 2

 Give an application name (in this case somesortofanne6), describe about


your application (in this case twitter API), provide a website’s URL (student
email URL). The Callback URL was left blank. Complete other formalities
and create a twitter application. Once, all the steps are done, the created

18
application will show as below. The properties were changed to ‘Read
Write and Access Direct Messages’. The Consumer key and Consumer
Secret numbers were used in R Studio.

Figure 3

The Twitter API is created as seen in figure 3.

5.2 Working on R Studio- Building the corpus


The installation of some packages and libraries in R was the first step of the R
studio process. These are twitter, ROAuth, plyr, stringr and ggplot2. The
installation of these packages can be seen below by the following commands:

19
Figure 4

Now, windows users need to download a small file by following command

Figure 5

Once this file is downloaded, the next stage is to access the Twitter API. This
step includes the script code to perform handshake using the Consumer Key and
Consumer Secret number of the application. In figure 6 is the code you have to
run to perform handshake.

Figure 6

20
In order to access the Twitter API, the programme requires the request URL,
access URL and authorization URL of Twitter application to the variables
requestURL, accessURL and authURL respectively. consumerKey and
consumerSecret are unique to a twitter application. Running this gives following
message on the R console:

Figure 7

The last three lines of the console are a message to the user. To enable the
connection, please direct your web browser to:

http://api.twitter.com/oauth/authorize?oauth_token=dHwEGXdxbjJ093sG0tVjYVT
0NQrkjU3DuCxcC1YQyc

After opening the above link in the browser, the authorization of the application is
ensured by providing you username and password. The provision of these items
ensures the app will be authorized. The code provided must be written into the
console.

Registration of the handshake is provided by the following command as seen in


figure 8;

Figure 8

21
The console will give a message with TRUE, which means that the handshake is
complete. Now we can get the tweets from the twitter timeline.

Figure 9

5.3 Saving Tweets


Once the handshake is done and authorized by twitter, we can fetch most recent
tweets related to any keyword. I have used #ConchitaWurst and #Eurovision2014
both these items are topical at the time of data collection. The code for getting
tweets related to #ConchitaWurst and #Eurovision2014 is:

Figure 10

This command will get 1500 tweets related to #ConchitaWurst. The function
“searchTwitter” is used to download tweets from the timeline. Twitter API can
only return a fixed maximum amount of tweets (1500). The return of a maximum
number of tweets may not be met sometimes as there are not enough tweets for
the particular keyword. This was the case for #Eurovision2014, the results did not
return many tweets, therefore it was decided that the paper would concentrate all
of its efforts on the tweets pulled by #ConchitaWurst. As can be seen in the code
above, the data of 1500 tweets was converted into a data frame, so that analysis
can be performed on it. Finally the data was converted into a .csv file

22
5.4 Sentiment Function
Once the tweets were obtained, the application of some functions to convert
these tweets into some useful information was needed. The main working
principle of sentiment analysis is to find the words in the tweets that
represent positive sentiments and find the words in the tweets that
represent negative sentiments. For this a list of words that contains positive
and negative sentiment words was needed. A list of positive and negative
words complied by B. Liu which was publicly available was downloaded
from the University of Illinois at Chicago website. Liu, Hu. (2004). After
downloading the list, it was saved it in a working directory. The sentiment
analysis uses two packages plyr and stringr to manipulate strings. The
function can be seen in the following screen prints; figure 11.

Figure 11

23
The sentiment function calculates score for each individual tweet. It first calculate
the positive score by comparing words with the negative words list and then
calculate negative score by comparing words with negative words list. The final
score is calculated as

score= positive score – negative score.

5.5 Scoring tweets and adding column


The next stage is to score the tweets from the above sentiment function.

Figure 12

The console gives the following output

Figure 13

24
5.6 Import the csv file
When we import this csv file, a dataset file is created in the working directory.
Next step is to score the tweets; this can be done by creating a separate csv file
which contains the score of each tweet. This can be done as follows:

Figure 14

The snapshot of the score file shows the score of each tweet as an integer in
front of every tweet.

Figure 15

25
5.7 Visualizing the tweets
The next step is to create visual histograms and other plots to visualize the
sentiments of the user. This can be done by using hist function.

Figure 16

5.8 Text Analysis


The data from the tweets was saved in a csv file . This needed to be loaded into
R. The score variable was a character vector. Since this is a categorical variable,
it would be better to convert it to a factor, as shown in figure 17 in the following
code:

Figure 17

26
The first step in processing text data involves creating a corpus, which refers to
a collection of text documents. In this project, a text document refers to a single
tweet. We'll build a corpus containing the tweets in the training data using the
following command:

Figure 18

The text mining of the tweets can be performed next. The functions to refine the
text and filter the text according to our need such as the removal of numbers,
punctuation, and how to handle uninteresting words such as ‘and’, ‘but’, and ‘or’,
is taken from the tm package.

Figure 19

The next step is to tokenize the corpus and return the sparse matrix with the
name twitter_dtm. From here, analyses involving word frequency will be
performed.

The data then needs to be split into a training dataset and test dataset for Naïve
Bayes, so that the classifier can evaluate the data.

27
The data is split into two portions: 75 percent (1; 1299) for training and 25
percent (1300; 1500) for testing.

This is done by splitting the raw data frame as seen below

Figure 20

To confirm that the subsets are representative of the complete set of Twitter data,
a comparison of the proportions of scores in the training and test data frames is
performed

Figure 21

Both the training data and test data contain about 13 percent negative sentiment
and 75 percent positive sentiment. This suggests that the tweets were divided
evenly between the two datasets.

Figure 22

28
.A word cloud can be produced afterwards. Words appearing more often in the
text are shown in a larger font, while less common terms are shown in smaller
fonts and therefore illustrate the frequency of words in the dataset. The code for
the word cloud is;

Figure 23

The final word cloud obtained is as follows:

Figure 24

According to this word cloud , we can see that ‘Eurovision’ ,‘Conchita Wurst’ and
‘Lady Gaga’ are the most used terms in the tweets followed by ‘gaga’, ‘gay’ and
‘queen’ which shows that while tweeting about ‘Conchita Wurst’ the person also
connects the words like ‘Eurovision’ and ‘Lady Gaga’.

29
5.9 Create a Naive Bayes classifier
The final step in the data preparation process is to transform the sparse matrix
into a data structure that can be used to train a naive Bayes classifier.

It's unlikely that all of the features in the sparse matrix are useful for
classification. To reduce the number of features, we will eliminate any words that
appear in less than five tweets, or less.

Figure 26

The naive Bayes classifier is typically trained on data with categorical features.

This poses a problem since the cells in the sparse matrix indicate a count of the
times a word appears in a message. Changing this to a factor variable that
indicates yes or no will alleviate this problem. The following code defines a
convert_counts() function to convert counts to factors:

30
Figure 27

As can be seen in the above screen shot the classifier is created by the Naive
Bayes function and is applied to the training data . The classifier will then predict
the results from the training data in the test data. A crosstable is created to
visualize the results where the data is classified into actual data and predicted
data.

Figure 28

31
6 Requirements

6.1 User Interfaces

The programme used for this paper was R studio. R studio was selected as a
suitable programme as its interface and programme libraries met the
requirements of the brief. Additionally the user interface of R studio was simple
to navigate and easy to understand. R studio is free and open source, and works
well on both Windows and Mac hardware . It contains advanced statistical
routines

As well as a large, coherent, integrated collection of tools for data analysis

R studio also processes powerful graphics capabilities which aid visualisation of


data and results greatly. Due to these properties Control of the programme
allowed the user to interact with the application at optimal ease.

The interface will include user inputs as well as two graphics, as outlined below.

6.1.1 User Inputs (Mandatory)


The user will be able to control the sentiment analysis of topics in two ways:

1. Edit this function will let the user edit Keywords, by adding, editing, or
removing keywords for each topic, and
2. Time, this function will let the user specify the duration of each analysis
session.

6.1.2 Graphic 1: Topic Mood Gauge (Mandatory)


This graphic will consist of a simple histogram, which shows the current mood of
the Twitter community on the topic of #ConchitaWurst. The percentage of the
Twitter users will be displayed who are currently for or against the topic being

32
analyzed. It will also display the most frequently used words used on the subject
of #Conchita Wurst through the use of a word cloud.

6.1.3 Error Notifications (Mandatory)


Error notifications will be required within the programme R studio; this will be
presented to the user with appropriate messages in red, which will describe the
error that has taken place. If applicable, error messages suggest possible
solutions to the problem.

6.1.4 Hardware Interface


The application will run on a password protected personal Microsoft laptop. No
further hardware devices or interfaces will be required for this analysis.

6.1.5 Software Interfaces


Inputs

The software will receive input from four sources. First, the programme R studio
and second, the

Twitter API, thirdly Excel, which will hold the dataset once retrieved from Twitter
API, and fourthly Twitter app’ Sentiment140’. The programme R studio will supply
the code results and the majority of the graphs for the analysis, while the Twitter
API will supply dataset of the Tweet text. Sentiment140 app will supply an
additional pie chart which will add a more visual element to the interpretation of
the data.

Outputs

The output will portray the current mood of the Twitter community on
#ConchitaWurst in the form of a simple charts, word cloud and histograms.

33
6.1.6 Functional Requirements

Retrieving Input

The software will receive three inputs: R studio code and R studio libraries,
analysis session duration and Tweets.

● R studio code, R studio libraries was entered by the user for each topic.

● the analysis session duration will be set by the user before each session.

● Tweets will be retrieved from the Twitter API and saved on an Excel file.

Real-Time Processing

The software will take input, process data, and display output in real-time. This
will ensure the data provided by Twitter is a current view of the Twitter
community’s mood on #ConchitaWurst.

Sentiment Analysis

Sentiment analysis will be performed on the keywords within the Tweet to


determine the overall mood of the Tweet relative to the topic. The sentiment
analysis will provide a negative or positive numeric sentiment value.

Output

The software must output real time data in the form of simple charts, word cloud
and histograms. In addition, the software may output additional statistics
pertaining to a topic of #ConchitaWurst). This output will be clear and easy to
understand.

34
6.1.7 Use Cases
This software will serve as a tool of interest, providing users with the current
mood of the Twitter

Community on #ConchitaWurst.

6.1.8 Non Functioning Requirements


Performance

The Twitter API will provide up-to-date information; limited only by the rate of
Twitter input. R studio will provide promptly analysis of the data using the various
software packages available to it. The output should display the latest results at
all times, and if it lags behind, the user should be notified. The application should
be capable of operating in the background should the user wish to utilize other
applications.

Reliability

The software will meet all of the functional requirements without any unexpected
behavior. At no time should the output display incorrect or outdated information
without alerting the user to potential errors. In this instance error message will be
shown.

Availability

The software will be available at all times on the user’s device desktop or laptop,
as long as the device is in proper working order. The functionality of the software
will depend on any external services such as internet access that are required. If
those services are unavailable, the user should be alerted.

Security

35
The software should never disclose any personal information of Twitter users,
and should collect no personal information from its own users. The use of
passwords and API keys will ensure private use of the Twitter API. The
programmes will be performed on a password protected laptop and desktop to
ensure maximum security.

Maintainability

The software should be written clearly and concisely. The code will be well
documented. Particular care will be taken to design the software modularly to
ensure that maintenance is easy.

Portability

This software will be designed to run on any Android operating system. To


ensure the longevity of the software, the software will be forwarded compatible
for all currently released Android operating systems.

6.2 Design Constraints


Twitter API has some limitations such as Twitter API can only return a fixed
maximum amount of tweets (1500). The return of a maximum number of tweets
may not be met sometimes as there are not enough tweets for the particular
keyword.

6.3 Logical Database Requirements


The tweets taken from Twitter will be stored on an excel spreadsheet. Excel is an
excellent programme for storing large amounts of data as well as being easy to
upload the data to R Studio .The data will have two columns, column one will
have the score of the tweets ( pos, neg, very pos and very neg), column two will
store the actual tweet content. Each row will represent an individual tweet.

36
6.4 Datasets
A Twitter API app was used to pull tweets from Twitter's public timeline in real-
time. A dataset was created using twitter tweets from a topic that was dominating
twitter at the time of data collection; #ConchitaWurst and #Eurovision2014.
Eurovision2014 did not produce very many tweets therefore the concentration of
the project fell to the tweets returned from ConchitaWurst which returned 1500
tweets. A sentence level sentiment analysis was performed on tweets as many
were full of slang words and misspellings. This is done in three phases. In the
first phase of a sentence level sentiment analysis pre-processing is done.
Secondly a feature vector is created using relevant features. A publicly available
sentiment lexicon which consists of around 6800 words in a list of positive and
negative opinion words or sentiment words for English was used to separate the
tweets. This list was compiled over many years by Liu and Hu (2004) finally
tweets are classified into positive and negative classes using different classifiers.
The final sentiment is based on the number of tweets in each class using several
sentiment analysis methodologies; the bag-of-words approach, which uses
available lexical resources as seen in Turney (2002) sentiment analysis.
Machine learning approaches are also used where the tweets dataset was split in
two Training and testing. Of these, we chose to use 1199 of the data set for
training and the remaining 300 tweets to be used for testing We had a total of
1135 positive tweets, 197 negative tweets and 152 very positive tweets as well
as 15 very negative tweets. These tweets were then used for training and testing
so to conduct a Naive Bayes classifier.

Creation of a Dataset

Since standard twitter dataset are not available for analysis, we created a new
dataset by collecting tweets over a period of time ranging from May 6th 2014 to

37
May 8th 2014. Tweets were collected automatically using Twitter API and they
are manually annotated as positive or negative. In total 1500 tweets were
collected from #ConchitaWurst and 100 tweets were collected from
#Eurovision2014. Unexpectedly a number of the tweets were neutral, however
positive and negative tweets created the dataset.

Preprocessing of Tweets

Pre-processing steps were performed in order to ensure that Keyword extraction


was made as simple as possible for the algorithm.

Punctuation marks, correctors and digits were removed as well as changing the
tweet texts to lower case, splitting sentences to words with structural split and
comparing the corpus from the dictionary’s positive and negative words. The
matched term would be returned as a true or false value which will be treated as
1 or 0 by the sum function. Finally scores are put into a data frame named
scores.df. Before tweets can be scores however a sentiment lexicon of words
must be obtained. This sentiment lexicon was found at Bing Lui’s website. Lui,
Hu (2004). The final score for each tweet will be the number of positive words
minus the number of negative words. If the score is higher than 0, the tweet will
be regarded as positive. If the tweet score is lower than 0 the tweet will be
regarded as negative opinion.

38
7 Results
The aim of this paper was to analysis the results of a sentiment orientation on the
keyword #ConchitaWurst.

Figure 29

The above histogram shows the frequency of tweets with respect of scores
allotted to each tweets. The x-axis shows the score of each tweet as a negative
and positive integer or zero. A positive score represents positive or good
sentiments associated with that particular tweet whereas a negative score
represents negative or bad sentiments associated with that tweet. A score of zero
indicates a neutral sentiment. The more positive the score, the more positive the
sentiments of the person tweeting and vice-versa.

The above histogram is slightly skewed towards positive score which shows that
the sentiments of people regarding Conchita Wurst are overwhelming positive
with a slight skew towards very positive.

Out of 1500 tweets that were fetched from the twitter, a majority of them (1135)
are positive, whereas around 197 were having negative sentiments. 152 tweets

39
had very positive sentiments but the overall score is positive as can be seen from
the plot.

In order to see how accurately the Naïve Bayes classifier worked, an analysis of
the table below which has produced the results of the actual data from the
training set and the predicted data from the test set will have to be undertaken. It
can be seen that of the 225 positive tweets 6 were incorrectly classified as very
positive (4), negative (1) and very negative (1), while 2 of 35 negative tweets
were incorrectly classified as positive (1) and very negative (1).The presence of
some mis-catogorised tweets might suggest that the training model was perfectly
fit however when applied to the test data the model was slightly under fitted.

Figure 30

40
8 Conclusions
In this paper a hybrid of knowledge based methodologies and machine learning
methodologies were used in order to give a thorough examination of the tweets
of #ConchitaWurst and #Eurovision2014 which were extracted from Twitter.
#Eurovision2014 did not produce very many tweets therefore the concentration
of the project fell to the tweets pulled from #ConchitaWurst, which returned 1500
tweets. A publicly available sentiment lexicon which consists of around 6800
words, in a list of positive and negative opinion words or sentiment words for
English was used to separate the tweets. This list was compiled over many years
by (Hu and Liu, KDD-2004) .Tweets were then classified into positive and
negative classes using the machine learning classifier Naïve Bayes. The
extraction of tweets from twitter proved to be more difficult than expected and
several attempts were made to produce a dataset. The lack of tweets pulled from
#Eurovision2014 highlights the limitations of the app if faced with not enough
tweets it was found that there are certain issues which can arise when dealing
with a tweet based dataset. The presence of white spaces, punctuations and
numbers had to be confronted in the preprocessing stage. To further alleviate
these issues, twitter specific features were extracted and added to the feature
vector after proper preprocessing.

Classification accuracy of the feature vector is tested using classifier like Nave
Bayes. The assumption of Naïve Bayes that the data is independent, proved
classification methodology to be an excellent tool in this analysis. It was found by
the author that Machine learning algorithms were simpler to implement and more
efficient than other aspects of the paper as they produced a table which allowed
for transparency in the accuracy of the Naive Bayes classification. Overall the
hybrid approach to sentiment analysis allowed for a thorough analysis of the data
and performs well for a Twitter dataset. However, the accuracy of the Naïve
Bayes classifier still leaves room for improvement this may be achieved by better
preprocessing.

41
9 Further development or research

The applicability of sentiment analysis for future businesses and marketing in


using a keywords and analysis of the sentiments around that keyword by the
public is only going to increase as the popularity of Twitter grows over the next
few years. However, in terms of long-term development or research, the ability of
the twitter API to pull data that is older, should be developed as well as other
social media API’s so that sentiment analysis could be performed over a period
of time, especially in the realm of social sciences where researchers could
enquire into social and political shifts of opinion on the social media sites. Equally
the lack of change in opinion over time on some issues might be worth pursuing
as a topic of research for twitter sentiment analysis. The usefulness of such a
sentiment analyzer would allow for an interesting analysis of social and
political issues.

42
10 References
.

1. Agarwal,A. Xie,B. Vovsha,I. Rambow,O. Passonneau,R.(2011) ‘Sentiment Analysis of


Twitter Data’ Proceedings of the Workshop on Language in Social Media (LSM 2011).
Portland, Oregon, 23 June 2011. Association for Computational Linguistics. Stroudsburg,
PA. pages 30–38,

2. Balahur, A. Hermida, J. M. and Montoyo, A. ( 2012.) “Building and exploiting emotinet, a


knowledge base for emotion detection based on the appraisal theory model,” Affective
Computing, IEEE Transactions on, vol. 3, no. 1, pp. 88–101. Available at
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6042854&queryText%3DB
alahur%2C+A.+Hermida%2C+J.+M.++and+Montoyo%2C+A.+.LB.+2012..RB.+%E2%80
%9CBuilding+and+exploiting+emo .[Accessed On 3 May 2014]

3. Boiy, E. Hens, P. Deschacht, K. Moens, M.F.( 2007) “Automatic sentiment analysis in


on-line text,” in Proceedings of the 11th International Conference on Electronic
Publishing, IEEE pp. 349-360, [Accessed May 5th 2014].

4. 4.Brooks,D.(2014) ‘Bearded Austrian drag queen to take on Eurovision’. Reuters Apr 28,
201Available at http://uk.reuters.com/article/2014/04/28/uk-austria-eurovision-drag-
idUKKBN0DE06O20140428 [Accessed on 23 April 2014]

5. Domingos, P and Pazzani, M. (1997) “On the optimality of the simple bayesian classifier
under zero-one loss,” Machine Learning, vol. 29, no. 2-3, pp. 103–130.IEEE [Accessed
On 3 May 2014]

6. Kamps, J. Marx, M. Mokken, R. J. De Rijke, M. (2004 )“Using wordnet to measure


semantic orientations of adjectives,” Found in Neethu,M. Rajasree R.(2013) ‘Sentiment
Analysis in Twitter using Machine Learning Techniques’ 4th International Conference on
Computing, Communications and Networking Technologies (ICCCNT) Tiruchengode
India. July 4-6 2013. IEEE. pp1-5 [Accessed April 25th 2014].

43
7. Liu, B. Hu ,M. (2004) ‘Opinion Mining, Sentiment Analysis, and Opinion Spam Detection’.
available at http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon [Accessed
on 16th May 2014]

8. Mejova, Y( 2009). “Sentiment analysis: An overview,” available at http://www. cs.


uiowa. edu/˜ ymejova/publications/CompsYelenaMejova. pdf [accessed on the 02-05
2014]

9. Michaels,S.(2014). ‘Parade for Eurovision's Conchita Wurst banned by Russian officials’.


The Guardian. 16 May 2014. Available at
http://www.theguardian.com/music/2014/may/16/eurovision-conchita-wurst-parade-
russia-ban. [Accessed on 23 April 2014]

10. Neethu,M. Rajasree R.(2013) ‘Sentiment Analysis in Twitter using Machine Learning
Techniques’ 4th International Conference on Computing, Communications and
Networking Technologies (ICCCNT) Tiruchengode India. July 4-6 2013. IEEE. pp1-5
[Accessed April 25th 2014].

11. Niu, Z. Yin, Z. Kong, X. (2012 )“Sentiment classification for microblog by machine
learning,” in Computational and Information Sciences (ICCIS),2012 Fourth International
Conference on, pp. 286–289, IEEE[Accessed On 3 May 2014]

12. Pak A. Paroubek, P. (2010). “Twitter as a corpus for sentiment analysis and opinion
mining,”In Proceedings of the Seventh Conference on Language Resources and
Evaluation(LREC10). Valette, Malta. May 2010.European Language Resources
Association.

13. Parikh,R. Movassate,M (2009) ‘Sentiment Analysis of User-Generated TwitterUpdates


using Various Classification Techniques’.Found in Neethu,M. Rajasree R.(2013)
‘Sentiment Analysis in Twitter using Machine Learning Techniques’ 4th International
Conference on Computing, Communications and Networking Technologies (ICCCNT)
Tiruchengode India. July 4-6 2013. IEEE. pp1-5 [Accessed April 25th 2014].

14. Turney, P. D. (2002 )“Thumbs up or thumbs down?: semantic orientation applied to


unsupervised classification of reviews,” in Proceedings of the 40th annual meeting on
association for computational linguistics, Association for Computational Linguistics,
IEEE. pp. 417–424.

44
15. Vinodhini G. Chandrasekaran, RM.(2012) ‘Sentiment Analysis and Opinion Mining: A
Survey’. International Journal of Advanced Research in Computer Science and Software
Engineering Volume 2, Issue 6, June 2012 IEEE. Pp61-75 [Accessed April 25th 2014].

16. Verzani, J.(2011) Getting Started with R Studio. CA. O'Reilly Media, Inc.

17. Xia, R. Zong, C.. Li, S(2011). “Ensemble of feature sets and classification algorithms for
sentiment classification,” Information Sciences: an International Journal, vol. 181, no. 6,
pp. 1138–1152. IEEE [Accessed On 3 May 2014]

45
11 Appendix

11.1 Project Proposal

Objectives and Contribution to the Knowledge


This paper will use a hybrid of knowledge based sentiment analysis methodologies which
have been more traditionally used, and those of machine learning methodologies which
used a more intuitive approach to sentiment. The results of these two methodologies will
be used to perform a thorough analysis of the dataset .

In order to conduct any kind of analysis on twitter the construction of a suitable dataset
of tweets needs to be built. Twitter API is an app which extracts tweets from twitter and
loads them into a dataset.

The aim of this paper is to use the results from the knowledge based techniques and
those of the machine learning techniques to ensure a thorough analysis of the dataset.

Background
Twitter is different to other forms of raw data which are used for sentiment analysis, as
sentiments are conveyed in one or two sentence blurbs rather than paragraphs. Twitter
is much more informal and less consistent in terms of language. Users cover a wide
array of topics which interest them and use many symbols such as emoticons to express
their views on many aspects of their life (Agarwal et al. 2011). When using human
generated status updates, sentiment are not always obvious; many tweets are
ambiguous and can use humor to maximize the opinion to other human readers but
deflect the opinion to a machine learning algorithm. (Agarwal et al. 2011).This provides
a challenge for machine learning algorithms. Sentiment analysis provides a means of
tracking opinions and attitudes on the web and determines if they are positively or
negatively received by the public.

Using the combination of knowledge based techniques and well as machine learning
techniques, this paper will allow for a full and thorough analysis of the data. The
combination of both these techniques to analysis data will provide researchers the
opportunities to see that the combination of techniques can be complementary to the
analysis.

Special resources required


R studio and dependent libraries
Twitter API
A Personal twitter account.
Excel
A comprehensive and publicly available sentiment lexicon.

46
Project Plan
20/02/201412/03/201401/04/201421/04/201411/05/2014 31/05/201420/06/2014

project proposal
Requirements Specifications
Data retrieval
Data analysis
statistics/ algorithms
Literature review
System Architecture
Editing

Technical Details
The construction of a suitable dataset of tweets needs to be built. Twitter API is an app
which extracts tweets from twitter and loads them into a dataset. Additionally for this
project R studio was used along with the following packages and libraries;

TwitteR; Provides an interface to the Twitter API

ROAuth; This package provides an interface to the OAuth 1.0 specification, which allows
users to authenticate via OAuth to the server of their choice

Plyr; The package plyr is a set of tools that solves common problems by breaking down
bigger problems into more workable pieces. The package then operates on each
problem before reassembling the reworked pieces back together.

Stringr; stringr makes R string functions more consistent, simpler and easier to use by
ensuring the function and argument names are consistent and all functions deal with
NA’s and zero length characters appropriately. Stringr also ensures that the data output
from each function matches the input data structures of other functions.

Ggplot2; This package provides an implementation of the grammar of graphics in R,


combining the advantages of both base and lattice graphics. Plots can be built up step
by step from multiple data sources

Tm; a framework for text mining applications within R

Wordcloud; This package creates a word cloud to illustrate frequency of words in text
mining

e1071; This package provides function for latent class analysis, such as Support vector
Machines, bagged clustering and Naive Bayes classification.

47
Systems/Datasets
The process of sentiment analysis for this project is outlined in the diagram below;

Twitter

Crawling

Test Data Training Data

Data Preparation Sentiment Training

Fisher Sentiment Analysis Document Probabilities

Ranked Topics

Evaluation, Tests and Analysis


The evaluation will be the results at the end of the analysis.

Consultation with Specialization Person(s)


Dr. Ioana Ghergulescu

48
11.2 Initial Requirement Specification

Introduction
Twitter is a micro-blogging site where users have the ability to send mini blogs (tweets) in
the form 140 character long messages to a group of friends (followers). Despite some
restrictions, in general people are permitted to read and follow one another’s tweets,
therefore tweets are by default public. According to Forbes, Twitter has an increase of
40% of active twitter users from the second quarter of 2012 to the fourth quarter of 2012
(forbes.com 2014).

Due to Twitter’s nature, it allows people to tweet how they feel about certain topics and
produces and because Twitter is now integrated with several Web applications, and can
be sent via various messaging and other social networking platforms, people can share
their opinions more freely. This makes Twitter an obvious choice for a research purposes
in data and opinion mining for a variety of fields.

Recently an Irish drag queen named Panti (Rory O’Neill) whose use of the word
“homophobe” on RTÉ the national broadcaster has led to a debate over a provision in
legislation relating to offence being given during a broadcast, as well as to broader
issues of homophobia and gay rights in Ireland. Because of the strong outcry of the
public against some of the broadcaster’s reactions and the important issues that Panti
has raised with regards to Irelands homophobia, her campaign has gone international.
Her Twitter account @Pantibliss allows us the opportunity to gain a strong insight into the
underlying social structures and complexities of modern day Ireland.

Purpose.
The purpose of this document is to outline the requirements which the ‘Panti on Twitter’
sentiment analysis tool will utilize. The audience of this tool will be politicians, media
broadcasters, LGBTQ activists and followers, drag enthusiasts, anti-Panti activists and
general public.

Scope
The scope of the project is to develop a sentiment analysis of a twitter page called
@Pantibliss. The medium of Twitter is used for the following reasons;

Twitter is used my Millions of different people as a medium to express their opinions and
thus is a valuable source of opinion dataset.

Twitter enjoys a range of contributors and audiences, where celebrities and regular users
can interact, barriers are broken down in this way and real opinions can be expressed
through these interactions. This also allows for analysis tweets form different social groups
which may otherwise be hard to collect data from.

49
Twitter allows for international contributions, therefore in the example of Panti, we can
see contributions on an ‘Irish’ issue going global and these tweets add to the tapestry of
the opinions we seek.

Definitions, Acronyms, and Abbreviations


API –Application Programming Interface

PTT - Python Twitter Tools.

Overview
The process of sentiment analysis for this project is outlined in the diagram below;

Twitter

Crawling

Test Data Training Data

Data Preparation Sentiment Training

Fisher Sentiment Analysis Document Probabilities

Ranked Topics

The process will begin with an acquisition program applied to Twitter. Specified keywords
will be taken from the data that is retrieved from Twitter. In order to support a derivation
to a conceptual level, the data segments are then fragmented, assuming that every
message will only contain a single concept. Three distinct categories have been chosen
as follows

Tweets will be evenly split into three sets of text types;

Positive; those opinions who favour what Panti has said or who react positively to her
comments (note these text may be negative towards those who oppose her)

Negative; tweets that are not in favour of what Panti has said and who react negatively
towards her (these tweets may be positive towards other groups)

50
Neutral; Objective tweets or those which do not state an opinion.

General Description
Product perspective
Python

There are numerous Python libraries that can be used to interact with the Twitter API. Two
of the most popular are Python Twitter Tools and python-twitter. The Twitter API requires
that requests are authenticated. For this project, I will use PTT (Python Twitter Tools). PTT
has a twitter command-line tool for getting tweets from followers and setting your own
tweets from the safety and security of your python shell. PTT also allow me to perform
actions with Twitter’s code without being on the website, and open up other options that
are not readily available to normal users.

Twitter API

Twitter exposes its data via an Application Programming Interface, (API). The Twitter API
has two different flavors: Restful and Streaming. The Streaming API works by making a
request for a specific type of data; filtered by keyword, user, geographic area, or a
random sample, and then keeping the connection open as long as there are no errors in
the connection.

The Restful API is useful for getting things like lists of followers and those who follow a
particular user, and is what most Twitter clients are built off of. However one of the main
drawbacks is that the Restful API is that only tweets from 5 days preceding can be
searched, and queries are limited to approximately 10 per minute at the time of writing
(Manjaly, 2013). For this project I am going to focus on the Streaming API.

Product Functions
A vital aspect of this project is Document preparation. This allows for the different aspects
we want to when representing our document. A full text string representation is not very
useful, because it is hard to find similarities between two text strings.

Therefore the ‘Bag of Words’ text string model is used which vitally ignores the ordering of
words, but instead counts of the number of occurrences of the words in the document
(Manjaly, 2013). Some information may be lost in this system however, the bag of words
model is still commonly used, and performs very strongly. It is computationally simple
and in many applications much of the information required for learning is captured by
this representation. According to. Bespalov, Bai, and. Shokoufandeh 2011, the Bag of
Words mode is a natural predecessor to the bag of N-gram. This system counts for
groups of consecutive words of size n. This is important as it can eliminate ambiguities
that can occur in bag of words models, such as “gay rights" being significantly different
to “The gay flower leaves a shadow which falls to the right”. This system allows us the
advantage of increased string length and therefore a greater context.

51
User Characteristics
The intended user will be politicians, media broadcasters, LGBTQ activists and followers,
drag enthusiasts, anti-Panti activists and general public who are interested in the
sentiment of the Twitter population with respect to the opinions formed by @Pantibliss.

Users are not expected to have a very high level of technical expertise.

General Constraints
Personal Data

If a User has not made information public, Twitter does not return that data. Any

Personal information that is collected from Twitter will not be stored or used in any way.

Twitter Data

The application must comply with the Twitter Developer terms of service. This includes the

Following:

1. Defining an application privacy policy (what we do with tweets, user data, etc.)
2. Not redistributing Tweets
3. Providing a link to Twitter sign-up if user does not have a registered Twitter
account

Specific Constraints

Specific Constraints that this product may encounter when dealing with the users Jose,
Bhatia and Krishna( 2010).

1. Negative sentences: many people would write their tweets with negation before the
adjective or verb, which complicates the data. For example : a sentence such as Not
satisfied with the

Situation of Gay Marriage. Has the adjective satisfied which can assign a polarity
positive without considering the negation in the sentence.

2. Confusing polarity: for certain tweets there will be a confusion or disagreement for the
polarity to be assigned. For instance, Norton defeats Pantibliss is positive when taken from
Pantibliss’s point of view, while its negative when Norton is the search query.

3. Dealing with emoticons: Our data should contain clean labels and emoticons are
deemed

A noisy label. However, emoticons are popular on Twitter therefore the data will have to
clean these out.

4. Casual language: Tweets contain very casual language. For example, a user may
want to right the word happy as: happpppyyy happpiieee happy hap-e besides
showing that people are happy,this emphasizes the casual nature of Twitter.

5. Usage of links: Users very often include links in their tweets. Thus there is a need to
classify

52
This type of tweet by using keywords such as URL. But even then it is difficult to extract the

Sentiment sometimes as sometimes it may not be given or it is unclear.

Assumptions and dependences


An assumption is that it is possible to accurately determine the sentiment for a 140
character string of English text.

Specific Requirement
User Interfaces

The programme used for this paper was R studio. R studio was selected as a suitable
programme as it’s interface and programme libraries met the requirements of the brief.
Addtionally the user interface of R studio was simple to navigate and easy to understand.
R studfio is free and open source, and works well on both Windows and Mac hardware .
It contains advanced statistical routines
As well as a large, coherent, integrated collection of tools for data analysis
R studio also processes powerful graphics capabilities which aid visualisation of data and
results greatly. Due to these properties Control of the programme allowed the user to
interact with the application at optimal ease.
The interface will include user inputs as well as two graphics, as outlined below.

User Inputs (Mandatory)

The user will be able to control the sentiment analysis of topics in two ways:

3. Edit this function will let the user edit Keywords, by adding, editing, or removing
keywords for each topic, and
4. Time, this function will let the user specify the duration of each analysis session.

Graphic 1: Topic Mood Gauge (Mandatory)

This graphic will consist of a simple histogram, which shows the current mood of the
Twitter community on the topic of #ConchitaWurst. The percentage of the Twitter users
will be displayed who are currently for or against the topic being analyzed. It will also
display the most frequently used words used on the subject of #Conchita Wurst through
the use of a word cloud.

Error Notifications (Mandatory)

Error notifications will be required within the programme R studio; this will be presented to
the user with appropriate messages in red, which will describe the error that has taken
place. If applicable, error messages suggest possible solutions to the problem.

53
Hardware Interface

The application will run on a password protected personal Microsoft laptop. No further
hardware devices or interfaces will be required for this analysis.

Software Interfaces

Inputs

The software will receive input from four sources. First, the programme R studio and
second, the

Twitter API, thirdly Excel, which will hold the dataset once retrieved from Twitter API, and
fourlhly Twitter app’ Sentiment140’. The programme R studio will supply the code results
and the majority of the graphs for the analysis, while the Twitter API will supply dataset of
the Tweet text. Sentiment140 app will supply an additional pie chart which will add a
more visual element to the interpretation of the data.

Outputs

The output will portray the current mood of the Twitter community on #ConchitaWurst in
the form of a simple charts, word cloud and histograms.

Functional Requirements
Retrieving Input

The software will receive three inputs: R studio code and R studio libraries, analysis session
duration and Tweets.

● R studio code, R studio libraries was entered by the user for each topic.

● the analysis session duration will be set by the user before each session.

● Tweets will be retrieved from the Twitter API and saved on an Excel file.

Real-Time Processing

The software will take input, process data, and display output in real-time. This will ensure
the data provided by Twitter is a current view of the Twitter community’s mood on
#ConchitaWurst.

Sentiment Analysis

Sentiment analysis will be performed on the keywords within the Tweet to determine the
overall mood of the Tweet relative to the topic. The sentiment analysis will provide a
negative or positive numeric sentiment value.

54
Output

The software must output real time data in the form of simple charts, word cloud and
histograms. In addition, the software may output additional statistics pertaining to a topic
of #ConchitaWurst). This output will be clear and easy to understand.

Use Cases
This software will serve as a tool of interest, providing users with the current mood of the
Twitter

Community on #ConchitaWurst.

Non Functioning Requirements


Performance

The Twitter API will provide up-to-date information, limited only by the rate of Twitter input.
R studio will provide promptly analysis of the data using the various software packages
available to it. The output should display the latest results at all times, and if it lags behind,
the user should be notified. The application should be capable of operating in the
background should the user wish to utilize other applications.

Reliability

The software will meet all of the functional requirements without any unexpected
behavior. At no time should the output display incorrect or outdated information without
alerting the user to potential errors. In this instance error message will be shown.

Availability

The software will be available at all times on the user’s device desktop or laptop, as long
as the device is in proper working order. The functionality of the software will depend on
any external services such as internet access that are required. If those services are
unavailable, the user should be alerted.

Security

The software should never disclose any personal information of Twitter users, and should
collect no personal information from its own users. The use of passwords and API keys will
ensure private use of the Twitter API. The programmes will be performed on a password
protected laptop and desktop to ensure maximum security.

55
Maintainability

The software should be written clearly and concisely. The code will be well documented.
Particular care will be taken to design the software modularly to ensure that
maintenance is easy.

Portability

This software will be designed to run on any Android operating system. To ensure the
longevity of the software, the software will be forwarded compatible for all currently
released Android operating systems.

Design Constraints
Twitter API has some limitations such as Twitter API can only return a fixed maximum
amount of tweets (1500). The return of a maximum number of tweets may not be met
sometimes as there are not enough tweets for the particular keyword.

Logical Database Requirements


The tweets taken from Twitter will be stored on an excel spreadsheet. Excel is an excellent
programme for storing large amounts of data as well as being easy to upload the data
to R studio .The data will have two columns, column one will have the score of the tweets
( pos, neg, very pos and very neg), column two will store the actual tweet content. Each
row will represent an individual tweet.

Analysis Models
List all analysis models used in developing specific requirements previously given in this
SRS. Each model should include an introduction and a narrative description.
Furthermore, each model should be traceable the SRS’s requirements.

Change Management Process


Identify and describe the process that will be used to update the SRS, as needed, when
project scope or requirements change. Who can submit changes and by what means,
and how will these changes be approved.

56
11.3 Management Progress Reports

11.3.1 Management Progress Report 1

Progress Management Report to 16 March 2014.

Progress Management Report Purpose


A Progress Management report provides the Project Supervisors with a summary of the
status of a project at agreed stages and is used to monitor progress. The Project
manager uses the Progress Management report to advise the Project Supervisors of any
potential problems or areas where the Project supervisor can help.

Activities during the Period

At this stage the project proposal has been submitted as well as the requirements
specifications. The project proposal was due on the 20 th of February .The requirements
specifications was submitted on the 2nd of March. Between the weeks of the two
submissions, it was decided that my project would be changed from a sentiment
analysis of several books to a sentiment analysis of a tweet feed. Therefore a second
project proposal had to be completed. The process of researching the project had
begun as effects were made to read and research papers that had investigated twitter
sentiment analysis. A lecture on Web API was attended during this period, provided by
the data mining module. A preliminary attempt at pulling tweets from Twitter failed.

Products Completed during the Period

1.A second project proposal outlining a twitter sentiment analysis.

2.Requirements specifications was also completed and submitted.

Quality Reviews

The lecture on data mining provided some insight into the project and therefore allowed
for a better understanding of the research question and the tools needed to perform the
analysis. Code was provided by the lecturer as well as some links to other sites that had
step by step instruction of the process of sentiment analysis. A first attempt to pull data
from the Twitter API was made, however I was unable to successfully retrieve any tweets
and encountered a number of problems using the code.

Issues Arising
The’ handshake’ of the code outlined by Jeffrey Breen was not performing as it should
and provide a URL address . This URL provided a code that would be inputted into the R
programme and if it was accepted it allowed the Twitter data to be loaded into R. The
issue arose for a number of people in the class. Efforts were made to figure out why this
was happening. A week or two later the issue was resolved as it was discovered that the

57
servers in college did not allow for the handshake to occur. All sentiment analysis work
would have to be performed on a computer outside of college with its own server.

Variance from Plan

Between the weeks of the two submissions, it was decided that my project would be
changed from a sentiment analysis of several books to a sentiment analysis of a tweet
feed. Therefore a second project proposal had to be completed. The reason for this
variance from plan was because initially I was not sure of the abilities of data mining- I
took a guess. Once I was more informed on what data mining was and its capabilities I
felt far more comfortable doing a sentiment analysis on Twitter. A number of other
people in the class were also performing sentiment analysis on Twitter and it was felt that
there would be sufficient support for this subject area.

Planned Work for Next Period (to 13th April 2014)

In the next period I plan to retrieve a dataset from Twitter using an API app. I also plan to
do a bit more research in order to produce a solid research question. I then plan to
research the methodologies I might be able to perform in order to answer my research
question

Product Completions next Period

Management Report number2 – Formal submission

A solid research question

A dataset retrieved from Twitter

58
11.3.2 Management Progress Report 2

Progress Mangement Report from 16TH March 2014 to 16th April 2014.

Progress Management Report Purpose

A Progress Management report provides the Project Supervisors with a summary of the
status of a project at agreed stages and is used to monitor progress. The Project
manager uses the Progress Management report to advise the Project Supervisors of any
potential problems or areas where the Project supervisor can help.

Activities during the Period

A research question was formulated from extensive reading on the subject of twitter
analysis and the project would focus on sentiment analysis of user-generated Twitter
updates using knowledge based and machine learning Techniques.

Further research was performed on the different methodologies I could use to answer
research question

Products Completed during the Period

Research question

Progress Management report 2

Quality Reviews

Further research was conducted on the subject of sentiment analysis as well as further
understanding of the mathematical equations behind the analysis. Lecture in advanced
business data analysis helped with the understanding of the processes involved. These
lectures also provided more experience with R studio as that was the primary
programme we used.

Issues Arising

During this period a number of attempts to retrieve a dataset were made however a
dataset was still not provided. Problems continued as I tried to retrieve tweets from the
dataset . Finally a setting in the twitter API was changed from ‘read only’ to ‘read write
and access direct messages’ finally tweets were pulled. However the original keyword of
#Pantibliss does not seem to collect enough tweets for analysis. Therefore a change of
keyword would be implemented to #Eurovision 2014 and #Conchita Wurst. This should
provide the project with enough tweets and a variation of opinions as #conchitawurst is
a topical contestant on Eurovision 2014.

59
Variance from Plan

The original Requirements Specifications outlined using a twitter sentiment analysis in the
programme Python, however after increased experience of R studio in the Advanced
Business Data Analysis lectures, and applicability of R studio of statistical problems, it
became apparent that this would be a more optimal package to use for the analysis.

Planned Work for Next Period (to 4th May 2014)

Get a dataset from Twitter API

Perform analysis on the dataset using R studio

Start writing some of the literature background

Product Completions next Period

Progress Management Report 2 – formal submission

Retrieval of dataset

Analysis of data

11.3.3 Management Progress Report 3

Progress Management Report from 15th April 2014to 4th May 2014.

Progress Management Report Purpose


A Progress Management report provides the Project Supervisors with a summary of the
status of a project at agreed stages and is used to monitor progress. The Project
manager uses the Progress Management report to advise the Project Supervisors of any
potential problems or areas where the Project supervisor can help.

Activities during the Period

In this period a dataset was created using keywords #Conchita Wurst and
Eurovsion2014. A number of data analysis techniques were applied to the data including
the machine learning classifier Naive Bayes.

The literature behind the project has also started to take form and writing of the project
has begun. A template of the paper has also been provided and this allows for a more
structured approach to the writing of the paper.

60
Products Completed during the Period

Dataset retrieved

Word list has been sourced

Analysis of the dataset

Literature review of papers

Progress management report 3

Quality Reviews

Templates for the progress management reports as well as a template for the paper
have been provided in the project class. This will provide excellent guidelines in
approaching the paper. Lectures have also being given and guidance has been
provided on writing style. The project has also been reduced in word count from 10000
to 7500 words. The deadlines for final submission have been pushed to the end of May
after the exams.

Issues Arising

No issues arising – just time as exams are approaching.

Variance from Plan

Time allocated to spend on the project has had to be deferred as exams are
approaching quickly. The writing of the rest of the paper will have to occur after the
exams and other projects and presentations.

Planned Work for Next Period (to 29th May 2014)

Compete understanding of Naive Bayes and write up the remaining parts of the project.

Product Completions next Period

Finish project and supply disk with code.

11.4 Other Material Used


Please find attached CD containing code should attach to the technical report.

61

También podría gustarte