Exploiting Human-Generated Text For Trend Mining

Exploiting Human-Generated Text for
Trend Mining
Vasileios Lampos
July, 2013
V. Lampos bill@lampos.net Exploiting Human-Generated Text for Trend Mining 1/47
1
/
47
Outline
Motivation, Aims [Facts, Questions]
Data
Nowcasting Events
Extracting Mood Patterns
Inferring Voting Intention
|= Conclusions
2
/
47
Facts
We started to work on those ideas back in 2008, when...
Web contained 1 trillion unique pages (Google)
Social Networks were rising, e.g.

Facebook: 100m (2008) >1.11 billion active users (March, 2013)
Twitter: 6m (2008) 554m active users (July, 2013)
New technologies to handle Big Data (e.g., Map-Reduce)
User behaviour was changing

Socialising via the Web
Giving up privacy (Debatin et al., 2009)
3
/
47
General questions/aims
Does human-generated text posted on web platforms (or

elsewhere) contain useful information?
How can we extract this information...

... automatically? Therefore, not we, but a machine.
Practical / real-life applications?
Can those large samples of human input assist studies in other

scientic elds?
Social Sciences, Psychology, Epidemiology
4
/
47
The Data (1/3) Why Twitter?
Twitter...
has a lot of content that is publicly accessible
provides a well-documented API for several forms of data collection
contains opinions and personal statements on various domains
is connected with current aairs (usually in real-time)
includes geo-located content
oers the option for personalised, per-user modelling

5
/
47
The Data (2/3)
What does a @tweet look like?
Figure 1: Some biased and anonymised examples of tweets (limit of 140
characters/tweet, # denotes a topic)
(a) (user will remain anonymous) (b) they live around us
(c) citizen journalism (d) u attitude
6
/
47
The Data (3/3)
Data Collection & Preprocessing
The easiest part of the process...

not true! Storage space, crawler implementation, parallel data
processing, adapt to new technologies
Data collected via Twitters Search API:

collective sampling
tweets geo-located in 54 urban centres in the UK
periodical crawling (every 3 or 5 minutes per urban centre)
Data collected via Twitters REST API:

user-centric sampling
preprocessing to approximate users location (city & country)
... or manual user selection from domain experts
get their latest tweets (3,000 or more)
Several forms of ground truth (u/rainfall rates, polls)

7
/
47
Nowcasting Events from the
Social Web
8
/
47
Nowcasting?
We do not predict the future, but infer the present
i.e. the very recent past
) (
) (u
M
) (u
W
) (u
S
State of the
World
Figure 2: Nowcasting the magnitude of an event () emerging in the real world
from Web information
Our case studies: nowcasting (a) u rates & (b) rainfall rates (?!)
9
/
47
What do we get in the end?
This is a regression problem
i.e. time interval i we aim to infer y
i
R using text input xxx
i
R
n
0 5 10 15 20 25 30
0
2
4
6
8
10
12
14
16
Days
R
a
i
n
f
a
l
l

r
a
t
e

(
m
m
)

B
r
i
s
t
o
l

Actual
Inferred
Figure 3: Inferred rainfall rates for Bristol, UK (October, 2009)
10
/
47
Methodology (1/5) Text in Vector Space
Candidate features (n-grams): C = {c
i
}
Set of Twitter posts for a time interval u: P
(u)
= {p
j
}
Frequency of c
i
in p
j
:
g(c
i
, p
j
) =
_
if c
i
p
j
,
0 otherwise.
g Boolean, maximum value for is 1
Score of c
i
in P
(u)
:
s
_
c
i
, P
(u)
_
=
|P
(u)
|
j=1
g(c
i
, p
j
)
|P
(u)
|
11
/
47
Methodology (2/5)
Set of time intervals: U = {u
k
} 1 hour, 1 day, ...
Time series of candidate features scores:
X
(U)
=
_
xxx
(u
1
)
... xxx
(u
|U|
)
_
T
,
where
xxx
(u
i
)
=
_
s
_
c
1
, P
(u
i
)
_
... s
_
c
|C|
, P
(u
i
)
__
T
Target variable (event):
yyy
(U)
=
_
y
1
... y
|U|
_
T
12
/
47
Methodology (3/5) Feature selection
Solve the following optimisation problem:
min
w
X
(U)
www yyy
(U)
2
s.t. www
1
t,
t = www
OLS
1
, (0, 1].
Least Absolute Shrinkage and Selection Operator (LASSO)

argmin
www
X
(U)
www yyy
(U)
2
+www
1
(Tibshirani, 1996)
Expect a sparse www (feature selection)
Least Angle Regression (LARS) computes entire regularisation

path (wwws for dierent values of ) (Efron et al., 2004)
13
/
47
Methodology (4/5)
LASSO is model-inconsistent:
inferred sparsity pattern may deviate from the true model, e.g.,
when predictors are highly correlated (Zhao and Yu, 2006)
bootstrap [?] LASSO (Bolasso) performs a more robust feature

selection (Bach, 2008)
?:
in each bootstrap, input space is sampled with replacement
apply LASSO (LARS) to select features
select features with nonzero weights in all bootstraps
better alternative soft-Bolasso:

a less strict feature selection
select features with nonzero weights in p% of bootstraps
(learn p using a separate validation set)
weights of selected features determined via OLS regression

14
/
47
Methodology (5/5) Simplied summary
Observations: X R
mn
(m time intervals, n features)
Response variable: yyy R
m
For i = 1 to number of bootstraps
Form X
i
X by sampling X with replacement
Solve LASSO for X
i
and yyy, i.e. learn www
i
R
n
Get the k n features with nonzero weights
End_For
Select the v n features with nonzero weight in p% of the bootstraps
Learn their weights with OLS regression on X
(v)
R
mv
and yyy
15
/
47
How do we form candidate features?
Commonly formed by indexing the entire corpus

(Manning, Raghavan and Schtze, 2008)
We extract them from Wikipedia, Google Search results, Public

Authority websites (e.g., NHS)
Why?
reduce dimensionality to bound the error of LASSO
L(www) L( www) +Q, with Q min
_
W
2
1
N
+
p
N
,
W
2
1
N
+
W
1
N
_
p candidate features, N samples, empirical loss L( www) and
www
1
W
1
(Bartlett, Mendelson and Neeman, 2011)
Harry Potter Eect!
16
/
47
The Harry Potter eect (1/2)
Figure 4: Events co-occurring (correlated) with the inference target may aect
feature selection, especially when the sample size is small.
180 200 220 240 260 280 300 320 340
0
50
100
150
200
250
300
Day Number (2009)
E
v
e
n
t

S
c
o
r
e

Flu (England & Wales)
Hypothetical Event I
Hypothetical Event II
(Lampos, 2012a)
17
/
47
The Harry Potter eect (2/2)
Table 1: Top 1-grams correlated with u rates in England/Wales (0612/2009)
1-gram Event Corr. Coef.
latitud Latitude Festival 0.9367
u Flu epidemic 0.9344
swine 0.9212
harri Harry Potter Movie 0.9112
slytherin 0.9094
potter 0.8972
benicassim Benicssim Festival 0.8966
graduat Graduation (?) 0.8965
dumbledor Harry Potter Movie 0.8870
hogwart 0.8852
quarantin Flu epidemic 0.8822
gryndor Harry Potter Movie 0.8813
ravenclaw 0.8738
princ 0.8635
swineu Flu epidemic 0.8633
ginni Harry Potter Movie 0.8620
weaslei 0.8581
hermion 0.8540
draco 0.8533
Solution: ground truth with some degree of variability
(Lampos, 2012a)
18
/
47
About n-grams
1-grams
decent (dense) representation in the Twitter corpus
unclear semantic interpretation

Example: I am not sick. But I dont feel great either!
2-grams
very sparse representation in tweets
sometimes clearer semantic interpretation

Experimental process indicated that...
a hybrid combination
of 1-grams and 2-grams

delivers the best inference performance
refer to (Lampos, 2012a)

19
/
47
Flu rates Example of selected features
Figure 5: Font size is proportional to the weight of each feature; ipped n-grams
are negatively weighted. All words are stemmed (Porter, 1980).
(Lampos and Cristianini, 2012)
20
/
47
Rainfall rates Example of selected features
Figure 6: Font size is proportional to the weight of each feature; ipped n-grams
are negatively weighted. All words are stemmed (Porter, 1980).
21
/
47
Examples of inferences
0 5 10 15 20 25 30
0
20
40
60
80
100
120
Days
F
l
u

R
a
t
e

C
.
E
n
g
l
a
n
d

&

W
a
l
e
s

Actual
Inferred
(a) Central England/Wales (u)
0 5 10 15 20 25 30
0
20
40
60
80
100
120
Days
F
l
u

R
a
t
e

S
.
E
n
g
l
a
n
d

Actual
Inferred
(b) South England (u)
0 5 10 15 20 25 30
0
2
4
6
8
10
12
14
16
Days
R
a
i
n
f
a
l
l

r
a
t
e

(
m
m
)

B
r
i
s
t
o
l

Actual
Inferred
(c) Bristol (rain)
Figure 7: Examples of u and rainfall rates inferences from Twitter content
22
/
47
Performance gures
Table 2: RMSE for u rates inference (5-fold cross validation), 50m tweets,
21/06/200919/04/2010
Method 1-grams 2-grams Hybrid
Baseline
12.442.37 13.813.29 11.621.58

Bolasso 11.142.35 12.642.57 10.572.2
CART ensemble
9.635.21 13.134.72 9.44.21

Table 3: RMSE (in mm) for rainfall rates inference (6-fold cross validation), 8.5m
tweets, 01/07/200930/06/2010
Method 1-grams 2-grams Hybrid
Baseline
2.910.6 3.10.57 4.392.99

Bolasso 2.730.65 2.950.55 2.600.68
CART ensemble
2.710.69 2.720.72 2.640.63
As implemented in (Ginsberg et al., 2009)
Classication and Regression Tree (Breiman et al., 1984) & (Sutton, 2005)
23
/
47
Flu Detector
URL: http://geopatterns.enm.bris.ac.uk/epidemics
Figure 8: Flu Detector uses the content of Twitter to nowcast u rates in several
UK regions
(Lampos, De Bie and Cristianini, 2010)
24
/
47
Extracting Mood Patterns from
Human-Generated Content
25
/
47
Computing a mood score
Table 4: Mood terms from WordNet Aect
Fear Sadness Joy Anger
afraid depressed admire angry
fearful discouraged cheerful despise
frighten disheartened enjoy enviously
horrible dysphoria enthousiastic harassed
panic gloomy exciting irritate
... ... ... ...
(92 terms) (115 terms) (224 terms) (146 terms)
Mood score computation for a time interval d using n mood terms
ms
d
=
1
n
n
i =1
c
(t
d
)
i
N(t
d
)
c
(t
d
)
i
: count of term i in the Twitter corpus of day d
N(t
d
): number of tweets for day d
Using the sample of d days, compute a standardised mood score:
ms
std
d
=
ms
d

ms
ms
26
/
47
The mood of the nation (1/5)
Figure 9: Daily time series (actual & their 14-point moving average) for the mood
of Joy based on Twitter content geo-located in the UK
27 august2012
We turned our attention to the issue of
public mood, or sentiment. Our goal was to
analyse the sentiment expressed in the collec-
tive discourse that constantly streams through
Twitter. Or as we called it the mood of
the nation.
We used tweets sampled from the 54 larg-
est cities in the UK over a period of 30 months.
Tere were more than 9 million dierent users,
and 484 million tweets. It is important to notice
that studies of this kind rely on very ecient
methods of data management and text mining,
which we have been rening for years, during
our studies of news content
5
, as well as social
media content. Our infrastructure is based on
a central database, and multiple independent
modules that can annotate the data
6
.
Notice also that the period we analysed
goes from July 2009 to January 2012, a period
marked by economic downturn and some so-
cial tensions. Tis will become relevant when
analysing our ndings.
Tere are standard methods in text analy-
sis to detect sentiment: they are used mostly
in marketing research, when analysts want to
know the opinion of users of a certain camera,
or viewers of a certain TV show. Each of the
basic emotions (fear, joy, anger, sadness) is
associated with a list of words, generated by
a combination of manual and automatic meth-
ods, and successively benchmarked on a test
set. Tis is called citation-sentiment analysis.
We did not want to develop a new method
for sentiment analysis, so we directly applied
a standard one to the textual stream generated
by UK Twitter users. We sampled the tweet-
stream every 3 to 5 minutes, specifying location
to within 10 km of an urban centre. Our word-
list contained 146 anger words, 92 fear words,
224 joy words and 115 sadness words. Tey
can be found at the WordNet-Aect website
(http://wordnet.princeton.edu)
7
.
In the u project we had a ground truth,
of independently-measured u cases. Tis
time around we did not, as no one seems to be
constantly measuring sentiment in the general
population. Tis means that the methods and
the conclusions will be of a dierent nature.
Whereas in the u project the list of keywords
(whose frequency is used to compute the u
score) is discovered by our algorithm, with
the goal of maximising correlation with the
ground truth, in the mood project we had to
feed the key words in ourselves we got them
from citation-sentiment analysis as mentioned
above and we have no ground truth to com-
pare the result with.
By applying these tools to a time series of
about 3 years of Twitter content we found that
each of the four key emotions changes over
time, in a manner that is partly predictable (or
at least interpretable). We were reassured to
nd there was a periodic peak of joy around
Christmas (Figure 2) surely due to greetings
messages and a periodic peak of fear around
Halloween, again probably due to increased
usage of certain keywords such as scary. Tese
were sanity checks, which showed us that
word-counting methods can provide a reason-
able approach to sentiment or mood analysis.
How far Christmas greetings accurately repre-
sent real joy, as opposed to duty and wishful
thinking, is of course another question. We do
not expect that a high frequency of the word
happy necessarily signies happier mood in
the population. Our measures of mood are not
perfect, but these eects could be ltered away
by a more sophisticated tool designed to ignore
conventional expressions such as Happy New
Year. It is, however, a remarkable observation
that certain days have reliably similar values
in dierent years. Tis suggests that we have
reduced statistical errors to a very low level.
But what came out most strongly is the
strong transition, towards a more negative
mood, that started in the week of October 20th,
2010. Tis was the week that the Prime Minis-
ter Gordon Brown announced massive cuts in
public spending. It was a clear change point that
we could validate by a statistical test. It was, if
you like, the moment that people realised that
austerity was not just for others; it would be
aecting their own lives too. Te eects of that
major shift in collective mood are still felt today.
We also found a sustained growth in an-
ger (Figure 4) in the weeks leading up to the
summer riots of August 2011, when parts of
London and several other cities across England
suered widespread violence, looting and arson.
It is interesting that the growth in anger
seems to have started before the riots them-
selves, but this does not mean that we could
Figure 1. A word cloud automatically generated from Twitter trafc. The larger the word, the greater the
correlation with u epidemics. Upside-down words have negative correlations
Figure 2. Plot of the time series representing levels of joy estimator over a period of 2 years. Notice the peaks
corresponding to Christmas and New Year, Valentines day and the Royal Wedding
Jul 09 Jan 10 Jul 10 Jan 11 Jul 11 Jan 12
2
0
2
4
6
8
10
933 Day Time Series for Joy in Twitter Content
Date
N
o
r
m
a
l
i
s
e
d

E
m
o
t
i
o
n
a
l

V
a
l
e
n
c
e
* RIOTS
* CUTS
* XMAS
* XMAS
* XMAS
* roy.wed.
* halloween
* halloween
* halloween
* valentine
* valentine
* easter
* easter

raw joy signal
14day smoothed joy
(Lansdall-Welfare, Lampos and Cristianini, 2012a&b)
27
/
47
Figure 10: Daily time series (actual & their 14-point moving average) for the
mood of Anger based on Twitter content geo-located in the UK
28 august2012
have predicted them. Discovering an interesting
correlation after the fact can be of great help to
social scientists and other scholars, when inter-
preting those events, but is very dierent from
predicting the events. Tere have been other
increases in anger before, without this lead-
ing to any riots. As there is no ocial record
of public mood, we need to be contented with
nding correlations between trends in the time
series of each emotion and events in the exter-
nal world. We can nd peaks of emotion for the
death of Amy Winehouse, and of Osama Bin
Laden; during the run-up to the Royal Wed-
ding in April 2011 people felt calmer.
After the collection and the analysis part,
we considered how to best visualise our results.
With big data this is always a consideration.
Te data sets are so large, and the possible
interactions they represent can be so complex,
that graphic displays are becoming the norm.
We are dealing with emotions; and we found
an open source tool that represents emotions
by a cartoon of a face whose expression depends
on degrees of anger, joy, surprise, fear, sadness
and disgust. It is called the grimace project
(http://grimace-project.net), and
we used it in conjunction with timelines. Te
end result can be used by the public as well as
by researchers. Figure 3 is taken from our mood
browser tool, which is live and interactive at
http://mediapatterns.enm.bris.
ac.uk/mood/. If you visit the site and drag
the cursor along the time-line to October 2010,
you will easily identify the week of the spend-
ing cuts: you will see the face suddenly wince.
Tere are some important considerations
to make and lessons to learn, from the point of
view of data analysis. Te rst is that the social
sciences can now enter a data-driven phase,
but this will require vast amounts of non-
traditional data. Te exploitation of big data
will require the use of multiple tools, from dif-
ferent elds. Data management, data mining,
text mining and data visualisation all seem to
be as necessary as the statistical analysis part.
Te second consideration is a caveat:
since we did not choose the parameters of the
mood system so as to correlate our score to
the same score for the general UK population,
we cannot claim that our mood scores were
calibrated to compensate for the various and
obvious biases we have in the data collection
(unlike in the u study). So all that we can
claim at best is that we have measured
the mood of city-dwelling Twitter users. Tey
tend to be young; they tend to be savvy and
techo-literate; they are denitely a biased
sample of the UK population, although a large
one, since we included posts by more than 9
million individual users.
Finally, there is the obvious caveat that
goes with every statistical study: correlations
as we all know are not causations. Even
if there was an increase in anger and fear after
the spending cuts were announced, how do
we know that this was due to the announce-
ment? Many other factors could have caused it.
Tis is where data analysis must stop, and the
interpretation of social scientists must begin.
But at least we have collected and digested 484
million tweets for them, so that they can focus
on the relevant questions. Big data can change
the way social science is performed, but will
not replace statistical common sense.
References
1. Weingrill, T., Gray, D.A., Barrett, L. and
Henzic, S.P. (2004) Fecal cortisol levels in free-
ranging female chacma baboons: relationship to
dominance, reproductive state and environmental
factors. Hormones and Behavior, 45(4), 259269.
2. Giannone, D., Reichlin, L. and Small, D.
(2008) Nowcasting: Te real-time informational
content of macroeconomic data. Journal of Mon-
etary Economics, 55(4), 665676.
3. Lampos, V. and Cristianini, N. (2011)
Nowcasting events from the Social Web with
statistical learning. ACM Transactions on Intelligent
Systems and Technology, 3(4).
4. Ginsberg, J., Mohebbi, M.H., Patel, R.S.,
Brammer, L., Smolinski, M.S. and Brilliant, L.
(2009) Detecting inuenza epidemics using search
engine query data. Nature, 457(7232), 10121014.
5. http://mediapatterns.enm.
bris.ac.uk
6. http://www.tijldebie.net/
V\VWHPOHV6,*02'BBGHPRB,OLDVSGI
7. Strapparava, C. and Valitutti, A. (2004)
WordNet-Aect: an aective extension of Word-
Net. In Proceedings of the 4th International Confer-
ence on Language Resources and Evaluation (LREC
2004), Lisbon, May, pp. 10831086.
Thomas Lansdall-Welfare, Vasileios Lampos and Nello
Cristianini are at the Intelligent Systems Laboratory
at the University of Bristol.
Figure 3. Visualisation of overall mood levels for the UK over 2 years using timeline plots and the Grimace
tool for facial expressions. The facial expression refers to October 27th, 2010. Visit mediapatterns.enm.
bris.ac.uk/mood
Figure 4. Plot of the time series for anger estimator over 2 and a half years. Notice visible change points
corresponding to spending cuts and riots
4
3
2
1
0
1
2
3
4
5
933 Day Time Series for Anger in Twitter Content
Date
N
o
r
m
a
l
i
s
e
d

E
m
o
t
i
o
n
a
l

V
a
l
e
n
c
e
* RIOTS
* CUTS
* XMAS
* XMAS
* XMAS
* roy.wed.
* halloween
* halloween
* halloween
* valentine
* valentine
* easter
* easter

raw anger signal
14day smoothed anger
(Lansdall-Welfare, Lampos and Cristianini, 2012a&b)
28
/
47
Window of 100 days: 50 before & after the point of interest
ms
std
i
=
_
ms
std
i +1i +50
_
_
ms
std
i 50i 1
_
1
0.5
0
0.5
1
1.5
Rate of Mood Change by Day using the Difference in 50day Mean
Date
D
i
f
f
e
r
e
n
c
e

i
n

m
e
a
n

Anger
Fear
Date of Budget Cuts
Date of Riots
Figure 11: Change point detection using a 100-day moving window
(Lansdall-Welfare, Lampos and Cristianini, 2012a)
29
/
47
Figure 12: Projections of 4-dimensional mood score signals (joy, sadness, anger and
fear) on their top-2 principal components (PCA) Twitter content from 2011
1.5 1 0.5 0 0.5 1
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
Saturday
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
1st Principal Component
2
n
d

P
r
i
n
c
i
p
a
l

C
o
m
p
o
n
e
n
t

Days of the Week
(a) Days of the week (2011)
8 6 4 2 0 2 4 6 8
2
0
2
4
6
8
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53 54
55 56
57
58
59
60
61 62
63 64 65
66
6768 69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96 97
98
99 100
101
102 103
104
105
106
107
108
109
110
111
112 113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136 137
138
139
140
141
142
143
144 145
146
147
148
149
150 151 152
153
154
155
156
157
158
159
160
161
162163 164
165
166 167
168
169
170
171
172
173
174
175
176
177
178 179
180 181
182
183
184
185
186
187
188
189
190
191
192
193 194
195
196
197
198
199
200
201
202
203
204 205
206
207
208
209 210
211
212
213
214
215
216
217
218219
220
221 222
223
224
225 226
227
228
229 230
231
232
233
234
235
236
237
238
239
240
241
242 243 244
245
246 247
248
249
250 251
252
253
254
255
256
257
258 259
260
261
262
263
264
265
266
267
268
269
270
271 272
273
274 275
276 277
278279
280
281
282
283
284
285
286
287
288
289
290
291 292
293
294
295
296
297
298 299 300 301
302 303
304
305 306 307
308
309
310
311 312 313
314
315
316
317
318 319
320
321
322
323
324
325 326 327
328
329
330
331
332
333
334
335
336
337 338
339
340
341 342
343
344
345 346
347
348 349
350
351 352
353
354
355
356
357
358
359
360
361
362 363
364
365
1st Principal Component
2
n
d

P
r
i
n
c
i
p
a
l

C
o
m
p
o
n
e
n
t

Days in 2011
(b) Days of the year (2011)
Cluster I
New Year (1), Valentines (45), Christmas Eve (358), New Years Eve (365)
Cluster II
O.B. Ladens death (122), Winehouses death + Breivik (204), UK riots (221)
(Lampos, 2012a)
30
/
47
URL: http://geopatterns.enm.bris.ac.uk/mood
Figure 13: Mood of the Nation uses the content of Twitter to nowcast mood
rates in several UK regions
(Lampos, 2012a)
31
/
47
Circadian mood patterns (1/3)
Compute 24-h mood score patterns
Mood score computation for a time interval u = 24hours using n
mood terms (WordNet) and a sample of D days:
M
s
(u) =
1
|D|
|D|
j=1
_
1
n
n
i =1
sf
(t
j,u
)
i
_
sf
(t
d,u
)
i
=
f
(t
d,u
)
i

f
i
f
i
, i {1, ..., n}.
f
(t
d,u
)
i
: normalised frequency of a mood term i during time interval u in day dD
32
/
47
F
e
a
r

S
c
o
r
e
3 6 9 12 15 18 21 24
-0.1
0
0.1
Winter Summer
3 6 9 12 15 18 21 24
-0.1
0
0.1
Aggregated Data
S
a
d
n
e
s
s

S
c
o
r
e
3 6 9 12 15 18 21 24
-0.1
0
0.1
3 6 9 12 15 18 21 24
-0.1
0
0.1
J
o
y

S
c
o
r
e
3 6 9 12 15 18 21 24
-0.1
0
0.1
3 6 9 12 15 18 21 24
-0.1
0
0.1
Hourly Intervals
A
n
g
e
r

S
c
o
r
e
3 6 9 12 15 18 21 24
-0.05
0
0.05
Hourly Intervals
3 6 9 12 15 18 21 24
-0.05
0
0.05
Figure 14: Circadian (24-hour) mood patterns based on UK Twitter content
33
/
47
Figure 15: Autocorrelation of circadian mood patterns based on hourly lags
revealing daily and weekly periodicities
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
0
0.2
0.4
Autocorr. Lags (Hours)
A
u
t
o
c
o
r
r
.

(
F
e
a
r
)

Autocorr.
Conf. Bound
(a) Fear
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
0
0.1
0.2
0.3
0.4
A
u
t
o
c
o
r
r
.

(
S
a
d
n
e
s
s
)

Autocorr.
Conf. Bound
(b) Sadness
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
0.2
0
0.2
0.4
A
u
t
o
c
o
r
r
.

(
J
o
y
)

Autocorr.
Conf. Bound
(c) Joy
1 12 24 36 48 60 72 84 96 108 120 132 144 156 168
0
0.1
0.2
0.3
A
u
t
o
c
o
r
r
.

(
A
n
g
e
r
)

Autocorr.
Conf. Bound
(d) Anger
Further analysis available in (Lampos, Lansdall-Welfare, Araya and Cristianini, 2013)
34
/
47
Emotion in Books
Input: Google Ngram corpus of 5m digitised books (Michel et al., 2010)
Tool: WordNet Aect (Strapparava and Valitutti, 2004)
1900 1920 1940 1960 1980 2000
1.0
0.5
0.0
0.5
1.0
Year
J
o
y
S
a
d
n
e
s
s

(
z
s
c
o
r
e
s
)
(a) Joy minus Sadness
1900 1920 1940 1960 1980 2000
4
2
0
2
4
Year
E
m
o
t
i
o
n
R
a
n
d
o
m

(
z
s
c
o
r
e
s
)
All
Fear
Disgust
(b) Use of
emotion-related terms
through time
1900 1920 1940 1960 1980 2000
4
2
0
2
4
Year
A
m
e
r
i
c
a
n
B
r
i
t
i
s
h

(
z
s
c
o
r
e
s
)
1900 1920 1940 1960 1980 2000
4
2
0
2
4
Year
A
m
e
r
i
c
a
n
B
r
i
t
i
s
h

(
z
s
c
o
r
e
s
)
(b)
1900 1920 1940 1960 1980 2000
4
2
0
2
4
Year
A
m
e
r
i
c
a
n
B
r
i
t
i
s
h

(
z
s
c
o
r
e
s
)
(c)
1900 1920 1940 1960 1980 2000
4
2
0
2
4
Year
A
m
e
r
i
c
a
n
B
r
i
t
i
s
h

(
z
s
c
o
r
e
s
)
(d)
(c) American versus
British English
Figure 16: Emotion trends in 20th century books
(Acerbi, Lampos, Garnett and Bentley, 2013)
35
/
47
Inferring Voting Intention from
Social Media Content
... and a new way for modelling text regression
36
/
47
Motivations and Aims
Social Media contain a vast amount of information about

various topics (health, politics, nance)
This information (X) can be used to assist predictions (y)
f : X y, f usually formulates a linear regression task
X accounts only for word frequencies; can we incorporate user

information as well?
Could we also exploit the statistical information held in multiple

response variables?
37
/
47
Data Sets
UK case study
60m tweets by 42K users from 30/04/2010 to 13/02/2012
Random selection and distribution of geo-located users proportional to regional

population gures
Main language: English
240 unique voting intention polls from YouGov

percentages for Conservatives (CON), Labour Party (LAB) and Liberal Democrats
(LIB)
Austrian case study
800K tweets by 1.1K users from 25/01 to 01/12/2012
Users manually selected by Austrian political analysts
Main language: German
98 unique voting intention polls from various pollsters

percentages for Social Democratic Party (SP), Peoples Party (VP), Freedom
Party (FP) and Green Alternative Party (GR)
38
/
47
The Bilinear Model (1/2)
The main idea is simple:
f (X) = uuu
T
Xwww +
X R
mp
: matrix of user-word frequencies
uuu, www: user and word weights
Our original bilinear text regression model:
{www
, uuu
} = argmin
www,uuu,
n
i =1
_
uuu
T
Q
i
www + y
i
_
2
+(www,
1
) +(uuu,
2
)
Q
i
: X for time instance i , yyy R
n
: response variable (voting intention)
www R
m
, uuu R
p
: word and user weights, R: bias
(): a regularisation function
Elastic Net (Zhou and Hastie, 2005) for ()
Bilinear Elastic Net (BEN) (Lampos, Preoiuc-Pietro and Cohn, 2013)
39
/
47
The Bilinear Model Multi-Task Learning (2/2)
Apply
1
/
2
regulariser (Argyriou, Evgeniou and Pontil, 2008)
Extends the notion of Group LASSO (Yuan and Lin, 2006) for a
-dimensional yyy
Bilinear Group
1
/
2
(BGL)
{W
, U
} = argmin
W,U,
t=1
n
i =1
_
uuu
T
t
Q
i
www
t
+
t
y
ti
_
2
+
1
m
j=1
W
j
2
+
2
p
k=1
U
k
2
,
W = [www
1
... www
]: words weight matrix www

t
refers to t-th political entity
U = [uuu
1
... uuu
]: users weight matrix

W
j
, U
j
: j-th rows of weight matrices W and U respectively
R
: bias term per task

(Lampos, Preoiuc-Pietro and Cohn, 2013)
40
/
47
Evaluation Performance Tables (1/2)
Table 5: UK case study Average RMSEs representing the error of the inferred
voting intention percentage for a 10-step validation process
CON LAB LIB
B
2.272 1.663 1.136 1.69

B
last
2 2.074 1.095 1.723
LEN 3.845 2.912 2.445 3.067
BEN 1.939 1.644 1.136 1.573
BGL 1.785 1.785 1.785 1.595 1.595 1.595 1.054 1.054 1.054 1.478 1.478 1.478
Table 6: Austrian case study
SP VP FP GR
B
1.535 1.373 3.3 1.197 1.851

B
last
1.148 1.148 1.148 1.556 1.639 1.639 1.639 1.536 1.47
LEN 1.291 1.286 2.039 1.152 1.152 1.152 1.442
BEN 1.392 1.31 2.89 1.205 1.699
BGL 1.619 1.005 1.005 1.005 1.757 1.374 1.439 1.439 1.439
41
/
47
Evaluation (2/3)
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
V
o
t
i
n
g

I
n
t
e
n
t
i
o
n

%
Time

CON
LAB
LIB
(a) Polls
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
V
o
t
i
n
g

I
n
t
e
n
t
i
o
n

%
Time

CON
LAB
LIB
(b) BEN
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
V
o
t
i
n
g

I
n
t
e
n
t
i
o
n

%
Time

CON
LAB
LIB
(c) BGL
Figure 17: UK case study 50 consecutive poll predictions
42
/
47
Evaluation (3/3)
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
V
o
t
i
n
g

I
n
t
e
n
t
i
o
n

%
Time

SP
VP
FP
GR
(a) Polls
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
V
o
t
i
n
g

I
n
t
e
n
t
i
o
n

%
Time

SP
VP
FP
GR
(b) BEN
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
V
o
t
i
n
g

I
n
t
e
n
t
i
o
n

%
Time

SP
VP
FP
GR
(c) BGL
Figure 18: Austrian case study 50 consecutive poll predictions
43
/
47
Conclusions
Social Media hold valuable information
We can develop methods to extract portions of this information

automatically
detect, quantify, nowcast events (examples of u and rainfall rates)
extract collective mood patterns (we can do this for books too!)
model other domains (such as politics)
Dierent types of information (word frequencies, user accounts)

can be fused for improved inference performance
Side eect: user privacy

44
/
47
Signicant collaborators...
Prof. Nello Cristianini, University of Bristol (Articial Intelligence)
Prof. Alexander Bentley, University of Bristol (Anthropology)
Dr. Trevor Cohn, University of Sheeld (Natural Language
Processing)
Dr. Alberto Acerbi, University of Bristol (Anthropology)
Daniel Preoiuc-Pietro, University of Sheeld (Computer Science)
45
/
47
Last Slide!
The end.
Any questions?
Download the slides from
http://www.lampos.net/research/presentations-and-posters
46
/
47
References
Acerbi, Lampos, Garnett and Bentley. The Expression of Emotions in 20th Century Books. PLoS ONE, 2013.
Argyriou, Evgeniou and Pontil. Convex multi-task feature learning. Machine Learning, 2008.
Bach. Bolasso: Model Consistent Lasso Estimation through the Bootstrap. ICML, 2008.
Bartlett, Mendelson and Neeman. L1-regularized linear regression: persistence and oracle inequalities. PTRF,
2011.
Debatin, Lovejoy, Horn and Hughes. Facebook and Online Privacy: Attitudes, Behaviors, and Unintended
Consequences. JCMC, 2009.
Efron et al.. Least Angle Regression. The Annals of Statistics, 2004.
Ginsberg et al. Detecting inuenza epidemics using search engine query data. Nature, 2009.
Lampos and Cristianini. Tracking the u pandemic by monitoring the Social Web. CIP, 2010.
Lampos, De Bie and Cristianini. Flu Detector Tracking Epidemics on Twitter. ECML PKDD, 2010.
Lampos and Cristianini. Nowcasting Events from the Social Web with Statistical Learning. ACM TIST, 2012.
Lampos. Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical
Learning Methods. Ph.D. Thesis, University of Bristol, 2012.(a)
Lampos. On voting intentions inference from Twitter content: a case study on UK 2010 General Election.
CoRR, 2012.(b)
Lampos, Preoiuc-Pietro and Cohn. A user-centric model of voting intention from Social Media. ACL, 2013.
Lampos, Lansdall-Welfare, Araya and Cristianini. Analysing Mood Patterns in the United Kingdom through
Twitter Content. CoRR, 2013.
Lansdall-Welfare, Lampos and Cristianini. Eects of the Recession on Public Mood in the UK. WWW, 2012.(a)
Lansdall-Welfare, Lampos and Cristianini. Nowcasting the mood of the nation. Signicance, 2012.(b)
Manning, Raghavan and Schtze. Introduction to Information Retrieval, 2008.
Michel et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Nature, 2010.
Porter. An algorithm for sux stripping. Program, 1980.
Strapparava and Valitutti. WordNet-Aect: an aective extension of WordNet. LREC, 2004.
Tibshirani. Regression Shrinkage and Selection via the LASSO. JRSS, 1996.
Yuan and Lin. Model selection and estimation in regression with grouped variables. JRSS, 2006.
Zhao and Yu. On model selection consistency of LASSO. JMLR, 2006.
Zhou and Hastie. Regularization and variable selection via the elastic net. JRSS, 2005.
47
/
47

Exploiting Human-Generated Text For Trend Mining

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Exploiting Human-Generated Text For Trend Mining

Cargado por

Copyright:

Formatos disponibles

Exploiting Human-Generated Text for

Web contained 1 trillion unique pages (Google)

Social Networks were rising, e.g.

New technologies to handle Big Data (e.g., Map-Reduce)

User behaviour was changing

Does human-generated text posted on web platforms (or

How can we extract this information...

Practical / real-life applications?

Can those large samples of human input assist studies in other

has a lot of content that is publicly accessible

provides a well-documented API for several forms of data collection

contains opinions and personal statements on various domains

is connected with current aairs (usually in real-time)

includes geo-located content

oers the option for personalised, per-user modelling

The easiest part of the process...

Data collected via Twitters Search API:

Data collected via Twitters REST API:

Several forms of ground truth (u/rainfall rates, polls)

Least Absolute Shrinkage and Selection Operator (LASSO)

Expect a sparse www (feature selection)

Least Angle Regression (LARS) computes entire regularisation

bootstrap [?] LASSO (Bolasso) performs a more robust feature

better alternative soft-Bolasso:

weights of selected features determined via OLS regression

Commonly formed by indexing the entire corpus

We extract them from Wikipedia, Google Search results, Public

decent (dense) representation in the Twitter corpus

unclear semantic interpretation

very sparse representation in tweets

sometimes clearer semantic interpretation

of 1-grams and 2-grams

refer to (Lampos, 2012a)

12.442.37 13.813.29 11.621.58

9.635.21 13.134.72 9.44.21

2.910.6 3.10.57 4.392.99

2.710.69 2.720.72 2.640.63

As implemented in (Ginsberg et al., 2009)

Social Media contain a vast amount of information about

This information (X) can be used to assist predictions (y)

f : X y, f usually formulates a linear regression task

X accounts only for word frequencies; can we incorporate user

Could we also exploit the statistical information held in multiple

60m tweets by 42K users from 30/04/2010 to 13/02/2012

Random selection and distribution of geo-located users proportional to regional

Main language: English

240 unique voting intention polls from YouGov

800K tweets by 1.1K users from 25/01 to 01/12/2012

Users manually selected by Austrian political analysts

Main language: German

98 unique voting intention polls from various pollsters

]: words weight matrix www

]: users weight matrix

: bias term per task

2.272 1.663 1.136 1.69

1.535 1.373 3.3 1.197 1.851

Social Media hold valuable information

We can develop methods to extract portions of this information

Dierent types of information (word frequencies, user accounts)

Side eect: user privacy

También podría gustarte