Está en la página 1de 36

Why Text?

How much data? 1.8 zettabytes (1.8 trillion GB)


Most of the Worlds Data is Unstructured
2009 HP survey: 70%
Gartner: 80%
Jerry Hill (Teradata), Anant Jhingran (IBM): 85%
Structured (stored) data often misses elements
critical to predictive modeling
Un-transcribed fields, notes, comments
Ex: examiner/adjuster notes, surveys with freetext fields, medical charts

Text Mining - Perspective

Taming Text
Grant Ingersoll
CTO, LucidWorks
@tamingtext, @gsingers

About the Book


Goal: An engineers guide to search and Natural
Language Processing (NLP) and Machine Learning
Target Audience: You
All examples in Java, but concepts easily ported
Covers:
Search, Fuzzy string matching, human language basics,
clustering, classification, Question Answering, Intro to
advanced topics

Content
Question Answering In Detail

Building Blocks
Indexing
Search/Passage Retrieval
Classification
Scoring

Other Interesting Topics


Clustering
Fuzzy-Wuzzy Strings

Whats next?
Resources

A Grain of Salt
Text is a strange and magical world filled
with
Evil villains
Jesters
Wizards
Unicorns
Heroes!

In other words, no system will be perfect


http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg

The Ugly Truth


You will spend most of your time in NLP, search,
etc. doing grunt work nicely labeled as:

Preprocessing
Feature Selection
Sampling
Validation/testing/etc.
Content extraction
ETL

Corollary: Start with simple, tried and true


algorithms, then iterate

Term / document matrix


Most common form of representation in text
mining is the term - document matrix
Term: typically a single word, but could be a word
phrase like data mining
Document: a generic term meaning a collection of
text to be retrieved
Can be large - terms are often 50k or larger,
documents can be in the billions (www).
Can be binary, or use counts
8

Term document matrix


Example: 10 documents: 6 terms

Database

SQL

Index

Regression

Likelihood

linear

D1

24

21

D2

32

10

D3

12

16

D4

D5

43

31

20

D6

18

D7

32

12

D8

22

D9

34

27

25

D10

17

23

Each document now is just a vector of terms,


sometimes boolean

Term document matrix


We have lost all semantic content
Be careful constructing your term list!
Not all words are created equal!
Words that are the same should be treated the same!

Stop Words
Stemming

10

Stop words
Many of the most frequently used words in English are
worthless in retrieval and text mining these words are called
stop words.
the, of, and, to, .
Typically about 400 to 500 such words
For an application, an additional domain specific stop words list may be
constructed

Why do we need to remove stop words?


Reduce indexing (or data) file size
stopwords accounts 20-30% of total word counts.

Improve efficiency
stop words are not useful for searching or text mining
stop words always have a large number of hits
11

Stemming
Techniques used to find out the root/stem of a word:
E.g.,

stem:

user
users
used
using

engineering
engineered
engineer

use

engineer

Usefulness
improving effectiveness of retrieval and text mining
matching similar words

reducing indexing size


combing words with same roots may reduce indexing size
as much as 40-50%.
12

Basic stemming methods


remove ending
if a word ends with a consonant other than s,
followed by an s, then delete s.
if a word ends in es, drop the s.
if a word ends in ing, delete the ing unless the remaining word consists
only of one letter or of th.
If a word ends with ed, preceded by a consonant, delete the ed unless
this leaves only a single letter.
...

transform words
if a word ends with ies but not eies or aies then ies --> y.

13

Feature Selection
Performance of text classification algorithms can be optimized by
selecting only a subset of the discriminative terms
Even after stemming and stopword removal.

Greedy search
Start from full set and delete one at a time
Find the least important variable
Can use Gini index for this if a classification problem

Often performance does not degrade even with orders of


magnitude reductions
Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation,
electricity and electronics.
Only 140 out of 20,000 terms needed for classification!

14

Distances in TD matrices
Given a term doc matrix represetnation, now we
can define distances between documents (or
terms!)
Elements of matrix can be 0,1 or term frequencies
(sometimes normalized)
Can use Euclidean or cosine distance
Cosine distance is the angle between the two
vectors
Not intuitive, but has been proven to work well

If docs are the same, dc =1, if nothing in common


dc=0
15

We can calculate cosine and Euclidean


distance for this matrix
What would you want the distances to look
like?
Database

SQL

Index

Regression

Likelihood

linear

D1

24

21

D2

32

10

D3

12

16

D4

D5

43

31

20

D6

18

D7

32

12

D8

22

D9

34

27

25

D10

17

23
16

Document distance

Pairwise distances between documents


Image plots of cosine distance, Euclidean,
and scaled Euclidean

R function: image
17

Weighting in TD space
Not all phrases are of equal importance
E.g. David less important than Beckham
If a term occurs frequently in many documents it has less discriminatory
power
One way to correct for this is inverse-document frequency (IDF).

Term importance = Term Frequency (TF) x IDF


Nj= # of docs containing the term
N = total # of docs
A term is important if it has a high TF and/or a high IDF.
TF x IDF is a common measure of term importance
18

Database

SQL

Index

Regression

Likelihood

linear

D1

24

21

D2

32

10

D3

12

16

D4

D5

43

31

20

D6

18

D7

32

12

D8

22

D9

34

27

25

D10

17

23

TF IDF

Database

SQL

Index

Regression

Likelihood

linear

D1

2.53

14.6

4.6

2.1

D2

3.3

6.7

2.6

1.0

D3

1.3

11.1

2.6

D4

0.7

4.9

1.0

D5

4.5

21.5

10.2

1.0

D6

0.2

12.5

2.5

11.1

D7

0.5

22.2

4.3

D8

0.3

15.2

1.4

1.4

D9

0.1

23.56

9.6

17.3

D10

0.6

11.8

1.4

16.0

Simple Question Answering


Workflow

Building Blocks
Sentence Detection
Part of Speech Tagging
Parsing
Ch. 2

QA in Taming Text
Apache Solr for Passage Retrieval and
integration
Apache OpenNLP for sentence detection,
parsing, POS tagging and answer type
classification
Custom code for Query Parsing, Scoring
See com.tamingtext.qa package

Wikipedia for truth

Indexing
Ingest raw data into the system and make it
available for search
Garbage In, Garbage Out
Need to spend some time understanding and
modeling your data just like you would with a DB
Lather, rinse, repeat

See the $TT_HOME/apache-solr/solrqa/conf/schema.xml for setup


WikipediaWexIndexer.java for indexing code

Aside: Named Entity Recognition

NER is the process of extracting proper names, etc.


from text
Plays a vital role in a QA and many other NLP systems
Often solved using classification approaches

Answer Type Classification


Answer Type examples:
Person (P), Location (L), Organization (O), Time
Point (T), Duration (R), Money (M)
See page 248 for more

Train an OpenNLP classifier off of a set of


previously annotated questions, e.g.:
P Which French monarch reinstated the divine
right of the monarchy to France and was known as
`The Sun King' because of the splendour of his
reign?

Search engines

Other Areas of NLP/Machine


Learning

Clustering
Group together content based
on some notion of similarity
Book covers (ch. 6):
Search result clustering using
Carrot2
Whole collection clustering using
Mahout
Topic Modeling

Mahout comes with many


different algorithms

Clustering Use Cases


Google News
Outlier detection in smart grids
Recommendations
Products
People, etc.

In Focus: K-Means

http://en.wikipedia.org/wiki/K-means_clustering

Fuzzy-Wuzzy Strings

Fuzzy string matching is a common, and difficult,


problem
Useful for solving problems like:
Did you mean spell checking
Auto-suggest
Record linkage

Common Approaches
See com.tamingtext.fuzzy package
Jaccard
Measure character overlap

Levenshtein (Edit Distance)


Count the number of edits required to transform
one word into the other

Jaro-Winkler
Account for position

Text Mining: Helpful Data

WordNet

Data Mining -Volinsky - 2011 - Columbia University

Courtesy: Luca Lanzi

33

Text Mining - Other Topics


Sentiment Analysis
Automatically determine tone in text: positive, negative or neutral
Typically uses collections of good and bad words
While the traditional media is slowly starting to take John McCains straight
talking image with increasingly large grains of salt, his base isnt quite ready to give
up on their favorite son. Jonathan Alters bizarre defense of McCain after he was
caught telling an outright lie, perfectly captures that reluctance[.]
Often fit using Nave Bayes

There are sentiment word lists out there:


See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis

34

Text Mining - Other Topics


Summarizing text: Word Clouds
Takes text as input, finds the most
interesting ones, and displays them
graphically
Blogs do this
Wordle.net

35

Much Harder Problems

Semantics, Pragmatics and beyond


Relationship Extraction
Cross-language Search
Importance

También podría gustarte