TamingText Slides

Why Text?
How much data? 1.8 zettabytes (1.8 trillion GB)

Most of the Worlds Data is Unstructured
2009 HP survey: 70%
Gartner: 80%
Jerry Hill (Teradata), Anant Jhingran (IBM): 85%
Structured (stored) data often misses elements
critical to predictive modeling
Un-transcribed fields, notes, comments
Ex: examiner/adjuster notes, surveys with freetext fields, medical charts
Text Mining - Perspective
Taming Text
Grant Ingersoll
CTO, LucidWorks
@tamingtext, @gsingers
About the Book

Goal: An engineers guide to search and Natural
Language Processing (NLP) and Machine Learning
Target Audience: You
All examples in Java, but concepts easily ported
Covers:
Search, Fuzzy string matching, human language basics,
clustering, classification, Question Answering, Intro to
advanced topics
Content
Question Answering In Detail
Building Blocks
Indexing
Search/Passage Retrieval
Classification
Scoring
Other Interesting Topics

Clustering
Fuzzy-Wuzzy Strings
Whats next?
Resources
A Grain of Salt
Text is a strange and magical world filled
with
Evil villains
Jesters
Wizards
Unicorns
Heroes!
In other words, no system will be perfect

http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg
The Ugly Truth

You will spend most of your time in NLP, search,
etc. doing grunt work nicely labeled as:
Preprocessing
Feature Selection
Sampling
Validation/testing/etc.
Content extraction
ETL
Corollary: Start with simple, tried and true

algorithms, then iterate
Term / document matrix

Most common form of representation in text
mining is the term - document matrix
Term: typically a single word, but could be a word
phrase like data mining
Document: a generic term meaning a collection of
text to be retrieved
Can be large - terms are often 50k or larger,
documents can be in the billions (www).
Can be binary, or use counts
8
Term document matrix

Example: 10 documents: 6 terms
Database
SQL
Index
Regression
Likelihood
linear
D1
24
21
D2
32
10
D3
12
16
D4
D5
43
31
20
D6
18
D7
32
12
D8
22
D9
34
27
25
D10
17
23
Each document now is just a vector of terms,

sometimes boolean
Term document matrix

We have lost all semantic content
Be careful constructing your term list!
Not all words are created equal!
Words that are the same should be treated the same!
Stop Words
Stemming
10
Stop words
Many of the most frequently used words in English are
worthless in retrieval and text mining these words are called
stop words.
the, of, and, to, .
Typically about 400 to 500 such words
For an application, an additional domain specific stop words list may be
constructed
Why do we need to remove stop words?

Reduce indexing (or data) file size
stopwords accounts 20-30% of total word counts.
Improve efficiency
stop words are not useful for searching or text mining
stop words always have a large number of hits
11
Stemming
Techniques used to find out the root/stem of a word:
E.g.,
stem:
user
users
used
using
engineering
engineered
engineer
use
engineer
Usefulness
improving effectiveness of retrieval and text mining
matching similar words
reducing indexing size

combing words with same roots may reduce indexing size
as much as 40-50%.
12
Basic stemming methods

remove ending
if a word ends with a consonant other than s,
followed by an s, then delete s.
if a word ends in es, drop the s.
if a word ends in ing, delete the ing unless the remaining word consists
only of one letter or of th.
If a word ends with ed, preceded by a consonant, delete the ed unless
this leaves only a single letter.
...
transform words
if a word ends with ies but not eies or aies then ies --> y.
13
Feature Selection
Performance of text classification algorithms can be optimized by
selecting only a subset of the discriminative terms
Even after stemming and stopword removal.
Greedy search
Start from full set and delete one at a time
Find the least important variable
Can use Gini index for this if a classification problem
Often performance does not degrade even with orders of

magnitude reductions
Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation,
electricity and electronics.
Only 140 out of 20,000 terms needed for classification!
14
Distances in TD matrices
Given a term doc matrix represetnation, now we
can define distances between documents (or
terms!)
Elements of matrix can be 0,1 or term frequencies
(sometimes normalized)
Can use Euclidean or cosine distance
Cosine distance is the angle between the two
vectors
Not intuitive, but has been proven to work well
If docs are the same, dc =1, if nothing in common

dc=0
15
We can calculate cosine and Euclidean

distance for this matrix
What would you want the distances to look
like?
Database
SQL
Index
Regression
Likelihood
linear
D1
24
21
D2
32
10
D3
12
16
D4
D5
43
31
20
D6
18
D7
32
12
D8
22
D9
34
27
25
D10
17
23
16
Document distance
Pairwise distances between documents

Image plots of cosine distance, Euclidean,
and scaled Euclidean
R function: image
17
Weighting in TD space
Not all phrases are of equal importance
E.g. David less important than Beckham
If a term occurs frequently in many documents it has less discriminatory
power
One way to correct for this is inverse-document frequency (IDF).
Term importance = Term Frequency (TF) x IDF

Nj= # of docs containing the term
N = total # of docs
A term is important if it has a high TF and/or a high IDF.
TF x IDF is a common measure of term importance
18
Database
SQL
Index
Regression
Likelihood
linear
D1
24
21
D2
32
10
D3
12
16
D4
D5
43
31
20
D6
18
D7
32
12
D8
22
D9
34
27
25
D10
17
23
TF IDF
Database
SQL
Index
Regression
Likelihood
linear
D1
2.53
14.6
4.6
2.1
D2
3.3
6.7
2.6
1.0
D3
1.3
11.1
2.6
D4
0.7
4.9
1.0
D5
4.5
21.5
10.2
1.0
D6
0.2
12.5
2.5
11.1
D7
0.5
22.2
4.3
D8
0.3
15.2
1.4
1.4
D9
0.1
23.56
9.6
17.3
D10
0.6
11.8
1.4
16.0
Simple Question Answering

Workflow
Building Blocks
Sentence Detection
Part of Speech Tagging
Parsing
Ch. 2
QA in Taming Text
Apache Solr for Passage Retrieval and
integration
Apache OpenNLP for sentence detection,
parsing, POS tagging and answer type
classification
Custom code for Query Parsing, Scoring
See com.tamingtext.qa package
Wikipedia for truth
Indexing
Ingest raw data into the system and make it
available for search
Garbage In, Garbage Out
Need to spend some time understanding and
modeling your data just like you would with a DB
Lather, rinse, repeat
See the $TT_HOME/apache-solr/solrqa/conf/schema.xml for setup

WikipediaWexIndexer.java for indexing code
Aside: Named Entity Recognition
NER is the process of extracting proper names, etc.

from text
Plays a vital role in a QA and many other NLP systems
Often solved using classification approaches
Answer Type Classification

Answer Type examples:
Person (P), Location (L), Organization (O), Time
Point (T), Duration (R), Money (M)
See page 248 for more
Train an OpenNLP classifier off of a set of

previously annotated questions, e.g.:
P Which French monarch reinstated the divine
right of the monarchy to France and was known as
`The Sun King' because of the splendour of his
reign?
Search engines
Other Areas of NLP/Machine

Learning
Clustering
Group together content based
on some notion of similarity
Book covers (ch. 6):
Search result clustering using
Carrot2
Whole collection clustering using
Mahout
Topic Modeling
Mahout comes with many

different algorithms
Clustering Use Cases

Google News
Outlier detection in smart grids
Recommendations
Products
People, etc.
In Focus: K-Means
http://en.wikipedia.org/wiki/K-means_clustering
Fuzzy-Wuzzy Strings
Fuzzy string matching is a common, and difficult,

problem
Useful for solving problems like:
Did you mean spell checking
Auto-suggest
Record linkage
Common Approaches
See com.tamingtext.fuzzy package
Jaccard
Measure character overlap
Levenshtein (Edit Distance)

Count the number of edits required to transform
one word into the other
Jaro-Winkler
Account for position
Text Mining: Helpful Data
WordNet
Data Mining -Volinsky - 2011 - Columbia University
Courtesy: Luca Lanzi
33
Text Mining - Other Topics

Sentiment Analysis
Automatically determine tone in text: positive, negative or neutral
Typically uses collections of good and bad words
While the traditional media is slowly starting to take John McCains straight
talking image with increasingly large grains of salt, his base isnt quite ready to give
up on their favorite son. Jonathan Alters bizarre defense of McCain after he was
caught telling an outright lie, perfectly captures that reluctance[.]
Often fit using Nave Bayes
There are sentiment word lists out there:

See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis
34
Text Mining - Other Topics

Summarizing text: Word Clouds
Takes text as input, finds the most
interesting ones, and displays them
graphically
Blogs do this
Wordle.net
35
Much Harder Problems
Semantics, Pragmatics and beyond

Relationship Extraction
Cross-language Search
Importance

TamingText Slides

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

TamingText Slides

Cargado por

Copyright:

Formatos disponibles

Why Text?

How much data? 1.8 zettabytes (1.8 trillion GB)

Text Mining - Perspective

About the Book

Other Interesting Topics

In other words, no system will be perfect

The Ugly Truth

Corollary: Start with simple, tried and true

Term / document matrix

Term document matrix

Each document now is just a vector of terms,

Term document matrix

Why do we need to remove stop words?

reducing indexing size

Basic stemming methods

Often performance does not degrade even with orders of

If docs are the same, dc =1, if nothing in common

We can calculate cosine and Euclidean

Pairwise distances between documents

Term importance = Term Frequency (TF) x IDF

Simple Question Answering

Wikipedia for truth

See the $TT_HOME/apache-solr/solrqa/conf/schema.xml for setup

Aside: Named Entity Recognition

NER is the process of extracting proper names, etc.

Answer Type Classification

Train an OpenNLP classifier off of a set of

Other Areas of NLP/Machine

Mahout comes with many

Clustering Use Cases

Fuzzy string matching is a common, and difficult,

Levenshtein (Edit Distance)

Text Mining: Helpful Data

Data Mining -Volinsky - 2011 - Columbia University

Courtesy: Luca Lanzi

Text Mining - Other Topics

There are sentiment word lists out there:

Text Mining - Other Topics

Much Harder Problems

Semantics, Pragmatics and beyond

También podría gustarte