Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Taming Text
Grant Ingersoll
CTO, LucidWorks
@tamingtext, @gsingers
Content
Question Answering In Detail
Building Blocks
Indexing
Search/Passage Retrieval
Classification
Scoring
Whats next?
Resources
A Grain of Salt
Text is a strange and magical world filled
with
Evil villains
Jesters
Wizards
Unicorns
Heroes!
Preprocessing
Feature Selection
Sampling
Validation/testing/etc.
Content extraction
ETL
Database
SQL
Index
Regression
Likelihood
linear
D1
24
21
D2
32
10
D3
12
16
D4
D5
43
31
20
D6
18
D7
32
12
D8
22
D9
34
27
25
D10
17
23
Stop Words
Stemming
10
Stop words
Many of the most frequently used words in English are
worthless in retrieval and text mining these words are called
stop words.
the, of, and, to, .
Typically about 400 to 500 such words
For an application, an additional domain specific stop words list may be
constructed
Improve efficiency
stop words are not useful for searching or text mining
stop words always have a large number of hits
11
Stemming
Techniques used to find out the root/stem of a word:
E.g.,
stem:
user
users
used
using
engineering
engineered
engineer
use
engineer
Usefulness
improving effectiveness of retrieval and text mining
matching similar words
transform words
if a word ends with ies but not eies or aies then ies --> y.
13
Feature Selection
Performance of text classification algorithms can be optimized by
selecting only a subset of the discriminative terms
Even after stemming and stopword removal.
Greedy search
Start from full set and delete one at a time
Find the least important variable
Can use Gini index for this if a classification problem
14
Distances in TD matrices
Given a term doc matrix represetnation, now we
can define distances between documents (or
terms!)
Elements of matrix can be 0,1 or term frequencies
(sometimes normalized)
Can use Euclidean or cosine distance
Cosine distance is the angle between the two
vectors
Not intuitive, but has been proven to work well
SQL
Index
Regression
Likelihood
linear
D1
24
21
D2
32
10
D3
12
16
D4
D5
43
31
20
D6
18
D7
32
12
D8
22
D9
34
27
25
D10
17
23
16
Document distance
R function: image
17
Weighting in TD space
Not all phrases are of equal importance
E.g. David less important than Beckham
If a term occurs frequently in many documents it has less discriminatory
power
One way to correct for this is inverse-document frequency (IDF).
Database
SQL
Index
Regression
Likelihood
linear
D1
24
21
D2
32
10
D3
12
16
D4
D5
43
31
20
D6
18
D7
32
12
D8
22
D9
34
27
25
D10
17
23
TF IDF
Database
SQL
Index
Regression
Likelihood
linear
D1
2.53
14.6
4.6
2.1
D2
3.3
6.7
2.6
1.0
D3
1.3
11.1
2.6
D4
0.7
4.9
1.0
D5
4.5
21.5
10.2
1.0
D6
0.2
12.5
2.5
11.1
D7
0.5
22.2
4.3
D8
0.3
15.2
1.4
1.4
D9
0.1
23.56
9.6
17.3
D10
0.6
11.8
1.4
16.0
Building Blocks
Sentence Detection
Part of Speech Tagging
Parsing
Ch. 2
QA in Taming Text
Apache Solr for Passage Retrieval and
integration
Apache OpenNLP for sentence detection,
parsing, POS tagging and answer type
classification
Custom code for Query Parsing, Scoring
See com.tamingtext.qa package
Indexing
Ingest raw data into the system and make it
available for search
Garbage In, Garbage Out
Need to spend some time understanding and
modeling your data just like you would with a DB
Lather, rinse, repeat
Search engines
Clustering
Group together content based
on some notion of similarity
Book covers (ch. 6):
Search result clustering using
Carrot2
Whole collection clustering using
Mahout
Topic Modeling
In Focus: K-Means
http://en.wikipedia.org/wiki/K-means_clustering
Fuzzy-Wuzzy Strings
Common Approaches
See com.tamingtext.fuzzy package
Jaccard
Measure character overlap
Jaro-Winkler
Account for position
WordNet
33
34
35