Documentos de Académico
Documentos de Profesional
Documentos de Cultura
N.P. Singh
Professor
Web Mining & Text Mining
Text & web mining are two sub-areas of data mining
both being focused on less structured data.
Reason of Growth:
ref: www.stanford.edu/class/cs276b/handouts/lecture10.ppt
APPLICATION
Bioinformatics
Why Biology Text Mining?
Strong motivations from biology side
Difficulty for biologists to access literature
No theory in biology, so we must keep all literature alive
Observations about the same biology mechanism may be described
in different terms (e.g., due to different perspectives of study)
Many unanswered research questions
Text mining may help better organize, link biology literature,
and answer simple questions (e.g., what do we know about
this gene? )
Why Biology Text Mining? (cont.)
Potentially high impact from CS side
Any discovery from biology text could be potentially
significant
Biology text is relatively easy for mining
Literature is cleaner (compared with web data)
Biology text often has many annotations
Many other kinds of biology data can be exploited (e.g.,
DNA/Protein sequences, gene expression information, metabolic
networks)
Simple techniques may work
Characteristics of Biology Text
Large number of entities (e.g., genes, proteins) that
have well-defined semantics
No standard for terminology (inconsistencies)
Ambiguities (e.g., many acronyms)
Synonyms
High complexity in phrases and sentence structures
Research Topics
General goal: Applying known text mining techniques to help biology
research
Problem 1: Data/Information Integration
How can we integrate text information (discovering terminology linkages)
How can we link text with databases (semantic interpretations of text on top of
entities/relations in DB, e.g., entity extraction)
How can we integrate biology DBs (many fields are text)
Problem 2: Functional annotations
How can we annotate a biological entity (e.g., a gene) with functional
information extracted from literature
How can we annotate a set of related genes with functional information
How can we exploit the ontologies/thesauri in biology?
Research Topics (cont.)
Problem 3: Data/Information Cleanup & Curation
How can we detect suspicious data/information in existing databases?
How can we automate many manual tasks of database curation?
Problem 4: Research question answering
How can we answer simply research questions? (e.g., what functional
connections are there between these two genes?)
How can we support exploratory access and digest of literature
information? (e.g., a biology research workbench)
Swanson Example
Platelet
Migraine Aggregability Magnesium
Spreading Cortical
Depression
stress
Ref 1: www.stanford.edu/class/cs276b/handouts/lecture10.ppt
Ref 2: Swanson, D. R. (1988). Migraine and magnesium: Eleven neglected connections.
Perspectives in Biology and Medicine, 31, 526-557
Swanson Example
Observations:
Stress can lead to a loss of magnesium
conducted
Three largest pizza chains: Pizza Hut, Dominos Pizza
relationships.
Enron Case:
Kenneth Lay and Tim Belden. Both actors played a
PR(Tn)/C(Tn))
Page Rank Algorithm
Example
Calculate C(A), C(B), C, C(D) which one will be
highest.
Example
Rank
Example
Fig
Ranks
Examples
Example