Está en la página 1de 102

Text Mining & Web Mining

N.P. Singh

Web Mining & Text Mining
 Text & web mining are two sub-areas of data mining
both being focused on less structured data.
 Reason of Growth:

◦ Size of unstructured data: 85 % of the organization data is text
or less structured.

Text Mining
 Motivation for text Mining
◦ Approximately 90% of the World’s data is held in unstructured
 Web pages
 Emails
 Technical documents
 Books
 Digital Libraries
 Customer complaint letters
 Transcripts of phone calls with customers
◦ Growing rapidly in size and importance

Challenges of Text Mining
 Information is in unstructured textual form
 Not well structured text
• Email/Chat rooms
 - “r u available ?”
 - “Hey whazzzzzz up”
• Speech
• Multilingual
 Large textual data base.
 Very high number of possible “dimensions” (but sparse):
◦ all possible word and phrase types in the language.
 Complex and subtle relationships between concepts in text
 AOL merges with Time-Warner” “Time-Warner is bought by AOL”

Challenges of Text Mining
 Word ambiguity and context sensitivity
◦ automobile = car = vehicle = Toyota
◦ Apple (the company) or apple (the fruit).
• Word ambiguity
 - Pronouns (he, she …)
 - Synonyms (buy, purchase…)
 - Words with multiple meanings (bat – is related to baseball or
• Semantic ambiguity
 The king saw the rabbit with his glasses (multiple meanings)
 Noisy data
 Example: Spelling mistakes
• Abbreviations
• Acronyms

Business Opportunity


APPLICATION  Bioinformatics .

Why Biology Text Mining?  Strong motivations from biology side ◦ Difficulty for biologists to access literature  No theory in biology.g. link biology literature. and answer simple questions… (e..g.. due to different perspectives of study) ◦ Many unanswered research questions ◦ Text mining may help better organize. what do we know about this gene? ) . so we must keep all literature “alive”  Observations about the same biology mechanism may be described in different terms (e.

gene expression information. metabolic networks) ◦ Simple techniques may work .)  Potentially high impact from CS side ◦ Any “discovery” from biology text could be potentially significant ◦ Biology text is relatively “easy” for mining  Literature is cleaner (compared with web data)  Biology text often has many annotations  Many other kinds of biology data can be exploited (e.Why Biology Text Mining? (cont.g.. DNA/Protein sequences.

proteins) that have well-defined semantics  No standard for terminology (inconsistencies)  Ambiguities (e.. many acronyms)  Synonyms  High complexity in phrases and sentence structures .g..Characteristics of Biology Text  Large number of entities (e.g. genes.

g. entity extraction) ◦ How can we integrate biology DBs (many fields are text)  Problem 2: Functional annotations ◦ How can we annotate a biological entity (e.g.. e.Research Topics  General goal: Applying known text mining techniques to help biology research  Problem 1: Data/Information Integration ◦ How can we integrate text information (discovering terminology linkages) ◦ How can we link text with databases (semantic interpretations of text on top of entities/relations in DB. a gene) with functional information extracted from literature ◦ How can we annotate a set of related genes with functional information ◦ How can we exploit the ontologies/thesauri in biology? ..

.g.Research Topics (cont.)  Problem 3: Data/Information Cleanup & Curation ◦ How can we detect suspicious data/information in existing databases? ◦ How can we automate many manual tasks of database curation?  Problem 4: Research question answering ◦ How can we answer simply research questions? (e. what functional connections are there between these two genes?) ◦ How can we support exploratory access and digest of literature information? (e. a biology research workbench) .g..

Swanson Example All Migraine All Nutrition Ca Channel Research Blockers Research Platelet Migraine Aggregability Magnesium Spreading Cortical Depression stress Ref 1: www.stanford. D. Migraine and magnesium: Eleven neglected connections. 31. Perspectives in Biology and Medicine. (1988). 526-557 .edu/class/cs276b/handouts/lecture10.ppt Ref 2: Swanson. R.

Swanson Example Observations:  Stress can lead to a loss of magnesium  Magnesium is a natural calcium channel blocker  High levels of magnesium inhibit SCD  Magnesium can suppress platelet aggregability  Stress is associated with migraines  Calcium channel blockers prevent some migraines  SCD is implicated in some migraines  Migraine patients haveD. Perspectives in Biology and Medicine.high Ref: Swanson. R. 31. 526-557 . platelet (1988). Migraine aggregability and magnesium: Eleven neglected connections.

Swanson Example • Hypothesis: Magnesium Deficiency related to Migraine • Found by extracting features from medical literature on Migraines and Nutrition • Three of his hypotheses received experimental verification • Information that not even the writer knows • Literature might be full of such undiscovered connections .

2nd Case  Social media competitive analysis and text mining: A case study in the pizza industry .

Specifically. the study attempts to answer the following questions:  What patterns can be found from their Facebook sites respectively?  What patterns can be found from their Twitter sites respectively?  What are the main differences in terms of their Facebook and Twitter patterns? .Continued…. Objectives  This study examined the social media sites of the three largest pizza chains and applied text mining to analyze unstructured text content on their Facebook and Twitter sites.

S. Domino’s Pizza and Papa John’s Pizza in our case study . Pizza industry is one of the first industries that has entered the social media arena for business purposes and has a large social media user base  This is the reason social media competitive analysis is conducted  Three largest pizza chains: Pizza Hut.Continued… Methodology  U.

Text mining process for social media content. .

Analysis.23% 4 Others Total 100% .60% 3 Papa John’s 4.Market share SN Pizza Market Share 1 Pizza Hut 11.65% 2 Domino’s 7.

more and more pizza stores are promoting their pizza business via social media. print coupon. newspaper.  Due to the rapid development of the Internet and the widespread use of Facebook. . Twitter and YouTube by customers. TV advertising.Continued… Promotion of Business  Pizza businesses promote sales to customers through various marketing channels such as direct mail. magazines.

 many pizza restaurants such as Pizza Hut and Domino’s have assigned specific staff members with responsibilities to engage customers and build an online community .More facts  People actively look for specific media outlets and information for gratification purposes.  As social media becomes an increasingly popular media outlet among consumers.  85% of pizza-chain sales are now tied to promotions and discounts mostly acquired through social media sites.  Nearly half of the survey participants had looked for a restaurant recommendation by reading online reviews and information posted on blogs. Facebook and Twitters.

customers can engage in activities such as customizing pizzas.Continued… More facts  By using these social media applications.  Currently. tastes and deal information with peer customers.  On the other hand. . providing feedback to pizza seller. giving praise and complaints. discussing pizza quality. large pizza chains are focusing their social media use on Facebook and Twitter. many pizza restaurants are using social media as a customer service tool to listen to customers and address their concerns.

frequency of posting. comments. number of postings.  First. 2011 and October 31. shares and likes. we conducted a social media competitive analysis for the Facebook and Twitter sites of the Big Three by following two phases.  The posts were saved into Excel Spreadsheets for analysis. study  used the posts collected between October 1.Quantitative data was collected manually from their individual social media sites such as number of fans/followers.  As October is the busiest month of the year in the pizza industry . . and to acquire a deeper understanding of how the three pizza chains are using social media in practice. 2011 as the sample for text mining. text mining is applied to analyze the text messages posted on their Facebook and Twitter sites in order to discover new knowledge and patterns.  Secondly.Process of Analysis  To answer the research questions.

Continued. assigning attributes. were used to facilitate the mining and analysis . Tools used  Raw data was transformed into a usable format. SPSS Clementine text mining tool and Nvivo 9. and integrating data.  Two leading tools in textual data analysis and mining.. mainly by cleaning.  Subsequently. data mining and text mining techniques are applied to examine the data sets in order to gain insights about participants’ social media activities.

Trend of tweets numbers in October for the Big Three. .

. 2011.Pizza Hut’s customer engagement trend in October.

. 2011.Domino’s customer engagement trend in October.

Papa John’s customer engagement trend in October. . 2011.

.Examples related to ordering and delivering.

Examples related to the quality of their pizzas. .

A summary of the six main themes on Facebook sites. .

Analysis of E-mail Data set. identifying gatekeepers and other central actors.  Link Analysis can extract dynamic movies of the evolution of social networks.Rule Mining  Link Analysis & Content Analysis  Enron Fraud Case.  E-Mail Dataset = 517. as well as to generate temporally correlated cluster maps of e-mail content.431  Massages belongs to 150 Mail Boxes .

 Doing a large-scale social network analysis. judging actors by their closeness to suspicious people. by looking for what we term “collaborative innovation networks” (COINs). and  Searching for clusters of suspicious activity.Enron Case – Three Approaches to get Some Meaningful Information  Filtering out messages with potentially suspicious contents. and then focusing on the social network created by those messages. .

Enron Case Content Analysis  Objective: To reduce the size of data.  There is a common language of evil e-mails. To keep only the data of evils. ◦ Affairs (Criminals do not use clear words) ◦ FERC ◦ Devastating (What is coming up they know) ◦ Investigation (Dangerous things) ◦ Disclosures (Dangerous things) ◦ Bonus (Most important thing) . No consistent but one caan make sense with patterns  Package v/s Bombs etc  Following words or combinations were used to filter the e-mails.

View of these communications .

Content Combination .

Concept Map of Interest .

Few Actors of Enron .

Birds of a Feather Flock Together  What is relation between Actors .

 Enron Case:  Kenneth Lay and Tim Belden. the more intensive is their relationships.  The more they communicate.  But there is no Communication Between them . Both actors played a central role at Enron during the 2001 Californian Energy crisis.Establishing Relations using Data Mining Tools  Common Ways to establish Relation  A relationship between two actors exists if there is direct communication between them.

Jeff Skilling. Liz Taylor. Richard Shapiro.  But 13 common communication partners:  David Forster. Sally Beck..  Out of which 6. Greg Whalley. Karen Denne. David Oxley. namely David Delainey. and Steven Kean. Philipp Allen.Continued…. Sally Beck. Mark Palmer. Jeff Skilling. Sarah Novosel. Richard Shapiro. and Steven Kean appear in our Enron main actor list .

Continued…… Gatekeepers Between Lay & Belden .

Searching Innovation Structures  Three Types of communities work together to form an ecosystem of interconnected communities  COINs (Collaborative Innovation Networks)  CLNs (Collaborative Learning Networks)  CINs (Collaborative Interest Networks) .

.  Around the core team.COINs (Collaborative Innovation Networks)  COINs (Collaborative Innovation Networks) develop around a small core group of people over time.  A COIN has high density and relatively low group between centrality. there are people linked to only one or two of the core team members.

A typical visualisation of a CLN  In a CLN (Collaborative Learning Network).  The communication activities are arranged around them. a small group of subject matter experts talking among themselves is developing around the coordinator in the center of the graph. who builds a learning network. who are not communicating among themselves. . communicating with a large group of other community members.

different people are acting as local hubs. Over time.CIN  In a CIN (Collaborative Interest Network) there are different small teams. the structural holes are filling up. . operating as isolated islands. until the network is almost fully connected.  There is no clear center in this graph.

Three Communities with Jeffery Skilling at Center .

 Almost all the suspicious people are appearing in the bigger of the two COINs. discovering potential COINs is most interesting for gathering intelligence e-mail data analysis  Next figure Shows two COINs of Enron E.Mail Data.Results  As innovation can be done for good or for bad. .

Two COINS of Enron E.Mails .

g. who connects at least two communities.  Besides identifying those roles visually in the social network graph. as a gatekeeper. or a knowledge expert. they can also be found by calculating their contribution index.  Roles of different actors can be obtained by measuring differences in their contribution frequency (measured in the numbers of messages sent).Further Analysis  Actors can have certain roles within their respective communities. and the extent to which their communication is balanced between sending and receiving messages. or a leader. e. which we measured via a simple contribution index: .

Index  [messages sent – messages received] / total of messages sent and received.  This index is –1 for somebody who only receives messages. . and +1 for somebody who only sends messages. 0 for somebody who sends and receives the same number of messages.


when and where they are writing from.  This way. compliance analysts and members of the legal community can track emerging trends in email conversations – who is writing to whom about what. message boards and other network-based communication. discovering hidden links and emerging trends. in near-real time “conversations” and links between email traffic.Conclusion  Most suspects involved in the COIN structure above are now located in the blue circle. It is straightforward to recognize communities and their leaders. Intelligence and Law Enforcement professionals will be able to analyze.  Government.  It is anticipated that applying a similar approach to correlate suspicious activity with temporal communication structure. blogs. .

.  It is used to study topology of the of the hyperlinks with out the description of the links. ◦ Model can be used to categorize web pages and is useful to generate information similarity & relationships between web sites.Web Mining  Web Structure Mining:  Web Structure mining is concerned with discovering the model underlying the link structure of the web.

Induced Topic Search)  CLEVER .Algorithm of Web Structure Mining  Page Rank Algorithm  HITS (Hyperlink.

e.. audio. as well as hyperlinks  Tools based on Two approaches of Mining ◦ Agent-Based Approach ◦ Database Approach . image. metadata. Web Mining  Web Content mining ◦ Web content consist of several types of data i. video. Textual.

to discover and organize Web-based information.Agent-Based Approach  The agent-based approach to Web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user. Parasite . Information Manifold. ◦ Harvest. ◦ Intelligent Search Agents : Several intelligent Web agents have been developed that search for relevant information using characteristics of a particular domain (and possibly a user profile) to organize and interpret the discovered information. FAQ-Finder.

HyPursuit  Personalized Web Agents ◦ Web agents includes those that obtain or learn user preferences and discover Web information sources that correspond to these preferences.Continued  Information Filtering/Categorization ◦ A number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve. and possibly those of other individuals with similar interests (using collaborative filtering). and categorize them. WebWatcher . filter.

UnQL .  Multilevel Databases: ARANEUS system  Web Query Systems:W3QL. and using standard database querying mechanisms and data mining techniques to access and analyze this information.Database Approach  The database approaches to Web mining have generally focused on techniques for integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources. such as in relational databases.

Web Mining  Three Types ◦ Web Structure Mining ◦ Web Content Mining  Web Page Content Mining  Search Result Mining ◦ Web Usage Mining  General Access Pattern Tracking  Customized Usage Tracking .

browser logs. proxy server logs.  Data can be accumulated by the web server  Two approaches ◦ General Access Pattern Tracking and ◦ Customized usage Mining.  Web content & Structure mining utilizes the real or primary data on the web. registration data. scrolls etc. mouse clicks. . user queries. book mark data. cookies. Web Mining  Web usage mining  Deals with studying the data generated by the web surfer’s sessions or behavior. or transactions. user sessions. user profiles. On the contrary web usage mining mines the secondary data such as access logs.

Web Usage Mining  General Access Pattern Tracking ◦ To learn user navigation patterns (Impersonalized). ◦ To understand access patterns & Trends. . ◦ Analyses can shed better light on the structure and grouping of the resource providers. ◦ Customized usage tracking analyzes individual trends ◦ Help in customizing of the web sites to the users.  Customized usage Mining ◦ To learn user profile or user modeling in adaptive interfaces (Personalized).

Web Usage Mining  Mining Techniques ◦ The first approach maps the usage data of the web server in to relational tables before data mining techniques are applied (Classification & Clustering) ◦ Second approach directly uses log data directly. . It can be represented with graphs.

This is. of crucial importance to understanding the results of web mining. by its very nature as hypertext. nevertheless a vast field with many possibilities. shows a structure that can be described by means of graphs. some of which are presented in this issue. . Web structure mining  The web. is reduced to that of graph drawing. The visualization of said structure.




Page Rank Algorithm .

Page Rank Algorithm  Page Rank is used by Search engine Optimization Experts.  Page Rank is one of the methods Google uses to determine a page’s relevance or importance  Questions for discussion: ◦ How Page Rank is calculated? ◦ How Page is used? .

 Back Link: if page A links out to page B. .Definitions  PR: Page Rank: Page rank is calculated for each page.  Toolbar PR: The page rank displayed in the Google toolbar in your browser.15 to billions. then page B is said to have a back link from page A. This rank ranges from 0 to 10. Varies from 0.

Page Rank  It a vote by all other pages on the web.  A link to a page counts as a vote of support.  If there is no link there is no support ( but it is an abstention from voting rather than a vote against the page) . about how important a page is.

. The PageRank of a page A is given as follows:  PR(A) = (1-d) + d (PR(T1)/C(T1) + ...Page Rank Algorithm  We assume page A has pages T1. We usually set d to 0.e.85.Tn which point to it (i. are citations). Also C(A) is defined as the number of links going out of page A. The parameter d is a damping factor which can be set between 0 and 1.. + PR(Tn)/C(Tn)) ..

and so on for all pages. The count.  PR(Tn)/C(Tn) . or number. of outgoing links for page 1 is “C(T1)”. “C(Tn)” for page n. That’s “PR(T1)” for the first page in the web all the way up to “PR(Tn)” for the last page  C(Tn) .Each page has a notion of its own self- importance.Each page spreads its vote out evenly amongst all of it’s outgoing if our page (page A) has a backlink from page “n” the share of the vote page A will get is “PR(Tn)/C(Tn)” .Page Rank Algorithm  PR(Tn) .

The (1 – d) bit at the beginning is a bit of probability math magic so the “sum of all web pages' PageRanks will be one”: it adds in the bit lost by the d(.All these fractions of votes are added together but.Page Rank Algorithm  d(....15 (i. . .. to stop the other pages having too much influence. this total vote is “damped down” by multiplying it by 0..e. It also means that if a page has no links to it (no backlinks) even then it will still get a small PR of 0.85). 1 – 0.85 (the factor “d”)  (1 .d) . (Aside: the Google paper says “the sum of all pages” but they mean the “the normalised sum” – otherwise known as “the average” to you and me.

15 + 0.85 * 1= 1 .  PR(A) = (1-d) + d (PR(T1)/C(T1) + ..15 + 0..Example  Each page has one outgoing link (the outgoing count is 1.85  PR(A)= (1 – d) + d(PR(B)/1)  PR(B)= (1 – d) + d(PR(A)/1)i.e. C(A) = 1 and C(B) = 1). i.e.  PR(A)= 0. + PR(Tn)/C(Tn))  d= 0.85 * 1= 1  PR(B)= 0.

0?  What if a calculation over-shoots and goes above 1. we’ve already calculated a “next best guess” at PR(A) so we use it here And again:  PR(A)= 0.85 * 0.15 + 0.15 + 0.15 + 0.15 + 0.Guess 2  Let’s start the guess at 0 instead and re-calculate:  PR(A)= 0.85 * 0.15  PR(B)= 0.47799375  And again  PR(A)= 0.15 + 0.622850484375  and so on.15 = = 0.5562946875 = 0.385875 = 0.  But will the numbers stop increasing when they get to 1.5562946875  PR(B)= 0.85 * 0= 0.0? .385875  PR(B)= 0.2775  NB.2775 = 0.85 * 0.47799375 =.  The numbers just keep going up.15 + 0.85 * 0.85 * 0.

85 * 24.85 * 40= 34.1775  And again  PR(A)= 0.25  PR(B)= 0. once the PageRank calculations have settled down.0 and stop  Here’s the code used to calculate this example starting the guess at 0:  Principle:  It doesn’t matter where you start your guess.1775 = 24.35824375  Numbers are heading down alright!  It sure looks the numbers will get to 1.15 + 0.15 + 0.85 * 0.85 * 29.0  .385875= 29.15 + 0.15 + 0.950875  PR(B)= 0.950875 =21.Guess 3  Let’s start the guess at 40 each and do a few cycles:  PR(A) = 40 PR(B) = 40  First calculation  PR(A)= 0. the “normalized probability distribution” (the average PageRank for all pages) will be 1.

Example  Calculate C(A). C©. C(D) which one will be highest. . C(B).

PR(D)=0.1 outgoing link # c -> a .Example  PR(A) = (1-d) + d (PR(T1)/C(T1) + . PR©=0.2 outgoing links # b -> c ..85  PR(A)=0. + PR(Tn)/C(Tn))  D=0.  # forward links # a -> b. c .1 outgoing link ..1 outgoing link # d -> a . PR(B)=0.

b.15  PR(B) =  (1 -d) + d * (PR(A)/C(A).Example  # "backward" links (what's pointing to me?) # a <= c # b <= a # c <= a.15/2 + 0.d = 0.85 (0.15 .=.15+.15+ 0. d # d .15+.15=0.3775/1+ 0/1)=0.5346  PR(D)= 1 .85*0=.3775  PR© =  (1 -d) + d * (PR(A)/C(A) + PR(B)/C(B) + PR(D)/C(D)) = .85*.nothing  PR(A) = (1 -d) + d * PR©/C©= .


 Ranks


 Rank

Example  Fig .

Ranks .

Examples  Observation: a hierarchy concentrates votes and PR into one page .

“Product” and “More” pages has had a lovely “feedback” effect. Example  Hierarchical – but with a link in and one out. pushing up the home page’s PR even further!  Principle: a well structured site will amplify the effect of any contributed PR  . but the raised PR in the “About”.85 PR to us.  Site A contributed 0.

We now value the external Site B equally with our “More” page. The “More” page is getting only half the vote it had before – this is good for Site B but very bad for us! .Example  The vote of the “Product” page has been split evenly between it and the external site.

Example .

23%) ◦ Total around 23%  What patterns can be found from their Facebook sites.  What patterns can be found from their twitters sites.60%) ◦ Pizza Hut (4. .  What are the main differences in terms of their Facebook and twitter patterns.A case study of Pizza Industry  Question to be answered  Data is taken of three largest pizza chains. ◦ Pizza Hut (11.65%) ◦ Domino’s Pizza (7.


Social Media use as October 2011 .

Trend in tweet numbers for October for big three .


Pizza Hut’s Customer engagement trends in October 2011 .

Domino’s Customer engagement trends in October 2011 .

Papa john’s Customer engagement trends in October 2011 .

. feelings and emotions ◦ What ELSE ◦ ……….Five themes  Ordering & Delivery ◦ Percentage of customers sharing their experience . ◦ …………. ◦ -------------- ◦ ------------- .

Pizza Quality .

Feed back on Customer Purchase Decision .

 Casual Socialization Tweets  Marketing Tweets .Continued….