Está en la página 1de 20

University of Malta | Shashi Narayan

Documentation: Search Engine



With the advancement of World Wide Web (WWW), looking for the information and getting the
appropriate results are the most significant problems of WWW. Search Engine (SE) tries to find the
suitable solutions to these problems. Main challenges to these systems are to handle enormous
amount of dynamic data (data storage problem), presenting the results in real time (time efficiency
problem) and fetching the relevant ranked results to the queries (relevancy decision problem).
Google and Yahoo are some of the best SE giant of the current WWW.
In this short documentation, a very simple search engine consisting of a simple indexing system and
query processing system is presented. Corpus of this SE is static and relevancy decision problem is
based on vector distance model using the cosine similarity weight model. As we will go ahead with
the further sections, this documentation will take us through methodology used (approach,
assumption and implementation details), some test cases and their evaluations, discussion and
conclusions. All coding part has been done in Python.
Methodology
In this section, our approach to different situations, assumptions and implementation details are
discussed. The whole process to engineer Search Engine can be divided into several smaller steps.
These steps have been described here in chronological order
Details of Data Used
This SE is based on small and static data. It is basically consisted of 378 html documents. Most of
them are related to h1n1 swine flus information. Few of these documents are in PDF format and xml
format. But as extension to all those documents is html, tokenization process doesnt make any
differentiation and processes it as a simple html file. The source to the data can be found here.
Because of the small and static data, SE doesnt need to crawl through the corpus frequently and the
information extracted from all of the data can be stored without any memory problem.
Information Retrieval Model
We are using vector space model using the cosine similarity weight model. According to this IR
model, each document in the corpus and query is being converted to their vector equivalent. The
dimensions of these vectors are the keywords (Token) extracted from these documents. Now to get
the relevant documents to the query, query vector is compared with the each of the document
vector based on cosine similarity. The documents with the highest cosine similarity are the desired
results. This whole process can be divided into following sub processes

University of Malta | Shashi Narayan

Tokenization Process
The process of tokenization is to convert each document to a set of keywords or tokens. A token
(case insensitive) is a document word which is supposed to represents the document theme or
meaning. In SE, all words of documents, after processing, are considered as tokens. The whole
tokenization process can be summarized in following important steps
1. Text Extraction from HTML File:
As mentioned earlier, corpus contains html documents, document need to be parsed to
extract text from html tags. First of all, to make the regular expression matching faster, all
the comments (/**/ and <!---->) are removed to reduce document size. After that two
main parts of html document head part and body part are considered one by one.
a. Head Part (<head></head>)
Although the text from the head part is not directly available to users but it can be used
to improve SE. The texts from title tags and the meta tags with the name title,
description, summary and keywords are extracted and stored in database
corresponding to each document (List
Doc Details
L of Python). These texts have been used
in two ways:
- Texts (with their actual case) from title, description and summary has been
used for attractive presentation of the link to resultant relevant document on the
result page of a query. Text from title can be used to give title to the link and
texts from description and summary can be used for snippet generation.
- These all texts (case insensitive) also have been used for token extraction for
their document. It does improve SE unless illicit use of meta tag by web
designer start to mislead the result. I have checked with our small corpus and no
such use was found. Note that Google doesnt use that to prevent illicit use of
meta tag.
b. Body Part (<body></body>)
Mainly the text from the body part is considered for token extraction. First of all script
(<script></script>) and style (<style></style>) part have been removed from
the body part. After that rest of the html tags (<>) are replaced by empty string. This
process produces html tag free text from body part. This text (case insensitive) is used
for token extraction.
In some files, there are some problems with opening and closing of head or body tags,
these are taken care with appropriate regular expressions.
For all those documents which contains neither head nor body (like document with
extension html but actual pdf or xml), no text will be extracted.

2. Punctuation Removal:
We generally use different punctuation symbols in our text. But these symbols themselves
dont help in understanding document themes. So during token extraction from texts, these
punctuation symbols can be filtered out. In this SE, all the punctuations from Wikipedia-
Punctuation page have been considered. Note that in the process it also reduce the number
University of Malta | Shashi Narayan

of distinct tokens like word world and world? end up with producing single token
world and ? is filtered out.

3. Stop Word Removal:
All words from the text are not considered as token. Usually some words occur frequently in
almost all of the documents. Because of this property, their discrimination power is
negligible. These types of words are called stop words and these words can be filtered out
during tokenization. Different types of stop word list [Probable Googles, Onix 1, Onix 2,
WordNet and RDS] have been tried in group, sole and absence, with the small update in
global information of our systems implementation.

4. Stemmer:
Token represents the theme of the document. Different word with the same root doesnt
present any extra information about theme. For example the words play, playing, plays
and played, all are with same root play and all of them present the same theme to play.
So to avoid the problem of token sparseness, all of words can be replaced by their root play
without losing any information. For this purpose, a stemmer can be used to convert all
words to their root and corresponding root is considered as token. In the current SE, Porter
Stemmer has been used. With small update in global information file of our systems
implementation, use of stemmer can also be restricted.
Note that the regular expression of punctuations, the list of stop words and the use of the stemmer
can be updated directly from Global Information file of our systems implementation.
Inverted Index Building
From the tokenization process, the set of tokens for each document has been constructed. At the
same time for each token, the list of documents in which it occurs, is constructed. This is called
inverted index. Construction of inverted index eases the process of access different weight value for
a particular token. The data structure Dictionary of Python has been used for the purpose.
Inverted Index
frequency
{key : value}
key token
value [ , ]
i
D
t
L df
=
=
=

The keys of the dictionary
Inverted Index
D are the tokens
i
t and the corresponding value to each token is
the list of two elements
frequency
L anddf .
frequency
L is the list of frequencies of corresponding token in
all documents. Thus size of
frequency
L is the number of documents in the corpus. df (Document
frequency) is the number of documents in the corpus in which the corresponding token occurs at
least once.
Thus by the end of tokenization process,
Inverted Index
D and
Doc Details
L are completely filled.
University of Malta | Shashi Narayan

TF-IDF Weight Calculation
2
max
log ( )
ij
ij
ij
i
i
i
f
tf
f
N
idf
df
=
=


ij
tf is the term frequency of token
i
t in document
j
d .
ij
f is the frequency of token
i
t in document
j
d .
Note that
ij
f can be retrieved from
frequency
L of key
i
t in
Inverted Index
D .
ij
tf is normalized over maximum
occurring token in document
j
d .
i
idf is the inverse document frequency of token
i
t . N is the total
number of documents in the corpus.
i
df is the document frequency of token
i
t . Note that
i
df can be
retrieved from df value of key
i
t in
Inverted Index
D . Also note that
i
idf value is constant for a token and
it doesnt vary with different documents.
*
ij ij i
W tf idf =

Where
ij
W is the TF-IDF weight of the token
i
t in document
j
d . Note that for
ij
W we dont need any
extra data structure, it can be stored in
Inverted Index
D itself. Only change to be made is that now
frequency
L will contain the list of TF-IDF weight of corresponding token
i
t for all documents. By going
through all the keys (tokens) in
Inverted Index
D for fixed index j in
frequency
L , vector for the document
j
d ,
j
d
V can be retrieved very easily.
1 2
, ,...,
j
d j j Mj
V W W W = ( )
, where M is total number of tokens present in the corpus.
Query Vector Construction
Query is small text which itself considered as a document
q
d . Only difference with query document
and other documents in the corpus is that query document is already in the form of plain text (free
from html tags). So tokenization process can be started with punctuation removal. Same regular
expression for punctuation symbols, list of stop word and stemmer, used for extracting tokens from
document in corpus, are used for extracting the tokens from the query document. The tokens
extracted from query document which dont occur in
Inverted Index
D are filtered out. Rest of the tokens
form the set for query document vector
q
d
V
. Note that because of small size of query text,
dimension of
q
d
V
will be very small with comparison to M .
University of Malta | Shashi Narayan

1 2
, ,..., ,
q
d q q mq
V W W W m M = ( )

Calculation of TF-IDF weight (
iq
W ) for token
i
t :
iq
tf can be calculated very easily with the above
described formula because it just depends on the query document
q
d . Calculating
i
idf is slightly
confusing because
i
df and N depends on the existing corpus plus query document. If we include query
document, both will increase by one. In current SE,
i
df and N are taken independent of query
document so
i
idf is the discrimination power of token
i
t based on the existing corpus. Finally, the
product of
iq
tf and
i
idf produce
iq
W and so
q
d
V
.
Cosine Similarity Model
To find the relevant document, query document
q
d is compared with all the documents
j
d in the
corpus one by one using cosine similarity. This similarity value can be used to rank the relevant
documents
j
d .
1
2 2
1 1
.
( , )
q j
q j
m
iq ij
d d
i
q j
m M
d d
iq ij
i i
W W
V V
sim d d
V V
W W
=
= =
= =



Note that dot product is running over dimension m of
q
d
V
which is very small with comparison to
M ( m M). This makes the process very fast. Only thing which looks time consuming is the
calculation of
j
d
V
(dimension M ). But fortunately
j
d
V
can be calculated for all j and stored in
memory. These whole assumption and pre-calculation make search of relevant documents
extremely fast.
The threshold value on ( , )
q j
sim d d can be used to reduce the size of the relevant document set.
Result Format (HTML and XML)
In the result, top 10 most relevant documents are shown in decreasing order of similarity value
similarity
Value . Number of relevant document to be shown can be changed by small update in Global
Information file of my implementation.
For each document on the result page, following details are shown:
University of Malta | Shashi Narayan

query
Text : Value of Query Text
similarity
Value : Cosine similarity between the query document and the document ( , )
q j
sim d d
doc
Rank : Rank of the document in the sorted list of decreasing similarity value
similarity
Value
doc
Link : Hyperlink to the document in the corpus, to easily view the content of the document
doc
Name : Name of the document
doc
Title : Title of the document extracted from title value of the document from the list
Doc Details
L
(stored during tokenization process of head part). If no such detail is available then title is
computed after search gets complete and taken as the first line of the document with some fixed
length (specified in Global Information file of systems implementation).
( , ) doc query
Snippet : Snippet for the query explaining the document, extracted from description and
summary value of the document from the list
Doc Details
L (Meta tag information). If no such details
available then snippet can be constructed from the document itself using sentence fragments
(Construction). In current implementation, if no Meta tags information available, each sentence of
document is ranked based on query vector and snippet is constructed from top ranked sentences
taking some fixed length (specified in Global Information file of systems implementation). Note
that with these two methods, Google also uses Open Directory Project (ODP) to get the snippet. Our
current SE cant use ODP as the web address is not provided for the documents.
Tokens from the query text are highlighted in
doc
Title and
( , ) doc query
Snippet .
HTML Structure
| | | |
( , )
/ /
2 / 2
Re : a href / /
: /
: /
query
query
doc similarity doc doc
doc
doc query
html head title Text title head body
h Text h
b sult Rank Value Link Name a b
br b Title b Title
br b Snippet b Snippet
br
< >< >< > < >< >< >
< > < >
( < > < = > < > < >

< >< > < >
< >< > < >
< >
/ /
br
body html

< >

< >< >






University of Malta | Shashi Narayan

XML Structure
( , )
? version="1.0" encoding="ISO-8859-1"?
/
doc id " " title " " relevancy " "
/
/ /
query
doc doc similarity
doc query
xml
SE query Text query results
result
Name Title Value
snippet Snippet snippet
doc resu
< >
< >< > < >< >
< >
< = = = >
< > < >
< ><
/ /
lt
results SE

>

< >< >



Bracket part in both html and xml format shows the part to be repeated for documents present in
the list of the relevant documents.
HTML Example File:

University of Malta | Shashi Narayan

XML Example File:

System Interface Details
Whole system is written in programming language Python. To go through the line by line details of
the implementation (variables, functions and package details), please go through the software
documentation in following section.
To run the system, run the file named SearchEngine_Interface.py containing the main method.
To control the systems performance, we just need to update the global variables in the
GlobalInformation.py file. By controlling these variables, we can play with stop word set,
punctuation regular expression, use of stemmer, number of documents to be shown on result page,
background image of systems GUI, system interface type etc.



University of Malta | Shashi Narayan

There are two types of interfaces are provided to operate the system:
1. Command Line Interface
This interface can be much useful to the advanced user who knows the implementation
details. Through the command line, whole data structure can be scanned for any minute
details. As query search result, it lists top relevant document with their decreasing
relevancy values. The size of the list can be controlled from Global information file. It also
shows the time taken in searching. At the same time, in the background, it generates html
and xml result file and stores it to result directory (specified in Global Information file).
Following is the snapshot of command line interface of SE:












University of Malta | Shashi Narayan

2. Graphical User Interface (GUI)
GUI is the easiest way to access our system. When search for a query, result appear in the
right side widget. This result consists of the time taken in searching and the lists of the top
relevant document with their decreasing relevancy values. The size of the list can be
controlled from Global information file. Successful search of the query also enables the
Show HTML and Show XML buttons, which can be pressed to see the result in HTML or
XML, format respectively. When there is no relevant document (failed search of a query),
both the buttons stays disabled.

Some of the snapshots are shown here:

Successful Search for a Query:










University of Malta | Shashi Narayan






Failed Search for a Query:



Test Cases and Evaluations
Use of inverted index for accessing token details and efficient calculation of cosine similarity vector
model makes presented SE extremely fast. Some of the statistics that have been used for the results
shown next are:
- Stop Words: First 3 stop word files (Probable Googles, Onix 1 and Onix 2) have been used
and the numbers of unique stop words extracted from these files were 724.
- Punctuation Symbols: Punctuations from Wikipedia-Punctuation page
- Stemmer: Python implementation of Porter Stemmer
- Maximum number of results documents to be shown is taken to be 10.
University of Malta | Shashi Narayan

With the current setting of the system (without Stemmer), total number of 21915 tokens extracted
during tokenization for final consideration. Complete tokenization process including construction
of inverted index took 17.2672 seconds on an average. With the stemmer, number of tokens
extracted reduced to 17022 and time taken in complete tokenization process was 35.0264
seconds recorded. TF-IDF weight calculation took 8.7253 seconds on an average. Finally to fasten
the process of cosine similarity model during relevancy calculation, magnitude of each document
vector got pre-calculated. This process of vector magnitude pre-calculation took 10.3223 seconds
on an average. Again with stemmer, time taken in TF-IDF weight calculation and magnitude
calculation of document vector reduced to 6.0324 seconds and 8.2893 seconds respectively.
Finally some of the search time results are shown below:
Query Search Status
Time Taken by SE(in sec)
(Without Stemmer)
Time Taken by SE(in sec)
(With Stemmer)
Swine Succeed 0.00156779702434 0.00135693991853
H1n1 swine flu Succeed 0.00415276243211 0.00384628632967
myriad Succeed 0.0017619557793 0.00165802567118
university Succeed 0.00216396217957 0.00200048288257
shashi Failed 0.000297454006159 0.000246825504141
World Health
Organization
Succeed 0.00403871162416 0.00296909244025
Swine flu reports Succeed 0.00401112431882 0.00358241340741
Blog, symptoms and
prevention
Succeed 0.00382716239119 0.00346626736104
Health Fitness Succeed 0.00284358766248 0.00255513689591
university Minnesota
copyright 2009
regents
Succeed 0.00577461660623 0.00479542930721
Eiffel Tower Failed/Succeed 0.00059260324997 (F) 0.00228933425899 (S)

Based on the results of time taken by search engine in case of different query, we can draw
following facts about current SE:
- Search process is extremely fast, average time taken for query search (drawn from above
table) is 0.002821067 seconds.
- The time taken by the SE is directly proportional to the number of the unique tokens
present in the query text. Query with single token takes less time than query with more
number of tokens.
University of Malta | Shashi Narayan

- Query can have lots of text but time taken by SE depends on number of unique tokens
extracted from it. For example World Health Organization and Blog, symptoms and
prevention take almost equal time because in both the cases number of unique token is 3.
and from second query is filtered out during tokenization because it is a stop word.
- SE performs extremely fast if there is no relevant documents. Basically for these query
number of tokens is zero (failed search).
- For each query, time taken in search varies depending on stemmer used or not. When
stemmer is used, time taken is generally less than compare to the case when no stemmer is
not used, this is because number of tokens extracted from corpus is less when stemmer is
considered.
- Our SE performs even better than Google, but this comparison doesnt mean anything as
both system has different corpuses to work on.
Our evaluation is based on quality wise. With each of the search, common question that I am
looking for is How many irrelevant documents occurred in top 10 relevant documents?, Are
these the results we might be looking for?, How many documents in top 10 relevant
documents are exact match to the query requirement? etc.
Basically my evaluation process is manual and not based on the comparison with other search
engine like Google. Comparing with other search engine like Google also does not make sense as
corpus used is completely different. Also even with this small size of corpus, it is not possible to
categorize each document in corpus for a given query for its relevancy and irrelevancy. So
unfortunately, evaluation in terms of precision and recall is not feasible.
Test Case 1: [Query: h1n1 swine flu] [With Stemmer]
[Result 1] [0.181129564946] [191.htm]
Title: H1N1 Influenza (Swine Flu) in the Yahoo! Directory
Snippet: Yahoo! reviewed these sites and found them related to H1N1 Influenza (Swine Flu)
[Result 2] [0.178197589149] [319.htm]
Title: Swine Flu - Influenza A (H1N1) - Novel H1N1 Flu
Snippet: Information and advice about Swine Flu (swine influenza - Influenza A H1N1)
[Result 3] [0.155311207686] [39.htm]
Title: Swine Flu Symptoms, Treatment, H1N1 Pandemic News, Vaccine and Transmission by MedicineNet.com
Snippet: Get the facts on swine flu (swine influenza A H1N1 virus) history, symptoms, how this contagious infection is
transmitted, prevention with a vaccine, diagnosis, treatment, news and research.
[Result 4] [0.14328112026] [87.htm]
Title: Swine Influenza (H1N1)
Snippet: CDC Information about swine flu - cdc.gov...No. Swine influenza viruses are not spread by food. You cannot
get swine influenza from eating pork or pork products. Eating properly handled and cooked pork products is safe....
[Result 5] [0.135996960162] [300.htm]
Title: Swine Flu Symptoms, Treatment, H1N1 Pandemic News, Vaccine and Transmission by MedicineNet.com
Snippet: Get the facts on swine flu (swine influenza A H1N1 virus) history, symptoms, how this contagious infection is
transmitted, prevention with a vaccine, diagnosis, treatment, news and research.
[Result 6] [0.130567225723] [173.htm]
Title: Open Directory - Health: Conditions and Diseases: Infectious Diseases: Viral: Influenza: A-H1N1
Snippet: " h1n1 swine flu " search on:...Swine Influenza - Latest news on the swine influenza situation in humans
around the world provided by the WHO. [RSS]...
[Result 7] [0.129887390989] [69.htm]
University of Malta | Shashi Narayan

Title: Key Facts About Swine Influenza - Swine Flu Center - EverydayHealth.com
Snippet: What are the symptoms of the swine influenza? Can you catch swine flu from eating pork? Get the answers to
these and more at EverydayHealth.com.
[Result 8] [0.124097572097] [226.htm]
Title: Swine Flu - Telegraph
Snippet: As Swine flu (H1N1) cases rise follow latest news on this global epidemic.
[Result 9] [0.12087530591] [61.htm]
Title: Vaccines and Vaccine-Preventable Diseases in the News: Influenza - H1N1 (swine flu)
Snippet: samples of the new swine flu, taken from people who fell ill in Mexico...Administration's swine flu work, told
The Associated Press. Using...ingredient for a swine flu vaccine ready in early May, but are finding...Medicine.
[Result 10] [0.116807375249] [73.htm]
Title: NCPH: Influenza in N.C.
Snippet: Q A about H1N1 novel flu ("swine flu")...Influenza Sentinel Surveillance Program...de la gripe(flu) en
espaol...Q&A about diagnostic tests for flu (CDC)...Farms and H1N1 flu in North Carolina
Discussion: Note that in the corpus, most of the documents are related to swine flu, selecting the
appropriate list was tuff task but the documents appeared in top 10 list is turned to be highly
relevant to the query. Query h1n1 swine flu seems to look for information or news about h1n1
swine flu. The current system successfully retrieved Yahoo! List of all related websites, List from
Open Directory Project, newspaper articles from Telegraph and NCPH and other documents
directly motivated to swine flu (MedicineNet.com and Novel H1N1). Most of the documents talk
about symptoms, treatment, news, vaccine and transmissions of h1n1 swine flu.
Only problem is with the rank of some of the documents. When user searches for a query, she
expects the direct information not the information through some list of documents provided
(Result 1: Yahoo Directory). Result 1 should occur somewhere with Result 6 (Open Directory
Project). Rest of the documents are well competent for the query, strict decision on their ranking is
not possible.
Test Case 2: [Query: Blog, symptoms and prevention] [With Stemmer]
[Result 1] [0.318280819858] [201.htm]
Title: Health News Blog -- Swine Flu - 2009 H1N1 - Resources
Snippet: Health News Blog -- Bird Flu / Avian Flu H5N1
[Result 2] [0.144399440093] [159.htm]
Title: Social Blogger Community Blog Directory
Snippet: autism prevention blogspot | October 23rd 2009 by leslie feldman...http://symptoms-of-swine-
flu.blogspot.com/...The U.S. Food and Drug Administration (FDA) issued an advisory on October
[Result 3] [0.144137217327] [160.htm]
Title: Social Blogger Community Blog Directory
Snippet: http://symptoms-of-swine-flu.blogspot.com/...http://preventah1n1.blogspot.com/...centers for disease
control and prevention...Great Info: "Simple ways to prevent influenza/H1N1 (Swine Flu)
[Result 4] [0.138454658183] [304.htm]
Title: Pregnancy and Swine Influenza A (H1N1) - The Well-Timed Period
Snippet: Swine flu cases reached 1,085 worldwide, spreading to every major region of the U.S., Bloomberg reports. So, if
you're pregnant, here are a few things about pregnancy and swine flu you should know
[Result 5] [0.132531227241] [149.htm]
Title: Swine influenza, H1N1 flu, swine flu Symptoms and Preventions. India Amazing
Snippet: Nice blogs by nitawriter...More blogs...My Other blogs Global Recession: How to Survive in Financial Crisis
Rajasthan won Indian premier League (IPL) like Royals Lowest scores in Test Cricket by Indians.
University of Malta | Shashi Narayan

[Result 6] [0.121857581987] [359.htm]
Title: Swine influenza
Snippet: University of Leicester Clinical Sciences Library
[Result 7] [0.120162752856] [328.htm]
Title: Swine Flu|Influenza A H1N1 News: Swine Influenza (H1N1) in Malaysia.Did u Know??
Snippet: What is swine influenza? It is a respiratory disease of pigs caused by type A strains of the influenza virus. It
regularly causes high flu outbreaks in pigs but with low death rates. There are four main sub-types of the virus
[Result 8] [0.115997430571] [42.htm]
Title: Digital Pathology Blog: CDC advisory for laboratory workers - Swine Influenza (H1N1) guidelines
Snippet: Swine Influenza A (H1N1) Virus Biosafety Guidelines for Laboratory Workers This guidance is for laboratory
workers who may be processing or performing diagnostic testing on clinical specimens from patients
[Result 9] [0.0898530442974] [263.htm]
Title: Varun's ScratchPad: Influenza A (H1N1) a.k.a Swine Influenza
Snippet: More details @ http://www.blogactionday.org/...With the increase in the number of Influenza A(H1N1) a.k.a
Swine Flu cases in the city, it is highly recommended that one should be aware of certain information about this outbreak.
[Result 10] [0.0896877541716] [370.htm]
Title: Taking Care Of H1N1 Swine Influenza Patient At Home | MEDDESKTOP
Snippet: Swine influenza A virus infection ( swine flu ) can cause a wide range of symptoms, including fever, cough, sore
throat, body aches, headache, chills and fatigue. Some people have reported diarrhea and vomiting
Test Case 2: [Query: Blog, symptoms and prevention] [Without Stemmer]
[Result 1] [0.290034118371] [201.htm]
Title: Health News Blog -- Swine Flu - 2009 H1N1 - Resources
Snippet: Health News Blog -- Bird Flu / Avian Flu H5N1
[Result 2] [0.146818853196] [159.htm]
Title: Social Blogger Community Blog Directory
Snippet: autism prevention blogspot | October 23rd 2009 by leslie feldman...http://symptoms-of-swine-
flu.blogspot.com/...blog for games,business,news and entertainment | October 23rd 2009 by tracychopz ...
[Result 3] [0.145407712097] [160.htm]
Title: Social Blogger Community Blog Directory
Snippet: http://symptoms-of-swine-flu.blogspot.com/...centers for disease control and prevention...Swine flu
(H1N1) prevention starts with proper hand hygiene! Hand sanitation for swine flu prevention begins
[Result 4] [0.138886132308] [304.htm]
Title: Pregnancy and Swine Influenza A (H1N1) - The Well-Timed Period
Snippet: Medical Weblogs...Surgeonsblog...Medblogs...Swine flu cases reached 1,085 worldwide, spreading to every
major region of the U.S., Bloomberg reports. So, if you're pregnant, here are a few things about pregnancy and swine
[Result 5] [0.12025094208] [328.htm]
Title: Swine Flu|Influenza A H1N1 News: Swine Influenza (H1N1) in Malaysia.Did u Know??
Snippet: blogarama.com...Earn money using blog......What is swine influenza? It is a respiratory disease of pigs caused by
type A strains of the influenza virus. It regularly causes high flu outbreaks in pigs but with low death rates..
[Result 6] [0.101464942491] [42.htm]
Title: Digital Pathology Blog: CDC advisory for laboratory workers - Swine Influenza (H1N1) guidelines
Snippet: Swine Influenza A (H1N1) Virus Biosafety Guidelines for Laboratory Workers This guidance is for laboratory
workers who may be processing or performing diagnostic testing on clinical specimens from patients with suspected
[Result 7] [0.0903952660191] [370.htm]
Title: Taking Care Of H1N1 Swine Influenza Patient At Home | MEDDESKTOP
Snippet: Swine influenza A virus infection ( swine flu ) can cause a wide range of symptoms, including fever, cough, sore
throat, body aches, headache, chills and fatigue. Some people have reported diarrhea and vomiting associated with
[Result 8] [0.0886311243466] [102.htm]
Title: H1N1 Swine Influenza Guidance For The Public And Clinicians | MEDDESKTOP
Snippet: The United States has 2,618 cases of the H1N1 swine influenza in 44 states (click on the map above to see a large
version), and three deaths, CDC, the U.S. Centers for Disease Control and Prevention reported on Monday.
University of Malta | Shashi Narayan

[Result 9] [0.0842996795404] [263.htm]
Title: Varun's ScratchPad: Influenza A (H1N1) a.k.a Swine Influenza
Snippet: More details @ http://www.blogactionday.org/...With the increase in the number of Influenza A(H1N1) a.k.a
Swine Flu cases in the city, it is highly recommended that one should be aware of certain information about this outbreak.
[Result 10] [0.0825507550642] [227.htm]
Title: Swine influenza H1N1 update
Snippet: Follow virology blog on twitter...Subscribe to virology blog Email...blog comments powered by Disqus...Here s a
new question for your next Q&A session: A New Zealand blogger posits that getting swine flu now could be beneficial if
Discussion: For this query Blog, symptoms and prevention, performance of SE is really good. All
the documents in top 10 list are blogs discussing about news, symptoms, vaccine and preventions
of swine flu. Although I was looking for some blogs which is not about swine flu, but I thing we dont
have such data, otherwise it must have fetch that.
Personal blogs like Varuns ScratchPad are ranked at bottom, top priority are given to more
general and important blogs like Health News Blog, Blogger and MEDDESKTOP.
Note that for without stemmer case, tokens considered were blog, symptoms and
presentation, but for with stemmer case, tokens considered were blog, symptom and
presentation. This minor difference creates a noticeable difference in rank. The second case fetch
two important result India Amazing and University of Leicester Clinical Sciences Library at fifth
and sixth positions.
Test Case 3: [Query: myriad] [With Stemmer]
[Result 1] [0.00996001037105] [61.htm]
Title: Vaccines and Vaccine-Preventable Diseases in the News: Influenza - H1N1 (swine flu)
Snippet: crisis, not to mention the myriad chronic health issues that threaten...email admin@immunize.org...tel 651-647-
9009 fax 651-647-9131...Immunization Action Coalition 1573 Selby Avenue Saint Paul, MN 55104...
Discussion: The query myriad is just to test that our SE working properly. Actually the token
myriad just occurs once in whole corpus. As a result, SE successfully retrieved that document.
Note that only one document is present in top 10 list which contains that token.
Note that in snippet, SE is showing the sentence with the word myriad (only occur once in the
document). During snippet construction for the query myriad, this sentence has been given the
highest score among other sentences present in the document.
Test Case 4: [Query: Lake County] [With Stemmer]
[Result 1] [0.21592553478] [107.htm]
Title: 2009 H1N1 (swine) Influenza - Lake County, IL
Snippet: Swine Flu Info...H1N1 (Swine Flu) Widget. Flash Player 9 is required....H1N1 (Swine Flu)...Centers for Disease
Control Illinois Department of Public Health Informacion de gripe procina en espanol World Health Organization (WHO)
[Result 2] [0.21129565701] [38.htm]
Title: Influenza Type A H1N1 (Swine Influenza) - Public Information
Snippet: Influenza Type A H1N1 (Swine Influenza) - Public Information Information and updates on the current situation
are available at: Ministry of Health: http://www.moh.govt.nz/influenza-a-h1n1
[Result 3] [0.120708584834] [179.htm]
Title: News and Alerts - Department of Health Services, Public Health Division, County of Sonoma
University of Malta | Shashi Narayan

Snippet: Website for Public Health Clinics in Sonoma County, California.
[Result 4] [0.117021970944] [35.htm]
Title: Update: Swine Influenza A (H1N1) Infections -- California and Texas, April 2009 - The Body
Snippet: On April 24, this report was posted as an MMWR Dispatch on the MMWR website (http://www.cdc.gov/mmwr).
[Result 5] [0.112324937647] [341.htm]
Title: H1N1 Influenza (swine flu)
Snippet: About human cases of pandemic Influenza A (H1N1) virus infection.
[Result 6] [0.102410179733] [175.htm]
Title: Bureau of Epidemiology - Swine Flu Information
Snippet: Advisory...Information * Email...Accessibility...Statement * Disclaimer...Privacy...2009 State of
Florida...Copyright...Infections with the Novel Influenza A (H1N1) Virus...
[Result 7] [0.0870538855301] [80.htm]
Title: H1N1 Flu - General Information
Snippet: What can I do to protect myself from getting sick? There is no vaccine available right now to protect against
H1N1 flu (swine flu). There are everyday actions that can help prevent the spread of germs that cause respiratory ...
[Result 8] [0.0795519553746] [234.htm]
Title: Merced County, CA - Official Website - H1N1 (Swine) Influenza Information
Snippet: H1N1 (Swine) Influenza Information
[Result 9] [0.0745545218655] [148.htm]
Title: A/H1N1 Swine Flu (Influenza) Timeline
Snippet: A well organised, automated, and easy to read, easy to understand timeline of events, from new case
confirmations, new infections, and school closings to technical details on the H1N1 Swine Influenza/Flu outbreak.
[Result 10] [0.0708008975208] [285.htm]
Title: Swine Influenza A (H1N1) Infection in Two Children --- Southern California, March--April 2009
Snippet: The lack of known exposure to pigs in the two cases described in this report increases the possibility that
human-to-human transmission of this new influenza virus has occurred.
Test Case 4: [Query: Lake County] [Without Stemmer]
[Result 1] [0.21425377278] [107.htm]
Title: 2009 H1N1 (swine) Influenza - Lake County, IL
Snippet: 2009 H1N1 (swine) Influenza Human cases of 2009 H1N1 (swine) Influenza continue to be identified in the
United States and internationally. Lake County continues to experience ongoing spread of the novel virus and
[Result 2] [0.116875534044] [179.htm]
Title: News and Alerts - Department of Health Services, Public Health Division, County of Sonoma
Snippet: Website for Public Health Clinics in Sonoma County, California.
[Result 3] [0.10936833554] [341.htm]
Title: H1N1 Influenza (swine flu)
Snippet: About human cases of pandemic Influenza A (H1N1) virus infection.
[Result 4] [0.0917571595656] [175.htm]
Title: Bureau of Epidemiology - Swine Flu Information
Snippet: local county health department or the Bureau of Epidemiology...Advisory...Information *
Email...Accessibility...Statement * Disclaimer...Privacy...2009 State of Florida...
[Result 5] [0.0800258632905] [80.htm]
Title: H1N1 Flu - General Information
Snippet: What can I do to protect myself from getting sick? There is no vaccine available right now to protect against
H1N1 flu (swine flu). There are everyday actions that can help prevent the spread of germs that cause respiratory
[Result 6] [0.0784464265073] [35.htm]
Title: Update: Swine Influenza A (H1N1) Infections -- California and Texas, April 2009 - The Body
Snippet: On April 24, this report was posted as an MMWR Dispatch on the MMWR website (http://www.cdc.gov/mmwr).
[Result 7] [0.0731715820846] [234.htm]
Title: Merced County, CA - Official Website - H1N1 (Swine) Influenza Information
Snippet: H1N1 (Swine) Influenza Information
University of Malta | Shashi Narayan

[Result 8] [0.0660780042229] [148.htm]
Title: A/H1N1 Swine Flu (Influenza) Timeline
Snippet: A well organized, automated, and easy to read, easy to understand timeline of events, from new case
confirmations, new infections, and school closings to technical details on the H1N1 Swine Influenza/Flu outbreak.
[Result 9] [0.0639112953337] [339.htm]
Title: H1N1 influenza (swine flu)
Snippet: H1N1 influenza, also known as swine flu, is a newly identified virus that can spread from people who are
infected to others through coughs and sneezes. When people cough or sneeze, they spread germs through the air or
[Result 10] [0.0596803528454] [120.htm]
Title: Pima County Health Department H1N1 Influenza (Swine Flu) Information - (520) 243-7797
Snippet: The Pima County Health Department continues to carefully monitor the cases of Swine Flu nationwide. Steps
everyone can take to prepare for and respond to the Swine Flu should they or someone in their family becomes ill:
Discussion: For the query Lake County, there is just one document which contains both of the
token, rest of the documents either contains Lake or County. As per expected, in both the cases
(with and without stemmer), SE ranked that particular document (107.html) at top , though
frequency of tokens Lake and Count separately are less compare to the frequency of these
tokens in other documents.
There is also one problem found with this test case. Ranking among other documents is slightly
confusing. The document (148.html) with high occurrence of County is lower on the rank table
than the document (175.html) with low occurrence of County. This phenomenon must be the
consequence of normalization of term frequency
ij
tf .
Note that for without stemmer case, tokens considered were lake and county, but for with
stemmer case, tokens considered were lake and counti.
Final Discussion and Conclusion
We see that search engine based on vector distance model based on cosine similarity weight model
performs very well. TF-IDF weight combines the document frequency and corpus frequency which
enhances its performance. SE presented over here is also very fast. In this section, we conclude our
documentation with discussing quality, short comings and possible improvement of current
system:
- Corpus used in this SE was very small. Still there were many PDF and XML document were
there with HTML extensions. To understand the real life scenario (time complexity, data
storage capacity and processing power), corpus needs to be huge.
- The stop words are supposed to be of negligible discrimination power and hence filtered
out. But this can cause problems when searching for a query that include the. For example
searching for the band and musical groups The Who, The The, Take That etc.
- Here during tokenization, direct words are considered as token, these token can be
modified with their semantic and syntactic information. Like merging of two token if they
are synonyms.
- Stemmer does help by reducing the sparseness problem of token and making system faster.
This performs well when token are just words, but when token are also combined with their
University of Malta | Shashi Narayan

semantic and syntactic information, use of stemming might avoid some of the meanings. In
this case it might increase sparseness problem.
- Vector space model really proves very fast and it can be implemented very efficiently. Its
pure mathematical approach. It provides partial matching of query to documents. It
combines the document frequency (local) and the corpus frequency (global) using TF-IDF
weight. According to this weight scheme, a token occurring frequently in the document but
rarely in the rest of the documents is given high weight. Resultant relevant documents are
quite significant.
- There are also many weaknesses in the vector space model also
o Words are considered as token without any semantic (word sense) and syntactic
(word order, phrase structure etc) information.
o If a term appears in each document in corpus then its IDF value is zero which makes
its TF-IDF weight to be zero no matter whats the TF value. TF value of a token might
be high for a document but because of IDF is zero; token doesnt add any
information to the vector of the document.
o TF normalization can also create problem, even though a token is more frequent in a
document with comparison to the other document, still it can be assigned lower
rank on rank table because of TF normalization. (problem of test case 4)
o Vector space model doesnt enforce any restriction over occurrence of a token in a
document or any other such kind of relations. For example for a query consisting of
token A and token B, SE can still rank high to those documents having token A
frequently but none of token B, with compare to the documents , having token A and
token B but both less frequently.
- Query itself is considered as a document. All words of the query text have been directly
converted to the tokens. For the better match, synonym information can be used to enhance
our token set. For example suppose no tokens in query document present in corpus, then in
place of showing failed search, synonym information can be used to enhance token set of
query document and search process can be repeated again.
- One of the important features of search engine is to present result attractively so that it can
attract user to click on them. To make search result attractive, snippet is one of the
important part of the result shown. Snippet explains the document that how relevant is it
corresponding to a query. But constructing the snippet depending on the query is very
difficult task. In current implementation, fixed snippet (constant for a document), extracted
from Meta tag is shown for all queries. For better presentation, it can be constructed for
each query from the document text itself. Uses of ODP or Yahoo Directory are also good
options.
- In the current evaluation, manual and theoretical approaches have been used. It would be
much better to evaluate the system with exiting search engines and present the results in
terms of precision and recall value by considering some gold standard value.


University of Malta | Shashi Narayan

Source Code Documentation
Systems source is written in Python. So to generate source code documentation in html format,
Pydoc can be directly used on different source files. It has also been generated and stored. Those
files can be accessed by clicking on the following files
- GlobalInformation.py: File containing global variables, needs to be updated before
running the system.
- SearchEngine_Interface.py: Contains main file, needs to run to start the system.
- Database.py: File containing data structure details.
- SearchEngine_Server.py: All methods on the server side, like tokenization, TF-IDF
calculations and cosine similarity model.
- SearchEngine_QueryProcessor.py: All methods on the query processing side, system
interface creation, xml and html result file creation etc.
- Porter Stemmer.py: External algorithm used for stemming
- WXPython: External Python Library for GUI. WX does not come with general distribution of
Python.

__________________________________________________________________________________________________________________

También podría gustarte