P. 1
Undergraduate Thesis

Undergraduate Thesis

|Views: 261|Likes:
Publicado porAmiya Patanaik

More info:

Published by: Amiya Patanaik on Sep 21, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/21/2012

pdf

text

original

Sections

  • CERTIFICATE
  • Acknowledgement
  • ABSTRACT
  • Contents
  • List of figures and tables
  • 1.1 History of Question Answering Systems
  • 1.4.7 Real time question answering
  • 1.4.8 Multi-lingual (or cross-lingual) question answering
  • 1.6.5 Traditional Metrics – Recall and Precision
  • Chapter2. Question Analysis
  • 2.1 Determining the Expected Answer Type
  • 2.1.2 Manually Constructed rules for question classification
  • 2.1.3 Fully Automatically Constructed Classifiers
  • 2.1.4 Support Vector Machines
  • 2.1.5 Kernel Trick
  • 2.1.6 Nave Bayes Classifier
  • 2.1.7 Datasets
  • 2.1.8 Features
  • 2.1.9 Entropy and Weighted Feature Vector
  • 2.1.10 Experiment Results
  • 2.2 Query Formulation
  • 2.2.1 Stop word for IR query formulation
  • Chapter3. Document Retrieval
  • 3.1 Retrieval from local corpus
  • 3.1.1 Ranking function
  • 3.1.2 Okapi BM25
  • 3.1.3 IDF Information Theoretic Interpretation
  • 3.2 Information retrieval from the web
  • 3.2.1 How many documents to retrieve?
  • Chapter4. Answer Extraction
  • 4.1 Sentence Ranking
  • 4.1.1 WordNet
  • 4.1.2 Sense/Semantic Similarity between words
  • 4.1.2 Sense Net ranking algorithm
  • Chapter5. Implementation and Results
  • 5.1 Results
  • 5.2 Comparisons with other Web Based QA Systems
  • 5.3 Feasibility of the system to be used in real time environment
  • 5.4 Conclusion
  • References

Open Domain Factoid Question

Answering System
By Amiya Patanaik
(05EG1008)




Thesis submitted in partial fulfilment of the
Requirements for the degree of Bachelor of Technology (Honours)

Under the supervision of


Dr. Sudeshna Sarkar
Professor, Department of Computer Science







DEPARTMENT OF ELECTRICAL ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY
KHARAGPUR – 721302
INDIA


MAY 2009


2


Department of Electrical Engineering
Indian Institute of Technology
Kharagpur-721302


CERTIFICATE

This is to certify that the thesis entitled Open Domain Factoid Question
Answering System is a bonafide record of authentic work carried out by Mr. Amiya
Patanaik under my supervision and guidance for the fulfilment of the requirement for the
award of the degree of Bachelor of Technology (Honours) at the Indian Institute of
Technology, Kharagpur. The work incorporated in this has not been, to the best of my
knowledge, submitted to any other University or Institute for the award of any degree or
diploma.




Dr. Sudeshna Sarkar (Guide)
Professor, Department of Computer Science Date :
Indian Institute of Technology – Kharagpur Place : Kharagpur
INDIA





Dr. S K Das (Co-guide)
Professor, Department of Electrical Engineering Date :
Indian Institute of Technology – Kharagpur Place : Kharagpur
INDIA






3
Acknowledgement

I express my sincere gratitude and indebtedness to my guide, Dr. Sudeshna Sarkar
under whose esteemed guidance and supervision, this work has been completed. This
project work would have been impossible to carry out without her advice and support
throughout.

I would also like to express my heartfelt gratitude to my co-guide Dr. S. K. Das and all
the professors of Electrical and Computer Science Engineering Department for all the
guidance, education and necessary skill set they have endowed me with, throughout my
years of graduation.

Last but not the least; I would like to thank my friends for their help during the course
of my work.


Date:




Amiya Patanaik
05EG1008
Department of Electrical Engineering
IIT Kharagpur - 721302






4
















Dedicated to
my parents and friends














5

ABSTRACT
A question answering (QA) system provides direct answers to user questions by
consulting its knowledge base. Since the early days of artificial intelligence in the 60’s,
researchers have been fascinated with answering natural language questions. However,
the difficulty of natural language processing (NLP) has limited the scope of QA to
domain-specific expert systems. In recent years, the combination of web growth,
improvements in information technology, and the explosive demand for better
information access has reignited the interest in QA systems. The wealth of information
on the web makes it an attractive resource for seeking quick answers to simple, factual
questions such as “who was the first American in space?” or “what is the second tallest
mountain in the world?” Yet today’s most advanced web search services (e.g., Google,
Yahoo, MSN live search and AskJeeves) make it surprisingly tedious to locate answers to
such questions. Question answering aims to develop techniques that can go beyond the
retrieval of relevant Documents in order to return exact answers to natural language
factoid questions, such as “Who is the first woman to be in space?”, “Which is the largest
city in India?”, and “When was first world war fought?”. Answering natural language
questions requires more complex processing of text than employed by current
information retrieval systems.

This thesis investigates a number of techniques for performing open-domain factoid
question answering. We have developed an architecture that augments existing search
engines so that they support natural language question answering and is also capable of
supporting local corpus as a knowledge base. Our system currently supports document
retrieval from Google and Yahoo via their public search engine application
programming interfaces (APIs). We assumed that all the information required to
produce an answer exists in a single sentence and followed a pipelined approach
towards the problem. Various stages in the pipeline include: automatically constructed
question type analysers based on various classifier models, document retrieval, passage
extraction, phrase extraction, sentence and answer ranking. We developed and analyzed
different sentence and answer ranking algorithms, starting with simple ones that
employ surface matching text patterns to more complicated ones using root words, part
of speech (POS) tags and sense similarity metrics. The thesis also presents a feasibility
analysis of our system to be used in real time QA applications.



6
Contents
CERTIFICATE
2
ACKNOWLEDGEMENT
3
DEDICATION
4
ABSTRACT
5
CONTENTS
6
LIST OF FIGURES AND TABLES
8

Chapter 1: Introduction 9
1.1 History of Question Answering Systems 9
1.2 Architecture 10
1.3 Question answering methods 11
1.3.1 Shallow 11
1.3.2 Deep 11
1.4 Issues 12
1.4.1 Question classes 12
1.4.2 Question processing 13
1.4.3 Context and QA 13
1.4.4 Data sources for QA 13
1.4.5 Answer extraction 13
1.4.6 Answer formulation 13
1.4.7 Real time question answering 14
1.4.8 Multi-lingual (or cross-lingual) question answering 14
1.4.9 Interactive QA 14
1.4.10 Advanced reasoning for QA 14
1.4.11 User profiling for QA 14
1.5 A generic framework for QA 15
1.6 Evaluating QA Systems 15
1.6.1 End-to-End Evaluation 16
1.6.2 Mean Reciprocal Rank 16
1.6.3 Confidence Weighted Score 16
1.6.4 Accuracy and coverage 17
1.6.5 Traditional Metrics – Recall and Precision 17
Chapter2: Question Analysis 19
2.1 Determining the Expected Answer Type 19
2.1.1 Question Classes 19
2.1.2 Manually Constructed rules for question classification 20


7
2.1.3 Fully Automatically Constructed Classifiers 20
2.1.4 Support Vector Machines 21
2.1.5 Kernel Trick 22
2.1.6 Naive Bayes Classifier 22
2.1.7 Datasets 24
2.1.8 Features 24
2.1.9 Entropy and Weighted Feature Vector 25
2.1.10 Experiment Results 26
2.2 Query Formulation 27
2.2.1 Stop word for IR query formulation 28
Chapter3. Document Retrieval 29
3.1 Retrieval from local corpus 29
3.1.1 Ranking function 29
3.1.2 Okapi BM25 29
3.1.3 IDF Information Theoretic Interpretation 30
3.2 Information retrieval from the web 30
3.2.1 How many documents to retrieve? 31
Chapter4. Answer Extraction 34
4.1 Sentence Ranking 34
4.1.1 WordNet 34
4.1.2 Sense/Semantic Similarity between words 35
4.1.2 Sense Net ranking algorithm 36
Chapter5. Implementation and Results 38
5.1 Results 38
5.2 Comparisons with other Web Based QA Systems 41
5.3 Feasibility of the system to be used in real time environment 42
5.4 Conclusion 43

APPENDIX A : Web Based Question Set
44
APPENDIX B : Implementation Details
46


REFERENCES
47




8
List of figures and tables
Figures PageNo.

Fig.1.1: A generic framework for question answering 15
Fig.1.2: Sections of a document collection as used for IR evaluation. 18
Fig.2.1: The kernel trick 22
Fig.2.2: Various feature sets extracted from a given question and its
corresponding part of speech tags.
24
Fig.2.3: Question type classifier performance 26
Fig.2.4: JAVA Question Classifier 27
Fig.3.1: Document retrieval framework 31
Fig.3.2: %coverage vs rank 32
Fig.3.3: %coverage vs. average processing time 33
Fig.4.1: Fragment of WordNet taxonomy 35
Fig.4.2: A sense network formed between a sentence and a query 36
Fig.4.3: A sample run for the question “Who performed the first human heart
transplant?”
37
Fig.5.1: Various modules of the QA system along with each ones basic task 38
Fig.9.2: Comparison with other web based QA systems 42
Fig.9.3: Time distribution of each module involved in QA 43

Tables PageNo.

Table 1.1 Coarse and fine grained question categories. 20
Table 2.1: performance of various query expansion modules implemented on
Lucene.
28
Table 3.1: %coverage and average processing time at different ranks 32
Table 5.1: Performance of the system on the web question set 39-41



9
Chapter1. Introduction
In information retrieval, question answering (QA) is the task of automatically answering
a question posed in natural language. To find the answer to a question, a QA computer
program may use either a pre-structured database or a collection of natural language
documents (a text corpus such as the World Wide Web or some local collection).
QA research attempts to deal with a wide range of question types including: fact, list,
definition, How, Why, hypothetical, semantically-constrained, and cross-lingual
questions. Search collections vary from small local document collections, to internal
organization documents, to compiled newswire reports, to the World Wide Web.
* Closed-domain question answering deals with questions under a specific domain
(for example, medicine or automotive maintenance), and can be seen as an easier task
because NLP systems can exploit domain-specific knowledge frequently formalized in
ontologies.
* Open-domain question answering deals with questions about nearly everything, and
can only rely on general ontologies and world knowledge. On the other hand, these
systems usually have much more data available from which to extract the answer.
(Alternatively, closed-domain might refer to a situation where only limited types of
questions are accepted, such as questions asking for descriptive rather than procedural
information.)
QA is regarded as requiring more complex natural language processing (NLP)
techniques than other types of information retrieval such as document retrieval, thus
natural language search engines are sometimes regarded as the next step beyond
current search engines.
1.1 History of Question Answering Systems
Some of the early AI systems were question answering systems. Two of the most famous
QA systems of that time are BASEBALL and LUNAR, both of which were developed in
the 1960s. BASEBALL answered questions about the US baseball league over a period of
one year. LUNAR, in turn, answered questions about the geological analysis of rocks
returned by the Apollo moon missions. Both QA systems were very effective in their
chosen domains. In fact, LUNAR was demonstrated at a lunar science convention in
1971 and it was able to answer 90% of the questions in its domain posed by people
untrained on the system. Further restricted-domain QA systems were developed in the
following years. The common feature of all these systems is that they had a core
database or knowledge system that was hand-written by experts of the chosen domain.
Some of the early AI systems included question-answering abilities. Two of the most
famous early systems are SHRDLU and ELIZA. SHRDLU simulated the operation of a
robot in a toy world (the "blocks world"), and it offered the possibility to ask the robot


10
questions about the state of the world. Again, the strength of this system was the choice
of a very specific domain and a very simple world with rules of physics that were easy
to encode in a computer program. ELIZA, in contrast, simulated a conversation with a
psychologist. ELIZA was able to converse on any topic by resorting to very simple rules
that detected important words in the person's input. It had a very rudimentary way to
answer questions, and on its own it lead to a series of chatter bots such as the ones that
participate in the annual Loebner prize.
The 1970s and 1980s saw the development of comprehensive theories in computational
linguistics, which led to the development of ambitious projects in text comprehension
and question answering. One example of such a system was the Unix Consultant (UC), a
system that answered questions pertaining to the Unix operating system. The system
had a comprehensive hand-crafted knowledge base of its domain, and it aimed at
phrasing the answer to accommodate various types of users. Another project was
LILOG, a text-understanding system that operated on the domain of tourism
information in a German city. The systems developed in the UC and LILOG projects
never went past the stage of simple demonstrations, but they helped the development
of theories on computational linguistics and reasoning.
In the late 1990s the annual Text Retrieval Conference (TREC) included a question-
answering track which has been running until the present. Systems participating in this
competition were expected to answer questions on any topic by searching a corpus of
text that varied from year to year. This competition fostered research and development
in open-domain text-based question answering. The best system of the 2004
competition achieved 77% correct fact-based questions.
In 2007 the annual TREC included a blog data corpus for question answering. The blog
data corpus contained both "clean" English as well as noisy text that include badly-
formed English and spam. The introduction of noisy text moved the question answering
to a more realistic setting. Real-life data is inherently noisy as people are less careful
when writing in spontaneous media like blogs. In earlier years the TREC data corpus
consisted of only newswire data that was very clean.
An increasing number of systems include the World Wide Web as one more corpus of
text. Currently there is an increasing interest in the integration of question answering
with web search. Ask.com is an early example of such a system, and Google and
Microsoft have started to integrate question-answering facilities in their search engines.
One can only expect to see an even tighter integration in the near future.
1.2 Architecture
The first QA systems were developed in the 1960s and they were basically natural-
language interfaces to expert systems that were tailored to specific domains. In
contrast, current QA systems use text documents as their underlying knowledge source
and combine various natural language processing techniques to search for the answers.
Current QA systems typically include a question classifier module that determines the
type of question and the type of answer. After the question is analyzed, the system


11
typically uses several modules that apply increasingly complex NLP techniques on a
gradually reduced amount of text. Thus, a document retrieval module uses search
engines to identify the documents or paragraphs in the document set that are likely to
contain the answer. Subsequently a filter preselects small text fragments that contain
strings of the same type as the expected answer. For example, if the question is "Who
invented Penicillin" the filter returns text that contain names of people. Finally, an
answer extraction module looks for further clues in the text to determine if the answer
candidate can indeed answer the question.
1.3 Question answering methods
QA is very dependent on a good search corpus - for without documents containing the
answer, there is little any QA system can do. It thus makes sense that larger collection
sizes generally lend well to better QA performance, unless the question domain is
orthogonal to the collection. The notion of data redundancy in massive collections, such
as the web, means that nuggets of information are likely to be phrased in many different
ways in differing contexts and documents, leading to two benefits:
(1) By having the right information appear in many forms, the burden on the QA
system to perform complex NLP techniques to understand the text is lessened.
(2) Correct answers can be filtered from false positives by relying on the correct
answer to appear more times in the documents than instances of incorrect ones.
1.3.1 Shallow
Some methods of QA use keyword-based techniques to locate interesting passages and
sentences from the retrieved documents and then filter based on the presence of the
desired answer type within that candidate text. Ranking is then done based on syntactic
features such as word order or location and similarity to query.
When using massive collections with good data redundancy, some systems use
templates to find the final answer in the hope that the answer is just a reformulation of
the question. If you posed the question "What is a dog?", the system would detect the
substring "What is a X" and look for documents which start with "X is a Y". This often
works well on simple "factoid" questions seeking factual tidbits of information such as
names, dates, locations, and quantities.
1.3.2 Deep
However, in the cases where simple question reformulation or keyword techniques will
not suffice, more sophisticated syntactic, semantic and contextual processing must be
performed to extract or construct the answer. These techniques might include named-
entity recognition, relation detection, co reference resolution, syntactic alternations,
word sense disambiguation, logic form transformation, logical inferences (abduction)


12
and commonsense reasoning, temporal or spatial reasoning and so on. These systems
will also very often utilize world knowledge that can be found in ontologies such as
WordNet, or the Suggested Upper Merged Ontology (SUMO) to augment the available
reasoning resources through semantic connections and definitions.
More difficult queries such as Why or How questions, hypothetical postulations,
spatially or temporally constrained questions, dialog queries, badly-worded or
ambiguous questions will all need these types of deeper understanding of the question.
Complex or ambiguous document passages likewise need more NLP techniques applied
to understand the text.
Statistical QA, which introduces statistical question processing and answer extraction
modules, is also growing in popularity in the research community. Many of the lower-
level NLP tools used, such as part-of-speech tagging, parsing, named-entity detection,
sentence boundary detection, and document retrieval, are already available as
probabilistic applications.
AQ (Answer Questioning) Methodology; introduces a working cycle to the QA methods.
This method may be used in conjunction with any of the known or newly founded
methods. AQ Method may be used upon perception of a posed question or answer. The
means by which it is utilized can be manipulated beyond its primary usage; however,
the primary usage is taking an answer and questioning it turning that very answer into
a question. Example; A"I like sushi." Q"(Why do) I like sushi(?)" A"The flavor." Q"(What
about) the flavor of sushi (do) I like?" Inadvertently, this may unveil different methods
of thinking and perception as well. While most would agree that this seems to be the
end-all stratagem, it is only a starting point with endless possibilities. Any number of
question methods may be used to derive the number of WHY as in, A = ∞(Q), the answer
may yield any number of questions to be asked; thereby unveiling an ongoing process
constantly being reborn into the research being performed. The QA methodology
utilizes just the opposite where, 1(Q) = ((∞(A)-∞) = 1(A), supposedly there is only one
true answer in reality everything else is perception or plausibility. Utilized alongside
other forms of communication; debate may be greatly improved. Even this methodology
should be questioned.
1.4 Issues
In 2002 a group of researchers wrote a roadmap of research in question answering. The
following issues were identified.
1.4.1 Question classes
Different types of questions require the use of different strategies to find the answer.
Question classes are arranged hierarchically in taxonomies.


13
1.4.2 Question processing
The same information request can be expressed in various ways - some interrogative,
some assertive. A semantic model of question understanding and processing is needed,
one that would recognize equivalent questions, regardless of the speech act or of the
words, syntactic inter-relations or idiomatic forms. This model would enable the
translation of a complex question into a series of simpler questions, would identify
ambiguities and treat them in context or by interactive clarification.
1.4.3 Context and QA
Questions are usually asked within a context and answers are provided within that
specific context. The context can be used to clarify a question, resolve ambiguities or
keep track of an investigation performed through a series of questions.
1.4.4 Data sources for QA
Before a question can be answered, it must be known what knowledge sources are
available. If the answer to a question is not present in the data sources, no matter how
well we perform question processing, retrieval and extraction of the answer, we shall
not obtain a correct result.
1.4.5 Answer extraction
Answer extraction depends on the complexity of the question, on the answer type
provided by question processing, on the actual data where the answer is searched, on
the search method and on the question focus and context. Given that answer processing
depends on such a large number of factors, research for answer processing should be
tackled with a lot of care and given special importance.
1.4.6 Answer formulation
The result of a QA system should be presented in a way as natural as possible. In some
cases, simple extraction is sufficient. For example, when the question classification
indicates that the answer type is a name (of a person, organization, shop or disease, etc),
a quantity (monetary value, length, size, distance, etc) or a date (e.g. the answer to the
question "On what day did Christmas fall in 1989?") the extraction of a single datum is
sufficient. For other cases, the presentation of the answer may require the use of fusion
techniques that combine the partial answers from multiple documents.


14
1.4.7 Real time question answering
There is need for developing Q&A systems that are capable of extracting answers
from large data sets in several seconds, regardless of the complexity of the question, the
size and multitude of the data sources or the ambiguity of the question.
1.4.8 Multi-lingual (or cross-lingual) question answering
The ability to answer a question posed in one language using an answer corpus in
another language (or even several). This allows users to consult information that they
cannot use directly. See also machine translation.
1.4.9 Interactive QA
It is often the case that the information need is not well captured by a QA system, as
the question processing part may fail to classify properly the question or the
information needed for extracting and generating the answer is not easily retrieved. In
such cases, the questioner might want not only to reformulate the question, but (s)he
might want to have a dialogue with the system.
1.4.10 Advanced reasoning for QA
More sophisticated questioners expect answers which are outside the scope of
written texts or structured databases. To upgrade a QA system with such capabilities,
we need to integrate reasoning components operating on a variety of knowledge bases,
encoding world knowledge and common-sense reasoning mechanisms as well as
knowledge specific to a variety of domains.
1.4.11 User profiling for QA
The user profile captures data about the questioner, comprising context data, domain
of interest, reasoning schemes frequently used by the questioner, common ground
established within different dialogues between the system and the user etc. The profile
may be represented as a predefined template, where each template slot represents a
different profile feature. Profile templates may be nested one within another.



15
1.5 A generic framework for QA
The majority of current question answering systems designed to answer factoid
questions consist of three distinct components:
1. question analysis,
2. document or passage retrieval and finally
3. answer extraction.
While these basic components can be further subdivided into smaller components like
query formation and document pre-processing, a three component architecture
describes the approach taken to building QA systems in the wider literature.














Fig.1.1: A generic framework for question answering.

It should be noted that while the three components address completely separate
aspects of question answering it is often difficult to know where to place the boundary
of each individual component. For example the question analysis component is usually
responsible for generating an IR query from the natural language question which can
then be used by the document retrieval component to select a subset of the available
documents. If, however, an approach to document retrieval requires some form of
iterative process to select good quality documents which involves modifying the IR
query, then it is difficult to decide if the modification should be classed as part of the
question analysis or document retrieval process.
1.6 Evaluating QA Systems
Evaluation is a highly subjective matter when dealing with NLP problems. It is always
easier to evaluate when there is a clearly defined answer, unfortunately with most of
the natural language tasks there is no single answer. A rather impractical and tedious
way of doing this could be to manually search an entire collection of text and mark the
Corpus or
document
collection
Document
Retrieval
Top n text
segments
or
sentences
Answer
Extraction
Answers
Question
Analysis
Question


16
relevant documents. Then the queries can be used to make an evaluation based on
precision and recall. But this is not possible even for the smallest of document
collections and with the size of corpuses like AQUAINT with approximately 1,00,000
articles it is next to impossible.
1.6.1 End-to-End Evaluation
Almost every QA system is concerned with the final answer. So a widely accepted
metric is required to evaluate the performance of our system and compare it with other
existing systems. Most of the recent large scale QA evaluations have taken place as part
of the TREC conferences and hence the evaluation metrics used have been extensively
studied and is used in this study. Following are definitions of numerous metrics for
evaluating factoid questions. Evaluating descriptive questions is much more difficult
than factoids.
1.6.2 Mean Reciprocal Rank
The original evaluation metric used in the QA tracks of TREC 8 and 9 was mean
reciprocal rank (MRR). MRR provides a method for scoring systems which return
multiple competing answers per question. Let Q be the question collection and
i
r the
rank of the first correct answer to question i or 0 if no correct answer is returned. MRR
is then given by:

| |
1
1
| |
Q
i
i
M
r
RR
Q
=
=
¯
(1.1)
As useful as MRR was as an evaluation metric for the early TREC QA evaluations it does
have a number of drawbacks [8], the most important of which are that
- systems are given no credit for retrieving multiple (different) correct answers
and
- As the task required each system to return at least one answer per question; no
credit was given to systems for determining that they did not know or could not
locate an appropriate answer to a question.
1.6.3 Confidence Weighted Score
Following the shortcomings of MRR as an evaluation metric a new evaluation metric
was chosen as the new evaluation metric [9]. Under this evaluation metric a system
returns a single answer for each question. These answers are then sorted before
evaluation so that the answer which the system has most confidence in is placed first.


17
The last answer evaluated will therefore be the one the system has least confidence in.
Given this ordering CWS is formally defined in Equation 1.2:

| |
1
no. of correct in first i answers
| |
Q
i
CWS
i
Q
=
=
¯
(1.2)
CWS therefore rewards systems which can not only provide correct exact answers to
questions but which can also recognise how likely an answer is to be correct and hence
place it early in the sorted list of answers. The main issue with CWS is that it is difficult
to get an intuitive understanding of the performance of a QA system given a CWS score
as it does not relate directly to the number of questions the system was capable of
answering.
1.6.4 Accuracy and coverage
Accuracy of a QA system is a simple evaluation metric with direct correspondence to
number of correct answers. Let
, D q
C be the correct answers for question q known to be
contained in the document collection D and
, ,
S
D q n
F be the first n answers found by
system S for question q from D then accuracy is defined as:
, , ,
| { | }|
( , , )
| |
S
D q n D q S
q Q
accuracy
F C
Q D n
Q
o · =
=
‹

(1.3)
Similarly The coverage of a retrieval system S for a question set Q and document
collection D at rank n is the fraction of the questions for which at least one relevant
document is found within the top n documents:

, , ,
| { | }|
coverage ( , , )
| |
S
D q n D q S
q Q R A
Q D n
Q
o · =
=
‹
(1.4)
1.6.5 Traditional Metrics – Recall and Precision
The standard evaluation measures for IR systems are precision and recall. Let D be the
document (or passage collection),
, D q
A the subset of Dwhich contains relevant
documents for a query q and
, ,
S
D q n
R be the n top-ranked documents (or passages) in D
retrieved by an IR system S (figure 1.2); then

The recall of an IR system S at rank n for a query q is the fraction of the relevant
documents
, D q
A , which have been retrieved:


18

, , ,
,
| |
( , , )
| |
S
D q n D q S
D q
R A
reca q l n
A
l D
·
=
(1.5)
The precision of an IR system S at rank n for a query q is the fraction of the retrieved
documents
, ,
S
D q n
R that are relevant:

, , ,
, ,
| |
( , , )
| |
S
D q n D q S
S
D q n
precisio
R A
D q n
R
n
·
= (1.6)
Clearly given a set of queries Q average recall and precision values can be calculated to
give a more representative evaluation of a specific IR system. Unfortunately these
evaluation metrics although well founded and used throughout the IR community suffer
from two problems when used in conjunction with the large document collections
utilized by QA systems, namely determining the set of relevant documents within a
collection for a given query,
, D q
A . The only accurate way to determine which documents
are relevant to a query is to read every single document in the collection and determine
its relevance. Clearly given the size of the collections over which QA systems are being
operated this is not a feasible proposition. It must be kept in mind that just because a
relevant document is found does not automatically mean the QA system will be able to
identify and extract a correct answer. Therefore it is better to use recall and precision at
the document retrieval stage rather than for the complete system.




















Figure 1.2: Sections of a document collection as used for IR evaluation.
Document Collection/Corpus
Relevant Documents


, D q
A


Retrieved Documents
, ,
S
D q n
R


19
Chapter2. Question Analysis

As the first component in a QA system it could easily be argued that question analysis is
the most important part. Not only is the question analysis component responsible for
determining the expected answer type and for constructing an appropriate query for
use by an IR engine but any mistakes made at this point are likely to render useless any
further processing of a question. If the expected answer type is incorrectly determined
then it is highly unlikely that the system will be able to return a correct answer as most
systems constrain possible answers to only those of the expected answer type. In a
similar way a poorly formed IR query may result in no answer bearing documents being
retrieved and hence no amount of further processing by an answer extraction
component will lead to a correct answer being found.
2.1 Determining the Expected Answer Type
In most QA systems the first stage in processing a previously unseen question is to
determine the semantic type of the expected answer. Determining the expected answer
type for a question implies the existence of a fixed set of answer types which can be
assigned to each new question. The problem of question type classification can be
solved by constructing manual rules or if we have access to large set of annotated-pre
classified questions, using machine learning approaches. We have employed a machine
learning model in our system which employs a feature-weighting model which assigns
different weights to features instead of simple binary values. The main characteristic of
this model is assigning more reasonable weight to features: these weights can be used
to differentiate features from each other according to their contribution to question
classification. Further, we propose to use features initially just as bag of words and later
on both as a bag of words and feature called as partitioned feature model. Results show
that with this new feature-weighting model the SVM-based classifier outperforms the
one without it to a large extent.
2.1.1 Question Classes
We follow the two-layered question taxonomy, which contains 6 coarse grained
categories and 50 fine grained categories, as shown in Table 1. Each coarse grained
category contains a non-overlapping set of fine grained categories. Most question
answering systems use a coarse grained category definition. Usually the number of
question categories is less than 20. However, it is obvious that a fine grained category
definition is more beneficial in locating and verifying the plausible answers.




20
Table 1.1 Coarse and fine grained question categories.
Coarse Fine
ABBR abbreviation, expansion
DESC definition, description, manner, reason
ENTY animal, body, color, creation, currency, disease/medical,
event, food, instrument, language, letter, other, plant,
product, religion, sport, substance, symbol, technique, term,
vehicle, word
HUM description, group, individual, title
LOC city, country, mountain, other, state
NUM code, count, date, distance, money, order, other, percent,
period, speed, temperature, size, weight


2.1.2 Manually Constructed rules for question classification
Often the easiest approach to question classification is a set of manually constructed
rules. This approach allows a simple low coverage classifier to be rapidly developed
without requiring a large amount of hand labelled training data. A number of systems
have taken this approach, many creating sets of regular expressions which only
questions with the same answer type [10],[11]. While these approaches work well for
some questions (for instance questions asking for a date of birth can be reliably
recognised using approximately six well constructed regular expressions) they often
require the examination of a vast number of questions and tend to rely purely on the
text of the question. One possible approach for manually constructing rules for such a
classifier would be to define rule formalism that whilst retaining the relative simplicity
of regular expressions would give access to a richer set of features. As we had access to
large set of pre annotated question samples we have not used this method.
2.1.3 Fully Automatically Constructed Classifiers
As mentioned in the previous section building a set of classification rules to perform
accurate question classification by hand is both a tedious and time-consuming task. An
alternative solution to this problem is to develop an automatic approach to constructing
a question classifier using (possibly hand labelled) training data. A number of different
automatic approaches to question classification have been reported which make use of
one or more machine learning algorithms [6][7][12] including nearest neighbour (NN)
[4], decision trees (DT) and support vector machines (SVM)[7][12] to induce a classifier.
In our system we employed a SVM and Naive Bayes classifier on different feature sets
extracted from the question.

21
2.1.4 Support Vector Machines
Support vector machines (SVMs) are a set of related supervised learning methods used
for classification and regression. Viewing input data as two sets of vectors in an n-
dimensional space, an SVM will construct a separating hyper-plane in that space, one
which maximizes the margin between the two data sets. To calculate the margin, two
parallel hyperplanes are constructed, one on each side of the separating hyper-plane,
which are "pushed up against" the two data sets. Intuitively, a good separation is
achieved by the hyper-plane that has the largest distance to the neighboring data points
of both classes, since in general the larger the margin the lower the generalization error
of the classifier.
We are given some training data, a set of points of the form
D = ൛(ݔ
݅
, ܿ
݅
) | ݔ
݅
߳ ℝ
݌
, ܿ
݅
߳ {−1,1}ൟ
݅=1
݊
(2.1)
where the ܿ
݅
is either 1 or −1, indicating the class to which the point ݔ
݅
belongs. Each ݔ
݅

is a p-dimensional real vector. We want to give the maximum-margin hyperplane which
divides the points having ܿ
݅
= 1 from those having ܿ
݅
= − 1. Any hyperplane can be
written as the set of points ݔ satisfying
w ⋅ ݔ − ܾ = 0 (2.2)
where denotes the dot product. The vector w is a normal vector: it is perpendicular to
the hyperplane. The parameter ܾ/‖ݓ‖ determines the offset of the hyperplane from the
origin along the normal vector w. We want to choose the w and b to maximize the
margin, or distance between the parallel hyperplanes that are as far apart as possible
while still separating the data. These hyperplanes can be described by the equations
w ⋅ ݔ − ܾ = 1 (2.3)

and
w ⋅ ݔ − ܾ = −1 (2.4)
Note that if the training data are linearly separable, we can select the two hyperplanes
of the margin in a way that there are no points between them and then try to maximize
their distance. By using geometry, we find the distance between these two hyperplanes
is 2/‖ݓ‖, so we want to minimize ‖ݓ‖. As we also have to prevent data points falling
into the margin, we add the following constraint: for each i either
w ⋅ ݔ
݅
− ܾ ≥ 1 (2.5)
and
w ⋅ ݔ
݅
− ܾ ≤ −1 (2.6)
This can be rewritten as:

22
ܿ
݅
(w ⋅ ݔ
݅
− ܾ) ≥ 1 , for all 1≤ ݅ ≤ ݊ (2.7)

We can put this together to get the optimization problem:
Minimize in (ݓ, ܾ) ‖ݓ‖ subject to (for any i = 1,……., n)
ܿ
݅
(w ⋅ ݔ
݅
− ܾ) ≥ 1 (2.8)
2.1.5 Kernel Trick
If instead of the Euclidean inner product w ⋅ ݔ
݅
one fed the QP solver with a function
K(w, ݔ
݅
) the boundary between the two classes would then be,
K(x,w) + b = 0 (2.9)
and the set of x e R
d
on that boundary becomes a curved surface embedded in R
d
when
the function K(x,w) is non-linear.
Consider K(x,w) to be the inner product not of the coordinate vectors x and w in R
d
but
of vectors o(x) and o(w) in higher dimensions. The map, o: X ÷ H
is called a feature map from the data space X into the feature space H . The feature
space is assumed to be a Hilbert space of real valued functions defined on X . The data
space is often R
d
but most of the interesting results hold when X is a compact
Riemannian manifold. The following picture illustrates a particularly simple example
where the feature map o(x1,x2)=(x1
2
,\2x1x2,x2
2
) maps data in R
2
into R
3
.

Figure 2.1: The kernel trick, after transformation the data is linearly separable.
2.1.6 Na ve Bayes Classifier

Along with SVM, we also tried Naïve Bayes Classifier[6]. A naive Bayes classifier is a
term in Bayesian statistics dealing with a simple probabilistic classifier based on


23
applying Bayes' theorem with strong (naive) independence assumptions. A more
descriptive term for the underlying probability model would be "independent feature
model". In simple terms, a naive Bayes classifier assumes that the presence (or absence)
of a particular feature of a class is unrelated to the presence (or absence) of any other
feature. For example, the words or features of a given question may are assumed to be
independent to simplify mathematical complexities. Even though these features depend
on the existence of the other features, a naive Bayes classifier considers all of these
properties to independently contribute to the probability that this fruit is an apple.
Depending on the precise nature of the probability model, naive Bayes classifiers can be
trained very efficiently in a supervised learning setting. In many practical applications,
parameter estimation for naive Bayes models uses the method of maximum likelihood;
in other words, one can work with the naive Bayes model without believing in Bayesian
probability or using any Bayesian methods.
Abstractly, the probability model for a classifier is a conditional model

p(C| F_1,…….,F_n),

over a dependent class variable C with a small number of outcomes or classes,
conditional on several feature variables F_1 through F_n. The problem is that if the
number of features n is large or when a feature can take on a large number of values,
then basing such a model on probability tables is infeasible. We therefore reformulate
the model to make it more tractable.
Using Bayes' theorem, we write

p(C | F_1,…….,F_n) =
ܘ(۱) ܘ(۴

,……,۴_ܖ| ۱)
ܘ(۴

,…….,۴_ܖ)
(2.10)

In plain English the above equation can be written as

posterior = (prior*likelihood)/evidence (2.11)

In practice we are only interested in the numerator of that fraction, since the
denominator does not depend on C and the values of the features F_i are given, so that
the denominator is effectively constant. The numerator is equivalent to the joint
probability model

p(C, F_1, ………, F_n),

which can be rewritten as follows, using repeated applications of the definition of
conditional probability:

p(C, F_1, ………., F_n)
= p(C) p(F_1,……….,F_n| C)
= p(C) p(F_1| C) p(F_2,.......,F_n| C, F_1)

24
= p(C) p(F_1| C) p(F_2| C, F_1) p(F_3,.......,F_n| C, F_1, F_2)
= p(C) p(F_1| C) p(F_2| C, F_1) p(F_3| C, F_1, F_2) p(F_4,.......,F_n| C, F_1, F_2, F_3)
= p(C) p(F_1| C) p(F_2| C, F_1) p(F_3| C, F_1, F_2) ...
.... p(F_n| C, F_1, F_2, F_3,.......,F_{n-1}) (2.12)
and so forth. Now the "naive" conditional independence assumptions come into play:
assume that each feature Fi is conditionally independent of every other feature Fj for j
i. This means that
p(F_i | C, F_j) = p(F_i | C), (2.13)
and so the joint model can be expressed as
p(C, F_1, ......., F_n) = p(C) p(F_1| C) p(F_2| C) p(F_3| C) …….. p(F_n| C)
= p(C)∏ p(F_i | C)
݊
݅=1
(2.14)
2.1.7 Datasets
We used the publicly available training and testing datasets provided by Tagged
Question Corpus, Cognitive Computation Group at the Department of Computer Science,
University of Illinois at Urbana-Champaign (UIUC) [5]. All these datasets have been
manually labelled by UIUC [5] according to the coarse and fine grained categories in
Table 1.1. There are about 5,500 labelled questions randomly divided into 5 training
datasets of sizes 1,000, 2,000, 3,000, 4,000 and 5,500 respectively. The testing dataset
contains 2000 labelled questions from the TREC QA track. The TREC QA data is hand
labelled by us.
2.1.8 Features
For each question, we extract two kinds of features: bag-of-words or a mix of POS tags
and words. Every question is represented as feature vectors; the weight associated with
each word varies between 0 and 1. The following example demonstrated different
feature sets considered for a given question and its POS parse.

Figure 2.2: Various feature sets extracted from the given question and its corresponding
part of speech tags.


25
2.1.9 Entropy and Weighted Feature Vector
In information theory the concept of entropy is used as a measure of the uncertainty of
a random variable. Let X be a discrete random variable with respect to alphabet A and
p(x) = Pr(X = x), x ∈ A be the probability function, then the entropy H(X) of the discrete
random variable X is defined as:

H(x) = −Σx∈A {p(x) log p(x)} (2.15)

The larger the entropy H(X) is, the more uncertain the random variable X is. In
information retrieval many methods have been applied to evaluate term’s relevance to
documents, among which entropy-weighting, based on information theoretic ideas, is
proved the most effective and sophisticated. Let fit be the frequency of word i in
document t, ni the total number of occurrences of word i in document collection, N the
number of total documents in the collection, then the confusion (or entropy) of word i
can be measured as follows:

H(i) = Σ(t=1 to N) [fit ni · log(nifit)] (2.16)

The larger the confusion of a word is, the less important it is. The confusion achieves
maximum value log(N) if the word is evenly distributed over all documents, and
minimum value 0 if the word occurs in only one document.
Keeping this in mind to calculate the entropy of a word, certain preprocessing is needed.
Let C be the set of question types. Without loss of generality, it is denoted by C = {1, . . .
,N}. Ci is a set of words extracted from questions of type i, that is to say, Ci represents a
word collection similar to documents. From the viewpoint of representation, each Ci is
the same as a document because both of which are just a collection of words. Therefore
we can also use the idea of entropy to evaluate word’s importance. Let ai be the weight
of word i, fit be the frequency of word i in Ct, ni be the total number of occurrences of
word i in all questions, then ai is defined as:

ai =(1 +1/log(N)Σ(t=1 to N)[fitni·log(fitni)]) (2.17)

Weight of word i is opposite to its entropy: the larger the entropy of word i is, the less
important to question classification it is. In other words, the smaller weight is
associated with word i. Consequently, ai get the maximum value of 1 if word i occurs in
only one set of question type, and the minimum value of 0 if the word is evenly
distributed over all sets. Note that if a word occurs in only one set, for other sets fik is 0.
We use the convention that 0 log 0 = 0, which is easily justified since xlogx → 0 as x → 0.

26
2.1.10 Experiment Results
We tested various algorithms for question classification.
Naïve Bayes Classifier* using Bag of Words feature set (64% accurate on TREC data),
Naïve Bayes Classifier* using Partitioned feature set (69% accurate on TREC data),
Support Vector Machine Classifier using Bag of Words feature set (78% accurate on
TREC data), Support Vector Machine Classifier using Weighted feature set (85%
accurate on TREC data)
It must be noted that the classifiers were NOT trained on TREC data. The classifier
classified questions into six broad classes and fifty coarse classes. Therefore a baseline
(random) classifier is (1/50) = 2% accurate. We employed various smoothing
techniques to Naive Bayes Classifier. The performance without smoothing was too low
and not worth mentioning. While Witten-Bell smoothing worked well, simple add one
smoothing outperformed it. The accuracy reported here are for Naive Bayes Classifier
employing add one smoothing.
We implemented weighted feature set SVM classifier into a cross platform standalone
desktop application (shown below). The application will be made available to public for
evaluation. Training was done on a set of 12788 questions provided by Cognitive
Computation Group at the Department of Computer Science, University of Illinois at
Urbana-Champaign.


Figure 2.3: Classifiers were tested on a set of 2000 TREC questions.
Some sample test runs
Q: What was the name of the first Russian astronaut to do a spacewalk?
Response: HUM -> IND(an Individual)
Q: How much folic acid should an expectant mother get daily?
Response: NUM -> COUNT
0
10
20
30
40
50
60
70
80
90
Chart Showing accuracy of classifier
Baseline Classifier
Naïve Bayes Classifier using
Bag of Words feature
Naïve Bayes Classifier using
Partitioned feature
SVM Classifier using Bag of
Words feature
SVM Classifier using
Weighted feature set


27
Q: What is Francis Scott Key best known for?
Response: DESC -> DESC
Q: What state has the most Indians?
Response: LOC -> STATE
Q: Name a flying mammal.
Response: ENTITY -> ANIMAL


Figure 2.4: JAVA Question Classifier, can be downloaded for evaluation from
http://www.cybergeeks.co.in/projects.php?id=10
2.2 Query Formulation

The question analysis component of a QA system is usually responsible for formulating
a query from a natural language questions to maximise the performance of the IR
engine used by the document retrieval component of the QA system. Most QA systems
start constructing an IR query simply by assuming that the question itself is a valid IR
query, while other systems go for a query expansion. The design of the query expansion
module should be such as to maintain the right balance between recall and precision.
For large corpus, query expansion may not be necessary as even with not so well
formed query recall is sufficient to extract the right answer and query expansion may in
fact reduce precision. But in case of a local small corpus query expansion may be
necessary. In our system when using web as document collection, we pass on the
question as IR query after masking the stop words. When a web corpus is not available
we employ Rocchio Query Expansion [1] method which is implemented in lucene query
expansion module. The table below shows performance of various query expansion
modules implemented on Lucene. The test is carried out on data from NIST TREC
Robust Retrieval Track 2004





28
Tag Combined Topic Set
MAP P10 %no
Lucene QE 0.2433 0.3936 18.10%
Lucene gQE 0.2332 0.3984 14%
KB-R-FIS gQE 0.2322 0.4076 14%
Lucene 0.2 0.37 15%

MAP - mean average precision
P10 - average of precision at 10 documents retrieved
%no - percentage of topics with no relevant in the top 10 retrieved

Lucene QE - lucene with local query expansion
Lucene gQE – Lucene system that utilized Rocchio’s query expansion along with Google.
KB-R-FIS gQE – My Fuzzy Inference System that utilized Rocchio’s query expansion along with Google.

Table 2.1: performance of various query expansion modules implemented on Lucene.

It must be noted that query expansion is internally carried out by the APIs used to
retrieve documents from the web, although because of the proprietary nature their
working is unknown and unpredictable.
2.2.1 Stop word for IR query formulation
Stop words or noise words are words which appear with a high frequency and are
considered to be insignificant for normal IR processes. Unfortunately when it comes to
QA systems high frequency of a word in a collection may not always suggest that it is
insignificant in retrieving the answer. For example the word “first” is widely considered
to be a stop word but is very important when appears in the question “Who was the first
President of India?”. Therefore we manually analyzed 100 TREC QA track questions and
prepared a list of stop words. A partial list of the stop words is shown below.

I a about an are
As at be by com
De en for from how
In is it la of
On or that the this
To was what when where
Who will with und the
www

The list of stop words we obtained is much smaller than standard stop word lists
(although there is no definite list of stop words which all natural language processing
tools incorporate, most of these lists are very similar).


29
Chapter3. Document Retrieval

The text collection over which a QA system works tend to be so large that it is
impossible to process whole of it to retrieve the answer. The task of the document
retrieval module is to select a small set from the collection which can be practically
handled in the later stages. A good retrieval unit will increase precision while
maintaining good enough recall.
3.1 Retrieval from local corpus
All the work presented in this thesis relies upon the Lucene IR engine [13] for local
corpus searches. Lucene is an open source boolean search engine with support for
ranked retrieval results using a TF.IDF based vector space model. One of the main
advantages of using Lucene, over many other IR engines, is that it is relatively easy to
extend to meet the demands of a given research project (as an open source project the
full source code to Lucene is available making modification and extension relatively
straight forward) allowing experiments with different retrieval models or ranking
algorithms to use the same document index.
3.1.1 Ranking function
We employ highly popular Okapi BM25 [3] ranking function for our document retrieval
module. It is based on the probabilistic retrieval framework developed in the 1970s and
1980s by Stephen E. Robertson, Karen Spärck Jones, and others [14].
The name of the actual ranking function is BM25. To set the right context, however, it
usually referred to as "Okapi BM25", since the Okapi information retrieval system,
implemented at London's City University in the 1980s and 1990s, was the first system
to implement this function. BM25, and its newer variants, e.g. BM25F [2] (a version of
BM25 that can take document structure and anchor text into account), represent state-
of-the-art retrieval functions used in document retrieval, such as Web search.
3.1.2 Okapi BM25
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the
query terms appearing in each document, regardless of the inter-relationship between
the query terms within a document (e.g., their relative proximity). It is not a single
function, but actually a whole family of scoring functions, with slightly different
components and parameters. One of the most prominent instantiations of the function
is as follows. Given a query Q, containing keywords ݍ
1
,..., ݍ
݊
, the BM25 score of a
document D is:


30
Score(D,Q) = ∑ ܫܦܨ(ݍ
݅
).
݂(ݍ
݅
,ܦ)∙(݇
1
+1)
݂(ݍ
݅
,ܦ)+݇
1
∙(1−ܾ+ܾ∙
|ܦ|
ܽݒ݈݃݀
)
ܰ
݅=1
(3.1)

where f(qi,D) is qi's term frequency in the document D, | D | is the length of the
document D in words, and avgdl is the average document length in the text collection
from which documents are drawn. k1 and b are free parameters, usually chosen as k1 =
2.0 and b = 0.75. IDF(qi) is the IDF (inverse document frequency) weight of the query
term qi. It is usually computed as:

ܫܦܨ(ݍ
݅
) = ݈݋݃
ܰ−݊(ݍ
݅
)+0.5
݊(ݍ
݅
)+0.5
(3.2)

where N is the total number of documents in the collection, and n(qi) is the number of
documents containing qi. There are several interpretations for IDF and slight variations
on its formula. In the original BM25 derivation, the IDF component is derived from the
Binary Independence Model.
3.1.3 IDF Information Theoretic Interpretation
Here is an interpretation from information theory. Suppose a query term q appears in
n(q) documents. Then a randomly picked document D will contain the term with
probability n(q)/N (where N is again the cardinality of the set of documents in the
collection). Therefore, the information content of the message "D contains q" is:

-log
݊(ݍ)
ܰ
= log
ܰ
݊(ݍ)
(3.3)

Now suppose we have two query terms q1 and q2. If the two terms occur in documents
entirely independently of each other, then the probability of seeing both q1 and q2 in a
randomly picked document D is:
݊(ݍ1)
ܰ

݊(ݍ2)
ܰ

and the information content of such an event is:
෍log
ܰ
݊(ݍ݅)
2
݅=1

With a small variation, this is exactly what is expressed by the IDF component of BM25.
3.2 Information retrieval from the web
Indexing the whole web is a gigantic task which is not possible on a small scale.
Therefore we use public APIs of search engines. We have used Google AJAX Search API
and Yahoo BOSS. Both APIs have relaxed terms of condition and allow access through
code. Moreover there are no limits on number of queries per day when used for


31
educational purposes. The search APIs can return top n documents for a given query.
We read top n uniform resource locators (URLs) and build the collection of documents
to be used for answer retrieval. As the task of reading the URLs over the internet is
inherently slow process, this stage is the most taxing one in terms of runtime. To
accelerate the process we employ multi threaded URL readers so that multiple URLs can
be read simultaneously. Figure 3.1 shows the document retrieval framework.
























Figure 3.1: Document retrieval framework
3.2.1 How many documents to retrieve?

One of the main considerations when doing document retrieval for QA is the amount of
text to retrieve and process for each question. Ideally a system would retrieve a single
text unit that was just large enough to contain a single instance of the exact answer for
every question. Whilst the ideal is not attainable, the document retrieval stage can act as
a filter between the document collections/web and answer extraction components by
retrieving a relatively small set of text collection. Therefore our target is to increasing
coverage with least number of retrieved documents to form the text collection. Lowered
precision is penalized by higher average processing time by later stages. Therefore,
IR Query
Lucene IR
Module
Google/Yahoo
Search APIs
Local
Corpus
Okapi BM25
Ranking function
URL Reader
URL Reader
URL Reader
URL Reader
Top n
Docs
Multi threaded
Reader module

INTERNET


32
criterion for selecting the right collection size depends on coverage and average
processing time. The table below shows percentage coverage, average processing time
at different ranks for Google and Yahoo search APIs. The results are obtained on a set of
30 questions (equally distributed over all question classes) from TREC 04 QA track [5].

Average
Processing
Time*(sec)
%Coverage @rank
Yahoo Google Yahoo Google
0.02 0.021 23 28 1
0.102 0.09 31 48 2
0.27 0.23 37 56 3
0.34 0.37 42 58 4
0.49 0.51 48 64 5
0.82 0.803 49 64 6
1.23 1.1 49 64 7
1.44 1.39 51 66 8
2.01 1.9 51 70 9
2.31 2.2 52 72 10
2.8 2.6 53 72 11
3.22 3.1 53 73 12
3.7 3.4 54 73 13
4.2 4.6 54 73 14
4.77 5.1 55 74 15

*Average time spent by answer retrieval node.

Table 3.1: %coverage and average processing time at different ranks



Figure 3.2: %coverage vs rank
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Yahoo BOSS API 23 31 37 42 48 49 49 51 51 52 53 53 54 54 55
Google AJAX Search
API
28 48 56 58 64 64 64 66 70 72 72 73 73 73 74
0
10
20
30
40
50
60
70
80
%
C
o
v
e
r
a
g
e
%Coverage vs rank


33



Figure 3.3: %coverage vs. average processing time

From the results it is clear that going up to rank 5 ensures a good coverage while
maintaining low processing time. Clearly Google outperforms Yahoo at all ranks.

0
10
20
30
40
50
60
70
80
0
.
0
2
1
0
.
0
9
0
.
2
3
0
.
3
7
0
.
5
1
0
.
8
0
3
1
.
1
1
.
3
9
1
.
9
2
.
2
2
.
6
3
.
1
3
.
4
4
.
6
5
.
1
%Coverage vs Average Processing time(sec)
Google AJAX Search
API
Yahoo BOSS


34
Chapter4. Answer Extraction

The final stage in a QA system, and arguably the most important, is to extract and
present the answers to questions. We employ a named entity (NE) recognizer to filter
out those sentences which could potentially contain answer to the given question. In
our system we have used GATE – A General Architecture for Text Engineering provided
by The Sheffield NLP group [15] as a tool to handle most of the NLP tasks including NE
recognition.
4.1 Sentence Ranking
The sentence ranking is responsible for ranking the sentences and giving a relative
probability estimate to each one. It also registers the frequency of each individual
phrase chunk marked by the NE recognizer for a given question class. The final answer
is the phase chunk with maximum frequency belonging to the sentence with highest
rank. The probability estimate and the retrieved answer’s frequency are used to
compute confidence of the answer.
4.1.1 WordNet
WordNet [16] is the product of a research project at Princeton University which has
attempted to model the lexical knowledge of a native speaker of English. In WordNet
each unique meaning of a word is represented by a synonym set or synset. Each synset
has a gloss that defines the concept of the word. For example the words car, auto,
automobile, and motorcar is a synset that represents the concept define by gloss: four
wheel Motor vehicle, usually propelled by an internal combustion Engine. Many glosses
have examples of usages associated with them, such as "he needs a car to get to work."
In addition to providing these groups of synonyms to represent a concept, WordNet
connects concepts via a variety of semantic relations. These semantic relations for
nouns include:
- Hyponym/Hypernym (IS-A/ HAS A)
- Meronym/Holonym (Part-of / Has-Part)
- Meronym/Holonym (Member-of / Has-Member)
- Meronym/Holonym (Substance-of / Has-Substance)

Figure 4.1 shows a fragment of WordNet taxonomy.


35
4.1.2 Sense/Semantic Similarity between words
We use statistics to compute information content value. We assign a probability to a
concept in taxonomy based on the occurrence of target concept in a given corpus.
The IC value is then calculated by negative log likelihood formula as follow:

( ) log( ( )) IC c P c = ÷
(4.1)
Where c is a concept and p is the probability of encountering c in a given corpus. Basic
idea behind the negative likelihood formula is that the more probable a concept
appears, the less information it conveys, in other words, infrequent words are more
informative then frequent ones. Using this basic idea we compute the sense/semantic
similarity ૆ between two given words based on a similarity metric proposed by Philip
Resnik [17].



Figure 4.1: Fragment of WordNet taxonomy


36
4.1.2 Sense Net ranking algorithm
We consider the sentence under consideration and the given query to be a set of words
similar to a bag of word model. But unlike a bag of word model we give importance to
the order of the words. Stop words are rejected from the set and only the root forms of
the words are taken into account. If W is the ordered set of n words in the given
sentence and Q is the ordered set m words in the query, then we compute a network of
sense weights between all pair of words in W and Q. Therefore we define the sense
network ) ( ,
j i
w q I as:

,
) ( ,
i j i j
w q c I =
(4.2)
Where
,
[0,1]
i j
c ‹ is the value of sense/semantic similarity between and
i j
w W q Q ‹ ‹ .






…… ࢝




) ( ,
j i
w q I






…… ࢗ


Figure 4.2: A sense network formed between a sentence and a query.

Given a sense network ) ( ,
i j
w q I , we define the distance of a word
i
w as
( )
i
d w i = (4.3)
( )
j
d q j = (4.4)
Word with maximum sense similarity with query word
i
q is:

,
( | ) ar ax gm
i j j j i
w j M q c = = (4.5)
And the corresponding value of
, i j
c = ( )
i
V q (4.6)
The exact match score is
ܧ
ݐ݋ݐ݈ܽ
=
∑ ܸ(ݍ
݅
)
݅
݉


Average sense similarity for query word ݍ
݅
with sentence W is
S(ݍ
݅
) =
∑ ξ
݆ ,݅ ݆
݊
(4.7)

Therefore the total average sense per word is
ܵ
ݐ݋ݐ݈ܽ
=
∑ ܵ(ݍ
݅
)
݅
݊
=
∑ ∑ ξ
݅,݆ ݆ ݅
݉݊
(4.8)
Let T = {ordered set of M(ݍ
݅
) ∀ ݅ ∈ [1, ݉]} in increasing order of d(q). Function ࣂ

is the
distance of i
th
element in ܶ then the alignment score is
ܭ
ݐ݋ݐ݈ܽ
=
∑ ݏ݃݊(ߠ
݅+1
−ߠ
݅
)
݉−1
݅=1
݉−1
(4.9)


37
The total average noise is defined as

*( * )
1
total
n E m
total
e
o
o
÷ ÷
= ÷
(4.10)
Where o is the noise decay factor.
Now, ߤ = noise penalty coefficient ߰ = ݁ݔܽܿݐ ݉ܽݐܿℎ ܿ݋݂݂݁݅ܿ݅݁݊ݐ
ߣ = ݏ݁݊ݏ݁ ݏ݈݅݉݅ܽݎ݅ݐݕ ܿ݋݂݂݁݅ܿ݅݁݊ݐ ߥ = ݋ݎ݀݁ݎ ܿ݋݂݂݁݅ܿ݅݁݊ݐ
Total score

ࣁ = ࣒× ࡱ
࢚࢕࢚ࢇ࢒
+ ࣅ × ࡿ
࢚࢕࢚ࢇ࢒
+ ࣆ × ࢾ
࢚࢕࢚ࢇ࢒

+
ࣇ × ࢑
࢚࢕࢚ࢇ࢒
(4.11)

The coefficients are fine tuned depending on the type of corpus. Unlike newswire data
most of the information found on the internet is badly formatted, grammatically
incorrect and most of the time not well formed. So when web is used as the knowledge
base we use the following values of different coefficients: ߤ = 1.0, ߰ = 1.0, ߣ = 0.25 ,
ߥ = 0.125 and noise decay factor o =0.25 but when using local corpus we reduce ߤ to
0.5 and o to 0.1. Once we obtain the total score for each sentence, we sort then
according to these scores. We take top t sentences and consider the plausible answers
within them. If an answer appears with frequency f in sentence ranked r then that
answer gets a confidence score

1
(1 l ) ) ) ( n( C ans f
r
+ =
(4.12)
Again all answers are sorted according to confidence score and top 0 (=5 in our case)
answers are returned along with corresponding sentence and URL (figure 4.3).


Figure 4.3: A sample run for the question “Who performed the first human heart
transplant?”


38
Chapter5. Implementation and Results

Our question answering module is written in JAVA. Use of JAVA makes the software
cross platform and highly portable. It uses various third party APIs for NLP and text
engineering; GATE, Stanford parser, Json and Lucene API to name a few. Each module is
designed keeping in mind space and time constraints. The URL reader module is multi
threaded to keep download time at the minimum. Most of the pre-processing is done via
GATE processing pipeline. More information is provided in appendix B.



Figure 5.1: Various modules of the QnA system along with each ones basic task.
5.1 Results
The idea of building an easily accessible question answering system which uses the web
as a document collection is not new. Most of these systems are accessed via a web
browser. In the later part of the section we compare our system with other web QA
systems. The tests were performed on a small set of fifty web based questions. The
reason we did not use questions from TREC QA is that the TREC questions are now
appearing quite frequently (sometimes with correct answers) in the results of web
search engines. This could have affected the results of any web based study. For this
reason a new collection of fifty questions was assembled to serve as the test set. Also we
don’t have access to AQUAINT corpus which is the knowledgebase for TREC QA systems.
The questions within the new test set were chosen to meet the following criteria:
Multi threaded URL reader implementation
Multi threaded URL reader interface
Stopwords filter class
Computes Sense/Semantic similarity between words
Stores a generic URL along with number of attempts to read it
Trains the weighted feature vector SVM classifier
Load GATE processing resources.
Implements a standard porter stemmer
main class that handles user queries.
Uses Google and Yahoo search engine queries to build the corpus
Sense Net implementation
ArrayList of Ranked Sentences with helper methods
Weighted feature vector SVM classifier.


39
1. Each question should be an unambiguous factoid question with only one known
answer. Some of the questions chosen do have multiple answers although this is
mainly due to incorrect answers appearing in some web documents.
2. The answers to the questions should not be dependent upon the time at which
the question is asked. This explicitly excludes questions such as “Who is the
President of the US?”
These questions are provided in appendix A.
For each question in set, the table below shows the (min) rank at which answer was
obtained. In case the system fails to answer a question we show the reason it failed. Also
time spent on various tasks is shown which would help in determining the feasibility of
the system to be used in real time environment. We used top 5 documents to construct
our corpus which restricts our coverage to 64%. In a way 64% is the accuracy upper
bound of our system.

Question
No.
Answer
obtained
@ rank
Remarks Time in seconds
Document
retrieval
module
#

Pre-
processing
Answer
extraction
module
1 5 8.5 13 0.82
2 NA NE recognizer not
designed to handle
this question.
0 0 0
3 1 11 9.77 0.38
4 4 8.6 10.23 0.41
5 1 6.4 13.33 0.55
6 1 7.8 15 0.51
7 NA NE recognizer not
designed to handle
this question.
0 0 0
8 1 4.1 16.3 1.1
9 1 5.2 11.8 0.43
10 1 6.4 12.23 0.61
11 NA Question Classifier
failed
0 0 0
12 3 8.0 14.5 0.2
13 1 7.37 11.2 0.71
14 1 8.1 15.7 0.88
15 NA Incorrect Answer 6.54 13.5 0.47
16 1 6.9 11.78 0.53
17 5 6.2 17.2 0.91


40
18 1 7.1 14.63 0.42
19 2 6.99 16.1 0.54
20 1 8.2 12.31 0.45
21 NA NE recognizer not
designed to handle
this question.
0 0 0
22 1 7.66 11.9 0.61
23 NA NE recognizer not
designed to handle
this question.
0 0 0
24 NA NE recognizer not
designed to handle
this question.
0 0 0
25 1 Answer changed
recently
11.2 14.7 0.62
26 NA Incorrect Answer 5.5 8 0.23
27 NA NE recognizer not
designed to handle
this question.
0 0 0
28 1 11.7 15.1 0.58
29 1 6.9 10.67 0.43
30 1 7.9 13.83 0.67
31 1 Incorrect Answer 6.67 11.5 0.47
32 4 7.23 14.67 0.65
33 1 7.21 16.23 0.61
34 1 Incorrect Answer 6.8 11.21 0.34
35 1 7.4 12.0 0.36
36 1 8.01 14.8 0.59
37 NA Incorrect Answer 8.11 14.99 0.64
38 NA Incorrect Answer 8.23 11.01 0.34
39 1 6.77 10.2 0.41
40 NA Incorrect Answer 8.4 16.3 0.79
41 1 9.1 11.4 0.53
42 NA Incorrect Answer 6.7 8.22 0.23
43 1 7.8 14.3 0.43
44 NA Incorrect Answer 9.2 16.1 0.62
45 1 7.2 13.8 0.48
46 1 11.2 15.3 0.54
47 NA Incorrect Answer 7.1 12.67 0.38
48 NA Incorrect Answer
mainly because req.
6.99 11.11 0.29


41
answer type was
present in the query
itself
49 2 8.01 12.51 0.46
50 NA Incorrect Answer 7.67 11.02 0.33
Average time spent: 6.6 11.24 0.45
Total number of questions: 50; Number of questions answered@
- Rank 1: 26 – Accuracy 52%
- Rank 2: 28 – Accuracy 56%
- Rank 3: 29 – Accuracy 58%
- Rank 4: 31 – Accuracy 62%
- Rank 5: 32 – Accuracy 64%
Average time spent per question: 18.3 seconds
#time is dependent on network speed

Table 5.1: Performance of the system on the web question set.

As seen, most of the failures were because of the handicapped NE recognizer. The
question classifier failed in only one instance. @Rank 5 the system reached its accuracy
upper bound of 64%.
5.2 Comparisons with other Web Based QA Systems
We compare our system with four web based QA Systems – AnswerBus [18],
AnswerFinder, IONAUT [19] and PowerAnswer[20]. The consistently best performing
system at TREC forms the backbone of the PowerAnswer system from Language
Computer
1
. Unlike our system each answer is a sentence and no attempt is made to
cluster (or remove) sentences which contain the same answer. This gives undue
advantage to the system as it performs the easier task of finding relevant sentences
only. The system called AnswerBus
2
[18] behaves in much the same way as
PowerAnswer, returning full sentences containing duplicated answers. It is claimed that
AnswerBus can correctly answer 70.5% of the TREC 8 question set although we believe
the performance would decrease if exact answers were being evaluated as experience of
the TREC evaluations has shown this to be a harder task than locating answer bearing
sentences. IONAUT
3
[19] uses its own crawler to index the web with specific focus on
entities and the relationships between them in order to provide a richer base for
answering questions than the unstructured documents returned by standard search
engines. The system returns both exact answers and snippets. AnswerFinder is a client
side application that supports natural language questions and queries the Internet via
Google. It returns both exact answer and snippets. This system is the closest to ours.

1. http://www.languagecomputer.com/demos/
2. http://misshoover.si.umich.edu/˜zzheng/qa-new/
3. http://www.ionaut.com:8400


42
The questions from the web question set were presented to the five systems on the
same day, within as short a period of time as was possible, so that the underlying
document collection, in this case the web would be relatively static and hence no system
would benefit from subtle changes in the content of the collection.


Figure 9.2: Comparison of AnswerBus , AnswerFinder , IONAUT +, PowerAnswer
and our system

It is clear from the graph that our system outperforms all but AnswerFinder at rank 1.
This is quite important as the answer returned at rank 1 can be considered to be the
final answer provided by the system. At higher ranks it performs considerably better
than AnswerBus and IONAUT while performing marginally less than AnswerFinder and
PowerAnswer. The results are encouraging but it should be noted that due to the small
number of test questions it is difficult to draw firm conclusions from these experiments.
5.3 Feasibility of the system to be used in real time environment

From table 5.1 it is clear that the system cannot be used for real time purposes as of
now. An average response time of 18.3 seconds is too high. But it must be noted that

43
document retrieval time will be significantly lower for offline – local corpus. More over
the task of post processing can be done offline on the corpus as it is independent of the
query. Once the corpus is pre-processed offline, the actual task of retrieving an answer
is quite low at 0.45 seconds. We believe that if we use our own crawler and pre-process
the documents beforehand, our system can retrieve answers fast enough to be used in
real time systems. The graph below shows percentage of time spent in different tasks.


Figure 9.3: Time distribution of each module involved in QA
5.4 Conclusion
The main motivation behind the work in this thesis was to consider, where possible,
simple approaches to question answering which can be both easily understood and
would operate quickly. We observed that the performance of the system is limited by
the worst performing module of the QA system. So even if a single module fails the
whole system won’t be able to answer. In our case the NE recognizer is the weakest
link. Our NE recognizer recognizes limited sets of answer types which is not enough to
obtain a good enough overall accuracy. We employed machine learning techniques for
question classification whose performance is good enough and any further
improvements won’t be beneficial. We also proposed the Sense Net algorithm as new
way of ranking sentences and answers. Even with the limited capability of NE
recognizer the system is at par with state of the art web QA systems which confirms the
efficacy of the ranking algorithm. The time distribution of various modules shows that
the system is quite fast at the answer extraction stage, if used along with a local corpus
which is pre-processed offline it can be adapted for real time applications. Finally our
current results are encouraging but we acknowledge that due to the small number of
test questions it is difficult to draw firm conclusions from these experiments.
Document
Retrieval
36%
Pre-Processing
61%
Answer
Extraction
3%
time distribution


44
Appendix A
Small Web Based Question Set
Q001: The chihuahua dog derives it’s name from a town in which country? Ans: Mexico
Q002: What is the largest planet in our Solar System? Ans: Jupiter
Q003: In which country does the wild dog, the dingo, live? Ans: Australia or America
Q004: Where would you find budgerigars in their natural habitat? Ans: Australia
Q005: How many stomachs does a cow have? Ans: Four or one with four parts
Q006: How many legs does a lobster have? Ans: Ten
Q007: Charon is the only satellite of which planet in the solar system? Ans: Pluto
Q008: Which scientist was born in Germany in 1879, became a Swiss citizen in 1901 and
later became a US citizen in 1940? Ans: Albert Einstein
Q009: Who shared a Nobel prize in 1945 for his discovery of the antibiotic penicillin?
Ans: Alexander Fleming, Howard Florey or Ernst Chain
Q010: Who invented penicillin in 1928? Ans: Sir Alexander Fleming
Q011: How often does Haley’s comet appear? Ans: Every 76 years or every 75 years
Q012: How many teeth make up a full adult set? Ans: 32
Q013: In degrees centigrade, what is the average human body temperature? Ans: 37, 38
or 37.98
Q014: Who discovered gravitation and invented calculus? Ans: Isaac Newton
Q015: Approximately what percentage of the human body is water? Ans: 80%, 66%,
60% or 70%
Q016: What is the sixth planet from the Sun in the Solar System? Ans: Saturn
Q017: How many carats are there in pure gold? Ans: 24
Q018: How many canine teeth does a human have? Ans: Four
Q019: In which year was the US space station Skylab launched? Ans: 1973
Q020: How many noble gases are there? Ans: 6
Q021: What is the normal colour of sulphur? Ans: Yellow
Q022; Who performed the first human heart transplant? Ans: Dr Christiaan Barnard
Q023: Callisto, Europa, Ganymede and Io are 4 of the 16 moons of which planet? Ans:
Jupiter
Q024: Which planet was discovered in 1930 and has only one known satellite called
Charon? Ans: Pluto
Q025: How many satellites does the planet Uranus have? Ans: 15, 17, 18 or 21
Q026: In computing, if a byte is 8 bits, how many bits is a nibble? Ans: 4
Q027: What colour is cobalt? Ans: blue
Q028: Who became the first American to orbit the Earth in 1962 and returned to Space
in 1997? Ans: John Glenn
Q029: Who invented the light bulb? Ans: Thomas Edison


45
Q030: How many species of elephant are there in the world? Ans: 2
Q031: In 1980 which electronics company demonstrated its latest invention, the
compact disc? Ans: Philips
Q032: Who invented the television? Ans: John Logie Baird
Q033: Which famous British author wrote ”Chitty Chitty Bang Bang”? Ans: Ian Fleming
Q034: Who was the first President of America? Ans: George Washington
Q035: When was Adolf Hitler born? Ans: 1889
Q036: In what year did Adolf Hitler commit suicide? Ans: 1945
Q037: Who did Jimmy Carter succeed as President of the United States? Ans: Gerald
Ford
Q038: For how many years did the Jurassic period last? Ans: 180 million, 195 – 140
million years ago, 208 to 146 million years ago, 205 to 140 million years ago, 205 to 141
million years ago or 205 million years ago to 145 million years ago
Q039: Who was President of the USA from 1963 to 1969? Ans: Lyndon B Johnson
Q040: Who was British Prime Minister from 1974-1976? Ans: Harold Wilson
Q041: Who was British Prime Minister from 1955 to 1957? Ans: Anthony Eden
Q042: What year saw the first flying bombs drop on London? Ans: 1944
Q043: In what year was Nelson Mandela imprisoned for life? Ans: 1964
Q044: In what year was London due to host the Olympic Games, but couldn’t because of
the Second World War? Ans: 1944
Q045: In which year did colour TV transmissions begin in Britain? Ans: 1969
Q046: For how many days were US TV commercials dropped after President Kennedy’s
death as a mark of respect? Ans: 4
Q047: What nationality was the architect Robert Adam? Ans: Scottish
Q048: What nationality was the inventor Thomas Edison? Ans: American
Q049: In which country did the dance the fandango originate? Ans: Spain
Q050: By what nickname was criminal Albert De Salvo better known? Ans: The Boston
Strangler.



46
Appendix B
Implementation Details
We have used Jcreator (http://www.jcreator.com/ ) as the preferred IDE. The code uses
newer features like generics which is not compatible with any version of JAVA prior to
1.5. The following third party APIs are used:

- GATE 4.0 (A General Architecture for Text Engineering) software toolkit
originally developed at the University of Sheffield since 1995 - http://gate.ac.uk/
- Apache Lucene API is a free/open source information retrieval library, originally
created in Java by Doug Cutting - http://lucene.apache.org/
- JSON API, JSON or JavaScript Object Notation, is a lightweight computer data
interchange format. The API brings support to read JSON data. -
http://www.json.org/java/
- LibSVM A Library for Support Vector Machines by Chih-Chung Chang and Chih-
Jen Lin - http://www.csie.ntu.edu.tw/~cjlin/libsvm/
- JWNL is an API for accessing WordNet in multiple formats, as well as relationship
discovery and morphological processing -
http://sourceforge.net/projects/jwordnet/
- Stanford Log-linear Part-Of-Speech Tagger -
http://nlp.stanford.edu/software/tagger.shtml
- WordNet 2.1 is a lexical database for the English language is used to measure
sense/semantic similarity measure - http://wordnet.princeton.edu/

All experiments performed on a Core 2 Duo 1.86 GhZ System with 2GB RAM. Default
stack size may not be sufficient to run the application. Therefore stack size should be
increased to at least 512MB using –Xmx512m command line option. Some classes
present in JWNL API conflict with GATE. To resolve the issue conflicting libraries
belonging to GATE must not be included in the classpath.



47
References
[1] Miles Efron. Query expansion and dimensionality reduction: Notions of
optimality in rocchio relevance feedback and latent semantic indexing.
Information Processing & Management, 44(1):163–180, January 2008.
[2] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu,
and Mike Gatford. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text
Retrieval Conference.
[3] Stephen E. Robertson and Steve Walker. 1999. Okapi/Keenbow at TREC-8. In
Proceedings of the 8th Text REtrieval Conference.
[4] Tom M. Mitchell. 1997. Machine Learning. Computer Science Series. McGraw-Hill.
[5] Corpora for Question Answering Task, Cognitive Computation Group at the
Department of Computer Science, University of Illinois at Urbana-Champaign.
[6] Xin Li and Dan Roth. 2002. Learning Question Classifiers. In Proceedings of the
19
th
International Conference on Computational Linguistics (COLING’02), Taipei,
Taiwan.
[7] Kadri Hacioglu and Wayne Ward. 2003. Question Classification with Support
Vector Machines and Error Correcting Codes. In Proceedings of the 2003
Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology (NAACL ’03), pages 28–30,
Morristown, NJ, USA.
[8] Ellen M. Voorhees. 1999. The TREC 8 Question Answering Track Report. In
Proceedings of the 8th Text REtrieval Conference.
[9] Ellen M. Voorhees. 2002. Overview of the TREC 2002 Question Answering Track.
In Proceedings of the 11th Text REtrieval Conference.
[10] Eric Breck, John D. Burger, Lisa Ferro, David House, Marc Light, and Inderjeet
Mani. 1999. A Sys Called Qanda. In Proceedings of the 8th Text REtrieval
Conference.
[11] Richard J. Cooper and Stefan M. R¨uger. 2000. A Simple Question Answering
System. In Proceedings of the 9th Text REtrieval Conference.
[12] Dell Zhang and Wee Sun Lee. 2003. Question Classification using Support Vector
Machines. In Proceedings of the 26th ACM International Conference on Research


48
and Developement in Information Retrieval (SIGIR’03), pages 26–32, Toronto,
Canada.
[13] Hao Wu, Hai Jin, and Xiaomin Ning. An approach for indexing, storing and
retrieving domain knowledge. In SAC ’07: Proceedings of the 2007 ACM
symposium on Applied computing, pages 1381–1382, New York, NY, USA, 2007.
ACM Press.
[14] Karen S. Jones, Steve Walker, and Stephen E. Robertson. A probabilistic model
of information retrieval: development and comparative experiments - part 2.
Information Processing and Management, 36(6):809–840, 2000.
[15] H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. Software
infrastructure for natural language processing, 1997.
[16] George A. Miller. 1995. WordNet: A Lexical Database. Communications of the
ACM, 38(11):39–41, November.
[17] Philip Resnik. Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language.
Journal of Artificial Intelligence Research, 11:95–130, 1999.
[18] Zhiping Zheng. 2002. AnswerBus Question Answering System. In Proceedings
of the Human Language Technology Conference (HLT 2002), San Diego, CA,
March 24-27.
[19] Steven Abney, Michael Collins, and Amit Singhal. 2000. Answer Extraction. In
Proceedings of the 6th Applied Natural Language Processing Conference (ANLP
2000), pages 296–301, Seattle, Washington, USA.
[20] Dan Moldovan, Sanda Harabagiu, Roxana Girju, Paul Mor?arescu, Finley
Lacatus¸u, Adrian Novischi, Adriana Badulescu, and Orest Bolohan. 2002. LCC
Tools for Question Answering. In Proceedings of the 11th Text REtrieval
Conference.

2

Department of Electrical Engineering Indian Institute of Technology Kharagpur-721302

CERTIFICATE
This is to certify that the thesis entitled Open Domain Factoid Question Answering System is a bonafide record of authentic work carried out by Mr. Amiya Patanaik under my supervision and guidance for the fulfilment of the requirement for the award of the degree of Bachelor of Technology (Honours) at the Indian Institute of Technology, Kharagpur. The work incorporated in this has not been, to the best of my knowledge, submitted to any other University or Institute for the award of any degree or diploma.

Dr. Sudeshna Sarkar (Guide) Professor, Department of Computer Science Indian Institute of Technology – Kharagpur INDIA

Date : Place : Kharagpur

Dr. S K Das (Co-guide) Professor, Department of Electrical Engineering Indian Institute of Technology – Kharagpur INDIA

Date : Place : Kharagpur

3

Acknowledgement
I express my sincere gratitude and indebtedness to my guide, Dr. Sudeshna Sarkar under whose esteemed guidance and supervision, this work has been completed. This project work would have been impossible to carry out without her advice and support throughout. I would also like to express my heartfelt gratitude to my co-guide Dr. S. K. Das and all the professors of Electrical and Computer Science Engineering Department for all the guidance, education and necessary skill set they have endowed me with, throughout my years of graduation. Last but not the least; I would like to thank my friends for their help during the course of my work.

Date:

Amiya Patanaik 05EG1008 Department of Electrical Engineering IIT Kharagpur - 721302

4

Dedicated to my parents and friends

. The thesis also presents a feasibility analysis of our system to be used in real time QA applications. In recent years. factual questions such as “who was the first American in space?” or “what is the second tallest mountain in the world?” Yet today’s most advanced web search services (e.g. and “When was first world war fought?”. the combination of web growth. We have developed an architecture that augments existing search engines so that they support natural language question answering and is also capable of supporting local corpus as a knowledge base. part of speech (POS) tags and sense similarity metrics. We assumed that all the information required to produce an answer exists in a single sentence and followed a pipelined approach towards the problem. This thesis investigates a number of techniques for performing open-domain factoid question answering. Yahoo. starting with simple ones that employ surface matching text patterns to more complicated ones using root words. such as “Who is the first woman to be in space?”. Google. document retrieval. phrase extraction.. MSN live search and AskJeeves) make it surprisingly tedious to locate answers to such questions. researchers have been fascinated with answering natural language questions. The wealth of information on the web makes it an attractive resource for seeking quick answers to simple. Since the early days of artificial intelligence in the 60’s. We developed and analyzed different sentence and answer ranking algorithms. Our system currently supports document retrieval from Google and Yahoo via their public search engine application programming interfaces (APIs). Question answering aims to develop techniques that can go beyond the retrieval of relevant Documents in order to return exact answers to natural language factoid questions. passage extraction. the difficulty of natural language processing (NLP) has limited the scope of QA to domain-specific expert systems. Various stages in the pipeline include: automatically constructed question type analysers based on various classifier models. However. Answering natural language questions requires more complex processing of text than employed by current information retrieval systems.5 ABSTRACT A question answering (QA) system provides direct answers to user questions by consulting its knowledge base. sentence and answer ranking. “Which is the largest city in India?”. and the explosive demand for better information access has reignited the interest in QA systems. improvements in information technology.

4.2 Mean Reciprocal Rank 1.3 Confidence Weighted Score 1.7 Real time question answering 1.1.11 User profiling for QA 1.10 Advanced reasoning for QA 1.6.8 Multi-lingual (or cross-lingual) question answering 1.1 End-to-End Evaluation 1.2 Deep 1.6.6.4 Data sources for QA 1.4.4.1 History of Question Answering Systems 1.3.6 Answer formulation 1.1.1 Question classes 1.4.1 Shallow 1.2 Architecture 1.2 Question processing 1.9 Interactive QA 1.5 A generic framework for QA 1.5 Traditional Metrics – Recall and Precision Chapter2: Question Analysis 2.4 Accuracy and coverage 1.1 Question Classes 2.5 Answer extraction 1.3.4.4.4.6.6 Evaluating QA Systems 1.2 Manually Constructed rules for question classification 2 3 4 5 6 8 9 9 10 11 11 11 12 12 13 13 13 13 13 14 14 14 14 14 15 15 16 16 16 17 17 19 19 19 20 .4.4 Issues 1.3 Question answering methods 1.4.4.6 Contents CERTIFICATE ACKNOWLEDGEMENT DEDICATION ABSTRACT CONTENTS LIST OF FIGURES AND TABLES Chapter 1: Introduction 1.4.3 Context and QA 1.6.1 Determining the Expected Answer Type 2.

1.2 Query Formulation 2.1.1.1.10 Experiment Results 2.1.3 Feasibility of the system to be used in real time environment 5.1 WordNet 4.1 Retrieval from local corpus 3. Implementation and Results 5.8 Features 2.1.1.1 How many documents to retrieve? Chapter4.2 Sense Net ranking algorithm Chapter5.2 Sense/Semantic Similarity between words 4.1 Sentence Ranking 4.1.4 Conclusion APPENDIX A : Web Based Question Set APPENDIX B : Implementation Details REFERENCES 20 21 22 22 24 24 25 26 27 28 29 29 29 29 30 30 31 34 34 34 35 36 38 38 41 42 43 44 46 47 .4 Support Vector Machines 2.5 Kernel Trick 2.1.1.7 Datasets 2.1.1.9 Entropy and Weighted Feature Vector 2.1 Ranking function 3. Answer Extraction 4.6 Naive Bayes Classifier 2.2 Okapi BM25 3.1 Stop word for IR query formulation Chapter3.7 2.2 Comparisons with other Web Based QA Systems 5.1.1 Results 5.2.3 IDF Information Theoretic Interpretation 3.2 Information retrieval from the web 3.2. Document Retrieval 3.1.3 Fully Automatically Constructed Classifiers 2.

Fig.1.2: Comparison with other web based QA systems Fig.2: %coverage vs rank Fig. Table 3.1: %coverage and average processing time at different ranks Table 5.4.1: The kernel trick Fig.4.2: Various feature sets extracted from a given question and its corresponding part of speech tags.1: performance of various query expansion modules implemented on Lucene. Table 2. Fig.3. average processing time Fig.2.1: A generic framework for question answering Fig.2.3.1: Various modules of the QA system along with each ones basic task Fig.3: %coverage vs. Table 1.4: JAVA Question Classifier Fig.3: Time distribution of each module involved in QA 15 18 22 24 26 27 31 32 33 35 36 37 38 42 43 Tables PageNo.1: Performance of the system on the web question set 20 28 32 39-41 .9. Fig.5.2: Sections of a document collection as used for IR evaluation.3: A sample run for the question “Who performed the first human heart transplant?” Fig.3.1: Fragment of WordNet taxonomy Fig.2: A sense network formed between a sentence and a query Fig.2.9.2.1: Document retrieval framework Fig.8 List of figures and tables Figures PageNo.1 Coarse and fine grained question categories.4.1.3: Question type classifier performance Fig.

Introduction In information retrieval. thus natural language search engines are sometimes regarded as the next step beyond current search engines. SHRDLU simulated the operation of a robot in a toy world (the "blocks world"). Two of the most famous early systems are SHRDLU and ELIZA. 1. such as questions asking for descriptive rather than procedural information. medicine or automotive maintenance). semantically-constrained. Further restricted-domain QA systems were developed in the following years. and cross-lingual questions. and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies. both of which were developed in the 1960s. (Alternatively. In fact. to compiled newswire reports. LUNAR was demonstrated at a lunar science convention in 1971 and it was able to answer 90% of the questions in its domain posed by people untrained on the system. list. to internal organization documents. definition. Why. Search collections vary from small local document collections. To find the answer to a question. QA research attempts to deal with a wide range of question types including: fact. a QA computer program may use either a pre-structured database or a collection of natural language documents (a text corpus such as the World Wide Web or some local collection).1 History of Question Answering Systems Some of the early AI systems were question answering systems. How. and can only rely on general ontologies and world knowledge. to the World Wide Web. Both QA systems were very effective in their chosen domains. these systems usually have much more data available from which to extract the answer. LUNAR. The common feature of all these systems is that they had a core database or knowledge system that was hand-written by experts of the chosen domain.9 Chapter1. Two of the most famous QA systems of that time are BASEBALL and LUNAR. answered questions about the geological analysis of rocks returned by the Apollo moon missions. hypothetical. closed-domain might refer to a situation where only limited types of questions are accepted. Some of the early AI systems included question-answering abilities. and it offered the possibility to ask the robot . question answering (QA) is the task of automatically answering a question posed in natural language. in turn.) QA is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval. * Open-domain question answering deals with questions about nearly everything. On the other hand. BASEBALL answered questions about the US baseball league over a period of one year. * Closed-domain question answering deals with questions under a specific domain (for example.

The blog data corpus contained both "clean" English as well as noisy text that include badlyformed English and spam. An increasing number of systems include the World Wide Web as one more corpus of text. Systems participating in this competition were expected to answer questions on any topic by searching a corpus of text that varied from year to year. a system that answered questions pertaining to the Unix operating system. In 2007 the annual TREC included a blog data corpus for question answering.2 Architecture The first QA systems were developed in the 1960s and they were basically naturallanguage interfaces to expert systems that were tailored to specific domains. Currently there is an increasing interest in the integration of question answering with web search. and it aimed at phrasing the answer to accommodate various types of users. This competition fostered research and development in open-domain text-based question answering. which led to the development of ambitious projects in text comprehension and question answering. ELIZA. ELIZA was able to converse on any topic by resorting to very simple rules that detected important words in the person's input. In the late 1990s the annual Text Retrieval Conference (TREC) included a questionanswering track which has been running until the present.com is an early example of such a system. a text-understanding system that operated on the domain of tourism information in a German city. The systems developed in the UC and LILOG projects never went past the stage of simple demonstrations. 1. The introduction of noisy text moved the question answering to a more realistic setting. Ask. In contrast. It had a very rudimentary way to answer questions. After the question is analyzed. but they helped the development of theories on computational linguistics and reasoning. Another project was LILOG. Again. The best system of the 2004 competition achieved 77% correct fact-based questions. in contrast. One example of such a system was the Unix Consultant (UC). Real-life data is inherently noisy as people are less careful when writing in spontaneous media like blogs. One can only expect to see an even tighter integration in the near future. and on its own it lead to a series of chatter bots such as the ones that participate in the annual Loebner prize. and Google and Microsoft have started to integrate question-answering facilities in their search engines. simulated a conversation with a psychologist. The system had a comprehensive hand-crafted knowledge base of its domain. The 1970s and 1980s saw the development of comprehensive theories in computational linguistics. the strength of this system was the choice of a very specific domain and a very simple world with rules of physics that were easy to encode in a computer program.10 questions about the state of the world. In earlier years the TREC data corpus consisted of only newswire data that was very clean. Current QA systems typically include a question classifier module that determines the type of question and the type of answer. the system . current QA systems use text documents as their underlying knowledge source and combine various natural language processing techniques to search for the answers.

When using massive collections with good data redundancy. 1. logic form transformation. semantic and contextual processing must be performed to extract or construct the answer. in the cases where simple question reformulation or keyword techniques will not suffice. Finally. This often works well on simple "factoid" questions seeking factual tidbits of information such as names. It thus makes sense that larger collection sizes generally lend well to better QA performance. the burden on the QA system to perform complex NLP techniques to understand the text is lessened. dates. means that nuggets of information are likely to be phrased in many different ways in differing contexts and documents. If you posed the question "What is a dog?".2 Deep However. (2) Correct answers can be filtered from false positives by relying on the correct answer to appear more times in the documents than instances of incorrect ones. For example. 1. a document retrieval module uses search engines to identify the documents or paragraphs in the document set that are likely to contain the answer.3. Subsequently a filter preselects small text fragments that contain strings of the same type as the expected answer. syntactic alternations. relation detection. unless the question domain is orthogonal to the collection. an answer extraction module looks for further clues in the text to determine if the answer candidate can indeed answer the question.3 Question answering methods QA is very dependent on a good search corpus .11 typically uses several modules that apply increasingly complex NLP techniques on a gradually reduced amount of text.for without documents containing the answer.1 Shallow Some methods of QA use keyword-based techniques to locate interesting passages and sentences from the retrieved documents and then filter based on the presence of the desired answer type within that candidate text. co reference resolution. there is little any QA system can do. such as the web. the system would detect the substring "What is a X" and look for documents which start with "X is a Y". leading to two benefits: (1) By having the right information appear in many forms. 1. locations.3. Thus. some systems use templates to find the final answer in the hope that the answer is just a reformulation of the question. word sense disambiguation. more sophisticated syntactic. if the question is "Who invented Penicillin" the filter returns text that contain names of people. Ranking is then done based on syntactic features such as word order or location and similarity to query. The notion of data redundancy in massive collections. These techniques might include namedentity recognition. logical inferences (abduction) . and quantities.

AQ (Answer Questioning) Methodology. 1. or the Suggested Upper Merged Ontology (SUMO) to augment the available reasoning resources through semantic connections and definitions. badly-worded or ambiguous questions will all need these types of deeper understanding of the question. it is only a starting point with endless possibilities." Q"(What about) the flavor of sushi (do) I like?" Inadvertently.4.1 Question classes Different types of questions require the use of different strategies to find the answer. The following issues were identified. thereby unveiling an ongoing process constantly being reborn into the research being performed. such as part-of-speech tagging. Any number of question methods may be used to derive the number of WHY as in. dialog queries. While most would agree that this seems to be the end-all stratagem. the primary usage is taking an answer and questioning it turning that very answer into a question. temporal or spatial reasoning and so on. A"I like sushi. AQ Method may be used upon perception of a posed question or answer. however. 1.12 and commonsense reasoning. supposedly there is only one true answer in reality everything else is perception or plausibility. . is also growing in popularity in the research community. parsing. Question classes are arranged hierarchically in taxonomies. Even this methodology should be questioned." Q"(Why do) I like sushi(?)" A"The flavor. Complex or ambiguous document passages likewise need more NLP techniques applied to understand the text. Example.4 Issues In 2002 a group of researchers wrote a roadmap of research in question answering. Many of the lowerlevel NLP tools used. The means by which it is utilized can be manipulated beyond its primary usage. and document retrieval. These systems will also very often utilize world knowledge that can be found in ontologies such as WordNet. hypothetical postulations. which introduces statistical question processing and answer extraction modules. this may unveil different methods of thinking and perception as well. Statistical QA. spatially or temporally constrained questions. 1(Q) = ((∞(A)-∞) = 1(A). More difficult queries such as Why or How questions. A = ∞(Q). sentence boundary detection. introduces a working cycle to the QA methods. named-entity detection. The QA methodology utilizes just the opposite where. Utilized alongside other forms of communication. This method may be used in conjunction with any of the known or newly founded methods. debate may be greatly improved. the answer may yield any number of questions to be asked. are already available as probabilistic applications.

we shall not obtain a correct result. would identify ambiguities and treat them in context or by interactive clarification. A semantic model of question understanding and processing is needed. on the search method and on the question focus and context. organization. regardless of the speech act or of the words. etc). shop or disease.4. no matter how well we perform question processing.4 Data sources for QA Before a question can be answered. when the question classification indicates that the answer type is a name (of a person.6 Answer formulation The result of a QA system should be presented in a way as natural as possible. the answer to the question "On what day did Christmas fall in 1989?") the extraction of a single datum is sufficient.3 Context and QA Questions are usually asked within a context and answers are provided within that specific context. 1. The context can be used to clarify a question.some interrogative.g. size. on the answer type provided by question processing.2 Question processing The same information request can be expressed in various ways . If the answer to a question is not present in the data sources. resolve ambiguities or keep track of an investigation performed through a series of questions. 1. the presentation of the answer may require the use of fusion techniques that combine the partial answers from multiple documents. length.5 Answer extraction Answer extraction depends on the complexity of the question. simple extraction is sufficient. research for answer processing should be tackled with a lot of care and given special importance. some assertive.13 1. retrieval and extraction of the answer. This model would enable the translation of a complex question into a series of simpler questions. 1. a quantity (monetary value. In some cases. it must be known what knowledge sources are available. distance.4.4.4. For other cases. one that would recognize equivalent questions. 1. Given that answer processing depends on such a large number of factors.4. For example. on the actual data where the answer is searched. syntactic inter-relations or idiomatic forms. etc) or a date (e. .

where each template slot represents a different profile feature. 1. encoding world knowledge and common-sense reasoning mechanisms as well as knowledge specific to a variety of domains.4. To upgrade a QA system with such capabilities.14 1. as the question processing part may fail to classify properly the question or the information needed for extracting and generating the answer is not easily retrieved. 1. See also machine translation.4.11 User profiling for QA The user profile captures data about the questioner. Profile templates may be nested one within another. the size and multitude of the data sources or the ambiguity of the question. The profile may be represented as a predefined template.10 Advanced reasoning for QA More sophisticated questioners expect answers which are outside the scope of written texts or structured databases. 1.8 Multi-lingual (or cross-lingual) question answering The ability to answer a question posed in one language using an answer corpus in another language (or even several). regardless of the complexity of the question. This allows users to consult information that they cannot use directly.7 Real time question answering There is need for developing Q&A systems that are capable of extracting answers from large data sets in several seconds. .4. the questioner might want not only to reformulate the question. In such cases. reasoning schemes frequently used by the questioner. 1.4. but (s)he might want to have a dialogue with the system.4. we need to integrate reasoning components operating on a variety of knowledge bases. domain of interest.9 Interactive QA It is often the case that the information need is not well captured by a QA system. comprising context data. common ground established within different dialogues between the system and the user etc.

1.15 1. an approach to document retrieval requires some form of iterative process to select good quality documents which involves modifying the IR query. While these basic components can be further subdivided into smaller components like query formation and document pre-processing. question analysis.6 Evaluating QA Systems Evaluation is a highly subjective matter when dealing with NLP problems. It is always easier to evaluate when there is a clearly defined answer.1. answer extraction. A rather impractical and tedious way of doing this could be to manually search an entire collection of text and mark the . 2. It should be noted that while the three components address completely separate aspects of question answering it is often difficult to know where to place the boundary of each individual component. For example the question analysis component is usually responsible for generating an IR query from the natural language question which can then be used by the document retrieval component to select a subset of the available documents. unfortunately with most of the natural language tasks there is no single answer.5 A generic framework for QA The majority of current question answering systems designed to answer factoid questions consist of three distinct components: 1.1: A generic framework for question answering. Question Question Analysis Corpus or document collection Document Retrieval Top n text segments or sentences Answer Extraction Answers Fig. however. a three component architecture describes the approach taken to building QA systems in the wider literature. If. then it is difficult to decide if the modification should be classed as part of the question analysis or document retrieval process. document or passage retrieval and finally 3.

no credit was given to systems for determining that they did not know or could not locate an appropriate answer to a question. Following are definitions of numerous metrics for evaluating factoid questions. So a widely accepted metric is required to evaluate the performance of our system and compare it with other existing systems. 1. 1.1 End-to-End Evaluation Almost every QA system is concerned with the final answer. But this is not possible even for the smallest of document collections and with the size of corpuses like AQUAINT with approximately 1. MRR provides a method for scoring systems which return multiple competing answers per question.000 articles it is next to impossible.00.6. . the most important of which are that x systems are given no credit for retrieving multiple (different) correct answers and x As the task required each system to return at least one answer per question. Evaluating descriptive questions is much more difficult than factoids. 1. Under this evaluation metric a system returns a single answer for each question. Then the queries can be used to make an evaluation based on precision and recall. These answers are then sorted before evaluation so that the answer which the system has most confidence in is placed first. Let Q be the question collection and ri the rank of the first correct answer to question i or 0 if no correct answer is returned.6.16 relevant documents.6. Most of the recent large scale QA evaluations have taken place as part of the TREC conferences and hence the evaluation metrics used have been extensively studied and is used in this study. MRR is then given by: MRR ¦r i 1 |Q| 1 i |Q| (1.3 Confidence Weighted Score Following the shortcomings of MRR as an evaluation metric a new evaluation metric was chosen as the new evaluation metric [9].1) As useful as MRR was as an evaluation metric for the early TREC QA evaluations it does have a number of drawbacks [8].2 Mean Reciprocal Rank The original evaluation metric used in the QA tracks of TREC 8 and 9 was mean reciprocal rank (MRR).

2) CWS therefore rewards systems which can not only provide correct exact answers to questions but which can also recognise how likely an answer is to be correct and hence place it early in the sorted list of answers.17 The last answer evaluated will therefore be the one the system has least confidence in.q .q z I}| |Q| (1. The main issue with CWS is that it is difficult to get an intuitive understanding of the performance of a QA system given a CWS score as it does not relate directly to the number of questions the system was capable of answering. 1. Given this ordering CWS is formally defined in Equation 1.q .q be the correct answers for question q known to be S contained in the document collection D and FD .q z I}| |Q| (1.q .4 Accuracy and coverage Accuracy of a QA system is a simple evaluation metric with direct correspondence to number of correct answers.2).q .n be the n top-ranked documents (or passages) in D retrieved by an IR system S (figure 1.q . then The recall of an IR system S at rank n for a query q is the fraction of the relevant documents AD .5 Traditional Metrics – Recall and Precision The standard evaluation measures for IR systems are precision and recall. n) S S |{q‹Q | RD . n) S S |{q‹Q | FD . D.n ˆ CD .4) 1.n ˆ AD . Let D be the document (or passage collection). D.2: ¦ CWS i 1 |Q| no.6.n be the first n answers found by system S for question q from D then accuracy is defined as: accuracy (Q. which have been retrieved: . Let CD .q the subset of D which contains relevant S documents for a query q and RD . AD .3) Similarly The coverage of a retrieval system S for a question set Q and document collection D at rank n is the fraction of the questions for which at least one relevant document is found within the top n documents: coverage (Q.6. of correct in first i answers i |Q| (1.

q | S | RD . q.q | recall ( D. namely determining the set of relevant documents within a collection for a given query. Clearly given the size of the collections over which QA systems are being operated this is not a feasible proposition.q . n) S | AD . Document Collection/Corpus Relevant Documents AD .n that are relevant: precision S ( D.2: Sections of a document collection as used for IR evaluation.6) Clearly given a set of queries Q average recall and precision values can be calculated to give a more representative evaluation of a specific IR system. AD .n | (1. q. The only accurate way to determine which documents are relevant to a query is to read every single document in the collection and determine its relevance. It must be kept in mind that just because a relevant document is found does not automatically mean the QA system will be able to identify and extract a correct answer.q .q | (1.q .n ˆ AD . n) S | RD .q Retrieved Documents S RD .18 S | RD .n Figure 1.q . Unfortunately these evaluation metrics although well founded and used throughout the IR community suffer from two problems when used in conjunction with the large document collections utilized by QA systems. .n ˆ AD .q . Therefore it is better to use recall and precision at the document retrieval stage rather than for the complete system.q .5) The precision of an IR system S at rank n for a query q is the fraction of the retrieved S documents RD .

which contains 6 coarse grained categories and 50 fine grained categories. In a similar way a poorly formed IR query may result in no answer bearing documents being retrieved and hence no amount of further processing by an answer extraction component will lead to a correct answer being found. We have employed a machine learning model in our system which employs a feature-weighting model which assigns different weights to features instead of simple binary values. as shown in Table 1. Further. .19 Chapter2. Most question answering systems use a coarse grained category definition. However. Determining the expected answer type for a question implies the existence of a fixed set of answer types which can be assigned to each new question. it is obvious that a fine grained category definition is more beneficial in locating and verifying the plausible answers. Not only is the question analysis component responsible for determining the expected answer type and for constructing an appropriate query for use by an IR engine but any mistakes made at this point are likely to render useless any further processing of a question. Usually the number of question categories is less than 20. using machine learning approaches. If the expected answer type is incorrectly determined then it is highly unlikely that the system will be able to return a correct answer as most systems constrain possible answers to only those of the expected answer type. 2. Question Analysis As the first component in a QA system it could easily be argued that question analysis is the most important part. The main characteristic of this model is assigning more reasonable weight to features: these weights can be used to differentiate features from each other according to their contribution to question classification.1.1 Determining the Expected Answer Type In most QA systems the first stage in processing a previously unseen question is to determine the semantic type of the expected answer. The problem of question type classification can be solved by constructing manual rules or if we have access to large set of annotated-pre classified questions. Each coarse grained category contains a non-overlapping set of fine grained categories. Results show that with this new feature-weighting model the SVM-based classifier outperforms the one without it to a large extent. we propose to use features initially just as bag of words and later on both as a bag of words and feature called as partitioned feature model.1 Question Classes We follow the two-layered question taxonomy. 2.

many creating sets of regular expressions which only questions with the same answer type [10]. A number of different automatic approaches to question classification have been reported which make use of one or more machine learning algorithms [6][7][12] including nearest neighbour (NN) [4]. technique. As we had access to large set of pre annotated question samples we have not used this method. count.2 Manually Constructed rules for question classification Often the easiest approach to question classification is a set of manually constructed rules. country. size. . product. body. 2. instrument. other. group. order.1. state code. color.[11]. date. distance. temperature. percent. manner.1 Coarse and fine grained question categories. vehicle. speed. An alternative solution to this problem is to develop an automatic approach to constructing a question classifier using (possibly hand labelled) training data. other. description. Coarse ABBR DESC ENTY Fine abbreviation. disease/medical. This approach allows a simple low coverage classifier to be rapidly developed without requiring a large amount of hand labelled training data. title city. language. event. While these approaches work well for some questions (for instance questions asking for a date of birth can be reliably recognised using approximately six well constructed regular expressions) they often require the examination of a vast number of questions and tend to rely purely on the text of the question. mountain. decision trees (DT) and support vector machines (SVM)[7][12] to induce a classifier. weight 2. creation. food. religion. plant. One possible approach for manually constructing rules for such a classifier would be to define rule formalism that whilst retaining the relative simplicity of regular expressions would give access to a richer set of features. In our system we employed a SVM and Naive Bayes classifier on different feature sets extracted from the question. letter. period. individual. currency. other. sport.20 Table 1. word HUM LOC NUM description. term. substance. reason animal. A number of systems have taken this approach. money.1.3 Fully Automatically Constructed Classifiers As mentioned in the previous section building a set of classification rules to perform accurate question classification by hand is both a tedious and time-consuming task. symbol. expansion definition.

By using geometry. The vector w is a normal vector: it is perpendicular to the hyperplane.21 2. )| .2) where denotes the dot product. we can select the two hyperplanes of the margin in a way that there are no points between them and then try to maximize their distance.5) − =1 (2. or distance between the parallel hyperplanes that are as far apart as possible while still separating the data. one which maximizes the margin between the two data sets. We want to choose the w and b to maximize the margin. we find the distance between these two hyperplanes is 2/ .1.4) Note that if the training data are linearly separable.3) . {−1. we add the following constraint: for each i either w⋅ and w⋅ This can be rewritten as: − ≤ −1 (2. To calculate the margin. a set of points of the form D= ( . an SVM will construct a separating hyper-plane in that space. indicating the class to which the point belongs. a good separation is achieved by the hyper-plane that has the largest distance to the neighboring data points of both classes. since in general the larger the margin the lower the generalization error of the classifier. one on each side of the separating hyper-plane. We are given some training data. We want to give the maximum-margin hyperplane which divides the points having = 1 from those having = − 1. The parameter / determines the offset of the hyperplane from the origin along the normal vector w. two parallel hyperplanes are constructed. As we also have to prevent data points falling into the margin. These hyperplanes can be described by the equations w⋅ and w ⋅ − = −1 (2. so we want to minimize . Viewing input data as two sets of vectors in an ndimensional space. Each is a p-dimensional real vector.6) − ≥1 (2. Any hyperplane can be written as the set of points satisfying w⋅ − =0 (2. which are "pushed up against" the two data sets.4 Support Vector Machines Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression.1) where the is either 1 or −1.1} =1 (2. Intuitively.

n) (w ⋅ − ) ≥1 (2. Rd Consider K(x.8) 2. The following picture illustrates a particularly simple example where the feature map I(x1.5 Kernel Trick If instead of the Euclidean inner product w ⋅ one fed the QP solver with a function K(w .9) d when and the set of x  on that boundary becomes a curved surface embedded in R the function K(x.7) We can put this together to get the optimization problem: Minimize in ( . ) subject to (for any i = 1.w) to be the inner product not of the coordinate vectors x and w in Rd but of vectors I x) and I(w) in higher dimensions. The data space is often Rd but most of the interesting results hold when X is a compact Riemannian manifold.……. K(x. ) the boundary between the two classes would then be.x2)=(x12.22 (w ⋅ − ) ≥ 1 . The map.x22) maps data in R2 into R3. we also tried Naïve Bayes Classifier[6].6 Na ve Bayes Classifier Along with SVM. 2. for all 1≤ ≤ (2. The feature space is assumed to be a Hilbert space of real valued functions defined on X . Figure 2.1..—2x1x2.1. A naive Bayes classifier is a term in Bayesian statistics dealing with a simple probabilistic classifier based on .w) + b = 0 (2.w) is non-linear. after transformation the data is linearly separable. I: X o H is called a feature map from the data space X into the feature space H .1: The kernel trick.

23 applying Bayes' theorem with strong (naive) independence assumptions.……. F_1.……. naive Bayes classifiers can be trained very efficiently in a supervised learning setting. the probability model for a classifier is a conditional model p(C| F_1. _ ) (2.. For example. F_1. Using Bayes' theorem... In many practical applications. The numerator is equivalent to the joint probability model p(C.. ………. we write p(C | F_1.11) In practice we are only interested in the numerator of that fraction. so that the denominator is effectively constant. a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature.. _ | ) ..F_n). Even though these features depend on the existence of the other features.…….F_n| C. The problem is that if the number of features n is large or when a feature can take on a large number of values... Depending on the precise nature of the probability model. In simple terms.………. F_n). ………. using repeated applications of the definition of conditional probability: p(C. one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods. Abstractly.. in other words. then basing such a model on probability tables is infeasible.10) In plain English the above equation can be written as posterior = (prior*likelihood)/evidence (2.F_n| C) = p(C) p(F_1| C) p(F_2.. We therefore reformulate the model to make it more tractable. F_n) = p(C) p(F_1.…….. a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.. parameter estimation for naive Bayes models uses the method of maximum likelihood. conditional on several feature variables F_1 through F_n. which can be rewritten as follows. F_1) . the words or features of a given question may are assumed to be independent to simplify mathematical complexities. since the denominator does not depend on C and the values of the features F_i are given.. over a dependent class variable C with a small number of outcomes or classes. A more descriptive term for the underlying probability model would be "independent feature model".F_n) = ( ) ( ( .

. (2.000..2: Various feature sets extracted from the given question and its corresponding part of speech tags.. F_1.1.. p(F_n| C.F_{n-1}) (2. F_2) ..24 = p(C) p(F_1| C) p(F_2| C.000..000. 2. 3. University of Illinois at Urbana-Champaign (UIUC) [5].14) 2...7 Datasets We used the publicly available training and testing datasets provided by Tagged Question Corpus. . The following example demonstrated different feature sets considered for a given question and its POS parse. The TREC QA data is hand labelled by us. F_1. 4. This means that p(F_i | C. F_1) p(F_3| C. we extract two kinds of features: bag-of-words or a mix of POS tags and words. Cognitive Computation Group at the Department of Computer Science. F_1) p(F_3..8 Features For each question.. F_1. p(F_n| C) = p(C)∏ =1 p(F_i | C) (2. F_2) = p(C) p(F_1| C) p(F_2| C. F_3) = p(C) p(F_1| C) p(F_2| C.. F_j) = p(F_i | C)....... .. F_n) = p(C) p(F_1| C) p(F_2| C) p(F_3| C) …….... 2. There are about 5. F_1) p(F_3| C... the weight associated with each word varies between 0 and 1.. Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is conditionally independent of every other feature Fj for j i..F_n| C. F_3. F_1.12) and so forth. F_2.....500 respectively.1.000 and 5. The testing dataset contains 2000 labelled questions from the TREC QA track. Figure 2. F_2..F_n| C.1.... All these datasets have been manually labelled by UIUC [5] according to the coarse and fine grained categories in Table 1. Every question is represented as feature vectors... F_1.13) and so the joint model can be expressed as p(C. F_2) p(F_4.. F_1.500 labelled questions randomly divided into 5 training datasets of sizes 1.. .

Ci represents a word collection similar to documents. ni be the total number of occurrences of word i in all questions.25 2. x ∈ A be the probability function. that is to say. Ci is a set of words extracted from questions of type i. Note that if a word occurs in only one set. ai get the maximum value of 1 if word i occurs in only one set of question type. the more uncertain the random variable X is. Without loss of generality. Keeping this in mind to calculate the entropy of a word. ni the total number of occurrences of word i in document collection. and minimum value 0 if the word occurs in only one document.N}. Let X be a discrete random variable with respect to alphabet A and p(x) = Pr(X = x). Let ai be the weight of word i.1. the less important it is. which is easily justified since xlogx → 0 as x → 0. the smaller weight is associated with word i. the less important to question classification it is. N the number of total documents in the collection. Therefore we can also use the idea of entropy to evaluate word’s importance. . then ai is defined as: ai =(1 +1/log(N)Σ(t=1 to N)[fitni·log(fitni)]) (2.9 Entropy and Weighted Feature Vector In information theory the concept of entropy is used as a measure of the uncertainty of a random variable. then the entropy H(X) of the discrete random variable X is defined as: H(x) = −Σx∈A {p(x) log p(x)} (2. . fit be the frequency of word i in Ct. . Let fit be the frequency of word i in document t. and the minimum value of 0 if the word is evenly distributed over all sets. We use the convention that 0 log 0 = 0. Consequently. among which entropy-weighting. Let C be the set of question types. for other sets fik is 0. In other words. . each Ci is the same as a document because both of which are just a collection of words. From the viewpoint of representation. based on information theoretic ideas. it is denoted by C = {1. is proved the most effective and sophisticated. certain preprocessing is needed. then the confusion (or entropy) of word i can be measured as follows: H(i) = Σ(t=1 to N) [fit ni · log(nifit)] (2.16) The larger the confusion of a word is. . In information retrieval many methods have been applied to evaluate term’s relevance to documents.17) Weight of word i is opposite to its entropy: the larger the entropy of word i is. The confusion achieves maximum value log(N) if the word is evenly distributed over all documents.15) The larger the entropy H(X) is.

We implemented weighted feature set SVM classifier into a cross platform standalone desktop application (shown below). Support Vector Machine Classifier using Weighted feature set (85% accurate on TREC data) It must be noted that the classifiers were NOT trained on TREC data.26 2.3: Classifiers were tested on a set of 2000 TREC questions.1. University of Illinois at Urbana-Champaign. Support Vector Machine Classifier using Bag of Words feature set (78% accurate on TREC data). The application will be made available to public for evaluation. While Witten-Bell smoothing worked well. The accuracy reported here are for Naive Bayes Classifier employing add one smoothing. Naïve Bayes Classifier* using Partitioned feature set (69% accurate on TREC data). Some sample test runs Q: What was the name of the first Russian astronaut to do a spacewalk? Response: HUM -> IND(an Individual) Q: How much folic acid should an expectant mother get daily? Response: NUM -> COUNT . The performance without smoothing was too low and not worth mentioning. We employed various smoothing techniques to Naive Bayes Classifier. Naïve Bayes Classifier* using Bag of Words feature set (64% accurate on TREC data). The classifier classified questions into six broad classes and fifty coarse classes.10 Experiment Results We tested various algorithms for question classification. 90 80 70 60 50 40 30 20 10 0 Baseline Classifier Naïve Bayes Classifier using Bag of Words feature Naïve Bayes Classifier using Partitioned feature SVM Classifier using Bag of Words feature SVM Classifier using Weighted feature set Chart Showing accuracy of classifier Figure 2. Therefore a baseline (random) classifier is (1/50) = 2% accurate. Training was done on a set of 12788 questions provided by Cognitive Computation Group at the Department of Computer Science. simple add one smoothing outperformed it.

The test is carried out on data from NIST TREC Robust Retrieval Track 2004 . while other systems go for a query expansion. For large corpus. In our system when using web as document collection.php?id=10 2.2 Query Formulation The question analysis component of a QA system is usually responsible for formulating a query from a natural language questions to maximise the performance of the IR engine used by the document retrieval component of the QA system. Response: ENTITY -> ANIMAL Figure 2. can be downloaded for evaluation from http://www.cybergeeks. But in case of a local small corpus query expansion may be necessary. query expansion may not be necessary as even with not so well formed query recall is sufficient to extract the right answer and query expansion may in fact reduce precision.co. When a web corpus is not available we employ Rocchio Query Expansion [1] method which is implemented in lucene query expansion module. we pass on the question as IR query after masking the stop words.in/projects.4: JAVA Question Classifier. The design of the query expansion module should be such as to maintain the right balance between recall and precision. The table below shows performance of various query expansion modules implemented on Lucene. Most QA systems start constructing an IR query simply by assuming that the question itself is a valid IR query.27 Q: What is Francis Scott Key best known for? Response: DESC -> DESC Q: What state has the most Indians? Response: LOC -> STATE Q: Name a flying mammal.

2322 Lucene 0. Table 2.2 MAP .10% 14% 14% 15% MAP Lucene QE 0. most of these lists are very similar).3936 0.2433 Lucene gQE 0.mean average precision P10 . KB-R-FIS gQE – My Fuzzy Inference System that utilized Rocchio’s query expansion along with Google.3984 0. . Unfortunately when it comes to QA systems high frequency of a word in a collection may not always suggest that it is insignificant in retrieving the answer.2.28 Tag Combined Topic Set P10 0.2332 KB-R-FIS gQE 0. I As De In On To Who www a at en is or was will about be for it that what with an by from la the when und are com how of this where the The list of stop words we obtained is much smaller than standard stop word lists (although there is no definite list of stop words which all natural language processing tools incorporate.percentage of topics with no relevant in the top 10 retrieved Lucene QE .average of precision at 10 documents retrieved %no . Therefore we manually analyzed 100 TREC QA track questions and prepared a list of stop words.4076 0. For example the word “first” is widely considered to be a stop word but is very important when appears in the question “Who was the first President of India?”. A partial list of the stop words is shown below. It must be noted that query expansion is internally carried out by the APIs used to retrieve documents from the web. although because of the proprietary nature their working is unknown and unpredictable. 2.lucene with local query expansion Lucene gQE – Lucene system that utilized Rocchio’s query expansion along with Google.1 Stop word for IR query formulation Stop words or noise words are words which appear with a high frequency and are considered to be insignificant for normal IR processes.1: performance of various query expansion modules implemented on Lucene.37 %no 18.

and its newer variants.. e. such as Web search. 3. BM25F [2] (a version of BM25 that can take document structure and anchor text into account). however.IDF based vector space model. represent stateof-the-art retrieval functions used in document retrieval..1.29 Chapter3. One of the most prominent instantiations of the function is as follows.g. but actually a whole family of scoring functions. and others [14]. It is not a single function.1. over many other IR engines. Karen Spärck Jones. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. it usually referred to as "Okapi BM25". is that it is relatively easy to extend to meet the demands of a given research project (as an open source project the full source code to Lucene is available making modification and extension relatively straight forward) allowing experiments with different retrieval models or ranking algorithms to use the same document index. One of the main advantages of using Lucene. their relative proximity). The name of the actual ranking function is BM25.g.. since the Okapi information retrieval system. the BM25 score of a document D is: . 3. Document Retrieval The text collection over which a QA system works tend to be so large that it is impossible to process whole of it to retrieve the answer.1 Ranking function We employ highly popular Okapi BM25 [3] ranking function for our document retrieval module. Robertson. A good retrieval unit will increase precision while maintaining good enough recall. 3. Lucene is an open source boolean search engine with support for ranked retrieval results using a TF. implemented at London's City University in the 1980s and 1990s.2 Okapi BM25 BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document. BM25.. .1 Retrieval from local corpus All the work presented in this thesis relies upon the Lucene IR engine [13] for local corpus searches. was the first system to implement this function.. The task of the document retrieval module is to select a small set from the collection which can be practically handled in the later stages. containing keywords 1 . Given a query Q. regardless of the inter-relationship between the query terms within a document (e. To set the right context. with slightly different components and parameters.

)+ ( . ( . It is usually computed as: ( )= − ( )+0. Both APIs have relaxed terms of condition and allow access through code. There are several interpretations for IDF and slight variations on its formula. | D | is the length of the document D in words. Therefore we use public APIs of search engines. and avgdl is the average document length in the text collection from which documents are drawn.2 Information retrieval from the web Indexing the whole web is a gigantic task which is not possible on a small scale. Then a randomly picked document D will contain the term with probability n(q)/N (where N is again the cardinality of the set of documents in the collection).Q) = ∑ =1 ( ). If the two terms occur in documents entirely independently of each other. 3.0 and b = 0.3) Now suppose we have two query terms q1 and q2. this is exactly what is expressed by the IDF component of BM25.2) where N is the total number of documents in the collection. usually chosen as k1 = 2. In the original BM25 derivation. the IDF component is derived from the Binary Independence Model. )( 1 1 +1) (1− + | | ) (3.75. and n(qi) is the number of documents containing qi. then the probability of seeing both q1 and q2 in a randomly picked document D is: ( 1) ( 2) and the information content of such an event is: 2 log =1 ( ) With a small variation. Moreover there are no limits on number of queries per day when used for .1) where f(qi. We have used Google AJAX Search API and Yahoo BOSS. IDF(qi) is the IDF (inverse document frequency) weight of the query term qi. the information content of the message "D contains q" is: -log ( ) = log ( ) (3. Therefore.5 ( )+0. Suppose a query term q appears in n(q) documents.D) is qi's term frequency in the document D. k1 and b are free parameters.5 (3.30 Score(D.1. 3.3 IDF Information Theoretic Interpretation Here is an interpretation from information theory.

. The search APIs can return top n documents for a given query. Lowered precision is penalized by higher average processing time by later stages. Therefore.2.1 How many documents to retrieve? One of the main considerations when doing document retrieval for QA is the amount of text to retrieve and process for each question.31 educational purposes. Whilst the ideal is not attainable. Ideally a system would retrieve a single text unit that was just large enough to contain a single instance of the exact answer for every question.1 shows the document retrieval framework. As the task of reading the URLs over the internet is inherently slow process. We read top n uniform resource locators (URLs) and build the collection of documents to be used for answer retrieval. Local Corpus Lucene IR Module Okapi BM25 Ranking function Top n Docs IR Query URL Reader Google/Yahoo Search APIs URL Reader URL Reader URL Reader INTERNET Multi threaded Reader module Figure 3. To accelerate the process we employ multi threaded URL readers so that multiple URLs can be read simultaneously. Therefore our target is to increasing coverage with least number of retrieved documents to form the text collection.1: Document retrieval framework 3. this stage is the most taxing one in terms of runtime. the document retrieval stage can act as a filter between the document collections/web and answer extraction components by retrieving a relatively small set of text collection. Figure 3.

39 1.803 1.09 0.23 0.4 4.27 0. average processing time at different ranks for Google and Yahoo search APIs.1: %coverage and average processing time at different ranks %Coverage vs rank 80 70 60 50 40 30 20 10 0 Yahoo BOSS API %Coverage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 23 31 37 42 48 49 49 51 51 52 53 53 54 54 55 Google AJAX Search 28 48 56 58 64 64 64 66 70 72 72 73 73 73 74 API Figure 3.8 3. The results are obtained on a set of 30 questions (equally distributed over all question classes) from TREC 04 QA track [5].32 criterion for selecting the right collection size depends on coverage and average processing time.37 0.22 3.82 1.49 0.2: %coverage vs rank .44 2.02 0.102 0. The table below shows percentage coverage.2 4.23 1.51 0.6 3.021 0.7 4.1 %Coverage @rank Yahoo 23 31 37 42 48 49 49 51 51 52 53 53 54 54 55 Google 28 48 56 58 64 64 64 66 70 72 72 73 73 73 74 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 *Average time spent by answer retrieval node. Average Processing Time*(sec) Yahoo 0. Table 3.1 3.77 Google 0.31 2.2 2.6 5.9 2.1 1.34 0.01 2.

Clearly Google outperforms Yahoo at all ranks.2 2.51 0.37 0.33 80 70 60 50 40 30 20 10 0 %Coverage vs Average Processing time(sec) Google AJAX Search API Yahoo BOSS From the results it is clear that going up to rank 5 ensures a good coverage while maintaining low processing time.6 5.3: %coverage vs.1 3.23 0. average processing time . 0.021 0.803 1.6 3.09 0.1 1.1 Figure 3.39 1.4 4.9 2.

The probability estimate and the retrieved answer’s frequency are used to compute confidence of the answer.1 Sentence Ranking The sentence ranking is responsible for ranking the sentences and giving a relative probability estimate to each one.1." In addition to providing these groups of synonyms to represent a concept.1 WordNet WordNet [16] is the product of a research project at Princeton University which has attempted to model the lexical knowledge of a native speaker of English. Each synset has a gloss that defines the concept of the word. auto. usually propelled by an internal combustion Engine. 4.34 Chapter4. automobile. Many glosses have examples of usages associated with them. is to extract and present the answers to questions. WordNet connects concepts via a variety of semantic relations. The final answer is the phase chunk with maximum frequency belonging to the sentence with highest rank. For example the words car. In WordNet each unique meaning of a word is represented by a synonym set or synset. It also registers the frequency of each individual phrase chunk marked by the NE recognizer for a given question class. 4. and motorcar is a synset that represents the concept define by gloss: four wheel Motor vehicle. These semantic relations for nouns include: x Hyponym/Hypernym (IS-A/ HAS A) x Meronym/Holonym (Part-of / Has-Part) x Meronym/Holonym (Member-of / Has-Member) x Meronym/Holonym (Substance-of / Has-Substance) Figure 4. and arguably the most important. such as "he needs a car to get to work.1 shows a fragment of WordNet taxonomy. Answer Extraction The final stage in a QA system. We employ a named entity (NE) recognizer to filter out those sentences which could potentially contain answer to the given question. In our system we have used GATE – A General Architecture for Text Engineering provided by The Sheffield NLP group [15] as a tool to handle most of the NLP tasks including NE recognition. .

We assign a probability to a concept in taxonomy based on the occurrence of target concept in a given corpus.1. Using this basic idea we compute the sense/semantic between two given words based on a similarity metric proposed by Philip similarity Resnik [17]. infrequent words are more informative then frequent ones. Figure 4.35 4.2 Sense/Semantic Similarity between words We use statistics to compute information content value. in other words. Basic idea behind the negative likelihood formula is that the more probable a concept appears. The IC value is then calculated by negative log likelihood formula as follow: IC (c) log( P(c)) (4.1: Fragment of WordNet taxonomy . the less information it conveys.1) Where c is a concept and p is the probability of encountering c in a given corpus.

Given a sense network *( wi .3) (4. ]} in increasing order of d(q).i (4. q j ) [i . q j ) .1] is the value of sense/semantic similarity between wi‹W and q j‹Q . q j ) …… Figure 4.2 Sense Net ranking algorithm We consider the sentence under consideration and the given query to be a set of words similar to a bag of word model. we define the distance of a word wi as d (wi ) i d (q j ) j Word with maximum sense similarity with query word qi is: (4. Stop words are rejected from the set and only the root forms of the words are taken into account. S( ) = Therefore the total average sense per word is (4.6) [i . q j ) as: *(wi .36 4.1. j = V (qi ) The exact match score is = Average sense similarity for query word ∑ ( ) with sentence W is ∑ ξ . j‹ [0. But unlike a bag of word model we give importance to the order of the words. j Where [i . Therefore we define the sense network *( wi .7) = ∑ ( ) = ∑ ∑ ξ .4) M (qi ) And the corresponding value of wj | j argmax j[ j . …… (4. (4.8) is the (4. Function distance of ith element in then the alignment score is = −1 ∑ =1 ( −1 +1 − ) .2: A sense network formed between a sentence and a query. If W is the ordered set of n words in the given sentence and Q is the ordered set m words in the query.2) *( wi . then we compute a network of sense weights between all pair of words in W and Q.5) (4.9) Let T = {ordered set of M( ) ∀ ∈ [1.

We take top t sentences and consider the plausible answers within them.3). Figure 4. Unlike newswire data most of the information found on the internet is badly formatted. Now.25 .0. If an answer appears with frequency f in sentence ranked r then that answer gets a confidence score C (ans) 1 (1 ln( f )) r (4.1. Once we obtain the total score for each sentence. = noise penalty coefficient = Total score e D *( n Etotal *m ) 1 (4. we sort then according to these scores.12) Again all answers are sorted according to confidence score and top .5 and D to 0. So when web is used as the knowledge base we use the following values of different coefficients: = 1.0.37 The total average noise is defined as G total Where D is the noise decay factor.10) = = = × + × + × + × (4. grammatically incorrect and most of the time not well formed.(=5 in our case) answers are returned along with corresponding sentence and URL (figure 4. = 0.3: A sample run for the question “Who performed the first human heart transplant?” .125 and noise decay factor D =0. = 1.25 but when using local corpus we reduce to 0.11) The coefficients are fine tuned depending on the type of corpus. = 0.

This could have affected the results of any web based study. Most of the pre-processing is done via GATE processing pipeline.38 Chapter5. Uses Google and Yahoo search engine queries to build the corpus Sense Net implementation ArrayList of Ranked Sentences with helper methods Weighted feature vector SVM classifier. Also we don’t have access to AQUAINT corpus which is the knowledgebase for TREC QA systems. GATE. Each module is designed keeping in mind space and time constraints. Stanford parser. The reason we did not use questions from TREC QA is that the TREC questions are now appearing quite frequently (sometimes with correct answers) in the results of web search engines. The URL reader module is multi threaded to keep download time at the minimum.1 Results The idea of building an easily accessible question answering system which uses the web as a document collection is not new. 5. Most of these systems are accessed via a web browser. Implements a standard porter stemmer main class that handles user queries. The tests were performed on a small set of fifty web based questions. It uses various third party APIs for NLP and text engineering. Use of JAVA makes the software cross platform and highly portable.1: Various modules of the QnA system along with each ones basic task. In the later part of the section we compare our system with other web QA systems. More information is provided in appendix B. For this reason a new collection of fifty questions was assembled to serve as the test set. The questions within the new test set were chosen to meet the following criteria: . Multi threaded URL reader implementation Multi threaded URL reader interface Stopwords filter class Computes Sense/Semantic similarity between words Stores a generic URL along with number of attempts to read it Trains the weighted feature vector SVM classifier Load GATE processing resources. Figure 5. Implementation and Results Our question answering module is written in JAVA. Json and Lucene API to name a few.

77 10.78 17.43 0.88 0.2 6.1 6.51 0 8 9 10 11 12 13 14 15 16 17 1 1 1 NA 3 1 1 NA 1 5 Question Classifier failed 4.4 0 8.9 6.2 15. Each question should be an unambiguous factoid question with only one known answer.5 11.8 0 9. This explicitly excludes questions such as “Who is the President of the US?” These questions are provided in appendix A.2 1. In case the system fails to answer a question we show the reason it failed. the table below shows the (min) rank at which answer was obtained.37 8.8 12.82 0 3 4 5 6 7 1 4 1 1 NA NE recognizer not designed to handle this question. Also time spent on various tasks is shown which would help in determining the feasibility of the system to be used in real time environment.5 11.61 0 0. Some of the questions chosen do have multiple answers although this is mainly due to incorrect answers appearing in some web documents.33 15 0 0.71 0. 8.3 11.6 6.4 7. Answer obtained @ rank 5 NA Remarks Time in seconds Document retrieval module# Preprocessing Answer extraction module 1 2 NE recognizer not designed to handle this question.1 0.38 0. The answers to the questions should not be dependent upon the time at which the question is asked. We used top 5 documents to construct our corpus which restricts our coverage to 64%.54 6.1 5.2 0.5 0 13 0 0.53 0.23 0 14.55 0. 11 8.7 13. In a way 64% is the accuracy upper bound of our system.2 16. For each question in set.23 13.39 1.0 7. Question No.41 0. 2.91 Incorrect Answer .47 0.

NE recognizer not designed to handle this question.59 0.58 0.1 6.48 0. Answer changed recently Incorrect Answer NE recognizer not designed to handle this question.63 16.34 0.62 0.43 0.61 0.2 16.7 8 0 0.40 18 19 20 21 1 2 1 NA 7.7 7.11 8.64 0.1 12.61 0 24 NA 0 0 0 25 26 27 1 NA NA 11.2 7.2 5.67 11.83 11.23 6.8 14.29 .21 12. 11.1 10. NE recognizer not designed to handle this question.62 0.4 8.4 9.3 12.77 8.4 8.1 13.1 6.99 15.67 0.3 16.47 0.42 0.43 0.21 6.54 0.23 0.34 0.1 6.2 7.9 7.22 14.67 13.01 8.7 6.8 7.65 0.38 0.5 14.0 14.41 0.3 11.23 7.66 0 11.9 6.54 0.5 0 14.36 0.53 0.23 0 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 1 1 1 1 4 1 1 1 1 NA NA 1 NA 1 NA 1 NA 1 1 NA NA Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer mainly because req.2 11.01 10.99 11.99 8.67 7.23 11.79 0. 22 23 1 NA 7.67 16.45 0 NE recognizer not designed to handle this question.8 9.2 0 14.31 0 0.9 0 0.8 15.11 0.

languagecomputer.1: Performance of the system on the web question set.com/demos/ http://misshoover. most of the failures were because of the handicapped NE recognizer. As seen.si.02 11. returning full sentences containing duplicated answers. http://www.01 7. @Rank 5 the system reached its accuracy upper bound of 64%.2 Comparisons with other Web Based QA Systems We compare our system with four web based QA Systems – AnswerBus [18].51 11.umich. 2.45 Total number of questions: 50.41 answer type was present in the query itself 49 2 50 NA Average time spent: Incorrect Answer 8.67 6.5% of the TREC 8 question set although we believe the performance would decrease if exact answers were being evaluated as experience of the TREC evaluations has shown this to be a harder task than locating answer bearing sentences. It is claimed that AnswerBus can correctly answer 70. The question classifier failed in only one instance.6 12.com:8400 . The consistently best performing system at TREC forms the backbone of the PowerAnswer system from Language Computer1. Unlike our system each answer is a sentence and no attempt is made to cluster (or remove) sentences which contain the same answer. This system is the closest to ours.33 0. AnswerFinder is a client side application that supports natural language questions and queries the Internet via Google.ionaut. The system called AnswerBus2 [18] behaves in much the same way as PowerAnswer. It returns both exact answer and snippets. AnswerFinder.24 0. Number of questions answered@ x Rank 1: 26 – Accuracy 52% x Rank 2: 28 – Accuracy 56% x Rank 3: 29 – Accuracy 58% x Rank 4: 31 – Accuracy 62% x Rank 5: 32 – Accuracy 64% Average time spent per question: 18. 1. The system returns both exact answers and snippets.46 0. 3.3 seconds #time is dependent on network speed Table 5. 5.edu/˜zzheng/qa-new/ http://www. IONAUT [19] and PowerAnswer[20]. This gives undue advantage to the system as it performs the easier task of finding relevant sentences only. IONAUT3 [19] uses its own crawler to index the web with specific focus on entities and the relationships between them in order to provide a richer base for answering questions than the unstructured documents returned by standard search engines.

The results are encouraging but it should be noted that due to the small number of test questions it is difficult to draw firm conclusions from these experiments. PowerAnswer It is clear from the graph that our system outperforms all but AnswerFinder at rank 1. AnswerFinder and our system . An average response time of 18. IONAUT +.3 Feasibility of the system to be used in real time environment From table 5. Figure 9.1 it is clear that the system cannot be used for real time purposes as of now. within as short a period of time as was possible.3 seconds is too high. This is quite important as the answer returned at rank 1 can be considered to be the final answer provided by the system. At higher ranks it performs considerably better than AnswerBus and IONAUT while performing marginally less than AnswerFinder and PowerAnswer. so that the underlying document collection. 5.2: Comparison of AnswerBus . in this case the web would be relatively static and hence no system would benefit from subtle changes in the content of the collection.42 The questions from the web question set were presented to the five systems on the same day. But it must be noted that .

43 document retrieval time will be significantly lower for offline – local corpus. We employed machine learning techniques for question classification whose performance is good enough and any further improvements won’t be beneficial. In our case the NE recognizer is the weakest link. Even with the limited capability of NE recognizer the system is at par with state of the art web QA systems which confirms the efficacy of the ranking algorithm. We believe that if we use our own crawler and pre-process the documents beforehand. We observed that the performance of the system is limited by the worst performing module of the QA system.4 Conclusion The main motivation behind the work in this thesis was to consider.45 seconds. . Answer Extraction 3% time distribution Document Retrieval 36% Pre-Processing 61% Figure 9. where possible. We also proposed the Sense Net algorithm as new way of ranking sentences and answers. simple approaches to question answering which can be both easily understood and would operate quickly. Our NE recognizer recognizes limited sets of answer types which is not enough to obtain a good enough overall accuracy. the actual task of retrieving an answer is quite low at 0. our system can retrieve answers fast enough to be used in real time systems. Finally our current results are encouraging but we acknowledge that due to the small number of test questions it is difficult to draw firm conclusions from these experiments.3: Time distribution of each module involved in QA 5. Once the corpus is pre-processed offline. So even if a single module fails the whole system won’t be able to answer. The graph below shows percentage of time spent in different tasks. The time distribution of various modules shows that the system is quite fast at the answer extraction stage. More over the task of post processing can be done offline on the corpus as it is independent of the query. if used along with a local corpus which is pre-processed offline it can be adapted for real time applications.

became a Swiss citizen in 1901 and later became a US citizen in 1940? Ans: Albert Einstein Q009: Who shared a Nobel prize in 1945 for his discovery of the antibiotic penicillin? Ans: Alexander Fleming. Who performed the first human heart transplant? Ans: Dr Christiaan Barnard Q023: Callisto. the dingo. 38 or 37. 66%.44 Appendix A Small Web Based Question Set Q001: The chihuahua dog derives it’s name from a town in which country? Ans: Mexico Q002: What is the largest planet in our Solar System? Ans: Jupiter Q003: In which country does the wild dog. if a byte is 8 bits.98 Q014: Who discovered gravitation and invented calculus? Ans: Isaac Newton Q015: Approximately what percentage of the human body is water? Ans: 80%. 17. Howard Florey or Ernst Chain Q010: Who invented penicillin in 1928? Ans: Sir Alexander Fleming Q011: How often does Haley’s comet appear? Ans: Every 76 years or every 75 years Q012: How many teeth make up a full adult set? Ans: 32 Q013: In degrees centigrade. how many bits is a nibble? Ans: 4 Q027: What colour is cobalt? Ans: blue Q028: Who became the first American to orbit the Earth in 1962 and returned to Space in 1997? Ans: John Glenn Q029: Who invented the light bulb? Ans: Thomas Edison . 18 or 21 Q026: In computing. Ganymede and Io are 4 of the 16 moons of which planet? Ans: Jupiter Q024: Which planet was discovered in 1930 and has only one known satellite called Charon? Ans: Pluto Q025: How many satellites does the planet Uranus have? Ans: 15. live? Ans: Australia or America Q004: Where would you find budgerigars in their natural habitat? Ans: Australia Q005: How many stomachs does a cow have? Ans: Four or one with four parts Q006: How many legs does a lobster have? Ans: Ten Q007: Charon is the only satellite of which planet in the solar system? Ans: Pluto Q008: Which scientist was born in Germany in 1879. what is the average human body temperature? Ans: 37. Europa. 60% or 70% Q016: What is the sixth planet from the Sun in the Solar System? Ans: Saturn Q017: How many carats are there in pure gold? Ans: 24 Q018: How many canine teeth does a human have? Ans: Four Q019: In which year was the US space station Skylab launched? Ans: 1973 Q020: How many noble gases are there? Ans: 6 Q021: What is the normal colour of sulphur? Ans: Yellow Q022.

205 to 141 million years ago or 205 million years ago to 145 million years ago Q039: Who was President of the USA from 1963 to 1969? Ans: Lyndon B Johnson Q040: Who was British Prime Minister from 1974-1976? Ans: Harold Wilson Q041: Who was British Prime Minister from 1955 to 1957? Ans: Anthony Eden Q042: What year saw the first flying bombs drop on London? Ans: 1944 Q043: In what year was Nelson Mandela imprisoned for life? Ans: 1964 Q044: In what year was London due to host the Olympic Games. 208 to 146 million years ago. 195 – 140 million years ago.45 Q030: How many species of elephant are there in the world? Ans: 2 Q031: In 1980 which electronics company demonstrated its latest invention. but couldn’t because of the Second World War? Ans: 1944 Q045: In which year did colour TV transmissions begin in Britain? Ans: 1969 Q046: For how many days were US TV commercials dropped after President Kennedy’s death as a mark of respect? Ans: 4 Q047: What nationality was the architect Robert Adam? Ans: Scottish Q048: What nationality was the inventor Thomas Edison? Ans: American Q049: In which country did the dance the fandango originate? Ans: Spain Q050: By what nickname was criminal Albert De Salvo better known? Ans: The Boston Strangler. 205 to 140 million years ago. . the compact disc? Ans: Philips Q032: Who invented the television? Ans: John Logie Baird Q033: Which famous British author wrote ”Chitty Chitty Bang Bang”? Ans: Ian Fleming Q034: Who was the first President of America? Ans: George Washington Q035: When was Adolf Hitler born? Ans: 1889 Q036: In what year did Adolf Hitler commit suicide? Ans: 1945 Q037: Who did Jimmy Carter succeed as President of the United States? Ans: Gerald Ford Q038: For how many years did the Jurassic period last? Ans: 180 million.

csie.ntu.http://lucene.shtml WordNet 2.uk/ Apache Lucene API is a free/open source information retrieval library.jcreator.apache.http://www.json. Some classes present in JWNL API conflict with GATE. http://www.net/projects/jwordnet/ Stanford Log-linear Part-Of-Speech Tagger http://nlp.46 Appendix B Implementation Details We have used Jcreator (http://www.princeton. is a lightweight computer data interchange format.http://wordnet.86 GhZ System with 2GB RAM.5. The following third party APIs are used: x x x x x x x GATE 4. originally created in Java by Doug Cutting .edu/software/tagger. Therefore stack size should be increased to at least 512MB using –Xmx512m command line option.org/ JSON API.http://gate.org/java/ LibSVM A Library for Support Vector Machines by Chih-Chung Chang and ChihJen Lin . Default stack size may not be sufficient to run the application.edu. The code uses newer features like generics which is not compatible with any version of JAVA prior to 1. .stanford.edu/ All experiments performed on a Core 2 Duo 1.ac.1 is a lexical database for the English language is used to measure sense/semantic similarity measure .tw/~cjlin/libsvm/ JWNL is an API for accessing WordNet in multiple formats. To resolve the issue conflicting libraries belonging to GATE must not be included in the classpath. as well as relationship discovery and morphological processing http://sourceforge. JSON or JavaScript Object Notation. The API brings support to read JSON data.com/ ) as the preferred IDE.0 (A General Architecture for Text Engineering) software toolkit originally developed at the University of Sheffield since 1995 .

Marc Light. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL ’03). Okapi at TREC-3. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). Taiwan. Steve Walker. In Proceedings of the 8th Text REtrieval Conference. 1997. Question Classification with Support Vector Machines and Error Correcting Codes. In Proceedings of the 8th Text REtrieval Conference. 44(1):163–180. [4] Tom M. Micheline Hancock-Beaulieu. 1999. [12] Dell Zhang and Wee Sun Lee. and Mike Gatford. Lisa Ferro. In Proceedings of the 3rd Text Retrieval Conference. Susan Jones. Overview of the TREC 2002 Question Answering Track. The TREC 8 Question Answering Track Report. Voorhees. [2] Stephen E. In Proceedings of the 26th ACM International Conference on Research . 2003. 2003. pages 28–30. David House. 1994. [9] Ellen M. In Proceedings of the 9th Text REtrieval Conference. Machine Learning. and Inderjeet Mani. 2002. [5] Corpora for Question Answering Task. Question Classification using Support Vector Machines. [3] Stephen E. A Simple Question Answering System. 2002. Robertson. Cooper and Stefan M. Mitchell. [7] Kadri Hacioglu and Wayne Ward. 1999. University of Illinois at Urbana-Champaign. John D. [8] Ellen M. In Proceedings of the 11th Text REtrieval Conference. January 2008. Morristown. Learning Question Classifiers. Taipei. Voorhees. Cognitive Computation Group at the Department of Computer Science. [6] Xin Li and Dan Roth. Burger. Query expansion and dimensionality reduction: Notions of optimality in rocchio relevance feedback and latent semantic indexing. [10] Eric Breck. NJ. R¨uger. In Proceedings of the 8th Text REtrieval Conference. 1999. Computer Science Series. McGraw-Hill. A Sys Called Qanda. USA. Robertson and Steve Walker. Information Processing & Management.47 References [1] Miles Efron. Okapi/Keenbow at TREC-8. 2000. [11] Richard J.

Journal of Artificial Intelligence Research. Jones. Miller. Software infrastructure for natural language processing. 11:95–130. March 24-27. AnswerBus Question Answering System. New York. [15] H. and Orest Bolohan. Washington. 2002. Communications of the ACM. Gaizauskas. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. In Proceedings of the 11th Text REtrieval Conference. 38(11):39–41. 1997. Hai Jin. November. 2002. and Xiaomin Ning. Seattle. In SAC ’07: Proceedings of the 2007 ACM symposium on Applied computing. Humphreys. [14] Karen S. [18] Zhiping Zheng. [19] Steven Abney. R. 2000. CA. [20] Dan Moldovan. Michael Collins. 1999. Information Processing and Management. A probabilistic model of information retrieval: development and comparative experiments . [13] Hao Wu. Wilks. Toronto. [16] George A. and Amit Singhal. 1995. K. [17] Philip Resnik. In Proceedings of the Human Language Technology Conference (HLT 2002). 36(6):809–840. USA. An approach for indexing. Sanda Harabagiu. Cunningham. ACM Press. and Stephen E. Finley Lacatus¸u. LCC Tools for Question Answering. WordNet: A Lexical Database. and Y. USA. Roxana Girju. Paul Mor?arescu. storing retrieving domain knowledge.48 and Developement in Information Retrieval (SIGIR’03). Adrian Novischi. Canada. 2000. Answer Extraction. pages 296–301. and . Robertson. pages 26–32. pages 1381–1382. San Diego.part 2. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP 2000). 2007. NY. Steve Walker. Adriana Badulescu.

You're Reading a Free Preview

Descarga
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->