P. 1
Undergraduate Thesis

Undergraduate Thesis

|Views: 261|Likes:
Publicado porAmiya Patanaik

More info:

Published by: Amiya Patanaik on Sep 21, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/21/2012

pdf

text

original

Sections

  • CERTIFICATE
  • Acknowledgement
  • ABSTRACT
  • Contents
  • List of figures and tables
  • 1.1 History of Question Answering Systems
  • 1.4.7 Real time question answering
  • 1.4.8 Multi-lingual (or cross-lingual) question answering
  • 1.6.5 Traditional Metrics – Recall and Precision
  • Chapter2. Question Analysis
  • 2.1 Determining the Expected Answer Type
  • 2.1.2 Manually Constructed rules for question classification
  • 2.1.3 Fully Automatically Constructed Classifiers
  • 2.1.4 Support Vector Machines
  • 2.1.5 Kernel Trick
  • 2.1.6 Nave Bayes Classifier
  • 2.1.7 Datasets
  • 2.1.8 Features
  • 2.1.9 Entropy and Weighted Feature Vector
  • 2.1.10 Experiment Results
  • 2.2 Query Formulation
  • 2.2.1 Stop word for IR query formulation
  • Chapter3. Document Retrieval
  • 3.1 Retrieval from local corpus
  • 3.1.1 Ranking function
  • 3.1.2 Okapi BM25
  • 3.1.3 IDF Information Theoretic Interpretation
  • 3.2 Information retrieval from the web
  • 3.2.1 How many documents to retrieve?
  • Chapter4. Answer Extraction
  • 4.1 Sentence Ranking
  • 4.1.1 WordNet
  • 4.1.2 Sense/Semantic Similarity between words
  • 4.1.2 Sense Net ranking algorithm
  • Chapter5. Implementation and Results
  • 5.1 Results
  • 5.2 Comparisons with other Web Based QA Systems
  • 5.3 Feasibility of the system to be used in real time environment
  • 5.4 Conclusion
  • References

Open Domain Factoid Question

Answering System
By Amiya Patanaik
(05EG1008)




Thesis submitted in partial fulfilment of the
Requirements for the degree of Bachelor of Technology (Honours)

Under the supervision of


Dr. Sudeshna Sarkar
Professor, Department of Computer Science







DEPARTMENT OF ELECTRICAL ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY
KHARAGPUR – 721302
INDIA


MAY 2009


2


Department of Electrical Engineering
Indian Institute of Technology
Kharagpur-721302


CERTIFICATE

This is to certify that the thesis entitled Open Domain Factoid Question
Answering System is a bonafide record of authentic work carried out by Mr. Amiya
Patanaik under my supervision and guidance for the fulfilment of the requirement for the
award of the degree of Bachelor of Technology (Honours) at the Indian Institute of
Technology, Kharagpur. The work incorporated in this has not been, to the best of my
knowledge, submitted to any other University or Institute for the award of any degree or
diploma.




Dr. Sudeshna Sarkar (Guide)
Professor, Department of Computer Science Date :
Indian Institute of Technology – Kharagpur Place : Kharagpur
INDIA





Dr. S K Das (Co-guide)
Professor, Department of Electrical Engineering Date :
Indian Institute of Technology – Kharagpur Place : Kharagpur
INDIA






3
Acknowledgement

I express my sincere gratitude and indebtedness to my guide, Dr. Sudeshna Sarkar
under whose esteemed guidance and supervision, this work has been completed. This
project work would have been impossible to carry out without her advice and support
throughout.

I would also like to express my heartfelt gratitude to my co-guide Dr. S. K. Das and all
the professors of Electrical and Computer Science Engineering Department for all the
guidance, education and necessary skill set they have endowed me with, throughout my
years of graduation.

Last but not the least; I would like to thank my friends for their help during the course
of my work.


Date:




Amiya Patanaik
05EG1008
Department of Electrical Engineering
IIT Kharagpur - 721302






4
















Dedicated to
my parents and friends














5

ABSTRACT
A question answering (QA) system provides direct answers to user questions by
consulting its knowledge base. Since the early days of artificial intelligence in the 60’s,
researchers have been fascinated with answering natural language questions. However,
the difficulty of natural language processing (NLP) has limited the scope of QA to
domain-specific expert systems. In recent years, the combination of web growth,
improvements in information technology, and the explosive demand for better
information access has reignited the interest in QA systems. The wealth of information
on the web makes it an attractive resource for seeking quick answers to simple, factual
questions such as “who was the first American in space?” or “what is the second tallest
mountain in the world?” Yet today’s most advanced web search services (e.g., Google,
Yahoo, MSN live search and AskJeeves) make it surprisingly tedious to locate answers to
such questions. Question answering aims to develop techniques that can go beyond the
retrieval of relevant Documents in order to return exact answers to natural language
factoid questions, such as “Who is the first woman to be in space?”, “Which is the largest
city in India?”, and “When was first world war fought?”. Answering natural language
questions requires more complex processing of text than employed by current
information retrieval systems.

This thesis investigates a number of techniques for performing open-domain factoid
question answering. We have developed an architecture that augments existing search
engines so that they support natural language question answering and is also capable of
supporting local corpus as a knowledge base. Our system currently supports document
retrieval from Google and Yahoo via their public search engine application
programming interfaces (APIs). We assumed that all the information required to
produce an answer exists in a single sentence and followed a pipelined approach
towards the problem. Various stages in the pipeline include: automatically constructed
question type analysers based on various classifier models, document retrieval, passage
extraction, phrase extraction, sentence and answer ranking. We developed and analyzed
different sentence and answer ranking algorithms, starting with simple ones that
employ surface matching text patterns to more complicated ones using root words, part
of speech (POS) tags and sense similarity metrics. The thesis also presents a feasibility
analysis of our system to be used in real time QA applications.



6
Contents
CERTIFICATE
2
ACKNOWLEDGEMENT
3
DEDICATION
4
ABSTRACT
5
CONTENTS
6
LIST OF FIGURES AND TABLES
8

Chapter 1: Introduction 9
1.1 History of Question Answering Systems 9
1.2 Architecture 10
1.3 Question answering methods 11
1.3.1 Shallow 11
1.3.2 Deep 11
1.4 Issues 12
1.4.1 Question classes 12
1.4.2 Question processing 13
1.4.3 Context and QA 13
1.4.4 Data sources for QA 13
1.4.5 Answer extraction 13
1.4.6 Answer formulation 13
1.4.7 Real time question answering 14
1.4.8 Multi-lingual (or cross-lingual) question answering 14
1.4.9 Interactive QA 14
1.4.10 Advanced reasoning for QA 14
1.4.11 User profiling for QA 14
1.5 A generic framework for QA 15
1.6 Evaluating QA Systems 15
1.6.1 End-to-End Evaluation 16
1.6.2 Mean Reciprocal Rank 16
1.6.3 Confidence Weighted Score 16
1.6.4 Accuracy and coverage 17
1.6.5 Traditional Metrics – Recall and Precision 17
Chapter2: Question Analysis 19
2.1 Determining the Expected Answer Type 19
2.1.1 Question Classes 19
2.1.2 Manually Constructed rules for question classification 20


7
2.1.3 Fully Automatically Constructed Classifiers 20
2.1.4 Support Vector Machines 21
2.1.5 Kernel Trick 22
2.1.6 Naive Bayes Classifier 22
2.1.7 Datasets 24
2.1.8 Features 24
2.1.9 Entropy and Weighted Feature Vector 25
2.1.10 Experiment Results 26
2.2 Query Formulation 27
2.2.1 Stop word for IR query formulation 28
Chapter3. Document Retrieval 29
3.1 Retrieval from local corpus 29
3.1.1 Ranking function 29
3.1.2 Okapi BM25 29
3.1.3 IDF Information Theoretic Interpretation 30
3.2 Information retrieval from the web 30
3.2.1 How many documents to retrieve? 31
Chapter4. Answer Extraction 34
4.1 Sentence Ranking 34
4.1.1 WordNet 34
4.1.2 Sense/Semantic Similarity between words 35
4.1.2 Sense Net ranking algorithm 36
Chapter5. Implementation and Results 38
5.1 Results 38
5.2 Comparisons with other Web Based QA Systems 41
5.3 Feasibility of the system to be used in real time environment 42
5.4 Conclusion 43

APPENDIX A : Web Based Question Set
44
APPENDIX B : Implementation Details
46


REFERENCES
47




8
List of figures and tables
Figures PageNo.

Fig.1.1: A generic framework for question answering 15
Fig.1.2: Sections of a document collection as used for IR evaluation. 18
Fig.2.1: The kernel trick 22
Fig.2.2: Various feature sets extracted from a given question and its
corresponding part of speech tags.
24
Fig.2.3: Question type classifier performance 26
Fig.2.4: JAVA Question Classifier 27
Fig.3.1: Document retrieval framework 31
Fig.3.2: %coverage vs rank 32
Fig.3.3: %coverage vs. average processing time 33
Fig.4.1: Fragment of WordNet taxonomy 35
Fig.4.2: A sense network formed between a sentence and a query 36
Fig.4.3: A sample run for the question “Who performed the first human heart
transplant?”
37
Fig.5.1: Various modules of the QA system along with each ones basic task 38
Fig.9.2: Comparison with other web based QA systems 42
Fig.9.3: Time distribution of each module involved in QA 43

Tables PageNo.

Table 1.1 Coarse and fine grained question categories. 20
Table 2.1: performance of various query expansion modules implemented on
Lucene.
28
Table 3.1: %coverage and average processing time at different ranks 32
Table 5.1: Performance of the system on the web question set 39-41



9
Chapter1. Introduction
In information retrieval, question answering (QA) is the task of automatically answering
a question posed in natural language. To find the answer to a question, a QA computer
program may use either a pre-structured database or a collection of natural language
documents (a text corpus such as the World Wide Web or some local collection).
QA research attempts to deal with a wide range of question types including: fact, list,
definition, How, Why, hypothetical, semantically-constrained, and cross-lingual
questions. Search collections vary from small local document collections, to internal
organization documents, to compiled newswire reports, to the World Wide Web.
* Closed-domain question answering deals with questions under a specific domain
(for example, medicine or automotive maintenance), and can be seen as an easier task
because NLP systems can exploit domain-specific knowledge frequently formalized in
ontologies.
* Open-domain question answering deals with questions about nearly everything, and
can only rely on general ontologies and world knowledge. On the other hand, these
systems usually have much more data available from which to extract the answer.
(Alternatively, closed-domain might refer to a situation where only limited types of
questions are accepted, such as questions asking for descriptive rather than procedural
information.)
QA is regarded as requiring more complex natural language processing (NLP)
techniques than other types of information retrieval such as document retrieval, thus
natural language search engines are sometimes regarded as the next step beyond
current search engines.
1.1 History of Question Answering Systems
Some of the early AI systems were question answering systems. Two of the most famous
QA systems of that time are BASEBALL and LUNAR, both of which were developed in
the 1960s. BASEBALL answered questions about the US baseball league over a period of
one year. LUNAR, in turn, answered questions about the geological analysis of rocks
returned by the Apollo moon missions. Both QA systems were very effective in their
chosen domains. In fact, LUNAR was demonstrated at a lunar science convention in
1971 and it was able to answer 90% of the questions in its domain posed by people
untrained on the system. Further restricted-domain QA systems were developed in the
following years. The common feature of all these systems is that they had a core
database or knowledge system that was hand-written by experts of the chosen domain.
Some of the early AI systems included question-answering abilities. Two of the most
famous early systems are SHRDLU and ELIZA. SHRDLU simulated the operation of a
robot in a toy world (the "blocks world"), and it offered the possibility to ask the robot


10
questions about the state of the world. Again, the strength of this system was the choice
of a very specific domain and a very simple world with rules of physics that were easy
to encode in a computer program. ELIZA, in contrast, simulated a conversation with a
psychologist. ELIZA was able to converse on any topic by resorting to very simple rules
that detected important words in the person's input. It had a very rudimentary way to
answer questions, and on its own it lead to a series of chatter bots such as the ones that
participate in the annual Loebner prize.
The 1970s and 1980s saw the development of comprehensive theories in computational
linguistics, which led to the development of ambitious projects in text comprehension
and question answering. One example of such a system was the Unix Consultant (UC), a
system that answered questions pertaining to the Unix operating system. The system
had a comprehensive hand-crafted knowledge base of its domain, and it aimed at
phrasing the answer to accommodate various types of users. Another project was
LILOG, a text-understanding system that operated on the domain of tourism
information in a German city. The systems developed in the UC and LILOG projects
never went past the stage of simple demonstrations, but they helped the development
of theories on computational linguistics and reasoning.
In the late 1990s the annual Text Retrieval Conference (TREC) included a question-
answering track which has been running until the present. Systems participating in this
competition were expected to answer questions on any topic by searching a corpus of
text that varied from year to year. This competition fostered research and development
in open-domain text-based question answering. The best system of the 2004
competition achieved 77% correct fact-based questions.
In 2007 the annual TREC included a blog data corpus for question answering. The blog
data corpus contained both "clean" English as well as noisy text that include badly-
formed English and spam. The introduction of noisy text moved the question answering
to a more realistic setting. Real-life data is inherently noisy as people are less careful
when writing in spontaneous media like blogs. In earlier years the TREC data corpus
consisted of only newswire data that was very clean.
An increasing number of systems include the World Wide Web as one more corpus of
text. Currently there is an increasing interest in the integration of question answering
with web search. Ask.com is an early example of such a system, and Google and
Microsoft have started to integrate question-answering facilities in their search engines.
One can only expect to see an even tighter integration in the near future.
1.2 Architecture
The first QA systems were developed in the 1960s and they were basically natural-
language interfaces to expert systems that were tailored to specific domains. In
contrast, current QA systems use text documents as their underlying knowledge source
and combine various natural language processing techniques to search for the answers.
Current QA systems typically include a question classifier module that determines the
type of question and the type of answer. After the question is analyzed, the system


11
typically uses several modules that apply increasingly complex NLP techniques on a
gradually reduced amount of text. Thus, a document retrieval module uses search
engines to identify the documents or paragraphs in the document set that are likely to
contain the answer. Subsequently a filter preselects small text fragments that contain
strings of the same type as the expected answer. For example, if the question is "Who
invented Penicillin" the filter returns text that contain names of people. Finally, an
answer extraction module looks for further clues in the text to determine if the answer
candidate can indeed answer the question.
1.3 Question answering methods
QA is very dependent on a good search corpus - for without documents containing the
answer, there is little any QA system can do. It thus makes sense that larger collection
sizes generally lend well to better QA performance, unless the question domain is
orthogonal to the collection. The notion of data redundancy in massive collections, such
as the web, means that nuggets of information are likely to be phrased in many different
ways in differing contexts and documents, leading to two benefits:
(1) By having the right information appear in many forms, the burden on the QA
system to perform complex NLP techniques to understand the text is lessened.
(2) Correct answers can be filtered from false positives by relying on the correct
answer to appear more times in the documents than instances of incorrect ones.
1.3.1 Shallow
Some methods of QA use keyword-based techniques to locate interesting passages and
sentences from the retrieved documents and then filter based on the presence of the
desired answer type within that candidate text. Ranking is then done based on syntactic
features such as word order or location and similarity to query.
When using massive collections with good data redundancy, some systems use
templates to find the final answer in the hope that the answer is just a reformulation of
the question. If you posed the question "What is a dog?", the system would detect the
substring "What is a X" and look for documents which start with "X is a Y". This often
works well on simple "factoid" questions seeking factual tidbits of information such as
names, dates, locations, and quantities.
1.3.2 Deep
However, in the cases where simple question reformulation or keyword techniques will
not suffice, more sophisticated syntactic, semantic and contextual processing must be
performed to extract or construct the answer. These techniques might include named-
entity recognition, relation detection, co reference resolution, syntactic alternations,
word sense disambiguation, logic form transformation, logical inferences (abduction)


12
and commonsense reasoning, temporal or spatial reasoning and so on. These systems
will also very often utilize world knowledge that can be found in ontologies such as
WordNet, or the Suggested Upper Merged Ontology (SUMO) to augment the available
reasoning resources through semantic connections and definitions.
More difficult queries such as Why or How questions, hypothetical postulations,
spatially or temporally constrained questions, dialog queries, badly-worded or
ambiguous questions will all need these types of deeper understanding of the question.
Complex or ambiguous document passages likewise need more NLP techniques applied
to understand the text.
Statistical QA, which introduces statistical question processing and answer extraction
modules, is also growing in popularity in the research community. Many of the lower-
level NLP tools used, such as part-of-speech tagging, parsing, named-entity detection,
sentence boundary detection, and document retrieval, are already available as
probabilistic applications.
AQ (Answer Questioning) Methodology; introduces a working cycle to the QA methods.
This method may be used in conjunction with any of the known or newly founded
methods. AQ Method may be used upon perception of a posed question or answer. The
means by which it is utilized can be manipulated beyond its primary usage; however,
the primary usage is taking an answer and questioning it turning that very answer into
a question. Example; A"I like sushi." Q"(Why do) I like sushi(?)" A"The flavor." Q"(What
about) the flavor of sushi (do) I like?" Inadvertently, this may unveil different methods
of thinking and perception as well. While most would agree that this seems to be the
end-all stratagem, it is only a starting point with endless possibilities. Any number of
question methods may be used to derive the number of WHY as in, A = ∞(Q), the answer
may yield any number of questions to be asked; thereby unveiling an ongoing process
constantly being reborn into the research being performed. The QA methodology
utilizes just the opposite where, 1(Q) = ((∞(A)-∞) = 1(A), supposedly there is only one
true answer in reality everything else is perception or plausibility. Utilized alongside
other forms of communication; debate may be greatly improved. Even this methodology
should be questioned.
1.4 Issues
In 2002 a group of researchers wrote a roadmap of research in question answering. The
following issues were identified.
1.4.1 Question classes
Different types of questions require the use of different strategies to find the answer.
Question classes are arranged hierarchically in taxonomies.


13
1.4.2 Question processing
The same information request can be expressed in various ways - some interrogative,
some assertive. A semantic model of question understanding and processing is needed,
one that would recognize equivalent questions, regardless of the speech act or of the
words, syntactic inter-relations or idiomatic forms. This model would enable the
translation of a complex question into a series of simpler questions, would identify
ambiguities and treat them in context or by interactive clarification.
1.4.3 Context and QA
Questions are usually asked within a context and answers are provided within that
specific context. The context can be used to clarify a question, resolve ambiguities or
keep track of an investigation performed through a series of questions.
1.4.4 Data sources for QA
Before a question can be answered, it must be known what knowledge sources are
available. If the answer to a question is not present in the data sources, no matter how
well we perform question processing, retrieval and extraction of the answer, we shall
not obtain a correct result.
1.4.5 Answer extraction
Answer extraction depends on the complexity of the question, on the answer type
provided by question processing, on the actual data where the answer is searched, on
the search method and on the question focus and context. Given that answer processing
depends on such a large number of factors, research for answer processing should be
tackled with a lot of care and given special importance.
1.4.6 Answer formulation
The result of a QA system should be presented in a way as natural as possible. In some
cases, simple extraction is sufficient. For example, when the question classification
indicates that the answer type is a name (of a person, organization, shop or disease, etc),
a quantity (monetary value, length, size, distance, etc) or a date (e.g. the answer to the
question "On what day did Christmas fall in 1989?") the extraction of a single datum is
sufficient. For other cases, the presentation of the answer may require the use of fusion
techniques that combine the partial answers from multiple documents.


14
1.4.7 Real time question answering
There is need for developing Q&A systems that are capable of extracting answers
from large data sets in several seconds, regardless of the complexity of the question, the
size and multitude of the data sources or the ambiguity of the question.
1.4.8 Multi-lingual (or cross-lingual) question answering
The ability to answer a question posed in one language using an answer corpus in
another language (or even several). This allows users to consult information that they
cannot use directly. See also machine translation.
1.4.9 Interactive QA
It is often the case that the information need is not well captured by a QA system, as
the question processing part may fail to classify properly the question or the
information needed for extracting and generating the answer is not easily retrieved. In
such cases, the questioner might want not only to reformulate the question, but (s)he
might want to have a dialogue with the system.
1.4.10 Advanced reasoning for QA
More sophisticated questioners expect answers which are outside the scope of
written texts or structured databases. To upgrade a QA system with such capabilities,
we need to integrate reasoning components operating on a variety of knowledge bases,
encoding world knowledge and common-sense reasoning mechanisms as well as
knowledge specific to a variety of domains.
1.4.11 User profiling for QA
The user profile captures data about the questioner, comprising context data, domain
of interest, reasoning schemes frequently used by the questioner, common ground
established within different dialogues between the system and the user etc. The profile
may be represented as a predefined template, where each template slot represents a
different profile feature. Profile templates may be nested one within another.



15
1.5 A generic framework for QA
The majority of current question answering systems designed to answer factoid
questions consist of three distinct components:
1. question analysis,
2. document or passage retrieval and finally
3. answer extraction.
While these basic components can be further subdivided into smaller components like
query formation and document pre-processing, a three component architecture
describes the approach taken to building QA systems in the wider literature.














Fig.1.1: A generic framework for question answering.

It should be noted that while the three components address completely separate
aspects of question answering it is often difficult to know where to place the boundary
of each individual component. For example the question analysis component is usually
responsible for generating an IR query from the natural language question which can
then be used by the document retrieval component to select a subset of the available
documents. If, however, an approach to document retrieval requires some form of
iterative process to select good quality documents which involves modifying the IR
query, then it is difficult to decide if the modification should be classed as part of the
question analysis or document retrieval process.
1.6 Evaluating QA Systems
Evaluation is a highly subjective matter when dealing with NLP problems. It is always
easier to evaluate when there is a clearly defined answer, unfortunately with most of
the natural language tasks there is no single answer. A rather impractical and tedious
way of doing this could be to manually search an entire collection of text and mark the
Corpus or
document
collection
Document
Retrieval
Top n text
segments
or
sentences
Answer
Extraction
Answers
Question
Analysis
Question


16
relevant documents. Then the queries can be used to make an evaluation based on
precision and recall. But this is not possible even for the smallest of document
collections and with the size of corpuses like AQUAINT with approximately 1,00,000
articles it is next to impossible.
1.6.1 End-to-End Evaluation
Almost every QA system is concerned with the final answer. So a widely accepted
metric is required to evaluate the performance of our system and compare it with other
existing systems. Most of the recent large scale QA evaluations have taken place as part
of the TREC conferences and hence the evaluation metrics used have been extensively
studied and is used in this study. Following are definitions of numerous metrics for
evaluating factoid questions. Evaluating descriptive questions is much more difficult
than factoids.
1.6.2 Mean Reciprocal Rank
The original evaluation metric used in the QA tracks of TREC 8 and 9 was mean
reciprocal rank (MRR). MRR provides a method for scoring systems which return
multiple competing answers per question. Let Q be the question collection and
i
r the
rank of the first correct answer to question i or 0 if no correct answer is returned. MRR
is then given by:

| |
1
1
| |
Q
i
i
M
r
RR
Q
=
=
¯
(1.1)
As useful as MRR was as an evaluation metric for the early TREC QA evaluations it does
have a number of drawbacks [8], the most important of which are that
- systems are given no credit for retrieving multiple (different) correct answers
and
- As the task required each system to return at least one answer per question; no
credit was given to systems for determining that they did not know or could not
locate an appropriate answer to a question.
1.6.3 Confidence Weighted Score
Following the shortcomings of MRR as an evaluation metric a new evaluation metric
was chosen as the new evaluation metric [9]. Under this evaluation metric a system
returns a single answer for each question. These answers are then sorted before
evaluation so that the answer which the system has most confidence in is placed first.


17
The last answer evaluated will therefore be the one the system has least confidence in.
Given this ordering CWS is formally defined in Equation 1.2:

| |
1
no. of correct in first i answers
| |
Q
i
CWS
i
Q
=
=
¯
(1.2)
CWS therefore rewards systems which can not only provide correct exact answers to
questions but which can also recognise how likely an answer is to be correct and hence
place it early in the sorted list of answers. The main issue with CWS is that it is difficult
to get an intuitive understanding of the performance of a QA system given a CWS score
as it does not relate directly to the number of questions the system was capable of
answering.
1.6.4 Accuracy and coverage
Accuracy of a QA system is a simple evaluation metric with direct correspondence to
number of correct answers. Let
, D q
C be the correct answers for question q known to be
contained in the document collection D and
, ,
S
D q n
F be the first n answers found by
system S for question q from D then accuracy is defined as:
, , ,
| { | }|
( , , )
| |
S
D q n D q S
q Q
accuracy
F C
Q D n
Q
o · =
=
‹

(1.3)
Similarly The coverage of a retrieval system S for a question set Q and document
collection D at rank n is the fraction of the questions for which at least one relevant
document is found within the top n documents:

, , ,
| { | }|
coverage ( , , )
| |
S
D q n D q S
q Q R A
Q D n
Q
o · =
=
‹
(1.4)
1.6.5 Traditional Metrics – Recall and Precision
The standard evaluation measures for IR systems are precision and recall. Let D be the
document (or passage collection),
, D q
A the subset of Dwhich contains relevant
documents for a query q and
, ,
S
D q n
R be the n top-ranked documents (or passages) in D
retrieved by an IR system S (figure 1.2); then

The recall of an IR system S at rank n for a query q is the fraction of the relevant
documents
, D q
A , which have been retrieved:


18

, , ,
,
| |
( , , )
| |
S
D q n D q S
D q
R A
reca q l n
A
l D
·
=
(1.5)
The precision of an IR system S at rank n for a query q is the fraction of the retrieved
documents
, ,
S
D q n
R that are relevant:

, , ,
, ,
| |
( , , )
| |
S
D q n D q S
S
D q n
precisio
R A
D q n
R
n
·
= (1.6)
Clearly given a set of queries Q average recall and precision values can be calculated to
give a more representative evaluation of a specific IR system. Unfortunately these
evaluation metrics although well founded and used throughout the IR community suffer
from two problems when used in conjunction with the large document collections
utilized by QA systems, namely determining the set of relevant documents within a
collection for a given query,
, D q
A . The only accurate way to determine which documents
are relevant to a query is to read every single document in the collection and determine
its relevance. Clearly given the size of the collections over which QA systems are being
operated this is not a feasible proposition. It must be kept in mind that just because a
relevant document is found does not automatically mean the QA system will be able to
identify and extract a correct answer. Therefore it is better to use recall and precision at
the document retrieval stage rather than for the complete system.




















Figure 1.2: Sections of a document collection as used for IR evaluation.
Document Collection/Corpus
Relevant Documents


, D q
A


Retrieved Documents
, ,
S
D q n
R


19
Chapter2. Question Analysis

As the first component in a QA system it could easily be argued that question analysis is
the most important part. Not only is the question analysis component responsible for
determining the expected answer type and for constructing an appropriate query for
use by an IR engine but any mistakes made at this point are likely to render useless any
further processing of a question. If the expected answer type is incorrectly determined
then it is highly unlikely that the system will be able to return a correct answer as most
systems constrain possible answers to only those of the expected answer type. In a
similar way a poorly formed IR query may result in no answer bearing documents being
retrieved and hence no amount of further processing by an answer extraction
component will lead to a correct answer being found.
2.1 Determining the Expected Answer Type
In most QA systems the first stage in processing a previously unseen question is to
determine the semantic type of the expected answer. Determining the expected answer
type for a question implies the existence of a fixed set of answer types which can be
assigned to each new question. The problem of question type classification can be
solved by constructing manual rules or if we have access to large set of annotated-pre
classified questions, using machine learning approaches. We have employed a machine
learning model in our system which employs a feature-weighting model which assigns
different weights to features instead of simple binary values. The main characteristic of
this model is assigning more reasonable weight to features: these weights can be used
to differentiate features from each other according to their contribution to question
classification. Further, we propose to use features initially just as bag of words and later
on both as a bag of words and feature called as partitioned feature model. Results show
that with this new feature-weighting model the SVM-based classifier outperforms the
one without it to a large extent.
2.1.1 Question Classes
We follow the two-layered question taxonomy, which contains 6 coarse grained
categories and 50 fine grained categories, as shown in Table 1. Each coarse grained
category contains a non-overlapping set of fine grained categories. Most question
answering systems use a coarse grained category definition. Usually the number of
question categories is less than 20. However, it is obvious that a fine grained category
definition is more beneficial in locating and verifying the plausible answers.




20
Table 1.1 Coarse and fine grained question categories.
Coarse Fine
ABBR abbreviation, expansion
DESC definition, description, manner, reason
ENTY animal, body, color, creation, currency, disease/medical,
event, food, instrument, language, letter, other, plant,
product, religion, sport, substance, symbol, technique, term,
vehicle, word
HUM description, group, individual, title
LOC city, country, mountain, other, state
NUM code, count, date, distance, money, order, other, percent,
period, speed, temperature, size, weight


2.1.2 Manually Constructed rules for question classification
Often the easiest approach to question classification is a set of manually constructed
rules. This approach allows a simple low coverage classifier to be rapidly developed
without requiring a large amount of hand labelled training data. A number of systems
have taken this approach, many creating sets of regular expressions which only
questions with the same answer type [10],[11]. While these approaches work well for
some questions (for instance questions asking for a date of birth can be reliably
recognised using approximately six well constructed regular expressions) they often
require the examination of a vast number of questions and tend to rely purely on the
text of the question. One possible approach for manually constructing rules for such a
classifier would be to define rule formalism that whilst retaining the relative simplicity
of regular expressions would give access to a richer set of features. As we had access to
large set of pre annotated question samples we have not used this method.
2.1.3 Fully Automatically Constructed Classifiers
As mentioned in the previous section building a set of classification rules to perform
accurate question classification by hand is both a tedious and time-consuming task. An
alternative solution to this problem is to develop an automatic approach to constructing
a question classifier using (possibly hand labelled) training data. A number of different
automatic approaches to question classification have been reported which make use of
one or more machine learning algorithms [6][7][12] including nearest neighbour (NN)
[4], decision trees (DT) and support vector machines (SVM)[7][12] to induce a classifier.
In our system we employed a SVM and Naive Bayes classifier on different feature sets
extracted from the question.

21
2.1.4 Support Vector Machines
Support vector machines (SVMs) are a set of related supervised learning methods used
for classification and regression. Viewing input data as two sets of vectors in an n-
dimensional space, an SVM will construct a separating hyper-plane in that space, one
which maximizes the margin between the two data sets. To calculate the margin, two
parallel hyperplanes are constructed, one on each side of the separating hyper-plane,
which are "pushed up against" the two data sets. Intuitively, a good separation is
achieved by the hyper-plane that has the largest distance to the neighboring data points
of both classes, since in general the larger the margin the lower the generalization error
of the classifier.
We are given some training data, a set of points of the form
D = ൛(ݔ
݅
, ܿ
݅
) | ݔ
݅
߳ ℝ
݌
, ܿ
݅
߳ {−1,1}ൟ
݅=1
݊
(2.1)
where the ܿ
݅
is either 1 or −1, indicating the class to which the point ݔ
݅
belongs. Each ݔ
݅

is a p-dimensional real vector. We want to give the maximum-margin hyperplane which
divides the points having ܿ
݅
= 1 from those having ܿ
݅
= − 1. Any hyperplane can be
written as the set of points ݔ satisfying
w ⋅ ݔ − ܾ = 0 (2.2)
where denotes the dot product. The vector w is a normal vector: it is perpendicular to
the hyperplane. The parameter ܾ/‖ݓ‖ determines the offset of the hyperplane from the
origin along the normal vector w. We want to choose the w and b to maximize the
margin, or distance between the parallel hyperplanes that are as far apart as possible
while still separating the data. These hyperplanes can be described by the equations
w ⋅ ݔ − ܾ = 1 (2.3)

and
w ⋅ ݔ − ܾ = −1 (2.4)
Note that if the training data are linearly separable, we can select the two hyperplanes
of the margin in a way that there are no points between them and then try to maximize
their distance. By using geometry, we find the distance between these two hyperplanes
is 2/‖ݓ‖, so we want to minimize ‖ݓ‖. As we also have to prevent data points falling
into the margin, we add the following constraint: for each i either
w ⋅ ݔ
݅
− ܾ ≥ 1 (2.5)
and
w ⋅ ݔ
݅
− ܾ ≤ −1 (2.6)
This can be rewritten as:

22
ܿ
݅
(w ⋅ ݔ
݅
− ܾ) ≥ 1 , for all 1≤ ݅ ≤ ݊ (2.7)

We can put this together to get the optimization problem:
Minimize in (ݓ, ܾ) ‖ݓ‖ subject to (for any i = 1,……., n)
ܿ
݅
(w ⋅ ݔ
݅
− ܾ) ≥ 1 (2.8)
2.1.5 Kernel Trick
If instead of the Euclidean inner product w ⋅ ݔ
݅
one fed the QP solver with a function
K(w, ݔ
݅
) the boundary between the two classes would then be,
K(x,w) + b = 0 (2.9)
and the set of x e R
d
on that boundary becomes a curved surface embedded in R
d
when
the function K(x,w) is non-linear.
Consider K(x,w) to be the inner product not of the coordinate vectors x and w in R
d
but
of vectors o(x) and o(w) in higher dimensions. The map, o: X ÷ H
is called a feature map from the data space X into the feature space H . The feature
space is assumed to be a Hilbert space of real valued functions defined on X . The data
space is often R
d
but most of the interesting results hold when X is a compact
Riemannian manifold. The following picture illustrates a particularly simple example
where the feature map o(x1,x2)=(x1
2
,\2x1x2,x2
2
) maps data in R
2
into R
3
.

Figure 2.1: The kernel trick, after transformation the data is linearly separable.
2.1.6 Na ve Bayes Classifier

Along with SVM, we also tried Naïve Bayes Classifier[6]. A naive Bayes classifier is a
term in Bayesian statistics dealing with a simple probabilistic classifier based on


23
applying Bayes' theorem with strong (naive) independence assumptions. A more
descriptive term for the underlying probability model would be "independent feature
model". In simple terms, a naive Bayes classifier assumes that the presence (or absence)
of a particular feature of a class is unrelated to the presence (or absence) of any other
feature. For example, the words or features of a given question may are assumed to be
independent to simplify mathematical complexities. Even though these features depend
on the existence of the other features, a naive Bayes classifier considers all of these
properties to independently contribute to the probability that this fruit is an apple.
Depending on the precise nature of the probability model, naive Bayes classifiers can be
trained very efficiently in a supervised learning setting. In many practical applications,
parameter estimation for naive Bayes models uses the method of maximum likelihood;
in other words, one can work with the naive Bayes model without believing in Bayesian
probability or using any Bayesian methods.
Abstractly, the probability model for a classifier is a conditional model

p(C| F_1,…….,F_n),

over a dependent class variable C with a small number of outcomes or classes,
conditional on several feature variables F_1 through F_n. The problem is that if the
number of features n is large or when a feature can take on a large number of values,
then basing such a model on probability tables is infeasible. We therefore reformulate
the model to make it more tractable.
Using Bayes' theorem, we write

p(C | F_1,…….,F_n) =
ܘ(۱) ܘ(۴

,……,۴_ܖ| ۱)
ܘ(۴

,…….,۴_ܖ)
(2.10)

In plain English the above equation can be written as

posterior = (prior*likelihood)/evidence (2.11)

In practice we are only interested in the numerator of that fraction, since the
denominator does not depend on C and the values of the features F_i are given, so that
the denominator is effectively constant. The numerator is equivalent to the joint
probability model

p(C, F_1, ………, F_n),

which can be rewritten as follows, using repeated applications of the definition of
conditional probability:

p(C, F_1, ………., F_n)
= p(C) p(F_1,……….,F_n| C)
= p(C) p(F_1| C) p(F_2,.......,F_n| C, F_1)

24
= p(C) p(F_1| C) p(F_2| C, F_1) p(F_3,.......,F_n| C, F_1, F_2)
= p(C) p(F_1| C) p(F_2| C, F_1) p(F_3| C, F_1, F_2) p(F_4,.......,F_n| C, F_1, F_2, F_3)
= p(C) p(F_1| C) p(F_2| C, F_1) p(F_3| C, F_1, F_2) ...
.... p(F_n| C, F_1, F_2, F_3,.......,F_{n-1}) (2.12)
and so forth. Now the "naive" conditional independence assumptions come into play:
assume that each feature Fi is conditionally independent of every other feature Fj for j
i. This means that
p(F_i | C, F_j) = p(F_i | C), (2.13)
and so the joint model can be expressed as
p(C, F_1, ......., F_n) = p(C) p(F_1| C) p(F_2| C) p(F_3| C) …….. p(F_n| C)
= p(C)∏ p(F_i | C)
݊
݅=1
(2.14)
2.1.7 Datasets
We used the publicly available training and testing datasets provided by Tagged
Question Corpus, Cognitive Computation Group at the Department of Computer Science,
University of Illinois at Urbana-Champaign (UIUC) [5]. All these datasets have been
manually labelled by UIUC [5] according to the coarse and fine grained categories in
Table 1.1. There are about 5,500 labelled questions randomly divided into 5 training
datasets of sizes 1,000, 2,000, 3,000, 4,000 and 5,500 respectively. The testing dataset
contains 2000 labelled questions from the TREC QA track. The TREC QA data is hand
labelled by us.
2.1.8 Features
For each question, we extract two kinds of features: bag-of-words or a mix of POS tags
and words. Every question is represented as feature vectors; the weight associated with
each word varies between 0 and 1. The following example demonstrated different
feature sets considered for a given question and its POS parse.

Figure 2.2: Various feature sets extracted from the given question and its corresponding
part of speech tags.


25
2.1.9 Entropy and Weighted Feature Vector
In information theory the concept of entropy is used as a measure of the uncertainty of
a random variable. Let X be a discrete random variable with respect to alphabet A and
p(x) = Pr(X = x), x ∈ A be the probability function, then the entropy H(X) of the discrete
random variable X is defined as:

H(x) = −Σx∈A {p(x) log p(x)} (2.15)

The larger the entropy H(X) is, the more uncertain the random variable X is. In
information retrieval many methods have been applied to evaluate term’s relevance to
documents, among which entropy-weighting, based on information theoretic ideas, is
proved the most effective and sophisticated. Let fit be the frequency of word i in
document t, ni the total number of occurrences of word i in document collection, N the
number of total documents in the collection, then the confusion (or entropy) of word i
can be measured as follows:

H(i) = Σ(t=1 to N) [fit ni · log(nifit)] (2.16)

The larger the confusion of a word is, the less important it is. The confusion achieves
maximum value log(N) if the word is evenly distributed over all documents, and
minimum value 0 if the word occurs in only one document.
Keeping this in mind to calculate the entropy of a word, certain preprocessing is needed.
Let C be the set of question types. Without loss of generality, it is denoted by C = {1, . . .
,N}. Ci is a set of words extracted from questions of type i, that is to say, Ci represents a
word collection similar to documents. From the viewpoint of representation, each Ci is
the same as a document because both of which are just a collection of words. Therefore
we can also use the idea of entropy to evaluate word’s importance. Let ai be the weight
of word i, fit be the frequency of word i in Ct, ni be the total number of occurrences of
word i in all questions, then ai is defined as:

ai =(1 +1/log(N)Σ(t=1 to N)[fitni·log(fitni)]) (2.17)

Weight of word i is opposite to its entropy: the larger the entropy of word i is, the less
important to question classification it is. In other words, the smaller weight is
associated with word i. Consequently, ai get the maximum value of 1 if word i occurs in
only one set of question type, and the minimum value of 0 if the word is evenly
distributed over all sets. Note that if a word occurs in only one set, for other sets fik is 0.
We use the convention that 0 log 0 = 0, which is easily justified since xlogx → 0 as x → 0.

26
2.1.10 Experiment Results
We tested various algorithms for question classification.
Naïve Bayes Classifier* using Bag of Words feature set (64% accurate on TREC data),
Naïve Bayes Classifier* using Partitioned feature set (69% accurate on TREC data),
Support Vector Machine Classifier using Bag of Words feature set (78% accurate on
TREC data), Support Vector Machine Classifier using Weighted feature set (85%
accurate on TREC data)
It must be noted that the classifiers were NOT trained on TREC data. The classifier
classified questions into six broad classes and fifty coarse classes. Therefore a baseline
(random) classifier is (1/50) = 2% accurate. We employed various smoothing
techniques to Naive Bayes Classifier. The performance without smoothing was too low
and not worth mentioning. While Witten-Bell smoothing worked well, simple add one
smoothing outperformed it. The accuracy reported here are for Naive Bayes Classifier
employing add one smoothing.
We implemented weighted feature set SVM classifier into a cross platform standalone
desktop application (shown below). The application will be made available to public for
evaluation. Training was done on a set of 12788 questions provided by Cognitive
Computation Group at the Department of Computer Science, University of Illinois at
Urbana-Champaign.


Figure 2.3: Classifiers were tested on a set of 2000 TREC questions.
Some sample test runs
Q: What was the name of the first Russian astronaut to do a spacewalk?
Response: HUM -> IND(an Individual)
Q: How much folic acid should an expectant mother get daily?
Response: NUM -> COUNT
0
10
20
30
40
50
60
70
80
90
Chart Showing accuracy of classifier
Baseline Classifier
Naïve Bayes Classifier using
Bag of Words feature
Naïve Bayes Classifier using
Partitioned feature
SVM Classifier using Bag of
Words feature
SVM Classifier using
Weighted feature set


27
Q: What is Francis Scott Key best known for?
Response: DESC -> DESC
Q: What state has the most Indians?
Response: LOC -> STATE
Q: Name a flying mammal.
Response: ENTITY -> ANIMAL


Figure 2.4: JAVA Question Classifier, can be downloaded for evaluation from
http://www.cybergeeks.co.in/projects.php?id=10
2.2 Query Formulation

The question analysis component of a QA system is usually responsible for formulating
a query from a natural language questions to maximise the performance of the IR
engine used by the document retrieval component of the QA system. Most QA systems
start constructing an IR query simply by assuming that the question itself is a valid IR
query, while other systems go for a query expansion. The design of the query expansion
module should be such as to maintain the right balance between recall and precision.
For large corpus, query expansion may not be necessary as even with not so well
formed query recall is sufficient to extract the right answer and query expansion may in
fact reduce precision. But in case of a local small corpus query expansion may be
necessary. In our system when using web as document collection, we pass on the
question as IR query after masking the stop words. When a web corpus is not available
we employ Rocchio Query Expansion [1] method which is implemented in lucene query
expansion module. The table below shows performance of various query expansion
modules implemented on Lucene. The test is carried out on data from NIST TREC
Robust Retrieval Track 2004





28
Tag Combined Topic Set
MAP P10 %no
Lucene QE 0.2433 0.3936 18.10%
Lucene gQE 0.2332 0.3984 14%
KB-R-FIS gQE 0.2322 0.4076 14%
Lucene 0.2 0.37 15%

MAP - mean average precision
P10 - average of precision at 10 documents retrieved
%no - percentage of topics with no relevant in the top 10 retrieved

Lucene QE - lucene with local query expansion
Lucene gQE – Lucene system that utilized Rocchio’s query expansion along with Google.
KB-R-FIS gQE – My Fuzzy Inference System that utilized Rocchio’s query expansion along with Google.

Table 2.1: performance of various query expansion modules implemented on Lucene.

It must be noted that query expansion is internally carried out by the APIs used to
retrieve documents from the web, although because of the proprietary nature their
working is unknown and unpredictable.
2.2.1 Stop word for IR query formulation
Stop words or noise words are words which appear with a high frequency and are
considered to be insignificant for normal IR processes. Unfortunately when it comes to
QA systems high frequency of a word in a collection may not always suggest that it is
insignificant in retrieving the answer. For example the word “first” is widely considered
to be a stop word but is very important when appears in the question “Who was the first
President of India?”. Therefore we manually analyzed 100 TREC QA track questions and
prepared a list of stop words. A partial list of the stop words is shown below.

I a about an are
As at be by com
De en for from how
In is it la of
On or that the this
To was what when where
Who will with und the
www

The list of stop words we obtained is much smaller than standard stop word lists
(although there is no definite list of stop words which all natural language processing
tools incorporate, most of these lists are very similar).


29
Chapter3. Document Retrieval

The text collection over which a QA system works tend to be so large that it is
impossible to process whole of it to retrieve the answer. The task of the document
retrieval module is to select a small set from the collection which can be practically
handled in the later stages. A good retrieval unit will increase precision while
maintaining good enough recall.
3.1 Retrieval from local corpus
All the work presented in this thesis relies upon the Lucene IR engine [13] for local
corpus searches. Lucene is an open source boolean search engine with support for
ranked retrieval results using a TF.IDF based vector space model. One of the main
advantages of using Lucene, over many other IR engines, is that it is relatively easy to
extend to meet the demands of a given research project (as an open source project the
full source code to Lucene is available making modification and extension relatively
straight forward) allowing experiments with different retrieval models or ranking
algorithms to use the same document index.
3.1.1 Ranking function
We employ highly popular Okapi BM25 [3] ranking function for our document retrieval
module. It is based on the probabilistic retrieval framework developed in the 1970s and
1980s by Stephen E. Robertson, Karen Spärck Jones, and others [14].
The name of the actual ranking function is BM25. To set the right context, however, it
usually referred to as "Okapi BM25", since the Okapi information retrieval system,
implemented at London's City University in the 1980s and 1990s, was the first system
to implement this function. BM25, and its newer variants, e.g. BM25F [2] (a version of
BM25 that can take document structure and anchor text into account), represent state-
of-the-art retrieval functions used in document retrieval, such as Web search.
3.1.2 Okapi BM25
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the
query terms appearing in each document, regardless of the inter-relationship between
the query terms within a document (e.g., their relative proximity). It is not a single
function, but actually a whole family of scoring functions, with slightly different
components and parameters. One of the most prominent instantiations of the function
is as follows. Given a query Q, containing keywords ݍ
1
,..., ݍ
݊
, the BM25 score of a
document D is:


30
Score(D,Q) = ∑ ܫܦܨ(ݍ
݅
).
݂(ݍ
݅
,ܦ)∙(݇
1
+1)
݂(ݍ
݅
,ܦ)+݇
1
∙(1−ܾ+ܾ∙
|ܦ|
ܽݒ݈݃݀
)
ܰ
݅=1
(3.1)

where f(qi,D) is qi's term frequency in the document D, | D | is the length of the
document D in words, and avgdl is the average document length in the text collection
from which documents are drawn. k1 and b are free parameters, usually chosen as k1 =
2.0 and b = 0.75. IDF(qi) is the IDF (inverse document frequency) weight of the query
term qi. It is usually computed as:

ܫܦܨ(ݍ
݅
) = ݈݋݃
ܰ−݊(ݍ
݅
)+0.5
݊(ݍ
݅
)+0.5
(3.2)

where N is the total number of documents in the collection, and n(qi) is the number of
documents containing qi. There are several interpretations for IDF and slight variations
on its formula. In the original BM25 derivation, the IDF component is derived from the
Binary Independence Model.
3.1.3 IDF Information Theoretic Interpretation
Here is an interpretation from information theory. Suppose a query term q appears in
n(q) documents. Then a randomly picked document D will contain the term with
probability n(q)/N (where N is again the cardinality of the set of documents in the
collection). Therefore, the information content of the message "D contains q" is:

-log
݊(ݍ)
ܰ
= log
ܰ
݊(ݍ)
(3.3)

Now suppose we have two query terms q1 and q2. If the two terms occur in documents
entirely independently of each other, then the probability of seeing both q1 and q2 in a
randomly picked document D is:
݊(ݍ1)
ܰ

݊(ݍ2)
ܰ

and the information content of such an event is:
෍log
ܰ
݊(ݍ݅)
2
݅=1

With a small variation, this is exactly what is expressed by the IDF component of BM25.
3.2 Information retrieval from the web
Indexing the whole web is a gigantic task which is not possible on a small scale.
Therefore we use public APIs of search engines. We have used Google AJAX Search API
and Yahoo BOSS. Both APIs have relaxed terms of condition and allow access through
code. Moreover there are no limits on number of queries per day when used for


31
educational purposes. The search APIs can return top n documents for a given query.
We read top n uniform resource locators (URLs) and build the collection of documents
to be used for answer retrieval. As the task of reading the URLs over the internet is
inherently slow process, this stage is the most taxing one in terms of runtime. To
accelerate the process we employ multi threaded URL readers so that multiple URLs can
be read simultaneously. Figure 3.1 shows the document retrieval framework.
























Figure 3.1: Document retrieval framework
3.2.1 How many documents to retrieve?

One of the main considerations when doing document retrieval for QA is the amount of
text to retrieve and process for each question. Ideally a system would retrieve a single
text unit that was just large enough to contain a single instance of the exact answer for
every question. Whilst the ideal is not attainable, the document retrieval stage can act as
a filter between the document collections/web and answer extraction components by
retrieving a relatively small set of text collection. Therefore our target is to increasing
coverage with least number of retrieved documents to form the text collection. Lowered
precision is penalized by higher average processing time by later stages. Therefore,
IR Query
Lucene IR
Module
Google/Yahoo
Search APIs
Local
Corpus
Okapi BM25
Ranking function
URL Reader
URL Reader
URL Reader
URL Reader
Top n
Docs
Multi threaded
Reader module

INTERNET


32
criterion for selecting the right collection size depends on coverage and average
processing time. The table below shows percentage coverage, average processing time
at different ranks for Google and Yahoo search APIs. The results are obtained on a set of
30 questions (equally distributed over all question classes) from TREC 04 QA track [5].

Average
Processing
Time*(sec)
%Coverage @rank
Yahoo Google Yahoo Google
0.02 0.021 23 28 1
0.102 0.09 31 48 2
0.27 0.23 37 56 3
0.34 0.37 42 58 4
0.49 0.51 48 64 5
0.82 0.803 49 64 6
1.23 1.1 49 64 7
1.44 1.39 51 66 8
2.01 1.9 51 70 9
2.31 2.2 52 72 10
2.8 2.6 53 72 11
3.22 3.1 53 73 12
3.7 3.4 54 73 13
4.2 4.6 54 73 14
4.77 5.1 55 74 15

*Average time spent by answer retrieval node.

Table 3.1: %coverage and average processing time at different ranks



Figure 3.2: %coverage vs rank
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Yahoo BOSS API 23 31 37 42 48 49 49 51 51 52 53 53 54 54 55
Google AJAX Search
API
28 48 56 58 64 64 64 66 70 72 72 73 73 73 74
0
10
20
30
40
50
60
70
80
%
C
o
v
e
r
a
g
e
%Coverage vs rank


33



Figure 3.3: %coverage vs. average processing time

From the results it is clear that going up to rank 5 ensures a good coverage while
maintaining low processing time. Clearly Google outperforms Yahoo at all ranks.

0
10
20
30
40
50
60
70
80
0
.
0
2
1
0
.
0
9
0
.
2
3
0
.
3
7
0
.
5
1
0
.
8
0
3
1
.
1
1
.
3
9
1
.
9
2
.
2
2
.
6
3
.
1
3
.
4
4
.
6
5
.
1
%Coverage vs Average Processing time(sec)
Google AJAX Search
API
Yahoo BOSS


34
Chapter4. Answer Extraction

The final stage in a QA system, and arguably the most important, is to extract and
present the answers to questions. We employ a named entity (NE) recognizer to filter
out those sentences which could potentially contain answer to the given question. In
our system we have used GATE – A General Architecture for Text Engineering provided
by The Sheffield NLP group [15] as a tool to handle most of the NLP tasks including NE
recognition.
4.1 Sentence Ranking
The sentence ranking is responsible for ranking the sentences and giving a relative
probability estimate to each one. It also registers the frequency of each individual
phrase chunk marked by the NE recognizer for a given question class. The final answer
is the phase chunk with maximum frequency belonging to the sentence with highest
rank. The probability estimate and the retrieved answer’s frequency are used to
compute confidence of the answer.
4.1.1 WordNet
WordNet [16] is the product of a research project at Princeton University which has
attempted to model the lexical knowledge of a native speaker of English. In WordNet
each unique meaning of a word is represented by a synonym set or synset. Each synset
has a gloss that defines the concept of the word. For example the words car, auto,
automobile, and motorcar is a synset that represents the concept define by gloss: four
wheel Motor vehicle, usually propelled by an internal combustion Engine. Many glosses
have examples of usages associated with them, such as "he needs a car to get to work."
In addition to providing these groups of synonyms to represent a concept, WordNet
connects concepts via a variety of semantic relations. These semantic relations for
nouns include:
- Hyponym/Hypernym (IS-A/ HAS A)
- Meronym/Holonym (Part-of / Has-Part)
- Meronym/Holonym (Member-of / Has-Member)
- Meronym/Holonym (Substance-of / Has-Substance)

Figure 4.1 shows a fragment of WordNet taxonomy.


35
4.1.2 Sense/Semantic Similarity between words
We use statistics to compute information content value. We assign a probability to a
concept in taxonomy based on the occurrence of target concept in a given corpus.
The IC value is then calculated by negative log likelihood formula as follow:

( ) log( ( )) IC c P c = ÷
(4.1)
Where c is a concept and p is the probability of encountering c in a given corpus. Basic
idea behind the negative likelihood formula is that the more probable a concept
appears, the less information it conveys, in other words, infrequent words are more
informative then frequent ones. Using this basic idea we compute the sense/semantic
similarity ૆ between two given words based on a similarity metric proposed by Philip
Resnik [17].



Figure 4.1: Fragment of WordNet taxonomy


36
4.1.2 Sense Net ranking algorithm
We consider the sentence under consideration and the given query to be a set of words
similar to a bag of word model. But unlike a bag of word model we give importance to
the order of the words. Stop words are rejected from the set and only the root forms of
the words are taken into account. If W is the ordered set of n words in the given
sentence and Q is the ordered set m words in the query, then we compute a network of
sense weights between all pair of words in W and Q. Therefore we define the sense
network ) ( ,
j i
w q I as:

,
) ( ,
i j i j
w q c I =
(4.2)
Where
,
[0,1]
i j
c ‹ is the value of sense/semantic similarity between and
i j
w W q Q ‹ ‹ .






…… ࢝




) ( ,
j i
w q I






…… ࢗ


Figure 4.2: A sense network formed between a sentence and a query.

Given a sense network ) ( ,
i j
w q I , we define the distance of a word
i
w as
( )
i
d w i = (4.3)
( )
j
d q j = (4.4)
Word with maximum sense similarity with query word
i
q is:

,
( | ) ar ax gm
i j j j i
w j M q c = = (4.5)
And the corresponding value of
, i j
c = ( )
i
V q (4.6)
The exact match score is
ܧ
ݐ݋ݐ݈ܽ
=
∑ ܸ(ݍ
݅
)
݅
݉


Average sense similarity for query word ݍ
݅
with sentence W is
S(ݍ
݅
) =
∑ ξ
݆ ,݅ ݆
݊
(4.7)

Therefore the total average sense per word is
ܵ
ݐ݋ݐ݈ܽ
=
∑ ܵ(ݍ
݅
)
݅
݊
=
∑ ∑ ξ
݅,݆ ݆ ݅
݉݊
(4.8)
Let T = {ordered set of M(ݍ
݅
) ∀ ݅ ∈ [1, ݉]} in increasing order of d(q). Function ࣂ

is the
distance of i
th
element in ܶ then the alignment score is
ܭ
ݐ݋ݐ݈ܽ
=
∑ ݏ݃݊(ߠ
݅+1
−ߠ
݅
)
݉−1
݅=1
݉−1
(4.9)


37
The total average noise is defined as

*( * )
1
total
n E m
total
e
o
o
÷ ÷
= ÷
(4.10)
Where o is the noise decay factor.
Now, ߤ = noise penalty coefficient ߰ = ݁ݔܽܿݐ ݉ܽݐܿℎ ܿ݋݂݂݁݅ܿ݅݁݊ݐ
ߣ = ݏ݁݊ݏ݁ ݏ݈݅݉݅ܽݎ݅ݐݕ ܿ݋݂݂݁݅ܿ݅݁݊ݐ ߥ = ݋ݎ݀݁ݎ ܿ݋݂݂݁݅ܿ݅݁݊ݐ
Total score

ࣁ = ࣒× ࡱ
࢚࢕࢚ࢇ࢒
+ ࣅ × ࡿ
࢚࢕࢚ࢇ࢒
+ ࣆ × ࢾ
࢚࢕࢚ࢇ࢒

+
ࣇ × ࢑
࢚࢕࢚ࢇ࢒
(4.11)

The coefficients are fine tuned depending on the type of corpus. Unlike newswire data
most of the information found on the internet is badly formatted, grammatically
incorrect and most of the time not well formed. So when web is used as the knowledge
base we use the following values of different coefficients: ߤ = 1.0, ߰ = 1.0, ߣ = 0.25 ,
ߥ = 0.125 and noise decay factor o =0.25 but when using local corpus we reduce ߤ to
0.5 and o to 0.1. Once we obtain the total score for each sentence, we sort then
according to these scores. We take top t sentences and consider the plausible answers
within them. If an answer appears with frequency f in sentence ranked r then that
answer gets a confidence score

1
(1 l ) ) ) ( n( C ans f
r
+ =
(4.12)
Again all answers are sorted according to confidence score and top 0 (=5 in our case)
answers are returned along with corresponding sentence and URL (figure 4.3).


Figure 4.3: A sample run for the question “Who performed the first human heart
transplant?”


38
Chapter5. Implementation and Results

Our question answering module is written in JAVA. Use of JAVA makes the software
cross platform and highly portable. It uses various third party APIs for NLP and text
engineering; GATE, Stanford parser, Json and Lucene API to name a few. Each module is
designed keeping in mind space and time constraints. The URL reader module is multi
threaded to keep download time at the minimum. Most of the pre-processing is done via
GATE processing pipeline. More information is provided in appendix B.



Figure 5.1: Various modules of the QnA system along with each ones basic task.
5.1 Results
The idea of building an easily accessible question answering system which uses the web
as a document collection is not new. Most of these systems are accessed via a web
browser. In the later part of the section we compare our system with other web QA
systems. The tests were performed on a small set of fifty web based questions. The
reason we did not use questions from TREC QA is that the TREC questions are now
appearing quite frequently (sometimes with correct answers) in the results of web
search engines. This could have affected the results of any web based study. For this
reason a new collection of fifty questions was assembled to serve as the test set. Also we
don’t have access to AQUAINT corpus which is the knowledgebase for TREC QA systems.
The questions within the new test set were chosen to meet the following criteria:
Multi threaded URL reader implementation
Multi threaded URL reader interface
Stopwords filter class
Computes Sense/Semantic similarity between words
Stores a generic URL along with number of attempts to read it
Trains the weighted feature vector SVM classifier
Load GATE processing resources.
Implements a standard porter stemmer
main class that handles user queries.
Uses Google and Yahoo search engine queries to build the corpus
Sense Net implementation
ArrayList of Ranked Sentences with helper methods
Weighted feature vector SVM classifier.


39
1. Each question should be an unambiguous factoid question with only one known
answer. Some of the questions chosen do have multiple answers although this is
mainly due to incorrect answers appearing in some web documents.
2. The answers to the questions should not be dependent upon the time at which
the question is asked. This explicitly excludes questions such as “Who is the
President of the US?”
These questions are provided in appendix A.
For each question in set, the table below shows the (min) rank at which answer was
obtained. In case the system fails to answer a question we show the reason it failed. Also
time spent on various tasks is shown which would help in determining the feasibility of
the system to be used in real time environment. We used top 5 documents to construct
our corpus which restricts our coverage to 64%. In a way 64% is the accuracy upper
bound of our system.

Question
No.
Answer
obtained
@ rank
Remarks Time in seconds
Document
retrieval
module
#

Pre-
processing
Answer
extraction
module
1 5 8.5 13 0.82
2 NA NE recognizer not
designed to handle
this question.
0 0 0
3 1 11 9.77 0.38
4 4 8.6 10.23 0.41
5 1 6.4 13.33 0.55
6 1 7.8 15 0.51
7 NA NE recognizer not
designed to handle
this question.
0 0 0
8 1 4.1 16.3 1.1
9 1 5.2 11.8 0.43
10 1 6.4 12.23 0.61
11 NA Question Classifier
failed
0 0 0
12 3 8.0 14.5 0.2
13 1 7.37 11.2 0.71
14 1 8.1 15.7 0.88
15 NA Incorrect Answer 6.54 13.5 0.47
16 1 6.9 11.78 0.53
17 5 6.2 17.2 0.91


40
18 1 7.1 14.63 0.42
19 2 6.99 16.1 0.54
20 1 8.2 12.31 0.45
21 NA NE recognizer not
designed to handle
this question.
0 0 0
22 1 7.66 11.9 0.61
23 NA NE recognizer not
designed to handle
this question.
0 0 0
24 NA NE recognizer not
designed to handle
this question.
0 0 0
25 1 Answer changed
recently
11.2 14.7 0.62
26 NA Incorrect Answer 5.5 8 0.23
27 NA NE recognizer not
designed to handle
this question.
0 0 0
28 1 11.7 15.1 0.58
29 1 6.9 10.67 0.43
30 1 7.9 13.83 0.67
31 1 Incorrect Answer 6.67 11.5 0.47
32 4 7.23 14.67 0.65
33 1 7.21 16.23 0.61
34 1 Incorrect Answer 6.8 11.21 0.34
35 1 7.4 12.0 0.36
36 1 8.01 14.8 0.59
37 NA Incorrect Answer 8.11 14.99 0.64
38 NA Incorrect Answer 8.23 11.01 0.34
39 1 6.77 10.2 0.41
40 NA Incorrect Answer 8.4 16.3 0.79
41 1 9.1 11.4 0.53
42 NA Incorrect Answer 6.7 8.22 0.23
43 1 7.8 14.3 0.43
44 NA Incorrect Answer 9.2 16.1 0.62
45 1 7.2 13.8 0.48
46 1 11.2 15.3 0.54
47 NA Incorrect Answer 7.1 12.67 0.38
48 NA Incorrect Answer
mainly because req.
6.99 11.11 0.29


41
answer type was
present in the query
itself
49 2 8.01 12.51 0.46
50 NA Incorrect Answer 7.67 11.02 0.33
Average time spent: 6.6 11.24 0.45
Total number of questions: 50; Number of questions answered@
- Rank 1: 26 – Accuracy 52%
- Rank 2: 28 – Accuracy 56%
- Rank 3: 29 – Accuracy 58%
- Rank 4: 31 – Accuracy 62%
- Rank 5: 32 – Accuracy 64%
Average time spent per question: 18.3 seconds
#time is dependent on network speed

Table 5.1: Performance of the system on the web question set.

As seen, most of the failures were because of the handicapped NE recognizer. The
question classifier failed in only one instance. @Rank 5 the system reached its accuracy
upper bound of 64%.
5.2 Comparisons with other Web Based QA Systems
We compare our system with four web based QA Systems – AnswerBus [18],
AnswerFinder, IONAUT [19] and PowerAnswer[20]. The consistently best performing
system at TREC forms the backbone of the PowerAnswer system from Language
Computer
1
. Unlike our system each answer is a sentence and no attempt is made to
cluster (or remove) sentences which contain the same answer. This gives undue
advantage to the system as it performs the easier task of finding relevant sentences
only. The system called AnswerBus
2
[18] behaves in much the same way as
PowerAnswer, returning full sentences containing duplicated answers. It is claimed that
AnswerBus can correctly answer 70.5% of the TREC 8 question set although we believe
the performance would decrease if exact answers were being evaluated as experience of
the TREC evaluations has shown this to be a harder task than locating answer bearing
sentences. IONAUT
3
[19] uses its own crawler to index the web with specific focus on
entities and the relationships between them in order to provide a richer base for
answering questions than the unstructured documents returned by standard search
engines. The system returns both exact answers and snippets. AnswerFinder is a client
side application that supports natural language questions and queries the Internet via
Google. It returns both exact answer and snippets. This system is the closest to ours.

1. http://www.languagecomputer.com/demos/
2. http://misshoover.si.umich.edu/˜zzheng/qa-new/
3. http://www.ionaut.com:8400


42
The questions from the web question set were presented to the five systems on the
same day, within as short a period of time as was possible, so that the underlying
document collection, in this case the web would be relatively static and hence no system
would benefit from subtle changes in the content of the collection.


Figure 9.2: Comparison of AnswerBus , AnswerFinder , IONAUT +, PowerAnswer
and our system

It is clear from the graph that our system outperforms all but AnswerFinder at rank 1.
This is quite important as the answer returned at rank 1 can be considered to be the
final answer provided by the system. At higher ranks it performs considerably better
than AnswerBus and IONAUT while performing marginally less than AnswerFinder and
PowerAnswer. The results are encouraging but it should be noted that due to the small
number of test questions it is difficult to draw firm conclusions from these experiments.
5.3 Feasibility of the system to be used in real time environment

From table 5.1 it is clear that the system cannot be used for real time purposes as of
now. An average response time of 18.3 seconds is too high. But it must be noted that

43
document retrieval time will be significantly lower for offline – local corpus. More over
the task of post processing can be done offline on the corpus as it is independent of the
query. Once the corpus is pre-processed offline, the actual task of retrieving an answer
is quite low at 0.45 seconds. We believe that if we use our own crawler and pre-process
the documents beforehand, our system can retrieve answers fast enough to be used in
real time systems. The graph below shows percentage of time spent in different tasks.


Figure 9.3: Time distribution of each module involved in QA
5.4 Conclusion
The main motivation behind the work in this thesis was to consider, where possible,
simple approaches to question answering which can be both easily understood and
would operate quickly. We observed that the performance of the system is limited by
the worst performing module of the QA system. So even if a single module fails the
whole system won’t be able to answer. In our case the NE recognizer is the weakest
link. Our NE recognizer recognizes limited sets of answer types which is not enough to
obtain a good enough overall accuracy. We employed machine learning techniques for
question classification whose performance is good enough and any further
improvements won’t be beneficial. We also proposed the Sense Net algorithm as new
way of ranking sentences and answers. Even with the limited capability of NE
recognizer the system is at par with state of the art web QA systems which confirms the
efficacy of the ranking algorithm. The time distribution of various modules shows that
the system is quite fast at the answer extraction stage, if used along with a local corpus
which is pre-processed offline it can be adapted for real time applications. Finally our
current results are encouraging but we acknowledge that due to the small number of
test questions it is difficult to draw firm conclusions from these experiments.
Document
Retrieval
36%
Pre-Processing
61%
Answer
Extraction
3%
time distribution


44
Appendix A
Small Web Based Question Set
Q001: The chihuahua dog derives it’s name from a town in which country? Ans: Mexico
Q002: What is the largest planet in our Solar System? Ans: Jupiter
Q003: In which country does the wild dog, the dingo, live? Ans: Australia or America
Q004: Where would you find budgerigars in their natural habitat? Ans: Australia
Q005: How many stomachs does a cow have? Ans: Four or one with four parts
Q006: How many legs does a lobster have? Ans: Ten
Q007: Charon is the only satellite of which planet in the solar system? Ans: Pluto
Q008: Which scientist was born in Germany in 1879, became a Swiss citizen in 1901 and
later became a US citizen in 1940? Ans: Albert Einstein
Q009: Who shared a Nobel prize in 1945 for his discovery of the antibiotic penicillin?
Ans: Alexander Fleming, Howard Florey or Ernst Chain
Q010: Who invented penicillin in 1928? Ans: Sir Alexander Fleming
Q011: How often does Haley’s comet appear? Ans: Every 76 years or every 75 years
Q012: How many teeth make up a full adult set? Ans: 32
Q013: In degrees centigrade, what is the average human body temperature? Ans: 37, 38
or 37.98
Q014: Who discovered gravitation and invented calculus? Ans: Isaac Newton
Q015: Approximately what percentage of the human body is water? Ans: 80%, 66%,
60% or 70%
Q016: What is the sixth planet from the Sun in the Solar System? Ans: Saturn
Q017: How many carats are there in pure gold? Ans: 24
Q018: How many canine teeth does a human have? Ans: Four
Q019: In which year was the US space station Skylab launched? Ans: 1973
Q020: How many noble gases are there? Ans: 6
Q021: What is the normal colour of sulphur? Ans: Yellow
Q022; Who performed the first human heart transplant? Ans: Dr Christiaan Barnard
Q023: Callisto, Europa, Ganymede and Io are 4 of the 16 moons of which planet? Ans:
Jupiter
Q024: Which planet was discovered in 1930 and has only one known satellite called
Charon? Ans: Pluto
Q025: How many satellites does the planet Uranus have? Ans: 15, 17, 18 or 21
Q026: In computing, if a byte is 8 bits, how many bits is a nibble? Ans: 4
Q027: What colour is cobalt? Ans: blue
Q028: Who became the first American to orbit the Earth in 1962 and returned to Space
in 1997? Ans: John Glenn
Q029: Who invented the light bulb? Ans: Thomas Edison


45
Q030: How many species of elephant are there in the world? Ans: 2
Q031: In 1980 which electronics company demonstrated its latest invention, the
compact disc? Ans: Philips
Q032: Who invented the television? Ans: John Logie Baird
Q033: Which famous British author wrote ”Chitty Chitty Bang Bang”? Ans: Ian Fleming
Q034: Who was the first President of America? Ans: George Washington
Q035: When was Adolf Hitler born? Ans: 1889
Q036: In what year did Adolf Hitler commit suicide? Ans: 1945
Q037: Who did Jimmy Carter succeed as President of the United States? Ans: Gerald
Ford
Q038: For how many years did the Jurassic period last? Ans: 180 million, 195 – 140
million years ago, 208 to 146 million years ago, 205 to 140 million years ago, 205 to 141
million years ago or 205 million years ago to 145 million years ago
Q039: Who was President of the USA from 1963 to 1969? Ans: Lyndon B Johnson
Q040: Who was British Prime Minister from 1974-1976? Ans: Harold Wilson
Q041: Who was British Prime Minister from 1955 to 1957? Ans: Anthony Eden
Q042: What year saw the first flying bombs drop on London? Ans: 1944
Q043: In what year was Nelson Mandela imprisoned for life? Ans: 1964
Q044: In what year was London due to host the Olympic Games, but couldn’t because of
the Second World War? Ans: 1944
Q045: In which year did colour TV transmissions begin in Britain? Ans: 1969
Q046: For how many days were US TV commercials dropped after President Kennedy’s
death as a mark of respect? Ans: 4
Q047: What nationality was the architect Robert Adam? Ans: Scottish
Q048: What nationality was the inventor Thomas Edison? Ans: American
Q049: In which country did the dance the fandango originate? Ans: Spain
Q050: By what nickname was criminal Albert De Salvo better known? Ans: The Boston
Strangler.



46
Appendix B
Implementation Details
We have used Jcreator (http://www.jcreator.com/ ) as the preferred IDE. The code uses
newer features like generics which is not compatible with any version of JAVA prior to
1.5. The following third party APIs are used:

- GATE 4.0 (A General Architecture for Text Engineering) software toolkit
originally developed at the University of Sheffield since 1995 - http://gate.ac.uk/
- Apache Lucene API is a free/open source information retrieval library, originally
created in Java by Doug Cutting - http://lucene.apache.org/
- JSON API, JSON or JavaScript Object Notation, is a lightweight computer data
interchange format. The API brings support to read JSON data. -
http://www.json.org/java/
- LibSVM A Library for Support Vector Machines by Chih-Chung Chang and Chih-
Jen Lin - http://www.csie.ntu.edu.tw/~cjlin/libsvm/
- JWNL is an API for accessing WordNet in multiple formats, as well as relationship
discovery and morphological processing -
http://sourceforge.net/projects/jwordnet/
- Stanford Log-linear Part-Of-Speech Tagger -
http://nlp.stanford.edu/software/tagger.shtml
- WordNet 2.1 is a lexical database for the English language is used to measure
sense/semantic similarity measure - http://wordnet.princeton.edu/

All experiments performed on a Core 2 Duo 1.86 GhZ System with 2GB RAM. Default
stack size may not be sufficient to run the application. Therefore stack size should be
increased to at least 512MB using –Xmx512m command line option. Some classes
present in JWNL API conflict with GATE. To resolve the issue conflicting libraries
belonging to GATE must not be included in the classpath.



47
References
[1] Miles Efron. Query expansion and dimensionality reduction: Notions of
optimality in rocchio relevance feedback and latent semantic indexing.
Information Processing & Management, 44(1):163–180, January 2008.
[2] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu,
and Mike Gatford. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text
Retrieval Conference.
[3] Stephen E. Robertson and Steve Walker. 1999. Okapi/Keenbow at TREC-8. In
Proceedings of the 8th Text REtrieval Conference.
[4] Tom M. Mitchell. 1997. Machine Learning. Computer Science Series. McGraw-Hill.
[5] Corpora for Question Answering Task, Cognitive Computation Group at the
Department of Computer Science, University of Illinois at Urbana-Champaign.
[6] Xin Li and Dan Roth. 2002. Learning Question Classifiers. In Proceedings of the
19
th
International Conference on Computational Linguistics (COLING’02), Taipei,
Taiwan.
[7] Kadri Hacioglu and Wayne Ward. 2003. Question Classification with Support
Vector Machines and Error Correcting Codes. In Proceedings of the 2003
Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology (NAACL ’03), pages 28–30,
Morristown, NJ, USA.
[8] Ellen M. Voorhees. 1999. The TREC 8 Question Answering Track Report. In
Proceedings of the 8th Text REtrieval Conference.
[9] Ellen M. Voorhees. 2002. Overview of the TREC 2002 Question Answering Track.
In Proceedings of the 11th Text REtrieval Conference.
[10] Eric Breck, John D. Burger, Lisa Ferro, David House, Marc Light, and Inderjeet
Mani. 1999. A Sys Called Qanda. In Proceedings of the 8th Text REtrieval
Conference.
[11] Richard J. Cooper and Stefan M. R¨uger. 2000. A Simple Question Answering
System. In Proceedings of the 9th Text REtrieval Conference.
[12] Dell Zhang and Wee Sun Lee. 2003. Question Classification using Support Vector
Machines. In Proceedings of the 26th ACM International Conference on Research


48
and Developement in Information Retrieval (SIGIR’03), pages 26–32, Toronto,
Canada.
[13] Hao Wu, Hai Jin, and Xiaomin Ning. An approach for indexing, storing and
retrieving domain knowledge. In SAC ’07: Proceedings of the 2007 ACM
symposium on Applied computing, pages 1381–1382, New York, NY, USA, 2007.
ACM Press.
[14] Karen S. Jones, Steve Walker, and Stephen E. Robertson. A probabilistic model
of information retrieval: development and comparative experiments - part 2.
Information Processing and Management, 36(6):809–840, 2000.
[15] H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. Software
infrastructure for natural language processing, 1997.
[16] George A. Miller. 1995. WordNet: A Lexical Database. Communications of the
ACM, 38(11):39–41, November.
[17] Philip Resnik. Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language.
Journal of Artificial Intelligence Research, 11:95–130, 1999.
[18] Zhiping Zheng. 2002. AnswerBus Question Answering System. In Proceedings
of the Human Language Technology Conference (HLT 2002), San Diego, CA,
March 24-27.
[19] Steven Abney, Michael Collins, and Amit Singhal. 2000. Answer Extraction. In
Proceedings of the 6th Applied Natural Language Processing Conference (ANLP
2000), pages 296–301, Seattle, Washington, USA.
[20] Dan Moldovan, Sanda Harabagiu, Roxana Girju, Paul Mor?arescu, Finley
Lacatus¸u, Adrian Novischi, Adriana Badulescu, and Orest Bolohan. 2002. LCC
Tools for Question Answering. In Proceedings of the 11th Text REtrieval
Conference.

2

Department of Electrical Engineering Indian Institute of Technology Kharagpur-721302

CERTIFICATE
This is to certify that the thesis entitled Open Domain Factoid Question Answering System is a bonafide record of authentic work carried out by Mr. Amiya Patanaik under my supervision and guidance for the fulfilment of the requirement for the award of the degree of Bachelor of Technology (Honours) at the Indian Institute of Technology, Kharagpur. The work incorporated in this has not been, to the best of my knowledge, submitted to any other University or Institute for the award of any degree or diploma.

Dr. Sudeshna Sarkar (Guide) Professor, Department of Computer Science Indian Institute of Technology – Kharagpur INDIA

Date : Place : Kharagpur

Dr. S K Das (Co-guide) Professor, Department of Electrical Engineering Indian Institute of Technology – Kharagpur INDIA

Date : Place : Kharagpur

3

Acknowledgement
I express my sincere gratitude and indebtedness to my guide, Dr. Sudeshna Sarkar under whose esteemed guidance and supervision, this work has been completed. This project work would have been impossible to carry out without her advice and support throughout. I would also like to express my heartfelt gratitude to my co-guide Dr. S. K. Das and all the professors of Electrical and Computer Science Engineering Department for all the guidance, education and necessary skill set they have endowed me with, throughout my years of graduation. Last but not the least; I would like to thank my friends for their help during the course of my work.

Date:

Amiya Patanaik 05EG1008 Department of Electrical Engineering IIT Kharagpur - 721302

4

Dedicated to my parents and friends

Google. We developed and analyzed different sentence and answer ranking algorithms. Since the early days of artificial intelligence in the 60’s.g. Question answering aims to develop techniques that can go beyond the retrieval of relevant Documents in order to return exact answers to natural language factoid questions. We assumed that all the information required to produce an answer exists in a single sentence and followed a pipelined approach towards the problem. However. the combination of web growth. Various stages in the pipeline include: automatically constructed question type analysers based on various classifier models. . such as “Who is the first woman to be in space?”. This thesis investigates a number of techniques for performing open-domain factoid question answering. Our system currently supports document retrieval from Google and Yahoo via their public search engine application programming interfaces (APIs). The thesis also presents a feasibility analysis of our system to be used in real time QA applications. factual questions such as “who was the first American in space?” or “what is the second tallest mountain in the world?” Yet today’s most advanced web search services (e. Answering natural language questions requires more complex processing of text than employed by current information retrieval systems.5 ABSTRACT A question answering (QA) system provides direct answers to user questions by consulting its knowledge base. the difficulty of natural language processing (NLP) has limited the scope of QA to domain-specific expert systems.. document retrieval. improvements in information technology. We have developed an architecture that augments existing search engines so that they support natural language question answering and is also capable of supporting local corpus as a knowledge base. and “When was first world war fought?”. and the explosive demand for better information access has reignited the interest in QA systems. part of speech (POS) tags and sense similarity metrics. starting with simple ones that employ surface matching text patterns to more complicated ones using root words. MSN live search and AskJeeves) make it surprisingly tedious to locate answers to such questions. In recent years. phrase extraction. “Which is the largest city in India?”. The wealth of information on the web makes it an attractive resource for seeking quick answers to simple. researchers have been fascinated with answering natural language questions. sentence and answer ranking. Yahoo. passage extraction.

2 Mean Reciprocal Rank 1.2 Architecture 1.1 Question Classes 2.5 Answer extraction 1.2 Question processing 1.3 Confidence Weighted Score 1.6 Answer formulation 1.5 Traditional Metrics – Recall and Precision Chapter2: Question Analysis 2.6 Evaluating QA Systems 1.1.4.6.4.3 Question answering methods 1.2 Manually Constructed rules for question classification 2 3 4 5 6 8 9 9 10 11 11 11 12 12 13 13 13 13 13 14 14 14 14 14 15 15 16 16 16 17 17 19 19 19 20 .5 A generic framework for QA 1.4.6 Contents CERTIFICATE ACKNOWLEDGEMENT DEDICATION ABSTRACT CONTENTS LIST OF FIGURES AND TABLES Chapter 1: Introduction 1.1.9 Interactive QA 1.4.4.1 Shallow 1.8 Multi-lingual (or cross-lingual) question answering 1.4.11 User profiling for QA 1.10 Advanced reasoning for QA 1.4.1 Determining the Expected Answer Type 2.1 History of Question Answering Systems 1.4 Issues 1.3.6.2 Deep 1.4.6.4.4.4 Accuracy and coverage 1.7 Real time question answering 1.4.6.6.3 Context and QA 1.1 End-to-End Evaluation 1.1 Question classes 1.3.4 Data sources for QA 1.

10 Experiment Results 2. Document Retrieval 3.1.1.1.1.1 How many documents to retrieve? Chapter4.3 IDF Information Theoretic Interpretation 3.2 Information retrieval from the web 3.1 WordNet 4.7 Datasets 2.5 Kernel Trick 2.6 Naive Bayes Classifier 2.8 Features 2.4 Conclusion APPENDIX A : Web Based Question Set APPENDIX B : Implementation Details REFERENCES 20 21 22 22 24 24 25 26 27 28 29 29 29 29 30 30 31 34 34 34 35 36 38 38 41 42 43 44 46 47 .1 Stop word for IR query formulation Chapter3.1.1 Sentence Ranking 4.2 Comparisons with other Web Based QA Systems 5. Answer Extraction 4.3 Feasibility of the system to be used in real time environment 5.2.2 Sense/Semantic Similarity between words 4.1 Retrieval from local corpus 3.2 Sense Net ranking algorithm Chapter5.9 Entropy and Weighted Feature Vector 2.1.4 Support Vector Machines 2.2.1.1.2 Query Formulation 2.1 Results 5.1.1.1. Implementation and Results 5.1.1.1 Ranking function 3.3 Fully Automatically Constructed Classifiers 2.1.2 Okapi BM25 3.7 2.

1: The kernel trick Fig.4.1: Various modules of the QA system along with each ones basic task Fig.3.2.1 Coarse and fine grained question categories.2: A sense network formed between a sentence and a query Fig. Fig.2.3: A sample run for the question “Who performed the first human heart transplant?” Fig.1: Fragment of WordNet taxonomy Fig.1: Performance of the system on the web question set 20 28 32 39-41 .1: performance of various query expansion modules implemented on Lucene.5. Fig.9.3.2.1: %coverage and average processing time at different ranks Table 5.2: %coverage vs rank Fig.2: Various feature sets extracted from a given question and its corresponding part of speech tags. Fig.1: A generic framework for question answering Fig. Table 1.2.3: %coverage vs.3: Time distribution of each module involved in QA 15 18 22 24 26 27 31 32 33 35 36 37 38 42 43 Tables PageNo. Table 2.4: JAVA Question Classifier Fig. Table 3.1.1.4.4.1: Document retrieval framework Fig. average processing time Fig.2: Sections of a document collection as used for IR evaluation.3.9.2: Comparison with other web based QA systems Fig.8 List of figures and tables Figures PageNo.3: Question type classifier performance Fig.

Some of the early AI systems included question-answering abilities. In fact. thus natural language search engines are sometimes regarded as the next step beyond current search engines. Two of the most famous early systems are SHRDLU and ELIZA. How.9 Chapter1. Both QA systems were very effective in their chosen domains. closed-domain might refer to a situation where only limited types of questions are accepted. On the other hand. * Open-domain question answering deals with questions about nearly everything. definition. to compiled newswire reports. to internal organization documents.1 History of Question Answering Systems Some of the early AI systems were question answering systems. medicine or automotive maintenance). to the World Wide Web. QA research attempts to deal with a wide range of question types including: fact. Why. BASEBALL answered questions about the US baseball league over a period of one year. * Closed-domain question answering deals with questions under a specific domain (for example. SHRDLU simulated the operation of a robot in a toy world (the "blocks world"). both of which were developed in the 1960s. Two of the most famous QA systems of that time are BASEBALL and LUNAR. hypothetical. such as questions asking for descriptive rather than procedural information. question answering (QA) is the task of automatically answering a question posed in natural language. these systems usually have much more data available from which to extract the answer. Search collections vary from small local document collections. and cross-lingual questions. and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies. 1. (Alternatively. answered questions about the geological analysis of rocks returned by the Apollo moon missions. LUNAR. LUNAR was demonstrated at a lunar science convention in 1971 and it was able to answer 90% of the questions in its domain posed by people untrained on the system. The common feature of all these systems is that they had a core database or knowledge system that was hand-written by experts of the chosen domain. and it offered the possibility to ask the robot . To find the answer to a question.) QA is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval. and can only rely on general ontologies and world knowledge. list. Further restricted-domain QA systems were developed in the following years. semantically-constrained. Introduction In information retrieval. a QA computer program may use either a pre-structured database or a collection of natural language documents (a text corpus such as the World Wide Web or some local collection). in turn.

2 Architecture The first QA systems were developed in the 1960s and they were basically naturallanguage interfaces to expert systems that were tailored to specific domains. Systems participating in this competition were expected to answer questions on any topic by searching a corpus of text that varied from year to year. Real-life data is inherently noisy as people are less careful when writing in spontaneous media like blogs. The introduction of noisy text moved the question answering to a more realistic setting. In earlier years the TREC data corpus consisted of only newswire data that was very clean. but they helped the development of theories on computational linguistics and reasoning. One example of such a system was the Unix Consultant (UC). and on its own it lead to a series of chatter bots such as the ones that participate in the annual Loebner prize. a text-understanding system that operated on the domain of tourism information in a German city. and Google and Microsoft have started to integrate question-answering facilities in their search engines. In the late 1990s the annual Text Retrieval Conference (TREC) included a questionanswering track which has been running until the present. In contrast. An increasing number of systems include the World Wide Web as one more corpus of text. ELIZA was able to converse on any topic by resorting to very simple rules that detected important words in the person's input. and it aimed at phrasing the answer to accommodate various types of users. Current QA systems typically include a question classifier module that determines the type of question and the type of answer.com is an early example of such a system. 1. The blog data corpus contained both "clean" English as well as noisy text that include badlyformed English and spam. Ask. which led to the development of ambitious projects in text comprehension and question answering. In 2007 the annual TREC included a blog data corpus for question answering. Currently there is an increasing interest in the integration of question answering with web search. The 1970s and 1980s saw the development of comprehensive theories in computational linguistics. The system had a comprehensive hand-crafted knowledge base of its domain. the system . simulated a conversation with a psychologist. current QA systems use text documents as their underlying knowledge source and combine various natural language processing techniques to search for the answers. in contrast. ELIZA. After the question is analyzed. Another project was LILOG. The systems developed in the UC and LILOG projects never went past the stage of simple demonstrations. the strength of this system was the choice of a very specific domain and a very simple world with rules of physics that were easy to encode in a computer program. It had a very rudimentary way to answer questions. This competition fostered research and development in open-domain text-based question answering. The best system of the 2004 competition achieved 77% correct fact-based questions. Again. One can only expect to see an even tighter integration in the near future.10 questions about the state of the world. a system that answered questions pertaining to the Unix operating system.

3.11 typically uses several modules that apply increasingly complex NLP techniques on a gradually reduced amount of text. This often works well on simple "factoid" questions seeking factual tidbits of information such as names. Finally. If you posed the question "What is a dog?". such as the web.3. logical inferences (abduction) . leading to two benefits: (1) By having the right information appear in many forms. (2) Correct answers can be filtered from false positives by relying on the correct answer to appear more times in the documents than instances of incorrect ones. the system would detect the substring "What is a X" and look for documents which start with "X is a Y". 1. Thus. in the cases where simple question reformulation or keyword techniques will not suffice.for without documents containing the answer. For example. logic form transformation. there is little any QA system can do. a document retrieval module uses search engines to identify the documents or paragraphs in the document set that are likely to contain the answer. 1. Subsequently a filter preselects small text fragments that contain strings of the same type as the expected answer. locations. co reference resolution. relation detection. if the question is "Who invented Penicillin" the filter returns text that contain names of people. syntactic alternations. the burden on the QA system to perform complex NLP techniques to understand the text is lessened.2 Deep However.1 Shallow Some methods of QA use keyword-based techniques to locate interesting passages and sentences from the retrieved documents and then filter based on the presence of the desired answer type within that candidate text. unless the question domain is orthogonal to the collection. means that nuggets of information are likely to be phrased in many different ways in differing contexts and documents. Ranking is then done based on syntactic features such as word order or location and similarity to query. When using massive collections with good data redundancy. word sense disambiguation. some systems use templates to find the final answer in the hope that the answer is just a reformulation of the question. 1. It thus makes sense that larger collection sizes generally lend well to better QA performance. dates. more sophisticated syntactic. The notion of data redundancy in massive collections. an answer extraction module looks for further clues in the text to determine if the answer candidate can indeed answer the question. semantic and contextual processing must be performed to extract or construct the answer. These techniques might include namedentity recognition. and quantities.3 Question answering methods QA is very dependent on a good search corpus .

or the Suggested Upper Merged Ontology (SUMO) to augment the available reasoning resources through semantic connections and definitions. Example. it is only a starting point with endless possibilities. . badly-worded or ambiguous questions will all need these types of deeper understanding of the question. is also growing in popularity in the research community. 1.4 Issues In 2002 a group of researchers wrote a roadmap of research in question answering. AQ (Answer Questioning) Methodology. The following issues were identified. sentence boundary detection. introduces a working cycle to the QA methods. supposedly there is only one true answer in reality everything else is perception or plausibility. Utilized alongside other forms of communication. hypothetical postulations. 1(Q) = ((∞(A)-∞) = 1(A). such as part-of-speech tagging. however. 1. are already available as probabilistic applications. Complex or ambiguous document passages likewise need more NLP techniques applied to understand the text.1 Question classes Different types of questions require the use of different strategies to find the answer. A = ∞(Q).12 and commonsense reasoning. this may unveil different methods of thinking and perception as well. spatially or temporally constrained questions. parsing." Q"(Why do) I like sushi(?)" A"The flavor. debate may be greatly improved. The means by which it is utilized can be manipulated beyond its primary usage. The QA methodology utilizes just the opposite where. More difficult queries such as Why or How questions.4. the answer may yield any number of questions to be asked. thereby unveiling an ongoing process constantly being reborn into the research being performed. AQ Method may be used upon perception of a posed question or answer. This method may be used in conjunction with any of the known or newly founded methods. Many of the lowerlevel NLP tools used. dialog queries. and document retrieval. Even this methodology should be questioned. A"I like sushi. which introduces statistical question processing and answer extraction modules. These systems will also very often utilize world knowledge that can be found in ontologies such as WordNet. the primary usage is taking an answer and questioning it turning that very answer into a question. Statistical QA. Question classes are arranged hierarchically in taxonomies. temporal or spatial reasoning and so on." Q"(What about) the flavor of sushi (do) I like?" Inadvertently. Any number of question methods may be used to derive the number of WHY as in. named-entity detection. While most would agree that this seems to be the end-all stratagem.

resolve ambiguities or keep track of an investigation performed through a series of questions. etc) or a date (e.3 Context and QA Questions are usually asked within a context and answers are provided within that specific context.4. would identify ambiguities and treat them in context or by interactive clarification. The context can be used to clarify a question. the presentation of the answer may require the use of fusion techniques that combine the partial answers from multiple documents. length.g. on the search method and on the question focus and context.some interrogative.4. no matter how well we perform question processing.6 Answer formulation The result of a QA system should be presented in a way as natural as possible. This model would enable the translation of a complex question into a series of simpler questions. organization. the answer to the question "On what day did Christmas fall in 1989?") the extraction of a single datum is sufficient.4. 1. on the answer type provided by question processing. regardless of the speech act or of the words. For other cases.13 1. one that would recognize equivalent questions. simple extraction is sufficient. In some cases. distance. on the actual data where the answer is searched. syntactic inter-relations or idiomatic forms.4.4. A semantic model of question understanding and processing is needed. 1.5 Answer extraction Answer extraction depends on the complexity of the question. 1. For example. size. If the answer to a question is not present in the data sources. research for answer processing should be tackled with a lot of care and given special importance.2 Question processing The same information request can be expressed in various ways . we shall not obtain a correct result. a quantity (monetary value.4 Data sources for QA Before a question can be answered. . etc). some assertive. Given that answer processing depends on such a large number of factors. it must be known what knowledge sources are available. shop or disease. retrieval and extraction of the answer. 1. when the question classification indicates that the answer type is a name (of a person.

but (s)he might want to have a dialogue with the system. To upgrade a QA system with such capabilities.7 Real time question answering There is need for developing Q&A systems that are capable of extracting answers from large data sets in several seconds. This allows users to consult information that they cannot use directly. the size and multitude of the data sources or the ambiguity of the question.4.9 Interactive QA It is often the case that the information need is not well captured by a QA system. 1. comprising context data.14 1. 1. where each template slot represents a different profile feature.4. 1. the questioner might want not only to reformulate the question.4. we need to integrate reasoning components operating on a variety of knowledge bases. domain of interest. See also machine translation. reasoning schemes frequently used by the questioner.4. . 1. regardless of the complexity of the question. as the question processing part may fail to classify properly the question or the information needed for extracting and generating the answer is not easily retrieved. common ground established within different dialogues between the system and the user etc.8 Multi-lingual (or cross-lingual) question answering The ability to answer a question posed in one language using an answer corpus in another language (or even several).11 User profiling for QA The user profile captures data about the questioner. Profile templates may be nested one within another. The profile may be represented as a predefined template.4. In such cases. encoding world knowledge and common-sense reasoning mechanisms as well as knowledge specific to a variety of domains.10 Advanced reasoning for QA More sophisticated questioners expect answers which are outside the scope of written texts or structured databases.

15 1.1. document or passage retrieval and finally 3. answer extraction. A rather impractical and tedious way of doing this could be to manually search an entire collection of text and mark the . unfortunately with most of the natural language tasks there is no single answer.1: A generic framework for question answering. For example the question analysis component is usually responsible for generating an IR query from the natural language question which can then be used by the document retrieval component to select a subset of the available documents. It is always easier to evaluate when there is a clearly defined answer.6 Evaluating QA Systems Evaluation is a highly subjective matter when dealing with NLP problems. If. 2. a three component architecture describes the approach taken to building QA systems in the wider literature. Question Question Analysis Corpus or document collection Document Retrieval Top n text segments or sentences Answer Extraction Answers Fig. question analysis. however. 1.5 A generic framework for QA The majority of current question answering systems designed to answer factoid questions consist of three distinct components: 1. an approach to document retrieval requires some form of iterative process to select good quality documents which involves modifying the IR query. While these basic components can be further subdivided into smaller components like query formation and document pre-processing. It should be noted that while the three components address completely separate aspects of question answering it is often difficult to know where to place the boundary of each individual component. then it is difficult to decide if the modification should be classed as part of the question analysis or document retrieval process.

6. MRR is then given by: MRR ¦r i 1 |Q| 1 i |Q| (1. the most important of which are that x systems are given no credit for retrieving multiple (different) correct answers and x As the task required each system to return at least one answer per question. But this is not possible even for the smallest of document collections and with the size of corpuses like AQUAINT with approximately 1. no credit was given to systems for determining that they did not know or could not locate an appropriate answer to a question. Following are definitions of numerous metrics for evaluating factoid questions. So a widely accepted metric is required to evaluate the performance of our system and compare it with other existing systems. 1. Under this evaluation metric a system returns a single answer for each question.000 articles it is next to impossible. Most of the recent large scale QA evaluations have taken place as part of the TREC conferences and hence the evaluation metrics used have been extensively studied and is used in this study. Evaluating descriptive questions is much more difficult than factoids.00. MRR provides a method for scoring systems which return multiple competing answers per question.16 relevant documents.3 Confidence Weighted Score Following the shortcomings of MRR as an evaluation metric a new evaluation metric was chosen as the new evaluation metric [9]. Let Q be the question collection and ri the rank of the first correct answer to question i or 0 if no correct answer is returned.6.1) As useful as MRR was as an evaluation metric for the early TREC QA evaluations it does have a number of drawbacks [8]. . These answers are then sorted before evaluation so that the answer which the system has most confidence in is placed first.6. 1.1 End-to-End Evaluation Almost every QA system is concerned with the final answer.2 Mean Reciprocal Rank The original evaluation metric used in the QA tracks of TREC 8 and 9 was mean reciprocal rank (MRR). 1. Then the queries can be used to make an evaluation based on precision and recall.

2: ¦ CWS i 1 |Q| no.n be the n top-ranked documents (or passages) in D retrieved by an IR system S (figure 1. AD . The main issue with CWS is that it is difficult to get an intuitive understanding of the performance of a QA system given a CWS score as it does not relate directly to the number of questions the system was capable of answering. 1.q z I}| |Q| (1.q .4 Accuracy and coverage Accuracy of a QA system is a simple evaluation metric with direct correspondence to number of correct answers.q .6. then The recall of an IR system S at rank n for a query q is the fraction of the relevant documents AD . Let CD . which have been retrieved: .2) CWS therefore rewards systems which can not only provide correct exact answers to questions but which can also recognise how likely an answer is to be correct and hence place it early in the sorted list of answers. D.q . n) S S |{q‹Q | FD .3) Similarly The coverage of a retrieval system S for a question set Q and document collection D at rank n is the fraction of the questions for which at least one relevant document is found within the top n documents: coverage (Q.q the subset of D which contains relevant S documents for a query q and RD .5 Traditional Metrics – Recall and Precision The standard evaluation measures for IR systems are precision and recall.4) 1.n ˆ CD .2).q .q be the correct answers for question q known to be S contained in the document collection D and FD .q z I}| |Q| (1.n ˆ AD .17 The last answer evaluated will therefore be the one the system has least confidence in. Given this ordering CWS is formally defined in Equation 1. D. of correct in first i answers i |Q| (1.6.q . n) S S |{q‹Q | RD .n be the first n answers found by system S for question q from D then accuracy is defined as: accuracy (Q. Let D be the document (or passage collection).

5) The precision of an IR system S at rank n for a query q is the fraction of the retrieved S documents RD . .n | (1.n ˆ AD . n) S | AD .q .n that are relevant: precision S ( D. Clearly given the size of the collections over which QA systems are being operated this is not a feasible proposition.q . q.q | (1.q Retrieved Documents S RD .2: Sections of a document collection as used for IR evaluation.q .n Figure 1. The only accurate way to determine which documents are relevant to a query is to read every single document in the collection and determine its relevance.q | recall ( D.q | S | RD .q . Document Collection/Corpus Relevant Documents AD .q . q. It must be kept in mind that just because a relevant document is found does not automatically mean the QA system will be able to identify and extract a correct answer. AD . n) S | RD . namely determining the set of relevant documents within a collection for a given query. Therefore it is better to use recall and precision at the document retrieval stage rather than for the complete system.6) Clearly given a set of queries Q average recall and precision values can be calculated to give a more representative evaluation of a specific IR system.n ˆ AD .q . Unfortunately these evaluation metrics although well founded and used throughout the IR community suffer from two problems when used in conjunction with the large document collections utilized by QA systems.18 S | RD .

The main characteristic of this model is assigning more reasonable weight to features: these weights can be used to differentiate features from each other according to their contribution to question classification. Results show that with this new feature-weighting model the SVM-based classifier outperforms the one without it to a large extent. Not only is the question analysis component responsible for determining the expected answer type and for constructing an appropriate query for use by an IR engine but any mistakes made at this point are likely to render useless any further processing of a question. Most question answering systems use a coarse grained category definition. which contains 6 coarse grained categories and 50 fine grained categories. Each coarse grained category contains a non-overlapping set of fine grained categories. . The problem of question type classification can be solved by constructing manual rules or if we have access to large set of annotated-pre classified questions. If the expected answer type is incorrectly determined then it is highly unlikely that the system will be able to return a correct answer as most systems constrain possible answers to only those of the expected answer type. Usually the number of question categories is less than 20.1 Question Classes We follow the two-layered question taxonomy. Further. we propose to use features initially just as bag of words and later on both as a bag of words and feature called as partitioned feature model.1. Question Analysis As the first component in a QA system it could easily be argued that question analysis is the most important part. 2. 2. it is obvious that a fine grained category definition is more beneficial in locating and verifying the plausible answers. Determining the expected answer type for a question implies the existence of a fixed set of answer types which can be assigned to each new question. However. using machine learning approaches. We have employed a machine learning model in our system which employs a feature-weighting model which assigns different weights to features instead of simple binary values. as shown in Table 1. In a similar way a poorly formed IR query may result in no answer bearing documents being retrieved and hence no amount of further processing by an answer extraction component will lead to a correct answer being found.19 Chapter2.1 Determining the Expected Answer Type In most QA systems the first stage in processing a previously unseen question is to determine the semantic type of the expected answer.

letter. disease/medical.[11].20 Table 1. vehicle. A number of different automatic approaches to question classification have been reported which make use of one or more machine learning algorithms [6][7][12] including nearest neighbour (NN) [4]. event. product. substance. description. manner. group. date. currency. word HUM LOC NUM description. reason animal. In our system we employed a SVM and Naive Bayes classifier on different feature sets extracted from the question. symbol. This approach allows a simple low coverage classifier to be rapidly developed without requiring a large amount of hand labelled training data. instrument. language. distance. sport. A number of systems have taken this approach. weight 2. term.1 Coarse and fine grained question categories. size. individual. count. period. body. title city. temperature. One possible approach for manually constructing rules for such a classifier would be to define rule formalism that whilst retaining the relative simplicity of regular expressions would give access to a richer set of features. 2. expansion definition. decision trees (DT) and support vector machines (SVM)[7][12] to induce a classifier. technique. While these approaches work well for some questions (for instance questions asking for a date of birth can be reliably recognised using approximately six well constructed regular expressions) they often require the examination of a vast number of questions and tend to rely purely on the text of the question. . An alternative solution to this problem is to develop an automatic approach to constructing a question classifier using (possibly hand labelled) training data. many creating sets of regular expressions which only questions with the same answer type [10].2 Manually Constructed rules for question classification Often the easiest approach to question classification is a set of manually constructed rules. country.1. color. Coarse ABBR DESC ENTY Fine abbreviation. As we had access to large set of pre annotated question samples we have not used this method. order.3 Fully Automatically Constructed Classifiers As mentioned in the previous section building a set of classification rules to perform accurate question classification by hand is both a tedious and time-consuming task. mountain. food. creation. plant. percent. speed. state code. money. other. religion.1. other. other.

a set of points of the form D= ( . since in general the larger the margin the lower the generalization error of the classifier. indicating the class to which the point belongs. The parameter / determines the offset of the hyperplane from the origin along the normal vector w. {−1. we find the distance between these two hyperplanes is 2/ . Viewing input data as two sets of vectors in an ndimensional space. we add the following constraint: for each i either w⋅ and w⋅ This can be rewritten as: − ≤ −1 (2. so we want to minimize .6) − ≥1 (2. one on each side of the separating hyper-plane.2) where denotes the dot product.21 2. an SVM will construct a separating hyper-plane in that space. These hyperplanes can be described by the equations w⋅ and w ⋅ − = −1 (2. one which maximizes the margin between the two data sets.4) Note that if the training data are linearly separable. We are given some training data. We want to choose the w and b to maximize the margin. The vector w is a normal vector: it is perpendicular to the hyperplane. two parallel hyperplanes are constructed.1.5) − =1 (2. Any hyperplane can be written as the set of points satisfying w⋅ − =0 (2.4 Support Vector Machines Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. which are "pushed up against" the two data sets. we can select the two hyperplanes of the margin in a way that there are no points between them and then try to maximize their distance. By using geometry. Each is a p-dimensional real vector. As we also have to prevent data points falling into the margin. )| . or distance between the parallel hyperplanes that are as far apart as possible while still separating the data. To calculate the margin. We want to give the maximum-margin hyperplane which divides the points having = 1 from those having = − 1.3) . Intuitively.1) where the is either 1 or −1.1} =1 (2. a good separation is achieved by the hyper-plane that has the largest distance to the neighboring data points of both classes.

Figure 2.x22) maps data in R2 into R3. we also tried Naïve Bayes Classifier[6].1.w) is non-linear. n) (w ⋅ − ) ≥1 (2. The map.…….7) We can put this together to get the optimization problem: Minimize in ( . A naive Bayes classifier is a term in Bayesian statistics dealing with a simple probabilistic classifier based on . 2.22 (w ⋅ − ) ≥ 1 . K(x. The feature space is assumed to be a Hilbert space of real valued functions defined on X .w) to be the inner product not of the coordinate vectors x and w in Rd but of vectors I x) and I(w) in higher dimensions. ) subject to (for any i = 1. The following picture illustrates a particularly simple example where the feature map I(x1. after transformation the data is linearly separable.. I: X o H is called a feature map from the data space X into the feature space H .—2x1x2.1: The kernel trick.9) d when and the set of x  on that boundary becomes a curved surface embedded in R the function K(x.5 Kernel Trick If instead of the Euclidean inner product w ⋅ one fed the QP solver with a function K(w .x2)=(x12. The data space is often Rd but most of the interesting results hold when X is a compact Riemannian manifold. Rd Consider K(x. for all 1≤ ≤ (2.w) + b = 0 (2.6 Na ve Bayes Classifier Along with SVM. ) the boundary between the two classes would then be.1.8) 2.

. The problem is that if the number of features n is large or when a feature can take on a large number of values. the words or features of a given question may are assumed to be independent to simplify mathematical complexities. the probability model for a classifier is a conditional model p(C| F_1.F_n| C. The numerator is equivalent to the joint probability model p(C. F_1) . We therefore reformulate the model to make it more tractable. For example.……. _ ) (2..23 applying Bayes' theorem with strong (naive) independence assumptions... F_1. Depending on the precise nature of the probability model.. a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.……. F_n) = p(C) p(F_1. Using Bayes' theorem. A more descriptive term for the underlying probability model would be "independent feature model". one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods.. then basing such a model on probability tables is infeasible.F_n| C) = p(C) p(F_1| C) p(F_2.. so that the denominator is effectively constant. which can be rewritten as follows.F_n) = ( ) ( ( .. we write p(C | F_1..……….. using repeated applications of the definition of conditional probability: p(C. Even though these features depend on the existence of the other features. F_1. a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. since the denominator does not depend on C and the values of the features F_i are given.F_n). _ | ) . conditional on several feature variables F_1 through F_n. F_n).…….10) In plain English the above equation can be written as posterior = (prior*likelihood)/evidence (2. ………. in other words. In many practical applications. naive Bayes classifiers can be trained very efficiently in a supervised learning setting. over a dependent class variable C with a small number of outcomes or classes. parameter estimation for naive Bayes models uses the method of maximum likelihood.. Abstractly.. In simple terms.11) In practice we are only interested in the numerator of that fraction. ……….……..

F_1) p(F_3| C. Figure 2. Cognitive Computation Group at the Department of Computer Science.. F_1) p(F_3| C.F_n| C... F_1.F_{n-1}) (2.. The TREC QA data is hand labelled by us..8 Features For each question.F_n| C. F_2.500 labelled questions randomly divided into 5 training datasets of sizes 1.... 2.. F_j) = p(F_i | C).. p(F_n| C.000. 3..14) 2.12) and so forth..500 respectively. F_1... F_1...2: Various feature sets extracted from the given question and its corresponding part of speech tags...000. Every question is represented as feature vectors.24 = p(C) p(F_1| C) p(F_2| C.000. There are about 5. F_1.. . . F_2) .1. (2. 4. This means that p(F_i | C..7 Datasets We used the publicly available training and testing datasets provided by Tagged Question Corpus. F_1) p(F_3.. F_2) = p(C) p(F_1| C) p(F_2| C.. . F_n) = p(C) p(F_1| C) p(F_2| C) p(F_3| C) ……. The following example demonstrated different feature sets considered for a given question and its POS parse. University of Illinois at Urbana-Champaign (UIUC) [5]. F_2. 2. Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is conditionally independent of every other feature Fj for j i.... F_2) p(F_4. F_3) = p(C) p(F_1| C) p(F_2| C. F_1... F_3...... The testing dataset contains 2000 labelled questions from the TREC QA track. the weight associated with each word varies between 0 and 1.. All these datasets have been manually labelled by UIUC [5] according to the coarse and fine grained categories in Table 1.1.13) and so the joint model can be expressed as p(C. p(F_n| C) = p(C)∏ =1 p(F_i | C) (2.1... F_1..000 and 5.. we extract two kinds of features: bag-of-words or a mix of POS tags and words.

We use the convention that 0 log 0 = 0. . From the viewpoint of representation. Without loss of generality. that is to say.16) The larger the confusion of a word is. each Ci is the same as a document because both of which are just a collection of words. certain preprocessing is needed.17) Weight of word i is opposite to its entropy: the larger the entropy of word i is. which is easily justified since xlogx → 0 as x → 0. the more uncertain the random variable X is. Therefore we can also use the idea of entropy to evaluate word’s importance. ni be the total number of occurrences of word i in all questions. x ∈ A be the probability function.1. The confusion achieves maximum value log(N) if the word is evenly distributed over all documents. . Note that if a word occurs in only one set.9 Entropy and Weighted Feature Vector In information theory the concept of entropy is used as a measure of the uncertainty of a random variable. then the confusion (or entropy) of word i can be measured as follows: H(i) = Σ(t=1 to N) [fit ni · log(nifit)] (2. N the number of total documents in the collection.15) The larger the entropy H(X) is. Ci is a set of words extracted from questions of type i. and minimum value 0 if the word occurs in only one document. Consequently. In information retrieval many methods have been applied to evaluate term’s relevance to documents. ni the total number of occurrences of word i in document collection.N}. then ai is defined as: ai =(1 +1/log(N)Σ(t=1 to N)[fitni·log(fitni)]) (2. for other sets fik is 0. Let ai be the weight of word i. . In other words. the smaller weight is associated with word i. . based on information theoretic ideas. is proved the most effective and sophisticated. ai get the maximum value of 1 if word i occurs in only one set of question type. . and the minimum value of 0 if the word is evenly distributed over all sets. Ci represents a word collection similar to documents. among which entropy-weighting. fit be the frequency of word i in Ct.25 2. Let X be a discrete random variable with respect to alphabet A and p(x) = Pr(X = x). it is denoted by C = {1. Let fit be the frequency of word i in document t. the less important it is. Let C be the set of question types. the less important to question classification it is. Keeping this in mind to calculate the entropy of a word. then the entropy H(X) of the discrete random variable X is defined as: H(x) = −Σx∈A {p(x) log p(x)} (2.

Naïve Bayes Classifier* using Partitioned feature set (69% accurate on TREC data).1. We implemented weighted feature set SVM classifier into a cross platform standalone desktop application (shown below). Support Vector Machine Classifier using Bag of Words feature set (78% accurate on TREC data). Therefore a baseline (random) classifier is (1/50) = 2% accurate. University of Illinois at Urbana-Champaign. simple add one smoothing outperformed it.10 Experiment Results We tested various algorithms for question classification. Some sample test runs Q: What was the name of the first Russian astronaut to do a spacewalk? Response: HUM -> IND(an Individual) Q: How much folic acid should an expectant mother get daily? Response: NUM -> COUNT . Support Vector Machine Classifier using Weighted feature set (85% accurate on TREC data) It must be noted that the classifiers were NOT trained on TREC data. While Witten-Bell smoothing worked well. The classifier classified questions into six broad classes and fifty coarse classes. We employed various smoothing techniques to Naive Bayes Classifier. The accuracy reported here are for Naive Bayes Classifier employing add one smoothing. The application will be made available to public for evaluation.26 2. The performance without smoothing was too low and not worth mentioning. Naïve Bayes Classifier* using Bag of Words feature set (64% accurate on TREC data). 90 80 70 60 50 40 30 20 10 0 Baseline Classifier Naïve Bayes Classifier using Bag of Words feature Naïve Bayes Classifier using Partitioned feature SVM Classifier using Bag of Words feature SVM Classifier using Weighted feature set Chart Showing accuracy of classifier Figure 2. Training was done on a set of 12788 questions provided by Cognitive Computation Group at the Department of Computer Science.3: Classifiers were tested on a set of 2000 TREC questions.

In our system when using web as document collection. The design of the query expansion module should be such as to maintain the right balance between recall and precision. For large corpus. while other systems go for a query expansion.in/projects. When a web corpus is not available we employ Rocchio Query Expansion [1] method which is implemented in lucene query expansion module.cybergeeks.27 Q: What is Francis Scott Key best known for? Response: DESC -> DESC Q: What state has the most Indians? Response: LOC -> STATE Q: Name a flying mammal. The table below shows performance of various query expansion modules implemented on Lucene. Most QA systems start constructing an IR query simply by assuming that the question itself is a valid IR query. query expansion may not be necessary as even with not so well formed query recall is sufficient to extract the right answer and query expansion may in fact reduce precision. can be downloaded for evaluation from http://www.2 Query Formulation The question analysis component of a QA system is usually responsible for formulating a query from a natural language questions to maximise the performance of the IR engine used by the document retrieval component of the QA system.4: JAVA Question Classifier. The test is carried out on data from NIST TREC Robust Retrieval Track 2004 . we pass on the question as IR query after masking the stop words.co. Response: ENTITY -> ANIMAL Figure 2. But in case of a local small corpus query expansion may be necessary.php?id=10 2.

percentage of topics with no relevant in the top 10 retrieved Lucene QE .mean average precision P10 .2322 Lucene 0.37 %no 18. KB-R-FIS gQE – My Fuzzy Inference System that utilized Rocchio’s query expansion along with Google.1 Stop word for IR query formulation Stop words or noise words are words which appear with a high frequency and are considered to be insignificant for normal IR processes. 2.3936 0.10% 14% 14% 15% MAP Lucene QE 0. although because of the proprietary nature their working is unknown and unpredictable. most of these lists are very similar).1: performance of various query expansion modules implemented on Lucene. Table 2. Unfortunately when it comes to QA systems high frequency of a word in a collection may not always suggest that it is insignificant in retrieving the answer.lucene with local query expansion Lucene gQE – Lucene system that utilized Rocchio’s query expansion along with Google.2332 KB-R-FIS gQE 0.2433 Lucene gQE 0. Therefore we manually analyzed 100 TREC QA track questions and prepared a list of stop words. . For example the word “first” is widely considered to be a stop word but is very important when appears in the question “Who was the first President of India?”.3984 0.average of precision at 10 documents retrieved %no .28 Tag Combined Topic Set P10 0.2. A partial list of the stop words is shown below. I As De In On To Who www a at en is or was will about be for it that what with an by from la the when und are com how of this where the The list of stop words we obtained is much smaller than standard stop word lists (although there is no definite list of stop words which all natural language processing tools incorporate. It must be noted that query expansion is internally carried out by the APIs used to retrieve documents from the web.4076 0.2 MAP .

1. and others [14].IDF based vector space model. however. Robertson. regardless of the inter-relationship between the query terms within a document (e. It is not a single function. implemented at London's City University in the 1980s and 1990s.1 Retrieval from local corpus All the work presented in this thesis relies upon the Lucene IR engine [13] for local corpus searches. e.. Document Retrieval The text collection over which a QA system works tend to be so large that it is impossible to process whole of it to retrieve the answer. A good retrieval unit will increase precision while maintaining good enough recall. but actually a whole family of scoring functions. One of the most prominent instantiations of the function is as follows. and its newer variants. is that it is relatively easy to extend to meet the demands of a given research project (as an open source project the full source code to Lucene is available making modification and extension relatively straight forward) allowing experiments with different retrieval models or ranking algorithms to use the same document index. the BM25 score of a document D is: . BM25. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. their relative proximity).. such as Web search. Lucene is an open source boolean search engine with support for ranked retrieval results using a TF. Given a query Q.g. One of the main advantages of using Lucene.. Karen Spärck Jones. 3. containing keywords 1 . 3.. with slightly different components and parameters. was the first system to implement this function. To set the right context. BM25F [2] (a version of BM25 that can take document structure and anchor text into account). . 3. since the Okapi information retrieval system.. The name of the actual ranking function is BM25.1. it usually referred to as "Okapi BM25".1 Ranking function We employ highly popular Okapi BM25 [3] ranking function for our document retrieval module. The task of the document retrieval module is to select a small set from the collection which can be practically handled in the later stages. represent stateof-the-art retrieval functions used in document retrieval.29 Chapter3. over many other IR engines.2 Okapi BM25 BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document.g.

1) where f(qi. )( 1 1 +1) (1− + | | ) (3.5 ( )+0. Therefore. 3. Both APIs have relaxed terms of condition and allow access through code. Therefore we use public APIs of search engines. | D | is the length of the document D in words.0 and b = 0. k1 and b are free parameters. this is exactly what is expressed by the IDF component of BM25.3 IDF Information Theoretic Interpretation Here is an interpretation from information theory. and avgdl is the average document length in the text collection from which documents are drawn. and n(qi) is the number of documents containing qi.30 Score(D. 3.1. Suppose a query term q appears in n(q) documents. We have used Google AJAX Search API and Yahoo BOSS. Moreover there are no limits on number of queries per day when used for . It is usually computed as: ( )= − ( )+0. Then a randomly picked document D will contain the term with probability n(q)/N (where N is again the cardinality of the set of documents in the collection). the IDF component is derived from the Binary Independence Model.D) is qi's term frequency in the document D.2 Information retrieval from the web Indexing the whole web is a gigantic task which is not possible on a small scale.Q) = ∑ =1 ( ). ( . then the probability of seeing both q1 and q2 in a randomly picked document D is: ( 1) ( 2) and the information content of such an event is: 2 log =1 ( ) With a small variation. )+ ( .75. usually chosen as k1 = 2. There are several interpretations for IDF and slight variations on its formula. If the two terms occur in documents entirely independently of each other. the information content of the message "D contains q" is: -log ( ) = log ( ) (3.5 (3. IDF(qi) is the IDF (inverse document frequency) weight of the query term qi.2) where N is the total number of documents in the collection.3) Now suppose we have two query terms q1 and q2. In the original BM25 derivation.

Local Corpus Lucene IR Module Okapi BM25 Ranking function Top n Docs IR Query URL Reader Google/Yahoo Search APIs URL Reader URL Reader URL Reader INTERNET Multi threaded Reader module Figure 3. Therefore. . Therefore our target is to increasing coverage with least number of retrieved documents to form the text collection. To accelerate the process we employ multi threaded URL readers so that multiple URLs can be read simultaneously.1: Document retrieval framework 3.31 educational purposes. We read top n uniform resource locators (URLs) and build the collection of documents to be used for answer retrieval.2.1 shows the document retrieval framework. Whilst the ideal is not attainable. this stage is the most taxing one in terms of runtime.1 How many documents to retrieve? One of the main considerations when doing document retrieval for QA is the amount of text to retrieve and process for each question. As the task of reading the URLs over the internet is inherently slow process. The search APIs can return top n documents for a given query. the document retrieval stage can act as a filter between the document collections/web and answer extraction components by retrieving a relatively small set of text collection. Lowered precision is penalized by higher average processing time by later stages. Figure 3. Ideally a system would retrieve a single text unit that was just large enough to contain a single instance of the exact answer for every question.

27 0.1: %coverage and average processing time at different ranks %Coverage vs rank 80 70 60 50 40 30 20 10 0 Yahoo BOSS API %Coverage 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 23 31 37 42 48 49 49 51 51 52 53 53 54 54 55 Google AJAX Search 28 48 56 58 64 64 64 66 70 72 72 73 73 73 74 API Figure 3.9 2.37 0. The results are obtained on a set of 30 questions (equally distributed over all question classes) from TREC 04 QA track [5].51 0.44 2.021 0.6 3.49 0.2 4.31 2.2: %coverage vs rank .1 3.34 0.6 5.1 1.23 1.02 0.8 3.82 1.09 0. Table 3.7 4.102 0.01 2.39 1.803 1.32 criterion for selecting the right collection size depends on coverage and average processing time.2 2.77 Google 0.23 0. Average Processing Time*(sec) Yahoo 0. The table below shows percentage coverage.22 3.1 %Coverage @rank Yahoo 23 31 37 42 48 49 49 51 51 52 53 53 54 54 55 Google 28 48 56 58 64 64 64 66 70 72 72 73 73 73 74 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 *Average time spent by answer retrieval node.4 4. average processing time at different ranks for Google and Yahoo search APIs.

803 1.9 2.33 80 70 60 50 40 30 20 10 0 %Coverage vs Average Processing time(sec) Google AJAX Search API Yahoo BOSS From the results it is clear that going up to rank 5 ensures a good coverage while maintaining low processing time.4 4. 0.1 1.3: %coverage vs.23 0.51 0. average processing time .1 3. Clearly Google outperforms Yahoo at all ranks.39 1.021 0.37 0.6 3.1 Figure 3.09 0.2 2.6 5.

. These semantic relations for nouns include: x Hyponym/Hypernym (IS-A/ HAS A) x Meronym/Holonym (Part-of / Has-Part) x Meronym/Holonym (Member-of / Has-Member) x Meronym/Holonym (Substance-of / Has-Substance) Figure 4.1 Sentence Ranking The sentence ranking is responsible for ranking the sentences and giving a relative probability estimate to each one. is to extract and present the answers to questions.1. and motorcar is a synset that represents the concept define by gloss: four wheel Motor vehicle. usually propelled by an internal combustion Engine. The probability estimate and the retrieved answer’s frequency are used to compute confidence of the answer. We employ a named entity (NE) recognizer to filter out those sentences which could potentially contain answer to the given question. The final answer is the phase chunk with maximum frequency belonging to the sentence with highest rank.1 WordNet WordNet [16] is the product of a research project at Princeton University which has attempted to model the lexical knowledge of a native speaker of English. Answer Extraction The final stage in a QA system. For example the words car. and arguably the most important. It also registers the frequency of each individual phrase chunk marked by the NE recognizer for a given question class. In our system we have used GATE – A General Architecture for Text Engineering provided by The Sheffield NLP group [15] as a tool to handle most of the NLP tasks including NE recognition. 4. automobile. Each synset has a gloss that defines the concept of the word. WordNet connects concepts via a variety of semantic relations. auto.1 shows a fragment of WordNet taxonomy. 4. In WordNet each unique meaning of a word is represented by a synonym set or synset." In addition to providing these groups of synonyms to represent a concept. such as "he needs a car to get to work.34 Chapter4. Many glosses have examples of usages associated with them.

2 Sense/Semantic Similarity between words We use statistics to compute information content value. We assign a probability to a concept in taxonomy based on the occurrence of target concept in a given corpus.35 4.1) Where c is a concept and p is the probability of encountering c in a given corpus. in other words.1.1: Fragment of WordNet taxonomy . Figure 4. Using this basic idea we compute the sense/semantic between two given words based on a similarity metric proposed by Philip similarity Resnik [17]. the less information it conveys. The IC value is then calculated by negative log likelihood formula as follow: IC (c) log( P(c)) (4. infrequent words are more informative then frequent ones. Basic idea behind the negative likelihood formula is that the more probable a concept appears.

Given a sense network *( wi . q j ) …… Figure 4.2) *( wi .2: A sense network formed between a sentence and a query.i (4. ]} in increasing order of d(q). Function distance of ith element in then the alignment score is = −1 ∑ =1 ( −1 +1 − ) .7) = ∑ ( ) = ∑ ∑ ξ .5) (4. we define the distance of a word wi as d (wi ) i d (q j ) j Word with maximum sense similarity with query word qi is: (4. But unlike a bag of word model we give importance to the order of the words. Therefore we define the sense network *( wi .8) is the (4.4) M (qi ) And the corresponding value of wj | j argmax j[ j . Stop words are rejected from the set and only the root forms of the words are taken into account. j = V (qi ) The exact match score is = Average sense similarity for query word ∑ ( ) with sentence W is ∑ ξ .36 4. If W is the ordered set of n words in the given sentence and Q is the ordered set m words in the query. then we compute a network of sense weights between all pair of words in W and Q.6) [i . q j ) [i .9) Let T = {ordered set of M( ) ∀ ∈ [1.3) (4.1.2 Sense Net ranking algorithm We consider the sentence under consideration and the given query to be a set of words similar to a bag of word model. q j ) as: *(wi . S( ) = Therefore the total average sense per word is (4.1] is the value of sense/semantic similarity between wi‹W and q j‹Q . j Where [i . j‹ [0. (4. q j ) . …… (4.

0. Now. = noise penalty coefficient = Total score e D *( n Etotal *m ) 1 (4. Unlike newswire data most of the information found on the internet is badly formatted.5 and D to 0. = 1.37 The total average noise is defined as G total Where D is the noise decay factor. grammatically incorrect and most of the time not well formed.3). = 0.3: A sample run for the question “Who performed the first human heart transplant?” .125 and noise decay factor D =0.0.25 but when using local corpus we reduce to 0. We take top t sentences and consider the plausible answers within them. So when web is used as the knowledge base we use the following values of different coefficients: = 1. Figure 4. = 0.10) = = = × + × + × + × (4. we sort then according to these scores.(=5 in our case) answers are returned along with corresponding sentence and URL (figure 4.1. Once we obtain the total score for each sentence.25 .12) Again all answers are sorted according to confidence score and top . If an answer appears with frequency f in sentence ranked r then that answer gets a confidence score C (ans) 1 (1 ln( f )) r (4.11) The coefficients are fine tuned depending on the type of corpus.

Implementation and Results Our question answering module is written in JAVA. The URL reader module is multi threaded to keep download time at the minimum.38 Chapter5.1: Various modules of the QnA system along with each ones basic task.1 Results The idea of building an easily accessible question answering system which uses the web as a document collection is not new. This could have affected the results of any web based study. Stanford parser. The reason we did not use questions from TREC QA is that the TREC questions are now appearing quite frequently (sometimes with correct answers) in the results of web search engines. Uses Google and Yahoo search engine queries to build the corpus Sense Net implementation ArrayList of Ranked Sentences with helper methods Weighted feature vector SVM classifier. Implements a standard porter stemmer main class that handles user queries. The tests were performed on a small set of fifty web based questions. Also we don’t have access to AQUAINT corpus which is the knowledgebase for TREC QA systems. For this reason a new collection of fifty questions was assembled to serve as the test set. The questions within the new test set were chosen to meet the following criteria: . It uses various third party APIs for NLP and text engineering. Most of these systems are accessed via a web browser. Figure 5. In the later part of the section we compare our system with other web QA systems. Json and Lucene API to name a few. Use of JAVA makes the software cross platform and highly portable. Multi threaded URL reader implementation Multi threaded URL reader interface Stopwords filter class Computes Sense/Semantic similarity between words Stores a generic URL along with number of attempts to read it Trains the weighted feature vector SVM classifier Load GATE processing resources. Each module is designed keeping in mind space and time constraints. GATE. Most of the pre-processing is done via GATE processing pipeline. More information is provided in appendix B. 5.

51 0 8 9 10 11 12 13 14 15 16 17 1 1 1 NA 3 1 1 NA 1 5 Question Classifier failed 4.82 0 3 4 5 6 7 1 4 1 1 NA NE recognizer not designed to handle this question.2 16. In a way 64% is the accuracy upper bound of our system.23 13.53 0. For each question in set.88 0. 11 8.2 1.33 15 0 0.5 0 13 0 0.0 7. The answers to the questions should not be dependent upon the time at which the question is asked.6 6.1 6.2 0. This explicitly excludes questions such as “Who is the President of the US?” These questions are provided in appendix A. Also time spent on various tasks is shown which would help in determining the feasibility of the system to be used in real time environment.3 11.39 1. Each question should be an unambiguous factoid question with only one known answer.1 5.5 11. 8.77 10.55 0.78 17.91 Incorrect Answer .2 6. the table below shows the (min) rank at which answer was obtained.41 0.7 13.38 0.9 6.8 12.61 0 0.2 15. 2. We used top 5 documents to construct our corpus which restricts our coverage to 64%.1 0. Some of the questions chosen do have multiple answers although this is mainly due to incorrect answers appearing in some web documents.4 0 8.47 0. In case the system fails to answer a question we show the reason it failed.23 0 14.8 0 9.43 0. Question No.37 8.4 7.54 6.5 11. Answer obtained @ rank 5 NA Remarks Time in seconds Document retrieval module# Preprocessing Answer extraction module 1 2 NE recognizer not designed to handle this question.71 0.

67 16.31 0 0.29 .23 0.11 8.9 6.1 6.21 6.23 7.4 8.58 0.43 0. Answer changed recently Incorrect Answer NE recognizer not designed to handle this question.2 16.61 0 24 NA 0 0 0 25 26 27 1 NA NA 11.59 0.1 10. NE recognizer not designed to handle this question.23 6.67 7.42 0.3 12.7 8 0 0.61 0.2 7. 11.79 0.41 0.40 18 19 20 21 1 2 1 NA 7.77 8.34 0.83 11.5 0 14.2 7.22 14.48 0.65 0.23 0 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 1 1 1 1 4 1 1 1 1 NA NA 1 NA 1 NA 1 NA 1 1 NA NA Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer Incorrect Answer mainly because req.4 9.1 12.54 0.23 11.5 14.9 7. 22 23 1 NA 7.2 11.43 0.38 0.2 5. NE recognizer not designed to handle this question.62 0.2 0 14.0 14.99 11.99 15.62 0.01 10.36 0.3 11.8 14.1 6.34 0.8 15.67 13.01 8.47 0.67 0.4 8.54 0.7 7.64 0.7 6.21 12.8 7.63 16.9 0 0.1 6.1 13.67 11.8 9.99 8.3 16.53 0.11 0.45 0 NE recognizer not designed to handle this question.66 0 11.

@Rank 5 the system reached its accuracy upper bound of 64%.umich. most of the failures were because of the handicapped NE recognizer.33 0.6 12. IONAUT3 [19] uses its own crawler to index the web with specific focus on entities and the relationships between them in order to provide a richer base for answering questions than the unstructured documents returned by standard search engines. 2. returning full sentences containing duplicated answers.67 6.5% of the TREC 8 question set although we believe the performance would decrease if exact answers were being evaluated as experience of the TREC evaluations has shown this to be a harder task than locating answer bearing sentences. 5.46 0.45 Total number of questions: 50. Number of questions answered@ x Rank 1: 26 – Accuracy 52% x Rank 2: 28 – Accuracy 56% x Rank 3: 29 – Accuracy 58% x Rank 4: 31 – Accuracy 62% x Rank 5: 32 – Accuracy 64% Average time spent per question: 18. It returns both exact answer and snippets. This gives undue advantage to the system as it performs the easier task of finding relevant sentences only.si. As seen.languagecomputer.02 11.edu/˜zzheng/qa-new/ http://www. AnswerFinder.2 Comparisons with other Web Based QA Systems We compare our system with four web based QA Systems – AnswerBus [18]. The system called AnswerBus2 [18] behaves in much the same way as PowerAnswer. IONAUT [19] and PowerAnswer[20].com:8400 . 3.1: Performance of the system on the web question set.24 0. The system returns both exact answers and snippets.com/demos/ http://misshoover.ionaut. The question classifier failed in only one instance.51 11. 1.01 7. The consistently best performing system at TREC forms the backbone of the PowerAnswer system from Language Computer1. http://www. It is claimed that AnswerBus can correctly answer 70. This system is the closest to ours.41 answer type was present in the query itself 49 2 50 NA Average time spent: Incorrect Answer 8. Unlike our system each answer is a sentence and no attempt is made to cluster (or remove) sentences which contain the same answer. AnswerFinder is a client side application that supports natural language questions and queries the Internet via Google.3 seconds #time is dependent on network speed Table 5.

Figure 9.3 Feasibility of the system to be used in real time environment From table 5. IONAUT +.2: Comparison of AnswerBus . An average response time of 18. At higher ranks it performs considerably better than AnswerBus and IONAUT while performing marginally less than AnswerFinder and PowerAnswer. This is quite important as the answer returned at rank 1 can be considered to be the final answer provided by the system.1 it is clear that the system cannot be used for real time purposes as of now. The results are encouraging but it should be noted that due to the small number of test questions it is difficult to draw firm conclusions from these experiments. within as short a period of time as was possible. so that the underlying document collection. But it must be noted that . AnswerFinder and our system . in this case the web would be relatively static and hence no system would benefit from subtle changes in the content of the collection. PowerAnswer It is clear from the graph that our system outperforms all but AnswerFinder at rank 1. 5.3 seconds is too high.42 The questions from the web question set were presented to the five systems on the same day.

3: Time distribution of each module involved in QA 5. The time distribution of various modules shows that the system is quite fast at the answer extraction stage. where possible. simple approaches to question answering which can be both easily understood and would operate quickly. More over the task of post processing can be done offline on the corpus as it is independent of the query. We employed machine learning techniques for question classification whose performance is good enough and any further improvements won’t be beneficial. Finally our current results are encouraging but we acknowledge that due to the small number of test questions it is difficult to draw firm conclusions from these experiments. if used along with a local corpus which is pre-processed offline it can be adapted for real time applications. the actual task of retrieving an answer is quite low at 0. Even with the limited capability of NE recognizer the system is at par with state of the art web QA systems which confirms the efficacy of the ranking algorithm. Our NE recognizer recognizes limited sets of answer types which is not enough to obtain a good enough overall accuracy. In our case the NE recognizer is the weakest link. We observed that the performance of the system is limited by the worst performing module of the QA system. Answer Extraction 3% time distribution Document Retrieval 36% Pre-Processing 61% Figure 9. We believe that if we use our own crawler and pre-process the documents beforehand. our system can retrieve answers fast enough to be used in real time systems. .4 Conclusion The main motivation behind the work in this thesis was to consider. The graph below shows percentage of time spent in different tasks. Once the corpus is pre-processed offline.43 document retrieval time will be significantly lower for offline – local corpus.45 seconds. So even if a single module fails the whole system won’t be able to answer. We also proposed the Sense Net algorithm as new way of ranking sentences and answers.

Europa. Ganymede and Io are 4 of the 16 moons of which planet? Ans: Jupiter Q024: Which planet was discovered in 1930 and has only one known satellite called Charon? Ans: Pluto Q025: How many satellites does the planet Uranus have? Ans: 15. live? Ans: Australia or America Q004: Where would you find budgerigars in their natural habitat? Ans: Australia Q005: How many stomachs does a cow have? Ans: Four or one with four parts Q006: How many legs does a lobster have? Ans: Ten Q007: Charon is the only satellite of which planet in the solar system? Ans: Pluto Q008: Which scientist was born in Germany in 1879. the dingo. 17. 38 or 37. 18 or 21 Q026: In computing. if a byte is 8 bits. 66%. how many bits is a nibble? Ans: 4 Q027: What colour is cobalt? Ans: blue Q028: Who became the first American to orbit the Earth in 1962 and returned to Space in 1997? Ans: John Glenn Q029: Who invented the light bulb? Ans: Thomas Edison . Who performed the first human heart transplant? Ans: Dr Christiaan Barnard Q023: Callisto. what is the average human body temperature? Ans: 37.44 Appendix A Small Web Based Question Set Q001: The chihuahua dog derives it’s name from a town in which country? Ans: Mexico Q002: What is the largest planet in our Solar System? Ans: Jupiter Q003: In which country does the wild dog. 60% or 70% Q016: What is the sixth planet from the Sun in the Solar System? Ans: Saturn Q017: How many carats are there in pure gold? Ans: 24 Q018: How many canine teeth does a human have? Ans: Four Q019: In which year was the US space station Skylab launched? Ans: 1973 Q020: How many noble gases are there? Ans: 6 Q021: What is the normal colour of sulphur? Ans: Yellow Q022.98 Q014: Who discovered gravitation and invented calculus? Ans: Isaac Newton Q015: Approximately what percentage of the human body is water? Ans: 80%. became a Swiss citizen in 1901 and later became a US citizen in 1940? Ans: Albert Einstein Q009: Who shared a Nobel prize in 1945 for his discovery of the antibiotic penicillin? Ans: Alexander Fleming. Howard Florey or Ernst Chain Q010: Who invented penicillin in 1928? Ans: Sir Alexander Fleming Q011: How often does Haley’s comet appear? Ans: Every 76 years or every 75 years Q012: How many teeth make up a full adult set? Ans: 32 Q013: In degrees centigrade.

208 to 146 million years ago. 205 to 141 million years ago or 205 million years ago to 145 million years ago Q039: Who was President of the USA from 1963 to 1969? Ans: Lyndon B Johnson Q040: Who was British Prime Minister from 1974-1976? Ans: Harold Wilson Q041: Who was British Prime Minister from 1955 to 1957? Ans: Anthony Eden Q042: What year saw the first flying bombs drop on London? Ans: 1944 Q043: In what year was Nelson Mandela imprisoned for life? Ans: 1964 Q044: In what year was London due to host the Olympic Games. 195 – 140 million years ago. but couldn’t because of the Second World War? Ans: 1944 Q045: In which year did colour TV transmissions begin in Britain? Ans: 1969 Q046: For how many days were US TV commercials dropped after President Kennedy’s death as a mark of respect? Ans: 4 Q047: What nationality was the architect Robert Adam? Ans: Scottish Q048: What nationality was the inventor Thomas Edison? Ans: American Q049: In which country did the dance the fandango originate? Ans: Spain Q050: By what nickname was criminal Albert De Salvo better known? Ans: The Boston Strangler. the compact disc? Ans: Philips Q032: Who invented the television? Ans: John Logie Baird Q033: Which famous British author wrote ”Chitty Chitty Bang Bang”? Ans: Ian Fleming Q034: Who was the first President of America? Ans: George Washington Q035: When was Adolf Hitler born? Ans: 1889 Q036: In what year did Adolf Hitler commit suicide? Ans: 1945 Q037: Who did Jimmy Carter succeed as President of the United States? Ans: Gerald Ford Q038: For how many years did the Jurassic period last? Ans: 180 million.45 Q030: How many species of elephant are there in the world? Ans: 2 Q031: In 1980 which electronics company demonstrated its latest invention. . 205 to 140 million years ago.

princeton.com/ ) as the preferred IDE.uk/ Apache Lucene API is a free/open source information retrieval library. JSON or JavaScript Object Notation.net/projects/jwordnet/ Stanford Log-linear Part-Of-Speech Tagger http://nlp. The following third party APIs are used: x x x x x x x GATE 4.0 (A General Architecture for Text Engineering) software toolkit originally developed at the University of Sheffield since 1995 . originally created in Java by Doug Cutting . To resolve the issue conflicting libraries belonging to GATE must not be included in the classpath. Therefore stack size should be increased to at least 512MB using –Xmx512m command line option.org/ JSON API. is a lightweight computer data interchange format.apache.edu/ All experiments performed on a Core 2 Duo 1.edu/software/tagger.edu.shtml WordNet 2. The API brings support to read JSON data.http://www. http://www.http://gate.http://lucene.ac. .jcreator. The code uses newer features like generics which is not compatible with any version of JAVA prior to 1.86 GhZ System with 2GB RAM. as well as relationship discovery and morphological processing http://sourceforge.ntu. Default stack size may not be sufficient to run the application. Some classes present in JWNL API conflict with GATE.org/java/ LibSVM A Library for Support Vector Machines by Chih-Chung Chang and ChihJen Lin .csie.stanford.json.46 Appendix B Implementation Details We have used Jcreator (http://www.http://wordnet.1 is a lexical database for the English language is used to measure sense/semantic similarity measure .tw/~cjlin/libsvm/ JWNL is an API for accessing WordNet in multiple formats.5.

In Proceedings of the 11th Text REtrieval Conference. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL ’03). R¨uger. A Sys Called Qanda. In Proceedings of the 8th Text REtrieval Conference. 1999. Robertson. NJ. Overview of the TREC 2002 Question Answering Track. The TREC 8 Question Answering Track Report. Query expansion and dimensionality reduction: Notions of optimality in rocchio relevance feedback and latent semantic indexing. Voorhees. Cooper and Stefan M. Steve Walker. [6] Xin Li and Dan Roth. Voorhees. Information Processing & Management. Cognitive Computation Group at the Department of Computer Science. Machine Learning. 1999. Computer Science Series. University of Illinois at Urbana-Champaign. Okapi at TREC-3. [5] Corpora for Question Answering Task. Question Classification with Support Vector Machines and Error Correcting Codes. 1999. Mitchell.47 References [1] Miles Efron. Lisa Ferro. [7] Kadri Hacioglu and Wayne Ward. 1994. 2002. McGraw-Hill. [10] Eric Breck. 2003. USA. [2] Stephen E. David House. A Simple Question Answering System. Micheline Hancock-Beaulieu. Learning Question Classifiers. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). [4] Tom M. 1997. In Proceedings of the 3rd Text Retrieval Conference. 2000. and Mike Gatford. Marc Light. [3] Stephen E. and Inderjeet Mani. Taiwan. Taipei. Burger. 44(1):163–180. 2003. [8] Ellen M. Question Classification using Support Vector Machines. [9] Ellen M. In Proceedings of the 8th Text REtrieval Conference. Morristown. John D. Robertson and Steve Walker. 2002. [11] Richard J. pages 28–30. [12] Dell Zhang and Wee Sun Lee. In Proceedings of the 9th Text REtrieval Conference. In Proceedings of the 8th Text REtrieval Conference. January 2008. Susan Jones. Okapi/Keenbow at TREC-8. In Proceedings of the 26th ACM International Conference on Research .

USA. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. San Diego. Finley Lacatus¸u. Hai Jin. 1997. Michael Collins. Miller. CA. In Proceedings of the 11th Text REtrieval Conference. Journal of Artificial Intelligence Research. Answer Extraction. 2002. WordNet: A Lexical Database. pages 296–301. Cunningham. [17] Philip Resnik. In Proceedings of the Human Language Technology Conference (HLT 2002). An approach for indexing. [19] Steven Abney. 1995. 1999. LCC Tools for Question Answering. Adriana Badulescu. Adrian Novischi. AnswerBus Question Answering System. R. and Orest Bolohan. Humphreys. Software infrastructure for natural language processing. Gaizauskas. Wilks. storing retrieving domain knowledge. In SAC ’07: Proceedings of the 2007 ACM symposium on Applied computing. November. USA. March 24-27. Paul Mor?arescu. 11:95–130. ACM Press. Seattle. and Y. and .part 2. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP 2000). 2000. pages 1381–1382. [15] H. K. A probabilistic model of information retrieval: development and comparative experiments . 36(6):809–840.48 and Developement in Information Retrieval (SIGIR’03). Steve Walker. [20] Dan Moldovan. Information Processing and Management. [14] Karen S. Washington. 2000. pages 26–32. [13] Hao Wu. and Amit Singhal. 2007. Communications of the ACM. Jones. Canada. Toronto. 38(11):39–41. and Stephen E. [18] Zhiping Zheng. Sanda Harabagiu. Roxana Girju. NY. New York. Robertson. [16] George A. 2002. and Xiaomin Ning.

You're Reading a Free Preview

Descarga
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->