Mulder progress

Questions Answering Using the World Wide Web

Abstract

....

1 Introduction

2 Background

2.1 QA systems

In the context of the web, QF means to locates web pages that may contain the answer to the question, and AE singles out the relevant passages and pinpoint the correct answer. We'll look at the nature of these two tasks and how Mulder attempts to solve them in the next two subsections.

2.2 Query Formulation

However it is not easy to locate relevant documents from search engines, let alone finding the answer. An under-constrianed query will result in many irrelevant documents, too many to process real-time and doesn't help QA. An overly specific query may not return any documents at all. The QA system is therefore faced with the daunting task of generating queries that are focused enough to get a small set of highly relevant pages.

Existing QA systems mostly pick the relevant keywords with some IR metrics such as TFIDF. There are two reasons for this. First, by using less specific queries, we hope to increase recall so that the answers may be contained in one of the web pages returned. Second, it can be expensive to perform multiple queries. However keyword queries tends to result in very poor quality results, even with state-of-the-art search engines such as Google, because of the sheer amount of documents the web has.

2.2 Answer Extraction

AE is more difficult on the web than on a carefully selected document set (such as WSJ in TREC). The quality of text can vary widely, and pages can be poorly formulated. Worse, they can even state incorrect "factoids", intentionally or unintentionally. e.g. "Elvis killed JFK" was found in numerous humor pages and "the first american in space was John Glenn" was found in a number of "common misconceptions about astronomy pages. Therefore the AE engine needs to be fault-tolerant.

2.3 NLP Tools

Parsing
It is extremely useful to find out the inherent phrasal structure of a sentence. Structural parsing using statistical techniques [?] has made a lot of headway recently and many parsers reports accurarcies of 85% or more [?]. One of the best parsers is the Maximum Entropy-Inspired (MEI) parser from Charinak [?] to uncover the syntatical structure of the question and answers. The parser is among the most accurate statistical parsers available, reporting a ?% accurarcy on the Penn Treebank WSJ corpus.
Another useful piece of information about a sentence is the relationship between words. The subject or object of a verb, for example, determines what is being asked in a question. The Link Parser [?], available from CMU, is one of the popluar parser that generates such relationships. The parser parses the sentence using the Link Grammar [?], and generates a wide variety of "links" between words. The parser uses a fixed set of grammar rules and lexicon rather than statistical methods, and has a relatively low accurarcy (75%). However the parser is relatively fast and is good for some stuff...

Lexical Synthesis and Analysis

- Search engine works with just the exact words - no stemming (kill /= killed)
- want to come up with the exact words for the query - need to be able to synthesize words
- word analysis and sythesis with PC-KIMMO
- what KIMMO does
- examples

Semantics Analysis The most popular software for semantics analysis is WordNet [?]. Wordnet is blah...
2.4 Information Retrieval on the Web

Search Engines

- many search engines - some good some bad
- boolean queries - inktomi-based, AV 
adv: fancy query abilities AV = NEAR, OR, AND, can combine multiple queries, fast
disadv, poor coverage, most have problematic ranking
- Google - wide coverage, good ranking, no boolean capability, limited phrase search, slower (> 1 sec) on queries involving stopwords
- QAS  - coverage important, as well as good ranking

Word Statistics
```
- Knowing which words are important can give us more powerful results.
- TFIDF or estimate with IDF
- hitting SE frequently for this is bad
- our own mini-search engine that has a few encyclopedia for corpus
```
3 The Mulder Web-QA System
We have implemented a QAS called Mulder that ... In the following section, we first present the underlying design principles of Mulder in section X, and then proceed to present an overview of the system in section X. We shall then look at the QF and AE components in more detail in sections X and X.
Design Principles
```
- retrieving useful pages from the web is an inherently difficult problem,
-  our system needs to respond as quickly as possible, we can only look at a limited amount
of pages
- getting irrelevant web pages can increase the noise of our system and degrade performance
- Therefore we choose to invest more effort in the QF engine.
```
Our QF engine is based on these observations about the web:
- The web is huge. Therefore, with short questions, it is likely that answers will be phrased the way we want it to be. For example, if we ask "what is the highest mountain in the world", we can expect some documents will contain the phrase "the highest mountain in the world is" inside, with high probability.
- A series of well-formulated queries will do better than any ranking schemes. On way of getting very high quality result on a search is to issue extremely specific queries. Refining a search usually involves giving more specfic keywords to the search engine to narrow down the things the user is looking for. In the extreme case, one can guess the phrase that may appear a web page and get under 10 results from search engines, all of which are relevant. Therefore when given a question, we can formulate it into specific phrases; the more specific the phrase is, the more probable that the web pages returned from the search will answer the question.
- Phrases are extremely informative, the longer the better From Grouper [?] we know that phrases contains important information, in the form of word co-occurrences and word adjacency. Therefore, phrase-based queires will do much better than keyword-based queries. If a piece of text matches a long phrase, we can almost be sure that the answer can be found somewhere around.
[Examples needed to verify the points]
In addition, we note that if we do well on Query Formulation Answer Extraction is easier. Focused queries and long phrases will tend to result in immediate answers. For example if you get back something from "is the highest mountain in the world", then you basically know the answer is somewhere close in front of this phrase. Therefore it makes sense to issue a number of queries from specific (long phrases) to general (keywords). Moreover, search engine performance has improved over the years, and a few indexed text searches can be faster than detailed analysis of hundreds of documents, so a few queries (<10) should not be a big concern. - Our strategy is to make the best guess on what the answer looks like on the target web pages.
Our system implements a voting system. Our assumption is that despite the noise, the truth will prevail, i.e. found in more pages.

Architecture overview
Figure X shows the architecture of Mulder: [blub of what does which, but not how] In the following subsections we look at these components in details The system consists of the following components:
Question Parsing
Mulder uses the MEI parser to discover the structure of the question. The parser itself performs reasonably well, but we found that it has a relatively small vocabulary (it is trained on blah). The parser itself can guess the part-of-speech of a word based on statistical rules, but we found that the guesses are sometimes too wild. Given a proper lexical analyser, the parser should make much more intelligent guesses. For example, the parser knows the word "neurology", but it doesn't understand the word "neurological" and guesses it as a noun. In addition, if a word is not known to the parser, it will generate a list of all possible tags for the parser, which amounts to about n tags. This increases the search space of the parser, hence slowing it down.
In order to amend this problem, we added the PC-KIMMO word analyser to the parser. If a word is not known to the parser, the word will first be given to KIMMO for recognition. KIMMO will come up with a list of possible alternatives for the word if it recognizes it. Otherwise, the word is assigned to be a noun. Given that KIMMO has a large vocabulary, the appearance of unknown words is most likely to be a noun.
After we obtain the structure of the question, we pass the question to the Link Parser to generate the relationships between words. In particular, we want to know which word is the subject or object of the question. This information is used in classifying the type of the question. The classifier is described next.

Question Classification
Being able to classify the type of the question helps answer extraction because we know what type of answers we are looking for. Mulder does not currently uses any name tagging techniques to recognize answers, therefore we perform a rather shallow classification - we only try to determine if the question is looking for a noun phrase, a number, or a date. We use the obvious approach of looking at the question heading, such as "Where" questions looks for noun phrase, "When" looks for dates, etc. However, for some headings like "What", there are many possibilities. Therefore we use the subject and object information from the Link parser to find out what exactly is being asked. For example, questions such as "what height is Everest", "What is Everest's height" and "What is the height of Everest?" are all equivalent, and in all cases the Link parser figures out height is the subject or object of the auxillary verb "is". This word is then given to WordNet to see if the noun has any magnitude-related hypernyms. If it does, then we classify the question as a question asking for numbers. For example, "measure" is a hypernym of "height". We use the same technique for date-related questions, such as "what year". In case of ambiguity, such as "capial" which can mean a city or money, we assume the question looks for a noun. Obviously word sense disambiguation techniques would help the classifier.
Query Formulation
Our QF converts the question into a series of search engine queries. As mentioned in section X, we want to retrieve higher quality web pages with these transformations, and [ improve recall ] We perform several different kinds of transformations depending on the format of the question.
- If the question contains an auxillary and a verb, such as "When did the Jurassic period end?", the fact possible way to find the answer is to issue the query with the word "ended" instead of "did" and "end" separately we use KIMMO to figure out the tense of the auxillary, and synthesize the verb that corresponds to the tense.
- Phrases.
- Transformations to assertions.
```
-Based xformational grammar [?]
-transforms questions into assertions, examples
- example rules
```
- WordNet-based
- original question is also given

Search Engine

- Google - why
- multiple queries
- grab web pages 
- need to filter out some trec8 web pages

Answer Extraction

- tokenization - careful about tokenization, some tags translates to periods.
- summaization - keyword-based, weighted with IDF score of keywords and distance of words
from each other. formula. sort, pick top 20.
- forward to parsers. - multiple parsers in parallel, allow lower quality parses
While the parser is one of the
faster statistical parsers, it is still quite slow, especially on longer sentences greater than
20-30 in length (some data needed). Therefore we implemented a quality switch so that
we can trade off time for quality. We tested the setting under ... and found that
the trade off isn't so bad blah...
- find nouns phrase or numerical phrases or dates. answers that occur in summaries
that came from xformation-based queries are given highest score. followed by phrase queries.

Decision making
This module determines the exact answer given all the evidence. It may involve some voting scheme where different evidence are given different weights.
Result presentation
This presents the answer to the user. It involves printing what the answer Mulder think it is, and how confident Mulder feel about the answer. A list of web pages with the relevant evidence is also presented to the user.

4 Evaluation

We present n experiments that shows the capability of Mulder. In the first set of experiment, we run Mulder with questions from a test set. In the second experiment, we [run some user study??].

Answering TREC 8 stuff

Setup - TREC 8 200 questions

- standard IR eval q set
- but on TREC 8 doc collection - which guarentees answers
- we're not guarenteed to find answers
not all answers are on the web (unlike the original trec8 setting) - more specific questions, 
such as "How much did Mecury spent on advertising?", does not retrieve any pages from Google
despite about 20 minutes of manual search efforts.

We have to modify and add answers to the data set because answers have alternative forms, such as meters vs. feet, and some have alternative acceptable answers, such as el nino = "christ child" or "the little one" in spanish

What are we trying to measure?

1. How well does Mulder answer questions? How much effort does it save the user?
2. How well does Mulder fair against competitive systems such as Google and AskJeeves?

* definitions:
- Hit: a hit on a search engine is an entry that contains a URL and a summary snippet.
A hit on Mulder is the answer and the summary of the answer.
- Result: In systems without extraction, a result is defined as any web page
 containing the answer.
 In Mulder, a result is an extracted answer on a hit.
- precision: the number of results / number of pages retrieved.
  In Mulder, it is possible to have multiple answers on the same page; however we only 
  select the highest score answer on the page. [ this doesn't seem useful ]
- recall: the percentage of questions in the TREC 8 question set that are answered.
- scan distance: the number of words it takes to reach an answer.

  let text(hit_n, answer) be the number of words read before reaching the answer in a hit. If answer
  is not found then it returns the number of words in the hit.
  let text(doc(hit_n), answer) be the number of words read before reaching the answer in the
  web page pointed to by hit_n.

  suppose the answer occurs on the title or summary snippet of the n-th hit, then the scan 
   distance is text(hit_1,answer)+...+text(hit_n,answer).

   For Google and AskJeeves, if the answer does not occur on the hit, but inside the web 
  page pointed to by the n-th hit,
   the scan distance is: text(hit_1,answer)+...+text(hit_n,answer) + text(doc(hit_n),answer)
   
   That is, we assume the user does not view any of the documents pointed to by the first n-1 hits.
   
  Scan distance is an effective measure of how well QAS performs, since it approximates
  the time the user spends in looking for the answer.

  We define another measure called result distance, which is the number
  of results we have to scan through before we reach the answer.


3.

We impose an upper bound on the number of words in a word sequence. Currently it is 5000.

Scan distance vs. Recall

In the first experiment we try to see on each level of scan distance, how many questions can each system answer. The graph below shows the result. We bound the maximum scan distance to be 5000, which represents almost 15 minutes of a user with average reading speed of 280 words per minute [?] [http://www.smartconcepts.net/Fr_thesis/chap2.htm]

In the graph below, the y-axis is the scan distance the user goes through (cut off at 5000) and the x-axis represents the recall of the system. We compare the performance of 4 systems:

- Mulder : Mulder, fetching 20 pages and generating 40 summaries.
- Mulder-X : Mulder without extraction, fetching 20 pages.
- Google : Google fetching 50 pages.
- AJ : AskJeeves fetching all the pages (< 10).

Observations:
- AskJeeves sucks
- Mulder - X beats Google, by only looking at 20 pages!
- Mulder beats all, esp. at scan dist 0 we have about 23% recall. It's up to 40% at dist 120.
- Mulder looking at 50 pages has only very minor improvement.
- Mulder's maximum recall is  about 54%.

Result distance vs. Recall


Observations:

- AskJeeves sucks again
- At result distance 0, Mulder gives a 31% recall. This means the answer occurs in the first result 31% of the time. Recall is about 45% at distance 5.
- At result distance 0, Mulder - X out performs Google by 20%. The difference is smaller further down, since Google starts to retrieve similar pages as Mulder does.
- the 70-80 percentile contains questions where most reformulated phrases retrieves nothing, which is reflected in the similar performance between Google and Mulder
- Mulder compares unfavorably to systems without extraction, since grep is used in the other systems.

User study

5 Related work

- Dragonmir - Harabagiu...

6 Future work

multiple retries
search engine hack
word sense disambiguation
Anaphors
Time - Some facts changes with time. For example, person X can be pregnant a few months ago, but not anymore. When Mulder retrieve web pages, it can look at a page and use the time stamp on the page to tell the time at which a particular fact may occur. This is of course approxmiate, it is possible that the time stamp actually has nothing to do with the time in which the facts occurs. Sometimes the time information may occur as context to the fact.
Mulder can use time in several ways: a. give higher weights to web pages that are recent. b. cluster web pages according to date, and display different answers associate with the question at different time clusters.
Our Search engine
Context Disambiguation - In many cases, the phrases that matches the web pages may not be the real answer to the question; moreover, the question itself may be ambiguous and may lead to different answers.
- Supposition: this happens when the phrase we searched for appears in a hypothetical context. For example, if the phrase X appear with "if X" or "suppose that X", then the sentence does not state X. The proposed solution is to find a list of common suppositional sentence structures to matching against. The potential difficulty with this is that there could be many many such structures.
- Modifying phrases: if there are some phrases surrounding the phrase we're querying on, then the whole phrase may have a different meaning. For example, the sentence "k2 is the highest mountain in the world I have climed" is not the same as "k2 is the highest mountain in the world". This phenonmenon is more difficult to deal with, but hopefully a voting scheme can vote away answers like this.
- Underspecified questions: sometimes a question needs to be more specific to give a concrete answer. For example, "is abortion legal?" is both true and false depending on the state we're talking about. Clustering answers into different groups based on context may help.
- Ambiguous questions: sometimes the question can mean several different things. For example in "is the mask good?", "the mask" can be a specific mask, the movie, or something else. Clustering answers may be useful here. If we are not able to distinguish the concepts because the answer is ambiguous or underspecified, we should at least have Mulder tell the user the question needs to be rephrased.
7 Conclusion
- QA on the web is feasible -
8 Acknowledgement

A Bibliography
- Experiments with Open-Domain Textual Question Answering, Proceedings of 1999 AAAI Fall Symposium on Question Answering Systems, Sanda M. Harabagiu, Marius A. Pasca and Steven J. Maiorano.
- LASSO: A Tool for Surfing the Answer Net, Proceedings of 1999 AAAI Fall Symposium on Question Answering Systems, Dan Moldovan, et. al.
- Question Answering from Large Document Collections, Proceedings of 1999 AAAI Fall Symposium on Question Answering Systems, Eric Breck, et.al.
- Ranking suspected answers to natural language questions using predictive annotation, Proceedings of the 6th Applied Natural Language Processing Conference, Dragomir R. Radev, John Prager and Valerie Samn.
- A Question Answering System Supported by Information Extraction, Proceedings of the 6th Applied Natural Language Processing Conference, Rohini Srihari and Wei Li.
- Examining the Role of Statistical and Linguistic Knowledge Sources in a General-Knowledge Question-Answering System, Proceedings of the 6th Applied Natural Language Processing Conference, Claire Cardie, et al.
- Answer Extraction, Proceedings of the 6th Applied Natural Language Processing Conference, Steven Abney, Michael Collins and Amit Singhal.
- Finding Answers in Large Collections of Texts: Paragraph Indexing + Abductive Inference, Proceedings of 1999 AAAI Fall Symposium on Question Answering Systems, Sanda M. Harabagiu, and Steven J. Maiorano.
- Extracting Names from Natural-Language Text, IBM Research Report RC 20338, Yael Ravin and Nina Wacholder.
- Nymble: a High-Performance Learning Name-finder, Proceedings of the 5th Applied Natural Language Processing Conference, Bikel D.M. et al.

Abstract

1 Introduction

2 Background

2.1 QA systems

2.2 Query Formulation

2.2 Answer Extraction

2.3 NLP Tools

2.4 Information Retrieval on the Web

3 The Mulder Web-QA System

Design Principles

Architecture overview

Question Parsing

Question Classification

4 Evaluation

Answering TREC 8 stuff

Scan distance vs. Recall

Result distance vs. Recall

5 Related work

6 Future work

7 Conclusion

8 Acknowledgement

A Bibliography