Questions Answering Using the World Wide Web

Abstract

....

1 Introduction

    2 Background

    2.1 QA systems

    A QA system is implemented on top of a knowledge base. It has two components: query formulation (QF) and answer extraction (AE). In QF, the system attempts to phrase questions into queries in order to retrieve relevant knowledge from the KB. In AE, the system filters out the single correct answer. Most QA system today relies directly on a document collection as its KB and we will focus on these systems in this paper, but some systems such as [?] do something else...

    In the context of the web, QF means to locates web pages that may contain the answer to the question, and AE singles out the relevant passages and pinpoint the correct answer. We'll look at the nature of these two tasks and how Mulder attempts to solve them in the next two subsections.

    2.2 Query Formulation

    For domain-independent QA systems to work, a wide coverage is necessary. Of the available technologies today, search engines are the only choice that can provide the extensive coverage necessary. So we're going to work with search engines.

    However it is not easy to locate relevant documents from search engines, let alone finding the answer. An under-constrianed query will result in many irrelevant documents, too many to process real-time and doesn't help QA. An overly specific query may not return any documents at all. The QA system is therefore faced with the daunting task of generating queries that are focused enough to get a small set of highly relevant pages.

    Existing QA systems mostly pick the relevant keywords with some IR metrics such as TFIDF. There are two reasons for this. First, by using less specific queries, we hope to increase recall so that the answers may be contained in one of the web pages returned. Second, it can be expensive to perform multiple queries. However keyword queries tends to result in very poor quality results, even with state-of-the-art search engines such as Google, because of the sheer amount of documents the web has.

    2.2 Answer Extraction

    In AE we attempt to extract the precise answer from the page. Most AE engines out there uses name entity tagging or shallow syntatic/sematic analysis. They first tag each noun phrases in the sentence with labels such as "country", "name", "company" etc, then do something more.... Most state-of-the-art name tagger works by statistical methods.... Name tagging works for some data types such as country or blah that has a limited range of values, but for a large document set such as the web name tagging may suffer in performance. Also name taggers require a large hand-tagged training set, and thus needs to updated constantly if we want to keep pace with the web, which is tedious. Therefore our system chooses to implement a type checker instead.

    AE is more difficult on the web than on a carefully selected document set (such as WSJ in TREC). The quality of text can vary widely, and pages can be poorly formulated. Worse, they can even state incorrect "factoids", intentionally or unintentionally. e.g. "Elvis killed JFK" was found in numerous humor pages and "the first american in space was John Glenn" was found in a number of "common misconceptions about astronomy pages. Therefore the AE engine needs to be fault-tolerant.

    2.3 NLP Tools

    NLP has made much improvements over the years, blah. Many NLP tools are available for developement. Here are some that our system uses.
  1. Parsing
    It is extremely useful to find out the inherent phrasal structure of a sentence. Structural parsing using statistical techniques [?] has made a lot of headway recently and many parsers reports accurarcies of 85% or more [?]. One of the best parsers is the Maximum Entropy-Inspired (MEI) parser from Charinak [?] to uncover the syntatical structure of the question and answers. The parser is among the most accurate statistical parsers available, reporting a ?% accurarcy on the Penn Treebank WSJ corpus.

    Another useful piece of information about a sentence is the relationship between words. The subject or object of a verb, for example, determines what is being asked in a question. The Link Parser [?], available from CMU, is one of the popluar parser that generates such relationships. The parser parses the sentence using the Link Grammar [?], and generates a wide variety of "links" between words. The parser uses a fixed set of grammar rules and lexicon rather than statistical methods, and has a relatively low accurarcy (75%). However the parser is relatively fast and is good for some stuff...

  2. Lexical Synthesis and Analysis
    - Search engine works with just the exact words - no stemming (kill /= killed)
    - want to come up with the exact words for the query - need to be able to synthesize words
    - word analysis and sythesis with PC-KIMMO
    - what KIMMO does
    - examples
    
  3. Semantics Analysis The most popular software for semantics analysis is WordNet [?]. Wordnet is blah...

    2.4 Information Retrieval on the Web

  4. Search Engines
    - many search engines - some good some bad
    - boolean queries - inktomi-based, AV 
    adv: fancy query abilities AV = NEAR, OR, AND, can combine multiple queries, fast
    disadv, poor coverage, most have problematic ranking
    - Google - wide coverage, good ranking, no boolean capability, limited phrase search, slower (> 1 sec) on queries involving stopwords
    - QAS  - coverage important, as well as good ranking
    
  5. Word Statistics
    - Knowing which words are important can give us more powerful results.
    - TFIDF or estimate with IDF
    - hitting SE frequently for this is bad
    - our own mini-search engine that has a few encyclopedia for corpus
    

    3 The Mulder Web-QA System

    We have implemented a QAS called Mulder that ... In the following section, we first present the underlying design principles of Mulder in section X, and then proceed to present an overview of the system in section X. We shall then look at the QF and AE components in more detail in sections X and X.

    Design Principles

    - retrieving useful pages from the web is an inherently difficult problem,
    -  our system needs to respond as quickly as possible, we can only look at a limited amount
    of pages
    - getting irrelevant web pages can increase the noise of our system and degrade performance
    - Therefore we choose to invest more effort in the QF engine.
    
    
    Our QF engine is based on these observations about the web: [Examples needed to verify the points]
    In addition, we note that if we do well on Query Formulation Answer Extraction is easier. Focused queries and long phrases will tend to result in immediate answers. For example if you get back something from "is the highest mountain in the world", then you basically know the answer is somewhere close in front of this phrase. Therefore it makes sense to issue a number of queries from specific (long phrases) to general (keywords). Moreover, search engine performance has improved over the years, and a few indexed text searches can be faster than detailed analysis of hundreds of documents, so a few queries (<10) should not be a big concern. - Our strategy is to make the best guess on what the answer looks like on the target web pages.

    Our system implements a voting system. Our assumption is that despite the noise, the truth will prevail, i.e. found in more pages.

    Architecture overview

    Figure X shows the architecture of Mulder:
    [blub of what does which, but not how] In the following subsections we look at these components in details The system consists of the following components:

    Question Parsing

    Mulder uses the MEI parser to discover the structure of the question. The parser itself performs reasonably well, but we found that it has a relatively small vocabulary (it is trained on blah). The parser itself can guess the part-of-speech of a word based on statistical rules, but we found that the guesses are sometimes too wild. Given a proper lexical analyser, the parser should make much more intelligent guesses. For example, the parser knows the word "neurology", but it doesn't understand the word "neurological" and guesses it as a noun. In addition, if a word is not known to the parser, it will generate a list of all possible tags for the parser, which amounts to about n tags. This increases the search space of the parser, hence slowing it down.

    In order to amend this problem, we added the PC-KIMMO word analyser to the parser. If a word is not known to the parser, the word will first be given to KIMMO for recognition. KIMMO will come up with a list of possible alternatives for the word if it recognizes it. Otherwise, the word is assigned to be a noun. Given that KIMMO has a large vocabulary, the appearance of unknown words is most likely to be a noun.

    After we obtain the structure of the question, we pass the question to the Link Parser to generate the relationships between words. In particular, we want to know which word is the subject or object of the question. This information is used in classifying the type of the question. The classifier is described next.

    Question Classification

    Being able to classify the type of the question helps answer extraction because we know what type of answers we are looking for. Mulder does not currently uses any name tagging techniques to recognize answers, therefore we perform a rather shallow classification - we only try to determine if the question is looking for a noun phrase, a number, or a date. We use the obvious approach of looking at the question heading, such as "Where" questions looks for noun phrase, "When" looks for dates, etc. However, for some headings like "What", there are many possibilities. Therefore we use the subject and object information from the Link parser to find out what exactly is being asked. For example, questions such as "what height is Everest", "What is Everest's height" and "What is the height of Everest?" are all equivalent, and in all cases the Link parser figures out height is the subject or object of the auxillary verb "is". This word is then given to WordNet to see if the noun has any magnitude-related hypernyms. If it does, then we classify the question as a question asking for numbers. For example, "measure" is a hypernym of "height". We use the same technique for date-related questions, such as "what year". In case of ambiguity, such as "capial" which can mean a city or money, we assume the question looks for a noun. Obviously word sense disambiguation techniques would help the classifier.
  6. Query Formulation
    Our QF converts the question into a series of search engine queries. As mentioned in section X, we want to retrieve higher quality web pages with these transformations, and [ improve recall ] We perform several different kinds of transformations depending on the format of the question.
  7. Search Engine
    - Google - why
    - multiple queries
    - grab web pages 
    - need to filter out some trec8 web pages
    
  8. Answer Extraction
    - tokenization - careful about tokenization, some tags translates to periods.
    - summaization - keyword-based, weighted with IDF score of keywords and distance of words
    from each other. formula. sort, pick top 20.
    - forward to parsers. - multiple parsers in parallel, allow lower quality parses
    While the parser is one of the
    faster statistical parsers, it is still quite slow, especially on longer sentences greater than
    20-30 in length (some data needed). Therefore we implemented a quality switch so that
    we can trade off time for quality. We tested the setting under ... and found that
    the trade off isn't so bad blah...
    - find nouns phrase or numerical phrases or dates. answers that occur in summaries
    that came from xformation-based queries are given highest score. followed by phrase queries.
    
  9. Decision making
    This module determines the exact answer given all the evidence. It may involve some voting scheme where different evidence are given different weights.
  10. Result presentation
    This presents the answer to the user. It involves printing what the answer Mulder think it is, and how confident Mulder feel about the answer. A list of web pages with the relevant evidence is also presented to the user.

4 Evaluation

We present n experiments that shows the capability of Mulder. In the first set of experiment, we run Mulder with questions from a test set. In the second experiment, we [run some user study??].

Answering TREC 8 stuff

5 Related work

- Dragonmir - Harabagiu...

6 Future work