In the context of the web, QF means to locates web pages that may contain the answer to the question, and AE singles out the relevant passages and pinpoint the correct answer. We'll look at the nature of these two tasks and how Mulder attempts to solve them in the next two subsections.
However it is not easy to locate relevant documents from search engines, let alone finding the answer. An under-constrianed query will result in many irrelevant documents, too many to process real-time and doesn't help QA. An overly specific query may not return any documents at all. The QA system is therefore faced with the daunting task of generating queries that are focused enough to get a small set of highly relevant pages.
Existing QA systems mostly pick the relevant keywords with some IR metrics such as TFIDF. There are two reasons for this. First, by using less specific queries, we hope to increase recall so that the answers may be contained in one of the web pages returned. Second, it can be expensive to perform multiple queries. However keyword queries tends to result in very poor quality results, even with state-of-the-art search engines such as Google, because of the sheer amount of documents the web has.
AE is more difficult on the web than on a carefully selected document set (such as WSJ in TREC). The quality of text can vary widely, and pages can be poorly formulated. Worse, they can even state incorrect "factoids", intentionally or unintentionally. e.g. "Elvis killed JFK" was found in numerous humor pages and "the first american in space was John Glenn" was found in a number of "common misconceptions about astronomy pages. Therefore the AE engine needs to be fault-tolerant.
Another useful piece of information about a sentence is the relationship between words. The subject or object of a verb, for example, determines what is being asked in a question. The Link Parser [?], available from CMU, is one of the popluar parser that generates such relationships. The parser parses the sentence using the Link Grammar [?], and generates a wide variety of "links" between words. The parser uses a fixed set of grammar rules and lexicon rather than statistical methods, and has a relatively low accurarcy (75%). However the parser is relatively fast and is good for some stuff...
- Search engine works with just the exact words - no stemming (kill /= killed) - want to come up with the exact words for the query - need to be able to synthesize words - word analysis and sythesis with PC-KIMMO - what KIMMO does - examples
- many search engines - some good some bad - boolean queries - inktomi-based, AV adv: fancy query abilities AV = NEAR, OR, AND, can combine multiple queries, fast disadv, poor coverage, most have problematic ranking - Google - wide coverage, good ranking, no boolean capability, limited phrase search, slower (> 1 sec) on queries involving stopwords - QAS - coverage important, as well as good ranking
- Knowing which words are important can give us more powerful results. - TFIDF or estimate with IDF - hitting SE frequently for this is bad - our own mini-search engine that has a few encyclopedia for corpus
- retrieving useful pages from the web is an inherently difficult problem, - our system needs to respond as quickly as possible, we can only look at a limited amount of pages - getting irrelevant web pages can increase the noise of our system and degrade performance - Therefore we choose to invest more effort in the QF engine.Our QF engine is based on these observations about the web:
Our system implements a voting system. Our assumption is that despite the noise, the truth will prevail, i.e. found in more pages.
In order to amend this problem, we added the PC-KIMMO word analyser to the parser. If a word is not known to the parser, the word will first be given to KIMMO for recognition. KIMMO will come up with a list of possible alternatives for the word if it recognizes it. Otherwise, the word is assigned to be a noun. Given that KIMMO has a large vocabulary, the appearance of unknown words is most likely to be a noun.
After we obtain the structure of the question, we pass the question to the Link Parser to generate the relationships between words. In particular, we want to know which word is the subject or object of the question. This information is used in classifying the type of the question. The classifier is described next.
-Based xformational grammar [?] -transforms questions into assertions, examples - example rules
- Google - why - multiple queries - grab web pages - need to filter out some trec8 web pages
- tokenization - careful about tokenization, some tags translates to periods. - summaization - keyword-based, weighted with IDF score of keywords and distance of words from each other. formula. sort, pick top 20. - forward to parsers. - multiple parsers in parallel, allow lower quality parses While the parser is one of the faster statistical parsers, it is still quite slow, especially on longer sentences greater than 20-30 in length (some data needed). Therefore we implemented a quality switch so that we can trade off time for quality. We tested the setting under ... and found that the trade off isn't so bad blah... - find nouns phrase or numerical phrases or dates. answers that occur in summaries that came from xformation-based queries are given highest score. followed by phrase queries.
- standard IR eval q set - but on TREC 8 doc collection - which guarentees answers - we're not guarenteed to find answers not all answers are on the web (unlike the original trec8 setting) - more specific questions, such as "How much did Mecury spent on advertising?", does not retrieve any pages from Google despite about 20 minutes of manual search efforts.
1. How well does Mulder answer questions? How much effort does it save the user? 2. How well does Mulder fair against competitive systems such as Google and AskJeeves? * definitions: - Hit: a hit on a search engine is an entry that contains a URL and a summary snippet. A hit on Mulder is the answer and the summary of the answer. - Result: In systems without extraction, a result is defined as any web page containing the answer. In Mulder, a result is an extracted answer on a hit. - precision: the number of results / number of pages retrieved. In Mulder, it is possible to have multiple answers on the same page; however we only select the highest score answer on the page. [ this doesn't seem useful ] - recall: the percentage of questions in the TREC 8 question set that are answered. - scan distance: the number of words it takes to reach an answer. let text(hit_n, answer) be the number of words read before reaching the answer in a hit. If answer is not found then it returns the number of words in the hit. let text(doc(hit_n), answer) be the number of words read before reaching the answer in the web page pointed to by hit_n. suppose the answer occurs on the title or summary snippet of the n-th hit, then the scan distance is text(hit_1,answer)+...+text(hit_n,answer). For Google and AskJeeves, if the answer does not occur on the hit, but inside the web page pointed to by the n-th hit, the scan distance is: text(hit_1,answer)+...+text(hit_n,answer) + text(doc(hit_n),answer) That is, we assume the user does not view any of the documents pointed to by the first n-1 hits. Scan distance is an effective measure of how well QAS performs, since it approximates the time the user spends in looking for the answer. We define another measure called result distance, which is the number of results we have to scan through before we reach the answer. 3.
We impose an upper bound on the number of words in a word sequence. Currently it is 5000.
In the graph below, the y-axis is the scan distance the user goes through (cut off at 5000) and the x-axis represents the recall of the system. We compare the performance of 4 systems:
- Mulder : Mulder, fetching 20 pages and generating 40 summaries. - Mulder-X : Mulder without extraction, fetching 20 pages. - Google : Google fetching 50 pages. - AJ : AskJeeves fetching all the pages (< 10).
Observations: - AskJeeves sucks - Mulder - X beats Google, by only looking at 20 pages! - Mulder beats all, esp. at scan dist 0 we have about 23% recall. It's up to 40% at dist 120. - Mulder looking at 50 pages has only very minor improvement. - Mulder's maximum recall is about 54%.
Observations: - AskJeeves sucks again - At result distance 0, Mulder gives a 31% recall. This means the answer occurs in the first result 31% of the time. Recall is about 45% at distance 5. - At result distance 0, Mulder - X out performs Google by 20%. The difference is smaller further down, since Google starts to retrieve similar pages as Mulder does. - the 70-80 percentile contains questions where most reformulated phrases retrieves nothing, which is reflected in the similar performance between Google and Mulder - Mulder compares unfavorably to systems without extraction, since grep is used in the other systems.