Research on Wrapper Induction for Information Extraction
The Internet provides access to numerous sources of useful information in textual form -- telephone directories, event listings, product catalogs, etc. Recently, there has been much interest in building systems that gather such information on a user's behalf. But because these information resources are formatted for use by people, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, customized procedures for information extraction.
We make three contributions. First, we introduce wrapper induction, a technique for automatically constructing wrappers from labeled examples of a resource's content. Second, we identify a class of wrappers that is efficiently learnable, yet expressive enough to handle 48 percent of a recently surveyed sample of actual Internet resources. Finally, we describe a method for heuristically labeling the examples used by the induction algorithm. We demonstrate, both empirically and analytically (using the PAC computational learning model), that automatic wrapper induction is feasible, and that the system degrades gracefully with imperfect labeling heuristics.
We tested our system on several actual Web sites. The graphs below indicates the number of examples needed to learn a satisfactory wrapper, as a function of increasing oracle noise, for two sites (OKRA and BigBook); see the paper for details.