Semantic Heterogeneity


This project was initiated by Alon Halevy, before he left the University of Washington to join Google. Dan Suciu took over after Alon left.

The main aim of this proposal is to develop new methods and systems for generating semantic mappings between disparate data sources. The problem is challenging because the semantics of the data is only completely understood by the designers of the schemas, and not necessarily captured by the schema itself. Since a computer program only has access to the schema and some surrounding information, it can, at best, approximate the semantic mappings. In practical terms, the goal is to develop systems that considerably increase the productivity of human designers who need to produce semantic mappings.

Our approach to schema matching is based on the observation that matching tasks are often (at least partially) repetitive. We are frequently asked to match schemas in the same or overlapping domains where we have matched schemas previously. Repetition is especially common in mainte- nance tasks, where mappings need to be adjusted after schema changes. As humans improve at matching tasks over time, our goal is for a system to mimic the same behavior. We capture previous experience as a corpus of schemas and matches. The corpus includes a collection of schemas in a particular domain and a set of mappings between some of the schemas in the corpus. Given such a collection of schemas in a domain, we construct models to recognize distinct elements such has relation and attribute names. Given two new schemas, we predict

Mike Cafarella was the PhD student supported by this grant.

Mike Cafarella investigated the feasibility of extracting a comprehensive structured database of "everything" from Web sources. A critical part of doing so is integrating data from the vast number of extractable Web sources.Data integration in a Web setting is unlike the traditional relational case in several ways, the most important of which are: (1) the data must be extracted and is possibly very dirty; (2) with millions of potential schemas, the user cannot be expected to know them apriori.

We found that we could address the these problems with a single framework. In Web search, users are familiar with exploring datasets that are both extremely variable in quality and unknown before the user starts her search task. The user addresses this problem by iteratively issuing a text search, observing the quality and contents of the results, and modifying the search string and trying again. Our data integration system uses this search/refine sequence at three points:

  1. The user can SEARCH for a relevant table with a query string. For example, "presidents" yields a relevance-ranked list of tables, the first of which is most likely US presidents. We found that because a good data table may not actually contain the search query, the ranking function should use Web-derived statistical-correlation between the search query and strings inside the table.
  2. In case the user wants to extend this table by unioning it with other similar tables, for each table in the relevance-ranked list, the system will return the most-related tables in the corpus. For example, a table of computer science publications from 2008 will be grouped with similar tables from other years. We used a clustering algorithm that measures both content- and structural-similarity between tables, but found that the problem is relatively insensitive to the similarity metric.
  3. If the user wants to extend the table by joining it to another, she provides another search query string. This second string indicates the "target" of the join. For example, if starting with a table of computer science publications and their authors, the user might indicate the "author" column, and then type "student". The system will then search for tables in the corpus that are about the topic "student" as well as the relevant content in the source table. We found effective algorithms for performing this "topic-directed join" that combine search-style relevance ranking with relational-style join tests.


Michael J. Cafarella: Extracting and Querying a Comprehensive Web Database. CIDR 2009

Michael J. Cafarella, Alon Y. Halevy, Nodira Khoussainova: Data Integration for the Relational Web. PVLDB 2(1): 1090-1101 (2009)

Luke McDowell, Michael J. Cafarella: Ontology-driven, unsupervised instance population. J. Web Sem. 6(3): 218-236 (2008)

Michael J. Cafarella, Jayant Madhavan, Alon Y. Halevy: Web-scale extraction of structured data. SIGMOD Record 37(4): 55-61 (2008)

Uncovering the Relational Web. Michael J. Cafarella, Alon Halevy, Yang Zhang, Daisy Zhe Wang, Eugene Wu. Proceedings of the Eleventh International Workship on the Web and Databases (WebDB), June 2008. Vancouver, Canada.

WebTables: Exploring the Power of Tables on the Web. Michael J. Cafarella, Alon Halevy, Yang Zhang, Daisy Zhe Wang, Eugene Wu. Proceedings of VLDB 2008, August 2008. Auckland, New Zealand.

Navigating Extracted Data with Schema Discovery. Michael J. Cafarella, Dan Suciu, Oren Etzioni. Proceedings of the Tenth International Workshop on the Web and Databases (WebDB), June 2007. Beijing, China.

Supported by:

NSF IIS-0415175