Semantic Heterogeneity
Description
This project was initiated by Alon Halevy, before he left the
University of Washington to join Google. Dan Suciu took over after
Alon left.
The main aim of this proposal is to develop new methods and systems for generating semantic
mappings between disparate data sources. The problem is challenging because the semantics of the
data is only completely understood by the designers of the schemas, and not necessarily captured by
the schema itself. Since a computer program only has access to the schema and some surrounding
information, it can, at best, approximate the semantic mappings. In practical terms, the goal is
to develop systems that considerably increase the productivity of human designers who need to
produce semantic mappings.
Our approach to schema matching is based on the observation that matching tasks are often (at
least partially) repetitive. We are frequently asked to match schemas in the same or overlapping
domains where we have matched schemas previously. Repetition is especially common in mainte-
nance tasks, where mappings need to be adjusted after schema changes. As humans improve at
matching tasks over time, our goal is for a system to mimic the same behavior. We capture previous
experience as a corpus of schemas and matches. The corpus includes a collection of schemas in a
particular domain and a set of mappings between some of the schemas in the corpus. Given such
a collection of schemas in a domain, we construct models to recognize distinct elements such has
relation and attribute names. Given two new schemas, we predict
Mike Cafarella was the PhD student supported by this grant.
Mike Cafarella investigated the feasibility of extracting a
comprehensive structured database of "everything" from Web sources. A
critical part of doing so is integrating data from the vast number of
extractable Web sources.Data integration in a Web setting is unlike
the traditional relational case in several ways, the most important of
which are: (1) the data must be extracted and is possibly very dirty;
(2) with millions of potential schemas, the user cannot be expected to
know them apriori.
We found that we could address the these problems with a single
framework. In Web search, users are familiar with exploring datasets
that are both extremely variable in quality and unknown before the
user starts her search task. The user addresses this problem by
iteratively issuing a text search, observing the quality and contents
of the results, and modifying the search string and trying again. Our
data integration system uses this search/refine sequence at three
points:
-
The user can SEARCH for a relevant table with a query string. For
example, "presidents" yields a relevance-ranked list of tables, the
first of which is most likely US presidents. We found that because a
good data table may not actually contain the search query, the ranking
function should use Web-derived statistical-correlation between the
search query and strings inside the table.
-
In case the user wants to extend this table by unioning it with
other similar tables, for each table in the relevance-ranked list, the
system will return the most-related tables in the corpus. For
example, a table of computer science publications from 2008 will be
grouped with similar tables from other years. We used a clustering
algorithm that measures both content- and structural-similarity
between tables, but found that the problem is relatively insensitive
to the similarity metric.
-
If the user wants to extend the table by joining it to another, she
provides another search query string. This second string indicates
the "target" of the join. For example, if starting with a table of
computer science publications and their authors, the user might
indicate the "author" column, and then type "student". The system
will then search for tables in the corpus that are about the topic
"student" as well as the relevant content in the source table. We
found effective algorithms for performing this "topic-directed join"
that combine search-style relevance ranking with relational-style join
tests.
Publications:
Michael J. Cafarella: Extracting and Querying a Comprehensive Web
Database. CIDR 2009
Michael J. Cafarella, Alon Y. Halevy, Nodira Khoussainova: Data
Integration for the Relational Web. PVLDB 2(1): 1090-1101 (2009)
Luke McDowell, Michael J. Cafarella: Ontology-driven, unsupervised
instance population. J. Web Sem. 6(3): 218-236 (2008)
Michael J. Cafarella, Jayant Madhavan, Alon Y. Halevy: Web-scale
extraction of structured data. SIGMOD Record 37(4): 55-61 (2008)
Uncovering the Relational Web. Michael J. Cafarella, Alon Halevy,
Yang Zhang, Daisy Zhe Wang, Eugene Wu. Proceedings of the Eleventh
International Workship on the Web and Databases (WebDB), June
2008. Vancouver, Canada.
WebTables: Exploring the Power of Tables on the Web. Michael
J. Cafarella, Alon Halevy, Yang Zhang, Daisy Zhe Wang, Eugene
Wu. Proceedings of VLDB 2008, August 2008. Auckland, New Zealand.
Navigating Extracted Data with Schema Discovery. Michael
J. Cafarella, Dan Suciu, Oren Etzioni. Proceedings of the Tenth
International Workshop on the Web and Databases (WebDB), June
2007. Beijing, China.
Supported by:
NSF IIS-0415175
Members
Publications