More on Data Management for XML

Alon Levy, University of Washington
May 9th, 1999.

In Jennifer Widom's note, she does an excellent job laying out the background for the XML revolution and some of the research issues that XML raises for the database community. She spells out several important research directions that we must address in order to enable efficient manipulation of XML data. The issues raised include storage and indexing, efficient processing, data model and query language definitions and dealing with issues such as integrity constraints, views and triggers in the context of XML. I especially liked her mention of the area of Databases and Information Retrieval that is likely to get much more attention in the XML world, and wholly support her call for a common benchmark for XML and significant performance evaluation.

As she notes, many of these issues are already being studied in the community. In particular, some recent work on storage and indexing of XML data has significantly raised the bar on work in this area. The definition of a data model and a query language(s) for XML is being pursued by several influential committees. Nonetheless, the points raised in her document can keep us busy for a while to come.

In this short response I make two points based on what I found missing in Jennifer's note. The first point concerns what I believe will be one of the main applications enabled by XML, namely data integration. XML is providing a platform for organizations to share data in a common syntax (not semantics yet!). In a sense, XML is making the large scale data integration that was envisioned in systems such as TSIMMIS and Information Manifold an imminent reality. However, as I discuss, the context of XML raises several challenges for database research.

Still, this is not satisfying (surely, nobody is surprised to hear that I'm raising new issues in data integration). The second point I raise is whether there is any fundamentally new issue that is raised by the context of XML. My answer is no and yes. I agree with the conclusion that emerges from Jennifer's note that while we have to reconsider many of the issues we face in database systems, there is no fundamentally new issue. However, I do think there is a new way we should be thinking about these problems. Specifically, I argue that we need to reconsider the issues with different measures of performance and complexity (and I do not mean this solely for theoretical analysis).

In database research we are accustomed to measuring our performance in terms of scaleup in the size of the data (a.k.a. as data complexity), and in the size of the queries (query complexity). While the first measure is clearly the hallmark of our field, the second measure is key for tasks such as query optimization. I argue that we should validate our work against two new measures:

Finally, since I already caught your attention, I will also briefly survey how we are addressing some of these issues in the database group at the University of Washington.

Like Jennifer, I will resist to insert links to related work throughout this document (largely because it's a pain in HTML).

XML and Data Integration

Since XML provides a standard syntax for representing data, it is perceived to be a key enabling technology for exchange of information on the WWW and within corporations. As a consequence, integration of XML data from multiple external sources is becoming a critical task. The reality is that XML, being only a syntax, only partially advances the prospects of data integration. We no longer have to implement wrappers to extract data from HTML pages, which was a very annoying task with very brittle results. However, XML without agreed upon DTDs does nothing to support integration at the semantic level. The names and meanings of the tags used in XML documents are arbitrary. As a result, the emergence of XML is fueling activity in various communities to agree on DTDs.

Given this state of affairs, the prospect of integrating data from a large number of sources on the WWW which was envisioned in some of our research prototypes is becoming a possible reality. However, there are several issues that need to be considered to enable such integration: I mention just the main ones here, and again, some of these are already being addressed by current research.

Languages for source descriptions: A key element of a data integration system is a language for describing the contents and capabilities of data sources. These descriptions provide the semantic mapping between the data in the source and the relations in the mediated schema. We need to develop such languages for describing XML sources. The main novel challenges that arise are (1) the kinds of data restructuring that appear occur in XML are richer than in relational data, (2) we need to scale up to a very large number of sources, and (3) exploit the knowledge conveyed by accompanying DTD's.

Query reformulation algorithms: We need to develop algorithms for efficiently reformulating user queries (posed on a mediated schema) to queries that refer to the data sources. The known techniques for reformulation from the relational case do not extend easily to languages for querying XML.

Translation among DTD's There will be a significant need for tools that translate XML data conforming to one DTD into an XML document conforming to a different DTD (presumably with semantically related content).

Obtaining source descriptions: When the number of data sources grows, we can no longer manually write their content descriptions. We must develop methods for automatically (or at least, semi-automatically) computing source descriptions for newly introduced XML data sources.

Measures of Complexity

The research issues raised in Jennifer's note and here are enough to keep our community busy for quite a while. Without diminishing their importance, I would like to take a step back and ask whether there is something fundamentally new about our research when we address the XML challenge. Or perhaps, we should just continue our work on the same problems we have been considering for semi-structured data (albeit with with an annoyingly verbose syntax).

My answer is that for the most part, we will work on the same problems (and, as a community that is comfortable with SQL, we'll grow to love the XML syntax). However, as we do that, I argue that we need to consider two new measures of complexity of the techniques we develop (in addition to the standard data and query complexities). These two measures of complexity should drive both our experimental validation and our theoretical analyses.

Number of XML sources

The main uses of XML will involve a large number of XML data sources: Hence, as we develop new methods for manipulating XML data, we need to pay special attention to how they scale up in the number of XML data sources. Such considerations affect many aspects of our study of XML, but especially our algorithms for query reformulation, query optimization across XML data sources, and the design of query evaluation algorithms for this context.

Degree of irregularity in the data

We claim that the hallmark of XML (and of semi-structured data) is that it can deal with irregularities in the data. But we never clearly define what we mean by irregularity. I argue that we need to distinguish the different kinds of irregularities we encounter in XML data, and analyze our algorithms w.r.t. the degree and type of irregularity.

By far, the most prevalent type of irregularity is that objects have differing subsets of attributes from a possibly large set of attributes. A different kind of irregularity (which is harder to quantify) is that the attributes are structured differently in a tree rooted by the object. This is one of the reasons that languages for querying XML (or semi-structured data) allow regular path expressions.

The degree of irregularity in the data has major implications on how it is most efficiently stored and indexed, how it is described (e.g., by DataGuides) and subsequently on query processing.

Work at the University of Washington

The database group at the University of Washington is pursuing several projects that touch upon the manipulation of XML data. Beginning in June of 1998 we started developing the Tukwila data integration system. Tukwila focuses on the query optimization and execution aspects of data integration. In particular, Tukwila is based on the idea of adaptive query processing, where query optimization and execution are interleaved.

We are current addressing the following issues:

Query execution for XML: We are adapting our query execution engine to handle XML data. In addition to designing a suitable algebra for manipulating XML, we are developing efficient algorithms for implementing the operators of the algebra.

Query reformulation: Building on our previous work in query reformulation, we are developing reformulation algorithms that scale up to a large number of data sources, and testing these algorithms experimentally.

Learning source descriptions: We are developing algorithms for automatically learning source descriptions (i.e., mappings between schemas of data sources and a mediated schema). Our approach is based on adapting algorithms recently developed in Machine Learning.

Intensional queries: We are developing algorithms for answering intensional queries to data integration systems (e.g., containment, equivalence and satisfiability queries).

Expressive power: We are studying the relative expressive power of a family of languages for querying semi-structured data (focusing on the effect of restructuring primitives on the expressive power).

Web-site management: We have developed the Tiramisu web-site management system. Like the Strudel system, Tiramisu takes a declarative approach to building web-sites. In contrast to Strudel, Tiramisu allows the implementation of a web-site to be done by an ensemble of cooperating web-site building tools, thus leveraging off their specific advantages.


My thoughts on XML were influenced by the database group at UW: Corey Anderson, Anhai Doan, Marc Friedman, Zack Ives, Todd Millstein, Rachel Pottinger, Sumeet Sobti and Dan Weld; my Strudel cohorts: Mary Fernandez, Dana Florescu and Dan Suciu; and my colleagues across the lake: Adam Bosworth and Michael Rys.