Database Systems:
A Textbook Case of Research Paying Off

James N. Gray
Senior Researcher
Microsoft Corporation


Industry Profile

The database industry generated about $7 billion in revenue in 1994 and is growing at 35% per year. Among software industries, it is second only to operating system software. All of the leading corporations in this industry are US-based: IBM, Oracle, Sybase, Informix, Computer Associates, and Microsoft. In addition, there are two large specialty vendors, both also US-based: Tandem, selling over $1 billion per year of fault-tolerant transaction processing systems, and AT&T-Teradata, selling about $500 million per year of data mining systems.

In addition to these well-established companies, there is a vibrant group of small companies specializing in application-specific databases -- text retrieval, spatial and geographical data, scientific data, image data, and so on. An emerging group of companies offer object-oriented databases. Desktop databases are another important market focused on extreme ease-of-use, small size, and disconnected operation.

A relatively modest federal research investment, complemented by an also-modest industrial research investment, has led directly to our nation's dominance of this key industry.

Historical Perspective

Companies began automating their back-office bookkeeping in the 1960s. COBOL and its record-oriented file model were the work-horses of this effort. Typically, a batch of transactions was applied to the old-tape-master, producing a new-tape-master and printout for the next business day.

During this era, there was considerable experimentation with systems to manage an on-line database that could capture transactions as they happened, rather than in daily batches. At first these systems were ad hoc, but late in the decade "network" and "hierarchical" database products emerged. A network data model standard (DBTG) was defined, which formed the basis for most commercial systems during the 1970s. Indeed, in 1980 DBTG-based Cullinet was the leading software company.

However, there were some problems with DBTG. DBTG used a low-level, record-at-a-time procedural language. The programmer had to navigate through the database, following pointers from record to record. If the database was redesigned, as they often are over a decade, then all the old programs had to be rewritten.

The "relational" data model, enunciated by Ted Codd in a landmark 1970 article, was a major advance over DBTG. The relational model unified data and metadata so that there was only one form of data representation. It defined a non-procedural data access language based on algebra or logic. It was easier for end-users to visualize and understand than the pointers-and-records-based DBTG model. Programs could be written in terms of the "abstract model" of the data, rather than the actual database design; thus, programs were insensitive to changes in the database design.

The research community (both industry and university) embraced the relational data model and extended it during the 1970s. Most significantly, researchers showed that a high-level relational database query language could give performance comparable to the best record-oriented database systems. This research produced a generation of systems and people that formed the basis for IBM's DB2, Ingres, Sybase, Oracle, Informix and others. The SQL relational database language was standardized between 1982 and 1986. By 1990, virtually all database systems provided an SQL interface (including network, hierarchical and object-oriented database systems, in addition to relational systems).

Meanwhile the database research agenda moved on to geographically distributed databases and to parallel data access. Theoretical work on distributed databases led to prototypes which in turn led to products. Today, all the major database systems offer the ability to distribute and replicate data among nodes of a computer network.

Research of the 1980s also showed how to execute each of the relational data operators in parallel -- giving hundred-fold and thousand-fold speedups. The results of this research are now beginning to appear in the products of several major database companies.

Three Case Studies

The government has funded a number of database research efforts from 1970 to the present. Projects at UCLA gave rise to Teradata and produced many excellent students. Projects at CCA (SDD-1, Daplex, Multibase, and HiPAC) pioneered distributed database technology and object-oriented database technology. Projects at Stanford created deductive database technology, data integration technology, and query optimization technology. Work at CMU gave rise to general transaction models and ultimately to the Transarc Corporation. There have been many other successes from AT&T, the University of Texas at Austin, Brown, Harvard, Maryland, Michigan, MIT, Princeton, and Toronto, among others. It is not possible to enumerate all the contributions here, but we shall highlight three representative research projects that had major impact on the industry.

Ingres

Project Ingres started at UC Berkeley in 1972. Inspired by Codd's work on the relational database model, several faculty members (Stonebraker, Rowe, Wong, and others) started a project to design and build a relational database system. Incidental to this work, they invented a query language (QUEL), relational optimization techniques, a language binding technique, and interesting storage strategies. They also pioneered work on distributed databases.

The Ingres academic system formed the basis for the Ingres product now owned by Computer Associates. Students trained on Ingres went on to start or staff all the major database companies (AT&T, Britton Lee, HP, Informix, IBM, Oracle, Tandem, Sybase). The Ingres project went on to investigate distributed databases, database inference, active databases, and extensible databases. It was rechristened Postgres, which is now the basis of the digital library and scientific database efforts within the University of California system. Recently, Postgres spun off to become the basis for a new object-relational system from the startup Illustra Information Technologies.

System R

Codd's ideas were inspired by seeing the problems IBM and its customers were having with the DBTG network data model and with IBM's product based on this model (IMS). Codd's relational model was at first very controversial; people thought that the model was too simplistic and that it could never give good performance. IBM Research management took a gamble and chartered a 10-person effort to prototype a relational system based on Codd's ideas. This group produced a prototype, System R, that eventually grew into the DB2 product series. Along the way, the IBM team pioneered ideas in query optimization, data independence (views), transactions (logging and locking), and security (the grant-revoke model). In addition, the SQL query language from System R was the basis for the standard that emerged.

The System R group went on to investigate distributed databases (project R*) and object-oriented extensible databases (project Starburst). These research projects have pioneered new ideas and algorithms. The results appear in IBM's database products and in those of other vendors.

Gamma

During the 1970s there was great enthusiasm for database machines -- special-purpose computers that would be much faster than general-purpose systems running conventional databases. The problem was that general purpose systems were improving at 50% per year, so it was difficult for customized systems to compete with them. By 1980, most researchers recognized the futility of special-purpose approaches, and the database machine community switched to research on using arrays of general purpose processors and disks to process data in parallel. The University of Wisconsin was home to the major proponents of this idea in the US. Funded by the government and industry, they built a parallel database machine called Gamma. That system produced ideas and a generation of students who went on to staff all the database vendors. Today the parallel systems from IBM, Tandem, Oracle, Informix, Sybase, and AT&T all have a direct lineage from the Wisconsin research on parallel database systems. The use of parallel database systems for data mining is the fastest-growing component of the database server industry.

The Gamma project evolved into the Exodus project at Wisconsin (focusing on an extensible object oriented database). Exodus has now evolved to the Paradise system which combines object-oriented and parallel database techniques to represent, store, and quickly process huge earth-observing satellite databases.

The Future

Database systems continue to be a key aspect of Computer Science & Engineering today. Representing knowledge within a computer is one of the central challenges of the field. Database research has focused primarily on this fundamental issue. Many universities have faculty investigating these problems and offer courses that teach the concepts developed by this research program.

There continues to be active and valuable research on representing and indexing data, adding inference to data search, compiling queries more efficiently, executing queries in parallel, integrating data from heterogeneous data sources, analyzing performance, and extending the transaction model to handle long transactions and workflow (transactions that involve human as well as computer steps). The availability of very-large-scale (tertiary) storage devices has prompted the study of models for queries on very slow devices.

In addition, there is great interest in unifying object-oriented concepts with the relational model. New datatypes (image, document, drawing) are best viewed as the methods that implement them rather than the bytes that represent them. By adding procedures to the database system, one gets active databases, data inference, and data encapsulation. This object-oriented approach is an area of active research and ferment both in academe and in industry.

A very modest research investment produced American market dominance in a $7 billion industry -- creating the ideas for the current generation of products, and training the people who built those products. Continuing research is creating the ideas and training the people for the next product generation.


The Author

Dr. James N. Gray is one of the world's leading experts on database and transaction processing computer systems. Over the past three decades he has worked for IBM, Tandem, and Digital Equipment Corporation on systems including System R, SQL/DS, DB2, IMS-Fast Path, Encompass, NonStopSQL, Pathway, TMF, Rdb, DBI, and ACMS -- systems that have defined the progress of the field.

Dr. Gray holds doctorates from U.C. Berkeley and the University of Stuttgart. He is a Fellow of the ACM, a member of the National Research Council's Computer Science and Telecommunications Board, Editor-in-Chief of the VLDB Journal, Editor of the Morgan Kaufmann series on Data Management, editor of the Performance Handbook for Database and Transaction Processing Systems, co-author of Transaction Processing Concepts and Techniques, and serves on Objectivity's Technical Advisory Board.

In 1995, following six months as McKay Fellow in U.C. Berkeley's Computer Science Division, he joined Microsoft to establish a San Francisco Bay Area laboratory focusing on making Microsoft data servers more scaleable, manageable, and fault tolerant.


Back to Computing Research: Driving Information Technology and the Information Industry Forward (http://cra.org/research.impact)
Copyright 1995, 1996, 1997 by James N. Gray and the Computing Research Association. Contributions to this document by Philip A. Bernstein, David DeWitt, Michael Stonebraker, and Jeffrey D. Ullman are appreciated.