Optimal Query Processing Meets Information Theory: from Proofs to Algorithms

Description

NSF III:Small 1907997: Optimal Query Processing Meets Information Theory: from Proofs to Algorithms

Acknowledgment:
This material is based upon work supported by the National Science Foundation under Grant No. 1907997.

Disclaimer:
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

NSF Proposal Number: NSF III-1907997

The demand to analyze large datasets increases at a rate faster than our ability to compute meaningful insights from the data. Despite the advances in query processing, query optimization, and novel hardware, efficient query evaluation remains a major challenge for the database community to date. This project will explore news ways to design query processing algorithms, by using techniques from information theory. The field of information theory has been developed over the last seven decades, and has found many applications, ranging from communications, to machine learning, but it has not been used yet in a significant way to support query processing. This project transfers techniques from information theory to the task of processing massive amounts of data.

The project introduces a new paradigm for query evaluation, called "from proofs to algorithms", which maps an information-theoretic proof of a query's upper bounds into optimal query evaluation algorithms. The project pursues four thrusts. In the first thrust it uses information theory to derive new bounds on the query answer based on database statistics. The second thrust develops a mapping from the information theoretic inequalities to relational operators, and from the complete proof of the query size to a complete query plan with provably optimal runtime. The third thrust extends these techniques to finer grained statistics, including sketches, distinct values, heavy hitters. Finally, the fourth thrust extends the algorithm to queries with projections, aggregates, and group-by's.

Principal Investigator: Dan Suciu


Supported by:

NSF III:Small 1907997

Members

Batya Kenig,
Dan Suciu,
Alvin Cheung,
Babak Salimi,
Sudeepa Roy,
Mahmoud Abo Khamis,
Hung Ngo,
Laurel Orr,
Magdalena Balazinska,

Publications

Batya Kenig, Dan Suciu,
Integrity Constraints Revisited: From Exact to Approximate Implication
Unpublished ,2020
Note: To appear in ICDT, 2020
Alvin Cheung, Dan Suciu,
Handling Highly Contended OLTP Workloads Using Fast Dynamic Partitioning
In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 527--542, 2020
Babak Salimi, Sudeepa Roy, Dan Suciu,
Causal Relational Learning
In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 241--256, 2020
Mahmoud Abo Khamis, Hung Ngo, Dan Suciu,
Bag Query Containment and Information Theory
In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14-19, 2020, pp. 95--112, 2020
Batya Kenig, Dan Suciu,
Integrity Constraints Revisited: From Exact to Approximate Implication
In 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark, pp. 18:1--18:20, 2020
Batya Kenig, Babak Salimi, Dan Suciu,
Mining Approximate Acyclic Schemes from Relations
In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 297--312, 2020
Dan Suciu,
Probabilistic Databases for All
In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14-19, 2020, pp. 19--31, 2020
Laurel Orr, Magdalena Balazinska, Dan Suciu,
Sample Debiasing in the Themis Open World Database System
In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 257--268, 2020