Query Compilation on Probabilistic Databases

Description

Sponsored by: NSF IIS 1115188

The goal of probabilistic databases is to manage large databases where the data is uncertain. Applications include Web-scale information extraction, RFID systems, scientific data management, biomedical data integration, business intelligence, data cleaning, approximate schema mappings, and data deduplication. Despite the huge demand and the intense recent research on probabilistic databases, no robust probabilistic database systems exists to date. The reason is that the probabilistic inference problem is, in general, intractable. Fortunately, in databases there are two distinct inputs to the probabilistic inference problem: the query and the database instance. This has led recently to the discovery of safe queries , which are queries that can be evaluated efficiently on any input database, and to new probabilistic inference algorithms that exploit the structure of the query. However, unsafe queries remain a major challenge in probabilistic databases.

This project studies novel algorithms for evaluating unsafe queries on probabilistic database, with guaranteed performance. It uses a novel approach, query compilation , which translates the query into one of four targets: OBDDs, FBDDs, d-DNNFs, and circuits using inclusion/exclusion nodes. The project pursues two thrusts: (1) It develops instance-dependent compilation techniques that significantly extends the reach of instance-independent techniques used in safe queries. (2) It develops approximate query compilation techniques , which always run efficiently, even on intractable query, instance pairs, by sacrificing accuracy. These algorithms are conservative, in the sense that they return correct probabilities in all cases when the input query, instance pair is tractable.

The Intellectual Merit of this project consists of new techniques for compiling queries into one of four compilation targets, OBDD, FBDD, d-DNNF, and inclusion/exclusion-based inference, using both exact inference (without performance guarantees), and approximate inference (with performance guarantees). It expands our understanding of probabilistic inference, and leads to practical approaches for probabilistic database engines. As Broader Impact, the project benefits a large class of applications that require general purpose management of uncertain data, ranging from large-scale information extraction systems, to scientific data management, to business intelligence. The project gradually incorporates topics from probabilistic data into into a curriculum for graduate level education; query compilation is already discussed in the PI's book on Probabilistic Databases ( http://dx.doi.org/10.2200/S00362ED1V01Y201105DTM016 ), a graduate-level textbook.

Principal Investigator: Dan Suciu


Supported by:

NSF IIS-1115188

Members

Wolfgang Gatterbauer,
Dan Suciu,
Paul Beame,
Eric Gribkoff,
Prasang Upadhyaya,
Magdalena Balazinska,
Paraschos Koutris,
JerryLi,
Sudeepa Roy,
Abhay Jha,

Publications

Wolfgang Gatterbauer, Dan Suciu,
Approximate Lifted Inference with Probabilistic Databases
Published in PVLDB, vol. 8 , no. 5 , pp. 629--640 , 2015
Paul Beame, Eric Gribkoff, Dan Suciu,
Symmetric Weighted First-Order Model Counting
In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp. 313--328, 2015
Prasang Upadhyaya, Magdalena Balazinska, Dan Suciu,
Automatic Enforcement of Data Use Policies with DataLawyer
In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp. 213--225, 2015
Eric Gribkoff, Dan Suciu,
SlimShot: Probabilistic Inference for Web-Scale Knowledge Bases
Unpublished ,2015
Paraschos Koutris, Dan Suciu,
A Dichotomy on the Complexity of Consistent Query Answering for Atoms with Simple Keys
In ICDT, 2014
Paul Beame, JerryLi, Sudeepa Roy, Dan Suciu,
Model Counting of Query Expressions: Limitations of Propositional Methods
In ICDT, 2014
Wolfgang Gatterbauer, Dan Suciu,
Oblivious Bounds on the Probability of Boolean Functions
Published in ACM TODS, vol. 39 , no. 1 , pp. 191-208 , 2014
Eric Gribkoff, Dan Suciu,
Lifted Probabilistic Inference: A Guide for the Database Researcher
Published in IEEE Data Eng. Bull., vol. 37 , no. 3 , pp. 6--17 , 2014
Paul Beame, JerryLi, Sudeepa Roy, Dan Suciu,
Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases
In UAI, pp. 52-61, 2013
Abhay Jha, Dan Suciu,
On the tractability of query compilation and bounded treewidth
In ICDT, pp. 249-261, 2012
Abhay Jha, Dan Suciu,
Knowledge compilation meets database theory: compiling queries to decision diagrams
In ICDT, pp. 162-173, 2011
Wolfgang Gatterbauer, Abhay Jha, Dan Suciu,
Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases
In MUD, 2010