Management of Probabilistic Data

Description

Management of Probabilistic Data

NSF Proposal Number: 0713576

NSF Abstract page: http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0713576

Databases and data management tools have a deterministic semantics: a data item either is in the data set or is not. But when data comes from multiple sources, or is extracted automatically, it often contains a variety of imprecisions that are difficult to model using the standard deterministic semantics: the same data item may have different representations in different sources and matching algorithms are imprecise; schemas differ across sources and schema matching tools are imprecise or incomplete or both; data at different sources may hold contradictory information; finally, some data, such as sensor data, is inherently probabilistic and hence imprecise. This project represents all sources of imprecision in a single uniform way, as data with a probabilistic semantics, and extendeds today's data management tools to manage efficiently data with probabilistic semantics.

Re-designing databases to handle probabilistic data is a daunting task. This project studies two problems that lie at the core of probabilistic data management: the complexity of the query evaluation problem on probabilistic databases, and the view materialization problem (deciding whether a view can be materialized and whether it can be used in other query plans). The results of this research will consists of a range of fundamental techniques to be used in a general purpose probabilistic query processor.

Intellectual Merits . The project makes new contributions that lie at the intersection of several disparate fields: logic, probability theory, knowledge representation, and traditional query processing and optimization. The project enhances the understanding of the query evaluation problem on probabilistic data, develops new algorithms for efficiently evaluating such queries and for materializing views, while leveraging existing database technology.

Broader Impact . Searching large information spaces (the Web; large collections of scientific databases; Homeland Security data) is one the new and most challenging frontiers in Computer Science. The innovation that is needed to support complex searches that scale to large and heterogeneous information spaces, has to come from the data management research community. This project makes contributions to broaden our ability to search large information spaces. If successful, the project will be one of the pieces that will help data management technology undergo a new paradigm shift, from supporting complex queries with deterministic semantics, to supporting complex explorations with probabilistic semantics.

Principal Investigator: Dan Suciu

Students, past and current:

Related Project: MystiQ

The safe-query compiler for Unions of Conjunctive Queries, written in ocaml: download


Supported by:

NSF IIS-0713576 and a Gift from Microsoft

Members

Paul Beame,
JerryLi,
Sudeepa Roy,
Dan Suciu,
Abhay Jha,
Alexandra Meliou,
Wolfgang Gatterbauer,
Chris Re,
Nilesh Dalvi,
Karl Schnaitter,
Bill Howe,
Kate Moore,
Vibhor Rastogi,
Magdalena Balazinska,
Evan Welbourne,
Nodira Khoussainova,
Daniel Li,
Gaetano Borriello,

Publications

Paul Beame, JerryLi, Sudeepa Roy, Dan Suciu,
Model Counting of Query Expressions: Limitations of Propositional Methods
In ICDT, 2014
Paul Beame, JerryLi, Sudeepa Roy, Dan Suciu,
Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases
In UAI, pp. 52-61, 2013
Abhay Jha, Dan Suciu,
On the tractability of query compilation and bounded treewidth
In ICDT, pp. 249-261, 2012
Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu,
Reverse Data Management
Published in PVLDB, vol. 4 , no. 12 , pp. 1490-1493 , 2011
Abhay Jha, Dan Suciu,
Knowledge compilation meets database theory: compiling queries to decision diagrams
In ICDT, pp. 162-173, 2011
Chris Re, Dan Suciu,
Understanding Cardinality Estimation using Entropy Maximization
In PODS, 2010
Nilesh Dalvi, Dan Suciu,
The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries
Unpublished ,2010
Note: to appear in JACM; preliminary version appeared in PODS 2010
Nilesh Dalvi, Karl Schnaitter, Dan Suciu,
Computing query probability with incidence algebras
In PODS, pp. 203-214, 2010
Nilesh Dalvi, Karl Schnaitter, Dan Suciu,
Computing Query Probability with Incidence Algebras
University of Washington, Technical Report, UW-CSE-10-03-02, 2010
Bill Howe, Dan Suciu,
Embracing Uncertainty in Large-Scale Computational Astrophysics
In MUD, 2009
Kate Moore, Vibhor Rastogi, Chris Re, Dan Suciu,
Query Containment of Tier-2 Queries over a Probabilistic Database
In MUD, 2009
Chris Re, Dan Suciu,
General Database Statistics Using Entropy Maximization
In DBPL, pp. 84-99, 2009
Nilesh Dalvi, Chris Re, Dan Suciu,
Queries and Materialized Views on Probabilistic Databases
Unpublished ,2009
Note: to appear in JCSS
Nilesh Dalvi, Chris Re, Dan Suciu,
Probabilistic Databases: Diamonds in the Dirt (Extended Version)
Unpublished ,2009
Nilesh Dalvi, Chris Re, Dan Suciu,
Probabilistic Databases: Diamonds in the Dirt
Published in CACM, vol. 52 , no. 7 , pp. 86-96 , 2009
Chris Re, Dan Suciu,
Approximate Lineage for Probabilistic Databases
In VLDB, 2008
Chris Re, Dan Suciu,
Management of Data with Uncertainties
In CIKM, pp. 3-8, 2008
Chris Re, Magdalena Balazinska, Dan Suciu,
Event Queries on Correlated Probabilistic Streams
In SIGMOD, 2008
Abhay Jha, Vibhor Rastogi, Dan Suciu,
Evaluating Queries in the Presence of Soft Key Constraints
In PODS, 2008
Evan Welbourne, Nodira Khoussainova, Daniel Li, Magdalena Balazinska, Gaetano Borriello, Dan Suciu,
Cascadia: A System for Specifying, Detecting, and Managing RFID Events
In MobySYS, 2008
Dan Suciu,
Probabilistic Databases
Published in SIGACT News, vol. 39 , no. 2 , pp. 111-124 , June , 2008
Chris Re, Nilesh Dalvi, Dan Suciu,
Efficient Top-k Query Evaluation on Probabilistic Data
In ICDE, 2007
Nilesh Dalvi, Dan Suciu,
Management of Probabilistic Data: Foundations and Challenges
In PODS, pp. 1-12, 2007
Note: (invited talk)
Chris Re, Dan Suciu,
Efficient Evaluation of HAVING Queries on a Probabilistic Database
In Proceedings of DBPL, 2007
Dan Suciu,
Tutorial at the EDBT Summer School: Techniques for managing probabilistic data
Unpublished ,2007
Nilesh Dalvi, Chris Re, Dan Suciu,
Query Evaluation on Probabilistic Databases
Published in IEEE Data Engineering Bulletin, vol. 29 , no. 1 , pp. 25-31 , 2006
Nodira Khoussainova, Magdalena Balazinska, Dan Suciu,
Towards correcting input data errors probabilistically using integrity constraints
In MobiDB, pp. 43-50, 2006
Nilesh Dalvi, Dan Suciu,
Efficient Query Evaluation on Probabilistic Databases
In VLDB, 2004