Management of Probabilistic Data
Description
NSF IIS-0713576: Management of Probabilistic Data
Acknowledgment:
This material is based upon work supported by the National Science Foundation under Grant
No. (NSF grant number).
Disclaimer:
Any opinions, findings, and conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of the National Science
Foundation.
NSF Proposal Number: 0713576
NSF Abstract page:
http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0713576
Duration:
Databases and data management tools have a deterministic semantics: a
data item either is in the data set or is not. But when data comes
from multiple sources, or is extracted automatically, it often
contains a variety of imprecisions that are difficult to model using
the standard deterministic semantics: the same data item may have
different representations in different sources and matching algorithms
are imprecise; schemas differ across sources and schema matching tools
are imprecise or incomplete or both; data at different sources may
hold contradictory information; finally, some data, such as sensor
data, is inherently probabilistic and hence imprecise. This project
represents all sources of imprecision in a single uniform way, as data
with a probabilistic semantics, and extendeds today's data management
tools to manage efficiently data with probabilistic semantics.
Re-designing databases to handle probabilistic data is a daunting
task. This project studies two problems that lie at the core of
probabilistic data management: the complexity of the query evaluation
problem on probabilistic databases, and the view materialization
problem (deciding whether a view can be materialized and whether it
can be used in other query plans). The results of this research will
consists of a range of fundamental techniques to be used in a general
purpose probabilistic query processor.
Intellectual Merits
. The project makes new contributions that
lie at the intersection of several disparate fields: logic,
probability theory, knowledge representation, and traditional query
processing and optimization. The project enhances the understanding
of the query evaluation problem on probabilistic data, develops new
algorithms for efficiently evaluating such queries and for
materializing views, while leveraging existing database technology.
Broader Impact
. Searching large information spaces (the Web;
large collections of scientific databases; Homeland Security data) is
one the new and most challenging frontiers in Computer Science. The
innovation that is needed to support complex searches that scale to
large and heterogeneous information spaces, has to come from the data
management research community. This project makes contributions to
broaden our ability to search large information spaces. If
successful, the project will be one of the pieces that will help data
management technology undergo a new paradigm shift, from supporting
complex queries with deterministic semantics, to supporting complex
explorations with probabilistic semantics.
Principal Investigator:
Dan Suciu
Students, past and current:
-
Christopher Re
(graduated in 2009; representations and approximations of views on
probabilistic databases)
-
Abhay Jha (query evalution techniques over tables with soft key
constraints; dissociation of unsafe queries into safe queries)
-
Vibhor Rastogi (graduated in 2010; worked on the complexity
aspects of query evaluation over soft key constraints)
-
Nodira
Koussainova
(applications of materialized views to RFID data)
-
Wolfgang
Gatterbauer
(dissociation of unsafe queries into safe
queries)
Related
Project:
MystiQ
The safe-query compiler for Unions of Conjunctive Queries, written in ocaml:
download
Supported by:
NSF IIS-0713576 and a Gift from Microsoft
Members
Paul Beame,
JerryLi,
Sudeepa Roy,
Dan Suciu,
Abhay Jha,
Alexandra Meliou,
Wolfgang Gatterbauer,
Chris Re,
Nilesh Dalvi,
Karl Schnaitter,
Bill Howe,
Kate Moore,
Vibhor Rastogi,
Magdalena Balazinska,
Evan Welbourne,
Nodira Khoussainova,
Daniel Li,
Gaetano Borriello,
Publications
Paul Beame, JerryLi, Sudeepa Roy, Dan Suciu,
Model Counting of Query Expressions: Limitations of Propositional Methods
In ICDT, 2014
Paul Beame, JerryLi, Sudeepa Roy, Dan Suciu,
Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases
In UAI, pp. 52-61, 2013
Abhay Jha, Dan Suciu,
On the tractability of query compilation and bounded treewidth
In ICDT, pp. 249-261, 2012
Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu,
Reverse Data Management
Published in PVLDB, vol. 4 , no. 12 , pp. 1490-1493 , 2011
Abhay Jha, Dan Suciu,
Knowledge compilation meets database theory: compiling queries to decision diagrams
In ICDT, pp. 162-173, 2011
Chris Re, Dan Suciu,
Understanding Cardinality Estimation using Entropy Maximization
In PODS, 2010
Nilesh Dalvi, Dan Suciu,
The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries
Unpublished ,2010
Note: to appear in JACM; preliminary version appeared in PODS 2010
Nilesh Dalvi, Karl Schnaitter, Dan Suciu,
Computing query probability with incidence algebras
In PODS, pp. 203-214, 2010
Nilesh Dalvi, Karl Schnaitter, Dan Suciu,
Computing Query Probability with Incidence Algebras
University of Washington,
Technical Report, UW-CSE-10-03-02, 2010
Bill Howe, Dan Suciu,
Embracing Uncertainty in Large-Scale Computational Astrophysics
In MUD, 2009
Kate Moore, Vibhor Rastogi, Chris Re, Dan Suciu,
Query Containment of Tier-2 Queries over a Probabilistic Database
In MUD, 2009
Chris Re, Dan Suciu,
General Database Statistics Using Entropy Maximization
In DBPL, pp. 84-99, 2009
Nilesh Dalvi, Chris Re, Dan Suciu,
Queries and Materialized Views on Probabilistic Databases
Unpublished ,2009
Note: to appear in JCSS
Nilesh Dalvi, Chris Re, Dan Suciu,
Probabilistic Databases: Diamonds in the Dirt (Extended Version)
Unpublished ,2009
Nilesh Dalvi, Chris Re, Dan Suciu,
Probabilistic Databases: Diamonds in the Dirt
Published in CACM, vol. 52 , no. 7 , pp. 86-96 , 2009
Chris Re, Dan Suciu,
Approximate Lineage for Probabilistic Databases
In VLDB, 2008
Chris Re, Dan Suciu,
Management of Data with Uncertainties
In CIKM, pp. 3-8, 2008
Chris Re, Magdalena Balazinska, Dan Suciu,
Event Queries on Correlated Probabilistic Streams
In SIGMOD, 2008
Abhay Jha, Vibhor Rastogi, Dan Suciu,
Evaluating Queries in the Presence of Soft Key Constraints
In PODS, 2008
Evan Welbourne, Nodira Khoussainova, Daniel Li, Magdalena Balazinska, Gaetano Borriello, Dan Suciu,
Cascadia: A System for Specifying, Detecting, and Managing RFID Events
In MobySYS, 2008
Dan Suciu,
Probabilistic Databases
Published in SIGACT News, vol. 39 , no. 2 , pp. 111-124 , June , 2008
Chris Re, Nilesh Dalvi, Dan Suciu,
Efficient Top-k Query Evaluation on Probabilistic Data
In ICDE, 2007
Nilesh Dalvi, Dan Suciu,
Management of Probabilistic Data: Foundations and Challenges
In PODS, pp. 1-12, 2007
Note: (invited talk)
Chris Re, Dan Suciu,
Efficient Evaluation of HAVING Queries on a Probabilistic Database
In Proceedings of DBPL, 2007
Dan Suciu,
Tutorial at the EDBT Summer School: Techniques for managing probabilistic data
Unpublished ,2007
Nilesh Dalvi, Chris Re, Dan Suciu,
Query Evaluation on Probabilistic Databases
Published in IEEE Data Engineering Bulletin, vol. 29 , no. 1 , pp. 25-31 , 2006
Nodira Khoussainova, Magdalena Balazinska, Dan Suciu,
Towards correcting input data errors probabilistically using integrity constraints
In MobiDB, pp. 43-50, 2006
Nilesh Dalvi, Dan Suciu,
Efficient Query Evaluation on Probabilistic Databases
In VLDB, 2004