To facilitate collaboration over sensitive data, we developed DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability — the data owner need not make any assumptions, set any parameters, nor take on any liability to start generating and sharing data safely and effectively.
DataSynthesizer consists of three high-level modules — DataDescriber, DataGenerator and ModelInspector. DataDescriber investigates the data types, correlations and distributions of the attributes in the private dataset and produces a model of the data, adding noise to the model to preserve privacy. DataGenerator samples from the noisy model and outputs synthetic data. ModelInspector shows an intuitive description of the model that was computed by the DataDescriber, allowing the data owner to evaluate the results of the summarization process and optionally adjust parameters to fine tune the process.
We have been using DataSynthesizer in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as a major obstacle to data-driven decision-making.
DataSynthesizer is available on github
@article{howe17bloombergdatasynthesizer, author = {Howe, Bill and Stoyanovich, Julia and Ping, Haoyue and Herman, Bernease and Gee, Matt}, title = {Synthetic Data for Social Good}, journal = {Bloomberg Data for Good Exchange}, year = {2017} }
@inproceedings{ping17datasynthesizer, author = {Ping, Haoyue and Stoyanovich, Julia and Howe, Bill}, title = {DataSynthesizer: Privacy-Preserving Synthetic Datasets}, booktitle = {Proceedings of the 29th International Conference on Scientific and Statistical Database Management}, series = {SSDBM '17}, year = {2017}, isbn = {978-1-4503-5282-6}, location = {Chicago, IL, USA}, pages = {42:1--42:5}, articleno = {42}, numpages = {5}, url = {http://doi.acm.org/10.1145/3085504.3091117}, doi = {10.1145/3085504.3091117}, acmid = {3091117}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Data Sharing, Differential Privacy, Synthetic Data} }