Privacy-Preserving Synthetic Data

To facilitate collaboration over sensitive data, we developed DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability — the data owner need not make any assumptions, set any parameters, nor take on any liability to start generating and sharing data safely and effectively.

DataSynthesizer consists of three high-level modules — DataDescriber, DataGenerator and ModelInspector. DataDescriber investigates the data types, correlations and distributions of the attributes in the private dataset and produces a model of the data, adding noise to the model to preserve privacy. DataGenerator samples from the noisy model and outputs synthetic data. ModelInspector shows an intuitive description of the model that was computed by the DataDescriber, allowing the data owner to evaluate the results of the summarization process and optionally adjust parameters to fine tune the process.

We have been using DataSynthesizer in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as a major obstacle to data-driven decision-making.

Software

DataSynthesizer is available on github

Collaborators

  • Julia Stoyanovich, Drexel University
  • Gerome Miklau, UMass Amherst
  • Ariel Rokem, UW eScience Institute

Students

  • Luke Rodriguez, UW iSchool
  • Haoyue Ping, Drexel University

Publications

  1. Synthetic Data for Social Good
    Bill Howe, Julia Stoyanovich, Haoyue Ping, Bernease Herman, Matt Gee.
    Bloomberg Data for Good Exchange 2017
    @article{howe17bloombergdatasynthesizer,
      author = {Howe, Bill and Stoyanovich, Julia and Ping, Haoyue and Herman, Bernease and Gee, Matt},
      title = {Synthetic Data for Social Good},
      journal = {Bloomberg Data for Good Exchange},
      year = {2017}
    }
    
  1. DataSynthesizer: Privacy-Preserving Synthetic Datasets
    Haoyue Ping, Julia Stoyanovich, Bill Howe.
    Proceedings of the 29th International Conference on Scientific and Statistical Database Management 2017
    @inproceedings{ping17datasynthesizer,
      author = {Ping, Haoyue and Stoyanovich, Julia and Howe, Bill},
      title = {DataSynthesizer: Privacy-Preserving Synthetic Datasets},
      booktitle = {Proceedings of the 29th International Conference on Scientific and Statistical Database Management},
      series = {SSDBM '17},
      year = {2017},
      isbn = {978-1-4503-5282-6},
      location = {Chicago, IL, USA},
      pages = {42:1--42:5},
      articleno = {42},
      numpages = {5},
      url = {http://doi.acm.org/10.1145/3085504.3091117},
      doi = {10.1145/3085504.3091117},
      acmid = {3091117},
      publisher = {ACM},
      address = {New York, NY, USA},
      keywords = {Data Sharing, Differential Privacy, Synthetic Data}
    }
    

Sponsors





This webpage was built with Bootstrap and Jekyll. You can find the source code here. Last updated: Aug 02, 2021