Designing a Data Reproduction Artifact

2024-10-02

I recently submitted a data reproduction artifact accompanying our paper "Verifying the Option Type with Rely—Guarantee Reasoning." There are things that I wish I had done from the get-go next time, so I'm writing them down in this post. Hopefully, you'll find some of the things I write here useful in creating your own data artifacts!

A (Brief) Checklist for a Data Reproduction Artifact

Below is a very brief checklist for preparing a data artifact. The rest of this blog post explains each of the checklist items in detail, so read on if you're interested.

What Is a Data Reproduction Artifact?

A data reproduction artifact comprises the raw experimental data you present in your paper and a mechanism to generate the results you derived from them.

Your paper generally has data that you analyze and interpret to support your insights. The data might be a single CSV file, a massive JSON database, or even a set of software repositories. Hopefully, you've automated the analysis and generation of your tables and figures from it. That is, you have a set of scripts that operate over your data and produce nicely-formatted tables or figures that you directly paste into your paper (generating the rows of a .tex table is one example).

Components of a Data Reproduction Artifact

You want to ensure that your artifact can be understood and used by as many people as possible! Here are some things your data artifact should really have, at the minimum:

I'll go over each of these components.

Data

You should ensure that you are allowed to publish your data in a publicly-available dataset. There are very few actual reasons why you might not be able to: the data might expose PII (personal identifiable information), be subject to data protections and regulations, or comprise proprietary information or trade secrets. In all of these cases, it's good to investigate whether you can publish your dataset in part, or create an anonymized version that is allowed to be published.

If you're working with data that is already publicly-available, such as public software repositories (e.g., a public repo on GitHub), you should be able to include it without much trouble.

If your dataset is large (i.e., more than a couple of GB), you may not want to distribute it directly in the artifact. Users may not have access to a reliable internet connection or their bandwidth might be poor; this should not exclude them from being able to access your artifact completely. You may instead choose to upload the data separately to an archival service (e.g., Zenodo, Anonymous GitHub) and link to it from your artifact. Another alternative, especially in the case where your data are software repositories, is to distribute the references in a file and include a script that clones them onto the user's machine.

Scripts

You should automate the generation of tables and figures in your table as much as possible; it should ideally be a one-step process. For example, to generate the rows of Table 1 in our paper "Verifying the Option Type with Rely—Guarantee Reasoning," I only need to run the following command:

% ./compute-precision-recall-annotations

after making any updates that would affect our results. Not only is (correctly) automating figure and table generation convenient, it can also be less error-prone. Auditing how you compute precision or recall, or any other results is easier when you have a program that anyone can review.

Your scripts should also be clearly documented. I prefer to have a small preamble at the top of each script describing its purpose as well as its inputs and outputs. For example, below is the documentation at the top of the compute-precision-recall-annotations script:

    # This script generates the rows for Table 1.
    # Specifically, it calculates the precision, recall, and number of false
    # positive suppressions for SpotBugs, Error Prone, IntelliJ IDEA, and the Optional Checker.
    # We use the following standard definitions:
    # Precision: (TP/TP+FP)
    # Recall:    (TP/TP+FN)
    #
    # This script runs grep for the relevant error suppression patterns for
    # each directory of subject programs that have been instrumented with
    # SpotBugs, Error Prone, IntelliJ IDEA, and the Optional Checker.
    #
    # Usage:
    #   compute-precision-recall-annotations
    # Result:
    #   eval-statistics.tex
    

It may also be helpful to have a top-level script that your users (and you!) can run to generate all the figures and tables in your paper.

Documentation

At minimum, you should include a README file that includes the following information; a clearly-labelled section for each is a good way to organize your README:

It is reasonable and OK that your artifact may not work on every single operating system under the sun or any piece of hardware dating back to the Apple II. However, you should clearly document the limitations of your artifact so that your users aren't caught by surprise. This is the place where you should also describe any minimum computing requirements. Perhaps your scripts require a certain amount of memory to run within a reasonable time limit, or your generated artifacts take up a surprisingly large amount of disk.

Your description of your dataset should be through. For example, if your dataset is composed of a set of folders, what does each folder contain? Is there a reason for the particular structure of your dataset? You should describe each column or field in your dataset (if you are distributing a CSV or JSON file) if it is not obvious from its name or context. If you're having doubts about the clarity of something, others might be confused, too. Providing more information than less is usually a good idea.

It is important to describe the results of your scripts, particularly if they only generate the rows of a table or the body of a figure. If your generated files do not have names that directly correspond to table names, you should provide an unambiguous mapping.

Checklist for Preparing a Data Artifact

Many conferences have a separate artifact evaluation track, where authors of accepted papers may submit a data artifact to accompany their paper. Every artifact evaluation track might be different and require different things, but here's a general checklist of things that might be helpful to remember.

Test Your Artifact

You should test your artifact on as many different environments as you have access to; all these systems should have no prior connection to your artifact (i.e., be as close to a regular machine as possible, one that doesn't have any special dependencies installed). At minimum, your co-authors should be able to take your artifact and reproduce your results following the instructions and documentation you have provided in the README.

Your colleagues in the lab are fantastic test subjects for your artifact, even better if they work on operating systems and hardware that differs from yours. Ask them nicely to write down anything they find confusing or ambiguous as they test your artifact. Additionally, ask them to write down anything they had to do to make your artifact work beyond what was specified in your instructions. Your colleagues are smart people; these are things you'll want to fix before you submit! They should avoid asking you clarifying questions directly, chances are you'll clarify on-the-spot and forget to update your artifact, which invalidates the hard work your colleagues are doing for you.

Anonymize Your Artifact

Most review committees are double-blind; your submitted artifact should not include any metadata that may de-anonymize you. If any component of your artifact was under version control, you may have a .git directory or two lying around. You can run the following command to recursively delete any .git directories from the root of your artifact directory:

rm -rf `find . -type d -name .git`

You should run a recursive search over your artifact directory for anything else that might de-anonymize you (e.g., usernames or IDs, institutional IDs, etc). Don't skip this step, your computer can create metadata and hidden files in ways that you may not have anticipated.

You should ideally have a script that automates the anonymization process, enabling you and your co-authors to audit the process via code review.

Double-Check Your Documentation

Your documentation is the only way you'll get to communicate with whomever is using or reviewing your artifact. You may have simulated this process by asking your colleagues to test out your artifact without talking to you, but it's always good to ask yourself the following:

Submitting Your Artifact

Archival Services

I use Zenodo to archive my data artifacts if there are no restrictions from the conference or publisher. It is free, convenient, and widely-used by my research community. Additionally, Zenodo will automatically version your artifact whenever you make updates, and generate a DOI that will resolve to the latest version without any work on your end. Here is an example of an anonymously-uploaded file on Zenodo.

Alternate services that support anonymous uploads include the Open Science Foundation, FigShare, Dataverse, and Anonymous GitHub.

Preparing Your Artifact for Submission

You generally do not want to upload a raw, uncompressed artifact, since it might be quite large. You can run the following command to create a .zip archive of your artifact:

zip -r <name_of_resulting_archive>.zip <path_to_artifact>

Creating a zip via the GUI on macOS will result in irritating hidden __MACOSX files being included in the zip, so I would avoid it.

Is data reproduction the same as data replication?

There is often some confusion around reproducing data and replicating data. Some fields use one to mean the other, while others use them interchangeably. I work in computer science, so I follow the ACM's stated definitions, which I summarize below:

An additional definition is repeatability, where a measurement (i.e., the results) can be obtained by the original team via the original methodology using the original data.

In the context of my specific data artifact, reproducibility can be summarized further as "same script, same data," while replicability can be summarized as "different script, different data" in pursuit of my original research questions.

Acknowledgments

Thanks to Joseph Wonsil and Michael Ernst for their feedback and comments on earlier versions of this post. Check out Joe's fantastic research that aims to augment methods that make reproducibility more accessible for research programmers! Mike and I work together at UW to improve programmer productivity, check out some of his latest work here.