HW1: An end-to-end data pipeline and analysis

High-level goal

The high-level goal of this exercise is to develop an end-to-end data analysis pipeline in R, starting from real-world research questions and existing data.

Set up

This is an individual exercise!
This exercise involves four main steps:
1. Conceptual modeling: defining the methodology (constructs of interest, proxy measures, etc.) and reasoning about validity and limitations.
2. Data wrangling: cleaning and consolidating the data, which is coming from different data sources.
3. Statistical analysis: analyzing the data with appropriate statistical methods to increase statistical and conclusion validity.
4. Conclusions: contextualizing and interpreting the results.
You are free to use any IDE or editor, but we do require:
- The final deliverable (submission) is an executable notebook, with all code cells written in R.
- All four main steps are clearly labeled and documented in the submitted notebook.
- Executing the submitted notebook must successfully reproduce the results of your data analysis pipeline without manual intervention.
- Clean and concise code, with a consistent coding style and use of tidyverse packages for data wrangling.
- Adequate testing of data properties, data transformations, etc.
Getting help:
- Data: Ask general clarification questions on Ed ([HW1] Data questions thread).
- Tidyverse: See the demos on the course website and the linked references. If you can’t find an answer, post a general question on Ed ([HW1] Tidyverse questions thread).
- Stats: Ask general clarification questions on Ed ([HW1] Stats questions thread).
- Solution-specific questions: Send the course staff a private Ed message (HW1 category).
Recommended packages:
- tidyverse: See the package descriptions and cheat sheets and the R for Data Science online book.
- assertthat: See the package description.
- effsize: See the package description.
- jsonlite: See the package description.

Background

Given a bug and commonly a set of tests, of which at least one test triggers the bug, Automated Program Repair (APR) aims to find a patch such that all tests pass. A plausible patch passes all tests, but may still be incorrect. A correct patch is plausible and satisfies the intended specification of the software system.

APR studies have extensively used the Defects4J benchmark. Here are some common observations and conjectures:

APR tools are more successful for bugs of projects added in Defects4J v1.0 than for bugs of projects added in Defects4J v1.1 and v2.0.
Weak tests are correlated with APR tools producing plausible yet incorrect patches.
More complex bugs may be more difficult to patch automatically.
APR tools may overfit to Defects4J v1.0 (compared to v1.1 and v2.0) but the reasons are unknown.

Instructions

Given the background on APR and the data set described below, your goal is to shed some light on possible explanations for observed differences in APR tool efficacy.

You are free to define your own research questions and hypotheses. You may find the following two high-level questions useful to guide you towards more precise research questions:

Are there key differences between v1.0 and v1.1/v2.0 of Defects4J, considering dimensions such as code/bug/test characteristics?
Is APR success associated with certain code/bug/test characteristics?

NOTE: Key grading aspects are design justifications and adequate testing of the analysis pipeline. Additionally, grading will consider the overall execution of the data analysis in the context of its scope. (In particular, a very simple analysis must be very well executed.)

Part 1: Methodology (30%)

State your research questions and hypotheses.
State constructs of interest and proxy measures.
Justify your choices and be explicit about CDA vs. EDA.
Discuss validity (in particular construct validity) and limitations of your analysis design.

Part 2: Data Wrangling (30%)

Consolidate all data files, and clean up and test the data.
Make sure to adequately test data properties and data transformations.
Add comments that link test assertions to data expectations and the methodology.

NOTE: This step should create a single tibble by adequately merging all five data files.

Part 3: Statistical Analysis (20%)

Select adequate statistical methods (incl. effect sizes).
Add comments that link the statistical methods to the methodology.
Make sure the notebook computes all results that inform the overall conclusions.

This comprehensive summary provides concrete examples for various statistical methods.

NOTE: We do not expect you to be an expert on statistical methods. Once you have defined your research questions and analysis design, feel free to engage the course staff in your hypothesis formalization process.

Part 4: Conclusions (20%)

Contextualize and interpret your results.
Answer your stated research questions.

Data set

The data set is derived from the following data sources and tools:

For simplicity and efficiency, the course staff have pre-processed and consolidated some of the raw data. The result of these pre-processing and consolidation steps are the 5 data files below.

Code-level information: `code_stats.csv`

PID: See the 17 project ids in Defects4J.
BID: See the active bug ids in Defects4J.
D4JVersionAdded: Defects4J version when the project and its bugs were added – 1.0, 1.1, or 2.0.
ClocSrc: Result of running cloc over all non-test source files.
ClocTest: Result of running cloc over all test source files.
RelevantFiles: Number of files touched by any triggering test.
PatchedFiles: Number of files patched by the bug fix.
AllTests: Number of all test methods across all test source files.
RelevantTests: Number of test methods that touch at least one patched file.
TriggerTests: Number of test methods that trigger the bug.

PID and BID uniquely identify a bug.

Bug information: `bug_stats.csv`

PID: See the 17 project ids in Defects4J.
BID: See the active bug ids in Defects4J.
DelLines: Number of lines deleted by the bug fix.
InsLines: Number of lines added by the bug fix.
ModLines: Number of lines changed by the bug fix.
FileName: Name of the buggy file.

PID, BID, and FileName uniquely identify a buggy file. (A bug may have more than 1 buggy file.)

Coverage information: `coverage_stats.csv`

PID: See the 17 project ids in Defects4J.
BID: See the active bug ids in Defects4J.
LinesTotal: Number of all lines across all files patched by the bug fix – line coverage.
LinesCovered: Number of covered lines in across all files patched by the bug fix – line coverage.
ConditionsTotal: Number of all conditions across all files patched by the bug fix – condition coverage.
ConditionsCovered: Number of covered conditions across all files patched by the bug fix – condition coverage.

PID and BID uniquely identify a bug. (The code coverage analysis may have failed for some bugs.)

Mutation information: `mutation_stats.csv`

PID: See the 17 project ids in Defects4J.
BID: See the active bug ids in Defects4J.
MutantsTotal: Number of all mutants generated across all files patched by the bug fix.
MutantsCovered: Number of covered mutants across all files patched by the bug fix.
MutantsDetected: Number of detected mutants across all files patched by the bug fix.

PID and BID uniquely identify a bug. (The mutation analysis may have failed for some bugs.)

If you are unfamiliar with mutation analysis, you can think of: (1) the ratio of detected mutants to covered mutants as an indicator for test assertion strength, and (2) the ratio of covered mutants to all mutants as a code coverage measure (strongly correlated with line coverage).

APR results: `apr_results.csv`

tool: Name of APR tool – lower case.
pid: Defects4J project id – lower case.
bid: Defects4J bug id.
patch_id: Patch id.
is_correct: Patch correctness – Yes = Correct patch; No = Incorrect yet plausible patch (passes all tests).

tool, pid, bid, and patch_id uniquely identify a repair attempt that produced a patch.

Deliverables

An executable notebook with all your code and analysis narrative. Acceptable notebook formats are:
- Rmarkdown: .Rmd
- Jupyter: .ipynb
You may use the templates linked from the table of contents. Note: the ipynb notebook is an automatically converted Rmd notebook (using Quarto).

Steps for turn-in

Upload the deliverables to Canvas.