Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset

“Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset” by Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D. Ernst, and Mauro Pezzè. In ASE 2025: Proceedings of the 39th Annual International Conference on Automated Software Engineering, (Seoul, South Korea), Nov. 2025.

Abstract

The oracle problem — the efficient generation of thorough test oracles — is still an open problem. Popular test case generators, like EvoSuite and Randoop, rely on implicit, rule-based, and regression oracles that miss failures that depend on the semantics of the program under test. Specified test oracles shift the costs of generating oracles to the production of formal specifications.

Large Language Models (LLMs) have the potential to overcome these limitations. The few studies of using LLM to automatically generate test oracles validate LLMs on modest-sized public benchmarks, such as Defects4J, that are likely to be included in the LLM training benchmark, with severe threats to the validity of the results.

This paper presents an empirical study of the effectiveness of LLMs in generating test oracles. We report the results of experimenting with 13,866 test oracles that we mined from 135 Java projects, and that were created after the cut-off dates of the training of the LLMs used in the experiments, and are thus unbiased.

The results of the experiments that we report in this paper indicate that LLMs indeed generate effective oracles that largely increase the mutation score of the test cases, reaching a mutation score comparable to the score of human-designed test oracles. Our results also indicate that the test prefix and the methods called in the program under test provide sufficient information to generate good oracles, while additional code context does not bring relevant benefits. These findings provide actionable insights into using LLMs for automatic testing and highlight their current limitations in generating complex oracles.

BibTeX entry:

@inproceedings{MolinelliDGMLEP2025,
   author = {Davide Molinelli and Di Grazia, Luca and Alberto Martin-Lopez
	and Michael D. Ernst and Mauro Pezz{\`e}},
   title = {Do {LLMs} Generate Useful Test Oracles? An Empirical Study
	with an Unbiased Dataset},
   booktitle = {ASE 2025: Proceedings of the 39th Annual International
	Conference on Automated Software Engineering},
   address = {Seoul, South Korea},
   month = nov,
   year = {2025}
}

(This webpage was created with bibtex2web.)

Back to Michael Ernst's publications.