DafnyBench: A Benchmark for Formal Software Verification

750+

Programs in Benchmark

53,000+

Lines of Code

68%

Best Success Rate

10x

Verification Cost Reduction

1 Introduction

Large Language Models (LLMs) are accelerating software development through co-pilots and program synthesis tools, but ensuring code reliability remains challenging. Formal verification provides mathematical proof that software meets specifications, yet adoption is limited by high costs and steep learning curves. DafnyBench addresses this gap as the largest benchmark for training and evaluating ML systems in formal verification.

2 Related Work

Existing benchmarks like Clover (66 programs) and dafny-synthesis (153 programs) are insufficient for modern ML training. Mathematical theorem proving benchmarks contain over 100,000 theorems with AI success rates exceeding 82%, highlighting the need for similar scale in software verification.

3 Benchmark Construction

3.1 Dataset Composition

DafnyBench comprises 750+ programs with approximately 53,000 lines of Dafny code, significantly exceeding previous benchmarks in both size and complexity.

3.2 Hint Requirements

Most programs require supplementary hints for the automated theorem prover. These hints guide the verification process and represent the additional knowledge needed beyond the core implementation.

4 LLM Performance Evaluation

4.1 Experimental Setup

Testing GPT-4 and Claude 3's ability to auto-generate hints for Dafny verification engine. Evaluation measures success rate across different program complexities and hint requirements.

4.2 Results Analysis

The best model and prompting scheme achieved 68% success rate. Performance improves with error message feedback but deteriorates with increased code complexity and hint requirements. The verification success probability follows: $P_{success} = \frac{1}{1 + e^{-(\alpha - \beta \cdot C)}}$ where $C$ represents code complexity and $\alpha$, $\beta$ are model-specific parameters.

Verification Success Rate vs. Code Complexity

Chart shows inverse relationship between code complexity and verification success rate. Programs requiring more than 50 lines of hints show success rates below 50%, while simpler programs achieve up to 85% verification success.

5 Conclusion and Future Work

DafnyBench enables rapid improvement in formal verification automation. Future work includes expanding benchmark diversity, improving LLM hint generation, and integrating verification directly into compilation processes.

6 Technical Analysis

Industry Analyst Perspective

一针见血 (Cutting to the Chase)

DafnyBench isn't just another academic exercise—it's a strategic move to bridge the chasm between AI-generated code and production-ready software. The 68% success rate reveals both the promise and the painful reality: while LLMs can assist verification, we're far from fully automated reliability.

逻辑链条 (Logical Chain)

The research follows a compelling progression: identify the formal verification bottleneck → recognize the ML training data scarcity → build massive benchmark → test current LLM capabilities → establish baseline for future improvements. This mirrors the trajectory of computer vision after ImageNet's introduction, where standardized benchmarks accelerated progress by orders of magnitude.

亮点与槽点 (Highlights and Pain Points)

Highlights: The scale is unprecedented—53,000 lines of verified code dwarfs previous efforts. The focus on Dafny is strategic, leveraging its Python-like syntax for broader adoption. The error message feedback mechanism shows practical engineering insight.

Pain Points: The 68% success rate, while impressive, means 32% failure rate—unacceptable for critical systems. The benchmark's complexity distribution isn't clearly stratified, making it difficult to assess where improvements are most needed. Like many academic benchmarks, it may suffer from overfitting risks as models optimize for this specific dataset.

行动启示 (Actionable Insights)

For engineering teams: Start integrating formal verification tools now, even if partially. The cost reduction from 10x to near-zero is coming faster than most organizations realize. For researchers: Focus on the failure cases—understanding why 32% of programs resist verification will reveal fundamental limitations in current approaches. For investors: The formal verification toolchain represents a massive opportunity as software reliability becomes non-negotiable in autonomous systems, healthcare, and finance.

This work sits at the convergence of multiple transformative trends: the industrialization of AI, the crisis of software reliability in critical systems, and the maturation of formal methods. Similar to how ImageNet revolutionized computer vision, DafnyBench has the potential to catalyze similar progress in software verification. The reference to mathematical theorem proving benchmarks achieving 82% success rates suggests we're approximately 4-5 years from similar performance in software verification, based on the historical progression curve from benchmarks like those described in the CycleGAN paper and subsequent rapid improvements.

The technical approach of using hints as intermediate verification targets is particularly insightful. It creates a tractable learning problem for LLMs while maintaining the rigor of full formal verification. This layered approach mirrors successful strategies in other AI domains, such as the use of attention mechanisms in transformer architectures that have driven recent breakthroughs in natural language processing.

However, the research leaves unanswered questions about generalization beyond the Dafny ecosystem and the computational cost of verification at scale. As organizations like NASA and automotive companies increasingly mandate formal verification for safety-critical systems, the economic impact of reducing verification costs from 10x to near-zero could be measured in billions of dollars and, more importantly, prevented catastrophes.

7 Code Implementation

Dafny Verification Example

method ComputeSum(n: int) returns (sum: int)
  requires n >= 0
  ensures sum == n * (n + 1) / 2
{
  sum := 0;
  var i := 0;
  while i <= n
    invariant sum == i * (i - 1) / 2
    invariant i <= n + 1
  {
    sum := sum + i;
    i := i + 1;
  }
}

This Dafny method computes the sum of first n natural numbers with formal verification. The requires clause specifies preconditions, ensures specifies postconditions, and invariant maintains loop correctness.

8 Future Applications

Formal verification integration into compilers as standard final step. Autonomous systems verification for automotive and aerospace. Smart contract verification for blockchain applications. Medical device software certification. Critical infrastructure protection.

9 References

Leino, K. R. M. (2010). Dafny: An automatic program verifier for functional correctness. LPAR-16.
Brown, T. B., et al. (2020). Language models are few-shot learners. NeurIPS.
Irving, G., et al. (2016). DeepMath-Deep sequence models for premise selection. NeurIPS.
Avizienis, A., et al. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions.
Zhu, J. Y., et al. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV.
Amazon Web Services (2023). Formal Verification in Production Systems.
Microsoft Research (2022). Applying Formal Methods at Scale.

Table of Contents