Skip to main content
03.03.2026

Formal Verification for AI-Generated Code: Why SRE Teams Should Care

Formal Verification

Quick Reference:

# Static analysis with semgrep
semgrep --config auto ./src

# Type checking with mypy (Python)
mypy --strict ./src

# Property-based testing with hypothesis
pytest --hypothesis-profile=ci

# Formal verification with Dafny
dafny verify program.dfy

AI coding assistants now write significant portions of production code. GitHub reports that Copilot generates 46% of code in files where it's enabled. But here's the uncomfortable question: how do you verify code you didn't write and may not fully understand?

Traditional testing catches bugs you anticipate. Formal verification proves properties hold for all possible inputs. As AI-generated code scales, that distinction becomes critical.

The Verification Gap

Consider this AI-generated function:

def calculate_discount(price: float, user_tier: str) -> float:
    if user_tier == "gold":
        return price * 0.8
    elif user_tier == "silver":
        return price * 0.9
    return price

Looks correct. Tests pass. Ships to production.

Three months later, someone discovers that negative prices return negative discounts (refunds). The AI generated syntactically correct code that violated an implicit business invariant: prices must be non-negative.

This is the verification gap: the space between "code that works for test cases" and "code that's provably correct."

Why AI Code Amplifies the Problem

AI-generated code has unique verification challenges:

Challenge Description
Volume More code generated faster means more potential bugs at scale
Opacity Developers may not deeply understand generated code
Edge cases AI optimizes for common patterns, not corner cases
Implicit assumptions AI doesn't know your business invariants
Confidence bias "AI wrote it" can reduce scrutiny

When a human writes code, they carry mental models of edge cases and constraints. AI models lack this context unless explicitly provided.

Levels of Verification

Verification exists on a spectrum:

Level 1: Static Analysis

Catch common bugs without running code:

# Python
ruff check ./src
mypy --strict ./src

# Go
go vet ./...
staticcheck ./...

# JavaScript/TypeScript
eslint --max-warnings 0 ./src

Static analysis catches type errors, unused variables, and common anti-patterns. It's fast and should run on every commit.

Level 2: Property-Based Testing

Instead of specific test cases, define properties that must hold:

from hypothesis import given, strategies as st

@given(st.floats(min_value=0), st.sampled_from(["gold", "silver", "bronze"]))
def test_discount_never_increases_price(price, tier):
    result = calculate_discount(price, tier)
    assert result <= price
    assert result >= 0

Hypothesis generates hundreds of random inputs, finding edge cases humans miss.

Level 3: Design by Contract

Specify preconditions, postconditions, and invariants:

from icontract import require, ensure

@require(lambda price: price >= 0, "Price must be non-negative")
@ensure(lambda result, price: 0 <= result <= price, "Discount must not exceed price")
def calculate_discount(price: float, user_tier: str) -> float:
    if user_tier == "gold":
        return price * 0.8
    elif user_tier == "silver":
        return price * 0.9
    return price

Contracts are checked at runtime (or statically with appropriate tools), catching violations immediately.

Level 4: Formal Verification

Prove mathematically that code satisfies specifications:

method CalculateDiscount(price: real, tier: string) returns (result: real)
  requires price >= 0.0
  ensures 0.0 <= result <= price
{
  if tier == "gold" {
    result := price * 0.8;
  } else if tier == "silver" {
    result := price * 0.9;
  } else {
    result := price;
  }
}

Dafny, TLA+, and similar tools mathematically prove your code meets its specification for all possible inputs.

Practical Implementation for SRE Teams

Full formal verification isn't practical for most codebases. Here's a pragmatic approach:

Tier 1: Everything (CI/CD Required)

# .github/workflows/verify.yml
- name: Static Analysis
  run: |
    ruff check .
    mypy --strict src/
    
- name: Security Scan
  run: semgrep --config auto --error .

Tier 2: Critical Paths (Required for Merge)

- name: Property Tests
  run: pytest tests/property/ --hypothesis-profile=ci
  
- name: Contract Verification
  run: python -m icontract_hypothesis test src/billing/

Tier 3: Core Invariants (Weekly/Release)

- name: Formal Verification
  run: |
    dafny verify src/core/*.dfy
    tla+ check specs/consensus.tla

What to Verify

Focus verification efforts on code that:

  1. Handles money: Billing, payments, credits
  2. Controls access: Authentication, authorization
  3. Manages state: Distributed consensus, caches
  4. Affects safety: Rate limiting, circuit breakers

For AI-generated code specifically, add verification when:

  • The generated code handles edge cases
  • You don't fully understand why the code works
  • The code will run in production without human review

Tools by Language

Language Static Property Testing Contracts Formal
Python mypy, ruff hypothesis icontract CrossHair
Go go vet, staticcheck rapid - -
Rust clippy proptest contracts Prusti
TypeScript tsc, eslint fast-check io-ts -
Java SpotBugs jqwik cofoja KeY

AI-Specific Verification Patterns

Pattern 1: Specification-First Generation

Instead of generating code directly, have AI generate specifications first:

Prompt: "Write a Dafny specification for a function that calculates shipping cost 
based on weight and distance. Include all necessary preconditions and postconditions."

Then generate implementation that must satisfy the spec.

Pattern 2: Invariant Extraction

Ask AI to identify invariants in existing code:

Prompt: "What invariants must hold for this billing calculation? 
Express them as Python icontract decorators."

Pattern 3: Test Oracle Generation

Use AI to generate property-based test oracles:

Prompt: "Generate Hypothesis strategies and properties for testing this 
rate limiter implementation. Focus on edge cases around counter wraparound."

Verification in Production

Runtime verification catches what static analysis misses:

from opentelemetry import trace

def calculate_discount(price: float, user_tier: str) -> float:
    tracer = trace.get_tracer(__name__)
    with tracer.start_as_current_span("calculate_discount") as span:
        # Runtime invariant check
        if price < 0:
            span.set_attribute("invariant.violation", "negative_price")
            raise ValueError(f"Price invariant violated: {price}")
        
        result = _calculate_discount_impl(price, user_tier)
        
        # Post-condition check
        if result > price:
            span.set_attribute("invariant.violation", "discount_exceeds_price")
            raise ValueError(f"Discount invariant violated: {result} > {price}")
        
        return result

Combine with alerting to catch invariant violations in production before they cascade.

The Economics of Verification

Verification has costs:

  • Time: Property tests are slower than unit tests
  • Expertise: Formal methods require specialized knowledge
  • Maintenance: Specifications must evolve with code

But so do bugs:

  • Incident response: Hours to days of engineer time
  • Customer impact: Revenue loss, trust erosion
  • Regulatory risk: Compliance violations, fines

The math favors verification for critical paths. A billing bug that costs $100K to resolve justifies significant verification investment.

Getting Started

  1. This week: Add static analysis to CI if missing
  2. This month: Implement property-based tests for one critical module
  3. This quarter: Add contracts to AI-generated code in billing/auth
  4. This year: Evaluate formal verification for core algorithms

Start small, measure bug escape rates, and expand verification where it proves valuable.

Conclusion

As AI generates more code, the verification gap widens. Traditional testing catches anticipated bugs; verification catches unanticipated ones.

You don't need to formally verify everything. But for code that handles money, controls access, or runs unsupervised, verification is becoming table stakes.

The question isn't whether to verify AI-generated code. It's which level of verification matches your risk tolerance.


Want AI agents that help verify and test your code? Akmatori can automate code review, run verification suites, and catch issues before they reach production.

Automate incident response and prevent on-call burnout with AI-driven agents!