Skip to main content
14.03.2026

Promptfoo: Test and Red Team Your LLM Applications

head-image

Deploying LLM-powered applications without proper testing is like shipping code without unit tests. You might get lucky, or you might deploy a chatbot that leaks secrets, generates harmful content, or hallucinates critical information. Promptfoo provides the testing infrastructure that LLM applications desperately need.

What is Promptfoo?

Promptfoo is an open-source CLI and Node.js library for evaluating LLM outputs systematically. Instead of manually checking if your prompts work correctly, you define test cases with expected behaviors and let promptfoo run automated evaluations. The tool also includes red teaming capabilities for identifying security vulnerabilities in AI systems.

Quick Reference

# Install promptfoo
npm install -g promptfoo

# Initialize a new project
promptfoo init --example getting-started

# Run evaluations
promptfoo eval

# View results in browser
promptfoo view

# Run red team security scan
promptfoo redteam run

Key Features

  • Prompt evaluation: Define test cases and assertions to verify LLM outputs meet your requirements
  • Model comparison: Run the same prompts against GPT-4, Claude, Gemini, Llama, and other models side by side
  • Red teaming: Automated vulnerability scanning for prompt injection, jailbreaks, PII leakage, and harmful outputs
  • CI/CD integration: Fail builds when LLM outputs regress or security vulnerabilities are detected
  • Local execution: All evaluations run on your machine, so prompts and data never leave your infrastructure
  • Code scanning: Review pull requests for LLM-related security and compliance issues

Installation

Install promptfoo using your preferred package manager:

# npm (recommended)
npm install -g promptfoo

# Homebrew
brew install promptfoo

# pip
pip install promptfoo

# Run without installing
npx promptfoo@latest eval

Setting Up Evaluations

Create a promptfooconfig.yaml file to define your evaluation:

prompts:
  - "Summarize this text in one sentence: {{text}}"
  - "Give a brief summary of: {{text}}"

providers:
  - openai:gpt-4
  - anthropic:claude-3-opus

tests:
  - vars:
      text: "The quick brown fox jumps over the lazy dog."
    assert:
      - type: contains
        value: "fox"
      - type: llm-rubric
        value: "The summary captures the main action"

Run the evaluation:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
promptfoo eval

Red Teaming for Security

Promptfoo's red team module scans your LLM application for common vulnerabilities:

# Initialize red team configuration
promptfoo redteam init

# Run vulnerability scan
promptfoo redteam run

The scanner tests for:

  • Prompt injection: Attempts to override system instructions
  • Jailbreaks: Techniques to bypass safety guardrails
  • PII extraction: Attempts to leak personal information
  • Harmful content: Requests for dangerous or illegal outputs
  • Hallucinations: Factually incorrect responses
  • Excessive agency: Unintended actions or tool calls

CI/CD Integration

Add promptfoo to your GitHub Actions workflow:

name: LLM Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm install -g promptfoo
      - run: promptfoo eval --ci
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The --ci flag outputs machine-readable results and returns non-zero exit codes when assertions fail.

Comparing Models

Run the same prompts against multiple providers to find the best model for your use case:

providers:
  - openai:gpt-4
  - openai:gpt-4-turbo
  - anthropic:claude-3-opus
  - anthropic:claude-3-sonnet
  - ollama:llama3.2

tests:
  - vars:
      query: "Explain Kubernetes pods"
    assert:
      - type: llm-rubric
        value: "Accurate, clear, and concise"
      - type: cost
        threshold: 0.01
      - type: latency
        threshold: 5000

The results view shows side-by-side comparisons with latency, cost, and quality metrics.

Operational Tips

Start by testing your most critical prompts. Define assertions based on actual failure modes you have observed or anticipate. Use the LLM-as-judge feature (llm-rubric) for subjective quality assessments. Run red team scans before any production deployment and after major prompt changes.

For production systems, set up continuous evaluation in CI to catch regressions. Store evaluation results to track quality over time. Consider running scheduled red team scans as part of your security program.

Conclusion

Promptfoo brings engineering rigor to LLM development. Automated evaluations catch regressions before they reach users. Red teaming identifies security vulnerabilities before attackers do. For any team deploying LLM applications at scale, systematic testing is not optional.

Looking for an AI-powered platform to enhance your SRE workflows? Check out Akmatori, an open-source AI agent designed for infrastructure teams. Built on Gcore infrastructure for reliable global performance.

Automate incident response and prevent on-call burnout with AI-driven agents!