Promptfoo: Test and Red Team Your LLM Applications

Deploying LLM-powered applications without proper testing is like shipping code without unit tests. You might get lucky, or you might deploy a chatbot that leaks secrets, generates harmful content, or hallucinates critical information. Promptfoo provides the testing infrastructure that LLM applications desperately need.
What is Promptfoo?
Promptfoo is an open-source CLI and Node.js library for evaluating LLM outputs systematically. Instead of manually checking if your prompts work correctly, you define test cases with expected behaviors and let promptfoo run automated evaluations. The tool also includes red teaming capabilities for identifying security vulnerabilities in AI systems.
Quick Reference
# Install promptfoo
npm install -g promptfoo
# Initialize a new project
promptfoo init --example getting-started
# Run evaluations
promptfoo eval
# View results in browser
promptfoo view
# Run red team security scan
promptfoo redteam run
Key Features
- Prompt evaluation: Define test cases and assertions to verify LLM outputs meet your requirements
- Model comparison: Run the same prompts against GPT-4, Claude, Gemini, Llama, and other models side by side
- Red teaming: Automated vulnerability scanning for prompt injection, jailbreaks, PII leakage, and harmful outputs
- CI/CD integration: Fail builds when LLM outputs regress or security vulnerabilities are detected
- Local execution: All evaluations run on your machine, so prompts and data never leave your infrastructure
- Code scanning: Review pull requests for LLM-related security and compliance issues
Installation
Install promptfoo using your preferred package manager:
# npm (recommended)
npm install -g promptfoo
# Homebrew
brew install promptfoo
# pip
pip install promptfoo
# Run without installing
npx promptfoo@latest eval
Setting Up Evaluations
Create a promptfooconfig.yaml file to define your evaluation:
prompts:
- "Summarize this text in one sentence: {{text}}"
- "Give a brief summary of: {{text}}"
providers:
- openai:gpt-4
- anthropic:claude-3-opus
tests:
- vars:
text: "The quick brown fox jumps over the lazy dog."
assert:
- type: contains
value: "fox"
- type: llm-rubric
value: "The summary captures the main action"
Run the evaluation:
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
promptfoo eval
Red Teaming for Security
Promptfoo's red team module scans your LLM application for common vulnerabilities:
# Initialize red team configuration
promptfoo redteam init
# Run vulnerability scan
promptfoo redteam run
The scanner tests for:
- Prompt injection: Attempts to override system instructions
- Jailbreaks: Techniques to bypass safety guardrails
- PII extraction: Attempts to leak personal information
- Harmful content: Requests for dangerous or illegal outputs
- Hallucinations: Factually incorrect responses
- Excessive agency: Unintended actions or tool calls
CI/CD Integration
Add promptfoo to your GitHub Actions workflow:
name: LLM Evaluation
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm install -g promptfoo
- run: promptfoo eval --ci
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The --ci flag outputs machine-readable results and returns non-zero exit codes when assertions fail.
Comparing Models
Run the same prompts against multiple providers to find the best model for your use case:
providers:
- openai:gpt-4
- openai:gpt-4-turbo
- anthropic:claude-3-opus
- anthropic:claude-3-sonnet
- ollama:llama3.2
tests:
- vars:
query: "Explain Kubernetes pods"
assert:
- type: llm-rubric
value: "Accurate, clear, and concise"
- type: cost
threshold: 0.01
- type: latency
threshold: 5000
The results view shows side-by-side comparisons with latency, cost, and quality metrics.
Operational Tips
Start by testing your most critical prompts. Define assertions based on actual failure modes you have observed or anticipate. Use the LLM-as-judge feature (llm-rubric) for subjective quality assessments. Run red team scans before any production deployment and after major prompt changes.
For production systems, set up continuous evaluation in CI to catch regressions. Store evaluation results to track quality over time. Consider running scheduled red team scans as part of your security program.
Conclusion
Promptfoo brings engineering rigor to LLM development. Automated evaluations catch regressions before they reach users. Red teaming identifies security vulnerabilities before attackers do. For any team deploying LLM applications at scale, systematic testing is not optional.
Looking for an AI-powered platform to enhance your SRE workflows? Check out Akmatori, an open-source AI agent designed for infrastructure teams. Built on Gcore infrastructure for reliable global performance.
