Reference: “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning” (Agrawal et al., 2025, arxiv:2507.19457)
What is GEPA?
GEPA is a reflective optimizer that adaptively evolves textual components (such as prompts) of arbitrary systems. In addition to scalar scores returned by metrics, users can also provide GEPA with text feedback to guide the optimization process. Such textual feedback provides GEPA more visibility into why the system got the score that it did, and then GEPA can introspect to identify how to improve the score. This allows GEPA to propose high performing prompts in very few rollouts.What Makes GEPA Unique?
Unlike traditional optimizers (COPRO, MIPROv2), GEPA introduces several key innovations:1. Rich Textual Feedback
Instead of just scalar scores (0.8, 0.9), GEPA leverages detailed explanations:2. Pareto-based Selection
GEPA maintains a diverse set of candidates that excel on different examples, preventing premature convergence:- Candidate A: Best on examples 1, 3, 5
- Candidate B: Best on examples 2, 4, 6
- Both stay in the population (complementary strengths)
3. LLM-driven Reflection
Uses LLMs to analyze execution traces and propose targeted improvements:4. Inference-Time Search
Can optimize at test time, not just training time.Quick Start
1. Implement FeedbackEvaluator
2. Configure and Run GEPA
Feedback Helpers
DSRs provides utilities for common feedback patterns:Document Retrieval
Code Generation
Multi-Objective Optimization
Configuration Options
Understanding GEPA Results
Architecture
Core Components
FeedbackMetric- Each candidate tracks which examples it wins on
- Sampling is proportional to coverage
- Automatically prunes dominated candidates
Evolutionary Algorithm
- Initialize the candidate pool with the unoptimized program
- Iterate:
- Sample a candidate from Pareto frontier (proportional to coverage)
- Sample a minibatch from the training set
- Collect execution traces with feedback
- Select a module for targeted improvement
- LLM Reflection: Propose new instruction using reflective meta-prompting
- Roll out the new candidate; if improved, evaluate on validation set
- Update the Pareto frontier
- Continue until budget is exhausted
- Return best candidate by average score
Implementing Feedback Metrics
A well-designed metric is central to GEPA’s sample efficiency. The DSRs implementation expects the metric to return aFeedbackMetric
struct with both a score and rich textual feedback.
Practical Recipe for GEPA-Friendly Feedback
- Leverage Existing Artifacts: Use logs, unit tests, evaluation scripts, profiler outputs
- Decompose Outcomes: Break scores into per-objective components
- Expose Trajectories: Label pipeline stages with pass/fail and errors
- Ground in Checks: Use validators or LLM-as-a-judge for subjective tasks
- Prioritize Clarity: Focus on error coverage and decision points
Feedback Examples by Domain
Document Retrieval: List correctly retrieved, incorrect, or missed documents Multi-Objective Tasks: Decompose aggregate scores to reveal contributions from each objective Stacked Pipelines: Expose stage-specific failures (parse, compile, run, test)Best Practices
Design Feedback for Actionability
Leverage Domain Knowledge
- Code generation: Show stage-specific failures
- Retrieval: List specific documents missed
- QA: Explain reasoning errors
Balance Feedback Detail
- Too brief: Not actionable
- Too verbose: Drowns out signal
- Sweet spot: 2-5 lines per issue
Set Realistic Budgets
Examples
Sentiment Analysis
Basic GEPA usage with explicit feedback for sentiment classification
LLM-as-Judge
Using an LLM judge to automatically generate feedback
Comparison with Other Optimizers
Feature | COPRO | MIPROv2 | GEPA |
---|---|---|---|
Feedback Type | Score | Score | Score + Text |
Selection Strategy | Best | Batch | Pareto |
Diversity | Low | Medium | High |
Actionability | Low | Medium | High |
Compute Cost | Low | Medium | Medium-High |
Sample Efficiency | Medium | High | Very High |
When to Use GEPA
- Complex tasks with subtle failure modes
- When you can provide rich feedback
- Multi-objective optimization
- Need for diverse solutions
- Inference-time search
When to Use Alternatives
- COPRO: Simple tasks, quick iteration
- MIPROv2: Best prompting practices, single objective
Troubleshooting
Issue: “GEPA requires FeedbackEvaluator trait”
Issue: Slow convergence
Issue: Running out of budget
Inference-Time Search
GEPA can act as a test-time/inference search mechanism. By setting yourvalset
to your evaluation batch and using track_best_outputs=True
, GEPA produces for each batch element the highest-scoring outputs found during the evolutionary search.