The Pattern
Instead of manually writing feedback rules, use another LLM to evaluate the output and reasoning:Why Use LLM-as-a-Judge?
- Good For
- When to Avoid
- Subjective quality assessment (writing style, helpfulness, clarity)
- Complex reasoning evaluation (is the logic sound?)
- Tasks where rules are hard to codify
- Analyzing reasoning quality beyond just answer correctness
Complete Example Walkthrough
Full Working Example
See the complete implementation with step-by-step comments
1. Task Signature with Reasoning
2. Judge Signature
3. Module with Embedded Judge
4. FeedbackEvaluator with Judge
Key Benefits
Catches Lucky Guesses
Catches Lucky Guesses
Rewards Partial Progress
Rewards Partial Progress
Identifies Systematic Issues
Identifies Systematic Issues
The judge notices patterns like:
- “Model consistently skips showing intermediate steps”
- “Model confuses similar concepts (area vs perimeter)”
- “Model doesn’t check units in answers”
“Add explicit instruction to show all intermediate steps and verify units”
Cost Considerations
LLM judges double your evaluation cost since every prediction requires both a task LM call and a judge LM call.
Optimization Tips
- Use a cheaper model for judging (gpt-4o-mini vs gpt-4)
- Judge only failed examples (not ones that passed)
- Cache judge evaluations for identical outputs
- Use parallel evaluation to reduce wall-clock time
Hybrid Approach
Best results often come from combining explicit checks with LLM judging:Example Evolution
When you run the example, GEPA will evolve prompts based on judge feedback:1
Baseline
Instruction: “Solve the math word problem step by step”Result: Some solutions skip stepsJudge: “Reasoning incomplete, jumped from step 2 to answer”
2
After GEPA
Instruction: “Solve step by step. Show ALL intermediate calculations. Label each step clearly.”Result: Complete solutions with all steps shownJudge: “Sound reasoning, all steps shown clearly”
Running the Example
- Baseline performance
- Judge evaluations during optimization
- How feedback evolves the prompt
- Final test with judge analysis