Use an LLM judge to automatically generate rich feedback
This guide explains how to use an LLM judge to automatically generate rich textual feedback for GEPA optimization, making it easier to optimize complex tasks where manual feedback rules are hard to codify.
#[derive(Signature, Clone, Debug)]/// Solve math word problems step by step.struct MathWordProblem { #[input] pub problem: String, #[output] pub reasoning: String, // We want to optimize this too #[output] pub answer: String,}
struct LlmJudgeMetric { judge: Predict<MathJudge>,}impl TypedMetric<MathWordProblem, MathSolver> for LlmJudgeMetric { async fn evaluate( &self, example: &Example<MathWordProblem>, prediction: &Predicted<<MathSolver as Module>::Output>, ) -> Result<MetricOutcome> { let problem = example.input.problem.clone(); let expected = example.output.answer.clone(); let student_answer = prediction.answer.clone(); let student_reasoning = prediction.reasoning.clone(); let exact_match = student_answer.trim() == expected.trim(); let judge_output = self .judge .call(MathJudgeInput { problem: problem.clone(), expected_answer: expected.clone(), student_answer: student_answer.clone(), student_reasoning: student_reasoning.clone(), }) .await; let (score, evaluation_text) = match judge_output { Ok(evaluation) => { let evaluation_text = evaluation.evaluation.clone(); let evaluation_lc = evaluation_text.to_lowercase(); let good_reasoning = evaluation_lc.contains("sound reasoning") || evaluation_lc.contains("correct approach") || evaluation_lc.contains("clear"); let partial_reasoning = evaluation_lc.contains("partially") || evaluation_lc.contains("good start") || evaluation_lc.contains("minor arithmetic") || evaluation_lc.contains("close"); let score = match (exact_match, good_reasoning, partial_reasoning) { (true, true, _) => 1.0, (true, false, _) => 0.7, (false, true, _) | (false, _, true) => 0.3, (false, false, false) => 0.0, }; (score, evaluation_text) } Err(err) => { let fallback = format!( "judge call failed: {err}; expected={expected}; predicted={student_answer}" ); ((exact_match as u8 as f32), fallback) } }; let feedback = FeedbackMetric::new( score, format!( "problem={problem}\nexpected={expected}\npredicted={student_answer}\njudge={evaluation_text}" ), ); Ok(MetricOutcome::with_feedback(score, feedback)) }}
GEPA itself does not own a special feedback_metric hook anymore.
The feedback function lives in your TypedMetric implementation, and GEPA enforces that every evaluation returns MetricOutcome::with_feedback(...).
That keeps the optimizer generic while preserving full judge-driven behavior.
When you run the example, GEPA will evolve prompts based on judge feedback:
1
Baseline
Instruction: “Solve the math word problem step by step”Result: Some solutions skip stepsJudge: “Reasoning incomplete, jumped from step 2 to answer”
2
After GEPA
Instruction: “Solve step by step. Show ALL intermediate calculations. Label each step clearly.”Result: Complete solutions with all steps shownJudge: “Sound reasoning, all steps shown clearly”
The judge’s analysis becomes the signal that drives prompt improvement.