Skip to main content
This guide explains how to use an LLM judge to automatically generate rich textual feedback for GEPA optimization, making it easier to optimize complex tasks where manual feedback rules are hard to codify.

The Pattern

Instead of manually writing feedback rules, use another LLM to evaluate the output and reasoning:
Task LM → generates answer + reasoning

Judge LM → analyzes quality and provides feedback

GEPA Reflection LM → reads feedback and improves prompt

Better Task LM prompt

Why Use LLM-as-a-Judge?

  • Subjective quality assessment (writing style, helpfulness, clarity)
  • Complex reasoning evaluation (is the logic sound?)
  • Tasks where rules are hard to codify
  • Analyzing reasoning quality beyond just answer correctness

Complete Example Walkthrough

Full Working Example

See the complete implementation with step-by-step comments

1. Task Signature with Reasoning

#[derive(Signature, Clone, Debug)]
/// Solve math word problems step by step.
struct MathWordProblem {
    #[input]
    pub problem: String,

    #[output]
    pub reasoning: String,  // We want to optimize this too

    #[output]
    pub answer: String,
}

2. Judge Signature

#[derive(Signature, Clone, Debug)]
/// You are an expert math teacher evaluating student work.
struct MathJudge {
    
    #[input]
    pub problem: String,
    
    #[input]
    pub expected_answer: String,
    
    #[input]
    pub student_answer: String,
    
    #[input]
    pub student_reasoning: String,
    
    #[output(desc = "Detailed evaluation of the work")]
    pub evaluation: String,  // This becomes the feedback
}

3. Optimized Module

#[derive(Builder, facet::Facet)]
#[facet(crate = facet)]
struct MathSolver {
    #[builder(default = Predict::<MathWordProblem>::new())]
    solver: Predict<MathWordProblem>,  // This gets optimized
}

4. TypedMetric with Judge

struct LlmJudgeMetric {
    judge: Predict<MathJudge>,
}

impl TypedMetric<MathWordProblem, MathSolver> for LlmJudgeMetric {
    async fn evaluate(
        &self,
        example: &Example<MathWordProblem>,
        prediction: &Predicted<<MathSolver as Module>::Output>,
    ) -> Result<MetricOutcome> {
        let problem = example.input.problem.clone();
        let expected = example.output.answer.clone();

        let student_answer = prediction.answer.clone();
        let student_reasoning = prediction.reasoning.clone();
        let exact_match = student_answer.trim() == expected.trim();

        let judge_output = self
            .judge
            .call(MathJudgeInput {
                problem: problem.clone(),
                expected_answer: expected.clone(),
                student_answer: student_answer.clone(),
                student_reasoning: student_reasoning.clone(),
            })
            .await;

        let (score, evaluation_text) = match judge_output {
            Ok(evaluation) => {
                let evaluation_text = evaluation.evaluation.clone();
                let evaluation_lc = evaluation_text.to_lowercase();
                let good_reasoning =
                    evaluation_lc.contains("sound reasoning")
                    || evaluation_lc.contains("correct approach")
                    || evaluation_lc.contains("clear");
                let partial_reasoning =
                    evaluation_lc.contains("partially")
                    || evaluation_lc.contains("good start")
                    || evaluation_lc.contains("minor arithmetic")
                    || evaluation_lc.contains("close");

                let score = match (exact_match, good_reasoning, partial_reasoning) {
                    (true, true, _) => 1.0,
                    (true, false, _) => 0.7,
                    (false, true, _) | (false, _, true) => 0.3,
                    (false, false, false) => 0.0,
                };
                (score, evaluation_text)
            }
            Err(err) => {
                let fallback = format!(
                    "judge call failed: {err}; expected={expected}; predicted={student_answer}"
                );
                ((exact_match as u8 as f32), fallback)
            }
        };

        let feedback = FeedbackMetric::new(
            score,
            format!(
                "problem={problem}\nexpected={expected}\npredicted={student_answer}\njudge={evaluation_text}"
            ),
        );

        Ok(MetricOutcome::with_feedback(score, feedback))
    }
}
GEPA itself does not own a special feedback_metric hook anymore. The feedback function lives in your TypedMetric implementation, and GEPA enforces that every evaluation returns MetricOutcome::with_feedback(...). That keeps the optimizer generic while preserving full judge-driven behavior.

Key Benefits

Answer: Correct (1.0)
But reasoning: "I just multiplied random numbers"
Score: 0.7 (penalized for bad reasoning)
The judge identifies when the model got the right answer for the wrong reasons.
Answer: Wrong
But reasoning: "Correct approach, arithmetic error in final step"
Score: 0.3 (partial credit)
The judge recognizes valid methodology even when the final answer is wrong.
The judge notices patterns like:
  • “Model consistently skips showing intermediate steps”
  • “Model confuses similar concepts (area vs perimeter)”
  • “Model doesn’t check units in answers”
GEPA’s reflection can then say:
“Add explicit instruction to show all intermediate steps and verify units”

Cost Considerations

LLM judges double your evaluation cost since every prediction requires both a task LM call and a judge LM call.
Budget accordingly:
GEPA::builder()
    .num_iterations(3)           // Fewer iterations
    .minibatch_size(3)           // Smaller batches
    .max_lm_calls(Some(100))  // Explicit limit
    .build()

Optimization Tips

  • Use a cheaper model for judging (gpt-4o-mini vs gpt-4)
  • Judge only failed examples (not ones that passed)
  • Cache judge evaluations for identical outputs
  • Use parallel evaluation to reduce wall-clock time

Hybrid Approach

Best results often come from combining explicit checks with LLM judging:
impl TypedMetric<MySignature, MyModule> for HybridMetric {
    async fn evaluate(
        &self,
        example: &Example<MySignature>,
        prediction: &Predicted<<MyModule as Module>::Output>,
    ) -> Result<MetricOutcome> {
        let mut score = 1.0;
        let mut feedback_parts = vec![];

        // Explicit checks first (fast, cheap, deterministic)
        if !is_valid_json(&prediction.result_json) {
            feedback_parts.push("Invalid JSON format".to_string());
            score = 0.0;
        }

        if score > 0.0 && missing_required_fields(&prediction.result_json) {
            feedback_parts.push("Missing fields: user_id, timestamp".to_string());
            score *= 0.5;
        }

        // Optional judge pass for qualitative scoring
        if score > 0.0 {
            let judge_feedback = self.judge_quality(example, prediction).await?;
            if judge_feedback.to_lowercase().contains("low quality") {
                score *= 0.7;
            }
            feedback_parts.push(judge_feedback);
        }

        let feedback = FeedbackMetric::new(score, feedback_parts.join("\n"));
        Ok(MetricOutcome::with_feedback(score, feedback))
    }
}

Example Evolution

When you run the example, GEPA will evolve prompts based on judge feedback:
1

Baseline

Instruction: “Solve the math word problem step by step”Result: Some solutions skip stepsJudge: “Reasoning incomplete, jumped from step 2 to answer”
2

After GEPA

Instruction: “Solve step by step. Show ALL intermediate calculations. Label each step clearly.”Result: Complete solutions with all steps shownJudge: “Sound reasoning, all steps shown clearly”
The judge’s analysis becomes the signal that drives prompt improvement.

Running the Example

OPENAI_API_KEY=your_key cargo run --example 10-gepa-llm-judge
This will show:
  1. Baseline performance
  2. Judge evaluations during optimization
  3. How feedback evolves the prompt
  4. Final test with judge analysis