GEPA with LLM-as-Judge

This guide explains how to use an LLM judge to automatically generate rich textual feedback for GEPA optimization, making it easier to optimize complex tasks where manual feedback rules are hard to codify.

The Pattern

Instead of manually writing feedback rules, use another LLM to evaluate the output and reasoning:

Task LM → generates answer + reasoning
    ↓
Judge LM → analyzes quality and provides feedback
    ↓
GEPA Reflection LM → reads feedback and improves prompt
    ↓
Better Task LM prompt

Why Use LLM-as-a-Judge?

Good For
When to Avoid

Subjective quality assessment (writing style, helpfulness, clarity)
Complex reasoning evaluation (is the logic sound?)
Tasks where rules are hard to codify
Analyzing reasoning quality beyond just answer correctness

Complete Example Walkthrough

Full Working Example

See the complete implementation with step-by-step comments

1. Task Signature with Reasoning

#[Signature(cot)]
struct MathWordProblem {
    #[input]
    pub problem: String,
    
    #[output]
    pub reasoning: String,  // We want to optimize this too
    
    #[output]
    pub answer: String,
}

2. Judge Signature

#[Signature]
struct MathJudge {
    /// You are an expert math teacher evaluating student work.
    
    #[input]
    pub problem: String,
    
    #[input]
    pub expected_answer: String,
    
    #[input]
    pub student_answer: String,
    
    #[input]
    pub student_reasoning: String,
    
    #[output(desc = "Detailed evaluation of the work")]
    pub evaluation: String,  // This becomes the feedback
}

3. Module with Embedded Judge

#[derive(Builder, Optimizable)]
struct MathSolver {
    #[parameter]
    solver: Predict,  // This gets optimized
    
    judge: Predict,   // This stays fixed, just evaluates
    judge_lm: Arc<Mutex<LM>>,
}

4. FeedbackEvaluator with Judge

impl FeedbackEvaluator for MathSolver {
    async fn feedback_metric(&self, example: &Example, prediction: &Prediction) 
        -> FeedbackMetric 
    {
        // Extract outputs
        let student_answer = prediction.get("answer", None).as_str().unwrap();
        let student_reasoning = prediction.get("reasoning", None).as_str().unwrap();
        let expected = example.get("expected_answer", None).as_str().unwrap();
        
        // Call the judge
        let judge_input = example! {
            "problem": "input" => problem,
            "expected_answer": "input" => expected,
            "student_answer": "input" => student_answer,
            "student_reasoning": "input" => student_reasoning
        };
        
        let judge_output = match self.judge
            .forward_with_config(judge_input, Arc::clone(&self.judge_lm))
            .await 
        {
            Ok(output) => output,
            Err(_) => {
                // Fallback if judge fails
                return FeedbackMetric::new(
                    if student_answer == expected { 1.0 } else { 0.0 },
                    format!("Expected: {}, Got: {}", expected, student_answer)
                );
            }
        };
        
        let judge_evaluation = judge_output
            .get("evaluation", None)
            .as_str()
            .unwrap_or("No evaluation provided")
            .to_string();
        
        // Score based on both correctness AND reasoning quality
        let answer_correct = student_answer.trim() == expected.trim();
        let good_reasoning = judge_evaluation.to_lowercase().contains("sound reasoning") 
            || judge_evaluation.to_lowercase().contains("correct approach");
        
        let score = match (answer_correct, good_reasoning) {
            (true, true) => 1.0,   // Perfect
            (true, false) => 0.7,  // Right answer, flawed reasoning
            (false, true) => 0.3,  // Wrong answer, but valid approach
            (false, false) => 0.0, // Completely wrong
        };
        
        // Combine factual info with judge's analysis
        let feedback = format!(
            "Problem: {}\nExpected: {}\nPredicted: {}\n\
             Answer: {}\n\nReasoning Quality Analysis:\n{}",
            problem, expected, student_answer,
            if answer_correct { "CORRECT" } else { "INCORRECT" },
            judge_evaluation
        );
        
        FeedbackMetric::new(score, feedback)
    }
}

Key Benefits

Catches Lucky Guesses

Answer: Correct (1.0)
But reasoning: "I just multiplied random numbers"
Score: 0.7 (penalized for bad reasoning)

The judge identifies when the model got the right answer for the wrong reasons.

Rewards Partial Progress

Answer: Wrong
But reasoning: "Correct approach, arithmetic error in final step"
Score: 0.3 (partial credit)

The judge recognizes valid methodology even when the final answer is wrong.

Identifies Systematic Issues

The judge notices patterns like:

“Model consistently skips showing intermediate steps”
“Model confuses similar concepts (area vs perimeter)”
“Model doesn’t check units in answers”

GEPA’s reflection can then say:

“Add explicit instruction to show all intermediate steps and verify units”

Cost Considerations

LLM judges double your evaluation cost since every prediction requires both a task LM call and a judge LM call.

Budget accordingly:

GEPA::builder()
    .num_iterations(3)           // Fewer iterations
    .minibatch_size(3)           // Smaller batches
    .maybe_max_lm_calls(Some(100))  // Explicit limit
    .build()

Optimization Tips

Use a cheaper model for judging (gpt-4o-mini vs gpt-4)
Judge only failed examples (not ones that passed)
Cache judge evaluations for identical outputs
Use parallel evaluation to reduce wall-clock time

Hybrid Approach

Best results often come from combining explicit checks with LLM judging:

async fn feedback_metric(&self, example: &Example, prediction: &Prediction) 
    -> FeedbackMetric 
{
    let mut feedback_parts = vec![];
    let mut score = 1.0;
    
    // Explicit checks first (fast, cheap, deterministic)
    if !is_valid_json(output) {
        feedback_parts.push("Invalid JSON format");
        score = 0.0;
    }
    
    if missing_required_fields(output) {
        feedback_parts.push("Missing fields: user_id, timestamp");
        score *= 0.5;
    }
    
    // Only call judge if basic checks pass
    if score > 0.0 {
        let judge_feedback = self.judge_quality(example, prediction).await;
        feedback_parts.push(judge_feedback);
        
        if judge_feedback.contains("low quality") {
            score *= 0.7;
        }
    }
    
    FeedbackMetric::new(score, feedback_parts.join("\n"))
}

Example Evolution

When you run the example, GEPA will evolve prompts based on judge feedback:

Baseline

Instruction: “Solve the math word problem step by step”Result: Some solutions skip stepsJudge: “Reasoning incomplete, jumped from step 2 to answer”

After GEPA

Instruction: “Solve step by step. Show ALL intermediate calculations. Label each step clearly.”Result: Complete solutions with all steps shownJudge: “Sound reasoning, all steps shown clearly”

The judge’s analysis becomes the signal that drives prompt improvement.

Running the Example

OPENAI_API_KEY=your_key cargo run --example 10-gepa-llm-judge

This will show:

Baseline performance
Judge evaluations during optimization
How feedback evolves the prompt
Final test with judge analysis

Get Started

Building Blocks

Data

Optimizers

GEPA with LLM-as-Judge

The Pattern

Why Use LLM-as-a-Judge?

Complete Example Walkthrough

Full Working Example

1. Task Signature with Reasoning

2. Judge Signature

3. Module with Embedded Judge

4. FeedbackEvaluator with Judge

Key Benefits

Cost Considerations

Optimization Tips

Hybrid Approach

Example Evolution

Running the Example

Get Started

Building Blocks

Data

Optimizers

​The Pattern

​Why Use LLM-as-a-Judge?

​Complete Example Walkthrough

Full Working Example

​1. Task Signature with Reasoning

​2. Judge Signature

​3. Module with Embedded Judge

​4. FeedbackEvaluator with Judge

​Key Benefits

​Cost Considerations

​Optimization Tips

​Hybrid Approach

​Example Evolution

​Running the Example

The Pattern

Why Use LLM-as-a-Judge?

Complete Example Walkthrough

1. Task Signature with Reasoning

2. Judge Signature

3. Module with Embedded Judge

4. FeedbackEvaluator with Judge

Key Benefits

Cost Considerations

Optimization Tips

Hybrid Approach

Example Evolution

Running the Example