GEPA: Reflective Optimizer

GEPA (Genetic-Pareto) is a state-of-the-art reflective prompt optimizer that uses rich textual feedback and evolutionary algorithms to improve LM-powered applications.

Reference: “GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning” (Agrawal et al., 2025, arxiv:2507.19457)

What is GEPA?

GEPA is a reflective optimizer that adaptively evolves textual components (such as prompts) of arbitrary systems. In addition to scalar scores returned by metrics, users can also provide GEPA with text feedback to guide the optimization process. Such textual feedback provides GEPA more visibility into why the system got the score that it did, and then GEPA can introspect to identify how to improve the score. This allows GEPA to propose high performing prompts in very few rollouts.

What Makes GEPA Unique?

Unlike traditional optimizers (COPRO, MIPROv2), GEPA introduces several key innovations:

1. Rich Textual Feedback

Instead of just scalar scores (0.8, 0.9), GEPA leverages detailed explanations:

Incorrect classification
  Expected: "positive"
  Predicted: "negative"
  Input text: "Great product but shipping was slow"
  May have misunderstood mixed sentiment

2. Pareto-based Selection

GEPA maintains a diverse set of candidates that excel on different examples, preventing premature convergence:

Candidate A: Best on examples 1, 3, 5
Candidate B: Best on examples 2, 4, 6
Both stay in the population (complementary strengths)

3. LLM-driven Reflection

Uses LLMs to analyze execution traces and propose targeted improvements:

"The current instruction doesn't handle mixed sentiments well. 
Suggest modifying to explicitly consider both positive and negative aspects..."

4. Inference-Time Search

Can optimize at test time, not just training time.

Quick Start

1. Implement FeedbackEvaluator

use dspy_rs::*;

#[derive(Builder, Optimizable)]
struct MyModule {
    #[parameter]
    predictor: Predict,
}

impl Module for MyModule {
    async fn forward(&self, inputs: Example) -> Result<Prediction> {
        self.predictor.forward(inputs).await
    }
}

// Implement regular Evaluator for non-GEPA optimizers
impl Evaluator for MyModule {
    async fn metric(&self, example: &Example, prediction: &Prediction) -> f32 {
        let feedback = self.feedback_metric(example, prediction).await;
        feedback.score
    }
}

// Implement FeedbackEvaluator for GEPA
impl FeedbackEvaluator for MyModule {
    async fn feedback_metric(&self, example: &Example, prediction: &Prediction) 
        -> FeedbackMetric 
    {
        let predicted = prediction.get("answer", None).as_str().unwrap_or("");
        let expected = example.get("expected", None).as_str().unwrap_or("");
        
        let correct = predicted == expected;
        let score = if correct { 1.0 } else { 0.0 };
        
        let feedback = if correct {
            format!("Correct answer: {}", predicted)
        } else {
            format!("Incorrect\n  Expected: {}\n  Predicted: {}", expected, predicted)
        };
        
        FeedbackMetric::new(score, feedback)
    }
}

2. Configure and Run GEPA

let gepa = GEPA::builder()
    .num_iterations(20)
    .minibatch_size(25)
    .num_trials(10)
    .temperature(0.9)
    .track_stats(true)
    .maybe_max_rollouts(Some(500))  // Budget control
    .build();

let result = gepa.compile_with_feedback(&mut module, trainset).await?;

println!("Best score: {:.3}", result.best_candidate.average_score());
println!("Best instruction: {}", result.best_candidate.instruction);

Feedback Helpers

DSRs provides utilities for common feedback patterns:

Document Retrieval

use dspy_rs::feedback_helpers::retrieval_feedback;

let feedback = retrieval_feedback(
    &retrieved_docs,
    &expected_docs,
    Some(&all_available_docs)
);

// Output:
// Retrieved 3/5 correct documents (Precision: 0.6, Recall: 0.6, F1: 0.6)
// Correctly retrieved: doc1, doc2, doc3
// Missed: doc4, doc5

Code Generation

use dspy_rs::feedback_helpers::{code_pipeline_feedback, CodeStage, StageResult};

let stages = vec![
    (CodeStage::Parse, StageResult::Success),
    (CodeStage::Compile, StageResult::Success),
    (CodeStage::Execute, StageResult::Failure {
        error: "Division by zero on line 42".to_string(),
    }),
];

let feedback = code_pipeline_feedback(&stages, 0.6);

// Output:
// Parse: Success
// Compile: Success
// Execute: Division by zero on line 42

Multi-Objective Optimization

use dspy_rs::feedback_helpers::multi_objective_feedback;

let mut objectives = HashMap::new();
objectives.insert("accuracy".to_string(), (0.9, "High accuracy".to_string()));
objectives.insert("latency".to_string(), (0.7, "Acceptable latency".to_string()));
objectives.insert("cost".to_string(), (0.8, "Within budget".to_string()));

let mut weights = HashMap::new();
weights.insert("accuracy".to_string(), 2.0);
weights.insert("latency".to_string(), 1.0);
weights.insert("cost".to_string(), 1.0);

let feedback = multi_objective_feedback(&objectives, Some(&weights));

Configuration Options

GEPA::builder()
    .num_iterations(20)          // Number of evolutionary iterations
    .minibatch_size(25)          // Examples per rollout
    .num_trials(10)              // Trials per candidate evaluation
    .temperature(0.9)            // LLM temperature for mutations
    .track_stats(true)           // Track detailed statistics
    .track_best_outputs(false)   // Store best outputs per example
    .maybe_max_rollouts(Some(500))   // Budget: max rollouts
    .maybe_max_lm_calls(Some(1000))  // Budget: max LM calls
    .maybe_prompt_model(Some(lm))    // Separate LM for meta-prompting
    .maybe_valset(Some(examples))    // Validation set
    .build()

Understanding GEPA Results

let result = gepa.compile_with_feedback(&mut module, trainset).await?;

// Best candidate found
println!("Best instruction: {}", result.best_candidate.instruction);
println!("Average score: {:.3}", result.best_candidate.average_score());
println!("Generation: {}", result.best_candidate.generation);

// Resource usage
println!("Total rollouts: {}", result.total_rollouts);
println!("Total LM calls: {}", result.total_lm_calls);

// Evolution over time
for (generation, score) in &result.evolution_history {
    println!("Gen {}: {:.3}", generation, score);
}

Architecture

Core Components

FeedbackMetric

pub struct FeedbackMetric {
    pub score: f32,
    pub feedback: String,
    pub metadata: HashMap<String, serde_json::Value>,
}

ExecutionTrace

pub struct ExecutionTrace {
    pub inputs: Example,
    pub outputs: Option<Prediction>,
    pub feedback: Option<FeedbackMetric>,
    pub intermediate_steps: Vec<(String, serde_json::Value)>,
    pub errors: Vec<String>,
    pub metadata: HashMap<String, serde_json::Value>,
}

ParetoFrontier

Each candidate tracks which examples it wins on
Sampling is proportional to coverage
Automatically prunes dominated candidates

GEPACandidate

pub struct GEPACandidate {
    pub id: usize,
    pub instruction: String,
    pub module_name: String,
    pub example_scores: Vec<f32>,
    pub parent_id: Option<usize>,
    pub generation: usize,
}

Evolutionary Algorithm

Initialize the candidate pool with the unoptimized program
Iterate:
- Sample a candidate from Pareto frontier (proportional to coverage)
- Sample a minibatch from the training set
- Collect execution traces with feedback
- Select a module for targeted improvement
- LLM Reflection: Propose new instruction using reflective meta-prompting
- Roll out the new candidate; if improved, evaluate on validation set
- Update the Pareto frontier
Continue until budget is exhausted
Return best candidate by average score

Implementing Feedback Metrics

A well-designed metric is central to GEPA’s sample efficiency. The DSRs implementation expects the metric to return a FeedbackMetric struct with both a score and rich textual feedback.

Practical Recipe for GEPA-Friendly Feedback

Leverage Existing Artifacts: Use logs, unit tests, evaluation scripts, profiler outputs
Decompose Outcomes: Break scores into per-objective components
Expose Trajectories: Label pipeline stages with pass/fail and errors
Ground in Checks: Use validators or LLM-as-a-judge for subjective tasks
Prioritize Clarity: Focus on error coverage and decision points

Feedback Examples by Domain

Document Retrieval: List correctly retrieved, incorrect, or missed documents Multi-Objective Tasks: Decompose aggregate scores to reveal contributions from each objective Stacked Pipelines: Expose stage-specific failures (parse, compile, run, test)

Best Practices

Design Feedback for Actionability

// BAD: Vague feedback
FeedbackMetric::new(0.5, "Wrong answer")

// GOOD: Specific, actionable feedback
FeedbackMetric::new(0.5, 
    "Incorrect answer\n\
     Expected: 'Paris'\n\
     Predicted: 'France'\n\
     Issue: Returned country instead of city")

Leverage Domain Knowledge

Code generation: Show stage-specific failures
Retrieval: List specific documents missed
QA: Explain reasoning errors

Balance Feedback Detail

Too brief: Not actionable
Too verbose: Drowns out signal
Sweet spot: 2-5 lines per issue

Set Realistic Budgets

// For development/testing
GEPA::builder()
    .num_iterations(5)
    .maybe_max_rollouts(Some(100))
    .build()

// For production optimization
GEPA::builder()
    .num_iterations(20)
    .maybe_max_rollouts(Some(1000))
    .build()

Examples

Sentiment Analysis

Basic GEPA usage with explicit feedback for sentiment classification

LLM-as-Judge

Using an LLM judge to automatically generate feedback

Comparison with Other Optimizers

Feature	COPRO	MIPROv2	GEPA
Feedback Type	Score	Score	Score + Text
Selection Strategy	Best	Batch	Pareto
Diversity	Low	Medium	High
Actionability	Low	Medium	High
Compute Cost	Low	Medium	Medium-High
Sample Efficiency	Medium	High	Very High

When to Use GEPA

Complex tasks with subtle failure modes
When you can provide rich feedback
Multi-objective optimization
Need for diverse solutions
Inference-time search

When to Use Alternatives

COPRO: Simple tasks, quick iteration
MIPROv2: Best prompting practices, single objective

Troubleshooting

Issue: “GEPA requires FeedbackEvaluator trait”

// Solution: Implement both Evaluator and FeedbackEvaluator
impl Evaluator for MyModule {
    async fn metric(&self, example: &Example, prediction: &Prediction) -> f32 {
        self.feedback_metric(example, prediction).await.score
    }
}

impl FeedbackEvaluator for MyModule {
    async fn feedback_metric(&self, example: &Example, prediction: &Prediction) 
        -> FeedbackMetric { ... }
}

Issue: Slow convergence

// Increase minibatch size for better gradient
GEPA::builder().minibatch_size(50).build()

// Increase temperature for more exploration
GEPA::builder().temperature(1.2).build()

Issue: Running out of budget

// Reduce iterations or increase budget
GEPA::builder()
    .num_iterations(10)
    .maybe_max_rollouts(Some(2000))
    .build()

Inference-Time Search

GEPA can act as a test-time/inference search mechanism. By setting your valset to your evaluation batch and using track_best_outputs=True, GEPA produces for each batch element the highest-scoring outputs found during the evolutionary search.

let gepa = GEPA::builder()
    .track_stats(true)
    .track_best_outputs(true)
    .maybe_valset(Some(my_tasks.clone()))
    .build();

let result = gepa.compile_with_feedback(&mut module, my_tasks).await?;

// Access per-task best scores and outputs
let best_scores = result.highest_score_achieved_per_val_task;
let best_outputs = result.best_outputs_valset;

Get Started

Building Blocks

Data

Optimizers

​What is GEPA?

​What Makes GEPA Unique?

​1. Rich Textual Feedback

​2. Pareto-based Selection

​3. LLM-driven Reflection

​4. Inference-Time Search

​Quick Start

​1. Implement FeedbackEvaluator

​2. Configure and Run GEPA

​Feedback Helpers

​Document Retrieval

​Code Generation

​Multi-Objective Optimization

​Configuration Options

​Understanding GEPA Results

​Architecture

​Core Components

​Evolutionary Algorithm

​Implementing Feedback Metrics

​Practical Recipe for GEPA-Friendly Feedback

​Feedback Examples by Domain

​Best Practices

​Design Feedback for Actionability

​Leverage Domain Knowledge

​Balance Feedback Detail

​Set Realistic Budgets

​Examples

Sentiment Analysis

LLM-as-Judge

​Comparison with Other Optimizers

​When to Use GEPA

​When to Use Alternatives

​Troubleshooting

​Issue: “GEPA requires FeedbackEvaluator trait”

​Issue: Slow convergence

​Issue: Running out of budget

​Inference-Time Search

​Additional Resources

What is GEPA?

What Makes GEPA Unique?

1. Rich Textual Feedback

2. Pareto-based Selection

3. LLM-driven Reflection

4. Inference-Time Search

Quick Start

1. Implement FeedbackEvaluator

2. Configure and Run GEPA

Feedback Helpers

Document Retrieval

Code Generation

Multi-Objective Optimization

Configuration Options

Understanding GEPA Results

Architecture

Core Components

Evolutionary Algorithm

Implementing Feedback Metrics

Practical Recipe for GEPA-Friendly Feedback

Feedback Examples by Domain

Best Practices

Design Feedback for Actionability

Leverage Domain Knowledge

Balance Feedback Detail

Set Realistic Budgets

Examples

Comparison with Other Optimizers

When to Use GEPA

When to Use Alternatives

Troubleshooting

Issue: “GEPA requires FeedbackEvaluator trait”

Issue: Slow convergence

Issue: Running out of budget

Inference-Time Search

Additional Resources