Python DSPy Agent Framework: 5 Fatal Pitfalls in LLM Programming and Auto-Optimization in 2026
The Era of Hand-Written Prompts Is Over
You spend 3 days crafting a perfect Prompt, and it's useless with a different model; your carefully designed few-shot examples perform worse on newer model versions; your Agent chain keeps growing, and every step's Prompt becomes a maintenance nightmare. In 2026, DSPy (Declarative Self-improving Python) evolves LLM programming from "hand-written Prompts" to "declarative programming + automatic optimization" — you only need to define input/output signatures, and the framework automatically searches for optimal Prompts and fine-tuning strategies.
This article guides you through building a DSPy-based AI Agent from scratch and solving the 5 most common fatal pitfalls in production.
DSPy Core Concepts
| Concept | Description |
|---|---|
| Signature | Declarative definition of module input/output, e.g., "question -> answer" |
| Module | Composable LLM call unit, similar to PyTorch's nn.Module |
| Teleprompter | Optimizer that automatically searches for optimal Prompts/examples |
| Example | Standardized input/output data sample |
| Metric | Scoring function that evaluates module output quality |
| Adapter | Adapter layer that converts signatures to specific LLM API calls |
DSPy vs Traditional Prompt Engineering
| Dimension | Hand-written Prompt | DSPy Declarative |
|---|---|---|
| Development | Manual writing, trial and error | Declare signatures, auto-optimize |
| Model Migration | Rewrite all Prompts | Just swap Adapter |
| Maintainability | Low, Prompts scattered | High, signatures as documentation |
| Optimization | Relies on human experience | Automatic search for optimal |
| Multi-step Reasoning | Manual chaining, error-prone | Modular composition, type-safe |
Problem Analysis: 5 Major DSPy Development Challenges
- Poor signature design: Vague input/output field names cause LLM misunderstanding
- Optimizer selection difficulty: BootstrapFewShot, MIPROv2 suit different scenarios
- Multi-step reasoning chain breaks: Type mismatches between modules cause mid-chain crashes
- Inaccurate metric functions: Evaluation criteria misaligned with business goals, optimization goes off-track
- Async concurrency pitfalls: No concurrency control during large-scale optimization triggers API rate limits
Step-by-Step: Complete DSPy Agent Implementation
Step 1: Environment Setup
pip install dspy-ai==2.6.0
pip install openai==1.35.0
pip install datasets==2.19.0
import dspy
lm = dspy.LM(
model="openai/gpt-4o-mini",
api_key="your-api-key",
temperature=0.7,
max_tokens=2048,
)
dspy.configure(lm=lm)
Step 2: Define Signatures and Modules
class QuestionAnswer(dspy.Signature):
"""Answer the question based on the given context. If the context doesn't contain the answer, respond with 'Cannot answer'."""
context: str = dspy.InputField(desc="Context text containing the answer")
question: str = dspy.InputField(desc="Question to answer")
answer: str = dspy.OutputField(desc="Brief answer based on context")
class RAGModule(dspy.Module):
def __init__(self, num_passages: int = 3):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate_answer = dspy.ChainOfThought(QuestionAnswer)
def forward(self, question: str) -> dspy.Prediction:
context = self.retrieve(question).passages
prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=prediction.answer)
Step 3: Build Multi-Step Reasoning Agent
class DecomposeQuestion(dspy.Signature):
"""Decompose a complex question into multiple simple sub-questions."""
question: str = dspy.InputField(desc="Complex question to decompose")
sub_questions: list[str] = dspy.OutputField(desc="List of decomposed sub-questions")
class SynthesizeAnswer(dspy.Signature):
"""Synthesize a final answer from multiple sub-question answers."""
original_question: str = dspy.InputField(desc="Original complex question")
sub_answers: list[str] = dspy.InputField(desc="Answers to each sub-question")
final_answer: str = dspy.OutputField(desc="Synthesized final answer")
class MultiStepAgent(dspy.Module):
def __init__(self, num_passages: int = 3):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.decompose = dspy.ChainOfThought(DecomposeQuestion)
self.sub_answer = dspy.ChainOfThought(QuestionAnswer)
self.synthesize = dspy.ChainOfThought(SynthesizeAnswer)
def forward(self, question: str) -> dspy.Prediction:
decomposed = self.decompose(question=question)
sub_answers = []
for sub_q in decomposed.sub_questions:
context = self.retrieve(sub_q).passages
sub_pred = self.sub_answer(context="\n".join(context), question=sub_q)
sub_answers.append(sub_pred.answer)
final = self.synthesize(
original_question=question,
sub_answers=sub_answers,
)
return dspy.Prediction(
sub_questions=decomposed.sub_questions,
sub_answers=sub_answers,
answer=final.final_answer,
)
Step 4: Define Metric Functions
def answer_exact_match(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float:
"""Exact match metric"""
return float(
example.answer.strip().lower() == prediction.answer.strip().lower()
)
def answer_f1_score(example: dspy.Example, prediction: dspy.Prediction, trace=None) -> float:
"""F1 score metric"""
pred_tokens = set(prediction.answer.strip().lower().split())
gold_tokens = set(example.answer.strip().lower().split())
if not pred_tokens or not gold_tokens:
return float(pred_tokens == gold_tokens)
common = pred_tokens & gold_tokens
if not common:
return 0.0
precision = len(common) / len(pred_tokens)
recall = len(common) / len(gold_tokens)
return 2 * precision * recall / (precision + recall)
Step 5: Automatic Optimization
from dspy.teleprompt import BootstrapFewShot, MIPROv2
trainset = [
dspy.Example(question="What is DSPy?", answer="A declarative LLM programming framework").with_inputs("question"),
dspy.Example(question="What does LoRA do?", answer="Reduces LLM fine-tuning memory requirements").with_inputs("question"),
dspy.Example(question="What does RAG stand for?", answer="Retrieval-Augmented Generation").with_inputs("question"),
]
optimizer_fewshot = BootstrapFewShot(
metric=answer_exact_match,
max_bootstrapped_demos=4,
max_labeled_demos=4,
max_rounds=3,
)
optimized_module = optimizer_fewshot.compile(
RAGModule(),
trainset=trainset,
)
optimizer_mipro = MIPROv2(
metric=answer_f1_score,
num_threads=4,
max_bootstrapped_demos=4,
max_labeled_demos=4,
num_candidates=10,
num_trials=20,
)
fully_optimized = optimizer_mipro.compile(
RAGModule(),
trainset=trainset,
)
Step 6: Evaluation and Deployment
from dspy.evaluate import Evaluate
evaluator = Evaluate(
devset=trainset,
metric=answer_f1_score,
num_threads=4,
display_progress=True,
display_table=5,
)
score = evaluator(fully_optimized)
print(f"Optimized F1 score: {score:.2f}")
result = fully_optimized(question="What is the core advantage of DSPy framework?")
print(f"Answer: {result.answer}")
Pitfall Guide
Pitfall 1: Missing Signature Field Descriptions
# ❌ Wrong: no description, LLM doesn't know output format
class BadSig(dspy.Signature):
question: str = dspy.InputField()
answer: str = dspy.OutputField()
# ✅ Correct: add detailed descriptions to guide LLM output
class GoodSig(dspy.Signature):
"""Answer the question based on context, answer within 50 words."""
question: str = dspy.InputField(desc="Question asked by the user")
answer: str = dspy.OutputField(desc="Concise and accurate answer, within 50 words")
Pitfall 2: Insufficient Optimizer Training Data
# ❌ Wrong: insufficient training data, optimizer can't learn effective patterns
trainset = [dspy.Example(question="1+1=?", answer="2").with_inputs("question")]
# ✅ Correct: at least 50-200 high-quality training examples
trainset = load_training_data(min_size=50)
Pitfall 3: Overly Lenient Metric Function
# ❌ Wrong: always returns 1.0, optimizer can't distinguish good from bad
def bad_metric(example, prediction, trace=None):
return 1.0
# ✅ Correct: use a discriminative metric
def good_metric(example, prediction, trace=None):
return answer_f1_score(example, prediction, trace)
Pitfall 4: Type Mismatch Between Modules
# ❌ Wrong: sub-module returns list, downstream expects str
class StepA(dspy.Signature):
items: list[str] = dspy.OutputField()
class StepB(dspy.Signature):
text: str = dspy.InputField()
# ✅ Correct: do type conversion in forward
def forward(self, question):
result_a = self.step_a(question=question)
joined = "\n".join(result_a.items)
result_b = self.step_b(text=joined)
return result_b
Pitfall 5: Not Handling LLM Output Parse Failures
# ❌ Wrong: directly accessing output fields, may throw exceptions
prediction = self.module(question=q)
answer = prediction.answer
# ✅ Correct: add exception handling and default values
try:
prediction = self.module(question=q)
answer = prediction.answer if prediction.answer else "Cannot answer"
except Exception as e:
answer = f"Processing failed: {str(e)}"
Error Troubleshooting
| # | Error Message | Cause | Solution |
|---|---|---|---|
| 1 | AssertionError: Signature must have at least one output field |
Signature missing output field | Ensure Signature has at least one OutputField |
| 2 | TypeError: Expected str, got list |
Type mismatch between modules | Do type conversion in forward |
| 3 | dspy.primitives.assertions.AssertionError |
Assertion condition not met | Check dspy.Assert condition logic |
| 4 | openai.RateLimitError |
API call rate exceeded | Reduce num_threads or add retry logic |
| 5 | KeyError: 'answer' |
LLM output missing expected field | Check signature definition, add field descriptions |
| 6 | ValueError: No demos were bootstrapped |
Insufficient training data quality | Increase training data, check metric function |
| 7 | JSONDecodeError |
LLM output not JSON format | Use dspy.ChainOfThought instead of dspy.Predict |
| 8 | AttributeError: module has no attribute 'retrieve' |
Module not initialized with retriever | Ensure all sub-modules initialized in init |
| 9 | TimeoutError: LLM call timed out |
LLM response timeout | Increase max_tokens or set timeout parameter |
| 10 | ImportError: cannot import name 'MIPROv2' |
DSPy version too low | Upgrade to dspy-ai>=2.5.0 |
Advanced Optimization
1. Custom Adapter for Local Models
class LocalModelAdapter(dspy.Adapter):
def format(self, signature, demos, inputs):
prompt = f"Task: {signature.__doc__}\n\n"
for demo in demos:
for key, val in demo.items():
prompt += f"{key}: {val}\n"
prompt += "\n"
for key, val in inputs.items():
prompt += f"{key}: {val}\n"
prompt += "\nPlease output:\n"
for field_name, field_info in signature.output_fields.items():
prompt += f"{field_name}: "
return prompt
def parse(self, signature, completion):
outputs = {}
for line in completion.strip().split("\n"):
if ":" in line:
key, val = line.split(":", 1)
outputs[key.strip()] = val.strip()
return outputs
2. Assertion-Driven Output Constraints
class ConstrainedQA(dspy.Module):
def __init__(self):
super().__init__()
self.generate = dspy.ChainOfThought(QuestionAnswer)
def forward(self, question: str, context: str) -> dspy.Prediction:
result = self.generate(question=question, context=context)
dspy.Assert(
len(result.answer) > 0,
"Answer cannot be empty",
)
dspy.Assert(
len(result.answer) <= 200,
"Answer cannot exceed 200 characters",
)
return result
3. Cache Optimization to Reduce API Calls
import hashlib
import json
class CachedModule(dspy.Module):
def __init__(self, module: dspy.Module, cache_dir: str = ".dspy_cache"):
super().__init__()
self.module = module
self.cache_dir = cache_dir
self.cache = {}
def _cache_key(self, **kwargs):
content = json.dumps(kwargs, sort_keys=True)
return hashlib.md5(content.encode()).hexdigest()
def forward(self, **kwargs):
key = self._cache_key(**kwargs)
if key in self.cache:
return self.cache[key]
result = self.module(**kwargs)
self.cache[key] = result
return result
Comparison Analysis
| Dimension | DSPy | LangChain | LlamaIndex | Raw Prompt |
|---|---|---|---|---|
| Programming Paradigm | Declarative | Imperative chain | Imperative index | Hand-written Prompt |
| Auto Optimization | ✅ Built-in optimizer | ❌ Manual | ❌ Manual | ❌ Fully manual |
| Reproducibility | ✅ Fixed signatures | ⚠️ Template-dependent | ⚠️ Template-dependent | ❌ Hard to reproduce |
| Model Migration | ✅ Swap Adapter | ⚠️ Change templates | ⚠️ Change templates | ❌ Rewrite all |
| Learning Curve | Medium | Low | Low | Low |
| Production Ready | ✅ Type-safe | ⚠️ Flexible but fragile | ✅ Strong for RAG | ❌ High maintenance |
| Community | Fast-growing | Mature | Mature | N/A |
Summary: DSPy isn't "yet another LLM framework" — it's a fundamental paradigm shift in LLM programming from "hand-written Prompts" to "declarative programming + automatic optimization." Its core value: 1) Signatures as documentation, eliminating Prompt maintenance nightmares; 2) Optimizers automatically search for optimal Prompts, no longer relying on human experience; 3) Modular composition ensures type safety, multi-step reasoning chains no longer break. The 2026 DSPy practice path: use ChainOfThought + signatures for quick validation → BootstrapFewShot for example optimization → MIPROv2 for full optimization. The key is having a high-quality metric function — it determines whether optimization heads in the right direction.
Recommended Online Tools
- JSON Formatter: /en/json/format
- Base64 Encode/Decode: /en/encode/base64
- Hash Calculator: /en/encode/hash
- JWT Decode: /en/encode/jwt-decode
Try these browser-local tools — no sign-up required →