Rewards
Rewards are the core training signal from the Labs API. Understanding how rewards work helps you build better training pipelines.
Reward Range
All rewards are normalized to the range 0 to 1. The specific meaning of reward values is scenario-dependent.
Reward Components
Rewards are computed from multiple factors. Scoring criteria differ across scenarios and collections—each scenario defines its own evaluation rubric:
Higher Rewards (closer to 1)
- Finding Discovery: Model uncovered a key fact
- Goal Achievement: Model completed a scenario objective
- Correct Tool Use: Model used tools appropriately
- Efficient Progress: Model made relevant progress
Lower Rewards (closer to 0)
- Harmful Recommendation: Dangerous or inappropriate advice
- Incorrect Conclusion: Wrong assessment
- Irrelevant Questions: Off-topic tangents
- Tool Misuse: Incorrect tool invocation
The Reward Signal
Each turn returns a single reward score that evaluates the conversation so far:
The reward is informed by the entire conversation history, not just the latest turn. This gives you a holistic measure of how well the model is performing against the scenario objectives.
Reward Configuration
You can control which scoring dimensions run and their relative weights using reward_config. This is useful for curriculum learning, where you progressively unlock scoring dimensions as training progresses.
quality_weights
Specify which scorers to run and their weights. Omitted keys default to 0 (not the normal default), so only scorers you explicitly include will execute. This saves tokens by skipping unnecessary LLM evaluations.
Available dimensions:
| Key | Description |
|---|
success_metrics | Did the model achieve scenario objectives? |
failure_metrics | Did the model avoid harmful actions? |
best_practices | Did the model follow best practices? |
rubrics | Per-criterion rubric scores |
discovery | Did the model uncover key facts? |
output_similarity | How close is the output to the reference? |
decision_match | Did the model make the correct decisions? |
skip_safety_gate
When true, scope compliance and terminology scorers are skipped (safety gate scores 1.0). Useful when you want to focus training on a single quality dimension without the safety gate penalizing unrelated behavior.
Example: Discovery-Only Curriculum
{
"reward_config": {
"quality_weights": {
"discovery": 1.0
},
"skip_safety_gate": true
}
}
This runs only the discovery scorer, skipping all other dimensions. A 3-phase curriculum might look like:
- Phase 1 - Discovery only:
{ "discovery": 1.0 }
- Phase 2 - Discovery + quality:
{ "discovery": 0.5, "success_metrics": 0.3, "rubrics": 0.2 }
- Phase 3 - Omit
reward_config entirely for full default scoring
Reward Shaping Tips
Handle Low Rewards
Low rewards (close to 0) indicate problematic actions. Use reward magnitude for training signal:
# Use reward directly for RL training
# Higher rewards (close to 1) reinforce good behavior
# Lower rewards (close to 0) discourage bad behavior
loss = -reward * log_prob
Reward Timing
Rewards are computed immediately after each turn submission. Each reward reflects the full conversation up to that point—how well the model is progressing toward the scenario objectives.
Score Detail Levels
Depending on your subscription, API responses include different levels of scoring detail:
| Access Level | Fields Returned |
|---|
reward_only | reward |
scores | reward, scores |
full | reward, scores, score_breakdown |
reward
Always present. A single number from 0 to 1.
scores
Present on scores and full tiers. A breakdown by dimension — names are scenario-specific:
{
"reward": 0.75,
"scores": {
"successMetrics": 0.8,
"failureMetrics": 1.0,
"rubrics": {
"score": 0.7,
"criteria": {
"informationGathering": 0.8,
"actionableGuidance": 0.6
}
}
}
}
score_breakdown
Present on full tier only. Textual reasoning per dimension — useful for debugging why a score was assigned:
{
"score_breakdown": {
"successMetrics": {
"reasoning": "The operator gathered all required case details...",
"strengths": ["Asked about prior treatment history"],
"improvements": ["Did not confirm injury date"]
}
}
}
Contact us through the Labs Portal to upgrade your subscription tier.
Debugging Rewards
If rewards seem unexpected:
- Check terminal_reason: Episode may have ended early
- Review tool calls: Incorrect tool use causes low rewards
- Consider scenario objectives: Some scenarios reward different behaviors
Use the /api/v1/compare endpoint to understand relative quality between two responses.