Rewards

Rewards are the core training signal from the Labs API. Understanding how rewards work helps you build better training pipelines.

Reward Range

All rewards are normalized to the range 0 to 1. The specific meaning of reward values is scenario-dependent.

Reward Components

Rewards are computed from multiple factors. Scoring criteria differ across scenarios and collections—each scenario defines its own evaluation rubric:

Higher Rewards (closer to 1)

  • Finding Discovery: Model uncovered a key fact
  • Goal Achievement: Model completed a scenario objective
  • Correct Tool Use: Model used tools appropriately
  • Efficient Progress: Model made relevant progress

Lower Rewards (closer to 0)

  • Harmful Recommendation: Dangerous or inappropriate advice
  • Incorrect Conclusion: Wrong assessment
  • Irrelevant Questions: Off-topic tangents
  • Tool Misuse: Incorrect tool invocation

The Reward Signal

Each turn returns a single reward score that evaluates the conversation so far:
{
  "reward": 0.75
}
The reward is informed by the entire conversation history, not just the latest turn. This gives you a holistic measure of how well the model is performing against the scenario objectives.

Reward Configuration

You can control which scoring dimensions run and their relative weights using reward_config. This is useful for curriculum learning, where you progressively unlock scoring dimensions as training progresses.

quality_weights

Specify which scorers to run and their weights. Omitted keys default to 0 (not the normal default), so only scorers you explicitly include will execute. This saves tokens by skipping unnecessary LLM evaluations. Available dimensions:
KeyDescription
success_metricsDid the model achieve scenario objectives?
failure_metricsDid the model avoid harmful actions?
best_practicesDid the model follow best practices?
rubricsPer-criterion rubric scores
discoveryDid the model uncover key facts?
output_similarityHow close is the output to the reference?
decision_matchDid the model make the correct decisions?

skip_safety_gate

When true, scope compliance and terminology scorers are skipped (safety gate scores 1.0). Useful when you want to focus training on a single quality dimension without the safety gate penalizing unrelated behavior.

Example: Discovery-Only Curriculum

{
  "reward_config": {
    "quality_weights": {
      "discovery": 1.0
    },
    "skip_safety_gate": true
  }
}
This runs only the discovery scorer, skipping all other dimensions. A 3-phase curriculum might look like:
  1. Phase 1 - Discovery only: { "discovery": 1.0 }
  2. Phase 2 - Discovery + quality: { "discovery": 0.5, "success_metrics": 0.3, "rubrics": 0.2 }
  3. Phase 3 - Omit reward_config entirely for full default scoring
See Training Tips for a complete curriculum learning example.

Reward Shaping Tips

Handle Low Rewards

Low rewards (close to 0) indicate problematic actions. Use reward magnitude for training signal:
# Use reward directly for RL training
# Higher rewards (close to 1) reinforce good behavior
# Lower rewards (close to 0) discourage bad behavior
loss = -reward * log_prob

Reward Timing

Rewards are computed immediately after each turn submission. Each reward reflects the full conversation up to that point—how well the model is progressing toward the scenario objectives.

Score Detail Levels

Depending on your subscription, API responses include different levels of scoring detail:
Access LevelFields Returned
reward_onlyreward
scoresreward, scores
fullreward, scores, score_breakdown

reward

Always present. A single number from 0 to 1.

scores

Present on scores and full tiers. A breakdown by dimension — names are scenario-specific:
{
  "reward": 0.75,
  "scores": {
    "successMetrics": 0.8,
    "failureMetrics": 1.0,
    "rubrics": {
      "score": 0.7,
      "criteria": {
        "informationGathering": 0.8,
        "actionableGuidance": 0.6
      }
    }
  }
}

score_breakdown

Present on full tier only. Textual reasoning per dimension — useful for debugging why a score was assigned:
{
  "score_breakdown": {
    "successMetrics": {
      "reasoning": "The operator gathered all required case details...",
      "strengths": ["Asked about prior treatment history"],
      "improvements": ["Did not confirm injury date"]
    }
  }
}
Contact us through the Labs Portal to upgrade your subscription tier.

Debugging Rewards

If rewards seem unexpected:
  1. Check terminal_reason: Episode may have ended early
  2. Review tool calls: Incorrect tool use causes low rewards
  3. Consider scenario objectives: Some scenarios reward different behaviors
Use the /api/v1/compare endpoint to understand relative quality between two responses.