Rewards

Rewards are the core training signal from the Labs API. Understanding how rewards work helps you build better training pipelines.

Reward Range

All rewards are normalized to the range 0 to 1. The specific meaning of reward values is scenario-dependent.

Reward Components

Rewards are computed from multiple factors. Scoring criteria differ across scenarios and collections—each scenario defines its own evaluation rubric:

Higher Rewards (closer to 1)

Finding Discovery: Model uncovered a key fact
Goal Achievement: Model completed a scenario objective
Correct Tool Use: Model used tools appropriately
Efficient Progress: Model made relevant progress

Lower Rewards (closer to 0)

Harmful Recommendation: Dangerous or inappropriate advice
Incorrect Conclusion: Wrong assessment
Irrelevant Questions: Off-topic tangents
Tool Misuse: Incorrect tool invocation

The Reward Signal

Each turn returns a single reward score that evaluates the conversation so far:

{
  "reward": 0.75
}

The reward is informed by the entire conversation history, not just the latest turn. This gives you a holistic measure of how well the model is performing against the scenario objectives.

Reward Configuration

You can control which scoring dimensions run and their relative weights using reward_config. This is useful for curriculum learning, where you progressively unlock scoring dimensions as training progresses.

quality_weights

Specify which scorers to run and their weights. Omitted keys default to 0 (not the normal default), so only scorers you explicitly include will execute. This saves tokens by skipping unnecessary LLM evaluations. Available dimensions:

Key	Description
`success_metrics`	Did the model achieve scenario objectives?
`failure_metrics`	Did the model avoid harmful actions?
`best_practices`	Did the model follow best practices?
`rubrics`	Per-criterion rubric scores
`discovery`	Did the model uncover key facts?
`output_similarity`	How close is the output to the reference?
`decision_match`	Did the model make the correct decisions?

skip_safety_gate

When true, scope compliance and terminology scorers are skipped (safety gate scores 1.0). Useful when you want to focus training on a single quality dimension without the safety gate penalizing unrelated behavior.

Example: Discovery-Only Curriculum

{
  "reward_config": {
    "quality_weights": {
      "discovery": 1.0
    },
    "skip_safety_gate": true
  }
}

This runs only the discovery scorer, skipping all other dimensions. A 3-phase curriculum might look like:

Phase 1 - Discovery only: { "discovery": 1.0 }
Phase 2 - Discovery + quality: { "discovery": 0.5, "success_metrics": 0.3, "rubrics": 0.2 }
Phase 3 - Omit reward_config entirely for full default scoring

See Training Tips for a complete curriculum learning example.

Reward Shaping Tips

Handle Low Rewards

Low rewards (close to 0) indicate problematic actions. Use reward magnitude for training signal:

# Use reward directly for RL training
# Higher rewards (close to 1) reinforce good behavior
# Lower rewards (close to 0) discourage bad behavior
loss = -reward * log_prob

Reward Timing

Rewards are computed immediately after each turn submission. Each reward reflects the full conversation up to that point—how well the model is progressing toward the scenario objectives.

Score Detail Levels

Depending on your subscription, API responses include different levels of scoring detail:

Access Level	Fields Returned
`reward_only`	`reward`
`scores`	`reward`, `scores`
`full`	`reward`, `scores`, `score_breakdown`

reward

Always present. A single number from 0 to 1.

scores

Present on scores and full tiers. A breakdown by dimension — names are scenario-specific:

{
  "reward": 0.75,
  "scores": {
    "successMetrics": 0.8,
    "failureMetrics": 1.0,
    "rubrics": {
      "score": 0.7,
      "criteria": {
        "informationGathering": 0.8,
        "actionableGuidance": 0.6
      }
    }
  }
}

score_breakdown

Present on full tier only. Textual reasoning per dimension — useful for debugging why a score was assigned:

{
  "score_breakdown": {
    "successMetrics": {
      "reasoning": "The operator gathered all required case details...",
      "strengths": ["Asked about prior treatment history"],
      "improvements": ["Did not confirm injury date"]
    }
  }
}

Debugging Rewards

If rewards seem unexpected:

Check terminal_reason: Episode may have ended early
Review tool calls: Incorrect tool use causes low rewards
Consider scenario objectives: Some scenarios reward different behaviors

Use the /api/v1/compare endpoint to understand relative quality between two responses.

Overview

Concepts

Integration

Best Practices

Troubleshooting

Reward Range

Reward Components

Higher Rewards (closer to 1)

Lower Rewards (closer to 0)

The Reward Signal

Reward Configuration

quality_weights

skip_safety_gate

Example: Discovery-Only Curriculum

Reward Shaping Tips

Handle Low Rewards

Reward Timing

Score Detail Levels

reward

scores

score_breakdown

Debugging Rewards

​Reward Range

​Reward Components

​Higher Rewards (closer to 1)

​Lower Rewards (closer to 0)

​The Reward Signal

​Reward Configuration

​quality_weights

​skip_safety_gate

​Example: Discovery-Only Curriculum

​Reward Shaping Tips

​Handle Low Rewards

​Reward Timing

​Score Detail Levels

​reward

​scores

​score_breakdown

​Debugging Rewards

Reward Range

Reward Components

Higher Rewards (closer to 1)

Lower Rewards (closer to 0)

The Reward Signal

Reward Configuration

quality_weights

skip_safety_gate

Example: Discovery-Only Curriculum

Reward Shaping Tips

Handle Low Rewards

Reward Timing

Score Detail Levels

reward

scores

score_breakdown

Debugging Rewards