Training Tips

These best practices will help you get the most out of your training.

Start Simple

Begin with Easy Scenarios

Start with scenarios that have:
  • Fewer tools (or no tools)
  • Shorter expected conversations
  • Clearer success criteria

Mix Scenario Types

Don’t train on just one scenario—mix scenarios from across your subscribed collections to build robust capabilities.

Working with Rewards

Use Rewards Directly

Rewards are already in the range [0, 1] and calibrated per-scenario. Use them directly without modification:
# Use rewards as-is
reward = result["reward"]  # Already in [0, 1]
Different scenarios may have different reward distributions. This is intentional—each scenario has its own calibrated scoring criteria.

Curriculum Learning with reward_config

Instead of computing custom rewards client-side, use reward_config to control which scorers run server-side. Scorers with zero weight are skipped entirely, saving LLM tokens and reducing latency.
# Phase 1: Discovery only (learn to ask the right questions)
phase1_config = {
    "quality_weights": {"discovery": 1.0},
    "skip_safety_gate": True,
}

# Phase 2: Discovery + quality (learn to respond well)
phase2_config = {
    "quality_weights": {
        "discovery": 0.5,
        "success_metrics": 0.3,
        "rubrics": 0.2,
    },
    "skip_safety_gate": True,
}

# Phase 3: Full scoring (no reward_config - use defaults)
phase3_config = None

# Use in turn submission
result = submit_turn(episode_id, messages, reward_config=current_phase_config)
How it works:
  • Omitted quality_weights keys default to 0 (not the normal default)
  • Only scorers with weight > 0 execute, so you save tokens
  • skip_safety_gate: true bypasses scope and terminology checks (useful when focusing on a single dimension)
  • Omit reward_config entirely to use the full default scoring
When using skip_safety_gate, the model can drift out of scope without penalty. Re-enable the safety gate periodically to catch regressions.

Custom Rewards from Score Dimensions

You can also construct custom rewards client-side from the returned score dimensions:
Dimension names are scenario-specific. The examples below use placeholder names. Check the actual scores keys returned for your scenario.
# Get score breakdown (dimension names vary by scenario)
scores = result.get("scores", {})

# Option 1: Train on a single dimension
empathy_reward = scores.get("D1_Empathy", 0)

# Option 2: Weight dimensions differently
custom_reward = (
    0.5 * scores.get("D1_Empathy", 0) +
    0.3 * scores.get("D2_Accuracy", 0) +
    0.2 * scores.get("D3_Efficiency", 0)
)

# Option 3: Exclude a mastered dimension to focus on others
focused_reward = (
    scores.get("D2_Accuracy", 0) +
    scores.get("D3_Efficiency", 0)
) / 2
Ignoring dimensions may cause regression in those areas. Monitor all scores even when training on a subset.

Handling Multi-Turn Conversations

Track Context Locally

The API returns only one message per turn. Your client must maintain the full conversation history:
# Initialize with the initial observation from episode creation
conversation = [episode["initial_observation"]]

# After each turn, append both your response and the new observation
conversation.append({"role": "assistant", "content": model_response})
conversation.append(result["observation"])
See the Episodes concept for details on conversation tracking.

Debugging Training

Log Episode Trajectories

Save trajectories for debugging:
import json

def log_episode(episode_id: str, trajectory: dict):
    with open(f"logs/episode_{episode_id}.json", "w") as f:
        json.dump({
            "episode_id": episode_id,
            "scenario": trajectory["scenario"],
            "turns": trajectory["turns"],
            "total_reward": trajectory["total_reward"],
            "terminal_reason": trajectory["terminal_reason"],
        }, f, indent=2)

Check for Common Issues

SymptomLikely CauseSolution
Rewards stuck at 0Model not investigatingCheck conversation flow
All low rewardsMaking harmful decisionsReview tool usage
Quick terminationCritical errorsLog and review trajectories