Training Tips

These best practices will help you get the most out of your training.

Start Simple

Begin with Easy Scenarios

Start with scenarios that have:

Fewer tools (or no tools)
Shorter expected conversations
Clearer success criteria

Mix Scenario Types

Don’t train on just one scenario—mix scenarios from across your subscribed collections to build robust capabilities.

Working with Rewards

Use Rewards Directly

Rewards are already in the range [0, 1] and calibrated per-scenario. Use them directly without modification:

# Use rewards as-is
reward = result["reward"]  # Already in [0, 1]

Different scenarios may have different reward distributions. This is intentional—each scenario has its own calibrated scoring criteria.

Curriculum Learning with reward_config

Instead of computing custom rewards client-side, use reward_config to control which scorers run server-side. Scorers with zero weight are skipped entirely, saving LLM tokens and reducing latency.

# Phase 1: Discovery only (learn to ask the right questions)
phase1_config = {
    "quality_weights": {"discovery": 1.0},
    "skip_safety_gate": True,
}

# Phase 2: Discovery + quality (learn to respond well)
phase2_config = {
    "quality_weights": {
        "discovery": 0.5,
        "success_metrics": 0.3,
        "rubrics": 0.2,
    },
    "skip_safety_gate": True,
}

# Phase 3: Full scoring (no reward_config - use defaults)
phase3_config = None

# Use in turn submission
result = submit_turn(episode_id, messages, reward_config=current_phase_config)

How it works:

Omitted quality_weights keys default to 0 (not the normal default)
Only scorers with weight > 0 execute, so you save tokens
skip_safety_gate: true bypasses scope and terminology checks (useful when focusing on a single dimension)
Omit reward_config entirely to use the full default scoring

When using skip_safety_gate, the model can drift out of scope without penalty. Re-enable the safety gate periodically to catch regressions.

Custom Rewards from Score Dimensions

You can also construct custom rewards client-side from the returned score dimensions:

Dimension names are scenario-specific. The examples below use placeholder names. Check the actual scores keys returned for your scenario.

# Get score breakdown (dimension names vary by scenario)
scores = result.get("scores", {})

# Option 1: Train on a single dimension
empathy_reward = scores.get("D1_Empathy", 0)

# Option 2: Weight dimensions differently
custom_reward = (
    0.5 * scores.get("D1_Empathy", 0) +
    0.3 * scores.get("D2_Accuracy", 0) +
    0.2 * scores.get("D3_Efficiency", 0)
)

# Option 3: Exclude a mastered dimension to focus on others
focused_reward = (
    scores.get("D2_Accuracy", 0) +
    scores.get("D3_Efficiency", 0)
) / 2

Ignoring dimensions may cause regression in those areas. Monitor all scores even when training on a subset.

Handling Multi-Turn Conversations

Track Context Locally

The API returns only one message per turn. Your client must maintain the full conversation history:

# Initialize with the initial observation from episode creation
conversation = [episode["initial_observation"]]

# After each turn, append both your response and the new observation
conversation.append({"role": "assistant", "content": model_response})
conversation.append(result["observation"])

See the Episodes concept for details on conversation tracking.

Debugging Training

Log Episode Trajectories

Save trajectories for debugging:

import json

def log_episode(episode_id: str, trajectory: dict):
    with open(f"logs/episode_{episode_id}.json", "w") as f:
        json.dump({
            "episode_id": episode_id,
            "scenario": trajectory["scenario"],
            "turns": trajectory["turns"],
            "total_reward": trajectory["total_reward"],
            "terminal_reason": trajectory["terminal_reason"],
        }, f, indent=2)

Check for Common Issues

Symptom	Likely Cause	Solution
Rewards stuck at 0	Model not investigating	Check conversation flow
All low rewards	Making harmful decisions	Review tool usage
Quick termination	Critical errors	Log and review trajectories

Overview

Concepts

Integration

Best Practices

Troubleshooting

Start Simple

Begin with Easy Scenarios

Mix Scenario Types

Working with Rewards

Use Rewards Directly

Curriculum Learning with reward_config

Custom Rewards from Score Dimensions

Handling Multi-Turn Conversations

Track Context Locally

Debugging Training

Log Episode Trajectories

Check for Common Issues

​Start Simple

​Begin with Easy Scenarios

​Mix Scenario Types

​Working with Rewards

​Use Rewards Directly

​Curriculum Learning with reward_config

​Custom Rewards from Score Dimensions

​Handling Multi-Turn Conversations

​Track Context Locally

​Debugging Training

​Log Episode Trajectories

​Check for Common Issues

Start Simple

Begin with Easy Scenarios

Mix Scenario Types

Working with Rewards

Use Rewards Directly

Curriculum Learning with reward_config

Custom Rewards from Score Dimensions

Handling Multi-Turn Conversations

Track Context Locally

Debugging Training

Log Episode Trajectories

Check for Common Issues