Training Tips
These best practices will help you get the most out of your training.Start Simple
Begin with Easy Scenarios
Start with scenarios that have:- Fewer tools (or no tools)
- Shorter expected conversations
- Clearer success criteria
Mix Scenario Types
Don’t train on just one scenario—mix scenarios from across your subscribed collections to build robust capabilities.Working with Rewards
Use Rewards Directly
Rewards are already in the range [0, 1] and calibrated per-scenario. Use them directly without modification:Different scenarios may have different reward distributions. This is intentional—each scenario has its own calibrated scoring criteria.
Curriculum Learning with reward_config
Instead of computing custom rewards client-side, usereward_config to control which scorers run server-side. Scorers with zero weight are skipped entirely, saving LLM tokens and reducing latency.
- Omitted
quality_weightskeys default to 0 (not the normal default) - Only scorers with weight > 0 execute, so you save tokens
skip_safety_gate: truebypasses scope and terminology checks (useful when focusing on a single dimension)- Omit
reward_configentirely to use the full default scoring
Custom Rewards from Score Dimensions
You can also construct custom rewards client-side from the returned score dimensions:Dimension names are scenario-specific. The examples below use placeholder names. Check the actual
scores keys returned for your scenario.Handling Multi-Turn Conversations
Track Context Locally
The API returns only one message per turn. Your client must maintain the full conversation history:See the Episodes concept for details on conversation tracking.
Debugging Training
Log Episode Trajectories
Save trajectories for debugging:Check for Common Issues
| Symptom | Likely Cause | Solution |
|---|---|---|
| Rewards stuck at 0 | Model not investigating | Check conversation flow |
| All low rewards | Making harmful decisions | Review tool usage |
| Quick termination | Critical errors | Log and review trajectories |