Architecture

This page explains the architecture and data flow between your training infrastructure and the Labs reward server.

Architecture Overview

Data Flow

What We Send You

Each turn, we return a single message—the simulated persona’s response. Your client must track conversation history locally and include all prior messages when submitting turns.
{
  "turn": 2,
  "observation": {
    "role": "assistant",
    "content": "I've been having this pain in my lower back..."
  },
  "reward": 0.75,
  "scores": {
    "D1_InformationGathering": 0.8,
    "D2_GoalProgress": 0.7
  },
  "episode_complete": false
}
The optional scores field provides individual scoring dimensions—see Training Tips for how to use them.

What We Never Send

  • Scoring criteria
  • Scenario objectives
  • Optimal action sequences
  • Expected conversation paths
This ensures your model learns to investigate, not memorize.

Integration Patterns

Pattern 1: Episode-Based (Multi-Turn)

Best for: Dialogue agents, investigative assistants
# POST /api/v1/episodes → Submit turns → Collect rewards
episode = post("/api/v1/episodes", {"collection_slug": "...", "scenario_slug": "..."})
while not done:
    action = model.generate(observation)
    result = post(f"/api/v1/episodes/{episode_id}/turns", {"messages": [...]})
    train_on_reward(result["reward"])

Pattern 2: Single-Shot Evaluation

Best for: Offline evaluation of complete conversations
# POST /api/v1/evaluate → Get single reward
result = post("/api/v1/evaluate", {
    "collection_slug": "...",
    "scenario_slug": "...",
    "messages": [...]
})
reward = result["reward"]

Pattern 3: Batch Evaluation

Best for: High-throughput scoring (TRL, OpenRLHF)
# POST /api/v1/batch/evaluate → Get rewards for all
result = post("/api/v1/batch/evaluate", {"items": [...]})
rewards = [r["reward"] for r in result["results"]]

Pattern 4: Pairwise Comparison

Best for: Preference learning (DPO)
# POST /api/v1/compare → Get preference signal
result = post("/api/v1/compare", {
    "collection_slug": "...",
    "scenario_slug": "...",
    "response_a": {"messages": [...]},
    "response_b": {"messages": [...]}
})
winner = result["winner"]  # 'a', 'b', or 'tie'
margin = result["margin"]  # reward difference (useful for DPO)

Next Steps