Architecture
This page explains the architecture and data flow between your training infrastructure and the Labs reward server.Architecture Overview
Data Flow
What We Send You
Each turn, we return a single message—the simulated persona’s response. Your client must track conversation history locally and include all prior messages when submitting turns.scores field provides individual scoring dimensions—see Training Tips for how to use them.
What We Never Send
- Scoring criteria
- Scenario objectives
- Optimal action sequences
- Expected conversation paths
Integration Patterns
Pattern 1: Episode-Based (Multi-Turn)
Best for: Dialogue agents, investigative assistantsPattern 2: Single-Shot Evaluation
Best for: Offline evaluation of complete conversationsPattern 3: Batch Evaluation
Best for: High-throughput scoring (TRL, OpenRLHF)Pattern 4: Pairwise Comparison
Best for: Preference learning (DPO)Next Steps
- Quickstart Guide — Get running in 5 minutes
- Episodes Concept — Deep dive on episode mechanics
- Integration Guides — Framework-specific setup