Custom Eval Infrastructure at With Him

Behavioral contracts, shadow traffic, and first-party launch gates

TL;DR

With Him does not rely on a generic LLM observability or eval SaaS as the source of truth for “good” assistant behavior. We run a single first-party policy engine that scores every assistant turn, offline on fixtures and online on production, then feed those signals into telemetry, shadow comparisons, and launch gates tied to our rollout model.

That is not a statement that tools like LangSmith or Braintrust are bad. They are strong products for trace-centric debugging and flexible eval workflows. Our bet is that high-context, high-stakes conversational products need evals that are inseparable from product policy, traffic shaping, and safety intercepts - and that wiring those concerns through a generic middle layer would have duplicated effort without tightening the loop we actually ship on.

The Problem Space

With Him sits at a difficult intersection:

Warm, conversational support - tone and pacing matter as much as factual correctness.
Strict behavioral contracts - bounded questions, response length, when scripture may appear, early-turn discipline.
Safety you cannot get wrong - crisis language, escalation patterns, and intercept paths are part of the contract, not an afterthought.

We do not only care whether a response “scores well” in isolation. We care whether conversations continue well in production, whether contract violations cluster by risk category, and whether a new prompt version moves engagement or safety the wrong way relative to a baseline.

Why Generic Evals Break in High-Context Systems

“Generic” eval stacks are optimized for breadth: many models, many prompts, traces, human labels, LLM-as-judge, experiment tracking. That is the right tradeoff for a platform or a portfolio of unrelated features.

They tend to break down - or at least leave a large gap - when:

Quality is defined by domain-specific rules, not a rubric you can paste into a judge prompt. Turn index after the first user message, scripture timing, stabilization exceptions for high-risk turns, and “meaningful moment” heuristics are product law, not generic NLP metrics.
The eval must match production exactly. If CI scores responses with one implementation and production uses another, you get false confidence. We wanted one policy function invoked from the generation path and from offline runners.
Rollout is part of quality. Shadow traffic, stable bucketing, treatment vs control variants, and go/hold/no-go gates are release infrastructure. A trace UI can show you what happened; it does not automatically enforce which variant is allowed to ship.
Safety intercepts change the text. Evaluating only the raw model completion misses the system users actually receive. Policy evaluation and replacement text need to stay in the same code path.
Latency, data residency, and cost matter when you evaluate every assistant turn and emit rich metadata. First-party evaluation keeps the hot path predictable and keeps sensitive analytical joins in the datastore we already govern.

None of that makes SaaS observability useless - you might still export traces elsewhere for deep dives. It means the authoritative contract and gate logic lived more naturally in-repo for us than as configuration spread across an external product.

Architecture: One Policy Engine, Two Surfaces

At the center is evaluatePromptV2Policy: deterministic scoring of assistant text given runtime context (risk level, turn index, long-form flags, last user message for mirror/heuristic signals, explicit scripture request, etc.). The same function backs offline fixtures and live generation.

Served turn

Live assistant response

Runtime policy + rollout decision
Safety intercepts affect served text

Offline eval

Fixture gates

Frozen JSON fixtures
Strict CI thresholds

Single source of truth

evaluatePromptV2Policy

The same deterministic behavioral contract runs everywhere assistant quality matters.

questionslengthscripturesafetymeaning

Shadow path

Non-blocking comparison

Serve v1 while evaluating v2
Shadow failure never blocks chat

Live launch telemetry

Go / hold / no_go gates

Engagement + safety deltas
Insufficient volume means hold

Design principle: policy, telemetry fields, and gate inputs stay aligned because they share one implementation - not a spreadsheet of rules and a separate production branch.

Shadow-Mode Workflow

Shadow mode is how we run the next prompt or policy behavior beside the live experience without changing what the user sees. Conceptually:

1
User -> Chat API
The user sends a message into the live chat path.
2
Chat API -> Rollout resolver
PROMPT_V2_MODE and the stable traffic bucket decide the served variant and whether shadow should run.
3
Rollout branch
Shadow treatment
Serve v1 while running v2 as runShadowVariant.
All other modes
Serve the chosen variant only.
4
Chat API -> AI generation
generateFullAIResponse runs for the served variant and returns text plus policy output.
5
Chat API -> Messages + events
Persist the assistant message and emit prompt_v2_turn_evaluated for live telemetry.
6
Async shadow path
If enabled, v2 generation runs best-effort, emits prompt_v2_shadow_evaluated, and never changes the user-visible response.
7
Chat API -> User
Only the served response is returned to the user.

Operational detail: shadow generation is non-blocking. If it fails, we log and move on - chat must never depend on shadow success. That choice favors availability and trust over perfect pairwise completeness in telemetry.

Rollout Gate Examples

We use two complementary gate styles: fixture gates (contract-heavy, for offline sign-off) and live telemetry gates (contract + engagement + safety deltas vs baseline).

1. Offline Fixture Gates: Contract-First

The offline runner aggregates policy outputs over a frozen JSON eval set and checks thresholds such as:

Gate concept	What it enforces
Question compliance	At most one question per response, with defined stabilization exceptions for high/crisis risk.
First-three word budget	Stricter length cap in early turns when first-three policy applies.
First-three scripture	No proactive scripture in early turns unless the user explicitly asked.
Meaningful moment rate	Heuristic coverage of identity encouragement, reframes, or grounded mirroring - not vanity length.
Severe safety intercept rate	Intercepts are rare but audited; threshold expectations are set explicitly for the fixture set.

The runner can emit JSON, Markdown, and CSV artifacts for review and --strict exit codes for CI.

2. Live Telemetry Gates: Production-Shaped

Live gates combine event-derived policy summaries (for example question compliance and severe safety intercept rates from prompt_v2_turn_evaluated / related event types) with message-level engagement proxies - such as whether an assistant message was followed by a user reply, including turn-indexed continuation views.

Example check types:

Assistant-turn reply rate drop - treatment vs baseline; catches “compliant but cold” regressions.
Turn-3 continuation drop - early conversation health without overfitting a single scalar.
Severe safety intercept delta - treatment must not drift worse than baseline by an agreed margin.
Question compliance floor - treatment must still meet a minimum compliance rate at volume.

When sample size is insufficient, the system recommends hold rather than forcing go or no_go - that explicitly encodes statistical humility in release policy.

Why We Did Not Anchor on LangSmith or Braintrust for This Layer

Again: these are capable products used by serious teams. For With Him, the mismatch was shape, not quality of vendor engineering.

We would still have needed to encode theology-aware, crisis-aware, turn-indexed policies in our codebase for generation and intercepts.
We would still have needed to wire those policies to Mongo-backed events, message metadata, shadow traffic, and bucketing for treatment vs control.
We wanted policy + telemetry + gates as one first-party system so that “what we measure” and “what we ship” cannot drift apart unnoticed.

If we had adopted a generic eval platform as the authority, we risked two layers of truth: vendor-side experiments and in-repo behavior. We chose one source of truth and kept the surface area honest for our risk model.

Closing

The best eval stack is not always the most famous one. It is the one that matches your risk surface, release model, and definition of quality.

For With Him, custom infrastructure is what lets us move fast without moving careless - and keep behavioral contracts where they belong: first-class citizens of the system, not an external appendix.

This note reflects engineering priorities and architecture as implemented in the With Him server codebase. It is not a benchmark or endorsement statement about third-party products.