Skip to content

ADR-0019: Poll source experiments to auto-populate evaluation datasets

ACCEPTED

Author: Open Chat Studio · Created: 2026-05-31

Context

Evaluation datasets were one-shot — populated via manual session pick, filter import, or CSV upload — and went stale unless someone refreshed them. Teams wanted datasets that continuously absorbed new sessions from a source chatbot matching saved filter criteria. The two viable mechanisms were event-driven hooks (signals on session end and tag changes) or periodic polling. Hooks demand careful coupling to many call sites and still need a backstop for post-hoc tag changes that fire long after a session was created — CustomTaggedItem writes don't bump session.updated_at, so a cursor-based ingester would miss them.

The feature is gated by the existing flag_evaluations waffle flag; teams with the evaluations app get auto-population without a separate opt-in.

Decision

We will run a periodic Celery task (auto_populate_eval_datasets, every 5 minutes) that walks each enabled DatasetAutoPopulationRule, scans its source experiment for recent sessions matching the rule's filter, and appends matches to the parent EvaluationDataset.

  • DatasetAutoPopulationRule is a BaseTeamModel carrying the parent dataset, source experiment, filter query string, enabled flag, and per-rule run metadata: last_run_at, last_run_status (success / error / no_op), last_error, consecutive_failure_count.
  • Rules are restricted to session-mode datasets in v1; message-mode datasets are out of scope.
  • Each tick re-scans within a configurable lookback window (EVALUATIONS_AUTO_POPULATION_LOOKBACK_DAYS, default 30). The scan floor is MAX(rule.created_at, now() - lookback)created_at is the forward-only floor, the lookback caps per-tick work.
  • No high-water-mark cursor is stored on the rule. Dedup is performed at scan time by excluding sessions whose id is already in dataset.messages. This guarantees sessions that gain a matching tag after creation are picked up on a later tick.
  • Each rule is processed inside its own transaction with select_for_update(skip_locked=True), so concurrent beat workers never double-process the same rule. Per-rule exceptions are caught and recorded in a fresh transaction outside the (potentially rolled-back) atomic block.
  • After three consecutive failures the rule is auto-disabled and an ocs_notifications notification fires.

Consequences

  • Ingestion latency is bounded by the 5-minute beat cadence; near-real-time ingestion is out of scope.
  • Re-scan cost grows linearly with the lookback window. The 30-day default trades freshness against scan cost; tag changes older than the window will not be picked up.
  • The forward-only floor means rules never absorb sessions older than the rule itself — backfilling historical traffic still requires a manual import.
  • Auto-disable + notification surfaces broken rules without paging an operator.
  • Per-tick state on the rule makes operational status visible directly on the dataset detail page, no separate audit log needed.

Alternatives considered

  • Event-driven ingestion (signals on ExperimentSession.end() and CustomTaggedItem writes) — rejected for v1: requires hooks at many call sites and still needs a polling backstop for tag-change races. Deferred until polling cadence proves too slow.
  • High-water-mark cursor on the rule — rejected: a last_seen_session_at cursor would skip sessions whose matching tag is added after the cursor advances. NOT IN dataset dedup is the only correct approach when filter criteria depend on mutable state.
  • Lifecycle hook on ExperimentSession.end() — deferred: useful for near-real-time but redundant with polling for v1.
  • Sampling sessions (ingest only N% of matches) — deferred; mentioned in the source issue but not needed for v1.