ADR-0019: Poll source experiments to auto-populate evaluation datasets
ACCEPTED
Context
Evaluation datasets were one-shot — populated via manual session pick, filter import, or CSV upload — and went stale unless someone refreshed them. Teams wanted datasets that continuously absorbed new sessions from a source chatbot matching saved filter criteria. The two viable mechanisms were event-driven hooks (signals on session end and tag changes) or periodic polling. Hooks demand careful coupling to many call sites and still need a backstop for post-hoc tag changes that fire long after a session was created — CustomTaggedItem writes don't bump session.updated_at, so a cursor-based ingester would miss them.
The feature is gated by the existing flag_evaluations waffle flag; teams with the evaluations app get auto-population without a separate opt-in.
Decision
We will run a periodic Celery task (auto_populate_eval_datasets, every 5 minutes) that walks each enabled DatasetAutoPopulationRule, scans its source experiment for recent sessions matching the rule's filter, and appends matches to the parent EvaluationDataset.
DatasetAutoPopulationRuleis aBaseTeamModelcarrying the parent dataset, source experiment, filter query string, enabled flag, and per-rule run metadata:last_run_at,last_run_status(success/error/no_op),last_error,consecutive_failure_count.- Rules are restricted to session-mode datasets in v1; message-mode datasets are out of scope.
- Each tick re-scans within a configurable lookback window (
EVALUATIONS_AUTO_POPULATION_LOOKBACK_DAYS, default 30). The scan floor isMAX(rule.created_at, now() - lookback)—created_atis the forward-only floor, the lookback caps per-tick work. - No high-water-mark cursor is stored on the rule. Dedup is performed at scan time by excluding sessions whose
idis already indataset.messages. This guarantees sessions that gain a matching tag after creation are picked up on a later tick. - Each rule is processed inside its own transaction with
select_for_update(skip_locked=True), so concurrent beat workers never double-process the same rule. Per-rule exceptions are caught and recorded in a fresh transaction outside the (potentially rolled-back) atomic block. - After three consecutive failures the rule is auto-disabled and an
ocs_notificationsnotification fires.
Consequences
- Ingestion latency is bounded by the 5-minute beat cadence; near-real-time ingestion is out of scope.
- Re-scan cost grows linearly with the lookback window. The 30-day default trades freshness against scan cost; tag changes older than the window will not be picked up.
- The forward-only floor means rules never absorb sessions older than the rule itself — backfilling historical traffic still requires a manual import.
- Auto-disable + notification surfaces broken rules without paging an operator.
- Per-tick state on the rule makes operational status visible directly on the dataset detail page, no separate audit log needed.
Alternatives considered
- Event-driven ingestion (signals on
ExperimentSession.end()andCustomTaggedItemwrites) — rejected for v1: requires hooks at many call sites and still needs a polling backstop for tag-change races. Deferred until polling cadence proves too slow. - High-water-mark cursor on the rule — rejected: a
last_seen_session_atcursor would skip sessions whose matching tag is added after the cursor advances.NOT IN datasetdedup is the only correct approach when filter criteria depend on mutable state. - Lifecycle hook on
ExperimentSession.end()— deferred: useful for near-real-time but redundant with polling for v1. - Sampling sessions (ingest only N% of matches) — deferred; mentioned in the source issue but not needed for v1.