ADR-0020: Delta evaluation runs scoped to newly appended messages
ACCEPTED
Context
With auto-population (ADR-0019) appending new rows to datasets continuously, re-running full evaluations on every append would re-evaluate every previously-scored row. Teams want to evaluate only the rows produced by a tick, while keeping manual evaluation workflows unchanged.
Decision
We will add a delta evaluation run type that scopes work to a specific subset of dataset messages, and an opt-in flag on EvaluationConfig that triggers a delta run automatically when the auto-population path appends rows.
EvaluationRunType.DELTAjoinsFULLandPREVIEWas a run-type choice.EvaluationRun.scoped_messagesis a M2M toEvaluationMessage, populated at enqueue forDELTAruns and empty forFULL/PREVIEW. The scope is frozen at enqueue, so concurrent appends mid-flight don't change what the in-flight run evaluates.EvaluationConfig.auto_run_on_append(defaultFalse) opts a config in. When auto-population appends rows, every opted-in config on the dataset gets aDELTArun scoped to those rows. The trigger is invoked from atransaction.on_commithook so a rolled-back append never fires evaluations.- The auto-trigger fires only from the polling path; manual filter-import and CSV-import paths are intentionally untouched.
EvaluationConfig.runaccepts an optionalscoped_messagesargument that pins the M2M before dispatchingrun_evaluation_task.run_evaluation_taskbranches on type:PREVIEWsamples,DELTAreadsscoped_messages,FULLreads the full dataset.- The results view filters by
scoped_messagesforDELTAruns so the UI shows only the rows that were evaluated. - Evaluator-tag rules apply to
DELTAruns the same asFULL(the tag-rule gate only skipsPREVIEW).
Consequences
- Per-tick cost is bounded by the append size, not the full dataset.
- Opting in cannot inadvertently re-trigger from a manual import, because the auto-trigger is wired only into the polling path.
- Stable scope semantics: the M2M is captured at enqueue, so a run reports results for exactly the rows it was asked to evaluate even if the dataset grows mid-flight.
- Adding new append entry points later (event-driven ingestion, manual import auto-trigger) requires explicit wiring — they don't inherit the auto-trigger.
- Datasets with many opted-in configs and frequent ticks will see N delta runs per tick; cost discipline is the operator's responsibility.
Alternatives considered
- Auto-trigger on every append path (manual filter import, CSV import) — rejected for v1: would change the behaviour of existing manual workflows. Per-path opt-in can be added later.
- Re-run the full evaluation on every append — rejected: re-evaluates already-scored rows and multiplies cost linearly with append frequency.
- Store the scope as a filter expression instead of a M2M — rejected: would re-resolve on access and pick up rows added after enqueue, breaking the stable-scope property.
- Separate
EvaluationRunsubclass for delta runs — rejected: a type discriminator plus nullable M2M is simpler and fits the existing run model.