ADR-0014: Minimal read-side concordance view backed by Score
ACCEPTED
Context
The dogfood pilot for "basic concordance" compares an LLM judge's per-session answer against a human reviewer's authoritative answer for one shared categorical field. With Score (ADR-0012) now populated by both subsystems (ADR-0013), the question is what read surface ships first.
The full unified-assessment design proposes persisted concordance configs, multi-source consensus aggregation, and kappa/MAE/confusion-matrix statistics — none of which the pilot needs. We want the smallest read-side view that proves the value layer works end-to-end without locking in those decisions.
Decision
We will ship a single Django TemplateView (ConcordanceView) at /a/<team_slug>/evaluations/concordance/, under a sidebar sub-item in Evaluations.
- Selection state lives entirely in query parameters (
?eval=,?queue=,?field=,?show=). There is no persisted config; a comparison is a bookmarkable but disposable URL. - Candidate fields are the name intersection of the two schemas, narrowed to
type: choiceon both sides. The eval side is the union of the configured evaluators' output schemas; the human side is the queue'sschema(ADR-0015). Numeric and free-text fields are filtered out for v1. A single candidate auto-selects; otherwise the picker renders. - Two
Scorequeries, joined in Python. The judge query filterssource IN (LLM_JUDGE, PROGRAMMATIC); the human query filterssource = HUMAN_REVIEWandreview.is_authoritative = True(ADR-0016). Each side is reduced to latestScorepertarget_object_id, then set-intersected into matched / eval-only / human-only buckets. - Aggregation is "latest Score per target per side," ordered by
(created_at, id)for deterministic ties. This is a v1 stand-in for the unified design's per-source consensus (mean / mode). is_authoritativeis filtered at read time, not denormalised ontoScore. Multi-reviewer queues let humans toggle authoritativeness after submission; denormalising would require sync hooks on every toggle. A query-time join is cheap enough at pilot scale.- Eval-side joins go through
automated_result.run.config, notevaluator.Evaluator ↔ EvaluationConfigis many-to-many, so filtering by evaluator would pull in Scores from other configs sharing that evaluator. Joining throughrun.configkeeps scope inside the selected config. - A
?show=toggle partitions the table (matched|eval_only|human_only|all, defaultmatched). Agreement count and percentage are computed overmatchedrows only. - The view is gated by the team-managed waffle flag
flag_assessments_concordance, which requiresflag_evaluationsandflag_human_annotations. Dispatch raisesHttp404if any of the three is inactive.
Consequences
- Positive: A reviewer sees side-by-side judge-vs-human values for one field plus an agreement count — the entire dogfood ask — with no new persisted models.
- Positive: All state in query params makes the view deep-linkable, shareable, and trivial to refactor when persisted configs land.
- Positive: The query-time authoritative filter stays correct as reviewers toggle authoritativeness, with no cache-busting or denormalisation upkeep.
- Positive: Joining through
run.configmeans concordance for one config never includes Scores from another config sharing an evaluator. - Positive: Waffle gating with a dispatch-level 404 keeps the URL invisible to teams not opted in.
- Negative: "Latest Score per target" shows only the most recent answer, so the agreement count measures most-recent agreement, not any temporal aggregation.
- Negative: Numeric and free-text fields are silently filtered from the picker, so a user won't see why an expected field is missing; the unified-design successor will surface numeric concordance with proper metrics.
- Negative: Items in
AWAITING_RESOLUTION(no authoritative pick yet) drop out; this is correct for "compare against the resolved human answer," but the empty state must signal when many items are filtered. - Negative: Not persisting the
(eval, queue, field)tuple means power users re-discover their comparison each visit; persisted configs are deferred to the unified design.
Alternatives considered
- Persist a
ConcordanceConfigmodel now → rejected; the unified design defines this surface, so a v1 would lock in choices the pilot doesn't need. - Denormalise
is_authoritativeontoScore→ rejected; needs sync hooks on every authoritative toggle and risks drift for marginal query benefit. - Filter by
automated_result__evaluator__in=...→ rejected; the M2MEvaluator ↔ EvaluationConfigrelation pulls in foreign Scores. Useautomated_result__run__config=eval_configinstead. - Compute the join in SQL (FULL OUTER JOIN or CTE) → rejected; Django ORM full-outer-join support is awkward, and the Python set-intersection on
target_object_idis readable and bounded by per-team result count. - Render Cohen's kappa / MAE / confusion matrix in v1 → rejected; explicit non-goal, and the
Scoredata suffices for these metrics to land later without schema changes. - CSV / JSONL export of rows → rejected; adds surface area prematurely. The session-row link to session detail is the v1 escape hatch.