ADR-0015: Dedicated human_annotations app with queue/item/annotation/aggregate model
ACCEPTED
Context
Structured human review of chat sessions — labelling conversations against a shared rubric, gathering multi-reviewer judgments, exporting them, comparing them to automated judgments — is a recurring need. Tag / CustomTaggedItem gave free-form annotation but no schema, reviewer assignment, review-count tracking, or aggregation.
The larger fork was whether to build this on top of apps.evaluations. That subsystem predates the annotation work and was designed entirely around automated evaluation: Celery-driven, idempotent re-runs over a fixed dataset. When this work began there was no settled picture of how the two would interact, and the immediate need was progress. Its data models and workflows did not fit interactive, per-reviewer, editable human review, and retrofitting them would have meant re-architecting a substantial part of evaluations — so we built a separate peer app. (See Consequences for the hindsight: the crossover workflows that emerged later have made this worth revisiting.)
Decision
A dedicated apps.human_annotations app, organised around four BaseTeamModel records:
AnnotationQueue— rubric + reviewer container: aschema(SanitizedJSONFieldmapping field names toFieldDefinitionentries — the int/float/choice/string unionapps.evaluationsuses for evaluator output);num_reviews_required(1-10,num_reviews_required_rangeCheckConstraint); aQueueStatusenum (ACTIVE/PAUSED/COMPLETED/ARCHIVED); andassigneesM2M toAUTH_USER_MODEL.(team, name)unique.AnnotationItem— the unit of work: anAnnotationItemTypeenum (SESSION/MESSAGE) over mutually-exclusive FKs toexperiments.ExperimentSessionandchat.ChatMessage, with partial unique constraintsunique_session_per_queue/unique_message_per_queue. Carries a denormalisedreview_countand an append-onlyflagsfield.Annotation— one reviewer's submission;(item, reviewer)unique.data(SanitizedJSONField) keys must match the queueschema. AnAnnotationStatusenum (DRAFT/SUBMITTED) gates whether the row counts toward review counts and aggregates.AnnotationQueueAggregate—OneToOneFieldto the queue caching per-field aggregates (see ADR-0017).
Plus:
- Reuse
FieldDefinitionfromapps.evaluations, not a fork. Queue rubrics and evaluator output schemas are the same shape, so one validator covers both — and ADR-0012'sScorewrites type-dispatch uniformly across automated and human paths. - Lock
schemaandnum_reviews_requiredonce any item has a review (items.filter(review_count__gt=0).exists()); only per-fieldrequiredstays mutable. Changing fields mid-queue would silently invalidate prior submissions. - Role-based visibility via
AnnotationQueue.objects.visible_to(user, team): all team queues for holders ofhuman_annotations.add_annotationqueue, else only queues where the user is an assignee. An emptyassigneesset means team-wide, non-empty means restricted — invite-only vs team-wide without a separate flag. - Gate entry points behind the
flag_human_annotationsWaffle flag. It controls discovery (nav, "Add to queue" affordances), not data access; the management views are not flag-gated.
Consequences
- Positive: Stable FK targets for downstream consumers (ADR-0013 score writer, ADR-0014 concordance, CSV/JSONL exports).
Annotationis referenced byassessments.Score.reviewand survives the eventualAnnotation → Reviewrename. - Positive: Sharing
FieldDefinitionmeans one validation, widget mapping, and aggregation dispatch (ADR-0017) across both subsystems; new field types ship in one place. - Positive: Schema-lock with mutable per-field
requiredlets admins re-tighten which fields are mandatory after launch without redefining what was measured. - Negative: A fifth
BaseTeamModelis one more queryset to team-scope; every view must filter throughteam=request.teamorvisible_to(...). - Negative: Locking on first review (not first submission, not
ACTIVE) means a queue stays editable through reviewer assignment and locks the moment work begins — easy to miss; the form disables fields on render to signal it. - Negative:
flagsis append-only and queryable only via JSON operators;AnnotationItemStatus.FLAGGEDis the indexed signal for listings. - Negative (hindsight): Building a peer app rather than extending
apps.evaluationshas been worth revisiting. The crossover workflows — judge-vs-human concordance, human labels as ground truth — mean the two are far less independent than assumed. The sharedScorelayer (ADR-0012), dual-write (ADR-0013), and concordance view (ADR-0014) are the reconciliation surface the split deferred rather than avoided. The right call for making progress without a clear vision, but the cost has been ongoing bridging rather than one upfront design.
Alternatives considered
- Extend
Tag/CustomTaggedItemwith a "queue": rejected — flat strings, no typed schema, no per-reviewer uniqueness, nowhere to hangreview_countor an aggregate. - Fold into
apps.evaluationsas a "manual evaluator": rejected — the main fork, covered in Context. Different lifecycle (worker-driven re-runs vs interactive editable reviews), poor data-model fit, and constant reviewer-vs-evaluator disambiguation in shared code. GenericForeignKeyforAnnotationItem: rejected — only two target types; dedicated FKs make the partial unique constraints andselect_related("session__experiment", "message")clearer.- One global
Annotationshared across both paths: rejected — automated results come fromEvaluationResult.outputwith a worker-driven lifecycle and would force a discriminator on every column. ADR-0012 is the shared layer where the paths converge. - Free-form schema edits any time: rejected — changing "yes/no" to a 1-5 scale after submissions invalidates them with no migration path. Per-field
requiredis the one safe mid-queue change.