ADR-0012: Lean Score value layer in apps/assessments
ACCEPTED
Context
Two independent subsystems produce per-session judgments as opaque JSON: automated evaluation (apps.evaluations, via EvaluationResult.output) and human review (apps.human_annotations per ADR-0015, via Annotation.data). The dogfood pilot for "basic concordance" needs to compare one shared categorical field (an LLM judge's answer versus the human authoritative answer) without an ad-hoc JSON-versus-JSON join.
We do not want to commit to the full unified-assessment design before its larger pieces are ratified. So we introduce only its value-storage layer, using its terminal column names, with enough flexibility that future targets and source types are additive rather than schema-breaking.
Decision
We will add a new Django app apps.assessments with a single model, Score, as the shared typed-value layer:
- One row per (target, field, source). A
Scorecarries aname, adata_typeenum (NUMERIC|CATEGORICAL|BOOLEAN), and split-column storage (value_numericDecimalField(20,6),value_stringTextField). ACheckConstraintnamedscore_value_presentrequires at least one value column populated. - Target is a
GenericForeignKeyfrom day one.target_content_type+target_object_idform the polymorphic target. OnlyExperimentSessionis exercised in v1; addingTraceorEvaluationMessagelater is non-breaking. - Source provenance via typed FKs plus a
sourceenum.automated_result(FK toevaluations.EvaluationResult) andreview(FK tohuman_annotations.Annotation) are mutually-exclusive nullable FKs recording the producing artefact.sourceis aTextChoicesenum:LLM_JUDGE,PROGRAMMATIC,HUMAN_REVIEW, plus reservedUSER_FEEDBACKandSYSTEMwith no producer in v1. - Idempotency enforced by partial unique constraints.
score_unique_per_automated_result_fieldcovers(automated_result, name)whereautomated_result IS NOT NULL;score_unique_per_review_fieldcovers(review, name)wherereview IS NOT NULL. This lets writers safely delete-and-recreate. teamdenormalised viaBaseTeamModel. Set at write time fromEvaluationResult.team/Annotation.teamso queries scope by team without an extra join.- Booleans land in
value_numericas 0/1. Thedata_type=BOOLEANmarker preserves rendering intent while letting aggregation treat booleans as numeric. - Field names align with the unified design's terminal vocabulary.
automated_resultandreview(notevaluation_result/annotation) are chosen now so eventual model renames change only the FK targets. - Defer everything else. No
Assessment,AssessmentSchema,AssessmentRun,RoutingRule,participantFK,score_config, orcomment— each becomes a nullable addition when its use case arrives.
Consequences
- Positive: Both subsystems dual-write into one queryable surface (see ADR-0013); future consumers read one model (concordance per ADR-0014, inter-rater reliability, cross-source aggregation).
- Positive:
GenericForeignKeyfrom day one makes addingTraceorEvaluationMessagetargets purely additive. - Positive: Terminal column names make the eventual
EvaluationResult/Annotationrenames a model rename, not aScoreschema migration. - Positive: Partial unique constraints make re-runs and re-submissions safe at the database layer.
- Negative: A
GenericForeignKeyis harder for the ORM to optimise than a dedicated FK; the composite index(target_content_type, target_object_id, name, source)mitigates the v1 query pattern. - Negative: Split columns plus a
data_typediscriminator cannot represent anything richer than scalar numeric/categorical/boolean (e.g. structured rubrics) — that waits for the deferred pieces. - Negative: Reserved enum values (
USER_FEEDBACK,SYSTEM) have no producer in v1; tooling must treat them as forward-compat placeholders.
Alternatives considered
- Ad-hoc JSON joins between
EvaluationResult.outputandAnnotation.data→ rejected; every consumer would re-invent field discovery, type coercion, and idempotency. - Ship the full unified-assessment model now (
Assessment,AssessmentSchema,AssessmentRun, routing tables) → rejected; the pilot needs none of it and it commits us to unratified routing and schema-catalogue decisions. - Typed FK to
ExperimentSessioninstead ofGenericForeignKey→ rejected; would force an additive migration onceTrace/EvaluationMessagetargets come online. - Single JSON
valuecolumn instead of splitvalue_numeric/value_string→ rejected; gives up cheap numeric aggregation and forces readers to disambiguate types in Python. - Validate one-of (
automated_result,review) via aCheckConstraint→ skipped; the partial unique constraints plus thesourceenum already prevent a malformed idempotent write.