Same canvas, same answer. New canvas, still the right answer.
AI scoring earns trust by holding up on inputs it has never seen, not by repeating itself on inputs it already knows. We tested for both. Here is what we found.
What a scoring engine has to do
When a business case is submitted for assessment, three things have to be true. The score has to mean something, not just look like it does. The engine has to recognise the same kind of problem on a canvas it has never seen before. And when one thing on a strong canvas changes, the score has to move on the right dimension, not random ones. That is the bar Meridian was tested against. The numbers below are what came back.
By the Numbers
0
Errors on 53 canvases it had never seen
Across 53 newly authored canvases, on both the Starter and Plus models, the engine never confused a strong canvas with a weak one. No PROCEED to REWORK flips. No REWORK to PROCEED flips. This is the result that carries the trust story.
40 / 40
Deliberate flaws introduced. Noticed every time.
We broke one part of a strong canvas at a time and re-ran the assessment. Every single time, the matching quality dimension moved the right way. The engine responds to the actual defect, not to noise.
0.80
Agreement with an independent model
A second AI, built by a different company with different training, assessed the same canvases against the same rubric. On a 0-to-1 agreement scale, the two systems scored 0.80. Two engines, no coordination, substantially aligned.
The bands these numbers were tested against were locked in before the runs started. Nothing was tuned to the results.
What this page does not claim
These results show that the engine generalises, that it responds to the right defects, and that an independent model agrees with it. They do not yet show accuracy against expert human reviewers. That is a separate test, running next on real beta canvases. No “validated against experts” number will appear on this page until that work is complete.
How we tested
The scoring engine was assessed four ways: on canvases it had never seen, on canvases with one thing deliberately broken, by an independent model from a different vendor, and against a locked set of boundary tests in continuous integration. Plus a separate 200-run stability check that the same input keeps returning the same output. Each assessment answers a different question.
Pre-registered before any model run
Expected bands, pass thresholds, and the catastrophic-error gate were all written down before any canvas was scored. Results below are reported as-is, with no tuning to fit the pre-registered bar.
Assessment on unseen canvases
53 newly authored business case canvases, none drawn from any set the engine had been calibrated against. Each scored once per tier model. The pre-registered bar was a soft floor of 85% directional agreement (proceed vs. rework), and a hard gate of zero two-step misses (no strong canvas read as weak, no weak canvas read as strong).
Plus (Sonnet)
90.6%
Directional
0
Two-step misses
Passed every pre-registered bar. The model we recommend for any canvas where the result will be acted on.
Starter (Haiku)
81.1%
Directional
0
Two-step misses
Passed the catastrophic-error gate. Runs slightly more generous at the certify boundary: a measured difference, not a manufactured one.
Plus tier: 90.6% accuracy. Coming soon.
The Plus tier uses Claude Sonnet, which cleared every pre-registered bar. The paid tier is not yet open. Request access to be first in line when it is.
Every miss was adjacent (one band away), clustered at the REVIEW / PROCEED boundary where the bands are inherently fuzzy. Haiku tipped four moderate canvases into PROCEED that Sonnet correctly held at REVIEW: a top-boundary generosity, not a catastrophic confusion.
Assessment sensitivity
We took four strong canvases and deliberately broke them, one element at a time: remove the measurable target, make the risk register generic, disconnect the success criteria from the original problem. Each change was tested five times per model. 40 tests in total. The question: when one specific thing on a canvas gets worse, does Meridian notice the right thing?
| Defect introduced | Target dimension | Result |
|---|---|---|
Strip the quantitative target Perfect isolation. Removing the measurable outcome always moves Measurability the most. | Measurability | 8 / 8 |
Observational morning-after Clean. The one near-miss still detected the defect; another dimension moved fractionally more on one Sonnet run. | Audience Clarity | 7 / 8 |
Trivial test on the wrong assumption The strongest signal of all. The engine heavily penalises a test that misses the riskiest assumption. | Risk and Test Alignment | 8 / 8 |
Generic risks Clean on Sonnet. On Haiku the signal is directionally right and the biggest mover, but inside that dimension’s noise band. | Risk and Test Alignment | 6 / 8 |
Decouple success from problem Coherence dropped detectably every time. Measurability co-moved, because rewriting success criteria changes both dimensions. A rubric-structure fact, not an insensitivity. | Canvas Coherence | 4 / 8 |
Every one of the 40 cells moved the target dimension in the correct direction. 33 of 40 also cleared the stricter bar of being detectable (above twice the standard error) and being the biggest mover on that canvas. The four near-misses on the success-criteria perturbation reflect that rewriting success criteria touches two rubric dimensions, not an insensitivity in the engine.
Independent model assessment
A model from a different vendor lineage (OpenAI gpt-4.1) was given the same production rubric and the same 53 canvases, and asked to score them independently. The two engines were then compared, band by band.
0.80
Cross-vendor agreement score
71.8%
Exact-band agreement
n = 39
Canvases compared
A score of 0.80 is the statistics field's standard for “substantial agreement”: the same threshold used to validate medical diagnostic tests. OpenAI ran slightly generous at the top boundary, but the overall ordering agreed strongly. This is AI compared against AI, not AI compared against humans: it is the evidence before the expert-reviewer phase, not a replacement for it.
Boundary regression tests
19 automated checks that run every time Meridian's code changes. They verify that scoring thresholds, the five-dimension weighting formula, and nine boundary canvases all produce the same results as before. If any future change to the scoring logic shifts these outputs, the build fails before it ships.
19 / 19 automated checks passing
A guardrail against silent scoring drift between releases.
200 runs, 10 profiles: same input, same answer
Generalisation tells you whether the engine handles canvases it has not seen. Determinism tells you whether it handles the same canvas the same way twice. Both have to be true. The 200 runs below answer the second question: 10 business case canvases, each scored 20 times with identical inputs, across six business domains. Each canvas is a profile, labelled P1 through P10. Two were placed deliberately at the scoring boundaries.
Behaviour at the boundaries
P7 (Operations, borderline low) returned the same recommendation across all 20 runs. P8 (HR, borderline high) returned the same recommendation in 16 of 20 runs. Four runs returned a different outcome, all in the same band, none crossing a threshold. We include it because hiding the hard result would make the 93% figure meaningless. When a canvas scores in the borderline zone, treat the recommendation as indicative and use the section-level feedback to resolve the uncertainty.
| Profile | Avg | Consistency | Outcome |
|---|---|---|---|
| P1 Weak | 20.1 | 100% | REWORK |
| P2 Minimal | 55.9 | 95% | REVIEW |
| P3 Moderate | 67.6 | 95% | REVIEW |
| P4 Strong | 80.1 | 95% | PROCEED |
| P5 Excellent | 81.4 | 84% | PROCEED |
| P6 Weak, critical flags | 12.5 | 100% | REWORK |
| P7boundary Near-threshold low | 55.5 | 100% | REVIEW |
| P8boundary Near-threshold high | 71.1 | 80% | REVIEW |
| P9 Strong | 79.9 | 80% | PROCEED |
| P10 Moderate | 62.9 | 100% | REVIEW |
Consistency: percentage of 20 runs returning the modal recommendation for that profile. The directional measure (proceed vs. rework) held at 100% across all 200 runs. Exact recommendation match was 93% overall, with P8 the one profile where it fell below 90%.
Three dashboards, three runs each
Three dashboards from different domains, each assessed three times with the two-pass scoring engine. Every score, every outcome, nothing excluded. Two dashboards produced identical scores across all three runs; one sits at the REWORK / REVISE boundary where the engine is inherently less certain.
Scope note: the four assessments above cover the canvas scorer. Dashboard certification accuracy against expert reviewers is part of the next phase of testing.
Sales Performance
Regional sales tracking
Run 1
77
Run 2
77
Run 3
77
Customer Retention
Customer success
Run 1
80
Run 2
80
Run 3
80
Workforce Operations
People & workforce analytics
Run 1
59
Run 2
59
Run 3
56
REWORK is Meridian's lowest certification outcome. REVISE, REFINE, and CERTIFIED follow in ascending order. The two-pass engine (June 2026) runs three parallel score-only calls and takes the median per sub-criterion, separating the score from the prose generation that previously caused drift. Two of three dashboards returned identical scores on every run.
Where we fell short, and what we are doing about it
Starter model missed its accuracy target
The pre-registered directional bar was 85%. Starter (Haiku) came in at 81.1%. It passed the critical test: no strong canvas was read as weak, and no weak canvas as strong. But it did not clear its own bar, and we are not moving the bar.
Calibrating the Starter model's scoring prompts against the same 53-canvas set. We will re-run the full test before claiming directional parity with Plus.
Canvas bands were set by one reviewer
The expected band for each of the 53 canvases was assigned by an experienced reviewer before any model run. This proves the engine generalises out-of-sample. It does not yet prove agreement with a panel of expert humans.
The next phase records band judgements from multiple expert reviewers on real beta canvases, without seeing Meridian's scores first. Results will appear on this page.
What is being tested next
The next phase brings in real expert reviewers assessing real beta canvases, without seeing Meridian's scores first. Their judgements are then compared against the engine using the same agreement measure used in Assessment 3.
Only after that phase passes can Meridian claim agreement with expert reviewers. Until then, the headline above stands as it is: the engine generalises, it responds to the right defects, and an independent model agrees with it.
When the human-accuracy number is ready, it will appear on this page first.
Request early access