Skip to main content
Scoring Confidence

Same canvas, same answer. New canvas, still the right answer.

AI scoring earns trust by holding up on inputs it has never seen, not by repeating itself on inputs it already knows. We tested for both. Here is what we found.

The Promise

What a scoring engine has to do

When a business case is submitted for assessment, three things have to be true. The score has to mean something, not just look like it does. The engine has to recognise the same kind of problem on a canvas it has never seen before. And when one thing on a strong canvas changes, the score has to move on the right dimension, not random ones. That is the bar Meridian was tested against. The numbers below are what came back.

The Headline

By the Numbers

0

Errors on 53 canvases it had never seen

Across 53 newly authored canvases, on both the Starter and Plus models, the engine never confused a strong canvas with a weak one. No PROCEED to REWORK flips. No REWORK to PROCEED flips. This is the result that carries the trust story.

40 / 40

Deliberate flaws introduced. Noticed every time.

We broke one part of a strong canvas at a time and re-ran the assessment. Every single time, the matching quality dimension moved the right way. The engine responds to the actual defect, not to noise.

0.80

Agreement with an independent model

A second AI, built by a different company with different training, assessed the same canvases against the same rubric. On a 0-to-1 agreement scale, the two systems scored 0.80. Two engines, no coordination, substantially aligned.

The bands these numbers were tested against were locked in before the runs started. Nothing was tuned to the results.

The Limit

What this page does not claim

These results show that the engine generalises, that it responds to the right defects, and that an independent model agrees with it. They do not yet show accuracy against expert human reviewers. That is a separate test, running next on real beta canvases. No “validated against experts” number will appear on this page until that work is complete.

Methodology

How we tested

The scoring engine was assessed four ways: on canvases it had never seen, on canvases with one thing deliberately broken, by an independent model from a different vendor, and against a locked set of boundary tests in continuous integration. Plus a separate 200-run stability check that the same input keeps returning the same output. Each assessment answers a different question.

Pre-registered before any model run

Expected bands, pass thresholds, and the catastrophic-error gate were all written down before any canvas was scored. Results below are reported as-is, with no tuning to fit the pre-registered bar.

Assessment 1

Assessment on unseen canvases

53 newly authored business case canvases, none drawn from any set the engine had been calibrated against. Each scored once per tier model. The pre-registered bar was a soft floor of 85% directional agreement (proceed vs. rework), and a hard gate of zero two-step misses (no strong canvas read as weak, no weak canvas read as strong).

Plus (Sonnet)

90.6%

Directional

0

Two-step misses

Passed every pre-registered bar. The model we recommend for any canvas where the result will be acted on.

Starter (Haiku)

81.1%

Directional

0

Two-step misses

Passed the catastrophic-error gate. Runs slightly more generous at the certify boundary: a measured difference, not a manufactured one.

Plus tier: 90.6% accuracy. Coming soon.

The Plus tier uses Claude Sonnet, which cleared every pre-registered bar. The paid tier is not yet open. Request access to be first in line when it is.

Request access

Every miss was adjacent (one band away), clustered at the REVIEW / PROCEED boundary where the bands are inherently fuzzy. Haiku tipped four moderate canvases into PROCEED that Sonnet correctly held at REVIEW: a top-boundary generosity, not a catastrophic confusion.

Assessment 2

Assessment sensitivity

We took four strong canvases and deliberately broke them, one element at a time: remove the measurable target, make the risk register generic, disconnect the success criteria from the original problem. Each change was tested five times per model. 40 tests in total. The question: when one specific thing on a canvas gets worse, does Meridian notice the right thing?

Defect introducedTarget dimensionResult

Strip the quantitative target

Perfect isolation. Removing the measurable outcome always moves Measurability the most.

Measurability8 / 8

Observational morning-after

Clean. The one near-miss still detected the defect; another dimension moved fractionally more on one Sonnet run.

Audience Clarity7 / 8

Trivial test on the wrong assumption

The strongest signal of all. The engine heavily penalises a test that misses the riskiest assumption.

Risk and Test Alignment8 / 8

Generic risks

Clean on Sonnet. On Haiku the signal is directionally right and the biggest mover, but inside that dimension’s noise band.

Risk and Test Alignment6 / 8

Decouple success from problem

Coherence dropped detectably every time. Measurability co-moved, because rewriting success criteria changes both dimensions. A rubric-structure fact, not an insensitivity.

Canvas Coherence4 / 8

Every one of the 40 cells moved the target dimension in the correct direction. 33 of 40 also cleared the stricter bar of being detectable (above twice the standard error) and being the biggest mover on that canvas. The four near-misses on the success-criteria perturbation reflect that rewriting success criteria touches two rubric dimensions, not an insensitivity in the engine.

Assessment 3

Independent model assessment

A model from a different vendor lineage (OpenAI gpt-4.1) was given the same production rubric and the same 53 canvases, and asked to score them independently. The two engines were then compared, band by band.

0.80

Cross-vendor agreement score

71.8%

Exact-band agreement

n = 39

Canvases compared

A score of 0.80 is the statistics field's standard for “substantial agreement”: the same threshold used to validate medical diagnostic tests. OpenAI ran slightly generous at the top boundary, but the overall ordering agreed strongly. This is AI compared against AI, not AI compared against humans: it is the evidence before the expert-reviewer phase, not a replacement for it.

Assessment 4

Boundary regression tests

19 automated checks that run every time Meridian's code changes. They verify that scoring thresholds, the five-dimension weighting formula, and nine boundary canvases all produce the same results as before. If any future change to the scoring logic shifts these outputs, the build fails before it ships.

19 / 19 automated checks passing

A guardrail against silent scoring drift between releases.

Consistency

200 runs, 10 profiles: same input, same answer

Generalisation tells you whether the engine handles canvases it has not seen. Determinism tells you whether it handles the same canvas the same way twice. Both have to be true. The 200 runs below answer the second question: 10 business case canvases, each scored 20 times with identical inputs, across six business domains. Each canvas is a profile, labelled P1 through P10. Two were placed deliberately at the scoring boundaries.

Behaviour at the boundaries

P7 (Operations, borderline low) returned the same recommendation across all 20 runs. P8 (HR, borderline high) returned the same recommendation in 16 of 20 runs. Four runs returned a different outcome, all in the same band, none crossing a threshold. We include it because hiding the hard result would make the 93% figure meaningless. When a canvas scores in the borderline zone, treat the recommendation as indicative and use the section-level feedback to resolve the uncertainty.

ProfileAvgConsistencyOutcome
P1

Weak

20.1100%REWORK
P2

Minimal

55.995%REVIEW
P3

Moderate

67.695%REVIEW
P4

Strong

80.195%PROCEED
P5

Excellent

81.484%PROCEED
P6

Weak, critical flags

12.5100%REWORK
P7boundary

Near-threshold low

55.5100%REVIEW
P8boundary

Near-threshold high

71.180%REVIEW
P9

Strong

79.980%PROCEED
P10

Moderate

62.9100%REVIEW

Consistency: percentage of 20 runs returning the modal recommendation for that profile. The directional measure (proceed vs. rework) held at 100% across all 200 runs. Exact recommendation match was 93% overall, with P8 the one profile where it fell below 90%.

Dashboard Stability

Three dashboards, three runs each

Three dashboards from different domains, each assessed three times with the two-pass scoring engine. Every score, every outcome, nothing excluded. Two dashboards produced identical scores across all three runs; one sits at the REWORK / REVISE boundary where the engine is inherently less certain.

Scope note: the four assessments above cover the canvas scorer. Dashboard certification accuracy against expert reviewers is part of the next phase of testing.

Sales Performance

Regional sales tracking

Insights

Run 1

77

Run 2

77

Run 3

77

Certification: REFINE (all 3)

Customer Retention

Customer success

Insights

Run 1

80

Run 2

80

Run 3

80

Certification: REFINE (all 3)

Workforce Operations

People & workforce analytics

Operational

Run 1

59

Run 2

59

Run 3

56

Certification: REWORK (all 3)Borderline case: sits at the REWORK / REVISE threshold (≈60). Spread of 3 over 3 runs.

REWORK is Meridian's lowest certification outcome. REVISE, REFINE, and CERTIFIED follow in ascending order. The two-pass engine (June 2026) runs three parallel score-only calls and takes the median per sub-criterion, separating the score from the prose generation that previously caused drift. Two of three dashboards returned identical scores on every run.

Honest Deviations

Where we fell short, and what we are doing about it

Shortfall

Starter model missed its accuracy target

The pre-registered directional bar was 85%. Starter (Haiku) came in at 81.1%. It passed the critical test: no strong canvas was read as weak, and no weak canvas as strong. But it did not clear its own bar, and we are not moving the bar.

What we are doing

Calibrating the Starter model's scoring prompts against the same 53-canvas set. We will re-run the full test before claiming directional parity with Plus.

Shortfall

Canvas bands were set by one reviewer

The expected band for each of the 53 canvases was assigned by an experienced reviewer before any model run. This proves the engine generalises out-of-sample. It does not yet prove agreement with a panel of expert humans.

What we are doing

The next phase records band judgements from multiple expert reviewers on real beta canvases, without seeing Meridian's scores first. Results will appear on this page.

What is being tested next

The next phase brings in real expert reviewers assessing real beta canvases, without seeing Meridian's scores first. Their judgements are then compared against the engine using the same agreement measure used in Assessment 3.

Only after that phase passes can Meridian claim agreement with expert reviewers. Until then, the headline above stands as it is: the engine generalises, it responds to the right defects, and an independent model agrees with it.

When the human-accuracy number is ready, it will appear on this page first.

Request early access