A cost model earns trust the same way a bridge does. Not because it looks finished, but because it holds under load and you can show your work. The Profitability Trust Score grades a model across seven dimensions, returns a number between 0 and 100, and treats 75 as the line below which the model is not yet safe to make decisions with.

This post explains the seven, why there are seven and not three or twelve, and why we are comfortable saying the framework is not finished.

Why seven, and not a single grade?

A single grade hides where the problem is. We tried it. You end up with a model scoring 68 and a finance director asking, reasonably, sixty-eight of what. The seven dimensions exist so the score points at the weak joint rather than waving at the whole structure.

We landed on seven after a long stretch of looking at where models actually break in practice across the engagements we have run. Fewer dimensions blurred distinct failures together. More dimensions split hairs that no one acted on. Seven is where the categories stopped overlapping and started being useful in a review meeting.

There is also a practical reason. Seven is roughly the number of distinct concerns a finance team can hold in mind while they argue about a model. Past that, the conversation fragments and people start defending their favourite metric instead of looking at the model whole. The dimensions are a checklist, but they are also a shared vocabulary, and a vocabulary stops working once it has too many words for the thing it describes.

What does each dimension actually catch?

Data quality. Whether the inputs are complete, current, and reconciled to a source you can name. Most weak models are weak here first. If the general ledger says one thing and the model’s cost pools say another, nothing downstream is trustworthy.

Traceability. Can you follow a number from the output back to its source without a leap of faith. A cost per unit that cannot be traced to drivers, rates, and quantities is a guess wearing a decimal point.

Allocation logic. Whether costs reach products, services, and customers through drivers that reflect real consumption rather than convenience. This is where most of the judgement lives, and where AI-built models tend to reach for the nearest plausible driver instead of the right one.

Drift. Whether the model still describes the business it was built for. Volumes move, processes change, a product line gets dropped. A model that was right in January and never revisited is quietly wrong by June.

Bias. Whether the structure systematically flatters or punishes particular products, channels, or customers. Allocation choices carry bias even when no one intends it. The dimension exists to make that bias visible rather than baked in.

Robustness. Whether the answers stay sensible when you stress the assumptions. Push a rate up by a fifth, drop a volume, and see whether the margins move in directions a sane person would expect.

Link to outcomes (reconciliation). Whether the model ties back to reality: total allocated cost reconciling to actual cost, modelled margin reconciling to reported margin. A model that does not reconcile is an opinion, however elegant.

Why is the threshold 75 and not 80 or 60?

Thresholds are judgement calls, and we will say so plainly. We set 75 because below it we have consistently seen at least one dimension weak enough to mislead a real decision, and above it the remaining gaps tend to be refinements rather than faults. It is deliberately not 90. A model does not need to be perfect to be useful. It needs to be honest about where it is thin, and strong enough that the thin parts will not flip a conclusion.

Seventy-five is a floor for decisions, not a finish line. Plenty of models we are happy with sit in the high seventies and have a known, documented soft spot the client has agreed to live with. What matters is that the soft spot is named and bounded, not hidden. A model scoring 82 with one weak dimension everyone understands is safer to use than a model scoring 88 whose weakness no one has located.

Will the seven change?

Probably. We treat the framework as a working instrument, not scripture. As more AI-built models come through validation, we expect the failure patterns to shift, and the dimensions should shift with them. If a category stops earning its place, we will retire it. If a new failure mode turns out to be common enough, we will name it. The number 75 may move too, though we would want a good reason and a paper trail before touching it.

Takeaway: treat the score as a map of where a model is weak, not a medal. Read the seven dimensions before the total. If you want the full breakdown of how each one is measured, the detail lives at /ai-profitability/trust-score/.