Agent Outcomes dashboard
The Agent Outcomes dashboard answers two questions about your org’s coding-agent spend: what did it cost, and did the work ship? Every Claude Code session run with the cardinal-claude-plugin is attributed to an engineer, a git branch, a pull request, and an initiative (one branch = one initiative), then classified by outcome: merged, in flight, lost (closed or gone stale without merging), or ad hoc (research and diagnostic work that was never headed to a PR).
It works the same on both deployments:
- Cardinal Cloud (SaaS) —
app.cardinalhq.io, under the Outcomes dashboard. - Self-hosted Maestro — the same dashboard on your Maestro host; session telemetry flows through your own Lakerunner.
Members see their own sessions; org owners see everyone’s.
Reach out to support@cardinalhq.io for support or to ask questions not answered in our documentation.
Reading the dashboard
The page is one scrolling view: a KPI strip (total spend, sessions, and the merged / in-flight / lost / ad-hoc split), spend-over-time with anomaly markers, an activity heatmap, then panels by user, initiative, repo, pull request, branch, and finally the full session ledger. Click any initiative, PR, or session to open its case file — a drawer with the spend breakdown, the chronological session ledger, and (for initiatives) an LLM-authored TL;DR of the specs its PRs touched.
Two badges flag sessions worth a second look:
bloated— the session carried unusually heavy context per turn (cache-read tokens per tool call well above the fleet median). The tooltip shows the estimated dollars spent re-reading context above the baseline, and the fix:/clearbetween sub-tasks, or split the work into focused sessions.⚠ drift— the session got expensive across unrelated bodies of work; a split would have been cheaper.
Comparing sessions and engineers
Anywhere you see a ⇄ button you can compare two entities side by side: on session rows, inside a session’s case file (including one-click comparison against sibling sessions on the same branch), on the per-user cards (owners only), and on an initiative’s contributor rows (which opens the engineer diff pre-scoped to that initiative). Click ⇄ once to arm the comparison, then pick the second entity; the result is a full-screen diff with a shareable URL.
The diff is built from computed facts, not generated prose:
- a verdict line (“B cost 4.3× A — most of the gap is context carried above the baseline”),
- decision triggers — deterministic rules that fire only when a threshold is crossed, each showing the exact numbers it fired on and ending in an action (set a spend limit,
/clearbetween themes, write down an effort-tier policy), - an aligned A | B | Δ ledger, token-composition bars, and theme lanes showing where each session’s dollars went (from stored session summaries),
- an optional narrative section — an LLM translation of the numbers above it. It can only restate and corroborate: every sentence cites the facts or stored summaries it draws on, and uncited sentences are discarded server-side. When the stored work summaries weaken a fired trigger, the trigger card is visually demoted (never hidden).
The engineer diff compares working styles, not workloads: median session cost, median context per turn, bloated-session rate with recoverable dollars, xhigh-effort spend share, and lost spend. A scope selector narrows both sides to a shared initiative or repo so totals become comparable too. Style rules only fire at n ≥ 10 sessions per side — below that the diff says which rules were suppressed instead of staying silent — and a work-mix overlap score flags when the two engineers simply do different kinds of work.
When the two sides aren’t honestly comparable — different models, effort tiers, a 10×+ scope difference, or low work-mix overlap — the diff says so in a banner rather than letting the totals mislead.
There is deliberately no initiative-vs-initiative diff: two initiatives are single tasks with no shared denominator, so every delta just restates “the work was different.” Initiative-level health renders directly on the initiative’s case file instead — a Spend health section (thrashing, overhead spend, lost spend triggers, and the $ split by outcome state) and a “where the money went” table of spend by theme.
Spend limits
Every panel has a gear (⚙) for spend-limit policies at four scopes: session, initiative, engineer, and PR, each with a window (day / week / month / lifetime) and an action (notify, warn, or block). Limits are evaluated as agent sessions run; the plugin surfaces warnings directly in Claude Code. Policies are owner-managed, and every limit records who set it. Compare-view triggers suggest caps where the data supports one (for example, a thrashing initiative gets a suggested cap of current spend +20%).
Requirements
- The
cardinal-claude-plugininstalled in engineers’ Claude Code environments (see Connect AI clients) — it attributes sessions to branches and initiatives. Branch names following<type>/<kebab-name>(feat/,fix/,refactor/,infra/,research/) classify the initiative type. - A connected GitHub integration, for PR outcome classification (merged / open / closed).
- Lakerunner, which stores the session telemetry the dashboard reads.