Owner Runbook · Updated June 1, 2026

AI trading agent monitoring is the difference between one bad fill and one silent system failure.

Most operators know they need alerts. Fewer define what the alerts should mean, who acts on them, how fast a live account must be paused, and what evidence proves the system is actually flat, protected, and reconciled again. Monitoring is not dashboard art. It is the owner control plane after the agent goes live.

Short answer: monitor heartbeats, market-data freshness, order acknowledgements, stop status, exposure drift, and reconciliation mismatches; then map each alert to a forced pause, cancel, flatten, revoke, or recovery workflow that the owner can execute outside the model runtime.

Audience: owner + operator Intent: live-system supervision Future module: operator console and runbook

Laplace angle: this page sits after risk controls, after wallet and API-key design, and after shadow-mode evaluation. Once the system trades real capital, the next moat is whether the owner can see failure early enough to keep it small.

The six alerts that matter first

Heartbeat loss

If the strategy, signer, or venue adapter stops reporting, the system should assume supervision is degraded before it assumes trading is safe.

Stale data

If the market feed, position state, or funding data is old, the agent may still sound coherent while acting on expired context.

Order ambiguity

"Submitted" is not enough. Alert when orders are unacknowledged, rejected, partially filled without follow-up, or missing a final state.

Stop or hedge gap

A live position without the intended stop, reduce-only order, or hedge protection is an incident, not a minor warning.

Exposure drift

If intended size and actual size diverge, the system is trading a different risk book than the owner approved.

Reconciliation mismatch

If balances, fills, or positions do not match internal state, freeze trust in the dashboard until the account is reconciled independently.

Decision rule: the first monitoring layer should watch for hidden risk, not for strategy opinions. Missed fills, stale state, and broken stops matter more than whether the model still likes BTC.

Monitoring stack by layer

LayerWhat to watchWhy it mattersOwner action
Process healthStrategy heartbeat, signer service, gateway latency, queue backlog, cron or worker completionSilent process death creates phantom supervisionPause new orders and confirm the last known account state
Data qualityFeed freshness, missing candles, funding lag, account-sync lag, timestamp skewGood prompts on stale inputs still create bad tradesBlock new entries until fresh data and account state agree
Venue interactionRejections, unacknowledged orders, cancel failures, rate-limit pressure, duplicate client idsIntent can diverge from accepted venue state quicklyCancel pending risk and reconcile directly with the venue
Protection statusStop-loss presence, reduce-only coverage, liquidation buffer, margin stress, hedge parityThe account can be live while the protection layer is brokenFlatten or reduce until protection is re-established
Economic statePosition delta, realized and unrealized P&L, fee spikes, funding drag, cross-strategy heatLets the owner catch risk drift before it compoundsThrottle size, disable the strategy, or move capital
Recovery stateWas the alert acknowledged, paused, resolved, rotated, or escalated?Monitoring without workflow becomes alert accumulationRun the documented incident path and log the outcome

Severity ladder for live AI trading incidents

SeverityExampleRequired responseResume condition
SEV-1: hidden live riskPosition exists but stop is missing, flat status is uncertain, or reconcile fails after fillsImmediate pause, cancel, flatten, and owner confirmation outside the agent runtimeAccount state verified independently and protection restored
SEV-2: execution-path failureRepeated rejects, rate-limit lockouts, signer degradation, venue acknowledgements missingPause new entries, protect or exit open risk, inspect venue and gateway logsClean test order path or verified recovery evidence
SEV-3: data or model-quality failureFeed freshness breach, delayed balances, broken catalyst source, abnormal prompt outputBlock new entries and downgrade the system to observation modeFreshness restored and validation checks pass again
SEV-4: informational driftLatency increase, mild reconciliation lag, dashboard rendering bugLog, watch, and fix before it escalates into hidden riskIssue resolved or threshold tightened
Common failure: teams page themselves for harmless noise but leave real account ambiguity at the same severity. If the owner cannot tell whether the system is actually flat, the incident is already severe.

The owner runbook for the first 15 minutes

1. Freeze new risk. Disable new order creation before diagnosing root cause. A broken system should not keep compounding uncertainty while you investigate.
2. Verify economic truth from the venue. Pull balances, positions, open orders, and recent fills from the exchange or wallet source of truth, not only from internal UI state.
3. Check protection status. Confirm whether stops, reduce-only exits, or hedges still exist where the strategy expected them to exist.
4. Decide flat, reduced, or supervised hold. If state is ambiguous, flatten. If risk is bounded and visible, reduce and monitor. Only supervised hold is acceptable when the position and protections are fully verified.
5. Preserve the incident receipt. Log the alert, timestamps, actual account state, operator action, and what changed before the system resumes.
Good monitoring design: the runbook should be executable by an owner who is tired at 3 a.m. The system should not require perfect memory or a heroic operator to survive routine failures.

What the agent should monitor versus what the owner should own

Agent-visible

Price movement, funding changes, fills, stop status, freshness checks, and strategy-rule violations that can trigger a no-trade or an internal alert.

Gateway-owned

Schema validation, risk caps, signer health, order id uniqueness, retries, and reject handling below the model layer.

Owner-owned

Kill switch, credential revocation, capital movement, venue escalation, incident severity declaration, and the decision to resume or keep the strategy paused.

Publicly reviewable

No-trade lines, incident summaries, reconciliation notes, and post-mortem lessons that explain why the system stayed flat, reduced, or changed policy.

This boundary matters for the same reason it matters in audit trails: without role separation, alerting turns into theater and the owner cannot prove which layer failed.

Metrics that belong on an operator dashboard

MetricGood questionWhy it matters
Heartbeat ageHow long since each critical process last reported healthy state?Detects silent failure before the market notices
Data freshnessWhich feed is oldest relative to its trading importance?Prevents live decisions on expired context
Intent-to-fill gapHow many orders are still between proposed, acknowledged, and filled states?Shows where execution ambiguity is accumulating
Protection coverageWhich live positions lack the intended stop, hedge, or reduce-only exit?Highlights the incidents that deserve immediate action
Exposure driftWhere does actual position or leverage exceed intended position or leverage?Lets the owner catch stealth risk and sizing bugs
Resolution timeHow long do serious incidents stay unresolved?Measures whether the runbook works under stress
Red flag: if the dashboard leads with P&L but hides stale-data breaches, missing stops, or unresolved reconciliation mismatches, it is optimized for excitement instead of survival.

How this fits the Laplace stack

This page extends the operator path from venue selection to access design, risk controls, evaluation, audit trails, and the live trading record. Monitoring is where those decisions meet production reality.

Future module supported: this page can expand into a venue-by-venue incident library, a machine-readable alert taxonomy, or an owner-facing console that verifies whether an agent is safe to keep live.

FAQ

What should an AI trading agent monitoring system alert on first?

Start with the failures that create hidden risk: missing heartbeats, stale data, unacknowledged or rejected orders, broken stops, exposure drift, and reconciliation mismatches.

Who owns incident response for an autonomous crypto trading agent?

The human owner or operator does, because the owner controls capital, credentials, venue access, and the right to pause, flatten, revoke, or rotate the system after a failure.

What is the biggest monitoring mistake in AI trading?

Watching dashboards without action thresholds. Monitoring only matters when every serious alert maps to a concrete workflow such as pause, cancel, flatten, or recovery verification.

When should an operator flatten instead of investigate longer?

Flatten when live account state is ambiguous, when stop coverage is broken, or when reconciliation cannot prove the system is carrying the risk the owner believes it is carrying.

Build the runbook before the incident

An autonomous trading system is only as safe as the owner's ability to see hidden risk and shut it down cleanly.