# hu-agent Observability: Silent-Failure Logging + Datadog Metrics

**Status:** Draft — pending review
**Date:** 2026-05-29
**Branch:** `spec/agent-observability` — branches off `develop` as an independent PR (see §4.1 for overlap with the in-flight migration)
**Author:** Iván Gómez Yaury (with Claude)

---

## 1. Motivation

On 2026-05-29 hu-agent appeared "down": it stopped picking up an assigned Jira ticket (SQWH-247) and a teammate reported it as dead. Investigation showed the **process was never down** (ECS task HEALTHY the whole time). The real cause was that the `cursor.bot@humand.co` identity was being torn down as part of the Cursor→Claude migration: the **Jira API token was revoked** (401 `AUTHENTICATED_FAILED`) and the **Slack app was uninstalled** (`account_inactive`).

The incident was hard to diagnose because the bot **went mute**:

1. **`if (queue.isBusy()) return 0;` logs nothing** in all four pollers. While the single-concurrency queue was occupied, every poll returned silently — the dead Jira credential was never even exercised, so nothing was logged.
2. **The Jira pollers cache the bot `accountId` once at boot and never re-validate it.** A credential that dies after a successful boot is invisible until a poll reaches the Jira API (which only happens when the queue is idle), and even then only surfaces as a generic `poll_error`.
3. **At `LOG_LEVEL=info` a healthy/idle poller is essentially silent** — `poll_completed` is commented out, skip paths log at `debug`. There is no positive liveness signal and no metric, so "quiet" is indistinguishable from "wedged".
4. **Persistent failures never escalate.** The Slack/PR pollers log failures at `warn` with backoff but never escalate to `error`, so a permanent outage (`account_inactive`) reads as a recurring warning that's easy to dismiss.

This spec makes every silent failure path observable, and adds Datadog custom metrics so the team can build dashboards and monitors that catch credential death and queue stalls automatically.

## 2. Goals

- Every place a poller or the queue can fail or skip work emits a signal — a **metric always**, and a **throttled structured log** for human detail.
- A dead integration credential (Jira / Slack / GitHub / agent provider) becomes a clear, queryable signal within minutes, independent of queue state and `LOG_LEVEL`.
- A wedged queue (a job holding the single slot too long) is detectable.
- Emit a Datadog custom-metric catalog covering both **operational health** and **agent usage**, so the user can build dashboards and monitors manually.

## 3. Non-Goals

- **Datadog dashboards and monitors are NOT built here.** The user creates them manually in the Datadog UI (us5). At most this work produces example metric definitions / a JSON export on request. No Terraform/IaC for dashboards or monitors.
- No change to the queue's concurrency model or the 100-minute job timeout semantics.
- No new alerting transport in the app (no Slack/PagerDuty calls). Alerting is done via Datadog monitors on the emitted metrics.
- Not fixing the pre-existing backlog of tickets stuck in "In Progress" (separate cleanup).

## 4. Key decisions (agreed)

| Decision | Choice |
|---|---|
| Metric emission | **DogStatsD**, following the Humand workspace convention (`MetricRecorderPort` + adapter, dot-named metrics, tag objects). |
| Metric library | **Spike `dd-trace` under Bun first; fall back to `hot-shots`** if it does not work. Same `MetricRecorderPort` interface either way. |
| Poller refactor | **Shared poll-runner** wrapping the four pollers' guard/skip/error logic, so the mute paths are fixed once and uniformly. |
| Auth checker | **New dedicated module** (`src/health/auth-checker.ts`), decoupled from the queue. |
| `DD_METRICS_ENABLED` default | **On in prod, off in dev/test** (no-op recorder otherwise). |
| Phasing | **Phase 1 = incident remedy; Phase 2 = full usage catalog** (see §10). |
| Branch base | **Off `develop`, independent PR** (not stacked on the migration). Conflicts accepted and deferred (see §4.1). |

### 4.1 Branching & coordination with the migration

This work branches off `develop` and ships as its own PR, rather than stacking on `spec/migrate-agent-to-claude`. The trade-off, accepted explicitly: it touches the same files the migration refactors (the four pollers, `memory-queue.ts`, and the agent-provider layer), so **merge conflicts are expected**. They are deferred — resolved when `writing-plans` is run against the then-current branch, or when they surface at merge time.

Concrete coupling to keep in mind:

- `develop` does **not** yet have the migration's `AGENT_PROVIDER` toggle or `AgentRunner` abstraction (those live only on `spec/migrate-agent-to-claude`). On `develop` the agent layer is Cursor-only.
- Therefore the auth checker's agent-provider probe and the `agent.*{provider}` metrics target **whatever agent layer exists on the target branch**: Cursor (`CURSOR_API_KEY` → `api.cursor.com/v0/models`, `provider:cursor`) today. If the migration merges first, generalize the probe and the `provider` tag to the `AGENT_PROVIDER`/`AgentRunner` abstraction.
- The shared poll-runner (§5.2) and queue instrumentation (§5.4) are written against `develop`'s poller/queue structure; whichever branch merges second resolves the overlap.

## 5. Architecture

### 5.1 Metrics recorder (port + adapter)

Following `humand-main-api`'s convention: a DI-injected port with a Datadog-backed adapter and a no-op fallback.

```ts
// src/metrics/types.ts
export type MetricTags = Record<string, string | number>;
export interface Metrics {
  increment(name: string, tags?: MetricTags): void;
  count(name: string, value: number, tags?: MetricTags): void;
  gauge(name: string, value: number, tags?: MetricTags): void;
  histogram(name: string, value: number, tags?: MetricTags): void;
  timing(name: string, ms: number, tags?: MetricTags): void;
}
```

- `src/metrics/datadog-recorder.ts` — the real adapter (over `dd-trace`'s `tracer.dogstatsd` **or** `hot-shots`, decided by the spike). Reads agent host/port from env (`DD_AGENT_HOST`, `DD_DOGSTATSD_PORT`, default `localhost:8125`). Applies the `hu_agent.` name prefix and global tags (`env`, `service`).
- `src/metrics/noop-recorder.ts` — does nothing. Used when `DD_METRICS_ENABLED=false` (dev/test) or when the recorder fails to init.
- Wired in `src/index.ts` via DI like the `logger`. **Emission is fire-and-forget and must never throw or block a pipeline.**

Metric names are module-level constants (`src/metrics/names.ts`). All names are namespaced `hu_agent.*`.

### 5.2 Shared poll-runner

All four pollers (`jira`, `jira-ticket-review`, `pr-comment`, `pr-mention`) duplicate the same preamble: `if (polling) return; if (queue.isBusy()) return; if (kind ∈ …) return;`. Extract a shared helper that wraps a poll cycle and uniformly emits:

- `poll.executed{poller}` — the cycle ran (liveness heartbeat).
- `poll.skipped{poller,reason}` — `reason` ∈ `busy | other_pipeline | reentrant`. **Metric every time; log throttled** (~1×/min) with `currentJobKey` + `current_job_age` so a wedged queue is visible.
- `poll.error{poller,error_type}` — the cycle threw.
- `poll.enqueued{poller}` + cycle duration.

Each poller provides only its "do the work" body. This kills the mute `isBusy` path in all four at once.

### 5.3 Auth checker (decoupled from the queue — load-bearing)

`src/health/auth-checker.ts`: a periodic job (`croner`, every `AUTH_CHECK_INTERVAL_MS`) that pings each integration and emits `integration.auth_ok{integration}` (gauge `1|0`):

- **Jira** — `getMyself`
- **Slack** — `auth.test`
- **GitHub** — lightweight authenticated call (token/installation check)
- **Agent provider** — the agent layer present on the target branch. On `develop` today: Cursor (`GET api.cursor.com/v0/models`). Generalize to the `AGENT_PROVIDER`/`AgentRunner` abstraction if the migration merges first (see §4.1).

It logs only on **transition** (ok→fail at `error`, fail→ok at `info`) to avoid noise.

> **Why decoupled from the queue / pollers:** the whole point is that it reports even when the queue is wedged. The `isBusy`-gated poll path can be silent for up to ~100 minutes during a stuck job; the auth checker runs on its own timer and is the early-warning signal the incident lacked. Including the **agent provider** credential is deliberate — `CURSOR_API_KEY` is the most load-bearing credential for the fix pipeline, and omitting it would reproduce the exact blind spot this work fixes.

### 5.4 Queue instrumentation

`src/queue/memory-queue.ts` already logs `job_received / job_completed / retry_attempt / job_aborted`. Add metric emission at those points plus a periodic sampler:

- `jobs.received / jobs.completed{job_kind,outcome} / jobs.retried / jobs.deduplicated`
- `jobs.duration{job_kind,outcome}` (histogram)
- Sampler (gauge): `queue.depth`, `queue.busy` (0/1), `queue.current_job_age` (seconds) — **the stuck-job signal**, `queue.high_load`

### 5.5 Pipeline instrumentation (Phase 2)

- `fix-pipeline.ts`: `repo.selected{repo}`, `agent.invocations{provider,phase}`, `agent.duration{provider,phase}`, `fix.build_validation{repo,result}`, `safety.violation{repo,reason}` (observability-only — pipeline does NOT block on violation; enforcement is a separate follow-up), `fix.outcome{verdict}` (emitted at triage before repo selection, so **no `repo` tag**), `pr.opened{repo}`
- `pr-comment-pipeline.ts`: `pr_comment.handled{repo,action}` (action `code_change|reply_only`) + `pr_comment.duration`
- `pr-mention-pipeline.ts`: `pr_mention.handled{repo,action}` (action `code_change|reply_only`) + `pr_mention.duration` + `agent.invocations/duration{provider,phase:"pr_mention"}` (added for symmetry with pr-comment — the agent runs here too)
- `slack-mention-pipeline.ts`: `slack_mention.handled` + `slack_mention.duration` (no tags — cardinality guardrail)
- **`agent.tokens` NOT emitted** — neither Cursor CLI nor cloud client exposes token usage. Revisit when the Claude provider (migration branch) lands.

## 6. Behavior changes (the mute fix)

- **`isBusy()` skip** (all 4 pollers): metric every cycle + throttled log with `currentJobKey` and `current_job_age`.
- **Auth failure in a poller**: from silent/`warn` → **`error` + `poll.auth_failure{poller,integration}` metric**.
- **Cached `accountId`**: **invalidate on auth failure** so it re-resolves (today it caches forever). The auth checker provides independent early detection.
- **Slack/PR pollers**: escalate `warn` → **`error`** after N consecutive failures (today it never escalates).
- **Poll heartbeat**: re-enabled but throttled (summary log every N cycles; metric every cycle).

## 7. Metric catalog (full)

Namespace `hu_agent.*`, global tags `env`, `service`. ⭐ = Phase 1 (incident-critical).

**Integration health**
- ⭐ `integration.auth_ok` gauge 0/1 `{integration: jira|slack|github|agent}`
- ⭐ `poll.auth_failure` count `{poller, integration}`

**Queue health**
- ⭐ `queue.current_job_age` gauge (s) · `queue.depth` gauge · `queue.busy` gauge 0/1 · `queue.high_load` count

**Poller health**
- ⭐ `poll.executed` count `{poller}` · ⭐ `poll.skipped` count `{poller, reason}` · `poll.error` count `{poller, error_type}` · `poll.enqueued` count `{poller}`

**Throughput / jobs**
- `jobs.received` / `jobs.completed` count `{job_kind, outcome}` · `jobs.duration` histogram `{job_kind, outcome}` · `jobs.retried` / `jobs.deduplicated` count

**Fix pipeline (Jira → PR)**
- `repo.selected` count `{repo}` · `pr.opened` count `{repo}` · `fix.build_validation` count `{repo, result}` · `fix.outcome` count `{verdict}` (**no `repo` tag** — emitted at triage before repo selection) · `safety.violation` count `{repo, reason}` (**observability-only**, pipeline does not block — enforcement deferred)

**Agent (Cursor/Claude)**
- `agent.invocations` count `{provider, phase}` · `agent.duration` histogram `{provider, phase}` — phases: `triage`, `fix`, `followup` (fix-pipeline), `pr_comment` (pr-comment-pipeline), `slack_mention` (slack-pipeline)
- `agent.tokens` count `{provider, model}` — **SKIPPED**: Cursor clients (CLI + cloud) do not expose token usage. Revisit when Claude provider lands (see §10).

**PR-comment & Slack**
- `pr_comment.handled` count `{repo, action}` (action `code_change|reply_only`) · `slack_mention.handled` count (no tags) · each with `*.duration`

### 7.1 Cardinality guardrail

DogStatsD custom metrics are billed by tag cardinality. **Allowed tags are bounded sets only** (`integration`, `poller`, `reason`, `job_kind`, `outcome`, `repo`, `verdict`, `result`, `provider`, `phase`, `model`, `action`). **Never tag with unbounded values** — no `issue_key`, no PR number, no `user_id`, no `branch`, no commit SHA. Those belong in logs, not metric tags.

## 8. Configuration

New env (Zod schema in `src/utils/config.ts`; documented in `.env.example`):

- `DD_METRICS_ENABLED` (default: `true` in prod, `false` when `NODE_ENV !== "production"`)
- `DD_AGENT_HOST` (default `localhost`), `DD_DOGSTATSD_PORT` (default `8125`)
- `AUTH_CHECK_INTERVAL_MS` (constant in `core/constants.ts`)
- Log-throttle windows for skip/heartbeat (constants in `core/constants.ts`)

The infra already supports this with **no Terraform change**: the datadog-agent sidecar has `DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true` and port `8125/udp` mapped.

### 8.1 dd-trace spike (hard gate)

> **Spike result (2026-05-29): `hot-shots`.** Under Bun, `dd-trace`'s dogstatsd client inits and `increment()` does not throw, but emits **0 UDP packets** (verified against a local listener) — the exact silent-failure mode this work guards against. `hot-shots` emits correctly under Bun (formatted packets with prefix + tags). Decision: use **`hot-shots`**; `dd-trace`/APM is **not** adopted, so no `tracer.init()` first-import is needed. The rest of this section is retained for the record.

Before committing to `dd-trace`, run a spike with a **real pass criterion**: a test metric must actually **land in Datadog us5** (not merely "imports without error" — `dd-trace` can init and silently fail to emit, or break async context under Bun).

- **Pass** → use `dd-trace`. Note two consequences to handle intentionally: (a) `tracer.init()` must be the **first import in `index.ts`** (it monkey-patches at import time); (b) `DD_TRACE_ENABLED=true` is **already live** in the task def, so adopting `dd-trace` will **start APM traces flowing** — a deliberate prod behavior change, not an accident.
- **Fail** → use `hot-shots` (pure UDP, Bun-safe) behind the same `MetricRecorderPort`. Call sites and DI are identical; only the adapter differs.

## 9. Error handling

- Metric emission is fire-and-forget; the recorder swallows its own errors and never throws into a pipeline.
- The auth checker catches per-integration errors (a failing check emits `auth_ok=0`, it does not crash the loop).
- The no-op recorder guarantees the app runs identically with metrics disabled.

## 10. Phasing

**Phase 1 — incident remedy (priority).** Stops the bot from going mute again and gives the monitor foundation:
- `MetricRecorderPort` + adapter (+ spike) + no-op + DI wiring
- Shared poll-runner with `poll.executed/skipped/error/enqueued` + throttled skip logging
- Stop caching `accountId` across auth failures; auth failures → `error` + `poll.auth_failure`
- Auth checker (jira/slack/github/agent) → `integration.auth_ok`
- Queue: `jobs.*` + `queue.*` (incl. `current_job_age`)
- ⭐ metrics from §7

**Phase 2 — full usage catalog (dashboard). ✅ DONE — implemented in `feat/agent-observability`.** Pipelines (`fix`, `pr-comment`, `slack`) instrumented with the usage metrics listed in §5.5 / §7.

**Deviations & follow-ups from Phase 2:**

1. **`agent.tokens` deferred** — neither Cursor CLI nor cloud client exposes token usage counts. Will revisit when the Claude provider (migration branch `spec/migrate-agent-to-claude`) lands, which does surface `usage.input_tokens` / `usage.output_tokens`.
2. **`safety.violation` is observability-only** — the metric is emitted when `getDiffStats()` detects a violation, but the fix pipeline does NOT block on it. Enforcement logic is a separate, intentionally-scoped follow-up.
3. **`pr-mention-pipeline` instrumented for symmetry** — although the original §7 catalog only named `pr_comment.handled`/`slack_mention.handled`, pr-mention runs the same agent and was added (`pr_mention.handled{repo,action}` + `agent.*{phase:"pr_mention"}`) so the @-mention flow isn't a blind spot.
4. **`fix.outcome` carries only `{verdict}`** — emitted at triage (before repo selection), so no `repo` tag is available. This deviates from the original `{repo, verdict}` spec entry in §7.

## 11. Testing

- `bun:test`, mirrored under `test/`. New `test/_helpers/fake-metrics.ts` recorder that captures emitted metrics for assertions.
- Tests: each poller emits the correct metric on `executed/skipped{reason}/error`; auth checker emits `auth_ok` 0/1 per integration and logs only on transition; queue emits the job lifecycle + sampler; `accountId` is re-resolved after an auth failure; skip/heartbeat throttling behaves.
- Remove the new source modules from `coveragePathIgnorePatterns` in `bunfig.toml` and keep the 80% threshold.
- `bun run check` (typecheck + lint + format) and `bun run test:coverage` green before done.

## 12. Risks

- **dd-trace under Bun** — mitigated by the §8.1 spike gate + hot-shots fallback.
- **Adopting dd-trace turns on APM** (env already set) — call out and decide intentionally.
- **Metric cardinality** — bounded by §7.1.
- **Log noise** — bounded by throttling skip/heartbeat logs while keeping metrics per-cycle.