# Parallelism Roadmap

Plan for moving `hu-agent` from the current single-worker model to a safe, observable, multi-worker one.

> **Status:** planning document. No code changes are implied by this file — it is a
> shared reference for sequencing the work and reviewing the blockers.

---

## 1. Current state (as of this writing)

- `QUEUE.CONCURRENCY = 1` — see `src/core/constants.ts:32`.
- Single Bun process. Single in-memory FIFO queue (`MemoryQueue` in `src/queue/memory-queue.ts`).
- N pollers run concurrently on the event loop but only enqueue work when the
  worker is idle:
  - `JiraPoller` — 5 s
  - `JiraTicketReviewPoller` — 30 s
  - `PrCommentPoller` — 30 s
  - `PrMentionPoller` — 60 s
  - `SlackMentionPoller` — 30 s
- Coordination rule: each poller calls `queue.isBusy()` at the start of its cycle
  and skips the whole cycle if a job is currently running
  (`src/poller/pr-comment-poller.ts:75`, etc.). `PrCommentPoller` and
  `PrMentionPoller` additionally skip if the running job is of kind `jira`
  (long fix pipelines).
- The queue can still hold pending jobs (a single poller can enqueue multiple
  jobs in one cycle, e.g. 3 pending comments on the same PR). Other pollers
  see `isBusy = true` until the whole batch drains.
- Deduplication by `jobLogKey` in `enqueue()` — same job key cannot be queued
  twice.
- Retry on failure: up to `MAX_RETRIES = 2`, reinserted at the **front** of
  the queue (`jobs.unshift()`). Timeout per job: `JOB_TIMEOUT_MS = 50 min`.

---

## 2. Goal

Lift `CONCURRENCY` above 1 (start with 2, measure, grow from there) **without**
introducing data corruption, token races, or log chaos.

Non-goals for this roadmap:
- Horizontal scaling (multiple processes / machines). This doc is about
  in-process concurrency only.
- Replacing the in-memory queue with an external broker (Redis, SQS). May come
  later; not required to get parallelism value.

---

## 3. Observability prerequisite — `traceId` per job

**Ship this first**, while `CONCURRENCY` is still 1. It does not change
behavior, but it forces every module to accept a scoped logger and gives us
log correlation before we need it.

Concrete changes:

- Add `traceId: string` (short UUID) to `JobPayload`. Generate in `enqueue()`
  if not set.
- In `MemoryQueue.drain()`, build a `logger.child({ traceId, jobKind, jobKey })`
  before invoking the handler.
- Handler in `src/index.ts:218` passes that logger down into each pipeline
  instead of the global one.
- Retries keep the same `traceId` but bump an `attempt` counter (so a single
  job's full lifecycle — including retries — is greppable as one unit, and
  individual attempts are still distinguishable).
- Later, when `CONCURRENCY > 1`, add `workerId` (stable per worker slot) as a
  second dimension.

Why first:
- Reveals any module that still writes through a "global" logger — those are
  exactly the places that would leak state across workers.
- Gives a direct Datadog query (`traceId:abc123`) for debugging the very first
  parallel run.
- Backward-compatible with single-worker operation, so it can live in `main`
  well before `CONCURRENCY` is bumped.

---

## 4. Blockers before raising `CONCURRENCY`

Listed by severity. Each must be addressed before multiple jobs run
simultaneously.

### 4.1. Shared workdir per repo (critical)

Every pipeline uses `join(workdirPath, repo.name)` — one path per repo,
shared across all jobs touching that repo. See `src/pipeline/pr-mention-pipeline.ts:71`,
`src/pipeline/pr-comment-pipeline.ts:70`, etc. Two jobs on the same repo doing
`checkout` / `pull` / `install` simultaneously will corrupt the working tree.

Options:
- **Git worktrees** (`.worktrees/` is already gitignored — the intent may have
  been this). One worktree per job, shared `.git` objects, isolated working
  tree.
- **Indexed clones**: `workdir/<repo>-<traceId>/`. More disk/network cost, but
  trivially correct.

### 4.2. `configureGitAuth` writes to global git config

`src/repo/manager.ts:48` sets `url.insteadOf` in `--global` scope. Two jobs
running it concurrently with different tokens (e.g. during GitHub App token
rotation) will race; the last writer wins and the other job may push with the
wrong token.

Options:
- Once workdir isolation is in place, move the auth config to `--local` scope
  on the worktree.
- Or serialize this specific operation behind an internal mutex.

### 4.3. `installDependencies()` is not concurrency-safe

`bun/npm/yarn install` against the same `node_modules` from two processes will
corrupt it. Resolved automatically once workdirs are isolated, but consider:
- Keep the package manager **cache** global (`YARN_CACHE_FOLDER`,
  `npm_config_cache`, Bun cache) so per-job installs stay cheap.

### 4.4. `appendPrMemory` race

`memory/<repo>#<pr>.json` is read → mutated → written without a lock
(`src/pr-memory/store.ts`). Two comments on the same PR processed in parallel
will clobber each other.

Cheap fix: file lock, or redesign as append-only writes with a compact step.

### 4.5. Poller coordination relies on `isBusy()`

Current contract: "if a job is running, skip". With `CONCURRENCY = N` this
becomes "if there are N jobs running, skip". Needed:
- Replace `isBusy()` with `hasCapacity()` (or similar) on the queue.
- Revisit the "skip if a jira job is running" rule in
  `pr-comment-poller.ts:79` and `pr-mention-poller.ts:83` — decide what it
  means when several jira jobs run in parallel.

### 4.6. CodeArtifact token refresh (minor)

`configureCodeArtifactAuth` at boot is global. If it expires mid-run, every
in-flight job fails together. Already true today; parallelism just amplifies
the blast radius. Consider refreshing on a timer with some jitter.

### 4.7. Cursor CLI spawn cost

`cursor-agent` subprocesses consume 1–2 GB RSS each. N jobs ≈ N × memory.
Measurement required before picking a production `CONCURRENCY` value. In
local dev, 2–3 is probably the ceiling.

### 4.8. Retry reinserts at the front of the queue

`memory-queue.ts:148` uses `jobs.unshift()`. Fine with one worker. With many
workers, a "toxic" job could retry immediately on another worker and spam
logs. Not dangerous, but worth reviewing.

---

## 5. Migration plan — ordered, each step mergeable on its own

Each step is reversible and does not force the next one. The goal is to land
them as independent PRs and observe production between steps.

1. **Add `traceId` + child loggers on every pipeline.** No behavior change.
   Live on this for ~1 week in prod to confirm logs correlate.
2. **Workdir isolation per job** (git worktrees or traceId-indexed clones),
   still with `CONCURRENCY = 1`. Code now assumes isolation.
3. **Move `configureGitAuth` to worktree-local config**, add lock around
   `appendPrMemory`.
4. **Introduce `hasCapacity(n)` on the queue** and replace `isBusy()` in
   pollers. Default stays `CONCURRENCY = 1`, so no observable change yet.
5. **Raise `CONCURRENCY` to 2 behind an env var.** Watch memory, actual
   concurrency (via `traceId` logs), log readability.
6. **Tune by metrics** — final ceiling, retry policy revisit, any jobKind
   priority rules.

---

## 6. Out of scope (explicitly)

- External queue backends (Redis Streams, SQS, etc.).
- Multi-process / multi-machine horizontal scaling.
- Rewriting pipelines to stream their outputs incrementally.
- Persistent queue across restarts (today, a crash loses in-flight jobs;
  pollers will re-discover pending Jira/GitHub/Slack items on next tick, which
  is the intended recovery path).

---

## 7. Open questions

- Should `fix-pipeline` jobs (long, CPU/mem heavy, ~30–50 min) share a worker
  pool with short jobs (pr-comment, slack-mention, ~1–5 min)? A split pool
  (e.g. 1 "heavy" slot + 2 "light" slots) may be more useful than a flat
  concurrency number. Decide during step 5.
- Do we want a dead-letter destination for jobs that exceed `MAX_RETRIES`?
  Today they are only logged via `job_aborted` in `memory-queue.ts:150`.
- Should `traceId` be propagated into Jira/GitHub comments posted by the bot
  (as a trailing `<!-- traceId: ... -->`) so users can quote it when reporting
  issues?