# Parallelism Roadmap Plan for moving `hu-agent` from the current single-worker model to a safe, observable, multi-worker one. > **Status:** planning document. No code changes are implied by this file — it is a > shared reference for sequencing the work and reviewing the blockers. --- ## 1. Current state (as of this writing) - `QUEUE.CONCURRENCY = 1` — see `src/core/constants.ts:32`. - Single Bun process. Single in-memory FIFO queue (`MemoryQueue` in `src/queue/memory-queue.ts`). - N pollers run concurrently on the event loop but only enqueue work when the worker is idle: - `JiraPoller` — 5 s - `JiraTicketReviewPoller` — 30 s - `PrCommentPoller` — 30 s - `PrMentionPoller` — 60 s - `SlackMentionPoller` — 30 s - Coordination rule: each poller calls `queue.isBusy()` at the start of its cycle and skips the whole cycle if a job is currently running (`src/poller/pr-comment-poller.ts:75`, etc.). `PrCommentPoller` and `PrMentionPoller` additionally skip if the running job is of kind `jira` (long fix pipelines). - The queue can still hold pending jobs (a single poller can enqueue multiple jobs in one cycle, e.g. 3 pending comments on the same PR). Other pollers see `isBusy = true` until the whole batch drains. - Deduplication by `jobLogKey` in `enqueue()` — same job key cannot be queued twice. - Retry on failure: up to `MAX_RETRIES = 2`, reinserted at the **front** of the queue (`jobs.unshift()`). Timeout per job: `JOB_TIMEOUT_MS = 50 min`. --- ## 2. Goal Lift `CONCURRENCY` above 1 (start with 2, measure, grow from there) **without** introducing data corruption, token races, or log chaos. Non-goals for this roadmap: - Horizontal scaling (multiple processes / machines). This doc is about in-process concurrency only. - Replacing the in-memory queue with an external broker (Redis, SQS). May come later; not required to get parallelism value. --- ## 3. Observability prerequisite — `traceId` per job **Ship this first**, while `CONCURRENCY` is still 1. It does not change behavior, but it forces every module to accept a scoped logger and gives us log correlation before we need it. Concrete changes: - Add `traceId: string` (short UUID) to `JobPayload`. Generate in `enqueue()` if not set. - In `MemoryQueue.drain()`, build a `logger.child({ traceId, jobKind, jobKey })` before invoking the handler. - Handler in `src/index.ts:218` passes that logger down into each pipeline instead of the global one. - Retries keep the same `traceId` but bump an `attempt` counter (so a single job's full lifecycle — including retries — is greppable as one unit, and individual attempts are still distinguishable). - Later, when `CONCURRENCY > 1`, add `workerId` (stable per worker slot) as a second dimension. Why first: - Reveals any module that still writes through a "global" logger — those are exactly the places that would leak state across workers. - Gives a direct Datadog query (`traceId:abc123`) for debugging the very first parallel run. - Backward-compatible with single-worker operation, so it can live in `main` well before `CONCURRENCY` is bumped. --- ## 4. Blockers before raising `CONCURRENCY` Listed by severity. Each must be addressed before multiple jobs run simultaneously. ### 4.1. Shared workdir per repo (critical) Every pipeline uses `join(workdirPath, repo.name)` — one path per repo, shared across all jobs touching that repo. See `src/pipeline/pr-mention-pipeline.ts:71`, `src/pipeline/pr-comment-pipeline.ts:70`, etc. Two jobs on the same repo doing `checkout` / `pull` / `install` simultaneously will corrupt the working tree. Options: - **Git worktrees** (`.worktrees/` is already gitignored — the intent may have been this). One worktree per job, shared `.git` objects, isolated working tree. - **Indexed clones**: `workdir/-/`. More disk/network cost, but trivially correct. ### 4.2. `configureGitAuth` writes to global git config `src/repo/manager.ts:48` sets `url.insteadOf` in `--global` scope. Two jobs running it concurrently with different tokens (e.g. during GitHub App token rotation) will race; the last writer wins and the other job may push with the wrong token. Options: - Once workdir isolation is in place, move the auth config to `--local` scope on the worktree. - Or serialize this specific operation behind an internal mutex. ### 4.3. `installDependencies()` is not concurrency-safe `bun/npm/yarn install` against the same `node_modules` from two processes will corrupt it. Resolved automatically once workdirs are isolated, but consider: - Keep the package manager **cache** global (`YARN_CACHE_FOLDER`, `npm_config_cache`, Bun cache) so per-job installs stay cheap. ### 4.4. `appendPrMemory` race `memory/#.json` is read → mutated → written without a lock (`src/pr-memory/store.ts`). Two comments on the same PR processed in parallel will clobber each other. Cheap fix: file lock, or redesign as append-only writes with a compact step. ### 4.5. Poller coordination relies on `isBusy()` Current contract: "if a job is running, skip". With `CONCURRENCY = N` this becomes "if there are N jobs running, skip". Needed: - Replace `isBusy()` with `hasCapacity()` (or similar) on the queue. - Revisit the "skip if a jira job is running" rule in `pr-comment-poller.ts:79` and `pr-mention-poller.ts:83` — decide what it means when several jira jobs run in parallel. ### 4.6. CodeArtifact token refresh (minor) `configureCodeArtifactAuth` at boot is global. If it expires mid-run, every in-flight job fails together. Already true today; parallelism just amplifies the blast radius. Consider refreshing on a timer with some jitter. ### 4.7. Cursor CLI spawn cost `cursor-agent` subprocesses consume 1–2 GB RSS each. N jobs ≈ N × memory. Measurement required before picking a production `CONCURRENCY` value. In local dev, 2–3 is probably the ceiling. ### 4.8. Retry reinserts at the front of the queue `memory-queue.ts:148` uses `jobs.unshift()`. Fine with one worker. With many workers, a "toxic" job could retry immediately on another worker and spam logs. Not dangerous, but worth reviewing. --- ## 5. Migration plan — ordered, each step mergeable on its own Each step is reversible and does not force the next one. The goal is to land them as independent PRs and observe production between steps. 1. **Add `traceId` + child loggers on every pipeline.** No behavior change. Live on this for ~1 week in prod to confirm logs correlate. 2. **Workdir isolation per job** (git worktrees or traceId-indexed clones), still with `CONCURRENCY = 1`. Code now assumes isolation. 3. **Move `configureGitAuth` to worktree-local config**, add lock around `appendPrMemory`. 4. **Introduce `hasCapacity(n)` on the queue** and replace `isBusy()` in pollers. Default stays `CONCURRENCY = 1`, so no observable change yet. 5. **Raise `CONCURRENCY` to 2 behind an env var.** Watch memory, actual concurrency (via `traceId` logs), log readability. 6. **Tune by metrics** — final ceiling, retry policy revisit, any jobKind priority rules. --- ## 6. Out of scope (explicitly) - External queue backends (Redis Streams, SQS, etc.). - Multi-process / multi-machine horizontal scaling. - Rewriting pipelines to stream their outputs incrementally. - Persistent queue across restarts (today, a crash loses in-flight jobs; pollers will re-discover pending Jira/GitHub/Slack items on next tick, which is the intended recovery path). --- ## 7. Open questions - Should `fix-pipeline` jobs (long, CPU/mem heavy, ~30–50 min) share a worker pool with short jobs (pr-comment, slack-mention, ~1–5 min)? A split pool (e.g. 1 "heavy" slot + 2 "light" slots) may be more useful than a flat concurrency number. Decide during step 5. - Do we want a dead-letter destination for jobs that exceed `MAX_RETRIES`? Today they are only logged via `job_aborted` in `memory-queue.ts:150`. - Should `traceId` be propagated into Jira/GitHub comments posted by the bot (as a trailing ``) so users can quote it when reporting issues?