# Design: Promote hu-agent to a real PRD environment + cutover - **Date:** 2026-05-29 - **Author:** Iván Gómez Yaury - **Status:** Draft (pending review) - **Sub-project:** 1 of 2. Sub-project 2 (convert the dev account into an isolated branch sandbox) is a separate spec that depends on this one. ## Context hu-agent runs as a single long-lived Bun process on ECS Fargate, with a single in-memory queue (concurrency = 1). It polls Jira for tickets **assigned to its bot account** (`jira-poller.ts:88`: `assignee = "" AND status = "…"`), polls GitHub for PR comments/mentions, and polls Slack for mentions — all keyed off one production identity (Jira bot account, GitHub App `2940076`, Slack bot `U0AJU6F4RPV`). **Today there is only one deployed environment, and it is production in disguise.** `infrastructure/env/` contains only `dev/`, which deploys into AWS account `923929101992` (the "dev" account) but runs with the full production configuration: `infrastructure/app/main.tf` hardcodes `NODE_ENV=production`, `JIRA_POLL_ENABLED=true`, `JIRA_TICKET_REVIEW_ENABLED=true`, `ATTEND_FEEDBACK_IN_PR_ENABLED=true`, `SLACK_MENTIONS_ENABLED=true`, `PR_MENTION_ENABLED=true`, `DAILY_REPORT_ENABLED=true`, real Slack channels, the production GitHub App, and secrets from SSM. So the bot does its real work from the dev account. The goal is to give hu-agent a real home in the **PRD** account (`887841176879`) — the same account where the monolith and `code-reviewer` already run — and then (in sub-project 2) free the dev account to become an isolated branch-testing sandbox. ## Goal Stand up a production-grade `prd` Terraform environment that runs hu-agent in account `887841176879` with the existing production identities and configuration, add a `prd.yml` deploy workflow mirroring the team's established pattern, and execute a cutover that moves the live workload from the dev account to PRD **without two instances ever processing the same work concurrently**. ## Non-Goals - **No dev-sandbox work.** Provisioning the isolated dev identities (separate Jira account, GitHub App, Slack bot), the dev test-repo config, and repurposing the dev deploy workflow are all sub-project 2. This spec only *neuters* the dev instance as the final cutover step so it stops competing — it does not rebuild it. - **No new application identities.** PRD reuses the exact production identities the dev account runs today. New identities belong to sub-project 2 (for dev). - **No app code changes.** This is infrastructure + CI + operational sequencing only. The app already reads everything from env/SSM. - **No per-branch preview environments.** PRD is a single service; branch isolation is sub-project 2's concern. ## Architecture ### New `infrastructure/env/prd/` A near-mirror of `infrastructure/env/dev/`. The `infrastructure/app` module is already parameterized by `env`, cluster ARN, ALB ARN, and target group, and the SSM secret ARNs in `app/main.tf` are built from `data.aws_caller_identity.current.account_id`, so they resolve to the PRD account automatically when applied there. Files: - **`versions.tf`** — same provider constraints; backend `s3` bucket `humand-terraform-state-prd`, key `service/hu-agent/terraform.tfstate`, region `us-east-1`, `encrypt = true`. - **`providers.tf`** — `default_tags` with `Environment = "prd"`. - **`ecr.tf`** — `aws_ecr_repository "hu-agent"` + lifecycle policy (keep last 10), identical to dev but living in the PRD account. - **`main.tf`** — `module "app"` with `env = "prd"`, the PRD ECS cluster ARN, the PRD private target group ARN, `use_load_balancer = true`, and the same resource sizing as dev (cpu 2048 / memory 16384 / ephemeral 50 GiB / Fargate non-spot) unless we decide PRD needs more. - **`target_group.tf`** — PRD private target group + a listener rule on the PRD private ALB forwarding `/hu-agent` and `/hu-agent/*` (mirrors dev). - **`variables.tf`** / **`outputs.tf`** — PRD VPC id, private ALB ARN, health-check path `/hu-agent/health`. `DD_ENV` becomes `prd` automatically (driven by `var.env`), which separates PRD from dev in Datadog — the existing dashboards filter by `env`. ### Deploy workflow - **`.github/workflows/prd.yml`** — modeled on hu-agent's own convention (the `environment` input is a JSON blob `{"env","acc"}`, unlike code-reviewer's separate `account` input): ```yaml name: PRD Environment deployment on: workflow_dispatch: inputs: branch: { description: "Branch/tag to deploy", required: true, default: "release", type: string } jobs: build-and-deploy: uses: ./.github/workflows/deployment.yml with: environment: '{"env":"prd", "acc":"887841176879"}' ref: ${{ format('refs/heads/{0}', github.event.inputs.branch) }} ``` **Manual-first (`workflow_dispatch` only)** for controlled rollout. Auto-deploy on push to `release` (mirroring dev.yml today, and code-reviewer's push-to-`main`) is a one-line follow-up once PRD is trusted — deliberately deferred so the cutover is human-gated. - **`deployment.yml`** — add the `prd` JSON blob to the `workflow_dispatch` `environment` choice options. The reusable workflow already resolves the OIDC role as `arn:aws:iam::{acc}:role/github-oidc-{env}` → `github-oidc-prd` in `887841176879` (the same role code-reviewer uses), runs `test:coverage` before deploy, builds an arm64 image tagged with the short SHA, pushes to ECR, and `terraform apply`s. No changes needed beyond the new choice option. ### Prerequisites (confirmed likely-present; verify with PRD-RO, do not mutate) All confirmed by the user as probably already in place (code-reviewer/monolith use them): 1. PRD ECS cluster + private ALB equivalent to dev's `private-dev-services` / `private-services` — **capture the exact ARNs** for `env/prd/main.tf` and `target_group.tf`. 2. Terraform state bucket `humand-terraform-state-prd`. 3. OIDC role `github-oidc-prd` (code-reviewer deploys PRD with it). 4. **SSM parameters `/hu-agent/*` in the PRD account** — these do NOT exist yet and must be created: `jira-email`, `jira-api-token`, `jira-domain`, `github-token` (GitHub App private key PEM), `cursor-api-key`, the Anthropic key if the Claude migration has landed, `slack-bot-token`, `slack-channel-id`, `npm-token-google-sign-in`. **This is the only step that writes to PRD — it requires the user's explicit OK at execution time** (per the PRD no-touch rule, the user will say when). These become a verification checklist in the plan, read-only via `PRD-RO` where possible. ## Cutover The risk is concurrency: the JQL is `assignee = ` and the queue is concurrency = 1, so two instances sharing the production identity would both grab the same assigned ticket → duplicate PRs, double PR-comment replies, double Slack answers, git races. The sequence is designed so **the two instances are never both polling at once**, accepting a short no-coverage gap instead of any overlap. 1. **Provision PRD with pollers OFF.** Apply `env/prd` with the service deployed but `JIRA_POLL_ENABLED=false`, `JIRA_TICKET_REVIEW_ENABLED=false`, `ATTEND_FEEDBACK_IN_PR_ENABLED=false`, `SLACK_MENTIONS_ENABLED=false`, `PR_MENTION_ENABLED=false`, `DAILY_REPORT_ENABLED=false` (override via `env_vars` in `env/prd/main.tf`, since the shared `app` module defaults them on). Verify: health endpoint green, SSM secrets resolve, the task can reach Jira/GitHub/Slack, and a `POST /hu-agent/api/pipeline/trigger` on a throwaway ticket completes end to end. 2. **Neuter dev FIRST.** Turn the dev instance's pollers off (or scale `desired_count = 0`) and confirm it is quiet (no in-flight jobs; queue drained). 3. **Enable PRD.** Flip the PRD pollers on (remove the overrides → `terraform apply` → redeploy). PRD now owns the workload. 4. **Monitor** `env:prd` in Datadog for a stabilization window. Keep the dev deployment intact and re-enablable for immediate rollback (reverse the flip) if PRD misbehaves. 5. **Handoff to sub-project 2.** Once PRD is stable, the dev account is free to be rebuilt as the isolated branch sandbox. A brief gap where nothing polls (step 2 → 3) is acceptable and far safer than any window of double-processing. The flip is human-gated (manual `workflow_dispatch` + manual env change), not automated. ## Observability & cost PRD adds a second always-on service (2 vCPU / 16 GB, Fargate non-spot). During the transition both run, so cost roughly doubles. After cutover, the dev service can sit at `desired_count = 0` except when actively testing (sub-project 2). Datadog separates the two by the `env` tag (`prd` vs `dev`); the existing dashboards already template on `env`, so add `prd` to their selectors. ## Risks - **Double-processing during cutover** — mitigated by the "dev off before PRD on" sequence above. The single biggest risk; the ordering is load-bearing. - **Missing/incorrect PRD shared infra ARNs** (cluster/ALB) — verified read-only before writing `env/prd`; wrong ARNs fail at `terraform plan`, not at runtime. - **SSM secrets drift** — if a PRD `/hu-agent/*` param is missing or stale, the task fails to boot or authenticates wrong. Step 1's verification (health + a trigger run) catches this before the cutover flip. - **State backend** — `humand-terraform-state-prd` must exist and the OIDC role must be allowed to write to it; confirm before the first `init`. ## Open questions / dependencies - Exact PRD cluster ARN, private ALB ARN, and VPC id (capture via PRD-RO). - Whether PRD should run larger than dev's 2 vCPU / 16 GB (default: keep parity). - Timing of the SSM secret population in PRD (user gives the go at execution time). - Sub-project 2 (dev isolated sandbox) is the natural follow-up and assumes this cutover is complete.