# Design: Promote hu-agent to a real PRD environment + cutover

- **Date:** 2026-05-29
- **Author:** Iván Gómez Yaury
- **Status:** Draft (pending review)
- **Sub-project:** 1 of 2. Sub-project 2 (convert the dev account into an isolated branch sandbox) is a separate spec that depends on this one.

## Context

hu-agent runs as a single long-lived Bun process on ECS Fargate, with a single
in-memory queue (concurrency = 1). It polls Jira for tickets **assigned to its bot
account** (`jira-poller.ts:88`: `assignee = "<botAccountId>" AND status = "…"`),
polls GitHub for PR comments/mentions, and polls Slack for mentions — all keyed off
one production identity (Jira bot account, GitHub App `2940076`, Slack bot
`U0AJU6F4RPV`).

**Today there is only one deployed environment, and it is production in disguise.**
`infrastructure/env/` contains only `dev/`, which deploys into AWS account
`923929101992` (the "dev" account) but runs with the full production configuration:
`infrastructure/app/main.tf` hardcodes `NODE_ENV=production`, `JIRA_POLL_ENABLED=true`,
`JIRA_TICKET_REVIEW_ENABLED=true`, `ATTEND_FEEDBACK_IN_PR_ENABLED=true`,
`SLACK_MENTIONS_ENABLED=true`, `PR_MENTION_ENABLED=true`, `DAILY_REPORT_ENABLED=true`,
real Slack channels, the production GitHub App, and secrets from SSM. So the bot does
its real work from the dev account.

The goal is to give hu-agent a real home in the **PRD** account (`887841176879`) — the
same account where the monolith and `code-reviewer` already run — and then (in
sub-project 2) free the dev account to become an isolated branch-testing sandbox.

## Goal

Stand up a production-grade `prd` Terraform environment that runs hu-agent in account
`887841176879` with the existing production identities and configuration, add a
`prd.yml` deploy workflow mirroring the team's established pattern, and execute a
cutover that moves the live workload from the dev account to PRD **without two
instances ever processing the same work concurrently**.

## Non-Goals

- **No dev-sandbox work.** Provisioning the isolated dev identities (separate Jira
  account, GitHub App, Slack bot), the dev test-repo config, and repurposing the dev
  deploy workflow are all sub-project 2. This spec only *neuters* the dev instance as
  the final cutover step so it stops competing — it does not rebuild it.
- **No new application identities.** PRD reuses the exact production identities the dev
  account runs today. New identities belong to sub-project 2 (for dev).
- **No app code changes.** This is infrastructure + CI + operational sequencing only.
  The app already reads everything from env/SSM.
- **No per-branch preview environments.** PRD is a single service; branch isolation is
  sub-project 2's concern.

## Architecture

### New `infrastructure/env/prd/`

A near-mirror of `infrastructure/env/dev/`. The `infrastructure/app` module is already
parameterized by `env`, cluster ARN, ALB ARN, and target group, and the SSM secret ARNs
in `app/main.tf` are built from `data.aws_caller_identity.current.account_id`, so they
resolve to the PRD account automatically when applied there. Files:

- **`versions.tf`** — same provider constraints; backend `s3` bucket
  `humand-terraform-state-prd`, key `service/hu-agent/terraform.tfstate`, region
  `us-east-1`, `encrypt = true`.
- **`providers.tf`** — `default_tags` with `Environment = "prd"`.
- **`ecr.tf`** — `aws_ecr_repository "hu-agent"` + lifecycle policy (keep last 10),
  identical to dev but living in the PRD account.
- **`main.tf`** — `module "app"` with `env = "prd"`, the PRD ECS cluster ARN, the PRD
  private target group ARN, `use_load_balancer = true`, and the same resource sizing
  as dev (cpu 2048 / memory 16384 / ephemeral 50 GiB / Fargate non-spot) unless we
  decide PRD needs more.
- **`target_group.tf`** — PRD private target group + a listener rule on the PRD private
  ALB forwarding `/hu-agent` and `/hu-agent/*` (mirrors dev).
- **`variables.tf`** / **`outputs.tf`** — PRD VPC id, private ALB ARN, health-check path
  `/hu-agent/health`.

`DD_ENV` becomes `prd` automatically (driven by `var.env`), which separates PRD from dev
in Datadog — the existing dashboards filter by `env`.

### Deploy workflow

- **`.github/workflows/prd.yml`** — modeled on hu-agent's own convention (the `environment`
  input is a JSON blob `{"env","acc"}`, unlike code-reviewer's separate `account` input):
  ```yaml
  name: PRD Environment deployment
  on:
    workflow_dispatch:
      inputs:
        branch: { description: "Branch/tag to deploy", required: true, default: "release", type: string }
  jobs:
    build-and-deploy:
      uses: ./.github/workflows/deployment.yml
      with:
        environment: '{"env":"prd", "acc":"887841176879"}'
        ref: ${{ format('refs/heads/{0}', github.event.inputs.branch) }}
  ```
  **Manual-first (`workflow_dispatch` only)** for controlled rollout. Auto-deploy on
  push to `release` (mirroring dev.yml today, and code-reviewer's push-to-`main`) is a
  one-line follow-up once PRD is trusted — deliberately deferred so the cutover is
  human-gated.
- **`deployment.yml`** — add the `prd` JSON blob to the `workflow_dispatch`
  `environment` choice options. The reusable workflow already resolves the OIDC role as
  `arn:aws:iam::{acc}:role/github-oidc-{env}` → `github-oidc-prd` in `887841176879`
  (the same role code-reviewer uses), runs `test:coverage` before deploy, builds an
  arm64 image tagged with the short SHA, pushes to ECR, and `terraform apply`s. No
  changes needed beyond the new choice option.

### Prerequisites (confirmed likely-present; verify with PRD-RO, do not mutate)

All confirmed by the user as probably already in place (code-reviewer/monolith use them):

1. PRD ECS cluster + private ALB equivalent to dev's `private-dev-services` /
   `private-services` — **capture the exact ARNs** for `env/prd/main.tf` and
   `target_group.tf`.
2. Terraform state bucket `humand-terraform-state-prd`.
3. OIDC role `github-oidc-prd` (code-reviewer deploys PRD with it).
4. **SSM parameters `/hu-agent/*` in the PRD account** — these do NOT exist yet and must
   be created: `jira-email`, `jira-api-token`, `jira-domain`, `github-token` (GitHub App
   private key PEM), `cursor-api-key`, the Anthropic key if the Claude migration has
   landed, `slack-bot-token`, `slack-channel-id`, `npm-token-google-sign-in`. **This is
   the only step that writes to PRD — it requires the user's explicit OK at execution
   time** (per the PRD no-touch rule, the user will say when).

These become a verification checklist in the plan, read-only via `PRD-RO` where possible.

## Cutover

The risk is concurrency: the JQL is `assignee = <bot>` and the queue is concurrency = 1,
so two instances sharing the production identity would both grab the same assigned
ticket → duplicate PRs, double PR-comment replies, double Slack answers, git races. The
sequence is designed so **the two instances are never both polling at once**, accepting a
short no-coverage gap instead of any overlap.

1. **Provision PRD with pollers OFF.** Apply `env/prd` with the service deployed but
   `JIRA_POLL_ENABLED=false`, `JIRA_TICKET_REVIEW_ENABLED=false`,
   `ATTEND_FEEDBACK_IN_PR_ENABLED=false`, `SLACK_MENTIONS_ENABLED=false`,
   `PR_MENTION_ENABLED=false`, `DAILY_REPORT_ENABLED=false` (override via `env_vars` in
   `env/prd/main.tf`, since the shared `app` module defaults them on). Verify: health
   endpoint green, SSM secrets resolve, the task can reach Jira/GitHub/Slack, and a
   `POST /hu-agent/api/pipeline/trigger` on a throwaway ticket completes end to end.
2. **Neuter dev FIRST.** Turn the dev instance's pollers off (or scale
   `desired_count = 0`) and confirm it is quiet (no in-flight jobs; queue drained).
3. **Enable PRD.** Flip the PRD pollers on (remove the overrides → `terraform apply` →
   redeploy). PRD now owns the workload.
4. **Monitor** `env:prd` in Datadog for a stabilization window. Keep the dev deployment
   intact and re-enablable for immediate rollback (reverse the flip) if PRD misbehaves.
5. **Handoff to sub-project 2.** Once PRD is stable, the dev account is free to be rebuilt
   as the isolated branch sandbox.

A brief gap where nothing polls (step 2 → 3) is acceptable and far safer than any window
of double-processing. The flip is human-gated (manual `workflow_dispatch` + manual env
change), not automated.

## Observability & cost

PRD adds a second always-on service (2 vCPU / 16 GB, Fargate non-spot). During the
transition both run, so cost roughly doubles. After cutover, the dev service can sit at
`desired_count = 0` except when actively testing (sub-project 2). Datadog separates the
two by the `env` tag (`prd` vs `dev`); the existing dashboards already template on `env`,
so add `prd` to their selectors.

## Risks

- **Double-processing during cutover** — mitigated by the "dev off before PRD on"
  sequence above. The single biggest risk; the ordering is load-bearing.
- **Missing/incorrect PRD shared infra ARNs** (cluster/ALB) — verified read-only before
  writing `env/prd`; wrong ARNs fail at `terraform plan`, not at runtime.
- **SSM secrets drift** — if a PRD `/hu-agent/*` param is missing or stale, the task
  fails to boot or authenticates wrong. Step 1's verification (health + a trigger run)
  catches this before the cutover flip.
- **State backend** — `humand-terraform-state-prd` must exist and the OIDC role must be
  allowed to write to it; confirm before the first `init`.

## Open questions / dependencies

- Exact PRD cluster ARN, private ALB ARN, and VPC id (capture via PRD-RO).
- Whether PRD should run larger than dev's 2 vCPU / 16 GB (default: keep parity).
- Timing of the SSM secret population in PRD (user gives the go at execution time).
- Sub-project 2 (dev isolated sandbox) is the natural follow-up and assumes this cutover
  is complete.