Home

pm-agent chat-to-plan round-trip works in prod with no founder intervention

KR: goals#872 configs

Winner: baselinemore candidate issues generated (4 vs 2)

baseline

Issue count

4

AC linter pass rate

50%

2/4

Total tokens

4,153

Candidate issues (4)

  1. 1.Repair RUN_ID propagation across chat → tool calls → issue bodies
    ### Summary -- What & Why
    
    Requests from the v0 chat must reliably carry the per-run RUN_ID through every internal step so issue bodies created in Projects #3 and #1 contain the unique RUN_ID. Without guaranteed propagation the smoke test cannot verify a round-trip and test-mode cleanup will leave residue or produce false negatives.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] When running the existing measure-kr87.sh (or equivalent local run) against the target v0 chat endpoint with a generated RUN_ID, every issue created in Project #3 (Objectives+KRs) and Project #1 (work issues) contains that RUN_ID in its body as verified by GraphQL queries.
    - [ ] A local/integration run that creates RUN_ID-tagged issues is able to trigger the teardown path and those RUN_ID-tagged issues are closed as NOT_PLANNED by the same run before the run exits.
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly smoke verification: run the measure-kr87.sh; confirm it exits 0. Use GraphQL to verify ≥1 Objective + ≥1 KR in Project #3 and ≥2 work issues in Project #1 all contain the RUN_ID in their bodies. Confirm teardown closed them as NOT_PLANNED.
    
    ### Prerequisites for Autonomous Execution
    
  2. 2.Add and test few-shot exemplars for v0 plan-generation reliability
    ### Summary -- What & Why
    
    The v0 chat's prompt templates need robust few-shot exemplars so the agent reliably produces the required Objective + KR + sub-issued work issues and surfaces the preview-and-confirm step during a fixture conversation. If exemplars are missing or weak the agent can produce malformed plans and fail the smoke test.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] A deterministic fixture conversation (the same one used by measure-kr87.sh) when run against a staging v0 chat with the new exemplars produces a response that includes a preview step and a confirmable plan payload (simulated run shows Objective + KR + work issue metadata).
    - [ ] Automated tests assert that, for the fixture conversation, the agent returns the expected plan shape (fields required by downstream issue creation) for at least 10 consecutive runs to reduce nondeterministic failures.
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly verification: submit the fixture conversation to the deployed v0 chat endpoint and assert the round-trip (measure-kr87.sh) completes with exit 0 and that created issues include the RUN_ID.
    
    ### Prerequisites for Autonomous Execution
    
  3. 3.Detect and enforce preview-and-confirm flow with automated retry on confirmation failures
    ### Summary -- What & Why
    
    The smoke test requires the chat to surface a preview-and-confirm step and complete confirmation. Currently, intermittent failures in the confirmation step cause noisy diffs and non-zero smoke test exits. Detecting the preview step explicitly and retrying confirmation (idempotently) will make the round-trip robust to transient failures.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] Unit/integration tests demonstrate that when the agent returns a preview step, the confirmation action is invoked and either succeeds or, on transient failure, is retried up to a defined short limit with deterministic backoff until success in test runs.
    - [ ] For simulated transient failures injected into the confirmation step, the system either completes the confirmation within retries or surfaces a clear failure that the measure-kr87.sh can detect and clean up.
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly verification: run measure-kr87.sh and assert the script observes and confirms the preview step and completes with exit 0. If the first confirmation attempt fails in the run, verify that at least one retry attempt occurred and the run still completed successfully.
    
    ### Prerequisites for Autonomous Execution
    
  4. 4.Add a guarded CI job that runs the KR87 smoke test with stronger teardown and observability
    ### Summary -- What & Why
    
    A single, repeatable CI job that runs the measure-kr87.sh smoke test, validates GraphQL post-conditions, and guarantees teardown will make it easy to detect regressions and ensure no residual test issues remain. Without this, regressions slip into prod and the KR cannot be validated automatically.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] A new CI job (non-blocking by default) is added to the repo that runs measure-kr87.sh against a configured test endpoint and fails the job if the script exits non-zero.
    - [ ] The CI job demonstrates teardown robustness by creating a RUN_ID, creating issues, and confirming the CI pipeline run shows those issues were closed as NOT_PLANNED by the teardown before the job completes.
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly verification: CI job runs on schedule, executes measure-kr87.sh, and the CI logs show exit 0. CI logs contain GraphQL query results demonstrating created issues with RUN_ID and subsequent closure as NOT_PLANNED.
    
    ### Prerequisites for Autonomous Execution
    

variant-a

Issue count

2

AC linter pass rate

50%

1/2

Total tokens

6,403

Candidate issues (2)

  1. 1.Add deterministic RUN_ID smoke-harness + GraphHub verification
    ### Summary -- What & Why
    
    The current end-to-end smoke test can fail nondeterministically because RUN_ID propagation and verification across chat → tools is fragile. Provide a small, deterministic harness that accepts a caller-supplied RUN_ID, runs the chat→plan→issue round-trip in a verifiable test mode, and exposes simple GH/GraphQL checks so autonomous agents can confirm created issues and clean them up. Without this harness we cannot reliably iterate on the agent behavior or validate post-change regressions, slowing delivery.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] A new harness script exists at ./scripts/measure/kr87_unit_smoke.sh and exits 0 when invoked with a RUN_ID. Verification command: RUN_ID=kr87-test-$(date -u +%Y%m%dT%H%M%SZ) bash ./scripts/measure/kr87_unit_smoke.sh "$RUN_ID"
    - [ ] After running the harness with a RUN_ID, the created issues are visible via the GH CLI in both target repos. Verification command (runs the harness then asserts counts): RUN_ID=kr87-test-$(date -u +%Y%m%dT%H%M%SZ) bash ./scripts/measure/kr87_unit_smoke.sh "$RUN_ID"; sleep 5; COUNT_GOALS=$(gh issue list --repo gonzalomelov/goals --search "$RUN_ID" --limit 100 | wc -l); echo "goals_count:$COUNT_GOALS"; test "$COUNT_GOALS" -ge 2; COUNT_BDAY=$(gh issue list --repo gonzalomelov/birthdayinvites.app --search "$RUN_ID" --limit 100 | wc -l); echo "birthday_count:$COUNT_BDAY"; test "$COUNT_BDAY" -ge 2
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly verification: the nightly workflow can run the full production smoke script and observe success. Step-by-step: 1) run: bash ./scripts/measure/measure-kr87.sh 2) expect exit status 0. Verification command (nightly): bash ./scripts/measure/measure-kr87.sh
    
    ### Prerequisites for Autonomous Execution
    
  2. 2.Add preview-and-confirm endpoint smoke check + few-shot exemplar validator
    ### Summary -- What & Why
    
    Failures in the agent UI flow often happen because the preview-and-confirm interaction or the few-shot exemplars are not surfaced consistently by the v0 chat endpoint. Add a lightweight, autonomous smoke check that posts a fixture conversation to the v0 chat endpoint and verifies the preview step and final confirmation are present in the agent responses. This catches regressions in the conversational gating before full issue creation runs and speeds iteration on exemplars and response shaping.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] A new preview/confirm fixture is available at ./scripts/measure/fixtures/kr87_chat_preview.json and a lightweight check script at ./scripts/measure/kr87_preview_confirm.sh returns exit 0 when the v0 chat responds with a preview + confirmation for the provided RUN_ID. Verification command: RUN_ID=kr87-pc-$(date -u +%Y%m%dT%H%M%SZ) bash ./scripts/measure/kr87_preview_confirm.sh "$RUN_ID"
    - [ ] Direct curl verification that the v0 chat endpoint surfaces a preview and confirmation strings for the fixture. Verification command: RUN_ID=kr87-pc-$(date -u +%Y%m%dT%H%M%SZ); curl -s -X POST 'https://pm-agent.bygmv.com/v0/chat' -H 'Content-Type: application/json' -d @scripts/measure/fixtures/kr87_chat_preview.json -o /tmp/kr87_preview.json; grep -q '"preview"' /tmp/kr87_preview.json && grep -q -i 'confirm\|confirmed' /tmp/kr87_preview.json
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly verification: the nightly workflow can run the preview+confirm check and fail the run if the preview or confirmation are missing. Step-by-step: 1) run: RUN_ID=nightly-kr87-preview bash ./scripts/measure/kr87_preview_confirm.sh "$RUN_ID" 2) expect exit status 0. Verification command (nightly): RUN_ID=nightly-kr87-preview bash ./scripts/measure/kr87_preview_confirm.sh "${RUN_ID}"
    
    ### Prerequisites for Autonomous Execution