Home

Factory ships a novel product end-to-end at a stable public subdomain (functional)

KR: goals#1202 configs

Winner: baselinemore candidate issues generated (4 vs 2)

baseline

Issue count

4

AC linter pass rate

100%

4/4

Total tokens

4,892

Candidate issues (4)

  1. 1.Generate static /r/<id> result pages and an /about page from committed run data
    ### Summary -- What & Why
    
    Produce a static, read-only product surface that renders one pre-rendered result page per factory run and an /about index that links to them. This makes the ideation pipeline inspectable and testable without any live runner or public form. Without static pages, there is no public proof the factory can surface runs at a stable subdomain and the KR cannot be validated.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] The repository build step produces an output directory containing an HTML file per run referenced in data/cycle-9-dogfood/runs/index.json under a stable /r/<id> path
    - [ ] Each generated result HTML file contains both harness names, both candidate-issue lists with at least one title each, and four numeric proxy values (issue count, AC linter pass rate, total tokens used, verification-anchor citation rate) for each harness
    - [ ] An /about HTML file exists in the build output and contains links to at least 6 result pages listed in the runs index
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly workflow performs HTTP GET to https://harness.bygmv.com/about and asserts HTTP 200 and that the page links to at least 6 /r/<id> result pages
    - [ ] Nightly workflow performs HTTP GET to three sample result pages (from the runs index) and asserts HTTP 200, valid TLS, and that each response body contains both harness names, both candidate issue lists, and the four proxy numbers for each harness
    
    ### Prerequisites for Autonomous Execution
    
    - Human: ensure harness.bygmv.com DNS is pointed to the deployment host and TLS is provisioned
    - Human: grant deploy or domain-binding permissions if the hosting/platform requires a manual step
  2. 2.Commit ≥6 dogfood run JSON files to data/cycle-9-dogfood/runs and a validated runs index
    ### Summary -- What & Why
    
    Provide a reproducible dataset of dogfood runs so the static renderer has content to produce the public result pages and the smoke test can assert a minimal history. If the runs dataset is missing or incomplete, the site cannot show the required ≥6 entries and the KR assertion will fail.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] The repository contains data/cycle-9-dogfood/runs/index.json referencing at least 6 run ids
    - [ ] For each run id listed in the index, there is a corresponding run JSON file that includes two harness entries, each with a candidate-issue list containing at least one issue title
    - [ ] A quick validation script in the repo verifies the index and each run file are parseable JSON and that each run file exposes the required fields (run id, two harnesses, candidate lists)
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly workflow reads data/cycle-9-dogfood/runs/index.json in the deployed repo and asserts it lists ≥6 runs, and that the deployed /about page links to those runs
    
    ### Prerequisites for Autonomous Execution
    
  3. 3.Add an offline harness-eval that computes and embeds the four proxy metrics per harness
    ### Summary -- What & Why
    
    Provide a deterministic offline evaluator that computes the four proxy numbers per harness for each run and stores them with the run so the static pages can display them. This guarantees the product surface shows public, proven guardrails rather than opaque internals. Without these computed proxies, pages cannot present the required metrics for the KR.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] Repository includes an executable offline evaluation script that, when run against the committed run files, produces numeric values for issue count, AC linter pass rate, total tokens used, and verification-anchor citation rate for each harness and writes those values into the run JSON files or into renderer input artifacts
    - [ ] Running the offline evaluator against the committed runs completes without errors and leaves every run with four numeric proxy values for each harness
    - [ ] Unit or smoke tests in the repo assert that render inputs contain numeric proxy values for every harness across the dataset
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly workflow requests a sample result page and asserts the response body contains the four numeric proxy values per harness and that the values are parsable numbers
    
    ### Prerequisites for Autonomous Execution
    
  4. 4.Add a CI workflow that runs the KR smoke test (scripts/measure/measure-kr120.sh) and a preview smoke check
    ### Summary -- What & Why
    
    Run automated verification of the KR smoke assertions on PRs and nightly so regressions are detected early. This gives quick feedback that the public surface remains available and correctly structured. Without CI smoke checks, regressions to the static surface or dataset may slip in undetected until the final gate.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] A GitHub Actions workflow exists that runs scripts/measure/measure-kr120.sh (or an equivalent verification command) against a configurable base URL and reports success/failure on PRs
    - [ ] The workflow logs indicate the script executed and show which of the four smoke-test assertions passed or failed when run against the configured preview URL
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly workflow runs scripts/measure/measure-kr120.sh against https://harness.bygmv.com and completes with all four assertions passing (index length, 200+TLS per run page, expected content per run page, /about linking to ≥6 pages)
    
    ### Prerequisites for Autonomous Execution
    
    - Human: ensure the public domain https://harness.bygmv.com is routed to the deployed site so the nightly workflow can reach it

variant-a

Issue count

2

AC linter pass rate

100%

2/2

Total tokens

5,898

Candidate issues (2)

  1. 1.Add a static, read-only HarnessDiff renderer and an /about index
    ### Summary -- What & Why
    
    Visitors must be able to view pre-rendered HarnessDiff run pages and an about/index page without authentication. This will make the ideation pipeline inspectable and provide the public, proven guardrail claim the KR requires. If we don't ship a static read-only surface, we cannot demonstrate the factory's public, auditable claims and the KR cannot be satisfied.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] A runnable Playwright integration test exists and passes locally that serves the static build and asserts: for each sample run page /r/<id> the page returns HTTP 200, contains both harness names, has at least one candidate issue title per harness, and shows the four proxy numbers per harness. Verify by running: cd apps/web && npx playwright test tests/harnessdiff.static.spec.ts --reporter=list
    - [ ] The repo contains at least 6 dogfood run entries committed to the data directory and the count is >=6. Verify by running from repo root: node -e "const idx=require('./data/cycle-9-dogfood/runs/index.json'); if(!Array.isArray(idx)||idx.length<6) process.exit(2); console.log('runs:',idx.length)"
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly verification: ensure the public surface is reachable and contains links to >=6 result pages. Steps for the nightly job: 1) curl -fsS https://harness.bygmv.com/about -o /tmp/harness-about.html 2) grep -Eo '/r/[a-z0-9-]+' /tmp/harness-about.html | sort -u | wc -l | grep -E '^[6-9][0-9]*$' 3) For each id found: curl -fsS https://harness.bygmv.com/r/<id> | grep -q 'Harness' && curl -fsS https://harness.bygmv.com/r/<id> | grep -q 'issue_count'
    
    ### Prerequisites for Autonomous Execution
    
    - Human: point harness.bygmv.com at the deployment target (DNS and Vercel/project routing) and provide access details if needed; without this the public domain cannot be validated by the nightly
    - Human: ensure the Vercel project is configured to serve the static output from the repository (domain mapping and build environment); otherwise the public site won't be reachable
  2. 2.Create an offline harness-eval pipeline that precomputes candidate lists and the four proxy metrics
    ### Summary -- What & Why
    
    The KR requires pre-rendered results with reproducible proxy numbers (issue count, AC linter pass rate, total tokens used, citation rate). We need a deterministic offline evaluator that consumes raw run traces and emits per-run JSON with the public metrics so the renderer can serve read-only pages. Without this, pages would be non-deterministic or require live runners and the KR's 'public, proven product guardrails' claim cannot be met.
    
    ### Acceptance Criteria (pre-merge)
    
    - [ ] A script is present and runnable that reads the committed raw run traces and emits per-run JSON objects containing exactly two harness entries per run, each harness containing an array of candidate issues (>=1) and numeric fields: issue_count, ac_lint_pass_rate, total_tokens, citation_rate. Verify by running: node scripts/harness/eval.js --runsDir=./data/cycle-9-dogfood/runs && node -e "const fs=require('fs'); const idx=JSON.parse(fs.readFileSync('./data/cycle-9-dogfood/runs/index.json','utf8')); if(!Array.isArray(idx)||idx.length<6) process.exit(2); idx.forEach(r=>{ if(!r.harnesses||r.harnesses.length!==2) process.exit(3); r.harnesses.forEach(h=>{ if(!Array.isArray(h.issues)||h.issues.length<1) process.exit(4); ['issue_count','ac_lint_pass_rate','total_tokens','citation_rate'].forEach(k=>{ if(typeof h[k]!== 'number') process.exit(5); }) }) }); console.log('eval output OK')"
    - [ ] Committing the generated per-run JSON files does not modify existing proven data format (index.json remains a top-level array of run entries) and the eval script is deterministic for the same inputs. Verify determinism by running the eval twice and checking byte-equality: node scripts/harness/eval.js --runsDir=./data/cycle-9-dogfood/runs && cp -r data/cycle-9-dogfood/runs /tmp/runs_a && node scripts/harness/eval.js --runsDir=./data/cycle-9-dogfood/runs && cmp -s /tmp/runs_a/index.json data/cycle-9-dogfood/runs/index.json || (echo 'not deterministic' && exit 2)
    - [ ] Add a small unit test that asserts the eval script exposes the required metric keys and that numeric metrics fall within plausible bounds (0<=ac_lint_pass_rate<=100, total_tokens>0, 0<=citation_rate<=1). Verify by running: npm test -- tests/unit/harness-eval.spec.js
    
    ### Acceptance Criteria (post-merge)
    
    - [ ] Nightly verification: run the eval script in CI and verify that its output still conforms to the structure and count expectations: 1) node scripts/harness/eval.js --runsDir=./data/cycle-9-dogfood/runs 2) node -e "const fs=require('fs'); const idx=JSON.parse(fs.readFileSync('./data/cycle-9-dogfood/runs/index.json','utf8')); if(!Array.isArray(idx)||idx.length<6) process.exit(2); console.log('nightly eval OK')"
    
    ### Prerequisites for Autonomous Execution