# TestMyVibes — coding-agent integration guide

**Audience:** any AI coding agent (Claude Code, Cursor, Codex, Replit
Agent, GPT-Engineer, etc.) working in a project that needs end-to-end
UI tests. **Discoverability targets:** the guide is also served at
`https://testmyvibes.com/docs/coding-agents` and exposed via the
`get_integration_guide` MCP tool, so any agent that has TMV's MCP
server registered can self-discover by calling that tool.

If you're a human reading this looking for setup steps, jump to
"Setup in another project" near the bottom.

---

## What TMV is, in one paragraph

TestMyVibes runs a real headless Chromium against any URL you give it,
driven by a GPT-4o vision agent that fills forms, clicks buttons,
reads emails sent during the test, and reports back what it observed
plus what broke. The output isn't a screenshot grid — it's a
machine-readable artifact (`actionTrail`, `consoleErrors`,
`failedRequests`, optional `bugs[]`) that another coding agent can
read, fix, and re-test against in a closed loop.

The point is to make "did my change actually work in a browser?" a
single MCP tool call away, not a manual smoke test.

## The fix-test-retest loop in 5 calls

This is the canonical pattern. Memorize it.

```ts
1. submit_test({
     url,                       // page where the flow starts
     description,               // plain-English summary
     runner: "ai" | "human",    // EXPLICIT — see "Runner choice" below
     goal,                      // AI runner: success criterion
     agentInstructions,         // AI runner: exact step-by-step
     identityMode: "auto",      // AI runner: let TMV decide whether
                                // inbox/persona are needed; human
                                // runner can use "keep"/"reuse"
     useTestInbox: true,        // AI runner: provision per-job inbox
                                //            (needed for OTP/verify flows)
     projectLabel,              // audit tag
   })
   → returns { jobId, runner, runnerNote, ... }

2. get_test_status({jobId})                    # poll every ~30s
   → returns "pending" | "in_progress" | "completed"

3. get_test_results({jobId})
   → returns the full diagnostic dump

4. <fix bugs in your code, push to deployment>

5. retest_job({jobId})
   → re-runs the SAME test against the new deployment;
     preserves the prior report for before/after comparison
   → loop back to step 2
```

`retest_job` is the killer step. It keeps the URL, goal, system
prompt, runner choice, and inbox configuration so the agent doesn't
have to re-compose the test after every fix. Compare the new
`actionTrail` to the old one: was step 5 "click failed" before, and
"clicked → navigated" after? That's how you know the fix landed.

## Runner choice — AI vs human

`runner` is required on `submit_test`. It explicitly routes the job
to one of two queues:

| Use AI when... | Use human when... |
|---|---|
| Flow has a clear pass/fail (URL, visible element, email arrival) | Flow needs visual/UX judgment ("does this look professional?") |
| You can write a concrete `goal` and `agentInstructions` | Flow involves complex visual canvases, drag-drop, captchas |
| You want fast turnaround (~1-5 min) | You want fresh eyes on UX (~15-60 min SLA) |
| You're in a fix-test-retest loop | You're doing first-impression / accessibility passes |
| You're testing 50+ flows and need throughput | You're testing 1-2 flows at high quality |

**Default is `runner: "ai"`** — AI is the right choice for ~90% of
coding-agent use cases (signup flows, login, regression, SDK checks).
Switch to human only when you specifically want human judgment.

For AI runs, the three fields that matter most:

- **`goal`** — concrete stop condition. Bad: "test the signup".
  Good: "Reach a URL containing /dashboard". The agent runs out its
  step budget on exploration if no goal is set.
- **`agentInstructions`** — verbal step-by-step. Pin EXACT field
  values here. Bad: "fill out the form". Good: "Type 'QA Tester'
  into the Name field; type the inbox email I gave you into the
  Email field; check the terms checkbox; click 'Send my code'."
- **`useTestInbox: true`** — provisions a
  `<jobid>-<random>@inbox.testmyvibes.com` address bound to this
  run. Required for OTP / email-verify flows. Skip it for read-only
  tests.
- **`identityMode`** — preferred identity control. Use `"auto"` for
  normal agent calls, `"keep"` to create a retained managed identity
  whose inbox and credentials survive future tests, `"reuse"` with a
  `testIdentityId` to sign in as that user, and `"none"` only for
  read-only tests. Human-runner jobs can also use `"keep"` so the
  checker creates the customer-site account and submits credentials
  back into the managed identity. Do not invent email addresses for
  signup or OTP flows; let TMV provision them.

## Writing a good test prompt

A bad prompt: "test the signup flow". A bad prompt produces a useless
report.

A good prompt has THREE things:

1. **One flow per test.** Sign up OR checkout OR profile-edit — never
   all three. If you want three flows tested, submit three jobs.

2. **A clear success criterion** in `aiCustomGoal`. Examples:
   - "Reach a URL containing /dashboard"
   - "See a balance amount displayed in the page header"
   - "Receive a 'Welcome' email in the session inbox"

3. **Exact field values** in `aiCustomSystemPrompt`. If you don't
   pin them, the model invents values and the test becomes
   non-reproducible. Example:
   ```
   When the signup form asks for a name, use "QA Tester".
   When asked for a password, use "TestPass!2026".
   For email, use the address I gave you (it's in your run context).
   Click "Send my code" once the form is complete.
   If a verification code appears in an email, type it into the
   six-digit input on the next screen.
   ```

The `useTestInbox: true` flag provisions a per-job
`<jobId-prefix>-<random>@testmyvibes.com` address bound to that one
run. The agent uses it for any "email" field; verification emails
arrive in the session inbox; `wait_for_email` blocks until they do.
Don't try to inspect these inboxes after the test — they're cleaned
up on completion.

For returning-user tests, use managed identities:

```ts
// First run: create the customer-site account and keep it.
submit_test({
  url,
  runner: "ai",
  goal: "Sign up, verify email, reach the dashboard",
  identityMode: "keep",
})
// Later: list_test_identities(), then reuse the returned id.
submit_test({
  url,
  runner: "ai",
  goal: "Log in and update notification settings",
  identityMode: "reuse",
  testIdentityId,
})
```

Managed identities have persistent TMV inboxes and encrypted stored
credentials. They are scoped to the owning TMV account and original
customer-site origin, so a leaked id cannot be used cross-account or
against a different site. They can be created by AI, by human testers,
or by seeding known credentials with `create_test_identity({
username, password })`.

Human-created persona setup:

```ts
submit_test({
  url,
  runner: "human",
  description: "Create a standard buyer account and verify onboarding",
  identityMode: "keep",
})
// The human checker receives the persistent inbox and returns
// createdIdentityCredentials from submit_job_results. Later AI or
// human jobs can reuse the returned testIdentityId.
```

Seat pricing is runner-neutral: one active managed identity is
included free; paid persistent persona seats start at `$2/mo`. AI and
human test execution still bill as normal runs. MCP clients can call
`list_test_identity_plans` and `subscribe_test_identities` to create a
Stripe Checkout subscription URL; payment still requires user approval.

## Reading reports like a fixer

When `get_test_results` returns, you get:

```ts
{
  jobId,
  status: "completed",
  ready: true,
  report: { overallStatus: "PASS"|"FAIL", passCount, failCount, summary },
  aiReport: {
    executiveSummary: string,         // the model's own narrative
    healthScore: 0..100,              // model's confidence — heuristic, not authoritative
    bugs: AiReportBug[],              // model's diagnosis — treat as a HINT, verify against the trail
    actionTrail: [                    // ← THE GROUND TRUTH
      {
        step: number,
        action: "type"|"click"|"navigate"|"scroll_down"|"scroll_up"|"wait_for_email"|"done",
        target: string | null,
        text: string | null,
        observation: string,          // what the model SAW in the screenshot
        outcome: string,              // what ACTUALLY happened on the page
        screenshotPath: string,       // DO Spaces path; auth required to fetch
      }
    ],
    consoleErrors: [                  // browser console errors during the run
      { at, level: "error"|"warning"|"pageerror", text }
    ],
    failedRequests: [                 // any 4xx/5xx or network failure
      { at, method, url, reason }    // reason may include "(redirected from X)"
    ],
    startUrl: string,
    finalReportNote: string | null,
  },
  checklist: ChecklistItem[],
  nextSteps: string | null,           // includes a retest_job() hint when bugs were found
}
```

**`actionTrail` and `outcome` are the most important fields.** The
model's `bugs[]` array can be wrong — the model is reasoning over
screenshots and can hallucinate. The `actionTrail` is what the
puppeteer driver actually did and observed. When they conflict, trust
the trail.

**`consoleErrors` is where root causes hide.** A real-world example:
TMV ran NVC's signup flow and reported "click does nothing" in the
bugs array. Looking at consoleErrors: `Uncaught SyntaxError: missing
) after argument list` from `/sdk/track.js`. That's the real bug —
the SDK had a build-pipeline string-escape issue that broke
`window.pmAuth` registration, which broke every form on every page
that used the SDK. Fix in PM source took 5 minutes; never would have
been findable from the visual report alone.

**`failedRequests` shows you what didn't fire.** No email arrived?
Look for the absence of a POST to your `/api/auth/otp/request`
endpoint. CORS errors? Check the `(redirected from X)` annotation
to find unintended canonicalization hops.

## Setup in another project

### Option A: use TMV via its MCP server (recommended)

In your project's `.mcp.json` (or `.claude/mcp.json` for Claude
Code), add:

```json
{
  "mcpServers": {
    "testmyvibes": {
      "url": "https://testmyvibes.com/mcp",
      "headers": {
        "Authorization": "Bearer YOUR_TMV_API_KEY"
      }
    }
  }
}
```

`YOUR_TMV_API_KEY` should come from a gitignored env var, NOT be
checked into `.mcp.json` directly. The cleanest pattern:

```json
{ "headers": { "Authorization": "Bearer ${env:TMV_API_KEY}" } }
```

Once registered, Claude Code (or any MCP client) can call the tools
listed below directly. First call should be `get_integration_guide`
to load this doc.

### Identity, billing, and "is this project mine?"

Three-layer auth model. Each layer is a backstop if the previous one
leaks.

**Layer 1 — API key** identifies the account. The key MUST come from
the operator's actual TMV account; anybody else's key has no
relationship to the operator's portfolio.

**Layer 2 — `isStaff` flag on the account.** Set per-account by a
TMV admin. Only staff-flagged accounts are eligible for free
internal-use testing. Non-staff accounts always bill credits, even
if they have an API key.

**Layer 3 — `internalTestDomains` allowlist.** Even with a
staff-flagged key, free testing ONLY applies to URLs whose hostname
matches a suffix in the operator's allowlist. Example: with
`internalTestDomains: ["paradisemodern.com", "newvibecity.com"]`,
tests against `paradisemodern.com`, `pm.newvibecity.com`,
`www.newvibecityartist.newvibecity.com` are free. Tests against
`somebodyelse.com` still bill credits.

**Why three layers:** a leaked staff key without a domain allowlist
match still bills normally. The worst case of a leak is "free
unlimited tests against the operator's OWN sites" — which is
useless to an external attacker but extremely convenient for the
operator's own projects.

**Self-check on first run:**

```ts
const me = await mcp.call("whoami", { targetUrl: PROJECT_URL });
// Returns { userId, email, isStaff, internalTestDomains[],
//           creditsRemaining,
//           billingPreviewForTarget: { billingMode, matchedDomain, reason }
//         }
```

If `billingPreviewForTarget.billingMode === "internal"`, you're set
— submit tests freely. If `"credits"`, every test deducts credits;
budget accordingly or escalate to your operator to add the project's
domain to the allowlist.

**Tag tests with `projectLabel`** so the operator sees which project
burned which tests:

```ts
mcp.call("submit_test", {
  url: PROJECT_URL,
  description: "...",
  projectLabel: "pm-claude-code",   // or whatever names YOUR project
  useTestInbox: true,
  aiCustomGoal: "...",
  aiCustomSystemPrompt: "...",
});
```

The label is audit-only — it doesn't affect auth or billing. It's
just so the operator's admin dashboard shows
"`pm-claude-code` ran 47 internal tests today; `nvc-coding-agent`
ran 12" rather than every test being indistinguishable.

### Option B: OAuth (for shared / public use)

If your project will be used by multiple people or you're publishing
it, use OAuth instead of an API key. TMV's MCP server implements
OAuth 2.1 with PKCE. Configure your MCP client to use OAuth flow;
the user signs in once and the client stores the access + refresh
tokens.

### Option C: HTTP (no MCP)

Every MCP tool has an equivalent HTTP endpoint. Useful for CI
pipelines, scripts, or non-MCP-aware agents:

| MCP tool                     | HTTP endpoint                     |
|------------------------------|-----------------------------------|
| `submit_test`                | `POST /v1/jobs`                   |
| `get_test_status`            | `GET /v1/jobs/:id`                |
| `get_test_results`           | `GET /v1/jobs/:id/results`        |
| `retest_job`                 | `POST /v1/jobs/:id/retest`        |
| `list_credit_packs`          | `GET /v1/billing/credit-packs`    |
| `get_credit_balance`         | `GET /v1/billing/balance`         |
| `top_up_credits`             | `POST /v1/billing/checkout`       |
| `list_test_identities`       | MCP only                          |
| `create_test_identity`       | MCP only                          |
| `list_test_identity_plans`   | MCP only                          |
| `subscribe_test_identities`  | MCP only                          |

All require `Authorization: Bearer <api-key>` and respect the
same rate limits + credit deductions as the MCP path.

## Things you don't know that you don't know

These are the gotchas that bit early users. Read them; save yourself
a debug session.

### Tests are non-deterministic
Same job, same site, different runs CAN produce different reports.
The vision model samples from a probability distribution. For CI use,
either:
- Use 3-run consensus (submit the same test 3x; trust the report
  only when 2+ runs agree)
- Or write narrower goals (the more specific the success criterion,
  the more stable the result)

### MCP credentials leak in commits
If you check `.mcp.json` into git with a literal API key, that key is
now public. Always use `${env:VAR}` substitution and a gitignored
env file. Better yet, rotate via the TMV admin panel and use OAuth
where possible.

### The agent has a same-origin restriction on explicit navigate
For cross-subdomain SSO flows (e.g. apex → passport.<bare> → apex),
the agent must CLICK links — the browser handles cross-origin
redirects naturally. If the agent tries an explicit `navigate` to a
different origin, the action is blocked with `navigate blocked:
<url> is cross-origin or unsafe`. This is intentional safety; design
your tests around clicks for SSO flows.

### Credit budgeting per project
Each test run deducts credits. A typical signup test is ~1-2 credits.
A complex multi-page flow with email waits is ~3-5. `retest_job`
deducts again. For shared TMV accounts across multiple projects,
budget accordingly. Check `get_credit_balance` early in your test
plan and call `top_up_credits` when needed.

### Same-origin nav vs cross-origin SSO
If your test flow crosses subdomains (newvibecity.com → passport.
newvibecity.com → newvibecity.com), make sure your `submit_test`
URL is the APEX domain. The agent starts there, then follows
clicks. Starting at the SSO subdomain skips the apex-side
initialization most sites depend on.

### Reports are persistent; inboxes are ephemeral
`actionTrail`, `consoleErrors`, `failedRequests`, screenshots — all
stay in the DB / Spaces indefinitely. You can call
`get_test_results(jobId)` for any past run at any time. But the test
inbox provisioned for the run is cleaned up on completion. If you
need to debug an email flow after the fact, you'll need to retest.

### Vision model is GPT-4o
Standard HTML forms (inputs, buttons, checkboxes) — easy. Complex UI
(canvas, drag-and-drop, custom dropdown menus that don't open until
clicked, infinite scroll lists) — harder. The model uses screenshots
+ DOM-aware action helpers, but it's not a substitute for a
purpose-built E2E framework like Playwright when you need
pixel-perfect reproduction.

### pm-shield bypass (Paradise-portfolio sites only)
TMV runs from DigitalOcean. Many cloud-IP ranges are on DNSBLs that
pm-shield rejects with `ip_dnsbl_listed`. For Paradise-portfolio
sites (Paradise Modern + its 500+ customers), TMV sends a signed
bypass token via `x-tmv-shield-bypass` header. Requires
`PARADISE_TMV_SHIELD_BYPASS_SECRET` set on both PM and TMV droplets
(see `paradisemodern/server/shield/tmv-bypass.ts`). If your site
uses a different bot defense (Cloudflare Turnstile, Akamai Bot
Manager, etc.), you'll need an equivalent allowlist for TMV's
egress IP or a custom test-only bypass route.

### Goal must include a STOP condition
Without one, the agent runs until step budget exhaustion (20 steps
for goal-directed runs). The exec summary in the resulting report
will reflect "ran out of steps" instead of "succeeded/failed at
specific point", which is harder to act on. Examples of good stop
conditions:
- "Reach a URL containing /dashboard, then call done"
- "See the text 'Welcome,' on the page header, then call done"
- "If at any step the page shows an error message, call done with
  an impression starting with 'BLOCKED:' and a description"

### Confidence comes from the trail, not the score
`healthScore` is the model's own self-assessment. It can be 75 on a
test that completely failed (the model was confident in its
interpretation, but its interpretation was wrong). Always validate
against `actionTrail`'s outcomes + `consoleErrors`.

## When TMV is the right tool — and when it isn't

**Use TMV for:**
- End-to-end user flows: signup → email verify → reach dashboard
- "Does this button actually work?" smoke tests after a deploy
- Cross-page navigation: form → confirm → redirect → next step
- Catching regressions in third-party SDK behavior (analytics,
  auth widgets, payment iframes)

**Don't use TMV for:**
- Unit tests (use vitest / jest / your normal dev loop)
- Type / lint checks (use tsc / eslint)
- Pure backend integration (use httpie / supertest)
- Pixel-perfect visual regression — TMV captures screenshots but
  doesn't diff them yet
- Load testing — TMV is one browser at a time
- Mobile-specific rendering bugs — TMV is a 1280x800 desktop
  Chromium

## CLAUDE.md stub (drop into any project root)

If you want every Claude Code session that opens your project to
automatically know about TMV, drop this 4-line block into your
`CLAUDE.md`:

```markdown
## End-to-end UI tests

For browser-based testing (signup flows, multi-page user journeys,
SDK regressions), use TestMyVibes via its MCP server. Full guide:
https://testmyvibes.com/docs/coding-agents — or call
`get_integration_guide` if TMV's MCP is already registered.
```

That's it. Claude reads CLAUDE.md on every session start, so this
becomes self-onboarding for any new agent that opens your project.

## Future: what's coming

- **Anthropic MCP Directory listing** — once TMV is in the directory,
  any Claude Code project will be able to enable TMV with one
  command. No manual `.mcp.json` config needed. (Tracked: task #17.)
- **Paradise Comms thread per test** — TMV reports flow into PM's
  admin home as conversational threads. You can ask follow-up
  questions of a report ("why did the click fail in step 5?") with
  the action trail right there as context. (Shipped 2026-05-20 for
  Paradise Modern; expanding to customer sites next.)
- **Multi-run consensus mode** — single MCP call triggers N parallel
  runs; the consensus report only flags bugs that appear in 2+
  runs. Reduces false positives from LLM stochasticity.
- **Human checker mode** — for flows the AI can't drive (complex
  visual canvases, captcha challenges), TMV routes the job to a
  human tester. Same MCP, same report shape, just longer SLA. (Most
  pieces shipped; opt-in.)

---

This doc lives in TMV's repo at `docs/coding-agent-integration.md`.
Edit it there. The web route and `get_integration_guide` MCP tool
read from the same file at runtime.