Your Agent Says "Done."
How Do You Know?

You can't. Not without checking every file, re-reading every requirement, and running every test yourself. That's the job gates do for you.

The problem

What "Done" Actually Means

AI coding agents are trained to satisfy you. When they say "done," they mean "I stopped working." Not "I fulfilled every requirement." Not "I tested everything." Not "the design matches the spec."

The result: features that look complete but silently dropped a requirement. Tests that pass but don't cover what users actually do. Code that works but implements the wrong thing.

You don't find out until a user reports it, a teammate notices, or you spend an hour manually reviewing what the agent claimed was finished.

Key concepts

What's a Gate?

A gate is an automated check that runs before the agent is allowed to move forward. Think of it like CI for specifications — the same way your build server won't deploy broken code, gates won't let your agent close work with gaps.

Gate

An automated checkpoint. It runs when the agent tries to create a task, complete a task, or close a story. If the check fails, the agent is blocked and told exactly what to fix.

Story

A unit of work — like "add user authentication." It has requirements (a PRD), a technical design, and implementation tasks. Stories move through stages: proposed → in-progress → QA → done.

Task

A single piece of implementation within a story — like "write the JWT validation middleware." Tasks must have a plan, produce evidence of what changed, and pass their tests.

Dependencies

Requirements trace to design. Design traces to tasks. Tasks trace to tests. Each link is a dependency — change one, and gates flag everything downstream that needs review.

How it works

Two Layers of Verification

Gates run automatically via MCP — the protocol your AI agent uses to talk to Ceetrix. The agent can't bypass them. It can't mark a task as done without declaring what changed. It can't close a story with missing coverage links. It can't skip the design phase.

Structural gates (G0–G7, G9, G11) check that data exists and relationships are complete. Does every requirement have a design section pointing to it? Does every capability have a task? These are deterministic — no judgment call, just graph traversal.

Content gates (G8, G10, G12) use an independent LLM to evaluate quality and relevance. Does the PRD have testable acceptance criteria? Does the design actually discuss the requirement it claims to cover, or just link to it? This isn't the agent grading its own homework — it's a separate evaluation.

An honest caveat: the agent self-reports evidence like files_changed and test results. It could fabricate these. Gates raise the cost of lying — the agent must produce structured, reviewable claims instead of just saying "done" — but they don't make fabrication impossible. The audit trail means you can spot-check efficiently instead of reviewing everything.

See it in action

The Full Lifecycle

Watch how gates enforce integrity at every phase — from writing requirements to closing a story.

"Done."

0:00 / 1:00

Task-Level Gates

These fire when the agent creates or completes an individual task. They ensure every task has a plan, produces evidence, and passes its tests.

Agent must write a test strategy in the design before it can create implementation tasks

Without a test strategy, the agent jumps straight to code. Forcing it to think about verification first changes what it builds.

What rejection looks like

Task creation blocked. Design must include a test_strategy section.

The test strategy should cover:
- Test levels (unit, integration, e2e)
- Multi-step E2E workflows
- Edge cases with expected behavior

Agent must write an implementation plan before starting work

The plan is a claim about intent — which files, which approach, which risks. It can be wrong, but writing it forces the agent to think before coding.

What rejection looks like

Task body required (minimum 20 characters).

Include:
- Approach: How you plan to implement this
- Files: Which files you expect to modify
- Risks: What could go wrong

Agent must declare which files it changed and why before marking a task done

The agent provides a structured claim: these files changed, for this reason. It could lie. But the claim creates an audit trail — when something breaks, you can trace which task claimed responsibility.

What rejection looks like

Completion evidence required.

Evidence must contain:
- files_changed: [{ path, change: 'added|modified|deleted' }]
- rationale: 'What was done and why' (min 20 chars)

Agent must report test results with zero failures before completing a test task

The agent self-reports pass/fail counts. It could fabricate these numbers. The gate checks the claim is structurally valid (passed > 0, failed = 0) — not that the agent actually ran the tests.

What rejection looks like

Test results show failures. Cannot complete task.
Passed: 3, Failed: 2

Fix the failing tests before completing this task.

An LLM checks whether the task plan actually discusses the capabilities it claims to implement

A task can claim to implement "Token Validation" but write a plan about database migrations. An independent LLM reads both and checks for relevance.

What rejection looks like

Task body does not address its capabilities:
  "Token Validation": No mention of token validation
  or JWT handling in the task plan

Story-Level Gates

These fire when the agent tries to move a story to QA or done. They verify the full specification — from requirements through design to tested implementation. Nothing gets through with gaps.

Every PRD requirement must be referenced by at least one design section

This is a structural check — it verifies the pointers exist, not that the content is good. Without it, the agent silently drops requirements it finds inconvenient.

What rejection looks like

G1 (PRD Coverage): FAILED
  "Logout" — not referenced by any design section
  "Session Timeout" — not referenced

Every design capability must have at least one task claiming to implement it

Another structural check. A capability with no task is a design intention that nobody committed to building.

What rejection looks like

G2 (Design Coverage): FAILED
  "Token Validation" — no implementation task
  "Session Refresh" — no implementation task

Capabilities that require tests must have test tasks assigned to them

Checks that test tasks exist in the backlog — not that they pass, not that they're good. Just that someone is on the hook for writing them.

What rejection looks like

G3 (Test Coverage): FAILED
  "Token Validation": missing unit, integration test tasks
  "Session Refresh": missing e2e test task

Story cannot move to QA while any task is still open

Structural check — are all tasks in "done" status? Doesn't verify the tasks were done well, just that none were abandoned mid-flight.

What rejection looks like

G6 (All Tasks Done): FAILED
  Pending: Task 1 — Implement auth flow
  Pending: Task 3 — Write integration tests

An independent LLM evaluates whether the PRD and design meet a quality checklist

This is a content gate — a separate LLM reads the documents and scores them against criteria like "testable acceptance criteria" and "scope boundaries defined." It's not the agent grading its own homework.

What rejection looks like

G8 (Content Quality): FAILED
  PRD: insufficient — requirements are vague, missing
  acceptance criteria for 3 of 5 features

Test types promised in the strategy prose must have corresponding test tasks

If the strategy says "e2e tests for the login flow" but no e2e test task exists, the promise is empty. An LLM extracts test types from the prose and checks them against actual tasks.

What rejection looks like

G9 (Test Strategy Prose Coverage): FAILED
  Strategy describes: unit, integration, e2e
  Tasks cover: unit
  Missing: integration, e2e test tasks

Flags when upstream documents changed but downstream artifacts weren't reviewed

If you change a PRD requirement, the design sections covering it are now stale. This gate tracks timestamps — it doesn't know if the change matters, just that someone should look.

What rejection looks like

G11 (Dependency Coherence): FAILED
  Design "auth-flow" is stale — PRD changed Mar 15,
  design last reviewed Mar 10
  Task 4 is stale — design changed Mar 15,
  task last reviewed Mar 12

An LLM reads the design prose and checks whether it actually discusses the requirements it claims to cover

G1 checks that pointers exist. G12 checks that the content behind those pointers is relevant. A design section can point to "Session Timeout" but only discuss login — G12 catches that.

What rejection looks like

G12 (Design Semantic Coverage): FAILED
  "auth-implementation" fails to cover "Session Timeout":
  section discusses login flow but never addresses
  session expiry or timeout handling

13 checks. Every transition. Automatic.

Stop manually verifying what your agent built. Let the gates do it.

Try Ceetrix Free

Your Agent Says "Done."
How Do You Know?

What "Done" Actually Means

What's a Gate?

Gate

Story

Task

Dependencies

Two Layers of Verification

The Full Lifecycle

Task-Level Gates

Test Strategy Exists

Task Has a Plan

Completion Evidence

Test Results Provided

Plan Addresses Design (LLM)

Story-Level Gates

PRD → Design Links

Design → Task Links

Test Task Existence

All Tasks Closed

Content Quality (LLM)

Strategy Matches Tasks

Staleness Detection

Design Content Matches PRD (LLM)

13 checks. Every transition. Automatic.

Your Agent Says "Done."How Do You Know?

What "Done" Actually Means

What's a Gate?

Gate

Story

Task

Dependencies

Two Layers of Verification

The Full Lifecycle

Task-Level Gates

Test Strategy Exists

Task Has a Plan

Completion Evidence

Test Results Provided

Plan Addresses Design (LLM)

Story-Level Gates

PRD → Design Links

Design → Task Links

Test Task Existence

All Tasks Closed

Content Quality (LLM)

Strategy Matches Tasks

Staleness Detection

Design Content Matches PRD (LLM)

13 checks. Every transition. Automatic.

Your Agent Says "Done."
How Do You Know?