npx zenkit init claude

Disciplined workflows
for Claude Code.

Add structured specs, plans, audits, checkpoints, and handoffs to any project. One command. Native slash commands and skills. No daemon, no framework, no lock-in.

npx zenkit init claude
How it works

The problem with agent workflows

Most AI-assisted development workflows share the same structural failures. These are not model capability issues — they are protocol gaps.

Drift

Agents wander from the plan. Each step introduces compounding divergence from the intended architecture.

Verbosity

Workflows burn tokens on narration, restating context, and theatrical reasoning instead of producing artifacts.

Hidden uncertainty

Agents report success without distinguishing what was validated from what was assumed.

Lost context

Handoffs between agents or sessions lose assumptions, constraints, and decisions made earlier.

Native Claude Code integration

ZenKit installs as slash commands and skills that Claude Code already supports. No framework to learn. No runtime to manage. Just structured discipline added to your existing workflow.

/zenkit-specDefine what to build before building it
/zenkit-planBreak a spec into tasks with acceptance criteria
/zenkit-buildImplement with documented decisions
/zenkit-auditReview for correctness, security, alignment
/zenkit-checkpointCapture state — what's validated vs assumed
/zenkit-handoffTransfer context without losing decisions
terminal
$ npx zenkit init claude

ZenKit for Claude Code
======================

  created .claude/commands/zenkit-spec.md
  created .claude/commands/zenkit-plan.md
  created .claude/commands/zenkit-build.md
  created .claude/commands/zenkit-audit.md
  created .claude/commands/zenkit-checkpoint.md
  created .claude/commands/zenkit-handoff.md
  created .claude/skills/zenkit-audit/SKILL.md
  created .claude/skills/zenkit-handoff/SKILL.md
  created .claude/skills/zenkit-checkpoint/SKILL.md
  created CLAUDE.md

Done. Start with: /zenkit-spec "your feature"

Core primitives

Six categories of artifacts that compose into disciplined workflows. Each is a plain file — markdown or JSON Schema — readable by humans and machines alike.

Commands

Eight workflow verbs: spec, plan, build, audit, refactor, handoff, checkpoint, ship. Each has a defined input/output contract.

/plan → structured plan with tasks, constraints, and acceptance criteria
Schemas

JSON Schema definitions for handoffs, tasks, audits, checkpoints, and benchmarks. Machine-validateable contracts.

handoff.schema.json validates every agent-to-agent transfer
Skills

Reusable capabilities: architecture review, security audit, bug triage, prompt pruning, release checks. Composable within commands.

security-review skill triggers during /audit on auth-related changes
Hooks

Automatic checkpoints at workflow boundaries. Pre-change validates plans exist. Post-change validates tests pass. Pre-ship validates all gates.

pre-ship hook blocks deploy if audit findings are unaddressed
Checkpoints

Explicit state snapshots with gate conditions. Distinguish validated facts from assumptions. Enable bounded rollback.

checkpoint captures git ref, test status, and token spend before shipping
Rubrics

Evaluation criteria for execution quality, verbosity, and architectural alignment. Quantified scoring on a 0-10 scale.

verbosity-score penalizes restating known context or theatrical reasoning

Why lightweight matters

Heavier frameworks solve real problems but introduce their own. Every abstraction layer is a source of drift, a cost multiplier, and a barrier to understanding what actually happened during execution.

instead ofOrchestration runtime with daemon processes
Plain files. No daemon. No runtime dependency.
instead ofAgent personas with elaborate backstories
Agent contracts with bounded responsibility and explicit handoffs.
instead ofCustom DSL for workflow definition
Markdown commands + JSON Schema. Tools you already know.
instead ofVendor-locked tool integrations
Runtime-agnostic. Works with Claude Code, local runtimes, or custom harnesses.
instead ofDashboard-first management layer
CLI-first. Files in your repo. Version-controlled with your code.

Benchmark: criteria-driven verification

ZenKit benchmarks verify acceptance criteria against the actual implementation — file contents, schema validity, test execution, and JSON value checks. Not file existence. Not narrative claims.

6/6 specs passed
claude code pack
12/12 criteria, 33/33 checks
cli tool
7/7 criteria, 20/20 checks
handoff system
9/9 criteria, 24/24 checks
protocol completeness
10/10 criteria, 37/37 checks
schema validator playground
8/8 criteria, 25/25 checks
self audit
10/10 criteria, 25/25 checks
Detail: Schema Validator Playground
Status
pass
Criteria
8/8
Checks
25/25
Telemetry
~37,500 tokens, ~$0.29(estimated)
Acceptance criteria
ac-1Schema selector component exists and exports SchemaSelector
src/components/playground/SchemaSelector.tsx contains 'export function SchemaSelector'
ac-2JSON editor component exists and accepts value/onChange props
src/components/playground/JsonEditor.tsx contains 'export function JsonEditor'
ac-3Validation results component displays errors with paths
src/components/playground/ValidationResults.tsx contains 'err.path'
ac-4Playground page wires schema selection, editing, and validation together
src/app/playground/page.tsx contains 'validateAgainstSchema'
ac-5All 5 ZenKit schemas are registered and compilable
6 schemas found (expected 6), 0 compilation errors
ac-6Example data exists and validates for each schema
valid-handoff.json: valid against handoff.schema.json; schemas.ts registers 'handoff'; schemas.ts registers 'task'; schemas.ts registers 'audit'; schemas.ts registers 'checkpoint'; schemas.ts registers 'benchmark'
ac-7Unit tests exist and cover schema validation
src/lib/__tests__/schemas.test.ts exists
ac-8All schemas use consistent draft-07 format
All 6 schemas use http://json-schema.org/draft-07/schema#
What this does NOT prove
? Token and cost figures are estimates — no actual API telemetry is captured by this runner
? Acceptance criteria verify code structure and schema validity, not runtime UI behavior
? Stage durations reflect verification time, not original implementation time
ZenKit vs baseline
data source: illustrative
Metric
ZenKit
Baseline
Status
pass
pass
Criteria
8
8
Checks
25
25
Both modes verify the same codebase — structural difference is in workflow metadata. Real comparison requires A/B execution.

Structured handoffs

Every agent-to-agent transfer uses the same contract. Context, assumptions, constraints, decisions, risks, and open questions — nothing is lost between stages.

handoff: backend-architect → frontend-architect
{
  "context": "Backend architect completed the user profile API endpoint with data validation and error handling.",
  "assumptions": [
    "PostgreSQL is the primary datastore",
    "Authentication middleware is already in place",
    "Profile images are stored in object storage, not the database"
  ],
  "constraints": [
    "Response time must be under 200ms at p95",
    "No breaking changes to existing /api/v1 endpoints"
  ],
  "decision": "Implemented as a new /api/v1/profile resource with GET/PATCH operations. Used existing ORM patterns rather than raw SQL for consistency.",
  "deliverable": {
    "type": "code",
    "description": "Profile API endpoint with validation, tests, and OpenAPI spec",
    "files_changed": [
      "src/api/profile.ts",
      "src/api/profile.test.ts",
      "docs/openapi.yaml"
    ],
    "validation_status": "passed"
  },
  "risks": [
    {
      "description": "Profile PATCH allows partial updates — concurrent writes could cause data races",
      "severity": "medium",
      "mitigation": "Added optimistic locking via updated_at timestamp check"
    }
  ],
  "open_questions": [
    "Should profile deletion be soft-delete or hard-delete?",
    "Is rate limiting needed on the PATCH endpoint?"
  ],
  "next_agent": "frontend-architect"
}

This handoff is validated against handoff.schema.json before the next agent begins work. Invalid handoffs are rejected.

Self-audit, not self-certification

ZenKit uses its own benchmark system to audit itself. This is structured introspection, not proof of correctness. The claims are only as strong as the checks behind them.

What self-audit does
  • Tests whether ZenKit's primitives are expressive enough to describe real work
  • Produces inspectable evidence — run the benchmark yourself and verify
  • Forces the same honesty requirements on ZenKit itself
What self-audit does NOT do
  • Does not prove ZenKit is correct — a system can only check what it knows to check
  • Does not replace independent inspection
  • Does not validate the rubrics themselves

Safeguards: benchmark checks are verifiable, uncertainty is required (not optional), limitations are inherited from specs, illustrative data is labeled, telemetry is never fabricated.

Three layers, adopt what you need

Layer 1: Claude Code pack

Slash commands, skills, and CLAUDE.md. One command, zero dependencies.

npx zenkit init claude
Layer 2: Protocol + CLI

Schemas, benchmarks, validation engine. For teams that want machine-verifiable workflows.

npm install zenkit
Layer 3: MCP server

Dynamic tool calls for validate, benchmark, and checkpoint. Coming soon.

planned