The Test Document Is the Test
Why I collapsed the test plan and the test code into one file, and why that is an architecture decision rather than a tooling preference.

The Test Document Is the Test
I build test infrastructure for connected-home hardware. Locks, sensors, hubs, the devices that have to keep working for years after someone installs them and forgets they exist. A few months ago I started a framework to do that work properly, and I want to explain the central decision behind it, because it is the decision most teams get wrong.
The human-readable test document and the executable test are the same file. There is no test plan that describes the procedure and separate code that implements it. You write the procedure once, as a document an engineer, product manager or QA tester can read in review, and that document is what runs against the hardware. This was a call which I made deliberately, borrowing from the likes of Cucumber but taking it to the extreme for hardware devices, and four months of production use have not given me a reason to walk it back.
Why the two-artifact model fails
The conventional setup keeps a test plan for humans and test code for machines. Everyone knows these two drift apart. Nobody treats it as the structural defect it is. They diverge the moment they are written, because there is no mechanism that forces them to agree. The plan is prose, and prose never runs, so nothing ever catches it lying. Six months later the document on the wiki describes a procedure the code stopped performing, and the only people who know are the two engineers who happened to be in the room.
For most software that is a documentation hygiene problem. For hardware validation it is a credibility problem, and credibility is the entire product. When I certify that a lock survives a power-cleared rejoin sequence, the value of that statement is exactly equal to my ability to prove it and reproduce it on demand. A result I cannot reproduce is worth nothing the moment someone competent doubts it.
So I removed the second artifact. There is no plan to fall out of sync with the code, because the plan is the executable. A document that misdescribes what the system does fails when you run it. Staleness stopped being a silent rot and became a test failure, which is feedback engineers actually act on.
What a test looks like
A test case is a Markdown document. It carries its objective, its preconditions, the procedure, and the expected results, in a form a reviewer reads like any other document in a pull request. The executable steps live inline:
id: step-3
type: api
method: POST
url: /devices
expected: Device is created and its ID is captured
validate:
status: 201
outputs:
DEVICE_ID: id
That block is simultaneously the line a reviewer reads and the request the runner issues. The outputs field captures the new identifier into DEVICE_ID, and later steps reference ${DEVICE_ID}. The document carries its own state forward. There is no hidden glue file, no separate layer of step definitions to maintain, no translation between what the document says and what the machine does. What you read is what executes.
The cost, paid knowingly
This decision is not free, and I want to be precise about the bill, because every architecture has a downside and hiding it is its own failure mode.
I gave up writing tests in a general-purpose language with a mature debugger behind it. In exchange I now own a parser, a schema, and a validator, none of which produce a single test on their own, and all of which I have to extend every time the framework needs to do something new. When an engineer needed to drive a Bluetooth handset in the middle of a test sequence, that was my problem to solve in the framework, not theirs to solve in a script. A framework is a standing commitment to keep building it. I knew that going in. I took the commitment because the alternative, invisible drift between intent and execution, was the thing actively destroying confidence in our results, and no amount of debugger convenience compensates for results nobody believes.
This is an architecture decision, not a tooling preference
When testing becomes painful the common instinct is to reach for a better tool. A faster runner, a new library, a cleaner harness. That instinct solves the bad afternoon in front of you and does nothing about the bad year generating those afternoons.
The problem I had was structural. Our testing knowledge did not accumulate. Every run started from zero, every result lived in someone's memory, every reproduction required a human in the loop. No runner fixes that, because it is not a runner-shaped problem. You fix it by deciding where the single source of truth lives and then subordinating everything else to that decision. I decided the document is the source of truth and the code serves the document. That one constraint determines everything downstream, including the part that runs before any device is touched:
$ spec validate ./tests # zero errors, or nothing runs
A malformed test does not get the chance to fail at hour four of an overnight soak run. It gets rejected while the cost of being wrong is a single line of output. The unglamorous machinery, the parsing and validation and structural enforcement, had to be correct before any of the interesting hardware work earned the right to exist on top of it.
Where this is going
I am going to write this build up in sequence, twice a week, including the work that did not survive contact with reality. The completion-detection logic I deleted because it reported a test finished before the device had actually settled. The point where this stopped being a tool I supervised by hand and became a service that runs unattended for days and reports honestly on its own health. The decisions I would make differently if I started again tomorrow.
The next piece answers the question I get within thirty seconds of describing any of this: isn't this just Cucumber? It deserves a real answer, so it gets its own piece.
If you think collapsing the plan and the code into one artifact is the wrong call, tell me where it breaks. That is the conversation worth having.
— end of dispatch —
More writing →