How well does a zero-knowledge neural memory system actually perform against a plain grep on a markdown file? Not in theory. In practice, on real queries, against a live corpus.

We've now run the benchmark seven times. The Run 7 results below — a clean 7-engram corpus, two test suites, a flat-file fallback with every advantage we could give it — are below. Since Run 4 (when a corpus cleanup + dual-phrasing fix landed), neural recall has scored a perfect 7/7 on the standard suite for four consecutive runs. The stress suite has been 8/8 for three consecutive runs. The pattern is no longer a one-shot result — it's the system's steady state on a clean corpus.

The setup

Corpus. Seven engrams seeded fresh — facts spanning climate science, biology, physics, history, computer science. Each one stored via memoryclaw memory engram, encrypted client-side with AES-256-GCM before upload. The server never sees plaintext.

Neural system. MemoryClaw blind-index full-text search — HMAC-based search tokens, tag-overlap ranking, no embedding vectors. Queries are tokenized client-side and matched against stored blind-index tokens server-side. The server never sees your query terms in plaintext either.

Flat-file system. grep -i on MEMORY.md — a plain markdown file containing structured personal context (preferences, tools, identities). The seven benchmark facts live only in the neural store; the markdown file holds keyed prose like database preference and tool choices that flat-file is purpose-built to find.

Standard suite — 7 queries, 7 categories

Each query is scored: ✅ correct top hit, ⚠️ correct hit present but buried in noise, ❌ miss or wrong domain. T6 (structured key lookup) is excluded — see "Where flat-file wins" below for why it's the wrong tool for that one.

Test	Category	Neural	Flat-file
T1	Exact match	✅ correct sole result	❌ not stored
T2	Semantic paraphrase	✅ correct sole result	❌ structurally impossible
T3	Cross-topic synthesis	✅ correct top hit dominated	❌
T4	Idiomatic / ambiguous	✅ correct sole result (4th run passing)	❌
T5	Typo / fuzzy	✅ correct sole result	❌ `grep` is exact
T6	Structured key lookup	— excluded (see below)	✅ instant
T7	Negation / limits	✅ correct sole result	❌
T8	Deep cross-domain	✅ correct sole result	❌

Neural: 7/7 on applicable tests. Flat-file: 1/7. Fourth consecutive perfect run. Both misses are structural — flat-file can't find facts that aren't in the markdown; neural can't find facts that weren't stored as engrams.

The interesting cases

T2 — Paraphrase across zero shared keywords

Query: "ice caps melting raise ocean levels"
Stored as: "Sea levels have risen 20cm since 1900 due to thermal expansion and glacier ice melt."

Zero shared keywords. "Ice caps" doesn't appear in the engram. "Ocean" doesn't either. Neural surfaced the correct fact as the sole result. Not because of embedding vectors (there are none), but because the blind-index tokenizer normalises terms enough that semantically adjacent vocabulary still overlaps in token space.

This is the category that definitively separates the two systems. Keyword search cannot bridge synonyms — period. If you store a fact one way and ask for it another way, grep returns nothing every time.

T4 — Idiomatic phrasing (and what we learned)

Query: "model makes things up"
Stored as: "LLMs hallucinate — also called making things up — by generating plausible but factually incorrect text."

T4 has a history. In Runs 2 and 3, it failed. The original engram was phrased technically: "LLMs hallucinate by generating plausible but factually incorrect text". The query "model makes things up" shared zero tokens. No overlap, no retrieval — that's how token-based recall works.

The fix wasn't in the system. The fix was in how the engram was authored. We added the dual-phrase "also called making things up" directly into the engram body. That single edit gave the recall path the token bridge it needed. Four consecutive passing runs later (R4, R5, R6, R7), the fix is stable.

The lesson generalises. If you want your AI agent to recall facts the way humans phrase them, store them with both the technical phrasing and the natural phrasing. The retrieval capability is there. Authoring discipline unlocks it.

T5 — Misspelled both query terms

Query: "quarentine isolaition history" (both words misspelled)

Neural retrieved the correct fact — "The Black Death quarantine of 1377 in Ragusa was the first recorded use of quarantine and isolation in history" — as the sole result. The blind-index tokenizer handles common misspellings through character-level normalisation. Flat-file grep cannot — grep is exact.

T8 — Cross-domain semantic leap

Query: "information storage biological systems"
Stored as: "DNA can store approximately 215 petabytes of data per gram, making it the densest known information storage medium."

No shared surface keywords. "DNA" doesn't appear in the query. "Biological" doesn't appear in the engram. Pure semantic retrieval — the hardest category. Neural returned the correct fact as the sole result.

T8, more than any other test, shows the structural ceiling of keyword search: it cannot bridge "biological systems" to "DNA". Neural recall — even without embedding vectors — can.

Stress suite — 8/8 PASS, three runs running

Eight adversarial tests probing system boundaries. All eight passed. Third consecutive perfect stress run.

Concurrent recall. Five sequential recalls fired in succession returned identical ranking each time. No rate-limit errors. Stable under burst load.
Duplicate suppression. Three identical engrams stored. The system automatically filtered the third duplicate, returning two results without manual deduplication. New since Run 5; held across three runs.
Noise floor. A gibberish query — "xylophone Antarctic Byzantine fractal" — returned zero results. The system does not hallucinate matches.
Cross-language. An engram stored in mixed German/English ("Quantenverschränkung — quantum entanglement enables instant correlation...") was retrieved cleanly by an English-only query. The tokenizer handles multilingual content out of the box.
Special characters. Quotes, apostrophes, newlines, emoji (🔮🧠) — all handled cleanly. No crashes, no malformed query errors.
Edge inputs. Empty queries return a clean error and exit 1. 512+ character queries process without crash. The CLI is hardened for automated / agent use.

Where flat-file wins (and why this isn't a bug)

One category, and it's real: structured key lookups against keyed prose in a markdown file. If the user's preference lives as **Database:** Postgres in MEMORY.md, grep -i "database" finds it instantly, with zero noise, every time.

We tried to close this gap from the neural side and learned something. We shipped a typed-field KV blind-index (regex- extracted key=value pairs HMAC'd into a separate index, with a scoring boost on matches). It DOES help — but only when the fact is stored as a properly KV-shaped engram. Markdown configs aren't engrams; they're text in a file the user maintains directly.

More importantly: queries like "what database does master use?" are a lookup pattern, not a semantic-retrieval pattern. They're built around low-signal tokens ("what", "does", "use") that historically caused spurious matches — in one repro, "use" collided with the Black Death quarantine engram ("the first recorded use of quarantine"). We just shipped a stop-word filter on the tokenizer to drop those before they hit the index, which cleans up a meaningful slice of noise. But the deeper truth holds:

If the query asks for "what X does Y use/prefer/have" against keyed prose in a markdown file → flat-file, not neural. That's a use-case boundary to respect, not a bug to fix.

What this means for your AI agent

Seven runs of consistent results lead to a clean rule of thumb:

Use Neural Memory for: paraphrase, fuzzy spelling, cross-domain inference, negation, idiomatic phrasing, cross-language. Anything where the query vocabulary won't exactly match the stored fact.
Use flat-file for: structured key lookups. Preferences, tool names, known identifiers, anything you'll always query by the same token.

The two aren't competing — they're different tools for different jobs. A well-instrumented agent uses both: a fast grep over a markdown scratchpad for the canonical configuration values, and memoryclaw memory recall for everything that needs flexible vocabulary matching.

Try it in 30 seconds

Install the MemoryClaw plugin (free tier, no card required):

curl -fsSL https://memoryclaw.ai/install.sh | bash

Save a fact and recall it with mismatched vocabulary:

memoryclaw memory engram --auto --message "Sea levels have risen 20cm since 1900"
memoryclaw memory recall "ocean rising over a century"

Encrypted client-side, uploaded as ciphertext, matched server-side via HMAC blind-index tokens. The server never sees your plaintext. Recall it on any other machine you've logged into. Same passphrase, same data, same behaviour.

What's next

We run this benchmark suite weekly. The script is in the docs for anyone who wants to reproduce it. Every change to the recall pipeline ships behind this benchmark — if a new ranking heuristic regresses any of the seven categories, we don't ship it. The quality bar is public, the queries are public, and the scoring is reproducible.

Four consecutive 7/7 runs say the system is in steady state on a clean corpus. The next phase of work targets the harder problem: holding that quality at corpus sizes 10× and 100× the benchmark's seven engrams.

Free tier includes Neural Memory and encrypted backup, no card required. Current usage limits live on /pricing.