Back to log

A Speed Nerd Met spaCy

How a small Rust and PyO3 crate turns spaCy POS tags into a fast structural diversity meter for generated text.

Diagram showing generated text parsed by spaCy and scored by a Rust POS-window diversity engine.

A speed nerd met spaCy and said… what if we did it this way?

That sentence is a joke, but only barely.

The problem was not spaCy. spaCy was doing exactly the part I wanted it to do: tokenize text, assign part-of-speech tags, expose dependency labels, and give me stable linguistic features without turning the pipeline into a research project.

The problem was the loop after spaCy.

I was generating hard examples at scale, and I needed to know when the generator was slipping into the same hidden sentence shape over and over. The text looked different. The entities changed. The topic changed. Underneath, the grammar kept repeating.

That is how a dataset gets fluent and weak at the same time.

Diagram showing text moving through spaCy and then a Rust POS-window scoring engine.
spaCy keeps the linguistic job. Rust takes the repeated scoring loop.

The Job

The crate is called warp_pos_spacy. It is small on purpose.

It does not parse English.

It does not replace spaCy.

It does not pretend POS tags are meaning.

It takes pre-tagged sequences and answers a narrower question quickly:

Are these generated examples collapsing into the same structural pattern?

For the original hard-negation pipeline, the anchor was often the negation point. Around each anchor, the scorer looked at a fixed six-token POS window: two tokens before, the anchor, and three tokens after. That window became a structural species.

PRON AUX PART VERB DET NOUN

Count the species. Measure the distribution. Report whether one pattern is taking over.

The Technical Shape

The split is the important part.

spaCy does the linguistic work:

  • tokenization
  • POS tagging
  • dependency parsing
  • named entity hooks when needed
  • stable Python pipeline integration through nlp.pipe

Rust does the hot loop:

  • validate anchors
  • pad fixed windows at sentence boundaries
  • hash POS-window species deterministically
  • count species in parallel with Rayon
  • compute Simpson dominance and Shannon entropy
  • return readable top patterns for diagnostics

PyO3 is the seam between the two worlds. Python owns the workflow and the spaCy objects. Rust owns the part where millions of tiny windows become counters and metrics.

On the original Super-Neg scoring job, that Rust/PyO3/Rayon path was roughly 1,500x faster than the Python loop it replaced: about 3.5 minutes for the Python loop versus about 6.6 ms for the Rust path on the same POS-window math. That number is deliberately scoped. End-to-end throughput still depends on spaCy, the model loaded, text length, CPU/GPU setup, and batch size.

The point is not “Rust beats Python at language.”

The point is that once the language problem has been reduced to arrays, anchors, hashes, and counts, Rust can run the scoring path at memory speed.

What the Metrics Mean

The crate reports a few simple signals:

  • total_windows: how many windows were scored.
  • unique_patterns: how many structural species appeared.
  • dominant_share: how much of the batch belongs to the most common pattern.
  • simpson_d: dominance, where higher means concentration.
  • shannon_h: entropy, where higher means broader variety.
  • collapse_pressure: the plain-English warning light.
  • top_patterns: the repeated POS windows a human can inspect.

Raw hashes are useful for machines. The readable pattern is useful for people.

If the top pattern is eating the batch, I want to see it:

PRON AUX PART VERB DET NOUN

That is the difference between a metric and a diagnostic.

Use Cases

Negation was the wedge, not the whole tool.

Generated text has a habit of finding grooves. That shows up across more than one workflow:

  • synthetic training data that repeats one sentence skeleton
  • evaluation questions that all test the same trick
  • RAG hard negatives that differ lexically but not structurally
  • prompt batches where the model falls into one phrasing move
  • agent traces that repeat the same action pattern
  • domain datasets where a template becomes a shortcut

In each case, the problem is not that repetition exists. Repetition is sometimes correct. The problem is invisible repetition in a place where the data is supposed to cover variety.

warp_pos_spacy is a pressure gauge. It does not decide truth. It does not replace evaluation. It tells you when a structural pattern is becoming too dominant to ignore.

Why POS Still Matters

There is a lazy version of the current AI story that says structure is obsolete because vectors won.

I do not buy it.

Vectors are excellent at soft similarity. They are bad at caring about one small structural hinge when the rest of the sentence is screaming “same thing.” Negation is full of hinges. So are numbers, dates, roles, causality, modality, and scope.

Part-of-speech tags are not glamorous, but they are the kind of boring instrument that keeps a generator honest. If a model produces 10,000 examples with the same hidden grammar, POS windows reveal the repetition. If the batch spreads across many grammatical shapes, the diversity metrics move.

That signal is not the whole truth.

It is a pressure gauge.

And in a generator, pressure gauges matter.

Something in this log entry sparking an argument, partnership idea, or infrastructure war story?

Open a channel