Finding Semantically Similar Content

Posted on: 09 October 2025

A common problem with technical documentation management is that writing teams tend to have a lot of stressed-out contributors working on shared content, scattered across multiple departments and revised over an extended period. This is a recipe for content bloat and redundancy.

All content needs some degree of custodial refactoring through deduplication, merging similar content into a single source of truth (SSOT) or by eliminating all but one permutation. Even before the emergence of AI, maintaining clean, unambiguous documentation was essential to avoid the junk in/out scenario. In the foreseeable future, documentation will converted to vectors wholesale,

Eventually, I'd like to contain a large set of documentation in a local LLM that can be searched for semantically similar content. A human would need to determine the best course of action on a case-by-case basis. In taking baby steps toward that goal, I came across txtai and put together a prototype that surfaces semantically same or similar chunks in the form of a report, using a simple Markdown file as input.

The processing is straightforward:

preprocess_markdown.py - Chunks MD by heading to populate a JSON file.
index_chunks.py - txtai (sentence-transformers/all-MiniLM-L6-v2) indexes JSON file.
report_similarity.py - txtai compares paragraph vectors. Reports similarity clusters and vector scores.

If you want to look under the hood, you'll find this package on GitHub.

Postmortem

I set out to use the hierarchical structure of the Markdown file as added context to vector scores, which was proved to be too ambitious for this first pass. Instead, I fell back to just comparing paragraph content under each Markdown heading. In retrospect, I might as well have done similarity searches across text in a *.txt file. Ultimately, my goal is to leverage what I've learned about comparing vectorized MD chunks and then apply it to DITA, which is a highly structured XML format with its own metadata that could contribute to better query results.

Tagged with:

Next: Setting up Local RAG Pipeline in Docker
Previous: LLM Prompt Organizer