Use embedding data from LLMs to determine the "most different" document in relation to a set of documents.
- When working with a corpus of texts, for instance in linguistics or qualitative social research, the order in which those texts are analyzed is essential. In qualitative social research, a common case selection strategy is the "most different case," i.e., a document that is most dissimilar to the ones already analyzed. However, this can be challenging, as the identification of which document is "most different" in itself is often not obvious.
- Another issue arises when the data corpus is so large, that not all documents can be analyzed. In that case, a common strategy is to stop the data analysis once theoretical saturation is reached, that is the point where analyzing additional documents does not yield new insights. However, an intrinsic danger of this approach is overlooking documents that contain relevant information, simply because in the sequential analysis of documents, the researcher coincidentally only selected documents with similar information.
In both of these scenarios, an automatic identification documents "most different" in relation to a given set of documents becomes relevant. Effectively, it allows the efficient selection of unread documents most dissimilar to the set of already read documents.
The Most Different Text Selector implements this via embeddings, the
mathematical vectors underlying LLMs, to calculate a "novelty score" of each
unread document in relation to the set of read documents.
The results are far from objective, but dependent on the model used and the training data the commercial providers used. In addition, the interpretation of the novelty scores is limited, since 1. the meaning of the more than 1000 embeddings is unknown, and 2. the aggregation of all those dimensions into a single score conceals which dimensions contributed to the score. (If a document is considered "most different," it could be due to the topic, genre, or language.) However, the latter can be dealt with by running the analysis only on sets of documents of the same type or language.
Nonetheless, a perfect identification of "most differentness," however it may look, is not needed: For the purpose of an efficient selection of the next document, the baseline for comparison is the random selection of the next text. And as imperfect as the embedding-based approach may be, it is certainly far better than randomly choosing the next document.
Even though Most Different Text Selector was designed with the above
scientific use case in mind, the general idea can also be applied elsewhere:
- Read later apps suggesting which article to read next.
- Legal work with a large amount of documents in the discovery phase.
- Recommendation systems geared toward novelty instead of similarity. ("similar to what you liked" vs "Want to try something new?")
Embeddings have been used to calculate the similarity of texts, for example to create a list of "Related articles" for technical documentation.
Since embeddings are basically data points to determine the similarity of texts,
they can also be used to do the opposite: the identification of dissimilar
texts. Most Different Text Selector works as follows:
- It takes a folder with Markdown documents as input
- In each file, the YAML-frontmatter is checked for a
readboolean key to determine whether the file was already read or not. - For each file, the embedding is determined via the OpenAI API.
- The semantic center of all read documents is determined by calculating the element-wise average vector of all embeddings of those documents.
- For each unread document, the distance (cosine similarity) to the semantic center of read documents is calculated.
- For easier interpretability, the cosine similarity will be transformed into a "novelty score" that ranges from 0 to 100, with 100 being the most different.
- The results are saved to a file called
./REPORT.md, and written back to the YAML-frontmatter of the unread documents.
Most Different Text Selector is written in TypeScript instead of Python, to
make potential future implementation as plugin for Obsidian
possible, e.g., to complement qualitative analysis with
Quadro.
Requirements
- OpenAI API key
- node.js
- Documents saved as Markdown files in a folder.
readboolean key in the YAML frontmatter indicating whether the document was read or unread.
-
Modify values in
src/settings.ts. -
Run in the Terminal:
npm install npx tsx ./src/main.ts
-
A summarizing report is saved in the file
./REPORT.md. -
The ranking of the "most differentness" is saved in the YAML frontmatter of the unread documents under the key
novelty-score.
Tip
Shift of semantic center After reading a sufficient number of documents, the semantic center of read documents will shift, resulting in outdated novelty scores for the unread documents. It is thus recommended to re-run the analysis once in a while.
Number of documents
There is a rate limit for OpenAI embeddings of 100 requests per day with the
text-embedding-3-small in the free tier. If you already paid for your OpenAI
account in the past, you are automatically placed in a higher tier, with much
more requests per day.
- Info on the placement in tiers
- Rate Limit for the model
- Usage limits info (your usage tier is noted at the top of the page)
Document size
The current OpenAI models for embeddings have a maximum input size of 8192
tokens, which
are about 32,000 characters of English text. Documents longer than that will be
automatically truncated by Most Different Text Selector.
All input documents are sent to OpenAI, so be sure to not include sensitive data in the input folder.
Please cite this software project as (APA):
Grieser, C. (2025). Most Different Text Selector [Computer software].
https://github.com/chrisgrieser/most-different-text-selectionIn my day job, I am a sociologist studying the social mechanisms underlying the digital economy. For my PhD project, I investigate the governance of the app economy and how software ecosystems manage the tension between innovation and compatibility. If you are interested in this subject, feel free to get in touch.