Most Different Text Selector

Use embedding data from LLMs to determine the "most different" document in relation to a set of documents.

Use case

Scientific use case

When working with a corpus of texts, for instance in linguistics or qualitative social research, the order in which those texts are analyzed is essential. In qualitative social research, a common case selection strategy is the "most different case," i.e., a document that is most dissimilar to the ones already analyzed. However, this can be challenging, as the identification of which document is "most different" in itself is often not obvious.
Another issue arises when the data corpus is so large, that not all documents can be analyzed. In that case, a common strategy is to stop the data analysis once theoretical saturation is reached, that is the point where analyzing additional documents does not yield new insights. However, an intrinsic danger of this approach is overlooking documents that contain relevant information, simply because in the sequential analysis of documents, the researcher coincidentally only selected documents with similar information.

In both of these scenarios, an automatic identification documents "most different" in relation to a given set of documents becomes relevant. Effectively, it allows the efficient selection of unread documents most dissimilar to the set of already read documents.

The Most Different Text Selector implements this via embeddings, the mathematical vectors underlying LLMs, to calculate a "novelty score" of each unread document in relation to the set of read documents.

The results are far from objective, but dependent on the model used and the training data the commercial providers used. In addition, the interpretation of the novelty scores is limited, since 1. the meaning of the more than 1000 embeddings is unknown, and 2. the aggregation of all those dimensions into a single score conceals which dimensions contributed to the score. (If a document is considered "most different," it could be due to the topic, genre, or language.) However, the latter can be dealt with by running the analysis only on sets of documents of the same type or language.

Nonetheless, a perfect identification of "most differentness," however it may look, is not needed: For the purpose of an efficient selection of the next document, the baseline for comparison is the random selection of the next text. And as imperfect as the embedding-based approach may be, it is certainly far better than randomly choosing the next document.

Non-scientific use cases

Even though Most Different Text Selector was designed with the above scientific use case in mind, the general idea can also be applied elsewhere:

Read later apps suggesting which article to read next.
Legal work with a large amount of documents in the discovery phase.
Recommendation systems geared toward novelty instead of similarity. ("similar to what you liked" vs "Want to try something new?")

How it works

Embeddings have been used to calculate the similarity of texts, for example to create a list of "Related articles" for technical documentation.

Since embeddings are basically data points to determine the similarity of texts, they can also be used to do the opposite: the identification of dissimilar texts. Most Different Text Selector works as follows:

It takes a folder with Markdown documents as input
In each file, the YAML-frontmatter is checked for a read boolean key to determine whether the file was already read or not.
For each file, the embedding is determined via the OpenAI API.
The semantic center of all read documents is determined by calculating the element-wise average vector of all embeddings of those documents.
For each unread document, the distance (cosine similarity) to the semantic center of read documents is calculated.
For easier interpretability, the cosine similarity will be transformed into a "novelty score" that ranges from 0 to 100, with 100 being the most different.
The results are saved to a file called ./REPORT.md, and written back to the YAML-frontmatter of the unread documents.

Most Different Text Selector is written in TypeScript instead of Python, to make potential future implementation as plugin for Obsidian possible, e.g., to complement qualitative analysis with Quadro.

Usage

Requirements

OpenAI API key
node.js
Documents saved as Markdown files in a folder.
read boolean key in the YAML frontmatter indicating whether the document was read or unread.

Modify values in src/settings.ts.
Run in the Terminal:
```
npm install
npx tsx ./src/main.ts
```
A summarizing report is saved in the file ./REPORT.md.
The ranking of the "most differentness" is saved in the YAML frontmatter of the unread documents under the key novelty-score.

Tip

Shift of semantic center After reading a sufficient number of documents, the semantic center of read documents will shift, resulting in outdated novelty scores for the unread documents. It is thus recommended to re-run the analysis once in a while.

Constraints

Number of documents
There is a rate limit for OpenAI embeddings of 100 requests per day with the text-embedding-3-small in the free tier. If you already paid for your OpenAI account in the past, you are automatically placed in a higher tier, with much more requests per day.

Info on the placement in tiers
Rate Limit for the model
Usage limits info (your usage tier is noted at the top of the page)

Document size
The current OpenAI models for embeddings have a maximum input size of 8192 tokens, which are about 32,000 characters of English text. Documents longer than that will be automatically truncated by Most Different Text Selector.

Privacy

All input documents are sent to OpenAI, so be sure to not include sensitive data in the input folder.

Credits

Recommended citation

Please cite this software project as (APA):

Grieser, C. (2025). Most Different Text Selector [Computer software]. 
https://github.com/chrisgrieser/most-different-text-selection

About the developer

In my day job, I am a sociologist studying the social mechanisms underlying the digital economy. For my PhD project, I investigate the governance of the app economy and how software ecosystems manage the tension between innovation and compatibility. If you are interested in this subject, feel free to get in touch.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.githooks		.githooks
.github/workflows		.github/workflows
src		src
test-data		test-data
.editorconfig		.editorconfig
.gitignore		.gitignore
.ignore		.ignore
.knip.jsonc		.knip.jsonc
.markdownlint.yaml		.markdownlint.yaml
.markdownlintignore		.markdownlintignore
CITATION.cff		CITATION.cff
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
biome.jsonc		biome.jsonc
manifest.json		manifest.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Most Different Text Selector

Table of contents

Use case

Scientific use case

Non-scientific use cases

How it works

Usage

Constraints

Privacy

Further readings

Credits

Recommended citation

About the developer

About

Uh oh!

Languages

License

chrisgrieser/most-different-text-selection

Folders and files

Latest commit

History

Repository files navigation

Most Different Text Selector

Table of contents

Use case

Scientific use case

Non-scientific use cases

How it works

Usage

Constraints

Privacy

Further readings

Credits

Recommended citation

About the developer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages