Feat: txt file input for pageindex #24

clarenceluo78 · 2025-06-06T18:12:56Z

Added support for txt file input

txt processing features

get_txt_page_tokens():
- Strategy 1: Form Feed Detection - Splits text on \f characters (page breaks)
- Strategy 2: Chapter Detection - (Currently disabled) Detects chapter markers using regex patterns on "CHAPTER" (useful for ebooks)
- Strategy 3: Character-based Chunking - Fallback method that splits by character count with smart word boundary detection. Defaults to 2000 chars per page

Use Case

example result ebooks_percival_keene_structure.json, which is generated from http://en.wikipedia.org/wiki/Percival_Keene (part of the NarrativeQA dataset: https://github.com/google-deepmind/narrativeqa)

clarenceluo78 · 2025-06-12T16:39:14Z

Modified txt file processing: now supports both token-based and char-based segmentation for txt input

Token-based segmentation: Uses LlamaIndex's TokenTextSplitter for token-level chunks (default)
Character-based segmentation: Traditional approach with word boundary detection as fallback
Configurable parameters: Customize chunk sizes, tokenizers, and overlap settings

clarenceluo78 and others added 3 commits June 6, 2025 19:00

feat: add txt support for pageindex

f402361

Merge branch 'VectifyAI:main' into main

07c3997

update token-based segmentation for txt input

1baf575

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: txt file input for pageindex #24

Feat: txt file input for pageindex #24

Uh oh!

clarenceluo78 commented Jun 6, 2025

Uh oh!

clarenceluo78 commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat: txt file input for pageindex #24

Are you sure you want to change the base?

Feat: txt file input for pageindex #24

Uh oh!

Conversation

clarenceluo78 commented Jun 6, 2025

txt processing features

Use Case

Uh oh!

clarenceluo78 commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant