Skip to content

Conversation

@clarenceluo78
Copy link
Contributor

Added support for txt file input

txt processing features

  • get_txt_page_tokens():
    • Strategy 1: Form Feed Detection - Splits text on \f characters (page breaks)
    • Strategy 2: Chapter Detection - (Currently disabled) Detects chapter markers using regex patterns on "CHAPTER" (useful for ebooks)
    • Strategy 3: Character-based Chunking - Fallback method that splits by character count with smart word boundary detection. Defaults to 2000 chars per page

Use Case

example result ebooks_percival_keene_structure.json, which is generated from http://en.wikipedia.org/wiki/Percival_Keene (part of the NarrativeQA dataset: https://github.com/google-deepmind/narrativeqa)

@clarenceluo78
Copy link
Contributor Author

Modified txt file processing: now supports both token-based and char-based segmentation for txt input

  • Token-based segmentation: Uses LlamaIndex's TokenTextSplitter for token-level chunks (default)
  • Character-based segmentation: Traditional approach with word boundary detection as fallback
  • Configurable parameters: Customize chunk sizes, tokenizers, and overlap settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant