This repository was archived by the owner on Sep 10, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 248
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
Simplify TokenizerArgs __post_init__: Unnecessarily verbose #1518
Copy link
Copy link
Closed
Labels
actionableItems in the backlog waiting for an appropriate impl/fixItems in the backlog waiting for an appropriate impl/fixgood first issueGood for newcomersGood for newcomerstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🚀 The feature, motivation and pitch
TokenizerArgs.__post_init__ has grown quite verbose/redundant and could use a bit of simplification
torchchat/torchchat/cli/builder.py
Lines 244 to 289 in 1384f7d
| class TokenizerArgs: | |
| tokenizer_path: Optional[Union[Path, str]] = None | |
| is_sentencepiece: bool = False | |
| is_tiktoken: bool = False | |
| is_hf_tokenizer: bool = False | |
| t: Optional[Any] = None | |
| def __post_init__(self): | |
| try: | |
| from tokenizer.tiktoken import Tokenizer as TiktokenTokenizer | |
| self.t = TiktokenTokenizer(model_path=str(self.tokenizer_path)) | |
| self.is_tiktoken = True | |
| self.is_sentencepiece = False | |
| self.is_hf_tokenizer = False | |
| return | |
| except: | |
| pass | |
| try: | |
| from sentencepiece import SentencePieceProcessor | |
| self.t = SentencePieceProcessor(model_file=str(self.tokenizer_path)) | |
| self.is_tiktoken = False | |
| self.is_sentencepiece = True | |
| self.is_hf_tokenizer = False | |
| return | |
| except: | |
| pass | |
| try: | |
| from tokenizer.hf_tokenizer import HFTokenizer | |
| self.t = HFTokenizer(str(self.tokenizer_path)) | |
| self.is_tiktoken = False | |
| self.is_sentencepiece = False | |
| self.is_hf_tokenizer = True | |
| return | |
| except: | |
| pass | |
| self.is_tiktoken = False | |
| self.is_sentencepiece = False | |
| self.is_hf_tokenizer = False | |
| self.t = None | |
| return |
Task: Simplify the logic in post_init to reduce redundancy
To test, run a model with each tokenizer type:
- python torchchat.py generate llama2
- python torchchat.py generate llama3
- python torchchat.py generate granite-code
Alternatives
No response
Additional context
No response
RFC (Optional)
No response
Metadata
Metadata
Assignees
Labels
actionableItems in the backlog waiting for an appropriate impl/fixItems in the backlog waiting for an appropriate impl/fixgood first issueGood for newcomersGood for newcomerstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
Done