Skip to content

Conversation

Maratyszcza
Copy link
Collaborator

  • Process tokens when Context.process or Context.sample is called rather than when we accumulate max_batch_tokens unprocessed tokens.
  • Avoid re-creating command buffers for each batch.
  • Avoid redundantly computing output activations for the last token in each batch.
  • Avoid invalidating KV cache on Context.reset. Match longest common prefix to tokens in KV cache when tokens are appended following Context.reset.

- Process tokens when Context.process or Context.sample is called rather than when we accumulate max_batch_tokens unprocessed tokens.
- Avoid re-creating command buffers for each batch.
- Avoid redundantly computing output activations for the last token in each batch.
- Avoid invalidating KV cache on Context.reset. Match longest common prefix to tokens in KV cache when tokens are appended following Context.reset.
@dkundel-openai dkundel-openai merged commit cf427a6 into main Aug 15, 2025
@volsgd volsgd deleted the marat/lazy-process branch August 16, 2025 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants