Process tokens in Context lazily #138

Maratyszcza · 2025-08-15T21:24:53Z

Process tokens when Context.process or Context.sample is called rather than when we accumulate max_batch_tokens unprocessed tokens.
Avoid re-creating command buffers for each batch.
Avoid redundantly computing output activations for the last token in each batch.
Avoid invalidating KV cache on Context.reset. Match longest common prefix to tokens in KV cache when tokens are appended following Context.reset.

- Process tokens when Context.process or Context.sample is called rather than when we accumulate max_batch_tokens unprocessed tokens. - Avoid re-creating command buffers for each batch. - Avoid redundantly computing output activations for the last token in each batch. - Avoid invalidating KV cache on Context.reset. Match longest common prefix to tokens in KV cache when tokens are appended following Context.reset.

Maratyszcza requested review from volsgd and dkundel-openai August 15, 2025 21:24

Maratyszcza force-pushed the marat/lazy-process branch from 3345a5e to e1770f0 Compare August 15, 2025 21:40

dkundel-openai approved these changes Aug 15, 2025

View reviewed changes

dkundel-openai merged commit cf427a6 into main Aug 15, 2025

volsgd deleted the marat/lazy-process branch August 16, 2025 01:03

Maratyszcza mentioned this pull request Sep 2, 2025

Metal: simplify and optimize Reponses API adapter #162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Process tokens in Context lazily #138

Process tokens in Context lazily #138

Uh oh!

Maratyszcza commented Aug 15, 2025

Uh oh!

Uh oh!

Process tokens in Context lazily #138

Process tokens in Context lazily #138

Uh oh!

Conversation

Maratyszcza commented Aug 15, 2025

Uh oh!

Uh oh!