[WIP] Implement configurable duplicate-text removal methods #3529
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.
Original prompt
Goal
Implement two configurable duplicate-text removal methods for the streaming example (examples/stream/stream.cpp):
Expose a command-line option to choose the dedupe mode and tuning parameters.
Requirements
Add configuration options to whisper_params (examples/stream/stream.cpp):
Add new CLI flags parsing and help text:
--dedupe {none,char,token}
--min-token-overlap N
--min-char-overlap N
--dedupe-history-chars N
Implement both dedupe methods inside examples/stream/stream.cpp printing logic:
Keep two histories:
a) last_printed_text (string, normalized/capped) used for char-mode and as fallback
b) last_printed_tokens (vector) used for token-mode
Character-level dedupe (existing approach): normalize text (lowercase, collapse leading whitespace), find longest suffix of history that equals prefix of new segment, require min_char_overlap and prefer word/boundary checks. Trim prefix when match found and append remainder to history.
Token-level dedupe:
A) If API whisper_full_get_token_text(ctx, seg_idx, token_idx) is available in the repository, use it to compute the character length of the first token_count tokens and remove that prefix from the segment text.
B) Otherwise, fall back to character-level trimming: use character-level longest-overlap logic between last_printed_text and the segment text and trim by characters. This keeps behavior safe when token text extraction is not available.
After printing the (possibly trimmed) segment output, append the printed token ids (only those that were actually printed) to last_printed_tokens and cap the token history to a reasonable value (e.g. 2048 tokens). Also cap last_printed_text to dedupe_history_chars.
Default behavior: dedupe mode default should be "token" (if token-mode is unavailable at runtime it should fallback to "char"). Provide an option to disable dedupe via --dedupe none.
Keep behavior for timestamps and output file writing consistent with trimming.
Keep the code self-contained to examples/stream/stream.cpp and avoid changing other files. Add helper functions inside stream.cpp.
Add clear comments and logging for what dedupe mode is used and when fallback happens.
Testing and validation
Deliverables
Notes / Assumptions
Please create a PR that implements this change. Make the patch detailed and include comments in the code explaining the dedupe strategies and the fallback behavior when token-level text extraction isn't available.
This pull request was created as a result of the following prompt from Copilot chat.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.