Support long-context and MTP prefix-cache hits by grimoire · Pull Request #4688 · InternLM/lmdeploy

grimoire · 2026-06-17T06:30:25Z

Summary

This PR is a follow-up to the prefix-cache refactor in #4618.

It enables prefix-cache reuse in two cases that were previously rolled back or disabled:

allow prefix-cache hits to resume long-context chunked prefill from the matched prefix instead of rolling back when the remaining suffix still needs chunking;
enable prefix caching for Spec/MTP with one-block overlap recompute, so the target model recomputes the hidden-state bridge needed by the draft/MTP path;
keep matched-but-recomputed overlap blocks private/writable during trie allocation, avoiding writes into shared cached KV blocks;
handle SSM prefix-cache restore through exact ready checkpoints, including sparse checkpoint cases where the private recompute span may be larger than one block;
add regressions for scheduler rollback, chunk flags, cached-token accounting, MTP overlap matching/allocation, VLM boundary expansion, and SSM checkpoint restore.

Copilot

Pull request overview

This PR extends the PyTorch prefix-cache implementation to support additional previously-disabled reuse scenarios: resuming long-context chunked prefill from a prefix-cache hit, and enabling Spec/MTP prefix caching via a one-block (or checkpoint-to-raw-hit, for SSM) private overlap recompute window to safely regenerate hidden-state “bridge” data without writing into shared cached KV.

Changes:

Allow accepted prefix-cache hits to proceed even when the remaining suffix still requires long-context chunking (removes the prior rollback condition).
Add “overlap recompute” support to prefix caching via match_recompute_blocks and a private/writable allocation window (private_recompute_*) to prevent overwriting shared cached KV during recompute overlap.
Enable prefix caching for speculative/MTP execution paths and add targeted regression tests for scheduler/trie behavior, chunk flags, and SSM checkpoint restore interactions.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/pytorch/paging/test_scheduler.py	Adds/updates scheduler regression tests for Spec overlap recompute, rollback cleanup, and long-context suffix acceptance after prefix hits.
tests/pytorch/paging/test_block_trie.py	Adds tests covering private overlap allocation, boundary cases, multimodal boundary expansion, and SSM checkpoint-to-raw-hit private spans.
tests/pytorch/engine/test_inputs_maker.py	Adds a regression test ensuring a prefix-resumed long-context suffix starts a new chunk chain (flags).
tests/pytorch/engine/test_executor_base.py	Updates expectation: prefix caching is kept enabled under spec decode.
lmdeploy/pytorch/strategies/ar_spec/sequence.py	Sets `match_recompute_blocks = 1` by default for AR-spec sequences to force one-block overlap recompute on hits.
lmdeploy/pytorch/paging/seq_states/states.py	Ensures private recompute window fields are cleared when freeing sequences.
lmdeploy/pytorch/paging/scheduler.py	Removes rollback condition that rejected prefix hits when long-context chunking would start “mid-chain”; clears private recompute fields on rollback.
lmdeploy/pytorch/paging/block_trie.py	Implements private overlap recompute window handling in match/allocation; extends SSM checkpoint matching to account for recompute overlap needs.
lmdeploy/pytorch/messages.py	Introduces typed aliases for multimodal extra-hash payloads and adds `match_recompute_blocks` + private recompute window fields to `PrefixCacheState`.
lmdeploy/pytorch/engine/executor/base.py	Removes the blanket disabling of prefix caching when speculative decoding is configured.
autotest/utils/run_restful_chat.py	Minor docstring punctuation normalization (“A–D” → “A-D”).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

grimoire added 5 commits June 16, 2026 18:19

allow prefix-cache hits to resume long-context chunks

e01a174

mtp

d5b0805

add test

5421810

rename

db99457

better readability

078a62e

grimoire marked this pull request as ready for review June 23, 2026 08:06

Copilot AI review requested due to automatic review settings June 23, 2026 08:06

grimoire changed the title ~~[WIP] Support long-context and MTP prefix-cache hits~~ Support long-context and MTP prefix-cache hits Jun 23, 2026

Copilot started reviewing on behalf of grimoire June 23, 2026 08:06 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support long-context and MTP prefix-cache hits#4688

Support long-context and MTP prefix-cache hits#4688
grimoire wants to merge 5 commits into
InternLM:mainfrom
grimoire:prefix-caching-part2

grimoire commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grimoire commented Jun 17, 2026

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants