Skip to content

[Roadmap]: H2 timelines and key focus areas #2486

@harryskim

Description

@harryskim

Hi Dynamo developers! We wanted to present the following key timelines and focus areas for Dynamo in H2 2025.

📅 Timeline

We plan on having 3 major releases (0.5 - 0.7) until we reach GA at the end of this year. Dynamo will continue to be released on approximately biweekly cadence, and the target dates are shown below:

v0.4.1 v0.5.0 v0.5.1 v0.6.0 v0.6.1 v0.7.0
8/27 9/17 10/8 10/22 11/5 11/19

We are likely to release another minor release (0.7.1) in December.

To improve our ability to track progress, we'll create a detailed roadmap for each major and minor release, such as 0.4.1 and 0.5.0. Each roadmap will include linked issues and pull requests, offering a clearer view of our development plans. This is in addition to the general H2 plan outlined here.

🎯 H2 focus areas

We will focus on making Dynamo to work seamlessly across 5 key areas:

  1. Performance
  2. Fault tolerance
  3. K8 deployment
  4. KV cache management and transfer
  5. Scheduling with smart router and planner

These are our key focus areas, but we are working on other features as well (e.g. Multi-LoRA, tool calling, etc). As mentioned earlier, a more detailed roadmap for each release will be available in a separate issue.

Here is our progress so far up to v0.4.0 release for the 5 areas:

  • Performance
    • Support for both aggregated and disaggregated serving with all inference engines (SGLang, TRT-LLM, and vLLM)
    • AIConfigurator to estimate performance between aggregated and disaggregated serving given model, config and HW and provide best configs.
    • [Functional] R1 disaggregated wideEP support with SGLang and TRT-LLM
      • TRT-LLM wideEP E2E tested with EP16
      • SGLang wideEP decode perf tested and verified. Collaborating to improve prefill perf.
    • [Functional] Multimodal E/P/D disaggregation with TRT-LLM and vLLM.
  • Fault tolerance
    • Request migration
    • Instance fault detection and tracking
    • K8 health check, base metrics for request handler, and structured logging
  • K8 deployment
    • New component model: Moving from the Dynamo SDK to direct interaction with the Dynamo Runtime. Utilize python commands & args for component deployments
    • CRD-Based Deployment: Shifting from dynamo CLI to direct single-CRD approach.
    • Pre-built NGC images (SGLang, TRT-LLM, vLLM)
    • Deployment guides (EKS, AKS)
  • KV cache management and transfer
    • NIXL plugins and benchmarks
      • Networking plugins: UCX and Mooncake
      • Storage plugins: GDS, POSIX, S3, 3FS
      • GPU initiated transfer plugins: DOCA GPUNetIO and UCX initiated comms
      • NIXLBench and NIXLKVBench
    • KV Block Manager using vLLM with GPU, host memory and local disk (PR expected on 8/18. To be included in v0.4.1 on 8/27)
  • Scheduling with smart routing and planning
    • Customizable smart router leveraging KV cache hit rate and KV cache load
    • SLA (TTFT & ITL) based planner with vLLM to dynamically allocate prefill and decode workers

Below is list of our major features leading up to GA

  • Performance
    • Benchmarks for top 4 models across all inference engines
      • Performance validation and 30 day stress test before GA
    • Disaggregated serving with pre-computed KV cache
    • Attention - FFN Disaggregated serving (AFD)
  • Fault tolerance
    • Request cancellation & rejection
    • GPU and engine health monitoring
    • High availability and recovery
  • K8 deployment
    • Additional high quality docs and examples (GKE, ECS, etc)
    • Grove (+KAI scheduler) integration
    • Inference gateway integration
    • SGLang OME integration
  • KV cache management and transfer
    • NIXL
      • Elastic EP and wideEP fault tolerance
      • AWS Libfabric support
      • NIXL plugin support for top 10 storage vendors
      • Elastic EP and fault tolerance
      • Network optimizations for large scale inference (congestion, QoS, etc)
    • KVBM (KV Block Manager)
      • LMCache integration
      • Modular KV Block Manager
      • SGLang and TRT-LLM support
      • Custom eviction policies
  • Scheduling with smart routing and planning
    • Composible frontend (API server, Processor, Router)
    • Standalone router and planner
    • Hierchical router
    • SLA based planner for SGLang and TRT-LLM
    • AIConfigurator integration to planner.

If there are any additional features that needs to be considered or prioritized, please let us know in the comment. Thank you so much for your ongoing feedback, and we will do our best to incorporate them to GA Dynamo in December 🙏.

Describe the problem you're encountering

N/A

Describe alternatives you've tried

No response

Metadata

Metadata

Assignees

Labels

roadmapTracks features, enhancements, or milestones planned as part of the project roadmap

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions