Software

4 minute read

KTransformers的5个隐藏用法，17K Star的MoE推理框架背后没写在README里的能力

June 12, 2026

你知道吗？2026 年中期，在生产环境部署 671B 参数的 DeepSeek-R1 仍然需要 8 张 H100，硬件成本约 20 万美元。但清华大学 MADSys 实验室的开源项目早在 2024 年就能在单台工作站上跑 236B 参数 MoE 模型，2025 年 2 月甚至在普通硬件上实现了 671B DeepSeek-R1 286 tokens/s 的 Prefill 速度。这个项目就是 kvcache-ai/ktransformers，截至 2026-06-12 已有 17,264 Stars、1,313 Forks、Apache-2.0 协议。2026 年的 AI 基础设施叙事被 NVIDIA 机架级系统和越来越贵的显存账单主导。KTransformers 是一条被严重低估的反击路线：它让你在消费级 GPU 和 CPU 内存的混合硬件上跑前沿 MoE 模型，而且提供了五个几乎没人讨论的生产级技巧。

背景：2026 年为什么 CPU/GPU 混合推理至关重要

2026 年，混合专家（Mixture-of-Experts，MoE）已经成为前沿开源模型的默认架构。DeepSeek-V3/R1、Qwen3-235B-A22B、Kimi-K2.5、GLM-4.7 以及最新的 DeepSeek-V4-Flash 全是 MoE。直觉上 MoE 推理仍然需要 H100 级 GPU，因为每个 token 只激活少量专家，活跃参数量虽小，但总参数量巨大（DeepSeek-R1 671B，Kimi-K2.5 上万亿）。CPU-GPU 混合方案把”冷”专家放在 CPU 内存，把”热”专家留在 GPU。KTransformers 把这个想法变成一个生产级框架，截至 v0.6.2（2026-05-03 发布）已支持 9 种不同的 MoE 模型。2026 年 ACM SIGOPS 论文《KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models》正式发表了这一架构。

隐藏用法 #1：基于频率感知的 CPU-GPU 专家调度

大多数人的做法： 把 GPU 当黑盒，试图把整个 MoE 模型塞进显存。模型太大时，要么买更多 GPU，要么换更小的模型。

隐藏技巧： KTransformers 通过 --kt-expert-placement-strategy 标志暴露四种显式专家放置策略。frequency 策略会记录专家激活统计信息，然后把最常被激活的专家放在 GPU，把冷专家留在 CPU 内存。配合 --kt-enable-dynamic-expert-update 还能在运行时根据 prefill token 数重新分配专家。

# 启动频率感知放置的服务
python -m sglang.launch_server 
    --model /path/to/qwen3-next-80b 
    --kt-num-gpu-experts 8 
    --kt-expert-placement-strategy frequency 
    --init-expert-location /path/to/activation_stats.pt

# 长上下文场景下加入动态再分配
python -m sglang.launch_server 
    --model /path/to/qwen3-next-80b 
    --kt-num-gpu-experts 8 
    --kt-expert-placement-strategy frequency 
    --init-expert-location /path/to/activation_stats.pt 
    --kt-enable-dynamic-expert-update 
    --kt-gpu-prefill-token-threshold 512

效果： 在 Qwen3-Next-80B-A3B-Instruct-FP8 + 4 张 RTX 4090 + Intel Xeon Gold 6454S 的官方基准测试中，50% GPU 专家比例下，frequency 策略达到 76.19 tokens/s，dynamic-expert-update 进一步推到 81.17 tokens/s（对比默认 uniform 策略的 65.25 tokens/s）。80% GPU 比例时，frequency 策略冲到 100.67 tokens/s。

数据来源： kvcache-ai/ktransformers GitHub 17,264 Stars、1,313 Forks、Apache-2.0 协议，2026-06-07 最后提交，v0.6.2 于 2026-05-03 发布；基准测试表来自 doc/en/kt-kernel/experts-sched-Tutorial.md；HN “Show HN: KTransformers-236B Model and 1M Context LLM Inference” 20 分（2024-08-29 发布，3 条评论）。

隐藏用法 #2：三层（GPU-CPU-磁盘）前缀缓存复用

大多数人的做法： 每次请求都从头重建 KV 缓存。对于长上下文工作负载（10 万 token 的系统提示 + 5 万 token 的对话），每次冷启动都是几分钟。

隐藏技巧： KTransformers 的 balance_serve 引擎实现了三层 KV 缓存层次结构。热前缀在 GPU 上，温前缀在 CPU 内存里，冷前缀在磁盘上。attn.page_size 和 kvc2.cpu_memory_size_GB 参数控制分配比例。开启后，共享相同系统提示的重复请求只计算增量的 KV 缓存，而不是整个上下文。

# ktransformers/configs/config.yaml
attn:
  page_size: 16          # KV 缓存中一个页的大小
  chunk_size: 256
kvc2:
  gpu_only: false        # false 表示启用 Disk + CPU + GPU 三层存储
  utilization_percentage: 1.0
  cpu_memory_size_GB: 500 # CPU 内存分配给 KV 缓存的容量
  disk_path: /mnt/data/kvc # 磁盘上存储 KV 缓存的路径

编辑配置后，重新编译以启用前缀缓存模式：

git submodule update --init --recursive
USE_BALANCE_SERVE=1 bash ./install.sh
# 对于有 1TB+ RAM 的双 NUMA 系统：
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh

效果： 多轮 Agent 工作流和 RAG 流水线一旦有稳定的系统提示，缓存的前缀可以在成千上万次请求中复用。CPU-GPU-磁盘的三层结构意味着你能服务的总上下文窗口远超 GPU 显存容量，磁盘层作为透明的扩展存在。

数据来源： kvcache-ai/ktransformers GitHub 17,264 Stars、Apache-2.0 协议；配置格式来自 doc/en/prefix_cache.md；v0.2.4 多并发架构重构的发布说明来自 doc/en/balance-serve.md。

隐藏用法 #3：AMX BF16/INT8 加速（比 AVX-512 快 8 倍）

大多数人的做法： 用 llama.cpp 和大多数推理栈默认的 AVX-512 指令跑 CPU 矩阵乘法。在消费级 CPU 上，这把 MoE 推理速度限制在 60-80 tokens/s。

隐藏技巧： KTransformers v0.3+ 内置针对 BF16 和 INT8 量化的 AMX（Intel 高级矩阵扩展）内核。AMX 为每个 CPU 核心引入 8 个专用 Tile 寄存器（tmm0-tmm7），每个可容纳 16 行 x 64 字节。单条 TDPBF16PS 指令在 16 个 CPU 周期内执行 32,768 次乘加运算，每个核心实现 2,048 次乘加/周期，相比同硅片上的 AVX-512 有 8 倍吞吐。

# 启用 AMX 支持安装
USE_BALANCE_SERVE=1 bash ./install.sh

# 用 AMX 后端跑 Qwen3MoE
python ktransformers/server/main.py 
    --architectures Qwen3MoeForCausalLM 
    --model_path  
    --gguf_path  
    --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml 
    --backend_type balance_serve

效果： 在 Xeon 4 代 + RTX 4090 工作站上，启用 AMX 的 KTransformers 在 Qwen3MoE-235B-A22 上达到 347 tokens/s Prefill。同样的模型在消费级 i9-14900KF + DDR5-4000 上跑 30B-A3B 流畅无压力。KTransformers 还提供仅 AVX2 路径（--kt-method），让同样的 MoE 推理栈能在 Sapphire Rapids 服务器、EPYC 工作站、消费台式机之间无缝切换。

数据来源： kvcache-ai/ktransformers GitHub 17,264 Stars、Apache-2.0 协议；AMX 指令细节和 347 tokens/s Prefill 基准来自 doc/en/AMX.md；Intel AMX intrinsic 参考来自同文档；HN “Show HN: KTransformers-671B DeepSeek-R1 on a Single Machine” 14 分（2025-02-10 发布）。

隐藏用法 #4：多并发 balance_serve 配合连续批处理

大多数人的做法： 一次只跑一个推理请求，把 LLM 当批处理脚本用。吞吐被单个用户能挤出的极限锁死。

隐藏技巧： KTransformers v0.2.4 引入了 balance_serve，一个受 SGLang 启发的 C++ 引擎，架构分三层：Server（处理 OpenAI 兼容的 HTTP 请求）、Inference Engine（执行分块 Prefill）、Scheduler（按 FCFS 顺序做连续批处理）。配合自定义 flashinfer 内核和可变 batch size 的 CUDA Graph，DeepSeek-R1 0528 上 4 路并发时聚合吞吐提升 130%。Intel 工程师在 Xeon6 + MRDIMM-8800 平台上验证，单用户 17 tokens/s 提升到聚合 40 tokens/s，瓶颈已经转移到 GPU 端。

# 拉取并运行 v0.2.4+ 多并发 Docker 镜像
docker pull approachingai/ktransformers:v0.2.4-AVX512
docker run -it --gpus all --privileged --shm-size 64g 
    --name ktrans --network=host -v /mnt:/mnt 
    approachingai/ktransformers:v0.2.4-AVX512 /bin/bash

# 打开第二个终端 exec 进去
docker exec -it ktrans bash

# 启动多并发服务
python ktransformers/server/main.py 
    --architectures Qwen3MoeForCausalLM 
    --model_path  
    --gguf_path  
    --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml 
    --backend_type balance_serve

# 并发打多个请求
for i in 1 2 3 4; do
    curl http://localhost:30000/v1/chat/completions 
        -H "Content-Type: application/json" 
        -d '{"model":"model-name","messages":[{"role":"user","content":"Hello!"}],"stream":true}' &
done
wait

效果： 一台 KTransformers 服务就能扛住整个团队的交互式 LLM 工作负载。在 Xeon6 + MRDIMM-8800 测试床上，多并发路径把聚合输出吞吐从 17 tokens/s 提升到 40 tokens/s，提升 2.35 倍。OpenAI 兼容的 /v1/chat/completions API 意味着现有工具（LangChain、LlamaIndex、Cursor、Continue.dev）可以零修改接入。

数据来源： kvcache-ai/ktransformers GitHub 17,264 Stars、1,313 Forks、Apache-2.0 协议；130% 吞吐提升和 17 到 40 tokens/s 基准来自 doc/en/balance-serve.md；v0.2.4 发布说明来自同文档；HN “Show HN: KTransformers-236B Model and 1M Context LLM Inference on Local Machines” 20 分（2024-08-29）。

隐藏用法 #5：LLaMA-Factory SFT 集成 MoE LoRA 微调（比 ZeRO-Offload 快 6-12 倍）

大多数人的做法： 用 DeepSpeed 的 ZeRO-Offload 微调 MoE 模型。能跑，但 CPU offload 让训练极其痛苦——每个优化器步骤都要通过 PCIe 总线搬运上百 GB 梯度。

隐藏技巧： KTransformers v0.6.1 提供 ktransformers[sft] extra，直接和 LLaMA-Factory 集成。集成使用 KT-Kernel 针对优化器状态的 CPU 优化 INT8/INT4 量化，加上 FSDP2 智能分片。在基准测试的 MoE SFT 工作负载下，相比 ZeRO-Offload 实现了 6-12 倍训练加速，CPU 内存减少约一半。

# 安装 SFT 栈
conda create -n kt-sft python=3.11 -y
conda activate kt-sft
pip install --extra-index-url https://download.pytorch.org/whl/cu130 
    torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1

# 安装 LLaMA-Factory + KT SFT
cd /path/to/LLaMA-Factory
pip install -e .
pip install -r requirements/ktransformers.txt

# 在 Qwen3-30B-A3B 上用 1 张 RTX 4090 启动 MoE LoRA SFT
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch 
    --config_file examples/ktransformers/accelerate/fsdp2_kt_int8.yaml 
    src/train.py 
    examples/ktransformers/train_lora/qwen3_5moe_lora_sft_kt.yaml

效果： 在 DeepSeek-V3 和 DeepSeek-R1 上，KT SFT 在 4 张 RTX 4090 + 80GB 显存上跑到 3.7 it/s。Qwen3-30B-A3B 在单张 RTX 4090 + 24GB 显存上跑到 8+ it/s。这让在单卡消费级硬件上微调前沿 MoE 模型成为现实，而不是需要 8 张 H100 集群。

数据来源： kvcache-ai/ktransformers GitHub 17,264 Stars、Apache-2.0 协议；6-12 倍加速和 3.7 it/s / 8+ it/s 基准来自 doc/en/SFT/KTransformers-Fine-Tuning_Quick-Start.md 以及主 README 中的 SFT 介绍；集成 PR 在 hiyouga/LLaMA-Factory#10430；HN Show HN 总计 34 分（2024-08-29 20 分 + 2025-02-10 14 分）。

总结

五个把 KTransformers 从研究玩具变成 2026 AI 基础设施重器的生产级技巧：

CPU-GPU 专家调度 配合频率感知放置和动态再分配（100% GPU 比例下 114.26 tokens/s；50% 比例下动态更新 81.17 tokens/s）
三层前缀缓存复用 横跨 GPU、CPU 内存和磁盘（把多分钟的冷启动变成增量更新）
AMX BF16/INT8 加速 实现 AVX-512 的 8 倍吞吐（Xeon 4 + RTX 4090 工作站上 347 tokens/s Prefill）
多并发 balance_serve 配合连续批处理（4 路并发下吞吐提升 130%，Xeon6 上 17 到 40 tokens/s 聚合）
LLaMA-Factory SFT 集成 MoE LoRA 微调（比 ZeRO-Offload 快 6-12 倍，DeepSeek-V3 在 4 张 RTX 4090 上 3.7 it/s）

如果你读过这个系列的其他文章，会发现熟悉的模式：Agent Skills: 5 Hidden Uses in 49K Stars of Workflow Magic 用类似的”框架藏了 5 个生产技巧”模式讲了工程技能，MemPalace: 5 Hidden Uses That Make It the Best-Benchmarked AI Memory System 以同等深度剖析记忆基础设施，Goose’s 5 Hidden Uses That Turn It Into a Production AI Agent Stack 在 Agent 编排层展示了同样的”生产技巧”模式。

你在 2026 年遇到的最被低估的 MoE 推理优化是什么？在评论里留下你的吞吐数字、硬件配置和模型，我们将在后续文章中深入挖掘。

Theker just raised $85M to build the factory robot that doesn’t specialize in anything

June 12, 2026

AI - Artificial-Intelligence

Cheaper, faster, and culturally aware, Avataar’s video AI is built for India’s scale

June 12, 2026

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Contribuir para a comunidade: como destacar isso no seu LinkedIn e currículo

Arcee, a US open source AI lab, says Chinese models are not inherently dangerous

Digital Metrology TraceBossPro Surface Gauge Software

Trending Tags

KTransformers的5个隐藏用法，17K Star的MoE推理框架背后没写在README里的能力

背景：2026 年为什么 CPU/GPU 混合推理至关重要

隐藏用法 #1：基于频率感知的 CPU-GPU 专家调度

隐藏用法 #2：三层（GPU-CPU-磁盘）前缀缓存复用

隐藏用法 #3：AMX BF16/INT8 加速（比 AVX-512 快 8 倍）

隐藏用法 #4：多并发 balance_serve 配合连续批处理

隐藏用法 #5：LLaMA-Factory SFT 集成 MoE LoRA 微调（比 ZeRO-Offload 快 6-12 倍）

总结

Leave a Reply Cancel reply

Previous Post

Theker just raised $85M to build the factory robot that doesn’t specialize in anything

Next Post

Cheaper, faster, and culturally aware, Avataar’s video AI is built for India’s scale

KTransformers的5个隐藏用法，17K Star的MoE推理框架背后没写在README里的能力

背景：2026 年为什么 CPU/GPU 混合推理至关重要

隐藏用法 #1：基于频率感知的 CPU-GPU 专家调度

隐藏用法 #2：三层（GPU-CPU-磁盘）前缀缓存复用

隐藏用法 #3：AMX BF16/INT8 加速（比 AVX-512 快 8 倍）

隐藏用法 #4：多并发 balance_serve 配合连续批处理

隐藏用法 #5：LLaMA-Factory SFT 集成 MoE LoRA 微调（比 ZeRO-Offload 快 6-12 倍）

总结

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts