LLM

Kimi 2.6 is out , more powerful than ever

Jia Chen

22 Apr 2026 • 9 min read

Kimi AI with K2.6 | Better Coding, Smarter Agents

Try Kimi K2.6 to build stunning, full-stack websites, use Agent Swarm for massive tasks, turn documents into reusable skills, and explore Claw Groups preview for agent teamwork.

Kimi

In our previous post, we talked about Kimi 2.5. It's only two months, Moonshot AI just released Kimi K2.6, the newest entry in the K2 series and, by the numbers on its own model card, the strongest open-weights agentic coding model available today. It is an open-source, natively multimodal MoE with 256K context, INT4-native quantization, and benchmark results that sit shoulder-to-shoulder with GPT-5.4 and Claude Opus 4.6 on several agentic and coding suites. This post walks through what is actually new, how it compares to K2.5 and the current frontier closed models, what it takes to run locally, and what the fine-tuning story looks like.

What is new in K2.6

Kimi K2.6 is a 1.1T-parameter Mixture-of-Experts model with 32B activated parameters per token. It uses 384 experts (8 selected per token plus 1 shared expert), MLA attention, SwiGLU activations, a 160K vocabulary, and 256K context. Multimodality is handled by a 400M-parameter MoonViT vision encoder, so image and video inputs are native rather than bolted on. The weights are released under a Modified MIT License on Hugging Face, and architecturally the model is a drop-in replacement for K2.5, which means existing vLLM, SGLang, and KTransformers pipelines can be reused with no structural changes.

Moonshot frames the release around four capability pillars. Long-horizon coding is the headline: K2.6 reliably generalizes across languages (Rust, Go, Python) and task types (front-end, devops, performance optimization), and Moonshot cites an internal case where the model ran 4,000+ tool calls over 12 hours to port and optimize a Qwen inference stack in Zig, ultimately hitting roughly 20% higher throughput than LM Studio. A second case describes a 13-hour autonomous overhaul of an 8-year-old Java matching engine that produced a 185% median throughput gain. Coding-driven design extends the model from pure backend work into front-end generation, including aesthetic landing pages, animations, and lightweight full-stack apps with auth and database operations. Agent Swarm scales horizontally to 300 sub-agents running 4,000 coordinated steps, up from K2.5's 100 sub-agents and 1,500 steps. Finally, Claw Groups is a research preview that lets heterogeneous agents (any model, any device, any toolkit) and humans collaborate in a shared operational space, with K2.6 serving as the adaptive coordinator.

The model also supports two inference modes, Thinking and Instant, plus a preserve_thinking option that keeps full reasoning content across multi-turn interactions, similar in spirit to K2 Thinking's interleaved thinking and multi-step tool call flow.

K2.6 vs K2.5

The jump from K2.5 to K2.6 is most pronounced on agentic and long-horizon benchmarks, not raw math. On BrowseComp, K2.6 moves from 74.9 to 83.2, and with Agent Swarm enabled from 78.4 to 86.3. DeepSearchQA accuracy goes from 77.1 to 83.0. Toolathlon nearly doubles, from 27.8 to 50.0. APEX-Agents more than doubles, from 11.5 to 27.9. On coding, Terminal-Bench 2.0 climbs from 50.8 to 66.7, SWE-Bench Pro from 50.7 to 58.6, and SWE-Bench Verified from 76.8 to 80.2. Reasoning moves more modestly: AIME 2026 goes from 95.8 to 96.4, GPQA-Diamond from 87.6 to 90.5, HLE-Full from 30.1 to 34.7. Vision sees real gains too, with MathVision climbing from 84.2 to 87.4 and V* (with python) from 86.9 to 96.9.

Moonshot's own framing matches what several external beta testers report in the release post: the delta is less about raw IQ and more about stability, instruction following, and the ability to sustain multi-hour autonomous runs without drifting. Several quoted partners describe the improvement as being in the 12–18% range on their internal benchmarks, with one Vercel engineer citing "more than 50% improvement" on a Next.js-specific benchmark.

K2.6 vs Opus 4.6, Opus 4.7, and GPT-5.4

Moonshot's benchmark table compares K2.6 (thinking mode) against GPT-5.4 at xhigh reasoning effort, Claude Opus 4.6 at max effort, and Gemini 3.1 Pro at high thinking. Opus 4.7 does not appear in the comparison table directly, but it is referenced: several of the Opus 4.6 / GPT-5.4 DeepSearchQA numbers are cited from the Claude Opus 4.7 System Card, implying Opus 4.7 is in circulation but was not the reference point Moonshot chose to benchmark against.

Where K2.6 leads: agentic search and tool-use. It posts the top scores on HLE-Full with tools (54.0 vs 52.1 for GPT-5.4 and 53.0 for Opus 4.6), DeepSearchQA f1 (92.5 vs 78.6 vs 91.3), DeepSearchQA accuracy (83.0 vs 63.7 vs 80.6), and SWE-Bench Pro (58.6 vs 57.7 vs 53.4). Agent Swarm mode on BrowseComp hits 86.3, which is the highest number in the table on that row.

Where K2.6 trails: closed-form reasoning and some vision tasks. GPT-5.4 leads on AIME 2026 (99.2 vs 96.4), HMMT (97.7 vs 92.7), and IMO-AnswerBench (91.4 vs 86.0). Gemini 3.1 Pro leads on HLE-Full without tools (44.4), GPQA-Diamond (94.3), and several vision benchmarks. Opus 4.6 leads on Claw Eval pass^3 (70.4) and OJBench (70.7). APEX-Agents is also a clear weakness (27.9 vs 33.3 for GPT-5.4).

The practical read: for agentic coding, deep search, and long-horizon tool use, K2.6 is competitive with or ahead of the frontier closed models. For pure math olympiad-style reasoning and some vision-heavy evals, the frontier closed models still have an edge. Given that K2.6 is open-weight and Moonshot positions it as offering "SOTA-level performance at a fraction of the cost" (per quoted partner KiloClaw), the tradeoff is compelling for teams that can self-host.

How to leverage K2.6 in practice

A few concrete use cases where the K2.6 capability profile matters:

Long-running autonomous coding agents. The Moonshot-reported 12–13 hour autonomous runs with thousands of tool calls, combined with strong SWE-Bench Pro and Terminal-Bench 2.0 scores, suggest K2.6 is a good fit for agents that need to explore a codebase, make cross-file changes, and recover from failed attempts without human babysitting. The recommended agent framework is Kimi Code CLI (kimi.com/code), but the model also works with OpenCode, Augment Code, OpenClaw, Hermes Agent, and Ollama integrations based on partner quotes.

Deep research and multi-source synthesis. The DeepSearchQA and BrowseComp numbers, and the 256K context, make K2.6 attractive for research agents that need to pull from many documents and produce a structured synthesis. The Agent Swarm architecture (300 sub-agents, 4,000 steps) is the lever to pull when a single-threaded agent would time out or lose the thread.

Front-end and lightweight full-stack generation. Coding-driven design is positioned as a direct competitor to tools like v0 and Google AI Studio, with the added ability to generate visual assets via image/video tools and wire up auth plus a database for transactional workflows.

Document-to-skill conversion. K2.6 can ingest a high-quality PDF, slide deck, or spreadsheet and turn it into a reusable "Skill" that preserves structural and stylistic DNA. This is a useful primitive for teams that want to encode house style or domain-specific templates without manual prompt engineering.

Persistent background agents. Moonshot describes a K2.6-backed agent running autonomously for 5 days on internal RL infrastructure monitoring and incident response. Strong instruction following and API interpretation stability are the differentiators here.

For thinking mode, Moonshot recommends temperature = 1.0, top_p = 0.95. For instant mode, temperature = 0.6, top_p = 0.95. Thinking is enabled by default; disable it by passing {'chat_template_kwargs': {'thinking': False}} in extra_body on vLLM/SGLang, or the equivalent for the official API.

Running K2.6 locally

K2.6 is the same architecture as K2.5, so any deployment recipe for K2.5 works unchanged. Weights are 1.1T parameters total, released in BF16 with native INT4 quantization (the same scheme as Kimi-K2-Thinking). That INT4-native piece matters: the release is designed around compressed-tensors checkpoints, not a post-hoc GGUF conversion, so the quality/memory tradeoff is built into the release rather than applied by the community after the fact.

Three officially supported inference engines: vLLM, SGLang, and KTransformers.

vLLM. Moonshot provides a verified recipe for a single H200 node with 8-way tensor parallelism. The canonical command is vllm serve $MODEL_PATH -tp 8 --mm-encoder-tp-mode data --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2. vLLM 0.19.1 is the manually verified stable release; nightly wheels are available from https://wheels.vllm.ai/nightly but are considered experimental. Both parsers are required — omitting --reasoning-parser kimi_k2 will break thinking mode.

SGLang. Supported in v0.5.10 and later stable releases (no nightly required). The install command is uv pip install "sglang>=0.5.10.post1" --prerelease=allow, and the serve command mirrors the vLLM one with TP8 on H200. The official SGLang cookbook entry lives at cookbook.sglang.io/autoregressive/Moonshotai/Kimi-K2.6.

KTransformers. This is the option for teams without 8xH200. KTransformers + SGLang gives CPU+GPU heterogeneous inference: Moonshot reports 640 tokens/sec prefill and 24.5 tokens/sec decode at 48-way concurrency on 8× NVIDIA L20 GPUs paired with 2× Intel 6454S CPUs. The launch flags include --kt-cpuinfer 96, --kt-num-gpu-experts 30, and --kt-method RAWINT4. Put differently: if you have a mid-range GPU cluster plus a fat CPU host, you can serve the full 1.1T model without buying H200s.

For truly local experimentation (single workstation, no H200), the practical path today is the community quantized builds showing up on Hugging Face — there are already 12 quantized variants listed on the model card and llama.cpp, LM Studio, Jan, and Ollama are all listed as compatible runtimes. Expect memory footprints in the hundreds of GBs even at INT4, since 32B activated parameters per token across 384 experts still means all experts need to be resident (or paged). A workstation with 1.5–2 TB of RAM plus a single high-end GPU for prefill, using KTransformers-style offloading, is the realistic floor for running the full model locally. Anything smaller and you are looking at heavy quantization and likely quality loss on long-horizon tasks.

The transformers version requirement is >=4.57.1, <5.0.0 if you want to load the weights directly.

Fine-tuning

Yes, K2.6 is fine-tunable, and Moonshot ships an official LoRA SFT path via KTransformers + LLaMA-Factory. The commands are straightforward:

USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
llamafactory-cli api examples/inference/kimik2_lora_sft_kt.yaml

Moonshot reports end-to-end LoRA SFT throughput of 44.55 tokens/sec on 2× NVIDIA 4090 + Intel 8488C with 1.97 TB RAM and 200 GB swap. That is genuinely striking: a consumer-GPU pair plus a beefy CPU host is enough to LoRA-tune a 1.1T-parameter MoE, because the KTransformers approach keeps most experts on CPU and moves only active ones to GPU. Full-parameter SFT is not officially documented for K2.6, but the K2.5 SFT Installation Guide (which applies identically given the shared architecture) is linked from the deployment docs and is the reference for more advanced fine-tuning setups. The Hugging Face model page already lists four fine-tuned derivatives of K2.6, so the community is clearly iterating.

For most teams, LoRA SFT with KTransformers is the pragmatic entry point: cheap hardware, official recipe, production-grade inference stack on the other end.

Bottom line

Kimi K2.6 is the clearest signal yet that open-weights models are not just catching up to the frontier but leading on specific, commercially important axes — long-horizon agentic coding, deep search, and tool use. It will not beat GPT-5.4 or Gemini 3.1 Pro on pure reasoning or vision, and APEX-Agents remains a weak spot. But for teams building autonomous coding agents, research agents, or 24/7 background agents, K2.6 offers a rare combination: frontier-grade agentic performance, 256K context, native multimodality, permissive licensing, and a realistic path to both self-hosted inference and fine-tuning on hardware that does not require a datacenter contract.

The weights are on Hugging Face at moonshotai/Kimi-K2.6. The API is live at platform.moonshot.ai. The agent framework is at kimi.com/code. If you have been waiting for the moment to bring an agent pipeline in-house, this is a reasonable one.