K2 K2 Think

Smaller. Smarter. Open.

K2 Think is a 32B open‑weights reasoning system tuned for long chain‑of‑thought, reinforcement learning with verifiable rewards, and agentic planning. It rivals much larger models on math, science, and coding — while delivering up to ~2,000 tok/s on Cerebras WSE.

Open weights • Apache‑2.0 • MBZUAI Institute of Foundation Models × G42 • Built on Qwen2.5‑32B

Quickstart · Transformers
from transformers import pipeline
import torch

model_id = "LLM360/K2-Think"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Solve: If x^2 - 5x + 6 = 0, find x."},
]

out = pipe(messages, max_new_tokens=2048)
print(out[0]["generated_text"][-1])
▶ Tip: the chat template is applied automatically when using pipeline().
Model Size
32.8B
Qwen2.5‑32B base · Apache‑2.0
Throughput
~2,000 tok/s
Cerebras WSE + speculative decoding
Context
32K tokens
Optimized for long CoT traces
License
Apache‑2.0
Open weights & usage‑friendly

Benchmarks

Reported pass@1 and scores averaged over runs from the model card & technical report. K2 Think targets competition‑level mathematics while maintaining strong science and coding ability.

AIME 2024
90.83
AIME 2025
81.24
HMMT 2025
73.75
OMNI‑MATH‑HARD
60.73
GPQA‑Diamond
71.08
LiveCodeBench v5
63.97
Sources: model card & tech report. Individual benchmark setup may vary; see report for details.

How it works

1) Long CoT SFT

Supervised finetuning on curated long chain‑of‑thought traces teaches structured, step‑by‑step reasoning and stable long outputs.

2) RL with Verifiable Rewards

Optimizes directly for correctness on verifiable tasks (Math, Code, Science, Logic, Simulation, Tabular) using public datasets.

3) Agentic Planning

A lightweight planning stage structures the solution path before detailed reasoning for higher reliability.

4) Test‑time Scaling

Best‑of‑N sampling boosts pass@k under fixed budgets; long‑trace friendly context (≈32K) preserves solution fidelity.

5) Speculative Decoding

Fast draft + verify decoding dramatically increases throughput without sacrificing quality.

6) Inference‑Optimized Hardware

Deployment on Cerebras Wafer‑Scale Engine enables near‑instant long responses, often ~2,000 tok/s per request.

Docs & Resources

Model Card

Weights, license, quickstart, benchmark tables.

Technical Report (PDF)

Method, ablations, pass@k, throughput details.

Playground & API

Try K2 Think in the browser; request API access.

GitHub · K2‑Think‑SFT

SFT training code built on LLaMA‑Factory.

GitHub · K2‑Think‑Inference

High‑throughput inference with speculative decoding.

Where K2 Think shines

Competition‑level Math

Olympiad‑style problem solving with long CoT and pass@k strategies (AIME, HMMT, Omni‑MATH‑HARD).

Software Engineering

LiveCodeBench‑friendly coding, debugging, and tool use with agentic planning.

Scientific Reasoning

Strong GPQA‑Diamond results and verifiable domains like logic, simulation, tabular.

Language (UAE):

FAQ

Is K2 Think fully open‑source?

Yes — weights under Apache‑2.0 on Hugging Face. See the model card for license text and usage notes.

Can I fine‑tune it for my domain?

Yes. Use the SFT repo for instruction/CoT tuning; evaluate with pass@k on your datasets. Be mindful of context and RL setup.

Why focus on parameter‑efficiency?

Smaller models cut cost and latency. With the right post‑training + test‑time recipe, K2 Think competes with much larger systems.

How do I reach 2,000 tok/s?

Use the provided inference stack on Cerebras WSE with speculative decoding. Throughput depends on context, batch size, and hardware.