K2 Think is a 32B open‑weights reasoning system derived from the Qwen2.5‑32B base and post‑trained with long chain‑of‑thought supervision, reinforcement learning with verifiable rewards, and agentic planning. It is released under Apache‑2.0 with tech report and weights.

Deployed on Cerebras Wafer‑Scale Engine with speculative decoding, K2 Think can deliver ~2,000 tokens/second for long outputs, enabling near‑instant reasoning traces.

How does it compare to larger models?

The technical report shows K2 Think matches or exceeds much larger proprietary and open models on tough math benchmarks while using far fewer parameters, highlighting strong parameter‑efficiency.

Where can I get the weights and code?

Weights: Hugging Face (LLM360/K2‑Think). Code: MBZUAI‑IFM GitHub repos for SFT and inference linked in the tech report. A hosted playground is available at k2think.ai.

Is it really open source?

Yes. The model card lists the license as Apache‑2.0, and the report documents open‑dataset usage and public releases.

Smaller. Smarter. Open.

K2 Think is a 32B open‑weights reasoning system tuned for long chain‑of‑thought, reinforcement learning with verifiable rewards, and agentic planning. It rivals much larger models on math, science, and coding — while delivering up to ~2,000 tok/s on Cerebras WSE.

Launch Playground Read Tech Report (PDF) Model Card

Open weights • Apache‑2.0 • MBZUAI Institute of Foundation Models × G42 • Built on Qwen2.5‑32B

Quickstart · Transformers

from transformers import pipeline
import torch

model_id = "LLM360/K2-Think"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Solve: If x^2 - 5x + 6 = 0, find x."},
]

out = pipe(messages, max_new_tokens=2048)
print(out[0]["generated_text"][-1])

▶ Tip: the chat template is applied automatically when using pipeline().

Model Size

32.8B

Qwen2.5‑32B base · Apache‑2.0

Throughput

~2,000 tok/s

Cerebras WSE + speculative decoding

Context

32K tokens

Optimized for long CoT traces

License

Apache‑2.0

Open weights & usage‑friendly

Benchmarks

Reported pass@1 and scores averaged over runs from the model card & technical report. K2 Think targets competition‑level mathematics while maintaining strong science and coding ability.

AIME 2024

90.83

AIME 2025

81.24

HMMT 2025

73.75

OMNI‑MATH‑HARD

60.73

GPQA‑Diamond

71.08

LiveCodeBench v5

63.97

Sources: model card & tech report. Individual benchmark setup may vary; see report for details.

How it works

1) Long CoT SFT

Supervised finetuning on curated long chain‑of‑thought traces teaches structured, step‑by‑step reasoning and stable long outputs.

2) RL with Verifiable Rewards

Optimizes directly for correctness on verifiable tasks (Math, Code, Science, Logic, Simulation, Tabular) using public datasets.

3) Agentic Planning

A lightweight planning stage structures the solution path before detailed reasoning for higher reliability.

4) Test‑time Scaling

Best‑of‑N sampling boosts pass@k under fixed budgets; long‑trace friendly context (≈32K) preserves solution fidelity.

5) Speculative Decoding

Fast draft + verify decoding dramatically increases throughput without sacrificing quality.

6) Inference‑Optimized Hardware

Deployment on Cerebras Wafer‑Scale Engine enables near‑instant long responses, often ~2,000 tok/s per request.

Docs & Resources

Model Card

Weights, license, quickstart, benchmark tables.

Technical Report (PDF)

Method, ablations, pass@k, throughput details.

Playground & API

Try K2 Think in the browser; request API access.

GitHub · K2‑Think‑SFT

SFT training code built on LLaMA‑Factory.

GitHub · K2‑Think‑Inference

High‑throughput inference with speculative decoding.

Where K2 Think shines

Competition‑level Math

Olympiad‑style problem solving with long CoT and pass@k strategies (AIME, HMMT, Omni‑MATH‑HARD).

Software Engineering

LiveCodeBench‑friendly coding, debugging, and tool use with agentic planning.

Scientific Reasoning

Strong GPQA‑Diamond results and verifiable domains like logic, simulation, tabular.

Language (UAE):

أصغر. أذكى. مفتوح.

‏K2 Think هو نظام استدلال مفتوح الأوزان بحجم 32B، مُحسَّن من أجل سلاسل التفكير الطويلة، والتعلم بالتعزيز مع مكافآت قابلة للتحقق، والتخطيط العاملـي. يضاهي نماذج أكبر بكثير في الرياضيات والعلوم والبرمجة — مع توفير سرعة تصل إلى ~2000 رمز/ث على منصة Cerebras WSE.

جرّب الآن التقرير التقني (PDF) بطاقة النموذج

أوزان مفتوحة • Apache‑2.0 • معهد النماذج الأساسية بجامعة MBZUAI × مجموعة G42 • مبني على Qwen2.5‑32B

نصائح سريعة · Transformers

pip install -U transformers torch
python -c "from transformers import pipeline;print(pipeline('text-generation',model='LLM360/K2-Think')([{'role':'user','content':'2+2?'}],max_new_tokens=16))"

▶ تُطبَّق قالب الدردشة تلقائيًا عند استخدام pipeline().

أين يبرع K2 Think

رياضيات المسابقات

حل مسائل أولمبياد بأسلوب CoT طويل و Best‑of‑N (AIME، HMMT، Omni‑MATH‑HARD).

الهندسة البرمجية

برمجة وتصحيح واستخدام أدوات مع تخطيط عاملـي؛ LiveCodeBench.

الاستدلال العلمي

نتائج قوية على GPQA‑Diamond ومجالات قابلة للتحقق كالمنطق والمحاكاة والجدوال.

FAQ

Is K2 Think fully open‑source?

Yes — weights under Apache‑2.0 on Hugging Face. See the model card for license text and usage notes.

Can I fine‑tune it for my domain?

Yes. Use the SFT repo for instruction/CoT tuning; evaluate with pass@k on your datasets. Be mindful of context and RL setup.

Why focus on parameter‑efficiency?

Smaller models cut cost and latency. With the right post‑training + test‑time recipe, K2 Think competes with much larger systems.

How do I reach 2,000 tok/s?

Use the provided inference stack on Cerebras WSE with speculative decoding. Throughput depends on context, batch size, and hardware.