K2 Think is a 32B open‑weights reasoning system tuned for long chain‑of‑thought, reinforcement learning with verifiable rewards, and agentic planning. It rivals much larger models on math, science, and coding — while delivering up to ~2,000 tok/s on Cerebras WSE.
Open weights • Apache‑2.0 • MBZUAI Institute of Foundation Models × G42 • Built on Qwen2.5‑32B
from transformers import pipeline import torch model_id = "LLM360/K2-Think" pipe = pipeline( "text-generation", model=model_id, torch_dtype="auto", device_map="auto", ) messages = [ {"role": "user", "content": "Solve: If x^2 - 5x + 6 = 0, find x."}, ] out = pipe(messages, max_new_tokens=2048) print(out[0]["generated_text"][-1])
pipeline()
.Reported pass@1 and scores averaged over runs from the model card & technical report. K2 Think targets competition‑level mathematics while maintaining strong science and coding ability.
Supervised finetuning on curated long chain‑of‑thought traces teaches structured, step‑by‑step reasoning and stable long outputs.
Optimizes directly for correctness on verifiable tasks (Math, Code, Science, Logic, Simulation, Tabular) using public datasets.
A lightweight planning stage structures the solution path before detailed reasoning for higher reliability.
Best‑of‑N sampling boosts pass@k under fixed budgets; long‑trace friendly context (≈32K) preserves solution fidelity.
Fast draft + verify decoding dramatically increases throughput without sacrificing quality.
Deployment on Cerebras Wafer‑Scale Engine enables near‑instant long responses, often ~2,000 tok/s per request.
Olympiad‑style problem solving with long CoT and pass@k strategies (AIME, HMMT, Omni‑MATH‑HARD).
LiveCodeBench‑friendly coding, debugging, and tool use with agentic planning.
Strong GPQA‑Diamond results and verifiable domains like logic, simulation, tabular.
Yes — weights under Apache‑2.0 on Hugging Face. See the model card for license text and usage notes.
Yes. Use the SFT repo for instruction/CoT tuning; evaluate with pass@k on your datasets. Be mindful of context and RL setup.
Smaller models cut cost and latency. With the right post‑training + test‑time recipe, K2 Think competes with much larger systems.
Use the provided inference stack on Cerebras WSE with speculative decoding. Throughput depends on context, batch size, and hardware.