Cactus Releases 26M Needle Model: Distilled Gemini Tool Calling for Budget Devices
Cactus has open-sourced Needle, a 26-million-parameter function-calling model derived from Google's Gemini architecture, targeting a significant gap in mobile and wearable AI deployment. The model achieves 6000 tokens per second on prefill and 1200 tokens per second on decode when running on consumer hardware—performance metrics that make on-device agentic experiences viable without cloud dependency.
The engineering rationale behind Needle challenges prevailing assumptions about model scale. The Cactus team argues that tool calling is fundamentally a retrieval-and-assembly task—matching queries to tool names, extracting argument values, and emitting structured JSON—rather than a reasoning-intensive operation. This reframing suggests that massive parameter counts are unnecessary overhead for function-calling workloads. The resulting architecture relies entirely on attention mechanisms and gating, eliminating MLPs entirely. The model was pretrained on 200 billion tokens across 16 TPU v6e pods over 27 hours, followed by post-training on 2 billion tokens of synthesized function-calling data.
The strategic implications extend beyond technical curiosity. By targeting budget smartphones, smartwatches, and emerging wearable form factors like smart glasses, Cactus positions Needle as infrastructure for ambient computing environments where latency, privacy, and offline capability are paramount. The open-source release invites community scrutiny and adaptation, potentially accelerating the proliferation of lightweight agentic applications. Whether this approach can match the reliability of larger models in production environments remains to be seen, but the work signals that the era of treating parameter count as the primary proxy for capability may be ending.