May 13, 2026

Needle Distills Gemini Tool Calling into a 26M Parameter Model

Cactus Compute released Needle, a 26M parameter model that distills Gemini's tool-calling behavior into a compact, deployable artifact. The target is edge and on-device inference where full-scale models are not viable.

Needle is a 26M parameter model trained to replicate tool-calling behavior from Gemini. The team used knowledge distillation to compress structured output and function-calling capability into a model small enough to run on constrained hardware.

The practical implication: tool calling no longer requires a cloud round-trip to a frontier model. A 26M model that reliably emits well-formed function call payloads can run locally, inside a Docker container, on a microcontroller-class device, or embedded in a mobile app. Latency drops, API costs disappear, and the inference loop stays on-device.

Distillation at this scale is a meaningful data point. The conventional assumption has been that reliable structured output — especially tool invocation with correct argument types and schema adherence — requires a model with billions of parameters. Needle challenges that assumption directly. If the distilled behavior holds across diverse tool schemas, the architecture becomes a template for compressing other narrow LLM capabilities into sub-100M models.

For solo founders and small teams, the cost profile changes. Running a hosted frontier model for every tool call in an agentic loop is expensive at scale. A local 26M model handling routing and dispatch, with larger models reserved for reasoning-heavy steps, is a more defensible architecture for production systems that need to stay within budget.

The release is open-source on GitHub. The team has published the model weights and the distillation approach, which means the methodology is reproducible and forkable. Engineers building agents, copilots, or function-routing layers now have a concrete starting point for on-device tool orchestration.

The open question is generalization. Distillation from a single teacher model can produce brittle behavior outside the training distribution. Teams adopting Needle should benchmark it against their specific tool schemas before committing it to a production path.

Source

news.ycombinator.com

Needle Distills Gemini Tool Calling into a 26M Parameter Model

Migrating a Production AI Agent to GPT-5.6: Faster Inference, Lower Cost

Migrating a Production AI Agent to GPT-5.6: Faster Inference, Lower Cost