SGR vs Function Calling

👁 The Problem with Function Calling on Local Models (ReAct Agents)

Reality Check: Function Calling works great on OpenAI/Anthropic (80+ BFCL scores) but fails dramatically on local models \<32B parameters when using true ReAct agents with tool_mode="auto", where the model itself decides when to call tools.

BFCL Benchmark Results for Qwen3 Models:

Qwen3-8B (FC): Only 15% accuracy in Agentic Web Search mode (BFCL benchmark)
Qwen3-4B (FC): Only 2% accuracy in Agentic Web Search mode
Qwen3-1.7B (FC): Only 4.5% accuracy in Agentic Web Search mode
Even with native FC support, smaller models struggle with deciding WHEN to call tools
Common result: {"tool_calls": null, "content": "Text instead of tool call"}

Note: Our team is currently working on creating a specialized benchmark for SGR vs ReAct performance on smaller models. Initial testing confirms that the SGR pipeline enables even smaller models to follow complex task workflows.

👁 SGR Solution: Forced Reasoning → Deterministic Execution

# Phase 1: Structured Output reasoning (100% reliable)
reasoning = model.generate(format="json_schema")
# {"action": "search", "query": "BMW X6 prices", "reason": "need current data"}

# Phase 2: Deterministic execution (no model uncertainty)
result = execute_plan(reasoning.actions)

👁 Architecture by Model Size

Model Size	Recommended Approach	FC Accuracy	Why Choose This
\<14B	Pure SGR + Structured Output	15-25%	FC practically unusable
14-32B	SGR + FC hybrid	45-65%	Best of both worlds
32B+	Native FC with SGR fallback	85%+	FC works reliably

👁 When to Use SGR vs Function Calling

Use Case	Best Approach	Why
Data analysis & structuring	SGR	Controlled reasoning with visibility
Document processing	SGR	Step-by-step analysis with justification
Local models (\<32B)	SGR	Forces reasoning regardless of model limitations
Multi-agent systems	Function Calling	Native agent interruption support
External API interactions	Function Calling	Direct tool access pattern
Production monitoring	SGR	All reasoning steps visible and loggable

👁 Real-World Results

Initial Testing Results:

SGR enables even small models to follow structured workflows
SGR pipeline provides deterministic execution regardless of model size
SGR forces reasoning steps that ReAct leaves to model discretion

Planned Benchmarking:

We're developing a comprehensive benchmark comparing SGR vs ReAct across model sizes
Initial testing shows promising results for SGR on models as small as 4B parameters
Full metrics and performance comparison coming soon

👁 Hybrid Approach: The Best of Both Worlds

The optimal solution for many production systems is a hybrid approach:

SGR for decision making - Determine which tools to use
Function Calling for execution - Get data and provide agent-like experience
SGR for final processing - Structure and format results

This hybrid approach works particularly well for models in the 14-32B range, where Function Calling works sometimes but isn't fully reliable.

Bottom Line: Don't force \<32B models to pretend they're GPT-4o in ReAct-style agentic workflows with tool_mode="auto". Let them think structurally through SGR, then execute deterministically.