unsandbox.com

Anonymous remote code, compile, & execution API for humans & machine learning agents.

unsandbox.

Program-of-Thought: How Small Models Outperform Large Models

November 30, 2025

Program-of-Thought: How Small Models Outperform Large Models

Chain-of-Thought (CoT) prompting revolutionized how language models solve complex problems by having them “think step by step.” But there’s a fundamental flaw: language models are terrible at arithmetic.

Enter Program-of-Thought (PoT) prompting — a technique that achieves 12-15% better performance than Chain-of-Thought by doing something surprisingly simple: letting models write code instead of doing math.

The Core Insight

The breakthrough paper “Program of Thoughts Prompting: Disentangling Computation from Reasoning“ reveals a critical insight: reasoning and computation are different skills.

Language models excel at:

Understanding problems
Planning solution strategies
Expressing logic in code

Language models struggle with:

Multi-digit arithmetic
Maintaining precision across calculation steps
Avoiding compounding errors

PoT separates these concerns: models do the reasoning and write Python code, while a deterministic runtime (like unsandbox) executes the calculations perfectly.

Chain-of-Thought vs Program-of-Thought

Chain-of-Thought (CoT)

Question: A store had 20 apples. They sold 8, received 15 more,
then sold 12. How many apples remain?

CoT Response:
Let me think step by step:
1. Start with 20 apples
2. After selling 8: 20 - 8 = 12 apples
3. After receiving 15: 12 + 15 = 27 apples
4. After selling 12: 27 - 12 = 15 apples

Answer: 15 apples

Problem: The model must perform arithmetic and reasoning. Small errors compound.

Program-of-Thought (PoT)

# Question: A store had 20 apples. They sold 8, received 15 more,
# then sold 12. How many apples remain?

def solve():
    apples = 20
    apples -= 8  # Sold 8
    apples += 15  # Received 15
    apples -= 12  # Sold 12
    return apples

result = solve()
print(f"Remaining apples: {result}")

Execution (via unsandbox):

Remaining apples: 15

The model focuses purely on understanding and translating to code. The calculation is delegated to Python.

Why This Makes Small Models Punch Above Their Weight

Here’s where it gets exciting for practical deployments:

1. Quantized Models Become Viable

A 4-bit quantized Llama 3.1 8B model running locally can now outperform GPT-4 on math problems — not because it’s better at reasoning, but because it doesn’t need to be good at arithmetic.

# Even a heavily quantized model can write this correctly:
def compound_interest(principal, rate, years):
    return principal * (1 + rate) ** years

# unsandbox executes it with perfect precision

2. Faster Inference, Lower Costs

PoT responses are typically shorter than CoT responses:

CoT: ~500 tokens (showing all calculation steps)
PoT: ~150 tokens (just the code)

Cost savings: 70% fewer output tokens × cheaper small models = 10-20x cost reduction

3. Deterministic, Auditable Results

Unlike CoT where the model might calculate 127 × 43 differently each time, PoT produces:

result = 127 * 43  # Always 5461, every time
print(result)

This is critical for financial, scientific, and healthcare applications where reproducibility matters.

Real-World Example: Financial Analysis

Task: Calculate the internal rate of return (IRR) for a series of cash flows.

CoT Approach (GPT-4):

Let's calculate IRR using trial and error...
Try r = 0.10: NPV = -$234.52
Try r = 0.12: NPV = $45.23
Try r = 0.115: NPV = -$12.34
...
[Model struggles with iterative numerical methods]

Result: Incorrect or “I cannot calculate this precisely”

PoT Approach (Llama 3.1 8B + unsandbox):

import numpy_financial as npf

def calculate_irr(cash_flows):
    """
    Calculate internal rate of return for cash flows
    cash_flows: list of cash flows, first element is initial investment (negative)
    """
    return npf.irr(cash_flows)

# Cash flows: -$1000 investment, then $300, $400, $500 returns
cash_flows = [-1000, 300, 400, 500]
irr = calculate_irr(cash_flows)

print(f"IRR: {irr:.2%}")

Result: IRR: 8.90% (mathematically correct, every time)

How unsandbox Enables PoT at Scale

unsandbox is purpose-built for Program-of-Thought workflows:

1. Zero-Trust Execution

# User's code runs in isolated container
# No access to filesystem, network, or other processes
# Automatic resource limits prevent runaway calculations

2. 42+ Language Support

Not all models are best at Python. Some excel at:

Julia for numerical computing
R for statistical analysis
JavaScript for JSON manipulation
Rust for performance-critical calculations

curl https://api.unsandbox.com/execute \
  -H "Authorization: Bearer unsb-sk-xxxx-xxxx-xxxx-xxxx" \
  -d '{
    "language": "julia",
    "code": "# Your Julia code here"
  }'

3. Sub-Second Execution

Average PoT workflow latency:
- Model generates code: ~800ms
- unsandbox executes code: ~150ms
- Total: <1 second

Compare to CoT:
- Model generates reasoning: ~2000ms
- Still might be wrong

4. Self-Consistency Decoding

Run the same PoT prompt 5 times, execute all code samples, return the most common result:

# Example: Self-consistency with voting
# Run multiple code samples and pick the most common result

results = []

# Sample 1
def solve1(): return 42 * 1.15
results.append(solve1())

# Sample 2
def solve2(): return 42 * 1.15
results.append(solve2())

# Sample 3 would fail: def solve3(): return 42 × 1.15  # Syntax error

# Sample 4
def solve4(): return 42 * 1.15
results.append(solve4())

# Sample 5 (model error)
def solve5(): return 42 * 1.51
results.append(solve5())

# Find consensus (most common result)
from collections import Counter
consensus = Counter(results).most_common(1)[0][0]
print(f"Consensus result: {consensus:.1f}")
# Output: Consensus result: 48.3

Result: Even with occasional model errors, consensus voting + code execution yields correct answers.

Implementation Patterns

Here’s a complete PoT workflow that works with any model + unsandbox:

Model Selection Guide

API Models (hosted, pay-per-token):

GPT-4.1 Nano: $0.10/1M input, $0.40/1M output - Cheapest option ever
GPT-4.1 Mini: $0.40/1M input, $1.60/1M output - Best balance, 1M context
GPT-4o Mini: $0.00015/1K input, $0.0006/1K output - Legacy OpenAI

Local Models (self-hosted, zero API costs):

Qwen 3 Coder 30B: Best accuracy, needs RTX 4090/3090 (24GB VRAM)
Hermes-3-Llama-3.1-8B: Excellent instruction following, RTX 4090/3090

Universal Implementation

The pattern is identical for all models — only the model endpoint changes:

import requests
from openai import OpenAI  # For API models, or use requests for local

# ===== CONFIGURATION: Choose your model =====
# All models use OpenAI-compatible /v1 endpoints - just change base_url!

# Option 1: OpenAI API models
BASE_URL = "https://api.openai.com/v1"
MODEL_NAME = "gpt-4.1-nano"  # or "gpt-4.1-mini" or "gpt-4o-mini"
API_KEY = "YOUR_OPENAI_KEY"

# Option 2: Free uncloseai.com - Hermes 8B (tested, works!)
# BASE_URL = "https://hermes.ai.unturf.com/v1"
# MODEL_NAME = "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic"
# API_KEY = "not-needed"

# Option 3: Free uncloseai.com - Qwen Coder 30B (tested, works!)
# BASE_URL = "https://qwen.ai.unturf.com/v1"
# MODEL_NAME = "hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M"
# API_KEY = "not-needed"

# Option 4: Local Ollama (port 11434)
# BASE_URL = "http://localhost:11434/v1"
# MODEL_NAME = "qwen3-coder:30b-q4"
# API_KEY = "ollama"

# Option 5: Local vLLM (port 18888)
# BASE_URL = "http://localhost:18888/v1"
# MODEL_NAME = "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic"
# API_KEY = "token-abc123"

# unsandbox API key (get free key at unsandbox.com)
UNSANDBOX_KEY = "unsb-sk-xxxx-xxxx-xxxx-xxxx"

# ===== STEP 1: Generate Code =====
# Same code works for ALL models - just change base_url above!
problem = """Solve this problem by writing Python code.
Output ONLY the code, no explanations.

Problem: A company's revenue grew 15% annually for 3 years,
starting from $1.2M. What's the revenue in year 3?"""

client = OpenAI(api_key=API_KEY, base_url=BASE_URL)
response = client.chat.completions.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=[{"role": "user", "content": problem}]
)
code = response.choices[0].message.content

# ===== STEP 2: Execute via unsandbox =====
exec_response = requests.post(
    "https://api.unsandbox.com/execute",
    headers={"Authorization": f"Bearer {UNSANDBOX_KEY}"},
    json={"language": "python", "code": code, "timeout": 5}
)

result = exec_response.json()
print(result["stdout"])  # $1,825,305.00

Per-Model Cost Analysis

Model	Code Generation	unsandbox	Total/Query	Speed
GPT-4.1 Nano	~$0.0003	$0.000023	$0.000323	~3-4s
GPT-4.1 Mini	~$0.0011	$0.000023	$0.001123	~3-4s
GPT-4o Mini	~$0.0002	$0.000023	$0.000223	~5s
Qwen 3 Coder 30B (local)	$0	$0.000023	$0.000023	~1-2s
Hermes-3 8B (local)	$0	$0.000023	$0.000023	~1-2s

Note: unsandbox costs $0.000023 per request for all plans (Dev, Production, Business) - you pay more for higher rate limits, not per-request pricing.

unsandbox pricing tiers (per-request cost at full utilization):

Tier	Monthly Cost	Requests/Month	Cost/Request
Development	$7	302,400	$0.000023
Production	$91	3,931,200	$0.000023
Business	$175	7,560,000	$0.000023

Key insight: Local models eliminate API costs entirely — you only pay for unsandbox execution ($0.000023/query).

Benchmarks: Small Models vs Large Models

Research from the PoT paper shows dramatic improvements:

Dataset	GPT-3 CoT	GPT-3 PoT	Improvement
GSM8K (math)	68.2%	78.5%	+10.3%
SVAMP (word problems)	74.1%	85.3%	+11.2%
FinQA (financial)	52.3%	67.8%	+15.5%
ConvFinQA	48.6%	63.2%	+14.6%

Key insight: The improvement is larger on harder problems where multi-step arithmetic compounds errors.

Why PoT Works with Modern Models (2025)

The original research used GPT-3 era models, but the technique is even more powerful today:

Smaller local models (Qwen 3 Coder 30B, Hermes-3-Llama-3.1-8B) can generate correct Python code
Code execution via unsandbox eliminates arithmetic errors completely
Cost dramatically lower: Local models + unsandbox (~$0.000023/query) vs GPT-4 API ($0.02-0.05/query)
Privacy preserved: Code generation happens locally, only execution goes to unsandbox

The key insight: you don’t need a massive model to write correct code - you just need it to understand the problem and express the logic. The actual computation is handled by Python, which never makes mistakes.

Speed: Smaller Quantized Models Are Faster

Here’s the counterintuitive part: quantized models give you correct answers faster than large API models.

Latency Comparison (Same Problem)

Model	Generation Time	Execution Time	Total	Result
GPT-4 API (CoT)	~3,500ms	N/A	3,500ms	❌ Sometimes wrong
GPT-4 API (PoT)	~2,000ms	~150ms	2,150ms	✅ Always correct
Qwen 3 Coder 30B Q4 (Local PoT)	~800ms	~150ms	950ms	✅ Always correct
Hermes-3 8B FP8 (Local PoT)	~400ms	~150ms	550ms	✅ Always correct

Why quantized models are faster:

Smaller memory footprint → faster token generation (Q4/FP8 runs entirely in GPU cache)
Shorter outputs → Code is more concise than step-by-step arithmetic explanations
No API latency → Local inference eliminates network round trips
Parallel execution → unsandbox can execute multiple code snippets simultaneously

Real-world impact:

Processing 100 math problems:

GPT-4 CoT:       100 × 3,500ms = 350 seconds (5.8 minutes)
Hermes-3 FP8 PoT: 100 × 550ms = 55 seconds (< 1 minute)

You get correct answers 6× faster at 1/200th the cost using a quantized model on a consumer GPU.

Beyond Math: Other PoT Applications

1. Data Analysis

# Model generates pandas code for business intelligence queries
import pandas as pd

df = pd.read_csv('sales_data.csv')
monthly_revenue = df.groupby(df['date'].dt.month)['revenue'].sum()
growth_rate = monthly_revenue.pct_change().mean()

print(f"Average monthly growth: {growth_rate:.2%}")

2. Scientific Computing

# Physics simulations the model can't do in its "head"
import numpy as np
from scipy.integrate import odeint

def projectile_motion(state, t, g=9.81):
    x, vx, y, vy = state
    return [vx, 0, vy, -g]

# Solve trajectory
solution = odeint(projectile_motion, [0, 10, 0, 15], np.linspace(0, 3, 100))
max_height = solution[:, 2].max()
print(f"Max height: {max_height:.2f} meters")

3. String Manipulation

# Complex regex/parsing that models hallucinate
import re

def extract_emails(text):
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    return re.findall(pattern, text)

text = "Contact us at hello@example.com or support@test.org for help"
emails = extract_emails(text)
print(f"Found {len(emails)} emails: {emails}")
# Output: Found 2 emails: ['hello@example.com', 'support@test.org']

The Economics of PoT

For a company processing 1M queries/month:

Traditional CoT with GPT-4:

Input: 500 tokens × $0.01/1K = $5,000
Output: 800 tokens × $0.03/1K = $24,000
Total: $29,000/month

PoT with Qwen 3 Coder 30B (Q4_K_M) + unsandbox:

Model (self-hosted on RTX 4090): ~$0/month (already own GPU)
Input: 500 tokens × $0.00 = $0
Output: 200 tokens × $0.00 = $0
unsandbox execution: 1M × $0.000023 = $23/month
Total: $23/month

Savings: $28,977/month (99.9% cost reduction)

Plus:

3x faster (smaller model + shorter output)
More accurate on numerical tasks
Deterministic results
Full privacy (run models locally)

Implementation Tips

1. System Prompt Template

You are a mathematical reasoning assistant.
When solving problems:
1. Write Python code to solve the problem
2. Use descriptive variable names
3. Add comments explaining your logic
4. Output ONLY executable code
5. End with a print() statement showing the result

DO NOT:
- Show arithmetic in comments
- Explain your reasoning in natural language
- Approximate or estimate - write exact calculations

2. Self-Correction with stderr/stdout Feedback

The killer feature: unsandbox returns both stdout and stderr, enabling automatic self-correction:

def solve_with_retry(problem, model_client, max_retries=3):
    """
    Generate code, execute it, and retry with error feedback if it fails.
    """
    for attempt in range(max_retries):
        # Step 1: Generate code
        response = model_client.chat.completions.create(
            model="gpt-4.1-nano",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": f"Solve this problem by writing Python code.\nOutput ONLY the code.\n\n{problem}"
            }]
        )

        code = response.choices[0].message.content

        # Step 2: Execute via unsandbox
        result = requests.post(
            "https://api.unsandbox.com/execute",
            headers={"Authorization": "Bearer unsb-sk-xxxx"},
            json={"language": "python", "code": code, "timeout": 5}
        ).json()

        # Step 3: Check for errors
        if result.get("stderr") == "" and result.get("stdout"):
            # Success!
            return result["stdout"]

        # Step 4: Retry with error feedback
        error_msg = result.get("stderr", "No output produced")
        problem = f"""
Previous attempt failed with error:
{error_msg}

Failed code:
{code}

Original problem: {problem}

Fix the syntax or logic error and write corrected Python code.
"""

    return None  # Failed after max retries

Real example:

Attempt 1: Model writes `printt(result)` → stderr: "NameError: name 'printt' is not defined"
Attempt 2: Model fixes typo → `print(result)` → Success!

This dramatically improves accuracy for smaller models that occasionally make syntax errors.

3. Validation

# Run simple assertion tests
def validate_solution(code, expected_properties):
    result = execute_code_via_unsandbox(code)

    assert isinstance(result, (int, float)), "Result must be numeric"
    assert result > 0, "Result must be positive"
    assert result < 1_000_000, "Result seems unreasonably large"

    return result

Limitations and Future Directions

Current Limitations

Code Generation Quality: Small models sometimes generate syntactically incorrect code
- Solution: Multi-sample voting, retry logic, or model fine-tuning
Problem Understanding: Models may misinterpret ambiguous questions
- Solution: Prompt clarification, few-shot examples
Complex Algorithms: Models struggle with novel algorithmic challenges
- Solution: Provide library functions, break into sub-problems

The Future: Chain-of-Code

Emerging research shows even better results with hybrid approaches:

# Step 1: Natural language reasoning
# "I need to find the compound annual growth rate..."

# Step 2: Code for calculations
def cagr(start_value, end_value, years):
    return (end_value / start_value) ** (1 / years) - 1

# Step 3: Natural language interpretation
# "A CAGR of 12.5% means the investment grew by about 12.5% per year"

This combines the strengths of both approaches.

Try It Yourself

Quick Start with unsandbox:

Get a free API key: unsandbox.com

Run your first PoT query:

curl https://api.unsandbox.com/execute \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
 "language": "python",
 "code": "def compound_growth(principal, rate, years):\n    return principal * (1 + rate) ** years\n\nresult = compound_growth(10000, 0.07, 10)\nprint(f\"Final value: ${result:,.2f}\")"
}'

Integrate with your LLM:

# See full example in our docs:
# https://unsandbox.com/docs/python

Example Prompts

Try these with your preferred model + unsandbox:

Finance: “Calculate the present value of receiving $10,000 annually for 20 years at a 5% discount rate”
Statistics: “Generate 1000 random samples from a normal distribution with mean=100, std=15. What percentage fall between 85 and 115?”
Physics: “A ball is thrown at 20 m/s at a 45° angle. How far does it travel before hitting the ground?”

Conclusion

Program-of-Thought represents a paradigm shift: stop asking models to do math; ask them to write code.

The implications are profound:

Smaller models become production-viable
Quantized models match or exceed large model performance
Costs drop by 90-99%
Results become deterministic and auditable
Local deployment is practical (no massive GPUs needed)

With unsandbox providing secure, fast code execution across 42+ languages, PoT is no longer a research technique — it’s a production-ready strategy for building accurate, affordable AI systems.

The future isn’t bigger models. It’s smarter architecture.

Resources:

Try Program-of-Thought: Get a free API key at unsandbox.com — 1 request per 42 seconds, perfect for experimentation.

← Back to Blog