unsandbox.com
Anonymous remote code, compile, & execution API for humans & machine learning agents.
Docs 📚 View Pricing →November 30, 2025
Program-of-Thought: How Small Models Outperform Large Models
Chain-of-Thought (CoT) prompting revolutionized how language models solve complex problems by having them “think step by step.” But there’s a fundamental flaw: language models are terrible at arithmetic.
Enter Program-of-Thought (PoT) prompting — a technique that achieves 12-15% better performance than Chain-of-Thought by doing something surprisingly simple: letting models write code instead of doing math.
The Core Insight
The breakthrough paper “Program of Thoughts Prompting: Disentangling Computation from Reasoning“ reveals a critical insight: reasoning and computation are different skills.
Language models excel at:
- Understanding problems
- Planning solution strategies
- Expressing logic in code
Language models struggle with:
- Multi-digit arithmetic
- Maintaining precision across calculation steps
- Avoiding compounding errors
PoT separates these concerns: models do the reasoning and write Python code, while a deterministic runtime (like unsandbox) executes the calculations perfectly.
Chain-of-Thought vs Program-of-Thought
Chain-of-Thought (CoT)
Question: A store had 20 apples. They sold 8, received 15 more,
then sold 12. How many apples remain?
CoT Response:
Let me think step by step:
1. Start with 20 apples
2. After selling 8: 20 - 8 = 12 apples
3. After receiving 15: 12 + 15 = 27 apples
4. After selling 12: 27 - 12 = 15 apples
Answer: 15 apples
Problem: The model must perform arithmetic and reasoning. Small errors compound.
Program-of-Thought (PoT)
# Question: A store had 20 apples. They sold 8, received 15 more,
# then sold 12. How many apples remain?
def solve():
apples = 20
apples -= 8 # Sold 8
apples += 15 # Received 15
apples -= 12 # Sold 12
return apples
result = solve()
print(f"Remaining apples: {result}")
Execution (via unsandbox):
Remaining apples: 15
The model focuses purely on understanding and translating to code. The calculation is delegated to Python.
Why This Makes Small Models Punch Above Their Weight
Here’s where it gets exciting for practical deployments:
1. Quantized Models Become Viable
A 4-bit quantized Llama 3.1 8B model running locally can now outperform GPT-4 on math problems — not because it’s better at reasoning, but because it doesn’t need to be good at arithmetic.
# Even a heavily quantized model can write this correctly:
def compound_interest(principal, rate, years):
return principal * (1 + rate) ** years
# unsandbox executes it with perfect precision
2. Faster Inference, Lower Costs
PoT responses are typically shorter than CoT responses:
- CoT: ~500 tokens (showing all calculation steps)
- PoT: ~150 tokens (just the code)
Cost savings: 70% fewer output tokens × cheaper small models = 10-20x cost reduction
3. Deterministic, Auditable Results
Unlike CoT where the model might calculate 127 × 43 differently each time, PoT produces:
result = 127 * 43 # Always 5461, every time
print(result)
This is critical for financial, scientific, and healthcare applications where reproducibility matters.
Real-World Example: Financial Analysis
Task: Calculate the internal rate of return (IRR) for a series of cash flows.
CoT Approach (GPT-4):
Let's calculate IRR using trial and error...
Try r = 0.10: NPV = -$234.52
Try r = 0.12: NPV = $45.23
Try r = 0.115: NPV = -$12.34
...
[Model struggles with iterative numerical methods]
Result: Incorrect or “I cannot calculate this precisely”
PoT Approach (Llama 3.1 8B + unsandbox):
import numpy_financial as npf
def calculate_irr(cash_flows):
"""
Calculate internal rate of return for cash flows
cash_flows: list of cash flows, first element is initial investment (negative)
"""
return npf.irr(cash_flows)
# Cash flows: -$1000 investment, then $300, $400, $500 returns
cash_flows = [-1000, 300, 400, 500]
irr = calculate_irr(cash_flows)
print(f"IRR: {irr:.2%}")
Result: IRR: 8.90% (mathematically correct, every time)
How unsandbox Enables PoT at Scale
unsandbox is purpose-built for Program-of-Thought workflows:
1. Zero-Trust Execution
# User's code runs in isolated container
# No access to filesystem, network, or other processes
# Automatic resource limits prevent runaway calculations
2. 42+ Language Support
Not all models are best at Python. Some excel at:
- Julia for numerical computing
- R for statistical analysis
- JavaScript for JSON manipulation
- Rust for performance-critical calculations
curl https://api.unsandbox.com/execute \
-H "Authorization: Bearer unsb-sk-xxxx-xxxx-xxxx-xxxx" \
-d '{
"language": "julia",
"code": "# Your Julia code here"
}'
3. Sub-Second Execution
Average PoT workflow latency:
- Model generates code: ~800ms
- unsandbox executes code: ~150ms
- Total: <1 second
Compare to CoT:
- Model generates reasoning: ~2000ms
- Still might be wrong
4. Self-Consistency Decoding
Run the same PoT prompt 5 times, execute all code samples, return the most common result:
# Example: Self-consistency with voting
# Run multiple code samples and pick the most common result
results = []
# Sample 1
def solve1(): return 42 * 1.15
results.append(solve1())
# Sample 2
def solve2(): return 42 * 1.15
results.append(solve2())
# Sample 3 would fail: def solve3(): return 42 × 1.15 # Syntax error
# Sample 4
def solve4(): return 42 * 1.15
results.append(solve4())
# Sample 5 (model error)
def solve5(): return 42 * 1.51
results.append(solve5())
# Find consensus (most common result)
from collections import Counter
consensus = Counter(results).most_common(1)[0][0]
print(f"Consensus result: {consensus:.1f}")
# Output: Consensus result: 48.3
Result: Even with occasional model errors, consensus voting + code execution yields correct answers.
Implementation Patterns
Here’s a complete PoT workflow that works with any model + unsandbox:
Model Selection Guide
API Models (hosted, pay-per-token):
- GPT-4.1 Nano: $0.10/1M input, $0.40/1M output - Cheapest option ever
- GPT-4.1 Mini: $0.40/1M input, $1.60/1M output - Best balance, 1M context
- GPT-4o Mini: $0.00015/1K input, $0.0006/1K output - Legacy OpenAI
Local Models (self-hosted, zero API costs):
- Qwen 3 Coder 30B: Best accuracy, needs RTX 4090/3090 (24GB VRAM)
- Hermes-3-Llama-3.1-8B: Excellent instruction following, RTX 4090/3090
Universal Implementation
The pattern is identical for all models — only the model endpoint changes:
import requests
from openai import OpenAI # For API models, or use requests for local
# ===== CONFIGURATION: Choose your model =====
# All models use OpenAI-compatible /v1 endpoints - just change base_url!
# Option 1: OpenAI API models
BASE_URL = "https://api.openai.com/v1"
MODEL_NAME = "gpt-4.1-nano" # or "gpt-4.1-mini" or "gpt-4o-mini"
API_KEY = "YOUR_OPENAI_KEY"
# Option 2: Free uncloseai.com - Hermes 8B (tested, works!)
# BASE_URL = "https://hermes.ai.unturf.com/v1"
# MODEL_NAME = "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic"
# API_KEY = "not-needed"
# Option 3: Free uncloseai.com - Qwen Coder 30B (tested, works!)
# BASE_URL = "https://qwen.ai.unturf.com/v1"
# MODEL_NAME = "hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M"
# API_KEY = "not-needed"
# Option 4: Local Ollama (port 11434)
# BASE_URL = "http://localhost:11434/v1"
# MODEL_NAME = "qwen3-coder:30b-q4"
# API_KEY = "ollama"
# Option 5: Local vLLM (port 18888)
# BASE_URL = "http://localhost:18888/v1"
# MODEL_NAME = "adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic"
# API_KEY = "token-abc123"
# unsandbox API key (get free key at unsandbox.com)
UNSANDBOX_KEY = "unsb-sk-xxxx-xxxx-xxxx-xxxx"
# ===== STEP 1: Generate Code =====
# Same code works for ALL models - just change base_url above!
problem = """Solve this problem by writing Python code.
Output ONLY the code, no explanations.
Problem: A company's revenue grew 15% annually for 3 years,
starting from $1.2M. What's the revenue in year 3?"""
client = OpenAI(api_key=API_KEY, base_url=BASE_URL)
response = client.chat.completions.create(
model=MODEL_NAME,
max_tokens=2048,
messages=[{"role": "user", "content": problem}]
)
code = response.choices[0].message.content
# ===== STEP 2: Execute via unsandbox =====
exec_response = requests.post(
"https://api.unsandbox.com/execute",
headers={"Authorization": f"Bearer {UNSANDBOX_KEY}"},
json={"language": "python", "code": code, "timeout": 5}
)
result = exec_response.json()
print(result["stdout"]) # $1,825,305.00
Per-Model Cost Analysis
| Model | Code Generation | unsandbox | Total/Query | Speed |
|---|---|---|---|---|
| GPT-4.1 Nano | ~$0.0003 | $0.000023 | $0.000323 | ~3-4s |
| GPT-4.1 Mini | ~$0.0011 | $0.000023 | $0.001123 | ~3-4s |
| GPT-4o Mini | ~$0.0002 | $0.000023 | $0.000223 | ~5s |
| Qwen 3 Coder 30B (local) | $0 | $0.000023 | $0.000023 | ~1-2s |
| Hermes-3 8B (local) | $0 | $0.000023 | $0.000023 | ~1-2s |
Note: unsandbox costs $0.000023 per request for all plans (Dev, Production, Business) - you pay more for higher rate limits, not per-request pricing.
unsandbox pricing tiers (per-request cost at full utilization):
| Tier | Monthly Cost | Requests/Month | Cost/Request |
|---|---|---|---|
| Development | $7 | 302,400 | $0.000023 |
| Production | $91 | 3,931,200 | $0.000023 |
| Business | $175 | 7,560,000 | $0.000023 |
Key insight: Local models eliminate API costs entirely — you only pay for unsandbox execution ($0.000023/query).
Benchmarks: Small Models vs Large Models
Research from the PoT paper shows dramatic improvements:
| Dataset | GPT-3 CoT | GPT-3 PoT | Improvement |
|---|---|---|---|
| GSM8K (math) | 68.2% | 78.5% | +10.3% |
| SVAMP (word problems) | 74.1% | 85.3% | +11.2% |
| FinQA (financial) | 52.3% | 67.8% | +15.5% |
| ConvFinQA | 48.6% | 63.2% | +14.6% |
Key insight: The improvement is larger on harder problems where multi-step arithmetic compounds errors.
Why PoT Works with Modern Models (2025)
The original research used GPT-3 era models, but the technique is even more powerful today:
- Smaller local models (Qwen 3 Coder 30B, Hermes-3-Llama-3.1-8B) can generate correct Python code
- Code execution via unsandbox eliminates arithmetic errors completely
- Cost dramatically lower: Local models + unsandbox (~$0.000023/query) vs GPT-4 API ($0.02-0.05/query)
- Privacy preserved: Code generation happens locally, only execution goes to unsandbox
The key insight: you don’t need a massive model to write correct code - you just need it to understand the problem and express the logic. The actual computation is handled by Python, which never makes mistakes.
Speed: Smaller Quantized Models Are Faster
Here’s the counterintuitive part: quantized models give you correct answers faster than large API models.
Latency Comparison (Same Problem)
| Model | Generation Time | Execution Time | Total | Result |
|---|---|---|---|---|
| GPT-4 API (CoT) | ~3,500ms | N/A | 3,500ms | ❌ Sometimes wrong |
| GPT-4 API (PoT) | ~2,000ms | ~150ms | 2,150ms | ✅ Always correct |
| Qwen 3 Coder 30B Q4 (Local PoT) | ~800ms | ~150ms | 950ms | ✅ Always correct |
| Hermes-3 8B FP8 (Local PoT) | ~400ms | ~150ms | 550ms | ✅ Always correct |
Why quantized models are faster:
- Smaller memory footprint → faster token generation (Q4/FP8 runs entirely in GPU cache)
- Shorter outputs → Code is more concise than step-by-step arithmetic explanations
- No API latency → Local inference eliminates network round trips
- Parallel execution → unsandbox can execute multiple code snippets simultaneously
Real-world impact:
Processing 100 math problems:
GPT-4 CoT: 100 × 3,500ms = 350 seconds (5.8 minutes)
Hermes-3 FP8 PoT: 100 × 550ms = 55 seconds (< 1 minute)
You get correct answers 6× faster at 1/200th the cost using a quantized model on a consumer GPU.
Beyond Math: Other PoT Applications
1. Data Analysis
# Model generates pandas code for business intelligence queries
import pandas as pd
df = pd.read_csv('sales_data.csv')
monthly_revenue = df.groupby(df['date'].dt.month)['revenue'].sum()
growth_rate = monthly_revenue.pct_change().mean()
print(f"Average monthly growth: {growth_rate:.2%}")
2. Scientific Computing
# Physics simulations the model can't do in its "head"
import numpy as np
from scipy.integrate import odeint
def projectile_motion(state, t, g=9.81):
x, vx, y, vy = state
return [vx, 0, vy, -g]
# Solve trajectory
solution = odeint(projectile_motion, [0, 10, 0, 15], np.linspace(0, 3, 100))
max_height = solution[:, 2].max()
print(f"Max height: {max_height:.2f} meters")
3. String Manipulation
# Complex regex/parsing that models hallucinate
import re
def extract_emails(text):
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
return re.findall(pattern, text)
text = "Contact us at hello@example.com or support@test.org for help"
emails = extract_emails(text)
print(f"Found {len(emails)} emails: {emails}")
# Output: Found 2 emails: ['hello@example.com', 'support@test.org']
The Economics of PoT
For a company processing 1M queries/month:
Traditional CoT with GPT-4:
- Input: 500 tokens × $0.01/1K = $5,000
- Output: 800 tokens × $0.03/1K = $24,000
- Total: $29,000/month
PoT with Qwen 3 Coder 30B (Q4_K_M) + unsandbox:
- Model (self-hosted on RTX 4090): ~$0/month (already own GPU)
- Input: 500 tokens × $0.00 = $0
- Output: 200 tokens × $0.00 = $0
- unsandbox execution: 1M × $0.000023 = $23/month
- Total: $23/month
Savings: $28,977/month (99.9% cost reduction)
Plus:
- 3x faster (smaller model + shorter output)
- More accurate on numerical tasks
- Deterministic results
- Full privacy (run models locally)
Implementation Tips
1. System Prompt Template
You are a mathematical reasoning assistant.
When solving problems:
1. Write Python code to solve the problem
2. Use descriptive variable names
3. Add comments explaining your logic
4. Output ONLY executable code
5. End with a print() statement showing the result
DO NOT:
- Show arithmetic in comments
- Explain your reasoning in natural language
- Approximate or estimate - write exact calculations
2. Self-Correction with stderr/stdout Feedback
The killer feature: unsandbox returns both stdout and stderr, enabling automatic self-correction:
def solve_with_retry(problem, model_client, max_retries=3):
"""
Generate code, execute it, and retry with error feedback if it fails.
"""
for attempt in range(max_retries):
# Step 1: Generate code
response = model_client.chat.completions.create(
model="gpt-4.1-nano",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"Solve this problem by writing Python code.\nOutput ONLY the code.\n\n{problem}"
}]
)
code = response.choices[0].message.content
# Step 2: Execute via unsandbox
result = requests.post(
"https://api.unsandbox.com/execute",
headers={"Authorization": "Bearer unsb-sk-xxxx"},
json={"language": "python", "code": code, "timeout": 5}
).json()
# Step 3: Check for errors
if result.get("stderr") == "" and result.get("stdout"):
# Success!
return result["stdout"]
# Step 4: Retry with error feedback
error_msg = result.get("stderr", "No output produced")
problem = f"""
Previous attempt failed with error:
{error_msg}
Failed code:
{code}
Original problem: {problem}
Fix the syntax or logic error and write corrected Python code.
"""
return None # Failed after max retries
Real example:
Attempt 1: Model writes `printt(result)` → stderr: "NameError: name 'printt' is not defined"
Attempt 2: Model fixes typo → `print(result)` → Success!
This dramatically improves accuracy for smaller models that occasionally make syntax errors.
3. Validation
# Run simple assertion tests
def validate_solution(code, expected_properties):
result = execute_code_via_unsandbox(code)
assert isinstance(result, (int, float)), "Result must be numeric"
assert result > 0, "Result must be positive"
assert result < 1_000_000, "Result seems unreasonably large"
return result
Limitations and Future Directions
Current Limitations
-
Code Generation Quality: Small models sometimes generate syntactically incorrect code
- Solution: Multi-sample voting, retry logic, or model fine-tuning
-
Problem Understanding: Models may misinterpret ambiguous questions
- Solution: Prompt clarification, few-shot examples
-
Complex Algorithms: Models struggle with novel algorithmic challenges
- Solution: Provide library functions, break into sub-problems
The Future: Chain-of-Code
Emerging research shows even better results with hybrid approaches:
# Step 1: Natural language reasoning
# "I need to find the compound annual growth rate..."
# Step 2: Code for calculations
def cagr(start_value, end_value, years):
return (end_value / start_value) ** (1 / years) - 1
# Step 3: Natural language interpretation
# "A CAGR of 12.5% means the investment grew by about 12.5% per year"
This combines the strengths of both approaches.
Try It Yourself
Quick Start with unsandbox:
-
Get a free API key: unsandbox.com
-
Run your first PoT query:
curl https://api.unsandbox.com/execute \ -H "Authorization: Bearer YOUR_KEY" \ -H "Content-Type: application/json" \ -d '{ "language": "python", "code": "def compound_growth(principal, rate, years):\n return principal * (1 + rate) ** years\n\nresult = compound_growth(10000, 0.07, 10)\nprint(f\"Final value: ${result:,.2f}\")" }' -
Integrate with your LLM:
# See full example in our docs: # https://unsandbox.com/docs/python
Example Prompts
Try these with your preferred model + unsandbox:
-
Finance: “Calculate the present value of receiving $10,000 annually for 20 years at a 5% discount rate”
-
Statistics: “Generate 1000 random samples from a normal distribution with mean=100, std=15. What percentage fall between 85 and 115?”
-
Physics: “A ball is thrown at 20 m/s at a 45° angle. How far does it travel before hitting the ground?”
Conclusion
Program-of-Thought represents a paradigm shift: stop asking models to do math; ask them to write code.
The implications are profound:
- Smaller models become production-viable
- Quantized models match or exceed large model performance
- Costs drop by 90-99%
- Results become deterministic and auditable
- Local deployment is practical (no massive GPUs needed)
With unsandbox providing secure, fast code execution across 42+ languages, PoT is no longer a research technique — it’s a production-ready strategy for building accurate, affordable AI systems.
The future isn’t bigger models. It’s smarter architecture.
Resources:
Try Program-of-Thought: Get a free API key at unsandbox.com — 1 request per 42 seconds, perfect for experimentation.