Hemat 70% Token AI Agent: Context Tiering & Lean Loading

English version: Want the technical deep-dive? Check out the GitHub tutorial.
Kenapa AI Agent Kamu Boros Token?
Pause. Think about this.
Setiap kali kamu nanya "disk usage berapa?", terus AI agent load 50,000 tokens context — conversation history, semua memory, seluruh workspace — cuma buat jawab pertanyaan yang butuh 150 tokens.
It doesn't make sense.
It's like calling a mechanic and making them re-read your entire car's service history before they'll check your oil level.
That 99.7% of the loaded context was completely irrelevant to the question.
Angka Nggak Bohong
Multiply that by 200 queries per day, 30 days per month. That's $1,500/month gone just because your agent is too lazy to think about what it actually needs.
What if we could fix that?
Context Tiering: The Solution
The idea is stupidly simple:
"Before you load context, ask: what's the MINIMUM I actually need to answer this?"
Instead of dumping everything, we tier the context loading based on query complexity.
The Four Tiers
Think of it like this — your brain does this automatically:
Tier 0 — Zero Shot (0 tokens overhead) Simple status checks, single facts. "What's CPU load?" doesn't need conversation history.
Tier 1 — Memory Lookup (200-500 tokens) Recent context, things you discussed earlier today. Load only today's memory, search for relevance.
Tier 2 — JIT Loading (1,000-5,000 tokens) Specific project files, targeted context. Find only the files that actually matter to this query.
Tier 3 — Full Session (10,000-80,000 tokens) Complex multi-file analysis, architecture decisions. When you genuinely need everything.
The Decision Flow
Benchmark Results: Real Numbers
We tested this for 7 days. Mixed workload, 200 queries per day. Here's what we found.
Setup
Results by Scenario
Simple Status Queries (50 per day)
Medium Workflows (80 per day)
Complex Analysis (70 per day)
Daily Totals
Token Savings per Day (ribuan):
Monthly Impact
That's $1,398 saved every month. What could you do with that?
- 2 additional Claude Max seats ($299/month each)
- 12 months of OpenClaw Pro
- 15 VPS instances
Latency Improvements
Response Time Comparison (ms):
Cost by Model
The savings scale differently per model:
Even on cheaper models, the absolute savings are significant. On Kimi 2.5, you save $280/month — that's basically your AI subscription cost covered.
Implementation Patterns
Pattern 1: Lean Query Router
def route_to_tier(query: str) -> dict:
"""Route query to appropriate context tier."""
q = query.lower()
simple = ['what is', 'show me', 'list', 'is running',
'disk', 'cpu', 'memory', 'status', 'time']
memory_kw = ['yesterday', 'last week', 'previously',
'earlier', 'we were', 'did we']
file_kw = ['in the file', 'in project', 'in code',
'analyze', 'audit', 'review']
# Tier 0: Simple status
if any(s in q for s in simple):
if not any(s in q for s in memory_kw + file_kw):
return {"tier": 0, "context": {}, "tokens": 50}
# Tier 1: Memory
if any(s in q for s in memory_kw):
return load_tier1(query)
# Tier 2: JIT files
if any(s in q for s in file_kw):
return load_tier2(query)
return {"tier": 0, "context": {}, "tokens": 50}
Pattern 2: Token Budget
def execute_with_budget(query: str, max_tokens: int = 5000) -> dict:
"""Execute with hard token ceiling."""
tier_data = route_to_tier(query)
if tier_data["tokens"] > max_tokens:
tier_data = compress_to_budget(tier_data, max_tokens)
result = model.generate(
system=get_system_prompt(),
context=tier_data["context"],
query=query
)
return {
"result": result,
"tokens_used": tier_data["tokens"],
"tier": tier_data["tier"]
}
Pattern 3: Memory-Backed Lean Loading
def lean_load_with_memory(query: str) -> dict:
"""Load only today's relevant memories."""
today_mem = load_today_memories()
relevant = semantic_search(
query=query,
corpus=today_mem,
max_tokens=400
)
if relevant["sufficient"]:
return {
"tier": 1,
"context": relevant["content"],
"tokens": relevant["tokens"],
"source": "memory"
}
# Fallback to workspace files
relevant_files = find_relevant_files(
query=query,
max_tokens=1500
)
return {
"tier": 2,
"context": relevant_files,
"tokens": sum(f.tokens for f in relevant_files),
"source": "workspace"
}
Mistakes to Avoid
1. Over-Caching Memory
❌ Bad:
# Loading everything "just in case"
all_memories = load_all_memories() # 50 files, 500K tokens
✅ Good:
# Load only what this query needs
relevant = semantic_search(query, corpus=today_memory, max_tokens=400)
2. Full Session for Simple Queries
❌ Bad:
# Loading 75,000 tokens for a disk check?!
session = load_full_session_history()
workspace = load_entire_workspace()
return process(query, session, workspace)
✅ Good:
# Zero context needed
result = run_command(query)
return format_result(result) # 50 tokens overhead
3. No Monitoring
❌ Bad:
# Blind execution
model.generate(query)
✅ Good:
# Track everything
result = model.generate(query)
log_query(query=query, tier=tier, tokens=tokens_used,
latency=latency, cost=cost)
Quick Start Checklist
Before you optimize:
- Instrument your agent first. You can't save what you can't measure.
- Classify your query mix. Run for one day with naive loading. Categorize each query as simple/medium/complex. This is your baseline.
- Implement tier routing. Start simple — keyword-based. Tier 0 for status checks, Tier 1 for memory queries. No ML needed.
- Set token budgets per tier:
- Tier 0 = 200 tokens max
- Tier 1 = 2,000 tokens max
- Tier 2 = 8,000 tokens max
- Add semantic memory search. Replace blanket loads with targeted search. Biggest gains here.
- Monitor for one week. Compare against baseline. Adjust thresholds.
- Re-classify monthly. Query patterns change.
Next Steps
** Mau deploy AI agent sendiri?** SumoPod bikin gampang. VPS siap pakai, tinggal colok:
→ SumoPod — One-Click AI Agent VPS
Technical deep-dive (English): → Token-Efficient AI Agents: Context Tiering on GitHub
OpenClaw tutorials lengkap: → OpenClaw Troubleshooting Guide → OpenClaw Gateway Setup → OpenClaw Session Maintenance
Part of OpenClaw SumoPod series — deploy your own AI agent on VPS.
Ada Pertanyaan? Yuk Ngobrol!
Butuh bantuan setup OpenClaw, konsultasi IT, atau mau diskusi project engineering? Book a call langsung — gratis.
Book a Call — Gratisvia Cal.com • WITA (UTC+8)
📬 Subscribe Newsletter
FreeDapat alert setiap ada artikel baru langsung ke inbox kamu. Free, no spam. 🚀
👥 Join 0+ engineers & tech enthusiasts
Zainul Fanani
Founder, Radian Group. Engineering & tech enthusiast.

💬 Komentar