Tech

Hemat 70% Token AI Agent: Context Tiering & Lean Loading

Gimana cara cut token usage AI agent dari $2,130 jadi $732 per bulan tanpa lost functionality. Real benchmark data inside.
2 menit baca
3 minggu lalu
Zainul Fanani
Hemat 70% Token AI Agent: Context Tiering & Lean Loading
📅 15 Apr 2026🤍 0 👁 0 🔗 0

English version: Want the technical deep-dive? Check out the GitHub tutorial.


Kenapa AI Agent Kamu Boros Token?

Pause. Think about this.

Setiap kali kamu nanya "disk usage berapa?", terus AI agent load 50,000 tokens context — conversation history, semua memory, seluruh workspace — cuma buat jawab pertanyaan yang butuh 150 tokens.

It doesn't make sense.

It's like calling a mechanic and making them re-read your entire car's service history before they'll check your oil level.

That 99.7% of the loaded context was completely irrelevant to the question.

Angka Nggak Bohong

Multiply that by 200 queries per day, 30 days per month. That's $1,500/month gone just because your agent is too lazy to think about what it actually needs.

What if we could fix that?


Context Tiering: The Solution

The idea is stupidly simple:

"Before you load context, ask: what's the MINIMUM I actually need to answer this?"

Instead of dumping everything, we tier the context loading based on query complexity.

The Four Tiers

Think of it like this — your brain does this automatically:

diagram
diagram

Tier 0 — Zero Shot (0 tokens overhead) Simple status checks, single facts. "What's CPU load?" doesn't need conversation history.

Tier 1 — Memory Lookup (200-500 tokens) Recent context, things you discussed earlier today. Load only today's memory, search for relevance.

Tier 2 — JIT Loading (1,000-5,000 tokens) Specific project files, targeted context. Find only the files that actually matter to this query.

Tier 3 — Full Session (10,000-80,000 tokens) Complex multi-file analysis, architecture decisions. When you genuinely need everything.


The Decision Flow

diagram
diagram


Benchmark Results: Real Numbers

We tested this for 7 days. Mixed workload, 200 queries per day. Here's what we found.

Setup

Results by Scenario

Simple Status Queries (50 per day)

Medium Workflows (80 per day)

Complex Analysis (70 per day)

Daily Totals

Token Savings per Day (ribuan):

diagram
diagram

Monthly Impact

That's $1,398 saved every month. What could you do with that?

  • 2 additional Claude Max seats ($299/month each)
  • 12 months of OpenClaw Pro
  • 15 VPS instances

Latency Improvements

Response Time Comparison (ms):


Cost by Model

The savings scale differently per model:

Even on cheaper models, the absolute savings are significant. On Kimi 2.5, you save $280/month — that's basically your AI subscription cost covered.


Implementation Patterns

Pattern 1: Lean Query Router

python
def route_to_tier(query: str) -> dict:
    """Route query to appropriate context tier."""
    q = query.lower()

    simple = ['what is', 'show me', 'list', 'is running',
              'disk', 'cpu', 'memory', 'status', 'time']

    memory_kw = ['yesterday', 'last week', 'previously',
                 'earlier', 'we were', 'did we']

    file_kw = ['in the file', 'in project', 'in code',
               'analyze', 'audit', 'review']

    # Tier 0: Simple status
    if any(s in q for s in simple):
        if not any(s in q for s in memory_kw + file_kw):
            return {"tier": 0, "context": {}, "tokens": 50}

    # Tier 1: Memory
    if any(s in q for s in memory_kw):
        return load_tier1(query)

    # Tier 2: JIT files
    if any(s in q for s in file_kw):
        return load_tier2(query)

    return {"tier": 0, "context": {}, "tokens": 50}

Pattern 2: Token Budget

python
def execute_with_budget(query: str, max_tokens: int = 5000) -> dict:
    """Execute with hard token ceiling."""
    tier_data = route_to_tier(query)

    if tier_data["tokens"] > max_tokens:
        tier_data = compress_to_budget(tier_data, max_tokens)

    result = model.generate(
        system=get_system_prompt(),
        context=tier_data["context"],
        query=query
    )

    return {
        "result": result,
        "tokens_used": tier_data["tokens"],
        "tier": tier_data["tier"]
    }

Pattern 3: Memory-Backed Lean Loading

python
def lean_load_with_memory(query: str) -> dict:
    """Load only today's relevant memories."""
    today_mem = load_today_memories()
    relevant = semantic_search(
        query=query,
        corpus=today_mem,
        max_tokens=400
    )

    if relevant["sufficient"]:
        return {
            "tier": 1,
            "context": relevant["content"],
            "tokens": relevant["tokens"],
            "source": "memory"
        }

    # Fallback to workspace files
    relevant_files = find_relevant_files(
        query=query,
        max_tokens=1500
    )

    return {
        "tier": 2,
        "context": relevant_files,
        "tokens": sum(f.tokens for f in relevant_files),
        "source": "workspace"
    }

Mistakes to Avoid

1. Over-Caching Memory

❌ Bad:

python
# Loading everything "just in case"
all_memories = load_all_memories()  # 50 files, 500K tokens

✅ Good:

python
# Load only what this query needs
relevant = semantic_search(query, corpus=today_memory, max_tokens=400)

2. Full Session for Simple Queries

❌ Bad:

python
# Loading 75,000 tokens for a disk check?!
session = load_full_session_history()
workspace = load_entire_workspace()
return process(query, session, workspace)

✅ Good:

python
# Zero context needed
result = run_command(query)
return format_result(result)  # 50 tokens overhead

3. No Monitoring

❌ Bad:

python
# Blind execution
model.generate(query)

✅ Good:

python
# Track everything
result = model.generate(query)
log_query(query=query, tier=tier, tokens=tokens_used,
          latency=latency, cost=cost)

Quick Start Checklist

Before you optimize:

  • Instrument your agent first. You can't save what you can't measure.
  • Classify your query mix. Run for one day with naive loading. Categorize each query as simple/medium/complex. This is your baseline.
  • Implement tier routing. Start simple — keyword-based. Tier 0 for status checks, Tier 1 for memory queries. No ML needed.
  • Set token budgets per tier:
    • Tier 0 = 200 tokens max
    • Tier 1 = 2,000 tokens max
    • Tier 2 = 8,000 tokens max
  • Add semantic memory search. Replace blanket loads with targeted search. Biggest gains here.
  • Monitor for one week. Compare against baseline. Adjust thresholds.
  • Re-classify monthly. Query patterns change.

Next Steps

** Mau deploy AI agent sendiri?** SumoPod bikin gampang. VPS siap pakai, tinggal colok:

SumoPod — One-Click AI Agent VPS

Technical deep-dive (English):Token-Efficient AI Agents: Context Tiering on GitHub

OpenClaw tutorials lengkap:OpenClaw Troubleshooting GuideOpenClaw Gateway SetupOpenClaw Session Maintenance


Part of OpenClaw SumoPod series — deploy your own AI agent on VPS.

Ada Pertanyaan? Yuk Ngobrol!

Butuh bantuan setup OpenClaw, konsultasi IT, atau mau diskusi project engineering? Book a call langsung — gratis.

Book a Call — Gratis

via Cal.com • WITA (UTC+8)

📬 Subscribe Newsletter

Free

Dapat alert setiap ada artikel baru langsung ke inbox kamu. Free, no spam. 🚀

👥 Join 0+ engineers & tech enthusiasts

F

Zainul Fanani

Founder, Radian Group. Engineering & tech enthusiast.

💬 Komentar

Catatan Fanani

Ngutak-ngatik teknologi, nulis pengalaman.

Perusahaan

  • CV Radian Fokus Mandiri — Balikpapan
  • PT UNO Solusi Teknik — Balikpapan
  • PT Reka Formasi Elektrika — Jakarta
  • PT Raya Fokus Solusi — Sidoarjo
© 2026 Catatan Fanani. All rights reserved.