Hemat 70% Token AI Agent: Context Tiering & Lean Loading

English version: Want the technical deep-dive? Check out the GitHub tutorial.

Kenapa AI Agent Kamu Boros Token?

Pause. Think about this.

Setiap kali kamu nanya "disk usage berapa?", terus AI agent load 50,000 tokens context — conversation history, semua memory, seluruh workspace — cuma buat jawab pertanyaan yang butuh 150 tokens.

It doesn't make sense.

It's like calling a mechanic and making them re-read your entire car's service history before they'll check your oil level.

That 99.7% of the loaded context was completely irrelevant to the question.

Angka Nggak Bohong

Multiply that by 200 queries per day, 30 days per month. That's $1,500/month gone just because your agent is too lazy to think about what it actually needs.

What if we could fix that?

Context Tiering: The Solution

The idea is stupidly simple:

"Before you load context, ask: what's the MINIMUM I actually need to answer this?"

Instead of dumping everything, we tier the context loading based on query complexity.

The Four Tiers

Think of it like this — your brain does this automatically:

Tier 0 — Zero Shot (0 tokens overhead) Simple status checks, single facts. "What's CPU load?" doesn't need conversation history.

Tier 1 — Memory Lookup (200-500 tokens) Recent context, things you discussed earlier today. Load only today's memory, search for relevance.

Tier 2 — JIT Loading (1,000-5,000 tokens) Specific project files, targeted context. Find only the files that actually matter to this query.

Tier 3 — Full Session (10,000-80,000 tokens) Complex multi-file analysis, architecture decisions. When you genuinely need everything.

The Decision Flow

Benchmark Results: Real Numbers

We tested this for 7 days. Mixed workload, 200 queries per day. Here's what we found.

Setup

Results by Scenario

Simple Status Queries (50 per day)

Medium Workflows (80 per day)

Complex Analysis (70 per day)

Daily Totals

Token Savings per Day (ribuan):

Monthly Impact

That's $1,398 saved every month. What could you do with that?

2 additional Claude Max seats ($299/month each)
12 months of OpenClaw Pro
15 VPS instances

Latency Improvements

Response Time Comparison (ms):

Cost by Model

The savings scale differently per model:

Even on cheaper models, the absolute savings are significant. On Kimi 2.5, you save $280/month — that's basically your AI subscription cost covered.

Implementation Patterns

Pattern 1: Lean Query Router

python

def route_to_tier(query: str) -> dict:
    """Route query to appropriate context tier."""
    q = query.lower()

    simple = ['what is', 'show me', 'list', 'is running',
              'disk', 'cpu', 'memory', 'status', 'time']

    memory_kw = ['yesterday', 'last week', 'previously',
                 'earlier', 'we were', 'did we']

    file_kw = ['in the file', 'in project', 'in code',
               'analyze', 'audit', 'review']

    # Tier 0: Simple status
    if any(s in q for s in simple):
        if not any(s in q for s in memory_kw + file_kw):
            return {"tier": 0, "context": {}, "tokens": 50}

    # Tier 1: Memory
    if any(s in q for s in memory_kw):
        return load_tier1(query)

    # Tier 2: JIT files
    if any(s in q for s in file_kw):
        return load_tier2(query)

    return {"tier": 0, "context": {}, "tokens": 50}

Pattern 2: Token Budget

python

def execute_with_budget(query: str, max_tokens: int = 5000) -> dict:
    """Execute with hard token ceiling."""
    tier_data = route_to_tier(query)

    if tier_data["tokens"] > max_tokens:
        tier_data = compress_to_budget(tier_data, max_tokens)

    result = model.generate(
        system=get_system_prompt(),
        context=tier_data["context"],
        query=query
    )

    return {
        "result": result,
        "tokens_used": tier_data["tokens"],
        "tier": tier_data["tier"]
    }

Pattern 3: Memory-Backed Lean Loading

python

def lean_load_with_memory(query: str) -> dict:
    """Load only today's relevant memories."""
    today_mem = load_today_memories()
    relevant = semantic_search(
        query=query,
        corpus=today_mem,
        max_tokens=400
    )

    if relevant["sufficient"]:
        return {
            "tier": 1,
            "context": relevant["content"],
            "tokens": relevant["tokens"],
            "source": "memory"
        }

    # Fallback to workspace files
    relevant_files = find_relevant_files(
        query=query,
        max_tokens=1500
    )

    return {
        "tier": 2,
        "context": relevant_files,
        "tokens": sum(f.tokens for f in relevant_files),
        "source": "workspace"
    }

Mistakes to Avoid

1. Over-Caching Memory

❌ Bad:

python

# Loading everything "just in case"
all_memories = load_all_memories()  # 50 files, 500K tokens

✅ Good:

python

# Load only what this query needs
relevant = semantic_search(query, corpus=today_memory, max_tokens=400)

2. Full Session for Simple Queries

❌ Bad:

python

# Loading 75,000 tokens for a disk check?!
session = load_full_session_history()
workspace = load_entire_workspace()
return process(query, session, workspace)

✅ Good:

python

# Zero context needed
result = run_command(query)
return format_result(result)  # 50 tokens overhead

3. No Monitoring

❌ Bad:

python

# Blind execution
model.generate(query)

✅ Good:

python

# Track everything
result = model.generate(query)
log_query(query=query, tier=tier, tokens=tokens_used,
          latency=latency, cost=cost)

Quick Start Checklist

Before you optimize:

Instrument your agent first. You can't save what you can't measure.
Classify your query mix. Run for one day with naive loading. Categorize each query as simple/medium/complex. This is your baseline.
Implement tier routing. Start simple — keyword-based. Tier 0 for status checks, Tier 1 for memory queries. No ML needed.
Set token budgets per tier:
- Tier 0 = 200 tokens max
- Tier 1 = 2,000 tokens max
- Tier 2 = 8,000 tokens max
Add semantic memory search. Replace blanket loads with targeted search. Biggest gains here.
Monitor for one week. Compare against baseline. Adjust thresholds.
Re-classify monthly. Query patterns change.

Next Steps

** Mau deploy AI agent sendiri?** SumoPod bikin gampang. VPS siap pakai, tinggal colok:

→ SumoPod — One-Click AI Agent VPS

Technical deep-dive (English): → Token-Efficient AI Agents: Context Tiering on GitHub

OpenClaw tutorials lengkap: → OpenClaw Troubleshooting Guide → OpenClaw Gateway Setup → OpenClaw Session Maintenance

Part of OpenClaw SumoPod series — deploy your own AI agent on VPS.

Hemat 70% Token AI Agent: Context Tiering & Lean Loading

Kenapa AI Agent Kamu Boros Token?

Angka Nggak Bohong

Context Tiering: The Solution

The Four Tiers

The Decision Flow

Benchmark Results: Real Numbers

Setup

Results by Scenario

Daily Totals

Monthly Impact

Latency Improvements

Cost by Model

Implementation Patterns

Pattern 1: Lean Query Router

Pattern 2: Token Budget

Pattern 3: Memory-Backed Lean Loading

Mistakes to Avoid

1. Over-Caching Memory

2. Full Session for Simple Queries

3. No Monitoring

Quick Start Checklist

Next Steps

Baca Juga

Ada Pertanyaan? Yuk Ngobrol!

Subscribe to Newsletter