I opened the Tool Attention paper expecting theory. What I found was production-ready pseudocode. The authors did not just prove lazy loading works; they showed exactly how to build it.

This guide walks through implementing Tool Attention in local agent frameworks. Specifically Hermes Agent, but the patterns apply to any MCP-compatible setup. We are covering the attention-based routing mechanism, the schema caching layer, and the fallback paths when predictions fail.

The paper reports a 40% to 60% reduction in overall inference latency. I am implementing it to verify those numbers on local hardware using the same 397B model and tool set but with a different architecture.

Here is the complete implementation.

Prerequisites

You need three things before starting:

  • An MCP-compatible agent framework such as Hermes Agent or LangChain.
  • Tool usage logging capability.
  • A caching layer like Redis or an in-memory dictionary for testing.

The implementation consists of four components: tool usage logging, dynamic tool gating, lazy schema loading, and confidence-based fallback. Each component is independent, allowing you to ship them incrementally.

Component 1: Tool Usage Logging

The gating model needs historical data to learn which tools are used together and which prompts trigger specific tool categories. Without logging, there is no training signal.

Add this to your agent's tool execution path:

import json
from datetime import datetime
from pathlib import Path

class ToolUsageLogger:
    def __init__(self, log_path: str = "~/.hermes/tool_usage.jsonl"):
        self.log_path = Path(log_path).expanduser()
        self.log_path.parent.mkdir(parents=True, exist_ok=True)
    
    def log(self, session_id: str, prompt: str, tool_name: str, 
            tool_category: str, latency_ms: int):
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "session_id": session_id,
            "prompt_truncated": prompt[:200],  # First 200 chars
            "tool_name": tool_name,
            "tool_category": tool_category,  # e.g., "filesystem", "git", "web"
            "latency_ms": latency_ms
        }
        with open(self.log_path, "a") as f:
            f.write(json.dumps(entry) + "\n")

Categorize your tools upfront. The gating model predicts categories first, then specific tools within those categories, which reduces the prediction space.

For Hermes Agent, the categories look like this:

Category Tools Typical Usage Pattern
filesystem read_file, write_file, patch, search_files Coding sessions, file ops
git terminal (git commands), github-* Code review, PR workflows
web web_search, browser_* Research tasks
agent delegate_task, clarify Multi-agent orchestration
memory memory, session_search Cross-session context

Log every tool call. The model needs this data to learn patterns.

Component 2: Dynamic Tool Gating

The gating model predicts which tools will be needed before loading schemas. While the paper uses attention-based routing, a simpler classifier works well for initial implementation.

Start with category-level prediction:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np

class ToolGatingModel:
    def __init__(self, categories: list[str]):
        self.categories = categories
        self.vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
        self.classifier = OneVsRestClassifier(LogisticRegression())
        self.is_trained = False
    
    def train(self, prompts: list[str], category_labels: list[list[str]]):
        X = self.vectorizer.fit_transform(prompts)
        y = np.zeros((len(category_labels), len(self.categories)))
        for i, labels in enumerate(category_labels):
            for label in labels:
                if label in self.categories:
                    y[i, self.categories.index(label)] = 1
        self.classifier.fit(X, y)
        self.is_trained = True
    
    def predict(self, prompt: str, confidence_threshold: float = 0.6) -> tuple[list[str], dict[str, float]]:
        if not self.is_trained:
            return self.categories, {}  # Fallback: all categories
        
        X = self.vectorizer.transform([prompt])
        predictions = self.classifier.predict_proba(X)[0]
        
        predicted = []
        confidences = {}
        for i, prob in enumerate(predictions):
            confidences[self.categories[i]] = float(prob)
            if prob >= confidence_threshold:
                predicted.append(self.categories[i])
        
        if not predicted:
            predicted = [self.categories[np.argmax(predictions)]]
        
        return predicted, confidences

Train the model on your logged data by extracting prompts and the categories of tools that were actually used.

def prepare_training_data(log_path: str, min_sessions: int = 50):
    from collections import defaultdict
    import json
    
    sessions = defaultdict(lambda: {"prompt": "", "categories": set()})
    
    with open(log_path, "r") as f:
        for line in f:
            entry = json.loads(line)
            session_id = entry["session_id"]
            sessions[session_id]["prompt"] = entry["prompt_truncated"]
            sessions[session_id]["categories"].add(entry["tool_category"])
    
    prompts = []
    labels = []
    for session_data in sessions.values():
        if session_data["prompt"] and len(session_data["categories"]) > 0:
            prompts.append(session_data["prompt"])
            labels.append(list(session_data["categories"]))
    
    return prompts, labels

Component 3: Lazy Schema Loading

Once the gating model predicts categories, you should load schemas only for those categories. This is the core lazy loading step.

class LazySchemaLoader:
    def __init__(self, tool_registry: dict[str, dict]):
        self.tool_registry = tool_registry
        self.schema_cache = {}
        self.category_index = defaultdict(list)
        for tool_name, info in tool_registry.items():
            self.category_index[info["category"]].append(tool_name)
    
    def load_schemas_for_categories(self, categories: list[str]) -> dict[str, dict]:
        loaded = {}
        for category in categories:
            for tool_name in self.category_index[category]:
                if tool_name not in self.schema_cache:
                    schema = self.tool_registry[tool_name]["schema"]()
                    self.schema_cache[tool_name] = schema
                loaded[tool_name] = self.schema_cache[tool_name]
        return loaded
    
    def get_schema(self, tool_name: str) -> dict:
        if tool_name not in self.schema_cache:
            schema = self.tool_registry[tool_name]["schema"]()
            self.schema_cache[tool_name] = schema
        return self.schema_cache[tool_name]

The key insight is that schemas are callables rather than pre-loaded dictionaries. They execute only when needed to defer the computational cost.

Component 4: Confidence-Based Fallback

The gating model is not perfect. When confidence is low, fall back to eager loading to ensure the agent has access to all necessary tools.

class FallbackStrategy:
    def __init__(self, gating_model: ToolGatingModel, schema_loader: LazySchemaLoader):
        self.gating_model = gating_model
        self.schema_loader = schema_loader
        self.low_confidence_threshold = 0.5
        self.high_confidence_threshold = 0.8
    
    def get_tools_for_prompt(self, prompt: str) -> dict[str, dict]:
        predicted_categories, confidences = self.gating_model.predict(prompt)
        max_confidence = max(confidences.values()) if confidences else 0
        
        if max_confidence < self.low_confidence_threshold:
            return self._load_all_schemas()
        
        if max_confidence >= self.high_confidence_threshold:
            return self.schema_loader.load_schemas_for_categories(predicted_categories)
        
        predicted_categories.append("memory")
        return self.schema_loader.load_schemas_for_categories(predicted_categories)
    
    def _load_all_schemas(self) -> dict[str, dict]:
        return self.schema_loader.load_schemas_for_categories(
            list(self.schema_loader.category_index.keys())
        )

Benchmarking the Implementation

While overall latency drops significantly, the most dramatic impact is seen in cold starts. The paper reports a 92% improvement in cold start latency. My initial tests on local hardware show similar results.

import time

def benchmark_cold_start(agent, prompt: str, iterations: int = 5):
    latencies = []
    for _ in range(iterations):
        start = time.perf_counter()
        agent.execute(prompt)
        end = time.perf_counter()
        latencies.append((end - start) * 1000)
    return np.mean(latencies), np.std(latencies)

mean_before, std_before = benchmark_cold_start(legacy_agent, "List files in /home")
mean_after, std_after = benchmark_cold_start(new_agent, "List files in /home")

print(f"Before: {mean_before:.0f}ms ± {std_before:.0f}ms")
print(f"After: {mean_after:.0f}ms ± {std_after:.0f}ms")
print(f"Improvement: {(1 - mean_after/mean_before)*100:.1f}%")

Production Considerations

Watch these three factors in production:

Cache Invalidation: If tool schemas change, you must clear the cache.

Cold Sessions: The first session after a restart will have no cached schemas; subsequent sessions will see the performance gain.

Model Retraining: Retrain the gating model weekly as usage patterns shift.

What is Next

I am running this implementation for a week to collect more benchmarks and refine the gating model. Full results will be published next week.

The "tools tax" is optional. It is time to stop paying it.

About the Author

Vinny Barreca writes about the messy reality of AI infrastructure at Phantom Byte. He tracks sovereign AI developments and open source model releases.

Enjoyed this article?

☕ Buy Me a Coffee

Support PhantomByte and keep the content coming!