Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

⬅️ Back to Articles

📝 ARTICLE INFORMATION

  • Title: Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
  • Authors: Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah (PwC)
  • Publication: arXiv preprint, May 2026
  • URL: https://arxiv.org/abs/2605.15184

🎯 HOOK

Everybody building RAG systems defaults to vector search. This paper ran a controlled experiment comparing grep against vector retrieval across four different agent harnesses and found that grep consistently wins on accuracy. But the more surprising result: switching harnesses (Chronos vs. Claude Code vs. Codex vs. Gemini CLI) shifted accuracy by roughly the same margin as switching retrievers. The agent scaffold is not passive infrastructure.

💡 ONE-SENTENCE TAKEAWAY

Lexical search (grep) consistently outperforms vector retrieval on long-memory QA when results are delivered inline, but the choice of agent harness and tool-calling architecture (inline vs. file-based) shifts accuracy as much as the retrieval strategy itself.

📖 SUMMARY

This paper from PwC systematically compares grep and vector retrieval across two experiments using a 116-question subset of the LongMemEval benchmark.

Experiment 1 tests four agent harnesses (Chronos, Claude Code, Codex, Gemini CLI) with five LLMs (Claude Opus 4.6, Claude Haiku 4.5, GPT-5.4, Gemini 3.1 Pro, Gemini 3.1 Flash-Lite) under both inline and file-based tool delivery. The results: inline grep beats inline vector on every single harness-model pair. The widest gap is Chronos + Gemini Flash-Lite (86.2% vs 62.9%). The narrowest is Claude Code + Claude Opus 4.6 (76.7% vs 75.0%). But the harness matters enormously: Claude Opus 4.6 hits 93.1% under Chronos but only 76.7% under Claude Code.

Experiment 2 tests how both retrieval methods degrade as irrelevant noise is mixed in. Grep holds up better across the board, and both methods show surprisingly graceful degradation as more distractor sessions are added.

The paper’s central contribution is showing that “retrieval” in agent systems is really retrieval-plus-orchestration. The harness shapes system prompts, tool descriptions, context construction, and termination criteria. Reporting BM25 vs. ANN in a static pipeline misses the variance introduced by agent scaffolding.

🔍 INSIGHTS

Core Insights:

  • Grep is the default winner on literal-recall tasks. LongMemEval answers are often specific facts, dates, and preferences. Grep surfaces exact matches without an embedding bottleneck. Vector search can drown in semantically “near” distractors.

  • Harness choice shifts accuracy as much as retriever choice. Moving Claude Opus 4.6 between Chronos and Claude Code changes accuracy from 93.1% to 76.7%. That 16.4-point gap is comparable to the gap between grep and vector within a single harness. The scaffold is not neutral infrastructure.

  • File-based delivery reshuffles the comparison entirely. When results are written to files instead of injected inline, vector search beats grep on 5 out of 10 harness-model pairs. But accuracy can collapse independently of retrieval quality when the read-integrate-retry cycle breaks, as happened with Codex + GPT-5.4 (93.1% inline grep to 55.2% programmatic grep).

  • Weaker models depend more on retrieval mode. Claude Haiku 4.5 shows the largest inline grep-vector gaps (55.2% vs 44.0% on Claude Code). Stronger models are more robust to retriever choice.

Broader Connections:

  • The “Lost in the Middle” problem applies to tool results too. Inline delivery competes with conversation history for context window space. File-based delivery decouples result size from context pressure but adds a multi-step workflow the agent must execute reliably.

  • Chronos’s category-conditioned prompting may explain its advantage. Custom harnesses can optimize system prompts per question type (temporal reasoning vs preference recall). Provider CLIs use generic prompts. That domain-specific adaptation is a real advantage.

  • This mirrors the BEIR benchmark findings. Thakur et al. (2021) showed BM25 remains a competitive baseline against dense retrieval in zero-shot settings. The same pattern holds inside agent loops.

  • For practitioners, the implication is practical: if your task rewards literal span recovery, grep may be all you need. But measure both your harness and your retriever, because switching costs are real.

🛠️ FRAMEWORKS & MODELS

Experimental Design:

DimensionConditions
Retrieval Strategygrep-only, vector-only
Agent HarnessesChronos (custom), Claude Code, Codex, Gemini CLI
Tool Deliveryinline (stdout), programmatic (file-based)
ModelsClaude Opus 4.6, Claude Haiku 4.5, GPT-5.4, Gemini 3.1 Pro, Gemini 3.1 Flash-Lite
BenchmarkLongMemEval-S (116 questions, 6 categories)
EvaluationGPT-4o grader with category-conditioned rubrics

Experiment 1: Key Results (accuracy %):

ModelHarnessGrep InlineVector InlineGrep FileVector File
Claude Opus 4.6Chronos93.183.680.281.9
Claude Opus 4.6Claude Code76.775.068.179.3
GPT-5.4Chronos89.781.987.175.0
GPT-5.4Codex CLI93.175.955.267.2
Gemini 3.1 ProChronos91.482.879.376.7
Gemini 3.1 Flash-LiteChronos86.262.985.372.4

Experiment 2: Context Scaling (noise robustness):

Grep-only and vector-only accuracy across session limits (s5, s10, s20, s30, full). Grep degrades more gracefully than vector as distractor sessions accumulate, though both methods show moderate robustness.

💬 QUOTES

  1. “Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.”

    Context: Abstract summarizing the core finding. Significance: The sentence that should give every RAG practitioner pause.

  2. “Retrieval mode is not measured in isolation: the harness shapes the system prompt, tool descriptions, and how hits are rendered back into the chat, all of which influence how the model schedules queries and decides when to stop.”

    Context: Section 4.1.4 discussion. Significance: Explains why harness effects are as large as retriever effects.

  3. “Programmatic delivery changes the task from ‘read the tool message’ to ’locate, open, and integrate an artifact.’ When that second stage is brittle, accuracy can collapse independently of retrieval quality.”

    Context: Section 4.1.4 on file-based tool delivery. Significance: A caution for anyone designing file-based agent workflows.

  4. “Lexical and dense search optimize different failure modes in an agent loop, not only in a ranking metric.”

    Context: Section 4.1.4. Significance: Grep punishes vocabulary mismatch. Vector punishes semantic distractors. Neither is universally better; the task determines which failure mode matters.

⚡ APPLICATIONS

For RAG System Designers:

  • Default to grep on tasks where answers are literal spans (names, dates, IDs, preferences). Add vector search as a fallback for paraphrase-heavy queries.
  • If you use file-based tool delivery, test the full read-integrate loop separately from retrieval quality. The paper shows accuracy can collapse at the workflow level even when retrieval is fine.
  • Profile your harness before your retriever. Changing from Chronos to Claude Code shifts accuracy by 16 points. That is the same magnitude as swapping retrievers.

For Benchmark Designers:

  • Report both retrieval strategy and harness architecture in evaluation results. Accuracy without the harness context is missing a confounding variable.
  • Include an inline vs. programmatic tool delivery dimension. The interaction with retrieval strategy is real and measurable.

For Developers Building Agent Systems:

  • Consider Chronos-style category-conditioned prompting if you control the harness. The paper’s custom harness consistently beat provider CLIs for the same backbone.
  • Be suspicious of “default to vector” recommendations. They assume a fixed pipeline that does not match how agents actually use retrieval tools.
  • Simple retrieval with a well-tuned harness outperforms sophisticated retrieval with a generic scaffold.

📚 REFERENCES

Primary Source:

  • Sen, S., Kasturi, A., Lumer, E., Gulati, A., & Subbiah, V. K. (2026). “Is Grep All You Need? How Agent Harnesses Reshape Agentic Search.” arXiv:2605.15184. https://arxiv.org/abs/2605.15184

Key References Cited:

  • Wu et al. (2025). “LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory.”
  • Sen et al. (2026). “Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory.”
  • Lewis et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.”
  • Gao et al. (2024). “Retrieval-Augmented Generation for Large Language Models: A Survey.”
  • Thakur et al. (2021). “BEIR: A Heterogenous Benchmark for Zero-Shot Evaluation of Information Retrieval Models.”
  • Liu et al. (2024). “Lost in the Middle: How Language Models Use Long Contexts.”
  • Yao et al. (2023). “ReAct: Synergizing Reasoning and Acting in Language Models.”
  • Packer et al. (2023). “MemGPT: Towards LLMs as Operating Systems.”

Crepi il lupo! 🐺