How AI Searches Through Your Codebase
An exploration and systemic introduction to different techniques AI software uses to search across your codebase.
How AI Searches Through Your Codebase
When VS Code Copilot first came out, I saw a little animation at the corner of my screen saying that the codebase was "indexing". I didn't think much of it at the time. Then I noticed Cursor doing the same thing. As my codebase grew larger and more complex, I got curious about how AI was so good at finding relevant code. Looking into it, I realized that AI was semantically searching through and surfacing the right snippets to work with. Later, I started using Claude Code and noticed it didn't mention anything about indexing. I wanted to do a deeper dive into how different AI tools search and interact with the codebase I'm working on.
Modern AI coding assistants need to understand your codebase to actually be useful. The only question is: how do they actually know which part of your codebase is relevant to your objective? The naive approach would be to stuff the context window with the entire codebase, but that quickly becomes impractical as projects grow. Instead, we need targeted search techniques that help the AI find the right code snippets. There are two primary approaches:
- Grep-based search — Fast, on-demand text pattern matching (used by Claude Code, Codex CLI, Cline)
- Semantic indexing — Pre-computed embeddings enabling meaning-based search (used by Cursor, VS Code Copilot, Roocode)
In this post, I explore both in detail and compare the two.
Part 1: Grep-Based Search
The simplest way to find code is grep which stands for Global Regular Expression Print. It's been a staple in Unix command lines for decades. It's fast, requires no setup, and works on any codebase immediately. Grep scans each file line by line and checks against a pattern (string or regex), then outputs exactly which lines match. There's no pre-indexing or understanding — grep just finds exact text matches. All it requires is terminal access, making it ideal for quick lookups or exploration.
How Grep Works
Let's say we have a codebase with a couple of files in it. We can use grep to search for specific patterns or literal strings. If you press scan, you'll see how grep searches through each file line by line until a match is made. On the right, we keep track of all the matches and where the matches were.
AI agents are effective at using grep to explore codebases. Given access to the command, they can formulate search queries and quickly iterate by refining them based on results. The flow is simple: the user asks a question, the agent reasons about what patterns to search for, executes grep, and collects matching lines with file paths. From there, it reads relevant snippets and synthesizes them into a coherent answer — no additional infrastructure required.
Grep: Strengths and Limitations
Strengths
- Deterministic results: Function names, identifiers, and exact patterns need exact matches — grep excels at this with zero hallucinations
- Scales to massive repos: Whether you have 100 files or 100,000, grep churns through them relatively quickly
- Minimal infrastructure: No vector databases, no embedding models, no external services
- Always fresh: Every search reads the current state of your code, never a stale index
Limitations
- Token bloat: Dumping large amounts of raw code into an LLM eats context and drives up costs as repositories grow
- Time-consuming exploration: With tools like Codex CLI, agents can spend significant time iteratively grepping through the codebase to build understanding
- No semantic understanding: Grep can't find "authentication logic" if the code uses
login,signIn, orverifyCredentials. It only finds exact text matches
Part 2: Semantic Indexing
The other approach to understanding your codebase is to build a semantic index. Instead of returning exact text matches, semantic search finds code using vector embeddings that capture meaning. Cursor, VS Code Copilot, Roocode, and Kilo Code all use this approach. They pre-index your codebase so that at query time, they can quickly find semantically relevant snippets. Semantic search requires a much larger upfront investment: a vector store, an embedding model, and an indexing pipeline.
The Indexing Pipeline
The pipeline transforms raw source code into searchable vectors:
- Scan — Traverse the codebase and read source files
- Detect — Identify programming languages
- Chunk — Parse code into semantic units using AST analysis
- Embed — Convert chunks into high-dimensional vectors
- Store — Save vectors in a database for fast retrieval
At a high level, all of these are core components we need to implement semantic search.
What Gets Indexed (and What Doesn't)
The indexer is smart about what it processes. It automatically excludes:[1][2]
- Binary files and images — Non-text content that can't be meaningfully embedded
- Large files (>1MB) — Files too large to process efficiently
- Git internals —
.gitfolders and repository metadata - Dependencies —
node_modules,vendor,venv, and other package directories - Ignored files — Anything matching
.gitignorepatterns
This filtering ensures the index stays focused on your code, not third-party libraries or generated artifacts.
Incremental Updates
Re-indexing an entire codebase on every change would be painfully slow. Modern indexers use several strategies to stay fast:[1][2][3]
| Strategy | How It Works |
|---|---|
| File Watching | Monitors your workspace for changes in real-time |
| Smart Updates | Only reprocesses modified files, not the entire codebase |
| Hash-based Caching | Compares file hashes to avoid reprocessing unchanged content |
| Branch Aware | Automatically handles Git branch switches and updates the index accordingly |
| Multi-Folder Workspaces | Each folder maintains its own index with separate settings and status |
This means after the initial indexing, updates happen in seconds rather than minutes, even for large codebases. Some tools also use smart techniques to detect which files have changed such as Cursor which uses a Merkle tree [4][5][6].
A Merkle tree works like a fingerprinting system for your codebase:
- Each file gets a unique cryptographic hash (fingerprint)
- Pairs of hashes are combined into parent hashes
- This continues until you have a single root hash representing the entire codebase
When any file changes, its hash changes and that change propagates up through all parent hashes to the root. By comparing root hashes, Cursor can instantly detect that something changed, then walk down the tree to find exactly which files need re-indexing. This approach significantly reduces bandwidth and processing time. In a workspace with 50,000 files, only the branches where hashes differ need to be examined.
Another smart technique that Cursor uses is index reuse. Cursor found that clones average 92% similarity across users within an organization.[6] Instead of rebuilding every index from scratch, Cursor can securely reuse a teammate's existing index:
| Repo Size | Without Reuse | With Reuse |
|---|---|---|
| Median | 7.87 seconds | 525 ms |
| 90th percentile | 2.82 minutes | 1.87 seconds |
| 99th percentile | 4.03 hours | 21 seconds |
This works through similarity hashing — the client computes a simhash from its Merkle tree and searches for matching indexes from teammates. Cryptographic proofs ensure you only see results for code you actually have locally. This reduces the number of files that need to be re-indexed, making the process much faster for large teams.
Embedding Caching & Privacy
Smart caching makes re-indexing fast, while privacy measures protect sensitive information:[4][5][6]
Caching Strategy
- Embeddings are cached by the hash of each chunk's content
- When code hasn't changed, the cached embedding is reused
- Indexing the same codebase a second time is nearly instant
- Teams benefit from shared caches across developers
Privacy Measures
- Only embeddings and metadata are stored remotely — raw source code stays local
- File paths are obfuscated before transmission (e.g.,
src/payments/invoice.py→a9f3/x72k/qp1m.f4) - Path obfuscation hides sensitive details while preserving directory structure for filtering
- Users can control what's indexed via
.cursorignoreor similar ignore files
Understanding Abstract Syntax Trees
Before we can chunk code intelligently, we need to understand its structure. Tree-sitter is a parser that transforms source code into an Abstract Syntax Tree (AST) — a hierarchical representation of the code's structure.
What the AST Reveals
- Hierarchical structure: Classes contain methods, functions contain statements
- Language-agnostic concepts: Functions, classes, and types are recognized across languages
- Semantic boundaries: Each node represents a complete syntactic unit
This tree structure is what enables intelligent code chunking — instead of splitting at arbitrary line numbers, we can split at meaningful boundaries.
Smart Code Chunking with Tree-sitter
Why This Matters
- Naive chunking splits at arbitrary line boundaries, often cutting functions in half
- Tree-sitter chunking respects semantic boundaries — each chunk is a complete function, class, or method
- Better chunks mean better embeddings, which means more accurate search results
A note on tooling: Libraries like Chonkie provide ready-made AST-aware code chunking out of the box — handling tree-sitter parsing, semantic boundary detection, and chunk extraction automatically. For production use, that's often the right choice.
Visualizing Code Embeddings
To generate these embeddings, I used jina-code-embeddings-1.5b-Q8_0[9], a model designed for code retrieval (text-to-code, code-to-code, and code-to-text). I ran it locally with Llama.cpp and embedded one of my personal projects. I also set up basic chunking using tree-sitter to split the code into functions and classes. Every function, class, and code chunk gets converted into a 1536-dimensional vector. Below, we visualize these using t-SNE to reduce them to 2D.
What You're Seeing
- Each point is a code chunk (function, class, method, or file)
- Colors represent programming languages or chunk types
- Proximity indicates semantic similarity — nearby points have similar meaning
- Clusters form naturally around related functionality
How Semantic Search Works
With embeddings stored, semantic search becomes possible. Select a natural language query and watch:
- The query gets embedded into the same vector space as the code
- The system finds the nearest neighbors using cosine similarity
- Results are ranked by semantic similarity, not keyword matching
Part 3: Comparing the Approaches
Both grep-based search and semantic indexing have their place — the right choice depends on your priorities.[7][8]
The Trade-off
As Nick Baumann from Cline explains, indexing introduces complexity that may not be worth the trade-off:[7]
"We don't index your codebase, and this choice isn't an oversight — it's a fundamental design decision."
The creator of Claude Code echoed similar concerns, noting that indexing introduces problems around security, privacy, staleness, and reliability. Cline's blog articulates three key challenges with RAG-based approaches:[7]
-
Code doesn't think in chunks — When you chunk code for embeddings, you're tearing apart its logic. A function call might be in one chunk, its definition in another, and the critical context that explains why it exists scattered across a dozen fragments.
-
Indexes decay while code evolves — Software development moves fast. Functions get refactored, dependencies update, entire modules get rewritten. An index is a snapshot frozen in time — every merge is a potential divergence between reality and your AI's understanding.
-
Security becomes a liability — Your codebase isn't just text — it's your competitive advantage. Creating vector embeddings means creating a secondary representation of your IP that needs to be stored somewhere, doubling your security surface.
Cursor, on the other hand, acknowledges these challenges but argues that semantic search remains valuable because of[8]:
- Faster results: Compute happens during indexing (offline) rather than at runtime, so searches are faster and cheaper
- Better accuracy: Custom-trained models retrieve more relevant results than string matching
- Fewer follow-ups: Users send fewer clarifying messages and use fewer tokens compared to grep-only search
- Conceptual matching: Find code by what it does, not just what it's named
As Cursor notes: "Agent uses both grep and semantic search together. Grep excels at finding exact patterns, while semantic search excels at finding conceptually similar code." Modern tools increasingly combine both approaches — grep for precision and semantic search for discovery. The future likely isn't one or the other, but intelligent orchestration of both.