Priyanshu Mahey.

How AI Searches Through Your Codebase

An exploration and systemic introduction to different techniques AI software uses to search across your codebase.

How AI Searches Through Your Codebase

When VS Code copilot first came out, I saw this little animation at the corner of my screen saying that the codebase was "indexing". This was a while back and I didn't think much of it. Then I noticed Cursor doing the same thing. At first I didn't think much of it until I started using Claude Code which didn't mention anything about indexing. This got me a bit curious about how exactly different AI tools actually search and interact with the codebase I'm building.

Modern AI coding assistants need to understand your codebase to actually be useful. The only question is, how do they actually know which part of your codebase is relevant to your objective? The naive approach would be to stuff the context window with the entire codebase, but that quickly becomes impractical as projects grow. Instead, we need targeted search techniques that help the AI find the right code snippets to work with. To do this, there's two primary approaches:

  1. Grep-based search — Fast, on-demand text pattern matching (used by Claude Code, Codex CLI, Cline)
  2. Semantic indexing — Pre-computed embeddings enabling meaning-based search (used by Cursor, VS Code Copilot, Roocode)

The goal of this blog post was to explore both approaches in detail and understand why different tools choose one over the other. This isn't a deep deep dive into the accuracy or performance of each method since that's probably best left to empirical benchmarks. Instead, I talk primarily about implementation and compling information from various sources.


The simplest way to find code is grep which stands for Global Regular Expression Print. It's nothing new, it's been a staple in Unix command lines for decades. It's fast, requires no setup, and works on any codebase immediately. Grep scans each file line by line and checks against a pattern (string or regex). It then outputs exactly which lines match the pattern. There's no pre-indexing or understanding, grep just finds exact text matches. Grep requires no prior setup, just some sort of terminal access to run the command. This makes it ideal for quick lookups or exploration.

How Grep Works

Let's say we have a codebase with a couple of files in it. We can use grep to search for specific patterns or literal strings. If you press scan, you'll see how grep searches through each file line by line until a match is made. On the right, we keep track of all the matches and where the matches were.

Fast
LITERAL
Source Files
src/services/auth.ts
1import { hash, verify } from 'crypto';
2import { User, Session } from './types';
3
4export async function authenticateUser(
5 email: string,
6 password: string
7): Promise<Session | null> {
8 const user = await findUserByEmail(email);
9 if (!user) return null;
10
11 const isValid = await verify(password, user.passwordHash);
12 if (!isValid) return null;
13
14 return createSession(user);
15}
16
17export async function validateSession(
18 token: string
19): Promise<User | null> {
20 const session = await findSession(token);
21 if (!session || isExpired(session)) {
22 return null;
23 }
24 return session.user;
25}
26
27function isExpired(session: Session): boolean {
28 return Date.now() > session.expiresAt;
29}
src/services/database.ts
1import { Pool, QueryResult } from 'pg';
2
3const pool = new Pool({
4 connectionString: process.env.DATABASE_URL,
5 max: 20
6});
7
8export async function query<T>(
9 sql: string,
10 params?: unknown[]
11): Promise<T[]> {
12 const client = await pool.connect();
13 try {
14 const result = await client.query(sql, params);
15 return result.rows as T[];
16 } finally {
17 client.release();
18 }
19}
20
21export async function findUserByEmail(
22 email: string
23): Promise<User | null> {
24 const users = await query<User>(
25 'SELECT * FROM users WHERE email = $1',
26 [email]
27 );
28 return users[0] || null;
29}
30
31export async function findSession(
32 token: string
33): Promise<Session | null> {
34 const sessions = await query<Session>(
35 'SELECT * FROM sessions WHERE token = $1',
36 [token]
37 );
38 return sessions[0] || null;
39}
src/routes/api.ts
1import { Router, Request, Response } from 'express';
2import { authenticateUser, validateSession } from '../services/auth';
3
4const router = Router();
5
6router.post('/login', async (req: Request, res: Response) => {
7 const { email, password } = req.body;
8
9 const session = await authenticateUser(email, password);
10 if (!session) {
11 return res.status(401).json({ error: 'Invalid credentials' });
12 }
13
14 res.json({ token: session.token });
15});
16
17router.get('/me', async (req: Request, res: Response) => {
18 const token = req.headers.authorization?.split(' ')[1];
19 if (!token) {
20 return res.status(401).json({ error: 'No token provided' });
21 }
22
23 const user = await validateSession(token);
24 if (!user) {
25 return res.status(401).json({ error: 'Invalid session' });
26 }
27
28 res.json({ user });
29});
30
31export default router;
Literal = exact text match
Regex = pattern matching
Highlighted = match found

AI agents are pretty good at using tools like grep to explore codebases. Given access to the grep command, they can easily formulate search queries. Even if they miss, they can quickly iterate by refining their queries based on results. The flow is simple. Once the user asks a question, the agent reasons about what patterns to search for. It then executes grep and collects the matching lines with file paths. From there, the agent reads relevant snippets and synthesizes them into a coherent answer. This technique is incredibly simple and requires no additional infrastructure beyond the grep command itself.

Agent Reasoning
I need to find authentication-related code
Searching for 'authenticate' or 'auth' patterns
Using grep to scan the codebase...
Tool Call: grep
$ grep -rn 'authenticate|auth' src/
Results0 / 4 matches
Waiting for search...
Agent Response
Analyzing results...
Grep finds code by pattern matching
Agent interprets the results

Grep: Strengths and Limitations

Strengths

  • Deterministic results: Function names, identifiers, and exact patterns need exact matche: grep excels at this with zero hallucinations
  • Scales to massive repos: Whether you have 100 files or 100,000, grep churns through them relatively quickly
  • Minimal infrastructure: No vector databases, no embedding models, no external services
  • Always fresh: Every search reads the current state of your code, never a stale index

Limitations

  • Token bloat: Dumping large amounts of raw code into an LLM eats context and drives up costs as repositories grow
  • Time-consuming exploration: With tools like Codex CLI, agents can spend significant time iteratively grepping through the codebase to build understanding
  • No semantic understanding: Grep can't find "authentication logic" if the code uses login, signIn, or verifyCredentials. It only finds exact text matches

Part 2: Semantic Indexing

The other approach to underestanding your codebase is to build a semantic index. Instead of getting exact text matches, semantic search finds code through the use of vector embeddings that attempt to capture meaning. Cursor, VS Code Copilot, Roocode, and Kilo Code all use this approach. They pre-index your codebase so that at query time, they can quickly find semantically relevant snippets. Semantic embeddings require a much larger upfront investment in infrastructure and computation. They require a vector store, embedding model and an indexing pipeline.

The Indexing Pipeline

Codebase.ts, .py, .rs...DetectionIdentify languageChunkingTree-sitter ASTEmbeddingCode → vectorsVector DBStore & indexfilestyped fileschunksvectorsPIPELINE:ScanDetectParseEmbedStore
Codebase Indexing Pipeline

The pipeline transforms raw source code into searchable vectors:

  1. Scan — Traverse the codebase and read source files
  2. Detect — Identify programming languages
  3. Chunk — Parse code into semantic units using AST analysis
  4. Embed — Convert chunks into high-dimensional vectors
  5. Store — Save vectors in a database for fast retrieval

At a high level, all of these are core components we need to implement semantic search.


What Gets Indexed (and What Doesn't)

The indexer is smart about what it processes. It automatically excludes:[1][2]

  • Binary files and images — Non-text content that can't be meaningfully embedded
  • Large files (>1MB) — Files too large to process efficiently
  • Git internals.git folders and repository metadata
  • Dependenciesnode_modules, vendor, venv, and other package directories
  • Ignored files — Anything matching .gitignore patterns

This filtering ensures the index stays focused on your code, not third-party libraries or generated artifacts.


Incremental Updates

Re-indexing an entire codebase on every change would be painfully slow. Modern indexers use several strategies to stay fast:[1][2][3]

StrategyHow It Works
File WatchingMonitors your workspace for changes in real-time
Smart UpdatesOnly reprocesses modified files, not the entire codebase
Hash-based CachingCompares file hashes to avoid reprocessing unchanged content
Branch AwareAutomatically handles Git branch switches and updates the index accordingly
Multi-Folder WorkspacesEach folder maintains its own index with separate settings and status

This means after the initial indexing, updates happen in seconds rather than minutes — even for large codebases.


How Cursor Detects Changes: Merkle Trees

Cursor uses a clever data structure called a Merkle tree to efficiently detect which files have changed.[4][5][6]

A Merkle tree works like a fingerprinting system for your codebase:

  1. Each file gets a unique cryptographic hash (fingerprint)
  2. Pairs of hashes are combined into parent hashes
  3. This continues until you have a single root hash representing the entire codebase

When any file changes, its hash changes — and that change propagates up through all parent hashes to the root. By comparing root hashes, Cursor can instantly detect that something changed, then walk down the tree to find exactly which files need re-indexing.

This approach significantly reduces bandwidth and processing time. In a workspace with 50,000 files, only the branches where hashes differ need to be examined — not the entire codebase.


Embedding Caching & Privacy

Smart caching makes re-indexing fast, while privacy measures protect sensitive information:[4][5][6]

Caching Strategy

  • Embeddings are cached by the hash of each chunk's content
  • When code hasn't changed, the cached embedding is reused
  • Indexing the same codebase a second time is nearly instant
  • Teams benefit from shared caches across developers

Privacy Measures

  • Only embeddings and metadata are stored remotely — raw source code stays local
  • File paths are obfuscated before transmission (e.g., src/payments/invoice.pya9f3/x72k/qp1m.f4)
  • Path obfuscation hides sensitive details while preserving directory structure for filtering
  • Users can control what's indexed via .cursorignore or similar ignore files

Index Reuse Across Teams

Here's a powerful optimization: most teams work from near-identical copies of the same codebase. Cursor found that clones average 92% similarity across users within an organization.[6]

Instead of rebuilding every index from scratch, Cursor can securely reuse a teammate's existing index:

Repo SizeWithout ReuseWith Reuse
Median7.87 seconds525 ms
90th percentile2.82 minutes1.87 seconds
99th percentile4.03 hours21 seconds

This works through similarity hashing — the client computes a simhash from its Merkle tree and searches for matching indexes from teammates. Cryptographic proofs ensure you only see results for code you actually have locally.


Understanding Abstract Syntax Trees

Before we can chunk code intelligently, we need to understand its structure. Tree-sitter is a parser that transforms source code into an Abstract Syntax Tree (AST) — a hierarchical representation of the code's structure.

0/18
Source Codepython
1class DataProcessor:
2 """Process and transform data."""
3
4 def __init__(self, config):
5 self.config = config
6 self.cache = {}
7
8 def process(self, data):
9 if data is None:
10 raise ValueError("No data")
11
12 for item in data:
13 result = self._transform(item)
14 self.cache[item.id] = result
15
16 return self.cache
17
18 def _transform(self, item):
19 return item.value * 2
Abstract Syntax Tree

Press play to watch parsing

file
class/struct
function
control flow
variable
property
return
JSX

What the AST Reveals

  • Hierarchical structure: Classes contain methods, functions contain statements
  • Language-agnostic concepts: Functions, classes, and types are recognized across languages
  • Semantic boundaries: Each node represents a complete syntactic unit

This tree structure is what enables intelligent code chunking — instead of splitting at arbitrary line numbers, we can split at meaningful boundaries.


Smart Code Chunking with Tree-sitter

data_processor.py
1import numpy as np
2from typing import List, Optional
3
4class DataProcessor:
5 """Handles data processing operations."""
6
7 def __init__(self, config: dict):
8 self.config = config
9 self.cache = {}
10
11 def normalize(self, data: List[float]) -> np.ndarray:
12 """Normalize data to [0, 1] range."""
13 arr = np.array(data)
14 min_val, max_val = arr.min(), arr.max()
15 return (arr - min_val) / (max_val - min_val)
16
17 def transform(self, data: List[float],
18 scale: float = 1.0) -> np.ndarray:
19 """Apply transformation to data."""
20 normalized = self.normalize(data)
21 return normalized * scale
22
23def calculate_stats(values: List[float]) -> dict:
24 """Calculate basic statistics."""
25 arr = np.array(values)
26 return {
27 "mean": float(arr.mean()),
28 "std": float(arr.std()),
29 "min": float(arr.min()),
30 "max": float(arr.max())
31 }
Abstract Syntax Tree
moduledata_processor.…L1-32
import stateimport numpyL1
import from from typing imp…L2
class definiDataProcessorL4-22
expression sdocstringL5
function def__init__L7-9
function defnormalizeL11-16
function deftransformL18-22
function defcalculate_statsL24-32
Chunking Strategy:
Generated Chunks:
Imports
DataProcessor.__init__
DataProcessor.normalize
DataProcessor.transform
calculate_stats
✓ Better: Tree-sitter identifies semantic boundaries.

Why This Matters

  • Naive chunking splits at arbitrary line boundaries, often cutting functions in half
  • Tree-sitter chunking respects semantic boundaries — each chunk is a complete function, class, or method
  • Better chunks mean better embeddings, which means more accurate search results

A note on tooling: Libraries like Chonkie provide ready-made AST-aware code chunking out of the box — handling tree-sitter parsing, semantic boundary detection, and chunk extraction automatically. For production use, that's often the right choice.


Visualizing Code Embeddings

In order to set up these code embeddings, I used jina-code-embeddings-1.5b-Q8_0[9]. This model is designed for code retrieval which includes text-to-code, code-to-code and code-to-text. To run it locally, I used Llama.cpp and embeded one of a personal project of mine. In additon, I set up basic chunking using tree-sitter to split the code into functions and classes. Every function, class, and code chunk gets converted into a 1536-dimensional vector and below, we visualize it using t-SNE to reduce it down to 2D.

Loading embeddings...

What You're Seeing

  • Each point is a code chunk (function, class, method, or file)
  • Colors represent programming languages or chunk types
  • Proximity indicates semantic similarity — nearby points have similar meaning
  • Clusters form naturally around related functionality

How Semantic Search Works

With embeddings stored, semantic search becomes possible. Select a natural language query and watch as:

  1. The query gets embedded into the same vector space as the code
  2. The system finds the nearest neighbors using cosine similarity
  3. Results are ranked by semantic similarity, not keyword matching
Loading demo data...

Part 3: Comparing the Approaches

Both grep-based search and semantic indexing have their place — the right choice depends on your priorities.[7][8]

Why Some Tools Skip Indexing

As Nick Baumann from Cline explains, indexing introduces complexity that may not be worth the trade-off:[7]

"We don't index your codebase, and this choice isn't an oversight — it's a fundamental design decision."

The creator of Claude Code echoed similar concerns, noting that implementing indexing introduces problems around security, privacy, staleness, and reliability.

The Case Against RAG for Code

Cline's blog articulates three key challenges with retrieval-augmented generation (RAG) approaches:[7]

  1. Code doesn't think in chunks — When you chunk code for embeddings, you're tearing apart its logic. A function call might be in one chunk, its definition in another, and the critical context that explains why it exists scattered across a dozen fragments.

  2. Indexes decay while code evolves — Software development moves fast. Functions get refactored, dependencies update, entire modules get rewritten. An index is a snapshot frozen in time — every merge is a potential divergence between reality and your AI's understanding.

  3. Security becomes a liability — Your codebase isn't just text — it's your competitive advantage. Creating vector embeddings means creating a secondary representation of your IP that needs to be stored somewhere, doubling your security surface.

Cursor's documentation highlights why semantic search remains valuable:[8]

  • Faster results: Compute happens during indexing (offline) rather than at runtime, so searches are faster and cheaper
  • Better accuracy: Custom-trained models retrieve more relevant results than string matching
  • Fewer follow-ups: Users send fewer clarifying messages and use fewer tokens compared to grep-only search
  • Conceptual matching: Find code by what it does, not just what it's named

As Cursor notes: "Agent uses both grep and semantic search together. Grep excels at finding exact patterns, while semantic search excels at finding conceptually similar code."

The Trade-off

ApproachBest ForChallenges
Grep-basedExact matches, massive repos, privacy-sensitive environments, zero setupToken usage, exploration time, no semantic understanding
Semantic indexingConceptual queries, large teams, frequently searched codebasesIndex staleness, security surface, infrastructure complexity

Modern tools increasingly combine both approaches — using grep for precision and semantic search for discovery. The future likely isn't one or the other, but intelligent orchestration of both.


FleurAI-native code editor

References