Priyanshu Mahey Portfolio

How AI Searches Through Your Codebase

When VS Code copilot first came out, I saw this little animation at the corner of my screen saying that the codebase was "indexing". This was a while back and I didn't think much of it. Then I noticed Cursor doing the same thing. At first I didn't think much of it until I started using Claude Code which didn't mention anything about indexing. This got me a bit curious about how exactly different AI tools actually search and interact with the codebase I'm building.

Modern AI coding assistants need to understand your codebase to actually be useful. The only question is, how do they actually know which part of your codebase is relevant to your objective? The naive approach would be to stuff the context window with the entire codebase, but that quickly becomes impractical as projects grow. Instead, we need targeted search techniques that help the AI find the right code snippets to work with. To do this, there's two primary approaches:

Grep-based search — Fast, on-demand text pattern matching (used by Claude Code, Codex CLI, Cline)
Semantic indexing — Pre-computed embeddings enabling meaning-based search (used by Cursor, VS Code Copilot, Roocode)

The goal of this blog post was to explore both approaches in detail and understand why different tools choose one over the other. This isn't a deep deep dive into the accuracy or performance of each method since that's probably best left to empirical benchmarks. Instead, I talk primarily about implementation and compling information from various sources.

Part 1: Grep-Based Search

The simplest way to find code is grep which stands for Global Regular Expression Print. It's nothing new, it's been a staple in Unix command lines for decades. It's fast, requires no setup, and works on any codebase immediately. Grep scans each file line by line and checks against a pattern (string or regex). It then outputs exactly which lines match the pattern. There's no pre-indexing or understanding, grep just finds exact text matches. Grep requires no prior setup, just some sort of terminal access to run the command. This makes it ideal for quick lookups or exploration.

How Grep Works

Let's say we have a codebase with a couple of files in it. We can use grep to search for specific patterns or literal strings. If you press scan, you'll see how grep searches through each file line by line until a match is made. On the right, we keep track of all the matches and where the matches were.

Fast

LITERALLiteral text search for async function declarations

Source Files

src/services/auth.ts

1import { hash, verify } from 'crypto';

2import { User, Session } from './types';

4export async function authenticateUser(

5 email: string,

6 password: string

7): Promise<Session | null> {

8 const user = await findUserByEmail(email);

9 if (!user) return null;

11 const isValid = await verify(password, user.passwordHash);

12 if (!isValid) return null;

14 return createSession(user);

15}

17export async function validateSession(

18 token: string

19): Promise<User | null> {

20 const session = await findSession(token);

21 if (!session || isExpired(session)) {

22 return null;

23 }

24 return session.user;

25}

27function isExpired(session: Session): boolean {

28 return Date.now() > session.expiresAt;

29}

src/services/database.ts

1import { Pool, QueryResult } from 'pg';

3const pool = new Pool({

4 connectionString: process.env.DATABASE_URL,

5 max: 20

6});

8export async function query<T>(

9 sql: string,

10 params?: unknown[]

11): Promise<T[]> {

12 const client = await pool.connect();

13 try {

14 const result = await client.query(sql, params);

15 return result.rows as T[];

16 } finally {

17 client.release();

18 }

19}

21export async function findUserByEmail(

22 email: string

23): Promise<User | null> {

24 const users = await query<User>(

25 'SELECT * FROM users WHERE email = $1',

26 [email]

27 );

28 return users[0] || null;

29}

31export async function findSession(

32 token: string

33): Promise<Session | null> {

34 const sessions = await query<Session>(

35 'SELECT * FROM sessions WHERE token = $1',

36 [token]

37 );

38 return sessions[0] || null;

39}

src/routes/api.ts

1import { Router, Request, Response } from 'express';

2import { authenticateUser, validateSession } from '../services/auth';

4const router = Router();

6router.post('/login', async (req: Request, res: Response) => {

7 const { email, password } = req.body;

9 const session = await authenticateUser(email, password);

10 if (!session) {

11 return res.status(401).json({ error: 'Invalid credentials' });

12 }

14 res.json({ token: session.token });

15});

17router.get('/me', async (req: Request, res: Response) => {

18 const token = req.headers.authorization?.split(' ')[1];

19 if (!token) {

20 return res.status(401).json({ error: 'No token provided' });

21 }

23 const user = await validateSession(token);

24 if (!user) {

25 return res.status(401).json({ error: 'Invalid session' });

26 }

28 res.json({ user });

29});

31export default router;

Results

Click Scan to search

Literal = exact text match

Regex = pattern matching

Highlighted = match found

AI agents are pretty good at using tools like grep to explore codebases. Given access to the grep command, they can easily formulate search queries. Even if they miss, they can quickly iterate by refining their queries based on results. The flow is simple. Once the user asks a question, the agent reasons about what patterns to search for. It then executes grep and collects the matching lines with file paths. From there, the agent reads relevant snippets and synthesizes them into a coherent answer. This technique is incredibly simple and requires no additional infrastructure beyond the grep command itself.

Agent Reasoning

→I need to find authentication-related code

→Searching for 'authenticate' or 'auth' patterns

→Using grep to scan the codebase...

Tool Call: grep

$ grep -rn 'authenticate|auth' src/

Results0 / 4 matches

Waiting for search...

Agent Response

Analyzing results...

Grep finds code by pattern matching

Agent interprets the results

Grep: Strengths and Limitations

Strengths

Deterministic results: Function names, identifiers, and exact patterns need exact matche: grep excels at this with zero hallucinations
Scales to massive repos: Whether you have 100 files or 100,000, grep churns through them relatively quickly
Minimal infrastructure: No vector databases, no embedding models, no external services
Always fresh: Every search reads the current state of your code, never a stale index

Limitations

Token bloat: Dumping large amounts of raw code into an LLM eats context and drives up costs as repositories grow
Time-consuming exploration: With tools like Codex CLI, agents can spend significant time iteratively grepping through the codebase to build understanding
No semantic understanding: Grep can't find "authentication logic" if the code uses login, signIn, or verifyCredentials. It only finds exact text matches

Part 2: Semantic Indexing

The other approach to underestanding your codebase is to build a semantic index. Instead of getting exact text matches, semantic search finds code through the use of vector embeddings that attempt to capture meaning. Cursor, VS Code Copilot, Roocode, and Kilo Code all use this approach. They pre-index your codebase so that at query time, they can quickly find semantically relevant snippets. Semantic embeddings require a much larger upfront investment in infrastructure and computation. They require a vector store, embedding model and an indexing pipeline.

The Indexing Pipeline

Codebase Indexing Pipeline — Click to start or watch the animation

The pipeline transforms raw source code into searchable vectors:

Scan — Traverse the codebase and read source files
Detect — Identify programming languages
Chunk — Parse code into semantic units using AST analysis
Embed — Convert chunks into high-dimensional vectors
Store — Save vectors in a database for fast retrieval

At a high level, all of these are core components we need to implement semantic search.

What Gets Indexed (and What Doesn't)

The indexer is smart about what it processes. It automatically excludes:^[1]^[2]

Binary files and images — Non-text content that can't be meaningfully embedded
Large files (>1MB) — Files too large to process efficiently
Git internals — .git folders and repository metadata
Dependencies — node_modules, vendor, venv, and other package directories
Ignored files — Anything matching .gitignore patterns

This filtering ensures the index stays focused on your code, not third-party libraries or generated artifacts.

Incremental Updates

Re-indexing an entire codebase on every change would be painfully slow. Modern indexers use several strategies to stay fast:^[1]^[2]^[3]

Strategy	How It Works
File Watching	Monitors your workspace for changes in real-time
Smart Updates	Only reprocesses modified files, not the entire codebase
Hash-based Caching	Compares file hashes to avoid reprocessing unchanged content
Branch Aware	Automatically handles Git branch switches and updates the index accordingly
Multi-Folder Workspaces	Each folder maintains its own index with separate settings and status

This means after the initial indexing, updates happen in seconds rather than minutes — even for large codebases.

How Cursor Detects Changes: Merkle Trees

Cursor uses a clever data structure called a Merkle tree to efficiently detect which files have changed.^[4]^[5]^[6]

A Merkle tree works like a fingerprinting system for your codebase:

Each file gets a unique cryptographic hash (fingerprint)
Pairs of hashes are combined into parent hashes
This continues until you have a single root hash representing the entire codebase

When any file changes, its hash changes — and that change propagates up through all parent hashes to the root. By comparing root hashes, Cursor can instantly detect that something changed, then walk down the tree to find exactly which files need re-indexing.

This approach significantly reduces bandwidth and processing time. In a workspace with 50,000 files, only the branches where hashes differ need to be examined — not the entire codebase.

Embedding Caching & Privacy

Smart caching makes re-indexing fast, while privacy measures protect sensitive information:^[4]^[5]^[6]

Caching Strategy

Embeddings are cached by the hash of each chunk's content
When code hasn't changed, the cached embedding is reused
Indexing the same codebase a second time is nearly instant
Teams benefit from shared caches across developers

Privacy Measures

Only embeddings and metadata are stored remotely — raw source code stays local
File paths are obfuscated before transmission (e.g., src/payments/invoice.py → a9f3/x72k/qp1m.f4)
Path obfuscation hides sensitive details while preserving directory structure for filtering
Users can control what's indexed via .cursorignore or similar ignore files

Index Reuse Across Teams

Here's a powerful optimization: most teams work from near-identical copies of the same codebase. Cursor found that clones average 92% similarity across users within an organization.^[6]

Instead of rebuilding every index from scratch, Cursor can securely reuse a teammate's existing index:

Repo Size	Without Reuse	With Reuse
Median	7.87 seconds	525 ms
90th percentile	2.82 minutes	1.87 seconds
99th percentile	4.03 hours	21 seconds

This works through similarity hashing — the client computes a simhash from its Merkle tree and searches for matching indexes from teammates. Cryptographic proofs ensure you only see results for code you actually have locally.

Understanding Abstract Syntax Trees

Before we can chunk code intelligently, we need to understand its structure. Tree-sitter is a parser that transforms source code into an Abstract Syntax Tree (AST) — a hierarchical representation of the code's structure.

0/18

Source Codepython

1class DataProcessor:

2 """Process and transform data."""

4 def __init__(self, config):

5 self.config = config

6 self.cache = {}

8 def process(self, data):

9 if data is None:

10 raise ValueError("No data")

12 for item in data:

13 result = self._transform(item)

14 self.cache[item.id] = result

16 return self.cache

18 def _transform(self, item):

19 return item.value * 2

Abstract Syntax Tree

Press play to watch parsing

file

class/struct

function

control flow

variable

property

return

JSX

What the AST Reveals

Hierarchical structure: Classes contain methods, functions contain statements
Language-agnostic concepts: Functions, classes, and types are recognized across languages
Semantic boundaries: Each node represents a complete syntactic unit

This tree structure is what enables intelligent code chunking — instead of splitting at arbitrary line numbers, we can split at meaningful boundaries.

Smart Code Chunking with Tree-sitter

data_processor.py

1import numpy as np

2from typing import List, Optional

4class DataProcessor:

5 """Handles data processing operations."""

7 def __init__(self, config: dict):

8 self.config = config

9 self.cache = {}

11 def normalize(self, data: List[float]) -> np.ndarray:

12 """Normalize data to [0, 1] range."""

13 arr = np.array(data)

14 min_val, max_val = arr.min(), arr.max()

15 return (arr - min_val) / (max_val - min_val)

17 def transform(self, data: List[float],

18 scale: float = 1.0) -> np.ndarray:

19 """Apply transformation to data."""

20 normalized = self.normalize(data)

21 return normalized * scale

23def calculate_stats(values: List[float]) -> dict:

24 """Calculate basic statistics."""

25 arr = np.array(values)

26 return {

27 "mean": float(arr.mean()),

28 "std": float(arr.std()),

29 "min": float(arr.min()),

30 "max": float(arr.max())

31 }

Abstract Syntax Tree

moduledata_processor.…L1-32

import stateimport numpyL1

import from from typing imp…L2

class definiDataProcessorL4-22

expression sdocstringL5

function def__init__L7-9

function defnormalizeL11-16

function deftransformL18-22

function defcalculate_statsL24-32

Chunking Strategy:

Generated Chunks:

Imports

DataProcessor.__init__

DataProcessor.normalize

DataProcessor.transform

calculate_stats

✓ Better: Tree-sitter identifies semantic boundaries. Each chunk contains a complete function or method with its docstring, making embeddings more meaningful for search.

Why This Matters

Naive chunking splits at arbitrary line boundaries, often cutting functions in half
Tree-sitter chunking respects semantic boundaries — each chunk is a complete function, class, or method
Better chunks mean better embeddings, which means more accurate search results

A note on tooling: Libraries like Chonkie provide ready-made AST-aware code chunking out of the box — handling tree-sitter parsing, semantic boundary detection, and chunk extraction automatically. For production use, that's often the right choice.

Visualizing Code Embeddings

In order to set up these code embeddings, I used jina-code-embeddings-1.5b-Q8_0^[9]. This model is designed for code retrieval which includes text-to-code, code-to-code and code-to-text. To run it locally, I used Llama.cpp and embeded one of a personal project of mine. In additon, I set up basic chunking using tree-sitter to split the code into functions and classes. Every function, class, and code chunk gets converted into a 1536-dimensional vector and below, we visualize it using t-SNE to reduce it down to 2D.

Loading embeddings...

What You're Seeing

Each point is a code chunk (function, class, method, or file)
Colors represent programming languages or chunk types
Proximity indicates semantic similarity — nearby points have similar meaning
Clusters form naturally around related functionality

How Semantic Search Works

With embeddings stored, semantic search becomes possible. Select a natural language query and watch as:

The query gets embedded into the same vector space as the code
The system finds the nearest neighbors using cosine similarity
Results are ranked by semantic similarity, not keyword matching

Loading demo data...

Part 3: Comparing the Approaches

Both grep-based search and semantic indexing have their place — the right choice depends on your priorities.^[7]^[8]

Why Some Tools Skip Indexing

As Nick Baumann from Cline explains, indexing introduces complexity that may not be worth the trade-off:^[7]

"We don't index your codebase, and this choice isn't an oversight — it's a fundamental design decision."

The creator of Claude Code echoed similar concerns, noting that implementing indexing introduces problems around security, privacy, staleness, and reliability.

The Case Against RAG for Code

Cline's blog articulates three key challenges with retrieval-augmented generation (RAG) approaches:^[7]

Code doesn't think in chunks — When you chunk code for embeddings, you're tearing apart its logic. A function call might be in one chunk, its definition in another, and the critical context that explains why it exists scattered across a dozen fragments.
Indexes decay while code evolves — Software development moves fast. Functions get refactored, dependencies update, entire modules get rewritten. An index is a snapshot frozen in time — every merge is a potential divergence between reality and your AI's understanding.
Security becomes a liability — Your codebase isn't just text — it's your competitive advantage. Creating vector embeddings means creating a secondary representation of your IP that needs to be stored somewhere, doubling your security surface.

The Case For Semantic Search

Cursor's documentation highlights why semantic search remains valuable:^[8]

Faster results: Compute happens during indexing (offline) rather than at runtime, so searches are faster and cheaper
Better accuracy: Custom-trained models retrieve more relevant results than string matching
Fewer follow-ups: Users send fewer clarifying messages and use fewer tokens compared to grep-only search
Conceptual matching: Find code by what it does, not just what it's named

As Cursor notes: "Agent uses both grep and semantic search together. Grep excels at finding exact patterns, while semantic search excels at finding conceptually similar code."

The Trade-off

Approach	Best For	Challenges
Grep-based	Exact matches, massive repos, privacy-sensitive environments, zero setup	Token usage, exploration time, no semantic understanding
Semantic indexing	Conceptual queries, large teams, frequently searched codebases	Index staleness, security surface, infrastructure complexity

Modern tools increasingly combine both approaches — using grep for precision and semantic search for discovery. The future likely isn't one or the other, but intelligent orchestration of both.

FleurAI-native code editor

References

[1]

Codebase IndexingKilo Code Documentation

[2]

Codebase IndexingRoo Code Documentation

[3]

Real-time Codebase IndexingCocoIndex Examples

[4]

How Cursor Indexes Codebases FastEngineer's Codex

[5]

How Cursor Actually Indexes Your CodebaseTowards Data Science

[6]

Securely Indexing Large CodebasesCursor Blog

[7]

Why Cline Doesn't Index Your Codebase (And Why That's a Good Thing)Cline Blog

[8]

Semantic SearchCursor Documentation

[9]

jina-code-embeddings-1.5bHugging Face

How AI Searches Through Your Codebase

Grep-based search — Fast, on-demand text pattern matching (used by Claude Code, Codex CLI, Cline)
Semantic indexing — Pre-computed embeddings enabling meaning-based search (used by Cursor, VS Code Copilot, Roocode)

Part 1: Grep-Based Search

How Grep Works

Fast

LITERALLiteral text search for async function declarations

Source Files

src/services/auth.ts

1import { hash, verify } from 'crypto';

2import { User, Session } from './types';

4export async function authenticateUser(

5 email: string,

6 password: string

7): Promise<Session | null> {

8 const user = await findUserByEmail(email);

9 if (!user) return null;

11 const isValid = await verify(password, user.passwordHash);

12 if (!isValid) return null;

14 return createSession(user);

15}

17export async function validateSession(

18 token: string

19): Promise<User | null> {

20 const session = await findSession(token);

21 if (!session || isExpired(session)) {

22 return null;

23 }

24 return session.user;

25}

27function isExpired(session: Session): boolean {

28 return Date.now() > session.expiresAt;

29}

src/services/database.ts

1import { Pool, QueryResult } from 'pg';

3const pool = new Pool({

4 connectionString: process.env.DATABASE_URL,

5 max: 20

6});

8export async function query<T>(

9 sql: string,

10 params?: unknown[]

11): Promise<T[]> {

12 const client = await pool.connect();

13 try {

14 const result = await client.query(sql, params);

15 return result.rows as T[];

16 } finally {

17 client.release();

18 }

19}

21export async function findUserByEmail(

22 email: string

23): Promise<User | null> {

24 const users = await query<User>(

25 'SELECT * FROM users WHERE email = $1',

26 [email]

27 );

28 return users[0] || null;

29}

31export async function findSession(

32 token: string

33): Promise<Session | null> {

34 const sessions = await query<Session>(

35 'SELECT * FROM sessions WHERE token = $1',

36 [token]

37 );

38 return sessions[0] || null;

39}

src/routes/api.ts

1import { Router, Request, Response } from 'express';

2import { authenticateUser, validateSession } from '../services/auth';

4const router = Router();

6router.post('/login', async (req: Request, res: Response) => {

7 const { email, password } = req.body;

9 const session = await authenticateUser(email, password);

10 if (!session) {

11 return res.status(401).json({ error: 'Invalid credentials' });

12 }

14 res.json({ token: session.token });

15});

17router.get('/me', async (req: Request, res: Response) => {

18 const token = req.headers.authorization?.split(' ')[1];

19 if (!token) {

20 return res.status(401).json({ error: 'No token provided' });

21 }

23 const user = await validateSession(token);

24 if (!user) {

25 return res.status(401).json({ error: 'Invalid session' });

26 }

28 res.json({ user });

29});

31export default router;

Results

Click Scan to search

Literal = exact text match

Regex = pattern matching

Highlighted = match found

Agent Reasoning

→I need to find authentication-related code

→Searching for 'authenticate' or 'auth' patterns

→Using grep to scan the codebase...

Tool Call: grep

$ grep -rn 'authenticate|auth' src/

Results0 / 4 matches

Waiting for search...

Agent Response

Analyzing results...

Grep finds code by pattern matching

Agent interprets the results

Grep: Strengths and Limitations

Strengths

Deterministic results: Function names, identifiers, and exact patterns need exact matche: grep excels at this with zero hallucinations
Scales to massive repos: Whether you have 100 files or 100,000, grep churns through them relatively quickly
Minimal infrastructure: No vector databases, no embedding models, no external services
Always fresh: Every search reads the current state of your code, never a stale index

Limitations

Token bloat: Dumping large amounts of raw code into an LLM eats context and drives up costs as repositories grow
Time-consuming exploration: With tools like Codex CLI, agents can spend significant time iteratively grepping through the codebase to build understanding
No semantic understanding: Grep can't find "authentication logic" if the code uses login, signIn, or verifyCredentials. It only finds exact text matches

Part 2: Semantic Indexing

The Indexing Pipeline

Codebase Indexing Pipeline — Click to start or watch the animation

The pipeline transforms raw source code into searchable vectors:

Scan — Traverse the codebase and read source files
Detect — Identify programming languages
Chunk — Parse code into semantic units using AST analysis
Embed — Convert chunks into high-dimensional vectors
Store — Save vectors in a database for fast retrieval

At a high level, all of these are core components we need to implement semantic search.

What Gets Indexed (and What Doesn't)

The indexer is smart about what it processes. It automatically excludes:^[1]^[2]

Binary files and images — Non-text content that can't be meaningfully embedded
Large files (>1MB) — Files too large to process efficiently
Git internals — .git folders and repository metadata
Dependencies — node_modules, vendor, venv, and other package directories
Ignored files — Anything matching .gitignore patterns

This filtering ensures the index stays focused on your code, not third-party libraries or generated artifacts.

Incremental Updates

Re-indexing an entire codebase on every change would be painfully slow. Modern indexers use several strategies to stay fast:^[1]^[2]^[3]

Strategy	How It Works
File Watching	Monitors your workspace for changes in real-time
Smart Updates	Only reprocesses modified files, not the entire codebase
Hash-based Caching	Compares file hashes to avoid reprocessing unchanged content
Branch Aware	Automatically handles Git branch switches and updates the index accordingly
Multi-Folder Workspaces	Each folder maintains its own index with separate settings and status

This means after the initial indexing, updates happen in seconds rather than minutes — even for large codebases.

How Cursor Detects Changes: Merkle Trees

Cursor uses a clever data structure called a Merkle tree to efficiently detect which files have changed.^[4]^[5]^[6]

A Merkle tree works like a fingerprinting system for your codebase:

Each file gets a unique cryptographic hash (fingerprint)
Pairs of hashes are combined into parent hashes
This continues until you have a single root hash representing the entire codebase

This approach significantly reduces bandwidth and processing time. In a workspace with 50,000 files, only the branches where hashes differ need to be examined — not the entire codebase.

Embedding Caching & Privacy

Smart caching makes re-indexing fast, while privacy measures protect sensitive information:^[4]^[5]^[6]

Caching Strategy

Embeddings are cached by the hash of each chunk's content
When code hasn't changed, the cached embedding is reused
Indexing the same codebase a second time is nearly instant
Teams benefit from shared caches across developers

Privacy Measures

Only embeddings and metadata are stored remotely — raw source code stays local
File paths are obfuscated before transmission (e.g., src/payments/invoice.py → a9f3/x72k/qp1m.f4)
Path obfuscation hides sensitive details while preserving directory structure for filtering
Users can control what's indexed via .cursorignore or similar ignore files

Index Reuse Across Teams

Here's a powerful optimization: most teams work from near-identical copies of the same codebase. Cursor found that clones average 92% similarity across users within an organization.^[6]

Instead of rebuilding every index from scratch, Cursor can securely reuse a teammate's existing index:

Repo Size	Without Reuse	With Reuse
Median	7.87 seconds	525 ms
90th percentile	2.82 minutes	1.87 seconds
99th percentile	4.03 hours	21 seconds

Understanding Abstract Syntax Trees

0/18

Source Codepython

1class DataProcessor:

2 """Process and transform data."""

4 def __init__(self, config):

5 self.config = config

6 self.cache = {}

8 def process(self, data):

9 if data is None:

10 raise ValueError("No data")

12 for item in data:

13 result = self._transform(item)

14 self.cache[item.id] = result

16 return self.cache

18 def _transform(self, item):

19 return item.value * 2

Abstract Syntax Tree

Press play to watch parsing

file

class/struct

function

control flow

variable

property

return

JSX

What the AST Reveals

Hierarchical structure: Classes contain methods, functions contain statements
Language-agnostic concepts: Functions, classes, and types are recognized across languages
Semantic boundaries: Each node represents a complete syntactic unit

This tree structure is what enables intelligent code chunking — instead of splitting at arbitrary line numbers, we can split at meaningful boundaries.

Smart Code Chunking with Tree-sitter

data_processor.py

1import numpy as np

2from typing import List, Optional

4class DataProcessor:

5 """Handles data processing operations."""

7 def __init__(self, config: dict):

8 self.config = config

9 self.cache = {}

11 def normalize(self, data: List[float]) -> np.ndarray:

12 """Normalize data to [0, 1] range."""

13 arr = np.array(data)

14 min_val, max_val = arr.min(), arr.max()

15 return (arr - min_val) / (max_val - min_val)

17 def transform(self, data: List[float],

18 scale: float = 1.0) -> np.ndarray:

19 """Apply transformation to data."""

20 normalized = self.normalize(data)

21 return normalized * scale

23def calculate_stats(values: List[float]) -> dict:

24 """Calculate basic statistics."""

25 arr = np.array(values)

26 return {

27 "mean": float(arr.mean()),

28 "std": float(arr.std()),

29 "min": float(arr.min()),

30 "max": float(arr.max())

31 }

Abstract Syntax Tree

moduledata_processor.…L1-32

import stateimport numpyL1

import from from typing imp…L2

class definiDataProcessorL4-22

expression sdocstringL5

function def__init__L7-9

function defnormalizeL11-16

function deftransformL18-22

function defcalculate_statsL24-32

Chunking Strategy:

Generated Chunks:

Imports

DataProcessor.__init__

DataProcessor.normalize

DataProcessor.transform

calculate_stats

✓ Better: Tree-sitter identifies semantic boundaries. Each chunk contains a complete function or method with its docstring, making embeddings more meaningful for search.

Why This Matters

Naive chunking splits at arbitrary line boundaries, often cutting functions in half
Tree-sitter chunking respects semantic boundaries — each chunk is a complete function, class, or method
Better chunks mean better embeddings, which means more accurate search results

Visualizing Code Embeddings

Loading embeddings...

What You're Seeing

Each point is a code chunk (function, class, method, or file)
Colors represent programming languages or chunk types
Proximity indicates semantic similarity — nearby points have similar meaning
Clusters form naturally around related functionality

How Semantic Search Works

With embeddings stored, semantic search becomes possible. Select a natural language query and watch as:

The query gets embedded into the same vector space as the code
The system finds the nearest neighbors using cosine similarity
Results are ranked by semantic similarity, not keyword matching

Loading demo data...

Part 3: Comparing the Approaches

Both grep-based search and semantic indexing have their place — the right choice depends on your priorities.^[7]^[8]

Why Some Tools Skip Indexing

As Nick Baumann from Cline explains, indexing introduces complexity that may not be worth the trade-off:^[7]

"We don't index your codebase, and this choice isn't an oversight — it's a fundamental design decision."

The creator of Claude Code echoed similar concerns, noting that implementing indexing introduces problems around security, privacy, staleness, and reliability.

The Case Against RAG for Code

Cline's blog articulates three key challenges with retrieval-augmented generation (RAG) approaches:^[7]

Code doesn't think in chunks — When you chunk code for embeddings, you're tearing apart its logic. A function call might be in one chunk, its definition in another, and the critical context that explains why it exists scattered across a dozen fragments.
Indexes decay while code evolves — Software development moves fast. Functions get refactored, dependencies update, entire modules get rewritten. An index is a snapshot frozen in time — every merge is a potential divergence between reality and your AI's understanding.
Security becomes a liability — Your codebase isn't just text — it's your competitive advantage. Creating vector embeddings means creating a secondary representation of your IP that needs to be stored somewhere, doubling your security surface.

The Case For Semantic Search

Cursor's documentation highlights why semantic search remains valuable:^[8]

Faster results: Compute happens during indexing (offline) rather than at runtime, so searches are faster and cheaper
Better accuracy: Custom-trained models retrieve more relevant results than string matching
Fewer follow-ups: Users send fewer clarifying messages and use fewer tokens compared to grep-only search
Conceptual matching: Find code by what it does, not just what it's named

As Cursor notes: "Agent uses both grep and semantic search together. Grep excels at finding exact patterns, while semantic search excels at finding conceptually similar code."

The Trade-off

Approach	Best For	Challenges
Grep-based	Exact matches, massive repos, privacy-sensitive environments, zero setup	Token usage, exploration time, no semantic understanding
Semantic indexing	Conceptual queries, large teams, frequently searched codebases	Index staleness, security surface, infrastructure complexity

Modern tools increasingly combine both approaches — using grep for precision and semantic search for discovery. The future likely isn't one or the other, but intelligent orchestration of both.

FleurAI-native code editor