Skip to content

I Tested 3 Local Search Engines on My Notes. Here's What Actually Works.

I have almost 2,000 notes in my Obsidian vault. PDFs of research papers, course notes, random ideas, technical references. And every time I need to find something specific — a quote from a paper, a definition I wrote down months ago, a concept I know I captured somewhere — I end up scrolling, guessing filenames, or running basic text searches that return everything except what I need.

So I decided to build and benchmark three fundamentally different approaches to local document search, all running 100% on my machine. No cloud APIs. No sending my notes anywhere. Just fast, local retrieval.

Obsidian's built-in search is keyword-based. If I search for "cognitive endurance," it finds it. But if I search for "the ability to sustain mental effort over time" — which is exactly what cognitive endurance means — it finds nothing. The words don't match even though the meaning does.

I wanted something that understands what I'm asking, not just what I'm typing. And I wanted it in under a second.

The Three Approaches

I set up a proper experiment: 40 documents (10 PDFs + 30 markdown notes), 10 test queries spanning exact phrases, conceptual questions, cross-document lookups, and specific fact retrieval. Then I threw three very different search engines at it.

litesearch is a young project (65 stars, 4 commits at the time I tested it) that takes the pure embedding approach. It chunks your documents, embeds them with nomic-embed-text-v1.5 (a 768-dimensional model), stores them in a local Qdrant Edge database, and searches by cosine similarity.

The idea: "Forget keywords. Just find chunks of text that mean the same thing as your query."

2. An Agentic Search CLI (Built from Scratch)

Nothing small enough existed for this, so I built a ~300-line Python CLI that uses Ollama as a local LLM brain. It has four tools available to it:

  • search_semantic — KNN vector search over embedded chunks
  • grep_files — exact text pattern matching
  • read_section — read specific line ranges from documents
  • list_documents — see what's indexed

The LLM (qwen3.5 running locally) decides which tools to call, refines its queries, reads context around matches, and synthesizes an answer with cited quotes.

The idea: "Let an AI reason about what to search for, not just retrieve blindly."

ck is a Rust binary that combines two search paradigms: BM25 keyword search (via tantivy, the Rust search engine library) and vector embedding search (via bge-small-en-v1.5). It merges results using Reciprocal Rank Fusion — if a passage scores highly in both systems, it gets boosted to the top.

The idea: "Combine the precision of keywords with the understanding of semantics."

The Results

Metric litesearch agentic (semantic) agentic (LLM) ck (hybrid)
Median query time 636ms 4,669ms 59,490ms 195ms
Hit rate 70% 90% 40% 70%

Three things jumped out immediately:

Speed: ck Destroys Everything Else

195 milliseconds. That's perceptibly instant. You type a query, you get results before your finger leaves the Enter key. litesearch was 3x slower, the semantic-only search was 24x slower (mostly cold-start loading the embedding model), and the LLM agent took almost a minute per query.

The difference is architectural. ck is a compiled Rust binary with an inverted index already memory-mapped. litesearch loads an ONNX embedding model per query. The Python semantic search loads PyTorch's sentence-transformers from scratch every time. A persistent daemon would fix the Python cold-start, but ck doesn't even need one.

The LLM Agent Was the Worst Performer

This was the surprising one. I expected the agentic approach — an LLM that can reason about what to search for, refine queries, combine semantic and keyword results — to produce the best results. Instead, it had a 40% hit rate and took 55 seconds per query.

Why? Two reasons:

  1. Speed kills reasoning. qwen3.5 at 6.6GB is a "thinking" model. It spends tokens on internal reasoning, leaving less budget for actually answering. At 55 seconds per query, the model is doing 5-10 tool calls, each with a round-trip to Ollama.

  2. Measurement artifact. The benchmark checks whether expected source filenames appear in the output. The LLM agent often found the right content but paraphrased the source instead of citing the exact filename. When I read the actual outputs manually, the answers were often excellent — richly cited with quotes and context. The quality was highest, even if the automated metric didn't capture it.

Direct Semantic Search Beat the LLM Agent

The "dumb" approach — just embed the query and find the nearest neighbors — achieved 90% hit rate. No reasoning, no query refinement, no multi-step tool calling. Just brute-force cosine similarity over 11,000 embeddings.

This is a useful lesson: for a personal knowledge base of a few thousand documents, sophisticated reasoning adds latency without improving retrieval. The embedding space is small enough that KNN search is already very accurate. The LLM's intelligence only adds value when you need multi-hop reasoning across documents.

The Per-Query Breakdown

This is where it gets interesting. Each engine has a different failure mode:

Query ck litesearch semantic LLM agent
"cognitive endurance" (exact phrase) HIT HIT HIT HIT
"strategies for effective practice and drilling" (conceptual) HIT MISS HIT MISS
"Bellman equation value function" (technical term) MISS HIT HIT MISS
"LLMs to improve student learning" (cross-document) MISS MISS HIT HIT
"vibe coding vs AI programming" (specific fact) MISS MISS MISS MISS

The pattern: ck excels at queries that contain keywords present in the text. It missed "Bellman equation" because the notes use different phrasing. litesearch and semantic search excel at conceptual matching — finding "drilling strategies" when the document says "optimal drilling to learn." The LLM agent excels at cross-document synthesis — finding "LLMs for student learning" across multiple notes that each cover part of the answer.

And everyone missed "vibe coding vs AI programming" because the actual note title is "vibe coding vs ai assisted programming" — the word "assisted" matters more than you'd think to embedding models.

What I'm Actually Using Now

I wrapped ck in a shell script called docsearch and aliased it globally. My daily workflow:

# Quick search from anywhere
docsearch "context engineering RAG"

# Interactive browsing (fzf-like TUI)
docsearch tui

# Add a new paper
docsearch add ~/Downloads/new-paper.pdf

The wrapper handles PDF-to-markdown conversion automatically (using pymupdf4llm), manages the corpus directory, and sets the right Rust toolchain PATH. 195ms queries, hybrid search, zero cloud dependencies.

For deep research questions where I need to find and synthesize across multiple documents, I still fall back to the semantic search approach. But for 90% of my daily "where did I write that?" moments, ck is perfect.

The Takeaway

If you're building local document search:

  1. Start with hybrid search (BM25 + vector). Pure semantic misses exact terms. Pure keyword misses meaning. Hybrid gets both.
  2. Don't overthink it with LLM agents for retrieval. They're great for synthesis and reasoning, but for finding passages, a good embedding model + fast index is faster and more reliable.
  3. Speed matters more than you think. The difference between 200ms and 5 seconds isn't just comfort — it changes whether you actually use the tool. A 200ms search becomes a reflex. A 5-second search becomes "maybe I'll just scroll through my notes instead."
  4. Pre-convert PDFs to markdown. Every tool handles plain text well. PDF parsing is where things break. Convert once, search forever.

The full experiment code, benchmark runner, and docsearch wrapper are all in a single repo if you want to reproduce this. Every search runs 100% locally — your notes never leave your machine.