RAG vs Obsidian-Based Memory: A Comparative Analysis
Two paradigms for giving language models knowledge they didn’t have at training time — one statistical, one curated. When does each one win?
1. Introduction
Large language models have a memory problem. Their parametric memory — the weights of the network — is fixed at training time, opaque, and expensive to update. Their working memory — the context window — is finite, even at a million tokens, and gets noisier the more you stuff into it12. Any serious application that needs the model to know things it wasn’t trained on, or to remember things between sessions, has to bolt on an external memory.
Two contrasting designs dominate that bolt-on layer in 2026. The first is retrieval-augmented generation (RAG): split a corpus into chunks, embed them as vectors, store them in a database, retrieve the top-k most similar to the query at runtime, and inject them into the prompt1. RAG is the default architecture for enterprise document Q&A, chatbots over technical manuals, and most “chat with your PDF” products. The second is what we’ll call an Obsidian-based memory system: a hand-curated graph of Markdown files with YAML front-matter and explicit wikilinks, organised into typed cognitive layers, loaded selectively by name rather than by similarity14. This is the design pattern behind several recent “language agent” memory architectures, most directly the CoALA framework7 and the MIRIX system8.
The two paradigms are often framed as opponents, but that framing is misleading. RAG is a retrieval mechanism; the Obsidian-style vault is a knowledge representation. They answer different questions: RAG asks “what passages in the corpus are similar to this query?”, while a curated vault asks “what notes does this agent know exist, and which should it pull in given the task?”. The interesting comparison is therefore between two complete systems — one statistical and unsupervised, one structural and human-curated — and the tradeoffs they make on retrieval semantics, maintenance cost, scale, and trust.
This paper lays out both architectures in self-contained form, compares them on a fixed set of axes, and ends with decision guidance for which to choose when. The honest answer at the bottom is “both, layered” — but the layering only makes sense once you understand each side properly.
2. Foundations of RAG
2.1 Definition and motivation
Retrieval-augmented generation was introduced by Lewis et al. at FAIR in 20201. The original formulation combined a pretrained sequence-to-sequence model (BART) with a non-parametric memory: a dense vector index of Wikipedia passages accessed by a learned retriever (DPR). The model retrieves passages conditional on the query, then generates an answer conditioned on both the query and the retrieved passages. The key insight is the split between parametric memory (encoded in weights) and non-parametric memory (an external corpus that can be edited without retraining).
RAG addresses three concrete limitations of vanilla LLMs:
- Knowledge cutoff. Model weights freeze at training time. RAG lets you serve answers grounded in documents the model has never seen.
- Hallucination mitigation. When generation is conditioned on retrieved evidence, fabrication rates drop — provided the retrieval is on-topic and the prompt instructs the model to cite or refuse.
- Context-window economics. Even with million-token contexts, attention dilutes over long inputs (the “lost in the middle” effect12), and inference cost scales with input length. Retrieving 5–20 relevant chunks beats stuffing the whole manual.
2.2 The core pipeline
Each stage has its own design space:
Chunking. Documents are split into passages, typically 200–800 tokens with some overlap. Strategies range from fixed-size sliding windows to recursive splitters that respect Markdown headings, to semantic chunkers that group sentences by embedding similarity. Chunking is where most RAG systems silently fail: a chunk that bisects a definition or strips a table from its caption is irretrievable no matter how good the embeddings are13.
Embedding. Each chunk is mapped to a dense vector by a sentence-embedding model — OpenAI’s text-embedding-3-large, Cohere’s embed-v3, the open-source BGE family, or smaller local models like bge-micro-v2 (the 384-dimensional model bundled with Obsidian’s Smart Connections plugin15). The same model must encode both documents and queries; mixing models silently breaks similarity.
Vector store. Embeddings live in a database tuned for approximate nearest-neighbour (ANN) search. The 2026 landscape spans managed SaaS (Pinecone), open-source server-mode systems (Weaviate, Qdrant, Milvus), embedded libraries (Chroma, FAISS), and Postgres extensions (pgvector)11. The common indexing algorithm is HNSW (hierarchical navigable small world), which trades a small recall hit for sublinear query time. Choice depends on scale, filtering needs, and whether you want a managed service or to self-host.
Retrieval. The query is embedded with the same model and compared to all stored vectors by cosine similarity (or dot product / Euclidean). The top-k chunks — typically k = 5 to 20 — are returned. Pure dense retrieval misses queries that hinge on rare keywords or exact identifiers, which is why hybrid search has become standard: run dense ANN and a sparse BM25 lexical search in parallel, then merge with Reciprocal Rank Fusion (RRF), which scores each candidate as 1/(k + rank) from each list and sums10. RRF is scale-agnostic and ships natively in every major vector DB.
Re-ranking. The top-N retrieved passages (say N = 50) are scored by a cross-encoder — a model that takes the (query, passage) pair jointly and outputs a relevance score. Cross-encoders are too slow to run over the whole corpus but excellent at re-ordering a shortlist. Re-ranking can lift NDCG@3 by 10–20 points on noisy corpora5.
Context injection and generation. The final 3–10 chunks are formatted into the prompt with separators and source labels, the LLM is instructed to answer using only the supplied context (and optionally to cite), and generation proceeds. The recall-then-filter pattern — retrieve broadly, then squeeze to 3–5 high-precision passages — is the production default in 202612.
2.3 Variants: naive, advanced, modular, GraphRAG, agentic
Gao et al.’s 2023 survey2 introduced the now-standard taxonomy:
- Naive RAG. Index–retrieve–generate, no frills. Suffers from low precision and recall, mismatched query/document phrasing, and brittle prompt assembly.
- Advanced RAG. Adds pre-retrieval optimisations (query rewriting, query expansion, HyDE) and post-retrieval optimisations (re-ranking, context compression, deduplication). HyDE (Hypothetical Document Embeddings3) is a particularly clean trick: the LLM first writes a hypothetical answer to the query, that fictional document is embedded, and the embedding is used to retrieve — because answer-shaped vectors retrieve answer-shaped chunks more reliably than question-shaped vectors.
- Modular RAG. The pipeline is decomposed into reorderable components (search, memory, fusion, routing, predict, task-adapter), allowing patterns like Rewrite-Retrieve-Read or iterative retrieval loops.
Two more recent branches deserve attention:
GraphRAG (Microsoft Research, 2024)4 tackles a known weakness of vector RAG: it cannot answer “global” questions like “What are the main themes in this corpus?”, because no single chunk contains the answer. GraphRAG uses an LLM to build an entity-and-relation knowledge graph over the source documents, then pregenerates community summaries via graph clustering. At query time it routes between local retrieval (vector) and global retrieval (community summary), giving substantial gains on sense-making questions over million-token corpora.
Agentic RAG wraps retrieval in a decision loop. Self-RAG6 trains the LLM to emit special reflection tokens that decide whether to retrieve, evaluate the retrieved passages, and critique the draft answer. Corrective RAG (CRAG) inserts a lightweight retrieval evaluator: if the retrieved documents score below a relevance threshold, it triggers a fallback search (often web) before generating. Both turn the pipeline from a fixed sequence into an adaptive one, at the cost of additional latency and orchestration complexity.
2.4 Evaluation
RAG evaluation is hard because two systems can fail in different stages (retrieval vs generation), and end-to-end answer accuracy hides which one is broken. The dominant open-source framework is RAGAS9, which decomposes quality into four reference-free, LLM-judged metrics:
- Faithfulness. Fraction of claims in the answer that are supported by the retrieved context. Targets hallucination.
- Answer relevancy. Whether the answer addresses the actual question.
- Context precision. Whether the relevant chunks are ranked at the top of the retrieved set.
- Context recall. Whether the retrieved set covers all the information needed to answer.
The RAGAS score is typically the mean of the four. Production teams usually pair these with golden-set evaluation (curated Q-A-context triples) and online metrics (click-through on cited sources, thumbs-up/down).
2.5 Common failure modes
Empirically, the same handful of failure modes recur across RAG deployments:
- Chunk-boundary loss. The answer spans two chunks and neither alone is retrievable.
- Retrieval miss. The query and the relevant chunk use different vocabulary; cosine similarity buries the right passage at rank 47.
- Context dilution. Top-k = 20 floods the prompt with marginally relevant chunks; the LLM averages over them and produces a vague, hedged answer12.
- Lost in the middle. Even when the right chunk is in context, attention is biased toward the start and end of long prompts.
- Stale index. Documents change; embeddings don’t auto-refresh. Drift is silent until a user catches a wrong answer.
- Orphan facts. A chunk says “the deductible is then doubled” without the chunk explaining what “the deductible” is.
3. Foundations of an Obsidian-Based Memory System
3.1 The substrate: Markdown, YAML, wikilinks
Obsidian is a local-first note-taking application whose vault is just a folder of plain Markdown files on disk14. There is no proprietary database; the user owns the files. Three primitives carry the structural load:
- Markdown body. Human-readable text with headings, lists, code blocks. Parseable by anything.
- YAML front-matter. A block of structured metadata at the top of each note (tags, type, status, dates, custom keys). Acts as the typed slot system.
- Wikilinks. The syntax
[[Some Note]]creates an explicit, named edge to another note. The set of all wikilinks defines a directed graph over the vault; Obsidian renders both forward links and backlinks (incoming references) for every note.
This means the memory system is, simultaneously, a filesystem (folders for organisation), a typed document store (YAML), and a knowledge graph (wikilinks). Retrieval can be performed by file path, by metadata filter, by graph traversal, or — with plugins — by full-text or semantic search. No single mechanism is privileged.
3.2 Typed cognitive layers
What turns a pile of Markdown into a memory system is the imposition of a typed structure on top of the filesystem. The intellectual lineage here is the CoALA framework (Sumers, Yao, Narasimhan, Griffiths, 2024)7, which adapted classical cognitive-architecture distinctions to language agents. CoALA distinguishes working memory (the immediate prompt context) from long-term memory, and subdivides long-term into episodic (timestamped events), semantic (consolidated facts and concepts), and procedural (how-to knowledge, skills, callable tools).
MIRIX (Wang & Chen, 2025)8 extends CoALA to a six-component schema explicitly designed for production agents: Core (persistent persona and user facts), Episodic (timestamped events), Semantic (abstract concepts and entities), Procedural (task-execution knowledge), Resource (documents, files, multimodal artefacts), and Knowledge Vault (secrets, credentials, structured records). MIRIX reports outperforming RAG baselines by 35% on an LLM-as-judge benchmark while cutting retrieval storage by 99.9% — the gains coming from typed routing rather than raw embedding similarity.
An Obsidian-based memory vault operationalises this taxonomy as folders. Each layer is a directory; each note carries a type: field in its front-matter that mirrors the folder; loading rules condition what gets pulled into context on what task is being executed. The promotion path — episodic notes get distilled into semantic notes during periodic consolidation passes — is a deliberate human-in-the-loop ritual, not an automatic clustering job.
3.3 Loading by selective reading
The critical departure from RAG is the retrieval mechanism. In a curated vault, the agent does not embed the user’s query and search for similar vectors. Instead, the agent reads a small index note (an explicit listing of what exists) and a small core note (the always-on persona and user facts), then decides — by name, by tag, by wikilink traversal — which additional notes to read. Retrieval is by explicit address, not statistical proximity.
This has three immediate consequences. First, retrieval is deterministic: given the same task and the same vault state, the agent loads the same notes. Second, retrieval is explainable: every loaded note can be justified (“I read this because the index pointed to it” or “I followed a wikilink from the previous note”). Third, retrieval is whole-note: the unit of context is the file the human wrote, not a 500-token slice of it. Boundary loss does not exist as a failure mode, because there are no boundaries to cross.
3.4 Human-in-the-loop consolidation
The curated vault depends on consolidation work that an embedding pipeline avoids. Episodic notes — dailylogs, session transcripts, decision records — accumulate quickly. Without distillation they become noise. The maintenance ritual is to periodically promote durable patterns from episodic to semantic (“we’ve decided this three times; write it down once, in canonical form”), to update procedural notes when workflows change, and to retire stale resources. In practice this is shared between the human and the LLM agent itself: Claude can propose consolidations, the human approves.
The cost is real. A curated vault scales sublinearly in information density but superlinearly in human attention. RAG inverts this: it scales linearly in storage and compute, with near-zero per-document curation, but pays at retrieval quality. Which trade is better depends on the corpus, which is the subject of the next section.
4. Comparison
The two systems make different choices on a fairly stable set of axes. Below, the prose treatment first, then a compact table.
4.1 Retrieval mechanism
RAG retrieves by similarity: cosine distance in an embedding space, optionally fused with BM25 and re-ranked. The mechanism is statistical, learned, and opaque — you can inspect which chunks were returned but not why they were ranked above the alternatives. An Obsidian vault retrieves by explicit address: filename, wikilink target, tag, folder. The mechanism is symbolic and fully inspectable. With Smart Connections or similar plugins, semantic search can be layered on top, but the primary access path remains the named link.
4.2 Granularity
RAG operates on chunks — sub-document fragments whose size is a tuning parameter. This is excellent for surgical retrieval (the one paragraph that defines an API) and poor for context (the whole policy the paragraph belongs to). Obsidian operates on whole notes, sized by the human who wrote them. A well-curated note is already the right unit of context for one concept; cross-note context is reconstructed by following wikilinks.
4.3 Determinism and explainability
Curated retrieval is deterministic and trivially auditable. RAG retrieval depends on the embedding model, the index state, the chunking parameters, the re-ranker, and (with HyDE or query rewriting) on stochastic LLM output. It is reproducible only when every component is pinned, and the explanation for any given top-k is a similarity score, not a reason.
4.4 Maintenance cost
Once a chunking and embedding pipeline is wired, RAG is largely self-maintaining: new documents come in, get chunked and embedded, and the index updates. The maintenance work is monitoring drift and re-embedding when models change. A curated vault demands continuous human attention: writing notes well in the first place, linking them, periodically consolidating episodic into semantic. This is the central trade.
4.5 Scalability
RAG scales to millions of documents — pgvector comfortably handles single-digit millions of vectors; Pinecone, Weaviate, Qdrant and Milvus run into the billions with sharding11. An Obsidian vault is bounded by what a human can curate, which is somewhere in the hundreds to low tens of thousands of notes. Past that, the wikilink graph becomes unmaintainable and you’ve reinvented a worse vector store.
4.6 Latency and infrastructure
A managed RAG stack adds 50–500 ms of retrieval latency, plus the network hop to a vector DB and the re-ranker model. An Obsidian vault on local disk loads notes at filesystem speed — typically < 5 ms per note for a small handful — and requires no infrastructure beyond the agent itself. For a personal assistant operating on a few thousand notes, the local vault is unambiguously faster and cheaper.
4.7 Drift, staleness, freshness
RAG indices go stale silently: documents change, embeddings don’t auto-update, the LLM cheerfully cites yesterday’s policy. Detection requires re-indexing pipelines or freshness metadata. A curated vault makes staleness a human concern by design — the act of editing a note is the update. Wikilinks break visibly when targets are renamed (Obsidian flags broken links), so structural drift surfaces immediately.
4.8 Privacy and locality
RAG with a cloud vector DB means your documents are embedded by a cloud model (OpenAI, Cohere, Voyage) and stored as vectors in a managed service. Most enterprise RAG deployments accept this; many regulated industries cannot. An Obsidian vault is plain Markdown on the user’s disk; Smart Connections embeddings are computed locally with a bundled model and stored in a hidden vault directory15. The entire memory is grep-able, version-controllable in Git, and never leaves the machine unless the user chooses to sync it.
4.9 Suitability for agentic workflows
Agents need to write back to memory, not just read from it. Writing a new chunk to a vector DB is straightforward but the new chunk is invisible to anything except similarity search — there is no explicit place it “belongs”. Writing a new note to a typed vault, by contrast, forces the agent to pick a layer, a filename, a set of wikilinks to existing concepts; the act of writing is itself an act of consolidation. This is why CoALA7 and MIRIX8 both reach for typed memory layers rather than a flat vector index.
4.10 Hybrid approaches
The interesting deployments are not pure. Smart Connections runs semantic search over an Obsidian vault using local embeddings — you get the curated structure for primary access and similarity fallback for the moments when you’ve forgotten what a note is called15. On the RAG side, GraphRAG4 inserts an LLM-built knowledge graph between the documents and the retriever, converging from the opposite direction toward the same structural-plus-statistical hybrid. The dichotomy in the section title is, in practice, a spectrum.
4.11 Side-by-side
| Axis | RAG (vector pipeline) | Obsidian-based memory |
|---|---|---|
| Primary retrieval | Cosine similarity over chunk embeddings; hybrid with BM25 | Explicit address: filename, wikilink, tag, folder |
| Unit of context | Chunk (200–800 tokens, tuned) | Whole note (human-sized) |
| Determinism | Probabilistic; depends on model, index state, re-ranker | Deterministic; same task → same loads |
| Explainability | Similarity score (opaque why) | Symbolic justification (named link, tag, index) |
| Curation cost | Near zero per document | High; continuous human attention |
| Scalability sweet spot | 10⁴ – 10⁹ documents | 10² – 10⁴ notes |
| Infrastructure | Vector DB, embedding model, re-ranker, pipeline | Folder of .md files |
| Latency | ~50–500 ms retrieval + network | Filesystem read; sub-millisecond per note |
| Privacy posture | Often cloud embeddings + managed DB | Local plaintext by default |
| Staleness handling | Re-index pipelines; silent drift risk | Editing the note is the update |
| Failure modes | Chunk boundary loss, retrieval miss, context dilution, lost-in-middle, stale index | Curation debt, orphan notes, broken links, human-bandwidth ceiling |
| Agent write-back | Append chunk to index (untyped) | Create typed note in named layer (forces consolidation) |
| Best at | Unstructured corpora at scale, enterprise Q&A | Personal/agent memory, durable concepts, audited workflows |
5. When to Use Which
The choice usually collapses onto a small number of corpus and use-case dimensions.
- The corpus is large (> 10,000 documents), unstructured, and authored by many people you don’t control.
- Queries are unpredictable and arrive in natural language from end users — you cannot enumerate them in advance.
- Surgical retrieval is required: pull the one paragraph that defines a parameter, not the whole 200-page manual.
- The corpus updates frequently and curation is not a viable cost line.
- Enterprise document Q&A, customer-support deflection, code search across a monorepo, scientific literature review.
- The corpus is personal or organisational, sized in the hundreds to low thousands of notes.
- You are building an agent that needs to remember across sessions — facts about the user, decisions, procedures — not just answer questions over a static corpus.
- Determinism and explainability matter (regulated workflows, audit trails, agent debugging).
- Privacy is a hard constraint (sensitive personal data, attorney-client material, medical history).
- You want the LLM to write back to memory, and you want those writes to be inspectable and reversible.
- Personal knowledge management, agent persistence, project memory, methodology vaults.
The hybrid case is the common case. If you have a curated agent memory of, say, 500 notes plus a 50,000-document research archive, the right architecture is both: load the curated vault by explicit address for persona, user facts, and procedures; run vector RAG over the research archive for ad-hoc fact retrieval. Smart Connections inside Obsidian is a small-scale version of exactly this. GraphRAG is a large-scale version where the structural layer is built automatically rather than by hand.
The wrong move is to use one architecture for a job the other is built for. Vector RAG over 200 personal notes is over-engineered and brittle; a curated vault over 5 million research papers is impossible. Pick by corpus size and curation budget first, by query pattern second.
6. Conclusion
RAG and Obsidian-based memory are not rival answers to the same question. RAG is a way of indexing unstructured corpora so that an LLM can pull relevant fragments out of them; it scales beautifully and demands almost no human attention per document, paying for that with opaque retrieval, chunk-boundary failures, and silent drift. An Obsidian-based vault is a way of writing a memory in the first place — choosing names, layers, links — so that an LLM can read it by address; it scales poorly with corpus size but unlocks deterministic, explainable, privacy-local memory at the scale of a person or a small team.
The convergence is visible from both sides. MIRIX and CoALA take agent memory away from flat embedding stores and toward typed cognitive layers because typing lets the agent know what kind of thing it is reading. GraphRAG takes vector RAG toward LLM-built knowledge graphs because pure similarity cannot answer global questions over a corpus. Both directions point at the same thing: structure on top of (or instead of) statistics.
For the kind of system this archive cares about — a long-running personal agent with durable memory of its user, working alongside large external corpora — the honest architecture is layered. A small, hand-curated Obsidian vault carries identity, methodology, and consolidated semantic knowledge; vector RAG (with re-ranking and hybrid search) extends the agent’s reach into research archives and documentation it does not own. The curated layer earns its curation cost by being the agent’s self-model; the RAG layer earns its scale by being everything else.
What you should not do is mistake an embedding index for a memory. An embedding index is a search engine. A memory has shape, type, and a history of consolidation. Build the shape first; let similarity fill in the gaps.
7. References
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401
- Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997. https://arxiv.org/abs/2312.10997
- Gao, L., Ma, X., Lin, J., Callan, J. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). arXiv:2212.10496; ACL 2023. https://arxiv.org/abs/2212.10496
- Edge, D. et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130. https://arxiv.org/abs/2404.16130; project page: https://microsoft.github.io/graphrag/
- Mao, S. et al. (2024). RaFe: Ranking Feedback Improves Query Rewriting for RAG. EMNLP 2024 Findings. https://arxiv.org/abs/2405.14431
- Singh, A. et al. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv:2501.09136. https://arxiv.org/abs/2501.09136
- Sumers, T. R., Yao, S., Narasimhan, K., Griffiths, T. L. (2024). Cognitive Architectures for Language Agents. Transactions on Machine Learning Research. arXiv:2309.02427. https://arxiv.org/abs/2309.02427
- Wang, Y., Chen, X. (2025). MIRIX: Multi-Agent Memory System for LLM-Based Agents. arXiv:2507.07957. https://arxiv.org/abs/2507.07957
- Es, S., James, J., Espinosa Anke, L., Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. Documentation: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/
- Cormack, G., Clarke, C., Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods. SIGIR. Practitioner reference: https://superlinked.com/vectorhub/articles/optimizing-rag-with-hybrid-search-reranking
- Vector database landscape overview (Pinecone, Weaviate, Qdrant, Milvus, Chroma, FAISS, pgvector). DataCamp 2026 review: https://www.datacamp.com/blog/the-top-5-vector-databases
- Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. https://arxiv.org/abs/2307.03172
- Weaviate. Chunking Strategies to Improve LLM RAG Pipeline Performance. https://weaviate.io/blog/chunking-strategies-for-rag
- Obsidian Help. Internal links and Backlinks. https://help.obsidian.md/links · https://help.obsidian.md/plugins/backlinks
- Petro, B. Smart Connections for Obsidian (plugin and documentation). https://smartconnections.app/ · https://github.com/brianpetro/obsidian-smart-connections
Note on sources: references [1]–[8] and [12] are peer-reviewed or arXiv preprints. Reference [9] is the open-source RAGAS documentation; the framework was first described in Es et al., EACL 2024 demo track. Reference [10] is the original RRF paper (Cormack et al., SIGIR 2009) plus a current practitioner survey for the RAG context. References [11], [13], [14], [15] are vendor or community documentation, used here for product behaviour and current practice, not theoretical claims.
