How AI Apps Are Built
Every AI application you use — Cursor, Perplexity, Notion AI, Lovable, ChatGPT — is built from the same three architectural layers: a model (reasoning), a retrieval system (memory), and an orchestration layer (action). Understanding these three layers demystifies what feels like magic and gives you a mental model for using AI tools more effectively.
This guide walks through how four major AI apps actually work, without code. If you want to first discover which AI products are worth studying, start with Where to Find AI Projects in 2026 or compare sources in Best Tools for Discovering AI Projects.
Layer 1: The Model — What AI Actually "Thinks" With
The foundation of every AI app is a large language model (LLM). Think of it as an extraordinarily well-read prediction engine: it was trained on hundreds of billions of documents, learning the statistical relationships between words, ideas, and reasoning patterns. When you give it text, it generates what's most likely to come next, based on everything it has learned.
This is both the power and the limitation of LLMs:
The power: LLMs can reason, write, summarize, translate, and generate code because they've seen so many examples of humans doing these things that they've learned the underlying patterns.
The limitation: LLMs only know what they were trained on. They don't know about events after their training cutoff. They don't know about your specific codebase. They don't know about your company's internal documents. They generate confident-sounding text even when they're wrong — because prediction doesn't distinguish "I know this" from "I'm extrapolating."
This limitation is why the second layer exists.
Layer 2: Retrieval — How AI Apps Give LLMs Memory
The retrieval layer is what most AI apps are actually built around. It solves the LLM's core limitation by giving the model access to specific, relevant, up-to-date information before it generates a response.
The dominant architecture is Retrieval-Augmented Generation (RAG):
User question → Search a database for relevant documents → Pass those documents to the LLM as context → LLM generates an answer grounded in those documents
The "database" in this pattern isn't a traditional database — it's a vector database that stores documents as embeddings (numerical representations of meaning, not just text). This enables semantic search: finding documents that are conceptually related to a query, not just textually matching.
Why vector databases matter: If you search a traditional database for "authorization logic," you get results containing that exact phrase. If you search a vector database for "authorization logic," you get results about authentication, permissions, access control, and security — because the embeddings capture meaning, not just keywords.
How Cursor Works: RAG for Code
Cursor is the most architecturally interesting AI coding tool because its core challenge is unique: indexing an entire codebase semantically, in real time, for any project size.
Here's how it works:
Step 1: Semantic chunking. Instead of splitting code arbitrarily, Cursor breaks it into meaningful units — individual functions, classes, and modules. Each chunk captures a coherent piece of logic, which makes retrieval more accurate.
Step 2: Merkle tree synchronization. To detect changes without re-indexing everything, Cursor builds a cryptographic tree of file hashes. Every 10 minutes, it compares the current tree to the stored one. Only the files that changed get re-indexed. On a 50,000-file codebase, this cuts bandwidth from transferring all file metadata to just transmitting hash mismatches.
Step 3: Embedding generation and caching. Changed code chunks are converted to vector embeddings and stored in a remote vector database (Turbopuffer). Unchanged chunks reuse cached embeddings — if the code didn't change, the embedding doesn't need to be regenerated.
Step 4: Index reuse across teams. Codebases within organizations are typically 92% similar between team members. When a new developer joins, Cursor's system finds the closest existing index in the organization and reuses it — cutting time-to-first-query from hours to seconds for large codebases.
Step 5: Query time. When you ask Cursor a question, it converts your question to an embedding, searches the vector database for semantically similar code chunks, and passes the most relevant results to the LLM as context. The model has never seen your codebase — it's seeing the relevant chunks retrieved for this specific query.
What this means in practice: Cursor works best in well-documented, popular languages and frameworks because the LLM's base knowledge (layer 1) matches the patterns it retrieves (layer 2). Obscure languages or very novel architectural patterns work less well because the LLM has less baseline knowledge to reason with.
How Perplexity Works: RAG for the Web
Perplexity's architectural challenge is different from Cursor's: it needs to search the entire web at query time, retrieve the most relevant pages, and generate a grounded answer with citations — all in under 2 seconds.
Its pipeline has five stages:
Stage 1: Query understanding. Perplexity first classifies your query — is this a factual question? A comparison? A how-to? This classification affects which retrieval strategy it uses.
Stage 2: Hybrid retrieval. Rather than pure semantic search, Perplexity combines two approaches: vector-based search (semantic meaning) and keyword-based search. It merges the results — approximately 30 vector results and 20 keyword results — to get ~50 candidate documents. This hybrid approach maximizes recall and prevents "domain overfitting" where one retrieval method misses relevant documents the other would catch.
Stage 3: Reranking. Not all 50 documents are equally useful. A specialized reranking model scores them for relevance to the specific query, filtering down to the most valuable sources.
Stage 4: Multi-model orchestration. Perplexity doesn't use a single LLM. A reinforcement learning-based router dynamically selects which model to use for each query — optimizing for the combination of quality, latency, and cost that best fits the query type.
Stage 5: Context fusion and citation mapping. A Context Fusion Engine takes the retrieved documents and the LLM's generated response, mapping each claim to its source document. This is why every Perplexity response includes numbered citations with confidence scores — it's not just showing sources, it's verifying that each claim is grounded in a retrieved document.
The key insight: Perplexity's quality advantage over a raw LLM isn't the model — it's the retrieval and verification pipeline. The model is interchangeable (they route to different models dynamically). The multi-stage pipeline that grounds answers in real sources is the architectural moat.
How Notion AI Works: RAG for Personal Knowledge
Notion AI's challenge is giving an LLM access to your workspace — potentially thousands of documents, databases, and notes — and enabling natural language queries across all of it.
The architecture follows the same RAG pattern, but with a workspace-specific twist:
Document indexing: Every page in your workspace gets chunked and embedded. The vector database is personal — it contains only your workspace's content.
Permission-aware retrieval: When you ask Notion AI a question, it only retrieves documents you have permission to see. The retrieval layer enforces access control — the LLM never "sees" documents outside your permissions, even indirectly.
MCP integration: In 2025, Notion built a hosted MCP (Model Context Protocol) server that converts their workspace actions to a format that AI coding tools like Cursor can invoke. This enables workflows like: "Create a Notion page with these specs" directly from a Cursor chat — the retrieval flows in both directions.
Why it sometimes gives generic answers: When Notion AI responds in a way that doesn't reference your specific documents, it's usually because the retrieval step didn't find highly relevant content for your query. The model falls back to its general knowledge. Prompting more specifically — mentioning document names, using terms you actually use in your workspace — helps the retrieval step find the right context.
How Lovable Works: Generating Entire Applications
Lovable's architecture is different from the retrieval-focused apps above. Its challenge isn't finding the right information — it's generating coherent, deployable code across multiple files and services from a single natural language description.
The intent-to-execution pipeline:
- Intent parsing: Your natural language description is analyzed to extract product requirements — what pages, what features, what integrations, what data model.
- Architecture planning: Before generating any code, Lovable's system plans the full stack: which frontend framework, which database, what authentication pattern, what API structure.
- Multi-file code generation: Code is generated not just for one file but for the complete application — React components, database schema, API routes, authentication config, deployment scripts.
- Consistency verification: Generated code is checked for consistency across files — import paths, type definitions, data model alignment.
The scale problem: Lovable processes over 1 billion tokens per minute at peak. At API rates, that's a massive infrastructure cost. Their engineering solution: sophisticated load balancing across multiple LLM providers (Anthropic, Google Vertex, Amazon Bedrock), using "project-level affinity" — keeping consecutive requests from the same project on the same provider to maintain prompt caching effectiveness.
Why sessions sometimes feel inconsistent: Lovable's system may route your project to a different LLM provider between sessions. The underlying model may differ. This is why establishing a clear project rules document at the start of any Lovable project — specifying your stack, folder structure, and conventions — anchors the AI's decisions across session boundaries.
Layer 3: Orchestration — How AI Apps Take Action
The third architectural layer is what transforms an AI from a text generator into an agent: the ability to take actions in the world.
Orchestration is the system that lets an LLM:
- Call external APIs
- Search the web
- Write and execute code
- Read and modify files
- Interact with databases and services
Most AI chatbots only have layers 1 and 2. AI agents have all three.
| Tool | Model | Retrieval | Orchestration |
|---|---|---|---|
| ChatGPT (no search) | ✅ | ❌ | ❌ |
| Perplexity | ✅ | ✅ Web retrieval | Partial (search) |
| Cursor | ✅ | ✅ Codebase index | Partial (file edits) |
| Claude Code | ✅ | ✅ File system | ✅ Full agent |
| Lovable | ✅ | ✅ Template context | ✅ App generation |
Claude Code and similar terminal agents represent the fullest expression of this architecture: given a goal, they search the codebase for context, plan a series of actions, write code, run tests, observe the results, and iterate — without human intervention at each step.
What This Means for How You Use AI Tools
Understanding AI architecture changes how you interact with these tools:
Better prompting: Knowing that Cursor retrieves context via semantic search means prompting with technical vocabulary produces better results than prompting in plain English. "How is authentication handled?" gets better results than "How does login work?"
Debugging AI failures: When Perplexity gives a wrong or hallucinated answer, it's usually a retrieval failure (it didn't find the right sources) or a context fusion failure (the LLM didn't stay grounded in the retrieved documents). Prompting more specifically, or asking Perplexity to search for specific sources, addresses retrieval failures.
Better product decisions: If you're deciding whether to build an AI feature that needs access to private company documents, you now know you need a RAG pipeline with document indexing — not just an API call to an LLM. That's a meaningful implementation decision that affects timeline and cost. And if your goal is educational rather than architectural, Where to Learn AI Without Coding is the better starting point before diving deeper here.
Research before building: Before building any AI product, understanding the architecture of similar products shows you what technical decisions you'll face. HowWorks breaks down the architecture of real AI products — including the products in this guide — so you can research those decisions before you commit to implementing them yourself.
The Common Pattern Across All of Them
Every AI app in this guide — Cursor, Perplexity, Notion AI, Lovable — follows the same underlying architecture:
User input → [Optional: Retrieve relevant context from a database or search] → Pass input + context to an LLM → [Optional: Execute actions based on LLM output] → Return result to user
The model itself is the least interesting part. OpenAI, Anthropic, and Google all offer world-class LLMs at competitive prices. The architectural moat — what makes Cursor better than a different coding tool using the same model — is in the retrieval pipeline (how well it indexes and searches your codebase) and the orchestration layer (how coherently it applies changes across files).
The data pipeline is the product.
Related Reading on HowWorks
- The AI Tech Stack Explained for Non-Technical Founders — The five-layer framework for understanding any AI product's infrastructure
- How to Build an App Like Perplexity — Deep dive into one of the architectures described in this article
- How Notion Was Built: Block Model, Architecture, and Sync Pipeline — Detailed breakdown of Notion's AI integration architecture
- How Top Tech Products Are Built: A Guide for Non-Developers — Research methodology for studying any product's architecture