All articles
How Products Are Built15 min read

How AI Apps Are Built: A Plain-Language Architecture Guide (2026)

Cursor, Perplexity, Notion AI, and Lovable look like magic from the outside. They're not. Here's how the most-used AI apps actually work — their architecture, why they made the technical decisions they did, and what that means for you.

By HowWorks Team

Key takeaways

  • Every AI app is built from three atomic components: a model (the reasoning layer), a retrieval system (the memory layer), and an orchestration layer (the action layer). Everything else is variations on this pattern.
  • RAG — Retrieval-Augmented Generation — is the architecture behind most AI products you use daily. Cursor, Perplexity, Notion AI, and enterprise AI assistants all use it. It's why they can answer questions about specific documents or codebases.
  • The difference between AI products that work and AI products that hallucinate is usually the quality of the retrieval layer, not the model. The model is a commodity. The data pipeline is the moat.
  • Lovable processes over 1 billion tokens per minute at peak. Understanding how it routes those tokens explains why your prompts sometimes behave differently across sessions.
  • Understanding how AI products are architecturally built — not just using them — is the difference between being a passive consumer of AI and being able to make informed product decisions about AI.

How AI Apps Are Built

Every AI application you use — Cursor, Perplexity, Notion AI, Lovable, ChatGPT — is built from the same three architectural layers: a model (reasoning), a retrieval system (memory), and an orchestration layer (action). Understanding these three layers demystifies what feels like magic and gives you a mental model for using AI tools more effectively.

This guide walks through how four major AI apps actually work, without code. If you want to first discover which AI products are worth studying, start with Where to Find AI Projects in 2026 or compare sources in Best Tools for Discovering AI Projects.


Layer 1: The Model — What AI Actually "Thinks" With

The foundation of every AI app is a large language model (LLM). Think of it as an extraordinarily well-read prediction engine: it was trained on hundreds of billions of documents, learning the statistical relationships between words, ideas, and reasoning patterns. When you give it text, it generates what's most likely to come next, based on everything it has learned.

This is both the power and the limitation of LLMs:

The power: LLMs can reason, write, summarize, translate, and generate code because they've seen so many examples of humans doing these things that they've learned the underlying patterns.

The limitation: LLMs only know what they were trained on. They don't know about events after their training cutoff. They don't know about your specific codebase. They don't know about your company's internal documents. They generate confident-sounding text even when they're wrong — because prediction doesn't distinguish "I know this" from "I'm extrapolating."

This limitation is why the second layer exists.


Layer 2: Retrieval — How AI Apps Give LLMs Memory

The retrieval layer is what most AI apps are actually built around. It solves the LLM's core limitation by giving the model access to specific, relevant, up-to-date information before it generates a response.

The dominant architecture is Retrieval-Augmented Generation (RAG):

User questionSearch a database for relevant documentsPass those documents to the LLM as contextLLM generates an answer grounded in those documents

The "database" in this pattern isn't a traditional database — it's a vector database that stores documents as embeddings (numerical representations of meaning, not just text). This enables semantic search: finding documents that are conceptually related to a query, not just textually matching.

Why vector databases matter: If you search a traditional database for "authorization logic," you get results containing that exact phrase. If you search a vector database for "authorization logic," you get results about authentication, permissions, access control, and security — because the embeddings capture meaning, not just keywords.


How Cursor Works: RAG for Code

Cursor is the most architecturally interesting AI coding tool because its core challenge is unique: indexing an entire codebase semantically, in real time, for any project size.

Here's how it works:

Step 1: Semantic chunking. Instead of splitting code arbitrarily, Cursor breaks it into meaningful units — individual functions, classes, and modules. Each chunk captures a coherent piece of logic, which makes retrieval more accurate.

Step 2: Merkle tree synchronization. To detect changes without re-indexing everything, Cursor builds a cryptographic tree of file hashes. Every 10 minutes, it compares the current tree to the stored one. Only the files that changed get re-indexed. On a 50,000-file codebase, this cuts bandwidth from transferring all file metadata to just transmitting hash mismatches.

Step 3: Embedding generation and caching. Changed code chunks are converted to vector embeddings and stored in a remote vector database (Turbopuffer). Unchanged chunks reuse cached embeddings — if the code didn't change, the embedding doesn't need to be regenerated.

Step 4: Index reuse across teams. Codebases within organizations are typically 92% similar between team members. When a new developer joins, Cursor's system finds the closest existing index in the organization and reuses it — cutting time-to-first-query from hours to seconds for large codebases.

Step 5: Query time. When you ask Cursor a question, it converts your question to an embedding, searches the vector database for semantically similar code chunks, and passes the most relevant results to the LLM as context. The model has never seen your codebase — it's seeing the relevant chunks retrieved for this specific query.

What this means in practice: Cursor works best in well-documented, popular languages and frameworks because the LLM's base knowledge (layer 1) matches the patterns it retrieves (layer 2). Obscure languages or very novel architectural patterns work less well because the LLM has less baseline knowledge to reason with.


How Perplexity Works: RAG for the Web

Perplexity's architectural challenge is different from Cursor's: it needs to search the entire web at query time, retrieve the most relevant pages, and generate a grounded answer with citations — all in under 2 seconds.

Its pipeline has five stages:

Stage 1: Query understanding. Perplexity first classifies your query — is this a factual question? A comparison? A how-to? This classification affects which retrieval strategy it uses.

Stage 2: Hybrid retrieval. Rather than pure semantic search, Perplexity combines two approaches: vector-based search (semantic meaning) and keyword-based search. It merges the results — approximately 30 vector results and 20 keyword results — to get ~50 candidate documents. This hybrid approach maximizes recall and prevents "domain overfitting" where one retrieval method misses relevant documents the other would catch.

Stage 3: Reranking. Not all 50 documents are equally useful. A specialized reranking model scores them for relevance to the specific query, filtering down to the most valuable sources.

Stage 4: Multi-model orchestration. Perplexity doesn't use a single LLM. A reinforcement learning-based router dynamically selects which model to use for each query — optimizing for the combination of quality, latency, and cost that best fits the query type.

Stage 5: Context fusion and citation mapping. A Context Fusion Engine takes the retrieved documents and the LLM's generated response, mapping each claim to its source document. This is why every Perplexity response includes numbered citations with confidence scores — it's not just showing sources, it's verifying that each claim is grounded in a retrieved document.

The key insight: Perplexity's quality advantage over a raw LLM isn't the model — it's the retrieval and verification pipeline. The model is interchangeable (they route to different models dynamically). The multi-stage pipeline that grounds answers in real sources is the architectural moat.


How Notion AI Works: RAG for Personal Knowledge

Notion AI's challenge is giving an LLM access to your workspace — potentially thousands of documents, databases, and notes — and enabling natural language queries across all of it.

The architecture follows the same RAG pattern, but with a workspace-specific twist:

Document indexing: Every page in your workspace gets chunked and embedded. The vector database is personal — it contains only your workspace's content.

Permission-aware retrieval: When you ask Notion AI a question, it only retrieves documents you have permission to see. The retrieval layer enforces access control — the LLM never "sees" documents outside your permissions, even indirectly.

MCP integration: In 2025, Notion built a hosted MCP (Model Context Protocol) server that converts their workspace actions to a format that AI coding tools like Cursor can invoke. This enables workflows like: "Create a Notion page with these specs" directly from a Cursor chat — the retrieval flows in both directions.

Why it sometimes gives generic answers: When Notion AI responds in a way that doesn't reference your specific documents, it's usually because the retrieval step didn't find highly relevant content for your query. The model falls back to its general knowledge. Prompting more specifically — mentioning document names, using terms you actually use in your workspace — helps the retrieval step find the right context.


How Lovable Works: Generating Entire Applications

Lovable's architecture is different from the retrieval-focused apps above. Its challenge isn't finding the right information — it's generating coherent, deployable code across multiple files and services from a single natural language description.

The intent-to-execution pipeline:

  1. Intent parsing: Your natural language description is analyzed to extract product requirements — what pages, what features, what integrations, what data model.
  2. Architecture planning: Before generating any code, Lovable's system plans the full stack: which frontend framework, which database, what authentication pattern, what API structure.
  3. Multi-file code generation: Code is generated not just for one file but for the complete application — React components, database schema, API routes, authentication config, deployment scripts.
  4. Consistency verification: Generated code is checked for consistency across files — import paths, type definitions, data model alignment.

The scale problem: Lovable processes over 1 billion tokens per minute at peak. At API rates, that's a massive infrastructure cost. Their engineering solution: sophisticated load balancing across multiple LLM providers (Anthropic, Google Vertex, Amazon Bedrock), using "project-level affinity" — keeping consecutive requests from the same project on the same provider to maintain prompt caching effectiveness.

Why sessions sometimes feel inconsistent: Lovable's system may route your project to a different LLM provider between sessions. The underlying model may differ. This is why establishing a clear project rules document at the start of any Lovable project — specifying your stack, folder structure, and conventions — anchors the AI's decisions across session boundaries.


Layer 3: Orchestration — How AI Apps Take Action

The third architectural layer is what transforms an AI from a text generator into an agent: the ability to take actions in the world.

Orchestration is the system that lets an LLM:

  • Call external APIs
  • Search the web
  • Write and execute code
  • Read and modify files
  • Interact with databases and services

Most AI chatbots only have layers 1 and 2. AI agents have all three.

ToolModelRetrievalOrchestration
ChatGPT (no search)
Perplexity✅ Web retrievalPartial (search)
Cursor✅ Codebase indexPartial (file edits)
Claude Code✅ File system✅ Full agent
Lovable✅ Template context✅ App generation

Claude Code and similar terminal agents represent the fullest expression of this architecture: given a goal, they search the codebase for context, plan a series of actions, write code, run tests, observe the results, and iterate — without human intervention at each step.


What This Means for How You Use AI Tools

Understanding AI architecture changes how you interact with these tools:

Better prompting: Knowing that Cursor retrieves context via semantic search means prompting with technical vocabulary produces better results than prompting in plain English. "How is authentication handled?" gets better results than "How does login work?"

Debugging AI failures: When Perplexity gives a wrong or hallucinated answer, it's usually a retrieval failure (it didn't find the right sources) or a context fusion failure (the LLM didn't stay grounded in the retrieved documents). Prompting more specifically, or asking Perplexity to search for specific sources, addresses retrieval failures.

Better product decisions: If you're deciding whether to build an AI feature that needs access to private company documents, you now know you need a RAG pipeline with document indexing — not just an API call to an LLM. That's a meaningful implementation decision that affects timeline and cost. And if your goal is educational rather than architectural, Where to Learn AI Without Coding is the better starting point before diving deeper here.

Research before building: Before building any AI product, understanding the architecture of similar products shows you what technical decisions you'll face. HowWorks breaks down the architecture of real AI products — including the products in this guide — so you can research those decisions before you commit to implementing them yourself.


The Common Pattern Across All of Them

Every AI app in this guide — Cursor, Perplexity, Notion AI, Lovable — follows the same underlying architecture:

User input[Optional: Retrieve relevant context from a database or search]Pass input + context to an LLM[Optional: Execute actions based on LLM output]Return result to user

The model itself is the least interesting part. OpenAI, Anthropic, and Google all offer world-class LLMs at competitive prices. The architectural moat — what makes Cursor better than a different coding tool using the same model — is in the retrieval pipeline (how well it indexes and searches your codebase) and the orchestration layer (how coherently it applies changes across files).

The data pipeline is the product.


Related Reading on HowWorks

Next reads in this topic

Structured to move from head-term discovery to deeper, more citable cluster pages.

FAQ

How do AI apps like ChatGPT and Perplexity actually work?

AI apps like ChatGPT use a large language model (LLM) as their core reasoning engine — a system trained on billions of documents that predicts the most likely next word given everything it has seen. Perplexity adds a retrieval layer before the model: it searches the web, retrieves relevant pages, and feeds those to the model as context before generating an answer. This is why Perplexity cites sources and ChatGPT (without search enabled) can't tell you what happened last week.

What is RAG and why does it matter?

RAG stands for Retrieval-Augmented Generation. It's the architecture pattern most AI apps use to give LLMs access to specific, up-to-date, or private information beyond their training data. The basic flow: user asks a question → the system retrieves relevant documents from a database → those documents are given to the LLM as context → the LLM generates an answer grounded in those documents. Cursor uses RAG to understand your codebase. Notion AI uses it to answer questions about your documents. Enterprise AI assistants use it to answer questions about internal company knowledge.

How does Cursor understand my entire codebase?

Cursor builds a semantic index of your codebase using embeddings — vector representations of code chunks that capture meaning, not just text. It breaks your code into meaningful units (functions, classes), converts each to a vector, and stores them in a vector database. When you ask Cursor a question, it retrieves the most semantically relevant code chunks from this index and feeds them to the LLM as context. Every 10 minutes, it checks for file changes using a Merkle tree and re-indexes only what changed.

How does Perplexity find sources and answer questions?

Perplexity uses a multi-stage retrieval pipeline: it combines vector-based search (semantic meaning) with keyword search to retrieve around 50 candidate documents, then passes those through a reranking model to select the most relevant ones, then passes those to an LLM for answer generation. A context fusion engine maps each generated claim back to its source document for citations. The whole pipeline runs in seconds.

What is the difference between an AI chatbot and an AI agent?

A chatbot generates text responses — it takes your message and produces a reply. An AI agent can take actions: search the web, write and run code, call APIs, create files, send messages. Perplexity is partly an agent (it searches and retrieves before answering). Claude Code is a full agent: it reads your files, writes code, runs tests, and fixes errors autonomously. The architectural difference is an 'action layer' — a system that lets the model interact with external tools and environments, not just generate text.

Why do AI apps like Lovable and Cursor cost money to run?

AI apps pay LLM providers per token — roughly per word of input and output. Lovable processes over 1 billion tokens per minute at peak. At typical API rates, that's $1,000-$10,000+ per minute in raw model costs. This is why AI tool pricing has subscription tiers, credit systems, and usage limits. The infrastructure cost of running AI at scale is the primary reason most serious AI apps charge monthly fees — the cost structure is fundamentally different from traditional software.

Do I need to understand AI architecture to use AI tools effectively?

No, but understanding architecture changes how you use the tools. Knowing that Cursor's quality depends on its index quality explains why projects in well-documented languages work better than obscure ones. Knowing that Perplexity retrieves before it generates explains why some queries get better answers than others. Knowing that Lovable uses prompt caching explains why keeping sessions consistent produces more coherent code. Architecture knowledge doesn't make AI tools work — it helps you work with them more intentionally.

Explore all guides, workflows, and comparisons

Use the HowWorks content hub to move from idea validation to build strategy, with practical playbooks and decision-focused comparisons.

Open content hub