Why Perplexity, and What You Can Learn from It
Perplexity reached 100 million monthly active users in roughly two years by doing one thing differently from every other AI chatbot: it always cites its sources.
That single design constraint defined everything else about the product. To cite sources, you need to retrieve them first. To retrieve them accurately, you need a sophisticated search pipeline. To scale that pipeline to 100 million users, you need multi-model routing, caching, and reranking infrastructure that took years to build.
Understanding how Perplexity works architecturally isn't just interesting — it's competitive intelligence. If you're building any AI product that needs access to current or specific information, Perplexity's design decisions show you the tradeoffs you'll face, whether you're building a domain-specific search tool, an AI research assistant, or any RAG-powered product.
The Core Insight: Retrieval Before Generation
The fundamental difference between Perplexity and a raw LLM is the sequence:
Standard LLM (ChatGPT without search):
User question → LLM generates answer from training knowledge
Perplexity:
User question → Search → Retrieve documents → LLM generates answer grounded in documents → Map claims to sources → Return cited answer
Every architectural decision in Perplexity flows from this sequence. The question "how does Perplexity work?" is really the question "how does each step in this pipeline work, and what tradeoffs did the team make at each step?"
The Five-Stage Pipeline
Stage 1: Query Understanding
Before Perplexity retrieves anything, it classifies the incoming query:
- Query type: Factual question? Comparison? How-to? Opinion? This classification determines which retrieval strategy to use.
- Recency requirement: Does this query need current information (news, stock prices) or is historical information sufficient?
- Domain signals: Is this a technical, medical, legal, or general query? Domain affects which sources to prioritize.
- Search depth: Does this require a simple search (one relevant page is sufficient) or a comprehensive search (multiple sources needed for a complete answer)?
This classification step is often underestimated. A query like "what is RAG?" is different from "what's the latest research on RAG?" — the retrieval strategy needs to differ. Most RAG implementations skip this step and use a single retrieval strategy for all queries.
Stage 2: Hybrid Retrieval
This is Perplexity's primary architectural moat: the quality of its retrieval system.
The problem with pure vector search: Semantic search finds documents conceptually related to your query — but can miss obvious keyword matches. If you search for "Perplexity Series B round amount," pure vector search might return documents about AI fundraising broadly rather than the specific fact you need.
The problem with pure keyword search: It misses conceptual connections. "AI search engine" and "LLM-powered search" describe the same thing, but keyword search won't match them.
Perplexity's solution: hybrid retrieval
Combine both:
- Run a vector search (semantic, embeddings-based) on the web index
- Run a keyword search on the same query
- Merge the results: approximately 30 vector results + 20 keyword results = 50 candidate documents
The merge strategy matters: they don't just concatenate the two result lists, they deduplicate and interleave them by relevance score. This maximizes recall (finding relevant documents) while preventing either method from dominating.
For founders: This hybrid approach is worth implementing from day one, even for domain-specific products. Pure vector search will consistently miss exact-match queries. Pure keyword search will miss semantic queries. The hybrid is not significantly harder to implement, and the quality difference is immediately visible in edge cases.
Stage 3: Reranking
50 candidate documents is too many to pass to an LLM as context. Perplexity's reranking stage selects the most relevant 5-10 documents from the 50 candidates.
How reranking works: A specialized model (smaller and faster than the generation LLM) scores each candidate document against the original query. This reranker is trained specifically for the task of "given this query, how relevant is this document?" — it's better at this specific task than a general LLM.
Why reranking matters: The generation quality of the final answer depends heavily on the quality of the context you give the LLM. Passing 50 mediocre documents produces a worse answer than passing 5 excellent ones. The reranking stage is the quality filter that makes the generation stage work.
For founders: Cohere's Rerank API and similar services offer managed reranking without building the model yourself. At early scale, this is the right approach — the cost is low, the quality improvement is real, and the engineering time saved is significant.
Stage 4: Multi-Model Generation
This is the stage that makes Perplexity economically viable at scale, and one of the more sophisticated architectural decisions in the product.
The problem: At 100 million monthly users, every request running through GPT-4o would cost tens of millions per month in API fees. But using a weaker model for all queries reduces answer quality on complex questions.
The solution: dynamic model routing
A routing component analyzes each query and selects which LLM to use:
- Simple factual queries ("What is the capital of France?") → fast, cheap model (GPT-4o-mini or equivalent)
- Complex research queries ("Explain the mechanism by which statins reduce cardiovascular risk") → high-capability model (Claude Opus, GPT-4o)
- Reasoning-heavy queries → model optimized for chain-of-thought
- Code queries → model optimized for technical content
Perplexity's router uses reinforcement learning (PPO algorithm) to optimize model selection based on query complexity, latency requirements, and cost targets. The routing decision is made in milliseconds.
The result: Average LLM cost per query is dramatically lower than using the best model for everything, while average quality is significantly higher than using the cheapest model for everything.
For founders: You don't need this on day one. Start with a single model. Implement tiered routing when: (a) your LLM costs are material to your margins, and (b) you can characterize query types well enough to route them accurately.
Stage 5: Context Fusion and Citation Mapping
This is the stage that produces Perplexity's most visible product differentiator: numbered citations for every claim.
The naive approach: Generate an answer, then append source links at the bottom. Problem: which part of the answer came from which source? Users can't verify individual claims.
Perplexity's approach: A Context Fusion Engine maps each generated claim to its source document after generation. Each sentence or claim is associated with the specific document it came from, enabling inline numbered citations.
This requires the LLM generation step to produce output in a format that supports claim-level attribution, plus a post-processing step that resolves which generated tokens correspond to which retrieved document.
The technical implementation is non-trivial. The product experience is transformative: users can verify every factual claim with one click.
The Architecture at a Glance
User Query
↓
[Query Classification]
↓
[Hybrid Retrieval: Vector (30) + Keyword (20) → 50 candidates]
↓
[Reranking: 50 → 5-10 high-quality documents]
↓
[Model Router: Select LLM based on query complexity]
↓
[Generation: LLM produces answer grounded in retrieved docs]
↓
[Context Fusion: Map claims to source documents]
↓
Cited Answer with Source Links
What You Can Build Now vs. What Takes Years
| Feature | Buildable Today | Complexity |
|---|---|---|
| Single-corpus RAG search (your documents) | ✅ | Low |
| Web search + answer generation | ✅ (via Bing/Google APIs) | Medium |
| Basic citations | ✅ | Medium |
| Hybrid retrieval (vector + keyword) | ✅ | Medium |
| Reranking | ✅ (Cohere API) | Low |
| Multi-model routing | ⚠️ (at small scale) | High |
| Citation mapping at claim level | ⚠️ | High |
| Sub-2-second responses at 100M MAU | ❌ (infrastructure scale) | Very High |
The minimum viable version of a domain-specific Perplexity — AI search over a specific corpus with cited answers — is genuinely buildable in 4-8 weeks with a small team. The production version that handles general web search at scale is a multi-year infrastructure investment.
The Founding Insight Worth Copying
Perplexity's competitive advantage over ChatGPT is not the model quality — both use frontier models. It's the trust architecture: every answer is verifiable. Users know where the information came from and can check it.
This trust architecture is a product design choice before it's a technical architecture choice. Perplexity decided that AI search products should be verifiable, then built the technical infrastructure to support that decision.
The lesson for founders: What is the trust architecture for your AI product? What would make users trust the output enough to act on it? The answer to that question often drives the most important architectural decisions.
The Domain-Specific Opportunity
The most actionable insight from Perplexity's architecture for early-stage founders is the domain-specific version.
General web search is a solved problem (Google) and AI general search is now Perplexity's territory. But AI search over a specific corpus — with the citations and grounding that make Perplexity trustworthy — is largely unsolved for most specialized domains.
Examples of underserved domain-specific AI search:
- AI search over clinical trial data and medical literature
- AI search over legal case law and regulatory filings
- AI search over engineering documentation and technical standards
- AI search over academic research in a narrow field
- AI search over a company's internal knowledge base
In each of these, users currently get generic results from general web search. A product with Perplexity-like retrieval quality and citation reliability, applied to a specific corpus, produces dramatically better results for that specific use case.
The technical advantage of domain-specific: your retrieval problem is simpler (fixed corpus, known domain vocabulary), your reranking is more accurate (you can fine-tune for domain-specific relevance), and your citation mapping is more reliable (controlled source set).
The Stack to Start With
For a domain-specific AI search MVP:
Foundation model: Claude 3.5 Sonnet or GPT-4o (via API) Vector search: Supabase pgvector or Pinecone Keyword search: Elasticsearch or Typesense (open-source, self-hosted) Hybrid merge: Custom Python, or LlamaIndex's QueryFusionRetriever Reranking: Cohere Rerank API (managed, no ML required) Web framework: Next.js + Vercel AI SDK (streaming responses) Auth and database: Supabase
This stack can be deployed by a 1-2 person team in 4-8 weeks. It won't match Perplexity's performance at 100M MAU, but it will produce a genuinely useful product for your target domain.
How to Research Before Building
Before building any component of a search-and-retrieval AI product, spend time understanding the architectural decisions in products that have already solved similar problems.
HowWorks shows how real AI products are built — including the retrieval and orchestration patterns that distinguish production systems from demos. The decisions that separate good AI search products from bad ones are visible in the architecture choices made before the first line of code was written.
Related Reading on HowWorks
- How AI Apps Are Built — The full architecture breakdown of Cursor, Perplexity, Notion AI, and Lovable
- The AI Tech Stack Explained for Non-Technical Founders — The five layers of every AI product and which decisions matter
- How to Build an App Like Linear — Architecture, sync strategy, and performance culture at Linear