All articles
How Products Are Built15 min read

How to Build an App Like Perplexity: Architecture, Stack, and Tradeoffs

Perplexity went from zero to 100 million monthly users with a simple insight: always cite your sources. Here's a complete breakdown of how Perplexity is architecturally built — the tech stack, the retrieval pipeline, the model routing logic — and what founders can learn from its design decisions.

By HowWorks Team

Key takeaways

  • Perplexity reached 100 million monthly active users by doing one thing differently from ChatGPT: always citing sources. That single design decision defined the product, the architecture, and the competitive moat.
  • Perplexity's core architecture is a five-stage RAG pipeline: query understanding → hybrid retrieval (vector + keyword) → reranking → multi-model generation → citation mapping. Each stage is a decision that affects quality, latency, and cost.
  • The hardest problem in an AI search product isn't the model — it's the retrieval. Model quality is commoditized. Retrieval quality is the moat. Perplexity invests disproportionately in the retrieval and reranking stages.
  • Multi-model routing (dynamically selecting which LLM to use based on query complexity) is the architecture pattern that lets Perplexity balance quality and cost at scale. You don't need this on day one — but understanding it shows you where AI search products compete.
  • The minimum viable version of a Perplexity-like product (a domain-specific AI search with citations) is buildable in weeks. The production version (100M MAU, sub-2-second responses, multi-model routing) requires years and tens of millions in infrastructure.

Why Perplexity, and What You Can Learn from It

Perplexity reached 100 million monthly active users in roughly two years by doing one thing differently from every other AI chatbot: it always cites its sources.

That single design constraint defined everything else about the product. To cite sources, you need to retrieve them first. To retrieve them accurately, you need a sophisticated search pipeline. To scale that pipeline to 100 million users, you need multi-model routing, caching, and reranking infrastructure that took years to build.

Understanding how Perplexity works architecturally isn't just interesting — it's competitive intelligence. If you're building any AI product that needs access to current or specific information, Perplexity's design decisions show you the tradeoffs you'll face, whether you're building a domain-specific search tool, an AI research assistant, or any RAG-powered product.


The Core Insight: Retrieval Before Generation

The fundamental difference between Perplexity and a raw LLM is the sequence:

Standard LLM (ChatGPT without search):

User question → LLM generates answer from training knowledge

Perplexity:

User questionSearchRetrieve documentsLLM generates answer grounded in documentsMap claims to sourcesReturn cited answer

Every architectural decision in Perplexity flows from this sequence. The question "how does Perplexity work?" is really the question "how does each step in this pipeline work, and what tradeoffs did the team make at each step?"


The Five-Stage Pipeline

Stage 1: Query Understanding

Before Perplexity retrieves anything, it classifies the incoming query:

  • Query type: Factual question? Comparison? How-to? Opinion? This classification determines which retrieval strategy to use.
  • Recency requirement: Does this query need current information (news, stock prices) or is historical information sufficient?
  • Domain signals: Is this a technical, medical, legal, or general query? Domain affects which sources to prioritize.
  • Search depth: Does this require a simple search (one relevant page is sufficient) or a comprehensive search (multiple sources needed for a complete answer)?

This classification step is often underestimated. A query like "what is RAG?" is different from "what's the latest research on RAG?" — the retrieval strategy needs to differ. Most RAG implementations skip this step and use a single retrieval strategy for all queries.


Stage 2: Hybrid Retrieval

This is Perplexity's primary architectural moat: the quality of its retrieval system.

The problem with pure vector search: Semantic search finds documents conceptually related to your query — but can miss obvious keyword matches. If you search for "Perplexity Series B round amount," pure vector search might return documents about AI fundraising broadly rather than the specific fact you need.

The problem with pure keyword search: It misses conceptual connections. "AI search engine" and "LLM-powered search" describe the same thing, but keyword search won't match them.

Perplexity's solution: hybrid retrieval

Combine both:

  • Run a vector search (semantic, embeddings-based) on the web index
  • Run a keyword search on the same query
  • Merge the results: approximately 30 vector results + 20 keyword results = 50 candidate documents

The merge strategy matters: they don't just concatenate the two result lists, they deduplicate and interleave them by relevance score. This maximizes recall (finding relevant documents) while preventing either method from dominating.

For founders: This hybrid approach is worth implementing from day one, even for domain-specific products. Pure vector search will consistently miss exact-match queries. Pure keyword search will miss semantic queries. The hybrid is not significantly harder to implement, and the quality difference is immediately visible in edge cases.


Stage 3: Reranking

50 candidate documents is too many to pass to an LLM as context. Perplexity's reranking stage selects the most relevant 5-10 documents from the 50 candidates.

How reranking works: A specialized model (smaller and faster than the generation LLM) scores each candidate document against the original query. This reranker is trained specifically for the task of "given this query, how relevant is this document?" — it's better at this specific task than a general LLM.

Why reranking matters: The generation quality of the final answer depends heavily on the quality of the context you give the LLM. Passing 50 mediocre documents produces a worse answer than passing 5 excellent ones. The reranking stage is the quality filter that makes the generation stage work.

For founders: Cohere's Rerank API and similar services offer managed reranking without building the model yourself. At early scale, this is the right approach — the cost is low, the quality improvement is real, and the engineering time saved is significant.


Stage 4: Multi-Model Generation

This is the stage that makes Perplexity economically viable at scale, and one of the more sophisticated architectural decisions in the product.

The problem: At 100 million monthly users, every request running through GPT-4o would cost tens of millions per month in API fees. But using a weaker model for all queries reduces answer quality on complex questions.

The solution: dynamic model routing

A routing component analyzes each query and selects which LLM to use:

  • Simple factual queries ("What is the capital of France?") → fast, cheap model (GPT-4o-mini or equivalent)
  • Complex research queries ("Explain the mechanism by which statins reduce cardiovascular risk") → high-capability model (Claude Opus, GPT-4o)
  • Reasoning-heavy queries → model optimized for chain-of-thought
  • Code queries → model optimized for technical content

Perplexity's router uses reinforcement learning (PPO algorithm) to optimize model selection based on query complexity, latency requirements, and cost targets. The routing decision is made in milliseconds.

The result: Average LLM cost per query is dramatically lower than using the best model for everything, while average quality is significantly higher than using the cheapest model for everything.

For founders: You don't need this on day one. Start with a single model. Implement tiered routing when: (a) your LLM costs are material to your margins, and (b) you can characterize query types well enough to route them accurately.


Stage 5: Context Fusion and Citation Mapping

This is the stage that produces Perplexity's most visible product differentiator: numbered citations for every claim.

The naive approach: Generate an answer, then append source links at the bottom. Problem: which part of the answer came from which source? Users can't verify individual claims.

Perplexity's approach: A Context Fusion Engine maps each generated claim to its source document after generation. Each sentence or claim is associated with the specific document it came from, enabling inline numbered citations.

This requires the LLM generation step to produce output in a format that supports claim-level attribution, plus a post-processing step that resolves which generated tokens correspond to which retrieved document.

The technical implementation is non-trivial. The product experience is transformative: users can verify every factual claim with one click.


The Architecture at a Glance

User Query
    ↓
[Query Classification]
    ↓
[Hybrid Retrieval: Vector (30) + Keyword (20) → 50 candidates]
    ↓
[Reranking: 50 → 5-10 high-quality documents]
    ↓
[Model Router: Select LLM based on query complexity]
    ↓
[Generation: LLM produces answer grounded in retrieved docs]
    ↓
[Context Fusion: Map claims to source documents]
    ↓
Cited Answer with Source Links

What You Can Build Now vs. What Takes Years

FeatureBuildable TodayComplexity
Single-corpus RAG search (your documents)Low
Web search + answer generation✅ (via Bing/Google APIs)Medium
Basic citationsMedium
Hybrid retrieval (vector + keyword)Medium
Reranking✅ (Cohere API)Low
Multi-model routing⚠️ (at small scale)High
Citation mapping at claim level⚠️High
Sub-2-second responses at 100M MAU❌ (infrastructure scale)Very High

The minimum viable version of a domain-specific Perplexity — AI search over a specific corpus with cited answers — is genuinely buildable in 4-8 weeks with a small team. The production version that handles general web search at scale is a multi-year infrastructure investment.


The Founding Insight Worth Copying

Perplexity's competitive advantage over ChatGPT is not the model quality — both use frontier models. It's the trust architecture: every answer is verifiable. Users know where the information came from and can check it.

This trust architecture is a product design choice before it's a technical architecture choice. Perplexity decided that AI search products should be verifiable, then built the technical infrastructure to support that decision.

The lesson for founders: What is the trust architecture for your AI product? What would make users trust the output enough to act on it? The answer to that question often drives the most important architectural decisions.


The Domain-Specific Opportunity

The most actionable insight from Perplexity's architecture for early-stage founders is the domain-specific version.

General web search is a solved problem (Google) and AI general search is now Perplexity's territory. But AI search over a specific corpus — with the citations and grounding that make Perplexity trustworthy — is largely unsolved for most specialized domains.

Examples of underserved domain-specific AI search:

  • AI search over clinical trial data and medical literature
  • AI search over legal case law and regulatory filings
  • AI search over engineering documentation and technical standards
  • AI search over academic research in a narrow field
  • AI search over a company's internal knowledge base

In each of these, users currently get generic results from general web search. A product with Perplexity-like retrieval quality and citation reliability, applied to a specific corpus, produces dramatically better results for that specific use case.

The technical advantage of domain-specific: your retrieval problem is simpler (fixed corpus, known domain vocabulary), your reranking is more accurate (you can fine-tune for domain-specific relevance), and your citation mapping is more reliable (controlled source set).


The Stack to Start With

For a domain-specific AI search MVP:

Foundation model: Claude 3.5 Sonnet or GPT-4o (via API) Vector search: Supabase pgvector or Pinecone Keyword search: Elasticsearch or Typesense (open-source, self-hosted) Hybrid merge: Custom Python, or LlamaIndex's QueryFusionRetriever Reranking: Cohere Rerank API (managed, no ML required) Web framework: Next.js + Vercel AI SDK (streaming responses) Auth and database: Supabase

This stack can be deployed by a 1-2 person team in 4-8 weeks. It won't match Perplexity's performance at 100M MAU, but it will produce a genuinely useful product for your target domain.


How to Research Before Building

Before building any component of a search-and-retrieval AI product, spend time understanding the architectural decisions in products that have already solved similar problems.

HowWorks shows how real AI products are built — including the retrieval and orchestration patterns that distinguish production systems from demos. The decisions that separate good AI search products from bad ones are visible in the architecture choices made before the first line of code was written.


Related Reading on HowWorks

Next reads in this topic

Structured to move from head-term discovery to deeper, more citable cluster pages.

FAQ

How does Perplexity work?

Perplexity is a retrieval-augmented generation (RAG) system applied to web search. When you ask it a question: (1) it classifies the query to determine search strategy; (2) it searches the web using a hybrid of vector-based and keyword-based retrieval, merging approximately 50 candidate documents; (3) a reranking model scores the candidates for relevance; (4) a model router selects which LLM to use based on query complexity; (5) the selected LLM generates an answer grounded in the retrieved documents; (6) a context fusion engine maps each claim to its source for citations. The whole pipeline runs in under 2 seconds.

What is Perplexity's tech stack?

Perplexity's public disclosures and engineering blog posts reveal: Python backend for the retrieval and generation pipeline, a proprietary hybrid search index combining vector search with keyword search, a custom reranking model, dynamic LLM routing across GPT, Claude, and proprietary models, and a React/Next.js frontend with streaming response display. Their 'online LLMs' are fine-tuned models specifically optimized for the cite-grounded-answer task.

How much does it cost to build an AI search product like Perplexity?

A domain-specific MVP (AI search over a specific dataset, with citations, for one use case) can be built for $5,000-$20,000 in engineering time using open-source tools and managed APIs. A production AI search product at 10,000 MAU runs $200-$2,000/month in infrastructure. Perplexity's infrastructure at 100M MAU involves tens of millions per year — primarily LLM API costs and compute for the retrieval pipeline. The cost curve is steep: the architecture that works at 100 users needs significant changes at 1,000,000.

What is the hardest technical problem in building an AI search product?

Retrieval quality. It's easy to set up a RAG pipeline that returns answers. It's very hard to build a retrieval system that consistently finds the most relevant documents for diverse, ambiguous real-world queries. Perplexity's hybrid retrieval (combining vector and keyword approaches), their reranking stage, and their continuous fine-tuning of the retrieval components represent years of accumulated improvement. Model quality is commoditized — any team can use GPT-4o. Retrieval quality is not commoditized.

What can I build that's similar to Perplexity but more focused?

Domain-specific AI search — Perplexity for a specific corpus — is one of the highest-value AI product categories in 2026. Examples: AI search over academic papers in your field, AI search over your company's documentation, AI search over legal or medical literature, AI search over your product's knowledge base. These products have a retrieval simplification advantage (your corpus is fixed, not the entire web), don't need multi-model routing at early stage, and serve users who currently get generic results from general web search.

What is the difference between Perplexity and ChatGPT architecturally?

ChatGPT (without search) uses only its training knowledge — no retrieval at query time. Perplexity retrieves web sources before every response, grounding its answers in real-time information and providing citations. Architecturally: ChatGPT = Foundation Model only. Perplexity = Foundation Model + Real-time Retrieval + Citation Mapping. The architectural choice produces different products: ChatGPT is better for creative tasks, Perplexity is better for factual research where you need verifiable sources.

Explore all guides, workflows, and comparisons

Use the HowWorks content hub to move from idea validation to build strategy, with practical playbooks and decision-focused comparisons.

Open content hub