How long does it actually take to onboard to a large codebase?

Weeks to months, even for senior engineers. There's no specific industry number because it varies wildly by codebase size and quality, but the consensus across engineering blogs (GitHub, Stripe, Shopify) is that the first month is mostly orientation, the second month is real contribution, and full fluency takes around six. The fastest paths involve pairing with someone, reading tests first, and tracing single user-facing flows end-to-end rather than trying to understand the whole tree.

Is it normal to feel lost in a new codebase?

Yes. Universally. Even the engineers who wrote the codebase feel lost in their own old code after a year — that's why version-control archeology (git blame, git log) exists. The feeling of being lost isn't a signal you're underqualified. It's a signal that the codebase is what it always was: a frozen artifact of decisions, half-refactors, and context that lives in someone else's head. The goal isn't to feel un-lost. It's to develop tactics fast enough that lost stops mattering.

Can AI tools really read code, or are they just summarizing blog posts?

Most AI chat tools (ChatGPT, generic Claude, Perplexity) summarize from training data — which means blog posts and Stack Overflow answers about a project, not the source itself. AI Code Research is different: it opens the actual GitHub repository at the time you ask, reads the source files, and returns an engineer's answer grounded in what the code does now. The honest version: for closed-source tools we research the public docs and SDK code instead, and we tell you upfront when we're working from the public surface rather than the source.

Why You Can't Read Other People's Code (And You're Not Stupid)

Q: Why is reading code harder than writing it?

Because writing happens with full context, and reading happens without it. When you write, the mental model exists in your head before the code does — you choose the names, the structure, the abstractions. When you read, that model isn't given to you. You have to reconstruct it from the artifact, which compounds in difficulty with every layer of abstraction someone before you didn't bother to clean up. Cognitive load theory calls this the 'extraneous load' tax, and it's measurable in eye-tracking studies.

Q: What's the most useful technique for understanding unfamiliar code?

Trace one path end-to-end. Pick a real user action — 'what happens when someone clicks login?' — and follow only the code that runs in that path. You'll touch maybe 1% of the codebase, and that 1% will teach you more about the architecture than reading 50%. The technique is a form of 'chunking': you treat un-traced functions as black boxes, trusting they do what their name implies until you have a reason to look closer.

The first time you cloned a repo at a new job, you probably felt smart. You'd been hired. You knew your stack. You'd shipped real things.

Then you opened the file tree. A hundred folders. A legacy/ directory that contained something called legacy-v3/. A README that said "see internal docs." There were no internal docs.

You opened a file at random. It imported six things from places you hadn't found yet. There was a function called process() that was four hundred lines long.

You stared at it for an hour. You didn't understand it. You felt dumb.

This article is to tell you: you weren't dumb. You were reading.

You're Not Stupid. You're Reading.

Reading code is genuinely, measurably harder than writing it.

"Indeed, the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code." — Robert C. Martin, Clean Code

A 2024 New Stack analysis of developer time found developers spend less than 32% of their time writing new code. The other 68% is other things — and a large chunk of that is reading.

Why is reading harder than writing?

When you write code, you have the context. You know what you're trying to do. You're choosing the variable names. You're laying out the structure. The mental model is in your head before the code exists.

When you read code, the mental model isn't given to you. You have to reconstruct it from the artifact. You're a detective in a house someone else built, where every door might lead to another house.

This is asymmetric, and the asymmetry compounds with every layer.

The Cognitive Load Tax

There's actual research on this.

Cognitive Load Theory, originally formulated by educational psychologist John Sweller in 1988, distinguishes three kinds of mental load: intrinsic (the inherent complexity of what you're learning), extraneous (load imposed by how the material is presented), and germane (productive mental work of building schema).

Applied to code reading, that maps to:

Intrinsic load: the algorithm itself. If a function implements RSA, you have to understand RSA. There's a floor.
Extraneous load: naming, formatting, dead code, abandoned abstractions, files in the wrong folder, comments that lie. This is the load that should be zero — but isn't.
Germane load: the "ah, I see" moment when your brain builds the right mental model and stops fighting the artifact.

Recent research published in IEEE and ACM venues — for example, "Estimating Developers' Cognitive Load at a Fine-grained Level Using Eye-Tracking Measures" (ICPC 2022) and the ScienceDirect systematic mapping study on developer cognitive load (2021) — uses eye tracking to measure load while developers read real source.

The findings are consistent: when extraneous load is high, comprehension is slow. When the codebase has accumulated layers of half-finished refactors, abandoned patterns, and inconsistent naming, your brain spends most of its energy on the wrapper, not the meat.

You're not stupid. You're paying tax.

Even Maintainers Can't Read Their Own Code

The most reassuring evidence comes from the people who actually wrote the code.

On Hacker News, an open-source maintainer wrote, in a thread about AI documentation tools:

"I maintain open source projects and frequently direct volunteers to use DeepWiki to explore those (fairly convoluted) codebases."

Read that again. The person who wrote and maintains the codebase calls their own code "fairly convoluted" and points new contributors at a third-party tool to read it.

The reason isn't laziness. It's that the maintainer has the mental model in their head. Loading that model into someone else's head — that's the hard part. The codebase is the artifact, but the artifact alone doesn't transfer the model.

You see this everywhere:

LLVM contributors are routinely told to read research papers before the code, in a specific order, because the code without the papers is impenetrable
The Linux kernel has a curated Documentation/ directory that's nearly as long as some kernels in their entirety
Major frameworks like React publish explanation videos because watching someone walk through the source is faster than reading it cold

The codebase alone doesn't teach you. Even the people who wrote it know that.

"Just Read the Code" Is Bad Advice

The advice you'll get from senior engineers is some variant of "just read the code." This is bad advice — not because they're wrong about reading being valuable, but because the advice is incomplete in three specific ways:

You can't read all of it. As one engineer put it on Hacker News, in a now-classic 2022 thread titled "It's Harder to Read Code Than Write It": "It takes a lot of time, and there's no way you can dig through more than a fraction of a large codebase." A 200K-LOC project at 100 LOC/minute reading speed (a generous pace) is over 33 hours. That's a full work-week. You don't have a work-week.
You don't know where to start. A large codebase has thousands of plausible entry points. Most are wrong. Without a guide, you'll pick the wrong door three times before finding the right one.
You don't know what you're looking for. Reading without a question is like reading a dictionary. You finish more confused than when you started.

The actual technique senior engineers use isn't "just read the code." It's chunking, scaffolding, and asking. The GitHub Engineers' guide to learning new codebases — written by people who routinely onboard onto repos with millions of LOC — lists tactics like:

Read the tests first
Pair with someone for the first week
Master one module before touching another
"Understand what code does without necessarily knowing exactly how it does it"

Notice what's missing from that list: "read the code start to finish." Nobody says that. Nobody does it.

What Actually Helps

If "just read the code" is bad advice, here's better advice. Most of it works without any new tools.

Read tests, not implementation files

Tests document the contract. Implementation details live under the contract. Start with tests; they're written for clarity in a way the implementation usually isn't. They tell you what the code is supposed to do and what edge cases the original author cared about.

Trace one path end-to-end, not the whole tree

Don't try to read every file. Pick a real user action — "what happens when someone clicks login?" — and follow only the code that runs in that path. You'll touch maybe 1% of the codebase, and that 1% will teach you more about the architecture than reading 50%. Repeat for two or three different user actions and you'll have a working mental model of the whole system.

Pair with someone for the first week

The mental model you're reconstructing already exists, fully formed, in another engineer's head. Borrowing it costs you a 30-minute conversation. Reconstructing it from scratch costs you a week. Senior engineers undervalue how much context they carry; if you ask, most are willing to spend an hour walking you through the bones.

Use chunking — ask "what does this do" not "how does it do it"

This is the technique GitHub engineers explicitly recommend. You don't need to know how a function implements its logic before you trust it. Just trust that it does what its name implies, treat it as a black box, and move on. Save the deep dive for the parts that surprise you, where the name and the behavior diverge.

Find what already exists before you write anything new

If you're about to add a feature, search the codebase for half-built versions of it first. Codebases over a year old usually contain two or three abandoned attempts at the thing you're trying to build. Knowing those attempts exist — and why they were abandoned — saves you a sprint and prevents you from being the fourth person to abandon it.

These are the moves engineers actually use. They're not glamorous. They're not fast. But they work.

Or Have an Engineer Read It For You

Here's the thing nobody admits about that list above: every single tactic depends on having a person — pair, mentor, senior engineer — who already has the model.

Most engineers, most of the time, don't. The senior who could pair with you is busy. The maintainer is in a different time zone. The team that wrote the abandoned attempts left two years ago. You're alone with a 200K-LOC repo and a Friday deadline.

That's why we built AI Code Research. Not because we think you can't read code. You obviously can. You did just now, when your brain spent ten minutes trying to reconstruct what process() was doing.

We built it because the most useful engineer in the world is one that reads what you don't have time to read. Point AI Code Research at any GitHub repo and ask — "what does this do?" / "how should I migrate it?" / "what's the architecture?" — and the agent opens the source, reads what's there, and returns the answer an engineer would, in plain English, in roughly 60 seconds.

It's not a replacement for a senior engineer. It's the senior engineer you didn't have.

For the longer version of what AI Code Research is and how it differs from ChatGPT or DeepWiki, see What Is AI Code Research?.

You Were Always Going to Feel This

Reading other people's code feels bad because the work is hard, not because you're bad at it. You're paying cognitive load tax on every layer of abstraction someone before you didn't bother to clean up. The maintainers feel it too. The senior engineers feel it too. They've just been paying the tax long enough to develop scar tissue and tactics.

The next time you open a repo and feel dumb, remember: you're not. You're reading.

That's the hardest mode software engineering has.

If you'd rather not read all of it yourself: we built a tool for that.

Why You Can't Read Other People's Code (And You're Not Stupid)

Key takeaways

You're Not Stupid. You're Reading.

The Cognitive Load Tax

Even Maintainers Can't Read Their Own Code

"Just Read the Code" Is Bad Advice

What Actually Helps

Read tests, not implementation files

Trace one path end-to-end, not the whole tree

Pair with someone for the first week

Use chunking — ask "what does this do" not "how does it do it"

Find what already exists before you write anything new

Or Have an Engineer Read It For You

You Were Always Going to Feel This

Next reads in this topic

What Is AI Code Research? An AI Engineer for Your GitHub Repos

How Today's AI Coding Tools Actually Work — Read at the Code Level

DeepWiki vs Greptile vs Reading It Yourself: An Honest Take (From Someone Who Built a Competitor)

How Claude Code Actually Works (We Read the Source)

Try a HowWorks specialist agent

AI Code Research

AI Research

FAQ

Why is reading code harder than writing it?

How long does it actually take to onboard to a large codebase?

What's the most useful technique for understanding unfamiliar code?

Is it normal to feel lost in a new codebase?

Can AI tools really read code, or are they just summarizing blog posts?

Explore all guides, workflows, and comparisons