Yann LeCun: LLMs Are Wrong — World Models and JEPA
Every few months someone announces LLMs have crossed a new threshold. Every few months, Yann LeCun says: you're confusing the map for the territory.
Every few months, someone announces that large language models have crossed a threshold — they pass the bar exam, write production code, reason through complex problems — and every few months, Yann LeCun looks at the evidence and says the same thing: you're confusing the map for the territory.
LeCun is not a doomer. He doesn't think AI will harm humanity; he thinks the current architecture cannot reach genuine intelligence, which is a more specific and more interesting claim. He's also not a skeptic in the dismissive sense — he shared the 2018 Turing Award with Geoffrey Hinton and Yoshua Bengio, invented convolutional neural networks, and has been at the frontier of this field longer than most of its current practitioners have been working. When he says something is structurally wrong with how we're building AI, it is worth understanding exactly what he means.
Who Is Yann LeCun?
LeCun invented the convolutional neural network — the architecture behind every major image recognition system of the last decade and the foundation of modern computer vision. He is Chief AI Scientist at Meta and a professor at NYU's Courant Institute. He shared the Turing Award, computer science's highest honor, with Hinton and Bengio in 2018. He has been publicly and specifically critical of the current LLM paradigm for years, not as a contrarian gesture but as a technically grounded position — one he's backed with a concrete alternative research agenda.
So when he says that large language models cannot lead to genuine intelligence, it's worth understanding exactly what he means.
The Core Argument
LeCun's critique of LLMs is structural, not superficial. The dominant paradigm — autoregressive next-token prediction — trains models to predict the next piece of text given everything that came before it. Do this at massive scale, across most of the internet, and you get systems that produce fluent, coherent, often impressive text.
What you do not get, LeCun argues, is understanding — because next-token prediction trains a model of text, not a model of the world. Text is a lossy, abstract representation of reality, encoding what humans chose to write down, which is not the same as how reality works. An LLM learns the statistical regularities of that representation: what words follow other words, what arguments follow what premises, what answers follow what questions. It does this extraordinarily well. What it cannot do — and this is LeCun's central claim — is model the underlying structure of reality that the text is about.
The failure modes he points to are specific. LLMs fail at physical reasoning: ask one what happens when you push a glass to the edge of a table, and it will frequently hedge, confabulate, or simply be wrong. They hallucinate confidently, because their training objective required coherent text, not truth. They struggle with multi-step planning that requires tracking real consequences of actions in sequence. Their knowledge is frozen at training time; they don't update from experience the way any living system does.
From LeCun's perspective, these aren't engineering bugs to be fixed with more compute or better RLHF. They're symptoms of an architectural mismatch between next-token prediction and genuine understanding — a mismatch that scaling doesn't resolve, because the problem is in what the objective function optimizes for, not how large the model is.
What He Proposes Instead: JEPA and World Models
LeCun's alternative centers on two ideas: world models and the Joint Embedding Predictive Architecture (JEPA).
Why Not Predict Tokens or Pixels?
Predicting in token space or pixel space is high-dimensional and noisy — there are infinitely many ways a video frame could look, infinitely many ways a sentence could continue. Training a model to predict in raw space forces it to model enormous amounts of irrelevant variation, which means most of its capacity is spent encoding noise rather than structure.
JEPA takes a different route: instead of predicting raw tokens or pixels, the model predicts in abstract embedding space — a compressed representation of the world's underlying structure. The model learns what is invariant and meaningful across different surface representations, rather than memorizing surface statistics. The result is a system that aims to encode the structure of reality, not the statistics of how humans write about it.
In a JEPA setup, the system learns to map different views or segments of the same underlying situation into a shared representation space and to predict one representation from another. The hope — and I want to be honest that this is still largely a research bet, not a demonstrated result — is that this forces the model to capture the latent causal and physical structure that generates the observations.
Grounding: Why Text Alone Isn't Enough
The other pillar of LeCun's critique is grounding. Genuine understanding requires sensory and physical experience, not just exposure to text.
A language model learns the word "heavy" by processing sentences that contain the word "heavy." A child learns heavy by picking up objects, feeling resistance, watching things fall. These are fundamentally different kinds of learning — one builds a statistical association between symbols, the other builds a causal model of physical reality.
This is why video occupies a central place in LeCun's vision. A baby learns physics not by reading a physics textbook but by dropping things, watching objects move and collide, navigating space with their own body. The statistics of physical reality are embedded in video in a way that text cannot capture — the weight of objects, the behavior of liquids, the arc of thrown things. Grounding AI in video and sensory data, rather than text alone, is what LeCun argues is required for any system that will genuinely understand the world.
World models are the integration point: internal simulations of how reality works that allow an agent to plan, predict, and reason about counterfactuals. Not "what text comes next" but "what happens if I do this." The distinction is not semantic — it's the difference between a system that retrieves plausible continuations and one that models causes.
The Fair Counterargument
LeCun's critics are numerous and technically credible. The emergent reasoning capabilities of frontier LLMs have surprised researchers who argued such capabilities were impossible at any scale. Chain-of-thought prompting, extended reasoning chains, and tool use suggest that something more structured than pure pattern matching may be happening inside these systems.
I was more skeptical of LLMs two years ago than I am now, and the benchmark results since then have been genuinely humbling. The honest position is that we don't fully understand what's happening inside these models. Interpretability research isn't close to answering whether LLMs do something that genuinely deserves to be called reasoning or something that merely resembles it from the outside.
But LeCun's directional bet is interesting regardless of who turns out to be right. If he's wrong, the emergent reasoning is real and we'll understand why — and the path to genuine AI runs straight through scaling the current paradigm. If he's right, the scaling trajectory hits a ceiling because the objective function is wrong, and the path forward requires something closer to what he's describing: models that learn world structure rather than text statistics, grounded in sensory data rather than human-generated language, optimizing for representation rather than prediction.
The LLM era taught us that scale applied to the right objective unlocks surprising capabilities. The open question is whether next-token prediction is the right objective — or whether it's a ladder that gets you high enough to see what you actually need to build.



